HFEPX Metric Hub

Latency Metric Papers

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Latency.

Read Full Context

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Latency. Use it to compare how latency is measured across human feedback and evaluation studies.

Papers: 60 Last published: Apr 9, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: High .

Metric Coverage

100.0%

60 sampled papers include metric names.

Benchmark Anchoring

25.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

5.0%

3 papers report calibration/adjudication/IAA controls.

60 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Use the top metric-reliable papers first, then compare benchmark context in the matrix before drawing conclusions.

Why This Matters (Expanded)

Why This Matters For Eval Research

Use this page to compare how latency is operationalized across benchmarks and rater setups.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Latency is often paired with automatic_metrics, llm_as_judge.

Metric Interpretation

latency: 60 papers
accuracy: 20 papers
cost: 15 papers
throughput: 7 papers

Benchmark Context

DROP: 2 papers
MS MARCO: 2 papers
ARC-Challenge: 1 papers

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Apr 9, 2026 · Citations: 0 · Score: 8.0

Metrics: Precision, Latency · Eval: Automatic Metrics
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Apr 8, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy, Latency · Eval: Automatic Metrics
SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT
Apr 7, 2026 · Citations: 0 · Score: 8.0

Metrics: Recall, Latency · Eval: Automatic Metrics
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Apr 7, 2026 · Citations: 0 · Score: 8.0

Metrics: F1, Latency · Eval: Llm As Judge, Automatic Metrics
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Apr 6, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy, Pass@1 · Eval: Automatic Metrics
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Apr 2, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy, Latency · Eval: Automatic Metrics, Simulation Env

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory Apr 9, 2026	Precision, Latency	Latentneeds Bench	Automatic Metrics	Not reported
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models Apr 8, 2026	Accuracy, Latency	GSM8K, TruthfulQA	Automatic Metrics	Not reported
SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT Apr 7, 2026	Recall, Latency	Not reported	Automatic Metrics	Calibration
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations Apr 7, 2026	F1, Latency	SQuAD	Llm As Judge, Automatic Metrics	Not reported
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency Apr 6, 2026	Accuracy, Pass@1	Full Duplex Bench	Automatic Metrics	Not reported
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks Apr 2, 2026	Accuracy, Latency	Not reported	Automatic Metrics, Simulation Env	Calibration
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval Mar 31, 2026	F1, Recall	MS MARCO	Automatic Metrics	Not reported
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications Mar 28, 2026	Latency, Latency p95	BEIR	Automatic Metrics	Not reported
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified? Mar 27, 2026	Accuracy, Latency	Formalproofbench	Automatic Metrics	Not reported
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control Apr 7, 2026	Latency	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Human feedback

Human feedback is present in 2 of 60 papers.
Gap: Quality controls

Quality controls is present in 3 of 60 papers.
Gap: Benchmarks

Benchmarks is present in 15 of 60 papers.
Strong: Metrics

Metrics is present in 60 of 60 papers.
Gap: Known rater population

Known rater population is present in 5 of 60 papers.
Gap: Known annotation unit

Known annotation unit is present in 12 of 60 papers.

Strengths

Metrics is present in 60 of 60 papers.

Known Gaps

Human feedback is present in 2 of 60 papers.
Quality controls is present in 3 of 60 papers.
Benchmarks is present in 15 of 60 papers.

Suggested Next Analyses

Review the most recent latency papers first, then compare benchmark context before reusing the metric.

Recommended Queries

Search Latency papers

Known Limitations

This synthetic persisted page is generated from extraction data because the cached metric payload was missing for latency.

Research Utility Snapshot (Detailed)

Top Metrics

Latency (60)
Accuracy (20)
Cost (15)
Throughput (7)

Evaluation Modes

Automatic Metrics (45)
Llm As Judge (2)
Simulation Env (2)

Top Benchmarks

DROP (2)
MS MARCO (2)
ARC Challenge (1)
BEIR (1)

Agentic Mix

None (53)
Long Horizon (4)
Multi Agent (2)
Tool Use (2)

Top Papers Reporting This Metric

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models
Shilin Yan, Jintao Tong, Hongwei Xue, Xiaojun Tang, Yangyang Wang · Apr 9, 2026 · Citations: 0

Automatic Metrics General

The advent of agentic multimodal models has empowered systems to actively interact with external environments.
KV Cache Offloading for Context-Intensive Tasks
Andrey Bocharnikov, Ivan Ermakov, Denis Kuznedelev, Vyacheslav Zhdanovskiy, Yegor Yershov · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Prior evaluations have largely focused on tasks that do not require extracting large amounts of information from the context.
Alloc-MoE: Budget-Aware Expert Activation Allocation for Efficient Mixture-of-Experts Inference
Baihui Liu, Kaiyuan Tian, Wei Wang, Zhaoning Zhang, Linbo Qiao · Apr 9, 2026 · Citations: 0

Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory
Zhifei Xie, Zongzheng Hu, Fangda Ye, Xin Zhang, Haobo Chai · Apr 9, 2026 · Citations: 0

Automatic Metrics General

Prior work remains largely confined to laboratory settings, leaving a clear gap in real-world proactive agent: depth, complexity, ambiguity, precision and real-time constraints.
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Nayoung Choi, Jonathan Zhang, Jinho D. Choi · Jan 12, 2026 · Citations: 0

Automatic Metrics General

Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.
See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs
Yicheng Ji, Jun Zhang, Jinpeng Chen, Cong Wang, Lidan Shou · Apr 7, 2026 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Efficient Learned Data Compression via Dual-Stream Feature Decoupling
Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu · Apr 8, 2026 · Citations: 0

LawCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang · Apr 8, 2026 · Citations: 0

Automatic Metrics Math

We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and…
MARS: Enabling Autoregressive Models Multi-Token Generation
Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun · Apr 8, 2026 · Citations: 0

Automatic Metrics General

When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks.
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang · Feb 24, 2025 · Citations: 0

Coding

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this…
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control
Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang · Apr 7, 2026 · Citations: 0

Automatic Metrics General

Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
BOSCH: Black-Box Binary Optimization for Short-Context Attention-Head Selection in LLMs
Abbas Ghaddar, Ivan Kobyzev, Boxing Chen, Yufei Cui · Apr 7, 2026 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
CoGate-LSTM: Prototype-Guided Feature-Space Gating for Mitigating Gradient Dilution in Imbalanced Toxic Comment Classification
Noor Islam S. Mohammad · Oct 19, 2025 · Citations: 0

Automatic Metrics General

On the Jigsaw Toxic Comment benchmark, CoGate-LSTM achieves 0.881 macro-F1 (95% CI: [0.873, 0.889]) and 96.0% accuracy, outperforming fine-tuned BERT by 6.9 macro-F1 points (p < 0.001) and XGBoost by 4.7, while using only 7.3M parameters…
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker
Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte · Nov 11, 2025 · Citations: 0

Automatic Metrics General

These constraints have led to isolated, task-specific developments in the field, with models and benchmarks focused on single prediction tasks.
SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT
Guan-Yan Yang, Wei-Ling Wen, Shu-Yuan Ku, Farn Wang, Kuo-Hui Yeh · Apr 7, 2026 · Citations: 0

Automatic Metrics General

Our evaluation demonstrates that SemLink achieves a Recall of 96.00%, comparable to state-of-the-art LLMs (GPT-5.2), while operating approximately 47.5 times faster and requiring significantly fewer computational resources.
Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation
Soufiane Jhilal, Martina Galletti · Mar 25, 2026 · Citations: 0

Automatic Metrics MedicineMultilingual

Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages.
AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering
N. D. Tantaroudas, A. J. McCracken, I. Karachalios, E. Papatheou · Apr 7, 2026 · Citations: 0

Automatic Metrics Multilingual

Validation comprised technical benchmarking of each AI component, including comparative assessments of speech synthesis providers and multilingual translation models (NLLB 200 and EuroLLM 1.7B variants).
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Screening Is Enough
Ken M. Nakanishi · Apr 1, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Guan-Ting Lin, Chen Chen, Zhehuai Chen, Hung-yi Lee · Apr 6, 2026 · Citations: 0

Automatic Metrics General

We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use.
Voxtral Realtime
Mistral-AI, :, Alexander H. Liu, Andy Ehrenberg, Andy Lo · Feb 11, 2026 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval
Chun Chet Ng, Jia Yu Lim, Wei Zeng Low · Nov 18, 2025 · Citations: 0

Automatic Metrics Coding

We present PRISM, a training-free framework that integrates refined system prompting, in-context learning (ICL), and lightweight multi-agent coordination for document and chunk ranking tasks.
Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
Lisan Al Amin, Md Ismail Hossain, Rupak Kumar Das, Mahbubul Islam, Abdulaziz Tabbakh · Mar 21, 2026 · Citations: 0

Automatic Metrics General

This study benchmarks four deep learning models (Conv6, VGG16, ResNet18, CycleGAN) using TensorFlow and PyTorch on Intel Xeon CPUs and NVIDIA Tesla T4 GPUs.
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier · Mar 16, 2026 · Citations: 0

Automatic Metrics General

This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries.
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage · Apr 2, 2026 · Citations: 0

Automatic MetricsSimulation Env General

In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL)…
DeDelayed: Deleting Remote Inference Delay via On-Device Correction
Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar · Oct 15, 2025 · Citations: 0

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
APEX: Agent Payment Execution with Policy for Autonomous Agent API Access
Mohd Safwan Uddin, Mohammed Mouzam, Mohammed Imran, Syed Badar Uddin Faizan · Apr 2, 2026 · Citations: 0

Automatic Metrics General

Autonomous agents are moving beyond simple retrieval tasks to become economic actors that invoke APIs, sequence workflows, and make real-time decisions.
NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov · Mar 13, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Adaptive Stopping for Multi-Turn LLM Reasoning
Xiaofan Zhou, Huy Nguyen, Bo Yu, Chenxi Liu, Lu Cheng · Apr 1, 2026 · Citations: 0

Automatic Metrics General

Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions.
OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion
Sai Koneru, Matthias Huck, Jan Niehues · Nov 28, 2025 · Citations: 0

CodingMultilingual

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong · Apr 1, 2026 · Citations: 0

Automatic Metrics MathCoding

Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive…
Execution-Verified Reinforcement Learning for Optimization Modeling
Runda Guan, Xiangqing Shen, Jiajun Zhang, Yifan Zhang, Jian Cheng · Apr 1, 2026 · Citations: 0

MathCoding

Automating optimization modeling with LLMs is a promising path toward scalable decision intelligence, but existing approaches either rely on agentic pipelines built on closed-source LLMs with high inference latency, or fine-tune smaller…
Large Language Models in the Abuse Detection Pipeline
Suraj Kath, Sanket Badhe, Preet Shah, Ashwin Sampathkumar, Shivani Gupta · Mar 31, 2026 · Citations: 0

General

Large Language Models introduce new capabilities for contextual reasoning, policy interpretation, explanation generation, and cross-modal understanding, enabling them to support multiple stages of modern safety systems.
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval
Antonín Jarolím, Martin Fajčík · Mar 31, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ParetoBandit: Budget-Paced Adaptive Routing for Non-Stationary LLM Serving
Annette Taberner-Miller · Mar 31, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence · Mar 31, 2026 · Citations: 0

Automatic Metrics Coding

Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues.
DeepCoT: Deep Continual Transformers for Real-Time Inference on Data Streams
Ginés Carreto Picón, Peng Yuan Zhou, Qi Zhang, Alexandros Iosifidis · Nov 21, 2025 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EventChat: Implementation and user-centric evaluation of a large language model-driven conversational recommender system for exploring leisure events in an SME context
Hannes Kunstmann, Joseph Ollier, Joel Persson, Florian von Wangenheim · Jul 5, 2024 · Citations: 0

Automatic Metrics General

Yet to date, research has predominantly focused upon technical frameworks to implement LLM-driven CRS, rather than end-user evaluations or strategic implications for firms, particularly from the perspective of a small to medium enterprises…
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025 · Citations: 0

Automatic Metrics General

We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
ShishuLM : Achieving Optimal and Efficient Parameterization with Low Attention Transformer Models
Shivanshu Kumar, Gopalakrishnan Srinivasan · Oct 13, 2025 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Accelerating Diffusion Large Language Models with SlowFast Sampling: The Three Golden Principles
Qingyan Wei, Yaojie Zhang, Zhiyuan Liu, Puyu Zeng, Yuxuan Wang · Jun 12, 2025 · Citations: 0

Automatic Metrics General

Extensive experiments across benchmarks and models show that SlowFast Sampling achieves up to 15.63\times speedup on LLaDA with minimal accuracy drop, and up to 34.22\times when combined with caching.
Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs
Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang · Mar 31, 2026 · Citations: 0

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OneComp: One-Line Revolution for Generative AI Model Compression
Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, Yudai Fujimoto, Hiroki Tokura · Mar 30, 2026 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LingoLoop Attack: Trapping MLLMs via Linguistic Context and State Entrapment into Endless Loops
Jiyuan Fu, Kaixun Jiang, Lingyi Hong, Jinglun Li, Haijing Guo · Jun 17, 2025 · Citations: 0

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
X-OPD: Cross-Modal On-Policy Distillation for Capability Alignment in Speech LLMs
Di Cao, Dongjie Fu, Hai Yu, Siqi Zheng, Xu Tan · Mar 6, 2026 · Citations: 0

Automatic Metrics General

Extensive experiments across multiple benchmarks demonstrate that X-OPD significantly narrows the gap in complex tasks while preserving the model's inherent capabilities.
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong · Oct 6, 2025 · Citations: 0

Automatic Metrics General

We introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation.
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
Alexandre Cristovão Maiorano · Mar 28, 2026 · Citations: 0

Automatic Metrics General

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow.
SCOPE: Tree-based Self-Correcting Online Log Parsing via Syntactic-Semantic Collaboration
Dongyi Fan, Suqiong Zhang, Lili He, Ming Liu, Yifan Huo · Mar 28, 2026 · Citations: 0

Automatic Metrics General

Extensive evaluations on diverse benchmark datasets show that SCOPE outperforms state-of-the-art methods in both accuracy and efficiency.
Towards Hyper-Efficient RAG Systems in VecDBs: Distributed Parallel Multi-Resolution Vector Search
Dong Liu, Yanxuan Yu · Nov 12, 2025 · Citations: 0

Automatic Metrics Coding

We implement SPI as a plugin for both FAISS and Qdrant backends and evaluate it across multiple RAG tasks including MS MARCO, Natural Questions, and multimodal retrieval benchmarks.
PHONOS: PHOnetic Neutralization for Online Streaming Applications
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna · Mar 27, 2026 · Citations: 0

Automatic Metrics General

Our evaluations show an 81% reduction in non-native accent confidence, with listening-test ratings consistent with this shift, and reduced speaker linkability as accent-neutralized utterances move away from the original speaker in embedding…
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu · Mar 27, 2026 · Citations: 0

Automatic Metrics Math

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level.
JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems
Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai · Mar 27, 2026 · Citations: 0

Automatic Metrics Multilingual

Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments.
TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling
Nisharg Nargund, Priyesh Shukla · Feb 7, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
LLM4AD: Large Language Models for Autonomous Driving -- Concept, Review, Benchmark, Experiments, and Future Trends
Can Cui, Yunsheng Ma, Sung-Yeon Park, Zichong Yang, Yupeng Zhou · Oct 20, 2024 · Citations: 0

Simulation Env General

Then, a comprehensive benchmark is proposed for evaluating the instruction-following and reasoning abilities of LLM4AD systems, which includes LaMPilot-Bench, CARLA Leaderboard 1.0 Benchmark in simulation and NuPlanQA for multi-view visual…
Characterizing Linear Alignment Across Language Models
Matt Gorbett, Suman Jana · Mar 19, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs
Selim An, Il hong Suh, Yeseong Kim · Mar 26, 2026 · Citations: 0

Automatic Metrics General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models
Hieu Xuan Le, Benjamin Goh, Quy Anh Tang · Mar 26, 2026 · Citations: 0

Llm As Judge General

In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while…
Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models
Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang · Mar 26, 2026 · Citations: 0

Automatic Metrics General

On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.
GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation
Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng · Mar 26, 2026 · Citations: 0

Automatic Metrics General

Prior approaches to this problem include agentic retrieval strategies, which expand the semantic search space by generating additional queries.

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote