Metric Hub

Latency + General Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 19 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: BrowseComp. Common metric signal: latency. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 19 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 19 papers for Latency + General Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on BrowseComp, Retrieval and metric focus on latency, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

5.3% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report
automatic metrics appears in 94.7% of papers in this hub.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report , LiCQA : A Lightweight Complex Question Answering System
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Towards Efficient Agents: A Co-Design of Inference Architecture and System , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Ruyi2 Technical Report

Protocol Takeaways

Most common quality-control signal is rater calibration (5.3% of papers).

Evidence: Discrete Stochastic Localization for Non-autoregressive Generation , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report
Rater context is mostly unspecified rater pools, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report , LiCQA : A Lightweight Complex Question Answering System
Stratify by benchmark (BrowseComp vs Retrieval) before comparing methods.

Evidence: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report , LiCQA : A Lightweight Complex Question Answering System

Benchmark Interpretation

BrowseComp appears in 10.5% of hub papers (2/19); use this cohort for benchmark-matched comparisons.
Retrieval appears in 10.5% of hub papers (2/19); use this cohort for benchmark-matched comparisons.

Metric Interpretation

latency is reported in 100% of hub papers (19/19); compare with a secondary metric before ranking methods.
accuracy is reported in 47.4% of hub papers (9/19); compare with a secondary metric before ranking methods.

Abstract Evidence Highlights

Direct snippets from paper abstracts to ground protocol and benchmark interpretation.

Protocol Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Human-eval abstract signal: Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.

Protocol Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Human-eval abstract signal: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.

Benchmark Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

BrowseComp benchmark signal: We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%).

Benchmark Towards Efficient Agents: A Co-Design of Inference Architecture and System

BrowseComp benchmark signal: Experiments on the BrowseComp-zh and DeepDiver benchmarks demonstrate that through the synergistic collaboration of these methods, AgentInfer reduces ineffective token consumption by over 50%, achieving an overall 1.8-2.5 times speedup with preserved accuracy.

Metric Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

latency metric signal: To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics.

Quality Control Discrete Stochastic Localization for Non-autoregressive Generation

rater calibration quality-control signal: Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

Protocol Ruyi2 Technical Report

Protocol abstract signal: Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.

Protocol LiCQA : A Lightweight Complex Question Answering System

Protocol abstract signal: Over the last twenty years, significant progress has been made in designing and implementing Question Answering (QA) systems.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (5.3% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.3% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (26.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (5.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (5.3% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (5.3% vs 35% target).

Known Limitations

Only 5.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: BrowseComp - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: latency - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=17, right_only=1

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

BrowseComp

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention BrowseComp.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Towards Efficient Agents: A Co-Design of Inference Architecture and System

Benchmark Brief

Retrieval

Coverage: 2 papers (10.5%)

2 papers (10.5%) mention Retrieval.

Examples: HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG , RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA

Benchmark Brief

DocVQA

Coverage: 1 papers (5.3%)

1 papers (5.3%) mention DocVQA.

Examples: Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Metric Brief

latency

Coverage: 19 papers (100%)

19 papers (100%) mention latency.

Examples: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report

Metric Brief

accuracy

Coverage: 9 papers (47.4%)

9 papers (47.4%) mention accuracy.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Metric Brief

cost

Coverage: 3 papers (15.8%)

3 papers (15.8%) mention cost.

Examples: Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers , vCache: Verified Semantic Prompt Caching

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization , Ruyi2 Technical Report

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026

Automatic Metrics General

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026

Automatic Metrics General

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
Ruyi2 Technical Report
Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An · Feb 26, 2026

Automatic Metrics General

Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.
LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026

Automatic Metrics General

The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
Generative Pseudo-Labeling for Pre-Ranking with LLMs
Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang · Feb 24, 2026

Automatic Metrics General

Pre-ranking is a critical stage in industrial recommendation systems, tasked with efficiently scoring thousands of recalled items for downstream ranking.
HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026

Automatic Metrics General

Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
TabAgent: A Framework for Replacing Agentic Generative Components with Tabular-Textual Classifiers
Ido Levy, Eilam Shapira, Yinon Goldshtein, Avi Yaeli, Nir Mashkif · Feb 18, 2026

Automatic Metrics General

Agentic systems, AI architectures that autonomously execute multi-step workflows to achieve complex goals, are often built using repeated large language model (LLM) calls for closed-set decision tasks such as routing, shortlisting, gating,
Discrete Stochastic Localization for Non-autoregressive Generation
Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis · Feb 18, 2026

Automatic Metrics General

On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets.
Overthinking Loops in Agents: A Structural Risk via MCP Tools
Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi · Feb 16, 2026

Automatic Metrics General

Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages.
Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring
Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang · Jan 20, 2026

Automatic Metrics General

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026

Simulation Env General

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025

Automatic Metrics General

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025

Automatic Metrics General

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025

Automatic Metrics General

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025

Automatic Metrics General

Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
On the Inference (In-)Security of Vertical Federated Learning: Efficient Auditing against Inference Tampering Attack
Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei · Jul 3, 2025

Automatic MetricsSimulation Env General

Vertical Federated Learning (VFL) is an emerging distributed learning paradigm for cross-silo collaboration without accessing participants' data.
vCache: Verified Semantic Prompt Caching
Luis Gaspar Schroeder, Aditya Desai, Alejandro Cuadron, Kyle Chu, Shu Liu · Feb 6, 2025

Automatic Metrics General

We release the vCache implementation and four benchmarks to support future research.
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu, Weihao Ye, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji · Mar 22, 2024

Automatic Metrics General

To validate EAS, we apply it to a recently proposed MLLM called LaVIN and a classic VL pre-trained model called METER, and conduct extensive experiments on a set of benchmarks.

Latency + General Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Abstract Evidence Highlights

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs