HFEPX Metric Hub

Throughput Metric Papers

Updated from current HFEPX corpus (Apr 12, 2026). 48 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 48 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: AIME. Common metric signal: throughput. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 23, 2026.

Papers: 48 Last published: Mar 23, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

22.9%

11 sampled papers include metric names.

Benchmark Anchoring

4.2%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

48 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

4.2% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 16.7% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

throughput is reported in 100% of hub papers (48/48); compare with a secondary metric before ranking methods.
accuracy is reported in 35.4% of hub papers (17/48); compare with a secondary metric before ranking methods.

Benchmark Context

AIME appears in 2.1% of hub papers (1/48); use this cohort for benchmark-matched comparisons.
BrowseComp appears in 2.1% of hub papers (1/48); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Apr 7, 2026 · Citations: 0 · Score: 8.0

Metrics: F1, Latency · Eval: Llm As Judge, Automatic Metrics
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Apr 6, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Throughput · Eval: Automatic Metrics
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Dec 20, 2025 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Latency · Eval: Automatic Metrics
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Mar 23, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput · Eval: Simulation Env
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
Apr 3, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput · Eval: Not reported
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Mar 18, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations Apr 7, 2026	F1, Latency	SQuAD	Llm As Judge, Automatic Metrics	Not reported
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Apr 6, 2026	Accuracy, Throughput	Not reported	Automatic Metrics	Not reported
Towards Efficient Agents: A Co-Design of Inference Architecture and System Dec 20, 2025	Accuracy, Latency	BrowseComp	Automatic Metrics	Not reported
Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications Mar 23, 2026	Throughput	Not reported	Simulation Env	Not reported
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency Apr 3, 2026	Throughput	Not reported	Not reported	Not reported
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing Mar 18, 2026	Throughput	Not reported	Automatic Metrics	Not reported
Learning When to Attend: Conditional Memory Access for Long-Context LLMs Mar 18, 2026	Throughput, Context length	Not reported	Automatic Metrics	Not reported
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination Feb 25, 2026	Success rate, Throughput	Not reported	Simulation Env	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
The Headless Firm: How AI Reshapes Enterprise Boundaries Feb 24, 2026	Throughput, Cost	Not reported	Automatic Metrics	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (4.2% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (16.7% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (2.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (4.2% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (2.1% coverage).
Annotation unit is under-specified (4.2% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (AIME vs BrowseComp) before comparing methods.
Track metric sensitivity by reporting both throughput and accuracy.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: AIME Metric Slice: throughput Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (2.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Throughput (48)
Accuracy (17)
Cost (15)
Latency (13)

Evaluation Modes

Automatic Metrics (8)
Llm As Judge (2)
Simulation Env (2)

Top Benchmarks

AIME (1)
BrowseComp (1)
GPQA (1)
HR Bench (1)

Agentic Mix

Long Horizon (6)
Multi Agent (2)

Top Papers Reporting This Metric

Spatio-Temporal Attention Enhanced Multi-Agent DRL for UAV-Assisted Wireless Networks with Limited Communications
Che Chen, Lanhua Li, Shimin Gong, Yu Zhao, Yuming Fang · Mar 23, 2026 · Citations: 0

Simulation Env General

To maximize the overall throughput, we first propose a delay-tolerant multi-agent deep reinforcement learning (MADRL) algorithm that integrates a delay-penalized reward to encourage information sharing among UAVs, while jointly optimizing…
Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0

Simulation Env Coding

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai · Apr 3, 2026 · Citations: 0

General

JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale…
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu · Apr 6, 2026 · Citations: 0

Automatic Metrics Law

Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0

Automatic Metrics General

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott · Mar 18, 2026 · Citations: 0

Automatic Metrics General

Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%.
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager · Mar 18, 2026 · Citations: 0

Automatic Metrics General

Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026 · Citations: 0

Automatic Metrics General

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation
Andi Gu, J. Pablo Bonilla Ataides, Mikhail D. Lukin, Susanne F. Yelin · Apr 9, 2026 · Citations: 0
AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention
Yuxuan Hu, Jianchao Tan, Jiaqi Zhang, Wen Zan, Pingwei Sun · Apr 9, 2026 · Citations: 0
GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning
Kaiyuan Tian, Yu Tang, Gongqingjian Jiang, Baihui Liu, Yifu Gao · Apr 9, 2026 · Citations: 0
Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra · Apr 6, 2026 · Citations: 0
SM-Net: Learning a Continuous Spectral Manifold from Multiple Stellar Libraries
Omar Anwar, Aaron S. G. Robotham, Luca Cortese, Kevin Vinsen · Mar 25, 2026 · Citations: 0
Learning-guided Prioritized Planning for Lifelong Multi-Agent Path Finding in Warehouse Automation
Han Zheng, Yining Ma, Brandon Araki, Jingkai Chen, Cathy Wu · Mar 25, 2026 · Citations: 0
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji · Mar 24, 2026 · Citations: 0
Sparser, Faster, Lighter Transformer Language Models
Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami · Mar 24, 2026 · Citations: 0
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
Yixuan Wang, Shiyu Ji, Yijun Liu, Qingfu Zhu, Wanxiang Che · Mar 24, 2026 · Citations: 0
Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
Siddhant Kulkarni, Yukta Kulkarni · Mar 24, 2026 · Citations: 0
Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
Caio Vicentino · Mar 23, 2026 · Citations: 0
A Blueprint for Self-Evolving Coding Agents in Vehicle Aerodynamic Drag Prediction
Jinhui Ren, Huaiming Li, Yabin Liu, Tao Li, Zhaokun Liu · Mar 23, 2026 · Citations: 0
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu · Mar 21, 2026 · Citations: 0
cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization
Yuyang Liu · Mar 19, 2026 · Citations: 0
Behavioral Fingerprints for LLM Endpoint Stability and Identity
Jonah Leshin, Manish Shah, Ian Timmis, Daniel Kang · Mar 19, 2026 · Citations: 0
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
Quy-Anh Dang, Chris Ngo · Mar 17, 2026 · Citations: 0
SRAM-Based Compute-in-Memory Accelerator for Linear-decay Spiking Neural Networks
Hongyang Shang, Shuai Dong, Yahan Yang, Junyi Yang, Peng Zhou · Mar 13, 2026 · Citations: 0
Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
Donglin Yu · Mar 13, 2026 · Citations: 0
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
Jun Xue, Junze Wang, Xinming Zhang, Shanze Wang, Yanjun Chen · Mar 13, 2026 · Citations: 0
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Abhinaba Basu, Pavan Chakraborty · Mar 12, 2026 · Citations: 0
Automatic Generation of High-Performance RL Environments
Seth Karten, Rahul Dev Appapogu, Chi Jin · Mar 12, 2026 · Citations: 0
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh · Mar 12, 2026 · Citations: 0
GLM-OCR Technical Report
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu · Mar 11, 2026 · Citations: 0
A Platform-Agnostic Multimodal Digital Human Modelling Framework: Neurophysiological Sensing in Game-Based Interaction
Daniel J. Buxton, Mufti Mahmud, Jordan J. Bird, Thomas Hughes-Roberts, David J. Brown · Mar 11, 2026 · Citations: 0
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar · Mar 5, 2026 · Citations: 0
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham · Mar 5, 2026 · Citations: 0
VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen, Ushik Shrestha Khwakhali, Thanh Pham · Mar 4, 2026 · Citations: 0
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu · Feb 27, 2026 · Citations: 0
IMMACULATE: A Practical LLM Auditing Framework via Verifiable Computation
Yanpei Guo, Wenjie Qu, Linyu Wu, Shengfang Zhai, Lionel Z. Wang · Feb 26, 2026 · Citations: 0
Predicting LLM Output Length via Entropy-Guided Representations
Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, Di Wang · Feb 12, 2026 · Citations: 0
Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Wenlei Shi · Feb 3, 2026 · Citations: 0
Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search
Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li · Dec 10, 2025 · Citations: 0
ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han · Dec 10, 2025 · Citations: 0
Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Mengjie Liu, Jiahui Peng, Wenchang Ning, Pei Chu, Jiantao Qiu · Nov 28, 2025 · Citations: 0
Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning
Jian Lu · Nov 24, 2025 · Citations: 0
Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control
Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan · Mar 14, 2025 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote