HFEPX Metric Hub

Throughput In CS.CL Papers

Updated from current HFEPX corpus (Apr 9, 2026). 20 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 20 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Frequently cited benchmark: BrowseComp. Common metric signal: throughput. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 3, 2026.

Papers: 20 Last published: Apr 3, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Medium .

Metric Coverage

40.0%

8 sampled papers include metric names.

Benchmark Anchoring

10.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

20 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

25% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 35% of papers in this hub.
BrowseComp is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly mixed annotation units; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Metric Interpretation

throughput is reported in 100% of hub papers (8/20); compare with a secondary metric before ranking methods.
accuracy is reported in 50% of hub papers (4/20); compare with a secondary metric before ranking methods.

Benchmark Context

BrowseComp appears in 12.5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.
SQuAD appears in 12.5% of hub papers (1/20); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Apr 7, 2026 · Citations: 0 · Score: 8.0

Metrics: F1, Latency · Eval: Llm As Judge, Automatic Metrics
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Apr 6, 2026 · Citations: 0 · Score: 7.5

Metrics: Accuracy, Throughput · Eval: Automatic Metrics
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Dec 20, 2025 · Citations: 0 · Score: 7.0

Metrics: Accuracy, Latency · Eval: Automatic Metrics
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
Apr 3, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput · Eval: Not reported
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Mar 18, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput · Eval: Automatic Metrics
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Mar 18, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput, Context length · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations Apr 7, 2026	F1, Latency	SQuAD	Llm As Judge, Automatic Metrics	Not reported
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Apr 6, 2026	Accuracy, Throughput	Not reported	Automatic Metrics	Not reported
Towards Efficient Agents: A Co-Design of Inference Architecture and System Dec 20, 2025	Accuracy, Latency	BrowseComp	Automatic Metrics	Not reported
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency Apr 3, 2026	Throughput	Not reported	Not reported	Not reported
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing Mar 18, 2026	Throughput	Not reported	Automatic Metrics	Not reported
Learning When to Attend: Conditional Memory Access for Long-Context LLMs Mar 18, 2026	Throughput, Context length	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations Feb 22, 2026	Accuracy, Latency	Not reported	Automatic Metrics	Not reported
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning Mar 24, 2026	Not reported	Not reported	Not reported	Not reported
Sparser, Faster, Lighter Transformer Language Models Mar 24, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (25% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (12.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

Agentic evaluation appears in 50% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BrowseComp vs SQuAD) before comparing methods.
Track metric sensitivity by reporting both throughput and accuracy.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BrowseComp Metric Slice: throughput Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (12.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Throughput (8)
Accuracy (4)
Latency (4)
Context length (1)

Evaluation Modes

Automatic Metrics (7)
Llm As Judge (2)

Top Benchmarks

BrowseComp (1)
SQuAD (1)

Agentic Mix

Long Horizon (4)

Top Papers Reporting This Metric

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency
Aichen Cai, Anmeng Zhang, Anyu Li, Bo Zhang, Bohua Cai · Apr 3, 2026 · Citations: 0

General

JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale…
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu · Apr 6, 2026 · Citations: 0

Automatic Metrics Law

Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0

Automatic Metrics General

The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing
Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott · Mar 18, 2026 · Citations: 0

Automatic Metrics General

Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%.
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager · Mar 18, 2026 · Citations: 0

Automatic Metrics General

Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention.
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026 · Citations: 0

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji · Mar 24, 2026 · Citations: 0
Sparser, Faster, Lighter Transformer Language Models
Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami · Mar 24, 2026 · Citations: 0
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction
Yixuan Wang, Shiyu Ji, Yijun Liu, Qingfu Zhu, Wanxiang Che · Mar 24, 2026 · Citations: 0
Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
Siddhant Kulkarni, Yukta Kulkarni · Mar 24, 2026 · Citations: 0
Autoregressive vs. Masked Diffusion Language Models: A Controlled Comparison
Caio Vicentino · Mar 23, 2026 · Citations: 0
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
Quy-Anh Dang, Chris Ngo · Mar 17, 2026 · Citations: 0
GLM-OCR Technical Report
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu · Mar 11, 2026 · Citations: 0
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar · Mar 5, 2026 · Citations: 0
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham · Mar 5, 2026 · Citations: 0
VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen, Ushik Shrestha Khwakhali, Thanh Pham · Mar 4, 2026 · Citations: 0
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu · Feb 27, 2026 · Citations: 0
Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
Mengjie Liu, Jiahui Peng, Wenchang Ning, Pei Chu, Jiantao Qiu · Nov 28, 2025 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote