HFEPX Metric Hub

Throughput In CS.LG Papers

Updated from current HFEPX corpus (Apr 9, 2026). 17 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 17 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Llm As Judge. Frequently cited benchmark: HumanoidBench. Common metric signal: throughput. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Apr 7, 2026.

Papers: 17 Last published: Apr 7, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

17.6%

3 sampled papers include metric names.

Benchmark Anchoring

5.9%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

0.0%

0 papers report calibration/adjudication/IAA controls.

17 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

automatic metrics appears in 17.6% of papers in this hub.
HumanoidBench is a recurring benchmark anchor for cross-paper comparisons in this page.
long-horizon tasks appears in 5.9% of papers, indicating agentic evaluation demand.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (HumanoidBench vs SQuAD) before comparing methods.

Metric Interpretation

throughput is reported in 100% of hub papers (17/17); compare with a secondary metric before ranking methods.
cost is reported in 47.1% of hub papers (8/17); compare with a secondary metric before ranking methods.

Benchmark Context

HumanoidBench appears in 5.9% of hub papers (1/17); use this cohort for benchmark-matched comparisons.
SQuAD appears in 5.9% of hub papers (1/17); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Apr 7, 2026 · Citations: 0 · Score: 8.0

Metrics: F1, Latency · Eval: Llm As Judge, Automatic Metrics
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Mar 18, 2026 · Citations: 0 · Score: 6.5

Metrics: Throughput, Context length · Eval: Automatic Metrics
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Feb 20, 2026 · Citations: 0 · Score: 6.0

Metrics: Accuracy, Latency · Eval: Llm As Judge, Automatic Metrics
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Apr 6, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported
Sparser, Faster, Lighter Transformer Language Models
Mar 24, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported
Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
Mar 24, 2026 · Citations: 0 · Score: 0.0

Metrics: Not reported · Eval: Not reported

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations Apr 7, 2026	F1, Latency	SQuAD	Llm As Judge, Automatic Metrics	Not reported
Learning When to Attend: Conditional Memory Access for Long-Context LLMs Mar 18, 2026	Throughput, Context length	Not reported	Automatic Metrics	Not reported
Luna-2: Scalable Single-Token Evaluation with Small Language Models Feb 20, 2026	Accuracy, Latency	Not reported	Llm As Judge, Automatic Metrics	Not reported
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control Apr 6, 2026	Not reported	Not reported	Not reported	Not reported
Sparser, Faster, Lighter Transformer Language Models Mar 24, 2026	Not reported	Not reported	Not reported	Not reported
Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies Mar 24, 2026	Not reported	Not reported	Not reported	Not reported
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning Mar 21, 2026	Not reported	Not reported	Not reported	Not reported
Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity Mar 13, 2026	Not reported	Not reported	Not reported	Not reported
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control Mar 13, 2026	Not reported	Not reported	Not reported	Not reported
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials Mar 12, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (11.8% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (0% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (HumanoidBench vs SQuAD) before comparing methods.
Track metric sensitivity by reporting both throughput and cost.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: HumanoidBench Metric Slice: throughput Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Top Metrics

Throughput (17)
Cost (8)
Latency (7)
Accuracy (5)

Evaluation Modes

Automatic Metrics (3)
Llm As Judge (2)

Top Benchmarks

HumanoidBench (1)
SQuAD (1)

Agentic Mix

Long Horizon (1)

Top Papers Reporting This Metric

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations
Shoaib Sadiq Salehmohamed, Jinal Prashant Thakkar, Hansika Aredla, Shaik Mohammed Omar, Shalmali Ayachit · Apr 7, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We introduce a weak supervision framework that combines three complementary grounding signals: substring matching, sentence embedding similarity, and an LLM as a judge verdict to label generated responses as grounded or hallucinated without…
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
Learning When to Attend: Conditional Memory Access for Long-Context LLMs
Sakshi Choudhary, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Matthew Trager · Mar 18, 2026 · Citations: 0

Automatic Metrics General

Based on this, we propose L2A (Learning To Attend), a layer that enables conditional (token-wise) long-range memory access by deciding when to invoke global attention.
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra · Apr 6, 2026 · Citations: 0
Sparser, Faster, Lighter Transformer Language Models
Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami · Mar 24, 2026 · Citations: 0
Benchmarking Multi-Agent LLM Architectures for Financial Document Processing: A Comparative Study of Orchestration Patterns, Cost-Accuracy Tradeoffs and Production Scaling Strategies
Siddhant Kulkarni, Yukta Kulkarni · Mar 24, 2026 · Citations: 0
MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
Dong Liu, Yanxuan Yu, Ben Lengerich, Ying Nian Wu · Mar 21, 2026 · Citations: 0
Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity
Donglin Yu · Mar 13, 2026 · Citations: 0
FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control
Jun Xue, Junze Wang, Xinming Zhang, Shanze Wang, Yanjun Chen · Mar 13, 2026 · Citations: 0
Proof-Carrying Materials: Falsifiable Safety Certificates for Machine-Learned Interatomic Potentials
Abhinaba Basu, Pavan Chakraborty · Mar 12, 2026 · Citations: 0
Automatic Generation of High-Performance RL Environments
Seth Karten, Rahul Dev Appapogu, Chi Jin · Mar 12, 2026 · Citations: 0
Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability
Xingyu Xie, Zhaochen Yu, Yue Liao, Tao Wang, Kim-Chuan Toh · Mar 12, 2026 · Citations: 0
Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham · Mar 5, 2026 · Citations: 0
Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving
Ferran Agullo, Joan Oliveras, Chen Wang, Alberto Gutierrez-Torre, Olivier Tardieu · Feb 27, 2026 · Citations: 0
Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search
Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li · Dec 10, 2025 · Citations: 0
Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning
Jian Lu · Nov 24, 2025 · Citations: 0
Unicorn: A Universal and Collaborative Reinforcement Learning Approach Towards Generalizable Network-Wide Traffic Signal Control
Yifeng Zhang, Yilin Liu, Ping Gong, Peizhuo Li, Mingfeng Fan · Mar 14, 2025 · Citations: 0

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote