Metric Hub

Throughput In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Common metric signal: throughput. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 10 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Throughput In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on multiple benchmark families and metric focus on throughput, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 90% of papers in this hub.

Evidence: The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
long-horizon tasks appears in 20% of papers, indicating agentic evaluation demand.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling , Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography
Track metric sensitivity by reporting both throughput and latency.

Evidence: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography , CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Metric Interpretation

throughput is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
latency is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (0% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Metric Slice: throughput - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Metric Brief

throughput

Coverage: 10 papers (100%)

10 papers (100%) mention throughput.

Examples: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography

Metric Brief

latency

Coverage: 3 papers (30%)

3 papers (30%) mention latency.

Examples: CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference , Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Metric Brief

accuracy

Coverage: 2 papers (20%)

2 papers (20%) mention accuracy.

Examples: Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations , Luna-2: Scalable Single-Token Evaluation with Small Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Self-Correcting VLA: Online Action Refinement via Sparse World Imagination , The Headless Firm: How AI Reshapes Enterprise Boundaries , Towards single-shot coherent imaging via overlap-free ptychography

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026

Simulation Env Coding

Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
The Headless Firm: How AI Reshapes Enterprise Boundaries
Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026

Automatic Metrics General

We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte
Towards single-shot coherent imaging via overlap-free ptychography
Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg · Feb 24, 2026

Automatic Metrics General

On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with
CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference
Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026

Automatic Metrics Coding

Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other str
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
AI-Driven Structure Refinement of X-ray Diffraction
Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026

Automatic Metrics Law

We benchmark WPEM on standard reference patterns (PbSO$_4$ and Tb$_2$BaCoO$_5$), where it yields lower $R_p/R_{wp}$ than widely used packages (FullProf and TOPAS) under matched refinement conditions.
PREFER: An Ontology for the PREcision FERmentation Community
Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026

Automatic Metrics General

Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels.
CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Weining Fu, Kai Shu, Kui Xu, Qiangfeng Cliff Zhang · Feb 2, 2026

Automatic Metrics General

Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies.
Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling
Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025

Automatic Metrics General

Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models.

Throughput In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs