Skip to content
← Back to explorer

Metric Hub

Throughput In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Common metric signal: throughput. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 25, 2026.

Papers: 10 Last published: Feb 25, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 10 papers for Throughput In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on multiple benchmark families and metric focus on throughput, latency. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Metric Interpretation

  • throughput is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
  • latency is reported in 30% of hub papers (3/10); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (0% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (0% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (10% vs 35% target).
  • Close gap on Papers with known annotation unit. Coverage is a replication risk (10% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (0% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Suggested Reading Order

  1. 1. Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. The Headless Firm: How AI Reshapes Enterprise Boundaries

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Towards single-shot coherent imaging via overlap-free ptychography

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

    Adds automatic metrics for broader coverage within this hub.

  5. 5. Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

    Adds automatic metrics for broader coverage within this hub.

  6. 6. Luna-2: Scalable Single-Token Evaluation with Small Language Models

    Adds automatic metrics for broader coverage within this hub.

  7. 7. AI-Driven Structure Refinement of X-ray Diffraction

    Adds automatic metrics for broader coverage within this hub.

  8. 8. PREFER: An Ontology for the PREcision FERmentation Community

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Rater population is under-specified (10% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

automatic_metrics vs simulation_env

both=0, left_only=9, right_only=1

0 papers use both Automatic Metrics and Simulation Env.

Top Papers Reporting This Metric

  • Self-Correcting VLA: Online Action Refinement via Sparse World Imagination

    Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026

    Simulation Env Coding

    Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.

  • The Headless Firm: How AI Reshapes Enterprise Boundaries

    Tassilo Klein, Sebastian Wieczorek · Feb 24, 2026

    Automatic Metrics General

    We argue that agentic AI induces a structural change in how coordination costs scale: in prior modular systems, integration cost grew with interaction topology (O(n^2) in the number of components); in protocol-mediated agentic systems, inte

  • Towards single-shot coherent imaging via overlap-free ptychography

    Oliver Hoidn, Aashwin Mishra, Steven Henke, Albert Vong, Matthew Seaberg · Feb 24, 2026

    Automatic Metrics General

    On synthetic benchmarks, reconstructions remain accurate at low counts ($\sim\!10^4$ photons/frame), and overlap-free single-shot reconstruction with an experimental probe reaches amplitude structural similarity (SSIM) 0.904, compared with

  • CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

    Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026

    Automatic Metrics Coding

    Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only \textbf{1\%} of the KV cache, delivers low-latency stable inference with up to \textbf{4.56$\times$} higher throughput, and consistently outperforms other str

  • Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

    Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026

    Automatic Metrics General

    Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.

  • Luna-2: Scalable Single-Token Evaluation with Small Language Models

    Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

    Automatic Metrics Coding

    Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.

  • AI-Driven Structure Refinement of X-ray Diffraction

    Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026

    Automatic Metrics Law

    We benchmark WPEM on standard reference patterns (PbSO$_4$ and Tb$_2$BaCoO$_5$), where it yields lower $R_p/R_{wp}$ than widely used packages (FullProf and TOPAS) under matched refinement conditions.

  • PREFER: An Ontology for the PREcision FERmentation Community

    Txell Amigó, Shawn Zheng Kai Tan, Angel Luu Phanthanourak, Sebastian Schulz, Pasquale D. Colaianni · Feb 18, 2026

    Automatic Metrics General

    Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels.

  • CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models

    Weining Fu, Kai Shu, Kui Xu, Qiangfeng Cliff Zhang · Feb 2, 2026

    Automatic Metrics General

    Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies.

  • Semantic Parallelism: Redefining Efficient MoE Inference via Model-Data Co-Scheduling

    Yan Li, Zhenyu Zhang, Zhengang Wang, Pengfei Chen, Pengfei Zheng · Mar 6, 2025

    Automatic Metrics General

    Prevailing LLM serving engines employ expert parallelism (EP) to implement multi-device inference of massive MoE models.

Other Metric Hubs