Skip to content
← Back to explorer

HFEPX Metric Hub

Latency Metric Papers

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Latency.

Read Full Context

Updated from current HFEPX corpus (2026-04-13). This page tracks 60 papers for Latency. Use it to compare how latency is measured across human feedback and evaluation studies.

Papers: 60 Last published: Apr 9, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: High .

Metric Coverage

100.0%

60 sampled papers include metric names.

Benchmark Anchoring

25.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

5.0%

3 papers report calibration/adjudication/IAA controls.

  • 60 papers are not low-signal flagged in this sample.
  • Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Use the top metric-reliable papers first, then compare benchmark context in the matrix before drawing conclusions.

Why This Matters (Expanded)

Why This Matters For Eval Research

  • Use this page to compare how latency is operationalized across benchmarks and rater setups.
Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

  • Latency is often paired with automatic_metrics, llm_as_judge.

Metric Interpretation

  • latency: 60 papers
  • accuracy: 20 papers
  • cost: 15 papers
  • throughput: 7 papers

Benchmark Context

  • DROP: 2 papers
  • MS MARCO: 2 papers
  • ARC-Challenge: 1 papers

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper Metrics Benchmarks Eval Modes Quality Controls
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

Apr 9, 2026

Precision, Latency Latentneeds Bench Automatic Metrics Not reported
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

Apr 8, 2026

Accuracy, Latency GSM8K, TruthfulQA Automatic Metrics Not reported
SemLink: A Semantic-Aware Automated Test Oracle for Hyperlink Verification using Siamese Sentence-BERT

Apr 7, 2026

Recall, Latency Not reported Automatic Metrics Calibration
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

Apr 7, 2026

F1, Latency SQuAD Llm As Judge, Automatic Metrics Not reported
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Apr 6, 2026

Accuracy, Pass@1 Full Duplex Bench Automatic Metrics Not reported
SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks

Apr 2, 2026

Accuracy, Latency Not reported Automatic Metrics, Simulation Env Calibration
FGR-ColBERT: Identifying Fine-Grained Relevance Tokens During Retrieval

Mar 31, 2026

F1, Recall MS MARCO Automatic Metrics Not reported
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

Mar 28, 2026

Latency, Latency p95 BEIR Automatic Metrics Not reported
FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Mar 27, 2026

Accuracy, Latency Formalproofbench Automatic Metrics Not reported
MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Apr 7, 2026

Latency Not reported Automatic Metrics Not reported
Researcher Workflow (Detailed)

Checklist

  • Gap: Human feedback

    Human feedback is present in 2 of 60 papers.

  • Gap: Quality controls

    Quality controls is present in 3 of 60 papers.

  • Gap: Benchmarks

    Benchmarks is present in 15 of 60 papers.

  • Strong: Metrics

    Metrics is present in 60 of 60 papers.

  • Gap: Known rater population

    Known rater population is present in 5 of 60 papers.

  • Gap: Known annotation unit

    Known annotation unit is present in 12 of 60 papers.

Strengths

  • Metrics is present in 60 of 60 papers.

Known Gaps

  • Human feedback is present in 2 of 60 papers.
  • Quality controls is present in 3 of 60 papers.
  • Benchmarks is present in 15 of 60 papers.

Suggested Next Analyses

  • Review the most recent latency papers first, then compare benchmark context before reusing the metric.

Recommended Queries

Known Limitations
  • This synthetic persisted page is generated from extraction data because the cached metric payload was missing for latency.
Research Utility Snapshot (Detailed)

Top Metrics

  • Latency (60)
  • Accuracy (20)
  • Cost (15)
  • Throughput (7)

Evaluation Modes

  • Automatic Metrics (45)
  • Llm As Judge (2)
  • Simulation Env (2)

Top Benchmarks

  • DROP (2)
  • MS MARCO (2)
  • ARC Challenge (1)
  • BEIR (1)

Agentic Mix

  • None (53)
  • Long Horizon (4)
  • Multi Agent (2)
  • Tool Use (2)

Top Papers Reporting This Metric

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.