Skip to content
← Back to explorer

HFEPX Archive Slice

HFEPX Daily Archive: 2025-10-21

Updated from current HFEPX corpus (Mar 10, 2026). 5 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 5 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: CAPArena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 21, 2025.

Papers: 5 Last published: Oct 21, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

5 / 5 papers are not low-signal flagged.

Benchmark Anchors

60.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

60.0%

Papers with reported metric mentions in extraction output.

  • 0 papers report explicit quality controls for this archive period.
  • Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Subscribe

Why This Time Slice Matters

  • 40% of papers report explicit human-feedback signals, led by demonstration data.
  • automatic metrics appears in 40% of papers in this hub.
  • CAPArena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

  • 1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
  • Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
  • Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper Eval Modes Benchmarks Metrics Quality Controls
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Oct 21, 2025

Human Eval, Llm As Judge CAPArena Spearman Not reported
LightMem: Lightweight and Efficient Memory-Augmented Generation

Oct 21, 2025

Automatic Metrics Longmemeval Accuracy Not reported
KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers

Oct 21, 2025

Automatic Metrics Not reported Relevance Not reported
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

Oct 21, 2025

Simulation Env Not reported Not reported Not reported
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Oct 21, 2025

Not reported Gar Bench, Dlc Bench Not reported Not reported
Researcher Workflow (Detailed)

Checklist

  • Moderate: Papers with explicit human feedback

    Coverage is usable but incomplete (40% vs 45% target).

  • Gap: Papers reporting quality controls

    Coverage is a replication risk (0% vs 30% target).

  • Strong: Papers naming benchmarks/datasets

    Coverage is strong (40% vs 35% target).

  • Strong: Papers naming evaluation metrics

    Coverage is strong (40% vs 35% target).

  • Strong: Papers with known rater population

    Coverage is strong (40% vs 35% target).

  • Strong: Papers with known annotation unit

    Coverage is strong (40% vs 35% target).

Strengths

  • Most papers provide measurable evaluation context (40% benchmarks, 40% metrics).
  • Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
  • Agentic evaluation appears in 40% of papers.

Known Gaps

  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

  • Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
  • Stratify by benchmark (CAPArena vs Longmemeval) before comparing methods.
  • Track metric sensitivity by reporting both accuracy and spearman.

Recommended Queries

Known Limitations
  • Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
  • LLM-as-judge appears without enough inter-annotator agreement reporting.
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Research Utility Snapshot (Detailed)

Evaluation Modes

  • Automatic Metrics (2)
  • Human Eval (1)
  • Llm As Judge (1)
  • Simulation Env (1)

Top Metrics

  • Accuracy (1)
  • Spearman (1)

Top Benchmarks

  • CAPArena (1)
  • Longmemeval (1)

Quality Controls

Papers In This Archive Slice

Recent Archive Slices

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.