HFEPX Archive Slice

HFEPX Daily Archive: 2025-10-21

Updated from current HFEPX corpus (Mar 10, 2026). 5 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 5 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequently cited benchmark: CAPArena. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Oct 21, 2025.

Papers: 5 Last published: Oct 21, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

5 / 5 papers are not low-signal flagged.

Benchmark Anchors

60.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

60.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

40% of papers report explicit human-feedback signals, led by demonstration data.
automatic metrics appears in 40% of papers in this hub.
CAPArena is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Oct 21, 2025 · Citations: 0 · Score: 6.5

Eval: Human Eval, Llm As Judge · Metrics: Spearman
LightMem: Lightweight and Efficient Memory-Augmented Generation
Oct 21, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
Oct 21, 2025 · Citations: 0 · Score: 4.0

Eval: Automatic Metrics · Metrics: Relevance
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Oct 21, 2025 · Citations: 0 · Score: 3.5

Eval: Simulation Env · Metrics: Not reported
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Oct 21, 2025 · Citations: 0 · Score: 3.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions Oct 21, 2025	Human Eval, Llm As Judge	CAPArena	Spearman	Not reported
LightMem: Lightweight and Efficient Memory-Augmented Generation Oct 21, 2025	Automatic Metrics	Longmemeval	Accuracy	Not reported
KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers Oct 21, 2025	Automatic Metrics	Not reported	Relevance	Not reported
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation Oct 21, 2025	Simulation Env	Not reported	Not reported	Not reported
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs Oct 21, 2025	Not reported	Gar Bench, Dlc Bench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (40% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (40% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (40% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (40% vs 35% target).

Strengths

Most papers provide measurable evaluation context (40% benchmarks, 40% metrics).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 40% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (CAPArena vs Longmemeval) before comparing methods.
Track metric sensitivity by reporting both accuracy and spearman.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: CAPArena Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
LLM-as-judge appears without enough inter-annotator agreement reporting.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)
Human Eval (1)
Llm As Judge (1)
Simulation Env (1)

Top Metrics

Accuracy (1)
Spearman (1)

Top Benchmarks

CAPArena (1)
Longmemeval (1)

Quality Controls

Papers In This Archive Slice

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025 · Citations: 0

Rubric Rating

In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g.
Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li · Oct 21, 2025 · Citations: 0

Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions.
LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang · Oct 21, 2025 · Citations: 0

Tool Use

Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages.
KrishokBondhu: A Retrieval-Augmented Voice-Based Agricultural Advisory Call Center for Bengali Farmers
Mohd Ruhul Ameen, Akif Islam, Farjana Aktar, M. Saifuzzaman Rafat · Oct 21, 2025 · Citations: 0

In a pilot evaluation, KrishokBondhu produced high-quality responses for 72.7% of diverse agricultural queries.
MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025 · Citations: 0

Demonstrations Long Horizon

Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote