HFEPX Archive Slice

HFEPX Daily Archive: 2026-02-03

Updated from current HFEPX corpus (Mar 8, 2026). 6 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Mar 8, 2026). 6 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 3, 2026.

Papers: 6 Last published: Feb 3, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Developing .

High-Signal Coverage

100.0%

6 / 6 papers are not low-signal flagged.

Benchmark Anchors

50.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

16.7%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Why This Time Slice Matters

automatic metrics appears in 33.3% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.
long-horizon tasks appears in 33.3% of papers, indicating agentic evaluation demand.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (DROP vs LongBench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Feb 3, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Feb 3, 2026 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Feb 3, 2026 · Citations: 0 · Score: 3.5

Eval: Not reported · Metrics: Not reported
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
Feb 3, 2026 · Citations: 0 · Score: 3.0

Eval: Automatic Metrics · Metrics: Not reported
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
Feb 3, 2026 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
FASA: Frequency-aware Sparse Attention
Feb 3, 2026 · Citations: 0 · Score: 0.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild? Feb 3, 2026	Automatic Metrics	DROP	Accuracy	Not reported
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering Feb 3, 2026	Not reported	Omnivideobench	Not reported	Not reported
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training Feb 3, 2026	Not reported	SWE Bench, SWE Bench Verified	Not reported	Not reported
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models Feb 3, 2026	Automatic Metrics	Not reported	Not reported	Not reported
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques Feb 3, 2026	Not reported	Not reported	Not reported	Not reported
FASA: Frequency-aware Sparse Attention Feb 3, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (33.3% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (33.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (16.7% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (16.7% vs 35% target).

Strengths

Agentic evaluation appears in 66.7% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Annotation unit is under-specified (16.7% coverage).

Suggested Next Analyses

Stratify by benchmark (DROP vs LongBench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries

Benchmark Slice: DROP Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (16.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (2)

Top Metrics

Accuracy (2)
Agreement (1)

Top Benchmarks

DROP (1)
LongBench (1)

Quality Controls

Papers In This Archive Slice

Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0

Web Browsing

To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026 · Citations: 0

Tool Use

To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0

Long Horizon

In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
FASA: Frequency-aware Sparse Attention
Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang · Feb 3, 2026 · Citations: 0
STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026 · Citations: 0

Tool Use

The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote