HFEPX Archive Slice

HFEPX Daily Archive: 2026-01-16

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 9, 2026). 10 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Common annotation unit: Trajectory. Frequently cited benchmark: BFCL. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 16, 2026.

Papers: 10 Last published: Jan 16, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: Medium .

High-Signal Coverage

100.0%

10 / 10 papers are not low-signal flagged.

Benchmark Anchors

20.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

20.0%

Papers with reported metric mentions in extraction output.

0 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

10% of papers report explicit human-feedback signals, led by red-team protocols.
automatic metrics appears in 10% of papers in this hub.
BFCL is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly unspecified rater pools, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Stratify by benchmark (BFCL vs Blenderbench) before comparing methods.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

AJAR: Adaptive Jailbreak Architecture for Red-teaming
Jan 16, 2026 · Citations: 0 · Score: 7.0

Eval: Simulation Env · Metrics: Success rate, Jailbreak success rate
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Jan 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
The unreasonable effectiveness of pattern matching
Jan 16, 2026 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Jan 16, 2026 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Jan 16, 2026 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported
Generating metamers of human scene understanding
Jan 16, 2026 · Citations: 0 · Score: 2.0

Eval: Not reported · Metrics: Not reported

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
AJAR: Adaptive Jailbreak Architecture for Red-teaming Jan 16, 2026	Simulation Env	Harmbench	Success rate, Jailbreak success rate	Not reported
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Jan 16, 2026	Automatic Metrics	Blenderbench, Slidebench	Accuracy	Not reported
The unreasonable effectiveness of pattern matching Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
Generating metamers of human scene understanding Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora Jan 16, 2026	Not reported	Not reported	Not reported	Not reported
Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents Jan 16, 2026	Not reported	Not reported	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (10% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (30% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (0% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (10% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Annotation unit is under-specified (10% coverage).

Suggested Next Analyses

Stratify by benchmark (BFCL vs Blenderbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Benchmark Slice: BFCL Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (1)
Simulation Env (1)

Top Metrics

Accuracy (1)
Cost (1)
Jailbreak success rate (1)
Success rate (1)

Top Benchmarks

BFCL (1)
Blenderbench (1)
Harmbench (1)
Slidebench (1)

Quality Controls

Papers In This Archive Slice

The unreasonable effectiveness of pattern matching
Gary Lupyan, Blaise Agüera y Arcas · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch · Jan 16, 2026 · Citations: 0

Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context.
T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu · Jan 16, 2026 · Citations: 0

Long Horizon

Starting from an AR-initialized small-block MDM, T^\star transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks.
The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel, Nikola Ljubešić · Jan 16, 2026 · Citations: 0
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li · Jan 16, 2026 · Citations: 0

Long Horizon

To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026 · Citations: 0

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
Fenglin Zhang, Jie Wang · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0

Red Team

Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning
Jinshi Liu, Pan Liu, Lei He · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents
Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong · Jan 16, 2026 · Citations: 0

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote