HFEPX Archive Slice

HFEPX Weekly Archive: 2026-W03

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 61 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: BFCL. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jan 18, 2026.

Papers: 61 Last published: Jan 18, 2026 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 61 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

15.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

31.7%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

13.1% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 27.9% of papers in this hub.
BFCL is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (1.6% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Pair this hub with a human_eval-heavy hub to validate judge-model calibration.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Jan 17, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Error rate
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Jan 16, 2026 · Citations: 0 · Score: 7.0

Eval: Simulation Env · Metrics: Success rate, Jailbreak success rate
Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification
Jan 18, 2026 · Citations: 0 · Score: 6.0

Eval: Llm As Judge · Metrics: Faithfulness
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Jan 16, 2026 · Citations: 0 · Score: 6.0

Eval: Automatic Metrics · Metrics: Accuracy
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Jan 14, 2026 · Citations: 0 · Score: 5.5

Eval: Simulation Env · Metrics: Latency
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Jan 12, 2026 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Latency, Cost

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Jan 17, 2026	Automatic Metrics	Calconflictbench	Error rate	Not reported
AJAR: Adaptive Jailbreak Architecture for Red-teaming Jan 16, 2026	Simulation Env	Harmbench	Success rate, Jailbreak success rate	Not reported
Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification Jan 18, 2026	Llm As Judge	Inteval	Faithfulness	Not reported
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning Jan 16, 2026	Automatic Metrics	Blenderbench, Slidebench	Accuracy	Not reported
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning Jan 14, 2026	Simulation Env	Not reported	Latency	Not reported
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs Jan 12, 2026	Automatic Metrics	MT Bench	Latency, Cost	Not reported
PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark Jan 13, 2026	Automatic Metrics	Not reported	Relevance	Not reported
Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning Jan 18, 2026	Automatic Metrics	Not reported	Bleu	Not reported
Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs Jan 18, 2026	Automatic Metrics	Not reported	Jailbreak success rate	Not reported
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space Jan 18, 2026	Automatic Metrics	MATH	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13.1% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (1.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (13.1% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (23% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (11.5% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (11.5% vs 35% target).

Strengths

This hub still surfaces a concentrated paper set for protocol triage and replication planning.

Known Gaps

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.5% coverage).
Annotation unit is under-specified (11.5% coverage).

Suggested Next Analyses

Pair this hub with a human_eval-heavy hub to validate judge-model calibration.
Stratify by benchmark (BFCL vs Blenderbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries

LLM-as-Judge Protocols Benchmark Slice: BFCL Metric Slice: accuracy Recent High-Signal Papers

Known Limitations

Only 1.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.5% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (17)
Simulation Env (4)
Llm As Judge (2)

Top Metrics

Accuracy (7)
Cost (2)
Relevance (2)
Coherence (1)

Top Benchmarks

BFCL (1)
Blenderbench (1)
Calconflictbench (1)
Harmbench (1)

Quality Controls

Calibration (1)

Papers In This Archive Slice

Improving Low-Resource Machine Translation via Round-Trip Reinforcement Learning
Ahmed Attia, Alham Fikri Aji · Jan 18, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-Task Instruction Tuning via Data Scheduling for Low-Resource Arabic AudioLLMs
Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury · Jan 18, 2026 · Citations: 0

To support end-to-end Arabic speech summarization, we introduce AraMega-SSum, a first speech summarization resource for training and benchmarking Arabic-centric Audio-LLMs.
Legal Experts Disagree With Rationale Extraction Techniques for Explaining ECtHR Case Outcome Classification
Mahammad Namazov, Tomáš Koref, Ivan Habernal · Jan 18, 2026 · Citations: 0

We study this task on decisions from the European Court of Human Rights (ECtHR), introducing a new ECtHR dataset with carefully curated positive (violation) and negative (non-violation) cases.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents
Raffi Khatchadourian · Jan 17, 2026 · Citations: 0

Long Horizon

We introduce the Determinism-Faithfulness Assurance Harness (DFAH), a framework for measuring trajectory determinism, decision determinism, and evidence-conditioned faithfulness in tool-using agents deployed in financial services.
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning
Bingxuan Li, Jeonghwan Kim, Cheng Qian, Xiusi Chen, Eitan Anzenberg · Jan 17, 2026 · Citations: 0

Pairwise Preference Long Horizon

To enable a systematic study of this question, we introduce CalConflictBench, a benchmark for long-horizon calendar conflict resolution.
Event Detection with a Context-Aware Encoder and LoRA for Improved Performance on Long-Tailed Classes
Abdullah Al Monsur, Nitesh Vamshi Bommisetty, Gene Louis Kim · Jan 17, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The unreasonable effectiveness of pattern matching
Gary Lupyan, Blaise Agüera y Arcas · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch · Jan 16, 2026 · Citations: 0

Spoken conversational systems require more than accurate speech generation to have human-like conversations: to feel natural and engaging, they must produce conversational behaviour that adapts dynamically to the context.
T$^\star$: Progressive Block Scaling for Masked Diffusion Language Models Through Trajectory Aware Reinforcement Learning
Hanchen Xia, Baoyou Chen, Yutang Ge, Guojiang Zhao, Siyu Zhu · Jan 16, 2026 · Citations: 0

Long Horizon

Starting from an AR-initialized small-block MDM, T^\star transitions smoothly to larger blocks, enabling higher-parallelism decoding with minimal performance degradation on math reasoning benchmarks.
The Growing Gains and Pains of Iterative Web Corpora Crawling: Insights from South Slavic CLASSLA-web 2.0 Corpora
Taja Kuzman Pungeršek, Peter Rupnik, Vít Suchomel, Nikola Ljubešić · Jan 16, 2026 · Citations: 0
Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning
Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li · Jan 16, 2026 · Citations: 0

Long Horizon

To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other.
Generating metamers of human scene understanding
Ritik Raina, Abe Leite, Alexandros Graikos, Seoyoung Ahn, Dimitris Samaras · Jan 16, 2026 · Citations: 0

Human vision combines low-resolution "gist" information from the visual periphery with sparse but high-resolution information from fixated locations to construct a coherent understanding of a visual scene.
Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
Fenglin Zhang, Jie Wang · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AJAR: Adaptive Jailbreak Architecture for Red-teaming
Yipu Dou, Wang Yang · Jan 16, 2026 · Citations: 0

Red Team

Large language model (LLM) safety evaluation is moving from content moderation to action security as modern systems gain persistent state, tool access, and autonomous control loops.
A Confidence-Variance Theory for Pseudo-Label Selection in Semi-Supervised Learning
Jinshi Liu, Pan Liu, Lei He · Jan 16, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents
Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong · Jan 16, 2026 · Citations: 0
Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
Qingyue Zhang, Chang Chu, Haohao Fu, Tianren Peng, Yanru Wu · Jan 15, 2026 · Citations: 0
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Zirui Ren, Ziming Liu · Jan 15, 2026 · Citations: 0
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi · Jan 15, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Syed Naveed Mahmood, Md. Rezaur Rahman Bhuiyan, Tasfia Zaman, Jareen Tasneem Khondaker, Md. Sameer Sakib · Jan 15, 2026 · Citations: 0

Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath…
Development of Ontological Knowledge Bases by Leveraging Large Language Models
Le Ngoc Luyen, Marie-Hélène Abel, Philippe Gouspillou · Jan 15, 2026 · Citations: 0
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0

Long Horizon

The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
Hengyu Shen, Tiancheng Gu, Bin Qin, Lan Wu, Yuling Wu · Jan 15, 2026 · Citations: 0
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang, Jian Yang, Weiyuan Li, Rui Xie, Jen-tse Huang · Jan 15, 2026 · Citations: 0

We present HumanLLM, a framework treating psychological patterns as interacting causal forces.
AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik, Ashish Anand · Jan 15, 2026 · Citations: 0

We introduce AWED-FiNER, an open-source collection of agentic tool, web application, and 53 state-of-the-art expert models that provide Fine-grained Named Entity Recognition (FgNER) solutions across 36 languages spoken by more than 6.6…
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice, Puria Radmard, Samuel Ratnam, Andy Kim, David Africa · Jan 15, 2026 · Citations: 0

We share our models, data, and evaluations at AlignmentPretraining.ai.
Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sijia Luo, Xiaokang Zhang, Yuxuan Hu, Bohan Zhang, Ke Wang · Jan 15, 2026 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Raphaël Merx, Jey Han Lau · Jan 15, 2026 · Citations: 0

Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0

Pairwise Preference Long Horizon

Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
Creating a Hybrid Rule and Neural Network Based Semantic Tagger using Silver Standard Data: the PyMUSAS framework for Multilingual Semantic Annotation
Andrew Moore, Paul Rayson, Dawn Archer, Tim Czerniak, Dawn Knight · Jan 14, 2026 · Citations: 0

However, for the UCREL Semantic Analysis System (USAS) framework, no open extensive evaluation has been performed beyond lexical coverage or single language evaluation.
Information Access of the Oppressed: A Problem-Posing Framework for Envisioning Emancipatory Information Access Platforms
Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja · Jan 14, 2026 · Citations: 0

Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, confidentiality, transparency, and safety.
MVSS: A Unified Framework for Multi-View Structured Survey Generation
Yinqi Liu, Yueqi Zhu, Yongkang Zhang, Feiran Liu, Yutong Shen · Jan 14, 2026 · Citations: 0

In addition, we introduce a dedicated evaluation framework that systematically assesses generated surveys from multiple dimensions, including structural quality, comparative completeness, and citation fidelity.
CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Galactic Archaeology and Scientific Discovery
Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo · Jan 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs
Biswesh Mohapatra, Théo Charlot, Giovanni Duca, Mayank Palan, Laurent Romary · Jan 14, 2026 · Citations: 0

With the increasing presence of embodied conversational agents and social robots, the ability to correctly ground this kind of conversational content in order to refer back later also becomes important for dialog systems.
GIFT: Reconciling Post-Training Objectives via Finite-Temperature Gibbs Initialization
Zhengyang Zhao, Lu Ma, Yizhen Jiang, Xiaochen Ma, Zimo Meng · Jan 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ProFit: Leveraging High-Value Signals in SFT via Probability-Guided Token Selection
Tao Liu, Taiqiang Wu, Runming Yang, Shaoning Sun, Junjie Wang · Jan 14, 2026 · Citations: 0

Supervised fine-tuning (SFT) is a fundamental post-training strategy to align Large Language Models (LLMs) with human intent.
CAST: Character-and-Scene Episodic Memory for Agents
Kexin Ma, Bojun Li, Yuhua Tang, Liting Sun, Ruochun Jin · Jan 14, 2026 · Citations: 0

Episodic memory is a central component of human memory, which refers to the ability to recall coherent events grounded in who, when, and where.
Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models
Youwei Liu, Jian Wang, Hanlin Wang, Beichen Guo, Wenjie Li · Jan 13, 2026 · Citations: 0
ConvoLearn: A Dataset for Fine-Tuning Dialogic AI Tutors
Mayank Sharma, Roy Pea, Hari Subramonyam · Jan 13, 2026 · Citations: 0
APEX-SWE
Abhi Kottamasu, Chirag Mahapatra, Sam Lee, Ben Pan, Aakash Barthwal · Jan 13, 2026 · Citations: 0

We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work.
A Geolocation-Aware Multimodal Approach for Ecological Prediction
Valerie Zermatten, Chiara Vanalli, Gencer Sumbul, Diego Marcos, Devis Tuia · Jan 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Auditing Student-AI Collaboration: A Case Study of Online Graduate CS Students
Nifu Dan · Jan 13, 2026 · Citations: 0
A Parallel Cross-Lingual Benchmark for Multimodal Idiomaticity Understanding
Dilara Torunoğlu-Selamet, Dogukan Arslan, Rodrigo Wilkens, Wei He, Doruk Eryiğit · Jan 13, 2026 · Citations: 0

Pairwise Preference

The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects.
Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation
Saumitra Yadav, Manish Shrivastava · Jan 13, 2026 · Citations: 0

To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation.
Rewriting Video: Text-Driven Reauthoring of Video Footage
Sitong Wang, Anh Truong, Lydia B. Chilton, Dingzeyu Li · Jan 13, 2026 · Citations: 0

A technical evaluation of the algorithm reveals a critical human-AI perceptual gap.
PosIR: Position-Aware Heterogeneous Information Retrieval Benchmark
Ziyang Zeng, Dun Zhang, Yu Yan, Xu Sun, Cuiqiaoshu Pan · Jan 13, 2026 · Citations: 0

Pairwise Preference

To address these limitations, we introduce PosIR (Position-Aware Information Retrieval), the first standardized benchmark designed to systematically diagnose position bias in diverse retrieval scenarios.
High-Fidelity Modeling of Stochastic Chemical Dynamics on Complex Manifolds: A Multi-Scale SIREN-PINN Framework for the Curvature-Perturbed Ginzburg-Landau Equation
Julian Evan Chrisnanto, Salsabila Rahma Alia, Nurfauzi Fadillah, Yulison Herry Chrisnanto · Jan 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
FigEx2: Visual-Conditioned Panel Detection and Captioning for Scientific Compound Figures
Jifeng Song, Arun Das, Pan Wang, Hui Ji, Kun Zhao · Jan 12, 2026 · Citations: 0

To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry.
Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
Laurits Lyngbaek, Pascale Feldkamp, Yuri Bizzoni, Kristoffer L. Nielbo, Kenneth Enevoldsen · Jan 12, 2026 · Citations: 0

Use cases of sentiment analysis in the humanities often require contextualized, continuous scores.
DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs
Nayoung Choi, Jonathan Zhang, Jinho D. Choi · Jan 12, 2026 · Citations: 0

Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.
VULCA-Bench: A Multicultural Vision-Language Benchmark for Evaluating Cultural Understanding
Haorui Yu, Diji Yang, Hang He, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Critique Edit

We introduce VULCA-Bench, a multicultural art-critique benchmark for evaluating Vision-Language Models' (VLMs) cultural understanding beyond surface-level visual perception.
Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler · Jan 12, 2026 · Citations: 0

Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and…
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026 · Citations: 0

Rubric RatingCritique Edit

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
Learning Through Dialogue: Engagement and Efficacy Matter More Than Explanations
Shaz Furniturewala, Gerard Christopher Yeo, Kokil Jaidka · Jan 12, 2026 · Citations: 0

We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in…
GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models
Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang · Jan 12, 2026 · Citations: 0

Extensive experiments show that our framework improves the aggregated Average by 22.4% over the strongest baseline on HumanML3D and by 14.4% on KIT-ML, while ablations confirm the effectiveness of the tokenizer, projection, and…
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang · Jan 12, 2026 · Citations: 0

Pairwise PreferenceCritique Edit

To address this issue, we propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals, thereby mitigating the problem of limited solution space inherent…
VLM-CAD: VLM-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing
Guanyuan Pan, Shuai Wang, Yugui Lin, Tiansheng Zhou, Pietro Liò · Jan 12, 2026 · Citations: 0
NRR-Phi: Text-to-State Mapping for Ambiguity Preservation in LLM Inference
Kei Saito · Jan 12, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Measuring Iterative Temporal Reasoning with Time Puzzles
Zhengxiang Wang, Zeyu Dong · Jan 12, 2026 · Citations: 0

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now