HFEPX Archive Slice

HFEPX Fortnight Archive: 2025-F22

Updated from current HFEPX corpus (Apr 12, 2026). 113 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 113 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Llm As Judge. Most common rater population: Domain Experts. Common annotation unit: Freeform. Frequent quality control: Calibration. Frequently cited benchmark: AIME. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Nov 2, 2025.

Papers: 113 Last published: Nov 2, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 113 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

13.3%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

43.3%

Papers with reported metric mentions in extraction output.

3 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice for trend comparison: review top papers first, then validate shifts in the protocol matrix.

Get this digest every Friday →

Why This Time Slice Matters

15% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 37.2% of papers in this hub.
AIME is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is rater calibration (3.5% of papers).
Rater context is mostly domain experts, and annotation is commonly Freeform; use this to scope replication staffing.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Oct 27, 2025 · Citations: 0 · Score: 6.5

Eval: Automatic Metrics · Metrics: Mse
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Oct 30, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Oct 29, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Success rate
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Oct 25, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Accuracy, Mae
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Oct 31, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Accuracy
Reasoning Up the Instruction Ladder for Controllable Language Models
Oct 30, 2025 · Citations: 0 · Score: 5.0

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA Oct 27, 2025	Automatic Metrics	LMSYS Chatbot Arena, GSM8K	Mse	Not reported
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models Oct 30, 2025	Automatic Metrics	Aot Psyphybench	Accuracy	Not reported
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Oct 29, 2025	Automatic Metrics	APPS	Success rate	Not reported
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations Oct 25, 2025	Automatic Metrics	Visjudge Bench, Visjudgebench	Accuracy, Mae	Not reported
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning Oct 31, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Reasoning Up the Instruction Ladder for Controllable Language Models Oct 30, 2025	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration Oct 30, 2025	Automatic Metrics	Not reported	Accuracy, Coherence	Not reported
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline Oct 29, 2025	Automatic Metrics	Not reported	Rouge	Not reported
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters Oct 29, 2025	Automatic Metrics	Not reported	Agreement	Calibration
LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data Oct 28, 2025	Llm As Judge, Automatic Metrics	Not reported	Accuracy, F1	Calibration

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (15% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (4.4% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (7.1% vs 35% target).
Moderate: Papers naming evaluation metrics

Coverage is usable but incomplete (24.8% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (7.1% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.8% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 4.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Annotation unit is under-specified (8.8% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (AIME vs AlpacaEval) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: AIME Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 4.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (7.1% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (42)
Llm As Judge (7)
Human Eval (6)
Simulation Env (4)

Top Metrics

Accuracy (11)
Cost (3)
F1 (3)
Agreement (2)

Top Benchmarks

AIME (1)
AlpacaEval (1)
APPS (1)
Arena Hard (1)

Quality Controls

Calibration (4)
Inter Annotator Agreement Reported (1)

Papers In This Archive Slice

Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen · Nov 2, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman · Nov 1, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Addressing Longstanding Challenges in Cognitive Science with Language Models
Dirk U. Wulff, Rui Mata · Oct 31, 2025 · Citations: 0
Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Hiba Ahsan, Byron C. Wallace · Oct 31, 2025 · Citations: 0
BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0

Pairwise Preference Long Horizon

We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
When Distributions Shifts: Causal Generalization for Low-Resource Languages
Mahi Aliyu Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye · Oct 31, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Analysing Environmental Efficiency in AI for X-Ray Diagnosis
Liam Kearns · Oct 31, 2025 · Citations: 0

This provides a benchmark study of 14 different model configurations for comparison of diagnostic accuracy and environmental impact.
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi · Oct 31, 2025 · Citations: 0

Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025 · Citations: 0

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative…
Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu · Oct 31, 2025 · Citations: 0

Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler · Oct 31, 2025 · Citations: 0

Multi Agent

Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts?
Probability Distributions Computed by Autoregressive Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski · Oct 31, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
VISTA: Verification In Sequential Turn-based Assessment
Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White · Oct 30, 2025 · Citations: 0

Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines.
Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0

Red Team

Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
Lívia Dutra, Arthur Lorenzi, Laís Berno, Franciany Campos, Karoline Biscardi · Oct 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon · Oct 30, 2025 · Citations: 0

Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.
LLMs Process Lists With General Filter Heads
Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau · Oct 30, 2025 · Citations: 0

Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming…
Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang · Oct 30, 2025 · Citations: 0

Extensive evaluations on medical QA benchmarks using Llama3-8B-Instruct and Med42-V2 demonstrate the effectiveness of Evontree, which outperforms both the base models and strong baselines, achieving up to a 3.7\% improvement in accuracy.
Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation
Zihao Guo, Qingyun Sun, Ziwei Zhang, Haonan Yuan, Huiping Zhuang · Oct 30, 2025 · Citations: 0
Co-Evolving Latent Action World Models
Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang · Oct 30, 2025 · Citations: 0
The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0

Pairwise Preference Multi Agent

While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection
Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan · Oct 30, 2025 · Citations: 0

SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions.
Are Language Models Borrowing-Blind? A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva, Sina Ahmadi · Oct 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa · Oct 30, 2025 · Citations: 0

We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0

Demonstrations Long Horizon

Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li · Oct 29, 2025 · Citations: 0

Red Team

As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs.
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025 · Citations: 0

Large language models (LLMs) are increasingly used as raters for evaluation tasks.
TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling
He Hu, Chiyuan Ma, Qianning Wang, Lin Liu, Yucheng Zhou · Oct 29, 2025 · Citations: 0
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0

Long Horizon

To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.
From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025 · Citations: 0

Multi Agent

To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem · Oct 29, 2025 · Citations: 0

We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and…
World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0

Long Horizon

These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu · Oct 28, 2025 · Citations: 0

In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages.
Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou · Oct 28, 2025 · Citations: 0
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025 · Citations: 0

Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy.
Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin · Oct 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI, :, Bowen Ma, Cheng Zou, ChengKun Du · Oct 28, 2025 · Citations: 0
LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot · Oct 28, 2025 · Citations: 0

Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs.
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu · Oct 28, 2025 · Citations: 0
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu · Oct 28, 2025 · Citations: 0

Expert Verification

To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks.
Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren · Oct 28, 2025 · Citations: 0

Long Horizon

To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct…
MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Aaron Scott, Maike Züfle, Jan Niehues · Oct 28, 2025 · Citations: 0
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0

Pairwise Preference

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025 · Citations: 0

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural…
A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li · Oct 27, 2025 · Citations: 0

The rapid advancement of large language models (LLMs) has spurred the emergence of data agents, autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks.
RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang · Oct 27, 2025 · Citations: 0
An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning
Xingtu Liu · Oct 27, 2025 · Citations: 0
Quantifying Systemic Vulnerability in the Foundation Model Industry
Claudio Pirrone, Stefano Fricano, Gioacchino Fazio · Oct 27, 2025 · Citations: 0
SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
Edouard Lansiaux, Antoine Simonet, Eric Wiel · Oct 27, 2025 · Citations: 0

Evaluation demonstrates exceptional duplicate detection performance (90.1% AP) and strong semantic similarity (76.1% Spearman correlation).
Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025 · Citations: 0

Pairwise Preference

Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
Shenran Wang, Timothy Tin-Long Tse, Jian Zhu · Oct 27, 2025 · Citations: 0
Batch Speculative Decoding Done Right
Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li · Oct 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao · Oct 26, 2025 · Citations: 0

Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation.
Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim · Oct 26, 2025 · Citations: 0
REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu · Oct 26, 2025 · Citations: 0

Critique Edit

To alleviate the issue, we propose a novel framework REVISION.
Rule-Based Explanations for Retrieval-Augmented LLM Systems
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta · Oct 26, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025 · Citations: 0

As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang · Oct 25, 2025 · Citations: 0

To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
WAON: Large-Scale Japanese Image-Text Pair Dataset for Improving Model Performance on Japanese Cultural Tasks
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe · Oct 25, 2025 · Citations: 0

To improve the quality and reliability of evaluation on Japanese cultural tasks, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification comprising 374 classes, which addresses issues in the…

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now