HFEPX Archive Slice

HFEPX Quarterly Archive: 2025-Q3

Updated from current HFEPX corpus (Apr 12, 2026). 389 papers are grouped in this daily page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 389 papers are grouped in this daily page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: DROP. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Sep 30, 2025.

Papers: 389 Last published: Sep 30, 2025 Global RSS

Researcher Quick Triage

Use this archive page for time-slice monitoring (what changed in evaluation methods, metrics, and protocol quality this period). Quality band: High .

Analysis blocks are computed from the loaded sample (60 of 389 papers).

High-Signal Coverage

100.0%

60 / 60 papers are not low-signal flagged.

Benchmark Anchors

5.0%

Papers with benchmark/dataset mentions in extraction output.

Metric Anchors

23.3%

Papers with reported metric mentions in extraction output.

1 papers report explicit quality controls for this archive period.
Prioritize papers with both benchmark and metric anchors for reliable longitudinal comparisons.

Primary action: Use this slice as early signal only; benchmark/metric anchoring is limited for rigorous period-over-period claims.

Get this digest every Friday →

Why This Time Slice Matters

13.4% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 29.3% of papers in this hub.
DROP is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways For This Period

Most common quality-control signal is rater calibration (2.6% of papers).
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Start Here (Highest-Signal Papers In This Slice)

Ranked by protocol completeness and evidence density for faster period-over-period review.

MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Sep 30, 2025 · Citations: 0 · Score: 5.5

Eval: Automatic Metrics · Metrics: Agreement
PrefDisco: Benchmarking Proactive Personalized Reasoning
Sep 30, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Accuracy
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Sep 30, 2025 · Citations: 0 · Score: 4.5

Eval: Llm As Judge · Metrics: Not reported
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Sep 29, 2025 · Citations: 0 · Score: 4.5

Eval: Automatic Metrics · Metrics: Success rate, Jailbreak success rate
DRBench: A Realistic Benchmark for Enterprise Deep Research
Sep 30, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Accuracy, Recall
Generative Value Conflicts Reveal LLM Priorities
Sep 29, 2025 · Citations: 0 · Score: 3.5

Eval: Automatic Metrics · Metrics: Harmlessness

Protocol Matrix (Top 10)

Quickly compare method ingredients across this archive slice.

Paper	Eval Modes	Benchmarks	Metrics	Quality Controls
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Sep 30, 2025	Automatic Metrics	Not reported	Agreement	Inter Annotator Agreement Reported
PrefDisco: Benchmarking Proactive Personalized Reasoning Sep 30, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing Sep 30, 2025	Llm As Judge	Genai Bench, Aurora Bench	Not reported	Not reported
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models Sep 29, 2025	Automatic Metrics	Not reported	Success rate, Jailbreak success rate	Not reported
DRBench: A Realistic Benchmark for Enterprise Deep Research Sep 30, 2025	Automatic Metrics	Not reported	Accuracy, Recall	Not reported
Generative Value Conflicts Reveal LLM Priorities Sep 29, 2025	Automatic Metrics	Not reported	Harmlessness	Not reported
Incentive-Aligned Multi-Source LLM Summaries Sep 29, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct Sep 29, 2025	Automatic Metrics	Not reported	Perplexity	Not reported
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models Sep 29, 2025	Automatic Metrics	Not reported	Accuracy	Not reported
ProxyAttn: Guided Sparse Attention via Representative Heads Sep 29, 2025	Automatic Metrics	Not reported	Cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Papers with explicit human feedback

Coverage is a replication risk (13.4% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (3.6% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (6.9% vs 35% target).
Gap: Papers naming evaluation metrics

Coverage is a replication risk (20.3% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (6.9% vs 35% target).
Gap: Papers with known annotation unit

Coverage is a replication risk (8.7% vs 35% target).

Strengths

Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Only 3.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.9% coverage).
Annotation unit is under-specified (8.7% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (DROP vs HotpotQA) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: DROP Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Only 3.6% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (114)
Simulation Env (12)
Llm As Judge (10)
Human Eval (3)

Top Metrics

Accuracy (41)
Cost (7)
F1 (7)
Precision (5)

Top Benchmarks

DROP (2)
HotpotQA (2)
HumanEval+ (2)
LMSYS Chatbot Arena (2)

Quality Controls

Calibration (10)
Adjudication (2)
Inter Annotator Agreement Reported (2)

Papers In This Archive Slice

BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley · Sep 30, 2025 · Citations: 0

Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading…
PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating

We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh · Sep 30, 2025 · Citations: 0

Long Horizon

We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0

Pairwise PreferenceRubric Rating

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Jingdi Lei, Varun Gumma, Rishabh Bhardwaj, Seok Min Lim, Chuan Li · Sep 30, 2025 · Citations: 0
On Deepfake Voice Detection -- It's All in the Presentation
Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro · Sep 30, 2025 · Citations: 0
Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo · Sep 30, 2025 · Citations: 0

Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities.
EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu · Sep 30, 2025 · Citations: 0

Pairwise Preference

To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.
Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta · Sep 30, 2025 · Citations: 0

Critique Edit

ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores.
SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Christoph Timmermann, Hyunse Lee, Woojin Lee · Sep 30, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares · Sep 30, 2025 · Citations: 0

We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages.
Vector sketch animation generation with differentiable motion trajectories
Xinding Zhu, Xinye Yang, Shuyang Zheng, Zhexin Zhang, Fei Gao · Sep 30, 2025 · Citations: 0
Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang · Sep 30, 2025 · Citations: 0

Long Horizon

Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance.
v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui · Sep 30, 2025 · Citations: 0

AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions.
LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025 · Citations: 0

Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.
Calibrating Verbalized Confidence with Self-Generated Distractors
Victor Wang, Elias Stengel-Eskin · Sep 29, 2025 · Citations: 0
The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis
Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse · Sep 29, 2025 · Citations: 0

We quantitatively examine two decades (2005 - 2025) of contributions to AfricaNLP research, using a dataset of 2.2K NLP papers, 4.9K contributing authors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions), along with…
Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma, Nolan Dey, Joel Hestness · Sep 29, 2025 · Citations: 0

We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*.
Generative Value Conflicts Reveal LLM Priorities
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh · Sep 29, 2025 · Citations: 0

Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended…
Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang · Sep 29, 2025 · Citations: 0

Demonstrations

We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM.
Pretraining Large Language Models with NVFP4
NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben · Sep 29, 2025 · Citations: 0
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang · Sep 29, 2025 · Citations: 0
Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao · Sep 29, 2025 · Citations: 0

The prevalent deployment of Large Language Model agents such as OpenClaw unlocks potential in real-world applications, while amplifying safety concerns.
Towards Personalized Deep Research: Benchmarks and Evaluations
Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian · Sep 29, 2025 · Citations: 0
Scaling with Collapse: Efficient and Predictable Training of LLM Families
Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal · Sep 29, 2025 · Citations: 0
Scaling Generalist Data-Analytic Agents
Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang · Sep 29, 2025 · Citations: 0

Long Horizon

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI.
Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct
Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu · Sep 29, 2025 · Citations: 0

On the OpenWebText benchmark, DiDi-Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline.
Agentic Exploration of Physics Models
Maximilian Nägele, Florian Marquardt · Sep 29, 2025 · Citations: 0

Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the…
MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen · Sep 29, 2025 · Citations: 0
Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball · Sep 29, 2025 · Citations: 0

Rubric Rating

Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards.
TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang · Sep 29, 2025 · Citations: 0

TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs.
VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan · Sep 29, 2025 · Citations: 0

Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative…
ProxyAttn: Guided Sparse Attention via Representative Heads
Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang · Sep 29, 2025 · Citations: 0

By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost.
Stop Before You Fail: Operational Capability Boundaries for Mitigating Unproductive Reasoning in Large Reasoning Models
Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei · Sep 29, 2025 · Citations: 0

In white-box settings, we show that the hidden states of the last input token contain information that is predictive of whether a question will not be solved correctly under our evaluation setup.
Inducing Dyslexia in Vision Language Models
Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf · Sep 29, 2025 · Citations: 0

Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses.
Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings
Hamna Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield · Sep 29, 2025 · Citations: 0

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users.
SUIT: Knowledge Editing with Subspace-Aware Key-Value Mappings
Haewon Park, Sangwoo Kim, Yohan Jo · Sep 29, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li · Sep 29, 2025 · Citations: 0

Pairwise PreferenceRed Team

Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong…
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou · Sep 29, 2025 · Citations: 0

To address this gap, we present HarmMetric Eval, a systematic benchmark for assessing the quality of harmfulness metrics and judges with varying formats and scales.
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang · Sep 29, 2025 · Citations: 0

Red Team

Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final…
SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee · Sep 29, 2025 · Citations: 0

We introduce SimuHome, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents.
G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong · Sep 29, 2025 · Citations: 0
Prompt and Parameter Co-Optimization for Large Language Models
Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang · Sep 29, 2025 · Citations: 0

Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.
BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre · Sep 29, 2025 · Citations: 0
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models
Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang · Sep 29, 2025 · Citations: 0
Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links
Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson · Sep 28, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025 · Citations: 0

Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu · Sep 28, 2025 · Citations: 0
SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025 · Citations: 0

This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin · Sep 28, 2025 · Citations: 0

Multi Agent

In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task.
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025 · Citations: 0

Pairwise Preference

These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li · Sep 28, 2025 · Citations: 0
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
Junyou Wang, Zehua Chen, Binjie Yuan, Kaiwen Zheng, Chang Li · Sep 28, 2025 · Citations: 0
Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025 · Citations: 0

Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
Internal Planning in Language Models: Characterizing Horizon and Branch Awareness
Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, Guannan Qu · Sep 28, 2025 · Citations: 0

Long Horizon

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra · Sep 27, 2025 · Citations: 0

However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized.

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now