- \$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0
Automatic Metrics Law
To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
- Document Reconstruction Unlocks Scalable Long-Context RLVR
Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin · Feb 9, 2026 · Citations: 0
Automatic Metrics Coding
However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming.
- SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie · Mar 22, 2026 · Citations: 0
Automatic Metrics Medicine
Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
- HyperMem: Hypergraph Memory for Long-Term Conversations
Juwei Yue, Chuanrui Hu, Jiawei Sheng, Zuyi Zhou, Wenyuan Zhang · Apr 9, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
Long-term memory is essential for conversational agents to maintain coherence, track persistent tasks, and provide personalized interactions across extended dialogues.
- VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng · Mar 5, 2026 · Citations: 0
Human Eval General
Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on…
- The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0
Automatic Metrics General
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env General
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0
Simulation Env Coding
We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0
Simulation Env General
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- Towards Reward Modeling for AI Tutors in Math Mistake Remediation
Kseniia Petukhova, Ekaterina Kochmar · Mar 25, 2026 · Citations: 0
Automatic Metrics Math
We develop and release Bradley-Terry preference models trained on weighted-sum rankings that we automatically create from MRBench, synthetic pairs, and data combinations.
- PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses
Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026 · Citations: 0
Automatic Metrics Medicine
We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
- Reasoning or Rhetoric? An Empirical Analysis of Moral Reasoning Explanations in Large Language Models
Aryan Kasat, Smriti Singh, Aman Chadha, Vinija Jain · Mar 23, 2026 · Citations: 0
Llm As Judge General
Using an LLM-as-judge scoring pipeline validated across three judge models, we classify more than 600 responses from 13 LLMs spanning a range of architectures, parameter scales, and training regimes across six classical moral dilemmas, and…
- Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao · Apr 7, 2025 · Citations: 0
Automatic Metrics MathCoding
We organize existing benchmarks and datasets into coherent categories reflecting the evolving landscape of multi-turn dialogue evaluation, and review a broad spectrum of enhancement methodologies, including model-centric strategies…
- PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang · Apr 2, 2026 · Citations: 0
Automatic Metrics General
Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global…
- BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents
Praveen Kumar Myakala, Manan Agrawal, Rahul Manche · Mar 25, 2026 · Citations: 0
Automatic Metrics General
LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved.
- StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models
Zehao Chen, Rong Pan, Haoran Li · Oct 13, 2025 · Citations: 0
Simulation Env General
Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment.
- $\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution
Muyu He, Adit Jain, Anand Kumar, Vincent Tu, Soumyadeep Bakshi · Apr 1, 2026 · Citations: 0
Automatic Metrics General
As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound.
- QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate
Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin · Mar 12, 2026 · Citations: 0
Automatic Metrics General
Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge…
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0
Automatic Metrics MathLaw
Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
- Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0
General
Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.
- Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement
Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang · Mar 9, 2026 · Citations: 0
Automatic Metrics General
To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty.
- LayerT2V: A Unified Multi-Layer Video Generation Framework
Guangzhao Li, Kangrui Cen, Baixuan Zhao, Yi Xin, Siqi Luo · Aug 6, 2025 · Citations: 0
Automatic Metrics General
Text-to-video generation has advanced rapidly, but existing methods typically output only the final composited video and lack editable layered representations, limiting their use in professional workflows.
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025 · Citations: 0
Automatic Metrics Coding
On the Episodic Memory Benchmark (EpBench) huet_episodic_2025 comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to 20\%.
- Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning
Justin Lovelace, Christian Belardi, Sofian Zalouk, Adhitya Polavaram, Srivatsa Kundurthy · Feb 24, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics General
Evaluations show STAR-LDM significantly outperforms similar-sized models on language understanding benchmarks and achieves >70\% win rates in LLM-as-judge comparisons for narrative coherence and commonsense reasoning.
- Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0
Automatic Metrics MathCoding
Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics Math
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
- Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
- Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
Tunazzina Islam · Apr 8, 2026 · Citations: 0
- PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li · Mar 26, 2026 · Citations: 0
- Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models
Kaijin Chen, Dingkang Liang, Xin Zhou, Yikang Ding, Xiaoqiang Liu · Mar 26, 2026 · Citations: 0
- Pixelis: Reasoning in Pixels, from Seeing to Acting
Yunpeng Zhou · Mar 26, 2026 · Citations: 0
- Anti-I2V: Safeguarding your photos from malicious image-to-video generation
Duc Vu, Anh Nguyen, Chi Tran, Anh Tran · Mar 25, 2026 · Citations: 0
- CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy · Mar 25, 2026 · Citations: 0
- Invisible Threats from Model Context Protocol: Generating Stealthy Injection Payload via Tree-based Adaptive Search
Yulin Shen, Xudong Pan, Geng Hong, Min Yang · Mar 25, 2026 · Citations: 0
- Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval
Junkai Yang, Qirui Wang, Yaoqing Jin, Shuai Ma, Minghan Xu · Mar 25, 2026 · Citations: 0
- MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage
Ufaq Khan, Umair Nawaz, L D M S S Teja, Numaan Saeed, Muhammad Bilal · Mar 24, 2026 · Citations: 0
- InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting
Duc Vu, Kien Nguyen, Trong-Tung Nguyen, Ngan Nguyen, Phong Nguyen · Mar 24, 2026 · Citations: 0
- Is AI Catching Up to Human Expression? Exploring Emotion, Personality, Authorship, and Linguistic Style in English and Arabic with Six Large Language Models
Nasser A Alsadhan · Mar 24, 2026 · Citations: 0
- From the AI Act to a European AI Agency: Completing the Union's Regulatory Architecture
Georgios Pavlidis · Mar 24, 2026 · Citations: 0
- Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
Ireh Kim, Tesia Sker, Chanwoo Kim · Mar 23, 2026 · Citations: 0
- Improving Coherence and Persistence in Agentic AI for System Optimization
Pantea Karimi, Kimia Noorbakhsh, Mohammad Alizadeh, Hari Balakrishnan · Mar 22, 2026 · Citations: 0
- Fusing Memory and Attention: A study on LSTM, Transformer and Hybrid Architectures for Symbolic Music Generation
Soudeep Ghoshal, Sandipan Chakraborty, Pradipto Chowdhury, Himanshu Buckchash · Mar 22, 2026 · Citations: 0
- Differential Attention-Augmented BiomedCLIP with Asymmetric Focal Optimization for Imbalanced Multi-Label Video Capsule Endoscopy Classification
Podakanti Satyajith Chary, Nagarajan Ganapathy · Mar 18, 2026 · Citations: 0
- FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord · Mar 18, 2026 · Citations: 0
- VirPro: Visual-referred Probabilistic Prompt Learning for Weakly-Supervised Monocular 3D Detection
Chupeng Liu, Jiyong Rao, Shangquan Sun, Runkai Zhao, Weidong Cai · Mar 18, 2026 · Citations: 0
- AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents
Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu · Mar 17, 2026 · Citations: 0
- Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability
Fan Huang, Haewoon Kwak, Jisun An · Mar 16, 2026 · Citations: 0
- Prompt Engineering for Scale Development in Generative Psychometrics
Lara Lee Russell-Lasalandra, Hudson Golino · Mar 16, 2026 · Citations: 0
- Spatio-Semantic Expert Routing Architecture with Mixture-of-Experts for Referring Image Segmentation
Alaa Dalaq, Muzammil Behzad · Mar 13, 2026 · Citations: 0
- Paper Title: LoV3D: Grounding Cognitive Prognosis Reasoning in Longitudinal 3D Brain MRI via Regional Volume Assessments
Zhaoyang Jiang, Zhizhong Fu, David McAllister, Yunsoo Kim, Honghan Wu · Mar 12, 2026 · Citations: 0
- COMPASS: The explainable agentic framework for Sovereignty, Sustainability, Compliance, and Ethics
Jean-Sébastien Dessureault, Alain-Thierry Iliho Manzi, Soukaina Alaoui Ismaili, Khadim Lo, Mireille Lalancette · Mar 11, 2026 · Citations: 0
- LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing
Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang · Mar 9, 2026 · Citations: 0
- How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms
JV Roig · Mar 9, 2026 · Citations: 0
- HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo · Mar 5, 2026 · Citations: 0
- Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy
Navdeep Singh Bedi, Ana-Maria Bucur, Noriko Kando, Fabio Crestani · Mar 4, 2026 · Citations: 0
- Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning
Zhengjian Yao, Yongzhi Li, Xinyuan Gao, Quan Chen, Peng Jiang · Mar 4, 2026 · Citations: 0
- A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang · Mar 4, 2026 · Citations: 0
- MT-PingEval: Evaluating Multi-Turn Collaboration with Private Information Games
Jacob Eisenstein, Fantine Huot, Adam Fisch, Jonathan Berant, Mirella Lapata · Feb 27, 2026 · Citations: 0
- LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
Yu Zhu, Kai Yang · Feb 27, 2026 · Citations: 0
- MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang · Feb 26, 2026 · Citations: 0