- Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
- Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou · Apr 8, 2026 · Citations: 0
We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations.
- Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction
Jackson Petty, Jaulie Goe, Tal Linzen · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence
Jianhui Liu, Haoze Sun, Wenbo Li, Yanbing Zhang, Rui Yang · Apr 8, 2026 · Citations: 0
Spatial understanding is a fundamental cornerstone of human-level intelligence.
- Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation
Songhee Han · Apr 8, 2026 · Citations: 0
More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled.
- A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
Nusrat Sultana, Abdullah Muhammad Moosa, Kazi Afzalur Rahman, Sajal Chandra Banik · Apr 8, 2026 · Citations: 0
This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus.
- ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection
Chhavi Dhiman, Naman Chawla, Riya Dhami, Gaurav Kumar, Ganesh Naik · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent
Bingxuan Li, Simo Du, Yue Guo · Apr 8, 2026 · Citations: 0
Long Horizon
We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module.
- Efficient Learned Data Compression via Dual-Stream Feature Decoupling
Huidong Ma, Xinyan Shi, Hui Sun, Xiaofei Yue, Xiaoguang Liu · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- On the Price of Privacy for Language Identification and Generation
Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- How Much LLM Does a Self-Revising Agent Actually Need?
Seongwoo Jeong, Seonil Son · Apr 8, 2026 · Citations: 0
Critique Edit
Recent LLM-based agents often place world modeling, planning, and reflection inside a single language model loop.
- TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Yen-Shan Chen, Sian-Yao Huang, Cheng-Lin Yang, Yun-Nung Chen · Apr 8, 2026 · Citations: 0
Red Team Long Horizon
As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces.
- LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics
Kosmas Pinitas, Ilias Maglogiannis · Apr 8, 2026 · Citations: 0
Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI.
- Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery
Jia Yu, Weiwei Yu, Pengfei Xiao, Fukun Xing · Apr 8, 2026 · Citations: 0
We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the…
- Dynamic Context Evolution for Scalable Synthetic Data Generation
Ryan Lingo, Rajeev Chhajer · Apr 8, 2026 · Citations: 0
Tool Use
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Language Bias under Conflicting Information in Multilingual LLMs
Robert Östling, Murathan Kurfalı · Apr 8, 2026 · Citations: 0
To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of…
- Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews
Ehsan Barkhordar, Abdulfattah Safa, Verena Blaschke, Erika Lombart, Marie-Catherine de Marneffe · Apr 8, 2026 · Citations: 0
We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37…
- Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering
Elyas Irankhah, Samah Fodeh · Apr 8, 2026 · Citations: 0
Expert Verification
Third, results on the development set show that alignment accuracy is mainly limited by reasoning.
- The Impact of Steering Large Language Models with Persona Vectors in Educational Applications
Yongchao Wu, Aron Henriksson · Apr 8, 2026 · Citations: 0
We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures.
- STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0
To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
- Selective Neuron Amplification for Training-Free Task Enhancement
Ryyan Akhtar · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multilingual Embedding Probes Fail to Generalize Across Learner Corpora
Laurits Lyngbaek, Ross Deans Kristensen-McLachlan · Apr 8, 2026 · Citations: 0
Under in-distribution evaluation, probes achieve strong performance (QWK\approx0.7), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions.
- Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English
Iza Škrjanec, Irene Elisabeth Winther, Vera Demberg, Stefan L. Frank · Apr 8, 2026 · Citations: 0
However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.
- SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
Liang-Chih Yu, Jonas Becker, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Lung-Hao Lee · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text
Muhammad Apriandito Arya Saputra, Andry Alamsyah, Dian Puteri Ramadhani, Thomhert Suprapto Siadari, Hanif Fakhrurroja · Apr 8, 2026 · Citations: 0
In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points.
- Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Le Zhan, Yanqi Yang, Leo Huang · Apr 8, 2026 · Citations: 0
We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable…
- ReDAct: Uncertainty-Aware Deferral for LLM Agents
Dzianis Piatrashyn, Nikita Kotelevskii, Kirill Grishchenkov, Nikita Glazkov, Ivan Nasonov · Apr 8, 2026 · Citations: 0
Long Horizon
Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems.
- Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang · Apr 8, 2026 · Citations: 0
We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and…
- Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Philipp D. Siedler · Apr 8, 2026 · Citations: 0
Multi Agent
We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
- MARS: Enabling Autoregressive Models Multi-Token Generation
Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun · Apr 8, 2026 · Citations: 0
When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks.
- Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl
Juan-José Guzman-Landa, Juan-Manuel Torres-Moreno, Graham Ranger, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DTCRS: Dynamic Tree Construction for Recursive Summarization
Guanran Luo, Zhongquan Jian, Wentao Qiu, Meihong Wang, Qingqiang Wu · Apr 8, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Continuous Interpretive Steering for Scalar Diversity
Ye-eun Cho · Apr 8, 2026 · Citations: 0
However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations.
- ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals
Yihao Wang, Zijian He, Jie Ren, Keze Wang · Apr 8, 2026 · Citations: 0
We introduce ChunQiuTR, a time-keyed retrieval benchmark built from the Spring and Autumn Annals and its exegetical tradition.
- Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
- The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era
Rudra Jadhav, Janhavi Danve · Apr 8, 2026 · Citations: 0
We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S.
- Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus
Aidan Mannion, Cécile Macaire, Armand Violle, Stéphane Ohayon, Xavier Tannier · Apr 8, 2026 · Citations: 0
Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations.
- iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations
Wenshuo Wang, Boyu Cao, Nan Zhuang, Wei Li · Apr 8, 2026 · Citations: 0
This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
- Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Bajian Xiang, Tingwei Guo, Xuan Chen, Yang Han · Apr 8, 2026 · Citations: 0
Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy.
- Digital Skin, Digital Bias: Uncovering Tone-Based Biases in LLMs and Emoji Embeddings
Mingchen Li, Wajdi Aljedaani, Yingjie Liu, Navyasri Meka, Xuan Lu · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models
Ane G. Domingo-Aldama, Iker De La Iglesia, Maitane Urruela, Aitziber Atutxa, Ander Barrena · Apr 8, 2026 · Citations: 0
BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical…
- MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors
Xiaotian Luo, Xun Jiang, Jiangcheng Wu · Apr 8, 2026 · Citations: 0
Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or…
- HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues
Yijie Zhong, Yunfan Gao, Haofen Wang · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- On the Step Length Confounding in LLM Reasoning Data Selection
Bing Wang, Rui Miao, Chen Shen, Shaotian Yan, Kaiyuan Liu · Apr 8, 2026 · Citations: 0
Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.
- Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
Chengyue Wu, Shiyi Lan, Yonggan Fu, Sensen Gao, Jin Wang · Apr 8, 2026 · Citations: 0
Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality.
- WRAP++: Web discoveRy Amplified Pretraining
Jiang Zhou, Yunhao Wang, Xing Wu, Tinghao Yu, Feng Zhang · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models
Paula Dodig, Boshko Koloski, Katarina Sitar Šuštar, Senja Pollak, Matthew Purver · Apr 8, 2026 · Citations: 0
The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content.
- SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Usman Naseem, Robert Geislinger, Juan Ren, Sarah Kohail, Rudy Garrido Veliz · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation
Guanran Luo, Wentao Qiu, Wanru Zhao, Wenhan Lv, Zhongquan Jian · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning
Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong · Apr 8, 2026 · Citations: 0
Long Horizon
Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
- Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal · Apr 8, 2026 · Citations: 0
Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause.
- GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering
Guanran Luo, Wentao Qiu, Zhongquan Jian, Meihong Wang, Qingqiang Wu · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Video-guided Machine Translation with Global Video Context
Jian Chen, JinZe Lv, Zi Long, XiangHua Fu · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Perception to Autonomous Computational Modeling: A Multi-Agent Approach
Daniel N. Wilke · Apr 8, 2026 · Citations: 0
Multi Agent
We present a solver-agnostic framework in which coordinated large language model (LLM) agents autonomously execute the complete computational mechanics workflow, from perceptual data of an engineering component through geometry extraction,…
- When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning
Yang Xiang, Yixin Ji, Ruotao Xu, Dan Qiao, Zheming Yang · Apr 8, 2026 · Citations: 0
Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT…
- Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0
Pairwise Preference
Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation.
- Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search
Zhiyu Cao, Peifeng Li, Qiaoming Zhu · Apr 8, 2026 · Citations: 0
Pairwise Preference
To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR).
- Geometric Properties of the Voronoi Tessellation in Latent Semantic Manifolds of Large Language Models
Marshall Brett · Apr 8, 2026 · Citations: 0
Fisher damage remains constant at ~5,300 positions across the validated range (λ = 0.15-0.6), achieving +28% median margin improvement at λ = 0.6 with invariant downstream benchmarks - a geometric reorganization that compresses the…
- TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks
Xiangyu Wang, Jin Wu, Haoran Shi, Wei Xia, Jiarui Yu · Apr 8, 2026 · Citations: 0
Long Horizon
To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework.
- Multilingual Cognitive Impairment Detection in the Era of Foundation Models
Damar Hoogland, Boshko Koloski, Jaya Caporusso, Tine Kolenik, Ana Zwitter Vitez · Apr 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.