- Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks
Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik · Mar 6, 2026 · Citations: 0
Pairwise PreferenceExpert Verification
This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
- Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping
Zhenyu Lei, Qiong Wu, Jianxiong Dong, Yinhan He, Emily Dodwell · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Dynamic Self-Evolving Extraction System
Moin Amin-Naseri, Hannah Kim, Estevam Hruschka · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Language Shapes Mental Health Evaluations in Large Language Models
Jiayi Xu, Xiyang Hu · Mar 6, 2026 · Citations: 0
This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations.
- MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour · Mar 6, 2026 · Citations: 0
Expert Verification
Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity.
- LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
Matthew Lyle Olson, Neale Ratzlaff, Musashi Hinck, Tri Nguyen, Vasudev Lal · Mar 6, 2026 · Citations: 0
Multi Agent
Large Language Models (LLMs) exhibit impressive general-purpose capabilities but also introduce serious safety risks, particularly the potential for deception as models acquire increased agency and human oversight diminishes.
- Symmetry-Constrained Language-Guided Program Synthesis for Discovering Governing Equations from Noisy and Partial Observations
Mirza Samad Ahmed Baig, Syeda Anshrah Gillani · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Counting on Consensus: Selecting the Right Inter-annotator Agreement Metric for NLP Annotation and Evaluation
Joseph James · Mar 6, 2026 · Citations: 0
Human annotation remains the foundation of reliable and interpretable data in Natural Language Processing (NLP).
- Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers
David Heye, Karl Kindermann, Robin Decker, Johannes Lohmöller, Anastasiia Belova · Mar 6, 2026 · Citations: 0
Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs,…
- Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records
Brian E. Perron, Dragan Stoll, Bryan G. Victor, Zia Qia, Andreas Jud · Mar 6, 2026 · Citations: 0
Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa).
- "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior
Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan · Mar 6, 2026 · Citations: 0
Pairwise Preference
The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase.
- KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection
Archie Sage, Salvatore Greco · Mar 6, 2026 · Citations: 0
Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
- Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar · Mar 6, 2026 · Citations: 0
Evaluations on over 1,500 hours of real-world conversational speech across 11 languages and 5 English dialects show that contextual input consistently improves recognition quality.
- Beyond Rows to Reasoning: Agentic Retrieval for Multimodal Spreadsheet Understanding and Editing
Anmol Gulati, Sahil Sen, Waqar Sarguroh, Kevin Paul · Mar 6, 2026 · Citations: 0
Long Horizon
We introduce Beyond Rows to Reasoning (BRTR), a multimodal agentic framework for spreadsheet understanding that replaces single-pass retrieval with an iterative tool-calling loop, supporting end-to-end Excel workflows from complex analysis…
- COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics
Kartik Sharma, Rakshit S. Trivedi · Mar 6, 2026 · Citations: 0
Pairwise PreferenceDemonstrations
Experiments across a variety of steering tasks and benchmarks demonstrate that COLD-Steer achieves upto 95% steering effectiveness while using 50 times fewer samples compared to the best baseline.
- NOBLE: Accelerating Transformers with Nonlinear Low-Rank Branches
Ethan Smith · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- PONTE: Personalized Orchestration for Natural Language Trustworthy Explanations
Vittoria Vineis, Matteo Silvestri, Lorenzo Antonelli, Filippo Betello, Gabriele Tolomei · Mar 6, 2026 · Citations: 0
Pairwise Preference
To address these challenges, we present PONTE (Personalized Orchestration for Natural language Trustworthy Explanations), a human-in-the-loop framework for adaptive and reliable XAI narratives.
- Abductive Reasoning with Syllogistic Forms in Large Language Models
Hirohiko Abe, Risako Ando, Takanobu Morishita Kentaro Ozeki, Koji Mineshima, Mitsuhiro Okada · Mar 6, 2026 · Citations: 0
Research in AI using Large-Language Models (LLMs) is rapidly evolving, and the comparison of their performance with human reasoning has become a key concern.
- From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring
Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen · Mar 6, 2026 · Citations: 0
Pairwise Preference
On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…
- Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task
Hirohiko Abe, Kentaro Ozeki, Risako Ando, Takanobu Morishita, Koji Mineshima · Mar 6, 2026 · Citations: 0
In humans, reasoning often performs well in domain specific settings, particularly in normative rather than purely formal contexts.
- Transparent AI for Mathematics: Transformer-Based Large Language Models for Mathematical Entity Relationship Extraction with XAI
Tanjim Taharat Aurpa · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary · Mar 6, 2026 · Citations: 0
Critique Edit
We introduce SAHOO, a practical framework to monitor and control drift through three safeguards: (i) the Goal Drift Index (GDI), a learned multi-signal detector combining semantic, lexical, structural, and distributional measures; (ii)…
- The Art That Poses Back: Assessing AI Pastiches after Contemporary Artworks
Anca Dinu, Andreiana Mihail, Andra-Maria Florescu, Claudiu Creanga · Mar 6, 2026 · Citations: 0
The analysis combines human evaluation with computational methods aimed at detecting visual and stylistic similarities or divergences between the original works and their AI-produced renditions.
- Continual Adaptation for Pacific Indigenous Speech Recognition
Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- The EpisTwin: A Knowledge Graph-Grounded Neuro-Symbolic Architecture for Personal AI
Giovanni Servedio, Potito Aghilar, Alessio Mattiace, Gianni Carmosino, Francesco Musicco · Mar 6, 2026 · Citations: 0
At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities…
- Mind the Gap: Pitfalls of LLM Alignment with Asian Public Opinion
Hari Shankar, Vedanta S P, Sriharini Margapuri, Debjani Mazumder, Ponnurangam Kumaraguru · Mar 6, 2026 · Citations: 0
We further show that downstream evaluations on bias benchmarks (such as CrowS-Pairs, IndiBias, ThaiCLI, KoBBQ) reveal persistent harms and under-representation in sensitive contexts.
- SPOT: Span-level Pause-of-Thought for Efficient and Interpretable Latent Reasoning in Large Language Models
Yunlong Chu, Minglai Shao, Yuhang Liu, Bing Hao, Yumeng Lin · Mar 6, 2026 · Citations: 0
Experiments on reasoning benchmarks demonstrate that SPOT improves accuracy by 2.3 points on average while reducing generated tokens by 37.5% and provides faithful semantic interpretations of the latent reasoning process.
- FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling
Qihang Fan, Huaibo Huang, Zhiying Wu, Juqiu Wang, Bingning Wang · Mar 6, 2026 · Citations: 0
Extensive evaluations demonstrate that FlashPrefill achieves a substantial leap in efficiency, delivering an unprecedented 27.78x speedup on 256K sequences.
- LIT-RAGBench: Benchmarking Generator Capabilities of Large Language Models in Retrieval-Augmented Generation
Koki Itai, Shunichi Hasegawa, Yuta Yamamoto, Gouki Minegishi, Masaki Otsuki · Mar 6, 2026 · Citations: 0
Long Horizon
To bridge the gap between existing evaluations and practical use, we introduce LIT-RAGBench (the Logic, Integration, Table, Reasoning, and Abstention RAG Generator Benchmark), which defines five categories: Integration, Reasoning, Logic,…
- Wisdom of the AI Crowd (AI-CROWD) for Ground Truth Approximation in Content Analysis: A Research Protocol & Validation Using Eleven Large Language Models
Luis de-Marcos, Manuel Goyanes, Adrián Domínguez-Díaz · Mar 6, 2026 · Citations: 0
Large-scale content analysis is increasingly limited by the absence of observable ground truth or gold-standard labels, as creating such benchmarks through extensive human coding becomes impractical for massive datasets due to high time,…
- MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue
Naifan Zhang, Ruihan Sun, Jinwei Su, Hengjie Yang, Zhengyuan Pan · Mar 6, 2026 · Citations: 0
Long Horizon
We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns.
- CRIMSON: A Clinically-Grounded LLM-Based Metric for Generative Radiology Report Evaluation
Mohammed Baharoon, Thibault Heintz, Siavash Raissi, Mahmoud Alabbad, Mona Alhammad · Mar 6, 2026 · Citations: 0
Pairwise Preference
We introduce CRIMSON, a clinically grounded evaluation framework for chest X-ray report generation that assesses reports based on diagnostic correctness, contextual relevance, and patient safety.
- Contrastive-to-Self-Supervised: A Two-Stage Framework for Script Similarity Learning
Claire Roman, Philippe Meyer · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni, Tanel Alumäe, Mathew Magimai Doss · Mar 6, 2026 · Citations: 0
Pairwise Preference Long Horizon
We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14…
- A Causal Graph Approach to Oppositional Narrative Analysis
Diego Revilla, Martin Fernandez-de-Retana, Lingfeng Chen, Aritz Bilbao-Jayo, Miguel Fernandez-de-Retana · Mar 6, 2026 · Citations: 0
Current methods for textual analysis rely on data annotated within predefined ontologies, often embedding human bias within black-box models.
- Diffusion Language Models Are Natively Length-Aware
Vittorio Rossi, Giacomo Cirò, Davide Beltrame, Luca Gandolfi, Paul Röttger · Mar 6, 2026 · Citations: 0
We evaluate our approach on four benchmarks with diverse tasks -- GSM8K (reasoning), HumanEval (code generation), IfEval (instruction following), and LongFormQA (question answering) -- revealing massive efficiency gains at minimal…
- Making Implicit Premises Explicit in Logical Understanding of Enthymemes
Xuyao Feng, Anthony Hunter · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin · Mar 6, 2026 · Citations: 0
To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios.
- Experiences Build Characters: The Linguistic Origins and Functional Impact of LLM Personality
Xi Wang, Mengdie Zhuang, Jiqun Liu · Mar 6, 2026 · Citations: 0
Human problem-solving is enriched by a diversity of styles and personality traits, yet the development of Large Language Models (LLMs) has largely prioritized uniform performance benchmarks that favour specific behavioural tendencies such…
- Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0
Rubric Rating
This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
- ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing
Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu · Mar 6, 2026 · Citations: 0
Multi Agent
Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration.
- Track-SQL: Enhancing Generative Language Models with Dual-Extractive Modules for Schema and Context Tracking in Multi-turn Text-to-SQL
Bingfeng Chen, Shaobin Shi, Yongqi Luo, Boyan Xu, Ruichu Cai · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Imagine How To Change: Explicit Procedure Modeling for Change Captioning
Jiayang Sun, Zixin Guo, Min Cao, Guibo Zhu, Jorma Laaksonen · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models
Nikita Soni, August Håkan Nilsson, Syeda Mahwish, Vasudha Varadarajan, H. Andrew Schwartz · Mar 6, 2026 · Citations: 0
These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.
- Implicit Style Conditioning: A Structured Style-Rewrite Framework for Low-Resource Character Modeling
Chanhui Zhu · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Addressing the Ecological Fallacy in Larger LMs with Human Context
Nikita Soni, Dhruv Vijay Kunjadiya, Pratham Piyush Shah, Dikshya Mohanty, H. Andrew Schwartz · Mar 6, 2026 · Citations: 0
We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (HuFT:Human-aware Fine-Tuning).
- Learning Next Action Predictors from Human-Computer Interaction
Omar Shaikh, Valentin Teutschbein, Kanishk Gandhi, Yikun Chi, Nick Haber · Mar 6, 2026 · Citations: 0
Using an LLM-as-judge evaluation metric (0-1 similarity to ground truth), LongNAP significantly outperforms supervised finetuning and prompted baselines on held-out data (by 79% and 39% respectively).
- InfoGatherer: Principled Information Seeking via Evidence Retrieval and Strategic Questioning
Maksym Taranukhin, Shuyue Stella Li, Evangelos Milios, Geoff Pleiss, Yulia Tsvetkov · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Building an Ensemble LLM Semantic Tagger for UN Security Council Resolutions
Hussein Ghaly · Mar 6, 2026 · Citations: 0
We introduce two evaluation metrics: Content Preservation Ratio (CPR) and Tag Well-Formedness (TWF), in order to avoid hallucinations and unnecessary additions or omissions to the input text beyond the task requirement.
- Lost in Stories: Consistency Bugs in Long Story Generation by LLMs
Junjie Li, Xinrui Guo, Yuhao Wu, Roy Ka-Wei Lee, Hongzhi Li · Mar 6, 2026 · Citations: 0
Existing story generation benchmarks focus mainly on plot quality and fluency, leaving consistency errors largely unexplored.
- VerChol -- Grammar-First Tokenization for Agglutinative Languages
Prabhu Raja · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Confidence Before Answering: A Paradigm Shift for Efficient LLM Uncertainty Estimation
Changcheng Li, Jiancan Wu, Hengheng Zhang, Zhengsu Chen, Guo An · Mar 6, 2026 · Citations: 0
Experiments across math, code, and factual QA benchmarks show improved calibration and uncertainty discrimination while preserving answer quality, thereby enabling a broader range of downstream applications.
- ROSE: Reordered SparseGPT for More Accurate One-Shot Large Language Models Pruning
Mingluo Su, Huan Wang · Mar 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning
Juyong Jiang, Jiasi Shen, Sunghun Kim, Kang Min Yoo, Jeonghoon Kim · Mar 6, 2026 · Citations: 0
Long Horizon
Extensive experiments across seven benchmarks demonstrate that our ReflexiCoder-8B establishes a new state-of-the-art (SOTA) among leading open-source models in the 1.5B-14B range, achieving 94.51% (87.20%) on HumanEval (Plus), 81.80%…
- Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference
Ramchand Kumaresan · Mar 6, 2026 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.