- Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing
Dennis Ulmer, Alexandra Lorson, Ivan Titov, Christian Hardmeier · Jul 11, 2025 · Citations: 0
Human users increasingly communicate with large language models (LLMs), but LLMs suffer from frequent overconfidence in their output, even when its accuracy is questionable, which undermines their trustworthiness and perceived legitimacy.
- A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0
Pairwise Preference
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
- Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang · Jul 10, 2025 · Citations: 0
To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box…
- From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park · Jul 10, 2025 · Citations: 0
- FrugalRAG: Less is More in RL Finetuning for Multi-Hop Question Answering
Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma · Jul 10, 2025 · Citations: 0
- SpatialViz-Bench: A Cognitively-Grounded Benchmark for Diagnosing Spatial Visualization in MLLMs
Siting Wang, Minnan Pei, Luoyang Sun, Cheng Deng, Yuchen Li · Jul 10, 2025 · Citations: 0
- Psychometric Item Validation Using Virtual Respondents with Trait-Response Mediators
Sungjib Lim, Woojung Song, Eun-Ju Lee, Yohan Jo · Jul 8, 2025 · Citations: 0
Traditionally, this requires costly, large-scale human data collection.
- Mechanistic Indicators of Understanding in Large Language Models
Pierre Beckmann, Matthieu Queloz · Jul 7, 2025 · Citations: 0
However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms.
- The Generalization Ridge: Information Flow in Natural Language Generation
Ruidi Chang, Chunyuan Deng, Hanjie Chen · Jul 7, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt · Jul 7, 2025 · Citations: 0
Pairwise Preference
To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning.
- Agentic Vehicles for Human-Centered Mobility
Jiangbo Yu, Raphael Frank, Luis Miranda-Moreno, Sasan Jafarnejad, Jonatas Augusto Manzolli · Jul 7, 2025 · Citations: 0