- Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop
Lifeng Han, David Lindevelt, Sander Puts, Erik van Mulligen, Suzan Verberne · Nov 9, 2025 · Citations: 0
- Q$^2$: Quantization-Aware Gradient Balancing and Attention Alignment for Low-Bit Quantization
Zhaoyang Wang, Dong Wang · Nov 8, 2025 · Citations: 0
Long Horizon
Quantization-aware training (QAT) has achieved remarkable success in low-bit ($\leq$4-bit) quantization for classification networks.
- OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · Nov 7, 2025 · Citations: 0
Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
- Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0
Pairwise Preference
We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…
- Steering Language Models with Weight Arithmetic
Constanza Fierro, Fabien Roger · Nov 7, 2025 · Citations: 0
- Are We Asking the Right Questions? On Ambiguity in Natural Language Queries for Tabular Data Analysis
Daniel Gomm, Cornelius Wolff, Madelon Hulsebos · Nov 6, 2025 · Citations: 0
Applying the framework to evaluations for tabular question answering and analysis, we analyze queries in 15 datasets, and observe an uncontrolled mixing of query types neither adequate for evaluating a system's accuracy nor for evaluating…
- Modeling Clinical Uncertainty in Radiology Reports: from Explicit Uncertainty Markers to Implicit Reasoning Pathways
Paloma Rabaey, Jong Hak Moon, Jung-Oh Lee, Min Gwan Kim, Hangyul Yoon · Nov 6, 2025 · Citations: 0
- Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Saurabh Srivastava, Janit Bidhan, Hao Yan, Abhishek Dey, Tanu Kansal · Nov 6, 2025 · Citations: 0
Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76\% (2{,}950\mapsto710), on average, while preserving or improving accuracy}.
- STARS: Synchronous Token Alignment for Robust Supervision in Large Language Models
Mohammad Atif Quamar, Mohammad Areeb, Mikhail Kuznetsov, Muslum Ozgur Ozmen, Z. Berkay Celik · Nov 5, 2025 · Citations: 0
Aligning large language models (LLMs) with human values is crucial for safe deployment.
- GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Stergios Chatzikyriakidis, Dimitris Papadakis, Sevasti-Ioanna Papaioannou, Erofili Psaltaki · Nov 5, 2025 · Citations: 0
- CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre · Nov 5, 2025 · Citations: 0
- Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization
Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025 · Citations: 0
Critique Edit
We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks.
- A Proof of Learning Rate Transfer under $μ$P
Soufiane Hayou · Nov 3, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Self-Harmony: Learning to Harmonize Self-Supervision and Self-Play in Test-Time Reinforcement Learning
Ru Wang, Wei Huang, Qi Cao, Yusuke Iwasawa, Yutaka Matsuo · Nov 3, 2025 · Citations: 0
Crucially, this requires no human supervision or auxiliary models.