- Task Arithmetic with Support Languages for Low-Resource ASR
Emma Rafkin, Dan DeGenaro, Xiulin Yang · Jan 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Symphonym: Universal Phonetic Embeddings for Cross-Script Name Matching
Stephen Gadd · Jan 11, 2026 · Citations: 0
Expert Verification
Trained on 32.7 million triplet samples drawn from 67 million toponyms spanning GeoNames, Wikidata, and the Getty Thesaurus of Geographic Names, the Student achieves the highest Recall@1 (85.2%) and Mean Reciprocal Rank (90.8%) on the…
- †DAGGER: Distractor-Aware Graph Generation for Executable Reasoning in Math Problems
Zabir Al Nazi, Shubhashis Roy Dipta, Sudipta Kar · Jan 11, 2026 · Citations: 0
To systematically study this challenge, we introduce DISTRACTMATH-BN, a Bangla benchmark that augments MGSM and MSVAMP with semantically coherent but computationally irrelevant information.
- A Mind Cannot Be Smeared Across Time
Michael Timothy Bennett · Jan 11, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Burn-After-Use for Preventing Data Leakage through a Secure Multi-Tenant Architecture in Enterprise LLM
Qiang Zhang, Elena Emma Wang, Jiaming Li, Xichun Wang · Jan 10, 2026 · Citations: 0
- EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation
Pei Yang, Wanyi Chen, Ke Wang, Lynn Ai, Eric Yang · Jan 10, 2026 · Citations: 0
Long Horizon
Existing evaluations often overlook execution accuracy and safety.
- LLMTrack: Semantic Multi-Object Tracking with Multi-modal Large Language Models
Pan Liao, Feng Yang, Di Wu, Jinwen Yu, Yuhua Zhu · Jan 10, 2026 · Citations: 0
To address this, we introduce Grand-SMOT, a large-scale, open-world benchmark providing high-density, dual-stream narratives that comprehensively decouple individual behaviors from environmental contexts.
- NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
Robert J. Moore, Sungeun An, Farhan Ahmed, Jay Pankaj Gala · Jan 10, 2026 · Citations: 0
The Natural Conversation Benchmark (NC-Bench) introduces a new approach to evaluating the general conversational competence of large language models (LLMs).
- Mixture-of-Experts as Soft Clustering: A Dual Jacobian-PCA Spectral Geometry Perspective
Feilong Liu · Jan 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Distilling Feedback into Memory-as-a-Tool
Víctor Gallego · Jan 9, 2026 · Citations: 0
Rubric RatingCritique Edit
We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls.
- Pantagruel: Unified Self-Supervised Encoders for French Text and Speech
Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai · Jan 9, 2026 · Citations: 0
Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech.
- FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG
Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie · Jan 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu · Jan 9, 2026 · Citations: 0
Demonstrations
Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code.
- HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation
Rongxin Chen, Tianyu Wu, Bingbing Xu, Jiatang Luo, Xiucheng Xu · Jan 9, 2026 · Citations: 0
High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains.
- Classroom AI: Large Language Models as Grade-Specific Teachers
Jio Oh, Steven Euijong Whang, James Evans, Jindong Wang · Jan 9, 2026 · Citations: 0
Evaluations across multiple datasets with 208 human participants demonstrate substantial improvements in grade-level alignment, achieving a 35.64 percentage point increase compared to prompt-based methods while maintaining response…
- HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
- Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong · Jan 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Over-Searching in Search-Augmented Large Language Models
Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun · Jan 9, 2026 · Citations: 0
- The Illusion of AI Expertise Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal · Jan 9, 2026 · Citations: 0
- A Two-Stage Multitask Vision-Language Framework for Explainable Crop Disease Visual Question Answering
Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary · Jan 8, 2026 · Citations: 0
Without fine-tuning, the model further generalizes well to the external PlantVillageVQA benchmark, achieving 83.18% micro accuracy in the VQA task.
- Token-Level LLM Collaboration via FusionRoute
Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang · Jan 8, 2026 · Citations: 0
- Key-Value Pair-Free Continual Learner via Task-Specific Prompt-Prototype
Haihua Luo, Xuming Ran, Zhengji Li, Huiyan Xue, Tingting Jiang · Jan 8, 2026 · Citations: 0
- Projected Autoregression: Autoregressive Language Generation in Continuous State Space
Oshri Naparstek · Jan 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
Guanzhi Deng, Bo Li, Ronghao Chen, Xiujin Liu, Zhuo Han · Jan 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models
Yifan Le, Yunliang Li · Jan 8, 2026 · Citations: 0
Pairwise Preference
Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance.
- LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence
Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung · Jan 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Neurosymbolic Retrievers for Retrieval-augmented Generation
Yash Saxena, Manas Gaur · Jan 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Political Alignment in Large Language Models: A Multidimensional Audit of Psychometric Identity and Behavioral Bias
Adib Sakhawat, Tahsin Islam, Takia Farhin, Syed Rifat Raiyan, Hasan Mahmud · Jan 8, 2026 · Citations: 0
These findings suggest that single-axis evaluations are insufficient and that multidimensional auditing frameworks are important to characterize alignment behavior in deployed LLMs.
- Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo · Jan 8, 2026 · Citations: 0
- CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026 · Citations: 0
Multi Agent
To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
- Vision-Language Agents for Interactive Forest Change Analysis
James Brock, Ce Zhang, Nantheera Anantrasirichai · Jan 8, 2026 · Citations: 0
To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks.
- Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models
San Kim, Gary Geunbae Lee · Jan 7, 2026 · Citations: 0
However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors.
- Interpreting Transformers Through Attention Head Intervention
Mason Kadem, Rong Zheng · Jan 7, 2026 · Citations: 0
- RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma, Yi-Hung Liu, Hsin-Yu Lin, Pin-Yu Chen, Hong-Yan Huang · Jan 7, 2026 · Citations: 0
DemonstrationsCritique Edit
On a fake news detection benchmark, RADAR consistently outperforms strong retrieval-augmented trainable baselines, as well as general-purpose LLMs with retrieval.
- What Matters For Safety Alignment?
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong · Jan 7, 2026 · Citations: 0
Red Team Tool Use
This paper presents a comprehensive empirical study on the safety alignment capabilities.
- IDESplat: Iterative Depth Probability Estimation for Generalizable 3D Gaussian Splatting
Wei Long, Haifeng Wu, Shiyin Jiang, Jinhua Zhang, Xinchun Ji · Jan 7, 2026 · Citations: 0
- Compact Example-Based Explanations for Language Models
Loris Schoenegger, Benjamin Roth · Jan 7, 2026 · Citations: 0
As humans cannot interpret thousands of documents, only a small subset of the training data can be presented as an explanation.
- EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning
Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu · Jan 6, 2026 · Citations: 0
Long Horizon
Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference.
- Prompting Underestimates LLM Capability for Time Series Classification
Dan Schumacher, Erfan Nourbakhsh, Rocky Slavin, Anthony Rios · Jan 6, 2026 · Citations: 0
Prompt-based evaluations suggest that large language models (LLMs) perform poorly on time series classification, raising doubts about whether they encode meaningful temporal structure.
- AnatomiX, an Anatomy-Aware Grounded Multimodal Large Language Model for Chest X-Ray Interpretation
Anees Ur Rehman Hashmi, Numan Saeed, Christoph Lippert · Jan 6, 2026 · Citations: 0
- One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li · Jan 6, 2026 · Citations: 0
Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning…
- Enhancing Moral Diagnosis and Correction in Large Language Models
Bocheng Chen, Xi Chen, Han Zi, Haitao Mao, Zimo Qi · Jan 6, 2026 · Citations: 0
Red Team
Identifying specific moral errors in an input and generating appropriate corrections require moral sensitivity in large language models (LLMs), which is fundamental for developing their moral performance, yet a challenging task.
- SentGraph: Hierarchical Sentence Graph for Multi-hop Retrieval-Augmented Question Answering
Junli Liang, Pengfei Zhou, Wangqiu Zhou, Wenjie Qing, Qi Zhao · Jan 6, 2026 · Citations: 0
Extensive experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of SentGraph, validating the importance of explicitly modeling sentence-level logical dependencies for multi-hop reasoning.
- Towards Faithful Reasoning in Comics for Small MLLMs
Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang · Jan 6, 2026 · Citations: 0
Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the \leq 4B regime, surpasses several 7B…
- LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation
Fabian Lukassen, Christoph Weisser, Michael Schlee, Manish Kumar, Anton Thielmann · Jan 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Jeonghyun Park, Byeongjeong Kim, Seojin Hwang, Hwanhee Lee · Jan 6, 2026 · Citations: 0
Pairwise Preference
To address these biases, we propose DeLP (Debiased Language Preference), a calibrated metric designed to explicitly factor out these structural confounds.
- From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Shinwoo Park, Yo-Sub Han · Jan 6, 2026 · Citations: 0
Rubric Rating
Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
- Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models
Zeyu Gan, Ruifeng Ren, Wei Yao, Xiaolin Hu, Gengze Xu · Jan 6, 2026 · Citations: 0
To address this theoretical fragmentation, this survey proposes a unified lifecycle-based taxonomy that organizes the research landscape into six distinct stages: Data Preparation, Model Preparation, Training, Alignment, Inference, and…
- Stratified Hazard Sampling: Minimal-Variance Event Scheduling for CTMC/DTMC Discrete Diffusion and Flow Models
Seunghwan Jang, SooJean Han · Jan 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SYNAPSE: Empowering LLM Agents with Episodic-Semantic Memory via Spreading Activation
Hanqi Jiang, Junhao Chen, Yi Pan, Ling Chen, Weihang You · Jan 6, 2026 · Citations: 0
While Large Language Models (LLMs) excel at generalized reasoning, standard retrieval-augmented approaches fail to address the disconnected nature of long-term agentic memory.
- When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Subha Ghoshal, Ali Al-Bustami · Jan 6, 2026 · Citations: 0
Tool Use
We benchmark this behavior on two real-world settings: event-centric question answering over graph-structured knowledge (Event-QA) and persuasive response generation in Reddit ChangeMyView (CMV).
- Embedding Retrofitting: Data Engineering for better RAG
Anantha Sharma · Jan 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Improved Evidence Extraction and Metrics for Document Inconsistency Detection with LLMs
Nelvin Tan, Yaowen Zhang, James Asikin Cheung, Fusheng Liu, Yu-Ching Shih · Jan 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
Hyeong Kyu Choi, Sharon Li · Jan 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Estimating Text Temperature with Language Models
Nikolay Mikhaylovskiy · Jan 5, 2026 · Citations: 0
Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model.
- From XAI to Stories: A Factorial Study of LLM-Generated Explanation Quality
Fabian Lukassen, Jan Herrmann, Christoph Weisser, Benjamin Saefken, Thomas Kneib · Jan 5, 2026 · Citations: 0
Using G-Eval, an LLM-as-a-judge evaluation method, with dual LLM judges and four evaluation criteria, we evaluate 660 explanations for time-series forecasting.
- FormationEval, an open multiple-choice benchmark for petroleum geoscience
Almaz Ermilov · Jan 5, 2026 · Citations: 0
This paper presents FormationEval, an open multiple-choice question benchmark for evaluating language models on petroleum geoscience and subsurface disciplines.
- DeCode: Decoupling Content and Delivery for Medical QA
Po-Jen Ko, Chen-Han Tsai, Yu-Shao Peng · Jan 5, 2026 · Citations: 0
We evaluate DeCode on OpenAI HealthBench, a comprehensive and challenging benchmark designed to assess clinical relevance and validity of LLM responses.
- Agentic Retoucher for Text-To-Image Generation
Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan · Jan 5, 2026 · Citations: 0
Pairwise Preference
To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop.
- Output Embedding Centering for Stable LLM Pretraining
Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg · Jan 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.