- OGD4All: A Framework for Accessible Interaction with Geospatial Open Government Data Based on Large Language Models
Michael Siebenmann, Javier Argota Sánchez-Vaquerizo, Stefan Arisona, Krystian Samp, Luis Gisler · Nov 30, 2025
The system combines semantic data retrieval, agentic reasoning for iterative code generation, and secure sandboxed execution that produces verifiable multimodal outputs.
- PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova · Nov 26, 2025
Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce.
- The Metaphysics We Train: A Heideggerian Reading of Machine Learning
Heman Shakeri · Nov 25, 2025
Third, AI's lack of existential structure, specifically the absence of Care (Sorge), is genuinely explanatory: it illuminates why AI systems have no internal resources for questioning their own optimization imperatives, and why they optimiz
- Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization
Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng · Nov 25, 2025
Long Horizon
Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks.
- CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025
The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
- Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
Wangjiaxuan Xin · Nov 24, 2025
This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models.
- MUCH: A Multilingual Claim Hallucination Benchmark
Jérémie Dentan, Alexi Canesse, Davide Buscaldi, Aymen Shabou, Sonia Vanier · Nov 21, 2025
We introduce MUCH, the first claim-level UQ benchmark designed for fair and reproducible evaluation of future methods under realistic conditions.
- Bridging Symbolic Control and Neural Reasoning in LLM Agents: Structured Cognitive Loop with a Governance Layer
Myung Ho Kim · Nov 21, 2025
Long Horizon
Large language model agents suffer from fundamental architectural problems: entangled reasoning and execution, memory volatility, and uncontrolled action sequences.
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping
Yushi Huang, Zining Wang, Zhihang Yuan, Yifu Ding, Ruihao Gong · Nov 19, 2025
Extensive experiments for 3 model series across 13 benchmarks demonstrate that MoDES far outperforms previous approaches.
- From Competition to Coordination: Market Making as a Scalable Framework for Safe and Aligned Multi-Agent LLM Systems
Brendan Gho, Suman Muppavarapu, Afnan Shaik, Tyson Tsay, Atharva Mohan · Nov 18, 2025
Multi Agent
As foundation models are increasingly deployed as interacting agents in multi-agent systems, their collective behavior raises new challenges for trustworthiness, transparency, and accountability.
- EARL: Entropy-Aware RL Alignment of LLMs for Reliable RTL Code Generation
Jiahe Shi, Zhengqi Gao, Ching-Yun Ko, Duane Boning · Nov 15, 2025
Recent advances in large language models (LLMs) have demonstrated significant potential in hardware design automation, particularly in using natural language to synthesize Register-Transfer Level (RTL) code.
- CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang · Nov 14, 2025
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist in reducing perceived quality: accent bias, where models default t
- Multimodal Peer Review Simulation with Actionable To-Do Recommendations for Community-Aware Manuscript Revisions
Mengze Hong, Di Jiang, Weiwei Zhao, Yawen Li, Yihang Wang · Nov 14, 2025
Critique Edit
Experimental results highlight the effectiveness of the proposed system in generating more comprehensive and useful reviews aligned with expert standards, surpassing ablated baselines and advancing transparent, human-centered scholarly assi
- Mastering Olympiad-Level Physics with Artificial Intelligence
Dong-Shan Jian, Xiang Li, Chen-Xu Yan, Hui-Wen Zheng, Zhi-Zhang Bian · Nov 13, 2025
Olympiad-level physics problem-solving significantly challenges both humans and artificial intelligence (AI), as it requires integrating appropriate modeling, application of physical principles, and precise calculation within long reasoning
- Chain of Summaries: Summarization Through Iterative Questioning
William Brach, Kristián Košťál, Lukas Galke Poech · Nov 12, 2025
CoS thus resembles an appealing option for website maintainers to make their content more accessible for LLMs, while retaining possibilities for human oversight.
- State of the Art in Text Classification for South Slavic Languages: Fine-Tuning or Prompting?
Taja Kuzman Pungeršek, Peter Rupnik, Ivan Porupski, Vuk Dinić, Nikola Ljubešić · Nov 11, 2025
Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks.
- Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
- Beyond Fact Retrieval: Episodic Memory for RAG with Generative Semantic Workspaces
Shreyas Rajesh, Pavan Holur, Chenda Duan, David Chong, Vwani Roychowdhury · Nov 10, 2025
Long Horizon
On the Episodic Memory Benchmark (EpBench) \cite{huet_episodic_2025} comprising corpora ranging from 100k to 1M tokens in length, GSW outperforms existing RAG based baselines by up to \textbf{20\%}.
- Graph Representation-based Model Poisoning on the Heterogeneous Internet of Agents
Hanlin Cai, Houtianfu Wang, Haofan Dong, Kai Li, Sai Zou · Nov 10, 2025
Internet of Agents (IoA) envisions a unified, agent-centric paradigm where heterogeneous large language model (LLM) agents can interconnect and collaborate at scale.
- RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025
Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.
- OckBench: Measuring the Efficiency of LLM Reasoning
Zheng Du, Hao Kang, Song Han, Tushar Krishna, Ligeng Zhu · Nov 7, 2025
Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token usage.
- Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025
Pairwise Preference
We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts su
- Batch Prompting Suppresses Overthinking Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models
Saurabh Srivastava, Janit Bidhan, Hao Yan, Abhishek Dey, Tanu Kansal · Nov 6, 2025
Across 13 diverse benchmarks with DeepSeek-R1 and OpenAI-o1, batch prompting {reduces reasoning tokens by 76\% (2{,}950$\mapsto$710), on average, while preserving or improving accuracy}.
- Error-Aware Knowledge Distillation via Targeted Revision for Customer-Service Summarization
Hee-Jin Lee, Zhen Guo, Luchao Jin, Morteza Moazami Goudarzi · Nov 4, 2025
Critique Edit
We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks.
- A Proof of Learning Rate Transfer under $μ$P
Soufiane Hayou · Nov 3, 2025
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit.