- VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu · Mar 25, 2026 · Citations: 0
Pairwise Preference Simulation Env Tool Use
With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
- Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0
Pairwise Preference Tool Use
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- Sabiá-4 Technical Report
Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás · Mar 10, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Tool Use
The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
- The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0
Critique Edit Tool Use
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
- Tucano 2 Cool: Better Open Source LLMs for Portuguese
Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf · Mar 3, 2026 · Citations: 0
Pairwise Preference Tool Use
Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two…
- ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering
Hussein Jawad, Nicolas J-B Brunel · Mar 14, 2026 · Citations: 0
Automatic Metrics Tool Use
Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning.
- Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception
Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai · Feb 12, 2026 · Citations: 0
Automatic Metrics Tool Use
To address this, we propose Region-to-Image Distillation, which transforms zooming from an inference-time tool into a training-time primitive, thereby internalizing the benefits of agentic zooming into a single forward pass of an MLLM.
- Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation
Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson · Feb 11, 2026 · Citations: 0
Automatic Metrics Tool Use
We present Agent-Diff, a novel benchmarking framework for evaluating agentic Large Language Models (LLMs) on real-world productivity software API tasks via code execution.
- The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang · Mar 24, 2026 · Citations: 0
Automatic Metrics Tool Use
As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such…
- REDSearcher: A Scalable and Cost-Efficient Framework for Long-Horizon Search Agents
Zheng Chu, Xiao Wang, Jack Hong, Huiming Fan, Yuqi Huang · Feb 15, 2026 · Citations: 0
Automatic Metrics Tool Use
To address these challenges, we propose REDSearcher, a unified framework that codesigns complex task synthesis, midtraining, and posttraining for scalable searchagent optimization.
- A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026 · Citations: 0
Automatic Metrics Tool Use
To address this, we introduce DEEPSYNTH, a novel benchmark designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning to produce insights.
- The Detection-Extraction Gap: Models Know the Answer Before They Can Say It
Hanyang Wang, Mingxuan Zhu · Apr 8, 2026 · Citations: 0
Automatic Metrics Tool Use
Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.
- AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning
Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026 · Citations: 0
Automatic Metrics Tool Use
To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.