- Transforming Science Learning Materials in the Era of Artificial Intelligence
Xiaoming Zhai, Kent Crippen · Feb 8, 2026
However, these innovations also raise critical ethical and pedagogical concerns, including issues of algorithmic bias, data privacy, transparency, and the need for human oversight.
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026
Long Horizon
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
- LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big
- Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng · Feb 7, 2026
Reinforcement learning (RL), particularly RL from verifiable reward (RLVR), has become a crucial phase of training large language models (LLMs) and a key focus of current scaling efforts.
- How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026
A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
- RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
Isaac Picov, Ritesh Goru · Feb 6, 2026
Tool Use
Explaining closed-source Large Language Model (LLM) outputs is challenging because API access prevents gradient-based attribution, while perturbation methods are costly and noisy when they depend on regenerated text.
- Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering
Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao · Feb 5, 2026
Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes.
- Transport and Merge: Cross-Architecture Merging for Large Language Models
Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng · Feb 5, 2026
Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data.
- Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang · Feb 5, 2026
In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks.
- The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026
Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs.
- EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li · Feb 5, 2026
Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench.
- VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026
Multi Agent
This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
- Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026
Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
- OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026
Tool Use
To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026
Long Horizon
In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.
- STAR: Similarity-guided Teacher-Assisted Refinement for Super-Tiny Function Calling Models
Jiliang Ni, Jiachen Pu, Zhongyi Yang, Jingfeng Luo, Conggang Hu · Feb 3, 2026
Tool Use
The proliferation of Large Language Models (LLMs) in function calling is pivotal for creating advanced AI agents, yet their large scale hinders widespread adoption, necessitating transferring their capabilities into smaller ones.
- Proof-RM: A Scalable and Generalizable Reward Model for Math Proof
Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu · Feb 2, 2026
In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality ``**question-proof-check**'' triplet data.
- DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations
Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen · Feb 2, 2026
However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC.
- Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts
Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao · Feb 2, 2026
Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time.
- CryoLVM: Self-supervised Learning from Cryo-EM Density Maps with Large Vision Models
Weining Fu, Kai Shu, Kui Xu, Qiangfeng Cliff Zhang · Feb 2, 2026
Cryo-electron microscopy (cryo-EM) has revolutionized structural biology by enabling near-atomic-level visualization of biomolecular assemblies.
- Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui · Feb 2, 2026
Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting.
- Mechanistic Indicators of Steering Effectiveness in Large Language Models
Mehdi Jafari, Hao Xue, Flora Salim · Feb 2, 2026
Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges.
- Argument Rarity-based Originality Assessment for AI-Assisted Writing
Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji · Feb 2, 2026
Experiments using 1,375 human essays and 1,000 AI-generated essays on two argumentative topics revealed three key findings.