- Transforming Science Learning Materials in the Era of Artificial Intelligence
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0
However, these innovations also raise critical ethical and pedagogical concerns, including issues of algorithmic bias, data privacy, transparency, and the need for human oversight.
- The Landscape of AI in Science Education: What is Changing and How to Respond
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0
At the same time, this chapter examines the ethical, social, and pedagogical challenges that arise, particularly issues of fairness, transparency, accountability, privacy, and human oversight.
- IV Co-Scientist: Multi-Agent LLM Framework for Causal Instrumental Variable Discovery
Ivaxi Sheth, Zhijing Jin, Bryan Wilder, Dominik Janzing, Mario Fritz · Feb 8, 2026 · Citations: 0
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen · Feb 8, 2026 · Citations: 0
Long Horizon
Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons.
- Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu · Feb 8, 2026 · Citations: 0
- LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SoulX-Singer: Towards High-Quality Zero-Shot Singing Voice Synthesis
Jiale Qian, Hao Meng, Tian Zheng, Pengcheng Zhu, Haopeng Lin · Feb 8, 2026 · Citations: 0
Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot…
- VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos
Wenqi Liu, Yunxiao Wang, Shijie Ma, Meng Liu, Qile Su · Feb 8, 2026 · Citations: 0
- PAND: Prompt-Aware Neighborhood Distillation for Lightweight Fine-Grained Visual Classification
Qiuming Luo, Yuebing Li, Feng Li, Chang Kong · Feb 8, 2026 · Citations: 0
- Do We Need Adam? Surprisingly Strong and Sparse Reinforcement Learning with SGD in LLMs
Sagnik Mukherjee, Lifan Yuan, Pavan Jayasinha, Dilek Hakkani-Tür, Hao Peng · Feb 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression
Rui Cen, QiangQiang Hu, Hong Huang, Hong Liu, Song Liu · Feb 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TernaryLM: Memory-Efficient Language Modeling via Native 1.5-Bit Quantization with Adaptive Layer-wise Scaling
Nisharg Nargund, Priyesh Shukla · Feb 7, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice
Savan Doshi · Feb 7, 2026 · Citations: 0
- KRONE: Hierarchical and Modular Log Anomaly Detection
Lei Ma, Jinyang Liu, Tieying Zhang, Peter M. VanNostrand, Dennis M. Hofmann · Feb 7, 2026 · Citations: 0
- How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026 · Citations: 0
A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
- Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors
Maximilian Vierlboeck, Antonio Pugliese, Roshanak Rose Nilchian, Paul T. Grogan, Rashika Sugganahalli Natesh Babu · Feb 6, 2026 · Citations: 0
Expert Verification
Complexity in engineered systems presents one of the most persistent challenges in modern development since it is driving cost overruns, schedule delays, and outright project failures.
- PACIFIC: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs
Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki · Feb 6, 2026 · Citations: 0
Pairwise Preference
Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g.,…
- On Randomness in Agentic Evals
Bjarni Haukur Bjarnason, André Silva, Martin Monperrus · Feb 6, 2026 · Citations: 0
Agentic systems are evaluated on benchmarks where agents interact with environments to solve tasks.
- Daily and Weekly Periodicity in Large Language Model Performance and Its Implications for Research
Paul Tschisgale, Peter Wulff · Feb 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough
Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz · Feb 6, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Personality as Relational Infrastructure: User Perceptions of Personality-Trait-Infused LLM Messaging
Dominik P. Hofer, David Haag, Rania Islambouli, Jan D. Smeddinck · Feb 6, 2026 · Citations: 0
- Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory
Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar · Feb 6, 2026 · Citations: 0
We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall,…
- Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study
Yi Liu, Zhihao Chen, Yanjun Zhang, Gelei Deng, Yuekang Li · Feb 6, 2026 · Citations: 0
Third-party agent skills extend LLM-based agents with instruction files and executable code that run on users' machines.
- LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
Brian Rabern, Philipp Mondorf, Barbara Plank · Feb 6, 2026 · Citations: 0
Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master.
- From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG
Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan · Feb 6, 2026 · Citations: 0
In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic…
- Stopping Computation for Converged Tokens in Masked Diffusion-LM Decoding
Daisuke Oba, Danushka Bollegala, Masahiro Kaneko, Naoaki Okazaki · Feb 6, 2026 · Citations: 0
- LLM-Enhanced Rumor Detection via Virtual Node Induced Edge Prediction
Jiran Tao, Cheng Wang, Binyan Jiang · Feb 6, 2026 · Citations: 0
- CALM: Class-Conditional Sparse Attention Vectors for Large Audio-Language Models
Videet Mehta, Liming Wang, Hilde Kuehne, Rogerio Feris, James R. Glass · Feb 6, 2026 · Citations: 0
Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains…
- LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao · Feb 6, 2026 · Citations: 0
Long Horizon
Across diverse chemical reasoning benchmarks, LatentChem achieves a 59.88\% non-tie win rate over strong CoT-based baselines on ChemCoTBench, while delivering a 10.84\times average reduction in reasoning overhead.
- RoPE-LIME: RoPE-Space Locality + Sparse-K Sampling for Efficient LLM Attribution
Isaac Picov, Ritesh Goru · Feb 6, 2026 · Citations: 0
Tool Use
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Protean Compiler: An Agile Framework to Drive Fine-grain Phase Ordering
Amir H. Ashouri, Shayan Shirahmad Gale Bagi, Kavin Satheeskumar, Tejas Srikanth, Jonathan Zhao · Feb 5, 2026 · Citations: 0
Traditionally, such locally optimized decisions are made by hand-coded algorithms tuned for a small number of benchmarks, often requiring significant effort to be retuned when the benchmark suite changes.
- Self-Improving World Modelling with Latent Actions
Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen · Feb 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang · Feb 5, 2026 · Citations: 0
Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs.
- Rewards as Labels: Revisiting RLVR from a Classification Perspective
Zepeng Zhai, Meilin Chen, Jiaxuan Zhao, Junlang Qian, Lei Shen · Feb 5, 2026 · Citations: 0
Extensive experiments on mathematical reasoning benchmarks show that REAL improves training stability and consistently outperforms GRPO and strong variants such as DAPO.
- Transport and Merge: Cross-Architecture Merging for Large Language Models
Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng · Feb 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LLM-driven Multimodal Recommendation
Yicheng Di · Feb 5, 2026 · Citations: 0
- Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai · Feb 5, 2026 · Citations: 0
Pairwise Preference
We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences.
- Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions
Jinchuan Tian, Haoran Wang, Bo-Hao Su, Chien-yu Huang, Qingzheng Wang · Feb 5, 2026 · Citations: 0
In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks.
- The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems
Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu · Feb 5, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- EBPO: Empirical Bayes Shrinkage for Stabilizing Group-Relative Policy Optimization
Kevin Han, Yuhang Zhou, Mingze Gao, Gedi Zhou, Serena Li · Feb 5, 2026 · Citations: 0
Empirically, EBPO consistently outperforms GRPO and other established baselines across diverse benchmarks, including AIME and OlympiadBench.
- GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek
Yang Zhang, Mersin Konomi, Christos Xypolopoulos, Konstantinos Divriotis, Konstantinos Skianis · Feb 5, 2026 · Citations: 0
Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited.
- Cross-talk based multi-task learning for fault classification of machine system influenced by multiple variables
Wonjun Yi, Rismaya Kumar Mishra, Yong-Hwa Park · Feb 5, 2026 · Citations: 0
We build on our previously introduced residual neural dimension reductor model, and extend its application to two benchmarks where system influenced by multiple variables.
- SPARE: Self-distillation for PARameter-Efficient Removal
Natnael Mola, Leonardo S. B. Pereira, Carolina R. Kelsch, Luis H. Arribas, Juan C. S. M. Avedillo · Feb 4, 2026 · Citations: 0
- CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation
Zhao Tong, Chunlin Gong, Yiping Zhang, Haichao Shi, Qiang Liu · Feb 4, 2026 · Citations: 0
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process.
- When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?
Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian · Feb 4, 2026 · Citations: 0
- Investigating Disability Representations in Text-to-Image Models
Yang Tian, Yu Fan, Liudmila Zavolokina, Sarah Ebling · Feb 4, 2026 · Citations: 0
- A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel · Feb 4, 2026 · Citations: 0
Red Team
Automated LLM-as-a-Judge frameworks have become the de facto standard for scalable evaluation across natural language processing.
- WideSeek-R1: Exploring Width Scaling for Broad Information Seeking via Multi-Agent Reinforcement Learning
Zelai Xu, Zhexuan Xu, Ruize Zhang, Chunyang Zhu, Shi Yu · Feb 4, 2026 · Citations: 0
Tool Use
To bridge this gap, we propose WideSeek-R1, a lead-agent-subagent framework trained via multi-agent reinforcement learning (MARL) to synergize scalable orchestration and parallel execution.
- VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Jaeyoon Jung, Yejun Yoon, Kunwoo Park · Feb 4, 2026 · Citations: 0
Multi Agent
This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration.
- Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models
Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim · Feb 4, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Forecasting Future Language: Context Design for Mention Markets
Sumin Kim, Jihoon Kwon, Yoon Kim, Nicole Kagan, Raffi Khatchadourian · Feb 4, 2026 · Citations: 0
- Contextual Drag: How Errors in the Context Affect LLM Reasoning
Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora · Feb 4, 2026 · Citations: 0
Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration.
- Expert Selections In MoE Models Reveal (Almost) As Much As Text
Amir Nuriyev, Gabriel Kulp · Feb 4, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0
Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
- Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez, Sreyoshi Bhaduri, Aman Chadha · Feb 3, 2026 · Citations: 0
Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings.
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0
Web Browsing
To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
- OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering
Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong · Feb 3, 2026 · Citations: 0
Tool Use
To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning.
- $V_0$: A Generalist Value Model for Any Policy at State Zero
Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu · Feb 3, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue
Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo · Feb 3, 2026 · Citations: 0
However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data.
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training
Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le · Feb 3, 2026 · Citations: 0
Long Horizon
In this technical report, we present SWE-Master, an open-source and fully reproducible post-training framework for building effective software engineering agents.