- The Conundrum of Trustworthy Research on Attacking Personally Identifiable Information Removal Techniques
Sebastian Ochs, Ivan Habernal · Mar 9, 2026 · Citations: 0
We critically analyze the evaluation of existing attacks and find that data leakage and data contamination are not properly mitigated, leaving the question whether or not PII removal techniques truly protect privacy in real-world scenarios…
- Supporting Workflow Reproducibility by Linking Bioinformatics Tools across Papers and Executable Code
Clémence Sebe, Olivier Ferret, Aurélie Névéol, Mahdi Esmailoghli, Ulf Leser · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation
Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska · Mar 9, 2026 · Citations: 0
Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages.
- Is continuous CoT better suited for multi-lingual reasoning?
Ali Hamza Bashir, Behzad Shomali, Markus Frey, Mehdi Ali, Rafet Sifa · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs
Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu · Mar 9, 2026 · Citations: 0
Multi Agent
First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning.
- Gender Bias in MT for a Genderless Language: New Benchmarks for Basque
Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez · Mar 9, 2026 · Citations: 0
Pairwise Preference
WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French.
- Gradually Excavating External Knowledge for Implicit Complex Question Answering
Chang Liu, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu · Mar 9, 2026 · Citations: 0
Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential.
- EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery
Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo · Mar 9, 2026 · Citations: 0
Multi Agent
To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution.
- Ramsa: A Large Sociolinguistically Rich Emirati Arabic Speech Corpus for ASR and TTS
Rania Al-Sabbagh · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning
Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li, Edward De Brouwer, Carl Edwards · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization
Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu · Mar 9, 2026 · Citations: 0
To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges.
- High-Fidelity Pruning for Large Language Models
Yijun Zhu, Jianxin Wang, Chengchao Shen · Mar 9, 2026 · Citations: 0
An intuitive solution to address this is to employ self distillation criterion for importance evaluation.
- Deterministic Differentiable Structured Pruning for Large Language Models
Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Examining the Role of YouTube Production and Consumption Dynamics on the Formation of Extreme Ideologies
Sarmad Chandio, Rishab Nithyanand · Mar 9, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn · Mar 9, 2026 · Citations: 0
Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
- ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments
Weixiang Zhao, Haozhen Li, Yanyan Zhao, xuda zhi, Yongbo Huang · Mar 9, 2026 · Citations: 0
As large language models (LLMs) evolve into autonomous agents capable of acting in open-ended environments, ensuring behavioral alignment with human values becomes a critical safety concern.
- SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning
Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang · Mar 9, 2026 · Citations: 0
Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25.
- \$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0
Rubric Rating Tool Use
To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
- Emergence is Overrated: AGI as an Archipelago of Experts
Daniel Kilov · Mar 9, 2026 · Citations: 0
This paper examines whether their framework accurately characterizes human intelligence and its implications for conceptualizing artificial general intelligence.
- BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence
Biao Xiang, Soyeon Caren Han, Yihao Ding · Mar 9, 2026 · Citations: 0
We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures.
- Reject, Resample, Repeat: Understanding Parallel Reasoning in Language Model Inference
Noah Golowich, Fan Chen, Dhruv Rohatgi, Raghav Singhal, Carles Domingo-Enrich · Mar 9, 2026 · Citations: 0
Given a base language model and a *process reward model* estimating expected terminal rewards, we ask: *how accurately can we sample from a target distribution given some number of process reward evaluations?* Theoretically, we identify (1)…
- CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao · Mar 9, 2026 · Citations: 0
However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of…
- What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network
Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin · Mar 9, 2026 · Citations: 0
When autonomous AI agents communicate with one another at scale, what kind of discourse system emerges?
- SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans
Hansi Zeng, Zoey Li, Yifan Gao, Chenwei Zhang, Xiaoman Pan · Mar 9, 2026 · Citations: 0
Tool Use
Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use.