- Optimizing Language Models for Crosslingual Knowledge Consistency
Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández · Mar 4, 2026 · Citations: 0
- Using Vision + Language Models to Predict Item Difficulty
Samin Khan · Mar 4, 2026 · Citations: 0
- Stan: An LLM-based thermodynamics course assistant
Eric M. Furst, Vasudevan Venkateshwaran · Mar 4, 2026 · Citations: 0
- iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah · Mar 4, 2026 · Citations: 0
- Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models
Xin Chen, Saili Uday Gadgil, Jiarong Qiu · Mar 4, 2026 · Citations: 0
- Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026 · Citations: 0
Pairwise Preference Web Browsing
We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
- Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang · Mar 4, 2026 · Citations: 0
- From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models
Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu · Mar 4, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Adaptive Memory Admission Control for LLM Agents
Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao · Mar 4, 2026 · Citations: 0
- Still Fresh? Evaluating Temporal Drift in Retrieval Benchmarks
Nathan Kuissi, Suraj Subrahmanyan, Nandan Thakur, Jimmy Lin · Mar 4, 2026 · Citations: 0
- AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai · Mar 4, 2026 · Citations: 0
To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research…
- TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning
Maximilian von Klinski, Maximilian Schall · Mar 4, 2026 · Citations: 0
- $τ$-Knowledge: Evaluating Conversational Agents over Unstructured Knowledge
Quan Shi, Alexandra Zytek, Pedram Razavi, Karthik Narasimhan, Victor Barres · Mar 4, 2026 · Citations: 0
- Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks
Haoyu Liu, Dingcheng Li, Lukas Rutishauser, Zeyu Zheng · Mar 4, 2026 · Citations: 0
- Pointer-CAD: Unifying B-Rep and Command Sequences via Pointer-based Edges & Faces Selection
Dacheng Qi, Chenyu Wang, Jingwei Xu, Tianzhe Chu, Zibo Zhao · Mar 4, 2026 · Citations: 0
- AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou · Mar 4, 2026 · Citations: 0
Pairwise Preference
We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc…
- World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings
Elan Barenholtz · Mar 4, 2026 · Citations: 0
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- The Company You Keep: How LLMs Respond to Dark Triad Traits
Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov · Mar 4, 2026 · Citations: 0
- Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models
Liangwei Yang, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu · Mar 4, 2026 · Citations: 0
- Causality Elicitation from Large Language Models
Takashi Kameyama, Masahiro Kato, Yasuko Hio, Yasushi Takano, Naoto Minakawa · Mar 4, 2026 · Citations: 0
- Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory
Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei · Mar 4, 2026 · Citations: 0
- Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG
Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi · Mar 4, 2026 · Citations: 0
- When Do Language Models Endorse Limitations on Human Rights Principles?
Keenan Samway, Nicole Miu Takagi, Rada Mihalcea, Bernhard Schölkopf, Ilias Chalkidis · Mar 4, 2026 · Citations: 0
- Code Fingerprints: Disentangled Attribution of LLM-Generated Code
Jiaxun Guo, Ziyuan Yang, Mengyu Sun, Hui Wang, Jingfeng Lu · Mar 4, 2026 · Citations: 0
- Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model
Jakub Prejzner · Mar 4, 2026 · Citations: 0
We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model.
- Traces of Social Competence in Large Language Models
Tom Kouwenhoven, Michiel van der Meer, Max van Duijn · Mar 4, 2026 · Citations: 0
- VietNormalizer: An Open-Source, Dependency-Free Python Library for Vietnamese Text Normalization in TTS and NLP Applications
Hung Vu Nguyen, Loan Do, Thanh Ngoc Nguyen, Ushik Shrestha Khwakhali, Thanh Pham · Mar 4, 2026 · Citations: 0
- BeamPERL: Parameter-Efficient RL with Verifiable Rewards Specializes Compact LLMs for Structured Beam Mechanics Reasoning
Tarjei Paule Hage, Markus J. Buehler · Mar 4, 2026 · Citations: 0
- FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
Juhyun Oh, Nayeon Lee, Chani Jung, Jiho Jin, Junho Myung · Mar 4, 2026 · Citations: 0
- Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation
Malik Marmonier, Benoît Sagot, Rachel Bawden · Mar 4, 2026 · Citations: 0
- Monitoring Emergent Reward Hacking During Generation via Internal Activations
Patrick Wilhelm, Thorsten Wittkopp, Odej Kao · Mar 4, 2026 · Citations: 0
- Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre · Mar 4, 2026 · Citations: 0
- Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects
Ji-Lun Peng, Yun-Nung Chen · Mar 4, 2026 · Citations: 0
- From Threat Intelligence to Firewall Rules: Semantic Relations in Hybrid AI Agent and Expert System Architectures
Chiara Bonfanti, Davide Colaiacomo, Luca Cagliero, Cataldo Basile · Mar 4, 2026 · Citations: 0
- IROSA: Interactive Robot Skill Adaptation using Natural Language
Markus Knauer, Samuel Bustamante, Thomas Eiband, Alin Albu-Schäffer, Freek Stulp · Mar 4, 2026 · Citations: 0
- CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents
Martin Kostelník, Michal Hradiš, Martin Dočekal · Mar 4, 2026 · Citations: 0
- On the Suitability of LLM-Driven Agents for Dark Pattern Audits
Chen Sun, Yash Vekaria, Rishab Nithyanand · Mar 4, 2026 · Citations: 0
- Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy
Navdeep Singh Bedi, Ana-Maria Bucur, Noriko Kando, Fabio Crestani · Mar 4, 2026 · Citations: 0
- Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling
Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Mary Catherine Lavissiere · Mar 4, 2026 · Citations: 0
- Benchmarking Motivational Interviewing Competence of Large Language Models
Aishwariya Jha, Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand · Mar 4, 2026 · Citations: 0
- Semantic Bridging Domains: Pseudo-Source as Test-Time Connector
Xizhong Yang, Huiming Wang, Ning Xu, Mofei Song · Mar 4, 2026 · Citations: 0
- In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
- SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Jialong Chen, Xander Xu, Hu Wei, Chuan Chen, Bing Zhao · Mar 4, 2026 · Citations: 0
- T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning
Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang · Mar 4, 2026 · Citations: 0
- MOOSE-Star: Unlocking Tractable Training for Scientific Discovery by Breaking the Complexity Barrier
Zonglin Yang, Lidong Bing · Mar 4, 2026 · Citations: 0
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
- ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement
Zijin Hong, Hao Chen, Zheng Yuan, Qinggang Zhang, Luyao Zhuang · Mar 4, 2026 · Citations: 0
- Order Is Not Layout: Order-to-Space Bias in Image Generation
Yongkang Zhang, Zonglin Zhao, Yuechen Zhang, Fei Ding, Pei Li · Mar 4, 2026 · Citations: 0
- CONCUR: Benchmarking LLMs for Concurrent Code Generation
Jue Huang, Tarek Mahmud, Corina Pasareanu, Guowei Yang · Mar 4, 2026 · Citations: 0
- MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation
Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen · Mar 4, 2026 · Citations: 0
- Linguistically Informed Graph Model and Semantic Contrastive Learning for Korean Short Text Classification
JaeGeon Yoo, Byoungwook Kim, Yeongwook Yang, Hong-Jun Jang · Mar 4, 2026 · Citations: 0
- A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang · Mar 4, 2026 · Citations: 0
- Why Are Linear RNNs More Parallelizable?
William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal · Mar 4, 2026 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.