- VehicleMemBench: An Executable Benchmark for Multi-User Long-Term Memory in In-Vehicle Agents
Yuhao Chen, Yi Xu, Xinyun Ding, Xiang Fang, Shuochen Liu · Mar 25, 2026 · Citations: 0
Pairwise Preference Simulation Env Tool Use
With the growing demand for intelligent in-vehicle experiences, vehicle-based agents are evolving from simple assistants to long-term companions.
- RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models
Daniel Yang, Samuel Stante, Florian Redhardt, Lena Libon, Parnian Kassraie · Feb 27, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Reward models are central to aligning large language models (LLMs) with human preferences.
- Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development
Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Web Browsing
We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 public validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous…
- Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness
Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan · Mar 16, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps.
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We study whether phone-use agents respect privacy while completing benign mobile tasks.
- CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu · Mar 19, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
- Sabiá-4 Technical Report
Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás · Mar 10, 2026 · Citations: 0
Pairwise Preference Automatic Metrics Tool Use
The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
- Surgical Post-Training: Cutting Errors, Keeping Knowledge
Wenye Lin, Kai Han · Mar 2, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
- The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He · Feb 17, 2026 · Citations: 0
Pairwise Preference Multi Agent
Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and…
- ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences.
- FEAST: Fully Connected Expressive Attention for Spatial Transcriptomics
Taejin Jeong, Joohyeok Kim, Jinyeong Kim, Chanyoung Kim, Seong Jae Hwang · Mar 26, 2026 · Citations: 0
Pairwise Preference
To address this, we propose FEAST (Fully connected Expressive Attention for Spatial Transcriptomics), an attention-based framework that models the tissue as a fully connected graph, enabling the consideration of all pairwise interactions.
- IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge
Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy · Mar 24, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions).
- Truth as a Compression Artifact in Language Model Training
Konstantin Krestnikov · Mar 12, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
- From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring
Minh Hoang Nguyen, Vu Hoang Pham, Xuan Thanh Huynh, Phuc Hong Mai, Vinh The Nguyen · Mar 6, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning…
- Tucano 2 Cool: Better Open Source LLMs for Portuguese
Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf · Mar 3, 2026 · Citations: 0
Pairwise Preference Tool Use
Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two…
- Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek, Haim Dubossarsky · Feb 17, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and
- From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs
Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang · Mar 25, 2026 · Citations: 0
Pairwise Preference
We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history,…
- Comparing Developer and LLM Biases in Code Evaluation
Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu · Mar 25, 2026 · Citations: 0
Pairwise PreferenceRubric Rating
We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh…
- From Isolated Scoring to Collaborative Ranking: A Comparison-Native Framework for LLM-Based Paper Evaluation
Pujun Zheng, Jiacheng Yao, Jinquan Zheng, Chenyang Gu, Guoxiu He · Mar 18, 2026 · Citations: 0
Pairwise Preference
Large language models (LLMs) are currently applied to scientific paper evaluation by assigning an absolute score to each paper independently.
- PrivMedChat: End-to-End Differentially Private RLHF for Medical Dialogue Systems
Sudip Bhujel · Mar 3, 2026 · Citations: 0
Pairwise PreferenceExpert Verification
To avoid costly clinician labeling, we introduce an annotation-free preference construction strategy that pairs physician responses with filtered non-expert generations.
- You Didn't Have to Say It like That: Subliminal Learning from Faithful Paraphrases
Isaia Gisler, Zhonghao He, Tianyi Qiu · Mar 10, 2026 · Citations: 0
Pairwise Preference
We investigate whether transmission occurs through natural language paraphrases with fixed semantic content, and whether content explicitly contradicting the teacher's preference can block it.
- Bioalignment: Measuring and Improving LLM Disposition Toward Biological Systems for AI Safety
Trent R Northen, Mingxun Wang · Mar 10, 2026 · Citations: 0
Pairwise Preference
A sample of 5 frontier and 5 open-weight models were measured using 50 curated Bioalignment prompts with a Kelly criterion-inspired evaluation framework.
- EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training
Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu · Mar 2, 2026 · Citations: 0
Pairwise Preference
We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior.
- gencat: Generative computerized adaptive testing
Wanyong Feng, Andrew Lan · Feb 23, 2026 · Citations: 0
Pairwise Preference
We train the model in a two-step process, first via Supervised Fine-Tuning and then via preference optimization for knowledge-response alignment.