- SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment
Jing Zhao, Ting Zhen, Junwei Bao, Hongfei Jiang, Yang Song · Feb 14, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability.
- FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0
Expert Verification Automatic Metrics
Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
- WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie · Jan 5, 2026 · Citations: 0
Pairwise Preference Llm As Judge
However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
- GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Red Team Automatic Metrics
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Automatic Metrics
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0
Pairwise Preference
Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
- Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0
Pairwise Preference
Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
- Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco · Feb 3, 2025 · Citations: 0
Pairwise Preference
In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0
Automatic Metrics
To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
- TARo: Token-level Adaptive Routing for LLM Test-time Alignment
Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli · Mar 19, 2026 · Citations: 0
Pairwise Preference
Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning.
- Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang · Mar 12, 2026 · Citations: 0
Pairwise Preference
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
- ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0
Pairwise Preference
Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0
Pairwise Preference
To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
- A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0
Pairwise Preference
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
- Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang · Jun 11, 2025 · Citations: 0
Pairwise Preference
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning…
- Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan · Jun 5, 2025 · Citations: 0
Pairwise Preference
In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
- Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
- An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
- IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
- C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
Akira Kawabata, Saku Sugawara · Apr 15, 2026 · Citations: 0
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
- Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
- Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
- Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
- IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
- Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
- Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
- RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
- Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
- Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
- Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend · Mar 17, 2026 · Citations: 0
- Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
- Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
- MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
- When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Eddie Landesberg · Mar 12, 2026 · Citations: 0
- LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
- UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
- Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
- Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0
- Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs
Jie Cao, Tianwei Lin, Zhenxuan Fan, Bo Yuan, Ziyuan Zhao · Feb 28, 2026 · Citations: 0
- Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang · Feb 27, 2026 · Citations: 0
- SC-Arena: A Natural Language Benchmark for Single-Cell Reasoning with Knowledge-Augmented Evaluation
Jiahao Zhao, Feng Jiang, Shaowei Qin, Zhonghui Zhang, Junhao Liu · Feb 26, 2026 · Citations: 0
- Agentic Explainable Artificial Intelligence (Agentic XAI) Approach To Explore Better Explanation
Tomoaki Yamaguchi, Yutong Zhou, Masahiro Ryo, Keisuke Katsura · Dec 24, 2025 · Citations: 0
- Revisiting the Reliability of Language Models in Instruction-Following
Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei · Dec 15, 2025 · Citations: 0
- Pay Less Attention to Function Words for Free Robustness of Vision-Language Models
Qiwei Tian, Chenhao Lin, Zhengyu Zhao, Chao Shen · Dec 8, 2025 · Citations: 0
- Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
Tomas Figliolia, Nicholas Alonso, Rishi Iyer, Quentin Anthony, Beren Millidge · Oct 6, 2025 · Citations: 0
- SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM-based Multi-Agent Communication
Ruijia Zhang, Xinyan Zhao, Ruixiang Wang, Sigen Chen, Guibin Zhang · Aug 15, 2025 · Citations: 0
- An Automated Survey of Generative Artificial Intelligence: Large Language Models, Architectures, Protocols, and Applications
Eduardo C. Garrido-Merchán, Álvaro López López · Jun 5, 2023 · Citations: 0