- SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Sher Badshah, Ali Emami, Hassan Sajjad · Feb 13, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
- CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
Vaibhav Devraj, Dhruv Kumar, Jagat Sesh Challa, Parth Agarwal, Navya Kommuri · Dec 26, 2025 · Citations: 0
Expert Verification Automatic Metrics
To investigate this potential capability gap, we present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.
- Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026 · Citations: 0
Pairwise Preference
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- FairMed-XGB: A Bayesian-Optimised Multi-Metric Framework with Explainability for Demographic Equity in Critical Healthcare Data
Mitul Goswami, Romit Chatterjee, Arif Ahmed Sekh · Mar 16, 2026 · Citations: 0
Expert Verification Automatic Metrics
Post-mitigation evaluation on seven clinically distinct cohorts derived from the MIMIC-IV-ED and eICU databases demonstrates substantial bias reduction: Statistical Parity Difference decreases by 40 to 51 percent on MIMIC-IV-ED and 10 to 19…
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026 · Citations: 0
Automatic Metrics
Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
- WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie · Jan 5, 2026 · Citations: 0
Pairwise Preference Llm As Judge
However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations Simulation Env
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
- Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin · Aug 28, 2025 · Citations: 0
Red Team Automatic Metrics
These models serve as suitable drop-in replacements for a broad range of safety-critical applications, where the risks of text-based manipulation outweigh the utility of text recognition.
- LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar · Apr 2, 2026 · Citations: 0
Llm As JudgeAutomatic Metrics
Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional…
- AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song · Jun 17, 2025 · Citations: 0
Automatic Metrics
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents.
- PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0
Pairwise Preference
Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
- Revisiting Self-Play Preference Optimization: On the Role of Prompt Difficulty
Yao Xiao, Jung-jae Kim, Roy Ka-wei Lee, Lidong Bing · Oct 7, 2025 · Citations: 0
Pairwise Preference
Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
- Evaluation of Large Language Models via Coupled Token Generation
Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco · Feb 3, 2025 · Citations: 0
Pairwise Preference
In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
- SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?
Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar · Feb 3, 2026 · Citations: 0
Automatic Metrics
To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts.
- Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
Yixin Liu, Yue Yu, DiJia Su, Sid Wang, Xuewei Wang · Mar 12, 2026 · Citations: 0
Pairwise Preference
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
- ReCoN-Ipsundrum: An Inspectable Recurrent Persistence Loop Agent with Affect-Coupled Control and Mechanism-Linked Consciousness Indicator Assays
Aishik Sanyal · Feb 26, 2026 · Citations: 0
Pairwise Preference
Inspired by Humphrey's ipsundrum hypothesis, we implement ReCoN-Ipsundrum, an inspectable agent that extends a ReCoN state machine with a recurrent persistence loop over sensory salience Ns and an optional affect proxy reporting…
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0
Pairwise Preference
To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
- Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning
Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen · Sep 26, 2025 · Citations: 0
Critique Edit
We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models.
- A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler · Jul 11, 2025 · Citations: 0
Pairwise Preference
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
- Search Arena: Analyzing Search-Augmented LLMs
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan · Jun 5, 2025 · Citations: 0
Pairwise Preference
In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
- MathDuels: Evaluating LLMs as Problem Posers and Solvers
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik · Apr 23, 2026 · Citations: 0
- Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Hao-Yuan Chen · Apr 23, 2026 · Citations: 0
- Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts
Azher Ahmed Efat, Seok Hwan Song, Wallapak Tavanapong · Apr 23, 2026 · Citations: 0
- MOMO: A framework for seamless physical, verbal, and graphical robot skill learning and adaptation
Markus Knauer, Edoardo Fiorini, Maximilian Mühlbauer, Stefan Schneyer, Promwat Angsuratanawech · Apr 22, 2026 · Citations: 0
- HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu · Apr 22, 2026 · Citations: 0
- Detoxification for LLM: From Dataset Itself
Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu · Apr 21, 2026 · Citations: 0
- Cat-DPO: Category-Adaptive Safety Alignment
Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding · Apr 19, 2026 · Citations: 0
- Neurosymbolic Repo-level Code Localization
Xiufeng Xu, Xiufeng Wu, Zejun Zhang, Yi Li · Apr 17, 2026 · Citations: 0
- Target-Oriented Pretraining Data Selection via Neuron-Activated Graph
Zijun Wang, Haoqin Tu, Weidong Zhou, Yiyang Zhou, Xiaohuan Zhou · Apr 17, 2026 · Citations: 0
- Context Over Content: Exposing Evaluation Faking in Automated Judges
Manan Gupta, Inderjeet Nair, Lu Wang, Dhruv Kumar · Apr 16, 2026 · Citations: 0
- StoryCoder: Narrative Reformulation for Structured Reasoning in LLM Code Generation
Geonhui Jang, Dongyoon Han, YoungJoon Yoo · Apr 16, 2026 · Citations: 0
- CollabCoder: Plan-Code Co-Evolution via Collaborative Decision-Making for Efficient Code Generation
Duy Tung Doan, Quang Huy Phung, Dzung Nguyen, Khac-Hoai Nam Bui · Apr 15, 2026 · Citations: 0
- An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2
Ryan Lail · Apr 15, 2026 · Citations: 0
- IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar · Apr 15, 2026 · Citations: 0
- One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness
Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram · Apr 14, 2026 · Citations: 0
- A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina · Apr 13, 2026 · Citations: 0
- Think in Sentences: Explicit Sentence Boundaries Enhance Language Model's Capabilities
Zhichen Liu, Yongyuan Li, Yang Xu · Apr 11, 2026 · Citations: 0
- Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
Utshab Kumar Ghosh, Ashish David, Shubham Chatterjee · Apr 11, 2026 · Citations: 0
- Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
Monishwaran Maheswaran, Leon Lakhani, Zhongzhu Zhou, Shijia Yang, Junxiong Wang · Apr 9, 2026 · Citations: 0
- IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
- Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao · Apr 2, 2026 · Citations: 0
- Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models
Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li · Mar 26, 2026 · Citations: 0
- Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das · Mar 26, 2026 · Citations: 0
- RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue
Long Mai · Mar 24, 2026 · Citations: 0
- Edge Radar Material Classification Under Geometry Shifts
Jannik Hohmann, Dong Wang, Andreas Nüchter · Mar 24, 2026 · Citations: 0
- How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge · Mar 23, 2026 · Citations: 0
- Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna · Mar 18, 2026 · Citations: 0
- Mediocrity is the key for LLM as a Judge Anchor Selection
Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend · Mar 17, 2026 · Citations: 0
- Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies
Giuseppe Samo, Paola Merlo · Mar 16, 2026 · Citations: 0
- Attention Residuals
Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu · Mar 16, 2026 · Citations: 0
- MXNorm: Reusing MXFP block scales for efficient tensor normalisation
Callum McLean, Luke Y. Prince, Alexandre Payot, Paul Balança, Carlo Luschi · Mar 13, 2026 · Citations: 0
- When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Eddie Landesberg · Mar 12, 2026 · Citations: 0
- LABSHIELD: A Multimodal Benchmark for Safety-Critical Reasoning and Planning in Scientific Laboratories
Qianpu Sun, Xiaowei Chi, Yuhan Rui, Ying Li, Kuangzhi Ge · Mar 12, 2026 · Citations: 0
- UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking
Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou · Mar 9, 2026 · Citations: 0
- Dynamic Vehicle Routing Problem with Prompt Confirmation of Advance Requests
Amutheezan Sivagnanam, Ayan Mukhopadhyay, Samitha Samaranayake, Abhishek Dubey, Aron Laszka · Mar 8, 2026 · Citations: 0
- Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning
Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu · Mar 4, 2026 · Citations: 0
- Polynomial Mixing for Efficient Self-supervised Speech Encoders
Eva Feillet, Ryan Whetten, David Picard, Alexandre Allauzen · Feb 28, 2026 · Citations: 0