A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research.
Every paper includes structured metadata for quick triage.
Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy.
To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas.
Safety benchmarks are routinely treated as evidence about how a language model will behave once deployed, but this inference is fragile if behavior depends on whether a prompt looks like an evaluation.
Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.
In contrast, fully unstructured teams enable adaptability and exploration but suffer from inefficiencies such as error propagation, inter-agent conflicts, and wasted resources (measured in time, tokens, or file operations).
Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced.
Laughter is a social non-vocalization that is universal across cultures and languages, and is crucial for human communication, including social bonding and communication signaling.
Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with…
In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in conversational Arabic, focusing on dialectal discourse.
Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of…
Browse by Topic
Jump directly into tag and hub pages to crawl deeper content clusters.
Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do.
The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into…
LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made.
Using Chatbot Arena data, we show naive 95\% CI coverage drops as n grows while TEE-corrected coverage holds at 95\%, and TEE-guided pipelines restrict the benchmark gaming surface from 56 to 32 Elo (K=27), below the human-leaderboard…
Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality…
Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve.
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation.
To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to…
Large Language Model (LLM)-based Collective Intelligence (CI) presents a promising approach to overcoming the data wall and continuously boosting the capabilities of LLM agents.
To address this gap, we introduce OpenHospital, an interactive arena where physician agents can evolve CI through interactions with patient agents.
However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable…
To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation.
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
Results show that GIFT converges faster, generalizes better with reduced overfitting, and outperforms GRPO on mathematical reasoning benchmarks (GSM8K, MATH, AIME) as well as generation tasks' evaluations (AlpacaEval and Arena-Hard).
Experimental evaluations show that our LongWriter-Zero model, trained from Qwen2.5-32B, consistently outperforms traditional SFT methods on long-form writing tasks, achieving state-of-the-art results across all metrics on WritingBench and…
Reasoning LLMs-as-Judges, which can benefit from inference-time scaling, provide a promising path for extending the success of reasoning models to non-verifiable domains where the output correctness/quality cannot be directly checked.
However, while reasoning judges have shown better performance on static evaluation benchmarks, their effectiveness in actual policy training has not been systematically examined.
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows.
To support these interactive evaluations, we introduce a retrieval-augmented code generation (RACG) module that efficiently handles long-context challenges by providing syntactically and semantically relevant code snippets from large…
Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
Additionally, we provide 30k high-quality preference optimization examples to further enhance alignment.
Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs).
It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO).
To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation.
The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available.
In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs.
The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes.
In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning.
We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation.