Research & Insights

September 8, 2025

8 min read

What Drives Agreement In LLM Evaluations

Findings on rubric clarity, domain expertise, and training prompts that raise agreement.

When two evaluators look at the same model output and reach different conclusions, at least one of them is wrong, or your rubric failed to define what "right" means. Inter-annotator agreement (IAA) is the fundamental measure of whether your evaluation program is producing reliable signal or expensive noise. This article examines the factors that drive agreement in LLM evaluations, drawing on practical experience across domains and task types.

Why Agreement Matters

Low agreement is not just an academic inconvenience. It has direct consequences for every downstream use of your evaluation data.

For reward model training: If your preference labels are inconsistent, your reward model learns a blurred objective. It cannot distinguish between genuine quality differences and annotator noise. The result is a reward model that either collapses to trivial preferences (preferring longer responses, for example) or produces unpredictable rankings.

For benchmarking and regression testing: If your evaluators disagree 40% of the time, small model improvements are invisible in the data. You cannot detect a 5% quality improvement when your measurement noise is 20%. High agreement gives you the statistical power to make confident claims about model performance.

For safety audits: Inconsistent safety labels mean some harmful outputs slip through while benign outputs are incorrectly flagged. Both failure modes are costly: the first creates real-world risk, the second wastes engineering time on false positives.

For cost efficiency: Low agreement forces you to use higher redundancy (more annotators per item) to get reliable labels through majority voting. If agreement is high, you can use lower redundancy and allocate budget to covering more evaluation examples instead.

Measuring Agreement: Choosing the Right Metric

The choice of agreement metric depends on your task format and scale. Using the wrong metric can make your agreement look better or worse than it actually is.

Raw Agreement Rate

The simplest metric: what percentage of items do annotators agree on? It is intuitive but misleading. If your task has two options and both are equally likely, random annotators would agree 50% of the time. An 80% raw agreement rate sounds good but may represent only modest improvement over chance.

Cohen's Kappa

For pairwise comparison between two annotators, Cohen's kappa adjusts raw agreement for chance. It ranges from -1 (systematic disagreement) to 1 (perfect agreement), with 0 representing chance-level agreement. Interpretation guidelines vary by field, but for LLM evaluation work, kappa above 0.6 is typically acceptable for subjective tasks and above 0.75 for more objective ones. Values below 0.4 almost always indicate a rubric or calibration problem.

Fleiss' Kappa

When you have more than two annotators rating each item, Fleiss' kappa extends Cohen's kappa to the multi-rater case. It is particularly useful for identifying whether disagreement is distributed evenly across annotators (suggesting a rubric problem) or concentrated in specific individuals (suggesting a training problem).

Krippendorff's Alpha

For ordinal scales (1-5 quality ratings), Krippendorff's alpha is often the best choice. It accounts for the magnitude of disagreement, treating a 1-versus-5 disagreement as more severe than a 3-versus-4 disagreement. It also handles missing data and varying numbers of annotators per item, making it practical for production annotation settings where overlap is incomplete.

Conditional Agreement by Difficulty

Beyond global metrics, compute agreement stratified by task difficulty. You should expect high agreement on easy items (clear quality differences) and lower agreement on hard items (close calls). If agreement is low even on easy items, something is fundamentally broken. If agreement is high even on hard items, your difficulty calibration may be off, or annotators are converging on superficial heuristics rather than applying the rubric thoughtfully.

Rubric Design: The Foundation of Agreement

The rubric is the contract between you and your evaluators. Every point of ambiguity in the rubric is a potential source of disagreement. Investing time in rubric design pays dividends throughout the entire evaluation program.

Decompose Quality Into Independent Dimensions

A single "overall quality" rating invites each evaluator to weight different factors differently. One evaluator might prioritize factual accuracy; another might weight conciseness. By breaking quality into explicit dimensions (accuracy, completeness, relevance, clarity, safety), you force evaluators to assess each factor independently and reduce the space for implicit disagreement.

Anchor Every Scale Point

For each dimension, define what each rating level means with concrete examples. Do not write abstract descriptions like "mostly accurate with minor errors." Instead, show an actual model output that represents a 3 on your accuracy scale, explain what the minor errors are, and show how a 4 would differ. Provide at least two examples per anchor point so evaluators can generalize the pattern rather than memorizing a single case.

Define the Hard Cases Explicitly

Most disagreement clusters around a small number of recurring edge cases. Anticipate these and document them in the rubric:

What if the response is correct but uses an outdated source?
What if the response answers a slightly different question than what was asked, but the answer is still useful?
What if the response includes a minor factual error embedded in an otherwise excellent explanation?
What if both responses are equally good (or equally bad)?

For each scenario, specify the expected judgment and the reasoning behind it. A rubric that addresses 20 common edge cases will produce higher agreement than one that perfectly defines the easy cases but is silent on the hard ones.

Keep the Rubric Living

Your first rubric draft will be wrong. Plan for at least two revision cycles during the pilot phase. After each calibration session, note the disagreements that the rubric failed to resolve and add clarifications. A good rubric is a document that grows organically from real disagreements.

Calibration Sessions: Building Shared Mental Models

A rubric on paper is not the same as a rubric in practice. Calibration sessions bridge the gap by building shared mental models across your evaluator team.

Structure of an Effective Calibration Session

Independent annotation. Give evaluators a set of 20-30 items to rate independently before the session. Choose items that span the difficulty distribution, including several known edge cases.
Reveal and discuss. Show the group how everyone rated each item. Focus discussion on disagreements, not agreements. For each disagreement, ask evaluators to explain their reasoning by referencing specific rubric criteria.
Establish consensus. After discussion, establish the "correct" rating for each item and document the reasoning. These become part of your gold set.
Identify rubric gaps. If a disagreement cannot be resolved by referencing the rubric, the rubric needs an update. Note the gap and revise before the next session.

Frequency and Duration

During the pilot phase, run calibration sessions daily or every other day. Once evaluators are calibrated, shift to weekly 30-minute sessions focused on the previous week's hardest cases. For long-running programs, monthly recalibration sessions prevent gradual drift.

Calibration Across Time Zones

For distributed teams, synchronous calibration sessions may not be feasible. Asynchronous alternatives include recorded video walkthroughs of disagreements, shared annotation of a calibration set with a discussion thread for each item, and periodic one-on-one reviews between team leads and individual evaluators.

The Role of Domain Expertise

Domain expertise affects agreement in two distinct ways. First, experts agree more with each other on factual judgments because they share a common knowledge base. A panel of physicians evaluating medical advice will agree more than a panel of generalists because they can verify claims against shared training. Second, experts disagree more productively: their disagreements tend to be about genuine ambiguities in the domain rather than about misunderstanding the rubric.

However, domain expertise is not a substitute for calibration. Experts who have not been calibrated on your specific rubric will apply their own implicit standards, which may differ from your evaluation goals. A physician might rate a response as "correct" because it is clinically accurate, even though your rubric penalizes responses that use jargon a patient would not understand. Expertise sets the floor for quality; calibration ensures alignment with your specific objectives.

Common Sources of Disagreement

Understanding where disagreement comes from helps you target your interventions.

Rubric Ambiguity

The most common source. When the rubric does not cover a case, evaluators improvise. Different evaluators improvise differently. Solution: audit disagreements weekly and update the rubric.

Dimension Weighting

Even with multi-dimensional rubrics, evaluators may implicitly weight dimensions differently when forced to make an overall judgment. One evaluator treats a minor factual error as disqualifying; another treats it as a small deduction. Solution: either avoid overall ratings entirely (use dimension-specific ratings only) or provide explicit weighting formulas.

Threshold Calibration

Evaluators may agree on the relative ordering of responses but disagree on where to draw thresholds. Everyone agrees Response A is better than Response B, but some rate A as 5 and B as 3, while others rate A as 4 and B as 2. Solution: more anchor examples at boundary points, and consider using pairwise comparison instead of absolute scales.

Fatigue and Session Effects

Agreement degrades within sessions as evaluators tire. It also varies across sessions: Monday morning ratings may differ systematically from Friday afternoon ratings. Solution: monitor agreement by position-within-session and time-of-day. Limit session length to 60-90 minutes for cognitively demanding tasks.

Cultural and Linguistic Differences

For multilingual evaluation, cultural norms affect judgments about politeness, directness, humor, and formality. What reads as appropriately concise in German may read as rudely terse in Japanese. Solution: calibrate within language groups and define language-specific rubric addenda for culturally sensitive dimensions.

Practical Tips for Improving Agreement

These interventions are ordered roughly by effort-to-impact ratio, from highest leverage to lowest.

Rewrite your rubric based on actual disagreements. Do not guess what will be confusing. Run a pilot, collect disagreements, and address every single one in the rubric.
Add more anchor examples. For every rubric revision, add at least two new examples. A rubric with 50 worked examples produces dramatically higher agreement than one with 10.
Run calibration sessions with disagreement review. Discussing why evaluators disagree, with specific rubric references, is more effective than simply telling them the right answer.
Embed gold tasks in production. Continuous monitoring catches drift early. A 5% gold rate is sufficient for ongoing programs; use 10% during onboarding.
Shorten sessions. If agreement drops after 45 minutes, end sessions at 45 minutes. More shorter sessions beat fewer longer ones.
Match evaluators to tasks by expertise. Do not ask generalists to evaluate domain-specific content. The resulting disagreements are not informative; they are just noise.
Use pairwise comparison for subjective tasks. Binary choices ("which is better?") produce higher agreement than multi-point scales for inherently subjective dimensions like helpfulness or naturalness.
Track and remove outlier evaluators. If one evaluator's agreement with the group is consistently two standard deviations below the mean, they need recalibration or replacement.

Leveraging Pre-Vetted Evaluators

Building a team of well-calibrated evaluators from scratch takes weeks or months. Platforms like OpenTrain can accelerate this by providing access to evaluators who have already passed screening for domain expertise and annotation aptitude, reducing the time from project kickoff to reliable data collection. But even pre-vetted evaluators need project-specific calibration. No amount of general training replaces a rubric walkthrough with your specific examples and edge cases.

Agreement Is a Means, Not an End

High inter-annotator agreement is necessary for trustworthy evaluation data, but it is not sufficient. Perfect agreement on a flawed rubric produces consistently wrong labels. Agreement should be pursued alongside rubric validity: periodically check that your rubric's criteria actually predict the outcomes you care about, whether that is user satisfaction, safety, or downstream model performance.

The teams that achieve consistently high agreement share a common pattern: they treat the rubric as a living document, calibration as an ongoing discipline, and disagreements as valuable diagnostic information rather than problems to be suppressed. When your evaluators disagree, they are telling you something about your task definition. Listen.

Find the Best AI Trainers.
Build the Best AI Models.

Post a job and get a curated shortlist of vetted AI Trainers and Data Labelers within 24 hours. Hire them into any annotation tool. No commitment required.

Post a Job Large Project? → Managed Service

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now