Skip to content
← Back to explorer

HFEPX Hub

General + Llm As Judge Papers

Updated from current HFEPX corpus (Feb 27, 2026). 8 papers are grouped in this hub page. Common evaluation modes: Llm As Judge, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Caparena. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 8 Last published: Feb 24, 2026 Global RSS Tag RSS
GeneralLlm As Judge

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 8 papers for General + Llm As Judge Papers. Dominant protocol signals include LLM-as-judge, automatic metrics, human evaluation, with frequent benchmark focus on Caparena, Visualwebarena and metric focus on agreement, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Caparena appears in 12.5% of hub papers (1/8); use this cohort for benchmark-matched comparisons.
  • Visualwebarena appears in 12.5% of hub papers (1/8); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • agreement is reported in 25% of hub papers (2/8); compare with a secondary metric before ranking methods.
  • accuracy is reported in 12.5% of hub papers (1/8); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (37.5% vs 45% target).
  • Close gap on Papers reporting quality controls. Coverage is a replication risk (12.5% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (25% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (75% vs 35% target).
  • Maintain strength on Papers with known rater population. Coverage is strong (37.5% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (25% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (37.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (12.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (25% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (75% vs 35% target).

Papers with known rater population

Coverage is strong (37.5% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (25% vs 35% target).

Suggested Reading Order

  1. 1. Overton Pluralistic Reinforcement Learning for Large Language Models

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. World-Model-Augmented Web Agents with Action Correction

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

    Include a human-eval paper to anchor calibration against automated judge settings.

  6. 6. EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis

    Adds LLM-as-judge with expert verification for broader coverage within this hub.

  7. 7. Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

    Adds LLM-as-judge for broader coverage within this hub.

  8. 8. Human-like Affective Cognition in Foundation Models

    Adds LLM-as-judge for broader coverage within this hub.

Known Limitations

  • Only 12.5% of papers report quality controls; prioritize calibration/adjudication evidence.
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
  • Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Links

human_eval vs llm_as_judge

both=2, left_only=0, right_only=6

2 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=2, right_only=2

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=2, left_only=6, right_only=0

2 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Caparena

Coverage: 1 papers (12.5%)

1 papers (12.5%) mention Caparena.

Examples: PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Benchmark Brief

Visualwebarena

Coverage: 1 papers (12.5%)

1 papers (12.5%) mention Visualwebarena.

Examples: World-Model-Augmented Web Agents with Action Correction

Metric Brief

agreement

Coverage: 2 papers (25%)

2 papers (25%) mention agreement.

Examples: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , Human-like Affective Cognition in Foundation Models

Metric Brief

accuracy

Coverage: 1 papers (12.5%)

1 papers (12.5%) mention accuracy.

Examples: Overton Pluralistic Reinforcement Learning for Large Language Models

Metric Brief

coherence

Coverage: 1 papers (12.5%)

1 papers (12.5%) mention coherence.

Examples: Stop-Think-AutoRegress: Language Modeling with Latent Diffusion Planning

Top Papers

Related Hubs