HFEPX Hub

Law Papers (Last 120 Days)

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 1, 2026). 10 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Adjudication. Frequently cited benchmark: Cow-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 15, 2026.

Papers: 10 Last published: Feb 15, 2026 Global RSS Tag RSS

LawLast 120d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (10) Replication-Ready Only (1)

High-Signal Coverage

100.0%

10 / 10 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
1 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Why This Matters (Expanded)

Why This Matters For Eval Research

40% of papers report explicit human-feedback signals, led by expert verification.
automatic metrics appears in 60% of papers in this hub.
Cow-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Notes (Expanded)

Protocol Takeaways

Most common quality-control signal is adjudication (10% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Cow-Bench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
HLE appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.
error rate is reported in 10% of hub papers (1/10); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (10% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (40% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (50% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (30% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Strengths

Most papers provide measurable evaluation context (40% benchmarks, 50% metrics).
Agentic evaluation appears in 60% of papers.

Known Gaps

Only 10% of papers report quality controls; prioritize calibration/adjudication evidence.

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Cow-Bench vs HLE) before comparing methods.
Track metric sensitivity by reporting both accuracy and error rate.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Cow-Bench Metric Slice: accuracy Recent High-Signal Papers

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 9.5

HF: Expert Verification, Critique Edit · Eval: Automatic Metrics · Benchmark: HLE · Metric: Accuracy
APEX-Agents
Jan 20, 2026 · Citations: 0 · Score: 5.5

HF: Rubric Rating, Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Pass@1
The Trinity of Consistency as a Defining Principle for General World Models
Feb 26, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: Cow Bench · Metric: Not Reported
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Feb 10, 2026 · Citations: 0 · Score: 4.5

HF: Pairwise Preference, Rubric Rating · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Feb 18, 2026 · Citations: 0 · Score: 4.5

HF: Red Team · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Jan 19, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Simulation Env · Benchmark: Lawbench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Yes Expert Verification , Critique Edit	Automatic Metrics	HLE	Accuracy	Adjudication
APEX-Agents Jan 20, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics	Not Reported	Pass@1	Not Reported
The Trinity of Consistency as a Defining Principle for General World Models Feb 26, 2026	No Not Reported	Simulation Env	Cow Bench	Not Reported	Not Reported
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage Feb 10, 2026	Yes Pairwise Preference , Rubric Rating	Not Reported	Not Reported	Not Reported	Not Reported
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Feb 18, 2026	Yes Red Team	Not Reported	Not Reported	Not Reported	Not Reported
Multimodal Multi-Agent Empowered Legal Judgment Prediction Jan 19, 2026	No Not Reported	Simulation Env	Lawbench	Not Reported	Not Reported
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System Feb 20, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	F1	Not Reported
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space Jan 18, 2026	No Not Reported	Automatic Metrics	MATH	Not Reported	Not Reported
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Feb 21, 2026	No Not Reported	Automatic Metrics	Not Reported	Error rate , Wer	Not Reported
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors Dec 6, 2025	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	HLE-Verified: A Systematic Verification and Structu…	APEX-Agents	The Trinity of Consistency as a Defining Principle…
Human Feedback	Expert Verification, Critique Edit	Rubric Rating, Expert Verification	Not reported
Evaluation Modes	Automatic Metrics	Automatic Metrics	Simulation Env
Benchmarks	HLE	Not reported	Cow Bench
Metrics	Accuracy	Pass@1	Not reported
Quality Controls	Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Unknown	Multi Dim Rubric	Trajectory

Research Utility Snapshot

Human Feedback Mix

Expert Verification (2)
Rubric Rating (2)
Critique Edit (1)
Pairwise Preference (1)

Evaluation Modes

Automatic Metrics (6)
Simulation Env (2)
Human Eval (1)

Top Benchmarks

Cow Bench (1)
HLE (1)
Lawbench (1)
MATH (1)

Top Metrics

Accuracy (2)
Error rate (1)
F1 (1)
Jailbreak success rate (1)

Rater Population Mix

Domain Experts (3)

Quality Controls

Adjudication (1)

Coverage diagnostics (sample-based): human-feedback 40.0% · benchmarks 40.0% · metrics 50.0% · quality controls 10.0%.

Top Papers

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Red Team

LLM-based agents execute real-world workflows via tools and memory.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote