HFEPX Hub

CS.AI + Law Papers

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Mar 10, 2026). 11 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequently cited benchmark: Cow-Bench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2026.

Papers: 11 Last published: Mar 9, 2026 Global RSS Tag RSS

Cs.AILaw

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (11) Replication-Ready Only (1)

High-Signal Coverage

100.0%

11 / 11 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

1 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
0 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

54.5% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 63.6% of papers in this hub.
Cow-Bench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Quality-control reporting is sparse in this slice; prioritize papers with explicit calibration or adjudication steps.
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Cow-Bench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
Lawbench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 18.2% of hub papers (2/11); compare with a secondary metric before ranking methods.
coherence is reported in 9.1% of hub papers (1/11); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (54.5% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (0% vs 30% target).
Strong: Papers naming benchmarks/datasets

Coverage is strong (36.4% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (45.5% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (27.3% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (54.5% vs 35% target).

Strengths

Strong human-feedback signal (54.5% of papers).
Most papers provide measurable evaluation context (36.4% benchmarks, 45.5% metrics).
Agentic evaluation appears in 81.8% of papers.

Known Gaps

Only 0% of papers report quality controls; prioritize calibration/adjudication evidence.

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Cow-Bench vs Lawbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and coherence.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Cow-Bench Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

\$OneMillion-Bench: How Far are Language Agents from Human Experts?

Highest protocol score with explicit human/eval signal plus Onemillion-Bench.

Strongest benchmark reference

APEX-Agents

Reported benchmark with pass@1 gives a fast comparison anchor.

Strongest recent paper

Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skil…

Useful for current practice scanning; published Dec 18, 2025.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Mar 9, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Onemillion Bench · Metric: Accuracy
APEX-Agents
Jan 20, 2026 · Citations: 0 · Score: 5.5

HF: Rubric Rating, Expert Verification · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Pass@1
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Dec 18, 2025 · Citations: 0 · Score: 5.5

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: Not Reported · Metric: Cost
The Trinity of Consistency as a Defining Principle for General World Models
Feb 26, 2026 · Citations: 0 · Score: 4.5

HF: Not reported · Eval: Simulation Env · Benchmark: Cow Bench · Metric: Not Reported
RoboPocket: Improve Robot Policies Instantly with Your Phone
Mar 5, 2026 · Citations: 0 · Score: 4.5

HF: Demonstrations · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Jan 19, 2026 · Citations: 0 · Score: 4.0

HF: Not reported · Eval: Simulation Env · Benchmark: Lawbench · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
APEX-Agents Jan 20, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics	Not Reported	Pass@1	Not Reported
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills Dec 18, 2025	Yes Pairwise Preference	Automatic Metrics	Not Reported	Cost	Not Reported
The Trinity of Consistency as a Defining Principle for General World Models Feb 26, 2026	No Not Reported	Simulation Env	Cow Bench	Not Reported	Not Reported
RoboPocket: Improve Robot Policies Instantly with Your Phone Mar 5, 2026	Yes Demonstrations	Not Reported	Not Reported	Not Reported	Not Reported
Multimodal Multi-Agent Empowered Legal Judgment Prediction Jan 19, 2026	No Not Reported	Simulation Env	Lawbench	Not Reported	Not Reported
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System Feb 20, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	F1	Not Reported
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space Jan 18, 2026	No Not Reported	Automatic Metrics	MATH	Not Reported	Not Reported
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures Aug 16, 2025	Yes Pairwise Preference	Not Reported	Not Reported	Not Reported	Not Reported
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors Dec 6, 2025	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Not Reported
On the Complexity of Neural Computation in Superposition Sep 5, 2024	Yes Pairwise Preference	Automatic Metrics	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	\$OneMillion-Bench: How Far are Language Agents fro…	APEX-Agents	Adaptation of Agentic AI: A Survey of Post-Training…
Human Feedback	Rubric Rating	Rubric Rating, Expert Verification	Pairwise Preference
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	Onemillion Bench	Not reported	Not reported
Metrics	Accuracy, Coherence	Pass@1	Cost
Quality Controls	Not reported	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Multi Dim Rubric	Multi Dim Rubric	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (3)
Rubric Rating (2)
Demonstrations (1)
Expert Verification (1)

Evaluation Modes

Automatic Metrics (7)
Simulation Env (2)
Human Eval (1)

Top Benchmarks

Cow Bench (1)
Lawbench (1)
MATH (1)
Onemillion Bench (1)

Top Metrics

Accuracy (2)
Coherence (1)
Cost (1)
F1 (1)

Rater Population Mix

Domain Experts (3)

Quality Controls

Coverage diagnostics (sample-based): human-feedback 54.5% · benchmarks 36.4% · metrics 45.5% · quality controls 0.0%.

Top Papers

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Tool Use

Large language model (LLM) agents are moving beyond prompting alone.
The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
On the Complexity of Neural Computation in Superposition
Micah Adler, Nir Shavit · Sep 5, 2024 · Citations: 0

Pairwise Preference Automatic Metrics

Superposition, the ability of neural networks to represent more features than neurons, is increasingly seen as key to the efficiency of large models.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote