HFEPX Hub

Law Papers

Updated from current HFEPX corpus (Apr 12, 2026). 42 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 12, 2026). 42 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Adjudication. Frequently cited benchmark: GSM8K. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 9, 2026.

Papers: 42 Last published: Mar 9, 2026 Global RSS Tag RSS

Law

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Medium .

All Sampled Papers (42) Replication-Ready Only (8)

High-Signal Coverage

100.0%

42 / 42 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

8 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
3 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Start with the top 2 papers in “Start Here”, then validate assumptions in the protocol matrix.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

54.8% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 61.9% of papers in this hub.
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (2.4% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Benchmark Interpretation

GSM8K appears in 4.8% of hub papers (2/42); use this cohort for benchmark-matched comparisons.
HLE appears in 4.8% of hub papers (2/42); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 31% of hub papers (13/42); compare with a secondary metric before ranking methods.
cost is reported in 16.7% of hub papers (7/42); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Strong: Papers with explicit human feedback

Coverage is strong (54.8% vs 45% target).
Gap: Papers reporting quality controls

Coverage is a replication risk (7.1% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.2% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (59.5% vs 35% target).
Moderate: Papers with known rater population

Coverage is usable but incomplete (28.6% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (31% vs 35% target).

Strengths

Strong human-feedback signal (54.8% of papers).
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.
Agentic evaluation appears in 54.8% of papers.

Known Gaps

Only 7.1% of papers report quality controls; prioritize calibration/adjudication evidence.
LLM-as-judge appears without enough inter-annotator agreement reporting.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (GSM8K vs HLE) before comparing methods.
Track metric sensitivity by reporting both accuracy and cost.
Add inter-annotator agreement checks when reproducing these protocols.

Recommended Queries (Expanded)

Recommended Queries

Judge vs Human Agreement Benchmark Slice: GSM8K Metric Slice: accuracy Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

HLE-Verified: A Systematic Verification and Structured Revision of Hu…

Highest protocol score with explicit human/eval signal plus HLE.

Strongest benchmark reference

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xpertbench with success rate gives a fast comparison anchor.

Strongest recent paper

Stabilizing Iterative Self-Training with Verified Reasoning via Symbo…

Useful for current practice scanning; published Mar 23, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Feb 15, 2026 · Citations: 0 · Score: 9.0

HF: Expert Verification, Critique Edit · Eval: Automatic Metrics · Benchmark: HLE · Metric: Accuracy
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Mar 27, 2026 · Citations: 0 · Score: 8.0

HF: Rubric Rating, Expert Verification · Eval: Automatic Metrics · Benchmark: Xpertbench · Metric: Success rate
Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Mar 23, 2026 · Citations: 0 · Score: 8.0

HF: Pairwise Preference · Eval: Automatic Metrics · Benchmark: GSM8K · Metric: Accuracy
\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Mar 9, 2026 · Citations: 0 · Score: 7.5

HF: Rubric Rating · Eval: Automatic Metrics · Benchmark: Onemillion Bench · Metric: Accuracy
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Sep 17, 2025 · Citations: 0 · Score: 6.5

HF: Red Team · Eval: Automatic Metrics · Benchmark: AdvBench · Metric: Helpfulness
Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models
Mar 23, 2026 · Citations: 0 · Score: 6.0

HF: Expert Verification · Eval: Not reported · Benchmark: Not Reported · Metric: Not Reported

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Feb 15, 2026	Yes Expert Verification , Critique Edit	Automatic Metrics	HLE	Accuracy	Adjudication
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Mar 27, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics	Xpertbench	Success rate	Not Reported
Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment Mar 23, 2026	Yes Pairwise Preference	Automatic Metrics	GSM8K	Accuracy	Not Reported
\$OneMillion-Bench: How Far are Language Agents from Human Experts? Mar 9, 2026	Yes Rubric Rating	Automatic Metrics	Onemillion Bench	Accuracy , Coherence	Not Reported
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness Sep 17, 2025	Yes Red Team	Automatic Metrics	AdvBench	Helpfulness	Not Reported
Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models Mar 23, 2026	Yes Expert Verification	Not Reported	Not Reported	Not Reported	Gold Questions
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Apr 6, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy , Throughput	Not Reported
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning Apr 1, 2026	No Not Reported	Automatic Metrics	HLE	Accuracy , Cost	Not Reported
Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis Mar 23, 2026	No Not Reported	Automatic Metrics	SWE Bench , SWE Bench Verified	Accuracy , Recall	Not Reported
APEX-Agents Jan 20, 2026	Yes Rubric Rating , Expert Verification	Automatic Metrics	Not Reported	Pass@1	Not Reported
Sabiá-4 Technical Report Mar 10, 2026	Yes Pairwise Preference	Automatic Metrics	Not Reported	Accuracy , Cost	Not Reported
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification Mar 30, 2026	No Not Reported	Automatic Metrics	Not Reported	Accuracy	Calibration

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	HLE-Verified: A Systematic Verification and Structu…	Xpertbench: Expert Level Tasks with Rubrics-Based E…	Stabilizing Iterative Self-Training with Verified R…
Human Feedback	Expert Verification, Critique Edit	Rubric Rating, Expert Verification	Pairwise Preference
Evaluation Modes	Automatic Metrics	Automatic Metrics	Automatic Metrics
Benchmarks	HLE	Xpertbench	GSM8K
Metrics	Accuracy	Success rate	Accuracy
Quality Controls	Adjudication	Not reported	Not reported
Rater Population	Domain Experts	Domain Experts	Unknown
Annotation Unit	Unknown	Multi Dim Rubric	Unknown

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (11)
Expert Verification (7)
Rubric Rating (5)
Critique Edit (2)

Evaluation Modes

Automatic Metrics (26)
Simulation Env (4)
Human Eval (2)
Llm As Judge (2)

Top Benchmarks

GSM8K (2)
HLE (2)
AdvBench (1)
Cow Bench (1)

Top Metrics

Accuracy (13)
Cost (7)
Coherence (2)
Jailbreak success rate (2)

Rater Population Mix

Domain Experts (12)

Quality Controls

Adjudication (1)
Calibration (1)
Gold Questions (1)

Coverage diagnostics (sample-based): human-feedback 54.8% · benchmarks 26.2% · metrics 59.5% · quality controls 7.1%.

Top Papers

\$OneMillion-Bench: How Far are Language Agents from Human Experts?
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen · Mar 9, 2026 · Citations: 0

Rubric Rating Automatic Metrics Tool Use

To this end, we introduce \OneMillion-Bench \OneMillion-Bench, a benchmark of 400 expert-curated tasks spanning Law, Finance, Industry, Healthcare, and Natural Science, built to evaluate agents across economically consequential scenarios.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics Long Horizon

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate…
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang · Mar 27, 2026 · Citations: 0

Rubric RatingExpert Verification Automatic Metrics

To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon · Oct 15, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general…
Stabilizing Iterative Self-Training with Verified Reasoning via Symbolic Recursive Self-Alignment
Xinyu Zhang · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

We further demonstrate that constructing DPO preference pairs from NSRSA verification teaches the model to distinguish sound from flawed reasoning (reward accuracy 46% to 63%).
Sabiá-4 Technical Report
Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás · Mar 10, 2026 · Citations: 0

Pairwise Preference Automatic Metrics Tool Use

The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal…
Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Tool Use

Large language model (LLM) agents are moving beyond prompting alone.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang · Feb 26, 2026 · Citations: 0

Simulation Env Long Horizon

To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0

Expert Verification Automatic Metrics

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer · May 19, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics Long Horizon

To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.
Generating and Evaluating Sustainable Procurement Criteria for the Swiss Public Sector using In-Context Prompting with Large Language Models
Yingqiang Gao, Veton Matoshi, Luca Rolshoven, Tilia Ellendorff, Judith Binder · Mar 23, 2026 · Citations: 0

Expert Verification

Swiss law requires the integration of ecological, social, and economic sustainability requirements into tender evaluations in the format of criteria that have to be fulfilled by a bidder.
RoboPocket: Improve Robot Policies Instantly with Your Phone
Junjie Fang, Wendi Chen, Han Xue, Fangyuan Zhou, Tian Le · Mar 5, 2026 · Citations: 0

Demonstrations Long Horizon

To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu · Apr 6, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.
Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0

Simulation Env Multi Agent

Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan · Mar 30, 2026 · Citations: 0

Automatic Metrics Multi Agent

We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation.
Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation
Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang · Apr 1, 2026 · Citations: 0

Rubric RatingCritique Edit

However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to…
On the Complexity of Neural Computation in Superposition
Micah Adler, Nir Shavit · Sep 5, 2024 · Citations: 0

Pairwise Preference Automatic Metrics

Superposition, the ability of neural networks to represent more features than neurons, is increasingly seen as key to the efficiency of large models.
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li · Apr 1, 2026 · Citations: 0

Automatic Metrics Multi Agent

In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song · Mar 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang, Zhangyi Jiang, Zhenqi He, Shenyang Tong, Wenhan Yang · Mar 16, 2025 · Citations: 0

Automatic Metrics Long Horizon

Empirical results on the PRM800K dataset show that HRM, together with HNC, provides more stable and reliable evaluations than PRM.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng · Apr 2, 2026 · Citations: 0

Automatic Metrics Long Horizon

Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO.
TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning
Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao · Mar 8, 2026 · Citations: 0

Automatic Metrics Long Horizon

To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM).
A Survey of On-Policy Distillation for Large Language Models
Mingyang Song, Mao Zheng · Apr 1, 2026 · Citations: 0

Expert VerificationDemonstrations

We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.
ExpGuard: LLM Content Moderation in Specialized Domains
Minseok Choi, Dongjin Kim, Seungbin Yang, Subin Kim, Youngjun Kwak · Mar 3, 2026 · Citations: 0

Expert Verification

With the growing deployment of large language models (LLMs) in real-world applications, establishing robust safety guardrails to moderate their inputs and outputs has become essential to ensure adherence to safety policies.
Learning Page Order in Shuffled WOO Releases
Efe Kahraman, Giulio Tosato · Feb 11, 2026 · Citations: 0

Pairwise Preference

We observe two unexpected failures: seq2seq transformers fail to generalize on long documents (Kendall's tau drops from 0.918 on 2-5 pages to 0.014 on 21-25 pages), and curriculum learning underperforms direct training by 39% on long…
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026 · Citations: 0

Pairwise PreferenceRubric Rating

By sampling annotators from police-affiliated, justice-system-impacted, and non-affiliated Los Angeles residents, we enable the systematic study of perceptual differences across diverse communities.
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Philipp D. Siedler · Apr 8, 2026 · Citations: 0

Simulation Env Multi Agent

We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation.
CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts
Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik · Jan 8, 2026 · Citations: 0

Llm As Judge Multi Agent

To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics.
Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation
Scott Merrill, Shashank Srivastava · Nov 21, 2025 · Citations: 0

Human EvalSimulation Env

Transcripts produced via automatic speech recognition (ASR) assign anonymous speaker labels (e.g., Speaker_1), preventing models from capturing consistent human behavior.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval
Andrea Volpini, Elie Raad, Beatrice Gamba, David Riccitelli · Mar 11, 2026 · Citations: 0

Automatic Metrics Web Browsing

In this paper, we investigate whether structured linked data, specifically Schema.org markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic…
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib · Mar 8, 2026 · Citations: 0

Automatic Metrics Long Horizon

To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search
Ziqi Wang, Boqin Yuan · Aug 31, 2025 · Citations: 0

Automatic Metrics Multi Agent

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence…
Dual Optimal: Make Your LLM Peer-like with Dignity
Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang · Apr 1, 2026 · Citations: 0

Pairwise Preference

Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Red Team

LLM-based agents execute real-world workflows via tools and memory.
Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names
Hayafumi Watanabe · Nov 6, 2025 · Citations: 0

Pairwise Preference

inward (community) contact suggests that α can be interpreted as an index of the preference for outward-oriented communication.

Related Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote