HFEPX Hub

Automatic Metrics + Law Papers

Updated from current HFEPX corpus (Feb 27, 2026). 39 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 39 Last published: Feb 26, 2026 Global RSS Tag RSS

Automatic MetricsLaw

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 39 papers for Automatic Metrics + Law Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on MATH, Retrieval and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

12.8% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation , Scaling View Synthesis Transformers
automatic metrics appears in 100% of papers in this hub.

Evidence: Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation , Scaling View Synthesis Transformers
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation , Scaling View Synthesis Transformers

Protocol Takeaways

Most common quality-control signal is adjudication (2.6% of papers).

Evidence: Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation , Scaling View Synthesis Transformers
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Airavat: An Agentic Framework for Internet Measurement , Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System , Agentic Adversarial QA for Improving Domain-Specific LLMs
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System , Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Benchmark Interpretation

MATH appears in 5.1% of hub papers (2/39); use this cohort for benchmark-matched comparisons.
Retrieval appears in 5.1% of hub papers (2/39); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 20.5% of hub papers (8/39); compare with a secondary metric before ranking methods.
cost is reported in 10.3% of hub papers (4/39); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (12.8% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.1% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (20.5% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (48.7% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (17.9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (2.6% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (12.8% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.1% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (20.5% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (48.7% vs 35% target).

Papers with known rater population

Coverage is a replication risk (17.9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (2.6% vs 35% target).

Known Limitations

Only 5.1% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (17.9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=38

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=38, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

MATH

Coverage: 2 papers (5.1%)

2 papers (5.1%) mention MATH.

Examples: Prescriptive Scaling Reveals the Evolution of Language Model Capabilities , Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space

Benchmark Brief

Retrieval

Coverage: 2 papers (5.1%)

2 papers (5.1%) mention Retrieval.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Benchmark Brief

AdvBench

Coverage: 1 papers (2.6%)

1 papers (2.6%) mention AdvBench.

Examples: A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Metric Brief

accuracy

Coverage: 8 papers (20.5%)

8 papers (20.5%) mention accuracy.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Agentic Adversarial QA for Improving Domain-Specific LLMs , Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Metric Brief

cost

Coverage: 4 papers (10.3%)

4 papers (10.3%) mention cost.

Examples: ReIn: Conversational Error Recovery with Reasoning Inception , Group Representational Position Encoding , From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Metric Brief

Coverage: 2 papers (5.1%)

2 papers (5.1%) mention f1.

Examples: Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning , Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Frequency-Ordered Tokenization for Better Text Compression , Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher · Feb 26, 2026 · Citations: 0

Automatic Metrics

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026 · Citations: 0

Automatic Metrics

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0

Automatic Metrics

Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto fronti
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0

Automatic Metrics

These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics

Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets.
Airavat: An Agentic Framework for Internet Measurement
Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi · Feb 24, 2026 · Citations: 0

Automatic Metrics

We present Airavat, the first agentic framework for Internet measurement workflow generation with systematic verification and validation.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026 · Citations: 0

Automatic Metrics

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp · Feb 19, 2026 · Citations: 0

Automatic Metrics

Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants.
ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026 · Citations: 0

Automatic Metrics

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held · Feb 18, 2026 · Citations: 0

Automatic Metrics

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026 · Citations: 0

Automatic MetricsSimulation Env

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
AI-Driven Structure Refinement of X-ray Diffraction
Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026 · Citations: 0

Automatic Metrics

We benchmark WPEM on standard reference patterns (PbSO$_4$ and Tb$_2$BaCoO$_5$), where it yields lower $R_p/R_{wp}$ than widely used packages (FullProf and TOPAS) under matched refinement conditions.
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0

Red Team Automatic Metrics

LLM-based agents execute real-world workflows via tools and memory.
Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
Sanket Badhe, Deep Shah, Nehal Kathrotia · Feb 18, 2026 · Citations: 0

Automatic Metrics

We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures.
DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026 · Citations: 0

Automatic Metrics

We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0

Automatic Metrics

Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0

Automatic Metrics

Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics

Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0

Automatic Metrics

Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.
Between Search and Platform: ChatGPT Under the DSA
Toni Lorente, Kathrin Gardhouse · Jan 22, 2026 · Citations: 0

Automatic Metrics Web Browsing

This article examines the applicability of the Digital Services Act (DSA) to ChatGPT, arguing that it should be classified as a hybrid of the two types of hosting services: online search engines and platforms.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025 · Citations: 0

Automatic Metrics

We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0

Automatic Metrics Long Horizon

We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025 · Citations: 0

Automatic Metrics

Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.
Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025 · Citations: 0

Automatic Metrics

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks.
ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell · Oct 24, 2025 · Citations: 0

Automatic Metrics

In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics

This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics Multi Agent

Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
When Algorithms Meet Artists: Semantic Compression of Artists' Concerns in the Public AI-Art Debate
Ariya Mukherjee-Gandhi, Oliver Muellerklein · Aug 5, 2025 · Citations: 0

Automatic Metrics

Artists occupy a paradoxical position in generative AI: their work trains the models reshaping creative labor.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0

Expert Verification Automatic Metrics

Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025 · Citations: 0

Automatic Metrics

These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0

Automatic Metrics

English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025 · Citations: 0

Automatic Metrics

Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025 · Citations: 0

Automatic Metrics

Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,
The Dark Side of ChatGPT: Legal and Ethical Challenges from Stochastic Parrots and Hallucination
Zihao Li · Apr 21, 2023 · Citations: 0

Automatic Metrics

With the launch of ChatGPT, Large Language Models (LLMs) are shaking up our whole society, rapidly altering the way we think, create and live.

Automatic Metrics + Law Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs