Skip to content
← Back to explorer

Tag: Law

Involves legal-domain expertise in annotation or evaluation.

Papers in tag: 43

Research Utility Snapshot

Evaluation Modes

  • Automatic Metrics (11)
  • Simulation Env (4)
  • Human Eval (2)

Human Feedback Types

  • Expert Verification (3)
  • Pairwise Preference (2)
  • Red Team (2)

Required Expertise

  • Law (15)
  • Coding (4)
  • Math (1)
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang · Feb 24, 2026 · Citations: 0

Simulation Env LawCoding
  • Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
  • This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation

Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics LawCoding
  • We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
  • The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System

Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics Law
  • Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
  • Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu · Feb 15, 2026 · Citations: 0

Expert VerificationCritique Edit Automatic Metrics Law
  • Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
  • However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage

Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human Eval Law
  • To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
APEX-Agents

Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026 · Citations: 0

Rubric RatingExpert Verification Simulation Env Law
  • We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
  • APEX-Agents requires agents to navigate realistic work environments with files and tools.
Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu · Jan 19, 2026 · Citations: 0

Simulation Env Law
  • Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
  • Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness

Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025 · Citations: 0

Red Team Automatic Metrics Law
  • This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
  • In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures

Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0

Pairwise Preference Automatic Metrics LawCoding
  • Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
  • In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise

Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0

Expert Verification Automatic Metrics Law
  • Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
  • However, existing benchmarks are documented to be contaminated and are based on multiple choice questions, which suffer from inherent biases.