Research Utility Snapshot
Evaluation Modes
- Automatic Metrics (11)
- Simulation Env (4)
- Human Eval (2)
Human Feedback Types
- Expert Verification (3)
- Pairwise Preference (2)
- Red Team (2)
Required Expertise
- Law (15)
- Coding (4)
- Math (1)
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang · Feb 24, 2026 · Citations: 0
Simulation Env LawCoding
- Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
- This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update) and introduces two complementary taxonomies.
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0
Automatic Metrics LawCoding
- We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
- The pipeline intercepts Whisper's initial transcript, applies specialized LLM agents for domain context identification, named entity recognition, and jargon detection, and generates compact prompts that guide Whisper's decoder.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0
Human EvalAutomatic Metrics Law
- Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
- Human evaluation of the generated explanations across Clarity, Linking, and Usefulness metrics highlights GPT-4o mini's superior interpretability.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval Subrit Dikshit · Feb 18, 2026 · Citations: 0
Automatic MetricsSimulation Env LawCoding
Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0
Red Team Automatic Metrics LawMultilingual
- LLM-based agents execute real-world workflows via tools and memory.
- These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios.
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics Law
- Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons.
The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh, Georgios Chochlakis · Feb 10, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human Eval Law
- To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
Between Search and Platform: ChatGPT Under the DSA Toni Lorente, Kathrin Gardhouse · Jan 22, 2026 · Citations: 0
Automatic Metrics Law
APEX-Agents Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Law
- We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- APEX-Agents requires agents to navigate realistic work environments with files and tools.
Multimodal Multi-Agent Empowered Legal Judgment Prediction Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu, Rong Fu · Jan 19, 2026 · Citations: 0
Simulation Env Law
- Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
- Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness.
Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space Wang Zixian · Jan 18, 2026 · Citations: 0
Automatic Metrics MathLaw
- Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0
Automatic Metrics Law
- We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li, Ruifeng Xu · Sep 17, 2025 · Citations: 0
Red Team Automatic Metrics Law
- This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- In addition, the assessment of defenses on the constructed safe prompts reveals inherent limitations of LLMs' safety mechanisms and flaws in the defense methods.
CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0
Pairwise Preference Automatic Metrics LawCoding
- Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
- In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions.
From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0
Expert Verification Automatic Metrics Law
- Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
- However, existing benchmarks are documented to be contaminated and are based on multiple choice questions, which suffer from inherent biases.