HFEPX Hub

Law Or Multilingual Papers

Updated from current HFEPX corpus (Feb 27, 2026). 139 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Adjudication. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 139 Last published: Feb 26, 2026 Global RSS Tag RSS

LawMultilingual

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 139 papers for Law Or Multilingual Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, DROP and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

12.9% of papers report explicit human-feedback signals, led by expert verification.

Evidence: MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression , Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
automatic metrics appears in 89.2% of papers in this hub.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression , Effective QA-driven Annotation of Predicate-Argument Relations Across Languages , Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression

Protocol Takeaways

Most common quality-control signal is adjudication (0.7% of papers).

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression , Effective QA-driven Annotation of Predicate-Argument Relations Across Languages , Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression , Effective QA-driven Annotation of Predicate-Argument Relations Across Languages , Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks

Benchmark Interpretation

Retrieval appears in 7.9% of hub papers (11/139); use this cohort for benchmark-matched comparisons.
DROP appears in 2.2% of hub papers (3/139); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 20.1% of hub papers (28/139); compare with a secondary metric before ranking methods.
cost is reported in 8.6% of hub papers (12/139); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (12.9% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (2.9% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (21.6% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (39.6% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (13.7% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (6.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (12.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (2.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (21.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (39.6% vs 35% target).

Papers with known rater population

Coverage is a replication risk (13.7% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (6.5% vs 35% target).

Known Limitations

Only 2.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (13.7% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=2, left_only=7, right_only=122

2 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=3, left_only=121, right_only=8

3 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=11, right_only=9

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 11 papers (7.9%)

11 papers (7.9%) mention Retrieval.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads , Structured Prompt Language: Declarative Context Management for LLMs

Benchmark Brief

DROP

Coverage: 3 papers (2.2%)

3 papers (2.2%) mention DROP.

Examples: Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads , Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration , CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics

Benchmark Brief

MATH

Coverage: 3 papers (2.2%)

3 papers (2.2%) mention MATH.

Examples: Prescriptive Scaling Reveals the Evolution of Language Model Capabilities , Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space , Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models

Metric Brief

accuracy

Coverage: 28 papers (20.1%)

28 papers (20.1%) mention accuracy.

Examples: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models , Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text

Metric Brief

cost

Coverage: 12 papers (8.6%)

12 papers (8.6%) mention cost.

Examples: Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"? , Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , Structured Prompt Language: Declarative Context Management for LLMs

Metric Brief

precision

Coverage: 5 papers (3.6%)

5 papers (3.6%) mention precision.

Examples: Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages , Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning , Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , Frequency-Ordered Tokenization for Better Text Compression , Effective QA-driven Annotation of Predicate-Argument Relations Across Languages

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang · Feb 26, 2026 · Citations: 0

Automatic Metrics

With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and t
Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher · Feb 26, 2026 · Citations: 0

Automatic Metrics

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation.
Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Jakub Šmíd, Pavel Přibáň, Pavel Král · Feb 26, 2026 · Citations: 0

Automatic Metrics

The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026 · Citations: 0

Automatic Metrics

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi · Feb 25, 2026 · Citations: 0

Automatic Metrics

Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH).
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran · Feb 25, 2026 · Citations: 0

Automatic Metrics

Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

Automatic Metrics

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026 · Citations: 0

Automatic Metrics

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0

Simulation Env

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0

Automatic MetricsSimulation Env

Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal · Feb 25, 2026 · Citations: 0

Automatic Metrics

We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0

Automatic Metrics

This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026 · Citations: 0

Automatic Metrics

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0

Automatic Metrics

Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto fronti
Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0

Automatic Metrics

These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
Evaluating Proactive Risk Awareness of Large Language Models
Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu · Feb 24, 2026 · Citations: 0

Simulation Env

As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks.
See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee · Feb 24, 2026 · Citations: 0

Automatic Metrics

Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets.
Airavat: An Agentic Framework for Internet Measurement
Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi · Feb 24, 2026 · Citations: 0

Automatic Metrics

We present Airavat, the first agentic framework for Internet measurement workflow generation with systematic verification and validation.
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang · Feb 24, 2026 · Citations: 0

Simulation Env Tool Use

Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen · Feb 23, 2026 · Citations: 0

Automatic Metrics

For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) play a critical role in how humans access information.
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026 · Citations: 0

Automatic Metrics

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026 · Citations: 0

Automatic Metrics

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
DEEP: Docker-based Execution and Evaluation Platform
Sergio Gómez González, Miguel Domingo, Francisco Casacuberta · Feb 23, 2026 · Citations: 0

Automatic Metrics

Comparative evaluation of several systems is a recurrent task in researching.
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026 · Citations: 0

Automatic Metrics

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov · Feb 22, 2026 · Citations: 0

Automatic Metrics

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources.
Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026 · Citations: 0

Simulation Env

The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0

Automatic Metrics Multi Agent

We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026 · Citations: 0

Automatic Metrics

Personal AI agents incur substantial cost via repeated LLM calls.
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026 · Citations: 0

Automatic Metrics

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani · Feb 20, 2026 · Citations: 0

Automatic Metrics

Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings.
Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0

Human EvalAutomatic Metrics

Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026 · Citations: 0

Automatic Metrics

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026 · Citations: 0

Automatic Metrics

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
What Language is This? Ask Your Tokenizer
Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel · Feb 19, 2026 · Citations: 0

Automatic Metrics

Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models.
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen · Feb 19, 2026 · Citations: 0

Automatic Metrics

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries.
Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026 · Citations: 0

Human Eval

Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
Auditing Reciprocal Sentiment Alignment: Inversion Risk, Dialect Representation and Intent Misalignment in Transformers
Nusrat Jahan Lia, Shubhashis Roy Dipta · Feb 19, 2026 · Citations: 0

Automatic Metrics

The core theme of bidirectional alignment is ensuring that AI systems accurately understand human intent and that humans can trust AI behavior.
Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics
Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya · Feb 19, 2026 · Citations: 0

Automatic Metrics

This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.
Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp · Feb 19, 2026 · Citations: 0

Automatic Metrics

Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants.
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026 · Citations: 0

Automatic Metrics

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.
Representation Collapse in Machine Translation Through the Lens of Angular Dispersion
Evgeniia Tokarchuk, Maya K. Nachesa, Sergey Troshin, Vlad Niculae · Feb 19, 2026 · Citations: 0

Automatic Metrics

Modern neural translation models based on the Transformer architecture are known for their high performance, particularly when trained on high-resource datasets.
Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective
Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang · Feb 19, 2026 · Citations: 0

Automatic Metrics

While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital
ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026 · Citations: 0

Automatic Metrics

While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification.
Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data
Deepak Uniyal, Md Abul Bashar, Richi Nayak · Feb 19, 2026 · Citations: 0

Automatic Metrics

Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages.
ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026 · Citations: 0

Automatic Metrics

Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik, Libby Barak, Jing Peng, Anna Feldman · Feb 18, 2026 · Citations: 0

Automatic Metrics

Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages.
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam · Feb 18, 2026 · Citations: 0

Automatic Metrics

However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries.
IndicJR: A Judge-Free Benchmark of Jailbreak Robustness in South Asian Languages
Priyaranjan Pattnayak, Sanchari Chowdhuri · Feb 18, 2026 · Citations: 0

Red Team Automatic Metrics

Safety alignment of large language models (LLMs) is mostly evaluated in English and contract-bound, leaving multilingual vulnerabilities understudied.
Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark
Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis · Feb 18, 2026 · Citations: 0

Automatic Metrics

In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zei
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held · Feb 18, 2026 · Citations: 0

Automatic Metrics

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling.
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
Yuyan Bu, Xiaohao Liu, ZhaoXing Ren, Yaodong Yang, Juntao Dai · Feb 18, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

The widespread deployment of large language models (LLMs) across linguistic communities necessitates reliable multilingual safety alignment.
Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026 · Citations: 0

Automatic MetricsSimulation Env

The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter

Law Or Multilingual Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs