- MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0
Automatic Metrics
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
- Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0
Automatic Metrics
Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto fronti
- Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning
Sanket Badhe, Deep Shah · Feb 24, 2026 · Citations: 0
Automatic Metrics
These expressive instructions render the decision-making process transparent, allowing for full human verification of logic, making this approach ideal for regulated industries such as law, finance, and content moderation, as well as high-v
- See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis
Jaehyun Park, Minyoung Ahn, Minkyu Kim, Jonghyun Lee, Jae-Gil Lee · Feb 24, 2026 · Citations: 0
Automatic Metrics
Previous artifact-aware methodologies depend on human-labeled artifact datasets, which are costly and difficult to scale, underscoring the need for an automated approach to reliably acquire artifact-annotated datasets.
- Airavat: An Agentic Framework for Internet Measurement
Alagappan Ramanathan, Eunju Kang, Dongsu Han, Sangeetha Abdu Jyothi · Feb 24, 2026 · Citations: 0
Automatic Metrics
We present Airavat, the first agentic framework for Internet measurement workflow generation with systematic verification and validation.
- SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang · Feb 24, 2026 · Citations: 0
Simulation Env Tool Use
Agentic systems increasingly rely on reusable procedural capabilities, \textit{a.k.a., agentic skills}, to execute long-horizon workflows reliably.
- Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026 · Citations: 0
Simulation Env
The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
- Whisper: Courtside Edition Enhancing ASR Performance Through LLM-Driven Context Generation
Yonathan Ron, Shiri Gilboa, Tammuz Dubnov · Feb 21, 2026 · Citations: 0
Automatic Metrics Multi Agent
We introduce Whisper: Courtside Edition, a novel multi-agent large language model (LLM) pipeline that enhances Whisper transcriptions without retraining.
- Vichara: Appellate Judgment Prediction and Explanation for the Indian Judicial System
Pavithra PM Nair, Preethu Rose Anish · Feb 20, 2026 · Citations: 0
Human EvalAutomatic Metrics
Vichara surpasses existing judgment prediction benchmarks on both datasets, with GPT-4o mini achieving the highest performance (F1: 81.5 on PredEx, 80.3 on ILDC_expert), followed by Llama-3.1-8B.
- Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026 · Citations: 0
Automatic Metrics
Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
- Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026 · Citations: 0
Human Eval
Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
- Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference
Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Norman Meuschke, Bela Gipp · Feb 19, 2026 · Citations: 0
Automatic Metrics
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants.
- ReIn: Conversational Error Recovery with Reasoning Inception
Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma · Feb 19, 2026 · Citations: 0
Automatic Metrics
Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors.
- Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds, Guangzhi Sun, William Held · Feb 18, 2026 · Citations: 0
Automatic Metrics
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling.
- Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval
Subrit Dikshit · Feb 18, 2026 · Citations: 0
Automatic MetricsSimulation Env
The rapid proliferation of Large Language Models (LLMs) has revolutionized Natural Language Processing (NLP) but has simultaneously created a "resource divide." State-of-the-art legal intelligence systems typically rely on massive parameter
- AI-Driven Structure Refinement of X-ray Diffraction
Bin Cao, Qian Zhang, Zhenjie Feng, Taolue Zhang, Jiaqiang Huang · Feb 18, 2026 · Citations: 0
Automatic Metrics
We benchmark WPEM on standard reference patterns (PbSO$_4$ and Tb$_2$BaCoO$_5$), where it yields lower $R_p/R_{wp}$ than widely used packages (FullProf and TOPAS) under matched refinement conditions.
- Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents
Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut · Feb 18, 2026 · Citations: 0
Red Team Automatic Metrics
LLM-based agents execute real-world workflows via tools and memory.
- Long-Tail Knowledge in Large Language Models: Taxonomy, Mechanisms, Interventions and Implications
Sanket Badhe, Deep Shah, Nehal Kathrotia · Feb 18, 2026 · Citations: 0
Automatic Metrics
We further examine how existing evaluation practices obscure tail behavior and complicate accountability for rare but consequential failures.
- DocSplit: A Comprehensive Benchmark Dataset and Evaluation Approach for Document Packet Recognition and Splitting
Md Mofijul Islam, Md Sirajus Salekin, Nivedha Balakrishnan, Vincil C. Bishop, Niharika Jain · Feb 17, 2026 · Citations: 0
Automatic Metrics
We present the first comprehensive benchmark dataset, DocSplit, along with novel evaluation metrics for assessing the document packet splitting capabilities of large language models.
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0
Automatic Metrics
Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0
Automatic Metrics
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li · Feb 15, 2026 · Citations: 0
Expert VerificationCritique Edit Automatic Metrics
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions.
- The Subjectivity of Respect in Police Traffic Stops: Modeling Community Perspectives in Body-Worn Camera Footage
Preni Golazizian, Elnaz Rahmati, Jackson Trager, Zhivar Sourati, Nona Ghazizadeh · Feb 10, 2026 · Citations: 0
Pairwise PreferenceRubric Rating Human Eval
To this end, we (i) develop a domain-specific evaluation rubric grounded in procedural justice theory, LAPD training materials, and extensive fieldwork; (ii) introduce a rubric-driven preference data construction framework for perspective-c
- Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0
Automatic Metrics
Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
- Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026 · Citations: 0
Automatic Metrics
Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.
- Between Search and Platform: ChatGPT Under the DSA
Toni Lorente, Kathrin Gardhouse · Jan 22, 2026 · Citations: 0
Automatic Metrics Web Browsing
This article examines the applicability of the Digital Services Act (DSA) to ChatGPT, arguing that it should be classified as a hybrid of the two types of hosting services: online search engines and platforms.
- APEX-Agents
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman · Jan 20, 2026 · Citations: 0
Rubric RatingExpert Verification Simulation Env Long Horizon
We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate law
- Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0
Simulation Env Multi Agent
Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025 · Citations: 0
Automatic Metrics
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
- Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025 · Citations: 0
Automatic Metrics Long Horizon
We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
- RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang · Nov 10, 2025 · Citations: 0
Automatic Metrics
Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks.
- Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025 · Citations: 0
Automatic Metrics
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks.
- ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell · Oct 24, 2025 · Citations: 0
Automatic Metrics
In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages.
- A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025 · Citations: 0
Red Team Automatic Metrics
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025 · Citations: 0
Pairwise Preference Automatic Metrics Multi Agent
Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
- When Algorithms Meet Artists: Semantic Compression of Artists' Concerns in the Public AI-Art Debate
Ariya Mukherjee-Gandhi, Oliver Muellerklein · Aug 5, 2025 · Citations: 0
Automatic Metrics
Artists occupy a paradoxical position in generative AI: their work trains the models reshaping creative labor.
- From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025 · Citations: 0
Expert Verification Automatic Metrics
Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
- Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025 · Citations: 0
Automatic Metrics
These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
- How much does context affect the accuracy of AI health advice?
Prashant Garg, Thiemo Fetzer · Apr 25, 2025 · Citations: 0
Automatic Metrics
English-language performance does not reliably generalise across contexts, underscoring the need for multilingual, domain-specific evaluation before deployment in public-health communication.
- Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025 · Citations: 0
Automatic Metrics
Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
- Using the Path of Least Resistance to Explain Deep Networks
Sina Salek, Joseph Enguehard · Feb 17, 2025 · Citations: 0
Automatic Metrics
Through experiments on both synthetic and real-world image classification data, we provide empirical evidence supporting our theoretical analysis and showing that GIG produces more faithful attributions than existing methods, including IG,
- The Dark Side of ChatGPT: Legal and Ethical Challenges from Stochastic Parrots and Hallucination
Zihao Li · Apr 21, 2023 · Citations: 0
Automatic Metrics
With the launch of ChatGPT, Large Language Models (LLMs) are shaking up our whole society, rapidly altering the way we think, create and live.