HFEPX Hub

Math Or Multilingual Papers

Updated from current HFEPX corpus (Feb 27, 2026). 200 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: MATH. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 200 Last published: Feb 26, 2026 Global RSS Tag RSS

MathMultilingual

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 200 papers for Math Or Multilingual Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on MATH, Retrieval and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

9.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
automatic metrics appears in 92% of papers in this hub.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
MATH is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Protocol Takeaways

Most common quality-control signal is rater calibration (3% of papers).

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Benchmark Interpretation

MATH appears in 10% of hub papers (20/200); use this cohort for benchmark-matched comparisons.
Retrieval appears in 8% of hub papers (16/200); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 27.5% of hub papers (55/200); compare with a secondary metric before ranking methods.
cost is reported in 10% of hub papers (20/200); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (9.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (4.5% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (35% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (45% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (8% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (9.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (4.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (35% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (45% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (8% vs 35% target).

Known Limitations

Only 4.5% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: MATH - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=6, right_only=183

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=3, left_only=181, right_only=10

3 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=13, right_only=7

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

MATH

Coverage: 20 papers (10%)

20 papers (10%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

Retrieval

Coverage: 16 papers (8%)

16 papers (8%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Benchmark Brief

GSM8K

Coverage: 13 papers (6.5%)

13 papers (6.5%) mention GSM8K.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

accuracy

Coverage: 55 papers (27.5%)

55 papers (27.5%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Metric Brief

cost

Coverage: 20 papers (10%)

20 papers (10%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Sparsity Induction for Accurate Post-Training Pruning of Large Language Models , Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?

Metric Brief

latency

Coverage: 11 papers (5.5%)

11 papers (5.5%) mention latency.

Examples: Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang · Feb 26, 2026 · Citations: 0

Automatic Metrics

With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and t
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation.
Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Jakub Šmíd, Pavel Přibáň, Pavel Král · Feb 26, 2026 · Citations: 0

Automatic Metrics

The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Dynamic Level Sets
Michael Stephen Fiske · Feb 26, 2026 · Citations: 0

Automatic Metrics

A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester).
Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi · Feb 25, 2026 · Citations: 0

Automatic Metrics

Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH).
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran · Feb 25, 2026 · Citations: 0

Automatic Metrics

Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

Automatic Metrics

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0

Automatic Metrics

We study reasoning for accessing world knowledge stored in a language model's parameters.
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026 · Citations: 0

Automatic Metrics

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0

Simulation Env

Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0

Automatic MetricsSimulation Env

Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal · Feb 25, 2026 · Citations: 0

Automatic Metrics

We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0

Automatic Metrics

This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
Germán T. Eizaguirre, Lars Tissen, Marc Sánchez-Artigas · Feb 25, 2026 · Citations: 0

Automatic Metrics

Text-to-SQL and Big Data are both extensively benchmarked fields, yet there is limited research that evaluates them jointly.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026 · Citations: 0

Automatic Metrics

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
Small Language Models for Privacy-Preserving Clinical Information Extraction in Low-Resource Languages
Mohammadreza Ghaffarzadeh-Esfahani, Nahid Yousefian, Ebrahim Heidari-Farsani, Ali Akbar Omidvarian, Sepehr Ghahraei · Feb 24, 2026 · Citations: 0

Automatic Metrics

Extracting clinical information from medical transcripts in low-resource languages remains a significant challenge in healthcare natural language processing (NLP).
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026 · Citations: 0

Automatic Metrics

We validate across five benchmarks, five models from three families, and both synthetic and real data.
Representation Theorems for Cumulative Propositional Dependence Logics
Juha Kontinen, Arne Meier, Kai Sauerwald · Feb 24, 2026 · Citations: 0

Automatic Metrics

This paper establishes and proves representation theorems for cumulative propositional dependence logic and for cumulative propositional logic with team semantics.
Equitable Evaluation via Elicitation
Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao · Feb 24, 2026 · Citations: 0

Automatic Metrics

To obtain sufficient training data, we train an LLM to act as synthetic humans.
Aletheia tackles FirstProof autonomously
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov · Feb 24, 2026 · Citations: 0

Automatic Metrics

We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge.
Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0

Automatic Metrics

Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li · Feb 24, 2026 · Citations: 0

Automatic Metrics

Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof.
Evaluating Proactive Risk Awareness of Large Language Models
Xuan Luo, Yubin Chen, Zhiyu Hou, Linpu Yu, Geng Tu · Feb 24, 2026 · Citations: 0

Simulation Env

As large language models (LLMs) are increasingly embedded in everyday decision-making, their safety responsibilities extend beyond reacting to explicit harmful intent toward anticipating unintended but consequential risks.
Linear Reasoning vs. Proof by Cases: Obstacles for Large Language Models in FOL Problem Solving
Yuliang Ji, Fuchen Shen, Jian Wu, Qiujie Xie, Yue Zhang · Feb 24, 2026 · Citations: 0

Automatic Metrics

To comprehensively evaluate the mathematical reasoning capabilities of Large Language Models (LLMs), researchers have introduced abundant mathematical reasoning datasets.
Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Feb 24, 2026 · Citations: 0

Automatic Metrics

Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.
Pipeline for Verifying LLM-Generated Mathematical Solutions
Varvara Sazonova, Dmitri Shmelkin, Stanislav Kikot, Vasily Motolygin · Feb 24, 2026 · Citations: 0

Automatic Metrics

We introduce a pipeline for both automatic and interactive verification as a more accurate alternative to only checking the answer which is currently the most popular approach for benchmarks.
ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
Xindian Ma, Rundong Kong, Peng Zhang, Ruoxiang Huang, Yongyu Jiang · Feb 24, 2026 · Citations: 0

Automatic Metrics

We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment.
Buffer Matters: Unleashing the Power of Off-Policy Reinforcement Learning in Large Language Model Reasoning
Xu Wan, Yansheng Wang, Wenqi Huang, Mingyang Sun · Feb 24, 2026 · Citations: 0

Automatic Metrics

Traditional on-policy Reinforcement Learning with Verifiable Rewards (RLVR) frameworks suffer from experience waste and reward homogeneity, which directly hinders learning efficiency on difficult samples during large language models post-tr
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0

Simulation Env Long Horizon

We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0

Automatic Metrics Long Horizon

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026 · Citations: 0

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
BabyLM Turns 4 and Goes Multilingual: Call for Papers for the 2026 BabyLM Workshop
Leshem Choshen, Ryan Cotterell, Mustafa Omer Gul, Jaap Jumelet, Tal Linzen · Feb 23, 2026 · Citations: 0

Automatic Metrics

For the workshop, we call for papers related to the overall theme of BabyLM, which includes training efficiency, small-scale training datasets, cognitive modeling, model evaluation, and architecture innovation.
Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) play a critical role in how humans access information.
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026 · Citations: 0

Automatic Metrics

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Position: General Alignment Has Hit a Ceiling; Edge Alignment Must Be Taken Seriously
Han Bao, Yue Huang, Xiaoda Wang, Zheyuan Zhang, Yujun Zhou · Feb 23, 2026 · Citations: 0

Automatic Metrics

We take the position that the dominant paradigm of General Alignment, which compresses diverse human values into a single scalar reward, reaches a structural ceiling in settings with conflicting values, plural stakeholders, and irreducible
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026 · Citations: 0

Automatic Metrics

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
SAMAS: A Spectrum-Guided Multi-Agent System for Achieving Style Fidelity in Literary Translation
Jingzhuo Wu, Jiajun Zhang, Keyan Jin, Dehua Ma, Junbo Wang · Feb 23, 2026 · Citations: 0

Automatic Metrics Multi Agent

This limitation stems from the inability of current single-model and static multi-agent systems to perceive and adapt to stylistic variations.
DEEP: Docker-based Execution and Evaluation Platform
Sergio Gómez González, Miguel Domingo, Francisco Casacuberta · Feb 23, 2026 · Citations: 0

Automatic Metrics

Comparative evaluation of several systems is a recurrent task in researching.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026 · Citations: 0

Automatic Metrics

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026 · Citations: 0

Automatic Metrics

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
TurkicNLP: An NLP Toolkit for Turkic Languages
Sherzod Hakimov · Feb 22, 2026 · Citations: 0

Automatic Metrics

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources.
Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0

Automatic Metrics Long Horizon

Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026 · Citations: 0

Automatic Metrics

Personal AI agents incur substantial cost via repeated LLM calls.
Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0

Automatic Metrics

Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026 · Citations: 0

Pairwise Preference Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models
Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat · Feb 21, 2026 · Citations: 0

Automatic Metrics

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG).
Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0

Automatic Metrics Long Horizon

LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.

Math Or Multilingual Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs