Benchmark Hub

Retrieval Or MATH Or GSM8K Benchmark Papers

Updated from current HFEPX corpus (Feb 27, 2026). 148 papers are grouped in this benchmark page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 148 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 148 papers for Retrieval Or MATH Or GSM8K Benchmark Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

11.5% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System
automatic metrics appears in 92.6% of papers in this hub.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Protocol Takeaways

Most common quality-control signal is rater calibration (2.7% of papers).

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System

Benchmark Interpretation

Retrieval appears in 77.7% of hub papers (115/148); use this cohort for benchmark-matched comparisons.
MATH appears in 13.5% of hub papers (20/148); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 33.1% of hub papers (49/148); compare with a secondary metric before ranking methods.
cost is reported in 8.8% of hub papers (13/148); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (11.5% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.4% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (100% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (60.8% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (8.8% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (13.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (11.5% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.4% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (100% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (60.8% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.8% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (13.5% vs 35% target).

Known Limitations

Only 3.4% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (8.8% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=2, right_only=136

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=2, left_only=135, right_only=9

2 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=11, right_only=3

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 115 papers (77.7%)

115 papers (77.7%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System

Benchmark Brief

MATH

Coverage: 20 papers (13.5%)

20 papers (13.5%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

GSM8K

Coverage: 13 papers (8.8%)

13 papers (8.8%) mention GSM8K.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

accuracy

Coverage: 49 papers (33.1%)

49 papers (33.1%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MoDora: Tree-Based Semi-Structured Document Analysis System

Metric Brief

cost

Coverage: 13 papers (8.8%)

13 papers (8.8%) mention cost.

Examples: Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Structured Prompt Language: Declarative Context Management for LLMs

Metric Brief

recall

Coverage: 10 papers (6.8%)

10 papers (6.8%) mention recall.

Examples: Personalized Graph-Empowered Large Language Model for Proactive Information Access , E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications , RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers On This Benchmark

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026

Automatic Metrics

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa · Feb 26, 2026

Automatic Metrics

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang · Feb 26, 2026

Automatic Metrics

Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026

Automatic Metrics

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026

Automatic Metrics

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi · Feb 25, 2026

Automatic Metrics

Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH).
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026

Automatic Metrics

Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026

Automatic Metrics

Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026

Automatic Metrics

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026

Automatic Metrics

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026

Automatic Metrics Long Horizon

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
Retrieval Challenges in Low-Resource Public Service Information: A Case Study on Food Pantry Access
Touseef Hasan, Laila Cure, Souvika Sarkar · Feb 25, 2026

Simulation Env

We conduct a pilot evaluation study using community-sourced queries to examine system behavior in realistic scenarios.
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026

Automatic Metrics

Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
Revisiting Text Ranking in Deep Research
Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026

Automatic Metrics

To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
Adversarial Intent is a Latent Variable: Stateful Trust Inference for Securing Multimodal Agentic RAG
Inderjeet Singh, Vikas Pahuja, Aishvariya Priya Rathina Sabapathy, Chiara Picardi, Amit Giloni · Feb 24, 2026

Automatic Metrics

Current stateless defences for multimodal agentic RAG fail to detect adversarial strategies that distribute malicious semantics across retrieval, planning, and generation components.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026

Automatic Metrics

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026

Automatic Metrics

We validate across five benchmarks, five models from three families, and both synthetic and real data.
Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026

Automatic Metrics

We study efficient multi-vector retrieval for late interaction in any modality.
A Benchmark for Deep Information Synthesis
Debjit Paul, Daniel Murphy, Milan Gritta, Ronald Cardenas, Victor Prokhorov · Feb 24, 2026

Human EvalAutomatic Metrics Tool Use

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis.
HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026

Automatic Metrics

Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
E-MMKGR: A Unified Multimodal Knowledge Graph Framework for E-commerce Applications
Jiwoo Kang, Yeon-Chang Lee · Feb 24, 2026

Automatic Metrics

Multimodal recommender systems (MMRSs) enhance collaborative filtering by leveraging item-side modalities, but their reliance on a fixed set of modalities and task-specific objectives limits both modality extensibility and task generalizati
RMIT-ADM+S at the MMU-RAG NeurIPS 2025 Competition
Kun Ran, Marwah Alaofi, Danula Hettiachchi, Chenglong Ma, Khoi Nguyen Dinh Anh · Feb 24, 2026

Automatic Metrics

R2RAG won the Best Dynamic Evaluation award in the Open Source category, demonstrating high effectiveness with careful design and efficient use of resources.
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems
Mukul Chhabra, Luigi Medrano, Arush Verma · Feb 23, 2026

Automatic Metrics

Enterprise Retrieval-Augmented Generation (RAG) assistants operate in multi-turn, case-based workflows such as technical support and IT operations, where evaluation must reflect operational constraints, structured identifiers (e.g., error c
InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026

Simulation Env

Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
How Retrieved Context Shapes Internal Representations in RAG
Samuel Yeh, Sharon Li · Feb 23, 2026

Automatic Metrics

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by conditioning generation on retrieved external documents, but the effect of retrieved context is often non-trivial.
Structured Prompt Language: Declarative Context Management for LLMs
Wen G. Gong · Feb 23, 2026

Automatic Metrics

SPL-flow extends SPL into resilient agentic pipelines with a three-tier provider fallback strategy (Ollama -> OpenRouter -> self-healing retry) fully transparent to the .spl script.
Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina · Feb 23, 2026

Automatic Metrics

We introduce large-scale data curation pipelines and new benchmarks, compare modeling strategies, and show that modality fusion within a frozen text Matryoshka model performs best.
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval
Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao · Feb 23, 2026

Automatic Metrics

We begin by examining the benchmark landscape, and subsequently dive into the methodological evolution, categorizing approaches into three primary aspects: multimodal embedding models, multimodal reranker models, and the integration of Retr
Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework
Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo · Feb 23, 2026

Automatic Metrics

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications.
How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1
Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie · Feb 23, 2026

Automatic Metrics

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Learning to Reason for Multi-Step Retrieval of Personal Context in Personalized Question Answering
Maryam Amirizaniani, Alireza Salemi, Hamed Zamani · Feb 22, 2026

Automatic Metrics Long Horizon

Personalization in Question Answering (QA) requires answers that are both accurate and aligned with users' background, preferences, and historical context.
Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection
Raihan Tanvir, Md. Golam Rabiul Alam · Feb 22, 2026

Automatic Metrics

Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives.
Facet-Level Persona Control by Trait-Activated Routing with Contrastive SAE for Role-Playing LLMs
Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide · Feb 22, 2026

Automatic Metrics

Personality control in Role-Playing Agents (RPAs) is commonly achieved via training-free methods that inject persona descriptions and memory through prompts or retrieval-augmented generation, or via supervised fine-tuning (SFT) on persona-s
VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
Diogo Glória-Silva, David Semedo, João Maglhães · Feb 22, 2026

Automatic Metrics Long Horizon

Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
AgenticRAGTracer: A Hop-Aware Benchmark for Diagnosing Multi-Step Retrieval Reasoning in Agentic RAG
Qijie You, Wenkai Yu, Wentao Zhang · Feb 22, 2026

Automatic Metrics Long Horizon

With the rapid advancement of agent-based methods in recent years, Agentic RAG has undoubtedly become an important research direction.
Think$^{2}$: Grounded Metacognitive Reasoning in Large Language Models
Abraham Paul Elenjical, Vivek Hruday Kavuri, Vasudeva Varma · Feb 21, 2026

Human Eval

We introduce a psychologically grounded metacognitive framework that operationalizes Ann Brown's regulatory cycle (Planning, Monitoring, and Evaluation) as a structured prompting architecture, and study its integration within a lightweight
Rethinking Retrieval-Augmented Generation as a Cooperative Decision-Making Problem
Lichang Song, Ting Long, Yi Chang · Feb 21, 2026

Automatic Metrics Multi Agent

To overcome this limitation, we reformulate RAG as a cooperative multi-agent decision-making problem and propose Cooperative Retrieval-Augmented Generation (CoRAG), a framework in which the reranker and the generator act as peer decision-ma
Contradiction to Consensus: Dual Perspective, Multi Source Retrieval Based Claim Verification with Source Level Disagreement using LLM
Md Badsha Biswas, Ozlem Uzuner · Feb 21, 2026

Automatic Metrics

Through extensive evaluation on four benchmark datasets with five LLMs, we show that knowledge aggregation not only improves claim verification but also reveals differences in source-specific reasoning.
Diagnosing LLM Reranker Behavior Under Fixed Evidence Pools
Baris Arat, Emre Sefer · Feb 20, 2026

Automatic Metrics

Standard reranking evaluations study how a reranker orders candidates returned by an upstream retriever.
RVR: Retrieve-Verify-Retrieve for Comprehensive Question Answering
Deniz Qian, Hung-Ting Chen, Eunsol Choi · Feb 20, 2026

Automatic Metrics

Our method outperforms baselines, including agentic search approaches, achieving at least 10% relative and 3% absolute gain in complete recall percentage on a multi-answer retrieval dataset (QAMPARI).
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026

Automatic MetricsSimulation Env

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

Human Eval

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026

Automatic Metrics

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering
Amine Kobeissi, Philippe Langlais · Feb 20, 2026

Automatic Metrics

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings.
CUICurate: A GraphRAG-based Framework for Automated Clinical Concept Curation for NLP applications
Victoria Blake, Mathew Miller, Jamie Novak, Sze-yuan Ooi, Blanca Gallego · Feb 20, 2026

Automatic Metrics

The framework was evaluated on five lexically heterogeneous clinical concepts against a manually curated benchmark and gold-standard concept sets.
Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering
Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman · Feb 20, 2026

Automatic Metrics

Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context.
Improving Neural Topic Modeling with Semantically-Grounded Soft Label Distributions
Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini · Feb 20, 2026

Automatic Metrics

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity.
TFL: Targeted Bit-Flip Attack on Large Language Model
Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Feb 19, 2026

Automatic Metrics

Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks.
QueryPlot: Generating Geological Evidence Layers using Natural Language Queries for Mineral Exploration
Meng Ye, Xiao Lin, Georgina Lukoczki, Graham W. Lederer, Yi Yao · Feb 19, 2026

Automatic Metrics

Mineral prospectivity mapping requires synthesizing heterogeneous geological knowledge, including textual deposit models and geospatial datasets, to identify regions likely to host specific mineral deposit types.
Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat · Feb 19, 2026

Automatic Metrics

While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms.
PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions
Greta Damo, Stéphane Petiot, Elena Cabrio, Serena Villata · Feb 19, 2026

Automatic Metrics

The increasing volume of hate speech on online platforms poses significant societal challenges.
The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour
Leonidas Zotos, Hedderik van Rijn, Malvina Nissim · Feb 19, 2026

Automatic Metrics

When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice.
RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering
Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao · Feb 19, 2026

Automatic Metrics

We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
Michael Dinzinger, Laura Caspari, Ali Salman, Irvin Topi, Jelena Mitrović · Feb 19, 2026

Automatic Metrics

We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages.

Other Benchmark Hubs

Retrieval Or MATH Or GSM8K Benchmark Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers On This Benchmark

Other Benchmark Hubs