HFEPX Benchmark Hub

Aime or Hotpotqa or MMLU Benchmark Papers

Updated from current HFEPX corpus (2026-07-24). This page tracks 60 papers reporting Aime or Hotpotqa or MMLU benchmark evidence, with protocol and metric context for comparison.

Papers: 60 Last published: Jul 2, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

5.0%

3 papers report calibration/adjudication/IAA controls.

60 papers explicitly name benchmark datasets in the sampled set.
41 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

Use this page to compare Aime or Hotpotqa or MMLU papers by evaluation mode, metric, and evidence quality before reusing reported results.

Protocol Notes (Expanded)

Protocol Takeaways

Aime or Hotpotqa or MMLU papers are often paired with automatic_metrics, llm_as_judge.

Benchmark Interpretation

MMLU: 29 papers
HotpotQA: 16 papers
AIME: 15 papers
GSM8K: 10 papers

Metric Interpretation

accuracy: 21 papers
cost: 12 papers
latency: 5 papers
f1: 4 papers

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

Will Scaling Improve Social Simulation with LLMs?
Jul 2, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics, Simulation Env · Metrics: Accuracy
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Jun 24, 2026 · Citations: 0 · Score: 8.0

Eval: Automatic Metrics · Metrics: Accuracy
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Apr 13, 2026 · Citations: 0 · Score: 7.5

Eval: Llm As Judge · Metrics: Precision
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Apr 2, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Mar 28, 2026 · Citations: 0 · Score: 7.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
PARTREP: Learning What to Repeat for Decoder-only LLMs
Jul 2, 2026 · Citations: 0 · Score: 7.0

Eval: Automatic Metrics · Metrics: Nll

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
Will Scaling Improve Social Simulation with LLMs? Jul 2, 2026	Automatic Metrics, Simulation Env	Not reported	Accuracy	Calibration
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning Jun 24, 2026	Automatic Metrics	Pairwise Preference	Accuracy, Pass@64	Not reported
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking Apr 13, 2026	Llm As Judge	Demonstrations	Precision, Agreement	Not reported
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite Apr 2, 2026	Automatic Metrics	Not reported	Accuracy	Calibration, Gold Questions
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering Mar 28, 2026	Llm As Judge, Automatic Metrics	Expert Verification	Accuracy, Relevance	Not reported
PARTREP: Learning What to Repeat for Decoder-only LLMs Jul 2, 2026	Automatic Metrics	Not reported	Nll, Cost	Not reported
What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It Jul 1, 2026	Automatic Metrics	Not reported	Accuracy, F1	Not reported
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense Jun 28, 2026	Automatic Metrics	Not reported	Auroc, Cost	Not reported
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA Oct 27, 2025	Automatic Metrics	Pairwise Preference	Mse	Not reported
NITP: Next Implicit Token Prediction for LLM Pre-training May 24, 2026	Not reported	Not reported	Cost, Inference cost	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Human feedback

Human feedback is present in 6 of 60 papers.
Gap: Quality controls

Quality controls is present in 3 of 60 papers.
Strong: Benchmarks

Benchmarks is present in 60 of 60 papers.
Strong: Metrics

Metrics is present in 41 of 60 papers.
Gap: Known rater population

Known rater population is present in 5 of 60 papers.
Gap: Known annotation unit

Known annotation unit is present in 8 of 60 papers.

Strengths

Benchmarks is present in 60 of 60 papers.
Metrics is present in 41 of 60 papers.
Agentic evaluation is present in 11 of 60 papers.

Known Gaps

Human feedback is present in 6 of 60 papers.
Quality controls is present in 3 of 60 papers.
Known rater population is present in 5 of 60 papers.

Suggested Next Analyses

Review the most recent Aime or Hotpotqa or MMLU papers first, then compare reported metrics and quality-control context before treating results as comparable.

Recommended Queries

Search Aime or Hotpotqa or MMLU papers

Known Limitations

This synthetic persisted page is generated from extraction data because the cached benchmark payload was missing for either-aime-or-hotpotqa-or-mmlu.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (38)
Llm As Judge (4)
Simulation Env (3)

Human Feedback Mix

None (54)
Pairwise Preference (3)
Demonstrations (2)
Expert Verification (1)

Top Benchmarks

MMLU (29)
HotpotQA (16)
AIME (15)
GSM8K (10)

Top Metrics

Accuracy (21)
Cost (12)
Latency (5)
F1 (4)

Top Papers On This Benchmark

Will Scaling Improve Social Simulation with LLMs?
Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto · Jul 2, 2026 · Citations: 0

Automatic MetricsSimulation Env

We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal…
NITP: Next Implicit Token Prediction for LLM Pre-training
Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng · May 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PARTREP: Learning What to Repeat for Decoder-only LLMs
Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee · Jul 2, 2026 · Citations: 0

Automatic Metrics

Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.
What Survives Into Context: A Diagnostic for Budget-Constrained Multi-Hop RAG and When Submodular Evidence Packing Improves It
Ananto Nayan Bala · Jul 1, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
On Compositional Learning Behaviours in Formal Mathematics
Kevin Yandoka Denamganaï · May 27, 2026 · Citations: 0

Automatic Metrics

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere…
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense
Subhadip Mitra · Jun 28, 2026 · Citations: 0

Automatic Metrics

Inference-time safety methods for large language models have proliferated, yet no systematic comparison exists.
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Steven Kolawole, Virginia Smith · Jun 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo · Apr 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management
Binqi Shen, Lier Jin, Hanyu Cai, Lan Hu, Yuting Xin · May 21, 2026 · Citations: 0

Automatic Metrics

Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis.
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Jaeyong Ko, Pilsung Kang, Yukyung Lee · Jun 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Across seven models and three mathematical reasoning benchmarks (GSM1K, MATH500, AIME 2025), cliff tokens act as failure triggers; deleting the first cliff token and resampling recovers pass@64 to 1.0, while keeping it limits recovery to…
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
Tianyu Dong, Yangyang Liu, Jiang Zhou, Xinwei Wu, Xiaohu Zhao · Jun 24, 2026 · Citations: 0

We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Chenhao Dang, Jing Ma, Mingjie Liao · Jun 23, 2026 · Citations: 0

Automatic Metrics

On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations.
RoPE-Aware Bit Allocation for KV-Cache Quantization
Fengfeng Liang, Yuechen Zhang, Jiaya Jia · Jun 23, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Ali Asgarov, Umid Suleymanov, Aadyant Khatri · Oct 31, 2025 · Citations: 0

Automatic Metrics

We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings…
Where Does Social Reasoning Come From? Capability Provenance in Language Models
Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla · Jun 17, 2026 · Citations: 0

Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has…
Resolution Diagnostics for Paired LLM Evaluation
Anany Kotawala · May 28, 2026 · Citations: 0

Pairwise Preference

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow · Mar 5, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Latent Performance Profiling of Large Language Models
Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti · May 28, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities.
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin · Jun 5, 2025 · Citations: 0

Automatic Metrics

Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining…
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman · May 8, 2026 · Citations: 0

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
Tianyi Huang, Samuel Xu, Jason Tansong Dang, Samuel Yan, Kimberley Yin · Apr 19, 2026 · Citations: 0

Agentic systems often fail not by being entirely wrong, but by being too precise: a response may be generally useful while particular claims exceed what the evidence supports.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun · May 8, 2026 · Citations: 0

To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence.
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
Tsuyoshi Okita · May 8, 2026 · Citations: 0

Simulation Env

With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and…
Beyond Factual Accuracy: Evaluating Global Reasoning Integrity in RAG Systems with LogicScore
Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Xiaoli Li · Jan 21, 2026 · Citations: 0

Automatic Metrics

Current evaluation methods for Retrieval Augmented Generation (RAG) suffer from factual myopia: they relentlessly emphasize factual accuracy yet neglect global logical integrity in long-form answer generation.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang · May 7, 2026 · Citations: 0

Automatic Metrics

Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing · Apr 13, 2026 · Citations: 0

Demonstrations Llm As Judge

LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made.
FIT to Forget: Robust Continual Unlearning for Large Language Models
Xiaoyu Xu, Minxin Du, Kun Fang, Yaxin Xiao, Zhicong Huang · Jan 29, 2026 · Citations: 0

Furthermore, to facilitate rigorous evaluation, we introduce PCH, a unified benchmark encompassing Personal, Copyrighted, and Harmful content, alongside two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), to…
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Lin Yao · Apr 20, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang · Dec 17, 2025 · Citations: 0

We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines.
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Pere Martra · Dec 27, 2025 · Citations: 0

Automatic Metrics

We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness.
ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback
Matthew Penaroza · Apr 8, 2026 · Citations: 0

Automatic Metrics

Language model agents reason from scratch on every query, discarding their chain of thought after each run.
SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation
Jingxi Qiu, Zeyu Han, Cheng Huang · May 5, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics

We evaluate on HotpotQA-RAG v3, a controlled multi-hop benchmark, under an artifact-aware protocol (shortcut baselines, counterfactual swaps, no-oracle checks, GPT-4o audits).
UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma · Aug 8, 2025 · Citations: 0

Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR^2, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to…
Cost-Effective Communication: An Auction-based Method for Language Agent Interaction
Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang · Nov 17, 2025 · Citations: 0

Automatic Metrics

To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource.
RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025 · Citations: 0

Automatic Metrics

A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes…
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang · Oct 1, 2025 · Citations: 0

Automatic Metrics

We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process.
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics

Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Automatic Metrics

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue
Ziyi Liu · Sep 22, 2025 · Citations: 0

Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
Shu Wang, Edwin Yu, Oscar Love, Tom Zhang, Tom Wong · Apr 6, 2026 · Citations: 0

Automatic Metrics

Large Language Model (LLM) agents require persistent memory to maintain personalization, factual continuity, and long-horizon reasoning, yet standard context-window and retrieval-augmented generation (RAG) pipelines degrade over…
Search, Do not Guess: Teaching Small Language Models to Be Effective Search Agents
Yizhou Liu, Qi Sun, Yulin Chen, Siyue Zhang, Chen Zhao · Apr 6, 2026 · Citations: 0

Automatic Metrics

Agents equipped with search tools have emerged as effective solutions for knowledge-intensive tasks.
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann, Bernhard Stadler, Michael Färber · Apr 2, 2026 · Citations: 0

Automatic Metrics

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence.
OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta · Apr 2, 2026 · Citations: 0

Automatic Metrics

We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods.
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution
Ante Wang, Weizhi Ma, Yang Liu · Nov 18, 2025 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong · Aug 11, 2025 · Citations: 0

Automatic Metrics

We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks.
CLAUSE: Agentic Neuro-Symbolic Knowledge Graph Reasoning via Dynamic Learnable Context Engineering
Yang Zhao, Chengxiao Dai, Wei Zhuo, Yue Xiu, Dusit Niyato · Sep 25, 2025 · Citations: 0

Automatic Metrics

We introduce CLAUSE, an agentic three-agent neuro-symbolic framework that treats context construction as a sequential decision process over knowledge graphs, deciding what to expand, which paths to follow or backtrack, what evidence to…
Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri · Aug 19, 2025 · Citations: 0

Automatic Metrics

We further validate practical utility with downstream sentence embedding benchmarks under a strict random initialization control to isolate tokenizer inductive bias.
EngGPT2: Sovereign, Efficient and Open Intelligence
G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico · Mar 17, 2026 · Citations: 0

EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring…
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke · Mar 7, 2025 · Citations: 0

When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and…
Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0

Demonstrations Simulation Env

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
Richard J. Young · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma · Mar 27, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Large Language Model as Token Compressor and Decompressor
Wenbing Li, Zikai Song, Jielei Zhang, Tianhao Zhao, Junkai Lin · Mar 26, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PIDP-Attack: Combining Prompt Injection with Database Poisoning Attacks on Retrieval-Augmented Generation Systems
Haozhen Wang, Haoyue Liu, Jionghao Zhu, Zhichao Wang, Yongxin Guo · Mar 26, 2026 · Citations: 0

Automatic Metrics

Experimental evaluations across three benchmark datasets (Natural Questions, HotpotQA, MS-MARCO) and eight LLMs demonstrate that PIDP-Attack consistently outperforms the original PoisonedRAG.
Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima · Mar 18, 2026 · Citations: 0

Automatic Metrics

The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA.
LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
Muhammed Saeed, Simon Razniewski · Mar 25, 2026 · Citations: 0

Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%.
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye · Mar 19, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Related Benchmark Hubs

GPQA Or BFCL Benchmark Papers (23) HumanEval+ Or BFCL Benchmark Papers (21) MATH-500 Or BFCL Benchmark Papers (25) GPQA Or HumanEval+ Benchmark Papers (22) MATH-500 Or GPQA Benchmark Papers (25) MATH-500 Or HumanEval+ Benchmark Papers (24) MATH-500 Benchmark Papers (15) MMLU Or AIME Or AlpacaEval Benchmark Papers (60) MMLU Or AIME Or BFCL Benchmark Papers (59) GSM8K Or MMLU Or AIME Benchmark Papers (72) MMLU Or AIME Or HumanEval+ Benchmark Papers (57)