HFEPX Benchmark Hub

Aime or Alpacaeval or MMLU Benchmark Papers

Updated from current HFEPX corpus (2026-07-16). This page tracks 60 papers reporting Aime or Alpacaeval or MMLU benchmark evidence, with protocol and metric context for comparison.

Papers: 60 Last published: Jul 2, 2026 Global RSS

Researcher Quick Triage

Use this page for benchmark-matched method comparisons and eval protocol selection. Quality band: High .

High-Signal Coverage

100.0%

60 / 60 sampled papers are not low-signal flagged.

Replication-Ready Set

Papers with explicit benchmark + metric + eval mode fields.

Quality Controls

6.7%

4 papers report calibration/adjudication/IAA controls.

60 papers explicitly name benchmark datasets in the sampled set.
38 papers report at least one metric term in metadata extraction.
Start with the ranked shortlist below before reading all papers.

Primary action: Start with the top 2 benchmark-matched papers, then compare evaluation modes in the protocol matrix.

Why This Matters (Expanded)

Why This Matters For Eval Research

Use this page to compare Aime or Alpacaeval or MMLU papers by evaluation mode, metric, and evidence quality before reusing reported results.

Protocol Notes (Expanded)

Protocol Takeaways

Aime or Alpacaeval or MMLU papers are often paired with automatic_metrics, llm_as_judge.

Benchmark Interpretation

MMLU: 39 papers
AIME: 16 papers
GSM8K: 12 papers
MMLU-Pro: 7 papers

Metric Interpretation

accuracy: 22 papers
cost: 9 papers
perplexity: 4 papers
latency: 3 papers

Start Here (Benchmark-Matched First 6)

Ranked by protocol completeness so you can quickly find papers suitable for comparison studies.

Will Scaling Improve Social Simulation with LLMs?
Jul 2, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics, Simulation Env · Metrics: Accuracy
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Jun 24, 2026 · Citations: 0 · Score: 8.5

Eval: Automatic Metrics · Metrics: Accuracy
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Apr 13, 2026 · Citations: 0 · Score: 7.5

Eval: Llm As Judge · Metrics: Precision
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Apr 2, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Mar 28, 2026 · Citations: 0 · Score: 7.5

Eval: Llm As Judge, Automatic Metrics · Metrics: Accuracy
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
Mar 23, 2026 · Citations: 0 · Score: 7.5

Eval: Automatic Metrics · Metrics: Accuracy

Protocol Matrix (Top 10)

Compare protocol ingredients quickly before deep-reading full papers.

Paper	Eval Modes	Human Feedback	Metrics	Quality Controls
Will Scaling Improve Social Simulation with LLMs? Jul 2, 2026	Automatic Metrics, Simulation Env	Not reported	Accuracy	Calibration
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning Jun 24, 2026	Automatic Metrics	Pairwise Preference	Accuracy, Pass@64	Not reported
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking Apr 13, 2026	Llm As Judge	Demonstrations	Precision, Agreement	Not reported
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite Apr 2, 2026	Automatic Metrics	Not reported	Accuracy	Calibration, Gold Questions
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering Mar 28, 2026	Llm As Judge, Automatic Metrics	Expert Verification	Accuracy, Relevance	Not reported
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment Mar 23, 2026	Automatic Metrics	Pairwise Preference	Accuracy	Not reported
PARTREP: Learning What to Repeat for Decoder-only LLMs Jul 2, 2026	Automatic Metrics	Not reported	Nll, Cost	Not reported
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense Jun 28, 2026	Automatic Metrics	Not reported	Auroc, Cost	Not reported
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning Jun 23, 2026	Automatic Metrics	Not reported	Perplexity	Not reported
RoPE-Aware Bit Allocation for KV-Cache Quantization Jun 23, 2026	Automatic Metrics	Not reported	Mae, Mse	Not reported

Researcher Workflow (Detailed)

Checklist

Gap: Human feedback

Human feedback is present in 12 of 60 papers.
Gap: Quality controls

Quality controls is present in 4 of 60 papers.
Strong: Benchmarks

Benchmarks is present in 60 of 60 papers.
Strong: Metrics

Metrics is present in 38 of 60 papers.
Gap: Known rater population

Known rater population is present in 6 of 60 papers.
Gap: Known annotation unit

Known annotation unit is present in 8 of 60 papers.

Strengths

Benchmarks is present in 60 of 60 papers.
Metrics is present in 38 of 60 papers.

Known Gaps

Human feedback is present in 12 of 60 papers.
Quality controls is present in 4 of 60 papers.
Known rater population is present in 6 of 60 papers.

Suggested Next Analyses

Review the most recent Aime or Alpacaeval or MMLU papers first, then compare reported metrics and quality-control context before treating results as comparable.

Recommended Queries

Search Aime or Alpacaeval or MMLU papers

Known Limitations

This synthetic persisted page is generated from extraction data because the cached benchmark payload was missing for either-aime-or-alpacaeval-or-mmlu.

Research Utility Snapshot (Detailed)

Evaluation Modes

Automatic Metrics (34)
Llm As Judge (3)
Simulation Env (3)

Human Feedback Mix

None (48)
Pairwise Preference (9)
Demonstrations (1)
Expert Verification (1)

Top Benchmarks

MMLU (39)
AIME (16)
GSM8K (12)
MMLU Pro (7)

Top Metrics

Accuracy (22)
Cost (9)
Perplexity (4)
Latency (3)

Top Papers On This Benchmark

Will Scaling Improve Social Simulation with LLMs?
Caleb Ziems, William Held, Su Doga Karaca, David Grusky, Tatsunori Hashimoto · Jul 2, 2026 · Citations: 0

Automatic MetricsSimulation Env

We use scaling laws to study the relationship between LLMs' compute scale, general capability benchmarks, and the fidelity of social simulation in three representative sub-domains: opinion modeling, behavioral simulation, and longitudinal…
NITP: Next Implicit Token Prediction for LLM Pre-training
Xiangdong Zhang, Debing Zhang, Shaofeng Zhang, Xiaohan Qin, Yu Cheng · May 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
PARTREP: Learning What to Repeat for Decoder-only LLMs
Andikawati P Widjaja, Yongjun Kim, Hyounghun Kim, Jaeho Lee · Jul 2, 2026 · Citations: 0

Automatic Metrics

Across eight benchmarks (including MMLU, GSM8K, and RULER) and three model families (Qwen2.5, Llama3.2, Gemma4), PartRep retains most of the gains of full repetition while using only 59.4\% of its KV cache and 79.0\% of its prefill FLOPs.
On Compositional Learning Behaviours in Formal Mathematics
Kevin Yandoka Denamganaï · May 27, 2026 · Citations: 0

Automatic Metrics

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere…
Closing the Activation-Cone Blind Spot: Response-Time Probing and Unified Defense
Subhadip Mitra · Jun 28, 2026 · Citations: 0

Automatic Metrics

Inference-time safety methods for large language models have proliferated, yet no systematic comparison exists.
Epiphany-Aware KV Cache Eviction Without the Attention Matrix
Steven Kolawole, Virginia Smith · Jun 25, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo · Apr 9, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Cliff Tokens: Identifying Single-Token Failure Triggers in LLM Mathematical Reasoning
Jaeyong Ko, Pilsung Kang, Yukyung Lee · Jun 24, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Across seven models and three mathematical reasoning benchmarks (GSM1K, MATH500, AIME 2025), cliff tokens act as failure triggers; deleting the first cliff token and resampling recovers pass@64 to 1.0, while keeping it limits recovery to…
SARA: Unlocking Multilingual Knowledge in Mixture-of-Experts via Semantically Anchored Routing Alignment
Tianyu Dong, Yangyang Liu, Jiang Zhou, Xinwei Wu, Xiaohu Zhao · Jun 24, 2026 · Citations: 0

We conduct experiments on 2 LLMs across 5 low-resource languages and 3 benchmarks.
Holistic Data Scheduler for LLM Pre-training via Multi-Objective Reinforcement Learning
Chenhao Dang, Jing Ma, Mingjie Liao · Jun 23, 2026 · Citations: 0

Automatic Metrics

On The Pile benchmark, HDS reaches the final validation perplexity of the next best method with 44% fewer training iterations.
RoPE-Aware Bit Allocation for KV-Cache Quantization
Fengfeng Liang, Yuechen Zhang, Jiaya Jia · Jun 23, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning
Ali Asgarov, Umid Suleymanov, Aadyant Khatri · Oct 31, 2025 · Citations: 0

Automatic Metrics

We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings…
Where Does Social Reasoning Come From? Capability Provenance in Language Models
Glenn Matlin, Chandreyi Chakraborty, Saehee Eom, Mika Okamoto, Rayan Castilla · Jun 17, 2026 · Citations: 0

Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has…
Resolution Diagnostics for Paired LLM Evaluation
Anany Kotawala · May 28, 2026 · Citations: 0

Pairwise Preference

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9…
Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow · Mar 5, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Latent Performance Profiling of Large Language Models
Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti · May 28, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities.
Accelerated Test-Time Scaling with Model-Free Speculative Sampling
Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin · Jun 5, 2025 · Citations: 0

Automatic Metrics

Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining…
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman · May 8, 2026 · Citations: 0

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
Qiyong Zhong, Mao Zheng, Mingyang Song, Xin Lin, Jie Sun · May 8, 2026 · Citations: 0

To address this, we propose SOD, a step-wise on-policy distillation framework for small language model agents, which adaptively reweights distillation strength at each step based on step-level divergence.
Mathematical Reasoning via Intervention-Based Time-Series Causal Discovery Using LLMs as Concept Mastery Simulators
Tsuyoshi Okita · May 8, 2026 · Citations: 0

Simulation Env

With a 7B-parameter LLM whose weights are entirely frozen, CIKA achieves 69.7\% on the contamination-free Omni-MATH-Rule benchmark and 64.0\% overall, compared to 60.5\% for o1-mini, and 97.2\% on GSM8K, 46--50\% on AIME 2024--2026, and…
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang · May 7, 2026 · Citations: 0

Automatic Metrics

Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls.
Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing · Apr 13, 2026 · Citations: 0

Demonstrations Llm As Judge

LLM evaluations drive which models get deployed, what safety standards get adopted, which research conclusions get published, and how projections of AI's labor-market impact get made.
FIT to Forget: Robust Continual Unlearning for Large Language Models
Xiaoyu Xu, Minxin Du, Kun Fang, Yaxin Xiao, Zhicong Huang · Jan 29, 2026 · Citations: 0

Furthermore, to facilitate rigorous evaluation, we introduce PCH, a unified benchmark encompassing Personal, Copyrighted, and Harmful content, alongside two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), to…
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models
Lin Yao · Apr 20, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning
Weiqin Wang, Yile Wang, Kehao Chen, Hui Huang · Dec 17, 2025 · Citations: 0

We conduct experiments across various models and benchmarks, experimental results show that SCOPE consistently outperforms recent baselines.
Fragile Knowledge, Robust Instruction-Following: The Width Pruning Dichotomy in Llama-3.2
Pere Martra · Dec 27, 2025 · Citations: 0

Automatic Metrics

We evaluated seven expansion ratio configurations using comprehensive benchmarks assessing factual knowledge, mathematical reasoning, language comprehension, instruction-following, and truthfulness.
UR$^2$: Unify RAG and Reasoning through Reinforcement Learning
Weitao Li, Boran Xiang, Xiaolong Wang, Zhinan Gou, Weizhi Ma · Aug 8, 2025 · Citations: 0

Experiments on open-domain QA, MMLU-Pro, medical, and mathematical reasoning tasks show that UR^2, built on Qwen-2.5-3/7B and LLaMA-3.1-8B, consistently outperforms existing RAG and RL baselines, and achieves performance comparable to…
Cost-Effective Communication: An Auction-based Method for Language Agent Interaction
Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang · Nov 17, 2025 · Citations: 0

Automatic Metrics

To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource.
Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Zeguan Xiao, Yun Chen, Guanhua Chen, Ke Tang · Jun 11, 2025 · Citations: 0

Pairwise Preference

Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning…
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li · Oct 9, 2025 · Citations: 0

Llm As JudgeAutomatic Metrics

Our method features a specialized judge model trained with out-of-distribution data (AIME 2022, AIME 2023, and MATH 500) using oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072…
Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0

Automatic Metrics

Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi · Oct 8, 2025 · Citations: 0

Pairwise Preference

Despite its small size, fine-tuning Llama-3-8B-Base on PiKa-SFT even outperforms the official Llama-3-8B-Instruct model trained on over 10M proprietary examples on widely used benchmarks such as AlpacaEval 2.0 and Arena-Hard.
GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann, Bernhard Stadler, Michael Färber · Apr 2, 2026 · Citations: 0

Automatic Metrics

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence.
Let the Model Distribute Its Doubt: Confidence Estimation through Verbalized Probability Distribution
Ante Wang, Weizhi Ma, Yang Liu · Nov 18, 2025 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization
Zhenpeng Su, Leiyu Pan, Xue Bai, Dening Liu, Guanting Dong · Aug 11, 2025 · Citations: 0

Automatic Metrics

We present Klear-Reasoner, a model with long reasoning capabilities that demonstrates careful deliberation during problem solving, achieving outstanding performance across multiple benchmarks.
Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri · Aug 19, 2025 · Citations: 0

Automatic Metrics

We further validate practical utility with downstream sentence embedding benchmarks under a strict random initialization control to isolate tokenizer inductive bias.
EngGPT2: Sovereign, Efficient and Open Intelligence
G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico · Mar 17, 2026 · Citations: 0

EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring…
PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0

Expert Verification Llm As JudgeAutomatic Metrics

In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs
Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke · Mar 7, 2025 · Citations: 0

When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet, and show improvements over prompting and Self-Debias in all cases, and…
Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models
Richard J. Young · Mar 27, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models
Hao Liang, Zhengyang Zhao, Meiyi Qiang, Mingrui Chen, Lu Ma · Mar 27, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Mitigating LLM Hallucinations through Domain-Grounded Tiered Retrieval
Md. Asraful Haque, Aasar Mehdi, Maaz Mahboob, Tamkeen Fatima · Mar 18, 2026 · Citations: 0

Automatic Metrics

The system was evaluated across 650 queries from five diverse benchmarks: TimeQA v2, FreshQA v2, HaluEval General, MMLU Global Facts, and TruthfulQA.
LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale
Muhammed Saeed, Simon Razniewski · Mar 25, 2026 · Citations: 0

Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%.
Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra · May 21, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith · Mar 23, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility.
Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval
Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye · Mar 19, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Learning to Self-Evolve
Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao · Mar 19, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
TARo: Token-level Adaptive Routing for LLM Test-time Alignment
Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli · Mar 19, 2026 · Citations: 0

Pairwise Preference

Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025 · Citations: 0

Rubric Rating Automatic MetricsSimulation Env

We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
Cem Uluoglakci, Tugba Taskaya Temizel · Mar 18, 2026 · Citations: 0

Pairwise Preference

We also release HypoTermQA-Enhanced, a benchmark for hallucination tendency strengthened through multiple validations.
Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0

Automatic Metrics

Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning
Xinran Zhang · Mar 16, 2026 · Citations: 0

How safety supervision is written may matter more than the explicit identity content it contains.
More Agents Improve Math Problem Solving but Adversarial Robustness Gap Persists
Khashayar Alavi, Zhastay Yeltay, Lucie Flek, Akbar Karimi · Nov 10, 2025 · Citations: 0

Automatic Metrics

These perturbations include punctuation noise with three intensities (10%, 30%, 50%), plus real-world and human-like typos (WikiTypo, R2ATA).
FLUX: Data Worth Training On
Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya · Mar 14, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
AdaBoN: Adaptive Best-of-N Alignment
Vinod Raman, Hilal Asi, Satyen Kale · May 17, 2025 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Multi-lingual Functional Evaluation for Large Language Models
Victor Ojewale, Inioluwa Deborah Raji, Suresh Venkatasubramanian · Jun 25, 2025 · Citations: 0

Multi-lingual competence in large language models is often evaluated via static data benchmarks such as Belebele, M-MMLU and M-GSM.
KV Cache Transform Coding for Compact Storage in LLM Inference
Konrad Staniszewski, Adrian Łańcucki · Nov 3, 2025 · Citations: 0

Automatic Metrics

We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER.
Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models
Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang · Mar 10, 2026 · Citations: 0

Automatic Metrics

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0

Pairwise Preference Automatic Metrics

We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.

Related Benchmark Hubs

GPQA Or BFCL Benchmark Papers (23) HumanEval+ Or BFCL Benchmark Papers (21) MATH-500 Or BFCL Benchmark Papers (25) GPQA Or HumanEval+ Benchmark Papers (22) MATH-500 Or GPQA Benchmark Papers (25) MATH-500 Or HumanEval+ Benchmark Papers (24) MATH-500 Benchmark Papers (15) MMLU Or AIME Or BFCL Benchmark Papers (59) GSM8K Or MMLU Or AIME Benchmark Papers (72) MMLU Or AIME Or HotpotQA Benchmark Papers (62) MMLU Or AIME Or HumanEval+ Benchmark Papers (57)