Metric Hub

Perplexity + Automatic Metrics Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 17 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: perplexity. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 17 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 17 papers for Perplexity + Automatic Metrics Metric Papers. Dominant protocol signals include automatic metrics, simulation environments, with frequent benchmark focus on GSM8K, Retrieval and metric focus on perplexity, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

automatic metrics appears in 100% of papers in this hub.

Evidence: NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
GSM8K is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: SPQ: An Ensemble Technique for Large Language Model Compression , Scaling Beyond Masked Diffusion Language Models , NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Protocol Takeaways

Most common quality-control signal is rater calibration (5.9% of papers).

Evidence: CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning , NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Stratify by benchmark (GSM8K vs Retrieval) before comparing methods.

Evidence: NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Track metric sensitivity by reporting both perplexity and accuracy.

Evidence: NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning

Benchmark Interpretation

GSM8K appears in 11.8% of hub papers (2/17); use this cohort for benchmark-matched comparisons.
Retrieval appears in 11.8% of hub papers (2/17); use this cohort for benchmark-matched comparisons.

Metric Interpretation

perplexity is reported in 100% of hub papers (17/17); compare with a secondary metric before ranking methods.
accuracy is reported in 29.4% of hub papers (5/17); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (0% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (5.9% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (29.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (0% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (5.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (29.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Only 5.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (0% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: perplexity - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=16, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

Benchmark Brief

GSM8K

Coverage: 2 papers (11.8%)

2 papers (11.8%) mention GSM8K.

Examples: SPQ: An Ensemble Technique for Large Language Model Compression , Scaling Beyond Masked Diffusion Language Models

Benchmark Brief

Retrieval

Coverage: 2 papers (11.8%)

2 papers (11.8%) mention Retrieval.

Examples: Fast-weight Product Key Memory , Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Benchmark Brief

DROP

Coverage: 1 papers (5.9%)

1 papers (5.9%) mention DROP.

Examples: Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Metric Brief

perplexity

Coverage: 17 papers (100%)

17 papers (100%) mention perplexity.

Examples: NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Metric Brief

accuracy

Coverage: 5 papers (29.4%)

5 papers (29.4%) mention accuracy.

Examples: SPQ: An Ensemble Technique for Large Language Model Compression , Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination , CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Metric Brief

cost

Coverage: 4 papers (23.5%)

4 papers (23.5%) mention cost.

Examples: Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation , Fast-weight Product Key Memory

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion , SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026

Automatic Metrics Math

On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026

Automatic MetricsSimulation Env MathCoding

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026

Automatic Metrics General

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz · Feb 18, 2026

Automatic Metrics Multilingual

Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce.
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026

Automatic Metrics General

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026

Automatic Metrics General

The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems.
Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026

Automatic Metrics MathLaw

Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Automatic Metrics General

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Automatic Metrics Coding

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025

Automatic Metrics General

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
Wangjiaxuan Xin · Nov 24, 2025

Automatic Metrics General

This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025

Automatic Metrics General

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis · Aug 6, 2025

Automatic Metrics General

Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets.
DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025

Automatic Metrics Medicine

Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025

Automatic Metrics General

In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025

Automatic Metrics MathCoding

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025

Automatic Metrics General

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.

Perplexity + Automatic Metrics Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs