Metric Hub

Perplexity Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 16 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequent quality control: Calibration. Frequently cited benchmark: GSM8K. Common metric signal: perplexity. Newest paper in this set is from Feb 20, 2026.

Papers: 16 Last published: Feb 20, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded

Updated from current HFEPX corpus (Feb 27, 2026). This page covers 16 papers centered on Perplexity Metric Papers. Common evaluation modes include Automatic Metrics, Simulation Env, with benchmark emphasis on GSM8K, Retrieval. Use the anchored takeaways below to compare protocol choices and identify papers with stronger evidence depth.

Why This Matters For Eval Research

Evaluation emphasis: Automatic Metrics and Simulation Env appear frequently in this slice.

Evidence: SPQ: An Ensemble Technique for Large Language Model Compression , Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Benchmark concentration: GSM8K, Retrieval helps control cross-paper variance.

Evidence: Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression , Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning , *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Metric concentration: perplexity, accuracy is repeatedly reported in this group.

Evidence: Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning , *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation , Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination

Protocol Takeaways

Stratify by benchmark (GSM8K vs Retrieval) before comparing methods.

Evidence: *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation , Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination , Scaling Beyond Masked Diffusion Language Models
Track metric sensitivity by reporting both perplexity and accuracy.

Evidence: Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination , Scaling Beyond Masked Diffusion Language Models , Fast-weight Product Key Memory
Papers with explicit human feedback is visible in approximately 0% of papers in this set.

Evidence: Scaling Beyond Masked Diffusion Language Models , Fast-weight Product Key Memory , Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Benchmark Interpretation

GSM8K appears as a recurring benchmark anchor in this page.
2 papers (12.5%) mention GSM8K.
Most common evaluation modes: Automatic Metrics, Simulation Env.

Metric Interpretation

perplexity is a common reported metric and should be paired with protocol context before ranking methods.
16 papers (100%) mention perplexity.
Most common evaluation modes: Automatic Metrics, Simulation Env.

Researcher Checklist

Papers with explicit human feedback: Coverage is a replication risk (0% vs 45% target).
Papers reporting quality controls: Coverage is a replication risk (6.3% vs 30% target).
Papers naming benchmarks/datasets: Coverage is usable but incomplete (31.3% vs 35% target).
Papers naming evaluation metrics: Coverage is strong (100% vs 35% target).
Papers with known rater population: Coverage is a replication risk (0% vs 35% target).
Papers with known annotation unit: Coverage is a replication risk (0% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (0% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (6.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (31.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (0% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (0% vs 35% target).

Known Limitations

Narrative synthesis is grounded in metadata and abstracts only; full-paper method details may be missing.
Extraction fields are conservative and can under-report implicit protocol details.
Cross-page comparisons should control for benchmark and metric mismatch.

Research Utility Links

Benchmark Slice: GSM8K - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: perplexity - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

automatic_metrics vs simulation_env

both=1, left_only=14, right_only=1

1 papers use both Automatic Metrics and Simulation Env.

Top Papers Reporting This Metric

SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026

Automatic MetricsSimulation Env MathCoding

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026

Automatic Metrics General

Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz · Feb 18, 2026

Automatic Metrics Multilingual

Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce.
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026

Automatic Metrics General

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026

Automatic Metrics General

The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems.
Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026

Automatic Metrics MathLaw

Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026

Automatic Metrics General

Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025

Automatic Metrics Coding

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025

Automatic Metrics General

We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
Wangjiaxuan Xin · Nov 24, 2025

Automatic Metrics General

This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models.
Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025

Simulation Env General

Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025

Automatic Metrics General

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis · Aug 6, 2025

Automatic Metrics General

Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets.
Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025

Automatic Metrics General

In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025

Automatic Metrics MathCoding

Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025

Automatic Metrics General

However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.

Other Metric Hubs

Cost Metric Papers (78) Accuracy Metric Papers (218) Latency Metric Papers (34) Recall Metric Papers (33) F1 Metric Papers (32) Precision Metric Papers (31) Agreement Metric Papers (24) Calibration Metric Papers (24) Throughput Metric Papers (15) Success Rate Metric Papers (14)