- SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026
Automatic MetricsSimulation Env MathCoding
Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
- Sign Lock-In: Randomly Initialized Weight Signs Persist and Bottleneck Sub-Bit Model Compression
Akira Sakai, Yuma Ichikawa · Feb 19, 2026
Automatic Metrics General
Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck.
- Training Models on Dialects of Translationese Shows How Lexical Diversity and Source-Target Syntactic Similarity Shape Learning
Jenny Kunz · Feb 18, 2026
Automatic Metrics Multilingual
Machine-translated data is widely used in multilingual NLP, particularly when native text is scarce.
- *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu · Feb 17, 2026
Automatic Metrics General
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods.
- Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen, Yujian Gan, Matthew Purver · Feb 17, 2026
Automatic Metrics General
The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems.
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026
Automatic Metrics MathLaw
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- Fast-weight Product Key Memory
Tianyu Zhao, Llion Jones · Jan 2, 2026
Automatic Metrics General
Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences.
- Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025
Automatic Metrics Coding
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
- Randomized Masked Finetuning: An Efficient Way to Mitigate Memorization of PIIs in LLMs
Kunj Joshi, David A. Smith · Dec 2, 2025
Automatic Metrics General
We present MaxTER, a Pareto-optimal evaluation framework for assessing privacy-utility tradeoffs, and show the performance of RMFT vs Deduplication by Area Under The Response Curve (AURC) metric.
- Empathetic Cascading Networks: A Multi-Stage Prompting Technique for Reducing Social Biases in Large Language Models
Wangjiaxuan Xin · Nov 24, 2025
Automatic Metrics General
This report presents the Empathetic Cascading Networks (ECN) framework, a multi-stage prompting method designed to enhance the empathetic and inclusive capabilities of large language models.
- Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025
Simulation Env General
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
- CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025
Automatic Metrics General
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
- Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis · Aug 6, 2025
Automatic Metrics General
Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets.
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025
Automatic Metrics General
In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
- Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025
Automatic Metrics MathCoding
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025
Automatic Metrics General
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.