OneComp: One-Line Revolution for Generative AI Model Compression
Yuma Ichikawa, Keiji Kimura, Akihiro Yoshida, Yudai Fujimoto, Hiroki Tokura +9 more
Abstract
Deploying foundation models is increasingly constrained by memory footprint, latency, and hardware costs. Post-training compression can mitigate these bottlenecks by reducing the precision of model parameters without significantly degrading performance; however, its practical implementation remains challenging as practitioners navigate a fragmented landscape of quantization algorithms, precision budgets, data-driven...
Summary
OneComp is an open-source post-training compression framework for large language models that automates model inspection, mixed-precision planning, and progressive multi-stage quantization. Given a model identifier and hardware target, it executes layer-wise compression, block-wise refinement, and global refinement stages, treating the first quantized checkpoint as a deployable pivot. Quality is measured via WikiText-2 perplexity and average zero-shot accuracy across bit-budget variants.
Key Contributions
- Fully automated pipeline: single model identifier plus hardware spec drives mixed-precision assignment and progressive quantization without manual configuration.
- Deployable pivot design: the first quantized checkpoint is production-ready; subsequent refinement stages are additive and monotonically improve quality with more compute.
- Multi-stage compression: layer-wise, block-wise, and global refinement stages are explicitly separated to allow compute-quality tradeoffs at each level.
- Unified evaluation protocol: WikiText-2 perplexity and average zero-shot accuracy used systematically across bit-budget and quantization variants.
Reproducibility Notes
- Benchmark value for WikiText-2 perplexity (reported as '2') appears to be a partial extraction artifact; verify exact figures from the paper tables before targeting them.
- Paper-only reproduction of a multi-stage progressive quantization pipeline typically requires days of implementation and compute, especially at LLM scale.
- Mixed-precision planning details (layer-to-bit-width assignment heuristics) are architecture-sensitive and may require careful reading of the experimental section to replicate.
Results & Benchmarks
Benchmark data is not yet available for this paper.
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Maintained implementation evidence is not confirmed for this paper yet.
Use the Implementation Status and Reproduction Path sections below for the current action plan.
Reproduction Path
Follow this baseline workflow to decide if this paper is worth immediate prototyping.
- 1
Use the paper and benchmark evidence to scope a baseline reproduction plan.
- 2
Start from this likely method family: Quantization.
- 3
Track assumptions and missing details in an experiment log before coding.
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches: