AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen +5 more
Abstract
Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical model size and the limited hardware resource pose significant deployment challenges. We propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach fo...
Best Implementation
A high-throughput and memory-efficient inference and serving engine for LLMs
- Selected vllm-project/vllm as the strongest maintained implementation for new work.
- Includes CI workflow signals.
- Includes dependency/environment manifest signals.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with vllm-project/vllm and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Additional Implementations
Official
- internlm/lmdeployConfidence: low
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Stars: 7.8kForks: 682Last push: Apr 2026License: Apache-2.0 - mit-han-lab/llm-awqConfidence: low
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Stars: 3.5kForks: 310Last push: Jul 2025License: MIT
Community
No additional community repositories detected yet.
Hugging Face Artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches: