Technical

Making LLMs Smaller Without Losing Intelligence

Distillation, quantization, pruning, and sparsity for efficient deployment.

VI
Vijayakumar S
Jun 15, 202514 min read
LLM Compression Techniques Diagram

The Compression Toolbox

Deploying LLMs in production requires balancing intelligence with efficiency. 2025 offers multiple compression techniques that can shrink models by 10x with minimal quality loss.

1. Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher":

student_loss = alpha * KL_divergence(student_logits, teacher_logits) +                (1-alpha) * cross_entropy(student_logits, hard_labels)

Results from Llama-4 distillation:

  • 400B teacher → 70B student: 95% performance retention
  • 70B teacher → 7B student: 88% performance retention
  • 7B teacher → 1B student: 75% performance retention

2. Quantization

Reduce precision from 16-bit to lower bit widths:

| Precision | Memory Reduction | Quality Loss | Best For          |
|-----------|------------------|--------------|-------------------|
| FP16      | 1x (baseline)    | 0%           | Training          |
| INT8      | 2x               | <1%          | Inference         |
| INT4      | 4x               | 2-5%         | Edge deployment   |
| INT2      | 8x               | 10-15%       | Extreme edge      |
| NF4 (QLoRA)| 4x              | <2%          | Fine-tuned models |

GPTQ and AWQ are the leading quantization algorithms:

from transformers import AutoModelForCausalLM, GPTQConfig

quant_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=False
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-70b",
    quantization_config=quant_config,
    device_map="auto"
)

3. Pruning

Remove unimportant weights or neurons:

  • Magnitude pruning: Remove smallest magnitude weights
  • SparseGPT: Prune 50% while maintaining accuracy
  • WANDA: Prune based on weight*activation product

4. Sparsity (Mixture-of-Experts)

Only activate a subset of parameters per forward pass:

  • Llama-4 400B with 40B active
  • 10x parameter count for 2x compute cost
  • Perfect for scenarios needing broad knowledge

Combined Pipeline

State-of-the-art compression in 2025:

  1. Distill from 400B → 70B (5.7x smaller)
  2. Prune 30% of weights (1.4x smaller)
  3. Quantize to INT4 (4x smaller)
  4. Total: 32x smaller with 85% original quality

Tools of the Trade

  • LLM Compressor (NVIDIA): One-click compression
  • Optimum (Hugging Face): Integrated quantization
  • SparseML: Pruning and sparsity
  • llama.cpp: CPU-optimized quantized inference
VI
Vijayakumar S
AI Engineer · ML Enthusiast

Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.