Technical

Making LLMs Smaller Without Losing Intelligence

Distillation, quantization, pruning, and sparsity for efficient deployment.

Vijayakumar S

Jun 15, 202514 min read

The Compression Toolbox

Deploying LLMs in production requires balancing intelligence with efficiency. 2025 offers multiple compression techniques that can shrink models by 10x with minimal quality loss.

1. Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher":

student_loss = alpha * KL_divergence(student_logits, teacher_logits) +                (1-alpha) * cross_entropy(student_logits, hard_labels)

Results from Llama-4 distillation:

400B teacher → 70B student: 95% performance retention
70B teacher → 7B student: 88% performance retention
7B teacher → 1B student: 75% performance retention

2. Quantization

Reduce precision from 16-bit to lower bit widths:

| Precision | Memory Reduction | Quality Loss | Best For          |
|-----------|------------------|--------------|-------------------|
| FP16      | 1x (baseline)    | 0%           | Training          |
| INT8      | 2x               | <1%          | Inference         |
| INT4      | 4x               | 2-5%         | Edge deployment   |
| INT2      | 8x               | 10-15%       | Extreme edge      |
| NF4 (QLoRA)| 4x              | <2%          | Fine-tuned models |

GPTQ and AWQ are the leading quantization algorithms:

from transformers import AutoModelForCausalLM, GPTQConfig

quant_config = GPTQConfig(
    bits=4,
    group_size=128,
    dataset="c4",
    desc_act=False
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-4-70b",
    quantization_config=quant_config,
    device_map="auto"
)

3. Pruning

Remove unimportant weights or neurons:

Magnitude pruning: Remove smallest magnitude weights
SparseGPT: Prune 50% while maintaining accuracy
WANDA: Prune based on weight*activation product

4. Sparsity (Mixture-of-Experts)

Only activate a subset of parameters per forward pass:

Llama-4 400B with 40B active
10x parameter count for 2x compute cost
Perfect for scenarios needing broad knowledge

Combined Pipeline

State-of-the-art compression in 2025:

Distill from 400B → 70B (5.7x smaller)
Prune 30% of weights (1.4x smaller)
Quantize to INT4 (4x smaller)
Total: 32x smaller with 85% original quality

Tools of the Trade

LLM Compressor (NVIDIA): One-click compression
Optimum (Hugging Face): Integrated quantization
SparseML: Pruning and sparsity
llama.cpp: CPU-optimized quantized inference

Topics

#Compression #Quantization #Distillation #Pruning

Vijayakumar S

AI Engineer · ML Enthusiast

Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.