Technical
Parameter-Efficient Fine-Tuning: LoRA, DoRA, and Beyond
Modern PEFT techniques that let you fine-tune LLMs with just 1% of parameters.
VI
Vijayakumar S
May 1, 202511 min read
The Fine-Tuning Revolution
Full fine-tuning of 70B+ models is impractical for most organizations. PEFT techniques make fine-tuning accessible by only training tiny adapter modules.
LoRA (Low-Rank Adaptation)
The foundational technique that started it all:
- Injects trainable rank decomposition matrices into model layers
- Only trains 0.1-1% of original parameters
- No inference latency when merged
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1
)
peft_model = get_peft_model(base_model, lora_config)
# Trains only ~8M params on 7B model
DoRA: Weight-Decomposed LoRA
2024's improvement over LoRA:
- Decomposes weights into magnitude and direction
- Learns updates for both components separately
- 15-20% better performance than LoRA at same rank
QLoRA: Quantized LoRA
Fine-tune 70B models on a single GPU:
- 4-bit quantization of base model
- LoRA adapters in full precision
- Memory: 70B model fits in 48GB VRAM
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b",
quantization_config=bnb_config
)
peft_model = get_peft_model(model, lora_config)
AdapterFusion
Combine multiple fine-tuned adapters without retraining:
# Train separate adapters
adapter_coding = finetune(model, coding_data)
adapter_math = finetune(model, math_data)
adapter_creative = finetune(model, creative_data)
# Fuse them
fused = AdapterFusion([adapter_coding, adapter_math, adapter_creative])
# Model can now code, do math, AND be creative
Best Practices for 2025
- Start with r=8-16 and increase if underfitting
- Target all attention projections (q,k,v,o) and MLP layers
- Use rank-stabilized LoRA (rsLoRA) for better scaling
- Combine with quantization for consumer GPUs
- Save only adapter weights (few MB vs multiple GB)
When to Use Which
| Scenario | Best Method |
|-----------------------------------|-------------|
| Limited compute (1 GPU, 70B model)| QLoRA |
| Best performance possible | DoRA |
| Multiple tasks sharing base | AdapterFusion|
| Rapid experimentation | LoRA (r=4) |
| Production deployment | LoRA merged |
VI
Vijayakumar S
AI Engineer 路 ML Enthusiast
Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.