Technical

Parameter-Efficient Fine-Tuning: LoRA, DoRA, and Beyond

Modern PEFT techniques that let you fine-tune LLMs with just 1% of parameters.

VI
Vijayakumar S
May 1, 202511 min read
Parameter Efficient Fine-Tuning Visualization

The Fine-Tuning Revolution

Full fine-tuning of 70B+ models is impractical for most organizations. PEFT techniques make fine-tuning accessible by only training tiny adapter modules.

LoRA (Low-Rank Adaptation)

The foundational technique that started it all:

  • Injects trainable rank decomposition matrices into model layers
  • Only trains 0.1-1% of original parameters
  • No inference latency when merged
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1
)

peft_model = get_peft_model(base_model, lora_config)
# Trains only ~8M params on 7B model

DoRA: Weight-Decomposed LoRA

2024's improvement over LoRA:

  • Decomposes weights into magnitude and direction
  • Learns updates for both components separately
  • 15-20% better performance than LoRA at same rank

QLoRA: Quantized LoRA

Fine-tune 70B models on a single GPU:

  • 4-bit quantization of base model
  • LoRA adapters in full precision
  • Memory: 70B model fits in 48GB VRAM
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b",
    quantization_config=bnb_config
)
peft_model = get_peft_model(model, lora_config)

AdapterFusion

Combine multiple fine-tuned adapters without retraining:

# Train separate adapters
adapter_coding = finetune(model, coding_data)
adapter_math = finetune(model, math_data)
adapter_creative = finetune(model, creative_data)

# Fuse them
fused = AdapterFusion([adapter_coding, adapter_math, adapter_creative])
# Model can now code, do math, AND be creative

Best Practices for 2025

  • Start with r=8-16 and increase if underfitting
  • Target all attention projections (q,k,v,o) and MLP layers
  • Use rank-stabilized LoRA (rsLoRA) for better scaling
  • Combine with quantization for consumer GPUs
  • Save only adapter weights (few MB vs multiple GB)

When to Use Which

| Scenario                          | Best Method |
|-----------------------------------|-------------|
| Limited compute (1 GPU, 70B model)| QLoRA       |
| Best performance possible          | DoRA        |
| Multiple tasks sharing base        | AdapterFusion|
| Rapid experimentation             | LoRA (r=4)  |
| Production deployment             | LoRA merged |
VI
Vijayakumar S
AI Engineer 路 ML Enthusiast

Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.