Technical
RLHF and Constitutional AI in 2025
How alignment techniques have evolved beyond human preferences.
VI
Vijayakumar S
Jul 1, 202511 min read
The Evolution of Alignment
RLHF remains the dominant alignment technique, but 2025 has introduced efficient variants and alternatives like Constitutional AI and Direct Preference Optimization.
Standard RLHF Pipeline
- Supervised Fine-Tuning (SFT): Train on demonstrations
- Reward Modeling: Train model to predict human preferences
- Reinforcement Learning (PPO): Optimize policy against reward model
# PPO training loop for LLMs
for epoch in range(num_epochs):
# Generate responses
responses = policy.generate(prompts)
# Score with reward model
rewards = reward_model(prompts, responses)
# Compute PPO loss
ratio = exp(log_probs - old_log_probs)
adv = rewards - value_model(prompts)
policy_loss = -min(ratio * adv, clip(ratio, 1-eps, 1+eps) * adv)
# Update policy
policy_loss.backward()
optimizer.step()
DPO: Direct Preference Optimization
2024's breakthrough - eliminates reward model and RL loop:
- Train directly on preference pairs
- 4x faster than RLHF
- More stable training
- Comparable or better results
from trl import DPOTrainer
dpo_trainer = DPOTrainer(
model=policy_model,
ref_model=reference_model,
args=training_args,
train_dataset=preference_dataset, # (chosen, rejected) pairs
)
dpo_trainer.train()
Constitutional AI
Anthropic's approach to alignment without human feedback:
- Define a "constitution" of principles
- Model critiques and revises its own responses
- Reinforcement learning from AI feedback (RLAIF)
Example Constitution Principles
- "Choose the most helpful, honest, and harmless response"
- "Avoid perpetuating harmful stereotypes"
- "Respect user privacy and autonomy"
Best Practices 2025
- Data quality > quantity: 10k high-quality preference pairs beat 100k noisy ones
- Diverse preference sources: Include multiple demographics, cultures
- Reward hacking detection: Monitor for exploitation of reward model
- Evaluation: Use LLM-as-judge with rubrics
Open Source Tools
- TRL (Transformer Reinforcement Learning): Hugging Face's RLHF library
- Axolotl: Fine-tuning with QLoRA + DPO
- OpenRLHF: Scalable RLHF with Ray
VI
Vijayakumar S
AI Engineer 路 ML Enthusiast
Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.