RL For LLMs - An Overview
RL For LLMs - An Overview
Bhavishya Pandit
INTRODUCTION
Reinforcement Learning (RL) is reshaping industries by enabling machines to learn
from experience, not just data. This post is an overview to help you understand how
RL is being used in LLMs.
In this post we’ll cover:
WHAT IS RL?
HOW DEEPSEEK
USES RL?
BENEFITS
APPLICATIONS
LIMITATIONS
Bhavishya Pandit
WHAT IS RL?
Reinforcement Learning
Bhavishya Pandit
HOW LLMs USE RL
LLMs like ChatGPT use RLHF to fine-tune their responses based on human preferences.
Instead of relying solely on pre-trained data, RLHF allows models to learn from human
feedback, ensuring outputs are safer, more aligned with user intent, and free from
harmful biases.
This technique makes LLMs more reliable for real-world applications by continuously
improving their ability to generate contextually appropriate and ethical responses.
Bhavishya Pandit
HOW DEEPSEEK USES RL
Credit*: Predibase
It leverages RL through its Group Relative Policy Optimization (GRPO) method, achieving
state-of-the-art reasoning performance. Here's a streamlined breakdown:
RL-Centric Training:
GRPO treats verifiable reasoning steps (e.g., correct code/math solutions) as
rewards. This drives the model to self-improve through iterative self-reflection.
Eliminates dependency on supervised fine-tuning (SFT) for initial training phases.
Hybrid Training Pipeline:
Combines cold-start data (curated reasoning examples) with RL to stabilize
learning and avoid poor readability issues.
Distills RL-trained 671B parameter models into smaller versions (1.5B–70B) while
retaining 95%+ performance on tasks like code generation.
Performance Gains:
Achieved 97.3% on MATH-500 (vs. OpenAI o1-1217’s 96.4%) and 65.9% on
LiveCodeBench coding tasks (vs. GPT-4o’s 34.2%).
Outperforms non-RL models in faithfulness, with 59% of responses correctly vs.
7% for standard models.
Bhavishya Pandit
BENEFITS
Improved Response Enhanced Efficient Knowledge
Quality Adaptability Distillation:
01 03 05
02 04
Bhavishya Pandit
APPLICATION
Visual Search &Autonomous Vehicles
Recommendations
Cross-Modal Retrieval
Healthcare & Drug Discovery
Healthcare Diagnostics
Finance & Trading Applications
Bhavishya Pandit
LIMITATIONS
BIAS HIGH
REINFORCEMENT COMPUTATIONAL
COST
Bias Reinforcement: RLHF can amplify biases in training data, leading to unfair or
skewed model outputs.
Exploration vs. Exploitation: LLMs struggle to balance generating diverse
responses (exploration) and refining known patterns (exploitation).
Reward Design Complexity : Defining clear, effective reward signals for language
tasks is challenging, often leading to suboptimal learning.
High Computational Cost: RL-based fine-tuning demands extensive resources,
making it expensive and less accessible.
Bhavishya Pandit
Follow to stay updated on
Generative AI
Bhavishya Pandit