0% found this document useful (0 votes)
47 views9 pages

RL For LLMs - An Overview

This document provides an overview of Reinforcement Learning (RL) and its applications in Large Language Models (LLMs). It explains how RL enhances LLMs through techniques like Reinforcement Learning from Human Feedback (RLHF), improving response quality and adaptability while highlighting the benefits, applications, and limitations of RL. Key applications include autonomous vehicles, healthcare, finance, and gaming, while limitations involve bias, computational costs, and the complexity of reward design.

Uploaded by

06.dhanushs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views9 pages

RL For LLMs - An Overview

This document provides an overview of Reinforcement Learning (RL) and its applications in Large Language Models (LLMs). It explains how RL enhances LLMs through techniques like Reinforcement Learning from Human Feedback (RLHF), improving response quality and adaptability while highlighting the benefits, applications, and limitations of RL. Key applications include autonomous vehicles, healthcare, finance, and gaming, while limitations involve bias, computational costs, and the complexity of reward design.

Uploaded by

06.dhanushs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

AN OVERVIEW

Bhavishya Pandit
INTRODUCTION
Reinforcement Learning (RL) is reshaping industries by enabling machines to learn
from experience, not just data. This post is an overview to help you understand how
RL is being used in LLMs.
In this post we’ll cover:

WHAT IS RL?

HOW LLMs USE RL

HOW DEEPSEEK
USES RL?

BENEFITS

APPLICATIONS

LIMITATIONS

Bhavishya Pandit
WHAT IS RL?

Reinforcement Learning

*Credit: Database town

Over 70% of AI breakthroughs in robotics, gaming, and finance rely on


Reinforcement Learning (RL).

But what is Reinforcement Learning?


Imagine a baby learning to crawl it stumbles, adjusts, and improves through
experience. RL works the same way! Instead of following fixed rules, an AI agent
learns by interacting with its environment, receiving rewards for good actions.

Just as encouragement helps a baby walk, rewards refine an RL model’s decisions.


Over time, both master their tasks through trial and error, making RL a game-
changer for AI in dynamic environments.

Bhavishya Pandit
HOW LLMs USE RL

*Credit: Hugging face

LLMs like ChatGPT use RLHF to fine-tune their responses based on human preferences.
Instead of relying solely on pre-trained data, RLHF allows models to learn from human
feedback, ensuring outputs are safer, more aligned with user intent, and free from
harmful biases.

This process involves:


Training an initial model using supervised learning.
Collecting human feedback on model-generated responses.
Using RL to optimize responses by rewarding helpful and accurate outputs while
penalizing undesirable ones.

This technique makes LLMs more reliable for real-world applications by continuously
improving their ability to generate contextually appropriate and ethical responses.

Bhavishya Pandit
HOW DEEPSEEK USES RL

Credit*: Predibase

It leverages RL through its Group Relative Policy Optimization (GRPO) method, achieving
state-of-the-art reasoning performance. Here's a streamlined breakdown:
RL-Centric Training:
GRPO treats verifiable reasoning steps (e.g., correct code/math solutions) as
rewards. This drives the model to self-improve through iterative self-reflection.
Eliminates dependency on supervised fine-tuning (SFT) for initial training phases.
Hybrid Training Pipeline:
Combines cold-start data (curated reasoning examples) with RL to stabilize
learning and avoid poor readability issues.
Distills RL-trained 671B parameter models into smaller versions (1.5B–70B) while
retaining 95%+ performance on tasks like code generation.
Performance Gains:
Achieved 97.3% on MATH-500 (vs. OpenAI o1-1217’s 96.4%) and 65.9% on
LiveCodeBench coding tasks (vs. GPT-4o’s 34.2%).
Outperforms non-RL models in faithfulness, with 59% of responses correctly vs.
7% for standard models.

Bhavishya Pandit
BENEFITS
Improved Response Enhanced Efficient Knowledge
Quality Adaptability Distillation:

01 03 05

02 04

Bias Reduction Optimized Long-Term


Decision Making

Improved Response Quality: RL fine-tunes LLMs to generate more accurate and


context-aware outputs.
Bias Reduction: RLHF helps minimize biases by aligning models with diverse
human feedback.
Enhanced Adaptability: RL enables LLMs to refine responses dynamically based
on user interactions.
Optimized Long-Term Decision Making: RL helps LLMs prioritize meaningful
responses over immediate rewards
Efficient Knowledge Distillation: RL-trained models can be distilled into smaller,
high-performing versions.

Bhavishya Pandit
APPLICATION
Visual Search &Autonomous Vehicles
Recommendations

Cross-Modal Retrieval
Healthcare & Drug Discovery

Healthcare Diagnostics
Finance & Trading Applications

Gaming & Simulation


Autonomous Systems Similarity Search
Audio Dataset
Frameworks
Content Creation
Industrial & Editing
Optimization

Autonomous Vehicles: Self-driving cars navigate traffic, avoid collisions, and


optimize routes; AI controls traffic signals.
Healthcare & Drug Discovery: AI designs treatment plans and discovers drugs by
analyzing molecular interactions.
Finance & Trading: RL optimizes high-frequency trading and dynamically
balances investment portfolios.
Gaming & Simulation: AI masters strategy games like AlphaGo and trains in
simulations before real-world tasks.
Industrial Optimization: Smart grids manage energy distribution, and AI
enhances supply chain efficiency.

Bhavishya Pandit
LIMITATIONS
BIAS HIGH
REINFORCEMENT COMPUTATIONAL
COST

EXPLORATION REWARD DESIGN


VS. COMPLEXITY
EXPLOITATION

Bias Reinforcement: RLHF can amplify biases in training data, leading to unfair or
skewed model outputs.
Exploration vs. Exploitation: LLMs struggle to balance generating diverse
responses (exploration) and refining known patterns (exploitation).
Reward Design Complexity : Defining clear, effective reward signals for language
tasks is challenging, often leading to suboptimal learning.
High Computational Cost: RL-based fine-tuning demands extensive resources,
making it expensive and less accessible.

Bhavishya Pandit
Follow to stay updated on
Generative AI

LIKE COMMENT REPOST

Bhavishya Pandit

You might also like