0% found this document useful (0 votes)

47 views9 pages

RL For LLMs - An Overview

This document provides an overview of Reinforcement Learning (RL) and its applications in Large Language Models (LLMs). It explains how RL enhances LLMs through techniques like Reinforcement Learning from Human Feedback (RLHF), improving response quality and adaptability while highlighting the benefits, applications, and limitations of RL. Key applications include autonomous vehicles, healthcare, finance, and gaming, while limitations involve bias, computational costs, and the complexity of reward design.

Uploaded by

06.dhanushs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views9 pages

RL For LLMs - An Overview

Uploaded by

06.dhanushs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

AN OVERVIEW

Bhavishya Pandit
INTRODUCTION
Reinforcement Learning (RL) is reshaping industries by enabling machines to learn
from experience, not just data. This post is an overview to help you understand how
RL is being used in LLMs.
In this post we’ll cover:

WHAT IS RL?

HOW LLMs USE RL

HOW DEEPSEEK
USES RL?

BENEFITS

APPLICATIONS

LIMITATIONS

Bhavishya Pandit
WHAT IS RL?

Reinforcement Learning

*Credit: Database town

Over 70% of AI breakthroughs in robotics, gaming, and finance rely on

Reinforcement Learning (RL).

But what is Reinforcement Learning?

Imagine a baby learning to crawl it stumbles, adjusts, and improves through
experience. RL works the same way! Instead of following fixed rules, an AI agent
learns by interacting with its environment, receiving rewards for good actions.

Just as encouragement helps a baby walk, rewards refine an RL model’s decisions.

Over time, both master their tasks through trial and error, making RL a game-
changer for AI in dynamic environments.

Bhavishya Pandit
HOW LLMs USE RL

*Credit: Hugging face

LLMs like ChatGPT use RLHF to fine-tune their responses based on human preferences.
Instead of relying solely on pre-trained data, RLHF allows models to learn from human
feedback, ensuring outputs are safer, more aligned with user intent, and free from
harmful biases.

This process involves:

Training an initial model using supervised learning.
Collecting human feedback on model-generated responses.
Using RL to optimize responses by rewarding helpful and accurate outputs while
penalizing undesirable ones.

This technique makes LLMs more reliable for real-world applications by continuously
improving their ability to generate contextually appropriate and ethical responses.

Bhavishya Pandit
HOW DEEPSEEK USES RL

Credit*: Predibase

It leverages RL through its Group Relative Policy Optimization (GRPO) method, achieving
state-of-the-art reasoning performance. Here's a streamlined breakdown:
RL-Centric Training:
GRPO treats verifiable reasoning steps (e.g., correct code/math solutions) as
rewards. This drives the model to self-improve through iterative self-reflection.
Eliminates dependency on supervised fine-tuning (SFT) for initial training phases.
Hybrid Training Pipeline:
Combines cold-start data (curated reasoning examples) with RL to stabilize
learning and avoid poor readability issues.
Distills RL-trained 671B parameter models into smaller versions (1.5B–70B) while
retaining 95%+ performance on tasks like code generation.
Performance Gains:
Achieved 97.3% on MATH-500 (vs. OpenAI o1-1217’s 96.4%) and 65.9% on
LiveCodeBench coding tasks (vs. GPT-4o’s 34.2%).
Outperforms non-RL models in faithfulness, with 59% of responses correctly vs.
7% for standard models.

Bhavishya Pandit
BENEFITS
Improved Response Enhanced Efficient Knowledge
Quality Adaptability Distillation:

01 03 05

02 04

Bias Reduction Optimized Long-Term

Decision Making

Improved Response Quality: RL fine-tunes LLMs to generate more accurate and

context-aware outputs.
Bias Reduction: RLHF helps minimize biases by aligning models with diverse
human feedback.
Enhanced Adaptability: RL enables LLMs to refine responses dynamically based
on user interactions.
Optimized Long-Term Decision Making: RL helps LLMs prioritize meaningful
responses over immediate rewards
Efficient Knowledge Distillation: RL-trained models can be distilled into smaller,
high-performing versions.

Bhavishya Pandit
APPLICATION
Visual Search &Autonomous Vehicles
Recommendations

Cross-Modal Retrieval
Healthcare & Drug Discovery

Healthcare Diagnostics
Finance & Trading Applications

Gaming & Simulation

Autonomous Systems Similarity Search
Audio Dataset
Frameworks
Content Creation
Industrial & Editing
Optimization

Autonomous Vehicles: Self-driving cars navigate traffic, avoid collisions, and

optimize routes; AI controls traffic signals.
Healthcare & Drug Discovery: AI designs treatment plans and discovers drugs by
analyzing molecular interactions.
Finance & Trading: RL optimizes high-frequency trading and dynamically
balances investment portfolios.
Gaming & Simulation: AI masters strategy games like AlphaGo and trains in
simulations before real-world tasks.
Industrial Optimization: Smart grids manage energy distribution, and AI
enhances supply chain efficiency.

Bhavishya Pandit
LIMITATIONS
BIAS HIGH
REINFORCEMENT COMPUTATIONAL
COST

EXPLORATION REWARD DESIGN

VS. COMPLEXITY
EXPLOITATION

Bias Reinforcement: RLHF can amplify biases in training data, leading to unfair or
skewed model outputs.
Exploration vs. Exploitation: LLMs struggle to balance generating diverse
responses (exploration) and refining known patterns (exploitation).
Reward Design Complexity : Defining clear, effective reward signals for language
tasks is challenging, often leading to suboptimal learning.
High Computational Cost: RL-based fine-tuning demands extensive resources,
making it expensive and less accessible.

Bhavishya Pandit
Follow to stay updated on
Generative AI

LIKE COMMENT REPOST

Bhavishya Pandit

RL Introduction
No ratings yet
RL Introduction
225 pages
Cyberark - Cau201.V2022-04-19.Q108: Show Answer
0% (1)
Cyberark - Cau201.V2022-04-19.Q108: Show Answer
28 pages
Siemens Hipath 2000 v2
No ratings yet
Siemens Hipath 2000 v2
324 pages
Cs224n 2025 Lecture10 Instruction Tunining RLHF
No ratings yet
Cs224n 2025 Lecture10 Instruction Tunining RLHF
61 pages
Solution Manual of The 80×86 IBM PC and Compatible Computers - 4th Edition
60% (5)
Solution Manual of The 80×86 IBM PC and Compatible Computers - 4th Edition
126 pages
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
No ratings yet
Reinforcement Learning From Human Feedback in LLMS: Whose Culture, Whose Values, Whose Perspectives?
26 pages
Limitation of RLHF
No ratings yet
Limitation of RLHF
42 pages
(Xinfeng Zhou) A Practical Guide To Quantitative Finance Interviews PDF
100% (1)
(Xinfeng Zhou) A Practical Guide To Quantitative Finance Interviews PDF
96 pages
Re Max
No ratings yet
Re Max
36 pages
Report ML Aat g1 Final
No ratings yet
Report ML Aat g1 Final
8 pages
Deep & Reinforcement - Unit 5
No ratings yet
Deep & Reinforcement - Unit 5
13 pages
Reinforcement Learning For Embedded Robotics
No ratings yet
Reinforcement Learning For Embedded Robotics
9 pages
Gotechprov 7
No ratings yet
Gotechprov 7
36 pages
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
No ratings yet
Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
31 pages
85001057A, Manual
100% (1)
85001057A, Manual
30 pages
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
No ratings yet
Survey On Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods
22 pages
Mod1 Slides
No ratings yet
Mod1 Slides
34 pages
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
No ratings yet
Open Problems and Fundamental Limitations of Reinforcement Learning From Human Feedback
38 pages
Single Phase Smart Meter Using DLMS/COSEM Application Data
No ratings yet
Single Phase Smart Meter Using DLMS/COSEM Application Data
2 pages
Reinforcement Learning Synopsis
No ratings yet
Reinforcement Learning Synopsis
7 pages
521H0502-521H0498-521h0333 NLP Report
No ratings yet
521H0502-521H0498-521h0333 NLP Report
27 pages
Blue Coat Authentication Webcast Final
No ratings yet
Blue Coat Authentication Webcast Final
53 pages
Brkcol 2385
No ratings yet
Brkcol 2385
88 pages
Machine Learning Questions
No ratings yet
Machine Learning Questions
13 pages
Lecture 1: Introduction: Reinforcement Learning With Tensorflow&Openai Gym
No ratings yet
Lecture 1: Introduction: Reinforcement Learning With Tensorflow&Openai Gym
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
07 Lecture10 Post Training
No ratings yet
07 Lecture10 Post Training
61 pages
200 301 ExamP
No ratings yet
200 301 ExamP
84 pages
Sbi Yono FINAL PROJECT
No ratings yet
Sbi Yono FINAL PROJECT
46 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
Lecture 2 Summary
No ratings yet
Lecture 2 Summary
1 page
586C Edge Banding Machine2023.2
No ratings yet
586C Edge Banding Machine2023.2
11 pages
RL Chap 5
No ratings yet
RL Chap 5
21 pages
RLHF
No ratings yet
RLHF
14 pages
List Media Digital Untuk Mendukung Pembelajaran CT
No ratings yet
List Media Digital Untuk Mendukung Pembelajaran CT
6 pages
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
No ratings yet
An Invitation To Deep Reinforcement Learning: Bernhard Jaeger
39 pages
Approach: Others
No ratings yet
Approach: Others
1 page
XT1754 (Moto C)
No ratings yet
XT1754 (Moto C)
29 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
GALLM Unit 4 Notes
No ratings yet
GALLM Unit 4 Notes
14 pages
Decipher
No ratings yet
Decipher
37 pages
Unit 1 - Theory of Computation - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - Theory of Computation - WWW - Rgpvnotes.in
16 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Reinforcement Learning From Theory To Real World Impact
No ratings yet
Reinforcement Learning From Theory To Real World Impact
8 pages
LIANLI LANCOOL II MESH Installation Guide
No ratings yet
LIANLI LANCOOL II MESH Installation Guide
16 pages
Applications of Reinforcement Learning
No ratings yet
Applications of Reinforcement Learning
10 pages
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
No ratings yet
Reinforcement Learning (RL) : by Abhiram Sharma (19311A12P0)
14 pages
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
No ratings yet
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
34 pages
Chapter 1.2
No ratings yet
Chapter 1.2
47 pages
(Slide) RLHF
No ratings yet
(Slide) RLHF
53 pages
Java Presentation
No ratings yet
Java Presentation
11 pages
Lec 23
No ratings yet
Lec 23
51 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Reinforcement Learning Presentation
No ratings yet
Reinforcement Learning Presentation
9 pages
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
No ratings yet
Back To Basics: Revisiting Reinforce Style Optimization For Learning From Human Feedback in Llms
28 pages
Module 3
No ratings yet
Module 3
44 pages
Lecture1 Introduction Part1
No ratings yet
Lecture1 Introduction Part1
17 pages
Playbook Executive Briefing Reinforcement Learning
No ratings yet
Playbook Executive Briefing Reinforcement Learning
20 pages
Nptel Bia All
No ratings yet
Nptel Bia All
42 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
Excel Shortcuts List For Mac and PC (Searchable) - Automate Excel
No ratings yet
Excel Shortcuts List For Mac and PC (Searchable) - Automate Excel
14 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
No ratings yet
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
9 pages
Four
No ratings yet
Four
5 pages
Termostato Bac1000
No ratings yet
Termostato Bac1000
1 page
Model Welds in Drawings Tekla
No ratings yet
Model Welds in Drawings Tekla
3 pages
Internet DMZ Equipment Policy: 1. Overview
No ratings yet
Internet DMZ Equipment Policy: 1. Overview
4 pages
RL
No ratings yet
RL
94 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
IK 220 Card
No ratings yet
IK 220 Card
4 pages
Final
No ratings yet
Final
18 pages
Ge1 Ge2-Gb
No ratings yet
Ge1 Ge2-Gb
3 pages
Unleashing The Power of Reinforcement Learning
No ratings yet
Unleashing The Power of Reinforcement Learning
2 pages
Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Reinforcement Learning From Human Feedback (RLHF)
23 pages
Leaf Disease Detection On Cucumber Leaves Using
No ratings yet
Leaf Disease Detection On Cucumber Leaves Using
6 pages
12 LLM Notes
No ratings yet
12 LLM Notes
10 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
An Introduction To Deep Reinforcement Learning PDF
No ratings yet
An Introduction To Deep Reinforcement Learning PDF
140 pages
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
No ratings yet
DeepSeek Proves AI Comes For All Jobs - Even AI Jobs
6 pages
Https:chartswap My Salesforce-Sites Com:rrequestview?id a0G3y00000RcsDtEAJ
No ratings yet
Https:chartswap My Salesforce-Sites Com:rrequestview?id a0G3y00000RcsDtEAJ
2 pages
DeepSeek Research
No ratings yet
DeepSeek Research
6 pages
AI Reinforcdement Learning
No ratings yet
AI Reinforcdement Learning
20 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Wifi Registration
No ratings yet
Wifi Registration
1 page
1 Introduction To RL
No ratings yet
1 Introduction To RL
46 pages
Iwi (26) - 1
No ratings yet
Iwi (26) - 1
1 page
Essential Guide to LLMOps: Implementing effective strategies for Large Language Models in deployment and continuous improvement
From Everand
Essential Guide to LLMOps: Implementing effective strategies for Large Language Models in deployment and continuous improvement
Ryan Doan
No ratings yet
Decoding Large Language Models: An exhaustive guide to understanding, implementing, and optimizing LLMs for NLP applications
From Everand
Decoding Large Language Models: An exhaustive guide to understanding, implementing, and optimizing LLMs for NLP applications
Irena Cronin
No ratings yet