0% found this document useful (0 votes)

12 views1 page

AI3

This document discusses advancements in Large Language Models (LLMs) towards improving reasoning capabilities through pure reinforcement learning (RL), specifically using the DeepSeek model. The study highlights the development of DeepSeek-R1, which enhances reasoning performance significantly and matches the performance of OpenAI's models. Additionally, it explores the distillation of DeepSeek-R1 into smaller models, achieving superior reasoning capabilities compared to existing state-of-the-art models.

Uploaded by

jackroland567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views1 page

AI3

Uploaded by

jackroland567

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

1.

Introduction
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap
towards Artificial General Intelligence (AGI).
Recently, post-training has emerged as an important component of the full training pipeline.
It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt
to user preferences, all while requiring relatively minimal computational resources against
pre-training. In the context of reasoning capabilities, OpenAI’s o1 (OpenAI, 2024b) series models
were the first to introduce inference-time scaling by increasing the length of the Chain-of-
Thought reasoning process. This approach has achieved significant improvements in various
reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge
of effective test-time scaling remains an open question for the research community. Several prior
works have explored various approaches, including process-based reward models (Lightman
et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024),
and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh
et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning
performance comparable to OpenAI’s o1 series models.
In this paper, we take the first step toward improving language model reasoning capabilities
using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop
reasoning capabilities without any supervised data, focusing on their self-evolution through
a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ
GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning.
During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting
reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance
on reasoning benchmarks. For instance, the pass@1 score on AIME 2024 increases from 15.6% to
71.0%, and with majority voting, the score further improves to 86.7%, matching the performance
of OpenAI-o1-0912.
However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language
mixing. To address these issues and further enhance reasoning performance, we introduce
DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training
pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the
DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1-
Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection
sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains
such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model.
After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking
into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to
as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.
We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5-
32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying
RL on it. This demonstrates that the reasoning patterns discovered by larger base models are cru-
cial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey
et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source
QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a
new record on the reasoning benchmarks among dense models.

A Perfect Guide To DeepSeek R1
No ratings yet
A Perfect Guide To DeepSeek R1
9 pages
The Illusion of Thinking
No ratings yet
The Illusion of Thinking
30 pages
Daa Mini Project
100% (1)
Daa Mini Project
13 pages
DeepSeek Prover V2
No ratings yet
DeepSeek Prover V2
34 pages
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
No ratings yet
Reinforcement Learning For Reasoning in Small LLMS: What Works and What Doesn'T
17 pages
DeepSeek R1 Dual
No ratings yet
DeepSeek R1 Dual
44 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
DeepSeek R1
No ratings yet
DeepSeek R1
22 pages
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
No ratings yet
Deepseek-R1: Incentivizing Reasoning Capability in Llms Via Reinforcement Learning
24 pages
DeepSeek R1
No ratings yet
DeepSeek R1
23 pages
DeepSeek R1论文中英文对照版
No ratings yet
DeepSeek R1论文中英文对照版
35 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
1.1. Background On Reasoning in Large Language Models (LLMS)
No ratings yet
1.1. Background On Reasoning in Large Language Models (LLMS)
64 pages
Demystifying Long Chain-of-Thought Reasoning in LLMs
No ratings yet
Demystifying Long Chain-of-Thought Reasoning in LLMs
40 pages
Module 4 - DSP
No ratings yet
Module 4 - DSP
63 pages
Competitive Programming With Large Reasoning Models
No ratings yet
Competitive Programming With Large Reasoning Models
48 pages
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
No ratings yet
DeepSeek-R1 百天后：关于复现研究的综述及推理语言模型的更多方向
36 pages
Teaching LLM
No ratings yet
Teaching LLM
24 pages
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
No ratings yet
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
27 pages
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
No ratings yet
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
31 pages
s1: Simple Test-Time Scaling
No ratings yet
s1: Simple Test-Time Scaling
45 pages
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
No ratings yet
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
35 pages
Search-O1: Agentic Search-Enhanced Large Reasoning Models
No ratings yet
Search-O1: Agentic Search-Enhanced Large Reasoning Models
23 pages
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
No ratings yet
Stop Overthinking: A Survey On Efficient Reasoning For Large Language Models
25 pages
s1: Simple Test-Time Scaling
No ratings yet
s1: Simple Test-Time Scaling
46 pages
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
No ratings yet
Simplerl-Zoo: Investigating and Taming Zero Reinforcement Learning For Open Base Models in The Wild
36 pages
Leanabell-Prover: Posttraining Scaling in Formal Reasoning: SFT RL Ours RL
No ratings yet
Leanabell-Prover: Posttraining Scaling in Formal Reasoning: SFT RL Ours RL
23 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Scaling Up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach
No ratings yet
Scaling Up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach
38 pages
Ao Search
No ratings yet
Ao Search
5 pages
Scaling Up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach
No ratings yet
Scaling Up Test-Time Compute With Latent Reasoning: A Recurrent Depth Approach
37 pages
2024 Arxiv OpenR
No ratings yet
2024 Arxiv OpenR
18 pages
Brief Analysis of Deepseek R1 and Its Implications For Generative Ai
No ratings yet
Brief Analysis of Deepseek R1 and Its Implications For Generative Ai
9 pages
MiMo: Unlocking The Reasoning Potential of Language Model - From Pretraining To Posttraining
No ratings yet
MiMo: Unlocking The Reasoning Potential of Language Model - From Pretraining To Posttraining
26 pages
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
No ratings yet
Training Language Models To Reason Efficiently: Daman Arora Andrea Zanette
16 pages
Tiny Reasoning Models Via Lora
No ratings yet
Tiny Reasoning Models Via Lora
34 pages
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
No ratings yet
Search-R1: Training Llms To Reason and Leverage Search Engines With Reinforcement Learning
16 pages
DeepSeek Research
No ratings yet
DeepSeek Research
6 pages
2023 ArXiv Unleash LLM For Offline RL
No ratings yet
2023 ArXiv Unleash LLM For Offline RL
19 pages
L L M C S - I: Arge Anguage Odels AN ELF Mprove
No ratings yet
L L M C S - I: Arge Anguage Odels AN ELF Mprove
19 pages
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
No ratings yet
U R C LLM S Q S S: Nleashing Easoning Apability of S Via Calable Uestion Ynthesis From Cratch
22 pages
Kimi k1.5
No ratings yet
Kimi k1.5
25 pages
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
No ratings yet
Improving Small-Scale Large Language Models Function Calling For Reasoning Tasks
10 pages
Kimi k1.5
No ratings yet
Kimi k1.5
25 pages
Reasoning
No ratings yet
Reasoning
21 pages
Table-R1: Inference-Time Scaling For Table Reasoning: Zheyuan Yang Lyuhao Chen Arman Cohan Yilun Zhao
No ratings yet
Table-R1: Inference-Time Scaling For Table Reasoning: Zheyuan Yang Lyuhao Chen Arman Cohan Yilun Zhao
20 pages
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
No ratings yet
R1-Searcher: Incentivizing The Search Capability in Llms Via Reinforcement Learning
17 pages
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
No ratings yet
AM-Thinking-v1: Advancing The Frontier of Reasoning at 32B Scale
16 pages
Master Data Management
No ratings yet
Master Data Management
5 pages
Offline Reinforcement Learning For LLM Multi-Step Reasoning
No ratings yet
Offline Reinforcement Learning For LLM Multi-Step Reasoning
13 pages
Vision R1
No ratings yet
Vision R1
14 pages
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
No ratings yet
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
9 pages
Engineering Computations: Solution of Non-Linear Equations
100% (1)
Engineering Computations: Solution of Non-Linear Equations
45 pages
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
No ratings yet
Motif: Modular Thinking Via Reinforcement Fine-Tuning in Llms
9 pages
Part 2
No ratings yet
Part 2
3 pages
1.1. Contributions Post-Training: Large-Scale Reinforcement Learning On The Base Model
No ratings yet
1.1. Contributions Post-Training: Large-Scale Reinforcement Learning On The Base Model
1 page
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
No ratings yet
DeepSeek-R1: Enhanced Reasoning Via Reinforcement Learning
9 pages
AI11
No ratings yet
AI11
1 page
2 Hons Mathematics SH-MTH-202-C-4 1654498669626
No ratings yet
2 Hons Mathematics SH-MTH-202-C-4 1654498669626
3 pages
9733 31211 3 PB
No ratings yet
9733 31211 3 PB
3 pages
Approach: Others
No ratings yet
Approach: Others
1 page
TableMind Scalable Inference-Time Strategies For Structured Data Reasoning PHD Research Topic
No ratings yet
TableMind Scalable Inference-Time Strategies For Structured Data Reasoning PHD Research Topic
2 pages
Anushee Jain CV Pre-Sales Data Scientist
No ratings yet
Anushee Jain CV Pre-Sales Data Scientist
2 pages
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
No ratings yet
Quantum Algorithms For Solving Ordinary Differential Equations Via Classical Integration Methods
13 pages
2021 Sherrill Lingel CONOPS Joint All Domain C2 Embedded AI
No ratings yet
2021 Sherrill Lingel CONOPS Joint All Domain C2 Embedded AI
10 pages
RC Circuit Step Response I: Find The Differential Equation That Describes The Circuit Below
No ratings yet
RC Circuit Step Response I: Find The Differential Equation That Describes The Circuit Below
7 pages
Computer Laboratory 2 Oral Question Answer Set I
No ratings yet
Computer Laboratory 2 Oral Question Answer Set I
10 pages
Tutorial-12 (Non-Dimensionalisation)
No ratings yet
Tutorial-12 (Non-Dimensionalisation)
8 pages
Design and Analysis of Algorithms CSC201: Shahid Hussain
No ratings yet
Design and Analysis of Algorithms CSC201: Shahid Hussain
18 pages
2-Artificial Intelligence, Concept and Application
No ratings yet
2-Artificial Intelligence, Concept and Application
24 pages
Pantech - AI, ML & Image Processing Projects Using MATLAB and OpenCV - 2021 - 22
No ratings yet
Pantech - AI, ML & Image Processing Projects Using MATLAB and OpenCV - 2021 - 22
6 pages
Stat841 Outline
No ratings yet
Stat841 Outline
3 pages
Integral Transforms
No ratings yet
Integral Transforms
104 pages
Pcanet: A Simple Deep Learning Baseline For Image Classification?
No ratings yet
Pcanet: A Simple Deep Learning Baseline For Image Classification?
15 pages
Statistics For Management
No ratings yet
Statistics For Management
7 pages
Ds 6 Relation
No ratings yet
Ds 6 Relation
40 pages
ML Unit 5
No ratings yet
ML Unit 5
13 pages
Sorenson - Recursive Fading Memory Filtering (1971)
No ratings yet
Sorenson - Recursive Fading Memory Filtering (1971)
19 pages
Afif Akbar Syawala - 120210200029 - Tugas Ekonometrik
No ratings yet
Afif Akbar Syawala - 120210200029 - Tugas Ekonometrik
5 pages
Soft Computing Assignment: Q5) Explain Supervised and Unsupervised Learning. Ans 5) Supervised Learning
No ratings yet
Soft Computing Assignment: Q5) Explain Supervised and Unsupervised Learning. Ans 5) Supervised Learning
3 pages
Assignment 3
No ratings yet
Assignment 3
25 pages
Source-Free Domain Adaptation With Diffusion-Guided Source Data Generation
No ratings yet
Source-Free Domain Adaptation With Diffusion-Guided Source Data Generation
13 pages
MIT15 093J F09 Final 2003
No ratings yet
MIT15 093J F09 Final 2003
6 pages
Answer All Questions, Each Carries3 Marks.: Page 1 of 2
No ratings yet
Answer All Questions, Each Carries3 Marks.: Page 1 of 2
2 pages
Smart City Surveillance
No ratings yet
Smart City Surveillance
6 pages
Ex7 - Controller Design in Matlab - Simulink
No ratings yet
Ex7 - Controller Design in Matlab - Simulink
2 pages
Mainframe Mastery with DevOps: Integrating Legacy Systems with Agile Practices: Mainframes
From Everand
Mainframe Mastery with DevOps: Integrating Legacy Systems with Agile Practices: Mainframes
Ricardo Nuqui
No ratings yet
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
DevOps Foundation Courseware - English
From Everand
DevOps Foundation Courseware - English
Oleg Skrynnik
No ratings yet

AI3

Uploaded by

AI3

Uploaded by

1.

You might also like