Deep Reinforcement Learning
Deep Reinforcement Learning
Applications
Learning the human way
Source: https://www.youtube.com/watch?v=e2_hsjpTi4w&t=67s
बायां हाथ दायाँ हाथ ?????? With respect to the feedback type to learner:
? q Supervised learning :
n Task Driven (Classification)
q Unsupervised learning :
n Data Driven (Clustering)
q Reinforcement learning
n Self learning (reward based)
Image credit : UCL Course of RL
9
Classes of Learning Problems Supervised VS Unsupervised
Reinforcement/Self Learning
Testing:
What is this?
80 40
Supervision*
Neural Good or
Input Output
Network Bad?
Supervised Learning Deep learning - representation learning: the automated formation of useful
Reinforcement Learning
Supervised Learning Reinforcement Learning
• Step: 1 Step: 1
Teacher: Does picture 1 show a car World: You are in state 9. Choose
or a flower? action A or C. • Supervised learning is “teach by example”:
Learner: A flower. Learner: Action A.
Teacher: No, it’s a car. World: Your reward is 100.
Here’s some examples, now learn patterns in these example.
Step: 2 Step: 2
Teacher: Does picture 2 show a car World: You are in state 32. Choose
or a flower? action B or E. • Reinforcement learning is “teach by experience”:
Learner: A car. Learner: Action B.
Teacher: Yes, it’s a car. World: Your reward is 50.
Here’s a world, now learn patterns by exploring it.
Step: 3 .... Step: 3 ....
33
Left 2 4 8
Right 3 1 7
Straight 6 11 50
Environment
Environment Open Question: Sensors
Sensors
Sensor Data
Sensor Data What can be learned from
data? Feature Extraction
Feature Extraction
Representation
Representation
Machine Learning
Machine Learning
Knowledge Knowledge
GPS
Reasoning Reasoning Camera
Lidar Radar
(Visible, Infrared)
Planning Planning
Action Action
Effector Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Source : https://deeplearning.mit.edu
Environment Environment
Sensors Sensors
Representation Representation
Knowledge Knowledge
Reasoning Reasoning
Planning Planning
Action Action
Effector Effector
Source: https://deeplearning.mit.edu Source: https://deeplearning.mit.edu
Environment Environment
Sensors Sensors
Image Recognition: Audio Recognition:
Sensor Data If it looks like a duck Quacks like a duck Sensor Data
Representation Representation
Reasoning Reasoning
Planning Planning
Action Action
Effector Effector
Source: https://deeplearning.mit.edu
Environment
Reinforcement Learning Framework
Sensors
Knowledge
Reasoning
The promise of
Planning
Deep Reinforcement Learning
Action
Effector
Reward: -0.04 for each step Reward: -0.04 for each step
actions: UP, DOWN, LEFT, RIGHT actions: UP, DOWN, LEFT, RIGHT
+1 +1
When actions are deterministic: When actions are stochastic:
-1 UP -1 UP
Policy: Shortest path. Policy: Shortest path. Avoid -UP around -1 square.
Optimal Policy for a Stochastic World Optimal Policy for a Stochastic World
-1 UP
-1 -1
80% move UP
10% move LEFT
10% move RIGHT
-1 UP
80% move UP
10% move LEFT
10% move RIGHT
58
59
61 62
Meaning of Life for RL Agent:
Some terms in Reinforcement Learning Maximize Reward
n The Agent Learns a Policy: • Future reward: !! = "!+ "!+1 + "!+2 + ⋯ +""
q Policy at step t, : a mapping from states to action • Discounted future reward:
probabilities will be: !! = "!+ $"!+1+ $2"!+2 + ⋯ +$"−!""
• A good strategy for an agent would be to always choose
an action that maximizes the (discounted) future reward
q Agents changes their policy with Experience. • Why “discounted”?
q Objective: get as much reward as possible over a
• Math trick to help analyze convergence
long run.
n Goals and Rewards • Uncertainty due to environment stochasticity, partial
observability, or that life can end at any moment:
q A goal should specify what we want to achieve, not
how we want to achieve it. “If today were the last day of my life, would I want
63
to do what I’m about to do today?” – Steve Jobs
65 66
Identify :- G S A R ???
q Action-Value function
Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
67
Examples of Reinforcement Learning Problem Solving Methods for RL
1) Dynamic programming
• Model-based
2) Monte Carlo methods
• No Model
3) Temporal-difference learning.
Grasping Objects with Robotic Arm
• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful
70
72
73 74
MC and DP Methods To find value of a State
n Estimate by experience, average the returns observed
n Compute same value function after visit to that state.
n Same step as in DP n More the return, more is the average converge to
q Policy evaluation
expected value
q Computation of state value (VΠ ) and action value
75 76
q No bootstrapping
n bootstrapping in RL means that you update a value based on
some estimates and not on some exact values.
77 78
a
• State-action value function: Q (s,a)
r
• Expected return when starting in s,
performing a, and following s’
A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1
n Less memory
State Diagram
!
!
Reward table/ Matrix R
Q Matrix- Experience Table
• Q matrix – Brain of agent - represent the memory of what the agent
have learned through experiences.
• In beginning, agent know nothing, thus Q is zero matrix.
• Let no of state is known (6).
!
!
!
!
C -- D -- B -- F or C -- D -- E -- F
Introduction
CASE Study :
q Dataset
q Article from “The Hindu” (june 2013)
é IG(W 11) updted IG(W 12) updted .... IG(W 1n) updted ù
q DUC’06 sets of documents :
ê IG(W 21) IG(W 22) updted .... IG(W 2n) updted úú
updted (TSM ) = ê 12 document sets
updted
q
ê .... .... .... .... ú
ê ú q No of document in each Set 25
ëê IG(W m1) updted IG(Wm2) updted .... IG(Wmn) updted ûú
q Average no of sentence 32
ET SS
SAAR (user 90 85 87.42
feedback)
PS
PS 75 60 66.66
SAAR Based
C. Prakash, A. Shukla. (2010). Chapter 15 – “Automatic Summary Generation from Single Document using Information Gain.” In Springer (2010),
Dr. Chandra Prakash Dr. Chandra Prakash
Contemporary Computing (pp. 152-159). doi:10.1007/978-3-642-14834-7_15
Q-Learning: Representation Matters
Deep Reinforcement Learning
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states
target prediction
Game of Go
Alpha Go Story
Source: https://www.youtube.com/watch?time_continue=6&v=8tq1C8spV_g&feature=emb_title
https://www.youtube.com/watch?v=8dMFJpEGNLQ
A more general program, AlphaZero, beat the most powerful programs playing go, chess and shogi
(Japanese chess) after a few days of play against itself using reinforcement learning.
“In part because few real-world problems are as
constrained as the games on which DeepMind has
focused, DeepMind has yet to find any large-scale
commercial application of deep reinforcement learning.”
Krunal Javiya, Jainesh Machhi, Parth Sharma, Saurav Patel Autonomous Gait and Balancing
Approach Using Deep Reinforcement Learning
Average precision-Recall
score:
0.60
2. Improve
Simulation
• Email: [email protected]
• https://Cprakash.in
[https://cprakash86.wordpress.com/]
Thank You