0% found this document useful (0 votes)

4 views25 pages

Deep Reinforcement Learning

The document provides an overview of Deep Reinforcement Learning (Deep RL) and its applications, explaining key concepts such as supervised, unsupervised, and reinforcement learning. It highlights the importance of learning through trial and error, rewards, and the role of neural networks in Deep RL. Additionally, it discusses the components of reinforcement learning agents, action selection methods, and the significance of maximizing rewards in learning processes.

Uploaded by

mqz5268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views25 pages

Deep Reinforcement Learning

Uploaded by

mqz5268

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Deep Reinforcement Learning and AI- ML Terminology

Applications
Learning the human way

Dr. Chandra Prakash

Slide credit : Dr. Partha Pratim Chakrabarti

Artifical Intelligence (AI) Introduction

Machine learning: Definition
q A scientific discipline that is concerned with the design
and development of algorithms that allow computers to
learn based on data, such as from sensor data or
databases, etc.

Major focus of machine learning research

ü To automatically learn to recognize complex
patterns and make intelligent decisions based on data .

Source: https://www.youtube.com/watch?v=e2_hsjpTi4w&t=67s

Learning Machine learning Type:

बायां हाथ दायाँ हाथ ?????? With respect to the feedback type to learner:
? q Supervised learning :
n Task Driven (Classification)

q Unsupervised learning :
n Data Driven (Clustering)

q Reinforcement learning
n Self learning (reward based)
Image credit : UCL Course of RL

9
Classes of Learning Problems Supervised VS Unsupervised

Reinforcement/Self Learning

Unlabeled images (random internet images)

Testing:
What is this?
80 40

DATA Vision and Deep Learning

Deep Reinforcement Learning (Deep RL) Reinforcement Learning Examples
Deep Learning Deep RL

• What is it? Framework for learning to solve

sequential decision making problems.
• How? Trial and error in a world that provides
occasional rewards
• Deep? Deep RL = RL + Neural Networks

Deep Reinforcement Learning Supervised Learning

• It’s all “supervised” by a loss function!

Supervision*

Neural Good or
Input Output
Network Bad?

• *Someone has to say what’s good and what’s bad

• ATARI 2600
• Alpha Go
• Mnih, V. (2013). Playing atari with deep reinforcement learning Silver,
• D. (2016). Mastering the game of Go with deep neural networks and tree search
• Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

Supervised Learning Deep learning - representation learning: the automated formation of useful

vs representations from data.

Reinforcement Learning
Supervised Learning Reinforcement Learning
• Step: 1 Step: 1
Teacher: Does picture 1 show a car World: You are in state 9. Choose
or a flower? action A or C. • Supervised learning is “teach by example”:
Learner: A flower. Learner: Action A.
Teacher: No, it’s a car. World: Your reward is 100.
Here’s some examples, now learn patterns in these example.
Step: 2 Step: 2
Teacher: Does picture 2 show a car World: You are in state 32. Choose
or a flower? action B or E. • Reinforcement learning is “teach by experience”:
Learner: A car. Learner: Action B.
Teacher: Yes, it’s a car. World: Your reward is 50.
Here’s a world, now learn patterns by exploring it.
Step: 3 .... Step: 3 ....

Dr. Chandra Prakash 30

Reinforcement Reinforcement Learning (Cont..)
n Emphasizes learning feedback that evaluates the learner's performance
Dictionary meaning without providing standards of correctness in the form of behavioral
Occurrence of an event, in the proper relation to a targets.
response, that tends to increase the probability that the n Some researcher consider RL a form of unsupervised learning.
response will occur again in the same situation. n An orthogonal approach for Learning Machine. :
n RL is training by
Reinforcement Learning (RL) n rewards and punishments.
Good vs Bad
“a way of programming agents by reward and n

RL agent learns by receiving a reward or reinforcement through trial-and-

punishment without needing to specify how the task is n
error interactions with a dynamic environment to achieve a goal, without any
to be achieved” form of supervision other than its own decision-making policy.
[Kaelbling, Littman,& Moore, 96] n Reinforcement Learning is learning how to act in order to maximize a
numerical reward.

Reinforcement Learning in Humans Reinforcement Learning (RL)

• Human appear to learn to walk through “very few examples” of • Close to Human Learning.
trial and error. How is an open question…
• Possible answers: • Agent learns a policy of how to act in a given environment.
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking. • Every action has some impact in the environment, and the
• Algorithms: Better than backpropagation and stochastic gradient descent environment provides rewards that guides the learning
algorithm

Study Time as a Self Learning Model

Left Right Straight

Left 2 4 8
Right 3 1 7
Straight 6 11 50
Environment
Environment Open Question: Sensors
Sensors
Sensor Data
Sensor Data What can be learned from
data? Feature Extraction
Feature Extraction
Representation
Representation
Machine Learning
Machine Learning

Knowledge Knowledge
GPS
Reasoning Reasoning Camera
Lidar Radar
(Visible, Infrared)
Planning Planning

Action Action

Effector Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Source : https://deeplearning.mit.edu

Environment Environment

Sensors Sensors

Sensor Data Sensor Data

Feature Extraction Feature Extraction

Representation Representation

Machine Learning Machine Learning

Knowledge Knowledge

Reasoning Reasoning

Planning Planning

Action Action

Effector Effector
Source: https://deeplearning.mit.edu Source: https://deeplearning.mit.edu

Environment Environment

Sensors Sensors
Image Recognition: Audio Recognition:
Sensor Data If it looks like a duck Quacks like a duck Sensor Data

Feature Extraction Feature Extraction

Representation Representation

Machine Learning Machine Learning

Activity Recognition:
Swims like a duck
Knowledge Knowledge

Reasoning Reasoning

Planning Planning

Action Action

Effector Effector
Source: https://deeplearning.mit.edu
Environment
Reinforcement Learning Framework
Sensors

Sensor Data At each step, the agent: Open Questions:

The promise of • What cannot be modeled in
Feature Extraction • Executes action this way?
Deep Learning
• Observe new state • What are the challenges of
Representation learning in this framework?
• Receive reward
Machine Learning

Knowledge

Reasoning
The promise of
Planning
Deep Reinforcement Learning
Action

Effector

Element of Reinforcement Learning

Environment and Actions
Agent Policy • Fully Observable (Chess) vs Partially Observable (Poker)
• Single Agent (Atari) vs Multi Agent (DeepTraffic)
State
Action • Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
Reward – Deterministic system - no randomness is involved in the development of
future states of the system.
Environment – Stochastic system - random probability distribution or pattern that may be
analysed statistically but may not be predicted precisely.
n Agent: Intelligent programs n Value function: • Static (Chess) vs Dynamic (DeepTraffic)
n Environment: External condition q Specifies what is good in the long run • Discrete (Chess) vs Continuous (Cart Pole)
n Policy: while Reward function indicates what
is good in an immediate sense.
n Agent’s behavior at a given time
q Value of a state - Total amount of Note: Real-world environment might not technically be stochastic or
n A mapping from states to actions
reward an agent can expect to
n Lookup tables or simple function partially-observable but might as well be treated as such due to their
accumulate over the future, starting
n Reward function : form that state. complexity.
q Defines the goal in an RL problem n Model of the environment :
q Policy is altered to achieve this goal q Used for planning & if Know current
state and action then predict the resultant
next state and next reward. 46

Major Components of an RL Agent Robot in a Room

An RL agent may be directly or indirectly trying to learn a: actions: UP, DOWN, LEFT, RIGHT
• Policy: agent’s behavior function +1 (Stochastic) model of the world:
• Value function: how good is each state and/or action
Action: UP
• Model: agent’s representation of the environment -1
80% move UP
10% move LEFT
START 10% move RIGHT

!0,"0, #1, !1,"1, #2,………,!!−1, "!−1, #!, !!

• Reward +1 at [4,3], -1 at [4,2]
state Terminal state
• Reward -0.04 for each step
action
• What’s the strategy to achieve max reward?
reward
• We can learn the model and plan
• We can learn the value of (action, state) pairs and act greed/non-greedy
• We can learn the policy directly while sampling from it
Optimal Policy for a Deterministic World Optimal Policy for a Stochastic World

Reward: -0.04 for each step Reward: -0.04 for each step

actions: UP, DOWN, LEFT, RIGHT actions: UP, DOWN, LEFT, RIGHT
+1 +1
When actions are deterministic: When actions are stochastic:

-1 UP -1 UP

100% move UP 80% move UP

0% move LEFT 10% move LEFT
0% move RIGHT 10% move RIGHT

Policy: Shortest path. Policy: Shortest path. Avoid -UP around -1 square.

Optimal Policy for a Stochastic World Optimal Policy for a Stochastic World

Reward: -2 for each step

Reward: -0.1 for each step Reward: -0.04 for each step
actions: UP, DOWN, LEFT, RIGHT
+1 +1 +1
When actions are stochastic:

-1 UP
-1 -1
80% move UP
10% move LEFT
10% move RIGHT

More urgent Less urgent

Policy: Shortest path.

Optimal Policy for a Stochastic World Lessons from Robot in Room

• Environment model has big impact on optimal policy

Reward: +0.01 for each step
• Reward structure has big impact on optimal policy
• As a programmer we have more control here.
actions: UP, DOWN, LEFT, RIGHT
+1
When actions are stochastic:

-1 UP

80% move UP
10% move LEFT
10% move RIGHT

Policy: Longest path.

Reinforcement Learning (Cont..) Action Selection Method
n Concept used in Reinforcement Learning Exploration and exploitation
A. Greedy action: Action chosen with greatest estimated value.
q Evaluative Vs. Instructive Feedback Greedy action: a case of Exploitation.
q Associative Vs. Non-Associative n Ɛ -greedy
q Exploration and exploitation q Most of the time the greediest action is chosen
q Every once in a while, with a small probability Ɛ, an action is
selected at random.
B. Non-Greedy action:a case of Exploration, as it
enables us to improve estimate the non-greedy
action's value.
n Ɛ-soft - The best action is selected with probability (1 –Ɛ)
and the rest of the time a random action is chosen uniformly.

Ɛ-Greedy Action Selection Method : Exploration vs Exploitation

Let the a* is the greedy action at time t and Qt(a) is the • Deterministic/greedy policy won’t explore all actions
value of action a at time. • Don’t know anything about the environment at the beginning
• Need to try all actions to find the optimal one
• ε-greedy policy
Greedy Action Selection: • Every once in a while, with a small probability Ɛ, an action is selected at random.
*
at = at = argmax Qt (a) • Ɛ -soft : With probability 1-ε perform the optimal/greedy action, otherwise random
a action
• Slowly move it towards greedy policy: ε -> 0
n Ɛ –greedy
at with probability 1- Î
a t= { random action with probability Î

Action Selection Policies (Cont…) Softmax Action Selection( Cont…)

n Softmax –
q Drawback of Ɛ -greedy & Ɛ -soft: Select random
n Problem with Ɛ-greedy: Neglects action values
actions uniformly. n Softmax idea: grade action probs. by estimated values.
q Softmax remedies this by: n Gibbs, or Boltzmann action selection, or exponential
n Assigning a weight with each actions, according weights:
to their action-value estimate.
n A random action is selected with regards to the
weight associated with each action
n The worst actions are unlikely to be chosen.
t is the “computational temperature”
n This is a good approach to take where the worst At t à 0 the Softmax action selection method become
actions are very unfavorable. the same as greedy action selection.

61 62
Meaning of Life for RL Agent:
Some terms in Reinforcement Learning Maximize Reward
n The Agent Learns a Policy: • Future reward: !! = "!+ "!+1 + "!+2 + ⋯ +""
q Policy at step t, : a mapping from states to action • Discounted future reward:
probabilities will be: !! = "!+ $"!+1+ $2"!+2 + ⋯ +$"−!""
• A good strategy for an agent would be to always choose
an action that maximizes the (discounted) future reward
q Agents changes their policy with Experience. • Why “discounted”?
q Objective: get as much reward as possible over a
• Math trick to help analyze convergence
long run.
n Goals and Rewards • Uncertainty due to environment stochasticity, partial
observability, or that life can end at any moment:
q A goal should specify what we want to achieve, not
how we want to achieve it. “If today were the last day of my life, would I want
63
to do what I’m about to do today?” – Steve Jobs

Some terms in RL (Cont…) UPDATE Rule

n Returns n Common update rule form:
q Rewards in long term
q Episodes: Subsequence of interaction between agent- NewEstimate = OldEstimate + StepSize[Target –OldEstimate]
environment e.g., plays of a game, trips through a
maze. n The expression [ Target - Old Estimate] is an error in the estimate.
n Discount return
q The geometrically discounted model of return: n It is reduce by taking a step toward the target.

n In proceeding the (t+1)st reward for action a the step-size parameter

q Used to: will be 1\(t+1).
n To determine the present value of the future
rewards
n Give more weight to earlier rewards

65 66

Value function Examples of Reinforcement Learning

n States-action pairs function that estimate how good it is for the agent
to be in a given state
n Type of value function
q State-Value function

Identify :- G S A R ???
q Action-Value function
Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
67
Examples of Reinforcement Learning Problem Solving Methods for RL

1) Dynamic programming
• Model-based
2) Monte Carlo methods
• No Model
3) Temporal-difference learning.
Grasping Objects with Robotic Arm
• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful

3 Types of Reinforcement Learning 1.Dynamic programming

n Classical solution method

n Require a complete and accurate model of the
environment.
n Popular method for Dynamic programming
Model-based Value-based Policy-based q Policy Evaluation : Iterative computation of the value
• Learn the model of • Learn the state or • Learn the stochastic function for a given policy (prediction Problem)
the world, then plan state-action value policy function that
using the model maps state to action q Policy Improvement: Computation of improved policy
• Act by choosing best
• Update model often action in state • Act by sampling for a given value function.
• Re-plan often • Exploration is a policy
necessary add-on • Exploration is baked in V ( st ) ¬ Ep {rt +1 + g V ( st )}
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

Generalized Policy Iteration (GPI) 2. Monte Carlo Methods

Consist of two iteration process,
n Policy Evaluation :Making the value function n Features of Monte Carlo Methods
consistent with the current policy q No need of Complete knowledge of environment
n Policy Improvement: Making the policy greedy
q Based on averaging sample returns observed after visit
with respect to the current value function to that state.
q Experience is divided into Episodes
q Only after completing an episode, value estimates and
policies are changed.
q Don't require a model
q Not suited for step-by-step incremental computation

73 74
MC and DP Methods To find value of a State
n Estimate by experience, average the returns observed
n Compute same value function after visit to that state.
n Same step as in DP n More the return, more is the average converge to
q Policy evaluation
expected value
q Computation of state value (VΠ ) and action value

(QΠ )for a fixed arbitrary policy (Π).

q Policy Improvement

q Generalized Policy Iteration

75 76

Monte Carlo and Dynamic Programming

n MC has several advantage over DP:

q Can learn from interaction with environment

q No need of full models

q No need to learn about ALL states

q No bootstrapping
n bootstrapping in RL means that you update a value based on
some estimates and not on some exact values.

77 78

3. Temporal Difference (TD) methods Temporal Difference (TD) Prediction

n Learn from experience, like MC Policy Evaluation (the prediction problem):

q Can learn directly from interaction with environment
for a given policy p, compute the state-value function
Vp
q No need for full models Simple every - visit Monte Carlo method :
n Estimate values based on estimated values of next states,
V(st ) ¬ V(st ) + α[Rt - V(st )]
like DP
n Bootstrapping (like DP) target: the actual return after time t
n Issue to watch for:
q maintaining sufficient exploration The simplest TD method, TD( 0 ) :
V(st ) ¬ V(st ) + α[rt+1 + gV(st+1 ) - V(st )]

target: an estimate of the return

79 80
Taxonomy of RL Methods Q-Learning s

a
• State-action value function: Q (s,a)
r
• Expected return when starting in s,
performing a, and following s’

• Q-Learning: Use any policy to estimate Q that maximizes future reward:

• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair

Learning Rate Discount Factor

New State Old State Reward

Q-Learning: Value Iteration

Q-Learning: Off-Policy TD Control
One - step Q - learning :
[ ]
Q (s t , a t ) ¬ Q (s t , a t ) + a rt +1 + g maxQ (s t +1 , a ) - Q (s t , a t )
a

A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1

Advantages of Temporal Difference (TD)

Sarsa: On-Policy TD Control
Learning
S A R S A: State Action Reward State Action
n TD methods do not require a model of the
Turn this into a control method by always updating the policy
to be greedy with respect to the current estimate: environment, only experience
n TD, but not MC, methods can be fully incremental
q You can learn before knowing the final outcome

n Less memory

n Less peak computation

q You can learn without the final outcome

n From incomplete sequences

n Both MC and TD converge

A Reinforcement Learning Example
Example :
PathFinder Bot using
Reinforcement Learning

Which path agent should choose???

A Reinforcement Learning Example Solution using RL

• Suppose we have 5 rooms A to E, in a building Step 1: Modeling the environment-
connected by certain doors : – Represent the rooms by graph,
– Each room as a vertex (or node) and
• We can consider outside of the building as one
– Each door as an edge (or link).
big room say F to cover the building.
– Goal room is the node F
• There are two doors lead to the building from F, that
is through room B and room E.
• Which path agent should choose???

Step 1: Modelling the environment Reward table/ Matrix R

ü Goal –Outside the building – Node F
ü Assign Reward Value to each room
ü State- Each room (including outside building )
ü Action – Agent’s Movement from 1 room to next room
ü Initial state – C (random )
ü Reward- Goal Node - highest reward (100) rest – 0;

State Diagram

!
!
Reward table/ Matrix R
Q Matrix- Experience Table
• Q matrix – Brain of agent - represent the memory of what the agent
have learned through experiences.
• In beginning, agent know nothing, thus Q is zero matrix.
• Let no of state is known (6).

!
!

• In more general case, start with zero matrix of single cell.

• It is a simple task to add more column and rows in Q matrix if a new
state is found.

Q Matrix- Experience Table Q learning

• To use the Q matrix, the agent traces the sequence of states, from the initial
• Given : State diagram with a goal state (represented by matrix R)
state until goal state. The algorithm is as simple as finding action that makes
maximum Q for current state: • Find : Minimum path from any initial state to the goal state (represented by
matrix Q)
•
Algorithm to utilize the Q matrix
Q Learning Algorithm goes as follow
Input: Q matrix, initial state
1. Set parameter , and environment reward matrix R
1. Set current state = initial state 2. Initialize matrix Q as zero matrix
2. From current state, find action that produce maximum Q value 3. For each episode:
3. Set current state = next state o Select random initial state
4. Go to 2 until current state = goal state o Do while not reach goal state
! Select one among all possible actions for the current state
• !The algorithm above will return sequence of current state from initial state ! Using this possible action, consider to go to the next state
until goal state. ! Get maximum Q value of this next state based on all possible
actions
! Compute

! Set the next state as the current state

End Do
End For
!

Step 2: Step 3: Update Q Matrix/Experience Table

• Randomly choose a state
• Let us set the value of learning parameter=0.8 and initial state as • Let it select state B in matrix
room B. • 2 possible action- D,F
• Set matrix Q as a zero matrix. • Consider now we are in state F.
• Reward matrix R • It has 3 possible actions to go to
– State B, E or F.
!
Update Q Matrix

• F is final state – end of one episode.

! ! !
Repeat again (Episode 2) Inner loop continue
• Start with initial random state. Start Again with B state
– State D
– 3 possible actions- B, C and E.
• By random selection, let
– B is next state.
!
– state B- 2 possible actions (D, F)
! !
• Compute Q value
!

n No change in matrix Q – same value

n F goal state – Finish 2 episode
!

Continue for more episodes ….. After Normalization –

• If agent learn more and more, experience through many
episode,
• It reaches to convergence value of matrix Q

!
!
C -- D -- B -- F or C -- D -- E -- F

Introduction

CASE Study :

Text Summarization using

Reinforcement Learning

Dr. Chandra Prakash Dr. Chandra Prakash

Real time Problem Problem definition (cont..)
Text summarization is not as per user specification.
Imagine
• Download 1000 + papers and now want to get the summary.. – Generic summary generation not possible as summary changes as user changes.
– Even two human can‘t generate a similar summary from a given document.
• We have list of emails about sports event, get the summary of – Internal factors (background, education etc.) play vital role in generating a
those emails in one para… summary
• We have to study lots of books for the exam and the summarizer
gives the key concepts of the books as few pages notes…
• What could be the possible solution now ???
• Value for researchers
Get me everything/Papers say about “Automatic Text
Summarization”

Dr. Chandra Prakash Dr. Chandra Prakash

Solution: Human Aided Text Summarization Methodology proposed (FAS)

Benefits of summarization include:
Save reading time
Value for researchers
Abstracts for Scientific and other articles
Facilitate fast literature searches
Facilities classification of articles and other written data :
Improve Search engines indexing efficiency of web pages
Assists in storing the text in much lesser space.
Heading of the given article/document
News summarization
Opinion Mining and Sentiment Analysis
Enables Cell phones to access the Web information
With human feedback – user oriented summary Chandra Prakash, Anupam Shukla “Automated summary generation from singe document using information gain ”
Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,
2010 .
Dr. Chandra Prakash Dr. Chandra Prakash

Methodology proposed (HAMS) Keyword Significant Factor

Dr. Chandra Prakash Dr. Chandra Prakash

Solution Methodology Steps..
n Approach for the Problem Methodology for text summarization involves
q Input: Document with text is fed into the system. – Term Selection using Pre-Processing
q Preprocessing: • Tokenization or Segmentation
n Tokenization: Divides the character sequence into words • Stop word Filtering
n sentence splitting further divides sequences of words into
sentences, and so on.
• Stemming or Lemmatization
n Stemming or Lemmatization
n Stop word filtering Feature Extraction : – Term weighting
• Term Frequency (TF):
q Sentence Ranking: Machine Learning Wi(Tj)=fij
q Human Feedback where fij is the frequency of jth term in sentence i.
q Output\ Result: Generated Summary
• Inverse Sentence Frequency (ISF) :
n an abstract. æNö
Wi(Tjj = fij ´ logçç ÷÷
è nj ø
where N=no of sentences in the collection
nj =no of sentence where the term j appears.

Dr. Chandra Prakash 111 Dr. Chandra Prakash

Methodology Steps (cont…) Methodology Steps (cont…)

Weight of a Term is calculated as :
(TW)i, j= (ISF) I,j • Sentence Information Gain is calculated as
Where (TW)I,j is Term weight if ith sentence and jth Term.
• Sentence Signature Sentence Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i
where i is the sentence and j is the term
– Sentences that indicate key concepts in a document.
– Term Sentence Matrix é W 11 W 12 .... W 1n ù
• Term-Sentence matrix after IG :
êW 21 W 22 .... W 2n úú
– Sentence Information Gain (TSM ) = ê
ê.... .... .... ....ú
• Term Frequency Weight Score ê
Wm2 .... Wmnû
ú é IG(W 11) IG(W 12) .... IG(W 1n) ù
ëW m1
• Inverse Sentence Frequency score ê IG(W 21) IG(W 22) .... IG(W 2n) úú
• Normalized Sentence length score (TSM ) = ê
ê .... .... .... .... ú
(NSL)i= ê ú
No of Words occuring in the sentences ë IG(W m1) IG(Wm2) .... IG(Wmn)û
• Sentence position score
No of words occuring in the longest sentence in the document
(SPS)i =
n - i +1
• Numerical Data Scoren

No of numerical data in the sentences

(PNS)i =
Dr. Chandra
Length ofPrakash
the sentence Dr. Chandra Prakash

Element of reinforcement learning Methodology Steps (cont…)

Agent Policy – Processing Step:
• Action Sentence scoring using Reinforcement Learning
State Reward Action • Selection Policies
– Ɛ-greedy ìa , with probility 1- Î
at = í t
îRandom action with probability Î
Environment In our approach we have consider
§State : Sentences ;
§Action: Updating Term weight is considered
q Agent: Intelligent programs §Policy: Update the term to maximum the sentence rank
q Environment: External condition §Reward : scalar value of Term. (IG)
q Policy:
qDefines the agent’s behavior at a given time • Q-Learning
qA mapping from states to actions
qLookup tables or simple function
• An agent learns behavior through trial-and-error interactions with a dynamic
environment.

Dr. Chandra Prakash Dr. Chandra Prakash

Processing Step: Summary Generation :
é IG(W 11) IG(W 12) .... IG(W 1n) ù
ê IG(W 21) IG(W 22) .... IG(W 2n) úú
(TSM ) = ê
ê .... .... .... .... ú •
ê ú
ë IG(W m1) IG(Wm2) .... IG(Wmn)û

Matrix Q : learning matrix.

q Dataset
q Article from “The Hindu” (june 2013)
é IG(W 11) updted IG(W 12) updted .... IG(W 1n) updted ù
q DUC’06 sets of documents :
ê IG(W 21) IG(W 22) updted .... IG(W 2n) updted úú
updted (TSM ) = ê 12 document sets
updted
q
ê .... .... .... .... ú
ê ú q No of document in each Set 25
ëê IG(W m1) updted IG(Wm2) updted .... IG(Wmn) updted ûú
q Average no of sentence 32

q 300 document summary

Dr. Chandra Prakash Dr. Chandra Prakash

Compared with some available automated text summarizers

Result Comparison •ofOpen Text summarizer (OTS), Pertinence Summarizer (PS), and
generated
Result Extractor text
Test Summarizer Software (ETSS)
summary for HAMS
Comparison of Recall, Precision
Value and F-score for HAMS

Methods Precision Recall F-score

value (P) Value(R)
Chart Title

ET SS
SAAR (user 90 85 87.42
feedback)
PS

IG summary 75 65 70.57 OTS

OTS 75 60 66.66 IG Summary

PS 75 60 66.66
SAAR Based

ETSS 75 60 66.66 0 20 40 60 80 100

F-Score Recall Value ® Precision Value (P)

C. Prakash, A. Shukla. (2010). Chapter 15 – “Automatic Summary Generation from Single Document using Information Gain.” In Springer (2010),
Dr. Chandra Prakash Dr. Chandra Prakash
Contemporary Computing (pp. 152-159). doi:10.1007/978-3-642-14834-7_15
Q-Learning: Representation Matters
Deep Reinforcement Learning
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states

• Think about the Breakout game

• State: screen pixels
• Image size: !" × !" (resized)
• Consecutive 4 images #$%!"×!"×" rows in the Q-table!
• Grayscale with 256 gray
levels = 1069,970 >> 1082 atoms in the universe
• ATARI 2600
• Alpha Go
• Mnih, V. (2013). Playing atari with deep reinforcement learning Silver,
• D. (2016). Mastering the game of Go with deep neural networks and tree search
• Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

Deep RL = RL + Neural Networks

Taxonomy of RL Methods DQN: Deep Q-Learning

Use a neural network to

approximate the Q-function:
Deep Q-Network (DQN): Atari DQN and Double DQN

• Loss function (squared error):

target prediction

• DQN: same network for both Q

• Double DQN: separate network for each Q
• Helps reduce bias introduced by the inaccuracies of
Q network at the beginning of training
Mnih et al. "Playing atari with deep reinforcement learning." 2013.

Game of Go
Alpha Go Story

Source: https://www.youtube.com/watch?time_continue=6&v=8tq1C8spV_g&feature=emb_title
https://www.youtube.com/watch?v=8dMFJpEGNLQ

AlphaGo (2016) Beat Top

Human at Go
Deep Mind, acquired by Google in 2014, made headlines in 2016 after its AlphaGo program beat
a human professional Go player Lee Sedol, the world champion, in a five-game match.

A more general program, AlphaZero, beat the most powerful programs playing go, chess and shogi
(Japanese chess) after a few days of play against itself using reinforcement learning.
“In part because few real-world problems are as
constrained as the games on which DeepMind has
focused, DeepMind has yet to find any large-scale
commercial application of deep reinforcement learning.”

Aug 14, 2019 Wired : https://www.wired.com/story/deepminds-losses-future-

artificial-intelligence/

Source : Simulation and Automated Deep Learning

To date, for most successful robots operating

in the real world: Deep RL is not involved
To date, for most successful robots operating in
the real world: Deep RL is not involved

But… that’s slowly changing: But… that’s slowly changing:

Learning Control Dynamics
Learning to Drive: Beyond Pure Imitation (Waymo)
But… that’s slowly changing: The outline of application domains of
Object detection using DRL RL in healthcare

Deep Reinforcement Learning of Region Proposal Networks for Object

Detection, 2018

• Hierarchical Object Detection with Deep Reinforcement Learning

Source : Yu, C., Liu, J., & Nemati, S. (2019). Reinforcement learning in healthcare: A survey. arXiv preprint
arXiv:1908.08796.

Computational Intelligence and Smart

Deep Reinforcement Learning Motion Research (CISMR) Group @SVNIT
Efficient Object Detection in Large Images using Deep Reinforcement Learning [2020]

• Deep Reinforcement Learning for Active Human Pose Estimation [2020]

CISMR , SVNIT 167

Computational Intelligence and Smart

Motion Rehabilization Motion Robotics (CISMR)
• 3 D Printer
• Bipedal Robot
• Foot pressure sensor
• IR Camera

CISMR , SVNIT 168 CISMR , SVNIT 169

Projects @ CISMR
Agents /Approaches
• We have trained three agents.

Straight Walker Terrain Walker

Imitation Walker

Krunal Javiya, Jainesh Machhi, Parth Sharma, Saurav Patel Autonomous Gait and Balancing
Approach Using Deep Reinforcement Learning

CISMR , SVNIT 170

Pilot study for walking person detection

Object Detection ?
using Reinforcement Learning
• Object detection is a computer vision technique
that works to identify and locate objects within an
image or video.
• Specifically, Object detection draws bounding
boxes around these detected objects, which allow
us to locate where said objects are in a given
scene.
• Need of Object detection?
– Object detection has its unique ability to
locate objects within an image or video. This Fig: Image recognition vs Object detection
then allows us to count and then track those
objects.
– It is applied in
• Crowd counting
• Self-driving cars
• Video surveillance
• Face detection
• Anomaly detection

Hierarchical Object detection: - Test results of Walking person dataset:

• In this method, we train an intelligent agent using Deep RL that can detect an
object by deforming bounding boxes until they fit into the object bounding
box.
• We use a fixed hierarchical representation with object localization method, to
force a top-down search.
• Each action that the agent does to the bounding box can change its aspect
ratio, scale or position.
• The bounding box shape is not correct but it is
observable that the model has got the idea of how
to detect a person in an image.
• Sometimes it zooms in too much on the person.
Test results of Walking person dataset: Deep-RL in Call Centre
CRSRL: Customer Routing System Using deep Reinforcement Learning [2019]
Precision-Recall curves

Average precision-Recall
score:
0.60

Fig: Precision vs Recall graph

Deep-RL in Financial markets Challenge: RL & Real-World Applications

Reminder:
Open Challenges. Two Options:
Supervised learning:
teach by example 1. Real world observation + one-shot trial & error
Reinforcement learning:
teach by experience 2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning

2. Improve
Simulation

Key Takeaways for Real-World Impact Advice for Researcher

• Deep Learning: • Background
• Fundamentals in probability, statistics, multivariate calculus.
• Fun part: Good algorithms that learn from data. • Deep learning basics
• Deep RL basics
• Hard part: Good questions, huge amounts of representative data.
• TensorFlow (or PyTorch)
• Learn by doing
• Implement core deep RL algorithms
• Look for tricks and details in papers that were key to get it to work
• Deep Reinforcement Learning: • Iterate fast in simple environments
• Research
• Fun part: Good algorithms that learn from data. • Improve on an existing approach
• Hard part: Defining a useful state space, action space, and reward. • Focus on an unsolved task / benchmark
• Hardest part: Getting meaningful data for the above formalization. • Create a new task / problem that hasn’t been addressed with RL
References Hands on RL using python
• MIT Deep Learning Basics: Introduction and Overview with TensorFlow
• Visit :
• Univ. of Alberta
• http://www.cs.ualberta.ca/~sutton/book/ebook/node1.html – https://cprakash86.wordpress.com/downloads/
• www.cs.ualberta.ca/~sutton/book/the-book.html
• Sutton and barto,”Reinforcement Learning an introduction.”
• Univ. of South Wales • Using python
• http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html – RL_example .ipynb
• https://people.revoledu.com/kardi/
• http://mnemstudio.org/path-finding-q-learning-tutorial.htm
• MIT Deep Learning and Artificial Intelligence Lectures
• https://www.analyticsvidhya.com/blog/2017/01/introduction-to-
reinforcement-learning-implementation/
• https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-
python/
• https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-
python-openai-gym/

In case of any query:

• Email: [email protected]

• https://Cprakash.in
[https://cprakash86.wordpress.com/]
Thank You

Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Reinforcement Learning, Q-Learning
No ratings yet
Reinforcement Learning, Q-Learning
20 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Reinforcement Learning With Python - Master Reinforcemearning in Python Without Being An Expert - Bob Story (Bob Story) (Z-Library)
No ratings yet
Reinforcement Learning With Python - Master Reinforcemearning in Python Without Being An Expert - Bob Story (Bob Story) (Z-Library)
58 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
An Introduction To Deep ReinforcementLearning
No ratings yet
An Introduction To Deep ReinforcementLearning
65 pages
Value-Based Reinforcement Learning: Shusen Wang
No ratings yet
Value-Based Reinforcement Learning: Shusen Wang
53 pages
Module 5-rl
No ratings yet
Module 5-rl
54 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
RL Unit5
No ratings yet
RL Unit5
101 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
AI Unit - 3
No ratings yet
AI Unit - 3
102 pages
Unit 5 ML
No ratings yet
Unit 5 ML
49 pages
Module 01
No ratings yet
Module 01
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Unit 4
No ratings yet
Unit 4
56 pages
Backgammon Strategy
No ratings yet
Backgammon Strategy
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Module 1
No ratings yet
Module 1
72 pages
Unit 4
No ratings yet
Unit 4
8 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
6cs4-02 ML Unit-4
No ratings yet
6cs4-02 ML Unit-4
59 pages
AI-unit 3
No ratings yet
AI-unit 3
55 pages
Unit-5 Reinforcemnt and Q Learning
No ratings yet
Unit-5 Reinforcemnt and Q Learning
45 pages
Artificial Intelligence A Z Learn How To Build An AI 2
100% (1)
Artificial Intelligence A Z Learn How To Build An AI 2
33 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
IntroductiontoRL BR
No ratings yet
IntroductiontoRL BR
22 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit 3
No ratings yet
Unit 3
29 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
Unit 4
100% (1)
Unit 4
7 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Lecture 5
No ratings yet
Lecture 5
28 pages
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
No ratings yet
Chapter 6: Temporal Difference Learning: Objectives of This Chapter
49 pages
Introduction To Statistical Machine Learning
No ratings yet
Introduction To Statistical Machine Learning
84 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Module 5
No ratings yet
Module 5
40 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
L-14 - Reinforcement-L-d-07062024-111949am
No ratings yet
L-14 - Reinforcement-L-d-07062024-111949am
22 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
Unit 5
No ratings yet
Unit 5
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
Final
No ratings yet
Final
18 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
ML 4
No ratings yet
ML 4
4 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
No ratings yet
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
12 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
10.1038@s41583 020 0355 6 PDF
No ratings yet
10.1038@s41583 020 0355 6 PDF
11 pages
Temporal Difference Learning Method
No ratings yet
Temporal Difference Learning Method
36 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Reinforcement Learning and Adaptive Dynamic Programming For Feedback Control
No ratings yet
Reinforcement Learning and Adaptive Dynamic Programming For Feedback Control
19 pages
Unit 06 Temporal Difference Learning
No ratings yet
Unit 06 Temporal Difference Learning
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Learning From Reinforcement: - Introduction (10.1) - Failure Is The Surest Path To Success (10.2)
No ratings yet
Learning From Reinforcement: - Introduction (10.1) - Failure Is The Surest Path To Success (10.2)
12 pages
Reinforcement learning-WPS Office
No ratings yet
Reinforcement learning-WPS Office
1 page
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet