0% found this document useful (0 votes)
4 views25 pages

Deep Reinforcement Learning

The document provides an overview of Deep Reinforcement Learning (Deep RL) and its applications, explaining key concepts such as supervised, unsupervised, and reinforcement learning. It highlights the importance of learning through trial and error, rewards, and the role of neural networks in Deep RL. Additionally, it discusses the components of reinforcement learning agents, action selection methods, and the significance of maximizing rewards in learning processes.

Uploaded by

mqz5268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views25 pages

Deep Reinforcement Learning

The document provides an overview of Deep Reinforcement Learning (Deep RL) and its applications, explaining key concepts such as supervised, unsupervised, and reinforcement learning. It highlights the importance of learning through trial and error, rewards, and the role of neural networks in Deep RL. Additionally, it discusses the components of reinforcement learning agents, action selection methods, and the significance of maximizing rewards in learning processes.

Uploaded by

mqz5268
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Deep Reinforcement Learning and AI- ML Terminology

Applications
Learning the human way

Dr. Chandra Prakash

Slide credit : Dr. Partha Pratim Chakrabarti

Artifical Intelligence (AI) Introduction


Machine learning: Definition
q A scientific discipline that is concerned with the design
and development of algorithms that allow computers to
learn based on data, such as from sensor data or
databases, etc.

Major focus of machine learning research


ü To automatically learn to recognize complex
patterns and make intelligent decisions based on data .

Source: https://www.youtube.com/watch?v=e2_hsjpTi4w&t=67s

Learning Machine learning Type:

बायां हाथ दायाँ हाथ ?????? With respect to the feedback type to learner:
? q Supervised learning :
n Task Driven (Classification)

q Unsupervised learning :
n Data Driven (Clustering)

q Reinforcement learning
n Self learning (reward based)
Image credit : UCL Course of RL

9
Classes of Learning Problems Supervised VS Unsupervised

Reinforcement/Self Learning

Unlabeled images (random internet images)

Testing:
What is this?
80 40

DATA Vision and Deep Learning


Deep Reinforcement Learning (Deep RL) Reinforcement Learning Examples
Deep Learning Deep RL

• What is it? Framework for learning to solve


sequential decision making problems.
• How? Trial and error in a world that provides
occasional rewards
• Deep? Deep RL = RL + Neural Networks

Deep Reinforcement Learning Supervised Learning

• It’s all “supervised” by a loss function!

Supervision*

Neural Good or
Input Output
Network Bad?

• *Someone has to say what’s good and what’s bad


• ATARI 2600
• Alpha Go
• Mnih, V. (2013). Playing atari with deep reinforcement learning Silver,
• D. (2016). Mastering the game of Go with deep neural networks and tree search
• Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

Supervised Learning Deep learning - representation learning: the automated formation of useful

vs representations from data.

Reinforcement Learning
Supervised Learning Reinforcement Learning
• Step: 1 Step: 1
Teacher: Does picture 1 show a car World: You are in state 9. Choose
or a flower? action A or C. • Supervised learning is “teach by example”:
Learner: A flower. Learner: Action A.
Teacher: No, it’s a car. World: Your reward is 100.
Here’s some examples, now learn patterns in these example.
Step: 2 Step: 2
Teacher: Does picture 2 show a car World: You are in state 32. Choose
or a flower? action B or E. • Reinforcement learning is “teach by experience”:
Learner: A car. Learner: Action B.
Teacher: Yes, it’s a car. World: Your reward is 50.
Here’s a world, now learn patterns by exploring it.
Step: 3 .... Step: 3 ....

Dr. Chandra Prakash 30


Reinforcement Reinforcement Learning (Cont..)
n Emphasizes learning feedback that evaluates the learner's performance
Dictionary meaning without providing standards of correctness in the form of behavioral
Occurrence of an event, in the proper relation to a targets.
response, that tends to increase the probability that the n Some researcher consider RL a form of unsupervised learning.
response will occur again in the same situation. n An orthogonal approach for Learning Machine. :
n RL is training by
Reinforcement Learning (RL) n rewards and punishments.
Good vs Bad
“a way of programming agents by reward and n

RL agent learns by receiving a reward or reinforcement through trial-and-


punishment without needing to specify how the task is n
error interactions with a dynamic environment to achieve a goal, without any
to be achieved” form of supervision other than its own decision-making policy.
[Kaelbling, Littman,& Moore, 96] n Reinforcement Learning is learning how to act in order to maximize a
numerical reward.

33

Reinforcement Learning in Humans Reinforcement Learning (RL)


• Human appear to learn to walk through “very few examples” of • Close to Human Learning.
trial and error. How is an open question…
• Possible answers: • Agent learns a policy of how to act in a given environment.
• Hardware: 230 million years of bipedal movement data.
• Imitation Learning: Observation of other humans walking. • Every action has some impact in the environment, and the
• Algorithms: Better than backpropagation and stochastic gradient descent environment provides rewards that guides the learning
algorithm

Study Time as a Self Learning Model

Left Right Straight

Left 2 4 8
Right 3 1 7
Straight 6 11 50
Environment
Environment Open Question: Sensors
Sensors
Sensor Data
Sensor Data What can be learned from
data? Feature Extraction
Feature Extraction
Representation
Representation
Machine Learning
Machine Learning

Knowledge Knowledge
GPS
Reasoning Reasoning Camera
Lidar Radar
(Visible, Infrared)
Planning Planning

Action Action

Effector Networking
Effector Stereo Camera Microphone IMU
(Wired, Wireless)
Source : https://deeplearning.mit.edu

Environment Environment

Sensors Sensors

Sensor Data Sensor Data

Feature Extraction Feature Extraction

Representation Representation

Machine Learning Machine Learning

Knowledge Knowledge

Reasoning Reasoning

Planning Planning

Action Action

Effector Effector
Source: https://deeplearning.mit.edu Source: https://deeplearning.mit.edu

Environment Environment

Sensors Sensors
Image Recognition: Audio Recognition:
Sensor Data If it looks like a duck Quacks like a duck Sensor Data

Feature Extraction Feature Extraction

Representation Representation

Machine Learning Machine Learning


Activity Recognition:
Swims like a duck
Knowledge Knowledge

Reasoning Reasoning

Planning Planning

Action Action

Effector Effector
Source: https://deeplearning.mit.edu
Environment
Reinforcement Learning Framework
Sensors

Sensor Data At each step, the agent: Open Questions:


The promise of • What cannot be modeled in
Feature Extraction • Executes action this way?
Deep Learning
• Observe new state • What are the challenges of
Representation learning in this framework?
• Receive reward
Machine Learning

Knowledge

Reasoning
The promise of
Planning
Deep Reinforcement Learning
Action

Effector

Element of Reinforcement Learning


Environment and Actions
Agent Policy • Fully Observable (Chess) vs Partially Observable (Poker)
• Single Agent (Atari) vs Multi Agent (DeepTraffic)
State
Action • Deterministic (Cart Pole) vs Stochastic (DeepTraffic)
Reward – Deterministic system - no randomness is involved in the development of
future states of the system.
Environment – Stochastic system - random probability distribution or pattern that may be
analysed statistically but may not be predicted precisely.
n Agent: Intelligent programs n Value function: • Static (Chess) vs Dynamic (DeepTraffic)
n Environment: External condition q Specifies what is good in the long run • Discrete (Chess) vs Continuous (Cart Pole)
n Policy: while Reward function indicates what
is good in an immediate sense.
n Agent’s behavior at a given time
q Value of a state - Total amount of Note: Real-world environment might not technically be stochastic or
n A mapping from states to actions
reward an agent can expect to
n Lookup tables or simple function partially-observable but might as well be treated as such due to their
accumulate over the future, starting
n Reward function : form that state. complexity.
q Defines the goal in an RL problem n Model of the environment :
q Policy is altered to achieve this goal q Used for planning & if Know current
state and action then predict the resultant
next state and next reward. 46

Major Components of an RL Agent Robot in a Room


An RL agent may be directly or indirectly trying to learn a: actions: UP, DOWN, LEFT, RIGHT
• Policy: agent’s behavior function +1 (Stochastic) model of the world:
• Value function: how good is each state and/or action
Action: UP
• Model: agent’s representation of the environment -1
80% move UP
10% move LEFT
START 10% move RIGHT

!0,"0, #1, !1,"1, #2,………,!!−1, "!−1, #!, !!


• Reward +1 at [4,3], -1 at [4,2]
state Terminal state
• Reward -0.04 for each step
action
• What’s the strategy to achieve max reward?
reward
• We can learn the model and plan
• We can learn the value of (action, state) pairs and act greed/non-greedy
• We can learn the policy directly while sampling from it
Optimal Policy for a Deterministic World Optimal Policy for a Stochastic World

Reward: -0.04 for each step Reward: -0.04 for each step

actions: UP, DOWN, LEFT, RIGHT actions: UP, DOWN, LEFT, RIGHT
+1 +1
When actions are deterministic: When actions are stochastic:

-1 UP -1 UP

100% move UP 80% move UP


0% move LEFT 10% move LEFT
0% move RIGHT 10% move RIGHT

Policy: Shortest path. Policy: Shortest path. Avoid -UP around -1 square.

Optimal Policy for a Stochastic World Optimal Policy for a Stochastic World

Reward: -2 for each step


Reward: -0.1 for each step Reward: -0.04 for each step
actions: UP, DOWN, LEFT, RIGHT
+1 +1 +1
When actions are stochastic:

-1 UP
-1 -1
80% move UP
10% move LEFT
10% move RIGHT

More urgent Less urgent


Policy: Shortest path.

Optimal Policy for a Stochastic World Lessons from Robot in Room

• Environment model has big impact on optimal policy


Reward: +0.01 for each step
• Reward structure has big impact on optimal policy
• As a programmer we have more control here.
actions: UP, DOWN, LEFT, RIGHT
+1
When actions are stochastic:

-1 UP

80% move UP
10% move LEFT
10% move RIGHT

Policy: Longest path.


Reinforcement Learning (Cont..) Action Selection Method
n Concept used in Reinforcement Learning Exploration and exploitation
A. Greedy action: Action chosen with greatest estimated value.
q Evaluative Vs. Instructive Feedback Greedy action: a case of Exploitation.
q Associative Vs. Non-Associative n Ɛ -greedy
q Exploration and exploitation q Most of the time the greediest action is chosen
q Every once in a while, with a small probability Ɛ, an action is
selected at random.
B. Non-Greedy action:a case of Exploration, as it
enables us to improve estimate the non-greedy
action's value.
n Ɛ-soft - The best action is selected with probability (1 –Ɛ)
and the rest of the time a random action is chosen uniformly.

58

Ɛ-Greedy Action Selection Method : Exploration vs Exploitation


Let the a* is the greedy action at time t and Qt(a) is the • Deterministic/greedy policy won’t explore all actions
value of action a at time. • Don’t know anything about the environment at the beginning
• Need to try all actions to find the optimal one
• ε-greedy policy
Greedy Action Selection: • Every once in a while, with a small probability Ɛ, an action is selected at random.
*
at = at = argmax Qt (a) • Ɛ -soft : With probability 1-ε perform the optimal/greedy action, otherwise random
a action
• Slowly move it towards greedy policy: ε -> 0
n Ɛ –greedy
at with probability 1- Î
a t= { random action with probability Î

59

Action Selection Policies (Cont…) Softmax Action Selection( Cont…)


n Softmax –
q Drawback of Ɛ -greedy & Ɛ -soft: Select random
n Problem with Ɛ-greedy: Neglects action values
actions uniformly. n Softmax idea: grade action probs. by estimated values.
q Softmax remedies this by: n Gibbs, or Boltzmann action selection, or exponential
n Assigning a weight with each actions, according weights:
to their action-value estimate.
n A random action is selected with regards to the
weight associated with each action
n The worst actions are unlikely to be chosen.
t is the “computational temperature”
n This is a good approach to take where the worst At t à 0 the Softmax action selection method become
actions are very unfavorable. the same as greedy action selection.

61 62
Meaning of Life for RL Agent:
Some terms in Reinforcement Learning Maximize Reward
n The Agent Learns a Policy: • Future reward: !! = "!+ "!+1 + "!+2 + ⋯ +""
q Policy at step t, : a mapping from states to action • Discounted future reward:
probabilities will be: !! = "!+ $"!+1+ $2"!+2 + ⋯ +$"−!""
• A good strategy for an agent would be to always choose
an action that maximizes the (discounted) future reward
q Agents changes their policy with Experience. • Why “discounted”?
q Objective: get as much reward as possible over a
• Math trick to help analyze convergence
long run.
n Goals and Rewards • Uncertainty due to environment stochasticity, partial
observability, or that life can end at any moment:
q A goal should specify what we want to achieve, not
how we want to achieve it. “If today were the last day of my life, would I want
63
to do what I’m about to do today?” – Steve Jobs

Some terms in RL (Cont…) UPDATE Rule


n Returns n Common update rule form:
q Rewards in long term
q Episodes: Subsequence of interaction between agent- NewEstimate = OldEstimate + StepSize[Target –OldEstimate]
environment e.g., plays of a game, trips through a
maze. n The expression [ Target - Old Estimate] is an error in the estimate.
n Discount return
q The geometrically discounted model of return: n It is reduce by taking a step toward the target.

n In proceeding the (t+1)st reward for action a the step-size parameter


q Used to: will be 1\(t+1).
n To determine the present value of the future
rewards
n Give more weight to earlier rewards

65 66

Value function Examples of Reinforcement Learning


n States-action pairs function that estimate how good it is for the agent
to be in a given state
n Type of value function
q State-Value function

Identify :- G S A R ???
q Action-Value function
Cart-Pole Balancing
• Goal — Balance the pole on top of a moving cart
• State — Pole angle, angular speed. Cart position, horizontal velocity.
• Actions — horizontal force to the cart
• Reward — 1 at each time step if the pole is upright
67
Examples of Reinforcement Learning Problem Solving Methods for RL

1) Dynamic programming
• Model-based
2) Monte Carlo methods
• No Model
3) Temporal-difference learning.
Grasping Objects with Robotic Arm
• Goal - Pick an object of different shapes
• State - Raw pixels from camera
• Actions – Move arm. Grasp.
• Reward - Positive when pickup is successful

70

3 Types of Reinforcement Learning 1.Dynamic programming

n Classical solution method


n Require a complete and accurate model of the
environment.
n Popular method for Dynamic programming
Model-based Value-based Policy-based q Policy Evaluation : Iterative computation of the value
• Learn the model of • Learn the state or • Learn the stochastic function for a given policy (prediction Problem)
the world, then plan state-action value policy function that
using the model maps state to action q Policy Improvement: Computation of improved policy
• Act by choosing best
• Update model often action in state • Act by sampling for a given value function.
• Re-plan often • Exploration is a policy
necessary add-on • Exploration is baked in V ( st ) ¬ Ep {rt +1 + g V ( st )}
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

72

Generalized Policy Iteration (GPI) 2. Monte Carlo Methods


Consist of two iteration process,
n Policy Evaluation :Making the value function n Features of Monte Carlo Methods
consistent with the current policy q No need of Complete knowledge of environment
n Policy Improvement: Making the policy greedy
q Based on averaging sample returns observed after visit
with respect to the current value function to that state.
q Experience is divided into Episodes
q Only after completing an episode, value estimates and
policies are changed.
q Don't require a model
q Not suited for step-by-step incremental computation

73 74
MC and DP Methods To find value of a State
n Estimate by experience, average the returns observed
n Compute same value function after visit to that state.
n Same step as in DP n More the return, more is the average converge to
q Policy evaluation
expected value
q Computation of state value (VΠ ) and action value

(QΠ )for a fixed arbitrary policy (Π).


q Policy Improvement

q Generalized Policy Iteration

75 76

Monte Carlo and Dynamic Programming

n MC has several advantage over DP:


q Can learn from interaction with environment

q No need of full models

q No need to learn about ALL states

q No bootstrapping
n bootstrapping in RL means that you update a value based on
some estimates and not on some exact values.

77 78

3. Temporal Difference (TD) methods Temporal Difference (TD) Prediction

n Learn from experience, like MC Policy Evaluation (the prediction problem):


q Can learn directly from interaction with environment
for a given policy p, compute the state-value function
Vp
q No need for full models Simple every - visit Monte Carlo method :
n Estimate values based on estimated values of next states,
V(st ) ¬ V(st ) + α[Rt - V(st )]
like DP
n Bootstrapping (like DP) target: the actual return after time t
n Issue to watch for:
q maintaining sufficient exploration The simplest TD method, TD( 0 ) :
V(st ) ¬ V(st ) + α[rt+1 + gV(st+1 ) - V(st )]

target: an estimate of the return


79 80
Taxonomy of RL Methods Q-Learning s

a
• State-action value function: Q (s,a)
r
• Expected return when starting in s,
performing a, and following s’

• Q-Learning: Use any policy to estimate Q that maximizes future reward:


• Q directly approximates Q* (Bellman optimality equation)
• Independent of the policy being followed
• Only requirement: keep updating each (s,a) pair

Learning Rate Discount Factor

New State Old State Reward

Q-Learning: Value Iteration


Q-Learning: Off-Policy TD Control
One - step Q - learning :
[ ]
Q (s t , a t ) ¬ Q (s t , a t ) + a rt +1 + g maxQ (s t +1 , a ) - Q (s t , a t )
a

A1 A2 A3 A4
S1 +1 +2 -1 0
S2 +2 0 +1 -2
S3 -1 +1 0 -2
S4 -2 0 +1 +1

Advantages of Temporal Difference (TD)


Sarsa: On-Policy TD Control
Learning
S A R S A: State Action Reward State Action
n TD methods do not require a model of the
Turn this into a control method by always updating the policy
to be greedy with respect to the current estimate: environment, only experience
n TD, but not MC, methods can be fully incremental
q You can learn before knowing the final outcome

n Less memory

n Less peak computation

q You can learn without the final outcome

n From incomplete sequences

n Both MC and TD converge


A Reinforcement Learning Example
Example :
PathFinder Bot using
Reinforcement Learning

Which path agent should choose???

A Reinforcement Learning Example Solution using RL


• Suppose we have 5 rooms A to E, in a building Step 1: Modeling the environment-
connected by certain doors : – Represent the rooms by graph,
– Each room as a vertex (or node) and
• We can consider outside of the building as one
– Each door as an edge (or link).
big room say F to cover the building.
– Goal room is the node F
• There are two doors lead to the building from F, that
is through room B and room E.
• Which path agent should choose???

Step 1: Modelling the environment Reward table/ Matrix R


ü Goal –Outside the building – Node F
ü Assign Reward Value to each room
ü State- Each room (including outside building )
ü Action – Agent’s Movement from 1 room to next room
ü Initial state – C (random )
ü Reward- Goal Node - highest reward (100) rest – 0;

State Diagram

!
!
Reward table/ Matrix R
Q Matrix- Experience Table
• Q matrix – Brain of agent - represent the memory of what the agent
have learned through experiences.
• In beginning, agent know nothing, thus Q is zero matrix.
• Let no of state is known (6).

!
!

• In more general case, start with zero matrix of single cell.


• It is a simple task to add more column and rows in Q matrix if a new
state is found.

Q Matrix- Experience Table Q learning


• To use the Q matrix, the agent traces the sequence of states, from the initial
• Given : State diagram with a goal state (represented by matrix R)
state until goal state. The algorithm is as simple as finding action that makes
maximum Q for current state: • Find : Minimum path from any initial state to the goal state (represented by
matrix Q)

Algorithm to utilize the Q matrix
Q Learning Algorithm goes as follow
Input: Q matrix, initial state
1. Set parameter , and environment reward matrix R
1. Set current state = initial state 2. Initialize matrix Q as zero matrix
2. From current state, find action that produce maximum Q value 3. For each episode:
3. Set current state = next state o Select random initial state
4. Go to 2 until current state = goal state o Do while not reach goal state
! Select one among all possible actions for the current state
• !The algorithm above will return sequence of current state from initial state ! Using this possible action, consider to go to the next state
until goal state. ! Get maximum Q value of this next state based on all possible
actions
! Compute

! Set the next state as the current state


End Do
End For
!

Step 2: Step 3: Update Q Matrix/Experience Table


• Randomly choose a state
• Let us set the value of learning parameter=0.8 and initial state as • Let it select state B in matrix
room B. • 2 possible action- D,F
• Set matrix Q as a zero matrix. • Consider now we are in state F.
• Reward matrix R • It has 3 possible actions to go to
– State B, E or F.
!
Update Q Matrix

• F is final state – end of one episode.


! ! !
Repeat again (Episode 2) Inner loop continue
• Start with initial random state. Start Again with B state
– State D
– 3 possible actions- B, C and E.
• By random selection, let
– B is next state.
!
– state B- 2 possible actions (D, F)
! !
• Compute Q value
!

n No change in matrix Q – same value


n F goal state – Finish 2 episode
!

Continue for more episodes ….. After Normalization –


• If agent learn more and more, experience through many
episode,
• It reaches to convergence value of matrix Q

!
!
C -- D -- B -- F or C -- D -- E -- F

Introduction

CASE Study :

Text Summarization using


Reinforcement Learning

Dr. Chandra Prakash Dr. Chandra Prakash


Real time Problem Problem definition (cont..)
— Text summarization is not as per user specification.
Imagine
• Download 1000 + papers and now want to get the summary.. – Generic summary generation not possible as summary changes as user changes.
– Even two human can‘t generate a similar summary from a given document.
• We have list of emails about sports event, get the summary of – Internal factors (background, education etc.) play vital role in generating a
those emails in one para… summary
• We have to study lots of books for the exam and the summarizer
gives the key concepts of the books as few pages notes…
• What could be the possible solution now ???
• Value for researchers
— Get me everything/Papers say about “Automatic Text
Summarization”

Dr. Chandra Prakash Dr. Chandra Prakash

Solution: Human Aided Text Summarization Methodology proposed (FAS)


Benefits of summarization include:
— Save reading time
— Value for researchers
— Abstracts for Scientific and other articles
— Facilitate fast literature searches
— Facilities classification of articles and other written data :
— Improve Search engines indexing efficiency of web pages
— Assists in storing the text in much lesser space.
— Heading of the given article/document
— News summarization
— Opinion Mining and Sentiment Analysis
— Enables Cell phones to access the Web information
— With human feedback – user oriented summary Chandra Prakash, Anupam Shukla “Automated summary generation from singe document using information gain ”
Springer, Contemporary Computing ,Communications in Computer and Information Science Volume 94, pp 152-159,
2010 .
Dr. Chandra Prakash Dr. Chandra Prakash

Methodology proposed (HAMS) Keyword Significant Factor

Dr. Chandra Prakash Dr. Chandra Prakash


Solution Methodology Steps..
n Approach for the Problem Methodology for text summarization involves
q Input: Document with text is fed into the system. – Term Selection using Pre-Processing
q Preprocessing: • Tokenization or Segmentation
n Tokenization: Divides the character sequence into words • Stop word Filtering
n sentence splitting further divides sequences of words into
sentences, and so on.
• Stemming or Lemmatization
n Stemming or Lemmatization
n Stop word filtering Feature Extraction : – Term weighting
• Term Frequency (TF):
q Sentence Ranking: Machine Learning Wi(Tj)=fij
q Human Feedback where fij is the frequency of jth term in sentence i.
q Output\ Result: Generated Summary
• Inverse Sentence Frequency (ISF) :
n an abstract. æNö
Wi(Tjj = fij ´ logçç ÷÷
è nj ø
where N=no of sentences in the collection
nj =no of sentence where the term j appears.

Dr. Chandra Prakash 111 Dr. Chandra Prakash

Methodology Steps (cont…) Methodology Steps (cont…)


Weight of a Term is calculated as :
(TW)i, j= (ISF) I,j • Sentence Information Gain is calculated as
Where (TW)I,j is Term weight if ith sentence and jth Term.
• Sentence Signature Sentence Information Gain (IG) = (TFW)i+ ISFS(Tj)i + (NSL)i +(SPS)i+ (PNS)i
where i is the sentence and j is the term
– Sentences that indicate key concepts in a document.
– Term Sentence Matrix é W 11 W 12 .... W 1n ù
• Term-Sentence matrix after IG :
êW 21 W 22 .... W 2n úú
– Sentence Information Gain (TSM ) = ê
ê.... .... .... ....ú
• Term Frequency Weight Score ê
Wm2 .... Wmnû
ú é IG(W 11) IG(W 12) .... IG(W 1n) ù
ëW m1
• Inverse Sentence Frequency score ê IG(W 21) IG(W 22) .... IG(W 2n) úú
• Normalized Sentence length score (TSM ) = ê
ê .... .... .... .... ú
(NSL)i= ê ú
No of Words occuring in the sentences ë IG(W m1) IG(Wm2) .... IG(Wmn)û
• Sentence position score
No of words occuring in the longest sentence in the document
(SPS)i =
n - i +1
• Numerical Data Scoren

No of numerical data in the sentences


(PNS)i =
Dr. Chandra
Length ofPrakash
the sentence Dr. Chandra Prakash

Element of reinforcement learning Methodology Steps (cont…)


Agent Policy – Processing Step:
• Action Sentence scoring using Reinforcement Learning
State Reward Action • Selection Policies
– Ɛ-greedy ìa , with probility 1- Î
at = í t
îRandom action with probability Î
Environment In our approach we have consider
§State : Sentences ;
§Action: Updating Term weight is considered
q Agent: Intelligent programs §Policy: Update the term to maximum the sentence rank
q Environment: External condition §Reward : scalar value of Term. (IG)
q Policy:
qDefines the agent’s behavior at a given time • Q-Learning
qA mapping from states to actions
qLookup tables or simple function
• An agent learns behavior through trial-and-error interactions with a dynamic
environment.

Dr. Chandra Prakash Dr. Chandra Prakash


Processing Step: Summary Generation :
é IG(W 11) IG(W 12) .... IG(W 1n) ù
ê IG(W 21) IG(W 22) .... IG(W 2n) úú
(TSM ) = ê
ê .... .... .... .... ú •
ê ú
ë IG(W m1) IG(Wm2) .... IG(Wmn)û

Matrix Q : learning matrix.

q Dataset
q Article from “The Hindu” (june 2013)
é IG(W 11) updted IG(W 12) updted .... IG(W 1n) updted ù
q DUC’06 sets of documents :
ê IG(W 21) IG(W 22) updted .... IG(W 2n) updted úú
updted (TSM ) = ê 12 document sets
updted
q
ê .... .... .... .... ú
ê ú q No of document in each Set 25
ëê IG(W m1) updted IG(Wm2) updted .... IG(Wmn) updted ûú
q Average no of sentence 32

q 300 document summary


Dr. Chandra Prakash Dr. Chandra Prakash

Dr. Chandra Prakash Dr. Chandra Prakash

Compared with some available automated text summarizers


Result Comparison •ofOpen Text summarizer (OTS), Pertinence Summarizer (PS), and
generated
Result Extractor text
Test Summarizer Software (ETSS)
summary for HAMS
Comparison of Recall, Precision
Value and F-score for HAMS

Methods Precision Recall F-score


value (P) Value(R)
Chart Title

ET SS
SAAR (user 90 85 87.42
feedback)
PS

IG summary 75 65 70.57 OTS

OTS 75 60 66.66 IG Summary

PS 75 60 66.66
SAAR Based

ETSS 75 60 66.66 0 20 40 60 80 100


F-Score Recall Value ® Precision Value (P)

C. Prakash, A. Shukla. (2010). Chapter 15 – “Automatic Summary Generation from Single Document using Information Gain.” In Springer (2010),
Dr. Chandra Prakash Dr. Chandra Prakash
Contemporary Computing (pp. 152-159). doi:10.1007/978-3-642-14834-7_15
Q-Learning: Representation Matters
Deep Reinforcement Learning
• In practice, Value Iteration is impractical
• Very limited states/actions
• Cannot generalize to unobserved states

• Think about the Breakout game


• State: screen pixels
• Image size: !" × !" (resized)
• Consecutive 4 images #$%!"×!"×" rows in the Q-table!
• Grayscale with 256 gray
levels = 1069,970 >> 1082 atoms in the universe
• ATARI 2600
• Alpha Go
• Mnih, V. (2013). Playing atari with deep reinforcement learning Silver,
• D. (2016). Mastering the game of Go with deep neural networks and tree search
• Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments

Deep RL = RL + Neural Networks

Taxonomy of RL Methods DQN: Deep Q-Learning

Use a neural network to


approximate the Q-function:
Deep Q-Network (DQN): Atari DQN and Double DQN

• Loss function (squared error):

target prediction

• DQN: same network for both Q


• Double DQN: separate network for each Q
• Helps reduce bias introduced by the inaccuracies of
Q network at the beginning of training
Mnih et al. "Playing atari with deep reinforcement learning." 2013.

Game of Go
Alpha Go Story

Source: https://www.youtube.com/watch?time_continue=6&v=8tq1C8spV_g&feature=emb_title
https://www.youtube.com/watch?v=8dMFJpEGNLQ

AlphaGo (2016) Beat Top


Human at Go
Deep Mind, acquired by Google in 2014, made headlines in 2016 after its AlphaGo program beat
a human professional Go player Lee Sedol, the world champion, in a five-game match.

A more general program, AlphaZero, beat the most powerful programs playing go, chess and shogi
(Japanese chess) after a few days of play against itself using reinforcement learning.
“In part because few real-world problems are as
constrained as the games on which DeepMind has
focused, DeepMind has yet to find any large-scale
commercial application of deep reinforcement learning.”

Aug 14, 2019 Wired : https://www.wired.com/story/deepminds-losses-future-


artificial-intelligence/

Source : Simulation and Automated Deep Learning

To date, for most successful robots operating


in the real world: Deep RL is not involved
To date, for most successful robots operating in
the real world: Deep RL is not involved

But… that’s slowly changing: But… that’s slowly changing:


Learning Control Dynamics
Learning to Drive: Beyond Pure Imitation (Waymo)
But… that’s slowly changing: The outline of application domains of
Object detection using DRL RL in healthcare

Deep Reinforcement Learning of Region Proposal Networks for Object


Detection, 2018

• Hierarchical Object Detection with Deep Reinforcement Learning


Source : Yu, C., Liu, J., & Nemati, S. (2019). Reinforcement learning in healthcare: A survey. arXiv preprint
arXiv:1908.08796.

Computational Intelligence and Smart


Deep Reinforcement Learning Motion Research (CISMR) Group @SVNIT
Efficient Object Detection in Large Images using Deep Reinforcement Learning [2020]

• Deep Reinforcement Learning for Active Human Pose Estimation [2020]

CISMR , SVNIT 167

Computational Intelligence and Smart


Motion Rehabilization Motion Robotics (CISMR)
• 3 D Printer
• Bipedal Robot
• Foot pressure sensor
• IR Camera

CISMR , SVNIT 168 CISMR , SVNIT 169


Projects @ CISMR
Agents /Approaches
• We have trained three agents.

Straight Walker Terrain Walker


Imitation Walker

Krunal Javiya, Jainesh Machhi, Parth Sharma, Saurav Patel Autonomous Gait and Balancing
Approach Using Deep Reinforcement Learning

CISMR , SVNIT 170

Pilot study for walking person detection


Object Detection ?
using Reinforcement Learning
• Object detection is a computer vision technique
that works to identify and locate objects within an
image or video.
• Specifically, Object detection draws bounding
boxes around these detected objects, which allow
us to locate where said objects are in a given
scene.
• Need of Object detection?
– Object detection has its unique ability to
locate objects within an image or video. This Fig: Image recognition vs Object detection
then allows us to count and then track those
objects.
– It is applied in
• Crowd counting
• Self-driving cars
• Video surveillance
• Face detection
• Anomaly detection

Hierarchical Object detection: - Test results of Walking person dataset:


• In this method, we train an intelligent agent using Deep RL that can detect an
object by deforming bounding boxes until they fit into the object bounding
box.
• We use a fixed hierarchical representation with object localization method, to
force a top-down search.
• Each action that the agent does to the bounding box can change its aspect
ratio, scale or position.
• The bounding box shape is not correct but it is
observable that the model has got the idea of how
to detect a person in an image.
• Sometimes it zooms in too much on the person.
Test results of Walking person dataset: Deep-RL in Call Centre
CRSRL: Customer Routing System Using deep Reinforcement Learning [2019]
Precision-Recall curves

Average precision-Recall
score:
0.60

Fig: Precision vs Recall graph

Deep-RL in Financial markets Challenge: RL & Real-World Applications


Reminder:
Open Challenges. Two Options:
Supervised learning:
teach by example 1. Real world observation + one-shot trial & error
Reinforcement learning:
teach by experience 2. Realistic simulation + transfer learning
1. Improve
Transfer
Learning

2. Improve
Simulation

Key Takeaways for Real-World Impact Advice for Researcher


• Deep Learning: • Background
• Fundamentals in probability, statistics, multivariate calculus.
• Fun part: Good algorithms that learn from data. • Deep learning basics
• Deep RL basics
• Hard part: Good questions, huge amounts of representative data.
• TensorFlow (or PyTorch)
• Learn by doing
• Implement core deep RL algorithms
• Look for tricks and details in papers that were key to get it to work
• Deep Reinforcement Learning: • Iterate fast in simple environments
• Research
• Fun part: Good algorithms that learn from data. • Improve on an existing approach
• Hard part: Defining a useful state space, action space, and reward. • Focus on an unsolved task / benchmark
• Hardest part: Getting meaningful data for the above formalization. • Create a new task / problem that hasn’t been addressed with RL
References Hands on RL using python
• MIT Deep Learning Basics: Introduction and Overview with TensorFlow
• Visit :
• Univ. of Alberta
• http://www.cs.ualberta.ca/~sutton/book/ebook/node1.html – https://cprakash86.wordpress.com/downloads/
• www.cs.ualberta.ca/~sutton/book/the-book.html
• Sutton and barto,”Reinforcement Learning an introduction.”
• Univ. of South Wales • Using python
• http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html – RL_example .ipynb
• https://people.revoledu.com/kardi/
• http://mnemstudio.org/path-finding-q-learning-tutorial.htm
• MIT Deep Learning and Artificial Intelligence Lectures
• https://www.analyticsvidhya.com/blog/2017/01/introduction-to-
reinforcement-learning-implementation/
• https://www.analyticsvidhya.com/blog/2019/04/introduction-deep-q-learning-
python/
• https://www.learndatasci.com/tutorials/reinforcement-q-learning-scratch-
python-openai-gym/

In case of any query:

• Email: [email protected]

• https://Cprakash.in
[https://cprakash86.wordpress.com/]
Thank You

You might also like