0% found this document useful (0 votes)
19 views11 pages

ML Mod 6

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views11 pages

ML Mod 6

ml

Uploaded by

neha1831sewani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module 06

Q1. What are the elements of reinforcement learning?


Q2. What is Reinforcement Learning? Explain with the help of an example.
Q3. Explain reinforcement learning in detail along with the various elements
involved in forming the concept. Also define what is meant by positive RL.

[IOM | May16, Dec16, May17, Dec17 & May18]


Ans:
REINFORCEMENT LEARNING:

1. Reinforcement Learning is a type of machine learning algorithm.

2. It enables an agent to learn in an interactive environment by trial and error.

3. Agent uses feedback from its own actions and experiences.

4. The goal is to find a suitable action model that increases total reward of the
agent.

5. Output depends on the state of the current input.

6. The next state depends on the output of the previous input.

7. Types of Reinforcement Learning:


a. Positive RL
b. Negative RL

8. The most common applications of reinforcement learning are:


a. PC Games: Reinforcement learning is widely being used in PC games like
Assassin's Creed. Change the enemies to change their moves and
approach based on your performance.
b. Robotics: Most of the robots that you see in the present world are
running on reinforcement learning.

EXAMPLE:

1. We have an agent and a reward with many hurdles in between.

2. The agent is supposed to find the best possible path to reach the reward.
[Figure 8.1: Example of Reinforcement Learning - showing a grid with robot,
diamond and fire]

3. The Figure 8.1 shows robot, diamond and fire.

Module 06 1
4. The goal of the robot is to get the reward.

5. Reward is the diamond and avoid the hurdles that is fire.

6. The robot learns by trying all the possible paths.

7. After learning robot chooses a path which gives him the reward with the
least hurdles.

8. Each right step will give the robot a reward.

9. Each wrong step will subtract the reward of the robot.

10. The total reward will be calculated when it reaches the final reward that is
the diamond.

REINFORCEMENT LEARNING ELEMENTS:


These are four main sub elements of a reinforcement learning system:

1. A policy

2. Reward function

3. A value function

4. A model of the environment (optional)

5. Policy:

The policy is the core of a reinforcement learning agent

Policy is sufficient to determine behaviour

In general, policies may be stochastic

A policy defines the learning agent's way of behaving at a given time

The simplest case of a policy may be a simple function or lookup table

1. Reward Function:

A reward function defines the goal in a reinforcement learning problem

It maps each perceived state of the environment to a single number, a


reward

The reward function defines what the good and bad events are for the
agent

Reward functions must necessarily be unalterable by the agent

Reward functions serve as a basis to altering the policy

Module 06 2
1. Value Function:

Whereas a reward function indicates what is good in an immediate sense

A value function specifies what is good in the long run

The value of a state is the total amount of reward an agent can expect to
collect over the future, starting from that state

Rewards are in a sense primary, whereas values, as predictions of rewards,


are secondary

Without rewards there could be no values, and the only purpose of


estimating values is to achieve more reward

1. Model:

Model is an optional element of reinforcement learning system

Model is something that mimics the behaviour of the environment

For example, given a state and action, it can predict the resultant next state
and next reward

Models are used for planning

Q4. Model Based Learning


[IOM | Dec17]
Ans:
MODEL BASED LEARNING:

1. Model-based machine learning refers to machine learning models.

Module 06 3
2. These models are parameterized with a certain number of parameters
which do not change as the model structure changes.

3. The goal is to provide a simple development framework which supports the


creation of a wide range of bespoke models.

4. For example, if you assume that a set of data {(x₁,..., x_n)} you are given is
subject to a linear model y=f(x;w, b), where w ∈
ℝᵈ and b is the dimension
of each data point irrespective of n.

5. There are 3 steps to model based machine learning which:


a. Describe the Model: Describe the process that generated the data using
factor graphs
b. Condition on Observed Data: Condition the variables to their known
quantities
c. Perform Inference: Perform bayesian reasoning to update the prior
distribution over the latent variables or parameters

6. This framework emerged from an important convergence of three key


ideas:
a. The adoption of a Bayesian viewpoint
b. The use of factor graphs (a type of probabilistic graphical model); and
c. The application of fast, deterministic efficient and approximate inference
algorithms

MODEL BASED LEARNING PROCESS:

1. We start with model learning where we completely know the environment


model parameters

2. Environment parameters are P(s') and r + 1 (s, a)

3. In such a case, we do not need any exploration

4. We can directly solve for the optimal policy using dynamic programming

5. The optimal value function is obtained

6. When we have the optimal value function, the optimal policy is to choose
the action that maximizes value in next state as following:
π*(s+t) = argₐ max{[r(s,t,a,t) + γ P(s,t + 1|s,t,a,t)
v(s,t + 1)]}

BENEFITS:

1. Provides a systematic process of creating ML solutions

Module 06 4
2. Allow for incorporation of prior knowledge

3. Does not suffer from over-fitting

4. Separates model from inference/training code

Q5. Explain in detail Temporal Difference Learning


[IOM | Dec16, May17 & Dec18]

TEMPORAL DIFFERENCE LEARNING:


It is an approach to learning how to predict a quantity that depends on future
values of a given signal.
Various TD learning methods can be used to estimate these value functions:

1. They can be used to predict returns - the total reward expected over the
future

2. Target returns are usually "boot strapped" returns towards a more accurate
target return

3. TD algorithms are often used to predict a measure of the total amount of


reward expected over the future

4. They can be used to predict other quantities as well

Different TD algorithms:

1. TD(0) algorithm

2. TD(λ) algorithm

3. TD(n) algorithm

4. The aspect to understand temporal difference algorithms is the TD(0)


algorithm

TEMPORAL DIFFERENCE LEARNING METHODS:


A) On-Policy Temporal Difference methods:

1. Learn the value of the policy that is used to make decisions

2. The value functions are related to specific actions determined by some


policy

3. These policies are usually "soft" and non-deterministic

4. Permission time is always an element of validation to the policy

5. The policy is not strict that it always chooses the action that gives the most
reward

Module 06 5
6. On policy algorithms cannot separate exploration from control

B) Off-Policy Temporal Difference methods:

1. Can learn different policies for estimation

2. Again the behaviors are usually "soft" to ensure sufficient exploration going
on

3. Off-policy algorithms can update the estimated value functions using


actions which have not actually been tried

4. Off policy algorithms can separate exploration from control

5. Agent may end up learning multiple policies of different behaviors during


the learning phase

ACTION SELECTION POLICIES:


The aim of these policies is to balance the trade-off between exploitation and
exploration
I. ε-Greedy:

1. Most of the time the action with the highest estimated reward is chosen,
called the greedy action

2. Every once in a while, say with a small probability, an action is selected at


random

3. The action is selected randomly independent of the action-value estimates

4. This method ensures that if enough trials are done, each action will be tried
an infinite number of times

5. Ensures optimal actions are discovered

II) Softmax:

1. Very similar to ε-greedy

2. The near actions is just as likely to be selected as the second best

3. Softmax assigns a rank or weight to each of these actions according to


their action-value estimate

4. An action is selected with respect to the weight associated with each action

5. It means that the worst actions are unlikely to be chosen

Module 06 6
6. This is a good approach to take when the worst actions are very
unfavorable

ADVANTAGES OF TD METHODS:

1. Don't need a model of the environment

2. Online and incremental can be use

3. They don't need to wait till the end of the episode

4. Need less memory and computation

Q6. What is Q-learning? Explain algorithm for learning Q


Ans: [IOM | Dec18]
Q-LEARNING:

1. Q-learning is a reinforcement learning technique used in machine learning

2. Q-learning is a values-based learning algorithm

3. The goal of Q-learning is to learn a policy

4. Q-learning tells an agent what action to take under what circumstances

5. Q-learning uses temporal differences to estimate the value of Q(s, a)

6. In Q-learning, the agent maintains a table of Q[s,A],


where S is the set of states and
A is the set of actions.
Q[s, a] represents the current estimate of Q*(s, a)

7. An experience (s, a, s') provides one data point for the value of Q(s, a)

8. The data point is that the agent received the future value of V*(s'),
where V(s') = maxₐ Q(s', a)

9. This is the actual current reward plus the discounted estimated future value

10. This new data point is called a return

11. The agent can use the temporal difference equation to update its estimate
for Q(s, a):
Q(s, a) = Q(s, a) + α[r + γmaxₐ' Q(s', a') - Q(s, a)]

12. Or, equivalently,


Q(s, a) ← (1-α) Q(s, a) + α[r + γmaxₐ' Q(s', a')]

Module 06 7
13. It can be proven that given sufficient training under any ε-soft policy, the
algorithm converges with probability 1 to a close approximation of the
action-value function for an arbitrary target policy

14. Q-Learning learns the optimal policy even when actions are selected
according to a more exploratory or even random policy

Q-TABLE:

1. Q-Table is just a simple lookup table

2. In Q-Table we calculate the maximum expected future rewards for action at


each state

3. Basically, this table will guide us to the best action at each state

4. In the Q-Table, the columns are the actions and the rows are the states

5. Each Q-table score will be the maximum expected future reward that the
agent will get if it takes that action at that state

6. This is an iterative process, as we need to improve the Q-table at each


iteration

PARAMETERS USED IN THE Q-VALUE UPDATE PROCESS:


I) α - the learning rate:

1. Set between 0 and 1

2. Setting it to 0 means that the Q-values are never updated, hence nothing is
learned

3. Setting a high value such as 0.9 means that learning can occur quickly

II) γ - Discount factor:

1. Set between 0 and 1

2. This models the fact that future rewards are worth less than immediate
rewards

3. Mathematically, the discount factor needs to be set less than 0 for the
algorithm to converge

III) Maxₐ - the maximum reward:

1. This is attainable in the state following the current one

2. i.e. the reward for taking the optimal action thereafter

Module 06 8
PROCEDURAL APPROACH:

1. Initialize the Q-values table, Q(s, a)

2. Observe the current state, s

3. Choose an action, a, for that state based on one of the action selection
policies (i.e. soft, ε-greedy or Softmax)

4. Take the action, and observe the reward, r, as well as the new state, s'

5. Update the Q-value for the state using the observed reward and the
maximum reward possible for the next state. (The updating is done
according to the formula and parameters described above.)

6. Set the state to the new state, and repeat the process until a terminal state
is reached

Q7. Explain following terms with respect to Reinforcement Learning: delayed


rewards, exploration, and partially observable states.(10 mark - d18)
Q8. Explain reinforcement learning in detail along with the various elements
involved in forming the concept. Also define what is meant by partially
observable state. (10 mark - d17)
[IOM | Dec17 & Dec18]

EXPLORATION:

1. It means gathering more information about the problem

2. Reinforcement learning requires clever exploration mechanisms

3. Randomly selecting actions, without reference to an estimated probability


distribution, seems less performance

4. The case of (small) finite Markov decision processes is relatively well


understood

5. However, due to the lack of algorithms that properly scale well with the
number of states, simple exploration methods are the most practical

6. One such method is ε-greedy, when the agent chooses the action that it
believes has the best long-term effect with probability 1-ε

7. If no action which satisfies this condition is found, the agent chooses an


action uniformly at random

8. Here, 0 < ε < 1 is a tuning parameter, which is sometimes changed, either


according to a fixed schedule or adaptively based on heuristics

Module 06 9
DELAYED REWARDS:

1. In the general case of the reinforcement learning problem, the agent's


actions determine not only its immediate reward

2. And it also determine (at least probabilistically) the next state of the
environment

3. It may take a long sequence of actions, receiving insignificant


reinforcement

4. Then finally arrive at a state with high reinforcement

5. It has to learn from delayed reinforcement. This is called delayed rewards

6. The agent must be able to learn which of its actions are desirable based on
reward that can take place arbitrarily far in the future

7. It can also be done with eligibility traces, which weight the previous actions
a lot

8. The action before that a little less, and the action before that even less, and
so on. But it takes lot of computational time

PARTIALLY OBSERVABLE STATES:

1. In certain applications, the agent does not know the state exactly

2. It is equipped with sensors that return an observation using which the agent
should estimate the state

3. For example, we have a robot which navigates in a room

4. The robot may not know its exact location in the room, or what else is in
room

5. The robot may have a camera with which sensory observations are
recorded

6. This does not tell the robot its state exactly but gives inputs as to its likely
state

7. For example, the robot may only know that there is a wall to right

8. The setting is like a Markov decision process, except that after taking an
action at state s

9. The new state s'+1 is not known but only an observation o+1 which is
stochastic function of s and p(o+1|st, at)

Module 06 10
10. This is called as partially observable state.

sUMS ON q1

Module 06 11

You might also like