MLT Unit-5 Notes
MLT Unit-5 Notes
Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent
(A software component) automatically explores its surrounding by hitting & trail,
taking action, learning from experiences, and improving its performance. Agent
gets rewarded for each good action and get punished for each bad action; hence the
goal of reinforcement learning agent is to maximize the rewards.
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
3) Value Function: The value function gives information about how good
the situation and action are and how much reward an agent can expect. A
reward indicates the immediate signal for each good and bad action,
whereas a value function specifies the good state and action for the
future. The value function depends on the reward as, without reward, there
could be no value. The goal of estimating values is to achieve more rewards.
The model is used for planning, which means it provides a way to take a
course of action by considering all future situations before experiencing
those situations. The approaches for solving the RL problems with the help
of the model are termed as the model-based approach. Comparatively,
an approach without using a model is called a model-free approach.
o Positive Reinforcement
o Negative Reinforcement
MDP is used to describe the environment for the RL, and almost all the RL
problem can be formalized using MDP.
MDP uses Markov property, and to better understand the MDP, we need to
learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs
an action a1 and move to the state s2, then the state transition
from s1 to s2 only depends on the current state and future action
and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does
not depend on any past action or state. Hence, MDP is an RL problem that
satisfies the Markov property. Such as in a Chess game, the players only
focus on the current state and do not need to remember past
actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions.
In RL, we consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states
S1, S2, ....., St that uses the Markov Property. The Markov process is also
known as Markov chain, which is a tuple (S, P) on state S and transition
function P. These two components (S and P) can define the dynamics of the
system.
o Q-Learning:
o Q-learning is an Off-policy RL algorithm, which is used for the
temporal difference Learning. The temporal difference learning
methods are the way of comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take
action "a" at a particular state "s."
o The below flowchart explains the working of Q- learning:
o State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is
an on-policy temporal difference learning method. The on-policy
control method selects the action for each state while learning using a
specific policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected
current policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is
that unlike Q-learning, the maximum reward for the next state
is not required for updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy,
which has determined the original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s', a').
Q-Learning Explanation:
To perform any action, the agent will get a reward R(s, a), and also he will
end up on a certain state, so the Q -value equation will be:
The Q stands for quality in Q-learning, which means it specifies the quality
of an action taken by the agent.
The RL agent uses this Q-table as a reference table to select the best action
based on the q-values.
Deep Q Learning uses the Q-learning idea and takes it one step further.
and approximates the Q-values for each action based on that state.
We do this because using a classic Q-table is not very scalable. It might work
for simple games, but in a more complex game with dozens of possible
actions and game states the Q-table will soon get complex and cannot be
So now we use a Deep Neural Network that gets the state as input,
and produces different Q values for each action. Then again we choose
the action with the highest Q-value. The learning process is still the same
with the iterative update approach, but instead of updating the Q-Table,
here we update the weights in the neural network so that the outputs are
improved.
Genetics Algorithm:
A genetic algorithm is an adaptive heuristic search algorithm inspired by
"Darwin's theory of evolution in Nature." It is used to solve optimization
problems in machine learning. It is one of the important algorithms as it
helps solve complex problems that would take a long time to solve.
o Initialization
o Fitness Assignment
o Selection
o Reproduction
o Termination
1. Initialization
The process of a genetic algorithm starts by generating the set of
individuals, which is called population. Here each individual is the solution for
the given problem. An individual contains or is characterized by a set of
parameters called Genes. Genes are combined into a string and generate
chromosomes, which is the solution to the problem. One of the most popular
techniques for initialization is the use of random binary strings.
2. Fitness Assignment
Fitness function is used to determine how fit an individual is? It means the
ability of an individual to compete with other individuals. In every iteration,
individuals are evaluated based on their fitness function. The fitness function
provides a fitness score to each individual. This score further determines the
probability of being selected for reproduction. The high the fitness score, the
more chances of getting selected for reproduction.
3. Selection
The selection phase involves the selection of individuals for the reproduction
of offspring. All the selected individuals are then arranged in a pair of two to
increase reproduction. Then these individuals transfer their genes to the next
generation.
4. Reproduction
After the selection process, the creation of a child occurs in the reproduction
step. In this step, the genetic algorithm uses two variation operators that are
applied to the parent population. The two operators involved in the
reproduction phase are given below:
The genes of parents are exchanged among themselves until the crossover
point is met. These newly generated offspring are added to the population.
This process is also called or crossover. Types of crossover styles available:
o One point crossover
o Two-point crossover
o Livery crossover
o Inheritable Algorithms crossover
o Mutation
The mutation operator inserts random genes in the offspring (new child) to
maintain the diversity in the population. It can be done by flipping some bits
in the chromosomes.
Mutation helps in solving the issue of premature convergence and enhances
diversification. The below image shows the mutation process:
Types of mutation styles available,
o Flip bit mutation
o Gaussian mutation
o Exchange/Swap mutation
5. Termination
After the reproduction phase, a stopping criterion is applied as a base for
termination. The algorithm terminates after the threshold fitness solution is
reached. It will identify the final solution as the best solution in the
population.
2. Mutation testing
3. Code breaking