0% found this document useful (0 votes)

20 views49 pages

Unit 5 ML

The document provides an overview of Reinforcement Learning (RL) and its components, including the learning process, types of reinforcements, and applications in various fields. It explains key concepts such as agents, environments, states, actions, and rewards, along with algorithms like Q-learning that help optimize decision-making through trial and error. Additionally, it discusses the advantages and disadvantages of RL, highlighting its effectiveness in solving complex problems and the challenges associated with its implementation.

Uploaded by

nitishgupta208017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views49 pages

Unit 5 ML

Uploaded by

nitishgupta208017

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Maharana Pratap Group of Institutions, Mandhana, Kanpur

(Approved By AICTE, New Delhi And Affiliated To AKTU, Luck now)

Digital Notes
[Department of Computer Science Engineering]

Course : B.TECH
Branch : CSE 3rd Yr
Subject Name :Machine Learning Techniques
(BCS055)
Prepared by: Mr. Abhishek Singh Sengar
UNIT 5
• REINFORCEMENT LEARNING–Introduction to
Reinforcement Learning, Learning Task, Example of
Reinforcement Learning in Practice, Learning Models
for Reinforcement – (Markov Decision process, Q
Learning - Q Learning function, Q Learning Algorithm)
• Application of Reinforcement Learning, Introduction
to Deep Q Learning. GENETIC ALGORITHMS:
Introduction, Components, GA cycle of reproduction,
Crossover, Mutation, Genetic Programming, Models
of Evolution and Learning, Applications
• Reinforcement Learning (RL) is a branch of
machine learning that focuses on how agents
can learn to make decisions through trial and
error to maximize cumulative rewards.
• RL allows machines to learn by interacting
with an environment and receiving feedback
based on their actions. This feedback comes in
the form of rewards or penalties.
• Reinforcement Learning revolves around the idea that an agent
(the learner or decision-maker) interacts with an environment
to achieve a goal. The agent performs actions and receives
feedback to optimize its decision-making over time.
• Agent: The decision-maker that performs actions.
• Environment: The world or system in which the agent operates.
• State: The situation or condition the agent is currently in.
• Action: The possible moves or decisions the agent can make.
• Reward: The feedback or result from the environment based
on the agent’s action.
How Reinforcement Learning Works?

• The RL process involves an agent performing actions in an environment,

receiving rewards or penalties based on those actions, and adjusting its
behavior accordingly. This loop helps the agent improve its decision-
making over time to maximize the cumulative reward.
• Here’s a breakdown of RL components:
• Policy: A strategy that the agent uses to determine the next action
based on the current state.
• Reward Function: A function that provides feedback on the actions
taken, guiding the agent towards its goal.
• Value Function: Estimates the future cumulative rewards the agent will
receive from a given state.
• Model of the Environment: A representation of the environment that
predicts future states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze

• Imagine a robot navigating a maze to reach a diamond

while avoiding fire hazards. The goal is to find the optimal
path with the least number of hazards while maximizing
the reward:
• Each time the robot moves correctly, it receives a reward.
• If the robot takes the wrong path, it loses points.
• The robot learns by exploring different paths in the maze.
By trying various moves, it evaluates the rewards and
penalties for each path. Over time, the robot determines
the best route by selecting the actions that lead to the
highest cumulative reward.
The robot’s learning process can be summarized as follows:

• Exploration: The robot starts by exploring all possible paths in the

maze, taking different actions at each step (e.g., move left, right, up,
or down).
• Feedback: After each move, the robot receives feedback from the
environment:
– A positive reward for moving closer to the diamond.
– A penalty for moving into a fire hazard.
• Adjusting Behavior: Based on this feedback, the robot adjusts its
behavior to maximize the cumulative reward, favoring paths that
avoid hazards and bring it closer to the diamond.
• Optimal Path: Eventually, the robot discovers the optimal path with
the least number of hazards and the highest reward by selecting the
right actions based on past experiences.
Types of Reinforcements in RL

• 1. Positive Reinforcement
• Positive Reinforcement is defined as when an event, occurs due to a
particular behavior, increases the strength and the frequency of the
behavior. In other words, it has a positive effect on behavior.
• Advantages: Maximizes performance, helps sustain change over time.
• Disadvantages: Overuse can lead to excess states that may reduce
effectiveness.
• 2. Negative Reinforcement
• Negative Reinforcement is defined as strengthening of behavior because a
negative condition is stopped or avoided.
• Advantages: Increases behavior frequency, ensures a minimum
performance standard.
• Disadvantages: It may only encourage just enough action to avoid
penalties.
CartPole in OpenAI Gym

• One of the classic RL problems is the CartPole

environment in OpenAI Gym, where the goal is to
balance a pole on a cart. The agent can either push the
cart left or right to prevent the pole from falling over.
• State space: Describes the four key variables (position,
velocity, angle, angular velocity) of the cart-pole system.
• Action space: Discrete actions—either move the cart left
or right.
• Reward: The agent earns 1 point for each step the pole
remains balanced.
• import gym
• import numpy as np
• import warnings

• # Suppress specific deprecation warnings

• warnings.filterwarnings("ignore", category=DeprecationWarning)

• # Load the environment with render mode specified

• env = gym.make('CartPole-v1', render_mode="human")

• # Initialize the environment to get the initial state

• state = env.reset()

• # Print the state space and action space

• print("State space:", env.observation_space)
• print("Action space:", env.action_space)

• # Run a few steps in the environment with random actions

• for _ in range(10):
• env.render() # Render the environment for visualization
• action = env.action_space.sample() # Take a random action
•
• # Take a step in the environment
• step_result = env.step(action)
•
• # Check the number of values returned and unpack accordingly
• if len(step_result) == 4:
• next_state, reward, done, info = step_result
• terminated = False
• else:
• next_state, reward, done, truncated, info = step_result
• terminated = done or truncated
•
• print(f"Action: {action}, Reward: {reward}, Next State: {next_state}, Done: {done}, Info: {info}")
•
• if terminated:
• state = env.reset() # Reset the environment if the episode is finished

• env.close() # Close the environment when done

• Application of Reinforcement Learning
• Robotics: RL is used to automate tasks in structured environments such as manufacturing, where robots learn to optimize movements
and improve efficiency.
• Game Playing: Advanced RL algorithms have been used to develop strategies for complex games like chess, Go, and video games,
outperforming human players in many instances.
• Industrial Control: RL helps in real-time adjustments and optimization of industrial operations, such as refining processes in the oil
and gas industry.
• Personalized Training Systems: RL enables the customization of instructional content based on an individual’s learning patterns,
improving engagement and effectiveness.
• Advantages of Reinforcement Learning
• Solving Complex Problems: RL is capable of solving highly complex problems that cannot be addressed by conventional techniques.
• Error Correction: The model continuously learns from its environment and can correct errors that occur during the training process.
• Direct Interaction with the Environment: RL agents learn from real-time interactions with their environment, allowing adaptive
learning.
• Handling Non-Deterministic Environments: RL is effective in environments where outcomes are uncertain or change over time,
making it highly useful for real-world applications.
• Disadvantages of Reinforcement Learning
• Not Suitable for Simple Problems: RL is often an overkill for straightforward tasks where simpler algorithms would be more efficient.
• High Computational Requirements: Training RL models requires a significant amount of data and computational power, making it
resource-intensive.
• Dependency on Reward Function: The effectiveness of RL depends heavily on the design of the reward function. Poorly designed
rewards can lead to suboptimal or undesired behaviors.
• Difficulty in Debugging and Interpretation: Understanding why an RL agent makes certain decisions can be challenging, making
debugging and troubleshooting complex
Advantages of Reinforcement Learning

• Solving Complex Problems: RL is capable of solving highly

complex problems that cannot be addressed by conventional
techniques.
• Error Correction: The model continuously learns from its
environment and can correct errors that occur during the
training process.
• Direct Interaction with the Environment: RL agents learn from
real-time interactions with their environment, allowing
adaptive learning.
• Handling Non-Deterministic Environments: RL is effective in
environments where outcomes are uncertain or change over
time, making it highly useful for real-world applications.
Disadvantages of Reinforcement Learning

• Not Suitable for Simple Problems: RL is often an overkill for

straightforward tasks where simpler algorithms would be more
efficient.
• High Computational Requirements: Training RL models requires
a significant amount of data and computational power, making it
resource-intensive.
• Dependency on Reward Function: The effectiveness of RL
depends heavily on the design of the reward function. Poorly
designed rewards can lead to suboptimal or undesired behaviors.
• Difficulty in Debugging and Interpretation: Understanding why
an RL agent makes certain decisions can be challenging, making
debugging and troubleshooting complex
Reinforcement Learning:
• Reinforcement Learning is a type of Machine Learning. It allows machines and software
agents to automatically determine the ideal behavior within a specific context, in order
to maximize its performance. Simple reward feedback is required for the agent to learn
its behavior; this is known as the reinforcement signal.
• There are many different algorithms that tackle this issue. As a matter of fact,
Reinforcement Learning is defined by a specific type of problem and all its solutions are
classed as Reinforcement Learning algorithms. In the problem, an agent is supposed to
decide the best action to select based on his current state. When this step is repeated,
the problem is known as a Markov Decision Process.
• A Markov Decision Process (MDP) model contains:

• A set of possible world states S.

• A set of Models.
• A set of possible actions A.
• A real-valued reward function R(s,a).
• A policy is a solution to Markov Decision Process
What is a State?
• A State is a set of tokens that represent every state that the agent can be in.
• What is a Model?
• A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular, T(S, a,
S’) defines a transition T where being in state S and taking an action ‘a’ takes us to state S’ (S and S’
may be the same). For stochastic actions (noisy, non-deterministic) we also define a probability P(S’|
S,a) which represents the probability of reaching a state S’ if action ‘a’ is taken in state S. Note
Markov property states that the effects of an action taken in a state depend only on that state and
not on the prior history.
• What are Actions?
• Action A is a set of all possible actions. A(s) defines the set of actions that can be taken being in state
S.
• What is a Reward?
• A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the state S.
R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’) indicates the
reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
• What is a Policy?
• A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It indicates
the action ‘a’ to be taken while in state S.
Let us take the example of a grid world:
• An agent lives in the grid. The above example is a 3*4 grid. The grid has a START state(grid no 1,1). The purpose
of the agent is to wander around the grid to finally reach the Blue Diamond (grid no 4,3). Under all
circumstances, the agent should avoid the Fire grid (orange color, grid no 4,2). Also, the grid no 2,2 is a blocked
grid, it acts as a wall hence the agent cannot enter it.
• The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
• Walls block the agent’s path, i.e., if there is a wall in the direction the agent would have taken, the agent stays in
the same place. So for example, if the agent says LEFT in the START grid he would stay put in the START grid.
• First Aim: To find the shortest sequence getting from START to the Diamond. Two such sequences can be found:
• RIGHT RIGHT UP UPRIGHT
• UP UP RIGHT RIGHT RIGHT
• Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time the action agent
takes causes it to move at right angles. For example, if the agent says UP the probability of going UP is 0.8
whereas the probability of going LEFT is 0.1, and the probability of going RIGHT is 0.1 (since LEFT and RIGHT are
right angles to UP).
• The agent receives rewards for each time step:-
• Small reward for each step (can be negative when can also be term as punishment, in the above example
entering the Fire can have a reward of -1).
• Big rewards come at the end (good or bad).
• The goal is to Maximize the sum of rewards.
Q-Learning in Reinforcement Learning

• Q-learning is a model-free reinforcement learning

algorithm used to train agents (computer programs) to
make optimal decisions by interacting with an
environment. It helps the agent explore different actions
and learn which ones lead to better outcomes. The agent
uses trial and error to determine which actions result in
rewards (good outcomes) or penalties (bad outcomes).
• Over time, it improves its decision-making by updating
a Q-table, which stores Q-values representing the
expected rewards for taking particular actions in given
states.
Key Components of Q-learning

• 1. Q-Values or Action-Values
• Q-values represent the expected rewards for taking an
action in a specific state. These values are updated over
time using the Temporal Difference (TD) update rule.
• 2. Rewards and Episodes
• The agent moves through different states by taking actions
and receiving rewards. The process continues until the
agent reaches a terminal state, which ends the episode.
• 3. Temporal Difference or TD-Update
• The agent updates Q-values using the formula:
Where,

• S is the current state.

• A is the action taken by the agent.
• S’ is the next state the agent moves to.
• A’ is the best next action in state S’.
• R is the reward received for taking action A in state S.
• γ (Gamma) is the discount factor, which balances
immediate rewards with future rewards.
• α (Alpha) is the learning rate, determining how much
new information affects the old Q-values.
4. ϵ-greedy Policy (Exploration vs. Exploitation)

• The ϵ-greedy policy helps the agent decide which action

to take based on the current Q-value estimates:
• Exploitation: The agent picks the action with the highest
Q-value with probability 1–ϵ1–ϵ. This means the agent
uses its current knowledge to maximize rewards.
• Exploration: With probability ϵϵ, the agent picks a
random action, exploring new possibilities to learn if
there are better ways to get rewards. This allows the
agent to discover new strategies and improve its
decision-making over time.
How does Q-Learning Works?

• Q-learning models follow an iterative process, where different

components work together to train the agent:
• Agent: The entity that makes decisions and takes actions within
the environment.
• States: The variables that define the agent’s current position in the
environment.
• Actions: The operations the agent performs when in a specific
state.
• Rewards: The feedback the agent receives after taking an action.
• Episodes: A sequence of actions that ends when the agent reaches
a terminal state.
• Q-values: The estimated rewards for each state-action pair.
Steps of Q-learning:

• Initialization: The agent starts with an initial Q-table,

where Q-values are typically initialized to zero.
• Exploration: The agent chooses an action based on
the ϵ-greedy policy (either exploring or exploiting).
• Action and Update: The agent takes the action,
observes the next state, and receives a reward. The
Q-value for the state-action pair is updated using the
TD update rule.
• Iteration: The process repeats for multiple episodes
until the agent learns the optimal policy.
Methods for Determining Q-values

• 1. Temporal Difference (TD):

• Temporal Difference is calculated by comparing the
current state and action values with the previous ones.
It provides a way to learn directly from experience,
without needing a model of the environment.
• 2. Bellman’s Equation:
• Bellman’s Equation is a recursive formula used to
calculate the value of a given state and determine the
optimal action. It is fundamental in the context of Q-
learning and is expressed as:
Where:

• Q(s, a) is the Q-value for a given state-action

pair.
• R(s, a) is the immediate reward for taking
action a in state s.
• γ is the discount factor, representing the
importance of future rewards.
• maxaQ(s’,a)maxaQ(s’,a) is the maximum Q-
value for the next state s’ and all possible
actions
What is a Q-table?

• The Q-table is essentially a memory structure where the agent stores

information about which actions yield the best rewards in each state. It is
a table of Q-values representing the agent’s understanding of the
environment. As the agent explores and learns from its interactions with
the environment, it updates the Q-table. The Q-table helps the agent
make informed decisions by showing which actions are likely to lead to
better rewards.
• Structure of a Q-table:
• Rows represent the states.
• Columns represent the possible actions.
• Each entry in the table corresponds to the Q-value for a state-action pair.
• Over time, as the agent learns and refines its Q-values through exploration
and exploitation, the Q-table evolves to reflect the best actions for each
state, leading to optimal decision-making.
Advantages of Q-learning

• Trial and Error Learning: Q-learning improves

over time by trying different actions and learning
from experience.
• Self-Improvement: Mistakes lead to learning,
helping the agent avoid repeating them.
• Better Decision-Making: Stores successful
actions to avoid bad choices in future situations.
• Autonomous Learning: It learns without
external supervision, purely through exploration.
Disadvantages of Q-learning

• Slow Learning: Requires many examples, making it time-

consuming for complex problems.
• Expensive in Some Environments: In robotics, testing
actions can be costly due to physical limitations.
• Curse of Dimensionality: Large state and action spaces
make the Q-table too large to handle efficiently.
• Limited to Discrete Actions: It struggles with continuous
actions like adjusting speed, making it less suitable for
real-world applications involving continuous decisions
Applications of Q-learning

• Applications for Q-learning, a reinforcement learning algorithm, can be found in many different
fields. Here are a few noteworthy instances:
• Atari Games: Classic Atari 2600 games can now be played with Q-learning. In games like Space
Invaders and Breakout, Deep Q Networks (DQN), an extension of Q-learning that makes use of
deep neural networks, has demonstrated superhuman performance.
• Robot Control: Q-learning is used in robotics to perform tasks like navigation and robot control.
With Q-learning algorithms, robots can learn to navigate through environments, avoid obstacles,
and maximise their movements.
• Traffic Management: Autonomous vehicle traffic management systems use Q-learning. It lessens
congestion and enhances traffic flow overall by optimising route planning and traffic signal timings.
• Algorithmic Trading: The use of Q-learning to make trading decisions has been investigated in
algorithmic trading. It makes it possible for automated agents to pick up the best strategies from
past market data and adjust to shifting market conditions.
• Personalized Treatment Plans: To make treatment plans more unique, Q-learning is used in the
medical field. Through the use of patient data, agents are able to recommend personalized
interventions that account for individual responses to various treatments.
Deep Q-Learning in Reinforcement Learning

• Deep Q-Learning integrates deep neural networks into the

decision-making process. This combination allows agents to
handle high-dimensional state spaces, making it possible to
solve complex tasks such as playing video games or
controlling robots.
• Before diving into Deep Q-Learning, it’s important to
understand the foundational concept of Q-Learning. Q-
Learning is a model-free method that learns an optimal
policy by estimating the Q-value function , which represents
the expected cumulative reward for taking a specific action
in a given state and following the optimal policy thereafter.
• While Q-Learning works well for small state-action spaces, it
struggles with scalability when dealing with high-dimensional
environments like images or continuous states. This limitation led to
the development of Deep Q-Learning , which leverages deep neural
networks to approximate the Q-value function.
• Role of Deep Learning in Q-Learning
• To address the limitations of traditional Q-Learning, researchers
introduced Deep Q-Networks (DQNs) , which combine Q-Learning
with deep neural networks. Instead of maintaining a table of Q-
values for each state-action pair, DQNs approximate the Q-value
function using a neural network parameterized by weights θ. The
network takes a state as input and outputs Q-values for all possible
actions.
Applications of Deep Q-Learning

• Deep Q-Learning has been successfully applied to a wide range of domains,

including:
• Atari Games: In 2013, DeepMind demonstrated that DQNs could achieve
superhuman performance on classic Atari games by learning directly from
raw pixel inputs.
• Robotics: DQNs have been used to train robots for tasks such as grasping
objects, navigating environments, and performing manipulation tasks.
• Autonomous Driving: Reinforcement learning with DQNs can optimize
decision-making for self-driving cars, such as lane-changing and obstacle
avoidance.
• Finance: DQNs are applied to portfolio optimization, algorithmic trading,
and risk management by learning optimal trading strategies.
• Healthcare: In medical applications, DQNs assist in treatment planning,
drug discovery, and personalized medicine.
Training Process of Deep Q-Learning

• The training process of a DQN involves the following steps:

• Initialization :
– Initialize the replay buffer, main network (θθ), and target network (θ−θ−).
– Set hyperparameters such as learning rate (αα), discount factor (γγ), and exploration rate (ϵϵ).
• Exploration vs. Exploitation : Use an ϵϵ-greedy policy to balance exploration and exploitation:
– With probability ϵϵ, select a random action to explore.
– Otherwise, choose the action with the highest Q-value according to the current network.
• Experience Collection : Interact with the environment, collect experiences (s,a,r,s′)(s,a,r,s′),
and store them in the replay buffer.
• Training Updates :
– Sample a mini-batch of experiences from the replay buffer.
– Compute the target Q-values using the target network.
– Update the main network by minimizing the loss function using gradient descent.
• Target Network Update: Periodically copy the weights of the main network to the target
network to ensure stability.
• Decay Exploration Rate: Gradually decrease ϵϵ over time to shift from exploration to
exploitation.
Role of Deep Learning in Q-Learning

• To address the limitations of traditional Q-Learning, researchers introduced Deep Q-

Networks (DQNs) , which combine Q-Learning with deep neural networks. Instead
of maintaining a table of Q-values for each state-action pair, DQNs approximate the
Q-value function using a neural network parameterized by weights θ. The network
takes a state as input and outputs Q-values for all possible actions.
• Key Challenges Addressed by Deep Q-Learning
• High-Dimensional State Spaces : Traditional Q-Learning requires storing a Q-table,
which becomes infeasible for large state spaces. Neural networks can generalize
across states, making them suitable for complex environments.
• Continuous Input Data : Many real-world problems involve continuous inputs, such
as pixel data from video frames. Neural networks excel at processing such data.
• Scalability : By leveraging the representational power of deep learning, DQNs can
scale to solve tasks that were previously unsolvable with tabular methods.
Genetic Algorithms(GAs)
• Genetic Algorithms(GAs) are adaptive heuristic search algorithms that belong
to the larger part of evolutionary algorithms. Genetic algorithms are based on
the ideas of natural selection and genetics. These are intelligent exploitation
of random searches provided with historical data to direct the search into the
region of better performance in solution space. They are commonly used to
generate high-quality solutions for optimization problems and search
problems.
• Genetic algorithms simulate the process of natural selection which means
those species that can adapt to changes in their environment can survive and
reproduce and go to the next generation. In simple words, they simulate
“survival of the fittest” among individuals of consecutive generations to solve
a problem. Each generation consists of a population of individuals and each
individual represents a point in search space and possible solution. Each
individual is represented as a string of character/integer/float/bits. This string
is analogous to the Chromosome.
Foundation of Genetic Algorithms

• Genetic algorithms are based on an analogy with the genetic

structure and behavior of chromosomes of the population.
Following is the foundation of GAs based on this analogy –
• Individuals in the population compete for resources and mate
• Those individuals who are successful (fittest) then mate to
create more offspring than others
• Genes from the “fittest” parent propagate throughout the
generation, that is sometimes parents create offspring which
is better than either parent.
• Thus each successive generation is more suited for their
environment.
Search space
• The population of individuals are maintained
within search space. Each individual represents
a solution in search space for given problem.
Each individual is coded as a finite length vector
(analogous to chromosome) of components.
• These variable components are analogous to
Genes. Thus a chromosome (individual) is
composed of several genes (variable
components).
Fitness Score

• A Fitness Score is given to each individual which shows the ability of an individual to
“compete”. The individual having optimal fitness score (or near optimal) are sought.
• The GAs maintains the population of n individuals (chromosome/solutions) along
with their fitness scores.The individuals having better fitness scores are given more
chance to reproduce than others. The individuals with better fitness scores are
selected who mate and produce better offspring by combining chromosomes of
parents. The population size is static so the room has to be created for new arrivals.
So, some individuals die and get replaced by new arrivals eventually creating new
generation when all the mating opportunity of the old population is exhausted. It is
hoped that over successive generations better solutions will arrive while least fit die.
• Each new generation has on average more “better genes” than the individual
(solution) of previous generations. Thus each new generations have better “partial
solutions” than previous generations. Once the offspring produced having no
significant difference from offspring produced by previous populations, the
population is converged. The algorithm is said to be converged to a set of solutions
for the problem.
Operators of Genetic Algorithms

• Once the initial generation is created, the algorithm

evolves the generation using following operators –
1) Selection Operator: The idea is to give preference to
the individuals with good fitness scores and allow them
to pass their genes to successive generations.
2) Crossover Operator: This represents mating between
individuals. Two individuals are selected using selection
operator and crossover sites are chosen randomly. Then
the genes at these crossover sites are exchanged thus
creating a completely new individual (offspring). For
example –
• 3) Mutation Operator: The key idea is to
insert random genes in offspring to maintain
the diversity in the population to avoid
premature convergence. For example –
cycle of reproduction,
• In machine learning, a cycle of reproduction, often used in the context of genetic algorithms, involves
selecting individuals, generating new offspring through techniques like crossover and mutation, and
then repeating this process to evolve a population towards a better solution. This process is a key
component of evolutionary algorithms in machine learning, allowing models to adapt and improve
over time.
• Here's a more detailed explanation:
• 1. Selection: Individuals in a population are selected based on their fitness, with fitter individuals
having a higher probability of being chosen for reproduction.
• 2. Reproduction (Offspring Generation):
• Crossover:
• This involves combining the genetic material (e.g., parameters of a model) of two selected parents to
create a new offspring.
• Mutation:
• Random changes are introduced to the offspring's genetic material to prevent stagnation and explore
new solutions.
• 3. Iteration: This selection and reproduction process is repeated iteratively, generating new generations
of individuals, and the best solutions are typically preserved and refined as the algorithm progresses.
• 4. Termination: The process continues until a stopping criterion is met, such as reaching a desired level
of fitness or a predefined number of iterations.
Crossover

• Crossover is a genetic operator used to vary the

programming of a chromosome or chromosomes
from one generation to the next. Crossover is
sexual reproduction.
• Two strings are picked from the mating pool at
random to crossover in order to produce superior
offspring. The method chosen depends on the
Encoding Method.
Different types of crossover :
• Single Point Crossover: A crossover point on the parent organism string is
selected. All data beyond that point in the organism string is swapped
between the two parent organisms. Strings are characterized by Positional
Bias.

• Two-Point Crossover : This is a specific case of a N-point Crossover

technique. Two random points are chosen on the individual chromosomes
(strings) and the genetic material is exchanged at these points.
• Uniform Crossover: Each gene (bit) is selected randomly from one of the corresponding
genes of the parent chromosomes.
Use tossing of a coin as an example technique
• The crossover between two good solutions may not always yield a better or as good a
solution. Since parents are good, the probability of the child being good is high. If offspring is
not good (poor solution), it will be removed in the next iteration during “Selection”.

• Problems with Crossover:

Depending on coding, simple crossovers can have a high chance to produce illegal offspring.
E.g. in TSP with simple binary or path coding, most offspring will be illegal because not all
cities will be in the offspring and some cities will be there more than once.
• Uniform crossover can often be modified to avoid this problem
E.g. in TSP with simple path coding:
Where the mask is 1, copy cities from one parent
Where the mask is 0, choose the remaining cities in the order of the other parent

DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Interview Evaluation Sheet - V3 - Jatin Bansal
No ratings yet
Interview Evaluation Sheet - V3 - Jatin Bansal
3 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
Approaches of Educational Planning: 1. Social Demand Approach
100% (3)
Approaches of Educational Planning: 1. Social Demand Approach
4 pages
Aman Raj
No ratings yet
Aman Raj
36 pages
ARTICLEONnlp
No ratings yet
ARTICLEONnlp
18 pages
Report ML Aat g1 Final
No ratings yet
Report ML Aat g1 Final
8 pages
Sara Reinforcement Learning
No ratings yet
Sara Reinforcement Learning
69 pages
Project Mg'T-Group Project Sec-A
No ratings yet
Project Mg'T-Group Project Sec-A
13 pages
w7 - Reinforcement Learning
No ratings yet
w7 - Reinforcement Learning
5 pages
First Reinforcement Learning Blog Post
No ratings yet
First Reinforcement Learning Blog Post
2 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Further Studies Maths P1 Memo 2024
No ratings yet
Further Studies Maths P1 Memo 2024
19 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
The Alchemist Test Study Guide
No ratings yet
The Alchemist Test Study Guide
2 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
What Is Reinforcement Learning
No ratings yet
What Is Reinforcement Learning
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Aye, Ai!, Ai Ai AI!, Ayes, and I: Market Comments
No ratings yet
Aye, Ai!, Ai Ai AI!, Ayes, and I: Market Comments
7 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
ITI Newsletter July 2024
No ratings yet
ITI Newsletter July 2024
3 pages
RL Chap 5
No ratings yet
RL Chap 5
21 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
Reinforcement Learning For IoT - Final
No ratings yet
Reinforcement Learning For IoT - Final
45 pages
Reinforcement Learning Details
No ratings yet
Reinforcement Learning Details
9 pages
Lecture1 Introduction Part1
No ratings yet
Lecture1 Introduction Part1
17 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
ML 4
No ratings yet
ML 4
4 pages
RL Week - 1
No ratings yet
RL Week - 1
53 pages
Rehan
No ratings yet
Rehan
1 page
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Playbook Executive Briefing Reinforcement Learning
No ratings yet
Playbook Executive Briefing Reinforcement Learning
20 pages
Exp-14 Reinforcement Learning
No ratings yet
Exp-14 Reinforcement Learning
11 pages
Four
No ratings yet
Four
5 pages
CPS 5008
No ratings yet
CPS 5008
12 pages
Living Off The Analyst: Harvesting Features From Yara Rules For Malware Detection
No ratings yet
Living Off The Analyst: Harvesting Features From Yara Rules For Malware Detection
11 pages
Seminar Report
No ratings yet
Seminar Report
12 pages
RL
No ratings yet
RL
94 pages
Battery Impedance Test Equipment: Bite 2 and BITE 2P
No ratings yet
Battery Impedance Test Equipment: Bite 2 and BITE 2P
4 pages
UNIT-V-Reinforcement Learning
No ratings yet
UNIT-V-Reinforcement Learning
4 pages
Unleashing The Power of Reinforcement Learning
No ratings yet
Unleashing The Power of Reinforcement Learning
2 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Reinforcement Learning Enhanced
No ratings yet
Reinforcement Learning Enhanced
3 pages
Module 01
No ratings yet
Module 01
66 pages
RL PyTexas 2017 PDF
No ratings yet
RL PyTexas 2017 PDF
29 pages
Reinforcement Learning: Nazia Bibi
100% (1)
Reinforcement Learning: Nazia Bibi
61 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
Module 1
No ratings yet
Module 1
72 pages
Unit 4
No ratings yet
Unit 4
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
4 pages
Unit 3
No ratings yet
Unit 3
12 pages
Reinforcement Learning Basics and Beyond
No ratings yet
Reinforcement Learning Basics and Beyond
1 page
Reinforcement Learning
No ratings yet
Reinforcement Learning
3 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
ML 10
No ratings yet
ML 10
9 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Ogunka 3 PDF
No ratings yet
Ogunka 3 PDF
18 pages
Final
No ratings yet
Final
18 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Unit-5 Mla
No ratings yet
Unit-5 Mla
22 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Introduction 2025-02-09
No ratings yet
Introduction 2025-02-09
4 pages
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
No ratings yet
Reinforcement Learning: Pablo Zometa - Department of Mechatronics - GIU Berlin 1
12 pages
Qadaqadar PDF
No ratings yet
Qadaqadar PDF
4 pages
Nokia Solutions and Networks Jaipur (Raj.) : Seminar Report ON Industrial Training AT
No ratings yet
Nokia Solutions and Networks Jaipur (Raj.) : Seminar Report ON Industrial Training AT
51 pages
Science-Unit-Plann-Final 2
No ratings yet
Science-Unit-Plann-Final 2
111 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Noun. (1) The French Indirect Object Pronouns Are
No ratings yet
Noun. (1) The French Indirect Object Pronouns Are
4 pages
Files Reviewer
No ratings yet
Files Reviewer
24 pages
Working With Change Systems Approaches To Public Sector Challenges
No ratings yet
Working With Change Systems Approaches To Public Sector Challenges
122 pages
Internal and External Data Sources For MIS
No ratings yet
Internal and External Data Sources For MIS
2 pages
Boeing 777-300ER Air New Zealand
No ratings yet
Boeing 777-300ER Air New Zealand
18 pages
Media and Information Literacy: Sablayan National Comprehensive High School
No ratings yet
Media and Information Literacy: Sablayan National Comprehensive High School
3 pages
Faster Eft
100% (1)
Faster Eft
3 pages
Rs 007
No ratings yet
Rs 007
1 page
Reading - Toefl
100% (1)
Reading - Toefl
10 pages
Functional Level Strategy of Starbucks
No ratings yet
Functional Level Strategy of Starbucks
25 pages
3) Unemployment and Types of Unemployment
No ratings yet
3) Unemployment and Types of Unemployment
4 pages
Math Paper 3 Practice MS
No ratings yet
Math Paper 3 Practice MS
47 pages
The Teaching Profession 2
No ratings yet
The Teaching Profession 2
11 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Ways to Achieve Quality
From Everand
Ways to Achieve Quality
chakrapani srinivasa
5/5 (1)

Unit 5 ML

Uploaded by

Unit 5 ML

Uploaded by

Maharana Pratap Group of Institutions, Mandhana, Kanpur

(Approved By AICTE, New Delhi And Affiliated To AKTU, Luck now)

• The RL process involves an agent performing actions in an environment,

• Imagine a robot navigating a maze to reach a diamond

• Exploration: The robot starts by exploring all possible paths in the

• One of the classic RL problems is the CartPole

• # Suppress specific deprecation warnings

• # Load the environment with render mode specified

• # Initialize the environment to get the initial state

• # Print the state space and action space

• # Run a few steps in the environment with random actions

• env.close() # Close the environment when done

• Solving Complex Problems: RL is capable of solving highly

• Not Suitable for Simple Problems: RL is often an overkill for

• A set of possible world states S.

• Q-learning is a model-free reinforcement learning

• S is the current state.

• The ϵ-greedy policy helps the agent decide which action

• Q-learning models follow an iterative process, where different

• Initialization: The agent starts with an initial Q-table,

• 1. Temporal Difference (TD):

• Q(s, a) is the Q-value for a given state-action

• The Q-table is essentially a memory structure where the agent stores

• Trial and Error Learning: Q-learning improves

• Slow Learning: Requires many examples, making it time-

• Deep Q-Learning integrates deep neural networks into the

• Deep Q-Learning has been successfully applied to a wide range of domains,

• The training process of a DQN involves the following steps:

• To address the limitations of traditional Q-Learning, researchers introduced Deep Q-

• Genetic algorithms are based on an analogy with the genetic

• Once the initial generation is created, the algorithm

• Crossover is a genetic operator used to vary the

• Two-Point Crossover : This is a specific case of a N-point Crossover

• Problems with Crossover:

You might also like