0% found this document useful (0 votes)

16 views12 pages

61 Report

The document discusses OpenAI Gym and Markov Decision Processes (MDPs). It introduces OpenAI Gym and describes its core functionality including environment methods like reset(), step(), render() and close(). It then explores the FrozenLake-v1 environment and MDP concepts like states, actions, rewards and transitions. FrozenLake-v1 is used as an example MDP.

Uploaded by

Anika Tabassum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views12 pages

61 Report

Uploaded by

Anika Tabassum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Artificial intelligence

Lab - 3

May 1st 2024

1 Introduction to OpenAI Gym

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It provides a di-
verse suite of environments ranging from simple games to complex physics simulations, allowing researchers
and developers to test and benchmark their algorithms in various scenarios.

1.1 Basic Functionality

The core functionality of OpenAI Gym revolves around the concept of environments, which are defined
tasks or problems that an agent interacts with. Each environment in Gym is represented by a Python
class with methods to interact with it.

1.2 Methods
• reset(): This method resets the environment to its initial state and returns an initial observation.
It is typically called at the beginning of an episode or whenever the agent needs to start a new
interaction with the environment. The return value is the initial observation that the agent uses to
start its decision-making process.

Example:
1 # Python code
2 import gym
3

4 # Create an environment
5 env = gym . make ( ' CartPole - v1 ')
6

7 # Reset the environment

8 initia l_ o b se r v at i o n = env . reset ()

Parameters: None.

Returns:

– observation: The initial observation representing the state of the environment.

1
2 Lab - 3

• step(): This method takes an action as input and performs one timestep of the environment’s dy-
namics. It returns four values: the next observation, the reward obtained from taking the action, a
boolean indicating whether the episode has terminated, and additional information useful for debug-
ging or analysis.

Example:
1 # Take an action in the environment
2 action = 0 # Example action
3 observation , reward , done , info = env . step ( action )

Parameters:

– action: The action to take in the environment

Returns:

– observation: The next observation after taking the action.

– reward: The reward obtained from taking the action.
– done: A boolean indicating whether the episode has terminated.
– info: Additional information for debugging or analysis.

• render(): The render() method renders the current state of the environment for visualization. Ren-
dering may be disabled or implemented differently depending on the environment.

Example:
1 env . render ()

• close() :The close() method frees up resources used by the environment. It’s good practice to call
close() when done using the environment to clean up resources.

Example:
1 env . close ()

• action space: Attribute representing the action space of the environment.

Example:
1 action_space = env . action_space

Attributes:

– action space.n: Number of discrete actions for discrete action spaces.

– action space.shape: Shape of the action space for continuous action spaces.
Artificial intelligence 3

• observation space: Attribute representing the observation space of the environment.

Example:
1 observatio n_spac e = env . obs ervati on_spa ce

Attributes:

– observation space.shape: Shape of the observation space.

– observation space.high: Highest value for each dimension in the observation space for con-
tinuous spaces.
– observation space.low: Lowest value for each dimension in the observation space for contin-
uous spaces.

These methods are fundamental for interacting with OpenAI Gym environments. By repeatedly calling
reset() and step(), an agent can interact with the environment, observe its state, take actions, and receive
feedback in the form of rewards. This interaction forms the basis for training reinforcement learning agents
using Gym environments.

2 Exploring MarkovDecision Processes (MDP)

2.1 FrozenLake-v1
FrozenLake-v1 is a grid-world environment in OpenAI Gym where an agent navigates icy terrain to reach
a goal while avoiding holes. The grid comprises safe (frozen) tiles and hazardous (hole) tiles. The agent
can move in four directions: up, down, left, and right. Upon reaching the goal, the agent receives a reward
of +1; falling into a hole yields a reward of 0. The environment introduces stochasticity, simulating the
slippery nature of the frozen lake, where actions may not always result in the intended movement. Through
reinforcement learning, agents aim to learn optimal policies to traverse the grid safely and reach the goal.

Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H)
by walking over the Frozen(F) lake. The agent may not always move in the intended direction due to the
slippery nature of the frozen lake.

• States:The environment consists of a grid of tiles, each representing a state.

– Frozen tiles (F): Safe tiles where the agent can move.
– Hole tiles (H): Tiles where the agent falls into a hole and fails.
– Start state (S): The initial state where the agent begins.
– Goal state (G): The state the agent aims to reach to succeed.

• Actions: At each state, the agent can take one of four actions. These actions determine the direction
in which the agent moves on the grid.

– Move Up
4 Lab - 3

– Move Down
– Move Left
– Move Right

• Rewards: The primary objective of the agent is to maximize the cumulative reward by reaching the
goal state while avoiding falling into holes. The rewards in FrozenLake-v1 are structured as follows:

– When the agent reaches the goal state (G), it receives a reward of +1.
– When the agent falls into a hole (H), it receives a reward of 0 (failure).
– All other transitions yield a reward of 0.

• Transitions: Transitions in FrozenLake-v1 are deterministic within the grid. If the agent takes an
action, it transitions to the corresponding adjacent tile unless it hits a wall, in which case it stays in
the same tile. For example, if the agent is at position (i, j) and takes the action ”Move Right,”
it transitions to position (i, j+1) if possible. However, there is a stochastic element introduced in
the environment. Due to the ”slippery” nature of the frozen lake, the agent’s intended action may
not always be executed as intended. There is a small probability (0.33) that the agent will slip and
move in a direction different from the intended action.

2.2 Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in sit-
uations where outcomes are influenced by both stochastic (random) elements and the actions taken by
a decision-maker. It’s widely used in reinforcement learning and operations research to solve problems
involving sequential decision-making under uncertainty.

2.2.1 Markov Decision Property (MDP)

The Markov Decision Property (MDP) is a fundamental concept in reinforcement learning and stochastic
processes. It states that the future state of a system, given the current state and action, depends only on
the current state and action and is independent of the past states and actions.

In the context of Markov Decision Processes (MDPs), this property implies that the transition proba-
bilities and rewards associated with state-action pairs fully capture the dynamics of the environment.
Mathematically, it can be expressed as:

P(st+1 , rt+1 |st , at , st−1 , at−1 , ..., s0 , a0 ) = P(st+1 , rt+1 |st , at )

Where:

• st represents the current state.

• at represents the action taken at time step t.

• st+1 represents the next state.

• rt+1 represents the reward obtained at time step t+1.

Artificial intelligence 5

• P(st+1 , rt+1 st , at ) denotes the transition probability distribution.

The Markov Decision Property simplifies the learning problem by allowing us to focus solely on the current
state and action, making planning and decision-making more tractable.

2.2.2 Relevance to MDP:

• Simplification: MDPs are used to model decision-making in environments where outcomes are
partly random. The Markov Property vastly simplifies the calculations involved because you don’t
need to track an endless history of states to make decisions.

• Foundation: The Markov Property is the underlying principle upon which MDPs are built. It
allows for the modeling of these decision processes in a tractable way.

2.2.3 FrozenLake-v1 Environment as an MDP

FrozenLake-v1 can be modeled as an MDP, where states represent grid tiles, actions determine movements,
transitions dictate movement outcomes (with stochasticity introduced by the slippery nature), rewards pro-
vide feedback on the agent’s performance, and the policy guides the agent’s decision-making process. By
understanding FrozenLake-v1 as an MDP, we can apply reinforcement learning algorithms to learn optimal
policies for navigating the environment and achieving the goal.

1. States (S):

• Each tile on the grid represents a state in the MDP.

• The environment consists of a grid of tiles, including frozen tiles (F), hole tiles (H), a start state
(S), and a goal state (G).
• The state encapsulates all relevant information about the agent’s current position on the grid.

2. Actions (A):

• At each state, the agent can take one of four actions: move up, move down, move left, or move
right.
• These actions represent the possible decisions or movements the agent can make in the environ-
ment.

3. Transitions (P):

• Transitions in FrozenLake-v1 are deterministic within the grid, meaning that the outcome of an
action is known with certainty.
• If the agent takes an action, it transitions to the corresponding adjacent tile, unless it hits a
wall, in which case it stays in the same tile.
• However, there’s a stochastic element introduced due to the ”slippery” nature of the frozen lake,
where there’s a small probability (0.33) that the agent will slip and move in a direction different
from the intended action.
6 Lab - 3

4. Rewards (R):

• The agent receives rewards based on its actions.

• Reaching the goal state yields a reward of +1.
• Falling into a hole results in a reward of 0 (failure).
• All other transitions yield a reward of 0.

5. Policy ():

• A policy specifies the agent’s behavior, i.e., which action to take in each state.
• The goal of the agent is to learn an optimal policy that maximizes the expected cumulative
reward over time.

3 Value iteration
Value iteration is an algorithm used to compute the optimal value function for a Markov Decision Process
(MDP). The optimal value function represents the expected cumulative reward that an agent can achieve
from each state, following the optimal policy.

3.1 Basic of Value iteration

The algorithm iteratively updates the value function until it converges to the optimal values. Here’s how
the value iteration algorithm works:

• Initialization: Initialize the value function V(s) arbitrarily for all states s.

• Iteration: For each state ss, update the value function using the Bellman optimality equation:

P(s′ , r|s, a)[r + γV(s′ )]

P
V(s) = maxa s′ ,r

where:

– a represents the actions available in state s.

– P(s′ , r|s, a) is the transition probability and reward function for transitioning from state s to
state s with reward r after taking action a.
– γ is the discount factor, which determines the importance of future rewards relative to immediate
rewards.

• Termination: Repeat the iteration process until the change in the value function between consec-
utive iterations falls below a specified threshold, indicating convergence.

• Policy extraction: Once the value function has converged, extract the optimal policy by selecting
the action that maximizes the value function in each state:
Artificial intelligence 7

π ∗ (s) = arg maxa P(s′ , r | s, a)[r + γV∗ (s′ )]

P
s′ ,r

where π ∗ (s) is the optimal policy and V ∗ (s) is the optimal value function

Value iteration converges to the optimal value function and policy for finite-state and finite-action
MDPs. It is a foundational algorithm in reinforcement learning and dynamic programming, providing
a principled approach to solving sequential decision-making problems under uncertainty. The graph il-

Figure 1: Convergence of State Values

lustrates convergence of state values in ”FrozenLake-v1” using value iteration. Each line represents the
estimated value of each state over iterations, demonstrating how the algorithm refines state evaluations,
stabilizing towards optimal values for decision-making in a stochastic, slippery environment.

The policy derived from Value Iteration generally performs well in the ”FrozenLake-v1” environment.
By optimizing the expected reward from each state, the policy strategically navigates the lake’s tiles, aim-
ing to reach the goal while minimizing the risk of falling into holes. The success rate of the policy, as
tested in the environment, reflects its effectiveness, accounting for the inherent randomness and challenges
posed by the slippery nature of the lake. The iterative refinement of state values helps in making informed
decisions that improve the overall success rate over time.

3.2 Performance of the Derived Policy

The policy derived from Value Iteration typically performs very well in the ”FrozenLake-v1” environment.
Since it computes the maximum expected utility from each state, the derived policy is optimal with respect
to the defined reward structure and transition probabilities.

• Optimal Decision Making:The primary strength of Value Iteration lies in its ability to compute
the optimal policy by leveraging the maximum expected utility from each state. Given that the
8 Lab - 3

utility values converge to their true values, the resulting policy ensures that the decisions made at
every state are optimal with respect to the environment’s dynamics and the reward structure. This
is critical in environments like ”FrozenLake-v1,” where every move can potentially lead to failure
(falling into a hole) or success (reaching the goal).

• Robustness to Environmental Stochasticity:”FrozenLake-v1” is known for its stochastic na-

ture—actions taken by the agent do not always result in the intended outcomes due to the slippery
ice. The Value Iteration-derived policy robustly accounts for this randomness by calculating utilities
that incorporate the probability distributions of all possible state transitions. This allows the agent
to make decisions that optimize the expected outcome, factoring in the likelihood of slipping or mov-
ing as intended.

• Efficiency and Convergence:Value Iteration is computationally efficient in environments with a

manageable number of states and actions because it iteratively refines the utilities of states until the
changes fall below a pre-set threshold, signaling convergence. This efficiency is particularly advan-
tageous in deterministic or small stochastic environments where the state and action spaces are not
overly large.

• Evaluation and Adaptability:The effectiveness of the policy can be empirically evaluated through
simulations where the agent attempts to navigate from the start to the goal across numerous trials.
In ”FrozenLake-v1,” an optimal policy will consistently manage to reach the goal more often than
not, demonstrating a high success rate. Moreover, the adaptability of Value Iteration allows for
adjustments in the model’s parameters (like the discount factor ) to experiment with short-term ver-
sus long-term gains, offering insights into the behavior of the policy under different strategic priorities.

• Limitations:Despite these strengths, the performance of the Value Iteration policy can sometimes be
limited by the granularity of the state space and the accuracy of the transition probabilities provided
by the environment’s model. Inaccuracies in the model or an overly simplified representation of the
state space can lead to suboptimal policies.

The effectiveness of the policy can be evaluated by simulating the environment with the derived policy and
measuring the success rate of reaching the goal without falling into any holes. This would typically show
high performance unless the environment setup or the reward structure inherently limits the achievable
success.

4 Q-Learning
Q-Learning is a model-free reinforcement learning algorithm used to learn the value of an action in a
particular state. It doesn’t require a model of the environment and can handle problems with stochastic
transitions and rewards without needing adaptations.
Artificial intelligence 9

4.1 Basics of Q-Learning

Q-Learning works by learning an action-value function that ultimately gives the expected utility of taking
a given action in a given state and following a fixed policy thereafter. Here’s a step-by-step explanation of
how Q-Learning works:

• Initialize the Q-values (Q-Table): Start with a table of Q-values for each state-action pair,
initialized to zero or some small random numbers.

• Policy Execution: At each state, select an action using a policy derived from the Q-values (com-
monly an -greedy policy where is the probability of choosing a random action, and 1- is the probability
of choosing the action with the highest Q-value).

• Execution and Update:

– Execute the chosen action, and observe the reward and the next state.
– Update the Q-value for the state-action pair based on the reward received and the maximum
Q-value of the next state (Bellman equation):

Q(s, a) = Q(s, a) + α [r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]

Where:
∗ Q(s,a) is the current Q-value of state s and action a.
∗ α is the learning rate (0 < α ≤ 1).
∗ r is the reward received after executing action a in state s.
∗ γ is the discount factor (0 ≤ γ < 1); it represents the difference in importance between
future rewards and immediate rewards.
∗ maxa′ Q(s′ , a′ ) is the estimate of optimal future value.

• Repeat: Repeat this process for each episode or until the Q-values converge.

The plot displays the success rate of a Q-learning algorithm applied on Frozen Lake-V1 over 1000
episodes of training, using a learning rate (alpha) of 0.5 and a discount factor (gamma) of 0.99.

• Interpretation:

– The high alpha (0.5) implies that the algorithm puts significant weight on recent information,
which might contribute to the observed volatility as recent states and rewards can drastically
influence the policy updates.
– The high gamma (0.99) suggests that future rewards are nearly as important as immediate
rewards, which encourages the agent to think long-term but may also result in instability in
success rate if the environment has diverse or conflicting immediate and future rewards.
10 Lab - 3

Figure 2: Q-Learning

4.2 Evaluation

Here, seems the Value Iteration got higher percentage of success rate than Q-Learning.

4.2.1 Value Iteration: Higher Success Rate

• Optimality and Efficiency:

– Value Iteration achieves a higher success rate, indicating it has successfully found a policy
closer to the optimal solution for the environment. This method systematically evaluates the
best action from each state based on known dynamics (transition probabilities and rewards),
which typically allows it to converge to the optimal policy.
Artificial intelligence 11

– The higher success rate of 76% suggests that Value Iteration efficiently utilized the model of the
environment to compute the maximum expected utility for each state systematically, leading to
more consistently successful decision-making.

• Model Dependency: The effectiveness of Value Iteration is contingent upon having accurate and
complete knowledge of the environment’s dynamics. This reliance on a model makes it highly effective
in environments where such a model can be accurately defined and where the state and action spaces
are sufficiently manageable.

4.2.2 Q-Learning: Lower Success Rate

• Learning from Interactions:

– Q-learning’s success rate of 56% indicates that, while it has learned to navigate the environment
to a significant extent, it has not reached the level of performance of the Value Iteration policy.
This discrepancy can arise from Q-learning’s nature of learning solely from interactions with
the environment, without prior knowledge of its dynamics.
– This model-free approach means it discovers effective strategies through trial and error, which
can be less efficient than the model-based strategies employed by Value Iteration.

• Exploration and Exploitation Challenges:

– Q-learning involves a balance between exploring new actions and exploiting known rewarding
actions. If not managed correctly (e.g., setting of the exploration rate, epsilon in epsilon-greedy
strategy), Q-learning might not explore the environment sufficiently or might stick too quickly
to suboptimal policies.
– The convergence to an optimal policy in Q-learning is also highly dependent on parameters like
the learning rate (alpha) and the discount factor (gamma), along with the number of episodes
of learning allowed. Inadequate tuning or insufficient learning episodes can lead to poorer
performance.

4.2.3 Comparaison
• Robustness vs. Flexibility: Value Iteration’s robustness is evident in environments with well-
understood dynamics. In contrast, Q-learning offers flexibility in unknown or complex environments
but may require more iterations and careful parameter tuning to achieve optimal results.

• Speed of Convergence: Value Iteration typically converges faster to the optimal policy since it
uses a complete model to update its estimates. Q-learning, being model-free, usually takes longer
and may require more episodes to converge, reflecting in a lower success rate in environments like
FrozenLake if not adequately trained.

• Applicability: Value Iteration is preferable in controlled settings or when the environmental model
is accurate. Meanwhile, Q-learning is better suited for environments where the model is unknown or
hard to estimate accurately.
12 Lab - 3

4.2.4 Conclusion
The performance comparison highlights that while Value Iteration can leverage complete environmental
models to quickly achieve high success rates, Q-learning’s model-free approach provides valuable flexibility
and adaptability, albeit often at the cost of efficiency and higher performance variability. Each method has
its strengths and is best suited to different types of problems or stages within a broader machine learning
strategy.

Markov Decision
100% (3)
Markov Decision
212 pages
Reinforcement Ch.2
No ratings yet
Reinforcement Ch.2
77 pages
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
Project Proposal For Mathematics College
No ratings yet
Project Proposal For Mathematics College
50 pages
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering A Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
A Concise Introduction To Decentralized POMDPs
No ratings yet
A Concise Introduction To Decentralized POMDPs
146 pages
Handbook of Ocean Container Transport Logistics - Making Global Supply Chains Effective PDFDrive
100% (3)
Handbook of Ocean Container Transport Logistics - Making Global Supply Chains Effective PDFDrive
556 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Reinforcement Learning - A Comprehensive Overview
No ratings yet
Reinforcement Learning - A Comprehensive Overview
177 pages
Chemical and Bio-Process Control: James B. Riggs M. Nazmul Karim
14% (7)
Chemical and Bio-Process Control: James B. Riggs M. Nazmul Karim
44 pages
06 MDP
No ratings yet
06 MDP
89 pages
Linear Programming Assignment Answers
No ratings yet
Linear Programming Assignment Answers
5 pages
Unit 4
No ratings yet
Unit 4
49 pages
Program Explanation
No ratings yet
Program Explanation
37 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Audio To Text Embedding
No ratings yet
Audio To Text Embedding
144 pages
2025 - MDPs 1
No ratings yet
2025 - MDPs 1
62 pages
General Agents Need World Models
No ratings yet
General Agents Need World Models
29 pages
Output
No ratings yet
Output
45 pages
Unit 5
No ratings yet
Unit 5
39 pages
Báo Cáo Cuối Kỳ Trí Tuệ Nhân Tạo
No ratings yet
Báo Cáo Cuối Kỳ Trí Tuệ Nhân Tạo
24 pages
Báo Cáo Nhóm 5 Final AI
No ratings yet
Báo Cáo Nhóm 5 Final AI
23 pages
ReinforcementLearning Algos
No ratings yet
ReinforcementLearning Algos
77 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lec 08
No ratings yet
Lec 08
59 pages
A Delivery Company Accepts Only Rectangular Boxes The Sum of Whose Length and Girth (Perimeter Of..
No ratings yet
A Delivery Company Accepts Only Rectangular Boxes The Sum of Whose Length and Girth (Perimeter Of..
4 pages
(24F-COSE361) 5. Markov Decision Process
No ratings yet
(24F-COSE361) 5. Markov Decision Process
40 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
MDP Concepts
No ratings yet
MDP Concepts
23 pages
ML Presentation Final
No ratings yet
ML Presentation Final
26 pages
Unit 1, 2 RL
No ratings yet
Unit 1, 2 RL
29 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
Unit-4 of Ai
No ratings yet
Unit-4 of Ai
9 pages
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
No ratings yet
Non-Maximizing Policies That Fulfill Multi-Criterion Aspirations in Expectation
19 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Seven Challenges of Inventory Management
100% (1)
Seven Challenges of Inventory Management
6 pages
NITK Placement Gyan 2014 PDF
No ratings yet
NITK Placement Gyan 2014 PDF
216 pages
RL Notes
No ratings yet
RL Notes
69 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Ai (It) Unit-4
No ratings yet
Ai (It) Unit-4
37 pages
Imp Q and A
No ratings yet
Imp Q and A
27 pages
Slides
No ratings yet
Slides
10 pages
Multiblend Feed Formulation
100% (1)
Multiblend Feed Formulation
4 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
No ratings yet
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
9 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
10-703 Deep RL and Controls Openai Gym Recitation: Devin Schwab
No ratings yet
10-703 Deep RL and Controls Openai Gym Recitation: Devin Schwab
43 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
IJIREEICE4C S Hiren Minimization-Final
No ratings yet
IJIREEICE4C S Hiren Minimization-Final
5 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Unit 1
No ratings yet
Unit 1
257 pages
mdp1 6pp
No ratings yet
mdp1 6pp
13 pages
Markov Decicion
No ratings yet
Markov Decicion
40 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
BRAC Linear Programming
100% (1)
BRAC Linear Programming
37 pages
Non-Linear Programming - A Basic Introduction
No ratings yet
Non-Linear Programming - A Basic Introduction
83 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Openai Gym
No ratings yet
Openai Gym
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
TransportationMethods PDF
No ratings yet
TransportationMethods PDF
11 pages
A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
No ratings yet
A Concise Introduction To Multi Agent Systems and Distributed Artificial Intelligence
84 pages
MADPToolbox-0 3
No ratings yet
MADPToolbox-0 3
25 pages
Project Report On Taguchi Method 9tqm Project) by Sudeshna Dash
No ratings yet
Project Report On Taguchi Method 9tqm Project) by Sudeshna Dash
17 pages
Operations Research
No ratings yet
Operations Research
126 pages
Integer Lindo
No ratings yet
Integer Lindo
2 pages
Production Engineering
No ratings yet
Production Engineering
18 pages
Pyomo PDF
No ratings yet
Pyomo PDF
382 pages
Building Reinforcement Learning Environment
No ratings yet
Building Reinforcement Learning Environment
7 pages
Bloque 3 - Lectura Complementaria 2
No ratings yet
Bloque 3 - Lectura Complementaria 2
12 pages
Markov Decision Processes For Path Planning in Unpredictable Environment
No ratings yet
Markov Decision Processes For Path Planning in Unpredictable Environment
8 pages
Linear Programming Using Excel Solver
No ratings yet
Linear Programming Using Excel Solver
27 pages
Building The Theory of Ecological Rationality
No ratings yet
Building The Theory of Ecological Rationality
22 pages
Project 2
No ratings yet
Project 2
9 pages
Surveying Some Metaheuristic Algorithms For Solving Maximum Clique Graph Problem
No ratings yet
Surveying Some Metaheuristic Algorithms For Solving Maximum Clique Graph Problem
7 pages
Bca Semester Iv Operations Research 2020 Question Paper
No ratings yet
Bca Semester Iv Operations Research 2020 Question Paper
3 pages
Tensor Decomp Presentation
No ratings yet
Tensor Decomp Presentation
9 pages
Rosenblatt1984 PDF
No ratings yet
Rosenblatt1984 PDF
14 pages
Economic Analysis of Horizontal Wells: Samuel 0. Osisanya, SPE, The University of Oklahoma
No ratings yet
Economic Analysis of Horizontal Wells: Samuel 0. Osisanya, SPE, The University of Oklahoma
6 pages
Computational Chemistry Exam Answers
No ratings yet
Computational Chemistry Exam Answers
6 pages
A Study For Optimization of Helical Gear Performance For Improved Energy Efficiency
No ratings yet
A Study For Optimization of Helical Gear Performance For Improved Energy Efficiency
4 pages
Operations Research
No ratings yet
Operations Research
7 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

61 Report

Uploaded by

61 Report

Uploaded by

Artificial intelligence

May 1st 2024

1 Introduction to OpenAI Gym

1.1 Basic Functionality

7 # Reset the environment

– observation: The initial observation representing the state of the environment.

– action: The action to take in the environment

– observation: The next observation after taking the action.

• action space: Attribute representing the action space of the environment.

– action space.n: Number of discrete actions for discrete action spaces.

• observation space: Attribute representing the observation space of the environment.

– observation space.shape: Shape of the observation space.

2 Exploring MarkovDecision Processes (MDP)

• States:The environment consists of a grid of tiles, each representing a state.

2.2 Markov Decision Process (MDP)

2.2.1 Markov Decision Property (MDP)

P(st+1 , rt+1 |st , at , st−1 , at−1 , ..., s0 , a0 ) = P(st+1 , rt+1 |st , at )

• st represents the current state.

• at represents the action taken at time step t.

• st+1 represents the next state.

• rt+1 represents the reward obtained at time step t+1.

• P(st+1 , rt+1 st , at ) denotes the transition probability distribution.

2.2.2 Relevance to MDP:

2.2.3 FrozenLake-v1 Environment as an MDP

• Each tile on the grid represents a state in the MDP.

• The agent receives rewards based on its actions.

3.1 Basic of Value iteration

P(s′ , r|s, a)[r + γV(s′ )]

– a represents the actions available in state s.

π ∗ (s) = arg maxa P(s′ , r | s, a)[r + γV∗ (s′ )]

Figure 1: Convergence of State Values

3.2 Performance of the Derived Policy

• Robustness to Environmental Stochasticity:”FrozenLake-v1” is known for its stochastic na-

• Efficiency and Convergence:Value Iteration is computationally efficient in environments with a

4.1 Basics of Q-Learning

• Execution and Update:

Q(s, a) = Q(s, a) + α [r + γ maxa′ Q(s′ , a′ ) − Q(s, a)]

4.2.1 Value Iteration: Higher Success Rate

4.2.2 Q-Learning: Lower Success Rate

• Exploration and Exploitation Challenges:

You might also like