61 Report
61 Report
Lab - 3
1.2 Methods
• reset(): This method resets the environment to its initial state and returns an initial observation.
It is typically called at the beginning of an episode or whenever the agent needs to start a new
interaction with the environment. The return value is the initial observation that the agent uses to
start its decision-making process.
Example:
1 # Python code
2 import gym
3
4 # Create an environment
5 env = gym . make ( ' CartPole - v1 ')
6
Parameters: None.
Returns:
1
2 Lab - 3
• step(): This method takes an action as input and performs one timestep of the environment’s dy-
namics. It returns four values: the next observation, the reward obtained from taking the action, a
boolean indicating whether the episode has terminated, and additional information useful for debug-
ging or analysis.
Example:
1 # Take an action in the environment
2 action = 0 # Example action
3 observation , reward , done , info = env . step ( action )
Parameters:
Returns:
• render(): The render() method renders the current state of the environment for visualization. Ren-
dering may be disabled or implemented differently depending on the environment.
Example:
1 env . render ()
• close() :The close() method frees up resources used by the environment. It’s good practice to call
close() when done using the environment to clean up resources.
Example:
1 env . close ()
Example:
1 action_space = env . action_space
Attributes:
Example:
1 observatio n_spac e = env . obs ervati on_spa ce
Attributes:
These methods are fundamental for interacting with OpenAI Gym environments. By repeatedly calling
reset() and step(), an agent can interact with the environment, observe its state, take actions, and receive
feedback in the form of rewards. This interaction forms the basis for training reinforcement learning agents
using Gym environments.
Frozen lake involves crossing a frozen lake from Start(S) to Goal(G) without falling into any Holes(H)
by walking over the Frozen(F) lake. The agent may not always move in the intended direction due to the
slippery nature of the frozen lake.
– Frozen tiles (F): Safe tiles where the agent can move.
– Hole tiles (H): Tiles where the agent falls into a hole and fails.
– Start state (S): The initial state where the agent begins.
– Goal state (G): The state the agent aims to reach to succeed.
• Actions: At each state, the agent can take one of four actions. These actions determine the direction
in which the agent moves on the grid.
– Move Up
4 Lab - 3
– Move Down
– Move Left
– Move Right
• Rewards: The primary objective of the agent is to maximize the cumulative reward by reaching the
goal state while avoiding falling into holes. The rewards in FrozenLake-v1 are structured as follows:
– When the agent reaches the goal state (G), it receives a reward of +1.
– When the agent falls into a hole (H), it receives a reward of 0 (failure).
– All other transitions yield a reward of 0.
• Transitions: Transitions in FrozenLake-v1 are deterministic within the grid. If the agent takes an
action, it transitions to the corresponding adjacent tile unless it hits a wall, in which case it stays in
the same tile. For example, if the agent is at position (i, j) and takes the action ”Move Right,”
it transitions to position (i, j+1) if possible. However, there is a stochastic element introduced in
the environment. Due to the ”slippery” nature of the frozen lake, the agent’s intended action may
not always be executed as intended. There is a small probability (0.33) that the agent will slip and
move in a direction different from the intended action.
In the context of Markov Decision Processes (MDPs), this property implies that the transition proba-
bilities and rewards associated with state-action pairs fully capture the dynamics of the environment.
Mathematically, it can be expressed as:
Where:
The Markov Decision Property simplifies the learning problem by allowing us to focus solely on the current
state and action, making planning and decision-making more tractable.
• Foundation: The Markov Property is the underlying principle upon which MDPs are built. It
allows for the modeling of these decision processes in a tractable way.
1. States (S):
2. Actions (A):
• At each state, the agent can take one of four actions: move up, move down, move left, or move
right.
• These actions represent the possible decisions or movements the agent can make in the environ-
ment.
3. Transitions (P):
• Transitions in FrozenLake-v1 are deterministic within the grid, meaning that the outcome of an
action is known with certainty.
• If the agent takes an action, it transitions to the corresponding adjacent tile, unless it hits a
wall, in which case it stays in the same tile.
• However, there’s a stochastic element introduced due to the ”slippery” nature of the frozen lake,
where there’s a small probability (0.33) that the agent will slip and move in a direction different
from the intended action.
6 Lab - 3
4. Rewards (R):
5. Policy ():
• A policy specifies the agent’s behavior, i.e., which action to take in each state.
• The goal of the agent is to learn an optimal policy that maximizes the expected cumulative
reward over time.
3 Value iteration
Value iteration is an algorithm used to compute the optimal value function for a Markov Decision Process
(MDP). The optimal value function represents the expected cumulative reward that an agent can achieve
from each state, following the optimal policy.
• Initialization: Initialize the value function V(s) arbitrarily for all states s.
• Iteration: For each state ss, update the value function using the Bellman optimality equation:
where:
• Termination: Repeat the iteration process until the change in the value function between consec-
utive iterations falls below a specified threshold, indicating convergence.
• Policy extraction: Once the value function has converged, extract the optimal policy by selecting
the action that maximizes the value function in each state:
Artificial intelligence 7
where π ∗ (s) is the optimal policy and V ∗ (s) is the optimal value function
Value iteration converges to the optimal value function and policy for finite-state and finite-action
MDPs. It is a foundational algorithm in reinforcement learning and dynamic programming, providing
a principled approach to solving sequential decision-making problems under uncertainty. The graph il-
lustrates convergence of state values in ”FrozenLake-v1” using value iteration. Each line represents the
estimated value of each state over iterations, demonstrating how the algorithm refines state evaluations,
stabilizing towards optimal values for decision-making in a stochastic, slippery environment.
The policy derived from Value Iteration generally performs well in the ”FrozenLake-v1” environment.
By optimizing the expected reward from each state, the policy strategically navigates the lake’s tiles, aim-
ing to reach the goal while minimizing the risk of falling into holes. The success rate of the policy, as
tested in the environment, reflects its effectiveness, accounting for the inherent randomness and challenges
posed by the slippery nature of the lake. The iterative refinement of state values helps in making informed
decisions that improve the overall success rate over time.
• Optimal Decision Making:The primary strength of Value Iteration lies in its ability to compute
the optimal policy by leveraging the maximum expected utility from each state. Given that the
8 Lab - 3
utility values converge to their true values, the resulting policy ensures that the decisions made at
every state are optimal with respect to the environment’s dynamics and the reward structure. This
is critical in environments like ”FrozenLake-v1,” where every move can potentially lead to failure
(falling into a hole) or success (reaching the goal).
• Evaluation and Adaptability:The effectiveness of the policy can be empirically evaluated through
simulations where the agent attempts to navigate from the start to the goal across numerous trials.
In ”FrozenLake-v1,” an optimal policy will consistently manage to reach the goal more often than
not, demonstrating a high success rate. Moreover, the adaptability of Value Iteration allows for
adjustments in the model’s parameters (like the discount factor ) to experiment with short-term ver-
sus long-term gains, offering insights into the behavior of the policy under different strategic priorities.
• Limitations:Despite these strengths, the performance of the Value Iteration policy can sometimes be
limited by the granularity of the state space and the accuracy of the transition probabilities provided
by the environment’s model. Inaccuracies in the model or an overly simplified representation of the
state space can lead to suboptimal policies.
The effectiveness of the policy can be evaluated by simulating the environment with the derived policy and
measuring the success rate of reaching the goal without falling into any holes. This would typically show
high performance unless the environment setup or the reward structure inherently limits the achievable
success.
4 Q-Learning
Q-Learning is a model-free reinforcement learning algorithm used to learn the value of an action in a
particular state. It doesn’t require a model of the environment and can handle problems with stochastic
transitions and rewards without needing adaptations.
Artificial intelligence 9
• Initialize the Q-values (Q-Table): Start with a table of Q-values for each state-action pair,
initialized to zero or some small random numbers.
• Policy Execution: At each state, select an action using a policy derived from the Q-values (com-
monly an -greedy policy where is the probability of choosing a random action, and 1- is the probability
of choosing the action with the highest Q-value).
– Execute the chosen action, and observe the reward and the next state.
– Update the Q-value for the state-action pair based on the reward received and the maximum
Q-value of the next state (Bellman equation):
Where:
∗ Q(s,a) is the current Q-value of state s and action a.
∗ α is the learning rate (0 < α ≤ 1).
∗ r is the reward received after executing action a in state s.
∗ γ is the discount factor (0 ≤ γ < 1); it represents the difference in importance between
future rewards and immediate rewards.
∗ maxa′ Q(s′ , a′ ) is the estimate of optimal future value.
• Repeat: Repeat this process for each episode or until the Q-values converge.
The plot displays the success rate of a Q-learning algorithm applied on Frozen Lake-V1 over 1000
episodes of training, using a learning rate (alpha) of 0.5 and a discount factor (gamma) of 0.99.
• Interpretation:
– The high alpha (0.5) implies that the algorithm puts significant weight on recent information,
which might contribute to the observed volatility as recent states and rewards can drastically
influence the policy updates.
– The high gamma (0.99) suggests that future rewards are nearly as important as immediate
rewards, which encourages the agent to think long-term but may also result in instability in
success rate if the environment has diverse or conflicting immediate and future rewards.
10 Lab - 3
Figure 2: Q-Learning
4.2 Evaluation
Here, seems the Value Iteration got higher percentage of success rate than Q-Learning.
– Value Iteration achieves a higher success rate, indicating it has successfully found a policy
closer to the optimal solution for the environment. This method systematically evaluates the
best action from each state based on known dynamics (transition probabilities and rewards),
which typically allows it to converge to the optimal policy.
Artificial intelligence 11
– The higher success rate of 76% suggests that Value Iteration efficiently utilized the model of the
environment to compute the maximum expected utility for each state systematically, leading to
more consistently successful decision-making.
• Model Dependency: The effectiveness of Value Iteration is contingent upon having accurate and
complete knowledge of the environment’s dynamics. This reliance on a model makes it highly effective
in environments where such a model can be accurately defined and where the state and action spaces
are sufficiently manageable.
– Q-learning’s success rate of 56% indicates that, while it has learned to navigate the environment
to a significant extent, it has not reached the level of performance of the Value Iteration policy.
This discrepancy can arise from Q-learning’s nature of learning solely from interactions with
the environment, without prior knowledge of its dynamics.
– This model-free approach means it discovers effective strategies through trial and error, which
can be less efficient than the model-based strategies employed by Value Iteration.
– Q-learning involves a balance between exploring new actions and exploiting known rewarding
actions. If not managed correctly (e.g., setting of the exploration rate, epsilon in epsilon-greedy
strategy), Q-learning might not explore the environment sufficiently or might stick too quickly
to suboptimal policies.
– The convergence to an optimal policy in Q-learning is also highly dependent on parameters like
the learning rate (alpha) and the discount factor (gamma), along with the number of episodes
of learning allowed. Inadequate tuning or insufficient learning episodes can lead to poorer
performance.
4.2.3 Comparaison
• Robustness vs. Flexibility: Value Iteration’s robustness is evident in environments with well-
understood dynamics. In contrast, Q-learning offers flexibility in unknown or complex environments
but may require more iterations and careful parameter tuning to achieve optimal results.
• Speed of Convergence: Value Iteration typically converges faster to the optimal policy since it
uses a complete model to update its estimates. Q-learning, being model-free, usually takes longer
and may require more episodes to converge, reflecting in a lower success rate in environments like
FrozenLake if not adequately trained.
• Applicability: Value Iteration is preferable in controlled settings or when the environmental model
is accurate. Meanwhile, Q-learning is better suited for environments where the model is unknown or
hard to estimate accurately.
12 Lab - 3
4.2.4 Conclusion
The performance comparison highlights that while Value Iteration can leverage complete environmental
models to quickly achieve high success rates, Q-learning’s model-free approach provides valuable flexibility
and adaptability, albeit often at the cost of efficiency and higher performance variability. Each method has
its strengths and is best suited to different types of problems or stages within a broader machine learning
strategy.