Documentation
Documentation
- Pioneer 3) in a static environment, as shown below, using the Webots simulator. The agent
should be designed as a self-learning entity using a certain algorithm to plan its path efficiently,
considering the starting and destination points, to reach the goal without colliding with obstacles.
First, we need to import the necessary libraries into the robot's controller file to utilize their
functions in solving the problem. We use the Supervisor library to access additional information
about the agent's environment, making the training process easier. Next, we create an instance
of the robot. To eliminate randomness in the variables mentioned in the code on each execution,
we define a specific seed to generate random states. This allows us to observe the changes in the
agent's learning solely based on our own hyperparameter modifications. This random state
should follow a specific distribution on each run. We define a variable related to the device such
that if CUDA is available, it will be used for processing the agent's computations. Otherwise, the
CPU will be used.
In the figure below, we can observe different components of a model or an agent that utilizes
reinforcement learning:
In this section of the code, we aim to define and implement the agent's environment. Actions are
input to the environment, and the reward and next state are output from it. The class related to
the agent's environment or "Environment" must inherit from "Supervisor" because we need
access to simulation environment information. Initially, we need to initialize variables such as
maximum speed, destination threshold, obstacle distance threshold, environment size, etc., for
problem-solving. Then, we specify the left and right motors, considering their positions as
unlimited and their speeds as zero initially, so we can assign values to them when needed. Next,
we identify and initialize the GPS, Touch, and Distance sensors. Using the Gymnasium library, we
determine the number of discrete actions. There are three actions: moving straight, turning right,
and turning left. Using the Box function, we determine the environment and the dimensions of
the state vector. Based on the composition of the agent's state space, the agent's observation
space will have dimensions of (5, ). We also specify that the value of each element in the robot's
observation space should be between 0 and 1. Therefore, we need to normalize the robot's
observation space. Furthermore, we set the maximum number of steps the robot can take. In
general, this piece of code sets the action space, observation space, and maximum number of
stages for a reinforcement learning environment. The action space is discrete with three possible
actions, and the observation space is continuous with five dimensions.
The `normalizer()` function takes an input value along with its minimum and maximum possible
values. It then normalizes the input value using the Min-Max normalization method to scale the
values between 0 and 1. The normalized value is returned as the output.
The `get_distance_to_goal()` function calculates the current distance of the robot to the goal or
destination based on the values obtained from the GPS sensor. It uses the Euclidean distance and
returns the normalized value of the distance as the output.
The `get_distance_to_start()` function, similar to the previous function, calculates the current
distance of the robot to the starting point or origin. It normalizes the distance and returns the
normalized value as the output.
The `get_sensor_data()` function takes the distance information obtained from the distance
sensors, normalizes the values, and returns them as the output.
The `get_current_position()` function uses the GPS sensor information to obtain the current
position of the robot. It normalizes the position values and returns them as the output.
Please note that the specific implementation details of these functions are not provided. The
descriptions above provide a general understanding of the purpose and behavior of each
function.
The function `get_observations()` is responsible for constructing the robot's observation space
and returning the State Vector as its output. The robot's observation space includes information
from distance sensors, including two dimensions, information about the distance from the robot
to the destination (one dimension), and the current location of the robot (two dimensions). These
pieces of information are concatenated in the current moment, forming an observation vector
representing the robot's state at that moment. The more comprehensive and rich this observable
space is, the better it is for the robot.
By using the `reset()` function and the help of a supervisor, we can reset the environment and
restore everything to the initial conditions, preparing the environment for the robot to start again
in its initial state. Additionally, we should consider the basic time of the environment. Then, we
can use the `get_observation()` function to obtain the initial observation in the environment.
Using the `step()` function, we apply the required action that the robot wants to perform. We
calculate the reward or punishment for that action and then calculate the robot's observations
after performing that action to observe the changes in the environment. Next, we check if we
have reached the maximum allowed number of steps for the robot. If we have reached it, the
executed command will be issued.
In the `reward()` function, we aim to adjust the rewards and punishments related to the robot's
actions according to our problem-solving objectives. The robot, based on the specified values in
this function, will be either punished or rewarded for the action it has taken to solve the problem.
Therefore, the values specified in this function should be adjusted as best as possible considering
the solution and the nature of the problem.
To make the robot move forward, both wheels move at maximum speed in the same direction.
To make the robot move to the right, the left motor moves forward at maximum speed, and
the right motor moves backward at maximum speed. To move to the left, the motion of the
motors is the opposite of the previous case.
Class related to the agent function, or Agent_Function:
This class is implemented for decision-making by the robot based on its reinforcement learning.
In this class, the path of storing the agent function along with the number of episodes in learning
or testing the agent are considered. Initially, we introduce the agent's environment. To train the
robot based on its experiences using rewards and penalties during learning, we use the Proximal
Policy Optimization (PPO) reinforcement learning algorithm.
The Proximal Policy Optimization (PPO) algorithm is a reinforcement learning algorithm used to
train artificial intelligence agents in dynamic environments. The main goal of this algorithm is to
provide an optimized policy optimization method that offers improvements over previous
algorithms such as Trust Region Policy Optimization (TRPO). PPO utilizes a clipping function to
limit large changes in the policy, allowing for more stable updates and preventing sudden
increases or oscillations in parameter changes.
In PPO, the agent tries to learn an optimal policy that maximizes the total cumulative reward for
each action taken in any state. To achieve this, the algorithm uses a specific objective function
called the "clipped surrogate objective" that allows policy updates to be performed in a way that
restricts significant changes in the policy. This enables the agent to improve its performance in
the environment while preventing excessive fluctuations during training.
In the first stage, the agent evaluates a new objective function by considering the difference in
action probabilities between the new and old policies. In the second stage, using this objective
function, the policy parameters are updated to enable the agent to achieve better performance
in the given environment. PPO is a suitable algorithm for continuous action spaces and can handle
continuous actions effectively.
In comparison to the Deep Q-Network (DQN) algorithm, which is a value-based algorithm, PPO is
a policy-based algorithm that focuses on approximate policy optimization. Additionally, DQN is
typically used for problems with discrete action spaces, while PPO is suitable for problems with
continuous action spaces.
For example, in a project like this, if a robot needs to learn how to navigate in a dynamic
environment to reach a specific goal, the PPO algorithm can be helpful. This algorithm uses the
robot's past experiences in the environment to determine better policies for the robot's
movement. These improvements can include changes in speed, direction, or how the robot
responds to obstacles and different environmental conditions.
In summary, PPO utilizes the robot's past experiences in the environment to improve operational
policies, enabling the robot to achieve optimal performance in dynamic environments and reach
its goal.
The Proximal Policy Optimization (PPO) algorithm has significant improvements compared to
TRPO and DQN algorithms. These improvements can include:
1. Efficiency: PPO has noticeable improvements in efficiency compared to TRPO. These
improvements often lead to faster learning and more optimal utilization of computational
resources.
2. Stability: PPO has greater stability in learning compared to TRPO and DQN. These
improvements result in a reduction in the occurrence of sudden shocks during the learning
process.
3. Experience Memory Usage: PPO can utilize experience memory to manage past experiences,
which helps improve the efficiency and stability of the algorithm.
4. Compatibility with Continuous Action Problems: PPO is a suitable algorithm for problems with
continuous action spaces and can effectively handle continuous actions. This demonstrates its
flexibility and effectiveness in real-world problems.
In this class, we consider the policy learning network using the Stable Baselines3 library and the
PPO algorithm. At the same time, we specify the path for storing the generated charts during the
agent's training. This allows us to access the generated charts for analysis and examination of the
training process after completing the training.
We use the save() and load() functions to save the trained agent's policy function after training
and during agent testing.
We start the training of the policy network, which is the PPO algorithm, using the train() function.
We also specify the total number of time steps required for training. With this function, the agent
gains experience in the environment and learns through reinforcement learning using PPO to
update the best policy function for the robot. By using this updated policy function, the agent
can achieve the best rewards during both training and testing phases.
Using the test() function, the necessary actions are predicted at each step based on the learned
policy function, and they are applied. Then, in each episode, the obtained reward is calculated
and applied.
Finally, we specify the path for saving the agent's policy function along with the number of
episodes required for training and testing. Then, to perform training and testing of the agent in
the environment, we can set the variable related to whether training or testing should be done.
As you can see in the above image, the agent achieved 10 successful reachings of the goal after
learning in the provided environment as the problem statement.
2- Change the starting point to a different point in the environment of your choice and perform
the test again (without retraining). In this case, the navigation may not necessarily be successful.
To investigate whether the robot can still reach the goal without retraining when the starting
point is changed, we modify the robot's initial position to different coordinates.
As mentioned in the observations section, the robot's observation space includes information
from distance sensors, distance to the goal, and the robot's current position. During the training
process, through exploration and receiving different rewards and penalties, the agent has
learned how rewards are related to observations and how it should behave to get closer to the
goal and ultimately reach it.
The information about the distance to the goal in the observation space and the rewards
obtained during training have enabled the agent to learn the relationship between rewards along
the trajectory and proximity to the goal.
The information about the distance to obstacles and penalties when approaching them during
training has taught the robot to avoid obstacles during the testing phase.
The information about the current position in the observation space and the rewards obtained
when reaching the goal during training allow the agent to learn the relationship between the
current position and the goal position and understand the path it needs to follow to reach the
goal.
Due to these learned relationships, when the starting position is changed, the robot can utilize
the learned observations-rewards connections during each moment and evaluate them with the
rewards and penalties during the testing phase, enabling it to reach the goal coordinates.
The chart related to "ep_rew_mean" shows the average rewards received during the training
process. As observed, the agent starts off inexperienced and performs poorly, receiving penalties
until it learns from its experiences to gain significant rewards over its lifespan. As seen in the
chart, at the end of training, the agent has achieved high rewards while reaching its goal.
The "approx_kl" chart indicates the approximate average KL divergence between the old and new
policy functions (for PPO). In other words, it estimates the level of changes occurring during
updates. As seen in the corresponding chart, this value decreases over time during the training
process. This means that towards the end of training, the robot has reached its goal, and there is
no need for further policy updates.
The "clip_fraction" chart represents the control level of the clip feature on policy updates, which
decreases over time. This indicator shows that the agent hasn't experienced sudden changes
during policy updates.
The "entropy_loss" chart displays the difference between the learned policy function and a
completely random policy function at each step during the training process. As observed in the
chart, this value is increasing, indicating that the policy function has moved away from
randomness and is becoming more purposeful in its training.
The "explained_variance" chart represents a fraction of the explained variance by the value
function, as explained by the given function. The ideal value for this chart is one. As seen, we
have approached this value during the training process.
The "train/loss" chart represents the current total loss value at each step. The "value/loss" chart
usually represents the error between the output of the value function and the Monte Carlo
estimation. As observed, both charts initially increase due to the robot's exploration, but after
gaining experience and learning during training, the values of the two errors decrease over the
training process.
The "policy_gradient_loss" chart represents the current value of the policy gradient loss (its value
doesn't have significant meaning). As seen, its value decreases over time during training.
After performing the training process and examining the generated charts to explain the training
process, we can conclude that the training process has been conducted correctly. The agent has
learned to avoid obstacles and walls to reach its goal by gaining experience and receiving rewards
and penalties during the training process.
Extra Point: Perform obstacle avoidance path planning using only the provided GPS and Camera
sensors, similar to the instructions given. It should be noted that the Pioneer 3-DX robot does not
have a camera sensor, so you will need to manually add it. The camera's properties can be
adjusted after adding it to the robot in Webots.
The code related to this section has been explained in the previous section.
The only difference is that the robot's observation space has been reduced to 2D because we no
longer use distance sensors. Instead, we will use the information collected by the camera sensor.
In the function `get_camera_information()`, we first receive the current captured image from the
camera and convert it to the acceptable format of OpenCV. Then, for targeted processing, we
convert the image to grayscale. To simplify the use of the camera for problem-solving, we modify
the environment as follows:
We have converted obstacles to black and the goal to white in the image. After receiving the
camera image and converting it to grayscale, we analyze and extract features of the obstacles
and goals. Then, we plot the histograms of these images.
The images of the goal and obstacles from the robot's perspective will be as follows:
By analyzing the above histograms, we observe that from the robot's perspective, the goal is
visible in the intensity range of 240 to 255, while obstacles are visible in the intensity range of 0
to 50. We return the number of pixels within these ranges as the output of the function. The
more pixels registered in these ranges, the closer we are to obstacles or goals, and vice versa.
The code related to this section has been explained in the previous section. The only difference
is that in the robot's observation space this time, only the robot's positional information is
recorded because we no longer have information from distance sensors.
The code related to this section has been explained in the previous section. The differences that
occurred in this section are as follows:
- Changes in the reward value related to proximity to the destination.
- Extraction of useful information from the camera sensor. In the camera information function, it
was explained that the number of pixels related to the range of observing the goal and obstacles
is returned as the function's output. The more pixels registered in these ranges, the closer we are
to obstacles or goals. Based on this, we can consider rewards in a way that if the number of pixels
related to the observation of the goal increases, it receives higher rewards. Conversely, if the
number of pixels related to the observation of obstacles increases, we consider penalties for the
agent. Accordingly, we have removed the rewards and penalties related to the distance sensors
and replaced them with rewards and penalties related to the camera sensor so that the robot
learns to avoid obstacles and move towards the goal.
The code related to this section has been explained in the previous section, with the difference
being that the number of episodes has been set to 4000, which corresponds to a total of
4,000,000 time steps for training.
As you can see in the above image, the agent was able to reach the goal successfully by using the
camera sensor instead of distance sensors.