21BAI10450 MATLABReport
21BAI10450 MATLABReport
Aim:
The reinforcement learning environment for this example is a biped robot. The
training goal is to make the robot walk in a straight line using minimal control
effort.
Theory:
• In the neutral 0 rad position, both of the legs are straight and the ankles are flat.
• The foot contact is modeled using the Spatial Contact Force block.
• The agent can control 3 individual joints (ankle, knee, and hip) on both legs of
the robot by applying torque signals from -3 to 3 N·m. The actual computed
action signals are normalized between -1 and 1.
• The robot torso center of mass is less than 0.1 m in the Z direction (the robot
falls) or more than 1 m in the either Y direction (the robot moves too far to the
side).
• The absolute value of either the roll, pitch, or yaw is greater than 0.7854 rad.
This reward function encourages the agent to move forward by providing a positive reward for
positive forward velocity. It also encourages the agent to avoid episode termination by providing
a constant reward ( ) at every time step. The other terms in the reward function are
penalties for substantial changes in lateral and vertical translations, and for the use of excess
control effort.
Procedure:
The robot was trained in following manner:
1. Use matlab command to get the Walking Bipedal robot example:
openExample('control_deeplearning/TrainBipedRobotToWalkUsingRein
forcementLearningA
gentsExample')
2. This command loads the matlab file. Open the .mlx file as it is the main
notebook.
3. We can execute and use the pretrained model of the robot by clicking
on “Run all” in the nav
bar.
4. The file allows us to explore two techniques to train the robot, either
use DDPG or TD3
5. DDPG:
a. Overview: DDPG is a reinforcement learning algorithm designed for
environments
with continuous action spaces, combining actor-critic architecture with
deep neural
networks.
b. Actor-Critic Structure: It comprises an actor network that learns a
deterministic
policy and a critic network that estimates the action-value function (Q-
function).
c. Experience Replay and Target Networks: DDPG utilizes experience
replay buffer to
store past experiences and target networks to provide more stable target
values during
training.
d. Off-Policy Learning with Soft Updates: It learns from past experiences,
allowing for
efficient use of data, and employs soft updates to gradually blend current
and target
network weights, improving stability in training
Simulation:
Conclusion:
We trained the biped robo to walk linearly with minimal controls via
reinforcement learning.
Exp.2.> Car parking safely RL
Aim:
Automatically parking a car that is left in front of a parking lot is a challenging
problem. The vehicle's automated systems are expected to take over control and
steer the vehicle to an available parking spot.
Theory:
For example:
• Front and side cameras for detecting lane markings, road signs (stop
signs, exit markings, etc.), other vehicles, and pedestrians
• Lidar and ultrasound sensors for detecting obstacles and calculating
accurate distance measurements
• Ultrasound sensors for obstacle detection
• IMU and wheel encoders for dead reckoning
On-board sensors are used to perceive the environment around the vehicle. The
perceived environment includes an understanding of road markings to interpret
road rules and infer drivable regions, recognition of obstacles, and detection of
available parking spots.
As the vehicle sensors perceive the world, the vehicle must plan a path through
the environment towards a free parking spot and execute a sequence of control
actions needed to drive to it. While doing so, it must respond to dynamic
changes in the environment, such as pedestrians crossing its path, and readjust
its plan.
Procedure:
1. Loading the environment
2.
7. Plan a local trajectory starting at the current pose and closely following
the reference path using the controllerTEB object.
8. Now combine all the previous steps in the planning process and run the
simulation for the complete route plan. This process involves
incorporating the behavioral planner.
Final Code
% Set the vehicle pose back to the initial starting point.
currentPose = [4 12 0]; % [x, y, theta]
vehicleSim.setVehiclePose(currentPose);
% Reset velocity.
currentVel = 0; % meters/second
vehicleSim.setVehicleVelocity(currentVel);
isGoalReached = false;
while ~isGoalReached
% Plan for the next path segment if near to the next path segment start
% pose.
if planNextSegment(behavioralPlanner, currentPose, 2*maxLocalPlanDistance)
% Request next maneuver from behavioral layer.
[nextGoal, plannerConfig, speedConfig] =
requestManeuver(behavioralPlanner, ...
currentPose, currentVel);
% Do local planning
localPlanner.Map = getLocalMap(costmap, currentPose,
maxLocalPlanDistance);
if hasNewPath
localPlanner.ReferencePath = refPath;
hasNewPath = false;
end
Simulation:
Conclusion:
1. With this experiment, we understood the working of a hybrid RL system.
The system uses RL only for parking which is the difficult task instead of
using RL even for moving around the parking lot
2. The car was able to park safely based on the distance from obstacles.
3. If we change the task of moving around the lot by implementing PPO in it
also and increase the camera depth so that we can perceive a lot more
area, we can reduce the time required to park. It would by as if the car
had a bird’s eye view of the parking and knew where the parking is
beforehand