0% found this document useful (0 votes)
20 views11 pages

21BAI10450 MATLABReport

The document describes training a biped robot to walk using reinforcement learning. The robot is modeled with joints that can be controlled to apply torque signals. An agent uses observations of the robot's state to determine actions that maximize a reward function for moving forward while avoiding falls or loss of balance. The robot was successfully trained to walk in a straight line with minimal controls.

Uploaded by

meet.joysar2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

21BAI10450 MATLABReport

The document describes training a biped robot to walk using reinforcement learning. The robot is modeled with joints that can be controlled to apply torque signals. An agent uses observations of the robot's state to determine actions that maximize a reward function for moving forward while avoiding falls or loss of balance. The robot was successfully trained to walk in a straight line with minimal controls.

Uploaded by

meet.joysar2021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lab1 Train Biped Robot to Walk Using Reinforcement Learning Agents

Aim:
The reinforcement learning environment for this example is a biped robot. The
training goal is to make the robot walk in a straight line using minimal control
effort.

Theory:

For this model:

• In the neutral 0 rad position, both of the legs are straight and the ankles are flat.
• The foot contact is modeled using the Spatial Contact Force block.
• The agent can control 3 individual joints (ankle, knee, and hip) on both legs of
the robot by applying torque signals from -3 to 3 N·m. The actual computed
action signals are normalized between -1 and 1.

The environment provides the following 29 observations to the agent.

• Y (lateral) and Z (vertical) translations of the torso center of mass. The


translation in the Z direction is normalized to a similar range as the other
observations.
• X (forward), Y (lateral), and Z (vertical) translation velocities.
• Yaw, pitch, and roll angles of the torso.
• Yaw, pitch, and roll angular velocities of the torso.
• Angular positions and velocities of the three joints (ankle, knee, hip) on both
legs.
• Action values from the previous time step.
The episode terminates if either of the following conditions occur.

• The robot torso center of mass is less than 0.1 m in the Z direction (the robot
falls) or more than 1 m in the either Y direction (the robot moves too far to the
side).
• The absolute value of either the roll, pitch, or yaw is greater than 0.7854 rad.

This reward function encourages the agent to move forward by providing a positive reward for
positive forward velocity. It also encourages the agent to avoid episode termination by providing
a constant reward ( ) at every time step. The other terms in the reward function are

penalties for substantial changes in lateral and vertical translations, and for the use of excess
control effort.

Procedure:
The robot was trained in following manner:
1. Use matlab command to get the Walking Bipedal robot example:
openExample('control_deeplearning/TrainBipedRobotToWalkUsingRein
forcementLearningA
gentsExample')
2. This command loads the matlab file. Open the .mlx file as it is the main
notebook.
3. We can execute and use the pretrained model of the robot by clicking
on “Run all” in the nav
bar.
4. The file allows us to explore two techniques to train the robot, either
use DDPG or TD3
5. DDPG:
a. Overview: DDPG is a reinforcement learning algorithm designed for
environments
with continuous action spaces, combining actor-critic architecture with
deep neural
networks.
b. Actor-Critic Structure: It comprises an actor network that learns a
deterministic
policy and a critic network that estimates the action-value function (Q-
function).
c. Experience Replay and Target Networks: DDPG utilizes experience
replay buffer to
store past experiences and target networks to provide more stable target
values during
training.
d. Off-Policy Learning with Soft Updates: It learns from past experiences,
allowing for
efficient use of data, and employs soft updates to gradually blend current
and target
network weights, improving stability in training

Simulation:

Conclusion:
We trained the biped robo to walk linearly with minimal controls via
reinforcement learning.
Exp.2.> Car parking safely RL

Aim:
Automatically parking a car that is left in front of a parking lot is a challenging
problem. The vehicle's automated systems are expected to take over control and
steer the vehicle to an available parking spot.

Theory:

For example:
• Front and side cameras for detecting lane markings, road signs (stop
signs, exit markings, etc.), other vehicles, and pedestrians
• Lidar and ultrasound sensors for detecting obstacles and calculating
accurate distance measurements
• Ultrasound sensors for obstacle detection
• IMU and wheel encoders for dead reckoning
On-board sensors are used to perceive the environment around the vehicle. The
perceived environment includes an understanding of road markings to interpret
road rules and infer drivable regions, recognition of obstacles, and detection of
available parking spots.
As the vehicle sensors perceive the world, the vehicle must plan a path through
the environment towards a free parking spot and execute a sequence of control
actions needed to drive to it. While doing so, it must respond to dynamic
changes in the environment, such as pedestrians crossing its path, and readjust
its plan.

Procedure:
1. Loading the environment
2.

3. Create a vehicleDimensions object for storing the dimensions of the


vehicle that will park automatically.

4. The table has three variables: StartPose, EndPose,


and Attributes. StartPose and EndPose specify the start and end poses of
the segment, expressed as . Attributes specifies properties of the
segment such as the speed limit.
5. Ploting a vehicle at the current pose, and along each goal in the route
plan.

6. Create a plannerHybridAStar (Navigation Toolbox) object to configure a


path planner using a hybrid A* approach. A* planning algorithms find an
optimal path between two points by constructing a tree of connected,
collision-free vehicle poses.

7. Plan a local trajectory starting at the current pose and closely following
the reference path using the controllerTEB object.
8. Now combine all the previous steps in the planning process and run the
simulation for the complete route plan. This process involves
incorporating the behavioral planner.

Final Code
% Set the vehicle pose back to the initial starting point.
currentPose = [4 12 0]; % [x, y, theta]
vehicleSim.setVehiclePose(currentPose);

% Reset velocity.
currentVel = 0; % meters/second
vehicleSim.setVehicleVelocity(currentVel);

% Initialize variables to store vehicle path.


refPath = [];
localPath = [];

% Setup pathAnalyzer to trigger update of local path every 1 second.


localPlanningFrequency = 1; % 1/seconds
pathAnalyzer.PlanningPeriod = 1/localPlanningFrequency/sampleTime; % timesteps

isGoalReached = false;

% Initialize count incremented each time the local planner is updated


localPlanCount = 0; % Used for visualization only

showFigure(vehicleSim); % Show vehicle simulation figure.

while ~isGoalReached
% Plan for the next path segment if near to the next path segment start
% pose.
if planNextSegment(behavioralPlanner, currentPose, 2*maxLocalPlanDistance)
% Request next maneuver from behavioral layer.
[nextGoal, plannerConfig, speedConfig] =
requestManeuver(behavioralPlanner, ...
currentPose, currentVel);

% Plan a reference path using A* planner to the next goal pose.


if isempty(refPath)
nextStartRad = [currentPose(1:2) deg2rad(currentPose(3))];
else
nextStartRad = refPath(end,:);
end
nextGoalRad = [nextGoal(1:2) deg2rad(nextGoal(3))];
newPath = plan(motionPlanner, nextStartRad, nextGoalRad,
SearchMode="exhaustive");

% Check if the path is valid. If the planner fails to compute a path,


% or the path is not collision-free because of updates to the map, the
% system needs to re-plan. This scenario uses a static map, so the path
% will always be collision-free.
isReplanNeeded = ~checkPathValidity(newPath.States, costmap);
if isReplanNeeded
warning("Unable to find a valid path. Attempting to re-plan.")

% Request behavioral planner to re-plan


replanNeeded(behavioralPlanner);
else
% Append to refPath
refPath = [refPath; newPath.States];
hasNewPath = true;

vehicleSim.plotReferencePath(refPath); % Plot reference path


end
end

% Update the local path at the frequency specified by


% |localPlanningFrequency|
if pathUpdateNeeded(pathAnalyzer)
currentPose = getVehiclePose(vehicleSim);
currentPoseRad = [currentPose(1:2) deg2rad(currentPose(3))];
currentVel = getVehicleVelocity(vehicleSim);
currentAngVel = getVehicleAngularVelocity(vehicleSim);

% Do local planning
localPlanner.Map = getLocalMap(costmap, currentPose,
maxLocalPlanDistance);

if hasNewPath
localPlanner.ReferencePath = refPath;
hasNewPath = false;
end

[localVel, ~, localPath, ~] = localPlanner(currentPoseRad, [currentVel


currentAngVel]);

vehicleSim.plotLocalPath(localPath); % Plot new local path

% For visualization only


if mod(localPlanCount, 20) == 0
snapnow; % Capture state of the figures
end
localPlanCount = localPlanCount+1;

% Configure path analyzer.


pathAnalyzer.RefPoses = [localPath(:,1:2) rad2deg(localPath(:,3))];
pathAnalyzer.Directions = sign(localVel(:,1));
pathAnalyzer.VelocityProfile = localVel(:,1);
end

% Find the reference pose on the path and the corresponding


% velocity.
[refPose, refVel, direction] = pathAnalyzer(currentPose, currentVel);

% Update driving direction for the simulator.


updateDrivingDirection(vehicleSim, direction);

% Compute steering command.


steeringAngle = lateralControllerStanley(refPose, currentPose, currentVel, ...
Direction=direction, Wheelbase=vehicleDims.Wheelbase, PositionGain=4);

% Compute acceleration and deceleration commands.


lonController.Direction = direction;
[accelCmd, decelCmd] = lonController(refVel, currentVel);

% Simulate the vehicle using the controller outputs.


drive(vehicleSim, accelCmd, decelCmd, steeringAngle);

% Get current pose and velocity of the vehicle.


currentPose = getVehiclePose(vehicleSim);
currentVel = getVehicleVelocity(vehicleSim);

% Check if the vehicle reaches the goal.


isGoalReached = helperGoalChecker(nextGoal, currentPose, currentVel,
speedConfig.EndSpeed, direction);

% Wait for fixed-rate execution.


waitfor(controlRate);
end

Simulation:
Conclusion:
1. With this experiment, we understood the working of a hybrid RL system.
The system uses RL only for parking which is the difficult task instead of
using RL even for moving around the parking lot
2. The car was able to park safely based on the distance from obstacles.
3. If we change the task of moving around the lot by implementing PPO in it
also and increase the camera depth so that we can perceive a lot more
area, we can reduce the time required to park. It would by as if the car
had a bird’s eye view of the parking and knew where the parking is
beforehand

You might also like