0% found this document useful (0 votes)

38 views85 pages

Object Detection and Segmentation

This document discusses the implementation of the YOLO3 algorithm for object detection in self-driving cars, focusing on detecting pedestrians, cars, trucks, cyclists, and motorbikes using the Kitti dataset. It outlines the challenges faced in the self-driving market up to 2021 and describes the research methodology, including the use of deep learning techniques for image processing and object tracking. The findings include detection results from testing the model on images collected via a mobile camera.

Uploaded by

Mubashir Hussain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views85 pages

Object Detection and Segmentation

Uploaded by

Mubashir Hussain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 85

Abstract

Self-driving is emerging trend, multi task is the combination of task, where object
and tracking segmentation and bounding box experiment is to perform and meat
desirable result. A lot of challenges where self-driving is not in market till 2021.
This research implements You Only Look Once family YOLO3 algorithm that uses
Draknet-53 CNN to detect four classes: pedestrians, car, trucks and cyclists and
motorbike. The model is trained using Kitti images dataset which is collected from
public roads using vehicle’s front looking camera. The algorithm is tested, and
detection results are presented. Road object Image is collected and tested by
mobile camera to check result.

Keywords: self-driving, object detection, object tracking

vi
Contents
Contents
Declaration.............................................................................................................................ii

1. Introduction......................................................................................................................i

1.1. Artificial Intelligence.......................................................................................................2

1.1.1. Digital Image processing.........................................................................................2

1.1.2. Machine learning....................................................................................................3

1.1.3. Deep Learning.........................................................................................................3

1.1.4. Self-Driving Car.......................................................................................................4

1.1.5. Automation.............................................................................................................6

1.1.5.1. Level of Automation........................................................................................6

1.1.5.1.1. Level 0 (No Driving Automation)..................................................................6

1.1.5.1.2. Level 1 (Driver Assistance)...........................................................................6

1.1.5.1.3. Level 2 (Partial Driving Automation)............................................................7

1.1.5.1.4. Level 3 (Conditional Driving Automation)....................................................7

1.1.5.1.5. Level 4 (High Driving Automation)...............................................................7

1.1.5.1.6. Level 5 (Full Driving Automation).................................................................8

1.1.6. Software architecture of ADS.................................................................................8

1.1.6.1. Perception.......................................................................................................8

1.1.6.2. Localization.....................................................................................................9

1.1.6.3. Connector.......................................................................................................9

1.1.6.4. Predictions......................................................................................................9

1.1.6.5. Communications.............................................................................................9

1.1.6.6. Hardware controllers....................................................................................10

i
1.1.6.7. Sensors..........................................................................................................10

1.1.6.8. Camera..........................................................................................................10

1.1.6.9. LIDAR............................................................................................................11

1.1.6.10. Controls.........................................................................................................11

1.1.7. Computer Vision...................................................................................................12

1.1.7.1. Object Detection...........................................................................................12

1.1.7.2. Object Tracking.............................................................................................12

1.1.7.3. Bounding Box................................................................................................13

1.1.7.4. Segmentation................................................................................................13

1.2. Statement of the Problem............................................................................................14

1.3. Limitation......................................................................................................................14

2. Literature Review...........................................................................................................16

3. Material and Methods/Model and Equations/Modeling.................................................40

3.1. Experimental Setup......................................................................................................40

3.2. Dataset.........................................................................................................................40

3.3. Kitti 3D Calib File...........................................................................................................40

3.4. U-Net (Convolutional Networks for Biomedical Image Segmentation)........................40

3.4.1. U net architecture.................................................................................................41

3.4.2. Object detection (u-net).......................................................................................42

3.4.2.1. Data from all directories, and retrieve a list of masks and images................43

3.4.3. Segmentation (unet).............................................................................................43

3.5. YOLO3NET.....................................................................................................................44

3.5.1. Anchor box............................................................................................................45

3.5.2. Object detection and tracking..............................................................................45

3.6. Measures......................................................................................................................46

3.6.1. True Positives (TP)................................................................................................46

i
3.6.2. True Negatives (TN)..............................................................................................46

3.6.3. False Positives (FP)................................................................................................46

3.6.4. False Negatives (FN)..............................................................................................46

3.6.5. Precision...............................................................................................................47

3.6.6. Recall....................................................................................................................47

3.6.7. F1 Score................................................................................................................47

3.6.8. Accuracy...............................................................................................................48

3.6.9. Average precision (AP)..........................................................................................48

3.6.10. Intersection over Union (IOU)...............................................................................48

4. Results and Discussions..................................................................................................50

4.1. Segmentation (u-net)...................................................................................................50

4.2. Object tracking using u-net...........................................................................................53

4.2.1. Draw 3D................................................................................................................53

4.2.2. Bounding box........................................................................................................54

4.2.3. Image Data Generator..........................................................................................55

4.3. Object detection and tracking......................................................................................58

4.3.1. Comparisons.........................................................................................................63

5. Conclusion.....................................................................................................................64

References............................................................................................................................65

i
List of Figures

Figure 1.1 Flow Diagram of Proposed Study..................................................................................2

Figure 1.2 self-driving car..............................................................................................................5
Figure 1.3 level of Automation......................................................................................................6
Figure 1.5 software architecture of self-driving.............................................................................8
Figure 1.6 original image 1..........................................................................................................13
Figure 1.7 object tacking.............................................................................................................13
Figure 1.8 original image 2..........................................................................................................13
Figure 1.9 bounding box..............................................................................................................13
Figure 1.10 original image 3........................................................................................................14
Figure 1.11 segmentation............................................................................................................14
Figure 3.1 U-net original architecture..........................................................................................41
Figure 3.2 code Mask generation....................................................Error! Bookmark not defined.
Figure 3.3IOU...............................................................................................................................49
Figure 4.1architecture proposed model......................................................................................51
Figure 4.2 result segmentation epoch 1......................................................................................52
Figure 4.3 segmentation result epoch 2......................................................................................52
Figure 4.4 segmentation result epoch 15....................................................................................52
Figure 4.5 segmentation result epoch 20....................................................................................53
Figure 4.6 original image with segmentation result....................................................................53
Figure 4.7mask image..................................................................................................................55
Figure 4.8 result mask generation...............................................................................................55
Figure 4.9 result with mask.........................................................................................................56
Figure 4.10object detection result 1............................................................................................56
Figure 4.11 object detection result 2...........................................................................................56
Figure 4.12 Accuracy graph.........................................................................................................57
Figure 4.13 result yolo3 with text file..........................................................................................60

iv
List of Tables

Table 1.1 Advantages and Disadvantage of self-driving (Dua et al., 2019)....................................5

Table 4.1 DETECTION RESULTS FOR KITTI DATASET USING YOLOV3............................................60
Table 4.2 Precision and recall value for the class.........................................................................61

v
1. Introduction
The principal practice of AI for self driving car goes back to the DARPA
(Defense Advanced Research Projects Agency) self driving car Challenge
(Ozguner, Stiller, & Redmill, 2007), which was won by the Stanley (Stanford
University Racing Squad's autonomous car ). The winning team, led by
Sebastian Thurn, an associate professor of computer science and director of
Stanford Artificial Intelligence Laboratory, attributed the victory to use of
machine learning(Bar-Cohen & Hanson, 2009; Chai, Nie, & Becker, 2020;
Jahromi, 2019). Stanley was equipped with multiple sensors and backed by
bespoke software, including ML algorithms, which facilitated the vehicle find
the path, obstacles detection and avoid them while staying on the course(Lin,
2016). Thurn later led the 'Self-Driving Car Project' at Google, which
eventually became Waymo in 2016(National Academies of Sciences &
Medicine, 2017; Owczarek, 2018).

Waymo has been extensively leveraging AI to make fully autonomous driving

a reality(D. Feng et al., 2020). The company's engineers collaborated with the
Google Brain team to apply DNN in its pedestrian detection system(Topol,
2019). Artificial intelligence can rejoin rapidly to practical data points produced
from 100 of diverse sensors (Sterne, 2017). AI application and software in
the car is associated to entirely the sensors and gathers input from Google
Street View and video cameras intimate the car(Guerrero-Ibáñez, Zeadally, &
Contreras-Castillo, 2018). The AI simulates human perceptual and decision-
making processes using deep learning and controls actions in driver control
systems, such as steering and brakes(Hoel, Wolff, & Laine, 2018; X. Zhang,
Zhou, Liu, & Hussain, 2020).

1
1 Introduction

Figure 1.1 Flow Diagram of Proposed Study

1.1. Artificial Intelligence

Science expected at on condition that technologies with the capacity of
execution functions such as logic, reasoning, planning, learning, and
perception. Even though the reference to “machines” in this definition, the
latter could be applied to “any type of living intelligence”(Shabbir & Anwer,
2018). similarly, the importance of intelligence, as it is establish in bishops
and other brilliant animals for example, it can be extended to include an
added set of capacities, including emotional knowledge, creativity and self-
awareness(Hudson, 2019; Perez, Deligianni, Ravi, & Yang, 2018). An area of
technology that highlight on how to manufacture a machine can work like
human including collect, analysis data and make a decision what to do all by
itself(Jarrahi, 2018; M. Wang, Cui, Wang, Xiao, & Jiang, 2017).

1.1.1. Digital Image processing:

Digital image processing is a technique to achieve several operations on an
image, in familiarity to develop an improved image or to extract specific
beneficial information from it. signal processing in which input is an image
and output might be image or characteristics related with that image (Steer,
2001). Today, image processing is among rapidly growing knowledge

2
1.1 Artificial Intelligence

(Schowengerdt, 2006). That methods are essential research area within

engineering and computer science disciplines too (Iivari, 2005). Image
processing is a method to transform an image into digital form and perform
some operations on it, in order to get an heightened image or to extract
certain valuable data from it (Bankman, 2008). Typically Image Processing
system consist of discussing images as two dimensional signals while
applying already set signal processing methods to them (Jalled & Voronkov,
2016). Image processing contains of the action of images by means of
computers. Its use has been developing exponentially in the last decades
(Richards & Richards, 1999) Its applications array from entertainment to
medicine, passing by geological processing and remote sensing. Multimedia
systems, one of the supports of the up-to-date info culture, rely greatly on
digital image processing (Umbaugh, 2010).

1.1.2. Machine learning

ML is a data analytics method that train with PCs to organize what derives
naturally to humans and animals and learn from past. Machine learning
algorithms use computational approaches to study information openly from
numbers without trusting on a pre-definite calculation as a model(Moubayed,
Injadat, Nassif, Lutfiyya, & Shami, 2018). The models adaptively increase
their performance as the quantity of samples obtainable for learning growths.
An area of AI which a computer system receive data from several sources
like human voices and images then decide what to do with those data to
perform one or several tasks. Deep learning is a subset of Machine learning.
Machine learning is a subset of artificial intelligent(Lemley, Bazrafkan, &
Corcoran, 2017).

1.1.3. Deep Learning

Deep learning is a subclass of machine learning that practices multi-layered
ANN to deliver state of the art accuracy in research projects like as object
detection and tracking, speech recognition, language translation(Voulodimos,
Doulamis, Doulamis, & Protopapadakis, 2018). Deep learning changes from
traditional ML methods in that they can inevitably learn representations from
data such as images(Wei, Ding, Su, Tang, & Zou, 2018), text and video,
without presenting hand-coded rules and human domain
knowledge(Hemberg,
3
1 Introduction

Kelly, & O'Reilly, 2019). Their highly flexible architectures can learn directly
from raw data and can increase their predictive accuracy when provided with
further data. Computer vision apps use deep learning to gain knowledge from
digital images and video(Tan & Lim, 2018). Conversational AI apps help
computers understand and communicate through natural language.
Recommendation systems use images, language, and a user’s interests to
offer meaningful and relevant search results and services(Montenegro, da
Costa, & da Rosa Righi, 2019).

1.1.4. Self-Driving Car

The demand for Av is growing in both the community and the commercial
sectors.(Bennett, Vijaygopal, & Kottasz, 2019; Bissell, Birtchnell, Elliott, &
Hsu, 2020; Nordhoff, Van Arem, & Happee, 2016). In the community sector,
people demand safe and fewer time consuming means of transportation and
on- demand services, which would be available within minutes(Parmar, Das,
& Dave, 2020; Sieber, Ruch, Hörl, Axhausen, & Frazzoli, 2020). In the private
sector the inspiration arises mainly from the power to increase the
consistency and utilization of transportation automobiles. For example, an
driverless truck can be driven approximately 24 hours a day, while humanoid-
driven vehicles have to stop for the driver to break (Badue et al., 2020;
Hancock, Nourbakhsh, & Stewart, 2019; Mounce & Nelson, 2019).

4
1.1 Artificial Intelligence

Figure 1.2 self-driving car

Through the definition, “Autonomous vehicles are those in which action of the
vehicle transpires short of straight driver and input to control the steering,
acceleration, and braking are considered so that the driver is not probable
toward commonly observer the roadway whereas functioning on self-driven
mode ‘USA Department of Transportation Issues Policy on Automated
Vehicle Development in 2013) (Biondi, Alvarez, & Jeong, 2019; Carsten &
Martens, 2019; Klomp et al., 2019) Various types of automobile automation
(for example, driver assistance, partial or full autonomy) can increase
mobility, diminish accidents entirely, or reduce their severity (Gkartzonikas &
Gkritza, 2019; van Wyk, Khojandi, & Masoud, 2020). However, it is not clear
if consumers fully understand the complexity of automobile automation (Bellet
et al., 2019; Boelhouwer et al., 2020). It is also not clear if and how
developments can be used to facilitate the mobility needs of consumers. It
should be evident that older people will experience very significant
advantages using self-driving cars (Ryan, 2019; Salonen & Haavisto, 2019).
Nevertheless, it can be noticed that the willingness to use automation in cars
differs. While younger people tend to be ready for Auto vehicles older
consumers show little willingness to use AVs (Jing, Xu, Chen, Shi, & Zhan,
2020; S. Wang & Zhao, 2019; T. Zhang et al., 2019). Therefore, it is crucial to
understand what kind of training for new automobile technologies consumers
current acquire and which technologies are required in the future in order to
satisfy consumers (Dua, White, & Lindland, 2019; Saeed, Burris, Labi, &
Sinha, 2020).
Table 1.1 Advantages and Disadvantage of self-driving (Dua et al., 2019).

Advantage Disadvantage
1. Decreased the number of accidents 1. Expensive

2. Lessens traffic jams 2. Safety and security concerns

3. Stress-free parking 3. Prone to Hacking

4. Time-saving vehicle 4. Non-functional sensors

5.Accessibility to transportation 5. Fewer job opportunities for others

5
1 Introduction

1.1.5. Automation
An area of technology in which is a process of machine doing one or several
tasks without human assistant.
1.1.5.1. Level of Automation
SAE International defines vehicles as having six levels of automation
depending upon the amount of attention required from a human driver. In
levels zero through two, humans drive and monitor the traffic environment; in
levels three through five, the automated systems drive and monitor the
environment(Dokic, Müller, & Meyer, 2015; Sousa, Almeida, Coutinho-
Rodrigues, & Natividade-Jesus, 2018).

Figure 1.3 level of Automation

1.1.5.1.1. Level 0 (driver only)

The largest part automobiles on the road today are Level 0 that manually
controlled. For example would be the emergency breaking system, since it
technically doesn’t drive the automobile, it does not to be suitable for
automation(Muslim & Itoh, 2019).
1.1.5.1.2. Level 1 (driver Assistance)
The automobile features a solo automatic system for driver assistance, like as
steering and adaptive cruise control. ACC( adaptive cruise control), wherever
the automobile can be reserved at a not dangerous distance behind the next
automobile, succeeds as Level 1 because the human driver monitors the
other facets of driving such as steering and braking(Kukkala, Tunnell,
Pasricha, & Bradley, 2018).

6
1.1 Artificial Intelligence

1.1.5.1.3. Level 2 (Partially Automation)

ADAS in automobile can control both steering and accelerating or
decelerating. Now the robotics drops the minor of Aromatic car as a humanoid
sits in the driver’s seat and can take control of the car at particular time.
(General Motors) Super Cruise systems both succeed as Level 2(S. Feng &
Haykin, 2020).
1.1.5.1.4. Level 3 (Conditional Automation)
The progress from Level 2 to Level 3 is important from a high-tech view,
environmental detection capabilities and can style informed decisions for
them, such as accelerating past a slow moving car. But the automobiles still
require human dominate. The driver must remain alert and ready to take
control if the system is unable to execute the task(Di Palma, Galdi, Calderaro,
& De Luca, 2020).

1.1.5.1.5. Level 4 (High Driving Automation)

The important change between Level 3 and Level 4 automation is that Level 4
vehicles can mediate(Young & Stanton, 2007) uncertainty gears go wrong or
else there is an auto system is fail. In this sense, these cars do not involve
human collaboration in most conditions. However, a human still has the option
to manually predominate(Stanovich, 2018).

Level 4 vehicles can operate in self-driving mode. But until legislation and
infrastructure evolves, they can only do so within a limited area (usually an
urban environment where top speeds reach an average of 30mph). This is
known as geofencing. As such, most Level 4 vehicles in existence are geared
toward ridesharing(Sagir, 2020). For example:

 NAVYA is a French company that is already shop the Level 4 buses

and taxis in the U.S.A. that route fully on electric and can range a
,maximum speed of 55 mph(Wrenn, 2017).
 Waymo newly revealed a Level 4 Automatic taxi service in Arizona,
wherever Waymo had been testing driverless cars a safety driver in the
seat or more than a year and over 10 million miles(Singh & Saini, 2021;
Stocker & Shaheen, 2019).

7
1 Introduction

1.1.5.1.6. Level 5 (Full Automation)

Fully autonomous cars are driving freely without any human interaction,
experiencing testing in numerous compartments of the world, but none are yet
available to the general public(Arvin, Kamrani, Khattak, & Rios-Torres, 2018;
Samii & Zinner, 2018).

Figure 1.4 software architecture of self-driving

1.1.6. Software architecture of ADS

1.1.6.1. Perception
In autonomous driving, vehicles need to make decisions based on the
perception of the surrounding environment, including static and dynamic
obstacles (Artuñedo, Godoy, & Villagra, 2019; D. Feng et al., 2020; Lu et al.,
2019). The perception system performance will greatly affect the system’s
overall ability and robustness (Markolf, Hoehne, Fraser, Chester, &
Underwood, 2019; Rosique, Navarro, Fernández, & Padilla, 2019). There are
three main types of sensors used by autonomous vehicle platforms: radar,
LIDAR and cameras. Radar has been widely used in automotive applications
for decades. It has the ability to detect obstacles’ positions and speeds
directly. However, radar outputs are usually not informative enough to

8
1 Introduction

estimate (Y. Li

9
1.1 Artificial Intelligence

& Ibanez-Guzman, 2020; S. Liu et al., 2019; Major et al., 2019; Marti, de
Miguel, Garcia, & Perez, 2019).
1.1.6.2. Localization
The localization module is one of the crucial parts of self-driving systems.
However the proposed benchmark currently contains only a node, the plan to
provide multiple nodes in the future (Ferranti et al., 2019; Tokunaga, Ota,
Tange, Miura, & Azumi, 2019).
1.1.6.3. Connector:
The connector node determines the velocity and pose of the vehicle used for
subsequent processing. Auto ware can calculate the vehicle velocity and
pose in multiple methods such as using Controller Area Network (CAN)
information, obtaining from the generated path, and estimating from
localization. These nodes output the specified velocity and pose as current
them (Olufowobi, Young, Zambreno, & Bloom, 2019).
1.1.6.4. Predictions
The prediction module examines the gesture (motion) patterns of other traffic
agent and forecasts AC forthcoming(future) trajectories which enables the
AC to make appropriate navigation decisions (R. Fan et al., 2019). Latest
prediction approaches can be grouped into double main components data-
driven-based and model-based. former computes the AC future motion, by
promulgating its kinematic state (position, speed and acceleration) over time,
based on the underlying physical system kinematics and dynamics (Z. Wang,
2018) .map information as a constraint to compute the next AC location
workings well for temporary predictions, but its performance degrades for
extended period limits, as it disregards nearby context for example roads and
traffic guidelines(R. Fan et al., 2019). Furthermore, a pedestrian motion
prediction model can be formed based on attractive and revolting forces
(Xue, Huynh, & Reynolds, 2019).
1.1.6.5. Communications
Autonomous vehicles require various sensors such as LIDAR, radar,
cameras, etc. to understand their surroundings; however, such sensors can
be easily obstructed by nearby obstacles, preventing long-range
detection(Schoettle, 2017). This can be improved by using sensor data from
other vehicles
9
1 Introduction

equipped with a vehicular communication device (Eze, Zhang, Liu, & Eze,
2018). (Gora & Rüb, 2016):

1. V2V (vehicle-to-vehicle) automobiles can “conversation” to each other.

(Demba & Möller, 2018)
2. V2I (vehicle to infrastructure) automobiles can send info to the
infrastructure(Xu, Li, & Xi, 2019).
3. I2V (infrastructure to vehicle) automobiles can get information from the
infrastructure(P. Liu & Fan, 2021).
4. V2P (Vehicle to pedestrian) communications for pedestrian protection
and conducted at the applicability of Wi-Fi(de Almeida, Ribeiro Júnior,
Campista, & Costa, 2020).
5. V2X (Vehicle-to-Everything), refers to the passing of information from
a vehicle to any entity(Arena, Pau, & Severino, 2020).
6. V2N (vehicle-to-Network) is when a vehicle accesses the network for
cloud-based services (Elagin, Spirkina, Buinevich, & Vladyko, 2020).
1.1.6.6. Hardware controllers
AC hardware controllers go on twisted steering motor, and gear shifter
(Yokoyama, Nishino, Matsubara, Chikuma, & Hashida, 2019).The vehicle
states, such as wheel speed and steering angle, are sensed automatically
and sent to the computer system via a CAN (Controller Area Network) bus.
This enables either the HD or the ADS to control throttle, brake and steering
wheel(Goldfain et al., 2019).
1.1.6.7. Sensors
Automatic cars and current human driven vehicles hold sensors that gather
information about the vehicle's action and its environment. For example,
sensors in Google's driverless vehicles auto slot in active (radar or Lidar)and
passive (cameras) sensors (Campbell et al., 2018).
1.1.6.8. Camera
In the perception system of autonomous vehicles and from a point of view of
the wavelength received by the device, cameras can be classified as visible
(VIS) or infrared(Shahjalal, Hasan, Chowdhury, & Jang, 2019). The element
used by the camera to capture a scene is known as an imaging sensor and

10
1.1 Artificial Intelligence

has traditionally been implemented with two technologies: Charge-coupled

device (CCD) and complementary metal oxide semiconductor (CMOS)
(Edgar, Gibson, & Padgett, 2019; Gove, 2020). CCD image sensors are
manufactured by an expensive manufacturing process that confers them
unique properties such as high quantification efficiency and low noise. CMOS
was developed to reduce the cost of manufacturing at the expense of
reducing its performance(Nobis; Stanley-Marbell et al., 2020). The design of
the extraction architecture of the luminosity values allows the selection and
processing of regions of interest (ROI); furthermore, the CMOS device has a
lower consumption than CCDs. These characteristics make them the most
used technology for mobile devices obstacle shape.(Albert Smet, 2018;
Sargın Güçlü, 2019)
1.1.6.9. LIDAR
LIDAR states to a light detection and ranging device (Colaço, Molin, Rosell-
Polo, & Escolà, 2018), which drives millions of light pulses per second in a
well-made pattern(MILOVANOVIĆ, KUKOLJ, & NEMET). Through its rotating
axis, that is accomplished for building a dynamic and 3 dimensional map of
the atmosphere. In a real scene, the points returned by the LIDAR are never
perfect(Pendleton et al., 2017). The difficulties in handling LIDAR points lie in
scan point sacristy, missing points, and unorganized patterns. The
surrounding environment also adds more challenges to the perception as the
surfaces may be arbitrary and erratic. Sometimes it is even difficult for human
beings to perceive useful information from a visualization of the scan
points(Shaukat, Blacker, Spiteri, & Gao, 2016).
1.1.6.10. Controls
Automobile control essentially contains car speed and path control (Olivares-
Mendez et al., 2016). Commonly, the function of car control are the
automobile’s standing perception and the progress of car’s control method. In
the direction of succeed automobile speed and direction intention, the EM
information with site perception, vehicle status, driving goal, traffic guidelines
and driving knowledge are functioned as power into the perception
component, then the vehicle control algorithm implements the mechanism of
the controller target, which is then transported into the vehicle control
system. Finally, the
11
1 Introduction

car control system implements individuals instructions to control the

automobile’s path, speed, light and horn (J. Zhao, Liang, & Chen, 2018).

1.1.7. Computer Vision

Computer vision is works on enabling computers and machine to see,
recognize and development of images in the equivalent approach that human
visualization see to, and then deliver proper output(Parker, 2010). It is similar
conveying human intelligence and constitutions to a computer. In actuality
yet, it is a problematic task to empower machine and computers to identify
imaginings of various objects. Computer vision is thoroughly related with AI,
as the computer must understand what it realizes(sees), and then perform
suitable learning or act consequently(Tuomi, 2018).
1.1.7.1. Object Detection
Object detection is a necessary component in a self-driving car. Humans can
sense and recognize the objects around them within a fraction of seconds.
The self-driving car should detect and differentiate between objects such as
humans, vehicles, traffic lights, roads etc.(Z.-Q. Zhao, Zheng, Xu, & Wu,
2019).
1.1.7.2. Object Tracking
Object tracking is the important tasks in computer vision that tries to detect
and track objects in image sequences.(Habibi, Sulistyaningrum, & Setiyono,
2017)

12
1.1 Artificial Intelligence

Figure 1.5 original image 1

Figure 1.6 object tacking

1.1.7.3. Bounding Box
A bounding box is an imagined rectangle that attends as a point of reference
for object detection and generates a collision box for that every object(Skubic
et al., 2004). Data annotators draw these rectangles over images, outlining
the object of interest inside all image by labelling its X and Y
coordinates(Sultana, Sufian, & Dutta, 2019)

Figure 1.7 original image 2

Figure 1.8 bounding box

1.1.7.4. Segmentation
Image segmentation is a basic area in image processing with these
applications for example, medical image analysis, robotic perception, scene

13
1 Introduction

understanding, video surveillance, augmented reality, and image

compression(Minaee et al., 2020), among many others .Image segmentation
is an vital component in many visual sympathetic systems. It involves
partitioning images and video frames addicted to many objects segments
(Ahmed, Ahmad, Khan, & Asif, 2020)

Figure 1.9 original image 3

Figure 1.10 segmentation

1.2. Statement of the Problem

Self-driving is the hot topic in technology fields. Work on this trend is
outsized; lots of problem solved by state-of-art model and in some case these
model are failed. But some questions related to this technology are following
on which we are working.
 Is model used for every proposed class predicting accurately.
 Low rate of false negative and false positive for enhance road safety
and achieve autonomous driving.

1.3. Limitation
In this proposed study object detection and tracking for self driving is not
making self driving hardware and not use any sensors. Proposed study only

14
1.1 Artificial Intelligence

use of already get self driving dataset for object detection and tracking. But for
testing the proposed model to image is collected by mobile camera.

1.4. Objective
Self-driving cars are the future. Cars need object detection to perceive its
surroundings. Main objective is improve road safety.
 Increase the value the true positive and true negative according to
object class.

15
1 Introduction

2. Literature Review
Previous approaches to this problem suffer either from an overly complex
inference engine or from insufficient detection accuracy. To deal with these
issues, we present SS3D, a single-stage monocular 3D object detector. The
framework consists of a CNN, which outputs a redundant representation of
each relevant object in the image with corresponding uncertainty estimates,
and a 3D bounding box optimizer. Method achieves SOTA accuracy on
monocular 3D object detection, while running at 20 fps in a straightforward
implementation. The proposed methods are evaluated primarily on the KITTI
object detection benchmark a number of image sequences with annotated
2D and 3D bounding boxes for a small number of object
categories(Jörgensen, Zach, & Kahl, 2019).

New task Multi-Object Tracking and Segmentation performed the mentioned

an procedure on the bounding box level annotations from the KITTI tracking
dataset and to facilitate training and evaluation, divided the 21 training
sequences of the KITTI tracking dataset new annotations comprise 65,213
pixel masks for 977 distinct objects (cars and pedestrians) in 10,870 video
frames. TrackR-CNN generates its own detections. It is also unclear how
much these methods were trained to perform well on the MOTSChallenge
training set, on which MOTSChallenge and generated masks using a Mask
R-CNN fine-tuned on MOTSChallenge(Voigtlaender et al., 2019).

The method can be used with any camera-based object detector and we
illustrate the technique on several sets of real world data. We show that a
state- of-the art detector, tracker and our classifier trained only on synthetic
data can identify valid errors on KITTI tracking dataset with an Average
Precision of
0.94. We also release a new tracking dataset with 104 sequences totaling
80, 655 labeled pairs of stereo images along with ground truth disparity from
a game engine to facilitate further research. Proposed features, an off-the-
shelf random forest classifier achieves an AP score of 0.93 on the GTA
dataset for RRC detector. Furthermore, we showed that our system

16
1 Introduction

(detector, tracker and classifier) trained only on synthetic data can

17
find errors made in the KITTI dataset with an AP score of 0.94 for RRC
detector(Ramanagopal, Anderson, Vasudevan, & Johnson-Roberson, 2018)

This work proposed a Detection Aware 3D Semantic Segmentation (DASS)

network to tackle limitations of current architectures. Provide a pipeline that
uses DASS to generate high recall proposals for existing 2-stage detectors
and demonstrate that the added supervisory signal can be used to improve
3D orientation estimation capabilities. Extensive experiments on both the
Semantic KITTI and KITTI object datasets show that DASS can improve 3D
semantic-segmentation results of geometrically similar classes up to 37.8%
IOU in image FOV while maintaining high precision bird’s-eye view (BEV)
detection results.(Unal, Van Gool, & Dai, 2021)

The objective of Multi-object Tracking (MOT) is to track multiple objects at the

same time and estimate their current states, such as locations, velocities,
and sizes, while maintaining their motion identifications. Propose a
multimodal MOT method by fusing the motion information and the deep
appearance features of objects. This paper employs a 2D object detector,
i.e., You Only Look Once (YOLOv3) and a 3D object detector , PointRCNN.
A multimodal MOT method is proposed by fusing the motion information and
deep appearance features of the object to achieve the MOT task. In addition,
the proposed method obtains competitive qualitative and quantitative tracking
results on the KITTI tracking benchmark(Shenoi et al., 2020).

With the objects to be detected becoming more complex, the problem of

multi- scale object detection has attracted more and more attention,
especially in the field of remote sensing detection. Early convolution neural
network detection algorithms are mostly based on artificially preset anchor-
boxes to divide different regions in the image, and then obtain the prior
position of the target. Get an AF-EMS detector using ResNet101-FPN
backbone and compare it with the state-of-the-art detector on the DOTA
dataset. The RPDet have our baseline and made further improvements to it
to get better multi-scale detection performance. New detector, based on the
characteristics of the anchor-free detection framework, can effectively
improve the multi-scale object detection performance. On remote sensing

17
2 Literature Review

detection task, the performance degradation on GTF is due to the fact that the
centers of some GTF targets are also the centers of SBF targets(Fu et al.,
2020)
This paper worked on BDD100K the largest driving video dataset with 100K
videos and 10 tasks to evaluate the interesting development of image
recognition algorithms on autonomous driving. This Paper benchmarks are

area
comprised of ten tasks: image tagging, lane detection, drivable
segmentation,
segmentation, instance
road object detection, semantic
segmentation,
multi-object multi- segmentation
detection tracking, object
tracking, domain adaptation, and imitation learning. Experiments provided
extensive analysis to different multitask learning scenarios: homogeneous
multi-task learning and cascaded multitask learning. The results presented
interesting findings about allocating the annotation budgets in multi-task
learning(Yu et al., 2018).

Dilemma is to first guess salient features, and then share these features with
driving decisions. The proposed model is a deep neural network model that
feeds extracted features from input image to a recurrent neural network with
an attention mechanism. This dataset is for evolution dataset BDD-A, and
saliency dataset CAT2000. The projected model outcome stimulating results
to explain the relationship between saliency prediction and driving decisions.
It also provides a holistic framework to be used as input for driving decisions..
Predicted saliency map also used in making driving decision (breaking). The
proposed model has two main components Driver Attention Module and

Decision Module. Primarily objective is to predict saliency for driving, much of

efforts are given to understand saliency within driving context(S. Zhao, Han,
Zhao, & Wei, 2020).

This paper deal with the issue of estimating the confidence of Deep Neural
Network in reaction to unexpected execution contexts through the purpose of
predicting possible safety-critical misbehavior’s such as out of bound
episodes
18
2 Literature Review

or collisions. An unsupervised Technique is implemented in a tool Self

Oracle for probability distribution fitting and time series analysis and e
Udacity simulator to infuse surprising driving circumstances. Dataset of 765
labeled

19
2 Literature Review

simulation-based collision and out-of-bound episodes. Promising results in

online misbehavior detection and combined with the accessibility of a labeled
dataset of crashes and a simulation environments and advance novel
approaches for online prediction and self-healing of automatic system for
driving(Stocco, Weiss, Calzana, & Tonella, 2020).

A novel approach designed to tracking for detection, which exploits the power
of structure prediction as well as deep neural networks. Towards this goal
and formulate the problem as inference in a deep structured model (DSM) on
Kitti dataset. The CNNs are initialized with the pre-trained VGG16 weights on
ImageNet and the fully connected layers that include the weights of the
binary random variables(y) are initialized by sampling from a truncated
normal distribution. Experimental evaluation on the challenging KITTI dataset
show that our approach is very competitive outperforming the state of the art
in the MOTP(Frossard & Urtasun, 2018)

Self-driving has many challenges but it attracts more people’s .Many

companies and universities have work on self-driving cars. Self-driving consist
of alt of function like share data to other automatic car. Thread to automatic
system can disrupt the all functionalities and to disturb entire service system
that cause crashing. This paper work on Guardauto framework protect
against runtime failure. Guardauto propose model to separate automatic
driving
system and provide protection mechanisms for its component. Guardauto
give protection and cooperation among local system. Guardauto prototype
framework is identifying the cause of failure in system. Guardauto is generally
implemented in C++ or Python and their range of libraries and tools are taken
on. Guardauto secure the entire autonomous driving system components with
both local and collaborated self-defense system. Evaluation results show that
the implemented prototype gather the plan goals, which can successfully
detect and moderate runtime intimidation.

This paper propose to learn a generic approach to learning a driving policy

from demonstrated behaviors, and formulate the problem as predicting future
feasible actions. The Berkeley DeepDrive Video dataset (BDDV) is a dataset

20
2 Literature Review

learn a novel FCN-LSTM architecture to learn from driving behaviors. Given

21
2 Literature Review

a video frame input, a visual encoder can encode the visual information in a
discriminative manner while maintaining the relevant spatial information.
Take the ImageNet and pre-trained AlexNet model use dilated convolutions
for conv3. Investigate the effectiveness of Deep Generic Driving Networks
driving model and the learning by evaluating future ego motion prediction on
held-out sequences across diverse conditions(Gao, 2019).

The Cambridge-driving Labeled Video Database (CamVid) is a collection of

videos with object class semantic labels, complete with Meta data. The
database provides ground truth labels that associate each pixel with one of
32
classes and the classes chosen from the dataset are Sky, Building, Pole,
Road, Pavement, Tree, Sign Symbol, Fence, Car, Pedestrian and Bicyclist.
The model is trained for 40 epochs and reaches training mean pixel accuracy
of 93percent and validation mean pixel accuracy of 88 percent’s also propose
our own attention module to enlarge the receptive field and encode more

contextual information. Multi scale feature maps are merged using concat
operator for encoding more contextual information. We present loss function,
optimization details, ablation studies and evaluation metrics used. Our
network achieves mean IOU value of 74.12 which is better than the previous
state of the art on semantic segmentation while running at >100 FPS(Sagar
& Soundrapandiyan, 2020).

The research is about to address the important problem in self-driving of

forecasting multi-pedestrian motion and their shared scene occupancy map,
critical for safe navigation on two large-scale real-world datasets, nuScenes
and ATG4D . Contributions are two-fold. Model captures the interaction
among multiple pedestrians and takes the scene level occupancy information
into consideration so that it is aware of the missing detection. Object detector
is operated at high precision (90 %), and scene occupancy forecasting
guarantees high recall. Experimental results indicate that SA-GNN achieves
state-of-the-art performance(Luo et al., 2021).
Multi-object tracking (MOT) is an essential component technology for many

vision applications such as autonomous driving. Multi-object tracking (MOT) is

an essential component technology for many vision applications such as
22
2 Literature Review

autonomous driving. To predict the object state in the next frame and
approximate the inter-frame displacement of objects using the constant
velocity model, independent of camera ego-motion. Apply the Hungarian
algorithm affinity matrix with dimension for the data association. Top 2nd
places on the official KITTI 2D MOT leader board among all published works
while achieving the fastest speed, suggesting that simple and accurate 3D
MOT can lead to very good results in 2D(Weng & Kitani, 2019).

DNNs are used to detect objects in regions. Proposed implementation,

YOLOv3 and YOLOv4 algorithm is used to predict class labels and detect
objects. . Multiple object dataset (KITTI image and video), which consists of
classes of images such as Car, truck, person, and two-wheeler captured during
RGB and grayscale images. Multiple object detection algorithm was trained
and tested Obtained results shows that the algorithm effectively detects
multiple objects approximately with an accuracy of 98% for image dataset
and 99% for video dataset .In addition, proposed YOLO model variants can
be implemented on DSP and FPGA on image and video dataset. The IOU
0.5 is considered for implementation(Tao, Wang, Zhang, Li, & Yang, 2017)

3D object detection is essential, to get information about the objects extent and
range in the 3D space. YOLO4D is presented for Spatio-temporal Real-time
3D Multi-object detection and classification from LiDAR point clouds. All the
experiments are conducted on the publicly available KITTI raw dataset which
consists of sequenced frames, unlike the KITTI benchmark dataset. The
dataset consists of 36 different annotated point cloud scenarios of variable
lengths and a total of 12919 frames. Automated Driving dynamic scenarios
are rich in temporal information. 4 Training For all Spatio-temporal based
models, a clip length m of 4 is used. Most of the current 3D Object Detection
approaches are focused on processing the spatial sensory features, either in
2D or 3D spaces, while the temporal factor is not fully exploited yet,
especially from 3D LiDAR point clouds. . YOLO4D models outperform all
other methods on all classes, achieving 11.5% improvement on Mixed-
YOLO3D, and 34.26% improvement on Tiny-YOLO3D. Frame stacking
provides a 2.36% improvement on the Mixed-YOLO baseline model, and a
14.56% improvement
23
2 Literature Review

on the shallower Tiny-YOLO baseline model. Mixed-YOLO4D achieves the

best mean F1 score, taking 20ms on TITAN XP GPU(El Sallab, Sobh, Zidan,
Zahran, & Abdelkarim, 2018).

The KITTI MOTS dataset was introduced in adding instance segmentation

masks for cars and pedestrians to a subset of 21 sequences from KITTI raw.
To demonstrate the generality of our MOTS label generation process by
extending the BDD100k tracking dataset with segmentation masks to
become a MOTS variant thereof and the tracking performance the best
perform. MOTSNet model from experiments on KITTI based on the ground
truth box- based tracking annotations available for BDD100k. Second major
contribution is a deep-learning based MOTSNet architecture to be trained on
MOTS data, exploiting a novel mask-pooling layer that guides the association
process for detections based on instance segmentation mask(Porzi et al.,
2020).

3D object detection is a fundamental task in perception systems. Develop an

efficient and effective single-stage detector that operates in Bird’s Eye View
(BEV) and fuses LiDAR information with rasterized maps. Bird’s eye view is a
good representation for 3D LiDAR as it is amenable to efficient inference and
retains the metric space. Compare two detector variants: the baseline
PIXOR++detector without map, and the HDNET detector with online map
estimation on the validation set of KITTI BEV object detection. The publicly
available ground plane results from 3DOP which were generated by road
segmentation and RANSAC fitting(Yang, Liang, & Urtasun, 2018).

The contribution of this paper is three-fold. First, establish a joint training

dataset for electronic components that includes real PCB photos and virtual
PCB photos based on circuit simulation software. Second, we propose an
improved YOLO (you only look once) V3 algorithm that adds one YOLO
output layer that is sensitive to small targets and validates the activeness of
the algorithm in a real PCB picture and virtual PCB picture test including a
large number of PCB electronic components. For the detection of PCB
electronic components containing 29 categories, we used the AP (average
precision) detected by each category of components to characterize the

24
2 Literature Review

performance of the four algorithm After analyzing the feature distribution of

the five

25
2 Literature Review

dimensionality-reduced output layers of Darknet-53 and the size distribution

of the detection target, it is proposed to adjust the original three YOLO output
layers to four YOLO output layers and generate 12 anchor boxes for
electronic component detection can learn from the specific data that YOLO
V3+4 outputs
+ PCB dataset 12 anchors + bbox800 have improved detection accuracy in
all categories. The experimental results show that the mean average
precision (mAP) of the improved YOLO V3 algorithm can achieve 93.07%(J.
Li, Gu, Huang, & Wen, 2019).

The left ventricle segmentation is an important medical imaging task

necessary to measure a patient’s heart pumping efficiency. Recently,
convolutional neural networks (CNN) have shown great potential in achieving
state-of-the-art segmentation for such applications. The two well-known
approaches to tackle this are Grid Search and Random Search. In random
search, instead of exhaustively search through all the combinations of hyper-
parameters, we select them randomly. Image preprocessing step consists of
cropping the image to size 192*192. The images with number of rows or
columns less than 192 will be padded to match 192, there is no data
augmentation used in our pipeline. Although the data augmentation would
likely improve the performance of the network, it is outside the scope of this
thesis since our purpose is to show how U-Net can be optimized. Presented
experiments and results indicating the U-Net for the left ventricle
segmentation does not need to be as deep as suggested. We showed this by
performing a gradient analysis in the deeper layers that shows that the
gradient flow is very sparse in those layers. Hence, removing most of the
bottom layers in the U-Net can decrease the number of free parameters and
increase the training and inference speed significantly(Litjens et al., 2019).

Tracking objects in videos is a significant problem in computer vision which

has attracted great attention. It has various applications such as video
following, autonomous driving and human computer interface .The goal of
MOT is to estimate the locations of multiple objects in the video and maintain
their identities. Consistently in order to yield their individual trajectories.MOT
is still a challenging problem, especially in crowded scenes with frequent
26
2 Literature Review

occlusion, interaction among targets and so on. The proposed algorithm is

implemented in MATLAB with Caffe .In our implementation, use the first ten
convolutional layers of the VGG-16 network trained on ImageNet
Classification task. Evaluate our online MOT algorithm on the public available
MOT15 and MOT16 proposed a dynamic CNN-based online MOT algorithm
that efficiently utilizes the merits of single object trackers using shared CNN
features and ROI Pooling. In addition, to alleviate the problem of drift caused
by frequent occlusions and interactions among targets, the spatial-temporal
attention mechanism is introduced. Besides, a simple motion model is
integrated into the algorithm to utilize the motion information. Experimental
results on challenging MOT benchmarks demonstrate the effectiveness of the
proposed online MOT algorithm. Benchmarks containing 22 (11 training, 11
test) and 14 (7 training, 7 test) video sequences in unconstrained
environments respectively. The ground truth annotations of the training
sequences are released(Chu et al., 2017).

Tracking objects over time, i.e., identity (ID) consistency is important when
dealing with multiple object tracking (MOT). Especially in complex scenes
with occlusion and interaction of objects this is challenging. Significant
improvements in single object tracking (SOT) methods have inspired the
introduction of SOT to MOT to improve the robustness, that is, maintaining
object identities as long as possible, as well as helping alleviate the
limitations from imperfect detections. Multiple object tracking (MOT) in video
(a critical problem for many applications including robotics, video
surveillance, and autonomous driving) remains one of the big challenges of
computer vision. The goal is to locate all the objects we are interested in, in a
series of frames, and form a reasonable trajectory for each one of them.
Since recent progress has been made on object detecting, tracking-by-
detection.MOT17 to evaluate our tracking performance. It consists of several
challenging pedestrian tracking sequences, with a significant number of
occlusions and crowded scenes, variations in angle of view, sizes of objects,
camera motion, and frame rates. MOT17 has the same video sequences as
the latest MOT16 .most tracked (MT) and identity preserving (IDF1) (which
compares ground truth trajectory and computes trajectory by a bipartite
graph, and reflects how long of an object
27
2 Literature Review

has been correctly tracked) prove the effectiveness of our method on ID

preserving, which outperform the other online trackers on MOT17. Different
from online methods, offline methods do have both future and past
information to further optimize the status of each object, which is usually
expressed by a better overall performance. Despite this, our method still
shows competitive ability on ID preserving, which is reflected by the IDF1 and
MT results. SOT based MOT method in order to increase the accuracy of
tracking results focusing on ID preservation(M. Li et al., 2019).

Robust tracking of objects is important for various computer vision

applications, for instance human-computer interaction, videos observation,
intelligent navigation. The data association (DA) method is a favorite for
Multiobject tracking. The often utilized techniques include the nearest
neighbor method .joint probability data association. Shortest path faster
algorithm to relax the integer program by the network flow framework; the
average case complexity of this algorithm. The global optimum of the SPFA
algorithm makes the tracking more reliable and more efficient. The network
flow framework needs two particular properties to realize the SPFA algorithm.
Proposed a reliable tracker with a flow network framework. In the min-cost
flow model established by the theory of integer program, we then used SPFA
algorithm to relax the integer assumption and to successfully identify the
global optimal solution. The resulting algorithm can better solve the problems
of short-time false positives and false negatives in Multiobject tracking and is
more robust than state-of-the-art methods. Our proposed method can quickly
find the global optimal solution of the relaxed LP by using SPFA. Experiment
results indicate that the proposed algorithm is helpful in improving trajectory
consistency and solving serious occlusion problems between multiple objects
and can satisfy real time measurement requirements. Compared with other
algorithms, there are obvious advantages of SPFA. Tracking multiple types of
targets with a dynamic background(L. Fan et al., 2016).

After a video classification, an image is separated into two complimentary

arrays of pixels, in which the first set encompasses the pixel which
correspond to foreground objects which is regularly motion objects like
people, boats and
28
2 Literature Review

cars and everything else is background while the second courtesy set
contains the circumstantial pixels. This result is often represented as a binary
image or as a mask .The Object tracking is in concert a vital role in several
shape recognition and computer-vision pattern recognition applications like
autonomous robotic navigation, surveillance and vehicle navigation.. Point
tracking, particularly for the frequency of occlusions and false object
detections, is a complex problem. In order to identify these points, recognition
can be achieved pretty quickly. Point based tracking approaches. Moving
object detection and tracking becomes attractive and crucial research topic
for researchers. There are many methods for the object detection and
tracking. All the methods have their own advantages and disadvantages. For
object tracking single method cannot give good accuracy for different kind of
videos with different situation like poor resolution, change in weather
condition. Gaussian Mixture Modeling (Drayer & Brox, 2016).

Tracking is a fundamental task in any video application requiring some

degree of reasoning about objects of interest, as it allows to establish object
correspondences between frames .To achieve this, we modified the two
variants of Siam Mask during inference so that, respectively, they report an
axis-aligned Bounding box from the score branch (SiamMask-2B-score) or
the box branch (Siam Mask-box).Siam Mask, a simple approach that enables
fully- convolutional Siamese trackers to produce class agnostic binary
segmentation masks of the target object. Show how it can be applied with
success to both tasks of visual object tracking and semi-supervised video
object segmentation, showing better accuracy than state-of-the-art trackers
and, at the same time, the fastest speed among VOS methods. The two
variants of Siam Mask we proposed are initialized with a simple bounding
box, operate online, run in real- time and do not require any adaptation to the
test sequence. Siam Mask operates online without any adaptation to the test
sequence. On a single NVIDIA Titan X GPU, measured an average speed of
55 and 60 frames per second, respectively for the two-branch and three-
branch variants. Note that the highest computational burden comes from the
feature extractor(Q. Wang, Zhang, Bertinetto, Hu, & Torr, 2019).

29
2 Literature Review

Object Tracking (OT) on a Moving Camera so called Moving Object Tracking

(MOT) is extremely vital in Computer Vision. While other conventional
tracking methods based on fixed camera can only track the objects in its
range, a moving camera can tackle this issue by following the objects.
Moreover, single tracker is used widely to track object but it is not effective
due to the moving camera because the challenges such as sudden
movements, blurring, pose variation. KITTI data set is used to evaluate the
accuracy of the proposed method. KITTI dataset has many different data
sets, two data sets “Raw Data” and “Object Tracking Evaluation 2012” are
selected to experiment. The accuracy of the camera position estimation
algorithm (camera position) is evaluated on “Raw Data” dataset. Datasets
0091, 0060, 0095, 0113, 0106 and 0005 are selected in dataset “Raw Data”.
Tracked object position estimation is evaluated based on the proposed
tracking by detection method on “Object Tracking Evaluation 2012” dataset.
These following datasets 0000, 0004,
0005, 0010, 0011, and 0020 are used because they have specific objects to
tracking. The data set KITTI is selected. The camera location is estimated
with and without removal of moving objects. The moving features removal
algorithm is still limited and should be improved in the future, though it has
shown an improvement in the accuracy of the camera position when the
moving features are removed. Though the error between the estimated
camera location and ground truth camera location is still remaining, but it is
useful in the environments such as indoors, noisy GPS and in the cases
where input of tracked object is image. The experimental results has shown
the important role of visual information in MOT(Mahmoudi, Ahadi, & Rahmati,
2019).

Robots operating in human-centered environments have to perform reliably in

unanticipated situations. While deep neural networks (DNNs) offer great
promise in enabling robots to learn from humans and their environments.
This paper introduces a new method for end-to-end training of deep neural
networks (DNNs) and evaluates it in the context of autonomous driving. DNN
training has been shown to result in high accuracy for perception to action
learning given sufficient training data. A society where robots are safely and
reliably integrated into daily life demands agents that are aware of scenarios
30
2 Literature Review

for which they are insufficiently trained. Detection of novel events which the

31
2 Literature Review

network has been unsatisfactorily trained for and not trusted to produce
reliable outputs; and Automated debasing of a neural network training
pipeline, leading to faster training convergence and increased accuracy. Start
by formulating the end-to-end control problem and then describe our model
architecture for estimating steering control of an autonomous vehicle. All
models in this paper were trained on an Nvidia Volta V100 GPU. To evaluate
the performance of our model for the task off end-to-end autonomous vehicle
control of steering we start by training a standard regression network which
takes as input a single image and outputs steering curvature. Input image
data as being modeled by a set of underlying latent variables (one of which is
the steering command taken by a human driver) with VAE architecture(Amini
et al., 2018).

End-To-end (perception-to-control) trained neural networks for autonomous

vehicles have shown great promise for lane stable driving .However, they
lack methods to learn robust models at scale and require vast amounts of
training data that are time consuming and expensive to collect. Learned end-
to-end driving policies and modular perception components in a driving
pipeline require capturing training data from all necessary edge cases, such
as recovery from off-orientation positions or even near collisions. Road
boundaries are plotted in black for scale of deviations .IMIT-AUG yielded
highest performance out of the three baselines, as it was trained directly with
real-world data from the human driver. Of the two models trained with only
CARLA control labels, S2R-AUG outperformed DR-AUG requiring an
intervention every 700 m compared to 220 m Simulation has emerged as a
potential solution for training and evaluating autonomous systems on
challenging situations that are often difficult to collect in the real-world.
However, successfully transferring learned policies from model-based
simulation into the real-world has been a long-standing field in robot
learning(Mehta, Subramanian, & Subramanian, 2018).

Accurate depth perception is critical in applications, such as autonomous

driving, robot navigation, and 3D reconstruction. This can be accomplished
by estimating pixel correspondences and disparities between rectified image
pairs, which is known as stereo matching. The KITTI 2015 stereo dataset
32
2 Literature Review

contains images of natural scenes (city and rural areas and highways)
collected in Karlsruhe, Germany. It contains 200 training stereo image pairs
with sparse ground truth disparities, collected using LiDAR sensor; and 200
testing image pairs without ground truth disparities. KITTI allows performance
evaluation by submitting final results to their evaluation server. The PSMNet
is an effective 3D stereo matching network that is commonly used as
backbone for disparity estimation. Provided an effective solution by including
foreground and background specific depth-based loss functions(Saleh, Hardt,
& Manoharan, 2020).

In computer vision, ground truth generation and performance analysis have

received increasing attention in the past years. The goal in autonomous
driving is to at least achieve the average reliability of humans. From around
200 sequences comprising 2.5 million image pairs at 200Hz, we selected 55
partial sequences with a total of 3563 image pairs at 25 Hz. Each sequence
contains between 19 and 100 consecutive frames. Comparison to KITTI.
KITTI uses a similar ground truth acquisition strategy to ours based on
LIDAR. In contrast to our approach, KITTI scans continuously while driving
with a car-mounted Velodyne device. The main advantage of their approach
is that the extrinsic calibration with the LIDAR can be carried out once and
the car can record ground truth without a prior scanning step. Designed and
recorded a new stereo and flow dataset and extracted an initial benchmark
subset comprising28504 stereo pairs with stereo and flow ground truth with
uncertainties for static regions. Dynamic regions, covering around 6% of all
pixels, are manually masked out and annotated with approximate ground
truth on 3500 pairs. Half of the ground truth is made available as training
data. New stereo metrics and interactive results visualizations are accessible
through our benchmark website(Rezaei & Klette, 2017).

As one of the primary computer vision problems, object detection aims to find
and locate semantic objects in digital images. Different with object
classification, which only recognizes an object to a certain class, object
detection also needs to extract accurate locations of objects. In the state-of-
the-art object detection algorithms, bounding box regression plays a critical

33
2 Literature Review

role in order to achieve high localization accuracy. In deep learning-based

object detection algorithms, they are usually generated through a deep
convolution neural network. VGG is widely-used network architecture
because of its simplicity and high-performance. When building the proposed
algorithm, we hope to obtain an end-to-end trainable object detection model
by directly optimizing the faster R-CNN. However, since faster R-CNN is a
well-designed model with elaborate hyper parameter tuning, it is hard to
modify and optimize the algorithm framework directly. Actually, an additional,
independent regression fine-tuning model may be quite useful in scenarios
where precision is required, but speed performance is not critical(Han,
Zhang, Cheng, Liu, & Xu, 2018).

Simply applying single object tracker for MOT will encounter the problem in
com-mutational efficiency and drifted results caused by occlusion. Our
framework achieves computational efficiency by sharing features and using
ROI-Pooling to obtain individual features for each target. In the framework,
introduce spatial-temporal attention mechanism (STAM) to handle the drift
caused by occlusion and interaction among targets. Tracking objects in
videos is a significant problem which has attracted great attention. It has
several applications for example video surveillance, human computer
communication and autonomous driving. Main goal of MOT is to evaluation
the locations of multiple objects in (video) dataset maintain their identities
consistently in order to yield their individual trajectories. Use the training
sequences inMOT15 benchmark for performance analysis of the pro-posed
method. The ground truth annotations of test sequences in both benchmarks
are not released and the tracking results are automatically evaluated by the
benchmark. So we use the test sequences in two benchmarks for
comparison with various state-of- the-art MOT methods. The overall tracking
speed of the proposed method onMOT15 test sequences is 0.5 fps using the
2.4GHz CPU and a TITAN X GPU, while the algorithm without feature
sharing runs at 0.1 fps with the same environment. Proposed a dynamic
CNN-based online MOT algorithm that efficiently utilizes the merits of single
object trackers using shared CNN features and ROI-Pooling. In addition, to
alleviate the problem of drift caused by frequent occlusions and interactions
among targets, the spatial-temporal
34
2 Literature Review

attention mechanism is introduced. Be-sides, a simple motion model is

integrated into the algorithm to utilize the motion information. Experimental
results on challenging MOT benchmarks demonstrate the effective-ness of
the proposed online MOT algorithm(Chu et al., 2017).

In video surveillance, person tracking is considered as challenging task.

Numerous computer vision, machine and deep learning–based techniques
have been developed in recent years. Majority of these techniques are based
on frontal view images/video sequences. The advancement of convolutional
neural network reforms the way of object tracking. The network layers of
convolutional neural network models trained on a number of images or video
sequences improve speed and accuracy of object tracking. Overhead view
person tracking is performed using CNN-based detection and tracking
models. For overhead person tracking, Generic Object Tracking Using
Regression Networks (GOTURN) algorithm. Performing overhead view
person tracking using GOTURN tracking algorithm combined with the Faster
RCNN detection model. For testing purpose, overhead view person data set
is used, containing video sequences having variation in person appearance
(including a variety of poses, shapes, and scales of person) and different
camera resolutions with indoor and outdoor background. To show the
generalization performance of GOTURN and Faster-RCNN (pre-trained using
normal or frontal view data set), testing is performed on a completely different
data set that is overhead view person data set. Importance of the CNN-based
overhead view person tracking model is explored in contrast with
conventional frontal view, particularly in the field of video surveillance. Insight
discussion is made to understand the importance of CNN-based overhead
view person tracking with future guidelines. Overhead view person tracking is
performed using CNN- based object detection model and tracking algorithm.
For person detection, Faster-RCNN is used which achieves good detection
results for overhead view images. Furthermore, for person tracking, the
Faster-RCNN detection model is combined with GOTURN tracking algorithm.
The experimental results demonstrate the robustness and efficiency of CNN-
based detection model and tracking algorithm, although there is significant
variation in the data set in terms of appearance, visibility, shape, and size of
the person in contrast with the
35
2 Literature Review

normal frontal view. The results prove the performance of tracking algorithm
with a success rate of 94% and detection model by achieving a TDR of 90%
to 93% with an FDR of 0.5%. In future, this work might be extended by
training(Ahmad, Ahmed, Khan, Qayum, & Aljuaid, 2020).

Objects localized by common evaluation metrics are not designed

specifically to illustrate how the algorithm handles open-set conditions or in
situations where some sensors are degraded or defective. Bounding boxes
recent advancements in perception for autonomous driving are driven by
deep learning. In this context, many methods have been proposed for deep
multimodal perception problems. However, there is no general guideline for
network architecture design, and questions of “what to fuse”, “when to fuse”,
and “how to fuse” remain open. This review paper attempts .To this end, we
first provide an overview of on-board sensors on test vehicles, open datasets,
and background information for object detection and semantic segmentation
in autonomous driving research. Datasets have been published, most of them
record data from RGB cameras, thermal cameras, and LiDAR.
Correspondingly, most of the papers we reviewed fuse RGB images either
with thermal images or with LiDAR point clouds. Only recently has the fusion
of Radar data been investigated. This includes nuScenes dataset, the Oxford
Radar Robot Car Dataset, the Astyx HiRes2019 Dataset, and the seminal
work that proposes to fuse RGB camera images with Radar points for vehicle
detection. In the future, we expect more datasets and fusion methods
concerning Radar signals(Qi, Liu, Wu, Su, & Guibas, 2018).

Propose a data-driven approach to online multi-object tracking (MOT) that

uses a convolutional neural network (CNN) for data association in a tracking-
by-detection framework. The problem of multi-target tracking aims to assign
noisy detections to a-priori unknown and time-varying number of tracked
objects across a sequence of frames. A majority of the existing solutions
focus on either tediously designing cost functions or formulating. To this end,
we propose to learn a similarity function that combines cues from both image
and spatial features of objects. Multi-object tracking (MOT) is a critical issue,
and activity recognition. It is the problem of finding the optimal set of
trajectories of
36
2 Literature Review

objects of interest over a sequence of consecutive frames. Most of the

successful computer vision approaches to MOT have focused on the
tracking- by-detection principle. Functionally, SimNet computes a similarity
score for every detection and target pair. It has two branches: a bounding box
branch and an appearance branch, each of which uses a trainable Siamese
network to learn object representations conditioned on whether two objects
are similar or not. The outputs of these branches are vector representations
of targets and detections. Their respective contribution towards the final
similarity score computation is weighted using the importance branch. The
appearance branch outputs a robust and invariant vector representation for
2D visual cues of targets and detections conditioned on whether they belong
to similar or dissimilar objects. In this paper, we presented a solution to the
problem of data association in 3D online multi-object tracking using deep
learning with multimodal data. Shown that a learning-based data
association framework helps in combining different similarity cues in the data
and provides more accurate associations than conventional approaches,
which helps in increased overall tracking performance. We demonstrated the
effectiveness of the tracker built using this model with a multitude of
experiments and evaluations and show competitive results in the KITTI
tracking benchmark. In the future, we plan to integrate this solution with an
object detection framework more tightly and perform end-to-end
training(Emami, Pardalos, Elefteriadou, & Ranka, 2018)

This has led to the rapid evolution of the autonomous driving systems over
the last several decades with the promise to prevent such accidents and
improve the driving experience been very successful in the past both in
academia and industry, which has led to autonomy deployed on road.
Navigation in dense urban environments requires understanding complex
multi-agent dynamics including tracking multiple actors across scenes,
predicting intent, and adjusting agent behavior conditioned on historical
states. Since DRL is challenging to be applied in the real world primarily due
to safety considerations and poor sample complexity of the state-of-art
algorithms, most current research in the RL domain is increasingly being
carried out on simulators, such as TORCS and CARLA , which can
eventually be transferred to real world settings. Present an approach to
37
2 Literature Review

learn urban driving tasks that

38
2 Literature Review

commonly subtasks such as lane-following, driving around intersections and

handling numerous interactions between dynamic actors, and traffic signs
and signals. We formulate a reinforcement learning-based approach primarily
based around three different variants that use waypoints and low-dimensional
visual affordances. We demonstrate that using such low-dimensional
representations makes the planning and control problem easier as we can
learn stable and robust policies demonstrated by our results with state
representation(Agarwal, Arora, & Schneider, 2021).

The self-driving car is a high potential field that knows exponential

development in areas, computer vision, and control systems. It becomes a
hot topic and grabs attention with all the hype that AI big players and
automakers made around, especially after the great steps performed thanks
to the growing research efforts, fueled by the industry, that move this
technology from simple prototypes in showrooms and sci-fi movies to real
world cars with almost full autonomy. This comparative study has adopted a
methodology based on training and testing both algorithms on the same
High-fidelity realistic driving environment CARLA with the same parameters
and the same metrics, which allowed us to understand the strengths and
weaknesses of both technologies in the context of Self-driving car.
Throughout this benchmark, Imitation learning has in fact, shown great
results for Deja vu situations and Reinforcement learning has the capacity to
deal with new situations, this conclusion based on the training and testing
results, led us to the next deduction, that the combination of both algorithms,
would be a good opportunity to leverage the strength of both of them.
Therefore, the learning process of an AI agent can begin by imitation learning
to quickly acquire the expert’s skills, then learns to generalize and adapt to
new environments, using Reinforcement learning algorithms. Also, this
approach can help to get rid of the weaknesses of Reinforcement learning
like bad initialization, slow learning, credit assignment, however, it cannot
solve all the limitations like dealing with new complex tasks, partial
observability or generalization, which opens the door for opportunities of
progress, and great challenges to overcome, by discovering new ideas and
breakthroughs. The goal is to make an ADS system more accurate and safer,
and get the domain of transportation to another level(Youssef & Houda,
39
2 Literature Review

2020).

40
2 Literature Review

Present MADRaS, an open-source multi-agent driving simulator for use in the

design and evaluation of motion planning algorithms for autonomous driving.
MADRaS provides a platform for constructing a wide variety of highway and
track driving scenarios where multiple driving agents can train for motion
planning tasks using reinforcement learning and other machine learning
algorithms. MADRaS is built on TORCS, an open-source car-racing
simulator. TORCS offers a variety of cars with different dynamic properties
and driving tracks with different geometries and surface properties. MADRaS
inherits these functionalities from TORCS and introduces support for multi-
agent training, inter-vehicular communication, noisy observations, stochastic
actions, and custom traffic cars whose behaviors can be programmed to
simulate challenging traffic conditions encountered in the real world.
MADRaS can be used to create driving tasks whose complexities can be
tuned along eight axes in well-defined steps. Describe the structure and
organization of the MADRaS simulator which constitutes the main
contribution of this paper. The current version of MADRaS is focused on track
driving. Track driving is traditionally used in the automotive world to
benchmark driver skill and car agility. We first present a brief overview of the
TORCS simulator and associated prior works that MADRaS builds upon. The
most salient feature of MADRaS is its support for multi-agent training. The
success of multi-agent learning is contingent on the ability of the agents to
communicate among themselves and plan actions taking into account the
states and actions of the other agents. We demonstrate how MADRaS can
be used to create a wide variety of driving tasks that can be addressed by
RL. The study use the Proximal Policy Optimization (PPO) algorithm. Train
RL agents to accomplish challenging tasks like generalizing across a wide
range of track geometries and vehicular dynamics, driving under stochastic
city and partial observability, navigating through static and moving obstacles
and negotiating with other agents to pass through a traffic bottleneck. These
studies demonstrate the viability of MADRaS to simulate rich highway and
track driving scenarios of high variance and complexity that are valuable for
autonomous driving research(Santara et al., 2020).

In Recent years, autonomous driving has drawn significant attention from

both academia and industries, such as Google, Baidu, Uber, and Nvidia.
41
2 Literature Review

Traditional

42
2 Literature Review

autonomous driving technologies employ various sensors, such as radar,

LiDAR, odometry, computer vision, sonar, GPS, and inertial measurement
units, to perceive the surroundings Autonomous driving has attracted
significant attention in recent years. With the booming of artificial intelligence
(AI), deep learning technologies have been applied to autonomous driving to
help vehicles better perceives the environment. Besides the perceiving
environment, predictive driving is another prominent smooth control and safe
driving skill for human drivers. In this work, we develop a deep Monte Carlo
Tree Search (deep-MCTS) control method for vision-based autonomous
driving. We introduce a DRL based deep Monte Carlo Tree Search (deep-
MCTS) control method for autonomous driving. Different from existing DRL-
based autonomous driving controllers, the deep-MCTS can predict driving
maneuvers, a critical skill for human to ensure safe driving and smooth
control. Propose a reinforcement learning based (deepMCTS) algorithm for
vision based autonomous driving control. Different from traditional
autonomous driving algorithms, the driver-view images captured by the
camera onboard of the autonomous vehicle is the only input and the deep-
MCTS algorithm can learn how to control the vehicle deprived of any human
knowledge. The deep MCTS algorithm can predict the driving maneuvers by
performing virtual driving simulations. Compared to the existing methods, the
deep MCTS algorithm shows 50%, 66.30% and 59.06% improvement in
training efficiency, stability of steering control and stability of driving
trajectory, respectively. In the future work, we will consider migrating our
proposed method from the virtual to the real life(Chen, Zhang, Luo, Xie, &
Wan, 2020).

To go beyond, humans adopt explorations strategy to discover new areas,

methods, and skills to improve their expertise and outperform current. The
objective of combining IL with RL is to outperform the expert skills learned
through the demonstrations without any drop in the performance of the new
model regarding the expert demonstrations and ensuring a continuous
accumulation of knowledge and skills. Also, the new model must deal with
real- world environment, consequently it has to manage continuous action
spaces. Expert capacities. If it is not well managed and controlled, this
exploration process can sometimes diverge and yield to forget. The next
43
2 Literature Review

step starts by

44
2 Literature Review

allowing the A2CfDoC two agents to interact with the Carracing-v0 and
CARLA environments (with the same seed) and run for a set number of steps
(respectively 200.000 and 450.000) expert demonstrations, leading to a
suboptimal performance that is lower than the capacities learned from the
experts. The use of Expert Gradient clipping limits the size of the update of
the network that can provoke large changes from the previous policy, by
keeping the update in a secured region and avoiding a dramatic decrease in
performance. A next step could be a combination of other techniques that
can solve the many challenges that Artificial Intelligence face in building an
100% reliable ADAS systems, by using a hierarchical deep learning network
architecture to form a true whole single network that can deal with complex
tasks and include other sub-function like sensor fusion, occupancy grid
mapping and path planning, or handle several macro features going from
pedestrians detection, road-sign recognition and Collision Avoidance to some
more complex one like self-parking, lane-keeping and cruise control. Also we
can combine Partial Observability Markov Decision Process (POMDP)
principle to provide the deep learning network with the ability to deal with
limited spatial and temporal perception of the environment by using
RNN/LSTM to predict(Ding, Florensa, Phielipp, & Abbeel, 2019).

This paper has presented SDVTracker, a technique for learning motion state
estimation and multiclass object-detection association. Practically work on
tracking system, SDVTracker that apply a deep learned model for relate
Classical Association Techniques. To work out the data association problem
and incoming detections at the up to date timestamp need to be
corresponding to existing objects from the preceding timestamp, state
estimation in combination with an Interacting Multiple Model (IMM) filter
.Model that mutually optimizes together state estimation and association by
means of a novel loss, an algorithm for a training procedure of the
determining ground-truth supervision(Sun, Chen, Liang, Ruan, & Mukherjee,
2020).

To design an explanation component for the driving decisions of an

autonomous vehicle, we first need an understanding of the expert and non-
expert mantel model of autonomous systems. Subsequently, a target mental
45
2 Literature Review

model can be identified by adding evaluated key components of the user

mental model to key components of the expert mental model. We identified a
target mental model that enhances the user’s mental model by adding key
components from the mental model experts have. To construct this target
mental model and to evaluate a prototype of an explanation visualization we
conducted interviews (N=8) and a user study (N=16).The explanation
consists of abstract visualizations of different elements, representing the
autonomous system’s components. Expert Mental Model. The interviewed
experts defined three categories which describe an autonomous driving
system: perception, deliberation and action. Perception includes all
components concerning sensors, object detection, localization, tracking,
maps, data fusion and fusion. Concerning the comprehension of autonomous
systems, the experts stated (To understand the driving decision of the
autonomous vehicle an environment model would be sufficient for most end
users.) and Trajectory planning is too complex for normal users explore the
relevance of the explanation’s individual elements and their influence on the
users situation awareness. The results show that displaying the detected
objects and their predicted motion was most important to understand a
situation(Talha, 2019).

The effective detection of curbs is fundamental and crucial for the navigation
of a self-driving car. This paper presents a real-time curb detection method
that automatically segments the road and detects its curbs using a 3D-LiDAR
sensor. The method captures the road curbs in various road scenarios
including straight roads, curved roads, T Shape intersections, Y-shape
intersections and +-shape intersections. The curb information forms the
foundation of decision making and path planning for autonomous driving.
Comprehensive off-line and real-time experiments demonstrate that the
proposed method achieves high accuracy of curb detection in various
scenarios while satisfying the stringent efficiency requirements of
autonomous driving. The off-line experiment demonstrates that the curbs can
be robustly extracted. The average precision is 84.89%, the recall is 82.87%,
and the average F1 score is 83.73%.Furthermore, the average processing
time in the real time experiments is around 12 ms per frame, which is fast
enough for self- driving(Y. Zhang, Wang, Wang, & Dolan, 2018).
46
39
3 Material and Methods/Model and Equations/Modeling

3. Material and Methods/Model and

Equations/Modeling
3.1. Experimental Setup
Our solution was associated using Google colaboratory .Colab-notebooks right
you to link executable code and text in an only document, along
with images, HTML, Latex and more. When you create your own Colab
notebooks that are stored in Google Drive account). Study on the Performance
using GPU base results.

3.2. Dataset
Datasets “The KITTI Vision Benchmark Suite". This Kernel contains the object
detection part of their different Datasets published for Autonomous Driving. It
contains a set of images with their bounding box labels. For more information
visit the Website they published the data on (linked above) and/or read the
README file as it explains the Label format. Datasets such as the KITTI Vision
Benchmark Suite .Download data 12GB of KITTI data in its collection, In addition
training the calibration file for the cameras placed in the car. We also collected
driving situation (road object) image in our university to test the proposed model.

3.3. Kitti 3D Calib File

This Datasets contains the Kitti Object Detection Benchmark, The KITTI Vision
Benchmark Suite. It contains a set of images with their bounding box labels. The
calibration file contains information on how the image was captured from the car
camera. The camera in the car was located on the roof and there are two
cameras that are very close to each other. Besides, Kitti3D has a LIDAR.

3.4. U-Net (Convolutional Networks for Biomedical

Image Segmentation)
The u-net is convolutional network architecture for fast and precise segmentation
of images.

40
3.4 U Net (Convolutional Networks for Biomedical Image segmentation)

3.4.1. U net architecture

For example “32x32” is pixels in the lowest resolution. Each blue box
corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box.
White boxes represent copied feature maps. The arrows denote the different

operations.

Figure 3.1 U-net original architecture

YOLO is designed in Darknet, an open source NN framework written in C and
CUDA framework, developed by the same author that created YOLO, Joseph
Redmon. The last iteration is YOLOv3, which is bigger, more accurate on small
objects, but slightly worse on larger objects when compared to the previous
version. In YOLOv3, Darknet-53 (53-layer CNN with residual connections) is
used, which is quite a leap from the previous Darknet-19 (19-layer CNN) for
YOLOv2.

41
3 Material and Methods/Model and Equations/Modeling

3.4.2. Object detection (u-net)

Object detection is a task of computer vision that involves identifying the
presence, location, and type of one or more objects in a given photograph. It is a
challenging problem that involves building upon methods for object recognition

42
3.4 U-Net (Convolutional networks for Biomedical Image Segmentation)

(like where are they), object localization (like what are their extent), and object
classification (like what are they).

3.4.2.1. Data from all directories, and retrieve a list of masks and
images.

PICTURE SIZE (INPUT): HERE VERY IMPORTANT WHEN USING

CONVOLUTIONARY UNET-TYPE NETWORKS.

IMAGE_WIDTH = 256
IMAGE_HEIGHT = 160
Because every time the image is reduced by 2 times in the layer, the size is
rounded to an integer. If at some point the quantity were an odd number. For
example: we have such a network

conv2d_66 (Conv2D) (None, 100, 175, 32) 9248

max_pooling2d_26 (MaxPooling (None, 50, 87, 32) 0

in the case of UNET, when we combine some of the initial layers with the next
that came out, then after the exit that are multiplied by two, so the layer we want
to join will be a size of 86 and creating one model will return an error. It should
be remembered that the sizes should be multiples of 8. (In the case of reducing 3
times, or 16 in the case of 4 reduction operations)
Get labels: returns the necessary data for each image, each object is a separate
line with a description. The most important parameters are:

3.4.3. Segmentation (unet)

class use in this segmentation 'Car' , 'Van' , 'Truck' , 'Pedestrian' , 'Cyclist':
'Tram', 'Misc', ‘None’ and not selected( 'Person sitting' ,'DontCare') and color
according to a class for segment.
Car: (0, 0, and 0), van: (244, 35,232), truck: (70, 70, and 70), Pedestrian:
(102,102,156), cyclist: (190,153,153), tram :( 153,153,153), misc : (250,170, 30),
7: (220, 220, 0) none: (107,142, 35)} image segmentation tasks. Since U-Net is
the basic architecture of this thesis,

43
3 Material and Methods/Model and Equations/Modeling

Let discuss this architecture in details here and understand its building blocks. In
the next chapter, explain our version of U-Net which is modified compared to
what explained here.The difference is that the convolutions that we use for U-Net
adds padding to the input in a way that after applying the convolution the image
keeps its dimension, but the architecture explained in the original U-Net does not
account for the padding hence after applying the 3*3 convolution to the input,
size of the output image is2 pixels short in width and height

3.5. YOLO3NET
YOLOv3 is an improved version of YOLO and YOLOv2. The main change in its
network structure is the introduction of residual blocks, which ensures that even
if the YOLOv3 network becomes deeper, the model can still converge quickly. In
order to better deal with the problem of overlap, the loss function uses binary
cross-entropy loss; the multi-scale fusion method is adopted to merge the high
level semantics with the low-level, which improves the sensitivity to small
targets.

YOLO (You Only Look Once) family of Convolutional Neural Networks that
achieve near state of the art results achieve with a multiple end-to-end model
that can object detection in real-time detection. A few weeks back, the third
version of YOLO came out, and this post aims at explaining the changes
introduced in YOLO v3. This is not going to be a post explaining what YOLO is
from the ground up. I assume you know how YOLO v2 works. If that is not the
case. YOLO-based Convolutional Neural Network family of models for object
detection and the most recent variation called YOLOv3. Best-of-breed open
source library implementation of the YOLOv3 for the Keras deep learning library.
The official title of YOLO v2 paper seemed if YOLO was a milk-based health
drink for kids rather than an object detection algorithm. It was named
“YOLO9000: Better, Faster, And Stronger”.

For its time YOLO 9000 was the fastest, and also one of the most accurate
algorithm. However, a couple of years down the line and it’s no longer the most
accurate with algorithms like Retina Net, and SSD outperforming it in terms of
44
3 Material and Methods/Model and Equations/Modeling
accuracy. It still, however, was one of the fastest.

45
3.5 YOLO3NET

But that speed has been traded off for boosts in accuracy in YOLO v3 this has to
do with the increase in complexity of underlying architecture called Darknet.

3.5.1. Anchor box

It capacity to make sense that predict the width and the height of the bounding
box, however in practice, which leads to unbalanced gradients during training.
Instead, most of the modern object detectors predict log space transforms, or
basically offsets to predefine defaulting bounding boxes called anchors. Then,
these transforms are applied to the anchor boxes to get the prediction. YOLO v3
has 3 anchors, which conclusion in prediction of three bounding boxes each cell.
Each cell of prediction feature map can predict three bounding boxes. The
question is which of these boxes will be assigned to the object? Instead of
predicting width and height of the bbox, YOLOv3 predicts offsets to predefined
boxes anchors. Anchor boxes are predefined shapes/boxes certain to match
ground truth bounding boxes because greatest of the objects in the training
dataset have a typical width and height ratio. Then, these offsets are applied to
the anchor boxes to obtain the prediction. Thus, the bounding box responsible
for detecting the object will be the one whose anchor has the highest IOU.

3.5.2. Object detection and tracking

At the present time, the problem of classifying objects in an image is more or less
solved, thanks to huge advances in computer vision and deep learning in
general. The publicly available models trained on large amounts of data further
simplifies this task. Accordingly, the computer vision researching community has
shifted focus in other very interesting and challenging topics, such as adversarial
image generation, neural style transfer, visual storytelling, and of course, object
detection, segmentation and tracking. Start off by paying homage to the long
established methods, and afterwards explore the current state-of-the-art
The Keras base yolo3 project brings very much of competency for using YOLOv3
models, including object detection with transfer learning , training new models
from scratch. The output of the model in fact, encoded candidate bounding
boxes from three different grid sizes, and the boxes are defined the context of
anchor boxes, carefully chosen based on an analysis of the size of objects in the

45
3 Material and Methods/Model and Equations/Modeling

MSCOCO dataset. A neural network predicts bounding boxes and class

probabilities openly from full images in 1 evaluation. Since the whole detection
pipeline is a single network, it can be optimized end-to-end directly on detection
performance.

Class is use in this experiment is (car, motorbike, and pedestrian) the method
involves a single deep CNN (formerly a version of GoogLeNet, earlier updated
and called DarkNet based on VGG) that splits the input into a grid of cells and
every (each) cell right predicts a bounding box and object classification. The
result is a huge number of candidate bounding boxes that are consolidated into a
final prediction by a post-processing step. There are 3 core variations of the
method, at the time of writing; that are YOLOv1, YOLOv2, and YOLOv3. The first
form proposed the general architecture, while the 2nd (second) version
advanced the design and prepared usage of predefined anchor boxes to
established bounding box proposal, and version 3 further refined the model
architecture and training practice
3.6. Measures
3.6.1. True Positives (TP)
These are the appropriately predicted positive values which means that the
value of actual class is true (yes) and the value of predicted class is also true
(yes).

3.6.2. True Negatives (TN)

These are the correctly predicted negative values which means that the value of
real class is no and value of predicted class is also no and False positives and
false negatives, these values occur when your actual class contradicts with the
predicted class.

3.6.3. False Positives (FP)

When actual class said no and predicted said that class is true (yes).

3.6.4. False Negatives (FN)

When actual class predicted yes but class is no

46
3.6 Measures

3.6.5. Precision
Is the ratio of the detected objects TP + FP, which are detected properly?
Precision: It is implied as the measure of the accurately identified positive cases
from all the predicted positive cases. Precision is the ratio of true positives to the
total of the true positives and false positives. Precision looks to see how much
junk positives got thrown in the mix. If there are no bad positives (those FPs),
then the model had 100% precision. The more FPs that get into the mix, the
unpleasant that precision is going to look. To calculate a model’s precision, we
need the positive and negative numbers from the confusion matrix. So, it is
valuable when the costs of False Positives is more (high). It tells us how many
false positive FP detections the detector produces. It is defined as follows.
Equation 1
Precision = TP /TP + FP

3.6.6. Recall
It is the measure of the correctly identified positive cases from all the actual
positive cases. It is main when the cost of False Negatives is high. Tells us the
ratio of the ground truth objects TP + FN, which are detected by the detector. It
is defined as
Equation 2
Recall=TP/TP + FN
The recall rate is penalized whenever a false negative is predicted. Because the
penalties in precision and recall are opposites, so too are the equations
themselves. Precision and recall are the yin and yang of assessing the confusion
matrix.

3.6.7. F1 Score
The F-score, also called the F1-score, is a measure of a model's accuracy on a
dataset. The F-score is a way of combining the precision and recall of the model,
and it is defined as the harmonic mean of the model's precision and recall. The
F1 Score is the 2*((precision*recall)/ (precision recall)). It is also called the F
Score or the F Measure. Put another way, the F1 score conveys the balance
between the
47
3 Material and Methods/Model and Equations/Modeling

precision and the recall .As in many vision problems, the ground truth labeling
may not be perfect. For autonomous navigation applications, it is not a serious
problem if the estimated free space is smaller than the actual one. On the other
hand, it is more critical to not have any obstacles inside the free space curve. In
regards to this, propose the F1 score to measure the accuracy of classification of
pixels under the curve given by
Equation 3
F1 = 2 × P × / P + R

3.6.8. Accuracy
One of the more obvious metrics, it is the measure of all the correctly identified
cases. It is most used when all the classes are equally important.

Equation 4
Accuracy = TP +TN / TP+FP +FN +TN

3.6.9. Average precision (AP)

Is used in the” KITTI benchmark”. It samples the precision and recall curve on
different places, interpolates it and computes the arithmetic mean. Therefore,
that’s measure gives insight into how the precision/recall curve changes.

3.6.10. Intersection over Union (IOU)

Intersection over Union, also referred to as the Jaccard Index, is an evaluation
metric that quantifies the similarity between the ground truth bounding box (i.e.
Targets annotated with bounding boxes in the test dataset) and the predicted
bounding box to evaluate how good the predicted box is. The IOU score ranges
from 0 to 1, the closer the two boxes, the higher the IOU score.

48
3.6 Measures

Figure 3.2IOU
Formally, the IOU measures the overlap between the ground truth box and the
predicted box over their union. This can be written in the form of T P, F P, T N,
and F N as
Equation 5
IOU = TP \T P + F P + F N

49
4. Results and Discussions
4.1. Segmentation (u-net)
Image segmentation is commonly used in a computer vision task. Segmentation
u-net model fig 11 architecture of proposed, the data for training contains 30
512*512 images, which are far not enough to feed a deep learning neural
network in keras.preprocessing.image to do data augmentation. Importing the
libraries we initialize the directory where the images are stored. Created two lists
one for storing masks and the other for storing the image. (Mask & image). After
storing the image and mask we have picked images with their corresponding
masks. Use the code shown below to do the same. Install the pre-trained model
for segmentation and load all the useful libraries from that segmentation model.
Within Keras various pre-trained models widely used for segmentation are
available for use and these models will be tested alongside the U-Net to
determine their practicality will keep the same results and code behavior .So no
randomness creeps into our calculations and we get the same results every time

50
4 Results and Discussion

Figure 4.1architecture proposed model

Our model training epoch 20/20 which show good result which is shows in fig number 14.

51
4.1 Segmentation (U Net)

Figure 4.2 result segmentation epoch 1

Figure 4.3 segmentation result epoch 2

Figure 4.4 segmentation result epoch 15

52
4 Results and Discussion

Figure 4.5 segmentation result epoch 20

Result show in fig 15 original image left side and segment image
right which show that our model is worked outstanding and
properly train.

Figure 4.6 original image with segmentation result

.
4.2. Object tracking using u-net
4.2.1. Draw 3D
A calibration file is also needed; it also has information about the camera settings
needed to draw the image. Additionally, the representation of the 3d bounding
box requires transformation and applying it to a flat image. For that, we need a
few helper functions below. We draw flat DontCare objects as 2D Bounding Box.
These 3D information objects do not. The sensor calibration zip archive contains
files, storing matrices in row-aligned order, meaning that the first value

53
4.2 Object Tracking Using (U Net)

correspond to the first row:

Draw 3d bounding box in image (8, 3) array of vertices for the 3d box in following
order:
1 -------- 0
/| /|
2 3.
|| ||
.5 4
|/ |/
6 7

4.2.2. Bounding box

In bounding box get_labels allows you to return. Bounding boxes for every car.
Thanks to it, we can see the accounts of returns a list with the values of the
bounding box, param label filename: filename like kitti_3d / {training, testing
/label_2/id.txt Returns Pandas DataFrame
The label files contain the following information, which can be read a written
using the matlab tools (readLabels.m, writeLabels.m) provided within this devkit.
All values (numerical or strings) are separated via spaces, each row corresponds
to one object.

54
4 Results and Discussion

4.2.3. Image Data Generator

Instead of loading everything into memory, Tensor Flow lets you use the Data
Generator. It is he who, each time downloading Batch data, opens an image
from the disk, processes it and ejects it. The basic one is ImageDataGenerator
which allows you to read images from a DataFrame and folder, but for the mask
we have to create our own class that allows us to return the image to our model
generated by the get_mask function. Then create a KittiDataGenerator class and
it has an image in the input and an image with the same size as the output.

Figure 4.7mask image

Allows you to generate a mask, because in the case of the UNET model want to
have a regular object mask in the output where fully indicate the objects are
included a car, van, track, and people in fig 19.
The KittiDataGenerator create class and it has an image at the input, and at the
output an image of the same size

Figure 4.8 result mask generation

55
4.2 Object Tracking Using (U Net)

Figure 4.9 result with mask

Model would be the standard U-NET model, i.e. it tapers in the middle and extend
from the middle up sample allows you to connect layers with each other.

Figure 4.10object detection result 1

Figure 4.11 object detection result 2

56
4 Results and Discussion

Figure 4.12 Accuracy graph

In the accuracy show 4 graph is which calc-IoU, dice, fbeta and loss light gray
line show training and orange line show validation.
calc_IOU
Training (min: 0.079, max: 0.128, cur: 0.128)
Validation (min: 0.106, max: 0.130, cur: 0.130)
Dice
Training (min: 0.079, max: 0.128, cur: 0.128)
Validation (min: 0.106, max: 0.130, cur: 0.130)
Fbeta
Training (min: 0.152, max: 0.168, cur: 0.168)
Validation (min: 0.157, max: 0.169, cur: 0.168)
Loss
Training (min: 0.984, max: 1.086, cur: 0.984)
Validation (min: 0.990, max: 1.057, cur: 0.990)

57
4.2 Object Tracking Using (U Net)

175/175 step accuracy result 208s 1s/step - loss: 0.9841 - calc_IOU: 0.1280 -
dice: 0.1280 - fbeta: 0.1675 - val_loss: 0.9901 - val_calc_IOU: 0.1298 - val_dice:
0.1298 - val_fbeta: 0.1679

4.3. Object detection and tracking

YOLO-based Convolutional Neural Network family of models for object detection
and the most recent variation called YOLOv3.YOLO V3 uses the concept of
anchor boxes when doing bounding box prediction. The meaning of an anchor
box is the most likely the object width and height want to implement simple object
detection with Keras of some JPEG images in our training set. I use a pertained
(kitti dataset) YOLOv3 model. These were trained using the DarkNet code base
on the MSCOCO dataset. Download the model weights and place them into your
current working directory with the filename “yolov3.weights.” It is a large file and
may take a moment to download depending on the speed of your internet
connection. Define a Keras model that has the right number and type of layers to
match the downloaded model weights. The model architecture is called a
“DarkNet” and was originally loosely based on the VGG-16 model.

Table 4.1YOLOV3 implementation parameters

Parameters Value
Input image size 0.005
Input image size 416*416
Number of cells per image 13*13
Number of bounding boxes per cell 9
Classes [Pedestrian, Truck, Car, Cyclist,
Motorbike]
Classification threshold 0.6
Non-Maximum suppression 0.5
overlapping threshold

Two concerns can be made about Kitti test dataset. The first concern is the test
images are not labeled, so detection results can’t be automated and have to be

58
4 Results and Discussion

done manually. The second concern is cars are the dominant object in the
dataset and the dataset doesn’t have enough samples for pedestrians, cyclists
and trucks. In our implementation, the dataset labels were reorganized to fit in
our four classes. Van, tram and cars are considered as one class, it is called
cars. Sitting person and pedestrian are merged, the class named as pedestrian.
Trucks and cyclists taken from the dataset without any change
Load the new photograph and arrange it as suitable input to the model. The
model supposes inputs to be colorful images with the square shape of 416×416
pixels can use the loading () Keras function to load the image and the target size
argument to resize the image after loading. We can also use the img_to_array()
function to convert the loaded PIL image object into a Numpy array, and then
rescale the pixel values from 0-255 to 0-1 32-bit floating point values.
The experiencor script provides the correct_yolo_boxes() function to perform this
translation of bounding box coordinates, taking the list of bounding boxes, the
original shape of our loaded photograph, and the outline of the input to the
network as arguments. The coordinates of the bounding boxes are updated
directly.
The model has predicted a lot of candidate bounding boxes, and most of the
boxes will be referring to the same objects. The list of bounding boxes can be
filtered and those boxes that overlap and discuss to the same object can be
merged. Script can define the amount of overlap as a configuration parameter, in
this case, 50% or 0.5. This filtering of bounding box regions is generally referred
to as non maximal suppression and is a required post-processing step.
The experiencer script delivers this via the do_nms () function that takes the list
of bounding boxes and a threshold parameter. Rather than purging the
overlapping boxes, their predicted probability for their overlapping class is
cleared. This permits the boxes to continue and be used if they also detect
another object type. The best-of-breed open source library implementation of the
YOLOv3 for the Keras deep learning library a use a pre-trained YOLOv3 to
perform object localization and detection on new photographs, and result shows
in fig18.

59
4.3 Object Detection and Tracking

Figure 4.13 result yolo3 with text file

The results of object tracking are in the same format as its (object detection)
counterparts. Images with the bounding boxes for every detected object and a
line of the alike color that demonstrations its trajectory along with text files
holding the progression for every object ID in different frames and its “bounding
box “coordinates and at the results of our model.
Table 4.2 Detections results for kitti dataset using YOLOV3
object True positive False positive False negative

car 559 6 52
truck 3 0 3

pedestrian 38 0 32
cyclist 15 0 4
motorbike 20 0 20

total 635 6 111

Table reviews the object detection results for exclusively the all classes. Results
show that 635 objects are classified correctly, while 111 objects were miss
classified and 88 objects were positives. IOU is 0.8, accuracy is 85.94 and f1

60
4.3 Object Detection and Tracking
score

61
4 Results and Discussion

is91.56 It is noticed that false negative detection was high for pedestrians and
cyclists. For better understanding of the results, the precision and recall were
calculated for each class. Precision and recall are given in the following
equations: Table 4.3 Precision and recall value for the class

object Precision Recall

car 98.9% 88.59%
truck 100% 50%
pedestrian 100% 54.28%
cyclist 100% 78.9%
motorbike 100% 50%
total 99.0% 84.1%

Table shows the precision and recall for each class. The precision is higher than
98% for all the classes except for car and can conclude that detection accuracy
is very high for all the classes. There is a significant drop in the recall values for
pedestrians and cyclists because of the high values for the false negatives. Due
to the small samples of trucks available in the test dataset, we can’t review the
algorithm performance in trucks detection. The summary we can make from the
above results is that the algorithm showed very excellent detection accuracy for
cars. It also disclosed a high precision for pedestrians and motorbike but with
low recall values due to the high number of false negatives. That means the
algorithm has high miss detection rate for small objects like pedestrians and
cyclists compared to larger objects like cars. Kitti dataset have enough trained
labeled images to achieve good performance. The paper provided a quick
review for the different approaches for road object detection and presented the
recent CNN architectures and methods for object detection. TheDarknet-53
architecture of Yolov3 is explained and the algorithm hyper-parameters were
discussed. Algorithm training steps and parameters like epoch and batch were
presented However, cars are the dominant object in the dataset, and the test
images are not labeled. Which makes detection result analysis more difficult.
Detection results are displayed by counting true positives, false positives and
false negatives per each class. Precision and recall also calculated for each

61
4 Results and Discussion
class. The algorithm

62
4.3 Object Detection and Tracking

showed a very good detection accuracy for cars. Pedestrian and cyclist
detection showed more false negatives than cars. The test dataset includes only
six trucks; therefore, no solid conclusion can be made about truck detection.

Figure 4.14

Table 4.4 result table of our collect image

class True False False Precision Recall F1 Accuracy IoU

positive positive negative

car 31 1 2 96.88 93.93 95.35 91.18 0.9

motorbike 12 2 2 85.71 85.71 85.71 75.00 0.7
pedestrian 40 1 2 97.56 95.23 96.39 94.12 0.9

Total 83 4 6 95.40 93.25 94.32 90.10 0.9

We collected images from our mobile then tested it on proposed model which
show good result shows in table 4.4. True negative is 8.image result shows in
fig4.14and fig4.15.

62
4 Results and Discussion

Figure 4.15

4.3.1. Comparisons
Table 4.5 comparison with pervious study
Model Accuracy
Fast VP- 89.00
RCNN
multi-task 86.12
CNN
Shift R- 65.47
CNN
PVGNet 89.94
Proposed 90.10
study

63
5 Conclusion

5. Conclusion
In this work, we presented experiments and results indicating the U-Net for the
segmentation and U-Net for object detection with file Calib for 3D bounding box.
Result of segmentation is also be great in training and testing used u-net. Result
of segmentation is also be great in training and testing used u-net.Then result is
show with ground truth dataset which good. Object detection not only needs to
identify the class of the object (‘car’ in this case) in the image, but also has to
locate the object in the image accurately
The experiment uses the pre-trained YOLO v3 framework to detect and track the
Object with Keras, Numpy, Tensor flow, and OpenCV. For object detection and
tracking, there are two phases offline and online processing .The pre-trained
template YOLO v3 is trained with some vehicle images and is tested using the
our own collected dataset.
In yolov3 our result is good but object is away it give not give good result. In this
mostly result “car” accuracy are 98% in the result, but the result which is down
the result is main issue distance, misplace and object properly cannot looking
well. In future we will find this problem object is in long distance, that effect on
accuracy and may be model will no consider the object