Object Detection and Segmentation
Object Detection and Segmentation
Self-driving is emerging trend, multi task is the combination of task, where object
and tracking segmentation and bounding box experiment is to perform and meat
desirable result. A lot of challenges where self-driving is not in market till 2021.
This research implements You Only Look Once family YOLO3 algorithm that uses
Draknet-53 CNN to detect four classes: pedestrians, car, trucks and cyclists and
motorbike. The model is trained using Kitti images dataset which is collected from
public roads using vehicle’s front looking camera. The algorithm is tested, and
detection results are presented. Road object Image is collected and tested by
mobile camera to check result.
vi
Contents
Contents
Declaration.............................................................................................................................ii
1. Introduction......................................................................................................................i
1.1.5. Automation.............................................................................................................6
1.1.6.1. Perception.......................................................................................................8
1.1.6.2. Localization.....................................................................................................9
1.1.6.3. Connector.......................................................................................................9
1.1.6.4. Predictions......................................................................................................9
1.1.6.5. Communications.............................................................................................9
i
1.1.6.7. Sensors..........................................................................................................10
1.1.6.8. Camera..........................................................................................................10
1.1.6.9. LIDAR............................................................................................................11
1.1.6.10. Controls.........................................................................................................11
1.1.7.4. Segmentation................................................................................................13
1.3. Limitation......................................................................................................................14
2. Literature Review...........................................................................................................16
3.2. Dataset.........................................................................................................................40
3.4.2.1. Data from all directories, and retrieve a list of masks and images................43
3.5. YOLO3NET.....................................................................................................................44
3.6. Measures......................................................................................................................46
i
3.6.2. True Negatives (TN)..............................................................................................46
3.6.5. Precision...............................................................................................................47
3.6.6. Recall....................................................................................................................47
3.6.7. F1 Score................................................................................................................47
3.6.8. Accuracy...............................................................................................................48
4.3.1. Comparisons.........................................................................................................63
5. Conclusion.....................................................................................................................64
References............................................................................................................................65
i
List of Figures
iv
List of Tables
v
1. Introduction
The principal practice of AI for self driving car goes back to the DARPA
(Defense Advanced Research Projects Agency) self driving car Challenge
(Ozguner, Stiller, & Redmill, 2007), which was won by the Stanley (Stanford
University Racing Squad's autonomous car ). The winning team, led by
Sebastian Thurn, an associate professor of computer science and director of
Stanford Artificial Intelligence Laboratory, attributed the victory to use of
machine learning(Bar-Cohen & Hanson, 2009; Chai, Nie, & Becker, 2020;
Jahromi, 2019). Stanley was equipped with multiple sensors and backed by
bespoke software, including ML algorithms, which facilitated the vehicle find
the path, obstacles detection and avoid them while staying on the course(Lin,
2016). Thurn later led the 'Self-Driving Car Project' at Google, which
eventually became Waymo in 2016(National Academies of Sciences &
Medicine, 2017; Owczarek, 2018).
1
1 Introduction
2
1.1 Artificial Intelligence
Kelly, & O'Reilly, 2019). Their highly flexible architectures can learn directly
from raw data and can increase their predictive accuracy when provided with
further data. Computer vision apps use deep learning to gain knowledge from
digital images and video(Tan & Lim, 2018). Conversational AI apps help
computers understand and communicate through natural language.
Recommendation systems use images, language, and a user’s interests to
offer meaningful and relevant search results and services(Montenegro, da
Costa, & da Rosa Righi, 2019).
4
1.1 Artificial Intelligence
Advantage Disadvantage
1. Decreased the number of accidents 1. Expensive
5
1 Introduction
1.1.5. Automation
An area of technology in which is a process of machine doing one or several
tasks without human assistant.
1.1.5.1. Level of Automation
SAE International defines vehicles as having six levels of automation
depending upon the amount of attention required from a human driver. In
levels zero through two, humans drive and monitor the traffic environment; in
levels three through five, the automated systems drive and monitor the
environment(Dokic, Müller, & Meyer, 2015; Sousa, Almeida, Coutinho-
Rodrigues, & Natividade-Jesus, 2018).
6
1.1 Artificial Intelligence
Level 4 vehicles can operate in self-driving mode. But until legislation and
infrastructure evolves, they can only do so within a limited area (usually an
urban environment where top speeds reach an average of 30mph). This is
known as geofencing. As such, most Level 4 vehicles in existence are geared
toward ridesharing(Sagir, 2020). For example:
7
1 Introduction
8
1 Introduction
estimate (Y. Li
9
1.1 Artificial Intelligence
& Ibanez-Guzman, 2020; S. Liu et al., 2019; Major et al., 2019; Marti, de
Miguel, Garcia, & Perez, 2019).
1.1.6.2. Localization
The localization module is one of the crucial parts of self-driving systems.
However the proposed benchmark currently contains only a node, the plan to
provide multiple nodes in the future (Ferranti et al., 2019; Tokunaga, Ota,
Tange, Miura, & Azumi, 2019).
1.1.6.3. Connector:
The connector node determines the velocity and pose of the vehicle used for
subsequent processing. Auto ware can calculate the vehicle velocity and
pose in multiple methods such as using Controller Area Network (CAN)
information, obtaining from the generated path, and estimating from
localization. These nodes output the specified velocity and pose as current
them (Olufowobi, Young, Zambreno, & Bloom, 2019).
1.1.6.4. Predictions
The prediction module examines the gesture (motion) patterns of other traffic
agent and forecasts AC forthcoming(future) trajectories which enables the
AC to make appropriate navigation decisions (R. Fan et al., 2019). Latest
prediction approaches can be grouped into double main components data-
driven-based and model-based. former computes the AC future motion, by
promulgating its kinematic state (position, speed and acceleration) over time,
based on the underlying physical system kinematics and dynamics (Z. Wang,
2018) .map information as a constraint to compute the next AC location
workings well for temporary predictions, but its performance degrades for
extended period limits, as it disregards nearby context for example roads and
traffic guidelines(R. Fan et al., 2019). Furthermore, a pedestrian motion
prediction model can be formed based on attractive and revolting forces
(Xue, Huynh, & Reynolds, 2019).
1.1.6.5. Communications
Autonomous vehicles require various sensors such as LIDAR, radar,
cameras, etc. to understand their surroundings; however, such sensors can
be easily obstructed by nearby obstacles, preventing long-range
detection(Schoettle, 2017). This can be improved by using sensor data from
other vehicles
9
1 Introduction
equipped with a vehicular communication device (Eze, Zhang, Liu, & Eze,
2018). (Gora & Rüb, 2016):
10
1.1 Artificial Intelligence
12
1.1 Artificial Intelligence
13
1 Introduction
1.3. Limitation
In this proposed study object detection and tracking for self driving is not
making self driving hardware and not use any sensors. Proposed study only
14
1.1 Artificial Intelligence
use of already get self driving dataset for object detection and tracking. But for
testing the proposed model to image is collected by mobile camera.
1.4. Objective
Self-driving cars are the future. Cars need object detection to perceive its
surroundings. Main objective is improve road safety.
Increase the value the true positive and true negative according to
object class.
15
1 Introduction
2. Literature Review
Previous approaches to this problem suffer either from an overly complex
inference engine or from insufficient detection accuracy. To deal with these
issues, we present SS3D, a single-stage monocular 3D object detector. The
framework consists of a CNN, which outputs a redundant representation of
each relevant object in the image with corresponding uncertainty estimates,
and a 3D bounding box optimizer. Method achieves SOTA accuracy on
monocular 3D object detection, while running at 20 fps in a straightforward
implementation. The proposed methods are evaluated primarily on the KITTI
object detection benchmark a number of image sequences with annotated
2D and 3D bounding boxes for a small number of object
categories(Jörgensen, Zach, & Kahl, 2019).
The method can be used with any camera-based object detector and we
illustrate the technique on several sets of real world data. We show that a
state- of-the art detector, tracker and our classifier trained only on synthetic
data can identify valid errors on KITTI tracking dataset with an Average
Precision of
0.94. We also release a new tracking dataset with 104 sequences totaling
80, 655 labeled pairs of stereo images along with ground truth disparity from
a game engine to facilitate further research. Proposed features, an off-the-
shelf random forest classifier achieves an AP score of 0.93 on the GTA
dataset for RRC detector. Furthermore, we showed that our system
16
1 Introduction
17
find errors made in the KITTI dataset with an AP score of 0.94 for RRC
detector(Ramanagopal, Anderson, Vasudevan, & Johnson-Roberson, 2018)
17
2 Literature Review
detection task, the performance degradation on GTF is due to the fact that the
centers of some GTF targets are also the centers of SBF targets(Fu et al.,
2020)
This paper worked on BDD100K the largest driving video dataset with 100K
videos and 10 tasks to evaluate the interesting development of image
recognition algorithms on autonomous driving. This Paper benchmarks are
area
comprised of ten tasks: image tagging, lane detection, drivable
segmentation,
segmentation, instance
road object detection, semantic
segmentation,
multi-object multi- segmentation
detection tracking, object
tracking, domain adaptation, and imitation learning. Experiments provided
extensive analysis to different multitask learning scenarios: homogeneous
multi-task learning and cascaded multitask learning. The results presented
interesting findings about allocating the annotation budgets in multi-task
learning(Yu et al., 2018).
Dilemma is to first guess salient features, and then share these features with
driving decisions. The proposed model is a deep neural network model that
feeds extracted features from input image to a recurrent neural network with
an attention mechanism. This dataset is for evolution dataset BDD-A, and
saliency dataset CAT2000. The projected model outcome stimulating results
to explain the relationship between saliency prediction and driving decisions.
It also provides a holistic framework to be used as input for driving decisions..
Predicted saliency map also used in making driving decision (breaking). The
proposed model has two main components Driver Attention Module and
This paper deal with the issue of estimating the confidence of Deep Neural
Network in reaction to unexpected execution contexts through the purpose of
predicting possible safety-critical misbehavior’s such as out of bound
episodes
18
2 Literature Review
19
2 Literature Review
A novel approach designed to tracking for detection, which exploits the power
of structure prediction as well as deep neural networks. Towards this goal
and formulate the problem as inference in a deep structured model (DSM) on
Kitti dataset. The CNNs are initialized with the pre-trained VGG16 weights on
ImageNet and the fully connected layers that include the weights of the
binary random variables(y) are initialized by sampling from a truncated
normal distribution. Experimental evaluation on the challenging KITTI dataset
show that our approach is very competitive outperforming the state of the art
in the MOTP(Frossard & Urtasun, 2018)
20
2 Literature Review
21
2 Literature Review
a video frame input, a visual encoder can encode the visual information in a
discriminative manner while maintaining the relevant spatial information.
Take the ImageNet and pre-trained AlexNet model use dilated convolutions
for conv3. Investigate the effectiveness of Deep Generic Driving Networks
driving model and the learning by evaluating future ego motion prediction on
held-out sequences across diverse conditions(Gao, 2019).
contextual information. Multi scale feature maps are merged using concat
operator for encoding more contextual information. We present loss function,
optimization details, ablation studies and evaluation metrics used. Our
network achieves mean IOU value of 74.12 which is better than the previous
state of the art on semantic segmentation while running at >100 FPS(Sagar
& Soundrapandiyan, 2020).
autonomous driving. To predict the object state in the next frame and
approximate the inter-frame displacement of objects using the constant
velocity model, independent of camera ego-motion. Apply the Hungarian
algorithm affinity matrix with dimension for the data association. Top 2nd
places on the official KITTI 2D MOT leader board among all published works
while achieving the fastest speed, suggesting that simple and accurate 3D
MOT can lead to very good results in 2D(Weng & Kitani, 2019).
3D object detection is essential, to get information about the objects extent and
range in the 3D space. YOLO4D is presented for Spatio-temporal Real-time
3D Multi-object detection and classification from LiDAR point clouds. All the
experiments are conducted on the publicly available KITTI raw dataset which
consists of sequenced frames, unlike the KITTI benchmark dataset. The
dataset consists of 36 different annotated point cloud scenarios of variable
lengths and a total of 12919 frames. Automated Driving dynamic scenarios
are rich in temporal information. 4 Training For all Spatio-temporal based
models, a clip length m of 4 is used. Most of the current 3D Object Detection
approaches are focused on processing the spatial sensory features, either in
2D or 3D spaces, while the temporal factor is not fully exploited yet,
especially from 3D LiDAR point clouds. . YOLO4D models outperform all
other methods on all classes, achieving 11.5% improvement on Mixed-
YOLO3D, and 34.26% improvement on Tiny-YOLO3D. Frame stacking
provides a 2.36% improvement on the Mixed-YOLO baseline model, and a
14.56% improvement
23
2 Literature Review
24
2 Literature Review
25
2 Literature Review
Tracking objects over time, i.e., identity (ID) consistency is important when
dealing with multiple object tracking (MOT). Especially in complex scenes
with occlusion and interaction of objects this is challenging. Significant
improvements in single object tracking (SOT) methods have inspired the
introduction of SOT to MOT to improve the robustness, that is, maintaining
object identities as long as possible, as well as helping alleviate the
limitations from imperfect detections. Multiple object tracking (MOT) in video
(a critical problem for many applications including robotics, video
surveillance, and autonomous driving) remains one of the big challenges of
computer vision. The goal is to locate all the objects we are interested in, in a
series of frames, and form a reasonable trajectory for each one of them.
Since recent progress has been made on object detecting, tracking-by-
detection.MOT17 to evaluate our tracking performance. It consists of several
challenging pedestrian tracking sequences, with a significant number of
occlusions and crowded scenes, variations in angle of view, sizes of objects,
camera motion, and frame rates. MOT17 has the same video sequences as
the latest MOT16 .most tracked (MT) and identity preserving (IDF1) (which
compares ground truth trajectory and computes trajectory by a bipartite
graph, and reflects how long of an object
27
2 Literature Review
cars and everything else is background while the second courtesy set
contains the circumstantial pixels. This result is often represented as a binary
image or as a mask .The Object tracking is in concert a vital role in several
shape recognition and computer-vision pattern recognition applications like
autonomous robotic navigation, surveillance and vehicle navigation.. Point
tracking, particularly for the frequency of occlusions and false object
detections, is a complex problem. In order to identify these points, recognition
can be achieved pretty quickly. Point based tracking approaches. Moving
object detection and tracking becomes attractive and crucial research topic
for researchers. There are many methods for the object detection and
tracking. All the methods have their own advantages and disadvantages. For
object tracking single method cannot give good accuracy for different kind of
videos with different situation like poor resolution, change in weather
condition. Gaussian Mixture Modeling (Drayer & Brox, 2016).
29
2 Literature Review
for which they are insufficiently trained. Detection of novel events which the
31
2 Literature Review
network has been unsatisfactorily trained for and not trusted to produce
reliable outputs; and Automated debasing of a neural network training
pipeline, leading to faster training convergence and increased accuracy. Start
by formulating the end-to-end control problem and then describe our model
architecture for estimating steering control of an autonomous vehicle. All
models in this paper were trained on an Nvidia Volta V100 GPU. To evaluate
the performance of our model for the task off end-to-end autonomous vehicle
control of steering we start by training a standard regression network which
takes as input a single image and outputs steering curvature. Input image
data as being modeled by a set of underlying latent variables (one of which is
the steering command taken by a human driver) with VAE architecture(Amini
et al., 2018).
contains images of natural scenes (city and rural areas and highways)
collected in Karlsruhe, Germany. It contains 200 training stereo image pairs
with sparse ground truth disparities, collected using LiDAR sensor; and 200
testing image pairs without ground truth disparities. KITTI allows performance
evaluation by submitting final results to their evaluation server. The PSMNet
is an effective 3D stereo matching network that is commonly used as
backbone for disparity estimation. Provided an effective solution by including
foreground and background specific depth-based loss functions(Saleh, Hardt,
& Manoharan, 2020).
As one of the primary computer vision problems, object detection aims to find
and locate semantic objects in digital images. Different with object
classification, which only recognizes an object to a certain class, object
detection also needs to extract accurate locations of objects. In the state-of-
the-art object detection algorithms, bounding box regression plays a critical
33
2 Literature Review
Simply applying single object tracker for MOT will encounter the problem in
com-mutational efficiency and drifted results caused by occlusion. Our
framework achieves computational efficiency by sharing features and using
ROI-Pooling to obtain individual features for each target. In the framework,
introduce spatial-temporal attention mechanism (STAM) to handle the drift
caused by occlusion and interaction among targets. Tracking objects in
videos is a significant problem which has attracted great attention. It has
several applications for example video surveillance, human computer
communication and autonomous driving. Main goal of MOT is to evaluation
the locations of multiple objects in (video) dataset maintain their identities
consistently in order to yield their individual trajectories. Use the training
sequences inMOT15 benchmark for performance analysis of the pro-posed
method. The ground truth annotations of test sequences in both benchmarks
are not released and the tracking results are automatically evaluated by the
benchmark. So we use the test sequences in two benchmarks for
comparison with various state-of- the-art MOT methods. The overall tracking
speed of the proposed method onMOT15 test sequences is 0.5 fps using the
2.4GHz CPU and a TITAN X GPU, while the algorithm without feature
sharing runs at 0.1 fps with the same environment. Proposed a dynamic
CNN-based online MOT algorithm that efficiently utilizes the merits of single
object trackers using shared CNN features and ROI-Pooling. In addition, to
alleviate the problem of drift caused by frequent occlusions and interactions
among targets, the spatial-temporal
34
2 Literature Review
normal frontal view. The results prove the performance of tracking algorithm
with a success rate of 94% and detection model by achieving a TDR of 90%
to 93% with an FDR of 0.5%. In future, this work might be extended by
training(Ahmad, Ahmed, Khan, Qayum, & Aljuaid, 2020).
This has led to the rapid evolution of the autonomous driving systems over
the last several decades with the promise to prevent such accidents and
improve the driving experience been very successful in the past both in
academia and industry, which has led to autonomy deployed on road.
Navigation in dense urban environments requires understanding complex
multi-agent dynamics including tracking multiple actors across scenes,
predicting intent, and adjusting agent behavior conditioned on historical
states. Since DRL is challenging to be applied in the real world primarily due
to safety considerations and poor sample complexity of the state-of-art
algorithms, most current research in the RL domain is increasingly being
carried out on simulators, such as TORCS and CARLA , which can
eventually be transferred to real world settings. Present an approach to
37
2 Literature Review
38
2 Literature Review
2020).
40
2 Literature Review
Traditional
42
2 Literature Review
step starts by
44
2 Literature Review
allowing the A2CfDoC two agents to interact with the Carracing-v0 and
CARLA environments (with the same seed) and run for a set number of steps
(respectively 200.000 and 450.000) expert demonstrations, leading to a
suboptimal performance that is lower than the capacities learned from the
experts. The use of Expert Gradient clipping limits the size of the update of
the network that can provoke large changes from the previous policy, by
keeping the update in a secured region and avoiding a dramatic decrease in
performance. A next step could be a combination of other techniques that
can solve the many challenges that Artificial Intelligence face in building an
100% reliable ADAS systems, by using a hierarchical deep learning network
architecture to form a true whole single network that can deal with complex
tasks and include other sub-function like sensor fusion, occupancy grid
mapping and path planning, or handle several macro features going from
pedestrians detection, road-sign recognition and Collision Avoidance to some
more complex one like self-parking, lane-keeping and cruise control. Also we
can combine Partial Observability Markov Decision Process (POMDP)
principle to provide the deep learning network with the ability to deal with
limited spatial and temporal perception of the environment by using
RNN/LSTM to predict(Ding, Florensa, Phielipp, & Abbeel, 2019).
This paper has presented SDVTracker, a technique for learning motion state
estimation and multiclass object-detection association. Practically work on
tracking system, SDVTracker that apply a deep learned model for relate
Classical Association Techniques. To work out the data association problem
and incoming detections at the up to date timestamp need to be
corresponding to existing objects from the preceding timestamp, state
estimation in combination with an Interacting Multiple Model (IMM) filter
.Model that mutually optimizes together state estimation and association by
means of a novel loss, an algorithm for a training procedure of the
determining ground-truth supervision(Sun, Chen, Liang, Ruan, & Mukherjee,
2020).
The effective detection of curbs is fundamental and crucial for the navigation
of a self-driving car. This paper presents a real-time curb detection method
that automatically segments the road and detects its curbs using a 3D-LiDAR
sensor. The method captures the road curbs in various road scenarios
including straight roads, curved roads, T Shape intersections, Y-shape
intersections and +-shape intersections. The curb information forms the
foundation of decision making and path planning for autonomous driving.
Comprehensive off-line and real-time experiments demonstrate that the
proposed method achieves high accuracy of curb detection in various
scenarios while satisfying the stringent efficiency requirements of
autonomous driving. The off-line experiment demonstrates that the curbs can
be robustly extracted. The average precision is 84.89%, the recall is 82.87%,
and the average F1 score is 83.73%.Furthermore, the average processing
time in the real time experiments is around 12 ms per frame, which is fast
enough for self- driving(Y. Zhang, Wang, Wang, & Dolan, 2018).
46
39
3 Material and Methods/Model and Equations/Modeling
3.2. Dataset
Datasets “The KITTI Vision Benchmark Suite". This Kernel contains the object
detection part of their different Datasets published for Autonomous Driving. It
contains a set of images with their bounding box labels. For more information
visit the Website they published the data on (linked above) and/or read the
README file as it explains the Label format. Datasets such as the KITTI Vision
Benchmark Suite .Download data 12GB of KITTI data in its collection, In addition
training the calibration file for the cameras placed in the car. We also collected
driving situation (road object) image in our university to test the proposed model.
40
3.4 U Net (Convolutional Networks for Biomedical Image segmentation)
operations.
41
3 Material and Methods/Model and Equations/Modeling
42
3.4 U-Net (Convolutional networks for Biomedical Image Segmentation)
(like where are they), object localization (like what are their extent), and object
classification (like what are they).
3.4.2.1. Data from all directories, and retrieve a list of masks and
images.
IMAGE_WIDTH = 256
IMAGE_HEIGHT = 160
Because every time the image is reduced by 2 times in the layer, the size is
rounded to an integer. If at some point the quantity were an odd number. For
example: we have such a network
in the case of UNET, when we combine some of the initial layers with the next
that came out, then after the exit that are multiplied by two, so the layer we want
to join will be a size of 86 and creating one model will return an error. It should
be remembered that the sizes should be multiples of 8. (In the case of reducing 3
times, or 16 in the case of 4 reduction operations)
Get labels: returns the necessary data for each image, each object is a separate
line with a description. The most important parameters are:
43
3 Material and Methods/Model and Equations/Modeling
Let discuss this architecture in details here and understand its building blocks. In
the next chapter, explain our version of U-Net which is modified compared to
what explained here.The difference is that the convolutions that we use for U-Net
adds padding to the input in a way that after applying the convolution the image
keeps its dimension, but the architecture explained in the original U-Net does not
account for the padding hence after applying the 3*3 convolution to the input,
size of the output image is2 pixels short in width and height
3.5. YOLO3NET
YOLOv3 is an improved version of YOLO and YOLOv2. The main change in its
network structure is the introduction of residual blocks, which ensures that even
if the YOLOv3 network becomes deeper, the model can still converge quickly. In
order to better deal with the problem of overlap, the loss function uses binary
cross-entropy loss; the multi-scale fusion method is adopted to merge the high
level semantics with the low-level, which improves the sensitivity to small
targets.
YOLO (You Only Look Once) family of Convolutional Neural Networks that
achieve near state of the art results achieve with a multiple end-to-end model
that can object detection in real-time detection. A few weeks back, the third
version of YOLO came out, and this post aims at explaining the changes
introduced in YOLO v3. This is not going to be a post explaining what YOLO is
from the ground up. I assume you know how YOLO v2 works. If that is not the
case. YOLO-based Convolutional Neural Network family of models for object
detection and the most recent variation called YOLOv3. Best-of-breed open
source library implementation of the YOLOv3 for the Keras deep learning library.
The official title of YOLO v2 paper seemed if YOLO was a milk-based health
drink for kids rather than an object detection algorithm. It was named
“YOLO9000: Better, Faster, And Stronger”.
For its time YOLO 9000 was the fastest, and also one of the most accurate
algorithm. However, a couple of years down the line and it’s no longer the most
accurate with algorithms like Retina Net, and SSD outperforming it in terms of
44
3 Material and Methods/Model and Equations/Modeling
accuracy. It still, however, was one of the fastest.
45
3.5 YOLO3NET
But that speed has been traded off for boosts in accuracy in YOLO v3 this has to
do with the increase in complexity of underlying architecture called Darknet.
45
3 Material and Methods/Model and Equations/Modeling
Class is use in this experiment is (car, motorbike, and pedestrian) the method
involves a single deep CNN (formerly a version of GoogLeNet, earlier updated
and called DarkNet based on VGG) that splits the input into a grid of cells and
every (each) cell right predicts a bounding box and object classification. The
result is a huge number of candidate bounding boxes that are consolidated into a
final prediction by a post-processing step. There are 3 core variations of the
method, at the time of writing; that are YOLOv1, YOLOv2, and YOLOv3. The first
form proposed the general architecture, while the 2nd (second) version
advanced the design and prepared usage of predefined anchor boxes to
established bounding box proposal, and version 3 further refined the model
architecture and training practice
3.6. Measures
3.6.1. True Positives (TP)
These are the appropriately predicted positive values which means that the
value of actual class is true (yes) and the value of predicted class is also true
(yes).
46
3.6 Measures
3.6.5. Precision
Is the ratio of the detected objects TP + FP, which are detected properly?
Precision: It is implied as the measure of the accurately identified positive cases
from all the predicted positive cases. Precision is the ratio of true positives to the
total of the true positives and false positives. Precision looks to see how much
junk positives got thrown in the mix. If there are no bad positives (those FPs),
then the model had 100% precision. The more FPs that get into the mix, the
unpleasant that precision is going to look. To calculate a model’s precision, we
need the positive and negative numbers from the confusion matrix. So, it is
valuable when the costs of False Positives is more (high). It tells us how many
false positive FP detections the detector produces. It is defined as follows.
Equation 1
Precision = TP /TP + FP
3.6.6. Recall
It is the measure of the correctly identified positive cases from all the actual
positive cases. It is main when the cost of False Negatives is high. Tells us the
ratio of the ground truth objects TP + FN, which are detected by the detector. It
is defined as
Equation 2
Recall=TP/TP + FN
The recall rate is penalized whenever a false negative is predicted. Because the
penalties in precision and recall are opposites, so too are the equations
themselves. Precision and recall are the yin and yang of assessing the confusion
matrix.
3.6.7. F1 Score
The F-score, also called the F1-score, is a measure of a model's accuracy on a
dataset. The F-score is a way of combining the precision and recall of the model,
and it is defined as the harmonic mean of the model's precision and recall. The
F1 Score is the 2*((precision*recall)/ (precision recall)). It is also called the F
Score or the F Measure. Put another way, the F1 score conveys the balance
between the
47
3 Material and Methods/Model and Equations/Modeling
precision and the recall .As in many vision problems, the ground truth labeling
may not be perfect. For autonomous navigation applications, it is not a serious
problem if the estimated free space is smaller than the actual one. On the other
hand, it is more critical to not have any obstacles inside the free space curve. In
regards to this, propose the F1 score to measure the accuracy of classification of
pixels under the curve given by
Equation 3
F1 = 2 × P × / P + R
3.6.8. Accuracy
One of the more obvious metrics, it is the measure of all the correctly identified
cases. It is most used when all the classes are equally important.
Equation 4
Accuracy = TP +TN / TP+FP +FN +TN
48
3.6 Measures
Figure 3.2IOU
Formally, the IOU measures the overlap between the ground truth box and the
predicted box over their union. This can be written in the form of T P, F P, T N,
and F N as
Equation 5
IOU = TP \T P + F P + F N
49
4. Results and Discussions
4.1. Segmentation (u-net)
Image segmentation is commonly used in a computer vision task. Segmentation
u-net model fig 11 architecture of proposed, the data for training contains 30
512*512 images, which are far not enough to feed a deep learning neural
network in keras.preprocessing.image to do data augmentation. Importing the
libraries we initialize the directory where the images are stored. Created two lists
one for storing masks and the other for storing the image. (Mask & image). After
storing the image and mask we have picked images with their corresponding
masks. Use the code shown below to do the same. Install the pre-trained model
for segmentation and load all the useful libraries from that segmentation model.
Within Keras various pre-trained models widely used for segmentation are
available for use and these models will be tested alongside the U-Net to
determine their practicality will keep the same results and code behavior .So no
randomness creeps into our calculations and we get the same results every time
50
4 Results and Discussion
51
4.1 Segmentation (U Net)
52
4 Results and Discussion
53
4.2 Object Tracking Using (U Net)
Draw 3d bounding box in image (8, 3) array of vertices for the 3d box in following
order:
1 -------- 0
/| /|
2 3.
|| ||
.5 4
|/ |/
6 7
54
4 Results and Discussion
55
4.2 Object Tracking Using (U Net)
56
4 Results and Discussion
57
4.2 Object Tracking Using (U Net)
175/175 step accuracy result 208s 1s/step - loss: 0.9841 - calc_IOU: 0.1280 -
dice: 0.1280 - fbeta: 0.1675 - val_loss: 0.9901 - val_calc_IOU: 0.1298 - val_dice:
0.1298 - val_fbeta: 0.1679
Parameters Value
Input image size 0.005
Input image size 416*416
Number of cells per image 13*13
Number of bounding boxes per cell 9
Classes [Pedestrian, Truck, Car, Cyclist,
Motorbike]
Classification threshold 0.6
Non-Maximum suppression 0.5
overlapping threshold
Two concerns can be made about Kitti test dataset. The first concern is the test
images are not labeled, so detection results can’t be automated and have to be
58
4 Results and Discussion
done manually. The second concern is cars are the dominant object in the
dataset and the dataset doesn’t have enough samples for pedestrians, cyclists
and trucks. In our implementation, the dataset labels were reorganized to fit in
our four classes. Van, tram and cars are considered as one class, it is called
cars. Sitting person and pedestrian are merged, the class named as pedestrian.
Trucks and cyclists taken from the dataset without any change
Load the new photograph and arrange it as suitable input to the model. The
model supposes inputs to be colorful images with the square shape of 416×416
pixels can use the loading () Keras function to load the image and the target size
argument to resize the image after loading. We can also use the img_to_array()
function to convert the loaded PIL image object into a Numpy array, and then
rescale the pixel values from 0-255 to 0-1 32-bit floating point values.
The experiencor script provides the correct_yolo_boxes() function to perform this
translation of bounding box coordinates, taking the list of bounding boxes, the
original shape of our loaded photograph, and the outline of the input to the
network as arguments. The coordinates of the bounding boxes are updated
directly.
The model has predicted a lot of candidate bounding boxes, and most of the
boxes will be referring to the same objects. The list of bounding boxes can be
filtered and those boxes that overlap and discuss to the same object can be
merged. Script can define the amount of overlap as a configuration parameter, in
this case, 50% or 0.5. This filtering of bounding box regions is generally referred
to as non maximal suppression and is a required post-processing step.
The experiencer script delivers this via the do_nms () function that takes the list
of bounding boxes and a threshold parameter. Rather than purging the
overlapping boxes, their predicted probability for their overlapping class is
cleared. This permits the boxes to continue and be used if they also detect
another object type. The best-of-breed open source library implementation of the
YOLOv3 for the Keras deep learning library a use a pre-trained YOLOv3 to
perform object localization and detection on new photographs, and result shows
in fig18.
59
4.3 Object Detection and Tracking
car 559 6 52
truck 3 0 3
pedestrian 38 0 32
cyclist 15 0 4
motorbike 20 0 20
Table reviews the object detection results for exclusively the all classes. Results
show that 635 objects are classified correctly, while 111 objects were miss
classified and 88 objects were positives. IOU is 0.8, accuracy is 85.94 and f1
60
4.3 Object Detection and Tracking
score
61
4 Results and Discussion
is91.56 It is noticed that false negative detection was high for pedestrians and
cyclists. For better understanding of the results, the precision and recall were
calculated for each class. Precision and recall are given in the following
equations: Table 4.3 Precision and recall value for the class
Table shows the precision and recall for each class. The precision is higher than
98% for all the classes except for car and can conclude that detection accuracy
is very high for all the classes. There is a significant drop in the recall values for
pedestrians and cyclists because of the high values for the false negatives. Due
to the small samples of trucks available in the test dataset, we can’t review the
algorithm performance in trucks detection. The summary we can make from the
above results is that the algorithm showed very excellent detection accuracy for
cars. It also disclosed a high precision for pedestrians and motorbike but with
low recall values due to the high number of false negatives. That means the
algorithm has high miss detection rate for small objects like pedestrians and
cyclists compared to larger objects like cars. Kitti dataset have enough trained
labeled images to achieve good performance. The paper provided a quick
review for the different approaches for road object detection and presented the
recent CNN architectures and methods for object detection. TheDarknet-53
architecture of Yolov3 is explained and the algorithm hyper-parameters were
discussed. Algorithm training steps and parameters like epoch and batch were
presented However, cars are the dominant object in the dataset, and the test
images are not labeled. Which makes detection result analysis more difficult.
Detection results are displayed by counting true positives, false positives and
false negatives per each class. Precision and recall also calculated for each
61
4 Results and Discussion
class. The algorithm
62
4.3 Object Detection and Tracking
showed a very good detection accuracy for cars. Pedestrian and cyclist
detection showed more false negatives than cars. The test dataset includes only
six trucks; therefore, no solid conclusion can be made about truck detection.
Figure 4.14
We collected images from our mobile then tested it on proposed model which
show good result shows in table 4.4. True negative is 8.image result shows in
fig4.14and fig4.15.
62
4 Results and Discussion
Figure 4.15
4.3.1. Comparisons
Table 4.5 comparison with pervious study
Model Accuracy
Fast VP- 89.00
RCNN
multi-task 86.12
CNN
Shift R- 65.47
CNN
PVGNet 89.94
Proposed 90.10
study
63
5 Conclusion
5. Conclusion
In this work, we presented experiments and results indicating the U-Net for the
segmentation and U-Net for object detection with file Calib for 3D bounding box.
Result of segmentation is also be great in training and testing used u-net. Result
of segmentation is also be great in training and testing used u-net.Then result is
show with ground truth dataset which good. Object detection not only needs to
identify the class of the object (‘car’ in this case) in the image, but also has to
locate the object in the image accurately
The experiment uses the pre-trained YOLO v3 framework to detect and track the
Object with Keras, Numpy, Tensor flow, and OpenCV. For object detection and
tracking, there are two phases offline and online processing .The pre-trained
template YOLO v3 is trained with some vehicle images and is tested using the
our own collected dataset.
In yolov3 our result is good but object is away it give not give good result. In this
mostly result “car” accuracy are 98% in the result, but the result which is down
the result is main issue distance, misplace and object properly cannot looking
well. In future we will find this problem object is in long distance, that effect on
accuracy and may be model will no consider the object
64
References
73