Research Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Drivable area detection for self-driving cars

using deep learning

Benchekroun Ahmed Laksantini Khalid Radouane Joubair

approach in accurately detecting drivable


Abstract areas and its potential for use in self-driving
cars.
Every year, the autonomous vehicles industry
increases by 16% globally and Around 9.1
driverless car crashes occur per million miles
driven. This research paper presents a study on
drivable area detection for self-driving cars
using deep learning. The goal of this study was
to develop a robust and accurate drivable area
detection system for self-driving cars, which
can be used to improve the safety and
efficiency of autonomous vehicles. To achieve
this goal, we proposed the use of convolutional
neural networks (CNNs) for this task and
evaluated the performance of four different Related work
CNN architectures: PSPNet, ResNet, U-Net,
In the field of image segmentation, we find
and STDC. The different methods were
evaluated using two datasets Ade 20K and
different research papers addressing this
Cityscapes, composed respectively of a variety subject (we chose some of them for
of objects, scenes, and backgrounds and urban reference). The first article [1] presents a
street scenes review of recent advances in semantic
segmentation using deep neural networks,
Introduction with a focus on methods for skin lesion
detection and FCN-based segmentation.
Self-driving cars have the potential to
The review is divided into three main
revolutionize transportation by increasing
categories: region-based semantic
safety and reducing human error. One of the
key components of autonomous driving is the
segmentation, FCN-based semantic
ability to accurately detect and segment segmentation, and weakly supervised
drivable areas from images captured by on- segmentation. The authors report on the
board cameras. In this paper, we propose a benefits and limitations of each method
deep learning-based approach for drivable and discuss how they can be used to
area detection using several state-of-the-art improve computational efficiency,
architectures including PSPNet, ResNet, U-Net, accuracy, and generalizability.
and STDC. We evaluate our method on a Additionally, the article [2] presents the
publicly available dataset and compare it to MALUNet, a multi-attention and
other state-of-the-art methods. Our results lightweight UNet for skin lesion
demonstrate the effectiveness of our segmentation. The method was trained and
tested on the ISIC2017/2018 dataset,
which contains over 2000 dermoscopy efficiency, making the method suitable as a
images with segmentation mask labels. practical annotation tool.
Results show an accuracy of 96.18% and
The article concludes by providing a table
94.66% for the training and testing sets
of models and metrics such as aAcc, mIoU,
respectively.
mAcc, comparing the performance of
The third article [3] proposes an attention different models on semantic segmentation
mechanism based on the SWIN U-net tasks.
model, with a two-level attention operation
to enhance the performance on skin lesion Methods
segmentation. The attention module uses
In this contribution, various methods have
ISIC2017, ISIC2018 and PH² datasets and
been utilized to perform image detection.
achieved an accuracy of 0.9656 0.9668 and
Before delving into the details of these
0.9685.
methods, it is important to understand the
The fourth article [4] proposes a translated overall process of the detection. The
skip connections method to expand the process comprises of several steps,
receptive fields of FCNs with a minimal including specifying the dataset, loading
number of parameters. The method was the image from a file, applying multi-scale
compared with U-net, Dilation2, Dilation3 flip augmentation, resizing the image,
and B-Net using COVID-19, IILD, applying random flip, normalizing the
Landcover.ai, Carla and VOC2012 as image, and converting the image to a
datasets. tensor. Each time the detection is launched,
these steps are executed in the specified
The fifth article [5] introduces IDD dataset order.
a novel dataset for autonomous navigation
in unstructured environments collected The MultiScaleFlipAug method combines
from Indian roads, the dataset is more multi-scale and flip augmentation
diverse and unstructured than the roads techniques to increase the robustness and
typically used in structured datasets like generalization of the model to new, unseen
Cityscapes. the study found that state-of- images. This is achieved by providing
the-art methods for semantic segmentation more variations of the same image.
achieved lower accuracy on the IDD Additionally, normalization is applied to
dataset than Cityscapes, the image, which offers several benefits
such as improved convergence,
demonstrating the need for new and more regularization, faster training, and better
complex datasets to improve performance. feature learning. After the image is
The last research paper SimpleClick [6], a converted to a tensor, features are extracted
new method for click-based interactive using an encoder-decoder architecture,
image segmentation is also introduced, the which typically consists of a backbone,
method uses plain, non-hierarchical Vision decode_head and an auxiliary_head. It
Transformer (ViT) as the backbone should be noted that the auxiliary_head is
architecture, the method achieved state-of- only used for deep supervision during
the-art performance on the SBD dataset training and is not necessary during
with a NoC@90 score of 4.15 and inference.
generalizability to medical images. The The following section shows the structure
authors also discussed computational of all the methods used in our solution.
Pspnet feature maps, effectively increasing the
receptive field of the model and generating
Pspnet is an image segmentation method a feature pyramid. The architecture also
based on a deep convolutional neural includes a CrossEntropyLoss layer, which
network (CNN). is used to measure the dissimilarity
between the predicted segmentation mask
It uses a pyramid pooling module to and the ground truth mask. Additionally,
capture the context information of an there are several ReLU activation function
image. layers, which are used to introduce non-
linearity into the network, allowing it to
learn more complex representations of the
input data. Finally, the architecture
includes an FCNHead, which is a specific
type of architecture that is used to create a
full-resolution segmentation mask from the
feature maps generated by the encoder.

This architecture is a deep learning model Resnest


for image segmentation tasks, it is
composed of several components: an Resnest is an image segmentation method
encoder-decoder structure, a ResNetV1c based on a residual neural network.
block and a PSPHead. The encoder is
typically a convolutional neural network It uses a multi-level feature fusion
approach to capture the global context of
(CNN) that reduces the spatial resolution
an image.
of the input image while increasing the
number of feature maps, which allows it to
capture increasingly abstract features of the
image. The decoder generates a
segmentation mask that segments the
image into different regions.
The ResNetV1c is a specific version of the
ResNet architecture, it has a similar
architecture to the original ResNet model,
but with fewer filters in the middle layers,
making it less computationally expensive
while still preserving the good
performance of the original ResNet. This architecture is a deep learning model
for image segmentation tasks, it is
The PSPHead (Pyramid Scene Parsing composed of several components: an
Head) is a specific type of architecture encoder-decoder structure, a ResNeSt
used in image segmentation tasks, block, and a
specifically in the context of semantic DepthwiseSeparableASPPHead. The
segmentation. It captures multi-scale encoder is typically a convolutional neural
contextual information from an input network (CNN) that reduces the spatial
image, it applies different convolution with resolution of the input image while
different dilation rates in parallel to the increasing the number of feature maps,
which allows it to capture increasingly
abstract features of the image. The decoder
generates a segmentation mask that
segments the image into different regions.
The ResNeSt block is a specific type of
architecture that is designed to improve the
performance of CNNs by introducing
residual connections and split-attention
blocks. It helps the network to learn more
effectively by allowing the flow of
This architecture seems to be a specific
information through the network to be
implementation of an Encoder-Decoder
more efficient.
architecture for image segmentation tasks.
The DepthwiseSeparableASPPHead is a The encoder part of the architecture is
specific type of architecture used in image responsible for extracting features from the
segmentation tasks, specifically in the input image, while the decoder part is
context of semantic segmentation. It responsible for generating a segmentation
captures multi-scale contextual information mask that segments the image into
from an input image, it applies different different regions. The encoder is typically
convolution with different dilation rates in a convolutional neural network (CNN) that
parallel to the feature maps, effectively reduces the spatial resolution of the image
increasing the receptive field of the model. while increasing the number of feature
maps, which allows it to capture
The architecture also includes a increasingly abstract features of the image.
CrossEntropyLoss layer, which is used to
measure the dissimilarity between the The decoder is made up of several layers
predicted segmentation mask and the including a ReLU activation function,
ground truth mask. Additionally, there are FCNHead and CrossEntropyLoss layers.
several ReLU activation function layers, The ReLU activation function is used to
which are used to introduce non-linearity introduce non-linearity into the network,
into the network. allowing it to learn more complex
representations of the input data. The
FCNHead is a specific type of architecture
that is used to create a full-resolution
Unet segmentation mask from the feature maps
generated by the encoder. The
Unet is an image segmentation method CrossEntropyLoss is used to measure the
based on a fully convolutional neural dissimilarity between the predicted
network (FCNN).
segmentation mask and the ground truth
mask.
It uses a contracting path and an expansive
path to capture the local and global context
of an image. Setdr
Setdr is an image segmentation method
based on a deep reinforcement learning
approach.
It uses a set of deep recurrent networks to function for image segmentation tasks, it is
capture the context information of an used to measure the dissimilarity between
image. the predicted segmentation mask and the
ground truth mask. FCNHead is a specific
type of architecture that is used to create a
full-resolution segmentation mask from the
feature maps generated by the encoder.

Stdc
Stdc is an image segmentation method
The architecture described in these steps based on a deep structured learning
appears to be a complex neural network approach.
composed of several components,
specifically an Encoder-Decoder It uses a set of deep convolutional neural
architecture followed by a Vision networks to capture the context
Transformer, Dropout, GELU activation information of an image.
function, MLANeck, Relu,
SETRMLAHead, cross-entropy loss, Relu,
FCNHead, CrossEntropyLoss, FCNhead
and CrossEntropyLoss. The Encoder-
Decoder architecture is commonly used in
image segmentation tasks, it extracts
features from the input image, while the
decoder part is responsible for generating a
segmentation mask that segments the
This architecture appears to be a specific
image into different regions. Vision
implementation for image segmentation
Transformer is an attention-based
tasks. The encoder part of the architecture
mechanism that allows the model to learn
is responsible for extracting features from
to selectively attend to different regions of
the input image, while the decoder part is
the image. Dropout is used for
responsible for generating a segmentation
regularization, to avoid overfitting. GELU
mask that segments the image into
is a non-linear activation function that is
different regions. The encoder is typically
used to introduce non-linearity into the
a convolutional neural network (CNN) that
network. MLANeck is a specific
reduces the spatial resolution of the image
component of the architecture that is
while increasing the number of feature
responsible for performing feature fusion
maps, which allows it to capture
and attention operations on the multi-level
increasingly abstract features of the image.
features. Relu is a non-linear activation
function that is commonly used in neural The architecture includes several specific
networks. SETRMLAHead is a specific components such as STDCContextPathNet
type of architecture used in image and STDCNet, which are used to capture
segmentation tasks, it allows the network spatial-temporal dynamic context and
to effectively use features from different context path information, respectively. The
levels of the network and selectively attend decoder includes several layers including a
to the most informative features. Cross- ReLU activation function, FCNHead,
entropy loss is a commonly used loss OHEMPixelSampler, and
CrossEntropyLoss layers. The FCNHead is Experiments
a specific type of architecture that is used
to create a full-resolution segmentation
mask from the feature maps generated by
the encoder. The OHEMPixelSampler is an
operation that samples pixels from the
predicted segmentation mask. The
CrossEntropyLoss is used to measure the
dissimilarity between the predicted
segmentation mask and the ground truth
mask.
Later in the architecture, there are This table compares the performance of
OHEMPixelSampler and ReLU layers, different models (Pspnet, Resnet, Setdr,
then it concludes with Dice loss function, Stdc, and Unet) on three types of graphics
cards: 3080, 3050Ti, and 3050. The
which is used to measure the dissimilarity
metrics used to evaluate the performance
between the predicted segmentation mask
of the models are Average Accuracy
and the ground truth mask. (aAcc), Mean Intersection over Union
(mIoU), and Mean Accuracy (mAcc). The
Dataset table shows the performance of each model
In our case we compared between two on each graphics card, with the highest
datasets Ade 20K and Cityscapes. performance being indicated by the highest
value in each cell. It appears that overall,
In the ADE20K dataset, the images contain the 3080 graphics card has the best
a variety of objects, scenes, and performance across all models and metrics,
backgrounds, with a total of 150K images while the 3050 graphics card has the
and 20K categories. lowest performance.

The Cityscapes dataset is composed of


urban street scenes, with a total of 5K
images and 30 classes.
Conclusion
In summary, this contribution presents
various methods for image detection,
including MultiScaleFlipAug, Pspnet and
Resnest. These methods use techniques
such as multi-scale and flip augmentation,
normalization, and encoder-decoder
architecture to improve robustness and
generalization of the model. Pspnet uses
pyramid pooling module to capture context
and Resnest uses multi-level feature fusion
to capture global context. These methods
are effective in image segmentation tasks
and can potentially improve performance
in other areas of image detection.
Additionally, it should be noted that each
method has its own specific architecture
designed to improve performance.
Drivable area detection for self-driving
cars can be achieved using deep learning
techniques such as a points-embedded
system. These methods can be used to
process streaming data from cameras to
accurately identify the drivable areas on a
road. An IoT architecture can also be
integrated to enable real-time
communication and data transfer between
the vehicle and a centralized system for
monitoring and control. This can allow the
vehicle to receive updates on road
conditions and traffic patterns, and also
allow remote monitoring of the vehicle's
performance. By using these techniques
together, it is possible to achieve accurate
and real-time drivable area detection for
self-driving cars.

Reference
[1] “MALUNet: A Multi-Attention and Light-weight
UNet for Skin Lesion Segmentation”by Jiacheng Ruan,
Suncheng Xiang, Mingye Xie, Ting Liu, Yuzhuo Fu

[2] “A review of semantic segmentation using deep


neural networks” by Yanming Guo1·Yu Liu ·Theodoros
Georgiou·Michael S. Lew

[3] “ATTENTION SWIN U-NET: CROSS-


CONTEXTUAL ATTENTION MECHANISM FOR
SKIN LESION SEGMENTATION” by Ehsan
Khodapanah Aghdam Reza Azad Maral Zarvani Dorit
Merhof

[4] “TRANSLATED SKIP CONNECTIONS - EXPANDING


THE RECEPTIVE FIELDS OF FULLY CONVOLUTIONAL
NEURAL NETWORKS” by J. Bruton, H. Wang University of
the Witwatersrand

[5] “SimpleClick: Interactive Image Segmentation with


Simple Vision Transformers” by Qin Liu, Zhenlin Xu, Gedas
Bertasius, Marc Niethammer University of North Carolina at
Chapel Hill

[6] “IDD: A Dataset for Exploring Problems


ofAutonomous Navigation in Unconstrained
Environments” by

Girish Varma1Anbumani Subramanian2Anoop


Namboodiri1Manmohan Chandraker3C V Jawahar

You might also like