NITCAD - Developing An Object Detection Classifica
NITCAD - Developing An Object Detection Classifica
com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
Procedia
Procedia Computer
Computer Science
Science 17100 (2019)
(2020) 000–000
207–216 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
Abstract
Abstract
Autonomous vehicles with various levels of autonomy are becoming popular in developed countries due to their effectiveness in
Autonomous vehicles caused
reducing the fatalities with various
by roadlevels of autonomy
accidents. are becoming
A developing countrypopular in developed
like India countries
with the second duepopulation
largest to their effectiveness
in the world,in
reducing the fatalities caused by road accidents. A developing country like India with the second largest population
creates unique road scenarios for an autonomous car which requires a lot of testing and fine tuning before implementation. This in the world,
creates
leads tounique road scenarios
the importance for an providing
of datasets autonomous car whichabout
information requires a lot traffic
various of testing and fine
situations in tuning before
India. For implementation.
planning This
its path ahead,
leads to the importance
autonomous vehicles haveoftodatasets
detect, providing
classify andinformation
estimate the about various
depth traffic that
of obstacles situations in India. on
they encounter Forroads.
planning its path ahead,
The purpose of this
autonomous vehiclesa have
paper is to provide to for
dataset detect, classify
object and estimate
classification, the depth
detection and of obstacles
stereo visionthat they encounter
corresponding on roads.
to Indian Thewhich
roads purposecanofserve
this
paper is to provide a dataset for object classification, detection and stereo vision corresponding to Indian roads
as a platform for developing effective algorithms for autonomous cars in Indian roads. In this work, we benchmarked the object which can serve
as a platform by
classification forusing
developing effective
confusion matrixalgorithms for autonomous
obtained from various deepcars in Indian
learning roads.
models, In this detection
evaluated work, we using
benchmarked the object
Faster R-CNN and
classification
compared depth by estimation
using confusion matrix
processed byobtained
Realsense from various
stereo deep
camera bylearning
applyingmodels, evaluated
convolutional detection
neural using
network Faster
based R-CNN and
algorithms.
compared depth estimation processed by Realsense stereo camera by applying convolutional neural network based algorithms.
c 2020
© 2020 The
The Authors.
Authors. Published
Published byby Elsevier
Elsevier B.V.
B.V.
c 2020
This is The Authors. Published by Elsevier B.V.
This is an
an open
open access
access article
article under
under the
the CC
CC BY-NC-ND
BY-NC-ND license
license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open
Peer-review access
under article under
responsibility of the
theCC BY-NC-ND
scientific licenseof(http://creativecommons.org/licenses/by-nc-nd/4.0/)
committee the Third International Conference on Computing and Network
Peer-review under(CoCoNet’19).
Communications
Communications responsibility of the scientific committee of the Third International Conference on Computing and Network
(CoCoNet’19).
Communications (CoCoNet’19).
Keywords: Dataset, Object classification, Object detection, Stereo vision, Indian roads
Keywords: Dataset, Object classification, Object detection, Stereo vision, Indian roads
1. Introduction
1. Introduction
The future of transportation lies in the development of autonomous vehicles that can eliminate the factor of human
The
error future reducing
thereby of transportation liesof
the chance in road
the development of autonomous
accidents. The vehicles
technology will only that can eliminate
be effective the factor of
if implemented human
across all
error thereby reducing the chance of road accidents. The technology will only be effective if implemented
the vehicles on the roads thereby reducing the uncertainties associated with drivers. For perfecting the technology across all
the vehicles on the roads thereby reducing the uncertainties associated with drivers. For perfecting the
to be implemented on a large scale, it should be tested across various traffic situations that will be encountered by technology
to be implemented
these on alarge
vehicles. Several large datasets
scale, it like
should be tested
Imagenet [1]across various[2],
and COCO traffic
are situations thatimage
available for will beclassification
encountered but
by
these vehicles. Several large datasets like Imagenet [1] and COCO [2], are available for image classification but
1877-0509
1877-0509 c 2020
© 2020 The
The Authors.
Authors. Published
Published by
by Elsevier
Elsevier B.V.
B.V.
1877-0509
This isisan c 2020 Thearticle
Authors. Published by Elsevier B.V.
This anopen
openaccess under
access article the CC
under the BY-NC-ND
CC BY-NC-ND licenselicense
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open access article under
of the
Peer-review
Peer-review under
underresponsibility
responsibility ofCC
the theBY-NC-ND
scientific
scientific license
committee of (http://creativecommons.org/licenses/by-nc-nd/4.0/)
the Third
committee International
of the Conference on
Third International Computingon
Conference andComputing
Network Communications
and Network
Peer-review under responsibility of the scientific committee of the Third International Conference on Computing and Network Communications
(CoCoNet’19).
Communications (CoCoNet’19).
(CoCoNet’19).
10.1016/j.procs.2020.04.022
208 Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216
2 Author name / Procedia Computer Science 00 (2019) 000–000
for localisation and segmentation, which are primary tasks for autonomous vehicles, only a few datasets like Pascal
VOC [3] can be used. But it has a wide variety of classes that are not necessary for autonomous navigation. There
are some datasets available exclusively for pedestrian detection such as Caltech Pedestrian Dataset [4], Citypersons
dataset [5], Daimler Pedestrain [6]. Krause et al. [7] created a dataset which can be used for classification of cars. But
autonomous vehicles need to detect different classes of obstacles present in its path. For that, Oxford robotcar dataset
[8] has collected data in the city of Oxford, UK at various times in the same location. KITTI [9] is another dataset
which is collected on the urban roads of Karlsruhe, Germany. Cityscapes [10], and Mapillary Vistas [11] created
datasets for semantic understanding of urban streets. More recently Apolloscapes [12], BDD100k [13], nuScenes [14]
provided datasets that are collected across various weather conditions, times and places, along with labels which are
crowd sourced.
To test an autonomous vehicle in India, the above mentioned datasets cannot be used due to the lack of information
about classes like auto rickshaws that are exclusively found on Indian roads. Also, unstructured traffic scenarios can
be observed frequently on Indian roads. These factors make it essential to have a dataset exclusively for Indian roads
which can give information regarding the same. Varma et al. [15] has studied about the situations in Indian roads
and created a dataset named IDD. For an autonomous vehicle to plan its path, it should be aware of distances to
other vehicles in its vicinity so that the velocity of other vehicles can be estimated. This can be achieved by using
a 3D LiDAR which can scan its surroundings and give distances to different obstacles but this is rather expensive.
A cost effective alternative is to use a stereo camera which can provide the depth information about the obstacles.
The performance of this method purely depends upon the algorithms for evaluating depth information. Due to the
introduction of better algorithms this method will be more suitable for a price conscious market like India.
In this context, there is a requirement for a new dataset, that can be used to develop autonomous navigation systems
for Indian roads. So, a dataset named National Institute of Technology, Calicut Autonomous Driving (NITCAD) is
presented in this paper. NITCAD primarily consists of NITCAD object dataset which can be used for classification,
detection and NITCAD stereo vision dataset for depth estimation on Indian roads, thereby leading to the development
of level 3 autonomous vehicles capable of handling Indian road scenarios.
2. Methodology
The autonomous vehicles that are being developed for Indian roads should be able to detect and classify different
vehicle classes that are exclusively found here. This will be advantageous for performing the path planning operation
since different vehicle classes behave differently on roads. NITCAD object dataset provided here was created under
this objective.
Inorder to keep track of the detected objects on the road, the autonomous vehicle needs to evaluate the velocity
of these objects. For developing stereo vision based velocity estimation algorithms, NITCAD stereo vision dataset
provides image data collected from synchronised left and right cameras having global shutter.
The NITCAD object dataset was evaluated with different deep learning architectures and the respective confusion
matrix, precision, recall values were found out. For the NITCAD stereo vision dataset, the relative difference between
the disparity maps obtained by different methods are evaluated and interpreted.
India has one of the the highest number of road fatalities in the world. Autonomous navigation is one of the solu-
tions to reduce the number of fatalities due to road accidents. To build an efficient and reliable system, the knowledge
of the traffic structure in India is highly essential. Traffic in India is highly heterogeneous and the models trained for
autonomous navigation in other countries may not be sufficient to characterize Indian traffic. Indias roads are func-
tionally classified as expressways, national highways, state highways, district roads, and rural roads. Out of this, only
expressways and some of the national highways are four-lane or six-lane. An example of unlaned traffic junction can
be observed in Fig. 1(a). Most roads are unpaved with potholes and with ambiguous boundaries. Compared to other
countries, India has low road density per 1000 people. This has led to many problems like traffic congestion and
irregular traffic speeds. The Indian roads carry almost 90 per cent of the countrys passenger traffic. Passengers use a
multitude of vehicles for transport daily which include cars, two-wheelers, and auto-rickshaws, etc. Auto-rickshaws
Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216 209
Author
Author
namename
/ Procedia
/ Procedia
Computer
Computer
Science
Science
00 (2019)
00 (2019)
000–000
000–000 3 3
(a) An
(a) unlaned
An unlaned
traffic
traffic
junction.
junction. (b) Unstructured
(b) Unstructured
environment
environment
prevalent
prevalent
in India.
in India.
Fig. Fig.
1: General
1: General
traffic
traffic
scenarios
scenarios
on Indian
on Indian
Roads.
Roads.
are are
a class
a class
of vehicles
of vehicles
thatthat
are are
trulytruly
unique
unique
to the
to the
Indian
Indian
traffic
traffic
system.
system.
Further,
Further,
the the
frequency
frequency
andand
variety
variety
of trucks
of trucks
andand
buses
buses
are are
alsoalso
highhigh
as compared
as compared to other
to other
countries.
countries.
Another
Anotherhugehuge
bottleneck
bottleneck
for for
autonomous
autonomous navigation
navigation
in India
in India
is that
is that
the the
pedestrians
pedestrians
andand
drivers
drivers
are are
lessless
likely
likely
to follow
to follow
traffic
traffic
rules.
rules.
Pedestrians
Pedestrians
often
often
cross
cross
the the
roadroad
at arbitrary
at arbitrary
locations
locations
andand
drivers
drivers
sometimes
sometimes
overtake
overtake
fromfrom
the the
wrong
wrong
side.side.
An An
example
example
of this
of this
unstructuredness
unstructurednessin traffic
in traffic
cancan
be be
observed
observed
in Fig.
in Fig.
1(b).
1(b).
2.2.2.2.
Collection
Collection
of data
of data
To collect
To collect
data,
data,
oneone
RGBRGBcamera
camera
andand
a Stereo
a Stereo
camera
camera
werewere
mounted
mounted
on aoncar
a car
andand
waswas
mademade
to travel
to travel
in the
in the
ruralrural
andand
urban
urban
roads
roads
of Kerala
of Kerala
where
where
various
various
traffic
traffic
situations
situations
arise.
arise.
TheThe
route
route
in which
in which
the the
datadata
waswas
collected
collected
is shown
is shown
in in
Fig.Fig.
2 2
Fig. Fig.
2: Route
2: Route
followed
followed
for data
for data
collection.
collection.
2.3.2.3.
NITCAD
NITCAD
Object
Object
dataset
dataset
For For
training
training
the the
system
system
to classify
to classify
different
different
objects
objects
thatthat
could
could
be encountered
be encountered on an
onIndian
an Indian
road,
road,
a dataset
a dataset
including
including
a variety
a variety
of classes
of classes
needs
needs
to be
tocreated.
be created.
By By
using
using
NoiseNoise
PlayPlay
2 Action
2 Action
camera,
camera,
traffic
traffic
in and
in and
around
aroundKottayam
Kottayamdistrict,
district,
Kerala
Kerala
waswas recorded
recordedalong
along
the the
route
route
shown
shown
in Fig.2
in Fig.2
in 720p
in 720p
at 30fps.
at 30fps.
A setA of
setimages
of imagesat aatrate
a rate
of 5ofimages
5 imagesper per
second
second
werewere
generated
generated from from
thisthis
recorded
recorded
video
video
footage.
footage.
These
These
images
images
werewere
annotated
annotatedusing
using
an online
an online
tooltool
- Label
- Label
boxbox
andand
a text
a text
file file
waswascreated
created
for for
eacheach
scene
scene
whichwhich
has has
information
informationabout
about
the the
location
location
of various
of variousobjects
objects
in that
in that
particular
particular
frame.
frame.
TheseThese
filesfiles
cancan
thenthen
be used
be used
for for
the the
visualization
visualization
of the
of the
dataset
dataset
as well
as well
as for
as for
training
training
a system
a systemto detect
to detect
andand
210
4 Namburi GNVV
Author Satya
name Sai Srinath
/ Procedia et al. / Science
Computer Procedia00Computer Science 171 (2020) 207–216
(2019) 000–000
(a) Number of objects per class. (b) Number of images having a class.
Fig. 4: An example of various classes present in NITCAD object dataset. From left to right: Car, Bus, Pedestrian, Two wheeler, Truck, Van and
Auto rickshaw
classify the objects on road. There are seven classes in the dataset namely car, pedestrian, auto rickshaw, truck, two
wheeler, bus and van. A total of 11000 images were collected under different traffic conditions out of which 4800
images were manually labelled.
Fig. 3(a) gives details about the frequency of each class. From Fig. 3(a), it can be observed that the number of
cars present in the dataset is maximum and the number of vans present is minimum. Fig. 3(b) gives details about the
number of images or frames having a particular class from which it can be inferred that cars, auto rickshaws and two
wheelers are almost present in all the frames while vans occur at rare instances on roads. Fig. 4 gives a typical example
of all the classes present in our dataset.
matrix and distortion coefficients of the camera where computed according to Zhang [18] and were obtained as below:
791.6965 0 632.9851
camera intrinsic matrix = 0 791.4219 347.7182
0 0 1
radial distortion coefficients = −0.3454 0.1593 −0.0344
tangential distortion coefficients = 0.0021 0.0016
With the above obtained camera intrinsic matrix and distortion coefficients the images taken by Noise play action
camera were undistorted and provided in the dataset. An example of distorted and corresponding undistorted image
is shown in Fig.5 where the distortion is clearly visible near the edges. Around 10,000 undistorted images (of which
3600 are labelled) are also provided.
For performing the task of depth estimation, data was collected in and around Kottayam district, Kerala using Intel
RealSense Depth camera D435. This depth camera has 2 infrared cameras having global shutter that are triggered
simultaneously so that calculation of disparity for a particular scene is possible. By using its inbuilt vision processor
a disparity map can be generated which can be used for depth estimation. More efficient algorithms can be developed
to improve its accuracy so that it becomes a cost effective depth approximation technique using stereo vision. Depth
estimation obtained by the inbuilt vision processor can be used for validation of the results obtained after performing
stereo algorithms. An example image is shown in Fig. 6 where a small disparity can be observed.
212 Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216
6 6 Author namename
Author / Procedia Computer
/ Procedia Science
Computer 00 (2019)
Science 000–000
00 (2019) 000–000
Fig. 6: An6:example
Fig. imageimage
An example pair from NITCAD
pair from stereostereo
NITCAD visionvision
dataset.
dataset.
TableTable
2: Evaluation of various
2: Evaluation architectures
of various on NITCAD
architectures objectobject
on NITCAD dataset.
dataset.
Architecture
Architecture Accuracy*
Accuracy*Precision
PrecisionF1 score
F1 score
DenseNet [19][19]
DenseNet 0.828
0.828 0.795
0.795 0.782
0.782
Inceptionv3 [20][20] 0.836
Inceptionv3 0.836 0.801
0.801 0.805
0.805
Mobilenet [21][21] 0.839
Mobilenet 0.839 0.780.78 0.796
0.796
NASNet [22][22]
NASNet 0.811
0.811 0.757
0.757 0.760
0.760
VGG16 [23][23]
VGG16 0.789
0.789 0.779
0.779 0.750
0.750
Xception [24][24]
Xception 0.854
0.854 0.832
0.832 0.825
0.825
*Accuracy and Recall
*Accuracy values
and Recall are same
values as micro-averaging
are same as micro-averaging
is considered for multiclass
is considered confusion
for multiclass matrix
confusion matrix
3. Evaluation
3. Evaluation
NITCAD
NITCAD can can
be considered
be consideredchallenging onlyonly
challenging if theif classes present
the classes in itin
present areit difficult
are difficultto classify. ThusThus
to classify. the dataset is is
the dataset
processed accordingly
processed and and
accordingly the cropped images
the cropped are fed
images are tofedvarious classification
to various algorithms
classification algorithmsto obtain confusion
to obtain matrix.
confusion matrix.
To evaluate the detection
To evaluate algorithms
the detection on our
algorithms on dataset, Faster
our dataset, R-CNN
Faster R-CNNis used. The The
is used. stereo dataset
stereo is evaluated
dataset on the
is evaluated on basis
the basis
of average of the
of average of relative difference
the relative between
difference the disparity
between the disparitymaps generated
maps (Rdi (R
generated f f )diby
ff )taking
by the
taking output
the obtained
output from
obtained from
IntelIntel
RealSense as the
RealSense as ground truth.
the ground truth.
3.1. 3.1.
NITCAD object
NITCAD dataset-
object Evaluation
dataset- for classification
Evaluation for classification
To evaluate
To evaluate howhowgoodgood
the classification algorithms
the classification perform
algorithms on the
perform on dataset, confusion
the dataset, matrix
confusion for different
matrix deepdeep
for different
learning
learning architectures was computed. Each image is cropped to extract all individual classes and a subsetthe
architectures was computed. Each image is cropped to extract all individual classes and a subset of of the
labelled datadata
labelled was was
considered for training,
considered validation
for training, and and
validation testing. Pre-trained
testing. models
Pre-trained on Imagenet
models werewere
on Imagenet taken and and
taken
initial layers
initial werewere
layers frozen as they
frozen learnlearn
as they about simple
about features
simple like like
features edges and and
edges lineslines
and and
thesethese
are common
are common in all
in the
all the
objects. Validation accuracy is used as a metric while training for 20 epochs and the best weights are used
objects. Validation accuracy is used as a metric while training for 20 epochs and the best weights are used for testing. for testing.
The The
confusion
confusionmatrix for various
matrix architectures
for various are represented
architectures in Fig.
are represented 7. 7.
in Fig.
It can be inferred that the class van is being confused with car/auto
It can be inferred that the class van is being confused with car/auto by these architectures.
by these Also,
architectures. the auto-rickshaw
Also, the auto-rickshaw
whichwhich is the most common class in Indian roads has been classified well as the dataset collected enough
is the most common class in Indian roads has been classified well as the dataset collected has has enoughnumber
number
of autos
of autos to train networks. The accuracy and precision values of various classes and models can be seenTable
to train networks. The accuracy and precision values of various classes and models can be seen in 2 2
in Table
and and
Table 3
Table 3
Author name / Procedia Computer Science 00 (2019) 000–000 7
Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216 213
Table
Table 3:3: Precision,
Precision, Recall
Recall values
values forfor different
different classes
classes present
present inin NITCAD
NITCAD object
object dataset.
dataset.
Precision
Precision Recall
Recall
Architecture
Architecture
AA BB CC PP TrTr Tw
Tw VV AA BB CC PP TrTr Tw
Tw VV
DenseNet
DenseNet 0.77 1.00
0.77 1.00 0.82
0.82 0.53
0.53 0.65
0.65 0.97
0.97 0.29
0.29 0.98
0.98 0.03
0.03 0.95
0.95 0.95
0.95 0.4
0.4 0.94 0.02
0.94 0.02
Inceptionv3 0.86
Inceptionv3 0.86 0.94
0.94 0.77
0.77 0.76
0.76 0.4
0.4 0.96 0.27
0.96 0.27 0.99
0.99 0.25
0.25 0.89
0.89 0.94
0.94 0.69
0.69 0.96
0.96 0.07
0.07
Mobilenet
Mobilenet 0.9
0.9 0.48 0.72
0.48 0.72 0.82
0.82 0.39
0.39 0.96
0.96 0.2
0.2 0.99 0.25
0.99 0.25 0.94
0.94 0.97
0.97 0.17
0.17 0.96
0.96 0.02
0.02
NASNet
NASNet 0.98 0.2
0.98 0.2 0.79 0.8
0.79 0.8 1.00 0.76
1.00 0.76 0.26
0.26 0.91
0.91 0.04
0.04 0.97
0.97 0.5
0.5 0.32 0.99
0.32 0.99 0.01
0.01
VGG16
VGG16 0.66 0.73
0.66 0.73 0.87
0.87 0.51
0.51 0.42
0.42 0.94
0.94 0.4
0.4 0.98 0.17
0.98 0.17 0.9
0.9 0.96 0.59
0.96 0.59 0.85
0.85 0.01
0.01
Xception
Xception 0.88 0.88
0.88 0.88 0.92
0.92 0.84
0.84 0.14
0.14 0.94
0.94 0.21
0.21 0.99
0.99 0.25
0.25 0.94
0.94 0.99
0.99 0.69
0.69 0.99
0.99 0.009
0.009
A-auto
A-autorickshaw,B-bus,C-car,P-pedestrian,Tr-truck,Tw-two
rickshaw,B-bus,C-car,P-pedestrian,Tr-truck,Tw-twowheeler,V-van
wheeler,V-van
3.2.
3.2.NITCAD
NITCADobject
objectdataset
dataset- Evaluation
- Evaluationfor
fordetection
detection
ToToevaluate
evaluatethethedetection,
detection,Faster
FasterR-CNN
R-CNNisischosen.
chosen.1200
1200images
imagesare
aretrained
trainedfor
for7070epochs
epochswith
witheach
eachepoch
epochhaving
having
200
200iterations
iterationsinin4GB
4GBGPUGPUsystem.
system.Resnet
Resnetisisused
usedasasthe
thebase
basearchitecture
architecturetototrain,
train,extract
extractfeatures
featuresand
andthe
themetrics
metrics
are
aretabulated
tabulatedininTable
Table4 4
Table
Table4:4:
Different
Different
metrics
metrics
obtained
obtained
after
after
training
training
Faster
Faster
R-CNN.
R-CNN.
Classifier
ClassifierAccuracy
Accuracy 0.894
0.894
LossRPN
Loss RPNClassifier
Classifier 0.057
0.057
LossRPN
Loss RPNRegression
Regression 0.0846
0.0846
LossDetector
Loss DetectorClassifier
Classifier 0.26
0.26
LossDetector
Loss DetectorRegression
Regression 0.11
0.11
3.3.
3.3.Evaluation
EvaluationofofNITCAD
NITCADStereo
Stereovision
visiondataset
dataset
Intel
IntelRealsense
Realsensestereostereocamera
camerageneratesgeneratesa adepth
depthmapmapofofthe
thescene
scenethat
thatisisbeing
beingrecorded
recordedusing
usingthethebuilt
builtininvision
vision
processor.
processor.This Thiswas wastaken
takenasasthe theground
groundtruthtruthofofdepth.
depth.ForForimproving
improvingthe thedepth
depthestimation,
estimation,a aneural
neuralnetwork
networkbased
based
approach
approachMC-CNN MC-CNN[16] [16]waswasapplied.
applied.For Fora apair
pairofofimages
imagescorresponding
correspondingtotoa ascene
sceneMC-CNN
MC-CNNgenerates
generatesa adisparity
disparity
map
mapwhich
whichisisused usedtotoobtain
obtainthe thedepth
depthinformation.
information.Pre-trained
Pre-trainednetwork
networkononKITTI
KITTIwaswaschosen
chosentotoestimate
estimatethethedisparity
disparity
maps.
maps.Disparity
Disparitymap mapfor forthat
thatparticular
particularscene scenewaswasalso
alsoobtained
obtainedusing
usinginbuilt
inbuiltfunctions
functionsprovided
providedbybyOpenCV
OpenCVlibrary.
library.
Let
LetDD Intel (x,(x,y)y)and
Intel andDD method (x,(x,y)y)corresponds
method correspondstotothe thedisparity
disparitymaps
mapsgenerated
generatedbybyIntel
IntelRealsense
Realsensestereo
stereocamera
cameraand andbyby
two
twoofofthetheabove
abovementioned
mentionedmethodsmethodsrespectively
respectivelyforfora aparticular
particularscene.
scene.The
Theaverage
averageofofthe
therelative
relativedifference
differencebetween
between
the
thedisparity
disparitymaps mapsgenerated(R
generated(R f ff)f can
di di ) canbebeobtained
obtainedasas
11
RR f ff f==
di di |D|D
Intel (x,(x,y)y)−−DD
Intel method (x,(x,y)|y)|
method (1)
(1)
ww××h hxw,yh
xw,yh
IfIfthe
thevalue
valueofofRR f ff fisisless,
di di less,then
thenit itcan
canbebeimplied
impliedthat
thatthe thedisparity
disparitymap mapobtained
obtainedbybythethemethod
methodisisaccurate.
accurate.
Obtaining
Obtainingthe thedepth
depthinformation
informationisisusefulusefulininestimating
estimatingthe
thevelocity
velocityofofthe
theobjects
objectsininthe
thescene.
scene.The
Theoutput
outputobtained
obtained
from
fromIntel
IntelRealsense
Realsenseisistakentakenasasground
groundtruth
truthand
andthe
theRR f ff fobtained
di di obtainedwithwithMC-CNN
MC-CNNisis1414and andwith
withOpenCV
OpenCVisis1818
where
whereit itisisthe
theaverage
averageofofaboutabout100 100image
imagepairs.
pairs.
Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216 215
Author name / Procedia Computer Science 00 (2019) 000–000 9
Fig. 8: Disparity maps obtained from different algorithms of the image pair as shown in Fig.6. (Images thresholded for visual analysis)
A challenging dataset is presented which includes various test cases that are frequently present in Indian roads. To
get the information regarding velocity, a stereo dataset is presented which can be used to develop algorithms to obtain
depth information. Various classes are labelled for object classification and is evaluated with confusion matrix which
is obtained by different architectures. For detection, Faster R-CNN is used. The stereo dataset is evaluated by absolute
sum of error between the output obtained from the Intel camera and with the methods described i.e MC-CNN and
OpenCV. Our dataset can be further extended by collecting data which can include new classes like animals, lorry,
sign boards etc. Research on the development of novel architectures that can detect and classify in various conditions
including many edge cases needs to be carried on.
5. Acknowledgements
We would like to thank TEQIP - III for providing fund to acquire Intel RealSense D435 Depth camera.
References
[1] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Computer Vision and
Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. pp. 248255. IEEE (2009)
[2] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In:
European conference on computer vision. pp. 740755. Springer (2014)
[3] Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal
of computer vision 88(2), 303338(2010)
[4] Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: Computer Vision and Pattern Recognition, 2009. CVPR
2009. IEEE Conference on. pp. 304311. IEEE (2009)
[5] Zhang, Shanshan, Rodrigo Benenson, and Bernt Schiele. “Citypersons: A diverse dataset for pedestrian detection.” Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2017.
[6] Keller, Christoph Gustav, Markus Enzweiler, and Dariu M. Gavrila. “A new benchmark for stereo-based pedestrian detection.” 2011 IEEE
Intelligent Vehicles Symposium (IV). IEEE, 2011.
[7] Krause, Jonathan, et al. “3d object representations for fine-grained categorization.” Proceedings of the IEEE International Conference on
Computer Vision Workshops. 2013.
[8] Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000 km: The oxford robotcar dataset. IJ Robotics Res. 36(1), 315 (2017)
[9] Geiger, Andreas, et al. “Vision meets robotics: The KITTI dataset.” The International Journal of Robotics Research 32.11 (2013): 1231-1237.
[10] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U.,Roth, S., Schiele, B.: The cityscapes dataset for
semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3213 3223
(2016)
[11] Neuhold, G., Ollmann, T., Bul, S.R., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: International
Conference on Computer Vision (ICCV) (2017)
[12] Huang, Xinyu, et al. “The apolloscape dataset for autonomous driving.” Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops. 2018.
[13] Yu, Fisher, et al. “BDD100K: A diverse driving video database with scalable annotation tooling.” arXiv preprint arXiv:1805.04687 (2018).
[14] Caesar, Holger, et al. “nuScenes: A multimodal dataset for autonomous driving.” arXiv preprint arXiv:1903.11027 (2019).
[15] Varma, Girish, et al. “IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments.” 2019 IEEE Winter
Conference on Applications of Computer Vision (WACV). IEEE, 2019.
216 Namburi GNVV Satya Sai Srinath et al. / Procedia Computer Science 171 (2020) 207–216
10 Author name / Procedia Computer Science 00 (2019) 000–000
[16] J. Zbontar, Y. LeCunet al., “Stereo matching by training a convolutional neuralnetwork to compare image patches.Journal of Machine Learning
Research,vol. 17, no. 1-32, p. 2, 2016.
[17] E. Rosten and T. Drummond, “Machine learning for high-speed cornerdetection, in European Conference on Computer Vision, vol. 1, May
2006,pp. 430443. [Online]. Available: http://www.edwardrosten.com/work/rosten2006machine.pdf
[18] Z. Zhang, “A flexible new technique for camera calibration,IEEE Transactions onpattern analysis and machine intelligence, vol. 22, 2000
[19] Huang, Gao, et al. “Densely connected convolutional networks.” Proceedings of the IEEE conference on computer vision and pattern recogni-
tion. 2017.
[20] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016.
[21] Howard, Andrew G., et al. “Mobilenets: Efficient convolutional neural networks for mobile vision applications.” arXiv preprint
arXiv:1704.04861 (2017).
[22] Zoph, Barret, et al. “Learning transferable architectures for scalable image recognition.” Proceedings of the IEEE conference on computer
vision and pattern recognition. 2018.
[23] Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint
arXiv:1409.1556 (2014).
[24] Chollet, Franois. “Xception: Deep learning with depthwise separable convolutions.” Proceedings of the IEEE conference on computer vision
and pattern recognition. 2017.