Multiple Object Recognition
Multiple Object Recognition
net/publication/360555005
CITATIONS READS
0 743
1 author:
Ömer Ünsalver
Independent Researcher
1 PUBLICATION 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Ömer Ünsalver on 12 May 2022.
Omer Unsalver
Independent Researcher, Istanbul, Turkiye
orcid.org/0000-0001-8466-0580
ABSTRACT - Recent advances in computer technology have enabled powerful hardware to be obtained at affordable
costs. This way, the fields of applications that require high processing power such as image processing, artificial neural
networks and deep learning have been increased. Some of these applications are autonomous vehicle navigation, robot
guidance, object recognition, speech recognition, medical analysis, production quality control and safety systems.
The objective of this research was to develop a computer vision software to perform the calibration of a stereo
camera system, to classify multiple objects in video frames using convolutional neural networks, to locate matching
objects on image pairs, to calculate their distances to the stereo camera and to verify calculated values with measured
values. Based on the results obtained, it has been concluded that proposed method can be adopted for industrial
applications.
Keywords: Calibration, Convolutional neural network, Depth estimation, Image processing, Stereo vision
i
DOI: 10.5281/zenodo.6544603
CONTENTS
ABSTRACT................................................................................................ .................................................. i
CONTENTS ................................................................................................................................................ ii
1. INTRODUCTION .................................................................................................................................. 1
2. GENERAL INFORMATION................................................................................................................ 3
4. RESULTS .............................................................................................................................................. 19
RESOURCES ........................................................................................................................................... 27
APPENDIX ............................................................................................................................................... 29
ii
DOI: 10.5281/zenodo.6544603
1. INTRODUCTION
Considering the period of time starting from the industrial revolution to the present, it can be said that the great leap
in the field of technology has begun with the invention of transistor in 1947, by William Shockley from Bell research
laboratories. This unprecedented development paved the way for the pioneers of computer science such as Alan Turing
and John von Neumann and enabled us to reach the technologies we have today.
In parallel to the developments in electronics industry, in 1956 a young mathematician from Dartmouth University,
John McCarthy introduced the concept of a thinking machine and used the term artificial intelligence for the first time.
The first artificial intelligence approach that takes the human brain as a model was the design of a single layer neural
network named Perceptron, proposed by Frank Rosenblatt in 1958. Inspired by the way neurons work together,
Rosenblatt conceived Perceptron as a supervised learning algorithm for binary classification.
In neuroscience, synaptic plasticity is defined as the brain's tendency to change the nature of connections between
individual synapses in response to changing needs. A neuron produces an output signal (fires) if the sum of the signals it
receives from neighboring neurons exceeds a certain threshold. Since neurons receive stronger or weaker signals
depending on the nature of their synaptic connections, a neuron in the network performs a sort of weighted addition.
Neuroscientists argue that learning is possible by this weighted transmission model where the weight changes over time.
Rosenblatt's Perceptron approach is considered as the ancestor of deep learning, an important branch of today's artificial
intelligence technology.
1
DOI: 10.5281/zenodo.6544603
Fig 3. Biological and artificial neuron: a) Biological neuron b) Artificial neuron c) Detailed workings of a neuron
(Samarasinghe, 2006)
During the period between 1960 and 2010 various neural network and machine learning concepts have been brought
to light and starting from year 2010 onwards with the help of advances in computer hardware deep learning has been
accepted as the most dominant artificial intelligence paradigm such that the term deep learning has become to mean
artificial intelligence.
Figure 4. Difference between machine learning and deep learning (Wolfewicz, 2021)
In supervised deep learning, an accurate classification or prediction is possible with a large amount of labeled data
fed into a multilayer artificial neural network without the need for specific feature extraction software. Feeding inputs,
calculating outputs with the default weights, calculating errors using loss function, updating the weights by back-
propagation constitute the basic stages of model training. This training process terminates when the errors converge to
an acceptable value.
2
DOI: 10.5281/zenodo.6544603
Back propagation takes place layer by layer backwards from the output layer. MSE (Mean squared error), which is a
differentiable function, is mostly used as the loss function in order to obtain the error at output layer neurons.
. εtotal ∑ ( )
The weight of a link is expressed by the product of the partial derivatives taken at all nodes (including neuron's own
activation function) starting from output layer to the link itself according to chain rule. The learning coefficient is used
to calculate the new value of the weight.
2. GENERAL INFORMATION
Convolutional Neural Network (CNN, ConvNet) is an artificial neural network algorithm that makes use of
convolution operation for image classification in deep learning. It was developed by the French computer scientist Yann
LeCun from year 1989 to 1998, and it has been widely adopted by the software community after 2010
3
DOI: 10.5281/zenodo.6544603
The convolutional neural network model is quite similar to the ordinary neural network model. It consists of neurons
whose weights and balances are trained. A neuron in the network multiplies values of all its upstream neurons by the
weights of their connection, sums these values, and passes through its activation function in order to eliminate linearity.
However, unlike classical artificial neural network, convolutional neural networks consist of two basic blocks;
Convolutional block and fully connected block.
Convolution block consists of convolutional layers and pooling layers in ordered pairs. This block is the part where
feature extraction takes place. Whereas fully connected block is in classical neural network architecture, which consists
of fully connected neurons. In this block, the classification takes place according to the convolution block outputs.
The advantage of CNN model is that it is much faster than the classical neural network. For example, an image of 3
channels 416x416 pixel size, the input layer in the classical artificial neural network model will consist of 519168
neurons. Since each of these neurons will be connected to hidden layer neurons, the size of the matrices holding the
connection weights will dramatically increase the memory and processing power consumption. However in CNN
model, the need for memory and processing power is reduced thousands of times by preserving the meaningful
elements in the image and reducing the input size at every stage with the convolution and pooling operations applied
consecutively. As for the weight update, elements trained by back propagation in the convolution block are convolution
filters (kernel) themselves. Filters are generally preferred in 3x3 or 5x5 sizes.
4
DOI: 10.5281/zenodo.6544603
The output of final pooling layer in convolution block is transformed into single dimension array (flattened) and
fed to the fully connected block where the extracted features are mapped to the network outputs. The last fully
connected layer consists of neurons holding probabilities of each class. For example, if an image is to be classified as
piano, guitar or flute, there will be three neurons in the output layer. The output values of these three neurons, which
symbolize the instruments, will be probabilities between 0 and 1. The neuron with the highest of these values
determines the result of classification for the image being analyzed.
Basically, three different application approaches can be mentioned in image analysis with convolutional neural
networks. In this study, Yolo v3 model is chosen for classification of images.
Fig 9. Types of object recognition algorithms (Amidi, Convolutional Neural Networks cheat sheet)
YOLO (You only look once) is a very efficient real-time multi-object recognition algorithm that was publicly
released in 2015 by Joseph Redmond et al. Unlike traditional CNN, it is based on finding five meaningful sub-regions
by dividing the image into parts, instead of searching for regions that may be meaningful in the whole image. Thus, the
amount of bounding boxes that can be extracted from an image reaches as high as 1805. The output data structure of a
bounding box has the following elements ;
5
DOI: 10.5281/zenodo.6544603
YOLO is simple by its design which allows classification of multiple objects simultaneously using a single
convolution block.
YOLO is fast. The standard version can classify at 45fps with the Titan X GPU. The simplified version, on the other
hand, reaches 150 fps, although its sensitivity is slightly reduced. (Redmon, 2016) Another trade-off for high speed is
the loss of performance in recognizing small objects..
YOLO training and testing codes are open source. Moreover, pre-trained models including certain objects can be
downloaded from the Internet.
In YOLO, the input image is processed by dividing it into parts. Assuming that the image is divided into S x S cells,
the predictions will be expressed as a tensor of size SxSx(5B+C), where B is the quantity of bounding boxes and
C is classification possibilities.
6
DOI: 10.5281/zenodo.6544603
Figure 12. Bounding boxes and class probability map (Redmon, 2016)
Almost everyone working in the field of image processing needs the open source OpenCV library at some point.
OpenCV is an image processing library developed by Intel in C and C++ languages, suitable for Linux, Windows and
MacOS operating systems. When Intel research group officials were developing this library, they saw that different
computer vision infrastructures were being created in computer science faculties of many respected universities. Since
image processing consumes a lot of resources, they decided to distribute this library free of charge starting from 1999
instead of marketing it, anticipating that powerful processors will increase their sales. Their decision allowed millions
of scientists and students to work on a common ground.
OpenCV is an extremely useful library containing thousands of functions that makes possible camera interface
access, matrix operations, image viewing and manipulation, conversion between formats, sharpening, blurring,
morphological operations, thresholding, various transformations (Canny, Laplace, Convolution, DFT, Histogram
equalization) contour finding, contour matching, segmentation, motion tracking, camera calibration, machine learning,
and deep learning. In this study OpenCV version 4 has been used for development.
Camera, with its simplest definition, is a device that maps 3D world into 2D plane. Whereas a stereo camera is a set
of two cameras mounted on the same fixture looking at a common scene as shown on Fig 12. The purpose of the stereo
camera is to obtain the depth information of the objects in the scene
7
DOI: 10.5281/zenodo.6544603
In order to obtain accurate dimensions from objects in the scene, translation and rotation relationship of
cameras with respect to each other must be known exactly. Even high end commercial stereo camera models on the
market may have some flaws. For instance, image pairs captured with the stereo camera that has been used for this
research were subject to visible rotational mismatch both in pitch and yaw axis. Another problem is the distortion of the
rays that pass through the lens and fall on the sensor depending on the manufacturing quality and focal length of the
lens. These distortions also need to be corrected in software for a precise image analysis.
Therefore, the most critical step in stereo image processing is individual and stereo calibration of the cameras. A
calibration that is not done meticulously will undoubtedly lead to erroneous results.
8
DOI: 10.5281/zenodo.6544603
Before detailing the calibration process, it would be useful to state some basics about the camera matrix. A
camera matrix is a 3x4 matrix which describes the mapping from 3D points in the world to 2D points in an image as
shown in figure 15.
Figure 15. Camera model camera and matrix (Kitani, Camera Matrix)
Camera matrix can be expressed as a combination of intrinsic and extrinsic parameters. Intrinsic parameters
include camera's optical center, focal length, and lens distortion information. External parameters, on the other hand,
contain information about the location and orientation of the camera, more specifically translation and rotation
information with reference to the world coordinate system in which the camera resides.
A camera with known internal and external parameters means a camera whose calibration matrix is known.
Images taken with that camera are corrected by the rectification process that requires calibration matrix. The figure 16
shows the rectification of the camera matrix with decomposed intrinsic and extrinsic parameters. Here P stands for
rectified projection matrix, K intrinsic parameters, R and t rotation and translation parameters
Figure 16. Projection matrix obtained from intrinsic and extrinsic parameters (Kitani, Camera Matrix)
9
DOI: 10.5281/zenodo.6544603
In the intrinsic parameters, f indicates the focal length, px and py indicate pixel coordinates of camera optical center
on the image. In the extrinsic parameters, the elements t1,t2,t3 represent the translation of camera in x,y,z axes with
reference to world space coordinate system. The elements r1,r2....r9 defines rotation matrix that is used to perform a
rotation in Euclidian space and which is the product of the rotation matrices of each axis. More clearly;
R= Z(θ) .X(ψ).Y(φ) =
In the light of above information, calibration process of a stereo camera using the OpenCV library has been
conducted in following steps ;
a) A chessboard image consisting of black and white squares of known sizes is printed out on a paper at 1:1 scale, and
pasted on a rigid board in order to prevent it from bending.
b) At least 12 photos, preferably more, are taken randomly while moving the board to different locations and rotations,
in a way that all chessboard squares remain inside the frame of both cameras.
c) From all images saved by left camera, intersection points of each black and white square are found using opencv's
findChessboardCorners function, and these are added to a c vector.
d) Camera matrix, distortion coefficients, rotation and translation matrices of the left camera are obtained by passing the
vector populated above and the vector consisting of pixel coordinates of the corresponding points in the printed image
as arguments to the calibrateCamera function of the opencv library, with the origin being the top left corner.
e) From all images saved by right camera, intersection points of each black and white square are found using opencv's
findChessboardCorners function, and these are added to a c vector.
10
DOI: 10.5281/zenodo.6544603
f) Camera matrix, distortion coefficients, rotation and translation matrices of the right camera are obtained by passing
the vector populated above and the vector consisting of pixel coordinates of the corresponding points in the printed
image, as arguments to the calibrateCamera function of the opencv library, with the origin being the top left corner.
g) Matrices and parameters obtained from calibrateCamera function are passed to stereoCalibrate function which
outputs the transform between left and right camera ; rotation matrix and translation vector. This function also returns
fundamental matrix for uncalibrated cameras and essential matrix for calibrated cameras which are used to compute a
corresponding point in right image from a given point in left image, and vice versa.
h) Finally, by passing the matrices obtained by the stereoCalibrate function to the stereoRectify function as parameters,
R1 and R2 matrices (rectification matrices) are obtained, which will ensure that the objects are at the same line on the
vertical axis in the images taken with the stereo camera.
j) All matrices obtained from stereoCalibrate and stereoRectify functions are saved.
After this stage, for each captured image pair by the stereo camera within the application, initUndistort RectifyMap
and remap functions are called in order to get corrected image.
Once camera setup is neatly calibrated and captured images are undistorted and remapped as explained in section
2.2.1, depth values of scene objects can be extracted using basic triangulation as long as camera positions and
orientations remain unchanged. Figure 18 shows basic projection geometry in stereo imaging.
11
DOI: 10.5281/zenodo.6544603
According to this representation, Z is the depth value of the object to calculate, B is the distance between cameras,
f is the focal length, CxL, CyL, CxR,CyR are the pixel coordinates of object center point in the right and left image
planes. The distance (B) between the cameras is given in camera technical specification. The focal length (f) is the lens
focal length scaled with image size to sensor size ratio, and it is available in internal parameters obtained as a result of
the calibration. The pixel coordinates of the object centers in the image plane are available in the bounding box
structure estimated by the Yolo convolutional neural network model.
From above equality, depth of an object Z is defined in terms of disparity (CxL - CxR).
12
DOI: 10.5281/zenodo.6544603
The OpenCV library comes with deep learning module from version 3.3 onwards. Thanks to this module, the
integration of different pre-trained artificial neural network modules into user software is carried out without the need
for another library. In this project, weight and configuration files of the YoloV3 neural network model was run through
this module..
Loading the neural network model is made with readNet function under the cv::dnn namespace. This function takes
the path to the weight file (.weights) and the path to the model's configuration file (.cfg) as argument.
The YoloV3 pre-trained set is designed to recognize 80 classes. These classes are from 1 to 80 respectively in the
output layer; human, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign,
parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe , backpack, umbrella, handbag, tie,
suitcase, frisbee, skis, snowboard, ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle,
wine glass, mug, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, sausage, pizza, donut,
cake, chair, armchair, flower pot, bed, dining table, closet, television, laptop, computer mouse, remote control,
keyboard, pocket phone, microwave oven, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair
dryer and toothbrush.
Following steps have been taken in order to classify objects with OpenCV dnn module and pre-trained neural
network model ;
a) The path to Yolo weights and configuration files downloaded from Yolo website are passed to readNet function
which returns an object of type cv::dnn::Net
b) The artificial neural network infrastructure is set to OpenCV by calling the setPreferableBackend(DNN_
BACKEND_OPENCV) method with returned Net variable
c) Calling the setPreferableTarget(DNN_TARGET_CPU) method with the same Net variable, for CPU (or GPU)
based operation is set.
d) An array of strings to hold class names is created and filled in the same order as neural network output neurons.
For this research coco.names text file is downloaded from Yolo website and used.
e) Whenever a new image is captured, it is corrected according to the calibration matrix. (as explained in the
calibration section 2.2.1)
g) setInput function is called with Net variable and blob variable is passed as argument.
13
DOI: 10.5281/zenodo.6544603
h) A vector of type cv::Mat is created to hold detected bounding boxes, classes, and confidence
i) Forward method is called with Net variable and cv::Mat vector and getUnconnectedOutLayerNames(Net variable)
are passed as argument.
j) cv::Mat vector is filled with detected bounding boxes, classes and confidences by the return of forward method.
Within the scope of this work, a native compiled linux application is developed using C++ language. The graphical
user interface is made with Qt library. The images captured from stereo camera were processed with OpenCV library
and its deep neural network module. Detected objects and their distances to the stereo camera are reported on a datagrid.
3.1. EQUIPMENT
Following equipment is used for the application ;
14
DOI: 10.5281/zenodo.6544603
15
DOI: 10.5281/zenodo.6544603
16
DOI: 10.5281/zenodo.6544603
17
DOI: 10.5281/zenodo.6544603
3.2. METHOD
4. RESULTS
Calibration menu implemented in the application provides complete calibration workflow as well as saving and
retrieving calibration data for later use. Images in figure 28 are taken using using the application ;
19
DOI: 10.5281/zenodo.6544603
The matrices obtained as a result of the calibration process are shown below ;
K: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8166088991359027e+02, 0., 3.3570997024361776e+02, 0.,
7.7387600203886643e+02, 2.8015027751270867e+02, 0., 0., 1. ]
D: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.6160937401860759e-01, 9.0319323291590248e-01,
-1.3781131824954437e-03, -4.5921517824351485e-04,
-1.9278208853746575e+00 ]
board_width: 9
board_height: 6
square_size: 2.
K: !!opencv-matrix
rows: 3 cols: 3 dt: d
20
DOI: 10.5281/zenodo.6544603
STEREO CALIBRATION , K1, K2, D1, D2, R, T , E , F, R1, R2, P1, P2, Q MATRICES
K1: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8166088991359027e+02, 0., 3.3570997024361776e+02, 0.,
7.7387600203886643e+02, 2.8015027751270867e+02, 0., 0., 1. ]
K2: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8249657630589957e+02, 0., 3.3435659791968590e+02, 0.,
7.7591331524355746e+02, 2.4632547239438091e+02, 0., 0., 1. ]
D1: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.6160937401860759e-01, 9.0319323291590248e-01,
-1.3781131824954437e-03, -4.5921517824351485e-04,
-1.9278208853746575e+00 ]
D2: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.4842077372742965e-01, 6.4051178026583100e-01,
1.4737778247401337e-04, -9.1167971051625096e-04,
-9.5386051458102916e-01 ]
R: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9992804293714921e-01, -2.4373835298358540e-03,
-1.1745982692441316e-02, 2.4212550560881551e-03,
9.9999610668890793e-01, -1.3871304840331348e-03,
1.1749317930672128e-02, 1.3585906502149082e-03,
9.9993005143340352e-01 ]
T: [ -6.0163964562455252e+00, -1.3622847988624018e-02,
-1.0461782337464186e-01 ]
E: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.3247261663228259e-05, 1.0459890819100304e-01,
-1.3767013661910412e-02, -3.3921740621952183e-02,
8.4288137330619066e-03, 6.0172044570800143e+00,
-9.4536261062024300e-04, -6.0164062366477848e+00,
8.1855131917909548e-03 ]
F: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.1465732738861130e-10, 7.7116616449344381e-07,
-2.9489689694408959e-04, -2.4979321189411534e-07,
21
DOI: 10.5281/zenodo.6544603
4.6651696740234350e-08, 2.5843916854448444e-02,
5.5823112325540603e-05, -2.6106886873434109e-02, 1. ]
R1: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9998409756045259e-01, -1.4945817662057851e-04,
5.6375782443052764e-03, 1.5334986050427035e-04,
9.9999975027098931e-01, -6.8988533769186189e-04,
-5.6374737274337934e-03, 6.9073888866935033e-04,
9.9998387075480388e-01 ]
R2: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9984628703230816e-01, 2.2639389008504455e-03,
1.7386111939144289e-02, -2.2759407140120580e-03,
9.9999718522097703e-01, 6.7055498599785006e-04,
-1.7384544905563154e-02, -7.1002167302160191e-04,
9.9984862527667173e-01 ]
P1: !!opencv-matrix
rows: 3 cols: 4 dt: d
data: [ 7.7489465864121189e+02, 0., 3.2709444046020508e+02, 0., 0.,
7.7489465864121189e+02, 2.6338606071472168e+02, 0., 0., 0., 1.,
0. ]
P2: !!opencv-matrix
rows: 3 cols: 4 dt: d
data: [ 7.7489465864121189e+02, 0., 3.2709444046020508e+02,
-4.6627902095334048e+03, 0., 7.7489465864121189e+02,
2.6338606071472168e+02, 0., 0., 0., 1., 0. ]
Q: !!opencv-matrix
rows: 4 cols: 4 dt: d
data: [ 1., 0., 0., -3.2709444046020508e+02, 0., 1., 0.,
-2.6338606071472168e+02, 0., 0., 0., 7.7489465864121189e+02, 0.,
0., 1.6618690179474188e-01, 0. ]
Wall clock, computer keyboard, tennis racket and bottle were used as objects to be detected in the experiments.
Image capture was performed live in video format. While recording the measurements, the clock and computer
keyboard were moved from near to far, the soda bottle from far to near, and the tennis racket was moved back and forth
by small distances. Screenshots of the application showing measurement records are as follows.
22
DOI: 10.5281/zenodo.6544603
Table 1. Measurements taken while wall clock moves away from camera
Table 2. Measurements taken while keyboard moves away from stereo camera
23
DOI: 10.5281/zenodo.6544603
During the experiment, bottle was classified as vase many times. However because bounding boxes were correctly
located, classification errors did not affect the measurement values.
Table 4. Measurements taken while tennis racket moves back and forth
24
DOI: 10.5281/zenodo.6544603
In order to evaluate measurement accuracy, another experiment was performed by keeping the wall clock and tennis
racket fixed. On this scene, the actual distance from wall clock to the camera was 600mm, and from the tennis racket to
camera was 1200mm.
Table 5. Measurements taken when tennis racket and wall clock stay steady
25
DOI: 10.5281/zenodo.6544603
In this project titled "Multiple Object Recognition and Depth Estimation from Stereo Images", the aim was to
recognize multiple objects on the scene using a pre-trained convolutional neural network model and to calculate
perpendicular distances of these objects to the stereo camera setup. The steps taken in the software are summarized as
recognizing objects in the image of one of the cameras using convolutional neural network model, finding the same
contents inside their bounding boxes in the other camera's image by template matching method, and calculating the
depth using the disparity information.
In the experiments carried out without calibration, it was observed that objects in the right and left images were at
different levels on the vertical axis and they were slightly rotated relative to each other. This error is thought to be
caused by some minor differences between cameras in terms of geometrical relationship between the lens and the sensor
or by the sensors not being mounted perfectly on the PCB. Looking at screenshots in the measurement findings section,
black bands with inclined inward edges are noticed at the bottom of the left image and at the top of the right image.
Based on this observation, it is understood as a result of calibration process that left image has been shifted up, right
image has been shifted down, and the horizon lines have been equalized by rotating two images in the opposite
direction with respect to each other. As a result, it is possible to say that the calibration process was successful.
In the experiments during which the objects were moved, depth value changed in parallel with the direction of the
movement. Measurement errors while objects were stationary, were below +/- 4%. At the time this research was
completed the opencv library pre-installed on Nvidia Jetson Nano was not compiled with GPU support, therefore neural
network had to be run with CPU support. Yet 2.5fps frame rate was achieved in the program cycle consisting of image
capture from two cameras in live video mode, image rectification, classification with Yolo neural network, image
matching and user interface update.
26
DOI: 10.5281/zenodo.6544603
RESOURCES
Abdelhamid, M. (2011), Extracting Depth Information From Stereo Vision System Using a
Correlation and a Feature Based Methods. Clemson University TigerPrints, Web address :
"https://tigerprints.clemson.edu/all_theses/1216", Access date : 18/4/2022
Amidi, A., Amidi, S., Convolutional Neural Networks cheatsheet. Web address :
"https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks", Access date :
19/4/2022.
Bhatt, D. (2021) A Comprehensive Guide for Camera Calibration in Computer Vision. Data Science Blogathon,
Web addresss : "https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-for-camera-calibration-in-
computer-vision" , Access date : 18/4/2022
Bradski, G., Kaehler, A. (2018), Learning OpenCV. O’Reilly Media, Inc, ISBN: 978-0-596-51613-0, 370-454.
Eby, M. (2020) Kernelled Connections: The Perceptron as Diagram. Web address : "https://tripleamp
ersand.org/kernelled-connections-perceptron-diagram", Access date : 18/4/2022
Maj, M. (2018) , Object Detection and Image Classification with YOLO. Appsilon Science, Web address :
"https://www.kdnuggets.com/2018/09/object-detection-image-classification-yolo.html", Access date : 19/4/2022.
Ortiz, L.E., Cabrera E.V., Gonçalvez L.M. (2018), Depth Data Error Modeling of the ZED 3D Vision Sensor from
Stereolabs. Electronic Letters on Computer Vision and Image Analysis, DOI:10.5565/rev/elcvia.1084 , 4-7
Redmon, J., Divvala, S., Girschick, R., Farhadi, A. (2016), You Only Look Once: Unified, Real-Time Object
Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR.2016.91
Reynolds, A.H. (2019), "Convolutional Neural Networks", Web address : " https://anhreynolds.
com/blogs/cnn.html ", Access date : 18/4/2022,
27
DOI: 10.5281/zenodo.6544603
Samarasinghe, S. (2006), Neural Networks for Applied Sciences and Engineering. Auerbach Publications, ISBN:
978-0-8493-3375-0, 17
Santoro, M., Alregib, G., Altunbasak, Y. (2012), Misalignment correction for depth estimation using stereoscopic
3-D cameras. 2012 IEEE 14th International Workshop on Multimedia Signal Processing DOI:
10.1109/MMSP.2012.6343409
Steaward, J. (2021), Camera Modeling: Exploring Distortion and Distortion Models. Web address :
"https://www.tangramvision.com/blog/camera-modeling-exploring-distortion-and-distortion-models-part-i", Access
date : 19/4/2022
Verma, N.K., Nama, P., Kumar,G., Siddhant, A., Raj, A., Dhar, N.K., Salour, A. (2015), Vision based object
follower automated guided vehicle using compressive tracking and stereo-vision. 2015 IEEE Bombay Section
Symposium, DOI: 10.1109/IBSS.2015.7456637
Wolfewicz, A. (2021), Deep learning vs. machine learning – What’s the difference?. Web address :
"https://levity.ai/blog/difference-machine-learning-deep-learning", Access date : 19/4/2022
28
DOI: 10.5281/zenodo.6544603
APPENDIX
#include <opencv2/core.hpp>
#include <opencv2/calib3d.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <stdio.h>
#include <iostream>
#include <QDebug>
#include <sys/stat.h>
using namespace std;
using namespace cv;
double xfocalLength_Left=0;
double xfocalLength_Right=0;
double xprincipalPoint_Left=0;
double xprincipalPoint_Right=0;
bool calibrationCompleted=false;
29
DOI: 10.5281/zenodo.6544603
int i, totalPoints = 0;
double totalErr = 0, err;
vector< float > perViewErrors;
perViewErrors.resize(objectPoints.size());
if(!found1 || !found2){
cout << "Chessboard find error!" << endl;
cout << "leftImg: " << left_img << " and rightImg: " << right_img <<endl;
continue;
}
if (found1)
{
cornerSubPix(gray1, corners1, cv::Size(5, 5), cv::Size(-1, -1), TermCriteria(TermCriteria::EPS | TermCriteria::MAX_ITER , 30, 0.1));
cv::drawChessboardCorners(gray1, board_size, corners1, found1);
}
if (found2)
{
30
DOI: 10.5281/zenodo.6544603
cornerSubPix(gray2, corners2, cv::Size(5, 5), cv::Size(-1, -1), TermCriteria(TermCriteria::EPS | TermCriteria::MAX_ITER, 30, 0.1));
cv::drawChessboardCorners(gray2, board_size, corners2, found2);
}
vector< Point3f > obj;
for (int i = 0; i < board_height; i++)
for (int j = 0; j < board_width; j++)
obj.push_back(Point3f((float)j * square_size, (float)i * square_size, 0));
int calibrate(char* leftcalib_file, char* rightcalib_file, char* leftimg_dir, char* rightimg_dir, char* leftimg_filename, char* rightimg_filename, char*
extension, char* outfile_stereo, int num_imgs)
{
board_width=9;
board_height=6;
square_size=2.0f;
load_image_points(board_width, board_height, num_imgs, square_size,
leftimg_dir, rightimg_dir, leftimg_filename, rightimg_filename, extension);
qDebug() << "Calibration error Left: " << computeReprojectionErrors(object_points, imagePoints1, rvecs1, tvecs1, K1, D1) << endl;
qDebug() << "Calibration error Right: " << computeReprojectionErrors(object_points, imagePoints2, rvecs2, tvecs2, K2, D2) << endl;
31
DOI: 10.5281/zenodo.6544603
printf("Done Calibration\n");
printf("Starting Rectification\n");
xfocalLength_Left= K1.at<double>(0,0);
xfocalLength_Right= K2.at<double>(0,0);
32
DOI: 10.5281/zenodo.6544603
xprincipalPoint_Left=K1.at<double>(0,2);
xprincipalPoint_Right=K2.at<double>(0,2);
calibrationCompleted=true;
printf("Done Rectification\n");
return 0;
}
void doNothing() { }
xfocalLength_Left= K1.at<double>(0,0);
xfocalLength_Right= K2.at<double>(0,0);
xprincipalPoint_Left=K1.at<double>(0,2);
xprincipalPoint_Right=K2.at<double>(0,2);
calibrationCompleted=true;
33
DOI: 10.5281/zenodo.6544603
Appx 2. Main loop source code (Object recognition on left image, template matching in right
image, depth calculation)
void MainWindow::on_single_shot_requested()
{
//SetCameraEnvironment();
cv::Mat blob;
cv::dnn::blobFromImage(leftImage,blob,OneDiv255,cv::Size(320,320),cv::Scalar(),true,false); //(416,416)
network.setInput(blob);
std::vector<cv::Mat> outs;
network.forward(outs,network.getUnconnectedOutLayersNames());
std::vector<int> classIds;
std::vector<float> confidences;
std::vector<cv::Rect> boxes;
std::vector<int> centersX; //for suppression ofmultiple detections of same object
if(confidence>0.2)
{
qDebug()<< confidence << classIdPoint.x << classes[classIdPoint.x].data();
int centerX = (int)(data[0] * leftImage.cols);
int centerY = (int)(data[1] * leftImage.rows);
int width = (int)(data[2] * leftImage.cols);
int height = (int)(data[3] * leftImage.rows);
int left = centerX - (width / 2);
int top = centerY - (height / 2);
34
DOI: 10.5281/zenodo.6544603
for(auto x : centersX)
{
float k=(float)centerX/(float)x;
Rect region(left,top,width,height);
Mat cropped=leftImage(region);
Mat outputMatch;
double minVal,maxVal;
Point minLoc,maxLoc;
matchTemplate(rightImage,cropped,outputMatch,TM_CCORR_NORMED);
normalize(outputMatch, outputMatch,0,1,NORM_MINMAX,-1,Mat());
minMaxLoc(outputMatch,&minVal,&maxVal,&minLoc,&maxLoc,Mat());
rectangle(rightImage,maxLoc,Point(maxLoc.x+width,maxLoc.y+height),cv::Scalar(0,255,0),2,8,0);
int centerX2=maxLoc.x+width/2;
classIds.push_back(classIdPoint.x);
confidences.push_back((float)confidence);
boxes.push_back(cv::Rect(left, top, width, height));
cv::rectangle(leftImage, cv::Rect(left, top, width, height), cv::Scalar(0, 255, 0), 2, 8, 0);
double disparity=0;
double depth=0;
if (ui->checkBox->checkState())
{
currentData=classes[classIdPoint.x].data();
currentData[0]=currentData[0].toUpper();
tableModel->setData(tableModel->index(currentRow,0),currentData);
currentData=QString::number(100*confidence) +"%";
tableModel->setData(tableModel->index(currentRow,1),currentData);
tableModel->setData(tableModel->index(currentRow,2),QString::number(centerX));
tableModel->setData(tableModel->index(currentRow,3),QString::number(centerX2));
disparity=(centerX-xprincipalPoint_Left)-(centerX2-xprincipalPoint_Right);
tableModel->setData(tableModel->index(currentRow,4),QString::number(disparity));
35
DOI: 10.5281/zenodo.6544603
depth= xfocalLength_Left*60/disparity;
tableModel->setData(tableModel->index(currentRow,5),QString::number(depth));
currentRow++;
}
QPixmap mapLeft=MatToPixmap(leftImage);
ui->label_6->setPixmap(mapLeft);
QPixmap mapRight=MatToPixmap(rightImage);
ui->label_7->setPixmap(mapRight);
QApplication::processEvents();
}
X001:
doNothing();
}
}
}
36