0% found this document useful (0 votes)
3 views

Multiple Object Recognition

Uploaded by

Ines Bouhelal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Multiple Object Recognition

Uploaded by

Ines Bouhelal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/360555005

MULTIPLE OBJECT RECOGNITION AND DEPTH ESTIMATION FROM STEREO


IMAGES

Article · May 2022


DOI: 10.5281/zenodo.6544603

CITATIONS READS

0 743

1 author:

Ömer Ünsalver
Independent Researcher
1 PUBLICATION 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ömer Ünsalver on 12 May 2022.

The user has requested enhancement of the downloaded file.


DOI: 10.5281/zenodo.6544603

MULTIPLE OBJECT RECOGNITION AND DEPTH ESTIMATION


FROM STEREO IMAGES

Omer Unsalver
Independent Researcher, Istanbul, Turkiye
orcid.org/0000-0001-8466-0580

ABSTRACT - Recent advances in computer technology have enabled powerful hardware to be obtained at affordable
costs. This way, the fields of applications that require high processing power such as image processing, artificial neural
networks and deep learning have been increased. Some of these applications are autonomous vehicle navigation, robot
guidance, object recognition, speech recognition, medical analysis, production quality control and safety systems.

The objective of this research was to develop a computer vision software to perform the calibration of a stereo
camera system, to classify multiple objects in video frames using convolutional neural networks, to locate matching
objects on image pairs, to calculate their distances to the stereo camera and to verify calculated values with measured
values. Based on the results obtained, it has been concluded that proposed method can be adopted for industrial
applications.

Keywords: Calibration, Convolutional neural network, Depth estimation, Image processing, Stereo vision

i
DOI: 10.5281/zenodo.6544603

CONTENTS

ABSTRACT................................................................................................ .................................................. i

CONTENTS ................................................................................................................................................ ii

1. INTRODUCTION .................................................................................................................................. 1

2. GENERAL INFORMATION................................................................................................................ 3

2.1. Convolutional neural networks .............................................................................................................. 3


2.1.1. Yolo neural network model ................................................................................................................ 5
2.2. Image processing ................................................................................................................................... 7
2.2.1. Stereo camera calibration and rectification ........................................................................................ 7
2.2.2. Depth estimation from stereo images ............................................................................................... 11
2.2.3. Integration of image processing and neural networks ...................................................................... 13

3. MATERIAL AND METHOD ............................................................................................................. 14

3.1. Equipment ........................................................................................................................................... 14


3.1.1. Nvidia Jetson Nano AI Computer .................................................................................................... 14
3.1.2. Sony IMX219-83 Stereo camera ...................................................................................................... 15
3.1.3. CSI Camera interface cable .............................................................................................................. 16
3.1.4. Tripod and mounting plate ............................................................................................................... 16
3.1.5. Experimental environment ............................................................................................................... 17
3.2. Method ................................................................................................................................................ 18

4. RESULTS .............................................................................................................................................. 19

4.1. Calibration results................................................................................................................................ 19


4.2. Measurement results ............................................................................................................................ 22

5. DISCUSSION AND CONCLUSION .................................................................................................. 26

RESOURCES ........................................................................................................................................... 27

APPENDIX ............................................................................................................................................... 29

Appx 1. Calibration source code ................................................................................................................ 29


Appx 2. Main loop source code .................................................................................................................. 34

ii
DOI: 10.5281/zenodo.6544603

1. INTRODUCTION

Considering the period of time starting from the industrial revolution to the present, it can be said that the great leap
in the field of technology has begun with the invention of transistor in 1947, by William Shockley from Bell research
laboratories. This unprecedented development paved the way for the pioneers of computer science such as Alan Turing
and John von Neumann and enabled us to reach the technologies we have today.

In parallel to the developments in electronics industry, in 1956 a young mathematician from Dartmouth University,
John McCarthy introduced the concept of a thinking machine and used the term artificial intelligence for the first time.

The first artificial intelligence approach that takes the human brain as a model was the design of a single layer neural
network named Perceptron, proposed by Frank Rosenblatt in 1958. Inspired by the way neurons work together,
Rosenblatt conceived Perceptron as a supervised learning algorithm for binary classification.

Figure 1. A photo of John McCarthy (Professor-John-


McCarthy-shows-off-computer-chess)
Figure 2. A photo of Frank Rosenblatt
Eby, M. (2020)

In neuroscience, synaptic plasticity is defined as the brain's tendency to change the nature of connections between
individual synapses in response to changing needs. A neuron produces an output signal (fires) if the sum of the signals it
receives from neighboring neurons exceeds a certain threshold. Since neurons receive stronger or weaker signals
depending on the nature of their synaptic connections, a neuron in the network performs a sort of weighted addition.
Neuroscientists argue that learning is possible by this weighted transmission model where the weight changes over time.
Rosenblatt's Perceptron approach is considered as the ancestor of deep learning, an important branch of today's artificial
intelligence technology.

1
DOI: 10.5281/zenodo.6544603

Fig 3. Biological and artificial neuron: a) Biological neuron b) Artificial neuron c) Detailed workings of a neuron
(Samarasinghe, 2006)

During the period between 1960 and 2010 various neural network and machine learning concepts have been brought
to light and starting from year 2010 onwards with the help of advances in computer hardware deep learning has been
accepted as the most dominant artificial intelligence paradigm such that the term deep learning has become to mean
artificial intelligence.

Figure 4. Difference between machine learning and deep learning (Wolfewicz, 2021)

In supervised deep learning, an accurate classification or prediction is possible with a large amount of labeled data
fed into a multilayer artificial neural network without the need for specific feature extraction software. Feeding inputs,
calculating outputs with the default weights, calculating errors using loss function, updating the weights by back-
propagation constitute the basic stages of model training. This training process terminates when the errors converge to
an acceptable value.

2
DOI: 10.5281/zenodo.6544603

Figure 5. Multiple layer perceptron (Sonmez, 2018)

Back propagation takes place layer by layer backwards from the output layer. MSE (Mean squared error), which is a
differentiable function, is mostly used as the loss function in order to obtain the error at output layer neurons.

. εtotal ∑ ( )

The weight of a link is expressed by the product of the partial derivatives taken at all nodes (including neuron's own
activation function) starting from output layer to the link itself according to chain rule. The learning coefficient is used
to calculate the new value of the weight.

2. GENERAL INFORMATION

2.1. CONVOLUTIONAL NEURAL NETWORKS

Convolutional Neural Network (CNN, ConvNet) is an artificial neural network algorithm that makes use of
convolution operation for image classification in deep learning. It was developed by the French computer scientist Yann
LeCun from year 1989 to 1998, and it has been widely adopted by the software community after 2010

Figure 6. A photo of Yann LeCun (Yann LeCun | Tandon School of Engineering)

3
DOI: 10.5281/zenodo.6544603

The convolutional neural network model is quite similar to the ordinary neural network model. It consists of neurons
whose weights and balances are trained. A neuron in the network multiplies values of all its upstream neurons by the
weights of their connection, sums these values, and passes through its activation function in order to eliminate linearity.
However, unlike classical artificial neural network, convolutional neural networks consist of two basic blocks;
Convolutional block and fully connected block.

Figure 7. Convolutional neural network model. (Convolutional Neural Network)

Convolution block consists of convolutional layers and pooling layers in ordered pairs. This block is the part where
feature extraction takes place. Whereas fully connected block is in classical neural network architecture, which consists
of fully connected neurons. In this block, the classification takes place according to the convolution block outputs.

The advantage of CNN model is that it is much faster than the classical neural network. For example, an image of 3
channels 416x416 pixel size, the input layer in the classical artificial neural network model will consist of 519168
neurons. Since each of these neurons will be connected to hidden layer neurons, the size of the matrices holding the
connection weights will dramatically increase the memory and processing power consumption. However in CNN
model, the need for memory and processing power is reduced thousands of times by preserving the meaningful
elements in the image and reducing the input size at every stage with the convolution and pooling operations applied
consecutively. As for the weight update, elements trained by back propagation in the convolution block are convolution
filters (kernel) themselves. Filters are generally preferred in 3x3 or 5x5 sizes.

Figure 8. Convolution operation (Reynolds, 2019)

4
DOI: 10.5281/zenodo.6544603

The output of final pooling layer in convolution block is transformed into single dimension array (flattened) and
fed to the fully connected block where the extracted features are mapped to the network outputs. The last fully
connected layer consists of neurons holding probabilities of each class. For example, if an image is to be classified as
piano, guitar or flute, there will be three neurons in the output layer. The output values of these three neurons, which
symbolize the instruments, will be probabilities between 0 and 1. The neuron with the highest of these values
determines the result of classification for the image being analyzed.

Basically, three different application approaches can be mentioned in image analysis with convolutional neural
networks. In this study, Yolo v3 model is chosen for classification of images.

Fig 9. Types of object recognition algorithms (Amidi, Convolutional Neural Networks cheat sheet)

2.1.1. YOLO NEURAL NETWORK MODEL

YOLO (You only look once) is a very efficient real-time multi-object recognition algorithm that was publicly
released in 2015 by Joseph Redmond et al. Unlike traditional CNN, it is based on finding five meaningful sub-regions
by dividing the image into parts, instead of searching for regions that may be meaningful in the whole image. Thus, the
amount of bounding boxes that can be extracted from an image reaches as high as 1805. The output data structure of a
bounding box has the following elements ;

- Bounding box center ( bx,by)


- Width of bounding box (bw)
- Height of bounding box (bh)
- Probability of an object found in box (pc)
- Array of class probabilities for each class id (c)

5
DOI: 10.5281/zenodo.6544603

Figure 10. Yolo bounding box structure (Maj, 2018)

YOLO is simple by its design which allows classification of multiple objects simultaneously using a single
convolution block.

YOLO is fast. The standard version can classify at 45fps with the Titan X GPU. The simplified version, on the other
hand, reaches 150 fps, although its sensitivity is slightly reduced. (Redmon, 2016) Another trade-off for high speed is
the loss of performance in recognizing small objects..

YOLO training and testing codes are open source. Moreover, pre-trained models including certain objects can be
downloaded from the Internet.

Figure 11. Yolo convolutional network architecture ( Redmon, 2016)

In YOLO, the input image is processed by dividing it into parts. Assuming that the image is divided into S x S cells,
the predictions will be expressed as a tensor of size SxSx(5B+C), where B is the quantity of bounding boxes and
C is classification possibilities.

6
DOI: 10.5281/zenodo.6544603

Figure 12. Bounding boxes and class probability map (Redmon, 2016)

2.2. IMAGE PROCESSING

Almost everyone working in the field of image processing needs the open source OpenCV library at some point.
OpenCV is an image processing library developed by Intel in C and C++ languages, suitable for Linux, Windows and
MacOS operating systems. When Intel research group officials were developing this library, they saw that different
computer vision infrastructures were being created in computer science faculties of many respected universities. Since
image processing consumes a lot of resources, they decided to distribute this library free of charge starting from 1999
instead of marketing it, anticipating that powerful processors will increase their sales. Their decision allowed millions
of scientists and students to work on a common ground.

OpenCV is an extremely useful library containing thousands of functions that makes possible camera interface
access, matrix operations, image viewing and manipulation, conversion between formats, sharpening, blurring,
morphological operations, thresholding, various transformations (Canny, Laplace, Convolution, DFT, Histogram
equalization) contour finding, contour matching, segmentation, motion tracking, camera calibration, machine learning,
and deep learning. In this study OpenCV version 4 has been used for development.

2.2.1 STEREO CAMERA CALIBRATION AND RECTIFICATION

Camera, with its simplest definition, is a device that maps 3D world into 2D plane. Whereas a stereo camera is a set
of two cameras mounted on the same fixture looking at a common scene as shown on Fig 12. The purpose of the stereo
camera is to obtain the depth information of the objects in the scene

7
DOI: 10.5281/zenodo.6544603

Figure 12. Stereo kamera arrangement ( Santoro, 2012)

In order to obtain accurate dimensions from objects in the scene, translation and rotation relationship of
cameras with respect to each other must be known exactly. Even high end commercial stereo camera models on the
market may have some flaws. For instance, image pairs captured with the stereo camera that has been used for this
research were subject to visible rotational mismatch both in pitch and yaw axis. Another problem is the distortion of the
rays that pass through the lens and fall on the sensor depending on the manufacturing quality and focal length of the
lens. These distortions also need to be corrected in software for a precise image analysis.

Figure 13. Radial distortions ( Sadekar, 2020)

Figure 14. Tangential distortion ( Steward, 2021)

Therefore, the most critical step in stereo image processing is individual and stereo calibration of the cameras. A
calibration that is not done meticulously will undoubtedly lead to erroneous results.

8
DOI: 10.5281/zenodo.6544603

Before detailing the calibration process, it would be useful to state some basics about the camera matrix. A
camera matrix is a 3x4 matrix which describes the mapping from 3D points in the world to 2D points in an image as
shown in figure 15.

Figure 15. Camera model camera and matrix (Kitani, Camera Matrix)

Camera matrix can be expressed as a combination of intrinsic and extrinsic parameters. Intrinsic parameters
include camera's optical center, focal length, and lens distortion information. External parameters, on the other hand,
contain information about the location and orientation of the camera, more specifically translation and rotation
information with reference to the world coordinate system in which the camera resides.
A camera with known internal and external parameters means a camera whose calibration matrix is known.
Images taken with that camera are corrected by the rectification process that requires calibration matrix. The figure 16
shows the rectification of the camera matrix with decomposed intrinsic and extrinsic parameters. Here P stands for
rectified projection matrix, K intrinsic parameters, R and t rotation and translation parameters

Figure 16. Projection matrix obtained from intrinsic and extrinsic parameters (Kitani, Camera Matrix)

9
DOI: 10.5281/zenodo.6544603

In the intrinsic parameters, f indicates the focal length, px and py indicate pixel coordinates of camera optical center
on the image. In the extrinsic parameters, the elements t1,t2,t3 represent the translation of camera in x,y,z axes with
reference to world space coordinate system. The elements r1,r2....r9 defines rotation matrix that is used to perform a
rotation in Euclidian space and which is the product of the rotation matrices of each axis. More clearly;

R= Z(θ) .X(ψ).Y(φ) =

Figure 17. Rotation matrix (Abdelhamid, 2011)

In the light of above information, calibration process of a stereo camera using the OpenCV library has been
conducted in following steps ;

a) A chessboard image consisting of black and white squares of known sizes is printed out on a paper at 1:1 scale, and
pasted on a rigid board in order to prevent it from bending.

b) At least 12 photos, preferably more, are taken randomly while moving the board to different locations and rotations,
in a way that all chessboard squares remain inside the frame of both cameras.

c) From all images saved by left camera, intersection points of each black and white square are found using opencv's
findChessboardCorners function, and these are added to a c vector.

d) Camera matrix, distortion coefficients, rotation and translation matrices of the left camera are obtained by passing the
vector populated above and the vector consisting of pixel coordinates of the corresponding points in the printed image
as arguments to the calibrateCamera function of the opencv library, with the origin being the top left corner.

e) From all images saved by right camera, intersection points of each black and white square are found using opencv's
findChessboardCorners function, and these are added to a c vector.

10
DOI: 10.5281/zenodo.6544603

f) Camera matrix, distortion coefficients, rotation and translation matrices of the right camera are obtained by passing
the vector populated above and the vector consisting of pixel coordinates of the corresponding points in the printed
image, as arguments to the calibrateCamera function of the opencv library, with the origin being the top left corner.

g) Matrices and parameters obtained from calibrateCamera function are passed to stereoCalibrate function which
outputs the transform between left and right camera ; rotation matrix and translation vector. This function also returns
fundamental matrix for uncalibrated cameras and essential matrix for calibrated cameras which are used to compute a
corresponding point in right image from a given point in left image, and vice versa.

h) Finally, by passing the matrices obtained by the stereoCalibrate function to the stereoRectify function as parameters,
R1 and R2 matrices (rectification matrices) are obtained, which will ensure that the objects are at the same line on the
vertical axis in the images taken with the stereo camera.

j) All matrices obtained from stereoCalibrate and stereoRectify functions are saved.

After this stage, for each captured image pair by the stereo camera within the application, initUndistort RectifyMap
and remap functions are called in order to get corrected image.

2.2.2. DEPTH ESTIMATION FROM STEREO IMAGES

Once camera setup is neatly calibrated and captured images are undistorted and remapped as explained in section
2.2.1, depth values of scene objects can be extracted using basic triangulation as long as camera positions and
orientations remain unchanged. Figure 18 shows basic projection geometry in stereo imaging.

Figure 18. Stereo imaging projection geometry (Ortiz, 2018)

11
DOI: 10.5281/zenodo.6544603

According to this representation, Z is the depth value of the object to calculate, B is the distance between cameras,
f is the focal length, CxL, CyL, CxR,CyR are the pixel coordinates of object center point in the right and left image
planes. The distance (B) between the cameras is given in camera technical specification. The focal length (f) is the lens
focal length scaled with image size to sensor size ratio, and it is available in internal parameters obtained as a result of
the calibration. The pixel coordinates of the object centers in the image plane are available in the bounding box
structure estimated by the Yolo convolutional neural network model.

The top view of Figure 18 is shown in Figure 19 ;

Figure 19. Top view of stereo imaging projection geometry

Following equality is inferred from similar triangles ;

From above equality, depth of an object Z is defined in terms of disparity (CxL - CxR).

12
DOI: 10.5281/zenodo.6544603

2.2.3. INTEGRATION OF IMAGE PROCESSING AND NEURAL NETWORK

The OpenCV library comes with deep learning module from version 3.3 onwards. Thanks to this module, the
integration of different pre-trained artificial neural network modules into user software is carried out without the need
for another library. In this project, weight and configuration files of the YoloV3 neural network model was run through
this module..

Loading the neural network model is made with readNet function under the cv::dnn namespace. This function takes
the path to the weight file (.weights) and the path to the model's configuration file (.cfg) as argument.

The YoloV3 pre-trained set is designed to recognize 80 classes. These classes are from 1 to 80 respectively in the
output layer; human, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign,
parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe , backpack, umbrella, handbag, tie,
suitcase, frisbee, skis, snowboard, ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle,
wine glass, mug, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, sausage, pizza, donut,
cake, chair, armchair, flower pot, bed, dining table, closet, television, laptop, computer mouse, remote control,
keyboard, pocket phone, microwave oven, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair
dryer and toothbrush.

Following steps have been taken in order to classify objects with OpenCV dnn module and pre-trained neural
network model ;

a) The path to Yolo weights and configuration files downloaded from Yolo website are passed to readNet function
which returns an object of type cv::dnn::Net

b) The artificial neural network infrastructure is set to OpenCV by calling the setPreferableBackend(DNN_
BACKEND_OPENCV) method with returned Net variable

c) Calling the setPreferableTarget(DNN_TARGET_CPU) method with the same Net variable, for CPU (or GPU)
based operation is set.

d) An array of strings to hold class names is created and filled in the same order as neural network output neurons.
For this research coco.names text file is downloaded from Yolo website and used.

e) Whenever a new image is captured, it is corrected according to the calibration matrix. (as explained in the
calibration section 2.2.1)

f) Rectified image is converted to 4 dimension tensor (NCHW) using dnn::blobfromImage function

g) setInput function is called with Net variable and blob variable is passed as argument.

13
DOI: 10.5281/zenodo.6544603

h) A vector of type cv::Mat is created to hold detected bounding boxes, classes, and confidence

i) Forward method is called with Net variable and cv::Mat vector and getUnconnectedOutLayerNames(Net variable)
are passed as argument.

j) cv::Mat vector is filled with detected bounding boxes, classes and confidences by the return of forward method.

k) cv::Mat vector is parsed and all necessary information are extracted.

3. MATERIAL AND METHOD

Within the scope of this work, a native compiled linux application is developed using C++ language. The graphical
user interface is made with Qt library. The images captured from stereo camera were processed with OpenCV library
and its deep neural network module. Detected objects and their distances to the stereo camera are reported on a datagrid.

3.1. EQUIPMENT
Following equipment is used for the application ;

1x Nvidia Jetson Nano AI computer


1x Sony IM219-83 based Stereo camera
2x CSI interface cable
1x Tripod
1 x Mounting plate

3.1.1. Nvidia Jetson Nano Computer

GPU 128-core Maxwell


CPU 4-core ARM A57 @ 1.43 GHz
Operating system Linux
RAM 4 GB 64 Bit LPDDR4 | 25.6 GB/s
Video Encode 4K @ 30 | 4x 1080p @ 30 | 9x 720p @ 30 (H.264/H.265)
Video Decode 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 18x 720p @ 30 (H.264/H.265)
Camera 2x MIPI CSI-2 DPHY lanes
Gigabit Ethernet, M.2 Key E
USB 4x USB 3.0, USB 2.0 Micro-B
Other GPIO, I2C, I2S, SPI, UART
Dimensions 100 mm x 80 mm x 29 mm

Figure 20. Jetson Nano Computer

14
DOI: 10.5281/zenodo.6544603

3.1.2. Sony IM219-83 based stereo kamera

Figure 21. Stereo Camera

Sensor: Sony IMX219


Resolution: 3280 × 2464 (per camera)
Lens specifications:
CMOS size: 1/4inch
Focal Length: 2.6mm
Angle of View: 83/73/50 degree (diagonal/horizontal/vertical)
Distortion: <1%
Baseline Length: 60mm
ICM20948:
Accelerometer:
Resolution: 16-bit
Measuring Range (configurable): ±2, ±4, ±8, ±16g
Operating Current: 68.9uA
Gyroscope:
Resolution: 16-bit
Measuring Range (configurable): ±250, ±500, ±1000, ±2000°/sec
Operating Current: 1.23mA
Magnetometer:
Resolution: 16-bit
Measuring Range: ±4900μT
Operating Current: 90uA
Dimension: 24mm × 85mm

15
DOI: 10.5281/zenodo.6544603

3.1.3. CSI Camera interface cable

Figure 22. Stereo Camera CSI interface cable

3.1.4. Tripod and mounting plate

Figure 23. Tripod and mounting plate

16
DOI: 10.5281/zenodo.6544603

3.1.5. Experimental environment

Figure 24. Experimental environment

Figure 25. A scene from experiment

17
DOI: 10.5281/zenodo.6544603

3.2. METHOD

The flowchart of the software is as shown in figure 26.

Figure 26. Software flow diagram


18
DOI: 10.5281/zenodo.6544603

4. RESULTS

4.1. CALIBRATION RESULTS

Calibration menu implemented in the application provides complete calibration workflow as well as saving and
retrieving calibration data for later use. Images in figure 28 are taken using using the application ;

Figure 27. Calibration menu

Left camera calibration images Right camera calibration images

Figure 28. Images taken for calibration

19
DOI: 10.5281/zenodo.6544603

Corners of checkerboard images are detected correctly as shown in Figure 29.

Figure 29. Detected corners during calibration

The matrices obtained as a result of the calibration process are shown below ;

LEFT CAMERA - CAMERA AND DISTORTION MATRICES

K: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8166088991359027e+02, 0., 3.3570997024361776e+02, 0.,
7.7387600203886643e+02, 2.8015027751270867e+02, 0., 0., 1. ]
D: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.6160937401860759e-01, 9.0319323291590248e-01,
-1.3781131824954437e-03, -4.5921517824351485e-04,
-1.9278208853746575e+00 ]
board_width: 9
board_height: 6
square_size: 2.

RIGHT CAMERA -CAMERA AND DISTORSION MATRICES

K: !!opencv-matrix
rows: 3 cols: 3 dt: d

20
DOI: 10.5281/zenodo.6544603

data: [ 5.8249657630589957e+02, 0., 3.3435659791968590e+02, 0.,


7.7591331524355746e+02, 2.4632547239438091e+02, 0., 0., 1. ]
D: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.4842077372742965e-01, 6.4051178026583100e-01,
1.4737778247401337e-04, -9.1167971051625096e-04,
-9.5386051458102916e-01 ]
board_width: 9
board_height: 6
square_size: 2.

STEREO CALIBRATION , K1, K2, D1, D2, R, T , E , F, R1, R2, P1, P2, Q MATRICES
K1: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8166088991359027e+02, 0., 3.3570997024361776e+02, 0.,
7.7387600203886643e+02, 2.8015027751270867e+02, 0., 0., 1. ]
K2: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 5.8249657630589957e+02, 0., 3.3435659791968590e+02, 0.,
7.7591331524355746e+02, 2.4632547239438091e+02, 0., 0., 1. ]
D1: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.6160937401860759e-01, 9.0319323291590248e-01,
-1.3781131824954437e-03, -4.5921517824351485e-04,
-1.9278208853746575e+00 ]
D2: !!opencv-matrix
rows: 1 cols: 5 dt: d
data: [ -1.4842077372742965e-01, 6.4051178026583100e-01,
1.4737778247401337e-04, -9.1167971051625096e-04,
-9.5386051458102916e-01 ]
R: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9992804293714921e-01, -2.4373835298358540e-03,
-1.1745982692441316e-02, 2.4212550560881551e-03,
9.9999610668890793e-01, -1.3871304840331348e-03,
1.1749317930672128e-02, 1.3585906502149082e-03,
9.9993005143340352e-01 ]
T: [ -6.0163964562455252e+00, -1.3622847988624018e-02,
-1.0461782337464186e-01 ]
E: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.3247261663228259e-05, 1.0459890819100304e-01,
-1.3767013661910412e-02, -3.3921740621952183e-02,
8.4288137330619066e-03, 6.0172044570800143e+00,
-9.4536261062024300e-04, -6.0164062366477848e+00,
8.1855131917909548e-03 ]
F: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.1465732738861130e-10, 7.7116616449344381e-07,
-2.9489689694408959e-04, -2.4979321189411534e-07,

21
DOI: 10.5281/zenodo.6544603

4.6651696740234350e-08, 2.5843916854448444e-02,
5.5823112325540603e-05, -2.6106886873434109e-02, 1. ]
R1: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9998409756045259e-01, -1.4945817662057851e-04,
5.6375782443052764e-03, 1.5334986050427035e-04,
9.9999975027098931e-01, -6.8988533769186189e-04,
-5.6374737274337934e-03, 6.9073888866935033e-04,
9.9998387075480388e-01 ]
R2: !!opencv-matrix
rows: 3 cols: 3 dt: d
data: [ 9.9984628703230816e-01, 2.2639389008504455e-03,
1.7386111939144289e-02, -2.2759407140120580e-03,
9.9999718522097703e-01, 6.7055498599785006e-04,
-1.7384544905563154e-02, -7.1002167302160191e-04,
9.9984862527667173e-01 ]
P1: !!opencv-matrix
rows: 3 cols: 4 dt: d
data: [ 7.7489465864121189e+02, 0., 3.2709444046020508e+02, 0., 0.,
7.7489465864121189e+02, 2.6338606071472168e+02, 0., 0., 0., 1.,
0. ]
P2: !!opencv-matrix
rows: 3 cols: 4 dt: d
data: [ 7.7489465864121189e+02, 0., 3.2709444046020508e+02,
-4.6627902095334048e+03, 0., 7.7489465864121189e+02,
2.6338606071472168e+02, 0., 0., 0., 1., 0. ]
Q: !!opencv-matrix
rows: 4 cols: 4 dt: d
data: [ 1., 0., 0., -3.2709444046020508e+02, 0., 1., 0.,
-2.6338606071472168e+02, 0., 0., 0., 7.7489465864121189e+02, 0.,
0., 1.6618690179474188e-01, 0. ]

4.2. MEASUREMENT RESULTS

Wall clock, computer keyboard, tennis racket and bottle were used as objects to be detected in the experiments.
Image capture was performed live in video format. While recording the measurements, the clock and computer
keyboard were moved from near to far, the soda bottle from far to near, and the tennis racket was moved back and forth
by small distances. Screenshots of the application showing measurement records are as follows.

22
DOI: 10.5281/zenodo.6544603

Table 1. Measurements taken while wall clock moves away from camera

Table 2. Measurements taken while keyboard moves away from stereo camera

23
DOI: 10.5281/zenodo.6544603

Table 3. Measurements taken while bottle moves towards stereo camera

During the experiment, bottle was classified as vase many times. However because bounding boxes were correctly
located, classification errors did not affect the measurement values.

Table 4. Measurements taken while tennis racket moves back and forth

24
DOI: 10.5281/zenodo.6544603

In order to evaluate measurement accuracy, another experiment was performed by keeping the wall clock and tennis
racket fixed. On this scene, the actual distance from wall clock to the camera was 600mm, and from the tennis racket to
camera was 1200mm.

Table 5. Measurements taken when tennis racket and wall clock stay steady

25
DOI: 10.5281/zenodo.6544603

6. DISCUSSION AND CONCLUSION

In this project titled "Multiple Object Recognition and Depth Estimation from Stereo Images", the aim was to
recognize multiple objects on the scene using a pre-trained convolutional neural network model and to calculate
perpendicular distances of these objects to the stereo camera setup. The steps taken in the software are summarized as
recognizing objects in the image of one of the cameras using convolutional neural network model, finding the same
contents inside their bounding boxes in the other camera's image by template matching method, and calculating the
depth using the disparity information.

In the experiments carried out without calibration, it was observed that objects in the right and left images were at
different levels on the vertical axis and they were slightly rotated relative to each other. This error is thought to be
caused by some minor differences between cameras in terms of geometrical relationship between the lens and the sensor
or by the sensors not being mounted perfectly on the PCB. Looking at screenshots in the measurement findings section,
black bands with inclined inward edges are noticed at the bottom of the left image and at the top of the right image.
Based on this observation, it is understood as a result of calibration process that left image has been shifted up, right
image has been shifted down, and the horizon lines have been equalized by rotating two images in the opposite
direction with respect to each other. As a result, it is possible to say that the calibration process was successful.

In the experiments during which the objects were moved, depth value changed in parallel with the direction of the
movement. Measurement errors while objects were stationary, were below +/- 4%. At the time this research was
completed the opencv library pre-installed on Nvidia Jetson Nano was not compiled with GPU support, therefore neural
network had to be run with CPU support. Yet 2.5fps frame rate was achieved in the program cycle consisting of image
capture from two cameras in live video mode, image rectification, classification with Yolo neural network, image
matching and user interface update.

26
DOI: 10.5281/zenodo.6544603

RESOURCES

Abdelhamid, M. (2011), Extracting Depth Information From Stereo Vision System Using a
Correlation and a Feature Based Methods. Clemson University TigerPrints, Web address :
"https://tigerprints.clemson.edu/all_theses/1216", Access date : 18/4/2022

Amidi, A., Amidi, S., Convolutional Neural Networks cheatsheet. Web address :
"https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks", Access date :
19/4/2022.

Bhatt, D. (2021) A Comprehensive Guide for Camera Calibration in Computer Vision. Data Science Blogathon,
Web addresss : "https://www.analyticsvidhya.com/blog/2021/10/a-comprehensive-guide-for-camera-calibration-in-
computer-vision" , Access date : 18/4/2022

Bradski, G., Kaehler, A. (2018), Learning OpenCV. O’Reilly Media, Inc, ISBN: 978-0-596-51613-0, 370-454.

Convolutional Neural Network, Web address : "https://www.mathworks.com/discovery/ convolutional-neural-


network-matlab.html" , Access date : 19/4/2022

Eby, M. (2020) Kernelled Connections: The Perceptron as Diagram. Web address : "https://tripleamp
ersand.org/kernelled-connections-perceptron-diagram", Access date : 18/4/2022

Kitani, K. Camera Matrix, Web address: "http://www.cs.cmu.edu/~16385/s17/Slides/11.1_Camera_


matrix.pdf " Access date : 18/4/2022

Maj, M. (2018) , Object Detection and Image Classification with YOLO. Appsilon Science, Web address :
"https://www.kdnuggets.com/2018/09/object-detection-image-classification-yolo.html", Access date : 19/4/2022.

Ortiz, L.E., Cabrera E.V., Gonçalvez L.M. (2018), Depth Data Error Modeling of the ZED 3D Vision Sensor from
Stereolabs. Electronic Letters on Computer Vision and Image Analysis, DOI:10.5565/rev/elcvia.1084 , 4-7

Professor-John-McCarthy-shows-off-computer-chess-in-1966-at-Stanford-University. Web address:


"https://www.researchgate.net/figure/Professor-John-McCarthy-shows-off-computer-chess-in-1966-at-Stanford-
University_fig1_354343066", Access date : 18/4/2022

Redmon, J., Divvala, S., Girschick, R., Farhadi, A. (2016), You Only Look Once: Unified, Real-Time Object
Detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition
DOI: 10.1109/CVPR.2016.91

Reynolds, A.H. (2019), "Convolutional Neural Networks", Web address : " https://anhreynolds.
com/blogs/cnn.html ", Access date : 18/4/2022,

27
DOI: 10.5281/zenodo.6544603

Sadekar, K. (2020), Understanding Lens Distortion. Web address : "https://learnopencv.com/ understanding-lens-


distortion" , Access date : 19/4/2022.

Samarasinghe, S. (2006), Neural Networks for Applied Sciences and Engineering. Auerbach Publications, ISBN:
978-0-8493-3375-0, 17

Santoro, M., Alregib, G., Altunbasak, Y. (2012), Misalignment correction for depth estimation using stereoscopic
3-D cameras. 2012 IEEE 14th International Workshop on Multimedia Signal Processing DOI:
10.1109/MMSP.2012.6343409

Sonmez, D. (2018) , Geri Yayılım Algoritması’na Matematiksel Yaklaşım. Web address :


"https://www.derinogrenme.com/2018/06/28/geri-yayilim-algoritmasina-matematiksel-yaklasim", Access date :
18/4/2022.

Steaward, J. (2021), Camera Modeling: Exploring Distortion and Distortion Models. Web address :
"https://www.tangramvision.com/blog/camera-modeling-exploring-distortion-and-distortion-models-part-i", Access
date : 19/4/2022

Verma, N.K., Nama, P., Kumar,G., Siddhant, A., Raj, A., Dhar, N.K., Salour, A. (2015), Vision based object
follower automated guided vehicle using compressive tracking and stereo-vision. 2015 IEEE Bombay Section
Symposium, DOI: 10.1109/IBSS.2015.7456637

Yann LeCun | Tandon School of Engineering. Web address : "https://engineering.nyu.edu/faculty/ yann-lecun" ,


Access date : 18/4/2022

Wolfewicz, A. (2021), Deep learning vs. machine learning – What’s the difference?. Web address :
"https://levity.ai/blog/difference-machine-learning-deep-learning", Access date : 19/4/2022

28
DOI: 10.5281/zenodo.6544603

APPENDIX

Appx 1. Source code for calibration

#include <opencv2/core.hpp>
#include <opencv2/calib3d.hpp>
#include <opencv2/highgui.hpp>
#include <opencv2/imgproc.hpp>
#include <stdio.h>
#include <iostream>
#include <QDebug>
#include <sys/stat.h>
using namespace std;
using namespace cv;

//CAMERA CALIBRATION VARIABLES


int board_width, board_height, num_imgs;
float square_size;
char* left_imgs_directory;
char* right_imgs_directory;
char* imgs_filename;
char* out_file_left;
char* out_file_right;
char* extension;
vector<vector<Point3f>> object_points;
vector<vector<Point2f>> imagePoints1, imagePoints2;
vector<Point2f >corners1, corners2;
vector<vector<Point2f>> left_img_points, right_img_points;
Mat img1, img2, gray1, gray2;

//RECTIFICATION & STEREO MAPPING VARIABLES


Mat R1, R2, P1, P2, Q;
Mat K1, K2, R;
Vec3d T;
Mat D1, D2;
Mat lmapx, lmapy, rmapx, rmapy;

double xfocalLength_Left=0;
double xfocalLength_Right=0;
double xprincipalPoint_Left=0;
double xprincipalPoint_Right=0;
bool calibrationCompleted=false;

double computeReprojectionErrors(const vector< vector< Point3f > >& objectPoints,


const vector< vector< Point2f > >& imagePoints,
const vector< Mat >& rvecs, const vector< Mat >& tvecs,
const Mat& cameraMatrix , const Mat& distCoeffs) {
vector< Point2f > imagePointsX;

29
DOI: 10.5281/zenodo.6544603

int i, totalPoints = 0;
double totalErr = 0, err;
vector< float > perViewErrors;
perViewErrors.resize(objectPoints.size());

for (i = 0; i < (int)objectPoints.size(); ++i) {


projectPoints(Mat(objectPoints[i]), rvecs[i], tvecs[i], cameraMatrix,
distCoeffs, imagePointsX);
err = norm(Mat(imagePoints[i]), Mat(imagePointsX), cv::NORM_L2);
int n = (int)objectPoints[i].size();
perViewErrors[i] = (float) std::sqrt(err*err/n);
totalErr += err*err;
totalPoints += n;
}
return std::sqrt(totalErr/totalPoints);
}

void load_image_points(int board_width, int board_height, int num_imgs, float square_size,


char* leftimg_dir, char* rightimg_dir, char* leftimg_filename, char* rightimg_filename, char* extension) {

Size board_size = Size(board_width, board_height);

for (int i = 1; i <= num_imgs; i++) {


char left_img[100], right_img[100];
sprintf(left_img, "%s%s%d.%s", leftimg_dir, leftimg_filename, i, extension);
sprintf(right_img, "%s%s%d.%s", rightimg_dir, rightimg_filename, i, extension);
img1 = imread(left_img, IMREAD_COLOR);
img2 = imread(right_img, IMREAD_COLOR);
cvtColor(img1, gray1, COLOR_BGR2GRAY);
cvtColor(img2, gray2, COLOR_BGR2GRAY);

bool found1 = false, found2 = false;

found1 = cv::findChessboardCorners(img1, board_size, corners1, CALIB_CB_ADAPTIVE_THRESH | CALIB_CB_FILTER_QUADS);


found2 = cv::findChessboardCorners(img2, board_size, corners2, CALIB_CB_ADAPTIVE_THRESH | CALIB_CB_FILTER_QUADS);

if(!found1 || !found2){
cout << "Chessboard find error!" << endl;
cout << "leftImg: " << left_img << " and rightImg: " << right_img <<endl;
continue;
}

if (found1)
{
cornerSubPix(gray1, corners1, cv::Size(5, 5), cv::Size(-1, -1), TermCriteria(TermCriteria::EPS | TermCriteria::MAX_ITER , 30, 0.1));
cv::drawChessboardCorners(gray1, board_size, corners1, found1);
}
if (found2)
{

30
DOI: 10.5281/zenodo.6544603

cornerSubPix(gray2, corners2, cv::Size(5, 5), cv::Size(-1, -1), TermCriteria(TermCriteria::EPS | TermCriteria::MAX_ITER, 30, 0.1));
cv::drawChessboardCorners(gray2, board_size, corners2, found2);
}
vector< Point3f > obj;
for (int i = 0; i < board_height; i++)
for (int j = 0; j < board_width; j++)
obj.push_back(Point3f((float)j * square_size, (float)i * square_size, 0));

if (found1 && found2) {


cout << i << ". Found corners!" << endl;
imagePoints1.push_back(corners1);
imagePoints2.push_back(corners2);
object_points.push_back(obj);
}
}
for (int i = 0; i < imagePoints1.size(); i++) {
vector< Point2f > v1, v2;
for (int j = 0; j < imagePoints1[i].size(); j++) {
v1.push_back(Point2f((double)imagePoints1[i][j].x, (double)imagePoints1[i][j].y));
v2.push_back(Point2f((double)imagePoints2[i][j].x, (double)imagePoints2[i][j].y));
}
left_img_points.push_back(v1);
right_img_points.push_back(v2);
}
}

int calibrate(char* leftcalib_file, char* rightcalib_file, char* leftimg_dir, char* rightimg_dir, char* leftimg_filename, char* rightimg_filename, char*
extension, char* outfile_stereo, int num_imgs)
{
board_width=9;
board_height=6;
square_size=2.0f;
load_image_points(board_width, board_height, num_imgs, square_size,
leftimg_dir, rightimg_dir, leftimg_filename, rightimg_filename, extension);

qDebug()<< "Left Camera indivitual calibration";

Mat K1, K2;


Mat D1, D2;
vector<Mat> rvecs1, tvecs1 , rvecs2, tvecs2;
int flag = 0;
flag |= CALIB_FIX_K4;
flag |= CALIB_FIX_K5;

calibrateCamera(object_points, imagePoints1, img1.size(), K1, D1, rvecs1, tvecs1, flag);


calibrateCamera(object_points, imagePoints2, img2.size(), K2, D2, rvecs2, tvecs2, flag);

qDebug() << "Calibration error Left: " << computeReprojectionErrors(object_points, imagePoints1, rvecs1, tvecs1, K1, D1) << endl;
qDebug() << "Calibration error Right: " << computeReprojectionErrors(object_points, imagePoints2, rvecs2, tvecs2, K2, D2) << endl;

31
DOI: 10.5281/zenodo.6544603

FileStorage fs1(leftcalib_file, FileStorage::WRITE);


fs1 << "K" << K1;
fs1 << "D" << D1;
fs1 << "board_width" << board_width;
fs1 << "board_height" << board_height;
fs1 << "square_size" << square_size;
printf("Done Calibration\n");

FileStorage fs2(rightcalib_file, FileStorage::WRITE);


fs2 << "K" << K2;
fs2 << "D" << D2;
fs2 << "board_width" << board_width;
fs2 << "board_height" << board_height;
fs2 << "square_size" << square_size;
printf("Done single camera Calibration\n");

printf("Starting stereo calibration\n");


Mat R,F,E;
Vec3d T;
flag = 0;
flag |= CALIB_FIX_INTRINSIC;

stereoCalibrate(object_points, left_img_points, right_img_points, K1, D1, K2, D2, img1.size(), R, T, E, F);

cv::FileStorage fss(outfile_stereo, cv::FileStorage::WRITE);


fss << "K1" << K1;
fss << "K2" << K2;
fss << "D1" << D1;
fss << "D2" << D2;
fss << "R" << R;
fss << "T" << T;
fss << "E" << E;
fss << "F" << F;

printf("Done Calibration\n");
printf("Starting Rectification\n");

cv::Mat R1, R2, P1, P2, Q;


stereoRectify(K1, D1, K2, D2, img1.size(), R, T, R1, R2, P1, P2, Q);

fss << "R1" << R1;


fss << "R2" << R2;
fss << "P1" << P1;
fss << "P2" << P2;
fss << "Q" << Q;

xfocalLength_Left= K1.at<double>(0,0);
xfocalLength_Right= K2.at<double>(0,0);

32
DOI: 10.5281/zenodo.6544603

xprincipalPoint_Left=K1.at<double>(0,2);
xprincipalPoint_Right=K2.at<double>(0,2);
calibrationCompleted=true;

printf("Done Rectification\n");

return 0;
}

void doNothing() { }

void GetRectifiedImages(Mat& inputLeft,Mat& inputRight,Mat& outputLeft,Mat& outputRight)


{
initUndistortRectifyMap(K1, D1, R1, P1, inputLeft.size(), CV_32F, lmapx, lmapy);
initUndistortRectifyMap(K2, D2, R2, P2, inputRight.size(), CV_32F, rmapx, rmapy);
remap(inputLeft, outputLeft, lmapx, lmapy, cv::INTER_LINEAR);
remap(inputRight,outputRight, rmapx, rmapy, cv::INTER_LINEAR);
}

void LoadCalibSettings(char* infile_stereo)


{
cv::FileStorage fs1(infile_stereo, cv::FileStorage::READ);
fs1["K1"] >> K1;
fs1["K2"] >> K2;
fs1["D1"] >> D1;
fs1["D2"] >> D2;
fs1["R"] >> R;
fs1["T"] >> T;

fs1["R1"] >> R1;


fs1["R2"] >> R2;
fs1["P1"] >> P1;
fs1["P2"] >> P2;
fs1["Q"] >> Q;

xfocalLength_Left= K1.at<double>(0,0);
xfocalLength_Right= K2.at<double>(0,0);
xprincipalPoint_Left=K1.at<double>(0,2);
xprincipalPoint_Right=K2.at<double>(0,2);

calibrationCompleted=true;

qDebug() << xfocalLength_Left << " " << xprincipalPoint_Left;


qDebug() << xfocalLength_Right << " " << xprincipalPoint_Right;
}

33
DOI: 10.5281/zenodo.6544603

Appx 2. Main loop source code (Object recognition on left image, template matching in right
image, depth calculation)

void MainWindow::on_single_shot_requested()
{
//SetCameraEnvironment();

if(capLeft.read(leftImage) && capRight.read(rightImage))


{
if (calibrationCompleted) GetRectifiedImages(leftImage,rightImage,leftImage,rightImage);

cv::Mat blob;
cv::dnn::blobFromImage(leftImage,blob,OneDiv255,cv::Size(320,320),cv::Scalar(),true,false); //(416,416)
network.setInput(blob);
std::vector<cv::Mat> outs;

network.forward(outs,network.getUnconnectedOutLayersNames());

std::vector<int> classIds;
std::vector<float> confidences;
std::vector<cv::Rect> boxes;
std::vector<int> centersX; //for suppression ofmultiple detections of same object

for (size_t i=0; i<outs.size();i++)


{
float* data=(float*)outs[i].data;
for (int j=0;j<outs[i].rows;j++ , data+=outs[i].cols)
{
cv::Mat scores=outs[i].row(j).colRange(5,outs[i].cols);
cv::Point classIdPoint;
double confidence;
cv::minMaxLoc(scores,0,&confidence,0,&classIdPoint);

if(confidence>0.2)
{
qDebug()<< confidence << classIdPoint.x << classes[classIdPoint.x].data();
int centerX = (int)(data[0] * leftImage.cols);
int centerY = (int)(data[1] * leftImage.rows);
int width = (int)(data[2] * leftImage.cols);
int height = (int)(data[3] * leftImage.rows);
int left = centerX - (width / 2);
int top = centerY - (height / 2);

left= left>=0 ? left : 0;


top= top>=0 ? top : 0;
width= (left+width<=leftImage.cols) ? width : leftImage.cols-left;
height= (top+height<=leftImage.rows) ? height : leftImage.rows-top;

34
DOI: 10.5281/zenodo.6544603

for(auto x : centersX)
{
float k=(float)centerX/(float)x;

if (k>=0.95f && k<=1.05f) goto X001;


}
centersX.push_back(centerX);
qDebug() << data[2] << "x" << data[3];
qDebug() << "left="<< left << " top=" << top << " width=" << width << " height=" << height << " " << left+width << "x" << top+height << "
centerx="<< centerX << " centery=" << centerY;
qDebug() << leftImage.cols << "x" << leftImage.rows;

Rect region(left,top,width,height);

Mat cropped=leftImage(region);
Mat outputMatch;
double minVal,maxVal;
Point minLoc,maxLoc;

matchTemplate(rightImage,cropped,outputMatch,TM_CCORR_NORMED);
normalize(outputMatch, outputMatch,0,1,NORM_MINMAX,-1,Mat());
minMaxLoc(outputMatch,&minVal,&maxVal,&minLoc,&maxLoc,Mat());

rectangle(rightImage,maxLoc,Point(maxLoc.x+width,maxLoc.y+height),cv::Scalar(0,255,0),2,8,0);
int centerX2=maxLoc.x+width/2;

classIds.push_back(classIdPoint.x);
confidences.push_back((float)confidence);
boxes.push_back(cv::Rect(left, top, width, height));
cv::rectangle(leftImage, cv::Rect(left, top, width, height), cv::Scalar(0, 255, 0), 2, 8, 0);

double disparity=0;
double depth=0;

if (ui->checkBox->checkState())
{
currentData=classes[classIdPoint.x].data();
currentData[0]=currentData[0].toUpper();
tableModel->setData(tableModel->index(currentRow,0),currentData);

currentData=QString::number(100*confidence) +"%";
tableModel->setData(tableModel->index(currentRow,1),currentData);

tableModel->setData(tableModel->index(currentRow,2),QString::number(centerX));
tableModel->setData(tableModel->index(currentRow,3),QString::number(centerX2));

disparity=(centerX-xprincipalPoint_Left)-(centerX2-xprincipalPoint_Right);
tableModel->setData(tableModel->index(currentRow,4),QString::number(disparity));

35
DOI: 10.5281/zenodo.6544603

depth= xfocalLength_Left*60/disparity;
tableModel->setData(tableModel->index(currentRow,5),QString::number(depth));

currentRow++;
}

QPixmap mapLeft=MatToPixmap(leftImage);
ui->label_6->setPixmap(mapLeft);
QPixmap mapRight=MatToPixmap(rightImage);
ui->label_7->setPixmap(mapRight);
QApplication::processEvents();

}
X001:
doNothing();
}
}
}

36

View publication stats

You might also like