Deep Learning Based Monocular Depth Estimation For Object Distance Inference in 2D Images

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1431

Deep Learning Based Monocular


Depth Estimation for Object Distance
Inference in 2D Images
G. Victor Daniel1 (Assistant Professor); Koneru Gnana Shritej2; Kosari Hemanth Sai3; Sunkara Namith4
1
Department of Artificial Intelligence, Anurag University, Hyderabad, India
2,3,4
U.G. Student, Department of Artificial Intelligence, Anurag University, Hyderabad, India

Abstract:- Monocular depth estimation, a process of transformer-based architectures, have demonstrated


predicting depth from a single 2D image, has seen remarkable capabilities in extracting intricate features from
significant advancements due to the proliferation of deep images, enabling significant advancements in monocular
learning techniques. This research focuses on leveraging depth estimation. These models can learn to infer depth by
deep learning for monocular depth estimation to infer recognizing patterns and contextual cues within the image,
object distances accurately in 2D images. We explore such as shading, texture gradients, and object relationships.
various convolutional neural network (CNN) The primary objective of this research is to investigate and
architectures and transformer models to analyze their compare the performance of various deep learning models in
efficacy in predicting depth information. Our approach the context of monocular depth estimation. We aim to
involves training these models on extensive datasets determine how different architectures and training strategies
annotated with depth information, followed by rigorous impact the accuracy and reliability of depth predictions. To
evaluation using standard metrics. The results achieve this, we utilize large-scale datasets annotated with
demonstrate substantial improvements in depth depth information, enabling the models to learn and
estimation accuracy, highlighting the potential of deep generalize effectively. This study also explores the
learning in enhancing computer vision tasks such as importance of training data diversity and augmentation
autonomous driving, augmented reality, and robotic techniques in enhancing model performance. By varying the
navigation. This study not only underscores the datasets and introducing different augmentation strategies,
importance of model architecture but also investigates the we seek to understand how these factors contribute to the
impact of training data diversity and augmentation robustness of depth estimation models. In the following
strategies. The findings provide a comprehensive sections, we provide a comprehensive review of related work,
understanding of the current state-of-the-art in detailing the evolution of monocular depth estimation
monocular depth estimation, paving the way for future techniques and the role of deep learning in this domain. We
innovations in object distance inference from 2D images. then describe our experimental setup, including the datasets
By providing a detailed analysis of various models and used, model architectures, and evaluation metrics. The results
their performance, this research contributes to a better section presents a detailed analysis of model performance,
understanding of monocular depth estimation and its highlighting key findings and insights. Finally, we discuss the
potential for real-world applications, paving the way for implications of our results for future research and practical
future advancements in object distance inference from 2D applications, and conclude with a summary of our
images. contributions and potential directions for further study.

Keywords:- Monocular Depth Estimation, Deep Learning, II. LITERATURE SURVEY


Convolutional Neural Network (CNN), Computer Vision,
Augmented Reality, Robotic Navigation. Masoumian et al. [1] conducted a comprehensive review
of monocular depth estimation using deep learning. The
I. INTRODUCTION authors discussed the advancements in this field and
highlighted the potential of deep learning models in
Monocular depth estimation, the task of determining accurately estimating depth from single images.
depth information from a single 2D image, is a fundamental
problem in computer vision with wide-ranging applications Höllein et al. [2] introduced Text2Room, a method for
in fields such as autonomous driving, augmented reality, and extracting textured 3D meshes from 2D text-to-image
robotics. Traditionally, depth estimation relied on stereo models. While not directly related to depth estimation, this
vision or multiple camera setups, which can be cost- work highlighted the significance of 3D representation for
prohibitive and complex to implement. However, the advent scene understanding, which is closely tied to monocular depth
of deep learning has opened new avenues for solving this estimation.
problem using a single camera, making it more feasible for a
variety of applications. Deep learning models, particularly
convolutional neural networks (CNNs) and more recently,

IJISRT24APR1431 www.ijisrt.com 3096


Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1431

Wang et al. [3] proposed a monocular 3D object A. Existing Systems


detection framework with depth from motion. The study Monocular depth estimation has been a subject of
demonstrated the potential of leveraging motion cues to extensive research over the years, with various approaches
improve depth estimation, indicating a direction for future developed to tackle the challenge of inferring depth from a
research in incorporating dynamic information for depth single 2D image. The existing systems can be broadly
inference. categorized into traditional methods and deep learning-based
methods.
Lian et al. [4] proposed MonoJSG, a joint semantic and
geometric cost volume for monocular 3D object detection.  Traditional Methods
This work highlighted the synergy between semantic and
geometric information in depth estimation, suggesting a  Structure from Motion (SfM):
multi-modal approach for enhancing depth prediction SfM techniques reconstruct 3D structures by analyzing
accuracy. the motion of objects across multiple frames of a video. By
tracking feature points across these frames, the relative
Sharma et al. [5] conducted a review of deep learning- motion between the camera and the objects can be used to
based human activity recognition on benchmark video estimate depth. While effective, these methods require
datasets. Although the focus was on activity recognition, the multiple images and are computationally intensive.
review shed light on the potential of leveraging temporal
information for depth estimation, offering a direction for  Shape from Shading (SfS):
future research in spatiotemporal modeling. SfS methods infer depth by analyzing the shading
patterns in an image, assuming known lighting conditions.
Samant et al. [6] presented a framework for deep These methods rely on the reflectance properties of surfaces
learning-based language models using multi-task learning in and often require complex optimization techniques to resolve
natural language understanding. While seemingly unrelated, ambiguities in depth perception.
this work provided insights into multi-task learning
paradigms, which could be adapted for jointly learning depth  Stereo Vision:
estimation along with related vision tasks. Stereo vision involves using two or more cameras to
capture different perspectives of the same scene. The
Chen et al. [7] discussed representation learning in disparity between the images is then used to compute depth.
multi-view clustering. The study emphasized the importance Although stereo vision can provide accurate depth estimates,
of holistic scene understanding through multi-view it necessitates precise camera calibration and
information, advocating for the integration of multi-view synchronization, increasing system complexity and cost.
cues in monocular depth estimation for comprehensive spatial
perception.  Deep Learning-Based Methods

III. PROBLEM STATEMENT  Convolutional Neural Networks (CNNs):


CNNs have been widely used for monocular depth
Accurately estimating depth from single 2D images estimation due to their ability to capture spatial hierarchies
using deep learning techniques is a fundamental challenge in and learn complex features. Pioneering works like Eigen et
computer vision with significant implications for various al.'s multi-scale deep network laid the foundation by
real-world applications. Traditional depth estimation predicting depth at multiple scales to capture both global and
methods, often reliant on stereo vision or multi-camera local features.
setups, pose inherent limitations in terms of complexity, cost,
and scalability. These constraints hinder the widespread  Encoder-Decoder Networks:
adoption of depth estimation technology in domains such as These networks, such as U-Net and Fully Convolutional
autonomous navigation, augmented reality, and robotics. Networks (FCNs), encode the input image into a latent
Addressing these challenges requires the development of a representation and then decode it to produce a dense depth
deep learning-based monocular depth estimation system map. They have shown significant improvements in depth
capable of achieving high accuracy, real-time performance, estimation accuracy.
and robustness across diverse environmental conditions.
Additionally, there is a pressing need for resource-efficient  Vision Transformers (ViTs):
models suitable for deployment on resource-constrained ViTs have been applied to depth estimation tasks,
platforms, such as embedded systems or mobile devices. demonstrating their ability to achieve competitive
Furthermore, ensuring the generalization of trained models to performance by capturing both local and global features in
unseen data and their adaptability to novel environments is the image.
critical for practical deployment in real-world scenarios.
 Hybrid CNN-Transformer Models:
These models combine the strengths of CNNs (for local
feature extraction) and transformers (for global context
modeling), resulting in robust depth estimation performance.

IJISRT24APR1431 www.ijisrt.com 3097


Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1431

B. Proposed System The proposed methodology involves several key steps


The proposed system leverages deep learning-based to achieve real-time object distance inference using deep
object detection for real-time monocular depth estimation to learning-based monocular depth estimation. The steps are as
infer object distances using a single 2D image from a follows:
webcam. The system integrates the YOLO (You Only Look
Once) object detection model with a web-based interface,  Model Initialization:
enabling real-time monitoring and interaction. The key The system initializes the YOLOv8 model, which has
components of the system include the YOLOv8 model for been pre-trained on a large dataset to recognize a variety of
object detection, a live video capture module using OpenCV, objects. The YOLOv8 model is selected for its balance
and a Flask-based web application for displaying the between speed and accuracy, making it suitable for real-time
processed video feed. The proposed system is designed to applications.
provide accurate and efficient object distance inference,
addressing the limitations of traditional depth estimation  Video Capture:
methods that often require stereo vision or multiple cameras. The system uses Open CV to access the default webcam
By using a single monocular camera, the system simplifies (denoted as 0) and capture live video frames. The video
the hardware requirements and broadens the range of capture runs continuously, providing a real-time feed to the
potential applications, including surveillance, autonomous object detection pipeline.
navigation, and augmented reality.
 Object Detection:
IV. PROPOSED METHODOLOGY Each frame captured from the webcam is processed by
the YOLO model to detect objects. The model outputs
bounding boxes, class labels, and confidence scores for the
detected objects. This step leverages the YOLOv8 model's
capability to perform rapid and accurate object detection.

 Distance Estimation:
For each detected object, the system calculates an
approximate distance based on the size of the bounding box
relative to the frame dimensions. The width of the bounding
box is used as an inverse indicator of the distance to the
camera. The approximate distance is computed using a
heuristic approach:

apx_distance=(1-width/frame_width)2

This approach assumes that larger objects in the frame


are closer to the camera, providing a simple yet effective
means of distance estimation.

 Frame Annotation:
The detected objects are annotated on the video frame
with bounding boxes, class labels, confidence scores, and
estimated distances. This information is overlaid on the video
feed, enabling real-time visualization of the detected objects
and their distances.

 Web Application:
A Flask web application serves the annotated video feed
to users. The web interface allows users to start and stop the
webcam feed and view the list of detected objects along with
their estimated distances. This interface provides an
accessible and interactive means of monitoring the system's
output.

 Thread Synchronization:
Threading is employed to handle video capture and
object detection concurrently, ensuring that the web interface
remains responsive. A threading lock is used to synchronize
access to shared resources, such as the list of detected objects,
to prevent race conditions and ensure data consistency.
Fig 1 Flowchart

IJISRT24APR1431 www.ijisrt.com 3098


Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1431

 REST API Endpoints:


The web application includes several REST API
endpoints:

 /video_feed: Streams the processed video feed with object


annotations.

 /toggle_webcam: Toggles the webcam feed on or off.

 /show_results: Returns a JSON response containing the


list of detected objects and their estimated distances.

These endpoints enable dynamic interaction with the


system, allowing users to control the webcam feed and access
detection results programmatically. Fig 4 Sample Output 3

V. RESULTS

The developed system was rigorously tested to evaluate


its performance in real-time object detection and distance
estimation using a monocular camera. The results
demonstrate the system's capability to accurately detect
objects and infer their distances, which are displayed through
a user-friendly web interface. The primary interface of the
system displays the live video feed from the webcam with
real-time annotations for detected objects. Each detected
object is enclosed in a bounding box, and relevant
information such as the object class and confidence score are
displayed. The approximate distance to each object,
Fig 5 Sample Output 4
calculated based on the size of the bounding box relative to
the frame dimensions, is also overlaid on the video feed. This
VI. CONCLUSION
setup provides immediate visual feedback on object detection
and distance estimation, making it useful for applications like
In this research, we have developed a deep learning-
surveillance and autonomous navigation.
based system for monocular depth estimation, enabling
accurate object distance inference from single 2D images.
The proposed system utilizes the YOLOv8 model for real-
time object detection, integrated with a robust methodology
for estimating distances based on bounding box dimensions.
By leveraging a single monocular camera, the system offers
a cost-effective and scalable solution suitable for various
applications, including autonomous navigation, augmented
reality, and surveillance. The experimental results
demonstrate the system's effectiveness in real-time scenarios,
showcasing its ability to detect multiple objects and
accurately estimate their distances. The user-friendly web
interface enhances accessibility and usability, providing clear
Fig 2 Sample Output 1 visual and textual feedback on detected objects and their
distances. This dual-mode presentation ensures that users can
easily interpret and utilize the information for practical
applications. Several key challenges were addressed in the
development of this system, including the need for high
accuracy, real-time performance, robustness across different
environments, and resource efficiency. The system's ability
to generalize well to diverse datasets and conditions
highlights its potential for deployment in a wide range of real-
world scenarios. Future work will focus on refining the
distance estimation algorithms, exploring the integration of
additional sensors to enhance accuracy, and optimizing the
system for deployment on mobile and embedded devices.
Additionally, extending the system to handle more complex
Fig 3 Sample Output 2

IJISRT24APR1431 www.ijisrt.com 3099


Volume 9, Issue 4, April – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24APR1431

scenes and dynamic environments will be a critical area of


further research. In conclusion, this research advances the
state-of-the-art in monocular depth estimation, providing a
viable solution for real-time object distance inference from
2D images. The developed system has significant potential to
enhance various applications in computer vision, contributing
to the development of smarter, more responsive technologies
in numerous fields.

REFERENCES

[1]. Masoumian, Armin., Rashwan, Hatem A.., Cristiano,


Julián., Asif, M. Salman., & Puig, D.. (2022).
Monocular Depth Estimation Using Deep Learning:
A Review. Sensors (Basel, Switzerland), 22.
http://doi.org/10.3390/s22145353
[2]. Höllein, Lukas., Cao, Ang., Owens, Andrew.,
Johnson, Justin., & Nießner, M.. (2023). Text2Room:
Extracting Textured 3D Meshes from 2D Text-to-
Image Models. 2023 IEEE/CVF International
Conference on Computer Vision (ICCV), 7875-7886.
http://doi.org/10.1109/ICCV51070.2023.00727
[3]. Wang, Tai., Pang, Jiangmiao., & Lin, Dahua. (2022).
Monocular 3D Object Detection with Depth from
Motion. ArXiv, abs/2207.12988. http://doi.org/
10.48550/arXiv.2207.12988
[4]. Lian, Qing., Li, Peiliang., & Chen, Xiaozhi. (2022).
MonoJSG: Joint Semantic and Geometric Cost
Volume for Monocular 3D Object Detection. 2022
IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 1060-1069.
http://doi.org/10.1109/CVPR52688.2022.00114
[5]. Sharma, Vijeta., Gupta, Manjari., Pandey, A.., Mishra,
Deepti., & Kumar, Ajai. (2022). A Review of Deep
Learning-based Human Activity Recognition on
Benchmark Video Datasets. Applied Artificial
Intelligence, 36. http://doi.org/10.1080/08839514.
2022.2093705
[6]. Samant, R.., Bachute, M.., Gite, Shilpa., & Kotecha,
K.. (2022). Framework for Deep Learning-Based
Language Models Using Multi-Task Learning in
Natural Language Understanding: A Systematic
Literature Review and Future Directions. IEEE
Access, 10, 17078-17097. http://doi.org/10.1109/
ACCESS.2022.3149798
[7]. Chen, Mansheng., Lin, Jia-Qi., Li, Xiang-Long., Liu,
Bao-Yu., Wang, Changdong., Huang, Dong., & Lai,
J.. (2022). Representation Learning in Multi-view
Clustering: A Literature Review. Data Science and
Engineering, 7, 225-241. http://doi.org/10.1007/s
41019-022-00190-8

IJISRT24APR1431 www.ijisrt.com 3100

You might also like