2d 3d Reconstruction
2d 3d Reconstruction
3D imaging is otherwise widely called as stereoscopy. This technique is widely used for
creation or enhancement of the 2D image by increasing the illusion of depth with the
help of binocular vision. Almost all kinds of stereoscopic methods are based on 2 images
that is one from the left view and the other one from the right view. These 2 images are
then joined together to give an illusion of 3D view along with inclusion of the depth.
Nowadays 3D television is the major milestone of the visual media. In recent years,
researchers have focused on developing algorithms for acquiring images, and
converting them to 3D model using depth analysis. The third dimension can usually be
perceived only by human vision.
The eyes visualize the depth and the brain reconstructs the third dimension with the
help of various views visualized by the eyes. The researchers used this strategy for
reconstructing the 3D model from various views with the help of certain parameters of
disparity and calibration. In recent days there are special cameras which help in
capturing the 3D model of the view directly. Few examples are stereoscopic dual
camera, depth-range camera etc. These cameras usually capture the RGB component of
the image and its corresponding depth map. The depth map is defined as the function
which helps in evaluating the depth of an object at a point (i.e. at pixel position). Usually
the intensity is considered to be the depth.
3D reconstruction is one of the most complex issues of deep learning systems. There
have been multiple types of research in this field, and almost everything has been tried
on it — computer vision, computer graphics and machine learning, but to no avail.
However, that has resulted in CNN or convolutional neural networks foraying into this
field, which has yielded some success.
1
Background Study
Recovering the lost dimension during image acquisition from any normal camera has
been a hot research area in the field of computer vision for more than a decade. The
literature review shows that the research methodology has been changed from time to
time. More precisely, we can divide the conversion of 2D images to 3D model
reconstruction into three generations. The first generation learns the 3D to 2D image
projection process by utilizing the mathematical and geometrical information using
some mathematical or algorithmic solution. These types of solutions usually require
multiple images that are captured using specially calibrated cameras. For example,
using some multiview of an object with constant angle changing that can cover all the
360 degrees of an object, we can compute the geometrical points of the object. Using
some triangularization techniques, we can join these points to make a 3D model. The
second generation of 2D to 3D model conversion utilizes the accurately segmented 2D
silhouettes. This generation leads to a reasonable 3D model generation, but it requires
specially designed calibrated cameras to capture the image of the same object from
every different angle. This type of technique is not feasible or more practical because of
the complex image capturing techniques.
Humans can assume the shape of the object using prior knowledge about some objects
and predict what an object will look like from another unseen viewpoint. The computer
vision-based techniques are inspired by human vision to convert 2D images to 3D
models. With the availability of large-scale data sets, deep learning research has evolved
in 3D reconstruction from a single 2D image. A deep-belief-network-based 3D model
was proposed to learn the 3D model from a single 2D image. It is considered one of the
earlier neural-network-based data-driven models to reproduce the 3D model.
2
Analysis
Scene reconstruction and modelling are two major tasks of 2D and 3D Computer
Vision. The reconstruction offers us the exact observation of the 2-dimensional and 3-
dimensional world, whereas, modelling allows us to perceive it accurately. Both of
these tasks have always been active areas of research due to their wide range of
potential applications, such as scene representation, under- standing, and robot
navigation.
For a moving 2D-3D camera setup, the 3D reconstruction of the scene can be
obtained by registering a sequence of point clouds with the help of Visual Odometry
(VO) measurement. However, the VO-based registration is valid only for the static
scene parts. Therefore, such reconstruction suf fers from several visual artifacts due to
the dynamic parts. In this regard, recent work by Jiang et al. [4–6] categorizes the scene
into static and dynamic parts before performing VO. Their method focuses on
improving VO measurements, and the attempted dynamic object reconstruction is
rather preliminary and naive. In this work, we focus on the high quality
3
Fig. 1: Moving Car Reconstruction from a Mobile Platform: Top are selected frames of a
moving car. Middle show the reg- istered sparse point cloud, the smoothed point cloud,
and the reconstructed mesh of the point cloud, respectively. Bottom show the fine
reconstruction in different views.
3.2 Learning-Based Reconstruction
Learning-based reconstruction approaches utilize data-driven volumetric 3D model
synthesis. The research community has leveraged improvements in deep learning to
enable efficient modeling of a 2D image into a 3D model. With the availability of large
scale data sets such as Shape Net, most researchers focus on developing a 3D voxelized
model from a single 2D image. Recently, various approaches have been proposed for
achieving this task. One study shows that a 3D morph able shape can be generated from
an image of a human face, but it requires many manual interactions and high-quality
3D scanning of the face. Some methods suggest learning a 3D shape model from key
points or silhouettes. In some studies, the single image’s depth map is first calculated
using machine-learning-based techniques, and then a 3D model is constructed using
RGB-D images.
A convolution neural network (CNN) has recently become popular to predict the
geometry directly from a single image by using an encoder–decoder-based architecture.
The encoder extracts the features from the single image, while the decoder generates
the model based on features extracted by the encoder. In another study, deep-CNN-
based models were learned, in which a single input image is directly mapped to output
4
3D representation for the 3D model generation in a single step. The authors of another
study proposed a 3D recurrent-reconstruction-neural-network (RRNN)-based technique,
in which the generation of the 3D model is performed in steps using a 2D image as
input. Some studies, such as, used a 2D image along with depth information as input to
the 3D-based U-Net architecture. For 3D appearance rendering, Groueix et al. used a
convolution encoder–decoder-based architecture to generate the 3D scene from a single
image as an input. Then, Haoqiang et al., by incorporating a differentiable appearance
sampling mechanism, further improved the quality of the generated 3D scene.
3.3 Object reconstruction
In order to use the captured hand motion for 3D reconstruction, we have to infer the
contact points with the object. This is described in Section 4.1. The reconstruction
process based on the estimated hand poses and the inferred contact points is then
described.
3.3.1 Contact Points Computation
In order to compute the contact points, we use the high resolution mesh of the hand,
which has been used for hand motion capture. To this end, we compute for each vertex
associated to each end-effector the distance to the closest point of the object point cloud
Do. We first count for each end-effector the number of vertices with a closest distance of
less than 1mm. If an end-effector has more than 40 candidate contact vertices, it is
labeled as a contact bone and all vertices of the bone are labeled as contact vertices. If
there are not at least 2 end-effectors selected, we iteratively increase the distance
threshold by 0.5mm until we have at least two end-effectors. In our experiments, we
observed that the threshold barely exceeds 2.5mm. As a result, we obtain for each frame
pair the set of contact correspondences (Xhand, X0 hand) ∈ Chand(θ, Dh), where
(Xhand, X0 hand) is a pair of contact vertices in the source and target frame,
respectively.
5
3.4. The Main Objective of the 3D Object Reconstruction
Developing this deep learning technology aims to infer the shape of 3D objects from 2D
images. So, to conduct the experiment, you need the following:
Highly calibrated cameras that take a photograph of the image from various
angles.
Large training datasets can predict the geometry of the object whose 3D image
reconstruction needs to be done. These datasets can be collected from a database
of images, or they can be collected and sampled from a video.
By using the apparatus and datasets, you will be able to proceed with the 3D
reconstruction from 2D datasets.
The technology used for this purpose needs to stick to the following parameters:
Input
Training with the help of one or multiple RGB images, where the segmentation of the
3D ground truth needs to be done. It could be one image, multiple images or even a
video stream.
The testing will also be done on the same parameters, which will also help to create a
uniform, cluttered background, or both.
Output
6
The volumetric output will be done in both high and low resolution, and the surface
output will be generated through parameterisation, template deformation and point
cloud. Moreover, the direct and intermediate outputs will be calculated this way.
The architecture used in training is 3D-VAE-GAN, which has an encoder and a decoder,
with TL-Net and conditional GAN. At the same time, the testing architecture is 3D-VAE,
which has an encoder and a decoder.
Training used
Given below are some of the places where 3D Object Reconstruction Deep Learning
Systems are used:
7
It can be used for re-modelling ruins at ancient architectural sites. The rubble or
the debris stubs of structures can be used to recreate the entire building structure
and get an idea of how it looked in the past.
They can be used in plastic surgery where the organs, face, limbs or any other
portion of the body has been damaged and needs to be rebuilt.
It can be used in airport security, where concealed shapes can be used for
guessing whether a person is armed or is carrying explosives or not.
It can also help in completing DNA sequences.
3.6 3D Models
Solid Model
Solid models deliver a 3D digital representation of an object with all the proper
geometry. It is correct in all the other types, but “solid” refers to the model as a whole
instead of only the surface. The object cannot be hollow. Much like all other types, solid
models come from three-dimensional shapes.
You can use a myriad of basic and complex shapes. Those shapes function like building
blocks that work together to create a single object. You can add more material to the
blocks or subtract from them. Some CAD programs use modifiers, starting with one big
chunk of solid, methodically carved out as if you were physically milling the base
material in a workshop.
Wireframe Modeling
In cases where the object features a lot of complex curves, wireframe modeling is often
the method of choice. Basic building blocks of solid models basic shapes are sometimes
too difficult to modify into the desired configuration and dimension. Wireframe
8
modeling allows for a smoother transition between curved edges in intricate objects. As
the complexity increases, however, some drawbacks become more apparent.
Surface Modeling
A higher step in terms of detail is the surface model. When seamless integration among
the edges and a smooth transition from one vertex to the next is required, you need
higher computational power to run the right software for building a surface model.
Compared to the previous two, surface modeling is more demanding, but only because
it has all the capabilities to create just about every shape that would be too difficult to
attain with the solid or wireframe methods.
3.7 2D Models
9
Summary, Conclusion and Recommendation
4.1 Summary
3D object reconstruction is the process of capturing the shape and appearance of real
objects. 2d reconstruction: This is used to recreate a face from the skull with the use of
soft tissue depth estimates.
4.2 Conclusion
4.3 Recommendation
10
Reference
Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp.
1653–1660.
Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 412–420.
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June
2015; pp. 1966–1974.
https://github.com/natowi/3D-Reconstruction-with-Deep-Learning-Methods
https://iaeme.com/MasterAdmin/Journal_uploads/IJCIET/VOLUME_8_ISSUE_12/IJCIET_
Häne, C.; Tulsiani, S.; Malik, J. Hierarchical surface prediction for 3d object
reconstruction. In Proceedings of the 2017 International
https://tongtianta.site/paper/68922
11