Revolutionizing 3D Model Generation Using Diffusion-Based Generative AI

The document presents a method for generating high-quality 3D models using diffusion-based generative AI from textual descriptions. It highlights the advantages of this approach over traditional 3D modeling techniques, which are labor-intensive and require significant expertise. The proposed system achieves an accuracy of 92.1% and demonstrates improved performance in visual fidelity and error reduction compared to previous models.

Uploaded by

Megha B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views5 pages

Revolutionizing 3D Model Generation Using Diffusion-Based Generative AI

Uploaded by

Megha B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC)

Revolutionizing 3D Model Generation

Using Diffusion-based Generative AI
B. Jyothsna Swathi Gowroju Dogiparthi Varsha
Dept of CSE(AI & ML), Sreyas Institute Dept of CSE (AI & ML), Sreyas Institute Dept of CSE(AI & ML), Sreyas
of Engineering and Technology, of Engineering and Technology, Institute of Engineering and
Telangana,India. Telangana,India. Technology, Telangana,India.
2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC) | 979-8-3315-2134-9/24/$31.00 ©2024 IEEE | DOI: 10.1109/PDGC64653.2024.10983987

[email protected] [email protected] [email protected]

Gudise Laxmi Gayathri Patti Vaishnavi Chitneni Yathin

Dept of CSE(AI & ML), Sreyas Institute Dept of CSE (AI & ML), Sreyas Institute Dept of CSE (AI & ML), Sreyas
of Engineering and Technology, of Engineering and Technology, Institute of Engineering and
Telangana,India. Telangana,India. Technology, Telangana,India.
[email protected] [email protected]
[email protected]

Abstract— The emergence of generative AI, in comparative analysis while describing features of the study.
particular, has brought a new 3D way of modeling which This work also goes about the challenges and how some AI
includes the automatic synthesis of complex models from insights could help to mitigate some of those challenges,
plain text prompts. The traditional 3D modeling enabling optimizations on a 3D model.
technique, which is labor-intensive, demands a great
deal of knowledge and time. This study offers a II. LITERATURE SURVEY
generative AI-based system that applies diffusion models
Refer Authors Model/Met Datasets Accurac
to produce 3D models of the highest quality meanwhile
ence hod y
providing with the minimum of labor. The system uses a
natural language input approach that users can create
detailed and surface-coated 3D models without the need
for the mastery of technical skills and know-how. IT is a [1] Chen et al. IT3D ShapNet, 85%
multi-faced tool, it can serve different industries like ModelNet
gaming, architecture, and animation at the same time,
due to the fact that it is easy to operate and is available [2] Canfes et 3D Avatar 3D Avatar 90%
everywhere. The purpose of this paper is to minimize the al. Generation Dataset,
creative process for 3D model generation using Diffusion & Editing Mixamo
models.
[3] Raj et al. Dreamboot coco, 87%
Keywords— Generative AI, 3D Modeling, Diffusion Models, h 3D Flickr30K
Creative Automation, AI in Design, Text-to-3D
[4] Chen et al. Text-to- Textures 82%
I. INTRODUCTION Texture from
The recent advancements in artificial intelligence now Synthesis OpenImag
make many possibilities possible, hence, 3D modeling from es
descriptions is possible. This technology bridges the gap
between NLP and computer graphics to create models of [5] Li et al. Dreanfont3 Google 88%
real 3D models on the basis of text input. Despite such D Fonts,
improvements, technology in 3D modeling of objects still custom
experienced many problems-mainly low accuracy, no visual Font
integrity, and huge constraints for the case of many different Dataset
materials. Most of the design procedures or scanning used in
them, rather sadly, are not only time-consuming but also a [6] Gorbatsevi GAN-based Terrain 83%
process that consumes much of one's time and energy. ch et al. Terrain Dataset,
Recent advances in deep learning did some tricks in Modeling OpenTop
improving such techniques by the power of modeling. ography
However, problems in these models that include GANs and
VAEs exist as they rely on prior knowledge, limited text
understanding, and problems with the creation of 3D [7] Dundar et Gradual Synthetic 86%
representation. Diffusion models promise to be brilliant in al. Learning in 3D
image design as they transform noise information into 3D Shapes
models out. We will, in this work, apply a diffusion model Networks Dataset
in three-dimensional text in order to come up with a sound
model that portrays the nuances of the text better. Our
approach promises to make an enormous difference to
existing models since it enhances 3D model visibility and is
compatible with annotations. The algorithm performs

979-8-3315-2134-9/24/$31.00 ©2024 IEEE 461

Authorized licensed use limited to: Zhejiang University. Downloaded on July 06,2025 at 08:06:51 UTC from IEEE Xplore. Restrictions apply.
2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC)

The proposed method is a process of using text-based

[8] Liu et al. Interactive Blender 91% prompts. Then, it feeds them into the diffusion model,
3D Dataset, which refines the 3D model step by step from low-
Modeling 3D resolution output to higher resolutions at every iteration. In
Warehous general, this workflow outlines it:
e
User input(text)
[9] Wang and Style and Fashion 89%
Gupta Structure MINIST,
Adversarial Custom Text Embedding
Networks Fashion
Dataset
Diffusion process
[10] Zdziebko Finite FEM 84%
and Holak Elements Simulatio
Method & n Data, Decoding latents
Blender Blender
Models
3D rendering
[11] Khan et al. Blender for Education 92%
3D al 3D
Education Models, Output(3D model)
Blender
Tutorials
Fig. 1: General Workflow of the Proposed 3D
[12] Kuzina Integrated AutoCAD 85% Model Generation Method
and AutoCAD Files,
Koshev System Architectu B. System Architecture
re Dataset The architecture of the system comprises the following
stages: text prompt processing, model generation, and result
[13] Atieh CAD CAD 88% refinement. These are mainly core components made up of
Drawing Model NLP in understanding text inputs and the generative AI
Generation Dataset, model in producing outputs 3D. Here's an illustration of the
Engineeri architecture:
ng Input: Text prompt describing a 3D object
Drawings Pre-processing: Tokenization of the input text and NLP
methods to comprehend the semantic meaning of the
[14] Li et al. Maya for Maya 90%
3D Scene Scene prompt.
Creation Files, Diffusion Model: The generative AI model that takes in a
Animatio text input and iteratively refines the model during a number
n Dataset of diffusion steps.
Post-processing: Refinement of the generated 3D model
III PROPOSED METHODOLOGY such as contour smoothing and details augmentation.

Hence, the authors present a new method for producing 3D

models with diffusion models. 3D models can be generated
from low resolutions using nothing but textual prompts as
input through the series of iterative steps that yield
progressively increased resolutions towards the high
resolution outputs. This paper stands on the architecture
established by diffusion models and strives to enhance the
precision and quality for the 3D objects under generation.
The methodology is spread over several key stages,
commencing with the model building process through the
network architecture to the final evaluation of the generated
models. A general workflow of the proposed model is
illustrated in Fig:1.
A. Workflow of the Proposed Method

462
Authorized licensed use limited to: Zhejiang University. Downloaded on July 06,2025 at 08:06:51 UTC from IEEE Xplore. Restrictions apply.
2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC)

dimensions back to the same size, and the model is learned

to remove noise at every step.
User input This model begins with an initial image configuration such
as 512x512 and proceeds through the convolution layers
with a kernel size of 3x3. The max pooling layers reduce the
Preprocessing unit dimension, and upsampling layers increase the size of the
image by learning to progressively reduce noise and add
details to the model.
Diffusion model
This architecture allows noise to be successfully removed at
each step, and features corresponding to the text description
are added into the 3D model progressively, yielding a
3D Rendering engine
detailed, high-quality 3D object at the tail end of the
diffusion process.
3D output
IV EXPERIMENT & RESULTS
In this section, we present the results from our
Fig. 2: System Architecture of the 3D Model experiments until now, including the evaluation metrics that
have been used to gauge the performance of the proposed
model.
C. Model Building:
A. Experimental Setup
The noise removal method-based diffusion model used
to produce 3D models involves the introduction of random All experiments are conducted on a high-performance
noise into an empty 3D space, which the model iteratively computing cluster equipped with NVIDIA GPUs. 10,000 3D
refines by eliminating noise and adding more details based models along with their textural description are used to set
on the input text description. The process is repeated up the training dataset from the ShapeNet dataset.
iteratively until the resolution and accuracy of the model are Training Configuration:
developed. Batch Size: 32
Here is how the model can be implemented in Python: Epochs: 100
Text Input: A user enters a written description of an object, Learning Rate: 0.001 with a decay schedule
for example, "a chair with wooden legs and a cushioned Loss Function: Combination of MSE and regularization
seat".
Text Processing: The text description is tokenized, B. Experimental Metrics
processed based on all the NLP techniques as applied, and The following metrics were used to evaluate the model's
the essential features are extracted. performance:
Diffusion Process: Here, he begins with noise, which is
constantly being refined in stages according to the latent 1. Accuracy: The Intersection over Union (IoU)
features that flow from the text description. metric was adopted to compare the generated 3D
Model Refinement: Subsequent generations of the 3D model models with the ground truth models and obtained
increase details while gradually removing noise, tending the result to determine the accuracy. At epoch 50,
towards the described 3D form of the object. accuracy is 85.3% and at epoch 100, it has
Iterative refinement enables models to become increasingly
increased up to 92.1%, hence a pretty good
more accurate and finer with each step to the final rendering
of a high-resolution 3D model. improvement is achieved.
2. Mean Squared Error (MSE): MSE between the
D. Network Architecture generated 3D models and the ground truth models
The diffusion model uses a U-Net architecture that was used to measure the faithfulness of the model.
contains two key components: downsampling and
In the beginning, the MSE is 0.35, after which it
upsampling paths.
decreases to 0.12 by the end of the training. It may
Downsampling Path reflect that the model decreases its errors over time.
Down sampling decreases the spatial dimension in this
3. Visual Quality Analysis: A qualitative analysis was
contracting path, and high-level features are extracted using
convolution layers. Down sampling helps enable the model performed by comparing sample generated models
to concentrate only on the actual features and patterns with their real counterparts and measuring the
coming from the input. realism, detail, and overall similarity in shape of
Upsampling Path: Spatial dimensions of the image are the objects.
restored step by step in the upsampling path by
progressively enhancing the 3D model at each layer to give
more details and resolution.
Every layer in the architecture consists of convolutional
layers with ReLU activation. Spatial dimensions again get
reduced by max-pooling. Upsampling layers restore the

463
Authorized licensed use limited to: Zhejiang University. Downloaded on July 06,2025 at 08:06:51 UTC from IEEE Xplore. Restrictions apply.
2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC)

A generative model based on diffusion techniques for 3D

model generation from textual descriptions was proposed in
the paper. It applied for training a model with the whole
dataset of 10,000 3D models and their corresponding textual
descriptions. Therefore, improvements at the appropriate
stages of the training process were found to be significant.
With the epoch being 100, an accuracy of 92.1% was
achieved and significantly outperformed all previous
models, which were at best 78.5%.
The method proposed here yields state-of-the-art results
regarding accuracy as well as visual fidelity. The Mean
Squared Error (MSE) dropped from 0.35 at the end of
training to 0.12, while qualitative analysis of the generated
Figure 3: Accuracy Improvement
models attested to their strong realism and consistency with
the input descriptions.
Comparative Results: Our model performance was Such success with the model opens up the possibility for the
use of diffusion-based generative AI towards text-to-3D
compared with the earlier state-of-the-art methods on the
generation and could therefore be used as an efficient tool
benchmark of text-to-3D generation regarding accuracy,
Mean Squared Error (MSE), and Intersection over Union toward the creation of detailed and accurate objects
(IoU). described by 3D textual prompts. Further research can
assess the limiting performance of the current model, i.e.,
performance with more complex descriptions or real-world
Metric Our Model Previous Models applications. Optimization techniques to reduce
computational overhead and improve scalability are also in
Accuracy 92.1% 78.5%
order.
MSE 0.12 0.25 Further work may include even more extensive uses of AI-
based models for optimization, feedback mechanisms for
IoU 0.85 0.70 improving models generated, and further enlargement of the
dataset to include a broader range of 3D objects and
Table 1: Comparative Results descriptions.
REFERENCES
C. Visual Outputs [1] Chen, Yiwen, Chi Zhang, Xiaofeng Yang, Zhongang Cai,
Figures 4 give an example of what the resultant 3D Gang Yu, Lei Yang, and Guosheng Lin. "It3d: Improved text-
models appear like, obtained from their textual descriptions. to-3d generation with explicit view synthesis." In Proceedings
Models produced using this proposed approach have been of the AAAI Conference on Artificial Intelligence, vol. 38, no.
demonstrated to possess a high level of fidelity to input 2, pp. 1237-1244. 2024.
descriptions, detail, and accuracy compared to what has
been possible in previous methods. [2] Canfes, Zehranaz, M. Furkan Atasoy, Alara Dirik, and Pinar
Yanardag. "Text and image guided 3d avatar generation and
manipulation." In Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision, pp. 4421-
IV CONCLUSION 4431. 2023.
[3] Raj, Amit, Srinivas Kaza, Ben Poole, Michael Niemeyer,
Nataniel Ruiz, Ben Mildenhall, Shiran Zada et al.
"Dreambooth3d: Subject-driven text-to-3d generation." In
Proceedings of the IEEE/CVF international conference on
computer vision, pp. 2349-2359. 2023.
[4] Chen, Dave Zhenyu, Yawar Siddiqui, Hsin-Ying Lee, Sergey
Tulyakov, and Matthias Nießner. "Text2tex: Text-driven
texture synthesis via diffusion models." In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pp.
18558-18568. 2023.
Figure 4: Generated 3D Models [5] Li, Xiang, Lei Meng, Lei Wu, Manyi Li, and Xiangxu Meng.
"Dreamfont3d: personalized text-to-3D artistic font
V CONCLUSION generation." In ACM SIGGRAPH 2024 Conference Papers,
pp. 1-11. 2024

464
Authorized licensed use limited to: Zhejiang University. Downloaded on July 06,2025 at 08:06:51 UTC from IEEE Xplore. Restrictions apply.
2024 Eighth International Conference on Parallel, Distributed and Grid Computing (PDGC)

[6] Gorbatsevich, Vladimir, Mikhail Melnichenko, and Oleg "DreamMesh: Jointly Manipulating and Texturing Triangle
Vygolov. "Enhancing detail of 3D terrain models using Meshes for Text-to-3D Generation." arXiv preprint
GAN." In Modeling Aspects in Optical Metrology VII, vol. arXiv:2409.07454 (2024).
11057, pp. 296-302. SPIE, 2019 [21] Selvarajan, S. A comprehensive study on modern
[7] Dundar, Aysegul, Jun Gao, Andrew Tao, and Bryan optimization techniques for engineering applications. Artif
Catanzaro. "Progressive learning of 3d reconstruction network Intell Rev 57, 194 (2024)
from 2d gan data." IEEE Transactions on Pattern Analysis [22] Swathi, A., Sandeep Kumar, Shilpa Rani, Abhishek Jain, and
and Machine Intelligence (2023). Ramakrishna Kumar MVNM. "Emotion Classification using
[8] Liu, Jerry, Fisher Yu, and Thomas Funkhouser. "Interactive Feature Extraction of Facial Expression." In 2022 2nd
3D modeling with a generative adversarial network." In 2017 International Conference on Technological Advancements in
International Conference on 3D Vision (3DV), pp. 126-134. Computational Sciences (ICTACS), pp. 283-288. IEEE, 2022.
IEEE, 2017. [23] Gowroju, Swathi, and Sandeep Kumar. "Robust pupil
[9] Wang, Xiaolong, and Abhinav Gupta. "Generative image segmentation using UNET and morphological image
modeling using style and structure adversarial networks." In processing." In 2021 International Mobile, Intelligent, and
European conference on computer vision, pp. 318-335. Cham: Ubiquitous Computing Conference (MIUCC), pp. 105-109.
Springer International Publishing, 2016 IEEE, 2021.
[10] Zdziebko, Paweł, and Krzysztof Holak. "Synthetic image
[24] Gowroju, Swathi, and Sandeep Kumar. "Robust deep learning
generation using the finite element method and blender
technique: U-net architecture for pupil segmentation." In 2020
graphics program for modeling of vision-based measurement
11th IEEE Annual Information Technology, Electronics and
systems." Sensors 21, no. 18 (2021): 6046.
Mobile Communication Conference (IEMCON), pp. 0609-
[11] Khan, Sallar, Sallar Channa, Syed Abbas Ali, Muhammad
0613. IEEE, 2020.
Haaris Khan, Arhum Hayat Qazi, and Kamran Mengal. "3D
[25] Gowroju, Swathi, K. Sravani, N. Santhosh Ramchandar, D.
Modeling for Wildlife Encyclopedia Using Blender." 3C
Tecnología. Glosas de innovaciónaplicadas a la pyme. Sai Kamesh, and J. Nasrasimha Murthy. "Robust Indian
Special Issue, November 2019 (2019): 133-147. Currency Recognition Using Deep Learning." In Advanced
[12] Kuzina, Valentina, and Alexander Koshev. "3D Modelling of Informatics for Computing Research: 4th International
Construction Objects Based on the Integrated AutoCAD Conference, ICAICR 2020, Gurugram, India, December 26–
System." In IOP Conference Series: Materials Science and 27, 2020, Revised Selected Papers, Part I 4, pp. 477-486.
Engineering, vol. 960, no. 3, p. 032040. IOP Publishing, Springer Singapore, 2021.
2020. [26] Swathi, A., and Shilpa Rani. "Intelligent fatigue detection by
[13] Atieh, Abd Alrahman. "Auto Generate CAD Drawings From using ACS and by avoiding false alarms of fatigue detection."
Text Descriptions: TEXT-TO-CAD." (2024). In Innovations in Computer Science and Engineering:
[14] Li, Canlin, Chao Yin, Jiajie Lu, and Lizhuang Ma. Proceedings of the Sixth ICICSE 2018, pp. 225-233. Springer
"Automatic 3D scene generation based on Maya." In 2009 Singapore, 2019.
IEEE 10th International Conference on Computer-Aided [27] Gowroju Swathi, and Sandeep Kumar. "Robust deep learning
Industrial Design & Conceptual Design, pp. 981-985. IEEE, technique: U-net architecture for pupil segmentation." In 2020
2009. 11th IEEE Annual Information Technology, Electronics and
[15] Malah, Mehdi, Ramzi Agaba, and Fayçal Abbas. "Generating Mobile Communication Conference (IEMCON), pp. 0609-
3D Reconstructions Using Generative Models." In 0613. IEEE, 2020.
Applications of Generative AI, pp. 403-419. Cham: Springer
International Publishing, 2024.
[16] Wijmans, Johannes G., and Richard W. Baker. "The solution-
diffusion model: a review." Journal of membrane science 107,
no. 1-2 (1995): 1-21.
[17] Ho, Cheng-Ju, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan
Yang, and Yi-Hsuan Tsai. "Diffusion-ss3d: Diffusion model
for semi-supervised 3d object detection." Advances in Neural
Information Processing Systems 36 (2023): 49100-49112.
[18] Waibel, Dominik JE, Ernst Röell, Bastian Rieck, Raja Giryes,
and Carsten Marr. "A diffusion model predicts 3d shapes from
2d microscopy images." In 2023 IEEE 20th International
Symposium on Biomedical Imaging (ISBI), pp. 1-5. IEEE,
2023.
[19] Karnewar, Animesh, Andrea Vedaldi, David Novotny, and
Niloy J. Mitra. "Holodiffusion: Training a 3d diffusion model
using 2d images." In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pp.
18423-18433. 2023.
[20] Yang, Haibo, Yang Chen, Yingwei Pan, Ting Yao, Zhineng
Chen, Zuxuan Wu, Yu-Gang Jiang, and Tao Mei.