0% found this document useful (0 votes)

36 views6 pages

DLCV ProjectReport

Uploaded by

Anumula Chaitanya Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

DLCV ProjectReport

Uploaded by

Anumula Chaitanya Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Integrating Latent Stable Diffusion models with

Virtual Try On

Anumula Chaitanya Sai Srikar Mukkamala

[email protected] [email protected]

Abstract

The advancement of latent diffusion models, especially in the realm of text-to-image

generation, has opened new possibilities in fashion technology. This project aims to explore
the integration of Latent Stable Diffusion models with Virtual Try-On (VTO) systems to
create a sophisticated text-to-image generation model specifically tailored for the fashion
industry. The model will have a semantically rich understanding of textual prompts, enabling
it to generate realistic clothing images based on descriptions. It will be capable of synthesizing
intricate patterns, brand logos, and various garment attributes. The second phase of the
project will focus on merging the generated clothing images with user photos, allowing
a virtual try-on experience by effectively utilizing the robust generative capability of a
pre-trained diffusion model while preserving the clothing details after warping. By bridging
deep learning techniques in text-to-image synthesis and virtual try-on, this project aims to
advance personalized online shopping experiences.

1. Introduction and Problem Motivation

Motivation: The fashion industry has seen rapid growth in the use of AI for enhancing customer
experiences. One significant area is Virtual Try-On (VTO), which allows users to visualize clothing items
on themselves digitally. Simultaneously, text-to-image generation models have reached new heights with
latent diffusion techniques, enabling the creation of high-quality images from textual prompts. This project
seeks to combine these two advancements to create a model that generates fashion images based on textual
descriptions and integrates them into a VTO framework.

Objective: To develop a text-to-image generation model using Latent Stable Diffusion that can
understand and process detailed textual prompts to create realistic clothing images. The project will then
integrate these generated images into a VTO system, allowing users to try on virtually generated clothing
based on their descriptions.

2. Proposed Solution Approach

Text-to-Image Generation with Latent Diffusion Models for fashion industry.

Model Design:

Develop a latent diffusion model tailored for fashion image generation, capable of understanding
complex clothing descriptions. Enhance the model’s semantic comprehension to handle intricate prompts,
including details like fabric type, patterns (e.g., stripes, polka dots), brand logos, and garment-specific
attributes.

Training Data:

COURSE CODE: DSL-506 1

Curate a dataset comprising fashion images labeled with detailed textual descriptions (e.g., "red
cotton dress with white floral patterns and Nike logo").

Preprocess data to align textual descriptions with corresponding fashion images, using CLIP (Contrastive
Language-Image Pre-training) for improved text-image alignment.

Evaluation:

Measure the realism and accuracy of generated images using metrics such as Inception Score (IS)
and Fréchet Inception Distance (FID). Conduct human evaluations to assess how well the generated clothing
matches the given textual descriptions.

Virtual Try-On:

The proposed solution for the virtual try-on task involves using a U-Net architecture to perform
exemplar-based image inpainting, integrating a clothing-agnostic representation of the person with the
clothing image through concatenated features. By incorporating a zero cross-attention mechanism and a
spatial encoder, the model ensures accurate alignment and fine detail preservation of the clothing on the
person’s image during training and inference.

Evaluation and Metrics: We are following a single dataset evaluation strategy where the same
dataset is being used for training and evaluation. We use FID and KID scores are quantitatively measuring
the performance.

3. Datasets
To effectively train the models, the following datasets will be utilized:

Fashion Image Dataset:

Description: A large-scale dataset comprising high-resolution images of various clothing items,

annotated with detailed textual descriptions. This dataset will include diverse clothing categories, styles,
colors, patterns, and brand logos.

Source: Potential sources include Fashion-MNIST, DeepFashion, and custom datasets compiled
from fashion e-commerce platforms such as the Meesho Dataset discussed in the class.

Virtual Try On:

We train our model on VITON-HD and upper body image in DressCode datasets. For the evaluation of
SHHQ-1.0, we use the first 2,032 images and follow the pre processing instruction of VITON-HD to obtain
the input conditions such as the agnostic maps or the dense pose.

4. Training of Model
Text-to-Image Generation Model:

Model Architecture: Implement the TokenCompose architecture, focusing on integrating token-

level supervision during the finetuning stage to ensure better alignment of the generated images with the
complex textual prompts.

Integration with TokenCompose:

Training the TokenCompose Model: Implement the TokenCompose framework as an enhancement to

the existing latent diffusion model. The key features will include:

COURSE CODE: DSL-506 2

• Token-Level Supervision: Introduce token-wise consistency terms between the generated image
content and object segmentation maps during the finetuning stage. This will enhance the model’s
ability to maintain consistency between multi-category clothing attributes specified in the textual
prompts.
• Fine-tuning on Stable Diffusion: Fine-tune the Stable Diffusion model using the curated dataset
to improve the quality and photorealism of the generated images. This process will involve:

Data Augmentation: Employ various augmentation techniques to enrich the dataset and
improve model robustness, such as random cropping, color jittering, and adding noise.

Loss Function: Implement a custom loss function that includes traditional pixel-wise losses (e.g.,
L2 loss) combined with the token-level consistency loss to optimize the model for both realism and
textual adherence.

Training Procedure: Train the model using a batch size of 16, employing an Adam optimizer with
a learning rate of 2e-5. Use mixed precision training to enhance speed and efficiency.

Fine-tuning: Use the curated fashion dataset to fine-tune the pre-trained Stable Diffusion
model. This process will incorporate token-level consistency terms that enhance the model’s ability
to handle multiple object categories effectively.

Virtual Try On:

Model Architecture: Implement a model based on the StableVITON architecture, utilizing a U-

Net framework to handle the virtual try-on task as an exemplar-based image inpainting problem. The model
will take four key inputs: (1) a noisy image, (2) a latent agnostic map that removes clothing information from
the person image, (3) a resized clothing-agnostic mask, and (4) a latent dense pose condition to maintain the
person’s pose.

Augmentation: To mitigate issues of key tokens unrelated to query tokens being attended to, we
alter the feature map by applying augmentation, including random shifts to input conditions.

Loss Function:

L = LLDM + λAT V LAT V (1)

The loss function LLDM is the used by stable diffusion to learn fine-grained symantic correspondence, instead
of just injecting the clothes at similar location and LAT V is employed to sharpen the attention regions on the
clothing, ensuring the cross-attention layer focuses on more localized and coherent areas, thereby improving
the accuracy of fine details in the generated images.

Training Procedure:

• Freezing Pretrained Models: The weights of the pre-trained U-Net and the CLIP image encoder
are copied and frozen during training. These networks provide pre-trained feature representations
for the human body and clothing, which are crucial for aligning the input clothing image with the
agnostic map of the person.
• Training the Zero Cross-Attention Blocks:The zero cross-attention blocks are trained to align
the clothing feature map with the human body feature map. This block enables flexible alignment by
using attention mechanisms, which help the model preserve details of the clothing during the try-on
process by conditioning the human and clothing features.
• Training the Spatial Encoder:The spatial encoder is trained to extract clothing-specific latent
representations. These are used as the key and value inputs in the cross-attention block to ensure
accurate alignment of the clothing with body parts.

COURSE CODE: DSL-506 3

• Enhancing Fine-grained Alignment with Attention Total Variation Loss: The loss is applied
during training to sharpen the attention regions on the clothing. This helps improve the accuracy of
small details in the generated images, such as color patterns and clothing textures.

5. Overall Pipeline of the Network

Below is the pipeline representing the dataflow across the architecture

6. Revisions made after Presentation

• We are simulating different fittings of cloth to the person by changing only the dense pose of the
person and not modifying the clothing or the original persons image.

• On comparing the human size with the clothing, we are dilating the chest, waist and hands at
different proportions, the model does the warping according to the dense pose.

• Initially we dilated only the chest region, which gave a synthetic look, but now we did different
regions at different proportions to give more natural look.

• Final results came out to be as per the image depicted below in which the shirt isn’t just increased
in size and imposed on the person’s body but it actually dilated to ensure that the size is calibrated
subjectively to match the person. We are showing a total 3 sizes of small, medium and large as per
below image. We can also observe that in the Large size(3rd image), we can also confirm this along
with wrinkles from the enlarged shirt.

COURSE CODE: DSL-506 4

COURSE CODE: DSL-506 5
7. Contribution of Members

• Srikar: Focused on the development and training of the text-to-image generation model using
TokenCompose techniques. Responsibilities include dataset curation, model architecture design, and
evaluation of generated images.

• Chaitanya: Responsible for the integration of the virtual try-on system, including pose estimation
and image alignment processes. Will lead the testing and user evaluation phases to ensure a seamless
virtual try-on experience.

References
1. TokenCompose: Text-to-Image Diffusion with Token-level Supervision (Zirui et. al.)
2. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On.
3. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations (Liu et. al.)
4. Text to Image Generation of Fashion Clothing (Jain et. al)
5. Stylized Text-to-Fashion Image Generation (Zhang et. al.)
6. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions (Lee et. al)

COURSE CODE: DSL-506 6

Punjabi Language
No ratings yet
Punjabi Language
5 pages
Ladi-Vton: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
No ratings yet
Ladi-Vton: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
15 pages
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
No ratings yet
Tryoffanyone: Tiled Cloth Generation From A Dressed Person: Ioannis Xarchakos Theodoros Koukopoulos
11 pages
LaDI VTON
No ratings yet
LaDI VTON
15 pages
Komprimiertes PDF 2403.05139v3
No ratings yet
Komprimiertes PDF 2403.05139v3
30 pages
Seminar
No ratings yet
Seminar
23 pages
Stableviton: Learning Semantic Correspondence With Latent Diffusion Model For Virtual Try-On
No ratings yet
Stableviton: Learning Semantic Correspondence With Latent Diffusion Model For Virtual Try-On
17 pages
Improving Diffusion Models For Authentic Virtual Try-On in The Wild
No ratings yet
Improving Diffusion Models For Authentic Virtual Try-On in The Wild
30 pages
Imagdressing-V1: Customizable Virtual Dressing
No ratings yet
Imagdressing-V1: Customizable Virtual Dressing
9 pages
Image Based Virtual Try-On
No ratings yet
Image Based Virtual Try-On
12 pages
32729-Article Text-36797-1-2-20250410
No ratings yet
32729-Article Text-36797-1-2-20250410
10 pages
Virtual Try On
No ratings yet
Virtual Try On
10 pages
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
No ratings yet
Fast and Robust Virtual Try On Based On Parser Free Generative Adversarial Network
10 pages
Fashion Diffusion Control
No ratings yet
Fashion Diffusion Control
25 pages
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
No ratings yet
Feng Weakly Supervised High-Fidelity Clothing Model Generation CVPR 2022 Paper
10 pages
Wa0008.
No ratings yet
Wa0008.
12 pages
Morelli Dress Code High-Resolution Multi-Category Virtual Try-On CVPRW 2022 Paper
No ratings yet
Morelli Dress Code High-Resolution Multi-Category Virtual Try-On CVPRW 2022 Paper
5 pages
Virtual Dressing Room Using OpenCv
No ratings yet
Virtual Dressing Room Using OpenCv
18 pages
Choi VITON-HD High-Resolution Virtual Try-On Via Misalignment-Aware Normalization CVPR 2021 Paper
No ratings yet
Choi VITON-HD High-Resolution Virtual Try-On Via Misalignment-Aware Normalization CVPR 2021 Paper
10 pages
Viton, CVPR 2018 PDF
No ratings yet
Viton, CVPR 2018 PDF
19 pages
VTON Research Evaluation
No ratings yet
VTON Research Evaluation
15 pages
Virtual-Try-Off Via High-Fidelity Garment Reconstruction Using Diffusion Models.18350v1
No ratings yet
Virtual-Try-Off Via High-Fidelity Garment Reconstruction Using Diffusion Models.18350v1
22 pages
Team Project Report
No ratings yet
Team Project Report
54 pages
Image Based Virtual Try On Network
No ratings yet
Image Based Virtual Try On Network
4 pages
End Sem Seminar
No ratings yet
End Sem Seminar
21 pages
Image Classification-AIML Project Presentation
No ratings yet
Image Classification-AIML Project Presentation
18 pages
Clothing Item Recogniton Using CNN 15,17 Paper
No ratings yet
Clothing Item Recogniton Using CNN 15,17 Paper
8 pages
A Virtual Try-On System Based On Deep Learning
No ratings yet
A Virtual Try-On System Based On Deep Learning
5 pages
Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper
No ratings yet
Fele C-VTON Context-Driven Image-Based Virtual Try-On Network WACV 2022 Paper
10 pages
Tryondiffusion: A Tale of Two Unets
No ratings yet
Tryondiffusion: A Tale of Two Unets
30 pages
Essentials of Data Analytics: J Component Report
No ratings yet
Essentials of Data Analytics: J Component Report
25 pages
1st Review GRP 10
No ratings yet
1st Review GRP 10
17 pages
Tame VTON
No ratings yet
Tame VTON
16 pages
Tryongan
No ratings yet
Tryongan
11 pages
Fashion Synthesis With Structural Coherence
No ratings yet
Fashion Synthesis With Structural Coherence
9 pages
Tryon
No ratings yet
Tryon
1 page
TITLE Fashion Synopsis
No ratings yet
TITLE Fashion Synopsis
8 pages
Mpvton, Iccv 2019 PDF
No ratings yet
Mpvton, Iccv 2019 PDF
11 pages
LGVTON - A Landmark Guided Approach For Model To Person Virtual Try-On
No ratings yet
LGVTON - A Landmark Guided Approach For Model To Person Virtual Try-On
39 pages
Learning Flow Fields in Attention For Controllable Person Image Generation
No ratings yet
Learning Flow Fields in Attention For Controllable Person Image Generation
19 pages
Poster
No ratings yet
Poster
1 page
KeysTracy - Final-Report - FashionImageClassifier v2 For Github
No ratings yet
KeysTracy - Final-Report - FashionImageClassifier v2 For Github
13 pages
Wear Any Way
No ratings yet
Wear Any Way
18 pages
Catvton: Concatenation Is All You Need For Virtual Try-On With Diffusion Models
No ratings yet
Catvton: Concatenation Is All You Need For Virtual Try-On With Diffusion Models
12 pages
Implementation of Virtual Try-On For Clothing Products Using Deep Learning Methods
No ratings yet
Implementation of Virtual Try-On For Clothing Products Using Deep Learning Methods
6 pages
LA-VITON: A Network For Looking-Attractive Virtual Try-On
No ratings yet
LA-VITON: A Network For Looking-Attractive Virtual Try-On
4 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Intelligent Clothing Interaction Design and Evaluation System Based On DCGAN
No ratings yet
Intelligent Clothing Interaction Design and Evaluation System Based On DCGAN
6 pages
Cp-Vton, Eccv 2018 PDF
No ratings yet
Cp-Vton, Eccv 2018 PDF
16 pages
Virtual Dressing
No ratings yet
Virtual Dressing
37 pages
Fashiongan: Display Your Fashion Design Using Conditional Generative Adversarial Nets
No ratings yet
Fashiongan: Display Your Fashion Design Using Conditional Generative Adversarial Nets
11 pages
Fashion AI
No ratings yet
Fashion AI
5 pages
Catvton
No ratings yet
Catvton
16 pages
OUTFIT CLASSIFICATION USING CNN Slides
No ratings yet
OUTFIT CLASSIFICATION USING CNN Slides
12 pages
StackGAN and AttnGAN
No ratings yet
StackGAN and AttnGAN
78 pages
FashionFlow Pix2Pix Approach
No ratings yet
FashionFlow Pix2Pix Approach
9 pages
PHASE2
No ratings yet
PHASE2
23 pages
Another half-RP
No ratings yet
Another half-RP
3 pages
FP 2
No ratings yet
FP 2
11 pages
Paper 8918
No ratings yet
Paper 8918
7 pages
Raid Vit-Gan
No ratings yet
Raid Vit-Gan
4 pages
2010 Mar-01 PDF
No ratings yet
2010 Mar-01 PDF
32 pages
Edutourism: The Nigeria Educational Challenges and International Students' Choice of Study in Nigerian Universities
No ratings yet
Edutourism: The Nigeria Educational Challenges and International Students' Choice of Study in Nigerian Universities
13 pages
Chapt
No ratings yet
Chapt
11 pages
How Do You Start A Summary of A Book Review
No ratings yet
How Do You Start A Summary of A Book Review
2 pages
Course Outline - Political Philosophy
No ratings yet
Course Outline - Political Philosophy
10 pages
Beury Donald Attorney California 141733 Binder6
No ratings yet
Beury Donald Attorney California 141733 Binder6
47 pages
First Aid at Work339678559
No ratings yet
First Aid at Work339678559
69 pages
DLL - Nail Care 8 - 1st Week, Nov. 5-9 2018
100% (1)
DLL - Nail Care 8 - 1st Week, Nov. 5-9 2018
3 pages
Azmin MD Zamin 2020 Learning Vocabulary Through Songs
No ratings yet
Azmin MD Zamin 2020 Learning Vocabulary Through Songs
9 pages
Football Fever: Played Using A Spherical Ball Ches) Circumference in FIFA Play Football Soccer Ball
No ratings yet
Football Fever: Played Using A Spherical Ball Ches) Circumference in FIFA Play Football Soccer Ball
4 pages
Extension Studies Updates
No ratings yet
Extension Studies Updates
2 pages
July 2025 Timetable
100% (1)
July 2025 Timetable
58 pages
01-c Plant Nursery Skill Development
No ratings yet
01-c Plant Nursery Skill Development
5 pages
Jaea Davidson Resume
No ratings yet
Jaea Davidson Resume
2 pages
Materi LSSWB
No ratings yet
Materi LSSWB
25 pages
Circ 11-293 14 June Annex IALA Members RD Summary
No ratings yet
Circ 11-293 14 June Annex IALA Members RD Summary
3 pages
MADE EASY GATE 2019 Rank Predictor - Rank Calculator and Estimator PDF
No ratings yet
MADE EASY GATE 2019 Rank Predictor - Rank Calculator and Estimator PDF
30 pages
Legal Education and RM Project
No ratings yet
Legal Education and RM Project
7 pages
Cross-Text Connections Level 3 Practice
No ratings yet
Cross-Text Connections Level 3 Practice
4 pages
Systemic Coaching Delivering Value Beyond The Individual 1st Edition ISBN 1138322482, 9781138322486 Google Drive Download
No ratings yet
Systemic Coaching Delivering Value Beyond The Individual 1st Edition ISBN 1138322482, 9781138322486 Google Drive Download
17 pages
Bny Sec Acr 2505191205 1100644398 1 1
No ratings yet
Bny Sec Acr 2505191205 1100644398 1 1
50 pages
Beauty Sickness
No ratings yet
Beauty Sickness
8 pages
Pasacao Central School
No ratings yet
Pasacao Central School
2 pages
40 North Pearl Street ALBANY, NY 12243-0001: New York State Office of Temporary and Disability Assistance
No ratings yet
40 North Pearl Street ALBANY, NY 12243-0001: New York State Office of Temporary and Disability Assistance
60 pages
Philippine Christian University: College of Business and Technology
100% (3)
Philippine Christian University: College of Business and Technology
10 pages
8611 - Assignment 2 (AG)
100% (1)
8611 - Assignment 2 (AG)
14 pages
Paper 4
No ratings yet
Paper 4
8 pages
Government Thesis Paper
100% (3)
Government Thesis Paper
6 pages
Gmail - Your ADHD Test Results Are Inside
No ratings yet
Gmail - Your ADHD Test Results Are Inside
2 pages