0% found this document useful (0 votes)
36 views6 pages

DLCV ProjectReport

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views6 pages

DLCV ProjectReport

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Integrating Latent Stable Diffusion models with

Virtual Try On

Anumula Chaitanya Sai Srikar Mukkamala


[email protected] [email protected]

Abstract

The advancement of latent diffusion models, especially in the realm of text-to-image


generation, has opened new possibilities in fashion technology. This project aims to explore
the integration of Latent Stable Diffusion models with Virtual Try-On (VTO) systems to
create a sophisticated text-to-image generation model specifically tailored for the fashion
industry. The model will have a semantically rich understanding of textual prompts, enabling
it to generate realistic clothing images based on descriptions. It will be capable of synthesizing
intricate patterns, brand logos, and various garment attributes. The second phase of the
project will focus on merging the generated clothing images with user photos, allowing
a virtual try-on experience by effectively utilizing the robust generative capability of a
pre-trained diffusion model while preserving the clothing details after warping. By bridging
deep learning techniques in text-to-image synthesis and virtual try-on, this project aims to
advance personalized online shopping experiences.

1. Introduction and Problem Motivation


Motivation: The fashion industry has seen rapid growth in the use of AI for enhancing customer
experiences. One significant area is Virtual Try-On (VTO), which allows users to visualize clothing items
on themselves digitally. Simultaneously, text-to-image generation models have reached new heights with
latent diffusion techniques, enabling the creation of high-quality images from textual prompts. This project
seeks to combine these two advancements to create a model that generates fashion images based on textual
descriptions and integrates them into a VTO framework.

Objective: To develop a text-to-image generation model using Latent Stable Diffusion that can
understand and process detailed textual prompts to create realistic clothing images. The project will then
integrate these generated images into a VTO system, allowing users to try on virtually generated clothing
based on their descriptions.

2. Proposed Solution Approach


Text-to-Image Generation with Latent Diffusion Models for fashion industry.

Model Design:

Develop a latent diffusion model tailored for fashion image generation, capable of understanding
complex clothing descriptions. Enhance the model’s semantic comprehension to handle intricate prompts,
including details like fabric type, patterns (e.g., stripes, polka dots), brand logos, and garment-specific
attributes.

Training Data:

COURSE CODE: DSL-506 1


Curate a dataset comprising fashion images labeled with detailed textual descriptions (e.g., "red
cotton dress with white floral patterns and Nike logo").

Preprocess data to align textual descriptions with corresponding fashion images, using CLIP (Contrastive
Language-Image Pre-training) for improved text-image alignment.

Evaluation:

Measure the realism and accuracy of generated images using metrics such as Inception Score (IS)
and Fréchet Inception Distance (FID). Conduct human evaluations to assess how well the generated clothing
matches the given textual descriptions.

Virtual Try-On:

The proposed solution for the virtual try-on task involves using a U-Net architecture to perform
exemplar-based image inpainting, integrating a clothing-agnostic representation of the person with the
clothing image through concatenated features. By incorporating a zero cross-attention mechanism and a
spatial encoder, the model ensures accurate alignment and fine detail preservation of the clothing on the
person’s image during training and inference.

Evaluation and Metrics: We are following a single dataset evaluation strategy where the same
dataset is being used for training and evaluation. We use FID and KID scores are quantitatively measuring
the performance.

3. Datasets
To effectively train the models, the following datasets will be utilized:

Fashion Image Dataset:

Description: A large-scale dataset comprising high-resolution images of various clothing items,


annotated with detailed textual descriptions. This dataset will include diverse clothing categories, styles,
colors, patterns, and brand logos.

Source: Potential sources include Fashion-MNIST, DeepFashion, and custom datasets compiled
from fashion e-commerce platforms such as the Meesho Dataset discussed in the class.

Virtual Try On:

We train our model on VITON-HD and upper body image in DressCode datasets. For the evaluation of
SHHQ-1.0, we use the first 2,032 images and follow the pre processing instruction of VITON-HD to obtain
the input conditions such as the agnostic maps or the dense pose.

4. Training of Model
Text-to-Image Generation Model:

Model Architecture: Implement the TokenCompose architecture, focusing on integrating token-


level supervision during the finetuning stage to ensure better alignment of the generated images with the
complex textual prompts.

Integration with TokenCompose:

Training the TokenCompose Model: Implement the TokenCompose framework as an enhancement to


the existing latent diffusion model. The key features will include:

COURSE CODE: DSL-506 2


• Token-Level Supervision: Introduce token-wise consistency terms between the generated image
content and object segmentation maps during the finetuning stage. This will enhance the model’s
ability to maintain consistency between multi-category clothing attributes specified in the textual
prompts.
• Fine-tuning on Stable Diffusion: Fine-tune the Stable Diffusion model using the curated dataset
to improve the quality and photorealism of the generated images. This process will involve:

Data Augmentation: Employ various augmentation techniques to enrich the dataset and
improve model robustness, such as random cropping, color jittering, and adding noise.

Loss Function: Implement a custom loss function that includes traditional pixel-wise losses (e.g.,
L2 loss) combined with the token-level consistency loss to optimize the model for both realism and
textual adherence.

Training Procedure: Train the model using a batch size of 16, employing an Adam optimizer with
a learning rate of 2e-5. Use mixed precision training to enhance speed and efficiency.

Fine-tuning: Use the curated fashion dataset to fine-tune the pre-trained Stable Diffusion
model. This process will incorporate token-level consistency terms that enhance the model’s ability
to handle multiple object categories effectively.

Virtual Try On:

Model Architecture: Implement a model based on the StableVITON architecture, utilizing a U-


Net framework to handle the virtual try-on task as an exemplar-based image inpainting problem. The model
will take four key inputs: (1) a noisy image, (2) a latent agnostic map that removes clothing information from
the person image, (3) a resized clothing-agnostic mask, and (4) a latent dense pose condition to maintain the
person’s pose.

Augmentation: To mitigate issues of key tokens unrelated to query tokens being attended to, we
alter the feature map by applying augmentation, including random shifts to input conditions.

Loss Function:

L = LLDM + λAT V LAT V (1)

The loss function LLDM is the used by stable diffusion to learn fine-grained symantic correspondence, instead
of just injecting the clothes at similar location and LAT V is employed to sharpen the attention regions on the
clothing, ensuring the cross-attention layer focuses on more localized and coherent areas, thereby improving
the accuracy of fine details in the generated images.

Training Procedure:

• Freezing Pretrained Models: The weights of the pre-trained U-Net and the CLIP image encoder
are copied and frozen during training. These networks provide pre-trained feature representations
for the human body and clothing, which are crucial for aligning the input clothing image with the
agnostic map of the person.
• Training the Zero Cross-Attention Blocks:The zero cross-attention blocks are trained to align
the clothing feature map with the human body feature map. This block enables flexible alignment by
using attention mechanisms, which help the model preserve details of the clothing during the try-on
process by conditioning the human and clothing features.
• Training the Spatial Encoder:The spatial encoder is trained to extract clothing-specific latent
representations. These are used as the key and value inputs in the cross-attention block to ensure
accurate alignment of the clothing with body parts.

COURSE CODE: DSL-506 3


• Enhancing Fine-grained Alignment with Attention Total Variation Loss: The loss is applied
during training to sharpen the attention regions on the clothing. This helps improve the accuracy of
small details in the generated images, such as color patterns and clothing textures.

5. Overall Pipeline of the Network


Below is the pipeline representing the dataflow across the architecture

6. Revisions made after Presentation

• We are simulating different fittings of cloth to the person by changing only the dense pose of the
person and not modifying the clothing or the original persons image.

• On comparing the human size with the clothing, we are dilating the chest, waist and hands at
different proportions, the model does the warping according to the dense pose.

• Initially we dilated only the chest region, which gave a synthetic look, but now we did different
regions at different proportions to give more natural look.

• Final results came out to be as per the image depicted below in which the shirt isn’t just increased
in size and imposed on the person’s body but it actually dilated to ensure that the size is calibrated
subjectively to match the person. We are showing a total 3 sizes of small, medium and large as per
below image. We can also observe that in the Large size(3rd image), we can also confirm this along
with wrinkles from the enlarged shirt.

COURSE CODE: DSL-506 4


COURSE CODE: DSL-506 5
7. Contribution of Members

• Srikar: Focused on the development and training of the text-to-image generation model using
TokenCompose techniques. Responsibilities include dataset curation, model architecture design, and
evaluation of generated images.

• Chaitanya: Responsible for the integration of the virtual try-on system, including pose estimation
and image alignment processes. Will lead the testing and user evaluation phases to ensure a seamless
virtual try-on experience.

References
1. TokenCompose: Text-to-Image Diffusion with Token-level Supervision (Zirui et. al.)
2. StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On.
3. DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations (Liu et. al.)
4. Text to Image Generation of Fashion Clothing (Jain et. al)
5. Stylized Text-to-Fashion Image Generation (Zhang et. al.)
6. High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions (Lee et. al)

COURSE CODE: DSL-506 6

You might also like