Text To Video Generation Using Deep Learning
Text To Video Generation Using Deep Learning
10142725
Abstract: Technology developments have resulted in GAN in its initial configuration. In order to
the creation of techniques that can provide desired distinguish between real samples drawn from the
visual multimedia. Particularly, deep learning-based actual data distribution and fraudulent samples
image generation has been the subject of in-depth produced by the generator, the discriminator is
research in many different disciplines. On the other
improved. In order to trick the discriminator, the
hand, it is still challenging for generative models to
produce films from text, a topic that is less focused. generator is taught to create samples that reflect the
This research tries to fill this gap by training the real data distribution. For recreating complex data
model to generate a clip that matches a given written distributions, such as those of texts, images, and
sentences. The field of conditional video creation is videos, GAN has lately shown a lot of potential.
largely underdeveloped. With the help of a conditional GAN models are known for being challenging to
generative adversarial network, which develops train, despite their success. The selection of hyper-
frame-by-frame and ultimately creates a full-length parameters frequently has an adverse effect on the
film, our project's goal is to transform text to image to training process stability. The deep convolutional
video. This focuses on creating a single, superb video GAN, proved effective in providing a realistic output
frame in the initial step while learning how to connect
using a GAN framework, other studies have claimed
text and visuals. As the stages go, our model is
gradually trained on an increasing count of continuous remarkable outcomes utilizing deep networks. The
frames. This approach of learning in stages stabilizes ability to generate photo realistic images that are
the training and makes it easier to understand. High- difficult for humans to discern from genuine pictures
definition movies may be created using conditional exists today. However, because creating videos is a
text descriptions. To demonstrate the efficacy of the far harder task than creating images, there are
recommended strategy, results from qualitative and significantly fewer studies on video generation than
quantitative trials on various datasets are required. there are on image generation. Videos must take into
account the sense of connection between frames,
Keywords— Variational auto encoders; GAN; Video whereas images just assess the finish of the single
generation; Conditional GAN; Video GAN.
frame. If the progression between nearer frames is
not assured, well-crafted films cannot be produced,
I.INTRODUCTION
even if each image is of a high quality. Further
complicating the process of creating videos is the fact
In this era, generative models has been researched in
that almost all publicly available video collections
a large scale. Variational auto encoders (VAEs)
are exceedingly diverse and out of alignment. In
which is a recurrent neural network with a prior and
contrast to the extensive study that has been done in
an appropriate noise distribution and generative
the field of image production, conditional video
adversarial networks (GANs) in which two artificial
generation does not gained much highlights on it. A
neural network combined with one another in a
network may produce a more realistic image that
machine learning (ML) model to make predictions
corresponding to the given text, for instance, and a
that are more accurate. They are extensively utilized
manual algorithm of one hot encoding can be used to
in the creation of images, videos, and voices. VAE
change the properties of the produced image.
and GAN are two recent innovations that stand out as
However, investigations on text-to-video production
current examples of rapid, prolific, and high-quality
are few and often undertaken at a lower resolution
growth. A model known as the Generative
than text-to-image generation. So, in order to widen
Adversarial Network (GAN) was put forth by Good
the scope of video generation, we concentrated on
fellow et al. A generator and a discriminator that
creating a conditional video that hasn't been widely
have been taught with conflicting goals make up a
studied in this sector. In this paper, a new
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
representation of using GANs in text-to-video IV.LITERATURE SURVEY
generating tasks is presented. This paper suggest a
novel network that creates videos in accordance with In [1], It suggest a revolutionary deep generative
provided descriptions. The network's learning method that can create videos from an image of a
structure is based on the fundamental idea that linked single face and designating a certain facial emotion,
frames in a video contain a lot of continuity. It will such an unintended smile. A frame sequence
be simple to generate a connected frame because they generator and an image generator make up the two
are related if we can create one high-quality video main components of our architecture. While the
frame. GAN is trained with respect to single image picture generator makes use of a deep neural model
and subsequently expand it to longer frames. GAN that combines GAN and VAE. The framework’s
may learn to produce lengthy scenes by sequential generator uses a neural image and a label
incrementally improving its ability to generate a large as inputs and generates a collection of secret
number of adjacent frames. Our extensive projections with smooth transitions that are
experimental findings demonstrate that, in addition to equivalent to video frames. The actual face images
producing an appropriate video for a given text, our are then created by decoding the hidden
approach also generates outcomes that are sharper representations using the image generator.
and better in both qualitative and quantitative terms
than those seen in previous comparable works. In[2], order to transfer hierarchical information from
the multiple picture semantic comprehension task to
II.PROBLEM STATEMENT the T2IS problem, the CKD approach has been
developed in this study. Using its experience with
Video creation from natural language phrases will image semantic comprehension challenges, the T2IS
have a significant influence since humans are able to can learn the transfer from semantic information to
listen to and read human language sentences as well picture content. In T2IS, text descriptions and
as imagine or see the things that are being discussed. artificial images are used. The distillation process is
Video is more effective than written words or text. divided into several steps using a multi-stage
Youngsters these days often do not have much time knowledge distillation paradigm. As a result, the
to go through an entire article to understand the visual quality issue for T2IS, can be resolved by
content yet they want to know all the important approximating the genuine picture distributions for
elements of the article. Hence there is a requirement T2IS, this carry out extensive tests using widely used
for video generation system that can create datasets to confirm the efficacy of our suggested
interesting, engaging, concise and high-quality CKD strategy.
videos from text stories with little or no human
intervention.
III.OBJECTIVE
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
In[3], A new Attention-Transfer Mechanism multiple pictures is determined using the DIRECT
(AATM) and Semantic Distillation Mechanism optimization technique.
(SDM) have been developed, along with a
Knowledge-Transfer Generative Adversarial
Network (KT-GAN) for the synthesis of Text to
Image (T21). To better encrypt text features and
synthesis photographic images, the SDM leverages
Image-to-Image synthesis as guidance. The AATM
aids the generator in gradually recognizing crucial
words and enhancing the specifics of the synthesized
image. The heterogeneous gap is effectively closed
by SDM and AATM, enabling the generator to
synthesize excellent images. Extensive experimental
findings and analysis showed that KT-GAN was
effective.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
original material, it is then brought back to high popular datasets, and the finest quantitative outcomes
resolution. TENet or Text Editor is used to alter the serve as proof of the Bridge-efficacy GAN's.
text area. The backdrop area of the original document
is kept untouched when we combine them into a new In [11], it proactively poses and addresses the issue
original document image. After IDNet creation, the of interactive crowd video generating in this study.
fresh pictures are converted into a captured document Our proposed model, Crowd GAN, combines two
using GAN-based distortion simulation. The success task-specific networks with a self-selective fusion
of our approach is then assessed using CNN-based module to increases the merits of adversarial learning
recapture detection on the output captured/recaptured and flow-based warping. This model can provide
images. user friendly, realistic crowd photos when given the
guidance map for the subsequent frame. Extensive
In [9], this paper, It suggest a fully trained generative testing demonstrates that our approach can result in
adversarial network for the synthesis of text and visually convincing and continuous crowd films. It
faces in images. The research describes a network also shows other applications that the suggested
that has been trained to produce high-quality images approach can improve. Due to the constraints of the
in relation to the input phrases using both a text already-existing crowd video datasets discussed
encoder and an image decoder. We conduct in-depth above, the lack of sufficient diversity in the crowd
tests on the publicly accessible dataset to demonstrate scenarios restricts the generalization of the approach
the superiority of our suggested methodology. to a certain extent. The performance of the system
Additionally, we have contributed to the text to face will likely be enhanced in the future as we annotate a
creation dataset for this innovative challenge. The larger and more generation focused crowd video
locally produced photos and various publicly collection. This will be especially helpful for
accessible datasets have been integrated. Following addressing denser and more complex crowd
that, manual categorization of each photograph was circumstances. The collection will offer extra data
done by hand. The details of the resemblance with particular people of the curve or twisting motion
between the generated faces and the input ground in an effort to boost user engagement. If this is the
truth description words are also presented in the case, Crowd GAN will contribute more in order to
proposed work. According to research, the solve the issue of a shortage of crowd data and, if
recommended generative adversarial network needed, make more helpful crowd films. The study
generates realistic pictures with excellent quality and [12] raises the problem of detecting cursive text, and
a face that resembles the labels and faces from the this paper propose a segment of free method using a
real data. Using FID and FSD scores, it compared the deep learning convolutional rnn, with an emphasis on
suggested method to cutting edge techniques. writing in Urdu in real world settings. Urdu text
Proposed model received utilizing a benchmark detection is much more challenging to interpret than
approach with a FID score of 42.62 from Fully non-cursive scripts because of various writing styles,
Trained Generative Adversarial Networks, which is a variety of letter forms, linked text, continuous
relatively lower than other algorithms. Furthermore, overlaying, elongated, vertical, and compressed text.
human evaluations of our created photos are also Instead of pre segmenting the word picture into
credible. individual characters, the recommended model first
converts a full word image into the continuous frame
In[10], this research, it presents the Bridge-GAN of the relevant attributes. This model consists of a
strategy to solve the text to image synthesis's content deep CNN with very short connections for feature
consistency problem. By assuring the essential visual extraction and encoding, a recurrent neural network
information from the text descriptions, a transitional (RNN) for feature decoding of convolution layers
area is created as a bridge in our Bridge-GAN and a connectionist temporal classification for
approach, allowing the interpretable representation to converting predicted sequences into target labels. In
be learned. For the development of optimizing the order to extract more beneficial Urdu text
transitional, space we also deduced a smaller bound characteristics, we study more sophisticated CNN
for the ternary mutual information goal. In order to architectures including VGG 16, VGG 19, ResNet
optimize the mutual information between the given 18, and ResNet 34 and then compare the recognition
written contents, the interpretable representation, and results in an effort to further increase text recognition
the observations of synthetic pictures, this objective precision.
was created. We do extensive research on two
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
According to [13], It suggests a novel framework for image GAN model can transmit all of the facial traits
producing face videos. The suggested approach is a inside an attached text description while still being
two-stage framework in which a source face image is aesthetically pleasing.. Our model combined visual-
first used to generate temporal 3D dynamics, and linguistic features from a well disentangled latent
they are then created using a suggested sparse texture space by using the hierarchical structure of the
mapping methodology to express face structural cutting edge Style GAN model. Based on our
characteristics and image data. The produced sparse research, we conclude that adding CLIP elements to
texture then serves as a trustworthy precursor for the our framework encourages results with richer
construction of faces. It differs from most preceding contextual meaning without compromising the
approaches, which both concentrate on face video overall uniqueness of the facial results. Furthermore,
generation and face video prediction, as opposed to it demonstrates that altering a linear based attention
those which just concentrate on face video mechanism makes it easier to produce reliable
generation. Both face video generation and face pictures.
video prediction are the main topics. To demonstrate
the success of the approach, three difficult tasks— According to [17], They provide novel approaches
video restoration, video forecasting, and target-driven for developing a thorough framework for scheduling
video prediction—have been completed. SL translation, creation, and identification operations
under real-world conditions. To improve recognition
In[14], To deal with the problem of limited accuracy, we use a hybrid Convolutional Neural
information, we present an updated framework, Network (CNN) and Bi-directional Long Short Term
RiFeGAN2, to quickly enrich the input caption. In Memory (Bi-LSTM) model for posture feature
order to speed up retrieval and boost retrieval quality, extraction and text synthesis. On the other hand, sign
RiFeGAN2 uses a domain specific restricted model gesture movies are created for certain spoken
to filter previous intelligence. It then uses a domain utterances using a hybrid Neural Machine
scorer and ranking scorer to improve the candidates Translation (NMT), Media Pipe, and Dynamic
that have been collected. Additionally, in order to Generative Adversarial Network (GAN) model. The
highlight the input caption and enhance semantic suggested approach achieves above 95%
consistency, we suggest SAEM2s with a center- classification accuracy while resolving various
attention layer. Compared to the competition, our problems with the current techniques. At various
models can create visuals that are more realistic, phases of development, the model's efficacy is
according to tests on frequently used datasets, and assessed as well, and the assessment metrics show
they also have better semantic consistency. The actual improvements in our model. Comparing the
outcomes further show that the suggested models can model's performance to previous multi-language
use numerous captions to facilitate interactive reference sign datasets used in the testing, it performs
operations and can increasingly fulfills the given better in terms of picture quality and greatest
content or the given text. accuracy.
In[15], It introduces Semantic Object Accuracy In [18], the objective of the field of vision and
(SOA), a novel assessment metric that gauges how language known as "text to image synthesis" is to
precisely a model can create unique things in acquire multimodal representations between the
pictures. With the use of this new SOA assessment, attributes of the picture and the text. It is thus
text to picture generative models can be reviewed necessary to be able to comprehend how the various
more completely and failure and success traits for elements in the given text relate to one another and
specific items and object classes may be identified. In create stunning graphics using that understanding.
contrast to other measures like the Inceptions Score, Translation from text to picture is referred to as
the SOA score is similar to the ranking derived by neural network visual thinking. Our algorithm
human evaluation, according to a user survey extrapolates the complex relationships between the
involving 200 participants. None of the cutting edge text's objects using its existing knowledge to produce
SOA approaches that have been examined can the final image. We define a variety of innovative
currently produce convincing moving objects for a adversarial loss functions and then show which one
sample of the 80 classes in the COCO data set. improves the text to image synthesis's capacity for
reasoning. Surprisingly, the majority of our models
This paper [16] illustrated how a text description to are capable of reasoning for themselves. The
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
superiority of our strategy is shown by quantitative explored more dataset about human actions like
and qualitative comparisons with various kinetics, MUG, MSRVTT, Celeb that is used for
methodologies. generating videos. This model gives higher
resolution video, compared to the existing models.
In [19], they address the difficulty of learning from The study is extended in the field of different GAN
diverse data sources. Our method is designed to teach and tends to explore more into Conditional
a combined textvideo embedding and can Generative Adversarial Network (CGAN). However,
compensate missing video modalities during training. the production of videos is less researched in the
To accomplish this attribute, we can provide subject of deep learning, thus the area of deep
Mixture-of Embedding-Experts (MEE) framework learning is looking into video generation and
that contrasts text with various video modalities. The processing
model takes into consideration the distinctive
contributions of each modality and develops skillful REFERENCES
weights based on the end to end learning process.
[1]. W. Wang, X. Alameda-Pineda, D. Xu, E. Ricci, and N. Sebe,
During training, we integrate datasets from both ―Learning How to Smile: Expression Video Generation With
image captioning and video captioning, and we Conditional Adversarial Recurrent Nets‖, IEEE Trans.
regard images as a particular case of motionless, Multimedia, vol. 22, no. 11, pp. 2808-2819, Nov 2020.
[2]. M. Yuan and Y. Peng, ―CKD: Cross-Task Knowledge
soundless movies. For instance, even if "banana" Distillation for Text-to-Image Synthesis‖, IEEE Trans.
only appears in training photos and never in training Multimedia, vol. 22, no. 8, Aug 2020.
[3]. H. Tan, X. Liu, M. Liu, B. Yin, and X. Li, ―KT-GAN:
videos, our technique can nevertheless learn an Knowledgetransfer generative adversarial network for textto-
embedding for "Eating banana." The MPII Movie image synthesis,‖ IEEE Trans. Image Process., vol. 30, pp.
Description and MSR-VTT datasets are used to 1275–1290, 2021.
[4]. Q. Chen, Q. Wu, J. Chen, Q. Wu, A. van den Hengel, and M.
assess our method for solving the video retrieval Tan, ―Scripted Video Generation With a Bottom-Up
problem. On tasks that involve text to video and Generative Adversarial Network‖, IEEE Trans. Image Process.,
video to text retrieval, the recommended MEE model VOL. 29, pp. 7454–7467,2020.
[5]. J. Dong, Y. Wang, X. Chen, X. Qu, X. Li, Y. He, and X. Wang,
beats all previously published techniques. ―Reading-Strategy Inspired Visual Representation Learning
for Text-to-Video Retrieval‖ IEEE Trans. Circuits Syst. Video
Technol., VOL. 32, No. 8, pp. 5680– 5694, Aug 2022.
In [20], It uses a text to image deep learning model [6]. D. Kim, D. Joo and J. Kim, ―TiVGAN: Text to Image to Video
GAN, which has become one of the most exciting Generation With Step-by-Step Evolutionary Generator‖, IEEE,
research topics in our period, to get around this vol. 8, pp. 153113–153122, 2020.
[7]. R. Deshpande, CH. Renu Madhavi And M. Ram Bhatt, ―3D
problem and can create images from the descriptions Image Generation From Single Image Using Color Filtered
that go with them. Scene retrieval performance is Aperture and 2.1D Sketch-A Computational 3D Imaging
considerably improved by Query is GAN, which is System and Qualitative Analysis‖, IEEE, vol. 9, pp. 93580–
93592, Jul 2021.
based on the text to image GAN, in our upgraded [8]. G. Zhu, Y. Ding and L. Zhao, ―A Document Image
retrieval system. In our novel method, queries for the Generation Scheme Based on Face Swapping and Distortion
Generation‖, IEEE, vol. 10, pp. 78827–78837, 2022.
scene retrieval issue are created from text to image [9]. M. Zeeshan Khani, S. Jabeen, M. Usman Ghani Khan, T. Saba,
GAN-generated pictures. Additionally, we show that, A. Rehmat, A. Rehman and Usman Tariq, ―A Realistic Image
unlike earlier work on text to picture GANs, which Generation of Face From Text Description Using the Fully
Trained Generative Adversarial Networks‖, IEEE, vol. 9, pp.
largely concentrated on the development of high 1250–1260, Aug 2020 .
quality images, the produced images, although not [10]. M. Yuan and Y. Peng, ―Bridge-GAN: Interpretable
being aesthetically pleasing, contain appropriate representation learning for text-to-image synthesis,‖ IEEE
Trans. Circuits Syst. Video Technol., IEEE, vol. 30, no. 11, pp.
visual features suitable for the query. It use scene 4258–4268, Nov. 2020.
retrieval from actual video datasets to empirically [11]. L. Chai, Y. Liu, W. Liu, G. Han, and S. He, ―CrowdGAN:
Identity-Free Interactive Crowd Video Generation and Beyond‖,
assess the effectiveness of the suggested approach. IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, Jun
2022.
V.CONCLUSIONS [12]. A. Ali Chandio, MD. Asikuzzaman, R. Pickering and M.
Leghari, ―Cursive Text Recognition in Natural Scene Images
Using Deep Convolutional Recurrent Neural Network‖, IEEE,
Deep learning-based image generation has drawn the vol. 29, 2020.
[13]. X. Tu, Y. Zou, J. Zhao, W. Ai, J. Dong, Y. Yao, Z. Wang, G.
attention of researchers across a variety of fields, Guo, Z. Li, W. Liu and J. Feng, ―Image-to-Video Generation
particularly on conditional data, video creation, in via 3D Facial Dynamics‖, IEEE Trans. Circuits Syst. Video
contrast, is still a difficult and underrated field. It has Technol., vol. 32, no. 4, pp. Apr 2022.
[14]. J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, ―RiFeGAN:
been studied about Conditional GAN and It has been Rich feature generation for text-to-image synthesis from prior
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.
knowledge,‖ IEEE Trans. Circuits Syst. Video Technol., vol. 32, [18]. H. Lee, G. Kim, Y. Hur and H. Seok Lim,, ―Visual Thinking
no. 8, pp.5187-5200, Aug 2022. of Neural Networks: Interactive Text to Image Synthesis‖,
[15]. T. Hinz, S. Heinrich, and S. Wermter, ―Semantic object IEEE, vol. 9, pp. 64510-64523, Apr 2021.
accuracy for generative text-to-image synthesis,‖ IEEE Trans. [19]. A. Miech, I. Laptev, and J. Sivic, ―Learning a text-video
Pattern Anal. Mach. Intell., early access, Sep. 2, 2020, doi: embedding from incomplete and heterogeneous data,‖ 2018,
10.1109/TPAMI.2020.3021209. arXiv:1804.02516.
[16]. U. Osahor and M. Nasrabadi, ―Text-Guided Sketch-toPhoto [20]. R. Yanagi, Rentogo, T. ogawa and M. Haseyama, ―Query is
Image Synthesis‖, IEEE, vol. 10, pp. 98278-98289, Nov 2022. GAN: Scene Retrieval With Attentional Text-to-Image
[17]. B. Natarajan, E. Rajalakshmi, R. Elakkiya, Ketan Kotecha, Generative Adversarial Network‖, IEEE, vol. 7, no. 22, pp.
Ajith Abraham, Lubna Abdelkareim Gabralla and V. 153183-153193, Oct 2019.
Subramaniyaswamy, ―Development of an End-to-End Deep
Learning Framework for Sign Language Recognition,
Translation, and Video Generation‖, IEEE, vol. 10, pp. 104358 -
104374, Sep 2022.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:33:23 UTC from IEEE Xplore. Restrictions apply.