0% found this document useful (0 votes)
18 views3 pages

Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views3 pages

Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

Text2Video: Automatic Video Generation Based on Text Scripts


Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun
{ianyu,lokitu,wakalu,evelynxchen,huizhan,zixunsun}@tencent.com
Interactive Entertainment Group, Tencent
Shanghai, China
Input: Text Scripts
ABSTRACT
To make video creation simpler, in this paper we present Text2Video, Material Videos from Live-
NLP Manager streamers
a novel system to automatically produce videos using only text-
editing for novice users. Given an input text script, the director-like
Multi-modal
system can generate game-related engaging videos which illustrate Cross-modal
Material by Manual
Retrieval Database
the given narrative, provide diverse multi-modal content, and fol- Upload
low video editing guidelines. The system involves five modules: (1)
A material manager extracts highlights from raw live game videos,
Subtitles
Text to Speech Video
and tags each video highlight, image and audio with labels. (2) A Speech
Material
Editing
natural language processor extracts entities and semantics from
the input text scripts. (3) A refined cross-modal retrieval searches Output: Videos
for matching candidate shots from the material manager. (4) A text
to speech speaker reads the processed text scripts with synthe- Figure 1: System diagram.
sized human voice. (5) The selected material shots and synthesized
speech are assembled artistically through appropriate video editing
techniques.

CCS CONCEPTS Broadcast

• Information systems → Multimedia content creation.


Blood bar

KEYWORDS ROI

text2video, video generation, video editing, video dubbing


ACM Reference Format:
Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun. 2021.
Text2Video: Automatic Video Generation Based on Text Scripts. In Pro-
ceedings of the 29th ACM International Conference on Multimedia (MM ’21),
October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 3 pages.
https://doi.org/10.1145/3474085.3478548
Figure 2: Highlight extractor.

1 SYSTEM ARCHITECTURE
“game-related” and “game-unrelated”, and a series of high-level la-
The system diagram of Text2Video is illustrated in Figure 1. The
bels. Labels of each material are provided by human annotation
whole system can be divided into five parts: material manager,
and algorithm classification. The labels and text description of each
natural language processing (NLP), cross-modal retrieval, text to
material will be used in the cross-modal retrieval.
speech, and video editing.
Notably, a large part of the video material are produced by a
highlight extractor automatically. Highlights are video episodes
1.1 Material Manager where target events of a game happened, such as “first blood”, “dou-
Modalities of material range from text, images, audios, and videos. ble kill”, “triple kill”, “quadra kill” and “penta kill”. Highlights are
Material manager is to manage the material for game video produc- continuously extracted from raw long videos of game live websites.
tion. All material would be tagged with two low-level labels, namely As shown in Figure 2, the timely broadcasts of game videos pro-
Corresponding author: Yipeng Yu, [email protected] or [email protected].
vide information about target events. At first, each raw video is
split into frame sequences with a frame rate of 2, then we adopt
Permission to make digital or hard copies of part or all of this work for personal or feature points matching [4] and OCR [8] to process each frame
classroom use is granted without fee provided that copies are not made or distributed and detect the target events. Moreover, each hero has a blood bar
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. overhead, we use template matching [6] to locate the blood bar and
For all other uses, contact the owner/author(s). a trained CNN models [3] to recognize each hero from 101 heroes
MM ’21, October 20–24, 2021, Virtual Event, China with images within ROI (region of interest) under the blood bar.
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8651-7/21/10. Each highlight with its labels will be stored in our database for
https://doi.org/10.1145/3474085.3478548 cross-modal retrieval.

2753
Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

Title: 爱在三国 (Romantic of Three Kingdoms)

小乔和大乔在郊外玩。遇到了帅气的周瑜。小乔爱上了周瑜, 虽然偶尔吵架打闹。直到一个叫做诸
1 2 3
葛亮的男人出现。周瑜被他带领的大军打得溃败,郁郁而终。小乔开始追着诸葛亮为夫报仇。
4 5 6
Xiaoqiao and sister Daqiao played in the countryside. Xiaoqiao met the handsome Yu Zhou. Xiaoqiao fell in love with Yu Zhou and flirted
with him. Unfortunately, a man named Liang Zhuge appeared. Yu Zhou was defeated by his army and died in depression. Xiaoqiao started
chasing Liang Zhuge to avenge the death of her lover.

Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6

Figure 3: An example of generated video. The video uses game material to retell a classic historical story.

1.2 NLP methods including speed adjustment, clip loop and background
This module is to understand the text scripts. It first splits the input music implantation. The selection of templates can be determined
text scripts into sentence sequences, then extracts game entities, by users via a configuration file or automatically chosen by machine
namely hero names in our cases, from each sentence. Moreover, with a probability distribution. Note that depending on applications,
each sentence is transformed into a vector embedding by BERT [1]. video edition techniques to generate videos can easily be updated.
The entities and vector embedding of each sentence will be used in
cross-modal retrieval, and each sentence will be used as a subtitle. 2 DEMONSTRATION AND SIGNIFICANCE
Figure 3 demonstrates an example of the scenario for automatic text-
1.3 Cross-Modal Retrieval to-video generation. The origin sentences are divided by underlines
Cross-modal retrieval is essential to Text2Video system, which se- with five colors and five numbers, which map to five video episodes
lects the best and closest matching materials, namely video episodes surrounded by boxes with the same color and episode number.
and images, from the database with the entities and embedding The videos match the origin text scripts in multi-view such as
from NLP module for each sentence. It first recalls candidate ma- heroes labels and text semantics, and narrated by a charming female
terial via Elasticsearch by entities. The video episodes and images voice. Furthermore, the transition between two video episodes
which have been tagged by the entities will be selected as candi- is smoothed with related pictures, and the events and important
dates. Next, we ranks the candidates by semantic matching through heroes are surrounded with animation effects. The video precisely
Faiss [2]. More specifically, we compute the embedding similarity expresses all import points of given scripts and embodies artistic
between each sentence and the text description of each candidate values with video montage methods.
material. Finally, the video episode and image with the highest Text2Video bridges the gap between text scripts and multi-modal
similarity for each sentence are selected. raw material and lowers the difficult for novice users in video
creation. Users only need to focus on the quality of scripts like
1.4 Text to Speech a screenwriter and Text2Video will generate appealing videos to
Video dubbing for subtitles is indispensable for video production. portray the scripts like a director. The system is also able to produce
This module is to generate speech for each sentence from the NLP numerous UGC and PGC videos with acceptable quality, which
module. Each piece of speech will be used as a subtitle dubbing. meets the requirement of video feeds.
We implement Style Token of Tacotron-2 model to read each sen-
tence [5, 7]. The model is trained on a mandarin speech dataset 3 CONCLUSION
named THCHS-30. The number of the style tokens is set to 10, and In this paper, we present Text2Video, an automatic system for game
training stops after 55000 steps. video generation given only text scripts. After the material for
generation are well prepared by a manager, the entire process can
1.5 Video Editing be condensed as follows. First, it uses NLP techniques to get labels
After the subtitles, dubbing, images and video episodes are prepared, and semantics from the input text. Next, a multi-step cross-modal
the Text2Video system assembles them according to various prede- retrieval engine searches for the material that best match the text,
fined video editing templates. A template is a pipeline to organize and a text to speech speaker synthesizes speech to read the text for
the multi-modal material sequences with different video editing video dubbing. Finally, the selected material and generated speech
techniques. For example, transitions would be added to smooth the are put together sequentially and artistically via video editing. Game
video switch. Video effects, such as stickers, light and animation, enthusiasts can use our system to produce appealing videos by
would be used for decoration. We also take advantage of montage simply writing their stories down.

2754
Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

REFERENCES [4] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An
[1] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: efficient alternative to SIFT or SURF. In ICCV. 2564–2571.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [5] Jonathan Shen, Ruoming Pang, et al. 2018. Natural TTS synthesis by conditioning
NAACL-HLT. wavenet on mel spectrogram predictions. In ICASSP. 4779–4783.
[2] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search [6] Nguyen Duc Thanh, Wanqing Li, and Philip Ogunbona. 2009. An improved
with GPUs. IEEE Transactions on Big Data (2019). template matching method for object detection. In ACCV. 193–202.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: [7] Yuxuan Wang, Daisy Stanton, et al. 2018. Style tokens: Unsupervised style model-
towards real-time object detection with region proposal networks. TPAMI 39, 6 ing, control and transfer in end-to-end speech synthesis. In ICML. 5180–5189.
(2016), 1137–1149. [8] Yaping Zhang, Shuai Nie, Wenju Liu, Xing Xu, Dongxiang Zhang, and Heng Tao
Shen. 2019. Sequence-to-sequence domain adaptation network for robust text
image recognition. In CVPR. 2740–2749.

2755

You might also like