Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

SYA63Raj More

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views3 pages

Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

SYA63Raj More

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

Text2Video: Automatic Video Generation Based on Text Scripts

Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun
{ianyu,lokitu,wakalu,evelynxchen,huizhan,zixunsun}@tencent.com
Interactive Entertainment Group, Tencent
Shanghai, China
Input: Text Scripts
ABSTRACT
To make video creation simpler, in this paper we present Text2Video, Material Videos from Live-
NLP Manager streamers
a novel system to automatically produce videos using only text-
editing for novice users. Given an input text script, the director-like
Multi-modal
system can generate game-related engaging videos which illustrate Cross-modal
Material by Manual
Retrieval Database
the given narrative, provide diverse multi-modal content, and fol- Upload
low video editing guidelines. The system involves five modules: (1)
A material manager extracts highlights from raw live game videos,
Subtitles
Text to Speech Video
and tags each video highlight, image and audio with labels. (2) A Speech
Material
Editing
natural language processor extracts entities and semantics from
the input text scripts. (3) A refined cross-modal retrieval searches Output: Videos
for matching candidate shots from the material manager. (4) A text
to speech speaker reads the processed text scripts with synthe- Figure 1: System diagram.
sized human voice. (5) The selected material shots and synthesized
speech are assembled artistically through appropriate video editing
techniques.

CCS CONCEPTS Broadcast

• Information systems → Multimedia content creation.

Blood bar

KEYWORDS ROI

text2video, video generation, video editing, video dubbing

ACM Reference Format:
Yipeng Yu, Zirui Tu, Longyu Lu, Xiao Chen, Hui Zhan, Zixun Sun. 2021.
Text2Video: Automatic Video Generation Based on Text Scripts. In Pro-
ceedings of the 29th ACM International Conference on Multimedia (MM ’21),
October 20–24, 2021, Virtual Event, China. ACM, New York, NY, USA, 3 pages.
https://doi.org/10.1145/3474085.3478548
Figure 2: Highlight extractor.

1 SYSTEM ARCHITECTURE
“game-related” and “game-unrelated”, and a series of high-level la-
The system diagram of Text2Video is illustrated in Figure 1. The
bels. Labels of each material are provided by human annotation
whole system can be divided into five parts: material manager,
and algorithm classification. The labels and text description of each
natural language processing (NLP), cross-modal retrieval, text to
material will be used in the cross-modal retrieval.
speech, and video editing.
Notably, a large part of the video material are produced by a
highlight extractor automatically. Highlights are video episodes
1.1 Material Manager where target events of a game happened, such as “first blood”, “dou-
Modalities of material range from text, images, audios, and videos. ble kill”, “triple kill”, “quadra kill” and “penta kill”. Highlights are
Material manager is to manage the material for game video produc- continuously extracted from raw long videos of game live websites.
tion. All material would be tagged with two low-level labels, namely As shown in Figure 2, the timely broadcasts of game videos pro-
Corresponding author: Yipeng Yu, [email protected] or [email protected].
vide information about target events. At first, each raw video is
split into frame sequences with a frame rate of 2, then we adopt
Permission to make digital or hard copies of part or all of this work for personal or feature points matching [4] and OCR [8] to process each frame
classroom use is granted without fee provided that copies are not made or distributed and detect the target events. Moreover, each hero has a blood bar
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. overhead, we use template matching [6] to locate the blood bar and
For all other uses, contact the owner/author(s). a trained CNN models [3] to recognize each hero from 101 heroes
MM ’21, October 20–24, 2021, Virtual Event, China with images within ROI (region of interest) under the blood bar.
© 2021 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-8651-7/21/10. Each highlight with its labels will be stored in our database for
https://doi.org/10.1145/3474085.3478548 cross-modal retrieval.

2753
Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

Title: 爱在三国 (Romantic of Three Kingdoms)

小乔和大乔在郊外玩。遇到了帅气的周瑜。小乔爱上了周瑜, 虽然偶尔吵架打闹。直到一个叫做诸
1 2 3
葛亮的男人出现。周瑜被他带领的大军打得溃败，郁郁而终。小乔开始追着诸葛亮为夫报仇。
4 5 6
Xiaoqiao and sister Daqiao played in the countryside. Xiaoqiao met the handsome Yu Zhou. Xiaoqiao fell in love with Yu Zhou and flirted
with him. Unfortunately, a man named Liang Zhuge appeared. Yu Zhou was defeated by his army and died in depression. Xiaoqiao started
chasing Liang Zhuge to avenge the death of her lover.

Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6

Figure 3: An example of generated video. The video uses game material to retell a classic historical story.

1.2 NLP methods including speed adjustment, clip loop and background
This module is to understand the text scripts. It first splits the input music implantation. The selection of templates can be determined
text scripts into sentence sequences, then extracts game entities, by users via a configuration file or automatically chosen by machine
namely hero names in our cases, from each sentence. Moreover, with a probability distribution. Note that depending on applications,
each sentence is transformed into a vector embedding by BERT [1]. video edition techniques to generate videos can easily be updated.
The entities and vector embedding of each sentence will be used in
cross-modal retrieval, and each sentence will be used as a subtitle. 2 DEMONSTRATION AND SIGNIFICANCE
Figure 3 demonstrates an example of the scenario for automatic text-
1.3 Cross-Modal Retrieval to-video generation. The origin sentences are divided by underlines
Cross-modal retrieval is essential to Text2Video system, which se- with five colors and five numbers, which map to five video episodes
lects the best and closest matching materials, namely video episodes surrounded by boxes with the same color and episode number.
and images, from the database with the entities and embedding The videos match the origin text scripts in multi-view such as
from NLP module for each sentence. It first recalls candidate ma- heroes labels and text semantics, and narrated by a charming female
terial via Elasticsearch by entities. The video episodes and images voice. Furthermore, the transition between two video episodes
which have been tagged by the entities will be selected as candi- is smoothed with related pictures, and the events and important
dates. Next, we ranks the candidates by semantic matching through heroes are surrounded with animation effects. The video precisely
Faiss [2]. More specifically, we compute the embedding similarity expresses all import points of given scripts and embodies artistic
between each sentence and the text description of each candidate values with video montage methods.
material. Finally, the video episode and image with the highest Text2Video bridges the gap between text scripts and multi-modal
similarity for each sentence are selected. raw material and lowers the difficult for novice users in video
creation. Users only need to focus on the quality of scripts like
1.4 Text to Speech a screenwriter and Text2Video will generate appealing videos to
Video dubbing for subtitles is indispensable for video production. portray the scripts like a director. The system is also able to produce
This module is to generate speech for each sentence from the NLP numerous UGC and PGC videos with acceptable quality, which
module. Each piece of speech will be used as a subtitle dubbing. meets the requirement of video feeds.
We implement Style Token of Tacotron-2 model to read each sen-
tence [5, 7]. The model is trained on a mandarin speech dataset 3 CONCLUSION
named THCHS-30. The number of the style tokens is set to 10, and In this paper, we present Text2Video, an automatic system for game
training stops after 55000 steps. video generation given only text scripts. After the material for
generation are well prepared by a manager, the entire process can
1.5 Video Editing be condensed as follows. First, it uses NLP techniques to get labels
After the subtitles, dubbing, images and video episodes are prepared, and semantics from the input text. Next, a multi-step cross-modal
the Text2Video system assembles them according to various prede- retrieval engine searches for the material that best match the text,
fined video editing templates. A template is a pipeline to organize and a text to speech speaker synthesizes speech to read the text for
the multi-modal material sequences with different video editing video dubbing. Finally, the selected material and generated speech
techniques. For example, transitions would be added to smooth the are put together sequentially and artistically via video editing. Game
video switch. Video effects, such as stickers, light and animation, enthusiasts can use our system to produce appealing videos by
would be used for decoration. We also take advantage of montage simply writing their stories down.

2754
Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

REFERENCES [4] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An
[1] J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: efficient alternative to SIFT or SURF. In ICCV. 2564–2571.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [5] Jonathan Shen, Ruoming Pang, et al. 2018. Natural TTS synthesis by conditioning
NAACL-HLT. wavenet on mel spectrogram predictions. In ICASSP. 4779–4783.
[2] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search [6] Nguyen Duc Thanh, Wanqing Li, and Philip Ogunbona. 2009. An improved
with GPUs. IEEE Transactions on Big Data (2019). template matching method for object detection. In ACCV. 193–202.
[3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: [7] Yuxuan Wang, Daisy Stanton, et al. 2018. Style tokens: Unsupervised style model-
towards real-time object detection with region proposal networks. TPAMI 39, 6 ing, control and transfer in end-to-end speech synthesis. In ICML. 5180–5189.
(2016), 1137–1149. [8] Yaping Zhang, Shuai Nie, Wenju Liu, Xing Xu, Dongxiang Zhang, and Heng Tao
Shen. 2019. Sequence-to-sequence domain adaptation network for robust text
image recognition. In CVPR. 2740–2749.

2755

Pka de Farmacos
83% (12)
Pka de Farmacos
6 pages
BAJAJ Statement of Account
100% (2)
BAJAJ Statement of Account
1 page
DOTCOM PPTX
No ratings yet
DOTCOM PPTX
31 pages
sem6_minor_report
No ratings yet
sem6_minor_report
33 pages
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
No ratings yet
MDMMT: Multidomain Multimodal Transformer For Video Retrieval
18 pages
IEEE_Conference_Template__1
No ratings yet
IEEE_Conference_Template__1
4 pages
Text2Performer: Text-Driven Human Video Generation
No ratings yet
Text2Performer: Text-Driven Human Video Generation
11 pages
P099
No ratings yet
P099
5 pages
Subtitle Generation Using Sphinx
0% (1)
Subtitle Generation Using Sphinx
59 pages
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
No ratings yet
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
5 pages
7
No ratings yet
7
5 pages
StoryTube_-_Generating_2D_Animation_for_a_Short_Story
No ratings yet
StoryTube_-_Generating_2D_Animation_for_a_Short_Story
6 pages
Movie Gen: A Cast of Media Foundation Models
No ratings yet
Movie Gen: A Cast of Media Foundation Models
96 pages
7. Generating YouTube Video Titles Using Closed Captions and BART Summarization
No ratings yet
7. Generating YouTube Video Titles Using Closed Captions and BART Summarization
5 pages
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
No ratings yet
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
6 pages
DAIS-I1912-HSE-JHA-AIPL-08 (Installation of Pelmet)
No ratings yet
DAIS-I1912-HSE-JHA-AIPL-08 (Installation of Pelmet)
4 pages
Vcip 05
No ratings yet
Vcip 05
9 pages
22.IndusValley
No ratings yet
22.IndusValley
8 pages
Charles 2016 Var Vai
No ratings yet
Charles 2016 Var Vai
8 pages
26613-Article Text-30676-1-2-20230626
No ratings yet
26613-Article Text-30676-1-2-20230626
8 pages
PPT for the First Paper (1)
No ratings yet
PPT for the First Paper (1)
49 pages
Video To Audio Generation Through Text
No ratings yet
Video To Audio Generation Through Text
30 pages
IEEE Paper
No ratings yet
IEEE Paper
13 pages
2311.05698v3
No ratings yet
2311.05698v3
14 pages
India-AI-Mission-Solutions
No ratings yet
India-AI-Mission-Solutions
14 pages
[Arirang TV] Business Proposal_The Senses of K-Culture
No ratings yet
[Arirang TV] Business Proposal_The Senses of K-Culture
14 pages
A Multi-Instance Multi-Label Dual Learning Approach For
No ratings yet
A Multi-Instance Multi-Label Dual Learning Approach For
18 pages
Attentive Visual Semantic Specialized Network for Video Captioning
No ratings yet
Attentive Visual Semantic Specialized Network for Video Captioning
8 pages
Kinesio Taping Pediatrics
95% (22)
Kinesio Taping Pediatrics
240 pages
V 2T: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
No ratings yet
V 2T: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
11 pages
Render Token Whitepaper
No ratings yet
Render Token Whitepaper
10 pages
Sheehan Saad 2013 Higher Order Orthogonal Iteration of Tensors (Hooi) and Its Relation To Pca and Glram
No ratings yet
Sheehan Saad 2013 Higher Order Orthogonal Iteration of Tensors (Hooi) and Its Relation To Pca and Glram
11 pages
Cogvideo: Large-Scale Pretraining For Text-To-Video Generation Via Transformers
No ratings yet
Cogvideo: Large-Scale Pretraining For Text-To-Video Generation Via Transformers
15 pages
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
No ratings yet
Audio-Visual Interpretable and Controllable Video Captioning CVPRW 2019 Paper
4 pages
Video Captioning Approaches
No ratings yet
Video Captioning Approaches
6 pages
A Dataset For Movie Description
No ratings yet
A Dataset For Movie Description
11 pages
MIT Research
No ratings yet
MIT Research
22 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
Video To Sequence
No ratings yet
Video To Sequence
9 pages
Word Square
No ratings yet
Word Square
14 pages
Deep Learning-Based Video Captioning Technique Using Transformer
No ratings yet
Deep Learning-Based Video Captioning Technique Using Transformer
4 pages
NeuroVidx Final Review-1
No ratings yet
NeuroVidx Final Review-1
29 pages
2207.07852v1
No ratings yet
2207.07852v1
23 pages
Visual Assist
No ratings yet
Visual Assist
53 pages
finalReport
No ratings yet
finalReport
22 pages
Paper - 3
No ratings yet
Paper - 3
33 pages
Transformer Network For Video To Text Translation
No ratings yet
Transformer Network For Video To Text Translation
6 pages
Social Network Graph Mining
No ratings yet
Social Network Graph Mining
34 pages
Remus A Security Enhanced Operating Syst
No ratings yet
Remus A Security Enhanced Operating Syst
26 pages
LogBook-1
No ratings yet
LogBook-1
5 pages
Video Transcription and Summarization Using NLP
No ratings yet
Video Transcription and Summarization Using NLP
5 pages
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
No ratings yet
V I T S D: Ideo Nstruction Uning With Ynthetic ATA
24 pages
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
No ratings yet
Koorathota Editing Like Humans A Contextual Multimodal Framework For Automated Video CVPRW 2021 Paper
9 pages
Financial Education and Financial Knowledge: Evidence From Indian Schools
No ratings yet
Financial Education and Financial Knowledge: Evidence From Indian Schools
39 pages
Security and Law in IT
No ratings yet
Security and Law in IT
57 pages
Controllable Video Generation With Text-Based Instructions
No ratings yet
Controllable Video Generation With Text-Based Instructions
12 pages
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
No ratings yet
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics
37 pages
Paper 1
No ratings yet
Paper 1
3 pages
Design of Primary and Secondary Ceils: II. An Equation Describing Battery Discharge
No ratings yet
Design of Primary and Secondary Ceils: II. An Equation Describing Battery Discharge
8 pages
Automated Facial Recognition[1]
No ratings yet
Automated Facial Recognition[1]
6 pages
Jamb Chemistry Syllabus
No ratings yet
Jamb Chemistry Syllabus
17 pages
Automatic Subtitle Generator
0% (1)
Automatic Subtitle Generator
25 pages
IRJET-V11I617
No ratings yet
IRJET-V11I617
7 pages
Towards A Better Metric For Text-to-Video Generation
No ratings yet
Towards A Better Metric For Text-to-Video Generation
16 pages
Paper 4
No ratings yet
Paper 4
5 pages
To Create What You Tell: Generating Videos From Captions: Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei
No ratings yet
To Create What You Tell: Generating Videos From Captions: Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li and Tao Mei
10 pages
Seminar Report 6657
No ratings yet
Seminar Report 6657
32 pages
Emu Video
No ratings yet
Emu Video
29 pages
Bed Ash Control Philosophy
No ratings yet
Bed Ash Control Philosophy
2 pages
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
No ratings yet
Tivgan: Text To Image To Video Generation With Step-By-Step Evolutionary Generator
10 pages
Altair WhitePaper FEKO CPT Concept-Alex
No ratings yet
Altair WhitePaper FEKO CPT Concept-Alex
6 pages
Generating AI Text to Video A Comprehensive Guide
No ratings yet
Generating AI Text to Video A Comprehensive Guide
4 pages
Grammar in Use Elementary-26
No ratings yet
Grammar in Use Elementary-26
1 page
Meta AI's Movie Gen: Transforming Text Into High-Quality Videos
No ratings yet
Meta AI's Movie Gen: Transforming Text Into High-Quality Videos
7 pages
Des Example Something
No ratings yet
Des Example Something
12 pages
462620483_1068561604805450_5292002846430085775_n
No ratings yet
462620483_1068561604805450_5292002846430085775_n
96 pages
VP 16
No ratings yet
VP 16
3 pages
Issues of Operating Systems Security
No ratings yet
Issues of Operating Systems Security
6 pages
Bus Timetable
No ratings yet
Bus Timetable
4 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
17 pages
The Role Work of Architects in Atenas - Fessas - P 90 PDF
No ratings yet
The Role Work of Architects in Atenas - Fessas - P 90 PDF
210 pages
T2V (1)
No ratings yet
T2V (1)
5 pages
Problem 4.11: Which Would You Expect To Be More Stable: (CH) C or (CF) C ? Why?
No ratings yet
Problem 4.11: Which Would You Expect To Be More Stable: (CH) C or (CF) C ? Why?
2 pages
The Dong With A Luminous Nose
No ratings yet
The Dong With A Luminous Nose
5 pages
Movie Description (Audio Description) .
No ratings yet
Movie Description (Audio Description) .
11 pages
Text To Video - Model
No ratings yet
Text To Video - Model
2 pages
Intellectual Property
No ratings yet
Intellectual Property
6 pages
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
No ratings yet
Henaki Ariable Length Video Generation From Open Domain Textual Descriptions
13 pages
Movie Description
No ratings yet
Movie Description
28 pages
Module 3 Teaching Profession
No ratings yet
Module 3 Teaching Profession
4 pages
Formative Assessment vs. Summative Assessment
0% (1)
Formative Assessment vs. Summative Assessment
12 pages
Ring Frame Cop Building
88% (8)
Ring Frame Cop Building
23 pages
3 DUMP Truck Inspection Form
100% (1)
3 DUMP Truck Inspection Form
2 pages
Text To Video Generation Using Deep Learning
No ratings yet
Text To Video Generation Using Deep Learning
7 pages
8.1 Smartube G.I Conduit
No ratings yet
8.1 Smartube G.I Conduit
8 pages
A Survey of AI Text-to-Image and AI Text-to-Video Generators
No ratings yet
A Survey of AI Text-to-Image and AI Text-to-Video Generators
5 pages
Specification and Contract Max Fajardo
No ratings yet
Specification and Contract Max Fajardo
162 pages
Creativity and Insight - A Review of EEG, ERP - Anotado PDF
No ratings yet
Creativity and Insight - A Review of EEG, ERP - Anotado PDF
27 pages
Chapter02 OSedition7Final
No ratings yet
Chapter02 OSedition7Final
81 pages
Dra Aft Surve Ey: Proc Cedures and Cal Lculation N: Readi Ing The Draf Ftmark of TH He Ship
No ratings yet
Dra Aft Surve Ey: Proc Cedures and Cal Lculation N: Readi Ing The Draf Ftmark of TH He Ship
4 pages
Free Video Editor Software Untuk Windows, Mac Dan Linux Edisi Bahasa Inggris
From Everand
Free Video Editor Software Untuk Windows, Mac Dan Linux Edisi Bahasa Inggris
Cyber Jannah Studio
No ratings yet

Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

Text2Video Automatic Video Generation Based On Text Scripts

Uploaded by

Session 19: Video Program and Demo Session MM ’21, October 20–24, 2021, Virtual Event, China

Text2Video: Automatic Video Generation Based on Text Scripts

CCS CONCEPTS Broadcast

• Information systems → Multimedia content creation.

text2video, video generation, video editing, video dubbing

Title: 爱在三国 (Romantic of Three Kingdoms)

Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6

You might also like