Controllable Video Generation With Text-Based Instructions
Controllable Video Generation With Text-Based Instructions
26, 2024
Abstract—Most of the existing studies on controllable video (GANs) enable the generation of high-dimensional data such
generation either transfer disentangled motion to an appearance as images, [1], [2], [3], [4], [5], [6] and videos [7], [8], [9],
without detailed control over motion or generate videos of simple [10], [11], [12]. These models can manipulate the given high-
actions such as the movement of arbitrary objects conditioned on a
control signal from users. In this study, we introduce Controllable dimensional data conditioned on the desired manipulation. For
Video Generation with text-based Instructions (CVGI) framework example, image manipulation and editing architectures [13],
that allows text-based control over action performed on a video. [14], [15], [16] allow users to transfer the style from another
CVGI generates videos where hands interact with objects to image.
perform the desired action by generating hand motions with Motion manipulation according to text-based instructions on
detailed control through text-based instruction from users. By
incorporating the motion estimation layer, we divide the task a video where human interacts with objects in a complex scene
into two sub-tasks: (1) control signal estimation and (2) action is indeed extremely more challenging, as there is no simple way
generation. In control signal estimation, an encoder models actions to model the interaction. Besides, building a semantic associa-
as a set of simple motions by estimating low-level control signals tion between instructions and motion is also challenging because
for text-based instructions with given initial frames. In action text descriptions are often ambiguous for controllable video gen-
generation, generative adversarial networks (GANs) generate
realistic hand-based action videos as a combination of hand motions eration. In the literature, there exist video manipulation archi-
conditioned on the estimated low control level signal. Evaluations tectures such as [12], [17], [18], [19], [20] that allow users to
on several datasets (EPIC-Kitchens-55, BAIR robot pushing, and manipulate motion of objects on a video. They can be grouped
Atari Breakout) show the effectiveness of CVGI in generating into two groups according to the source of manipulation. Most
realistic videos and in the control over actions.
of the existing approaches in the first group use driving videos as
Index Terms—Controllable video generation, video generation a source of manipulation by extracting actions. They can disen-
with textual instructions, motion generation, conditional generative tangle motion and transfer it to another appearance but they are
models. limited to detailed control over the motion during the genera-
tion [9], [21], [22], [23], [24], [25]. In the second group, existing
I. INTRODUCTION
approaches use control signals that are received from an agent
EEP architectural models such as convolutional neural
D networks (CNNs) and generative adversarial networks
such as mouse click [26], key stroke [8], [27], joystick [28] but
in most of them only video generation of simple actions that
can be defined as displacement-based actions such as moving
Manuscript received 22 July 2022; revised 30 October 2022, 2 January 2023,
and 20 February 2023; accepted 14 March 2023. Date of publication 29 March arbitrary objects is possible. On the other hand, the proposed
2023; date of current version 8 January 2024. The Associate Editor coor- framework allows detailed control over motions of generated
dinating the review of this manuscript and approving it for publication was videos and it can generate complex actions as a combination of
Dr. Jiebo Luo. (Corresponding author: Ali Koksal.)
Ali Köksal is with the Department of Visual Intelligence, Institute for In- simple motions.
focomm Research, A*STAR, Singapore 138632, and also with the School of With the motivation of building an association between text-
Computer Science and Engineering (SCSE), Nanyang Technological Univer- based institutions and motions to manipulate the features of the
sity (NTU), Singapore 639798 (e-mail: [email protected]).
Kenan E. Ak is with the Department of Visual Intelligence, Institute for generated motion such as direction, speed, the target,..., this pa-
Infocomm Research, A*STAR, Singapore 138632 (e-mail: kenan_emir_ak@ per introduces a novel framework, named CVGI, that allows
i2r.a-star.edu.sg). users to manipulate simple human-object interactions such as
Ying Sun is with the Department of Visual Intelligence, Institute for Infocomm
Research, A*STAR, Singapore 138632, and also with the Centre for Frontier AI hand/s going toward the desired object in videos with complex
Research, A*STAR, Singapore 117602 (e-mail: [email protected]). scenes by conditioning through text-based instructions. CVGI
Deepu Rajan is with the School of Computer Science and Engineering receives a text-based instruction from a user and takes an initial
(SCSE), Nanyang Technological University (NTU), Singapore 639798 (e-mail:
[email protected]). frame as input to generate a video sequence that corresponds well
Joo Hwee Lim is with the Department of Visual Intelligence, Institute for with the user input. For example, Fig. 1 shows that CVGI can
Infocomm Research, A*STAR, Singapore 138632, with the Centre for Frontier reconstruct the ground truth video by using the same text-based
AI Research, A*STAR, Singapore 117602, and also with the School of Computer
Science and Engineering (SCSE), Nanyang Technological University (NTU), instruction as the instruction of ground truth. It can also generate
Singapore 639798 (e-mail: [email protected]). novel videos with different text-based instructions. As shown,
This article has supplementary downloadable material available at the generated videos are photo-realistic and correspond well
https://doi.org/10.1109/TMM.2023.3262972, provided by the authors.
Digital Object Identifier 10.1109/TMM.2023.3262972 with the text-based instructions. CVGI divides the task into two
1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 191
Fig. 1. The proposed framework, named CVGI, generates controllable videos conditioned on text-based instructions received from a user. CVGI generates novel
photo-realistic videos from an initial frame and textual instructions. Generated frames are from top: ground truth, duplication of ground truth, and two novel videos.
In the duplication of ground truth, CVGI generates hands at similar positions to the ground truth. In the novel videos, CVGI generates videos with different hand
movements based on textual instruction. Note that the boundary of the hand masks of the initial frame is indicated as blue in the generated frames to highlight the
difference in hand movements.
Fig. 2. CVGI divides the task into two sub-tasks: control signal estimation and action generation. In the first sub-task, an encoder estimates low-level control
signals with a given initial frame conditioned on text-based instructions. In the second sub-task, two GANs (M2M and M2F) are trained one after another in a
loop. M2M is trained to control motion and generates the next mask of the object of interest conditioned on the estimated control signal. M2F is trained to perform
motion-aware mask-to-frame translation and generates realistic frames with given masks.
sub-tasks by incorporating the motion estimation layer as seen useful visual input for the understanding of the person’s activi-
in Fig. 2. The first sub-task, control signal estimation, encodes ties because scenes of egocentric videos approximate the vision
high-level text-based instructions to low-level control signals of the person who wears the camera. In spite of useful scenes,
as a form of motion representation. The control signal encoder wearable cameras lead to a lot of motion and dynamic scenes in
takes instruction and an initial frame to estimate a set of low-level egocentric videos. In dynamic scenes, it is hard to focus on the
control signals for motions on the future frames. Low-level sig- object of interest and represent its meaningful movement and in-
nals define the location change of the object of interest between teraction with low-level control signals. Consequently, motion
two consecutive frames such as displacement center of mass of generation in egocentric videos is indeed extremely challenging.
hand masks and displacement of the robot arm’s gripper. The Moreover, to the best of our knowledge, CVGI is the first study
second sub-task, action generation, generates realistic videos that can generate controllable egocentric videos where the mo-
frame-by-frame in a loop conditioned on low-level signals. First, tion of the object of interest is controlled in detail. Therefore,
it generates the next frame with the initial frame and the first esti- CVGI incorporates the hand masks to differentiate the object
mated low-level signal. Then it takes the generated frame and the of interest to control and the action generator synthesizes video
second estimated low-level signal to generate the third frame and sequences by employing two GANs: mask-to-mask (M2M) and
so on until n frames are generated for all estimated low-level sig- mask-to-frame (M2F) for egocentric videos. M2M models the
nals. EPIC-Kitchens-55 dataset [29] contains egocentric videos association between motions of masks of the object of inter-
shot by a head-mounted camera. Egocentric videos capture est, i.e., hand masks and low-level signals and it is trained to
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
192 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
synthesize masks that correspond with low-level signals. It takes controllable human behavior by transferring motion that is ren-
masks of input frames and generates masks for the next frames dered over keypoints of the human body. [21] generates dance
conditioned on the low-level signal. M2F GAN maps masks videos by transferring motion with a network that is trained
to frames by being aware of hands’ motion to generate real- to translate pose to appearance and vice versa. [9] introduces a
istic frames. For BAIR robot pushing [30] and Atari Breakout framework that includes a generator and multiple discriminators
datasets [8], we employ a single GAN similar to M2M. disentangles actions and objects of a given video. It generates
The contributions of this work can be summarized in three as- human-object interaction videos based on text description by
pects: 1) We propose CVGI which generates novel human-object replacing objects to generate novel videos. The aforementioned
interaction videos where hand/s go toward the desired object studies disentangle content and motion of videos and transfer
by manipulating motions on complex scenes conditioned on motions to generate novel videos of contents. On the other hand,
text-based instructions from users. 2) We overcome the chal- our framework controls the motion of objects according to user
lenge with two innovations. Control signal estimation models input. It generates novel videos by manipulating a given frame
human-object interaction in terms of motions and builds the as- conditioned on control signals that are estimated from user input.
sociation between text-based instructions and low-level control Most of the studies that receive user input as the source of
signals. Action generation models motions with low-level sig- manipulation generate novel videos by synthesizing simple mo-
nals and generates realistic videos. 3) With the evaluations on tions as depicting desired motion on the generated frames. [34]
three public datasets: EPIC-Kitchens-55 [29], BAIR robot push- introduces a framework that generates action conditional videos
ing [30], and Atari Breakout [8], we demonstrate that CVGI in Atari games by predicting future frames with given previ-
generates photo-realistic videos that correspond well with in- ous frames and an action label for player actions. [35] generates
structions. action-conditioned videos for robotic arm actions by predict-
ing future frames in long-range from previous frames. [18] in-
troduces a framework that generates variable-length videos for
artificial or arbitrary objects conditioned on captions by sepa-
II. RELATED WORK rately learning short-term and long-term context. [17] generates
In the literature, many studies for novel video generation exist controllable videos with a given frame conditioned on sparse
and they can be grouped into two according to the source of trajectories specified by users and improves the video quality by
manipulation. The first group learns a motion that is extracted hallucinating pixels that cannot be copied based-on flow. [28]
from a video and transfers the motion to another object/subject. extracts a character from a given video, and generates videos
The second group receives user input to control the motion and of the extracted character performing motions that are control-
generates novel videos that correspond well with user input. lable with low-level signals received from an agent on any back-
Another important difference between the groups other than the ground. The framework has two modules, first generates poses
source of manipulation is the ability of the detailed control over corresponding with signals from an agent such as a joystick, sec-
the synthesized motion that can be defined as explicit control. ond translates poses to frames. [27] is trained to imitate the game
Although studies in the first group can control the motion, they engines and renders the next screen conditioned on keystrokes
are limited to explicitly controlling the synthesized motion as by users. [8] learns actions in an unsupervised manner to clus-
they transfer the motion that is learned from a source video as ter motions and then generates videos of discrete actions with a
it is. On the other hand, studies in the second group learn to given initial frame conditioned on keystrokes from users. [26]
associate motions and user input and can manipulate the motion proposes a framework that generates videos where the motion
according to user input. of specific objects can be controlled through mouse clicks. It re-
Most existing studies that transfer motions learn content ceives an input frame, its segmentation map, and mouse click and
which is the object and its appearance and motion which is the incorporates a graph convolution network to model the motions
dynamics of content and they generate videos of moving faces, of objects. As summarized above, most existing studies generate
human body, and arbitrary objects. [25] proposes a framework controllable videos with low-level signals received by an agent
that generates controllable videos for human faces conditioned such as keyboard, joystick, and mouse. On the other hand, our
on a driving vector that can be extracted from a given video, framework builds a semantic association between text-based in-
audio, or pose vector. [24] builds separate latent spaces for con- structions and motions. This association allows controlling gen-
tent and motion of videos for the generation of novel videos of erated videos according to text-based input that can describe
faces, human body, and artificial objects by controlling content complex actions such as human-object interaction.
and motion latent vectors. [31] and [32] predict future frames In addition to the above-mentioned video generation frame-
with a given input frame conditioned on estimated future human works, there exist recent studies that generate videos based on
body poses. [19] introduces a framework that animates an arbi- text. [12] proposes a generic solution for various visual genera-
trary object on the given image conditioned on a motion that is tion tasks. Its generic model can also generate videos based on
derived from a driving video sequence by using sparse keypoint text. [36] proposes a framework, CogVideo, for text-to-video
trajectories. In [33], a similar framework addresses the same generation. CogVideo includes a transformer-based architec-
problem without using any annotation. [23] distinguishes the ap- ture and uses a pretrained text-to-image model. Similarly, [37]
pearance and pose of humans and generates images with given employs bi-directional masked transformers-based model. It is
appearance with different poses. Similar to [23], [22] generates trained with a large amount of text-image pairs and a smaller
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 193
1
n
amount of video-text pairs. [38] proposes an efficient video
L= Δi − Δ̂i , where Δ̂1,2,...,n = E(F0 , d). (1)
generation model. It is a GAN-based model and can control n i=1 2
1 adv
+ (Dmask (Mi , M̂i+1 ) − 1)2
2
1 adv
+ (Dmask (M̃i , Mi+1 ) − 1)2 . (7)
2
In addition, Dmask learns to predict the displacement of the
Fig. 4. Forward training of M2M GAN that includes (a) a conditional generator object by minimizing the following loss:
Gmask and (b) a sequence discriminator Dmask . Gmask is trained to generate
reg
masks conditioned on low-level control signals which contain four floating-point
cs = Dmask (Mi , Mi+1 ) − Δi+1 2 .
LD (8)
numbers. The first two control the right hand and the next two control the left
hand. Dmask is a sequence discriminator with two heads. The first head is
trained to distinguish real and fake sequences. The second head is trained to The full objective function to optimize Dmask is defined as fol-
estimate the displacement of objects of interest. lows:
Fig. 6. Forward training of discriminators of M2F GAN. Df rame (a) is trained to distinguish real and fake frames, Df g (b) is trained to distinguish real and
fake objects of interest, and Dbg (c) is trained to distinguish real and fake background. σ and φ denote operations to compute hand frames and background frames,
respectively.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
196 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
Fig. 7. Qualitative evaluation on EPIC-Kitchens-55 dataset. Given initial frame and mask and different instructions, estimated low-level control signals, estimated
masks, and generated frames by CVGI are shown for three different textual instructions. Δ̂rx , Δ̂ry , Δ̂lx , and Δ̂ly denote the estimated low-level control signals in
2D for right and left hand, respectively.
the annotations of the dataset with the low-level control sig- by using the last generated frame, its mask, and the text-based
nals and text-based instructions. The displacement of the base instruction, for EPIC-Kitchens-55 and the last generated frame
is computed with respect to the location of the base’s most-left and the text-based instruction for BAIR robot pushing and Atari
pixel. Text-based instructions are prepared by the computed dis- Breakout datasets.
placements. Although the variation of adverbs is three as in the
BAIR robot pushing dataset, the variation of verbs is three due to
the one-dimensional motion. So, the action space of text-based B. Qualitative Results
instructions has 7 unique actions. Figs. 7, 8, and 9 show the results of qualitative evaluation
of CVGI on EPIC-Kitchens-55, BAIR robot pushing, and Atari
Breakout datasets, respectively. Fig. 7 shows the initial frame,
A. Training Details the corresponding mask, and text-based instructions along with
CVGI’s modules are trained separately with video sequences estimated low-level control signals, estimated masks, and frames
of the training set that contains a set of frames, a set of masks, of generated sequences. In Figs. 8 and 9, estimated low-level
low-level control signals, and a text-based instruction, S : {{F0 , control signals and generated next frames by CVGI conditioned
F1 , . . ., Fn }, {M0 , M1 , . . ., Mn }, {Δ0 , Δ1 , . . ., Δn }, d}. In all on different instructions are shown for one sample initial frame
experiments, modules are trained from scratch for 500 k due to space limitations.
iterations and we use Adam optimizer [44] with batch size In Fig. 7, CVGI is able to generate novel videos depicting
of 16, learning rate of 0.0002, β1 = 0.5, β2 = 0.999. For different hand motions by using the same initial frame and hand
Epic-Kitchens-55, CVGI is trained to produce 7 future frames. mask according to instructions. Estimated low-level control sig-
i.e., the default value of the hyperparameter n is selected as nals change according to instructions which enables M2M GAN
7 experimentally. Based on rigorous experimentation, we ob- to produce different hand masks, which in turn allow M2F GAN
serve that 7 is an optimal number for generating future frames to to generate videos with different hand movements. In addition
avoid excess accumulation of errors in terms of artifacts. On the to the difference in generated videos, they are semantically con-
other hand, for BAIR robot pushing and Atari Breakout datasets, sistent with instructions. As seen in Fig. 7, hands in the gen-
CVGI generates frames by producing the next frame (default erated videos move towards desired objects according to the
value of n is 1) because motions in both datasets are simple mo- instruction. Thus, CVGI comprehends instructions and controls
tions that start in a frame and end in the next frame typically. To the motion of both hands to synthesize videos depicting desired
produce longer video sequences, generation can be re-initiated hand-based action.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 197
Fig. 8. Qualitative evaluation on BAIR robot pushing dataset. Given initial frame and different instructions, estimated low-level control signals and generated
frames by CVGI are shown for different textual instructions. Δ̂x and Δ̂y denote the estimated low-level control signals in 2D for robotic arm.
Fig. 9. Qualitative evaluation on Atari Breakout dataset. Given initial frame and different instructions, estimated low-level control signals and generated frames
by CVGI are shown. Δ̂x denotes the estimated low-level control signals in 1D for base of breakout game.
C. Quantitative Results
We follow the evaluation protocol proposed in [8] to eval-
uate the video generation quality of our framework. Accord-
ing to the protocol, models are used to generate frames of the
test set by starting from the initial frame. Then the quality
of generated frames is measured by three metrics: FID [45],
FVD [46], and LPIPS [47]. They measure the similarity be- generates controllable videos of moving faces, body parts, arti-
tween two sets of samples and a lower score means more sim- ficial objects requires adjustment to handle the actions.
ilar sets. Fréchet Inception Distance (FID) [45] measures simi- With this protocol, we evaluate the video generation quality
larity between two sets by comparing Gaussian distribution of of CVGI’s action generator on the video reconstruction task over
deep features. Fréchet Video Distance (FVD) [46] is a variant BAIR robot pushing and Atari Breakout datasets and compare it
of the FID metric specifically to evaluate the quality of video with models including MoCoGAN [24], SAVP [48], SRVP [49],
generation models. Learned Perceptual Image Patch Similar- and CADDY [8].
ity (LPIPS) [47] measures the perceptual similarity between In Tables I and II, we report the evaluation results as two
image patches. [8] chooses MoCoGAN [24], SAVP [48], and groups according to the resolution of generated frames. The first
SRVP [49] as baseline. As discussed in [8], SAVP and SRVP are group reported at the top shows the comparison of models that
originally proposed to address future frame prediction which is a are trained with frames in low resolution (64 × 64 for BAIR robot
task to predict future frames with given previous frames. How- pushing, 128 × 48 for Atari Breakout). After frames are gener-
ever, they can be adapted to our task without requiring major ated, they are rescaled to high resolution (256 × 256 for BAIR
adjustments since future frame prediction is closely related to robot pushing and 160 × 210 for Atari Breakout). MoCoGAN,
the controllable video generation. Besides, MoCoGAN which SAVP, and SRVP are originally proposed to generate frames
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
198 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 199
[7] Y.-H. Kwon and M.-G. Park, “Predicting future frames using retrospective
cycle GAN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 1811–1820.
[8] W. Menapace, S. Lathuilière, S. Tulyakov, A. Siarohin, and E. Ricci,
“Playable video generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat-
tern Recognit., 2021, pp. 10061–10070.
[9] M. Nawhal, M. Zhai, A. Lehrmann, L. Sigal, and G. Mori, “Generating
videos of zero-shot compositions of actions and objects,” in Proc. Eur.
Conf. Comput. Vis., 2020, pp. 382–401.
[10] W. Wang, X. Alameda-Pineda, D. Xu, E. Ricci, and N. Sebe, “Learning
how to smile: Expression video generation with conditional adversarial
recurrent nets,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2808–2819,
Nov. 2020.
[11] R. Cui, Z. Cao, W. Pan, C. Zhang, and J. Wang, “Deep gesture video
generation with learning on regions of interest,” IEEE Trans. Multimedia,
vol. 22, no. 10, pp. 2551–2563, Oct. 2020.
[12] C. Wu et al., “NÜWA: Visual synthesis pre-training for neural visual
world creation,” in Proc. 17th Eur. Conf. Comput. Vis., Oct. 23-27, 2022,
pp. 720–736.
[13] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse image synthe-
sis for multiple domains,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2020, pp. 8188–8197.
[14] A. Koksal and S. Lu, “RF-GAN: A light and reconfigurable network for
unpaired image-to-image translation,” in Proc. Asian Conf. Comput. Vis.,
2020, pp. 1–18.
[15] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “GauGAN: Semantic im-
age synthesis with spatially adaptive normalization,” in Proc. ACM SIG-
Fig. 10. Ablation results on BAIR robot pushing and Atari Breakout datasets. GRAPH Real-Time Live!, 2019, pp. 1–1.
Initial frames and the generated frames with different textual instructions by [16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
CVGI and the ablation model that does not incorporate the motion estimation translation using cycle-consistent adversarial networks,” in Proc. IEEE
layer are shown. Gray vertical dotted lines show the position of the object of Int. Conf. Comput. Vis., 2017, pp. 2223–2232.
interest on the initial frame for the clarity. [17] Z. Hao, X. Huang, and S. Belongie, “Controllable video generation with
sparse trajectories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2018, pp. 7854–7863.
motion of the object of interest as motions are continuous as [18] T. Marwah, G. Mittal, and V. N. Balasubramanian, “Attentive semantic
well. video generation using captions,” in Proc. IEEE Int. Conf. Comput. Vis.,
2017, pp. 1426–1434.
[19] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “Animat-
V. CONCLUSION ing arbitrary objects via deep motion transfer,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 2377–2386.
In this work, we propose a controllable video generation [20] Y. Yan, B. Ni, W. Zhang, J. Xu, and X. Yang, “Structure-constrained
framework that provides detailed control over the motion of the motion sequence generation,” IEEE Trans. Multimedia, vol. 21, no. 7,
pp. 1799–1812, Jul. 2019.
object of interest to generate novel videos with text-based in- [21] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,”
structions. It incorporates a motion estimation layer to divide in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5933–5942.
the task into two sub-tasks: control signal estimation and action [22] P. Esser et al., “Towards learning a realistic rendering of human behavior,”
in Proc. Eur. Conf. Comput. Vis. Workshops, 2018, pp. 1–17.
generation. Our model learns to plan the motion of the object [23] P. Esser, E. Sutter, and B. Ommer, “A variational U-Net for conditional ap-
of interest according to instructions in control signal estimation pearance and shape generation,” in Proc. IEEE Conf. Comput. Vis. Pattern
and generate photo-realistic action videos in action generation. Recognit., 2018, pp. 8857–8866.
[24] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “MoCoGAN: Decomposing
Experimental results on benchmark datasets demonstrate the ef- motion and content for video generation,” in Proc. IEEE Conf. Comput.
fectiveness of our model. In the future, we plan to extend our Vis. Pattern Recognit., 2018, pp. 1526–1535.
model into an end-to-end model. [25] O. Wiles, A. Koepke, and A. Zisserman, “X2Face: A network for control-
ling face generation using images, audio, and pose codes,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 670–686.
REFERENCES [26] P. Ardino, M. D. Nadai, B. Lepri, E. Ricci, and S. Lathuilière, “Click
to move: Controlling video generation with sparse motion,” in Proc.
[1] Y.-F. Zhou et al., “BranchGAN: Unsupervised mutual image-to-image IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 14749–14758.
transfer with a single encoder and dual decoders,” IEEE Trans. Multimedia, [27] S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler,
vol. 21, no. 12, pp. 3136–3149, Dec. 2019. “Learning to simulate dynamic environments with gameGAN,” in
[2] H.-Y. Lee et al., “DRIT : Diverse image-to-image translation via disen- Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
tangled representations,” in Proc. Int. J. Comput. Vis., vol. 128, 2020, pp. 1231–1240.
pp. 2402–2417. [28] O. Gafni, L. Wolf, and Y. Taigman, “Vid2game: Controllable characters
[3] B. Li, X. Qi, T. Lukasiewicz, and P. Torr, “Controllable text-to-image gen- extracted from real-world videos,” 2019, arXiv:1904.08379.
eration,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 2065– [29] D. Damen et al., “Scaling egocentric vision: The epic-kitchens dataset,”
2075. in Proc. Eur. Conf. Comput. Vis., 2018, pp. 720–736.
[4] J. Lin et al., “Exploring explicit domain supervision for latent space disen- [30] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual plan-
tanglement in unpaired image-to-image translation,” IEEE Trans. Pattern ning with temporal skip connections,” in Proc. Conf. Robot Learn., 2017,
Anal. Mach. Intell., vol. 43, no. 4, pp. 1254–1266, Apr. 2012. pp. 344–356.
[5] B. Zhu and C.-W. Ngo, “CookGAN: Causality based text-to-image syn- [31] J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video
thesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, forecasting by generating pose futures,” in Proc. IEEE Int. Conf. Comput.
pp. 5519–5527. Vis., 2017, pp. 3332–3341.
[6] J. Huang, J. Liao, and S. Kwong, “Semantic example guided image- [32] R. Villegas et al., “Learning to generate long-term future via hierarchical
to-image translation,” IEEE Trans. Multimedia, vol. 23, pp. 1654–1665, prediction,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3560–3569.
2021.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 201
[33] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First order [62] M. Babaeizadeh et al., “FitVid: Overfitting in pixel-level video prediction,”
motion model for image animation,” in Proc. Adv. Neural Inf. Process. 2021, arXiv:2106.13195.
Syst., 2019, vol. 32, pp. 1–11. [63] V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Masked conditional
[34] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video video diffusion for prediction, generation, and interpolation,” 2022,
prediction using deep networks in atari games,” in Proc. Adv. Neural Inf. arXiv:2205.09853.
Process. Syst., 2015, vol. 28, pp. 1–9. [64] T. Höppe, A. Mehrjou, S. Bauer, D. Nielsen, and A. Dittadi, “Diffusion
[35] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical models for video prediction and infilling,” 2022, arXiv:2206.07696.
interaction through video prediction,” in Proc. Adv. Neural Inf. Process.
Syst., 2016, vol. 29, pp. 1–9.
[36] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large- Ali Köksal received the B.Sc. and M.Sc. degrees
scale pretraining for text-to-video generation via transformers,” 2022, in computer engineering from the Izmir Institute of
arXiv:2205.15868. Technology, Izmir, Turkey, in 2014 and 2017. He
[37] R. Villegas et al., “Phenaki: Variable length video generation from open is currently working toward the Ph.D. degree with
domain textual description,” 2022, arXiv:2210.02399. the School of Computer Science and Engineering
[38] M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, gen- with Nanyang Technological University, Singapore,
erate densely: Memory-efficient unsupervised training of high-resolution in 2022. He is also a Scientist with the Visual Intelli-
temporal GAN,” Int. J. Comput. Vis., vol. 128, no. 10, pp. 2586–2606, gence Department, Institute for Infocomm Research,
2020. A*STAR, Singapore.
[39] J. Ho et al., “Video diffusion models,” 2022, arXiv:2204.03458.
[40] U. Singer et al., “Make-a-video: Text-to-video generation without text-
video data,” 2022, arXiv:2209.14792.
[41] X. Mao et al., “Least squares generative adversarial networks,” in Proc.
IEEE Int. Conf. Comput. Vis., 2017, pp. 2794–2802. Kenan Emir Ak received the B.Sc. and M.Sc. de-
[42] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation grees in electrical and electronics engineering from
learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Isik University, Istanbul, Turkey, and Bogazici Uni-
Vis. Pattern Recognit., 2019, pp. 5686–5696. versity, Istanbul, Turkey, in 2014 and 2016, and the
[43] Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of Ph.D. degree in electrical and computer engineering
gaze and actions in first person video,” in Proc. Eur. Conf. Comput. Vis., from the National University of Singapore, Singa-
2018, pp. 619–635. pore, in 2020. He is currently a Scientist with the Vi-
[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” sual Intelligence Department, Institute for Infocomm
CoRR, 2014, arXiv:1412.6980. Research, A*STAR, Singapore.
[45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“GANs trained by a two time-scale update rule converge to a local nash
equilibrium,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
pp. 6629–6640. [Online]. Available: http://dl.acm.org/citation.cfm?id= Ying Sun (Member, IEEE) received the B.Eng. de-
3295222.3295408 gree from Tsinghua University, Beijing, China in
[46] T. Unterthiner et al., “Towards accurate generative models of video: A new 1998, the M.Phil. degree from the Hong Kong Univer-
metric & challenges,” 2018, arXiv:1812.01717. sity of Science and Technology, Hong Kong in 2000,
[47] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The un- and the Ph.D. degree from Carnegie Mellon Univer-
reasonable effectiveness of deep features as a perceptual metric,” in Proc. sity, Pittsburgh, PA, USA in 2004. She is curently a
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. Senior Scientist with the Visual Intelligence Depart-
[48] A. X. Lee et al., “Stochastic adversarial video prediction,” 2018, ment, Institute for Infocomm Research, and Centre
arXiv:1804.01523. for Frontier AI Research, A*STAR, Singapore.
[49] J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari,
“Stochastic latent residual video prediction,” in Proc. Int. Conf. Mach.
Learn., 2020, pp. 3233–3246.
[50] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
assessment: From error visibility to structural similarity,” IEEE Trans. Deepu Rajan (Member, IEEE) received the Bach-
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. elor of Engineering degree in electronics and com-
[51] T. Salimans et al., “Improved techniques for training GANs,” in Proc. munication engineering from the Birla Institute of
Adv. Neural Inf. Process. Syst., 2016, pp. 2234–2242. [Online]. Available: Technology, Ranchi, India, the M.S. degree in electri-
http://papers.nips.cc/paper/6125-improved-techniques-for-training- cal engineering from Clemson University, Clemson,
gans.pdf SC, USA, and the Ph.D. degree from the Indian In-
[52] E. Denton and R. Fergus, “Stochastic video generation with a learned stitute of Technology, Bombay, Mumbai, India. He is
prior,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1174–1183. currently an Associate Professor with the School of
[53] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, Computer Science and Engineering, Nanyang Tech-
“Stochastic variational video prediction,” 2017, arXiv:1710.11252. nological University, Singapore. From 1992 to 2002,
[54] R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev, he was a Lecturer with the Department of Electron-
“Latent video transformer,” 2020, arXiv:2006.10704. ics, Cochin University of Science and Technology, Kochi, India. His research
[55] A. Clark, J. Donahue, and K. Simonyan, “Adversarial video generation on interests include image processing, computer vision, and multimedia signal pro-
complex datasets,” 2019, arXiv:1907.06571. cessing.
[56] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “VideoGPT: Video genera-
tion using VQ-VAE and transformers,” 2021, arXiv:2104.10157.
[57] P. Luc et al., “Transformation-based adversarial video prediction on large- Joo Hwee Lim (Senior Member, IEEE) received the
scale data,” 2020, arXiv:2003.04035. B.Sc. (Hons. I) and M.Sc. degrees in computer sci-
[58] C. Nash et al., “Transframer: Arbitrary frame prediction with generative ence from the National University of Singapore, Sin-
models,” 2022, arXiv:2203.09494. gapore, and the Ph.D. degree in computer science &
[59] Y. Seo, K. Lee, F. Liu, S. James, and P. Abbeel, “HARP: Autoregressive engineering from UNSW, Sydney NSW, Australia.
latent video prediction with high-fidelity image generator,” in Proc. IEEE He is currently a Principal Scientist and the Depart-
Int. Conf. Image Process., 2022, pp. 3943–3947. ment Head (Visual Intelligence) with the Institute for
[60] G. Le Moing, J. Ponce, and C. Schmid, “CCVS: Context-aware control- Infocomm Research, A*STAR, Singapore. He has
lable video synthesis,” in Proc. Adv. Neural Inf. Process. Syst., 2021, authored or coauthored 290 international refereed
vol. 34, pp. 14042–14055. journal and conference papers in connectionist expert
[61] D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive systems, neural-fuzzy systems, content-based image
video models,” 2019, arXiv:1906.02634 . retrieval, medical image analysis, human robot collaboration.
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.