0% found this document useful (0 votes)
47 views12 pages

Controllable Video Generation With Text-Based Instructions

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views12 pages

Controllable Video Generation With Text-Based Instructions

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

190 IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

26, 2024

Controllable Video Generation With Text-Based


Instructions
Ali Köksal , Kenan E. Ak, Ying Sun , Member, IEEE, Deepu Rajan , Member, IEEE,
and Joo Hwee Lim , Senior Member, IEEE

Abstract—Most of the existing studies on controllable video (GANs) enable the generation of high-dimensional data such
generation either transfer disentangled motion to an appearance as images, [1], [2], [3], [4], [5], [6] and videos [7], [8], [9],
without detailed control over motion or generate videos of simple [10], [11], [12]. These models can manipulate the given high-
actions such as the movement of arbitrary objects conditioned on a
control signal from users. In this study, we introduce Controllable dimensional data conditioned on the desired manipulation. For
Video Generation with text-based Instructions (CVGI) framework example, image manipulation and editing architectures [13],
that allows text-based control over action performed on a video. [14], [15], [16] allow users to transfer the style from another
CVGI generates videos where hands interact with objects to image.
perform the desired action by generating hand motions with Motion manipulation according to text-based instructions on
detailed control through text-based instruction from users. By
incorporating the motion estimation layer, we divide the task a video where human interacts with objects in a complex scene
into two sub-tasks: (1) control signal estimation and (2) action is indeed extremely more challenging, as there is no simple way
generation. In control signal estimation, an encoder models actions to model the interaction. Besides, building a semantic associa-
as a set of simple motions by estimating low-level control signals tion between instructions and motion is also challenging because
for text-based instructions with given initial frames. In action text descriptions are often ambiguous for controllable video gen-
generation, generative adversarial networks (GANs) generate
realistic hand-based action videos as a combination of hand motions eration. In the literature, there exist video manipulation archi-
conditioned on the estimated low control level signal. Evaluations tectures such as [12], [17], [18], [19], [20] that allow users to
on several datasets (EPIC-Kitchens-55, BAIR robot pushing, and manipulate motion of objects on a video. They can be grouped
Atari Breakout) show the effectiveness of CVGI in generating into two groups according to the source of manipulation. Most
realistic videos and in the control over actions.
of the existing approaches in the first group use driving videos as
Index Terms—Controllable video generation, video generation a source of manipulation by extracting actions. They can disen-
with textual instructions, motion generation, conditional generative tangle motion and transfer it to another appearance but they are
models. limited to detailed control over the motion during the genera-
tion [9], [21], [22], [23], [24], [25]. In the second group, existing
I. INTRODUCTION
approaches use control signals that are received from an agent
EEP architectural models such as convolutional neural
D networks (CNNs) and generative adversarial networks
such as mouse click [26], key stroke [8], [27], joystick [28] but
in most of them only video generation of simple actions that
can be defined as displacement-based actions such as moving
Manuscript received 22 July 2022; revised 30 October 2022, 2 January 2023,
and 20 February 2023; accepted 14 March 2023. Date of publication 29 March arbitrary objects is possible. On the other hand, the proposed
2023; date of current version 8 January 2024. The Associate Editor coor- framework allows detailed control over motions of generated
dinating the review of this manuscript and approving it for publication was videos and it can generate complex actions as a combination of
Dr. Jiebo Luo. (Corresponding author: Ali Koksal.)
Ali Köksal is with the Department of Visual Intelligence, Institute for In- simple motions.
focomm Research, A*STAR, Singapore 138632, and also with the School of With the motivation of building an association between text-
Computer Science and Engineering (SCSE), Nanyang Technological Univer- based institutions and motions to manipulate the features of the
sity (NTU), Singapore 639798 (e-mail: [email protected]).
Kenan E. Ak is with the Department of Visual Intelligence, Institute for generated motion such as direction, speed, the target,..., this pa-
Infocomm Research, A*STAR, Singapore 138632 (e-mail: kenan_emir_ak@ per introduces a novel framework, named CVGI, that allows
i2r.a-star.edu.sg). users to manipulate simple human-object interactions such as
Ying Sun is with the Department of Visual Intelligence, Institute for Infocomm
Research, A*STAR, Singapore 138632, and also with the Centre for Frontier AI hand/s going toward the desired object in videos with complex
Research, A*STAR, Singapore 117602 (e-mail: [email protected]). scenes by conditioning through text-based instructions. CVGI
Deepu Rajan is with the School of Computer Science and Engineering receives a text-based instruction from a user and takes an initial
(SCSE), Nanyang Technological University (NTU), Singapore 639798 (e-mail:
[email protected]). frame as input to generate a video sequence that corresponds well
Joo Hwee Lim is with the Department of Visual Intelligence, Institute for with the user input. For example, Fig. 1 shows that CVGI can
Infocomm Research, A*STAR, Singapore 138632, with the Centre for Frontier reconstruct the ground truth video by using the same text-based
AI Research, A*STAR, Singapore 117602, and also with the School of Computer
Science and Engineering (SCSE), Nanyang Technological University (NTU), instruction as the instruction of ground truth. It can also generate
Singapore 639798 (e-mail: [email protected]). novel videos with different text-based instructions. As shown,
This article has supplementary downloadable material available at the generated videos are photo-realistic and correspond well
https://doi.org/10.1109/TMM.2023.3262972, provided by the authors.
Digital Object Identifier 10.1109/TMM.2023.3262972 with the text-based instructions. CVGI divides the task into two

1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 191

Fig. 1. The proposed framework, named CVGI, generates controllable videos conditioned on text-based instructions received from a user. CVGI generates novel
photo-realistic videos from an initial frame and textual instructions. Generated frames are from top: ground truth, duplication of ground truth, and two novel videos.
In the duplication of ground truth, CVGI generates hands at similar positions to the ground truth. In the novel videos, CVGI generates videos with different hand
movements based on textual instruction. Note that the boundary of the hand masks of the initial frame is indicated as blue in the generated frames to highlight the
difference in hand movements.

Fig. 2. CVGI divides the task into two sub-tasks: control signal estimation and action generation. In the first sub-task, an encoder estimates low-level control
signals with a given initial frame conditioned on text-based instructions. In the second sub-task, two GANs (M2M and M2F) are trained one after another in a
loop. M2M is trained to control motion and generates the next mask of the object of interest conditioned on the estimated control signal. M2F is trained to perform
motion-aware mask-to-frame translation and generates realistic frames with given masks.

sub-tasks by incorporating the motion estimation layer as seen useful visual input for the understanding of the person’s activi-
in Fig. 2. The first sub-task, control signal estimation, encodes ties because scenes of egocentric videos approximate the vision
high-level text-based instructions to low-level control signals of the person who wears the camera. In spite of useful scenes,
as a form of motion representation. The control signal encoder wearable cameras lead to a lot of motion and dynamic scenes in
takes instruction and an initial frame to estimate a set of low-level egocentric videos. In dynamic scenes, it is hard to focus on the
control signals for motions on the future frames. Low-level sig- object of interest and represent its meaningful movement and in-
nals define the location change of the object of interest between teraction with low-level control signals. Consequently, motion
two consecutive frames such as displacement center of mass of generation in egocentric videos is indeed extremely challenging.
hand masks and displacement of the robot arm’s gripper. The Moreover, to the best of our knowledge, CVGI is the first study
second sub-task, action generation, generates realistic videos that can generate controllable egocentric videos where the mo-
frame-by-frame in a loop conditioned on low-level signals. First, tion of the object of interest is controlled in detail. Therefore,
it generates the next frame with the initial frame and the first esti- CVGI incorporates the hand masks to differentiate the object
mated low-level signal. Then it takes the generated frame and the of interest to control and the action generator synthesizes video
second estimated low-level signal to generate the third frame and sequences by employing two GANs: mask-to-mask (M2M) and
so on until n frames are generated for all estimated low-level sig- mask-to-frame (M2F) for egocentric videos. M2M models the
nals. EPIC-Kitchens-55 dataset [29] contains egocentric videos association between motions of masks of the object of inter-
shot by a head-mounted camera. Egocentric videos capture est, i.e., hand masks and low-level signals and it is trained to

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
192 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

synthesize masks that correspond with low-level signals. It takes controllable human behavior by transferring motion that is ren-
masks of input frames and generates masks for the next frames dered over keypoints of the human body. [21] generates dance
conditioned on the low-level signal. M2F GAN maps masks videos by transferring motion with a network that is trained
to frames by being aware of hands’ motion to generate real- to translate pose to appearance and vice versa. [9] introduces a
istic frames. For BAIR robot pushing [30] and Atari Breakout framework that includes a generator and multiple discriminators
datasets [8], we employ a single GAN similar to M2M. disentangles actions and objects of a given video. It generates
The contributions of this work can be summarized in three as- human-object interaction videos based on text description by
pects: 1) We propose CVGI which generates novel human-object replacing objects to generate novel videos. The aforementioned
interaction videos where hand/s go toward the desired object studies disentangle content and motion of videos and transfer
by manipulating motions on complex scenes conditioned on motions to generate novel videos of contents. On the other hand,
text-based instructions from users. 2) We overcome the chal- our framework controls the motion of objects according to user
lenge with two innovations. Control signal estimation models input. It generates novel videos by manipulating a given frame
human-object interaction in terms of motions and builds the as- conditioned on control signals that are estimated from user input.
sociation between text-based instructions and low-level control Most of the studies that receive user input as the source of
signals. Action generation models motions with low-level sig- manipulation generate novel videos by synthesizing simple mo-
nals and generates realistic videos. 3) With the evaluations on tions as depicting desired motion on the generated frames. [34]
three public datasets: EPIC-Kitchens-55 [29], BAIR robot push- introduces a framework that generates action conditional videos
ing [30], and Atari Breakout [8], we demonstrate that CVGI in Atari games by predicting future frames with given previ-
generates photo-realistic videos that correspond well with in- ous frames and an action label for player actions. [35] generates
structions. action-conditioned videos for robotic arm actions by predict-
ing future frames in long-range from previous frames. [18] in-
troduces a framework that generates variable-length videos for
artificial or arbitrary objects conditioned on captions by sepa-
II. RELATED WORK rately learning short-term and long-term context. [17] generates
In the literature, many studies for novel video generation exist controllable videos with a given frame conditioned on sparse
and they can be grouped into two according to the source of trajectories specified by users and improves the video quality by
manipulation. The first group learns a motion that is extracted hallucinating pixels that cannot be copied based-on flow. [28]
from a video and transfers the motion to another object/subject. extracts a character from a given video, and generates videos
The second group receives user input to control the motion and of the extracted character performing motions that are control-
generates novel videos that correspond well with user input. lable with low-level signals received from an agent on any back-
Another important difference between the groups other than the ground. The framework has two modules, first generates poses
source of manipulation is the ability of the detailed control over corresponding with signals from an agent such as a joystick, sec-
the synthesized motion that can be defined as explicit control. ond translates poses to frames. [27] is trained to imitate the game
Although studies in the first group can control the motion, they engines and renders the next screen conditioned on keystrokes
are limited to explicitly controlling the synthesized motion as by users. [8] learns actions in an unsupervised manner to clus-
they transfer the motion that is learned from a source video as ter motions and then generates videos of discrete actions with a
it is. On the other hand, studies in the second group learn to given initial frame conditioned on keystrokes from users. [26]
associate motions and user input and can manipulate the motion proposes a framework that generates videos where the motion
according to user input. of specific objects can be controlled through mouse clicks. It re-
Most existing studies that transfer motions learn content ceives an input frame, its segmentation map, and mouse click and
which is the object and its appearance and motion which is the incorporates a graph convolution network to model the motions
dynamics of content and they generate videos of moving faces, of objects. As summarized above, most existing studies generate
human body, and arbitrary objects. [25] proposes a framework controllable videos with low-level signals received by an agent
that generates controllable videos for human faces conditioned such as keyboard, joystick, and mouse. On the other hand, our
on a driving vector that can be extracted from a given video, framework builds a semantic association between text-based in-
audio, or pose vector. [24] builds separate latent spaces for con- structions and motions. This association allows controlling gen-
tent and motion of videos for the generation of novel videos of erated videos according to text-based input that can describe
faces, human body, and artificial objects by controlling content complex actions such as human-object interaction.
and motion latent vectors. [31] and [32] predict future frames In addition to the above-mentioned video generation frame-
with a given input frame conditioned on estimated future human works, there exist recent studies that generate videos based on
body poses. [19] introduces a framework that animates an arbi- text. [12] proposes a generic solution for various visual genera-
trary object on the given image conditioned on a motion that is tion tasks. Its generic model can also generate videos based on
derived from a driving video sequence by using sparse keypoint text. [36] proposes a framework, CogVideo, for text-to-video
trajectories. In [33], a similar framework addresses the same generation. CogVideo includes a transformer-based architec-
problem without using any annotation. [23] distinguishes the ap- ture and uses a pretrained text-to-image model. Similarly, [37]
pearance and pose of humans and generates images with given employs bi-directional masked transformers-based model. It is
appearance with different poses. Similar to [23], [22] generates trained with a large amount of text-image pairs and a smaller

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 193

next frames. It contains an embedding layer and a CNN-based


encoder. The embedding layer takes a textual instruction d and
computes text embedding. The encoder conditioned on text em-
bedding predicts a set of low-level control signal (displacement)
Δ̂1 , Δ̂2 , . . ., Δ̂n for the object that is desired to perform the mo-
tion with the given initial frame F0 . They are trained to minimize
mean square error (MSE) that is computed between the ground
Fig. 3. Control signal estimator includes an embedding layer and a CNN-based
encoder. It takes the initial frame and text-based instructions and estimates low-
truth control signals Δ1 , Δ2 , . . ., Δn and estimated control
level control signals for the object of interest. signals Δ̂1 , Δ̂2 , . . ., Δ̂n as follows.

1  
n
amount of video-text pairs. [38] proposes an efficient video  
L= Δi − Δ̂i  , where Δ̂1,2,...,n = E(F0 , d). (1)
generation model. It is a GAN-based model and can control n i=1 2

the generated videos by conditioned on the discrete label of the


category. Likewise, [39] is a conditional model, but it is based on
diffusion architecture instead of GAN. Similarly, [40] employs a B. Action Generator
diffusion model. The proposed text-to-video model leverages a The action generator aims to manipulate the motion of the
text-to-image model. The summarized video generation mod- object according to low-level signals. It employs two GANs:
els can generate videos that are aligned with the given text mask-to-mask (M2M) and mask-to-frame (M2F) which are
or condition, but they are limited to explicit control over the trained in a loop. M2M GAN synthesises the motion over masks
motion. For example, [40] can generate a flying dog based on and M2F GAN is responsible for motion-aware photo-realistic
text, but it cannot manipulate the generated video by control- frame synthesis.
ling the motion such as changing direction. Furthermore, they Mask-to-mask GAN: employs a conditional generator Gmask
are similar to the second group in terms of the source of ma- and a sequence discriminator Dmask . M2M GAN uses a contin-
nipulation and similar to the first group in terms of the ability uous signal as a condition in Gmask and ground truth in Dmask ,
to explicitly control the generated motion. On the other hand, unlike most existing conditional GANs that usually use discrete
our framework can control the motion explicitly with text-based signals such as the label of categories. Using continuous signals
instructions. improves the ability to control motion and it is elaborated in the
ablation study. Gmask is an encoder-decoder-based generator
III. APPROACH that takes a single mask and is trained to generate another mask
CVGI learns to generate realistic and temporally consistent conditioned on the low-level control signal. During training, it
videos by manipulating motion on complex scenes of egocen- is trained in both forward and backward directions by changing
tric videos according to the given text-based instructions. Fig. 2 the order of the frames to capture more variability. Thus, training
shows the overall flow of the framework. CVGI takes an initial with both forward and backward passes increases the variation
frame F0 as context image, mask of the object of interest M0 of training samples and is able to learn a motion together with
such as the mask of hands, instruction d and it generates n next its reverse. Besides, it improves the understanding of motion’s
frames F̂1 , F̂2 , . . ., F̂n that corresponds to the given instruction. direction. Fig. 4 shows only toward forward for sake of visual
We divide the task into two sub-tasks: control signal estima- simplicity. As seen in Fig. 4(a), Gmask is trained to generate the
tion and action generation. Control signal estimator builds as- next mask from the initial mask conditioned on the correspond-
sociation between text-based instructions and low-level control ing low-level control signal, i.e., Gmask (Mi , Δi+1 ) → M̂i+1 .
signals Δ̂1 , Δ̂2 , . . ., Δ̂n . For egocentric videos such as videos For backward training, the initial frame is generated from the
on EPIC-Kitchens-55, the action generator which consists of next frame conditioned on the inverse of the control signal
two GANs (M2M and M2F) generates frames according to the (negative displacement), i.e., Gmask (Mi+1 , (Δi+1 )−1 ) → M̃i .
control signal in a loop. First, it generates future masks and then As seen in Fig. 4(b), Dmask takes an input of two con-
translates them to frames one by one. Note that, for BAIR Robot secutive frames that are concatenated to train to distinguish
pushing and Atari Breakout datasets, the action generator con- real and fake sequences. Over Dmask , we introduce an aux-
sists of one GAN similar to M2M GAN with two differences. iliary regressor to distinguish the motion of objects of inter-
First, it directly takes the initial frame and generates the next est on given consecutive frames. Thus, Dmask is a sequence
frame conditioned on low-level signals. Second, its generator’s discriminator with two heads, i.e., Dmask (Mi , Mi+1 ) →
reg
image loss is changed to L2 norm instead of L1 norm to improve {Dmask
adv
(Mi , Mi+1 ), Dmask (Mi , Mi+1 )}. The first is an adver-
the visual quality. sarial head that distinguishes sequences of two frames as real and
fake and provides adversarial training. The second is a regressor
A. Control Signal Estimator head and measures the displacement of the objects of interest on
the given two frames.
As illustrated in Fig. 3, control signal estimator E converts During training, for each consecutive mask and the low-level
high-level text-based instructions that describe the actions to a control signal (displacement) {{Mi , Mi+1 }, Δi+1 }, Gmask and
set of low-level control signals. It takes initial frames as con- Dmask are trained in an adversarial manner where generated
text images along with instructions to predict motion for the masks towards forward M̂i+1 and backward M̃i are denoted as
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
194 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Fig. 5. Forward training of Gf rame . Gf rame is trained to generate the next


frame by taking the initial frame, its mask, and the mask of the next frame.

1  adv 
+ (Dmask (Mi , M̂i+1 ) − 1)2
2
1  adv 
+ (Dmask (M̃i , Mi+1 ) − 1)2 . (7)
2
In addition, Dmask learns to predict the displacement of the
Fig. 4. Forward training of M2M GAN that includes (a) a conditional generator object by minimizing the following loss:
Gmask and (b) a sequence discriminator Dmask . Gmask is trained to generate
reg
masks conditioned on low-level control signals which contain four floating-point
cs = Dmask (Mi , Mi+1 ) − Δi+1 2 .
LD (8)
numbers. The first two control the right hand and the next two control the left
hand. Dmask is a sequence discriminator with two heads. The first head is
trained to distinguish real and fake sequences. The second head is trained to The full objective function to optimize Dmask is defined as fol-
estimate the displacement of objects of interest. lows:

mask = Ladv + Lcs .


LD D D
(9)
follows:
Mask-to-frame GAN: employs a generator Gf rame and three
M̂i+1 = E Gmask (Mi , Δi+1 ), frame-based discriminators Df rame , Df g , Dbg for motion-
Mi ,Δi+1
aware mask-to-frame translation. M2F GAN is only employed
M̃i = E Gmask (Mi+1 , (Δi+1 )−1 ). (2) for EPIC-Kitchens-55 because videos in this dataset are egocen-
Mi+1 ,Δi+1
tric shot by a head-mounted camera. In egocentric videos, every
Gmask learns to construct the next mask conditioned on object appears moving in most of the frames due to the camera
ground truth control signals and given mask by minimizing the motion. It may cause the generator to confuse which object is
following image loss function: the object of interest. Therefore, controllable video generation
Limg = Mi+1 − M̂i+1 1 + Mi − M̃i 1 . (3) in egocentric videos requires indicating the object of interest.
For this reason, CVGI uses hand masks. After we achieve con-
In addition, least squared adversarial loss [41] that is computed trolling hand motions with masks, we employ M2F GAN to
by Dmask ’s adversarial head is employed as follows to make translate masks to frames. As shown in Fig. 5, Gf rame takes a
generated masks indistinguishable from real masks: context frame (the initial frame), the corresponding hand mask,
LG 2 2 and the mask for the next frame that indicates hands’ new loca-
adv = (Dmask (Mi , M̂i+1 )) + (Dmask (M̃i , Mi+1 )) . (4)
adv adv
tion. It is trained to hallucinate pixels at the hands’ location on
Besides, the following regression loss (MSE) is computed be- the context images to remove them and create hands at the new
tween ground truth control signals and estimated control signals location. Similar to Gmask , Gf rame is trained in the forward and
by the auxiliary regressor. This is so as to enforce generated backward directions by changing inputs. Figs. 5 and 6 show only
masks correspond well with low-level control signals forward training for simplicity. For forward training, Gf rame is
LG reg trained to generate the next frame from initial frame, its mask,
cs = Dmask (Mi , M̂i+1 ) − Δi+1 )2
and the mask of the next frame, Gf rame (Mi , Fi , Mi+1 ) →
reg
+ Dmask (M̃i , Mi+1 ) − Δi+1 )2 . (5) F̂i+1 . For backward training, it is trained to generate the initial
frame from the next frame, its mask, and the mask of the initial
The full objective function of Gmask is formulated as follows:
frame, Gf rame (Mi+1 , Fi+1 , Mi ) → F̃i . As seen in Fig. 6, three
LG
mask = λimg Limg + λadv Ladv + λcs Lcs ,
G G
(6) frame-based discriminators are employed. Df rame is trained to
distinguish real and fake frames, Df g takes frames where the
where λimg , λadv , and λcs are positive weights to balance
background is masked to distinguish real and fake objects of
loss functions and their default values are 1.0, 1.0, and 10.0,
interest, and Dbg takes frames where the foreground is masked
respectively.
to distinguish real and fake background.
Dmask is trained to minimize the following least squared ad-
Gf rame and Df rame are trained with consecutive frames and
versarial loss to distinguish real and fake sequences:
their corresponding masks {{Fi , Fi+1 }, {Mi , Mi+1 }} where
2
LD
adv = (Dmask (Mi , Mi+1 ))
adv
generated frames towards forward F̂i+1 and backward F̃i are
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 195

Fig. 6. Forward training of discriminators of M2F GAN. Df rame (a) is trained to distinguish real and fake frames, Df g (b) is trained to distinguish real and
fake objects of interest, and Dbg (c) is trained to distinguish real and fake background. σ and φ denote operations to compute hand frames and background frames,
respectively.

denoted as follows: IV. EXPERIMENTS


F̂i+1 = E Gf rame (Mi , Fi , Mi+1 ), We evaluate our approach with three public datasets: EPIC-
Mi ,Fi ,Mi+1
Kitchens-55 where there are two objects of interest (hands) to
F̃i = E Gf rame (Mi+1 , Fi+1 , Mi ). (10) control the motion in 2D, BAIR robot pushing dataset where
Mi+1 ,Fi+1 ,Mi there is a single object of interest (robotic arm) to control the
Gf rame is trained with an image loss Limg and least squared motion in 2D, and Atari Breakout where there is a single object
adversarial losses LGf rame , Lf g , Lbg which are defined in (11).
G G of interest (base of breakout game) to control the motion in 1D.
LG EPIC-Kitchens-55 dataset [29]: contains approximately 40 k
f rame by Df rame leads to generating indistinguishable frames
first-person videos where humans interact with objects during
from real frames. LGf g by Df g is trained with hand frames where
daily activities in the kitchen. Actions on video clips are an-
the background is masked out and leads to producing realistic
notated with a text-based description composed of an action
hands at the new position. LG bg by Dbg is trained with background
label (verb) and an object label (noun). In the evaluation with
frames where hands are masked out and leads to hallucinating
EPIC-Kitchens-55, we use the video clips where at least one
pixels at the hands’ previous location.
hand is visible. CVGI is trained with video clips of the first
Limg = Fi+1 − F̂i+1 2 + Fi − F̃i 2 , kitchen (P01).
2 2
Hand masks for M2M and M2F GANs are extracted au-
LG
f rame = (Df rame (F̂i+1 )) + (Df rame (F̃i )) , tomatically by the pretrained hand-segmentation model intro-
LG 2 2 duced in [42]. The model is trained on Extended GTEA Gaze+
f g = (Df g (σ(F̂i+1 , Mi+1 ))) + (Df g (σ(F̃i , Mi ))) ,
dataset [43] for 100 epochs. The trained model is used to extract
2 2
bg = (Dbg (φ(F̂i+1 , Mi+1 ))) + (Dbg (φ(F̃i , Mi ))) , (11)
LG hand masks for each frame of EPIC-Kitchens-55. In addition to
extending with hand masks, we also extend the annotations with
where σ and φ denote operations to compute hand frames and
the low-level control signals. As the ground truth control sig-
background frames, respectively. Gf rame is optimized with the
nals, we compute the displacements of the center of mass of the
following full objective function:
hand masks for every consecutive two frames. Furthermore, we
f rame = λimg Limg + λf rame Lf rame
LG G augment the masks of consecutive frames by flipping horizon-
tally (reflection) and warping masks with random translations
+ λ f g LG
f g + λbg Lbg ,
G
(12) in x and y directions.
where λimg , λf rame , λf g , and λbg are positive weights to bal- BAIR robot pushing dataset [30]: contains roughly 44 k video
ance loss functions and their default values are 10.0, 1.0, 1.0, and clips of robotic arm pushing objects on a table. Each video clip
1.0, respectively. Furthermore, Df rame , Df g , Dbg are trained to consists of 30 frames in 256 × 256 resolution. Besides, the
minimize LD dataset provides the ground truth location of the robotic arm’s
f rame , Lf g , Lbg , respectively.
D D
gripper. In order to evaluate our approach, we extend the an-
2 2
LD
f rame = (Df rame (Fi )) + (Df rame (Fi+1 )) notations of the dataset with the low-level control signals and
text-based instructions. Thus, we compute the displacement of
+ (Df rame (F̂i+1 ) − 1)2 + (Df rame (F̃i ) − 1)2 , the gripper for every two consecutive frames as ground truth
2 2 low-level signals. We prepare text-based instructions composed
f g = (Df g (σ(Fi , Mi ))) + (Df g (σ(Fi+1 , Mi+1 )))
LD
of a verb and adverb over the computed displacements. Verbs
+ (Df g (σ(F̂i+1 , Mi+1 )) − 1)2 depict action with nine variations (8 for directions and 1 for sta-
tionary) and adverbs depict the speed of the motion with three
+ (Df g (σ(F̃i , Mi )) − 1)2 , variations (slowly, -, and quickly). The combination of verbs and
2 2 adverbs composes 25 unique actions in the space of text-based
bg = (Dbg (φ(Fi , Mi ))) + (Dbg (φ(Fi+1 , Mi+1 )))
LD
instructions.
+ (Dbg (φ(F̂i+1 , Mi+1 )) − 1)2 Atari Breakout dataset [8]: contains roughly 1400 video clips
in resolution 160 × 210 of the Atari Breakout video game
+ (Dbg (φ(F̃i , Mi )) − 1)2 . (13) environment. Similar to BAIR robot pushing dataset, we extend

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
196 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Fig. 7. Qualitative evaluation on EPIC-Kitchens-55 dataset. Given initial frame and mask and different instructions, estimated low-level control signals, estimated
masks, and generated frames by CVGI are shown for three different textual instructions. Δ̂rx , Δ̂ry , Δ̂lx , and Δ̂ly denote the estimated low-level control signals in
2D for right and left hand, respectively.

the annotations of the dataset with the low-level control sig- by using the last generated frame, its mask, and the text-based
nals and text-based instructions. The displacement of the base instruction, for EPIC-Kitchens-55 and the last generated frame
is computed with respect to the location of the base’s most-left and the text-based instruction for BAIR robot pushing and Atari
pixel. Text-based instructions are prepared by the computed dis- Breakout datasets.
placements. Although the variation of adverbs is three as in the
BAIR robot pushing dataset, the variation of verbs is three due to
the one-dimensional motion. So, the action space of text-based B. Qualitative Results
instructions has 7 unique actions. Figs. 7, 8, and 9 show the results of qualitative evaluation
of CVGI on EPIC-Kitchens-55, BAIR robot pushing, and Atari
Breakout datasets, respectively. Fig. 7 shows the initial frame,
A. Training Details the corresponding mask, and text-based instructions along with
CVGI’s modules are trained separately with video sequences estimated low-level control signals, estimated masks, and frames
of the training set that contains a set of frames, a set of masks, of generated sequences. In Figs. 8 and 9, estimated low-level
low-level control signals, and a text-based instruction, S : {{F0 , control signals and generated next frames by CVGI conditioned
F1 , . . ., Fn }, {M0 , M1 , . . ., Mn }, {Δ0 , Δ1 , . . ., Δn }, d}. In all on different instructions are shown for one sample initial frame
experiments, modules are trained from scratch for 500 k due to space limitations.
iterations and we use Adam optimizer [44] with batch size In Fig. 7, CVGI is able to generate novel videos depicting
of 16, learning rate of 0.0002, β1 = 0.5, β2 = 0.999. For different hand motions by using the same initial frame and hand
Epic-Kitchens-55, CVGI is trained to produce 7 future frames. mask according to instructions. Estimated low-level control sig-
i.e., the default value of the hyperparameter n is selected as nals change according to instructions which enables M2M GAN
7 experimentally. Based on rigorous experimentation, we ob- to produce different hand masks, which in turn allow M2F GAN
serve that 7 is an optimal number for generating future frames to to generate videos with different hand movements. In addition
avoid excess accumulation of errors in terms of artifacts. On the to the difference in generated videos, they are semantically con-
other hand, for BAIR robot pushing and Atari Breakout datasets, sistent with instructions. As seen in Fig. 7, hands in the gen-
CVGI generates frames by producing the next frame (default erated videos move towards desired objects according to the
value of n is 1) because motions in both datasets are simple mo- instruction. Thus, CVGI comprehends instructions and controls
tions that start in a frame and end in the next frame typically. To the motion of both hands to synthesize videos depicting desired
produce longer video sequences, generation can be re-initiated hand-based action.

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 197

Fig. 8. Qualitative evaluation on BAIR robot pushing dataset. Given initial frame and different instructions, estimated low-level control signals and generated
frames by CVGI are shown for different textual instructions. Δ̂x and Δ̂y denote the estimated low-level control signals in 2D for robotic arm.

Fig. 9. Qualitative evaluation on Atari Breakout dataset. Given initial frame and different instructions, estimated low-level control signals and generated frames
by CVGI are shown. Δ̂x denotes the estimated low-level control signals in 1D for base of breakout game.

As shown in Figs. 8 and 9, CVGI can control generation TABLE I


COMPARISON OF VIDEO GENERATION QUALITY ON BAIR ROBOT PUSHING
based on instructions since, for the same initial frame, estimated DATASET: RESULTS ARE REPORTED AS TWO GROUPS. IN THE FIRST GROUP
low-level control signals and generated next frames differ ac- (THE FIRST FOUR ROWS), MODELS ARE TRAINED TO GENERATE FRAMES IN
cording to instructions. In addition, generated frames also depict LOW RESOLUTION (64 × 64). AFTER FRAMES ARE GENERATED TO
RECONSTRUCT TEST VIDEOS, THEY ARE RESCALED TO HIGH RESOLUTION
the desired motion. Thus, the qualitative evaluation shows that (256 × 256). THE SECOND GROUP (THE LAST FOUR ROWS) SHOWS THE
CVGI is able to control 2D and 1D motion of a single object of RESULTS OF MODELS THAT ARE TRAINED TO GENERATE FRAMES IN HIGH
interest while producing realistic frames. RESOLUTION

C. Quantitative Results
We follow the evaluation protocol proposed in [8] to eval-
uate the video generation quality of our framework. Accord-
ing to the protocol, models are used to generate frames of the
test set by starting from the initial frame. Then the quality
of generated frames is measured by three metrics: FID [45],
FVD [46], and LPIPS [47]. They measure the similarity be- generates controllable videos of moving faces, body parts, arti-
tween two sets of samples and a lower score means more sim- ficial objects requires adjustment to handle the actions.
ilar sets. Fréchet Inception Distance (FID) [45] measures simi- With this protocol, we evaluate the video generation quality
larity between two sets by comparing Gaussian distribution of of CVGI’s action generator on the video reconstruction task over
deep features. Fréchet Video Distance (FVD) [46] is a variant BAIR robot pushing and Atari Breakout datasets and compare it
of the FID metric specifically to evaluate the quality of video with models including MoCoGAN [24], SAVP [48], SRVP [49],
generation models. Learned Perceptual Image Patch Similar- and CADDY [8].
ity (LPIPS) [47] measures the perceptual similarity between In Tables I and II, we report the evaluation results as two
image patches. [8] chooses MoCoGAN [24], SAVP [48], and groups according to the resolution of generated frames. The first
SRVP [49] as baseline. As discussed in [8], SAVP and SRVP are group reported at the top shows the comparison of models that
originally proposed to address future frame prediction which is a are trained with frames in low resolution (64 × 64 for BAIR robot
task to predict future frames with given previous frames. How- pushing, 128 × 48 for Atari Breakout). After frames are gener-
ever, they can be adapted to our task without requiring major ated, they are rescaled to high resolution (256 × 256 for BAIR
adjustments since future frame prediction is closely related to robot pushing and 160 × 210 for Atari Breakout). MoCoGAN,
the controllable video generation. Besides, MoCoGAN which SAVP, and SRVP are originally proposed to generate frames

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
198 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

TABLE II TABLE III


COMPARISON OF VIDEO GENERATION QUALITY ON ATARI BREAKOUT COMPARISON OF CVGI WITH COPY-LAST THAT IS A COMMONLY USED
DATASET: SIMILAR TO TABLE I, RESULTS ARE REPORTED IN TWO RESOLUTION BASELINE, RETROSPECTIVE CYCLEGAN, RETROSPECTIVE CYCLEGAN* THAT
GROUPS. MODELS IN THE FIRST GROUP (THE FIRST TWO ROWS), ARE IS AN ADAPTED RETROSPECTIVE CYCLEGAN MODEL TO PREDICT NEXT
TRAINED TO GENERATE FRAMES IN LOW RESOLUTION AND RESCALED TO FRAMES BY USING ONLY THE LAST PREVIOUS FRAME, VIDEO DIFFUSION*
HIGH RESOLUTION. IN THE SECOND GROUP (THE LAST FOUR ROWS), MODELS THAT IS ADAPTED VIDEO DIFFUSION MODEL TO SYTHESISE NEXT FRAMES BY
ARE TRAINED TO GENERATE FRAMES IN 160 × 210 RESOLUTION AND USING THE LAST PREVIOUS FRAME AND TEXT-BASED INSTRUCTIONS, AND
COMPARED WITHOUT REQUIRING TO RESCALE CVGI W/O M2M GAN WHICH IS AN ABLATION MODEL IN PREDICTING NEXT
FRAMES ON EPIC-KITCHENS-55 DATASET. MSE SCORES ARE MULTIPLIED BY
103 TO EMPHASIZE THE DIFFERENCE. BESIDES, NOTE THAT IS OF COPY-LAST
DOES NOT BE INCLUDED AS IT CONTAINS REAL FRAMES AND THE BEST
SCORE AND THE SECOND-BEST SCORE ARE HIGHLIGHTED IN BOLD AND
UNDERLINED, RESPECTIVELY

in low resolution and adapting them to high-resolution genera-


tion requires to improve the representation capacity of networks
as discussed in [8]. The results of such improved models for
MoCoGAN and SAVP are also reported in the second group
where models are indicated with + sign. The second group in-
cludes CADDY which is proposed to generate frames in high
resolution. Note that scores of other models in Tables I and II frames. In addition to Retrospective CycleGAN and the adapted
are reported from [8]. Retrospective CycleGAN, we use Video Diffusion Models [39]
Table I shows the superior performance of CVGI in both res- as it is one of most recent and powerful model in video synthe-
olution groups on BAIR robot push dataset. This could be at- sis. However, the Video Diffusion model uses textual conditions
tributed to use of L2 norm as reconstruction loss (Limg ) to train only whereas CVGI uses visual conditions (a single frame) and
out model, which leads to better visual quality scores as dis- textual conditions (text-based instruction) together. Thus, we
cussed in [48]. On the other hand, the quality of generated videos adapt Video Diffusion which takes textual and visual condi-
on Atari Breakout dataset is comparable to the state-of-the-art tions together for generating next frames. The adapted model
video generation and prediction models as shown in Table II. is trained from scratch on EPIC-Kitchens-55 dataset with the
The reason for this could be the limited training set. default hyper-parameters of Video Diffusion.
Moreover, the action generation module of CVGI is capa- The generation quality is measured by three metrics: Mean-
ble to control the motion of the object of interest conditioned Squared Error (MSE), Peak Signal to Noise Ratio (PSNR),
on the displacements which are continuous low-level control and Structural Similarity Index Measure (SSIM) [50] instead
signals despite action space being discretized in the control sig- of FID [45], FVD [46], and LPIPS [47]. As MSE, PSNR, and
nal estimator for the sake of simplicity of the control of the SSIM are the most widely used metrics in the evaluation of fu-
motion. On the other hand, CADDY framework controls the ture frame prediction approaches. They measure the similarity
motion of the object of interest conditioned on the label of dis- between the generated frame and the ground truth frame di-
crete actions. Thus, although the quality of generated videos by rectly rather than measuring the similarity of two sets of frames.
CADDY framework on Atari Breakout dataset is better as shown Since metrics are computed for frames one by one, we report
in Table II, controlling the motion of the object of interest with the mean of the scores. MSE is a pixel-wise metric. PSNR is
continuous low-level signals instead of discrete labels of actions also a pixel-wise metric and it is based on MSE. On the other
fits better with the motion. And it allows a better understanding hand, SSIM compares frames based on image patches instead
of the motion and increases the flexibility of the motion control. of pixel-wise comparison. Whereas a lower MSE score means
The same evaluation protocol cannot be used for EPIC- more similar samples like FID, FVD, and LPIPS, a higher score
Kitchens-55 because other models require major adaptations to means more similar samples in PSNR and SSIM. In addition
handle complex scenes and actions of EPIC-Kitchens-55. For to MSE, PSNR, and SSIM, Inception Score (IS) [51] is used to
this reason, we evaluate the future frame prediction of CVGI by measure the fidelity of the generated frames and a higher score
comparing it with Retrospective CycleGAN [7] which is one of means better fidelity for the generation like PSNR and SSIM.
the state-of-the-art future frame prediction models. Future frame With this evaluation, we compare the fidelity of the generated
prediction is a closely related research area and in the evaluation, frames and the consistency between the ground truth and the
we use Retrospective CycleGAN without adapting it to control- generated frames that are predicted by CVGI’s action generator
lable video generation because such adaptation requires major on the video reconstruction task over EPIC-Kitchens-55 dataset.
modification of the networks and loss functions. Retrospective As shown in Table III CVGI is compared with the adapted Video
CycleGAN originally trained to predict future frames condi- Diffusion [39] that is indicated as Video Diffusion*, Retrospec-
tioned on four previous consecutive frames. Moreover, similar tive CycleGAN, the adapted Retrospective CycleGAN indicated
to CVGI, we adapt Retrospective CycleGAN to predict the future as Retrospective CycleGAN*, and copy-last that is commonly
frame conditioned on the previous frame instead of four previous used baseline in future frame prediction means copying the last

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 199

previous frame as the prediction. In this evaluation, models are TABLE IV


COMPARISON OF CVGI WITH EXISTING VIDEO GENERATION MODELS ON A
trained from scratch for 500 k iterations with batch size 16 as BENCHMARK TASK ON BAIR ROBOT PUSHING DATASET. THE BENCHMARK
the training of CVGI and used to predict the fifth frame of ev- TASK IS VIDEO PREDICTION WHERE 15 NEXT FRAMES ARE PREDICTED
ery five consecutive frames of videos at the test set. In addition, CONDITIONED ON A GIVEN SINGLE FRAME. THE BEST SCORE AND THE
SECOND-BEST SCORE ARE HIGHLIGHTED IN BOLD AND UNDERLINED,
we also present an ablation study where CVGI without M2M RESPECTIVELY
GAN is used to reconstruct the frames of the test set by taking
the ground truth hand masks instead of hands masks that are
generated by M2M GAN. Thus, the ablation model shows the
performance of M2F GAN only.
Table III shows that CVGI is capable to generate frames that
are consistent with the ground truth and also it shows that the
generated frames by CVGI are photo-realistic. In the comparison
with Retrospective CycleGAN, CVGI has better and close scores
in PSNR, SSIM, and IS whereas Retrospective CycleGAN has
a better MSE score than CVGI. When we compare CVGI with
Retrospective CycleGAN which is adapted to predict the next
frame by using a single frame, CVGI consistently has better
scores. This indicates when the supervision (the number of pre-
vious frames used to predict the next frame) in Retrospective
CycleGAN decreases, its performance also decreases. Conse-
quently, while the performance of CVGI and Retrospective Cy-
cleGAN are close to each other, CVGI has superior performance
to Retrospective CycleGAN*. Despite the scores of Retrospec-
tive CycleGAN, predicting the future frame conditioned on a by following the evaluation protocol in [54], [62], the genera-
single frame as in CVGI and Retrospective CycleGAN* is in- tion quality of CVGI is measured with FVD score [46] between
deed more challenging. In addition, CVGI consistently outper- generated videos and ground truth videos. Note that FVD scores
forms the adapted Video Diffusion model that uses both textual of others are reported from [37], [39], [54], [62]. As shown in
and visual conditions. In its comparison with Retrospective Cy- Table IV, CVGI has the second-best score in the benchmark
cleGAN, it is observed that Video Diffusion has a better score task on BAIR robot pushing dataset. Although FVD score of
in IS only. Moreover, the performance of Video Diffusion and Video Diffusion is better than CVGI, Table IV shows CVGI can
CVGI is relatively close in IS whereas CVGI outperforms in generate realistic videos. In addition to generating high-fidelity
MSE, PSNR, and SSIM with larger margins. In other words, videos, CVGI is capable to control the action to generate novel
the fidelity of generated frames by CVGI and Video Diffusion videos.
is relatively close to each other but generated frames by CVGI
are more consistent with the ground truth than generated frames
by Video Diffusion. The reason is that directly using text-based D. Ablation Study
instructions might be ambiguous for consistent video synthesis. An ablation study is performed to analyze the effectiveness
Moreover, Table III includes an ablation study. Although the of the motion estimation layer over BAIR robot pushing and
scores of the ablation model and the complete model are close Atari Breakout datasets. This ablation study, therefore, shows
to each other, the performance of the ablation model is slightly the effect of using continuous signal rather than discrete signal
better than the complete model. Because the complete model as a condition. A new ablation model is trained which includes
generates hand masks as well and the error in M2M GAN is ac- a GAN. This model manipulates the motion with text-based in-
cumulated. We believe these are the reasons for the performance structions directly. The generator takes the label of motion in-
difference. On the other hand, although the ablation model has stead of control signals along with the frames. The discriminator
slightly better performance than the complete model, it requires of the ablation model is a sequence discriminator and it has an
masks to control the hands’ motion which is unreasonable to auxiliary classifier instead of a regressor that predicts the action
expect as user input. Thus, although M2M GAN causes error performed.
accumulation, it is essential for generating controllable videos. Fig. 10 shows the initial frames and textual instructions. In
Finally, to compare CVGI with a wider range of existing video addition, it shows the different motions on generated frames con-
generation models, we evaluated CVGI on the common bench- ditioned on instructions by our model and the ablation model. As
mark task on BAIR robot pushing dataset (in low resolution 64 shown in Fig. 10, in the generated frames by the ablation model,
× 64). The task is video prediction to reconstruct test frames by the position of the gripper and the base are approximately the
synthesizing next 15 frames priming on a given single frame. same whereas in the generated frames by CVGI that incorporates
To reconstruct the test frames from the given single frame, we the motion estimation layer, the position of the gripper and the
employ the trained model of CVGI’s action generator and recon- base differs according to the given instruction. Thus, the motion
struct test frames conditioned on the displacement of the gripper. estimation layer is especially essential to control the motion’s
The frames are generated in a loop as CVGI is trained to gen- speed. Consequently, using a continuous signal rather than a dis-
erate a single next frame rather than a set of next frames. Then, crete signal such as labels of actions is better to represent the
Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
200 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

[7] Y.-H. Kwon and M.-G. Park, “Predicting future frames using retrospective
cycle GAN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
2019, pp. 1811–1820.
[8] W. Menapace, S. Lathuilière, S. Tulyakov, A. Siarohin, and E. Ricci,
“Playable video generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pat-
tern Recognit., 2021, pp. 10061–10070.
[9] M. Nawhal, M. Zhai, A. Lehrmann, L. Sigal, and G. Mori, “Generating
videos of zero-shot compositions of actions and objects,” in Proc. Eur.
Conf. Comput. Vis., 2020, pp. 382–401.
[10] W. Wang, X. Alameda-Pineda, D. Xu, E. Ricci, and N. Sebe, “Learning
how to smile: Expression video generation with conditional adversarial
recurrent nets,” IEEE Trans. Multimedia, vol. 22, no. 11, pp. 2808–2819,
Nov. 2020.
[11] R. Cui, Z. Cao, W. Pan, C. Zhang, and J. Wang, “Deep gesture video
generation with learning on regions of interest,” IEEE Trans. Multimedia,
vol. 22, no. 10, pp. 2551–2563, Oct. 2020.
[12] C. Wu et al., “NÜWA: Visual synthesis pre-training for neural visual
world creation,” in Proc. 17th Eur. Conf. Comput. Vis., Oct. 23-27, 2022,
pp. 720–736.
[13] Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “StarGAN v2: Diverse image synthe-
sis for multiple domains,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., 2020, pp. 8188–8197.
[14] A. Koksal and S. Lu, “RF-GAN: A light and reconfigurable network for
unpaired image-to-image translation,” in Proc. Asian Conf. Comput. Vis.,
2020, pp. 1–18.
[15] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, “GauGAN: Semantic im-
age synthesis with spatially adaptive normalization,” in Proc. ACM SIG-
Fig. 10. Ablation results on BAIR robot pushing and Atari Breakout datasets. GRAPH Real-Time Live!, 2019, pp. 1–1.
Initial frames and the generated frames with different textual instructions by [16] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
CVGI and the ablation model that does not incorporate the motion estimation translation using cycle-consistent adversarial networks,” in Proc. IEEE
layer are shown. Gray vertical dotted lines show the position of the object of Int. Conf. Comput. Vis., 2017, pp. 2223–2232.
interest on the initial frame for the clarity. [17] Z. Hao, X. Huang, and S. Belongie, “Controllable video generation with
sparse trajectories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2018, pp. 7854–7863.
motion of the object of interest as motions are continuous as [18] T. Marwah, G. Mittal, and V. N. Balasubramanian, “Attentive semantic
well. video generation using captions,” in Proc. IEEE Int. Conf. Comput. Vis.,
2017, pp. 1426–1434.
[19] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “Animat-
V. CONCLUSION ing arbitrary objects via deep motion transfer,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit., 2019, pp. 2377–2386.
In this work, we propose a controllable video generation [20] Y. Yan, B. Ni, W. Zhang, J. Xu, and X. Yang, “Structure-constrained
framework that provides detailed control over the motion of the motion sequence generation,” IEEE Trans. Multimedia, vol. 21, no. 7,
pp. 1799–1812, Jul. 2019.
object of interest to generate novel videos with text-based in- [21] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody dance now,”
structions. It incorporates a motion estimation layer to divide in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 5933–5942.
the task into two sub-tasks: control signal estimation and action [22] P. Esser et al., “Towards learning a realistic rendering of human behavior,”
in Proc. Eur. Conf. Comput. Vis. Workshops, 2018, pp. 1–17.
generation. Our model learns to plan the motion of the object [23] P. Esser, E. Sutter, and B. Ommer, “A variational U-Net for conditional ap-
of interest according to instructions in control signal estimation pearance and shape generation,” in Proc. IEEE Conf. Comput. Vis. Pattern
and generate photo-realistic action videos in action generation. Recognit., 2018, pp. 8857–8866.
[24] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “MoCoGAN: Decomposing
Experimental results on benchmark datasets demonstrate the ef- motion and content for video generation,” in Proc. IEEE Conf. Comput.
fectiveness of our model. In the future, we plan to extend our Vis. Pattern Recognit., 2018, pp. 1526–1535.
model into an end-to-end model. [25] O. Wiles, A. Koepke, and A. Zisserman, “X2Face: A network for control-
ling face generation using images, audio, and pose codes,” in Proc. Eur.
Conf. Comput. Vis., 2018, pp. 670–686.
REFERENCES [26] P. Ardino, M. D. Nadai, B. Lepri, E. Ricci, and S. Lathuilière, “Click
to move: Controlling video generation with sparse motion,” in Proc.
[1] Y.-F. Zhou et al., “BranchGAN: Unsupervised mutual image-to-image IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 14749–14758.
transfer with a single encoder and dual decoders,” IEEE Trans. Multimedia, [27] S. W. Kim, Y. Zhou, J. Philion, A. Torralba, and S. Fidler,
vol. 21, no. 12, pp. 3136–3149, Dec. 2019. “Learning to simulate dynamic environments with gameGAN,” in
[2] H.-Y. Lee et al., “DRIT : Diverse image-to-image translation via disen- Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020,
tangled representations,” in Proc. Int. J. Comput. Vis., vol. 128, 2020, pp. 1231–1240.
pp. 2402–2417. [28] O. Gafni, L. Wolf, and Y. Taigman, “Vid2game: Controllable characters
[3] B. Li, X. Qi, T. Lukasiewicz, and P. Torr, “Controllable text-to-image gen- extracted from real-world videos,” 2019, arXiv:1904.08379.
eration,” in Proc. Adv. Neural Inf. Process. Syst., 2019, vol. 32, pp. 2065– [29] D. Damen et al., “Scaling egocentric vision: The epic-kitchens dataset,”
2075. in Proc. Eur. Conf. Comput. Vis., 2018, pp. 720–736.
[4] J. Lin et al., “Exploring explicit domain supervision for latent space disen- [30] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised visual plan-
tanglement in unpaired image-to-image translation,” IEEE Trans. Pattern ning with temporal skip connections,” in Proc. Conf. Robot Learn., 2017,
Anal. Mach. Intell., vol. 43, no. 4, pp. 1254–1266, Apr. 2012. pp. 344–356.
[5] B. Zhu and C.-W. Ngo, “CookGAN: Causality based text-to-image syn- [31] J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows: Video
thesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, forecasting by generating pose futures,” in Proc. IEEE Int. Conf. Comput.
pp. 5519–5527. Vis., 2017, pp. 3332–3341.
[6] J. Huang, J. Liao, and S. Kwong, “Semantic example guided image- [32] R. Villegas et al., “Learning to generate long-term future via hierarchical
to-image translation,” IEEE Trans. Multimedia, vol. 23, pp. 1654–1665, prediction,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3560–3569.
2021.

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.
KÖKSAL et al.: CONTROLLABLE VIDEO GENERATION WITH TEXT-BASED INSTRUCTIONS 201

[33] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe, “First order [62] M. Babaeizadeh et al., “FitVid: Overfitting in pixel-level video prediction,”
motion model for image animation,” in Proc. Adv. Neural Inf. Process. 2021, arXiv:2106.13195.
Syst., 2019, vol. 32, pp. 1–11. [63] V. Voleti, A. Jolicoeur-Martineau, and C. Pal, “Masked conditional
[34] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action-conditional video video diffusion for prediction, generation, and interpolation,” 2022,
prediction using deep networks in atari games,” in Proc. Adv. Neural Inf. arXiv:2205.09853.
Process. Syst., 2015, vol. 28, pp. 1–9. [64] T. Höppe, A. Mehrjou, S. Bauer, D. Nielsen, and A. Dittadi, “Diffusion
[35] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical models for video prediction and infilling,” 2022, arXiv:2206.07696.
interaction through video prediction,” in Proc. Adv. Neural Inf. Process.
Syst., 2016, vol. 29, pp. 1–9.
[36] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large- Ali Köksal received the B.Sc. and M.Sc. degrees
scale pretraining for text-to-video generation via transformers,” 2022, in computer engineering from the Izmir Institute of
arXiv:2205.15868. Technology, Izmir, Turkey, in 2014 and 2017. He
[37] R. Villegas et al., “Phenaki: Variable length video generation from open is currently working toward the Ph.D. degree with
domain textual description,” 2022, arXiv:2210.02399. the School of Computer Science and Engineering
[38] M. Saito, S. Saito, M. Koyama, and S. Kobayashi, “Train sparsely, gen- with Nanyang Technological University, Singapore,
erate densely: Memory-efficient unsupervised training of high-resolution in 2022. He is also a Scientist with the Visual Intelli-
temporal GAN,” Int. J. Comput. Vis., vol. 128, no. 10, pp. 2586–2606, gence Department, Institute for Infocomm Research,
2020. A*STAR, Singapore.
[39] J. Ho et al., “Video diffusion models,” 2022, arXiv:2204.03458.
[40] U. Singer et al., “Make-a-video: Text-to-video generation without text-
video data,” 2022, arXiv:2209.14792.
[41] X. Mao et al., “Least squares generative adversarial networks,” in Proc.
IEEE Int. Conf. Comput. Vis., 2017, pp. 2794–2802. Kenan Emir Ak received the B.Sc. and M.Sc. de-
[42] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation grees in electrical and electronics engineering from
learning for human pose estimation,” in Proc. IEEE/CVF Conf. Comput. Isik University, Istanbul, Turkey, and Bogazici Uni-
Vis. Pattern Recognit., 2019, pp. 5686–5696. versity, Istanbul, Turkey, in 2014 and 2016, and the
[43] Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of Ph.D. degree in electrical and computer engineering
gaze and actions in first person video,” in Proc. Eur. Conf. Comput. Vis., from the National University of Singapore, Singa-
2018, pp. 619–635. pore, in 2020. He is currently a Scientist with the Vi-
[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” sual Intelligence Department, Institute for Infocomm
CoRR, 2014, arXiv:1412.6980. Research, A*STAR, Singapore.
[45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“GANs trained by a two time-scale update rule converge to a local nash
equilibrium,” in Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017,
pp. 6629–6640. [Online]. Available: http://dl.acm.org/citation.cfm?id= Ying Sun (Member, IEEE) received the B.Eng. de-
3295222.3295408 gree from Tsinghua University, Beijing, China in
[46] T. Unterthiner et al., “Towards accurate generative models of video: A new 1998, the M.Phil. degree from the Hong Kong Univer-
metric & challenges,” 2018, arXiv:1812.01717. sity of Science and Technology, Hong Kong in 2000,
[47] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The un- and the Ph.D. degree from Carnegie Mellon Univer-
reasonable effectiveness of deep features as a perceptual metric,” in Proc. sity, Pittsburgh, PA, USA in 2004. She is curently a
IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 586–595. Senior Scientist with the Visual Intelligence Depart-
[48] A. X. Lee et al., “Stochastic adversarial video prediction,” 2018, ment, Institute for Infocomm Research, and Centre
arXiv:1804.01523. for Frontier AI Research, A*STAR, Singapore.
[49] J.-Y. Franceschi, E. Delasalles, M. Chen, S. Lamprier, and P. Gallinari,
“Stochastic latent residual video prediction,” in Proc. Int. Conf. Mach.
Learn., 2020, pp. 3233–3246.
[50] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
assessment: From error visibility to structural similarity,” IEEE Trans. Deepu Rajan (Member, IEEE) received the Bach-
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. elor of Engineering degree in electronics and com-
[51] T. Salimans et al., “Improved techniques for training GANs,” in Proc. munication engineering from the Birla Institute of
Adv. Neural Inf. Process. Syst., 2016, pp. 2234–2242. [Online]. Available: Technology, Ranchi, India, the M.S. degree in electri-
http://papers.nips.cc/paper/6125-improved-techniques-for-training- cal engineering from Clemson University, Clemson,
gans.pdf SC, USA, and the Ph.D. degree from the Indian In-
[52] E. Denton and R. Fergus, “Stochastic video generation with a learned stitute of Technology, Bombay, Mumbai, India. He is
prior,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 1174–1183. currently an Associate Professor with the School of
[53] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine, Computer Science and Engineering, Nanyang Tech-
“Stochastic variational video prediction,” 2017, arXiv:1710.11252. nological University, Singapore. From 1992 to 2002,
[54] R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev, he was a Lecturer with the Department of Electron-
“Latent video transformer,” 2020, arXiv:2006.10704. ics, Cochin University of Science and Technology, Kochi, India. His research
[55] A. Clark, J. Donahue, and K. Simonyan, “Adversarial video generation on interests include image processing, computer vision, and multimedia signal pro-
complex datasets,” 2019, arXiv:1907.06571. cessing.
[56] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas, “VideoGPT: Video genera-
tion using VQ-VAE and transformers,” 2021, arXiv:2104.10157.
[57] P. Luc et al., “Transformation-based adversarial video prediction on large- Joo Hwee Lim (Senior Member, IEEE) received the
scale data,” 2020, arXiv:2003.04035. B.Sc. (Hons. I) and M.Sc. degrees in computer sci-
[58] C. Nash et al., “Transframer: Arbitrary frame prediction with generative ence from the National University of Singapore, Sin-
models,” 2022, arXiv:2203.09494. gapore, and the Ph.D. degree in computer science &
[59] Y. Seo, K. Lee, F. Liu, S. James, and P. Abbeel, “HARP: Autoregressive engineering from UNSW, Sydney NSW, Australia.
latent video prediction with high-fidelity image generator,” in Proc. IEEE He is currently a Principal Scientist and the Depart-
Int. Conf. Image Process., 2022, pp. 3943–3947. ment Head (Visual Intelligence) with the Institute for
[60] G. Le Moing, J. Ponce, and C. Schmid, “CCVS: Context-aware control- Infocomm Research, A*STAR, Singapore. He has
lable video synthesis,” in Proc. Adv. Neural Inf. Process. Syst., 2021, authored or coauthored 290 international refereed
vol. 34, pp. 14042–14055. journal and conference papers in connectionist expert
[61] D. Weissenborn, O. Täckström, and J. Uszkoreit, “Scaling autoregressive systems, neural-fuzzy systems, content-based image
video models,” 2019, arXiv:1906.02634 . retrieval, medical image analysis, human robot collaboration.

Authorized licensed use limited to: Somaiya University. Downloaded on July 23,2024 at 08:48:13 UTC from IEEE Xplore. Restrictions apply.

You might also like