0% found this document useful (0 votes)
40 views26 pages

IMP-1A Review On Deep Learning Techniques For Video Prediction

This document reviews deep learning techniques for video prediction, emphasizing their role in intelligent decision-making systems. It defines video prediction fundamentals, analyzes existing models, and discusses datasets and evaluation metrics, highlighting the significance of self-supervised learning in extracting meaningful representations from video data. The paper concludes with open research challenges and future directions in the field of video prediction.

Uploaded by

Chachas Products
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views26 pages

IMP-1A Review On Deep Learning Techniques For Video Prediction

This document reviews deep learning techniques for video prediction, emphasizing their role in intelligent decision-making systems. It defines video prediction fundamentals, analyzes existing models, and discusses datasets and evaluation metrics, highlighting the significance of self-supervised learning in extracting meaningful representations from video data. The paper concludes with open research challenges and future directions in the field of video prediction.

Uploaded by

Chachas Products
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1

A Review on Deep Learning Techniques for


Video Prediction
S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J.A. Castro-Vargas, S. Orts-Escolano,
J. Garcia-Rodriguez, and A. Argyros

Abstract—The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making
systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising
research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation
learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos.
Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences.
arXiv:2004.05214v2 [cs.CV] 15 Apr 2020

We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we
carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and
their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the
assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying
open research challenges and by pointing out future research directions.

Index Terms—Video prediction, future frame prediction, deep learning, representation learning, self-supervised learning

1 I NTRODUCTION Context Frames Predicted Frames

W ILL the car hit the pedestrian? That might be one


of the questions that comes to our minds when we
observe Figure 1. Answering this question might be in ...
principle a hard task; however, if we take a careful look
into the image sequence we may notice subtle clues that can
help us predicting into the future, e.g., the person’s body
indicates that he is running fast enough so he will be able to
escape the car’s trajectory. This example is just one situation (Xt−n, . . . , Xt) Ŷt+1 Ŷt+m
among many others in which predicting future frames in
video is useful. Fig. 1: A pedestrian appeared from behind the white
In general terms, the prediction and anticipation of car with the intention of crossing the street. The au-
future events is a key component of intelligent decision- tonomous car must make a call: hit the emergency brak-
making systems. Despite the fact that we, humans, solve ing routine or not. This all comes down to predict the
this problem quite easily and effortlessly, it is extremely next frames (Ŷt+1 , . . . , Ŷt+m ) given a sequence of context
challenging from a machine’s point of view. Some of the frames (Xt−n , . . . , Xt ), where n and m denote the number
factors that contribute to such complexity are occlusions, of context and predicted frames, respectively. From these
camera movement, lighting conditions, clutter, or object predictions at a representation level (RGB, high-level se-
deformations. Nevertheless, despite such challenging con- mantics, etc.) a decision-making system would make the car
ditions, many predictive methods have been applied with avoid the collision.

• S. Oprea, P. Martinez-Gonzalez A. Garcia-Garcia, J. A. Castro-Vargas,


and J. Garcia-Rodriguez are with the 3D Perception Lab (3DPL), De- a certain degree of success in a broad range of application
partment of Computer Technology, University of Alicante, Carrer de San domains such as autonomous driving, robot navigation and
Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain.
E-mail: {soprea, pmartinez, jacastro, jgarcia}@dtic.ua.es human-machine interaction. Some of the tasks in which
• A. Garcia-Garcia is with the Institute of Space Sciences (ICE-CSIC), future prediction has been applied successfully are: antic-
Campus UAB, Carrer de Can Magrans s/n, E-08193 Barcelona, Spain. ipating activities and events [1]–[4], long-term planning [5],
E-mail: [email protected].
• S. Orts-Escolano is with the Department of Computer Science and
future prediction of object locations [6], video interpolation
Artificial Intelligence (DCCIA), University of Alicante, Carrer de San [7], predicting instance/semantic segmentation maps [8]–
Vicente del Raspeig s/n, E-03690 San Vicent del Raspeig Spain, Spain. [10], prediction of pedestrian trajectories in traffic [11],
E-mail: [email protected]. anomaly detection [12], precipitation nowcasting [13], [14],
• A. Argyros is with the Institute of Computer Science, FORTH, Heraklion
GR-700 13, Greece and with the Computer Science Department, Univer- and autonomous driving [15].
sity of Crete, Heraklion, Rethimno 741 00, Greece. The great strides made by deep learning algorithms
E-mail: [email protected].
in a variety of research fields such as semantic segmen-
2

tation [16], human action recognition and prediction [17], intermediate step between raw video data and decision
object pose estimation [18] and registration [19] to name making. Its potential to extract meaningful representations
a few, motivated authors to explore deep representation- of the underlying patterns in video data makes the video
learning models for future video frame prediction. What prediction task a promising avenue for self-supervised rep-
made the deep architectures take a leap over the traditional resentation learning.
approaches is their ability to learn adequate representa-
tions from high-dimensional data in an end-to-end fash- 2.1 Problem Definition
ion without hand-engineered features [20]. Deep learning-
We formally define the task of predicting future
based models fit perfectly into the learning by prediction
frames in videos, i.e. video prediction, as follows. Let
paradigm, enabling the extraction of meaningful spatio-
Xt ∈ Rw×h×c be the t-th frame in the video sequence
temporal correlations from video data in a self-supervised
X = (Xt−n , . . . , Xt−1 , Xt ) with n frames, where w, h,
fashion.
and c denote width, height, and number of channels,
In this review, we put our focus on deep learning tech-
respectively. The target is to predict the next frames
niques and how they have been extended or applied to
Y = (Ŷt+1 , Ŷt+2 , . . . , Ŷt+m ) from the input X.
future video prediction. We limit this review to the future
Under the assumption that good predictions can only be
video prediction given the context of a sequence of previous
the result of accurate representations, learning by prediction
frames, leaving aside methods that predict future from a
is a feasible approach to verify how accurately the system
static image. In this context, the terms video prediction,
has learned the underlying patterns in the input data. In
future frame prediction, next video frame prediction, future
other words, it represents a suitable framework for represen-
frame forecasting, and future frame generation are used
tation learning [31], [32]. The essence of predictive learning
interchangeably. To the best of our knowledge, this is the
paradigm is the prediction of plausible future outcomes
first review in the literature that focuses on video prediction
from a set of historical inputs. On this basis, the task of video
using deep learning techniques.
prediction is defined as: given a sequence of video frames
This review is organized as follows. First, Sections 2
as context, predict the subsequent frames –generation of
and 3 lay down the terminology and explain important
continuing video given a sequence of previous frames. Dif-
background concepts that will be necessary throughout the
ferent from video generation that is mostly unconditioned,
rest of the paper. Next, Section 4 surveys the datasets used
video prediction is conditioned on a previously learned
by the video prediction methods that are carefully reviewed
representation from a sequence of input frames. At a first
in Section 5, providing a comprehensive description as well
glance, and in the context of learning paradigms, we can
as an analysis of their strengths and weaknesses. Section 6
think about the future video frame prediction task as a
analyzes typical metrics and evaluation protocols for the
supervised learning approach because the target frame acts
aforementioned methods and provides quantitative results
as a label. However, as this information is already available
for them in the reviewed datasets. Section 7 presents a
in the input video sequence, no extra labels or human
brief discussion on the presented proposals and enumerates
supervision is needed. Therefore, learning by prediction is a
potential future research directions. Finally, Section 8 sum-
self-supervised task, filling the gap between supervised and
marizes the paper and draws conclusions about this work.
unsupervised learning.

2 V IDEO P REDICTION 2.2 Exploiting the Time Dimension of Videos


The ability to predict, anticipate and reason about future Unlike static images, videos provide complex transforma-
events is the essence of intelligence [21] and one of the main tions and motion patterns in the time dimension. At a fine
goals of decision-making systems. This idea has biological granularity, if we focus on a small patch at the same spatial
roots, and also draws inspiration from the predictive coding location across consecutive time steps, we could identify
paradigm [22] borrowed from the cognitive neuroscience a wide range of local visually similar deformations due
field [23]. From a neuroscience perspective, the human to the temporal coherence. In contrast, by looking at the
brain builds complex mental representations of the physical big picture, consecutive frames would be visually different
and causal rules that govern the world. This is primarily but semantically coherent. This variability in the visual
through observation and interaction [24]–[26]. The common appearance of a video at different scales is mainly due to,
sense we have about the world arises from the conceptual occlusions, changes in the lighting conditions, and camera
acquisition and the accumulation of background knowledge motion, among other factors. From this source of temporally
from early ages, e.g. biological motion and intuitive physics ordered visual cues, predictive models are able to extract
to name a few. But how can the brain check and refine representative spatio-temporal correlations depicting the
the learned mental representations from its raw sensory dynamics in a video sequence. For instance, Agrawal et
input? The brain is continuously learning through predic- al. [33] established a direct link between vision and motion,
tion, and refines the already understood world models from attempting to reduce supervision efforts when training deep
the mismatch between its predictions and what actually predictive models.
happened [27]. This is the essence of the predictive coding Recent works study how important is the time dimen-
paradigm that early works tried to implement as computa- sion for video understanding models [34]. The implicit
tional models [22], [28]–[30]. temporal ordering in videos, also known as the arrow of
Video prediction task closely captures the fundamentals time, indicates whether a video sequence is playing forward
of the predictive coding paradigm and it is considered the or backward. This temporal direction is also used in the
3

Time model is a crucial aspect. Probabilistic approaches dealing


Probabilistic Deterministic Context Frame with these issues are discussed in Section 5.6.

2.4 The Devil is in the Loss Function


The design and selection of the loss function for the video
prediction task is of utmost importance. Pixel-wise losses,
e.g. Cross Entropy (CE), `2 , `1 and Mean-Squared Error
(MSE), are widely used in both unstructured and struc-
tured predictions. Although leading to plausible predictions
in deterministic scenarios, such as synthetic datasets and
Fig. 2: At top, a deterministic environment where a geo- video games, they struggle with the inherent uncertainty
metric object, e.g. a black square, starts moving following a of natural videos. In a probabilistic environment, with dif-
random direction. At bottom, probabilistic outcome. Darker ferent equally probable outcomes, pixel-wise losses aim to
areas correspond to higher probability outcomes. As uncer- accommodate uncertainty by blurring the prediction, as we
tainty is introduced, probabilities get blurry and averaged. can observe in Figure 2. In other words, the deterministic
Figure inspired by [38]. loss functions average out multiple equally plausible out-
comes in a single, blurred prediction. In the pixel space,
these losses are unstable to slight deformations and fail to
capture discriminative representations to efficiently regress
literature as a supervisory signal [35]–[37]. This further
the broad range of possible outcomes. This makes difficult
encouraged predictive models to implicitly or explicitly
to draw predictions maintaining the consistency with our
model the spatio-temporal correlations of a video sequence
visual similarity notion. Besides video prediction, several
to understand the dynamics of a scene. The time dimension
studies analyzed the impact of different loss functions in
of a video reduces the supervision effort and makes the
image restoration [39], classification [40], camera pose re-
prediction task self-supervised.
gression [41] and structured prediction [42], among others.
This fosters reasoning about the importance of the loss
function, particularly when making long-term predictions
2.3 Dealing with Stochasticity
in high-dimensional and multimodal natural videos.
Predicting how a square is moving, could be extremely Most of distance-based loss functions, such as based on
challenging even in a deterministic environment such as `p norm, come from the assumption that data is drawn
the one represented in Figure 2. The lack of contextual from a Gaussian distribution. But, how these loss func-
information and the multiple equally probable outcomes tions address multimodal distributions? Assuming that
hinder the prediction task. But, what if we use two con- a pixel is drawn from a bimodal distribution with two
secutive frames as context? Under this configuration and equally likely modes M o1 and M o2 , the mean value
assuming a physically perfect environment, the square will M o = (M o1 + M o2 )/2 would minimize the `p -based losses
be indefinitely moving in the same direction. This represents over the data, even if M o has very low probability [43]. This
a deterministic outcome, an assumption that many authors suggests that the average of two equally probable outcomes
made in order to deal with future uncertainty. Assuming a would minimize distance-based losses such as, the MSE
deterministic outcome would narrow the prediction space to loss. However, this applies to a lesser extent when using
a unique solution. However, this assumption is not suitable `1 norm as the pixel values would be the median of the two
for natural videos. The future is by nature multimodal, since equally likely modes in the distribution. In contrast to the `2
the probability distribution defining all the possible future norm that emphasizes outliers with the squaring term, the `1
outcomes in a context has multiple modes, i.e. there are mul- promotes sparsity thus making it more suitable for predic-
tiple equally probable and valid outcomes. Furthermore, on tion in high-dimensional data [43]. Based on the `2 norm,
the basis of a deterministic universe, we indirectly assume the MSE is also commonly used in the training of video
that all possible outcomes are reflected in the input data. prediction models. However, it produces low reconstruction
These assumptions make the prediction under uncertainty errors by merely averaging all the possible outcomes in
an extremely challenging task. a blurry prediction as uncertainty is introduced. In other
Most of the existing deep learning-based models in the words, the mean image would minimize the MSE error as
literature are deterministic. Although the future is uncertain, it is the global optimum, thus avoiding finer details such
a deterministic prediction would suffice some easily pre- as facial features and subtle movements as they are noise
dictable situations. For instance, most of the movement of a for the model. Most of the video prediction approaches
car is largely deterministic, while only a small part is uncer- rely on pixel-wise loss functions, obtaining roughly accurate
tain. However, when multiple predictions are equally prob- predictions in easily predictable datasets.
able, a deterministic model will learn to average between One of the ultimate goals of many video prediction
all the possible outcomes. This unpredictability is visually approaches is to palliate the blurry predictions when it
represented in the predictions as blurriness, especially on comes to uncertainty. For this purpose, authors broadly
long time horizons. As deterministic models are unable to focused on: directly improving the loss functions; explor-
handle real-world settings characterized by chaotic dynam- ing adversarial training; alleviating the training process
ics, authors considered that incorporating uncertainty to the by reformulating the problem in a higher-level space; or
4

exploring probabilistic alternatives. Some promising results Amersfoort et al. [71] replicated a purely convolutional
were reported by combining the loss functions with sophis- approach in time to address multi-scale predictions in the
ticated regularization terms, e.g. the Gradient Difference transformation space. In this case, the learned affine trans-
Loss (GDL) to enhance prediction sharpness [43] and the forms at each time step play the role of a recurrent state.
Total Variation (TV) regularization to reduce visual artifacts
and enforce coherence [7]. Perceptual losses were also used 3.2 Recurrent Models
to further improve the visual quality of the predictions [44]–
[48]. However, in light of the success of the Generative Ad- Recurrent models were specifically designed to model a
versarial Networks (GANs), adversarial training emerged as spatio-temporal representation of sequential data such as
a promising alternative to disambiguate between multiple video sequences. Among other sequence learning tasks,
equally probable modes. It was widely used in conjunc- such as machine translation, speech recognition and video
tion with different distance-based losses such as: MSE [49], captioning, to name a few, Recurrent Neural Networks
`2 [50]–[52], or a combination of them [43], [53]–[57]. To al- (RNNs) [72] demonstrated great success in the video predic-
leviate the training process, many authors reformulated the tion scenario [10], [13], [49], [50], [52], [53], [53], [70], [73]–
optimization process in a higher-level space (see Section 5.5). [85]. Vanilla RNNs have some important limitations when
While great strides have been made to mitigate blurriness, dealing with long-term representations due to the vanishing
most of the existing approaches still rely on distance-based and exploding gradient issues, making the Backpropagation
loss functions. As a consequence, the regress-to-the-mean through time (BPTT) cumbersome. By extending classical
problem remains an open issue. This has further encouraged RNNs to more sophisticated recurrent models, such as
authors to reformulate existing deterministic models in a Long Short-Term Memory (LSTM) [86] and Gated Recurrent
probabilistic fashion. Unit (GRU) [87], these problems were mitigated. Shi et
al. extended the use of LSTM-based models to the image
space [13]. While some authors explored multidimensional
3 BACKBONE D EEP L EARNING A RCHITECTURES LSTM (MD-LSTM) [88], others stacked recurrent layers
to capture abstract spatio-temporal correlations [49], [89].
In this section, we will briefly review the most common
Zhang et al. addressed the duplicated representations along
deep networks that are used as building blocks for the video
the same recurrent paths [90].
prediction models discussed in this review: convolutional
neural networks, recurrent networks, and generative mod-
els. 3.3 Generative Models
Whilst discriminative models learn the decision bound-
3.1 Convolutional Models aries between classes, generative models learn the un-
derlying distribution of individual classes. More formally,
Convolutional layers are the basic building blocks of deep
discriminative models capture the conditional probability
learning architectures designed for visual reasoning since
p(y|x), while generative models capture the joint probability
the Convolutional Neural Networks (CNNs) efficiently
p(x, y), or p(x) in the absence of labels y . The goal of
model the spatial structure of images [58]. As we focus
generative models is the following: given some training
on the visual prediction, CNNs represent the foundation of
data, generate new samples from the same distribution. Let
predictive learning literature. However, their performance
input data ∼ pdata (x) and generated samples ∼ pmodel (x)
is limited by the intra-frame and inter-frame dependencies.
where, pdata and pmodel are the underlying input data and
Convolutional operations account for short-range intra-
model’s probability distribution respectively. The training
frame dependencies due to their limited receptive fields,
process consists in learning a pmodel (x) similar to pdata (x).
determined by the kernel size. This is a well-addressed
This is done by explicitly, e.g VAEs, or implicitly, e.g. GANs,
issue, that many authors circumvented by (1) stacking more
estimating a density function from the input data. In the
convolutional layers [59], (2) increasing the kernel size (al-
context of video prediction, generative models are mainly
though it becomes prohibitively expensive), (3) by linearly
used to cope with future uncertainty by generating a wide
combining multiple scales [43] as in the reconstruction pro-
spectrum of feasible predictions rather than a single even-
cess of a Laplacian pyramid [60], (4) using dilated convo-
tual outcome.
lutions to capture long-range spatial dependencies [61], (5)
enlarging the receptive fields [62], [63], or subsampling, i.e.
3.3.1 Explicit Density Modeling
using pooling operations in exchange for losing resolution.
The latter could be mitigated by using residual connec- These models explicitly define and solve for pmodel (x).
tions [64], [65] to preserve resolution while increasing the PixelRNNs and PixelCNNs [91]: These are a type of Fully
number of stacking convolutions. But even addressing these Visible Belief Networks (FVBNs) [92], [93] that explicitly
limitations, would CNNs be able to predict in a longer time define a tractable density and estimate the joint distribu-
horizon? tion p(x) as a product of conditional distributions over
Vanilla CNNs lack of explicit inter-frame modeling capa- the pixels. Informally, they turn pixel generation into a
bilities. To properly model inter-frame variability in a video sequential modeling problem, where next pixel values are
sequence, 3D convolutions come into play as a promising determined by previously generated ones. In PixelRNNs,
alternative to recurrent modeling. Several video prediction this conditional dependency on previous pixels is modeled
approaches leveraged 3D convolutions to capture tempo- using two-dimensional (2d) LSTMs. On the other hand,
ral consistency [66]–[70]. Also modeling time dimension, dependencies are modeled using convolutional operations
5

over a context region, thus making training faster. In a nut- quality and sharpness. However, adversarial training is
shell, these methods are outputting a distribution over pixel unstable. Without an explicit latent variable interpretation,
values at each location in the image, aiming to maximize the GANs are prone to mode collapse —generator fails to cover
likelihood of the training data being generated. Further im- the space of possible predictions by getting stuck into a
provements of the original architectures have been carried single mode [99], [108]. Moreover, GANs often struggle to
out to address different issues. The Gated PixelCNN [94] is balance between the adversarial and reconstruction loss,
computationally more efficient and improves the receptive thus getting blurry predictions. Among the dense literature
fields of the original architecture. In the same work, au- on adversarial networks, we find some other interesting
thors also explored conditional modeling of natural images, works addressing GANs limitations [109], [110].
where the joint probability distribution is conditioned on a
latent vector —it represents a high-level image description.
This further enabled the extension to video prediction [95]. 4 DATASETS
Variational Autoencoders (VAEs): These models are an ex-
tension of Autoencoders (AEs) that encode and reconstruct As video prediction models are mostly self-supervised, they
its own input data x in order to capture a low-dimensional need video sequences as input data. However, some video
representation z containing the most meaningful factors of prediction methods rely on extra supervisory signals, e.g.
variation in x. Extending this architecture to generation, segmentation maps, and human poses. This makes out-
VAEs aim to sample new images from a prior over the of-domain video datasets perfectly suitable for video pre-
underlying latent representation z . VAEs represent a prob- diction. This section describes the most relevant datasets,
abilistic spin over the deterministic latent space in AEs. discussing their pros and cons. Datasets were organized
Instead of directly optimizing the density function, which is according to their main purpose and summarized in Table 1.
intractable, they derive and optimize a lower bound on the
likelihood. Data is generated from the learned distribution
by perturbing the latent variables. In the video prediction 4.1 Action and Human Pose Recognition Datasets
context, VAEs are the foundation of many probabilistic KTH [111]: is an action recognition dataset which includes
models dealing with future uncertainty [9], [38], [55], [81], 2391 video sequences of 4 seconds mean duration, each of
[85], [96], [97]. Although these variational approaches are them containing an actor performing an action taken with a
able to generate various plausible outcomes, the predictions static camera, over homogeneous backgrounds, at 25 frames
are blurrier and of lower quality compared to state-of-the- per second (fps) and with its resolution downsampled to
art GAN-based models. Many approaches were taken to 160 × 120 pixels. Just 6 different actions are performed, but
leverage the advantages of variational inference: combined it was the biggest dataset of this kind at its moment.
adversarial training with VAEs [55], and others incorporated
latent probabilistic variables into deterministic models, such Weizmann [112]: is also an action recognition dataset, cre-
as Variational Recurrent Neural Networks (VRNNs) [97], ated for modelling actions as space-time shapes. For this
[98] and Variational Encoder-Decoders (VEDs) [99]. reason, it was recorded at a higher frame rate (50 fps). It just
includes 90 video sequences, but performing 10 different
3.3.2 Implicit Density Modeling actions. It uses a static-camera, homogeneous backgrounds
and low resolution (180 × 144 pixels). KTH and Weizmann
These models learn to sample from pmodel without explicitly
are usually used together due to their similarities in order
defining it.
to augment the amount of available data.
GANs [100]: are the backbone of many video prediction
HMDB-51 [113]: is a large-scale database for human motion
approaches [43], [49]–[55], [57], [65], [67], [68], [78], [101]–
recognition. It claims to represent the richness of human
[106]. Inspired on game theory, these networks consist of
motion taking profit from the huge amount of video avail-
two models that are jointly trained as a minimax game
able online. It is composed by 6766 normalized videos
to generate new fake samples that resemble the real data.
(with mean duration of 3.15 seconds) where humans appear
On one hand, we have the discriminator model featuring
performing one of the 51 considered action categories. More-
a probability distribution function describing the real data.
over, a stabilized dataset version is provided, in which cam-
On the other hand, we have the generator which tries
era movement is disabled by detecting static backgrounds
to generate new samples that fool the discriminator. In
and displacing the action as a window. It also provides
their original formulation, GANs are unconditioned –the
interesting data for each sequence such as body parts visible,
generator samples new data from a random noise, e.g.
point of view respect the human, and if there is camera
Gaussian noise. Nevertheless, Mirza et al. [107] proposed
movement or not. It also exists a joint-annotated version
the conditional Generative Adversarial Network (cGAN), a
called J-HMBD [114] in which the key points of joints were
conditional version where the generator and discriminator
mannually added for 21 of the HMDB actions.
are conditioned on some extra information, e.g. class labels,
previous predictions, and multimodal data, among others. UCF101 [115]: is an action recognition dataset of realistic
CGANs are suitable for video prediction, since the spatio- action videos, collected from YouTube. It has 101 different
temporal coherence between the generated frames and the action categories, and it is an extension of UCF50, which
input sequence is guaranteed. The use of adversarial train- has 50 action categories. All videos have a frame rate of 25
ing for the video prediction task, represented a leap over fps and a resolution of 320 × 240 pixels. Despite being the
the previous state-of-the-art methods in terms of prediction most used dataset among predictive models, a problem it
6

has is that only a few sequences really represent movement, level for both semantic and instance segmentation, follow-
i.e. they often show an action over a fixed background. ing the format proposed by the Cityscapes [121] dataset.
Penn Action Dataset [116]: is an action and human pose Cityscapes [121]: is a large-scale database which focuses on
recognition dataset from the University of Pennsylvania. It semantic understanding of urban street scenes. It provides
contains 2326 video sequences of 15 different actions, and semantic, instance-wise, and dense pixel annotations for 30
it also provides human joint and viewpoint (position of the classes grouped into 8 categories. The dataset consist of
camera respect the human) annotations for each sequence. around 5000 fine annotated images (1 frame in 30) and
Each action is balanced in terms of different viewpoints 20 000 coarse annotated ones (one frame every 20 seconds
representation. or 20 meters run by the car). Data was captured in 50
cities during several months, daytimes, and good weather
Human3.6M [117]: is a human pose dataset in which 11
conditions. All frames are provided as stereo pairs, and the
actors with marker-based suits were recorded performing
dataset also includes vehicle odometry obtained from in-
15 different types of actions. It features RGB images, depth
vehicle sensors, outside temperature, and GPS tracks.
maps (time-of-flight range data), poses and scanned 3D
surface meshes of all actors. Silhouette masks and 2D Comma.ai steering angle [137]: is a driving dataset com-
bounding boxes are also provided. Moreover, the dataset posed by 7.25 hours of largely highway routes. It was
was extended by inserting high-quality 3D rigged human recorded as 360 × 180 camera images at 20 fps (divided in
models (animated with the previously recorded actions) in 11 different clips), and steering angles, among other vehicle
real videos, to create a realistic and complex background. data (speed, GPS, etc.).
THUMOS-15 [118]: is an action recognition challenge that Apolloscape [122]: is a driving/urban scene understanding
was celebrated in 2015. It didn’t just focus on recognizing an dataset that focuses on 3D semantic reconstruction of the
action in a video, but also on determining the time span in environment. It provides highly precise information about
which that action occurs. With that purpose, the challenge location and 6D camera pose, as well as a much bigger
provided a dataset that extends UCF101 [115] (trimmed amount of dense per-pixel annotations than other datasets.
videos with one action) with 2100 untrimmed videos where Along with that, depth information is retireved from a
one or more actions take place (with the correspondent tem- LIDAR sensor, that allows to semantically reconstruct the
poral annotations) and almost 3000 relevant videos without scene in 3D as a point cloud. It also provides RGB stereo
any of the 101 proposed actions. pairs as video sequences recorded under various weather
conditions and daytimes. This video sequences and their
per-pixel instance annotations make this dataset very inter-
4.2 Driving and Urban Scene Understanding Datasets esting for a wide variety of predictive models.
CamVid [136]: the Cambridge-driving Labeled Video
Database is a driving/urban scene understanding dataset
which consists of 5 video sequences recorded with a 960 × 4.3 Object and Video Classification Datasets
720 pixels resolution camera mounted on the dashboard of Sports1M [123]: is a video classification dataset that also
a car. Four of those sequences were sampled at 1 fps, and consists of annotated YouTube videos. In this case, it is
one at 15 fps, resulting in 701 frames which were manually fully focused on sports: its 487 classes correspond to the
per-pixel annotated for semantic segmentation (under 32 sport label retrieved from the YouTube Topics API. Video
classes). It was the first video sequence dataset of this kind resolution, duration and frame rate differ across all available
to incorporate semantic annotations. videos, but they can be normalized when accessed from
CalTech Pedestrian Dataset [119]: is a driving dataset fo- YouTube. It is much bigger than UCF101 (over 1 million
cused on detecting pedestrians, since its unique annotations videos), and movement is also much more frequent.
are pedestrian bounding boxes. It is conformed of approxi- Youtube-8M [124]: Sports1M [123] dataset is, since 2016,
mately 10 hours of 640 × 480 30fps video taken from a vehi- part of a bigger one called YouTube8M, which follows the
cle driving through regular traffic in an urban environment, same philosophy, but with all kind of videos, not just sports.
making a total of 250 000 annotated frames distributed in Moreover, it has been updated in order to improve the qual-
137 approximately minute-long segments. The total pedes- ity and precision of their annotations. In 2019 YouTube-8M
trian bounding boxes is 350 000, identifying 2300 unique Segments was released with segment-level human-verified
pedestrians. Temporal correspondence between bounding labels on about 237 000 video segments on 1000 different
boxes and detailed occlusion labels are also provided. classes, which are collected from the validation set of the
Kitti [120]: is one of the most popular datasets for mobile YouTube-8M dataset. Since YouTube is the biggest video
robotics and autonomous driving, as well as a benchmark source on the planet, having annotations for some of their
for computer vision algorithms. It is composed by hours of videos at segment level is great for predictive models.
traffic scenarios recorded with a variety of sensor modali- YFCC100M [125]: Yahoo Flickr Creative Commons 100 Million
ties, including high-resolution RGB, gray-scale stereo cam- Dataset is a collection of 100 million images and videos
eras, and a 3D laser scanner. Despite its popularity, the uploaded to Flickr between 2004 and 2014. All those media
original dataset did not contain ground truth for semantic files were published in Flickr under Creative Commons
segmentation. However, after various researchers manually license, overcoming one of the biggest issues affecting ex-
annotated parts of the dataset to fit their necessities, in 2015 isting multimedia datasets, licensing and volume. Although
Kitti dataset was updated with 200 annotated frames at pixel only 0.8% of the elements of the dataset are videos, it is
7

TABLE 1: Summary of the most widely used datasets for video prediction (S/R: Synthetic/Real, st: stereo, de: depth,
ss: semantic segmentation, is: instance segmentation, sem: semantic, I/O: Indoor/Outdoor environment, bb: bounding
box, Act: Action label, ann: annotated, env: environment, ToF: Time of Flight, vp: camera viewpoints respect human).

provided data and ground-truth


name1 year S/R #videos #frames #ann. frames resolution #classes RGB st de ss is other annotations env.
Action and human pose recognition datasets
KTH [111] 2004 R 2391 250 0002 0 160 × 120 6 (action) X 7 7 7 7 Act. O
Weizmann [112] 2007 R 90 90002 0 180 × 144 10 (action) X 7 7 7 7 Act. O
HMDB-51 [113] 2011 R 6766 639 300 0 var × 240 51 (action) X 7 7 7 7 Act., vp I/O
UCF101 [115] 2012 R 13 320 2 000 0002 0 320 × 240 101 (action) X 7 7 7 7 Act. I/O
Penn Action D. [116] 2013 R 2326 163 841 0 480 × 270 15 (action) X 7 7 7 7 Act., Human poses, vp I/O
Human3.6M [117] 2014 SR 40002 3 600 000 0 1000x1000 15 (action) X 7 ToF 7 7 Act., Human poses & meshes I/O
THUMOS-15 [118] 2017 R 18 404 3 000 0002 0 320 × 240 101 (action) X 7 7 7 7 Act., Time span I/O
Driving and urban scene understanding datasets
Camvid [77] 2008 R 5 18 202 701 (ss) 960 × 720 32 (sem) X 7 7 X 7 7 O
CalTech Pedest. [119] 2009 R 137 1 000 0002 250 000 (bb) 640 × 480 - X 7 7 7 7 Pedestrian bb & occlusions O
Kitti [120] 2013 R 151 48 791 200 (ss) 1392 × 512 30 (sem) X X LiDAR X X Odometry O
Cityscapes [121] 2016 R 50 7 000 0002 25 000 (ss) 2048 × 1024 30 (sem) X X stereo X X Odometry, temp, GPS O
Comma.ai [75] 2016 R 11 522 0002 0 160 × 320 - X 7 7 7 7 Steering angles & speed O
Apolloscape [122] 2018 R 4 200 000 146 997 (ss) 3384 × 2710 25 (sem) X X LiDAR X X Odometry, GPS O
Object and video classification datasets
Sports1m [123] 2014 R 1 133 158 n/a 0 640 × 360 (var.) 487 (sport) X 7 7 7 7 Sport label I/O
YouTube8M [124] 2016 R 8 200 000 n/a 0 variable 1000 (topic) X 7 7 7 7 Topic label, Segment info I/O
YFCC100M [125] 2016 SR 8000 n/a 0 variable - X 7 7 7 7 User tags, Localization I/O
Video prediction datasets
Bouncing balls [126] 2008 S 4000 20 000 0 150 × 150 - X 7 7 7 7 7 -
Van Hateren [127] 2012 R 56 3584 0 128 × 128 - X 7 7 7 7 7 I/O
NORBvideos [128] 2013 R 110 560 552 800 All (is) 640 × 480 5 (object) X 7 7 7 X 7 I
Moving MNIST [74] 2015 SR custom3 custom3 0 64 × 64 - X 7 7 7 7 7 -
Robotic Pushing [89] 2016 R 57 000 1 500 0002 0 640 × 512 - X 7 7 7 7 Arm pose I
BAIR Robot [129] 2017 R 45 000 n/a 0 n/a - X 7 7 7 7 Arm pose I
RoboNet [130] 2019 R 161 000 15 000 000 0 variable - X 7 7 7 7 Arm pose I
Other-purpose and multi-purpose datasets
ViSOR [131] 2010 R 1529 1 360 0002 0 variable - X 7 7 7 7 User tags, human bb I/O
PROST [132] 2010 R 4 (10) 4936 (9296) All (bb) variable - X 7 7 7 7 Object bb I
Arcade Learning [133] 2013 S custom3 custom3 0 210 × 160 - X 7 7 7 7 7 -
Inria 3DMovie v2 [134] 2016 R 27 2476 235 (is) 960 × 540 - X X 7 7 X Human poses, bb I/O
Robotrix [16] 2018 S 67 3 039 252 All (ss) 1920 × 1080 39 (sem) X 7 X X X Normal maps, 6D poses I
UASOL [135] 2019 R 33 165 365 0 2280 × 1282 - X X stereo 7 7 7 O
1
some dataset names have been abbreviated to enhance table’s readability.
2
values estimated based on the framerate and the total number of frames or videos, as the original values are not provided by the authors.
3
custom indicates that as many frames as needed can be generated. This is related to datasets generated from a game, algorithm or simulation, involving interaction or randomness.

still useful for predictive models due to the great variety of of view and 6 lightning conditions. Those images were
these, and therefore the challenge that it represents. processed to obtain their object masks and even their casted
shadows, allowing them to augment the dataset introducing
4.4 Video Prediction Datasets random backgrounds. Viewpoints are determined by rotat-
ing the camera through 9 elevations and 18 azimuths (every
Standard bouncing balls dataset [126]: is a common test
20 degrees) around the object. NORBvideos dataset was built
set for models that generate high dimensional sequences. It
by sequencing all these frames for each object.
consists of simulations of three balls bouncing in a box. Its
clips can be generated randomly with custom resolution but
the common structure is composed by 4000 training videos, Moving MNIST [74] (M-MNIST): is a video prediction
200 testing videos and 200 more for validation. This kind of dataset built from the composition of 20-frame video se-
datasets are purely focused on video prediction. quences where two handwritten digits from the MNIST
database are combined inside a 64 × 64 patch, and moved
Van Hateren Dataset of natural videos (version [127]): is with some velocity and direction along frames, potentially
a very small dataset of 56 videos, each 64 frames long, that overlapping between them. This dataset is almost infinite (as
has been widely used in unsupervised learning. Original new sequences can be generated on the fly), and it also has
images were taken and given for scientific use by the interesting behaviours due to occlusions and the dynamics
photographer Hans van Hateren, and they feature moving of digits bouncing off the walls of the patch. For these rea-
animals in grasslands along rivers and streams. Its frame sons, this dataset is widely used by many predictive models.
size is 128 × 128 pixels. The version we are reviewing A stochastic variant of this dataset is also available. In the
is the one provided along with the work of Cadieu and original M-MNIST the digits move with constant velocity
Olshausen [127]. and bounce off the walls in a deterministic manner. In
NORBvideos [128]: NORB (NYU Object Recognition Bench- contrast, in SM-MNIST digits move with a constant velocity
mark) dataset [138] is a compilation of static stereo pairs along a trajectory until they hit at wall at which point they
of 50 homogeneously colored objects from various points bounce off with a random speed and direction. In this way,
8

moments of uncertainty (each time a digit hits a wall) are pixels resolution at 60 fps in real-time, and up to 6000 fps
interspersed with deterministic motion. when it is running at full speed. It also offers the possibility
Robotic Pushing Dataset [89]: is a dataset created for of saving and restoring the state of a game. Although its
learning about physical object motion. It consist on 640×512 obvious main application is reinforcement learning, it could
pixels image sequences of 10 different 7-degree-of-freedom also be profitable as source of almost-infinite interactive
robotic arms interacting with real-world physical objects. video sequences from which prediction models can learn.
No additional labeling is given, the dataset was designed Inria 3DMovie Dataset v2 [134]: is a video dataset which
to model motion at pixel level through deep learning algo- extracted its data from the StreetDance 3D stereo movies.
rithms based on convolutional LSTM (ConvLSTM). The dataset includes stereo pairs, and manually generated
BAIR Robot Pushing Dataset (used in [129]): BAIR (Berke- ground-truth for human segmentation, poses and bounding
ley Artificial Intelligence Research) group has been working boxes. The second version of this dataset, used in [134],
on robots that can learn through unsupervised training (also is composed by 27 clips, which represent 2476 frames, of
known in this case as self-supervised), this is, learning the which just a sparse subset of 235 were annotated.
consequences that its actions (movement of the arm and RobotriX [16]: is a synthetic dataset designed for assistance
grip) have over the data it can measure (images from two robotics, that consist of sequences where a humanoid robot
cameras). In this way, the robot assimilates physics of the is moving through various indoor scenes and interacting
objects and can predict the effects that its actions will gen- with objects, recorded from multiple points of view, in-
erate on the environment, allowing it to plan strategies to cluding robot-mounted cameras. It provides a huge variety
achieve more general goals. This was improved by showing of ground-truth data generated synthetically from highly-
the robot how it can grab tools to interact with other objects. realistic environments deployed on the cutting-edge game
The dataset is composed by hours of this self-supervised engine UnrealEngine, through the also available tool Un-
learning with the robotic arm Sawyer. realROX [139]. RGB frames are provided at 1920 × 1080
RoboNet [130]: is a dataset composed by the aggrega- pixels resolution and at 60 fps, along with pixel-precise
tion of various self-supervised training sequences of seven instance masks, depth and normal maps, and 6D poses of
robotic arms from four different research laboratories. The objects, skeletons and cameras. Moreover, UnrealROX is an
previously described BAIR group is one of them, along open source tool for retrieving ground-truth data from any
with Stanford AI Laboratory, Grasp Lab of the University of simulation running in UnrealEngine.
Pennsylvania and Google Brain Robotics. It was created with UASOL [135]: is a large-scale dataset consisting of high-
the goal of being a standard, in the same way as ImageNet resolution sequences of stereo pairs recorded outdoors at
is for images, but for robotic self-supervised learning. Sev- pedestrian (egocentric) point of view. Along with them,
eral experiments have been performed studying how the precise depth maps are provided, computed offline from
transfer among robotic arms can be achieved. stereo pairs by the same camera. This dataset is intended to
be useful for depth estimation, both from single and stereo
images, research fields where outdoor and pedestrian-point-
4.5 Other-purpose and Multi-purpose Datasets of-view data is not abundant. Frames were taken at a
ViSOR [131]: ViSOR (Video Surveillance Online Repository) resolution of 2280 × 1282 pixels at 15 fps.
is a repository designed with the aim of establishing an open
platform for collecting, annotating, retrieving, and sharing
5 V IDEO P REDICTION M ETHODS
surveillance videos, as well as evaluating the performance
of automatic surveillance systems. Its raw data could be In the video prediction literature we find a broad range of
very useful for video prediction due to its implicit static different methods and approaches. Early models focused
camera. on directly predicting raw pixel intensities, by implicitly
modeling scene dynamics and low-level details (Section 5.1).
PROST [132]: is a method for online tracking that used However, extracting a meaningful and robust representa-
ten manually annotated videos to test its performance. tion from raw videos is challenging, since the pixel space
Four of them were created by PROST authors, and they is highly dimensional and extremely variable. From this
conform the dataset with the same name. The remaining six point, reducing the supervision effort and the representation
sequences were borrowed from other authors, who released dimensionality emerged as a natural evolution. On the one
their annotated clips to test their tracking methods. We will hand, the authors aimed to disentangle the factors of vari-
consider both 4-sequences PROST dataset and 10-sequences ation from the visual content, i.e. factorizing the prediction
aggregated dataset when providing statistics. In each video, space. For this purpose, they: (1) formulated the prediction
different challenges are presented for tracking methods: problem into an intermediate transformation space by ex-
occlusions, 3D motion, varying illumination, heavy appear- plicitly modeling the source of variability as transformations
ance/scale changes, moving camera, motion blur, among between frames (Section 5.2); (2) separated motion from the
others. Provided annotations include bounding boxes for visual content with a two-stream computation (Section 5.3).
the object/element being tracked. On the other hand, some models narrowed the output
Arcade Learning Environment [133]: is a platform that en- space by conditioning the predictions on extra variables
ables machine learning algorithms to interact with the Atari (Section 5.4), or reformulating the problem in a higher-level
2600 open-source emulator Stella to play over 500 Atari space (Section 5.5). High-level representation spaces are
games. The interface provides a single 2D frame of 210×160 increasingly more attractive, since intelligent systems rarely
9

Video Prediction

Through Direct Factorizing the Narrowing the By Incorporating


Pixel Synthesis Prediction Space Prediction Space Uncertainty

Implicit Modeling Using Explicit By Conditioning on Using Probabilistic


of Scene Dynamics Transformations Extra Variables Approaches
With Explicit Mo- To High-level
tion from Content Feature Space
Separation

Fig. 3: Classification of video prediction models.

rely on raw pixel information for decision making. Besides patch-level. Addressing longer-term predictions, Srivastava
simplifying the prediction task, some other works addressed et al. [74] proposed different AE-based approaches incorpo-
the future uncertainty in predictions. As the vast majority of rating LSTM units to model the temporal coherence. Using
video prediction models are deterministic, they are unable convolutional [146] and flow [147] percepts alongside RGB
to manage probabilistic environments. To address this issue, image patches, authors tested the models on multi-domain
several authors proposed modeling future uncertainty with tasks and considered both unconditioned and conditioned
probabilistic models (Section 5.6). decoder versions. The latter only marginally improved the
So far in the literature, there is no specific taxonomy that prediction accuracy. Replacing the fully connected LSTMs
classifies video prediction models. In this review, we have with convolutional LSTMs, Shi et al. proposed an end-to-end
classified the existing methods according to the video pre- model efficiently exploiting spatial correlations [13]. This
diction problem they addressed and following the classifi- enhanced prediction accuracy and reduced the number of
cation illustrated in Figure 3. For simplicity, each subsection parameters.
extends directly the last level in the taxonomy. Moreover,
some methods in this review can be classified in more Inspired on adversarial training: Building on the recent
than one category since they addressed multiple problems. success of the Laplacian Generative Adversarial Network
For instance, [9], [54], [85] are probabilistic models making (LAPGAN), Mathieu et al. proposed the first multi-scale
predictions in a high-level space as they addressed both the architecture for video prediction that was trained in an ad-
future uncertainty and high dimensionality in videos. The versarial fashion [43]. Their novel GDL regularization com-
category of these models were specified according to their bined with `1 -based reconstruction and adversarial training
main contribution. The most relevant methods, ordered in a represented a leap over the previous state-of-the-art mod-
chronological order, are summarized in Table 2 containing els [73], [74] in terms of prediction sharpness. However, it
low-level details. Prediction is a widely discussed topic was outperformed by the Predictive Coding Network (Pred-
in different fields and at different levels of abstraction. Net) [75] which stacked several ConvLSTMs vertically con-
For instance, the future prediction from a static image [3], nected by a bottom-up propagation of the local `1 error com-
[106], [140]–[143], vehicle behavior prediction [144] and puted at each level. Previously to PredNet, the same authors
human action prediction [17] are a different but inspiring proposed the Predictive Generative Network (PGN) [49], an
research fields. Although related, the aforementioned topics end-to-end model trained with a weighted combination of
are outside the scope of this particular review, as it focuses adversarial loss and MSE on synthetic data. However, no
purely on the video prediction methods using a sequence of tests on natural videos and comparison with state-of-the-art
previous frames as context and is limited to 2D RGB data. predictive models were carried out. Using a similar training
strategy as [43], Zhou et al. used a convolutional AE to learn
5.1 Direct Pixel Synthesis long-term dependencies from time-lapse videos [103]. Build
Initial video prediction models attempted to directly predict on Progressively Growing GANs (PGGANs) [148], Aigner et
future pixel intensities without any explicit modeling of al. proposed the FutureGAN [69], a three-dimensional (3d)
the scene dynamics. Ranzato et al. [73] discretized video convolutional Encoder-decoder (ED)-based model. They
frames in patch clusters using k-means. They assumed that used the Wasserstein GAN with gradient penalty (WGAN-
non-overlapping patches are equally different in a k-means GP) loss [149] and conducted experiments on increasingly
discretized space, yet similarities can be found between complex datasets. Extending [13], Zhang et al. proposed
patches. The method is a convolutional extension of a RNN- a novel LSTM-based architecture where hidden states are
based model [145] making short-term predictions at the updated along a z-order curve [70]. Dealing with distortion
patch-level. As the full-resolution frame is a composition and temporal inconsistency in predictions and inspired by
of the predicted patches, some tilling effect can be noticed. the Human Visual System (HVS), Jin et al. [150] first incor-
Predictions of large and fast-moving objects are accurate, porated multi-frequency analysis into the video prediction
however, when it comes to small and slow-moving ob- task to decompose images into low and high frequency
jects there is still room for improvement. These are com- bands. This allowed high-fidelity and temporally consistent
mon issues for most methods making predictions at the predictions with the ground truth, as the model better lever-
10

ages the spatial and temporal details. The proposed method


outperformed previous state-of-the-art in all metrics except
in the Learned Perceptual Image Patch Similarity (LPIPS),
where probabilistic models achieved a better performance
since their predictions are clearer and realistic but less
consistent with the ground truth. Distortion and blurriness
are further accentuated when it comes to predict under fast
camera motions. To this end, Shouno [151] implemented a
hierarchical residual network with top-down connections.
Leveraging parallel prediction at multiple scales, authors
reported finer details and textures under fast and large
camera motion.

Bidirectional flow: Under the assumption that video se-


quences are symmetric in time, Kwon et al. [101] explored
a retrospective prediction scheme training a generator for
both, forward and backward prediction (reversing the in-
put sequence to predict the past). Their cycle GAN-based
approach ensure the consistency of bidirectional prediction Fig. 4: Representation of the 3D encoder-decoder architec-
through retrospective cycle constraints. Similarly, Hu et ture of E3d-LSTM [66]. After reducing T consecutive input
al. [57] proposed a novel cycle-consistency loss used to train frames to high-dimensional feature maps, these are directly
a GAN-based approach (VPGAN). Future frames are gener- fed into a novel eidetic module for modeling long-term spa-
ated from a sequence of context frames and their variation tiotemporal dependencies. Finally, stacked 3D CNN decoder
in time, denoted as Z . Under the assumption that Z is outputs the predicted video frames. For classification tasks
symmetric in the encoding space, it is manipulated by the the hidden states can be directly used as the learned video
model manipulates to generate desirable moving directions. representation. Figure extracted from [66].
In the same spirit, other works focused on both, forward
and backward predictions [37], [152]. Enabling state sharing It+1
It It+1 It
between the encoder and decoder, Oliu et al. proposed the
folded Recurrent Neural Network (fRNN) [153], a recurrent (x + u, y + v)
AE architecture featuring GRUs that implement a bidirec-
tional flow of the information. The model demonstrated a (x, y) (x, y) (x, y)
stratified representation, which makes the topology more P (x, y)
explainable, as well as efficient compared to regular AEs in
It+1 (x, y) = f (It (x + u, y + v)) It+1 (x, y) = K(x, y) ∗ P (x, y)
terms or memory consumption and computational require-
ments. (a) Vector-based. (b) Kernel-based.

Exploiting 3D convolutions: for modeling short-term fea- Fig. 5: Representation of transformation-based approaches.
tures, Wang et al. [66] integrated them into a recurrent net- (a) Vector-based with a bilinear interpolation. (b) Apply-
work demonstrating promising results in both video predic- ing transformations as a convolutional operation. Figure
tion and early activity recognition. While 3D convolutions inspired by [155].
efficiently preserves local dynamics, RNNs enables long-
range video reasoning. The eidetic 3d LSTM (E3d-LSTM)
network, represented in Figure 4, features a gated-controlled context-aware model that efficiently aggregates per-pixel
self-attention module, i.e. eidetic 3D memory, that effectively contextual information at each layer and in multiple direc-
manages historical memory records across multiple time tions. The core of their proposal is a context-aware layer
steps. Outperforming previous works, Yu et al. proposed consisting of two blocks, one aggregating the information
the Conditionally Reversible Network (CrevNet) [154] con- from multiple directions and the other blending them into a
sisting of two modules, an invertible AE and a Reversible unified context.
Predictive Model (RPM). While the bijective two-way AE Extracting a robust representation from raw pixel values
ensures no information loss and reduces the memory con- is an overly complicated task due to the high-dimensionality
sumption, the RPM extends the reversibility from spatial to of the pixel space. The per-pixel variability between consec-
temporal domain. Some other works used 3D convolutional utive frames, causes an exponential growth in the prediction
operations to model the time dimension [69]. error on the long-term horizon.
Analyzing the previous works, Byeon et al. [76] identified
a lack of spatial-temporal context in the representations,
5.2 Using Explicit Transformations
leading to blurry results when it comes to the future uncer-
tainty. Although authors addressed this contextual limita- Let X = (Xt−n , . . . , Xt−1 , Xt ) be a video sequence of n
tion with dilated convolutions and multi-scale architectures, frames, where t denotes time. Instead of learning the vi-
the context representation progressively vanishes in long- sual appearance, transformation-based approaches assume
term predictions. To address this issue, they proposed a that visual information is already available in the input
11

sequence. To deal with the strong similarity and pixel redun-


dancy between successive frames, these methods explicitly
model the transformations that takes a frame at time t to the
frame at t+1. These models are formally defined as follows:

Yt+1 = T (G (Xt−n:t ) , Xt−n:t ) , (1)


where G is a learned function that outputs future trans-
formation parameters, which applied to the last observed
frame Xt using the function T , generates the future frame
prediction Yt+1 . According to the classification of Reda Fig. 6: A representation of the spatial transformer module
et al. [155], T function can be defined as a vector-based proposed by [160]. First, the localization network regresses
resampling such as bilinear sampling, or adaptive kernel- the transformation parameters, denoted as θ, from the input
based resampling, e.g. using convolutional operations. For feature map U . Then, the grid generator creates a sampling
instance, a bilinear sampling operation is defined as: grid from the predicted transformation parameters. Finally,
the sampler produces the output map by sampling the input
Yt+1 (x, y) = f (Xt (x + u, y + v)) , (2)
at the points defined in the sampling grid. Figure extracted
where f is a bilinear interpolator such as [7], [156], [157], from [160].
(u, v) is a motion vector predicted by G , and Xt (x, y)
is a pixel value at (x,y) in the last observed frame Xt .
Approaches following this formulation are categorized as Moreover, it can be incorporated at any part of the CNNs
vector-based resampling operations and are depicted in and it is fully differentiable. The ST module is the essence of
Figure 5a. vector-based resampling approaches for video prediction.
On the other side, in the kernel-based resampling, the As an extension, Patraucean et al. [77] modified the grid
G function predicts the kernel K(x, y) which is applied as a generator to consider per-pixel transformations instead of a
convolution operation using T , as depicted in Figure 5b and single dense transformation map for the entire image. They
is mathematically represented as follows: nested a LSTM-based temporal encoder into a spatial AE,
proposing the AE-convLSTM-flow architecture. The predic-
Yt+1 (x, y) = K(x, y) ∗ Pt (x, y), (3) tion is generated by resampling the current frame with the
where K(x, y) ∈ RN xN is the 2D kernel predicted by the flow-based predicted transformation. Using the components
function G and Pt (x, y) is an N × N patch centered at (x, y). of the AE-convLSTM-flow architecture, Lu et al. [78] assem-
Combining kernel and vector-based resampling into a bled an extrapolation module which is unfolded in time
hybrid solution, Reda et al. [155] proposed the Spatially for multi-step prediction. Their Flexible Spatio-semporal
Displaced Convolution (SDC) module that synthesizes high- Network (FSTN) features a novel loss function using the
resolution images applying a learned per-pixel motion vec- DeePSiM perceptual loss [44] in order to mitigate blurriness.
tor and kernel at a displaced location in the source image. An exhaustive experimentation and ablation study was
Their 3D CNN model trained on synthetic data and featur- carried out, testing multiple combinations of loss functions.
ing the SDC modules, reported promising predictions of a Also inspired on the ST module for the volume sampling
high-fidelity. layer, Liu et al. proposed the Deep Voxel Flow (DVF) archi-
tecture [7]. It consists of a multi-scale flow-based ED model
originally designed for the video frame interpolation task,
5.2.1 Vector-based Resampling
but also evaluated on a predictive basis reporting sharp
Bilinear models use multiplicative interactions to extract results. Liang et al. [55] use a flow-warping layer based on a
transformations from pairs of observations in order to relate bilinear interpolation. Finn et al. proposed the Spatial Trans-
images, such as Gated Autoencoders (GAEs) [158]. Inspired former Predictor (STP) motion-based model [89] producing
by these models, Michalski et al. proposed the Predictive 2D affine transformations for bilinear sampling. Pursuing
Gating Pyramid (PGP) [159] consisting on a recurrent pyra- efficiency, Amersfoort et al. [71] proposed a CNN designed
mid of stacked GAEs. To the best of our knowledge, this to predict local affine transformations of overlapping image
was the first attempt to predict future frames in the affine patches. Unlike the ST module, authors estimated transfor-
transform space. Multiple GAEs are stacked to represent a mations of input frames off-line and at patch level. As the
hierarchy of transformations and capture higher-order de- model is parameter-efficient, it was unfolded in time for
pendencies. From the experiments on predicting frequency multi-step prediction. This resembles RNNs as the param-
modulated sin-waves, authors stated that standard RNNs eters are shared over time and the local affine transforms
were outperformed in terms of accuracy. However, no per- play the role of recurrent states.
formance comparison was conducted on videos.

Based on the Spatial Transformer (ST) module [160]: 5.2.2 Kernel-based Resampling
To provide spatial transformation capabilities to existing As a promising alternative to the vector-based resampling,
CNNs, Jaderberg et al. [160] proposed the ST module rep- recent approaches synthesize pixels by convolving input
resented in Figure 6. It regresses different affine transforma- patches with a predicted kernel. However, convolutional
tion parameters for each input, to be applied as a single operations are limited in learning spatial invariant repre-
transformation to the whole feature map(s) or image(s). sentations of complex transformations. Moreover, due to
12

their local receptive fields, global spatial information is not


fully preserved. Using larger kernels would help to pre-
serve global features, but in exchange for a higher memory
consumption. Pooling layers are another alternative, but
loosing spatial resolution. Preserving spatial resolution at a
low computational cost is still an open challenge for future
video frame prediction task. Transformation layers used
in vector-based resampling [7], [77], [160] enabled CNNs
to be spatially invariant and also inspired kernel-based
architectures.

Inspired on the Convolutional Dynamic Neural Advec-


tion (CDNA) module [89]: In addition to the STP vector-
based model, Finn et al. [89] proposed two different kernel-
based motion prediction modules outperforming previous
Fig. 7: MCnet with Multi-scale Motion-Content Residuals.
approaches [43], [80], (1) the Dynamic Neural Advection
While the motion encoder captures the temporal dynamics
(DNA) module predicting different distributions for each
in a sequence of image differences, the content encoder
pixel and (2) the CDNA module that instead of predicting
extracts meaningful spatial features from the last observed
different distributions for each pixel, it predicts multiple
RGB frame. After that, the network computes motion-
discrete distributions applied convolutionally to the input.
content features that are fed into the decoder to predict the
While, CDNA and STP mask out objects that are moving
next frame. Figure extracted from [65].
in consistent directions, the DNA module produces per-
pixel motion. These modules inspired several kernel-based
approaches. Similar to the CDNA module, Klein et al. cally as in the DFN, to then apply them to the last patch
proposed the Dynamic Convolutional Layer (DCL) [161] containing an object. Although object-centered predictions
for short-range weather prediction. Likewise, Brabandere is novel, performance drops when dealing with multiple
et al. [162] proposed the Dynamic Filter Networks (DFN) objects and occlusions as the attention module fails to dis-
generating sample (for each image) and position-specific tinguish them correctly.
(for each pixel) kernels. This enabled sophisticated and
local filtering operations in comparison with the ST module,
that is limited to global spatial transformations. Different 5.3 Explicit Motion from Content Separation
to the CDNA model, the DFN uses a softmax layer to Drawing inspiration from two-stream architectures for ac-
filter values of greater magnitude, thus obtaining sharper tion recognition [166], video generation from a static im-
predictions. Moreover, temporal correlations are exploited age [67], and unconditioned video generation [68], authors
using a parameter-efficient recurrent layer, much simpler decided to factorize the video into content and motion to
than [13], [74]. Exploiting adversarial training, Vondrick et process each on a separate pathway. By decomposing the
al. proposed a cGAN-based model [102] consisting of a high-dimensional videos, the prediction is performed on
discriminator similar to [67] and a CNN generator featur- a lower-dimensional temporal dynamics separately from
ing a transformer module inspired on the CDNA model. the spatial layout. Although this makes end-to-end training
Different from the CDNA model, transformations are not difficult, factorizing the prediction task into more tractable
applied recurrently on a per-frame basis. To deal with in- problems demonstrated good results.
the-wild videos and make predictions invariant to camera The Motion-content Network (MCnet) [65], represented
motion, authors stabilized the input videos. However, no in Figure 7 was the first end-to-end model that disentan-
performance comparison with previous works has been gled scene dynamics from the visual appearance. Authors
conducted. performed an in-depth performance analysis ensuring the
Relying on kernel-based transformations and improv- motion and content separation through generalization capa-
ing [163], Luc et al. [164] proposed the Transformation-based bilities and stable long-term predictions compared to mod-
& TrIple Video Discriminator GAN (TrIVD-GAN-FP) fea- els that lack of explicit motion-content factorization [43],
turing a novel recurrent unit that computes the parameters [74]. In a similar fashion, yet working in a higher-level pose
of a transformation used to warp previous hidden states space, Denton et al. proposed Disentangled-representation
without any supervision. These Transformation-based Spa- Net (DRNET) [79] using a novel adversarial loss —it isolates
tial Recurrent Units (TSRUs) are generic modules and can the scene dynamics from the visual content, considered as
replace any traditional recurrent unit in currently existent the discriminative component— to completely disentangle
video prediction approaches. motion dynamics from content. Outperforming [43], [65],
the DRNET demonstrated a clean motion from content
Object-centric representation: Instead of focusing on the separation by reporting plausible long-term predictions on
whole input, Chen et al. [50] modeled individual motion of both synthetic and natural videos. To improve prediction
local objects, i.e. object-centered representations. Based on variability, Liang et al. [55] fused the future-frame and
the ST module and a pyramid-like sampling [165], authors future-flow prediction into a unified architecture with a
implemented an attention mechanism for object selection. shared probabilistic motion encoder. Aiming to mitigate
Moreover, transformation kernels were generated dynami- the ghosting effect in disoccluded regions, Gae et al. [167]
13

proposed a two-staged approach consisting of a separate conditioned on the robot state and robot-object interac-
computation of flow and pixel predictions. As they focused tions performed in a controlled scenario. These models
on inpainting occluded regions of the image using flow predict per-pixel transformations conditioned by the pre-
information, they improved results on disoccluded areas vious frame, to finally combine them using a composition
avoiding undesirable artifacts and enhancing sharpness. mask. They outperformed [43], [80] on both conditioned
Separating the moving objects and the static background, and unconditioned predictions, however the quality of
Wu et al. [168] proposed a two-staged architecture that long-term predictions degrades over time because of the
firstly predicts the static background to then, using this blurriness caused by the MSE loss function. Also, using
information, predict the moving objects in the foreground. high-dimensional sensory such as images, Dosovitskiy et
Final results are generated through composition and by al. [172] proposed a sensorimotor control model which en-
means of a video inpainting module. Reported predictions ables interaction in complex and dynamic 3d environments.
are quite accurate, yet performance was not contrasted with The approach is a reinforcement learning (RL)-based tech-
the latest video prediction models. niques, with the difference that instead of building upon a
Although previous approaches disentangled motion monolithic state and a scalar reward, the authors consider
from content, they have not performed an explicit de- high-dimensional input streams, such as raw visual input,
composition into low-dimensional components. Address- alongside a stream of measurements or player statistics.
ing this issue, Hsieh et al. proposed the Decompositional Although the outputs are future measurements instead of
Disentangled Predictive Autoencoder (DDPAE) [169] that visual predictions, it was proven that using multivariate
decomposes the high-dimensional video into components data benefits decision-making over conventional scalar re-
represented with low-dimensional temporal dynamics. On ward approaches.
the Moving MNIST dataset, DDPAE first decomposes im-
ages into individual digits (components) to then factorize
5.5 In the High-level Feature Space
each digit into its visual appearance and spatial location,
being the latter easier to predict. Although experiments Despite the vast work on video prediction models, there
were performed on synthetic data, this approach represents is still room for improvement in natural video prediction.
a promising baseline to disentangle and decompose natural To deal with the curse of dimensionality, authors reduced
videos. Moreover, it is applicable to other existing models to the prediction space to high-level representations, such
improve their predictions. as semantic and instance segmentation, and human pose.
Since the pixels are categorical, the semantic space greatly
5.4 Conditioned on Extra Variables simplifies the prediction task, yet unexpected deformations
Conditioning the prediction on extra variables such as ve- in semantic maps and disocclusions, i.e. initially occluded
hicle odometry or robot state, among others, would narrow scene entities become visible, induce uncertainty. However,
the prediction space. These variables have a direct influence high-level prediction spaces are more tractable and consti-
on the dynamics of the scene, providing valuable informa- tute good intermediate representations. By bypassing the
tion that facilitates the prediction task. For instance, the prediction in the pixel space, models become able to report
motion captured by a camera placed on the dashboard of longer-term and more accurate predictions.
an autonomous vehicle is directly influenced by the wheel-
steering and acceleration. Without explicitly exploiting this 5.5.1 Semantic Segmentation
information, we rely blindly on the model’s capabilities to In recent years, semantic and instance representations have
correlate the wheel-steering and acceleration with the per- gained increasing attention, emerging as a promising av-
ceived motion. However, the explicit use of these variables enue for complete scene understanding. By decomposing
would guide the prediction. the visual scene into semantic entities, such as pedestri-
Following this paradigm, Oh et al. first made long- ans, vehicles and obstacles, the output space is narrowed
term video predictions conditioned by control inputs from to high-level scene properties. This intermediate represen-
Atari games [80]. Although the proposed ED-based models tation represents a more tractable space as pixel values
reported very long-term predictions (+100), performance of a semantic map are categorical. In other words, scene
drops when dealing with small objects (e.g. bullets in dynamics are modeled at the semantic entity level instead
Space Invaders) and while handling stochasticity due to the of being modeled at the pixel level. This has encouraged
squared error. However, by simply minimizing `2 error can authors to (1) leverage future prediction to improve parsing
lead to accurate and long-term predictions for deterministic results [51] and (2) directly predict segmentation maps into
synthetic videos, such as those extracted from Atari video the future [8], [56], [173].
games. Building on [80], Chiappa et al. [170] proposed Exploring the scene parsing in future frames, Jin et al.
alternative architectures and training schemes alongside an proposed the Parsing with prEdictive feAtuRe Learning
in-depth performance analysis for both short and long-term (PEARL) framework [51] which was the first to explore the
prediction. Similar model-based control from visual inputs potential of a GAN-based frame prediction model to im-
performed well in restricted scenarios [171], but was inade- prove per-pixel segmentation. Specifically, this framework
quate for unconstrained environments. These deterministic conducts two complementary predictive learning tasks.
approaches are unable to deal with natural videos in the Firstly, it captures the temporal context from input data
absence of control variables. by using a single-frame prediction network. Then, these
To address this limitation, the models proposed by Finn temporal features are embedded into a frame parsing net-
et al. [89] successfully made predictions on natural images, work through a transform layer for generating per-pixel
14

Fig. 8: Two-staged method proposed by Chiu et al. [174]. In the upper half, the student network consists on an ED-based
architecture featuring a 3D convolutional forecasting module. It performs the forecasting task guided by an additional loss
generated by the teacher network (represented in the lower half). Figure extracted from [174].

future segmentations. Although the predictive net was not finally combined with a novel end-to-end warp layer. An
compared with existing approaches, PEARL outperforms improvement on short-term predictions were reported over
the traditional parsing methods by generating temporally previous works [56], [175], yet performing worse on mid-
consistent segmentations. In a similar fashion, Luc et al. [56] term predictions.
extended the msCNN model of [43] to the novel task of A different approach was proposed by Vora et al. [83]
predicting semantic segmentations of future frames, using which first incorporated structure information to predict
softmax pre-activations instead of raw pixels as input. The future 3D segmented point clouds. Their geometry-based
use of intermediate features or higher-level data as input model consists of several derivable sub-modules: (1) the
is a common practice in the video prediction performed pixel-wise segmentation and depth estimation modules
in the high-level feature space. Some authors refer to this which are jointly used to generate the 3d segmented point
type or input data as percepts. Luc et al. explored different cloud of the current RGB frame; and (2) an LSTM-based
combinations of loss functions, inputs (using RGB informa- module trained to predict future camera ego-motion trajec-
tion alongside percepts), and outputs (autoregressive and tories. The future 3d segmented point clouds are obtained
batch models). Results on short, medium and long-term by transforming the previous point clouds with the pre-
predictions are sound, however, the models are not end- dicted ego-motion. Their short-term predictions improved
to-end and they do not capture explicitly the temporal the results of [56], however, the use of structure information
continuity across frames. To address this limitation and for longer-term predictions is not clear.
extending [51], Jin et al. first proposed a model for jointly The main disadvantage of two-staged, i.e. not end-to-
predicting motion flow and scene parsing [175]. Flow-based end, approaches [10], [56], [82], [83], [175] is that their
representations implicitly draw temporal correlations from performance is constrained by external supervisory signals,
the input data, thus producing temporally coherent per- e.g. optical flow [178], segmentation [179] and intermedi-
pixel segmentations. As in [56], the authors tested different ate features or percepts [61]. Breaking this trend, Chiu et
network configurations, as using Res101-FCN percepts for al. [174] first solved jointly the semantic segmentation and
the prediction of semantic maps, and also performed multi- forecasting problems in a single end-to-end trainable model
step prediction up to 10 time-steps into the future. Per- by using raw pixels as input. This ED architecture is based
pixel accuracy improved when segmenting small objects, on two networks, with one performing the forecasting task
e.g. pedestrians and traffic signs, which are more likely to (student) and the other (teacher) guiding the student by
vanish in long-term predictions. Similarly, except that time means of a novel knowledge distillation loss. An in-depth
dimension is modeled with LSTMs instead of motion flow ablation study was performed, validating the performance
estimation, Nabavi et al. proposed a simple bidirectional ED- of the ED architectures as well as the 3D convolution used
LSTM [82] using segmentation masks as input. Although for capturing temporal scale instead of a LSTM or ConvL-
the literature on knowledge distillation [176], [177] stated STM, as in previous works.
that softmax pre-activations carry more information than Avoiding the flood of deterministic models, Bhat-
class labels, this model outperforms [56], [175] on short-term tacharyya et al. proposed a Bayesian formulation of the
predictions. ResNet model in a novel architecture to capture model
and observation uncertainty [9]. As main contribution, their
Another relevant idea is to use both motion flow esti- dropout-based Bayesian approach leverages synthetic likeli-
mation alongside LSTM-based temporal modeling. In this hoods [180] to encourage prediction diversity and deal with
direction, Terwilliger et al. [10] proposed a novel method multi-modal outcomes. Since Cityscapes sequences have
performing a LSTM-based feature-flow aggregation. Au- been recorded in the frame of reference of a moving vehicle,
thors also tried to further simplify the semantic space by authors conditioned the predictions on vehicle odometry.
disentangling motion from semantic entities [65], achiev-
ing low overhead and efficiency. The prediction problem 5.5.2 Instance Segmentation
was decomposed into two subtasks, that is, current frame While great strides have been made in predicting future
segmentation and future optical flow prediction, which are segmentation maps, the authors attempted to make predic-
15

tions at a semantically richer level, i.e. future prediction of tions, they have used additional inputs ensuring trajectory
semantic instances. Predicting future instance-level segmen- and behavior variability at a human pose level. To better
tations is a challenging and weakly unexplored task. This preserve the visual appearance in the predictions than [53],
is because instance labels are inconsistent and variable in [65], [108], Tang et al. [184] firstly predict human poses using
number across the frames in a video sequence. Since the a LSTM-based model to then synthesize pose-conditioned
representation of semantic segmentation prediction models future frames using a combination of different networks: a
is of fixed-size, they cannot directly address semantics at the global GAN modeling the time-invariant background and
instance level. a coarse human pose, a local GAN refining the coarse-
To overcome this limitation and introducing the novel predicted human pose, and a 3D-AE to ensure temporal
task of predicting instance segmentations, Luc et al. [8] consistency across frames.
predict fixed-sized feature pyramids, i.e. features at multiple
scales, used by the Mask R-CNN [181] network. The com- Keypoints-based representations: The keypoint coordinate
bination of dilated convolutions and multi-scale, efficiently space is a meaningful, tractable and structured represen-
preserve high-resolution details improving the results over tation for prediction, ensuring stable learning. It enforces
previous methods [56]. To further improve predictions, Sun model’s internal representation to contain object-level infor-
et al. [84] focused on modeling not only the spatio-temporal mation. This leads to better results on tasks requiring object-
correlations between the pyramids, but also the intrinsic level understanding such as, trajectory prediction, action
relations among the feature layers inside them. By enrich- recognition and reward prediction. As keypoints are a nat-
ing the contextual information using the proposed Context ural representation of dynamic objects, Minderer et al. [85]
Pyramid ConvLSTMs (CP-ConvLSTM), an improvement in reformulated the prediction task in the keypoint coordinate
the prediction was noticed. Although the authors have space. They proposed an AE architecture with a keypoint-
not shown any long-term predictions nor compared with based representational bottleneck, consisting of a VRNN
semantic segmentation models, their approach is currently that predicts dynamics in the keypoint space. Although
the state of the art in the task of predicting instance segmen- this model qualitatively outperforms the Stochastic Video
tations, outperforming [8]. Generation (SVG) [81], Stochastic Adversarial Video Predic-
tion (SAVP) [108] and EPVA [52] models, the quantitative
evaluation reported similar results.
5.5.3 Other High-level Spaces
Although semantic and instance segmentation spaces were 5.6 Incorporating Uncertainty
the most used in video prediction, other high-level spaces
Although high-level representations significantly reduce the
such as human pose and keypoints represent a promising
prediction space, the underlying distribution still has mul-
avenue.
tiple modes. In other words, different plausible outcomes
Human Pose: As the human pose is a low-dimensional and would be equally probable for the same input sequence.
interpretable structure, it represents a cheap supervisory Addressing multimodal distributions is not straightforward
signal for predictive models. This fostered pose-guided pre- for regression and classification approaches, as they regress
diction methods, where future frame regression in the pixel to the mean and aim to discretize a continuous high-
space is conditioned by intermediate prediction of human dimensional space, respectively. To deal with the inherent
poses. However, these methods are limited to videos with unpredictability of natural videos, some works introduced
human presence. As this review focuses on video prediction, latent variables into existing deterministic models or di-
we will briefly review some of the most relevant methods rectly relied on generative models such as GANs and VAEs.
predicting human poses as an intermediate representation. Inspired by the DVF, Xue et al. [202] proposed a cVAE-
From a supervised prediction of human poses, Villegas et based [222], [223] multi-scale model featuring a novel cross
al. [53] regress future frames through analogy making [182]. convolutional layer trained to regress the difference image
Although background is not considered in the prediction, or Eulerian motion [224]. Background on natural videos is
authors compared the model against [13], [43] reporting not uniform, however the model implicitly assumes that the
long-term results. To make the model unsupervised on the difference image would accurately capture the movement
human pose, Wichers et al. [52] adopted different training in foreground objects. Introducing latent variables into a
strategies: end-to-end prediction minimizing the `2 loss, convolutional AE, Goroshin et al. [211] proposed a proba-
and through analogy making, constraining the predicted bilistic model for learning linearized feature representations
features to be close to the outputs of the future encoder. to linearly extrapolate the predicted frame in a feature space.
Different from [53], in this work the predictions are made Uncertainty is introduced to the loss by using a cosine
in the feature space. As a probabilistic alternative, Walker et distance as an explicit curvature penalty. Authors focused
al. [54] fused a conditioned Variational Autoencoder (cVAE)- on evaluating the linearization properties, yet the model
based probabilistic pose predictor with a GAN. While the was not contrasted to previous works. Extending [141],
probabilistic predictor enhances the diversity in the pre- [202], Fragkiadaki et al. [96] proposed several architectural
dicted poses, the adversarial network ensures prediction changes and training schemes to handle marginalization
realism. As this model struggles with long-term predictions, over stochastic variables, such as sampling from the prior
Fushishita et al. [183] addressed long-term video predic- and variational inference. They proposed a stochastic ED
tion of multiple outcomes avoiding the error accumulation architecture that predicts future optical flow, i.e., dense pixel
and vanishing gradients by using a unidimensional CNN motion field, used to spatially transform the current frame
trained in an adversarial fashion. To enable multiple predic- into the next frame prediction. To introduce uncertainty
16

TABLE 2: Summary of video prediction models (c: convolutional; r: recurrent; v: variational; ms: multi-scale; st: stacked; bi:
bidirectional; P: Percepts; M: Motion; PL: Perceptual Loss; AL: Adversarial Loss; S/R: using Synthetic/Real datasets; SS:
Semantic Segmentation; D: Depth; S: State; Po: Pose; O: Odometry; IS: Instance Segmentation; ms: multi-step prediction;
pred-fr: number of predicted frames, ? 1-5 frames, ? ? 5-10 frames, ? ? ? 10-100 frames, ? ? ? ? over 100 frames; ood:
indicates if model was tested on out-of-domain tasks).

details evaluation
method year based on architecture datasets (train, valid, test) input output MS loss function S/R pred-fr ood code
Direct Pixel Synthesis
Ranzato et al. [73] 2014 [145], [185] rCNN [115], [127] RGB RGB 7 CE R ? 7 7
Srivastava et al. [74] 2015 [186] LSTM-AE [74], [113], [115], [123] RGB,P RGB X CE, `2 SR ??? X X
PGN [49] 2015 - LSTM-cED [126] RGB RGB 7 M SE, AL S ? 7 7
Shi et al. [13] 2015 [74] cLSTM [74] RGB RGB 7 CE S ??? X 7
BeyondMSE [43] 2016 [60], [187] msCNN [115], [123] RGB RGB X `1 , GDL, AL R ?? 7 X
PredNet [75] 2017 [22], [188] stLSTMs [117], [119], [120], [137] RGB RGB X `1 ,`2 SR ?? X X
ContextVP [76] 2018 [88], [189] MD-LSTM [115], [117], [119], [120] RGB RGB X `1 , GDL R ?? 7 7
fRNN [153] 2018 - cGRU-AE [74], [111], [115] RGB RGB X `1 SR ??? 7 X
E3d-LSTM [66] 2019 [13] r3D-CNN [74], [111], [190], [191] RGB RGB X `1 , `2 , CE SR ??? X X
Kwon et al. [101] 2019 [45], [192], [193] cycleGAN [115], [119], [120], [194], [195] RGB RGB X `1 , LoG, AL R ??? 7 7
Znet [70] 2019 [13] cLSTM [74], [111] RGB RGB X `2 , BCE, AL SR ??? 7 7
VPGAN [57] 2019 [79], [193] GAN [111], [129] RGB,Z RGB X `1 , Lcycle , AL R ??? 7 7
Jin et al. [150] 2020 - cED-GAN [111], [119], [120], [129] RGB RGB X `2 , GDL, AL R ??? 7 7
Shouno et al. [151] 2020 [75] GAN [119], [120] RGB RGB X Lp , AL, P L R ??? 7 7
CrevNet [154] 2020 [13], [196], [197] 3d-cED [74], [119], [120], [198] RGB RGB X M SE SR ??? X X
Using Explicit Transformations
PGP [159] 2014 [157] st-rGAEs [126], [128] RGB RGB X `2 SR ? 7 7
Patraucean et al. [77] 2015 [186] LSTM-cAE [74], [113], [131], [132] RGB RGB 7 `2 , `δ SR ? X X
DFN [162] 2016 [89], [161] r-cED [74], [115] RGB RGB X BCE SR ??? X X
Amersfoort et al. [71] 2017 [77] CNN [74], [115] RGB RGB X M SE SR ?? 7 7
FSTN [78] 2017 [44], [77] LSTM-cED [74], [115], [123], [131], [132] RGB RGB X `2 , `δ , P L SR ??? 7 7
Vondrick et al. [102] 2017 [67], [89] cGAN [125] RGB RGB X CE, AL R ??? X 7
Chen et al. [50] 2017 [71], [160], [162] rCNN-ED [74], [115] RGB RGB X CE, `2 , GDL, AL SR ?? 7 7
DVF [7] 2017 [160] ms-cED [115], [118] RGB RGB X `1 , T V R ? X X
SDC-Net [155] 2018 [199], [200] CNN [119], [124] RGB,M RGB X `1 , P L SR ?? X 7
TrIVD-GAN-FP [164] 2020 [142], [163], [167] DVD-GAN [115], [129], [201] RGB RGB X Lhinge [55] R ??? 7 7
Explicit Motion from Content Separation
MCnet [65] 2017 [13], [166], [202] LSTM-cED [111], [112], [115], [123] RGB RGB X `p , GDL, AL R ??? 7 X
Dual-GAN [55] 2017 [100] VAE-GAN [115], [118]–[120] RGB RGB X `1 , KL, AL R ?? 7 7
DRNET [79] 2017 [65] LSTM-ED [74], [111], [138], [203] RGB RGB X `2 , CE, AL SR ???? X X
DPG [167] 2019 [89], [142] cED [119], [204], [205] RGB RGB X `p , T V, P L, CE SR ?? 7 7
Conditioned on Extra Variables
Oh et al. [80] 2015 [13] rED [133] RGB,A RGB X `2 S ???? X X
Finn et al. [89] 2016 [13], [80] st-cLSTMs [89], [117] RGB,A,S RGB X `2 R ??? 7 X
In the High-level Feature Space
Villegas et al. [53] 2017 [182], [206], [207] LSTM-cED [116], [117] RGB,Po RGB,Po X `2 , P L, AL [44] R ???? X 7
PEARL [51] 2017 - cED [121], [136] RGB SS 7 `2 , AL R ? X 7
S2S [56] 2017 [43] msCNN [121], [136] P SS X `1 , GDL, AL R ??? 7 X
Walker et al. [54] 2017 [208] vED [115], [116] RGB,Po RGB X `2 , CE, KL, AL R ??? X 7
Jin et al. [175] 2017 [51], [56], [77] cED [121], [137] RGB,P SS,M X `1 , GDL, CE R ??? X 7
EPVA (EPVA) [52] 2018 [53] LSTM-ED [117] RGB RGB X `2 , AL SR ???? X X
Nabavi et al. [82] 2018 [56], [175] biLSTM-cED [121] P SS X CE R ?? 7 7
F2F et al. [8] 2018 [56], [181] st-msCNN [121] P P,SS,IS X `2 R ??? X X
Vora et al. [83] 2018 - LSTM [121] ego-M ego-M 7 `1 R ? X 7
Chiu et al. [174] 2019 - 3D-cED [121], [122] RGB SS 7 CE, M SE R ?? 7 7
Bayes-WD-SL [9] 2019 [56], [175] bayesResNet [121] SS,O SS X KL SR ??? X X
Sun et al. [84] 2019 [8] st-ms-cLSTM [121], [134] P P,IS X `2 , [181] R ?? 7 7
Terwilliger et al. [10] 2019 [65], [175] M-cLSTM [121] RGB,P SS X CE, `1 R ??? 7 X
Struct-VRNN [85] 2019 [209], [210] cVRNN [90], [117] RGB RGB X `2 , KL SR ?? X X
Incorporating Uncertainty
Goroshin et al. [211] 2015 [212] cAE [138], [213] RGB RGB 7 `2 , penalty SR ? 7 7
Fragkiadaki et al. [96] 2017 [141], [202] vED [117], [214] RGB RGB 7 KL, M Cbest R ? X 7
EEN [99] 2017 [22], [75], [215] vED [216]–[218] RGB RGB X `1 , `2 SR ?? 7 X
SV2P [38] 2018 [89] CDNA [89], [117], [129] RGB RGB X `p , KL SR ??? 7 X
SVG [81] 2018 [38] LSTM-cED [74], [111], [129] RGB RGB X `2 , KL SR ???? 7 X
Castrejon et al. [97] 2019 [81], [98] vRNN [74], [121], [129] RGB RGB X KL SR ??? 7 7
Hu et al. [15] 2020 [56], [163], [219] cED [121], [122], [220], [221] RGB SS,D,M X CE, `δ , Ld , Lc , Lp R ??? X 7
17

in predictions, the authors proposed the k-best-sample-loss restricted, predictable or simulated datasets, Hu et al. [15]
(MCbest) that draws K outcomes penalizing those similar jointly predict full-frame ego-motion, static scene, and object
to the ground-truth. dynamics on complex real-world urban driving. Featuring
Incorporating latent variables into the deterministic a novel spatio-temporal module, their five-component ar-
CDNA architecture for the first time, Babaeizadeh et chitecture learns rich representations that incorporate both
al. proposed the Stochastic Variational Video Prediction local and global spatio-temporal context. Authors validated
(SV2P) [38] model handling natural videos. Their time- the model on predicting semantic segmentation, depth and
invariant posterior distribution is approximated from the optical flow, two seconds in the future outperforming exist-
entire input video sequence. Authors demonstrated that, by ing spatio-temporal architectures. However, no performance
explicitly modeling uncertainty with latent variables, the comparison with [81], [108] has been carried out.
deterministic CDNA model is outperformed. By combin-
ing a standard deterministic architecture (LSTM-ED) with
stochastic latent variables, Denton et al. proposed the SVG 6 P ERFORMANCE E VALUATION
network [81]. Different from SV2P, the prior is sampled from This section presents the results of the previously analyzed
a time-varying posterior distribution, i.e. it is a learned-prior video prediction models on the most popular datasets on
instead of fixed-prior sampled from the same distribution. the basis of the metrics described below.
Most of the VAEs use a fixed Gaussian as a prior, sampling
randomly at each time step. Exploiting the temporal depen-
dencies, a learned-prior predicts high variance in uncertain 6.1 Metrics and Evaluation Protocols
situations, and a low variance when a deterministic predic- For a fair evaluation of video prediction systems, multiple
tion suffices. The SVG model is easier to train and reported aspects in the prediction have to be addressed such as
sharper predictions in contrast to [38]. Built upon SVG, whether the predicted sequences look realistic, are plau-
Villegas et al. [225] implemented a baseline to perform an sible and cover all possible outcomes. To the best of our
in-depth empirical study on the importance of the inductive knowledge, there are no evaluation protocols and metrics
bias, stochasticity, and model’s capacity in the video predic- that evaluate the predictions by fulfilling simultaneously all
tion task. Different from previous approaches, Henaff et al. these aspects.
proposed the Error Encoding Network (EEN) [99] that incor- The most widely used evaluation protocols for video
porates uncertainty by feeding back the residual error —the prediction rely on image similarity-based metrics such
difference between the ground truth and the deterministic as, Mean-Squared Error (MSE), Structural Similarity Index
prediction— encoded as a low-dimensional latent variable. Measure (SSIM) [229], and Peak Signal to Noise Ratio
In this way, the model implicitly separates the input video (PSNR). However, evaluating a prediction according to the
into deterministic and stochastic components. mismatch between its visual appearance and the ground
On the one hand, latent variable-based approaches cover truth is not always reliable. In practice, these metrics pe-
the space of possible outcomes, yet predictions lack of nalize all predictions that deviate from the ground truth. In
realism. On the other hand, GANs struggle with uncertainty, other words, they prefer blurry predictions nearly accom-
but predictions are more realistic. Searching for a trade- modating the exact ground truth than sharper and plausible
off between VAEs and GANs, Lee et al. [108] proposed but imperfect generations [97], [108], [230]. Pixel-wise met-
the SAVP model, being the first to combine latent variable rics do not always reflect how accurate a model captured
models with GANs to improve variability in video pre- video scene dynamics and their temporal variability. In
dictions, while maintaining realism. Under the assumption addition, the success of a metric is influenced by the loss
that blurry predictions of VAEs are a sign of underfit- function used to train the model. For instance, the models
ting, Castrejon et al. extended the VRNNs to leverage a trained with MSE loss function would obviously perform
hierarchy of latent variables and better approximate data well on MSE metric, but also on PSNR metric as it is based
likelihood [97]. Although the backpropagation through a on MSE. Suffering from similar problems, SSIM measures
hierarchy of conditioned latents is not straightforward, the similarity between two images, from −1 (very dissim-
several techniques alleviated this issue such as, KL beta ilar) to +1 (the same image). As a difference, it measures
warm-up, dense connectivity pattern between inputs and similarities on image patches instead of performing pixel-
latents, Ladder Variational Autoencoders (LVAEs) [226]. As wise comparison. These metrics are easily fooled by learning
most of the probabilistic approaches fail in approximating to match the background in predictions. To address this
the true distribution of future frames, Pottorff et al. [227] issue, Mathieu et al. [43] evaluated the predictions only on
reformulated the video prediction task without making any the dynamic parts of the sequence, avoiding background
assumption about the data distribution. They proposed the influence.
Invertible Linear Embedding (ILE) enabling exact maximum As the pixel space is multimodal and highly-
likelihood learning of video sequences, by combining an dimensional, it is challenging to evaluate how accurately
invertible neural network [228], also known as reversible a prediction sequence covers the full distribution of pos-
flows, and a linear time-invariant dynamic system. The sible outcomes. Addressing this issue, some probabilistic
ILE handles nonlinear motion in the pixel space and scales approaches [81], [97], [108] adopted a different evaluation
better to longer-term predictions compared to adversarial protocol to assess prediction coverage. Basically, they sam-
models [43]. ple multiple random predictions and then they search for
While previous variational approaches [81], [108] fo- the best match with the ground truth sequence. Finally,
cused on predicting a single frame of low resolution in they report the best match using common metrics. This
18

TABLE 3: Results on M-MNIST (Moving MNIST). Predicting TABLE 4: Results on KTH dataset. Predicting the next y
the next y frames from x context frames (x → y ). † results frames from x context frames (x → y ). † results reported
reported by Oliu et al. [153], ‡ results reported by Wang et al. by Oliu et al. [153], ‡ results reported by Wang et al. [66], ∗
[66], ∗ results reported by Wang et al. [197], / results reported results reported by Zhang et al. [70], / results reported by
by Wang et al. [235]. MSE represents per-pixel average MSE Jin et al. [150]. Per-pixel average MSE (10−3 ). Best results
(10−3 ). MSE represents per-frame error. are represented in bold.

M-MNIST M-MNIST KTH KTH KTH


(10 → 10) (10 → 30) (10 → 10) (10 → 20) (10 → 40)
method MSE MSE SSIM PSNR CE MSE SSIM
method MSE PSNR SSIM PSNR SSIM PSNR
BeyondMSE [43] 27.48† 122.6∗ 0.713∗ 15.969† - - -
Srivastava et al. [74] 17.37† 118.3∗ 0.690∗ 18.183† 341.2 180.1/ 0.583/ Srivastava et al. [74]† 9.95 21.22 - - - -
Shi et al. [13] - 96.5‡ 0.713‡ - 367.2∗ 156.2/ 0.597/ PredNet [75]† 3.09 28.42 - - - -
DFN [162] - 89.0‡ 0.726‡ - 285.2 149.5/ 0.601/ BeyondMSE [43]† 1.80 29.34 - - - -
CDNA [89] - 84.2‡ 0.728‡ - 346.6∗ 142.3/ 0.609/ fRNN [153] 1.75 29.299 0.771/ 26.12/ 0.678/ 23.77/
VLN [236] - - - - 187.7 MCnet [65] 1.65† 30.95† 0.804‡ 25.95‡ 0.73/ 23.89/
Patraucean et al. [77] 43.9 - - - 179.8 - - RLN [237]† 1.39 31.27 - - - -
MCnet [65]† 42.54 - - 13.857 - - -
Shi et al. [13]‡ - - 0.712 23.58 0.639 22.85
RLN [237]† 42.54 - - 13.857 - - -
PredNet [75]† 41.61 - - 13.968 - - -
SAVP [108]/ - - 0.746 25.38 0.701 23.97
fRNN [153] 9.47 68.4‡ 0.819‡ 21.386 - - - VPN [95]∗ - - 0.746 23.76 - -
PredRNN [197] - 56.8 0.867 - 97.0 - - DFN [162]‡ - - 0.794 27.26 0.652 23.01
VPN [95] - 64.1‡ 0.870‡ - 87.6 129.6/ 0.620/ fRNN [153]‡ - - 0.771 26.12 0.678 23.77
Znet [70] - 50.5 0.877 - - - - Znet [70] - - 0.817 27.58 - -
PredRNN++ [235] - 46.5 0.898 - - 91.1 0.733 SV2P invariant [38]/ - - 0.826 27.56 0.778 25.92
E3d-LSTM [66] - 41.3 0.910 - - - -
SV2P variant [38]/ - - 0.838 27.79 0.789 26.12
CrevNet [154] - 22.3 0.949 - - - -
PredRNN [197] - - 0.839 27.55 0.703‡ 24.16‡
VarNet [238]/ - - 0.843 28.48 0.739 25.37
SAVP-VAE [108]/ - - 0.852 27.77 0.811 26.18
represents the most common evaluation protocol for prob- PredRNN++ [235] - - 0.865 28.47 0.741‡ 25.21‡
abilistic video prediction. Other methods [97], [150], [151] MSNET [239] - - 0.876 27.08 - -
also reported results using: LPIPS [230] as a perceptual E3d-LSTM [66] - - 0.879 29.31 0.810 27.24
metric comparing CNN features, or Frchet Video Distance Jin et al. [150] - - 0.893 29.85 0.851 27.56
(FVD) [231] to measure sample realism by comparing under-
lying distributions of predictions and ground truth. More-
over, Lee et al. [108] used the VGG Cosine Similarity metric variation in the evaluation protocols of the video prediction
that performs cosine similarity to the features extracted with models.
the VGGnet [146] from the predictions. Many authors evaluated their methods on the Moving
Some other alternative metrics include the inception MNIST synthetic environment. Although it represents a
score [232] introduced to deal with GANs mode collapse restricted and quasi-deterministic scenario, long-term pre-
problem by measuring the diversity of generated samples; dictions are still challenging. The black and homogeneous
perceptual similarity metrics, such as DeePSiM [44]; mea- background induce methods to accurately extrapolate black
suring sharpness based on difference of gradients [43]; frames and vanish the predicted digits in the long-term hori-
Parzen window [233], yet deficient for high-dimensional zon. Under this configuration, the CrevNet model demon-
images; and the Laplacian of Gaussians (LoG) [60], [234] strated a leap over the previous state of the art. As the
used in [101]. In the semantic segmentation space, authors second best, the E3d-LSTM network reported stable errors
used the popular Intersection over Union (IoU) metric. in both short-term and longer-term predictions showing
Inception score was also widely used to report results on the advantages of their memory attention mechanism. It
different methods [54], [65], [67], [79]. Differently, on the also reported the second best results on the KTH dataset,
basis of the EPVA model [52] a quantitative evaluation was after [150] which achieved the best overall performance and
performed, based on the confidence of an external method demonstrated quality predictions on natural videos.
trained to identify whether the generated video contains Performing short-term predictions in the KTH dataset,
a recognizable person. While some authors [10], [43], [56] the Recurrent Ladder Network (RLN) outperformed MC-
evaluated the performance only on the dynamic parts of net and fRNN by a slight margin. The RLN architecture
the image, other directly opted for visual human evaluation draws similarities with fRNN, except that the former uses
through Amazon Mechanical Turk (AMT) workers, without bridge connections and the latter, state sharing that im-
a direct quantitative evaluation. proves memory consumption. On the Moving MNIST and
UCF101 datasets, fRNN outperformed RLN. Other interest-
ing methods to highlight are PredRNN and PredRNN++,
6.2 Results both providing close results to E3d-LSTM. State-of-the-art
In this section we report the quantitative results of the results using different metrics were reported on Caltech
most relevant methods reviewed in the previous sections. Pedestrian by Kwon et al. [101], CrevNet [154], and Jin et
To achieve a wide comparison, we limited the quantitative al. [150]. The former, by taking advantage of its retrospective
results to the most common metrics and datasets. We have prediction scheme, was also the overall winner on the UCF-
distributed the results in different tables, given the large 101 dataset meanwhile the latter outperformed previous
19

TABLE 5: Results on Caltech Pedestrian. Predicting the next TABLE 7: Results on SM-MNIST (Stochastic Moving
y frames from x context frames (x → y ). † reported by Kwon MNIST), BAIR Push and Cityscapes datasets. † results re-
et al. [101], ‡ reported by Reda et al. [155], ∗ reported by Gao ported by Castrejon et al. [97]. ‡ results reported by Jin et al.
et al. [167], / reported by Jin et al. [150]. Per-pixel average [150].
MSE (10−3 ). Best results are represented in bold.
SM-MNIST BAIR Push Cityscapes
(5 → 10) (2 → 28) (2 → 28)
Caltech Pedestrian
(10 → 1) method FVD SSIM FVD SSIM PSNR FVD SSIM
SVG [81] 90.81† 0.688† 256.62† 0.816† 17.72‡ 1300.26† 0.574†
method MSE SSIM PSNR LPIPS
SAVP [108] - - 143.43† 0.795† 18.42‡ - -
BeyondMSE [43]‡ 3.42 0.847 - - SAVP-VAE [108] - - - 0.815‡ 19.09‡ - -
MCnet [65]‡ 2.50 0.879 - - SV2P inv. [38]‡ - - - 0.817 20.36 - -
vRNN 1L [97] 63.81 0.763 149.22 0.829 - 682.08 0.609
DVF [7]∗ - 0.897 26.2 5.57/
vRNN 3L [97] 57.17 0.760 143.40 0.822 - 567.51 0.628
Dual-GAN [55] 2.41 0.899 - - Jin et al. [150] - - - 0.844 21.02 - -
CtrlGen [142]∗ - 0.900 26.5 6.38/
PredNet [75]† 2.42 0.905 27.6 7.47/
TABLE 8: Results on Cityscapes dataset. Predicting the next
ContextVP [76] 1.94 0.921 28.7 6.03/
y time-steps of semantic segmented frames from 4 context
GAN-VGG [151] - 0.916 - 3.61
frames (4 → y ). ‡ IoU results on eight moving objects
G-VGG [151] - 0.917 - 3.52
classes. † results reported by Chiu et al. [174]
SDC-Net [155] 1.62 0.918 - -
Kwon et al. [101] 1.61 0.919 29.2 -
Cityscapes
DPG [167] − 0.923 28.2 5.04/
(4 → 1) (4 → 3) (4 → 9) (4 → 10)
G-MAE [151] - 0.923 - 4.30
GAN-MAE [151] - 0.923 - 4.09 method IoU IoU IoU IoU
CrevNet [154] - 0.925 29.3 - S2S [56]‡ - 55.3 40.8 -
Jin et al. [150] - 0.927 29.1 5.89 S2S-maskRCNN [8]‡ - 55.4 42.4 -
S2S [56] 62.60‡ 59.4 47.8 -
TABLE 6: Results on UCF-101 dataset. Predicting the next x Nabavi et al. [82] 71.37 60.06 - -
frames from y context frames (x → y ). † results reported by F2F [8] - 61.2 41.2 -
Oliu et al. [153]. Per-pixel average MSE (10−3 ). Best results Vora et al. [83] - 61.47 45.4 -
are represented in bold. S2S-Res101-FCN [175] - 62.6 - 50.8
Terwilliger et al. [10]‡ - 65.1 46.3 -
Chiu et al. [174] 72.43 65.53 50.52
UCF-101 UCF-101
Jin et al. [175] - 66.1 - 53.9
(10 → 10) (4 → 1)
Bayes-WD-SL [9] 75.3 66.7 52.5 -
method MSE PSNR MSE SSIM PSNR Terwilliger et al. [10] 73.2 67.1 51.5 52.5
Srivastava et al. [74]† 148.66 10.02 - - -
PredNet [75]† 15.50 19.87 - - -
BeyondMSE [43]† 9.26 22.78 - - - MNIST dataset, the stochastic version includes uncertain
MCnet [65] 9.40† 23.46† - 0.91 31.0 digit trajectories, i.e. the digits bounce off the border with
RLN [237]† 9.18 23.56 - - - a random new direction. On this dataset, both versions of
fRNN [153] 9.08 23.87 - - - Castrejon et al. models (1L, without a hierarchy of latents,
BeyondMSE [43] - - - 0.92 32
and 3L with a 3-level hierarchy of latents) outperform SVG
Dual-GAN [55] - - - 0.94 30.5
by a large margin. On the Bair Push dataset, SAVP reported
DVF [7] - - - 0.94 33.4
ContextVP [76] - - - 0.92 34.9 sharper and more realistic-looking predictions than SVG
Kwon et al. [101] - - 1.37 0.94 35.0 which suffer of blurriness. However, both models were
outperformed by [97] as well on the Cityscapes dataset.
The model based on a 3-level hierarchy of latents [97]
methods on the BAIR Push dataset. outperform previous works on all three datasets, showing
On the one hand, some approaches have been evalu- the advantages of the extra expressiveness of this model.
ated on other datasets: SDC-Net [155] outperformed [43],
[65] on YouTube8M, TrIVD-GAN-FP outperformed [163], 6.2.2 Results on the High-level Prediction Space
[240] on Kinetics-600 test set [201], E3d-LSTM compared Most of the methods have chosen the semantic segmentation
their method with [95], [153], [197], [235] on the TaxiBJ space to make predictions. Although they relied on differ-
dataset [190], and CrevNet [154] on Traffic4cast [198]. On the ent datasets for training, performance results were mostly
other hand, some explored out-of-domain tasks [13], [66], reported on the Cityscapes dataset using the IoU metric.
[102], [154], [162] (see ood column in Table 2). Authors explored short-term (next-frame prediction), mid-
term (+3 time steps in the future) and long-term (up to
6.2.1 Results on Probabilistic Approaches +10 time step in the future) predictions. On the semantic
Video prediction probabilistic methods have been mainly segmentation prediction space, Bayes-WD-SL [9], the model
evaluated on the Stochastic Moving MNIST, Bair Push and proposed by Terwilliger et al. [10], and Jin et al. [51] reported
Cityscapes datasets. Different from the original Moving the best results. Among these methods, it is noteworthy
20

that Bayes-WD-SL was the only one to explore prediction From the analysis performed in this review and in line
diversity on the basis of a Bayesian formulation. with the conclusions extracted from [225] we state that:
In the instance segmentation space, the F2F pioneering (1) including recurrent connections and stochasticity in a
method [8] was outperformed by Sun et al. [84] on short and video prediction model generally lead to improved perfor-
mid-term predictions using the AP50 and AP evaluation mance; (2) increasing model capacity while maintaining a
metrics. On the other hand, in the keypoint coordinate low inductive bias also improves prediction performance;
space, the seminal model of Minderer et al. [85] qualitatively (3) multi-step predictions conditioned by previously gen-
outperforms SVG [81], SAVP [108] and EPVA [52], yet pixel- erated outputs are prone to accumulate errors, diverging
wise metrics reported similar results. In the human pose from the ground truth when addressing long-term hori-
space, Tang et al. [184] by regressing future frames from hu- zons; (4) authors predicted further in the future without
man pose predictions outperformed SAVP [108], MCnet [65] relying on high-level feature spaces; (5) combining pixel-
and [53] on the basis of the PSNR and SSIM metrics on the wise losses with adversarial training somewhat mitigates
Penn Action and J-HMDB [114] datasets. the regression-to-the-mean issue.

7.1 Research Challenges


7 D ISCUSSION
Despite the wealth of currently existing video prediction
The video prediction literature ranges from a direct syn- approaches and the significant progress made in this field,
thesis of future pixel intensities, to complex probabilistic there is still room to improve state-of-the-art algorithms. To
models addressing prediction uncertainty. The range be- foster progress, open research challenges must be clearly
tween these approaches consists of methods that try to identified and disentangled. So far in this review, we have
factorize or narrow the prediction space. Simplifying the already discussed about: (1) the importance of spatio-
prediction task has been a natural evolution of video predic- temporal correlations as a self-supervisory signal for pre-
tion models, influenced by several open research challenges dictive models; (2) how to deal with future uncertainty and
discussed below. Due to the curse of dimensionality and model the underlying multimodal distributions of natural
the inherent pixel variability, developing a robust prediction videos; (3) the over-complicated task of learning meaningful
based on raw pixel intensities is overly-complicated. This representations and deal with the curse of dimensionality;
often leads to the regression-to-the-mean problem, visually (4) pixel-wise loss functions and blurry results when dealing
represented as blurriness. Making parametric models larger with equally probable outcomes, i.e. probabilistic environ-
would improve the quality of predictions, yet this is cur- ments. These issues define the open research challenges in
rently incompatible with high-resolution predictions due video prediction.
to memory constraints. Transformation-based approaches Currently existing methods are limited to short-term
propagate pixels from previous frames based on estimated horizons. While frames in the immediate future are ex-
flow maps. In this case, prediction quality is directly in- trapolated with high accuracy, in the long term horizon
fluenced by the accuracy of the estimated flow. Similarly, the prediction problem becomes multimodal by nature.
the prediction in a high-level space is mostly conditioned Initial solutions consisted on conditioning the prediction on
by the quality of some extra supervisory signals such as previously predicted frames. However, these autoregressive
semantic maps and human poses, to name a few. Erroneous models tend to accumulate prediction errors that progres-
supervision signals would harm prediction quality. sively diverge the generated prediction from the expected
Analyzing the impact of the inductive bias on the per- outcome. On the other hand, due to memory issues, there
formance of a video prediction model, Villegas et al. [225] is a lack of resolution in predictions. Authors tried to
demonstrated the maximization of the SVG model [81] address this issue by composing the full-resolution image
performance with minimal inductive bias (e.g. segmentation from small predicted patches. However, as the results are
or instance maps, optical flow, adversarial losses, etc.) by in- not convincing because of the annoying tilling effect, most
creasing progressively the scale of computation. A common of the available models are still limited to low-resolution
assumption when addressing the prediction task in a high- predictions. In addition to the lack of resolution and long-
level feature space, is the direct improvement of long-term term predictions, models are still prone to the regress-to-the-
predictions as a result of simplifying the prediction space. mean problem that consists on averaging the output frame
Even if the complexity of the prediction space is reduced, to accommodate multiple equally probable outcomes. This
it is still multimodal when dealing with natural videos. is directly related to the pixel-wise loss functions, that focus
For instance, when it comes to long-term predictions in the the learning process on the visual appearance. The choice
semantic segmentation space, most of the models reported of the loss function is an open research problem with a
predictions only up to ten time steps into the future. This direct influence on the prediction quality. Finally, the lack
directly suggests that the choice of the prediction space is of reliable and fair evaluation models makes the qualitative
still an unsolved problem. Finding a trade-off between the evaluation of video prediction challenging and represents
complexity of the prediction space and the output quality is another potential open problem.
challenging. An overly-simplified representation could limit
the prediction on complex data such as natural videos. Al-
though abstract predictions suffice for many of the decision- 7.2 Future Directions
making systems based on visual reasoning, prediction in Based on the reviewed research identifying the state-of-
pixel space is still being addressed. the-art video prediction methods, we present some future
21

promising research directions. the most popular datasets and metrics to finally provide
Consider alternative loss functions: Pixel-wise loss func- useful insight in shape of future research directions and
tions are widely used in the video prediction task, causing open problems. In conclusion, video prediction is a promis-
blurry predictions when dealing with uncontrolled environ- ing avenue for the self-supervised learning of rich spatio-
ments or long-term horizon. In this regard, great efforts have temporal correlations to provide prediction capabilities to
been made in the literature for identifying more suitable existing intelligent decision-making systems. While great
loss functions for the prediction task. However, despite the strides have been made, there is still room for improvement
existing wide spectrum of loss functions, most models still in video prediction using deep learning techniques.
blindly rely on deterministic loss functions.
Alternatives to RNNs: Currently, RNNs are still widely ACKNOWLEDGMENTS
used in this field to model temporal dependencies, and This work has been funded by the Spanish Government
achieved state-of-the-art results on different benchmarks PID2019-104818RB-I00 grant for the MoDeaAS project. This
[66], [153], [197], [235]. Nevertheless, some methods also work has also been supported by two Spanish national
relied on 3D convolutions to further enhance video predic- grants for PhD studies, FPU17/00166, and ACIF/2018/197
tion [66], [174] representing a promising avenue. respectively. Experiments were made possible by a generous
Use synthetically generated videos: Simplifying the predic- hardware donation from NVIDIA.
tion is a current trend in the video prediction literature. A
vast amount of video prediction models explored higher-
level features spaces to reformulate the prediction task into R EFERENCES
a more tractable problem. However, this mostly conditions [1] M. H. Nguyen and F. D. la Torre, “Max-margin early event
the prediction to the accuracy of an external source of su- detectors,” in CVPR, 2012.
[2] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert, “Activity
pervision such as optical flow, human pose, pre-activations Forecasting,” in ECCV, 2012.
(percepts) extracted from supervised networks, and more. [3] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating Visual
However, this issue could be alleviated by taking advan- Representations from Unlabeled Video,” in CVPR, 2016.
tage of existing fully-annotated and photorealistic synthetic [4] K. Zeng, W. B. Shen, D. Huang, M. Sun, and J. C. Niebles, “Visual
Forecasting by Imitating Dynamics in Natural Sequences,” in
datasets or by using data generation tools. Video prediction ICCV, 2017.
in photorealistic synthetic scenarios has not been explored [5] S. Shalev-Shwartz, N. Ben-Zrihem, A. Cohen, and
in the literature. A. Shashua, “Long-term planning by short-term prediction,”
arXiv:1602.01580, 2016.
Evaluation metrics: Since the most widely used evaluation [6] O. Makansi, E. Ilg, O. Cicek, and T. Brox, “Overcoming lim-
protocols for video prediction rely on image similarity- itations of mixture density networks: A sampling and fitting
framework for multimodal future prediction,” in CVPR, 2019.
based metrics, the need for fairer evaluation metrics is [7] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame
imminent. A fair metric should not penalize predictions that synthesis using deep voxel flow,” in ICCV, 2017.
deviate from the ground truth at the pixel level, if their [8] P. Luc, C. Couprie, Y. LeCun, and J. Verbeek, “Predicting Future
content represents a plausible future prediction in a higher Instance Segmentation by Forecasting Convolutional Features,”
in ECCV, 2018, pp. 593–608.
level, i.e., the dynamics of the scene correspond to the reality [9] A. Bhattacharyya, M. Fritz, and B. Schiele, “Bayesian prediction
of the labels. In this regard, some methods evaluate the of future street scenes using synthetic likelihoods,” in ICLR, 2019.
similarity between distributions or at a higher-level. How- [10] A. Terwilliger, G. Brazil, and X. Liu, “Recurrent flow-guided
ever, there is still room for improvement in the evaluation semantic forecasting,” in WACV, 2019.
[11] A. Bhattacharyya, M. Fritz, and B. Schiele, “Long-Term On-Board
protocols for video prediction and generation [241]. Prediction of People in Traffic Scenes Under Uncertainty,” in
CVPR, 2018.
[12] W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for
8 C ONCLUSION anomaly detection - A new baseline,” in CVPR. IEEE, 2018.
In this review, after reformulating the predictive learning [13] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo,
“Convolutional LSTM network: A machine learning approach for
paradigm in the context of video prediction, we have closely precipitation nowcasting,” in NeurIPS, 2015.
reviewed the fundamentals on which it is based: exploiting [14] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong,
the time dimension of videos, dealing with stochasticity, and and W.-c. WOO, “Deep learning for precipitation nowcasting: A
the importance of the loss functions in the learning pro- benchmark and a new model,” in NeurIPS, 2017.
[15] A. Hu, F. Cotter, N. Mohan, C. Gurau, and A. Kendall,
cess. Moreover, an analysis of the backbone deep learning- “Probabilistic future prediction for video scene understanding,”
based architectures for this task was performed in order to arXiv:2003.06409, 2020.
provide the reader the necessary background knowledge. [16] A. Garcia-Garcia, P. Martinez-Gonzalez, S. Oprea, J. A. Castro-
Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Jover-
The core of this study encompasses the analysis and clas- Alvarez, “The robotrix: An extremely photorealistic and very-
sification of more than 50 methods and the datasets they large-scale indoor dataset of sequences with robot trajectories
have used. Methods were analyzed from three perspectives: and interactions,” in IROS, 2018, pp. 6790–6797.
method description, contribution over the previous works [17] Y. Kong and Y. Fu, “Human action recognition and prediction: A
survey,” arXiv:1806.11230, 2018.
and performance results. They have also been classified [18] C. Sahin, G. Garcia-Hernando, J. Sock, and T. Kim, “A review on
according to a proposed taxonomy based on their main object pose recovery: from 3d bounding box detectors to full 6d
contribution. In addition, we have presented a comparative pose estimators,” arXiv:2001.10609, 2020.
summary of the datasets and methods in tabular form so [19] V. Villena-Martinez, S. Oprea, M. Saval-Calvo, J. A. López,
A. F. Guilló, and R. B. Fisher, “When deep learning meets data
as the reader, at a glance, could identify low-level details. alignment: A review on deep registration networks (DRNs),”
In the end, we have discussed the performance results on arXiv:2003.03167, 2020.
22

[20] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, [49] W. Lotter, G. Kreiman, and D. D. Cox, “Unsupervised learn-
vol. 521, no. 7553, 2015. ing of visual structure using predictive generative networks,”
[21] J. Hawkins and S. Blakeslee, On Intelligence. Times Books, 2004. arXiv:1511.06380, 2015.
[22] R. P. N. Rao and D. H. Ballard, “Predictive coding in the vi- [50] X. Chen, W. Wang, J. Wang, and W. Li, “Learning object-centric
sual cortex: a functional interpretation of some extra-classical transformation for video prediction,” in ACM-MM, ser. MM ’17.
receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, 1999. New York, NY, USA: ACM, 2017.
[23] D. Mumford, “On the computational architecture of the neocor- [51] X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong,
tex,” Biological Cybernetics, vol. 66, no. 3, 1992. L. Liu, Z. Jie, J. Feng, and S. Yan, “Video Scene Parsing with
[24] A. Cleeremans and J. L. McClelland, “Learning the structure of Predictive Feature Learning,” in ICCV, 2017.
event sequences.” Journal of Experimental Psychology: General, vol. [52] N. Wichers, R. Villegas, D. Erhan, and H. Lee, “Hierarchical
120, no. 3, 1991. long-term video prediction without supervision,” in ICML, ser.
[25] A. Cleeremans and J. Elman, Mechanisms of implicit learning: Proceedings of Machine Learning Research, vol. 80, 2018.
Connectionist models of sequence processing. MIT press, 1993. [53] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learn-
[26] R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and ing to generate long-term future via hierarchical prediction,” in
Z. Kourtzi, “Learning to predict: Exposure to temporal sequences ICML, 2017.
facilitates prediction of future events,” Vision Research, vol. 99, [54] J. Walker, K. Marino, A. Gupta, and M. Hebert, “The pose knows:
2014. Video forecasting by generating pose futures,” in ICCV, 2017.
[27] H. E. M. den Ouden, P. Kok, and F. P. de Lange, “How prediction [55] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion GAN for
errors shape perception, attention, and motivation,” in Front. future-flow embedded video prediction,” in ICCV, 2017.
Psychology, 2012. [56] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun,
[28] W. R. Softky, “Unsupervised pixel-prediction,” in NeurIPS, 1995. “Predicting Deeper into the Future of Semantic Segmentation,”
[29] G. Deco and B. Schürmann, “Predictive coding in the visual in ICCV, 2017.
cortex by a recurrent network with gabor receptive fields,” Neural [57] Z. Hu and J. Wang, “A novel adversarial inference framework
Processing Letters, vol. 14, no. 2, 2001. for video prediction with action control,” in ICCV Workshops, Oct
[30] A. Hollingworth, “Constructing visual representations of natural 2019.
scenes: the roles of short- and long-term visual memory.” Journal [58] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
of experimental psychology. Human perception and performance, vol. learning applied to document recognition,” Proceedings of the
30 3, 2004. IEEE, vol. 86, no. 11, Nov 1998.
[31] Y. Bengio, A. C. Courville, and P. Vincent, “Representation learn- [59] V. Jain, J. F. Murray, F. Roth, S. C. Turaga, V. P. Zhigulin, K. L. Brig-
ing: A review and new perspectives,” Trans. on PAMI, vol. 35, gman, M. Helmstaedter, W. Denk, and H. S. Seung, “Supervised
no. 8, 2013. learning of image restoration with convolutional networks,” in
[32] X. Wang and A. Gupta, “Unsupervised Learning of Visual Rep- ICCV, 2007.
resentations Using Videos,” in ICCV, 2015. [60] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus, “Deep gen-
[33] P. Agrawal, J. Carreira, and J. Malik, “Learning to see by mov- erative image models using a laplacian pyramid of adversarial
ing,” in ICCV, 2015. networks,” in NeurIPS, 2015.
[34] D.-A. Huang, V. Ramanathan, D. Mahajan, L. Torresani, [61] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual net-
M. Paluri, L. Fei-Fei, and J. Carlos Niebles, “What makes a video works,” in CVPR. IEEE, 2017.
a video: Analyzing temporal information in video understanding [62] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
models and datasets,” in CVPR, June 2018. Yuille, “Deeplab: Semantic image segmentation with deep con-
[35] L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman, volutional nets, atrous convolution, and fully connected crfs,”
B. Schölkopf, and W. T. Freeman, “Seeing the arrow of time,” in TPAMI, vol. 40, no. 4, 2018.
CVPR, 2014. [63] W. Luo, Y. Li, R. Urtasun, and R. S. Zemel, “Understanding
[36] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning the Effective Receptive Field in Deep Convolutional Neural Net-
and using the arrow of time,” in CVPR, 2018. works,” in NeurIPS, 2016.
[37] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: Unsu- [64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
pervised learning using temporal order verification,” in ECCV, image recognition,” in CVPR, 2016.
2016. [65] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing
[38] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and motion and content for natural video sequence prediction,” in
S. Levine, “Stochastic variational video prediction,” in ICLR, ICLR, 2017.
2018. [66] Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei,
[39] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for im- “Eidetic 3d LSTM: A model for video prediction and beyond,” in
age restoration with neural networks,” IEEE Trans. Computational ICLR, 2019.
Imaging, vol. 3, no. 1, 2017. [67] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating Videos
[40] K. Janocha and W. M. Czarnecki, “On loss functions for deep with Scene Dynamics,” in NeurIPS, 2016.
neural networks in classification,” arXiv:1702.05659, 2017. [68] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “MoCoGAN: De-
[41] A. Kendall and R. Cipolla, “Geometric loss functions for camera composing motion and content for video generation,” in CVPR,
pose regression with deep learning,” in CVPR, 2017. June 2018.
[42] J.-J. Hwang, T.-W. Ke, J. Shi, and S. X. Yu, “Adversarial structure [69] S. Aigner and M. Körner, “Futuregan: Anticipating the future
matching for structured prediction tasks,” in CVPR, 2019. frames of video sequences using spatio-temporal 3d convolutions
[43] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video in progressively growing autoencoder gans,” arXiv:1810.01325,
prediction beyond mean square error,” in ICLR (Poster), 2016. 2018.
[44] A. Dosovitskiy and T. Brox, “Generating images with perceptual [70] J. Zhang, Y. Wang, M. Long, W. Jianmin, and P. S. Yu, “Z-order
similarity metrics based on deep networks,” in NIPS, 2016. recurrent neural networks for video prediction,” in ICME, July
[45] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real- 2019.
time style transfer and super-resolution,” in ECCV, vol. 9906, [71] J. R. van Amersfoort, A. Kannan, M. Ranzato, A. Szlam, D. Tran,
2016. and S. Chintala, “Transformation-based models of video se-
[46] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, quences,” arXiv:1701.08435, 2017.
A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi, [72] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning
“Photo-realistic single image super-resolution using a generative representations by back-propagating errors,” Nature, vol. 323, no.
adversarial network,” in CVPR, 2017. 6088, 1986.
[47] M. S. M. Sajjadi, B. Schölkopf, and M. Hirsch, “Enhancenet: Single [73] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and
image super-resolution through automated texture synthesis,” in S. Chopra, “Video (language) modeling: a baseline for generative
ICCV, 2017. models of natural videos,” arXiv:1412.6604, 2014.
[48] J. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Generative [74] N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsuper-
visual manipulation on the natural image manifold,” in ECCV, vised Learning of Video Representations using LSTMs,” in ICML,
ser. Lecture Notes in Computer Science, vol. 9909, 2016. 2015.
23

[75] W. Lotter, G. Kreiman, and D. Cox, “Deep Predictive Coding [104] P. Bhattacharjee and S. Das, “Temporal coherency based criteria
Networks for Video Prediction and Unsupervised Learning,” in for predicting video frames using deep multi-stage generative
ICLR (Poster), 2017. adversarial networks,” in NIPS, 2017.
[76] W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos, [105] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative
“Contextvp: Fully context-aware video prediction,” in CVPR adversarial nets with singular value clipping,” in ICCV, 2017.
(Workshops), 2018. [106] B. Chen, W. Wang, and J. Wang, “Video imagination from a sin-
[77] V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal video gle image with transformation generation,” in ACM Multimedia,
autoencoder with differentiable memory,” (ICLR) Workshop, 2015. 2017.
[78] C. Lu, M. Hirsch, and B. Schölkopf, “Flexible Spatio-Temporal [107] M. Mirza and S. Osindero, “Conditional generative adversarial
Networks for Video Prediction,” in CVPR, 2017. nets,” arXiv:1411.1784, 2014.
[79] E. L. Denton and V. Birodkar, “Unsupervised learning of disen- [108] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine,
tangled representations from video,” in NeurIPS, 2017. “Stochastic adversarial video prediction,” arXiv:1804.01523, 2018.
[80] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. P. Singh, “Action- [109] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
Conditional Video Prediction using Deep Networks in Atari tation learning with deep convolutional generative adversarial
Games,” in NeurIPS, 2015. networks,” in ICLR, 2016.
[81] E. Denton and R. Fergus, “Stochastic video generation with a [110] M. Arjovsky and L. Bottou, “Towards principled methods for
learned prior,” in ICML, ser. Proceedings of Machine Learning training generative adversarial networks,” in ICLR, 2017.
Research, J. G. Dy and A. Krause, Eds., vol. 80, 2018. [111] C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human ac-
tions: A local SVM approach,” in ICPR. IEEE, 2004.
[82] S. shahabeddin Nabavi, M. Rochan, and Y. Wang, “Future Seman-
[112] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,
tic Segmentation with Convolutional LSTM,” in BMVC, 2018.
“Actions as space-time shapes,” Trans. on PAMI, vol. 29, no. 12,
[83] S. Vora, R. Mahjourian, S. Pirk, and A. Angelova, “Future seg- 2007.
mentation using 3d structure,” arXiv:1811.11358, 2018.
[113] H. Kuehne, H. Jhuang, E. Garrote, T. A. Poggio, and T. Serre,
[84] J. Sun, J. Xie, J. Hu, Z. Lin, J. Lai, W. Zeng, and W. Zheng, “Pre- “HMDB: A large video database for human motion recognition,”
dicting future instance segmentation with contextual pyramid in ICCV, 2011.
convLSTMs,” in ACM Multimedia. ACM, 2019. [114] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards
[85] M. Minderer, C. Sun, R. Villegas, F. Cole, K. P. Murphy, and understanding action recognition,” in ICCV, 2013.
H. Lee, “Unsupervised learning of object structure and dynamics [115] K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101
from videos,” in NeurIPS, 2019. human actions classes from videos in the wild,” arXiv:1212.0402,
[86] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” 2012.
Neural Computation, vol. 9, no. 8, 1997. [116] W. Zhang, M. Zhu, and K. G. Derpanis, “From actemes to
[87] K. Cho, B. van Merrienboer, Ç. Gülçehre, F. Bougares, action: A strongly-supervised representation for detailed action
H. Schwenk, and Y. Bengio, “Learning phrase representations understanding,” in ICCV, 2013.
using RNN encoder-decoder for statistical machine translation,” [117] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Hu-
EMNLP, pp. 1724–1734, 2014. man3.6m: Large scale datasets and predictive methods for 3d
[88] A. Graves, S. Fernández, and J. Schmidhuber, “Multi- human sensing in natural environments,” Trans. on PAMI, vol. 36,
dimensional recurrent neural networks,” in ICANN, vol. 4668, no. 7, 2014.
2007. [118] H. Idrees, A. R. Zamir, Y. Jiang, A. Gorban, I. Laptev, R. Suk-
[89] C. Finn, I. J. Goodfellow, and S. Levine, “Unsupervised Learning thankar, and M. Shah, “The THUMOS challenge on action recog-
for Physical Interaction through Video Prediction,” in NeurIPS, nition for videos ”in the wild”,” CVIU, vol. 155, 2017.
2016. [119] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detec-
[90] E. Zhan, S. Zheng, Y. Yue, L. Sha, and P. Lucey, “Generating tion: A benchmark,” in CVPR, 2009.
multi-agent trajectories using programmatic weak supervision,” [120] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets
in ICLR, 2019. robotics: The kitti dataset,” IJRR, vol. 32, no. 11, 2013.
[91] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel [121] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,
Recurrent Neural Networks,” in ICML, 2016. R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes
[92] R. M. Neal, “Connectionist learning of belief networks,” Artif. dataset for semantic urban scene understanding,” in CVPR, 2016.
Intell., vol. 56, no. 1, 1992. [122] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin,
[93] Y. Bengio and S. Bengio, “Modeling high-dimensional discrete and R. Yang, “The apolloscape dataset for autonomous driving,”
data with multi-layer neural networks,” in NeurIPS, 1999. arXiv: 1803.06184, 2018.
[94] A. van den Oord, N. Kalchbrenner, L. Espeholt, K. Kavukcuoglu, [123] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
O. Vinyals, and A. Graves, “Conditional image generation with F. Li, “Large-scale video classification with convolutional neural
pixelcnn decoders,” in NIPS, 2016. networks,” in CVPR, 2014.
[95] N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, [124] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici,
O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel net- B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-
works,” in ICML, 2017, pp. 1771–1779. scale video classification benchmark,” arXiv:1609.08675, 2016.
[125] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni,
[96] K. Fragkiadaki, J. Huang, A. Alemi, S. Vijayanarasimhan,
D. Poland, D. Borth, and L. Li, “YFCC100M: the new data in
S. Ricco, and R. Sukthankar, “Motion prediction un-
multimedia research,” Commun. ACM, vol. 59, no. 2, 2016.
der multimodality with conditional stochastic networks,”
[126] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The recurrent
arXiv:1705.02082, 2017.
temporal restricted boltzmann machine,” in NIPS, 2008.
[97] L. Castrejon, N. Ballas, and A. Courville, “Improved conditional
[127] C. F. Cadieu and B. A. Olshausen, “Learning intermediate-level
vrnns for video prediction,” in ICCV, 2019.
representations of form and motion from natural movies,” Neural
[98] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Computation, vol. 24, no. 4, 2012.
Y. Bengio, “A recurrent latent variable model for sequential data,” [128] R. Memisevic and G. Exarchakis, “Learning invariant features by
in NIPS, 2015. harnessing the aperture problem,” in ICML, vol. 28, 2013.
[99] M. Henaff, J. J. Zhao, and Y. LeCun, “Prediction under uncer- [129] F. Ebert, C. Finn, A. X. Lee, and S. Levine, “Self-supervised
tainty with error-encoding networks,” arXiv:1711.04994, 2017. visual planning with temporal skip connections,” in CoRL, ser.
[100] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Proceedings of Machine Learning Research, vol. 78, 2017.
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- [130] S. Dasari, F. Ebert, S. Tian, S. Nair, B. Bucher, K. Schmeckpeper,
sarial nets,” in NIPS, 2014, pp. 2672–2680. S. Singh, S. Levine, and C. Finn, “Robonet: Large-scale multi-
[101] Y.-H. Kwon and M.-G. Park, “Predicting future frames using robot learning,” arXiv:1910.11215, 2019.
retrospective cycle gan,” in CVPR, 2019. [131] R. Vezzani and R. Cucchiara, “Video surveillance online repos-
[102] C. Vondrick and A. Torralba, “Generating the Future with Ad- itory (visor): an integrated framework,” Multimedia Tools Appl.,
versarial Transformers,” in CVPR, 2017. vol. 50, no. 2, 2010.
[103] Y. Zhou and T. L. Berg, “Learning Temporal Transformations [132] J. Santner, C. Leistner, A. Saffari, T. Pock, and H. Bischof, “PROST:
from Time-Lapse Videos,” in ECCV, 2016. parallel robust online simple tracking,” in CVPR, 2010.
24

[133] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The [161] B. Klein, L. Wolf, and Y. Afek, “A dynamic convolutional layer
arcade learning environment: An evaluation platform for general for short rangeweather prediction,” in CVPR, 2015.
agents,” J. Artif. Intell. Res., vol. 47, 2013. [162] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool, “Dynamic
[134] G. Seguin, P. Bojanowski, R. Lajugie, and I. Laptev, “Instance- filter networks,” in NeurIPS, 2016.
level video segmentation from object tracks,” in CVPR, 2016. [163] A. Clark, J. Donahue, and K. Simonyan, “Adversarial video
[135] Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Ca- generation on complex datasets,” 2019.
zorla, “UASOL, a large-scale high-resolution outdoor stereo [164] P. Luc, A. Clark, S. Dieleman, D. de Las Casas, Y. Doron, A. Cas-
dataset,” Scientific Data, vol. 6, no. 1, 2019. sirer, and K. Simonyan, “Transformation-based adversarial video
[136] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla, “Segmenta- prediction on large-scale data,” arXiv:2003.04035, 2020.
tion and recognition using structure from motion point clouds,” [165] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling
in ECCV, vol. 5302, 2008. in deep convolutional networks for visual recognition,” Trans. on
[137] E. Santana and G. Hotz, “Learning a driving simulator,” PAMI, vol. 37, no. 9, 2015.
arXiv:1608.01230, 2016. [166] K. Simonyan and A. Zisserman, “Two-stream convolutional net-
[138] Y. LeCun, F. J. Huang, and L. Bottou, “Learning methods for works for action recognition in videos,” in NeurIPS, Z. Ghahra-
generic object recognition with invariance to pose and lighting,” mani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Wein-
in CVPR, 2004. berger, Eds., 2014.
[139] P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. Jover- [167] H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell, “Disentan-
Alvarez, S. Orts-Escolano, and J. Garcia-Rodriguez, “Unreal- gling propagation and generation for video prediction,” in ICCV,
ROX: An extremely photorealistic virtual reality environment 2019.
for robotics simulations and synthetic data generation,” Virtual [168] Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with
Reality, 2019. object motion prediction,” 2020.
[140] D. Jayaraman and K. Grauman, “Look-ahead before you leap: [169] J. Hsieh, B. Liu, D. Huang, F. Li, and J. C. Niebles, “Learning
End-to-end active recognition by forecasting the effect of mo- to decompose and disentangle representations for video predic-
tion,” in ECCV, vol. 9909, 2016. tion,” in NeurIPS, 2018.
[141] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An Uncertain [170] S. Chiappa, S. Racanière, D. Wierstra, and S. Mohamed, “Recur-
Future: Forecasting from Static Images Using Variational Autoen- rent environment simulators,” in ICLR, 2017.
coders,” in ECCV, 2016. [171] K. Fragkiadaki, P. Agrawal, S. Levine, and J. Malik, “Learning
[142] Z. Hao, X. Huang, and S. J. Belongie, “Controllable video gener- visual predictive models of physics for playing billiards,” in ICLR
ation with sparse trajectories,” in CVPR, 2018. (Poster), 2016.
[143] Y. Ye, M. Singh, A. Gupta, and S. Tulsiani, “Compositional video [172] A. Dosovitskiy and V. Koltun, “Learning to Act by Predicting the
prediction,” in ICCV, October 2019. Future,” in ICLR, 2017.
[144] S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, P. A. Jennings, [173] P. Luc, “Self-supervised learning of predictive segmentation
and A. Mouzakitis, “Deep learning-based vehicle behaviour models from video,” Theses, Université Grenoble Alpes, Jun.
prediction for autonomous driving applications: A review,” 2019. [Online]. Available: https://tel.archives-ouvertes.fr/tel-0
arXiv:1912.11676, 2019. 2196890
[145] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khu- [174] H.-k. Chiu, E. Adeli, and J. C. Niebles, “Segmenting the future,”
danpur, “Recurrent neural network based language model,” in arXiv:1904.10666, 2019.
INTERSPEECH, 2010. [175] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng,
[146] K. Simonyan and A. Zisserman, “Very deep convolutional net- and S. Yan, “Predicting Scene Parsing and Motion Dynamics in
works for large-scale image recognition,” in ICLR, 2015. the Future,” in NeurIPS, 2017.
[147] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy [176] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in
optical flow estimation based on a theory for warping,” in ECCV, NIPS, 2014.
T. Pajdla and J. Matas, Eds., vol. 3024, 2004. [177] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge
[148] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing in a neural network,” arXiv:1503.02531, 2015.
of gans for improved quality, stability, and variation,” in ICLR, [178] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid,
2018. “EpicFlow: Edge-preserving interpolation of correspondences for
[149] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. optical flow,” in CVPR, 2015.
Courville, “Improved training of wasserstein gans,” in NIPS, [179] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
2017. network,” in CVPR, 2017.
[150] B. Jin, Y. Hu, Q. Tang, J. Niu, Z. Shi, Y. Han, and X. Li, “Exploring [180] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mo-
spatial-temporal multi-frequency analysis for high-fidelity and hamed, “Variational approaches for auto-encoding generative
temporal-consistency video prediction,” arXiv:2002.09905, 2020. adversarial networks,” arXiv:1706.04987, 2017.
[151] O. Shouno, “Photo-realistic video prediction on natural videos of [181] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask R-CNN,”
largely changing frames,” arXiv:2003.08635, 2020. in ICCV, 2017.
[152] R. Hou, H. Chang, B. Ma, and X. Chen, “Video prediction with [182] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee, “Deep visual analogy-
bidirectional constraint network,” in FG, May 2019. making,” in NIPS, 2015.
[153] M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural [183] N. Fushishita, A. Tejero-de-Pablos, Y. Mukuta, and T. Harada,
networks for future video prediction,” in ECCV, 2018. “Long-term video generation of multiple futures using human
[154] W. Yu, Y. Lu, S. Easterbrook, and S. Fidler, “Efficient and poses,” arXiv:1904.07538, 2019.
information-preserving future frame prediction and beyond,” in [184] J. Tang, H. Hu, Q. Zhou, H. Shan, C. Tian, and T. Q. S. Quek,
ICLR, 2020. “Pose guided global and local gan for appearance preserving
[155] F. A. Reda, G. Liu, K. J. Shih, R. Kirby, J. Barker, D. Tarjan, A. Tao, human video prediction,” in ICIP, Sep. 2019.
and B. Catanzaro, “SDC-Net: Video prediction using spatially- [185] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural
displaced convolution,” in ECCV, 2018. probabilistic language model,” J. Mach. Learn. Res., vol. 3, no.
[156] R. Memisevic and G. E. Hinton, “Learning to represent spa- null, p. 11371155, Mar. 2003.
tial transformations with factored higher-order boltzmann ma- [186] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
chines,” Neural Computation, vol. 22, no. 6, 2010. learning with neural networks,” in NeurIPS, 2014.
[157] R. Memisevic, “Gradient-based learning of higher-order image [187] A. Mahendran and A. Vedaldi, “Understanding deep image
features,” in ICCV, 2011. representations by inverting them,” in CVPR, 2015.
[158] ——, “Learning to relate images,” Trans. on PAMI, vol. 35, no. 8, [188] R. Chalasani and J. C. Prı́ncipe, “Deep predictive coding net-
2013. works,” in ICLR (Workshop Poster), 2013.
[159] V. Michalski, R. Memisevic, and K. Konda, “Modeling deep tem- [189] M. F. Stollenga, W. Byeon, M. Liwicki, and J. Schmidhuber, “Par-
poral dependencies with recurrent grammar cells,” in NeurIPS, allel multi-dimensional lstm, with application to fast biomedical
2014. volumetric image segmentation,” in NeurIPS, 2015.
[160] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, [190] J. Zhang, Y. Zheng, and D. Qi, “Deep spatio-temporal residual
“Spatial Transformer Networks,” in NeurIPS, 2015. networks for citywide crowd flows prediction,” in AAAI, 2017.
25

[191] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, [220] F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and
H. Kim, V. Haenel, I. Fründ, P. Yianilos, M. Mueller-Freitag, T. Darrell, “BDD100K: A diverse driving video database with
F. Hoppe, C. Thurau, I. Bax, and R. Memisevic, “The ”something scalable annotation tooling,” arXiv:1805.04687, 2018.
something” video database for learning and evaluating visual [221] G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder, “The
common sense,” in ICCV, 2017. mapillary vistas dataset for semantic understanding of street
[192] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised scenes,” in ICCV, 2017.
dual learning for image-to-image translation,” in ICCV, 2017. [222] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”
[193] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- in ICLR, 2014.
image translation using cycle-consistent adversarial networks,” [223] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Condi-
in ICCV, 2017. tional image generation from visual attributes,” in ECCV, vol.
[194] W. Luo, W. Liu, and S. Gao, “A revisit of sparse coding based 9908, 2016.
anomaly detection in stacked RNN framework,” in ICCV. IEEE, [224] H. Wu, M. Rubinstein, E. Shih, J. V. Guttag, F. Durand, and
2017. W. T. Freeman, “Eulerian video magnification for revealing subtle
[195] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. changes in the world,” ToG, vol. 31, no. 4, 2012.
Regazzoni, and N. Sebe, “Abnormal event detection in videos [225] R. Villegas, A. Pathak, H. Kannan, D. Erhan, Q. V. Le, and H. Lee,
using generative adversarial nets,” in ICIP, 2017. “High fidelity video prediction with large stochastic recurrent
[196] L. Dinh, D. Krueger, and Y. Bengio, “NICE: non-linear indepen- neural networks,” in NeurIPS, 2019, pp. 81–91.
dent components estimation,” in ICLR (Workshop), 2015. [226] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and
[197] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: O. Winther, “Ladder variational autoencoders,” in NIPS, 2016.
Recurrent neural networks for predictive learning using spa- [227] R. Pottorff, J. Nielsen, and D. Wingate, “Video extrapolation with
tiotemporal lstms,” in NeurIPS, 2017. an invertible linear embedding,” arXiv:1903.00133, 2019.
[228] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with
[198] “Traffic4cast: Traffic map movie forecasting,” https://www.iara
invertible 1x1 convolutions,” in NeurIPS, 2018.
i.ac.at/traffic4cast/, accessed: 2020-04-14.
[229] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image
[199] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via
quality assessment: from error visibility to structural similarity,”
adaptive separable convolution,” in ICCV. IEEE, 2017.
IEEE Trans. Image Processing, vol. 13, no. 4, 2004.
[200] ——, “Video frame interpolation via adaptive convolution,” in [230] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,
CVPR. IEEE, 2017. “The unreasonable effectiveness of deep features as a perceptual
[201] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisser- metric,” in CVPR, 2018.
man, “A short note about kinetics-600,” arXiv:1808.01340, 2018. [231] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier,
[202] T. Xue, J. Wu, K. L. Bouman, and B. Freeman, “Visual Dynamics: M. Michalski, and S. Gelly, “Towards accurate generative models
Probabilistic Future Frame Synthesis via Cross Convolutional of video: A new metric & challenges,” arXiv:1812.01717, 2018.
Networks,” in NeurIPS, 2016. [232] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
[203] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. and X. Chen, “Improved techniques for training gans,” in NIPS,
Funkhouser, “Semantic scene completion from a single depth 2016.
image,” in CVPR. IEEE, 2017. [233] O. Breuleux, Y. Bengio, and P. Vincent, “Quickly generating
[204] M. Menze and A. Geiger, “Object scene flow for autonomous representative samples from an rbm-derived process,” Neural
vehicles,” in CVPR, 2015. Computation, vol. 23, no. 8, 2011.
[205] J. Janai, F. Güney, A. Ranjan, M. J. Black, and A. Geiger, “Unsu- [234] E. Hildreth, “Theory of edge detection,” Proc. of Royal Society of
pervised learning of multi-frame optical flow with occlusions,” London, vol. 207, no. 187-217, 1980.
in ECCV, vol. 11220, 2018. [235] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, “Predrnn++:
[206] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, Towards A resolution of the deep-in-time dilemma in spatiotem-
“Learning what and where to draw,” in NIPS, 2016. poral predictive learning,” in ICML, ser. Proceedings of Machine
[207] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks Learning Research, vol. 80, 2018.
for human pose estimation,” in ECCV, vol. 9912, 2016. [236] F. Cricri, X. Ni, M. Honkala, E. Aksu, and M. Gabbouj, “Video
[208] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent ladder networks,” arXiv:1612.01756, 2016.
Network Models for Human Dynamics,” in ICCV, 2015. [237] I. Prémont-Schwarz, A. Ilin, T. Hao, A. Rasmus, R. Boney, and
[209] T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Conditional H. Valpola, “Recurrent ladder networks,” in NIPS, 2017.
image generation for learning the structure of visual objects,” [238] B. Jin, Y. Hu, Y. Zeng, Q. Tang, S. Liu, and J. Ye, “Varnet: Exploring
arXiv:1806.07823, 2018. variations for unsupervised video prediction,” in IROS, 2018.
[210] Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, and H. Lee, “Unsupervised [239] J. Lee, J. Lee, S. Lee, and S. Yoon, “Mutual suppression
discovery of object landmarks as structural representations,” in network for video prediction using disentangled features,”
CVPR, 2018. arXiv:1804.04810, 2018.
[211] R. Goroshin, M. Mathieu, and Y. LeCun, “Learning to linearize [240] D. Weissenborn, O. Tckstrm, and J. Uszkoreit, “Scaling autore-
under uncertainty,” in NeurIPS, 2015. gressive video models,” in ICLR, 2020.
[241] L. Theis, A. van den Oord, and M. Bethge, “A note on the
[212] G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-
evaluation of generative models,” in ICLR, 2016.
encoders,” in ICANN, vol. 6791. Springer, 2011.
[213] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun,
“Unsupervised learning of spatiotemporally coherent metrics,”
in ICCV, 2015.
[214] T. Brox and J. Malik, “Object segmentation by long term analysis
of point trajectories,” in ECCV, vol. 6315, 2010.
[215] J. Schmidhuber, “Learning complex, extended sequences using
the principle of history compression,” Neural Computation, vol. 4,
no. 2, 1992. Sergiu Oprea is a PhD student at the Depart-
[216] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning ment of Computer Technology (DTIC), Univer-
to poke by poking: Experiential learning of intuitive physics,” in sity of Alicante. He received his MSc (Automa-
NeurIPS, 2016, p. 50925100. tion and Robotics) and BSc (Computer Science)
[217] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, from the same institution in 2017 and 2015 re-
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep spectively. His main research interests include
reinforcement learning,” in ICML, vol. 48, 2016. video prediction with deep learning, virtual real-
[218] J. Zhang and K. Cho, “Query-efficient imitation learning for end- ity, 3D computer vision, and parallel computing
to-end simulated driving,” in AAAI, 2017. on GPUs.
[219] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam,
K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger,
“A probabilistic u-net for segmentation of ambiguous images,”
in NeurIPS, 2018.
26

Pablo Martinez Gonzalez is a PhD student Antonis Argyros is a professor of computer


at the Department of Computer Technology science at the Computer Science Department,
(DTIC), University of Alicante. He received his University of Crete and a researcher at the Insti-
MSc (Computer Graphics, Games and Virtual tute of Computer Science, FORTH, in Heraklion,
Reality) and BSc (Computer Science) at the Crete, Greece. His research interests fall in the
Rey Juan Carlos University and University of Al- areas of computer vision and pattern recogni-
icante, in 2017 and 2015, respectively. His main tion, with emphasis on the analysis of humans
research interests include deep learning, virtual in images and videos, human pose analysis,
reality and parallel computing on GPUs. recognition of human activities and gestures, 3D
computer vision, as well as image motion and
tracking. He is also interested in applications of
computer vision in the fields of robotics and smart environments.

Alberto Garcia Garcia is a Postdoctoral Re-


searcher at the Institute of Space Sciences (ICE-
CSIC, Barcelona) where he leads the efforts in
code optimization, machine learning, and par-
allel computing on the MAGNESIA ERC Con-
solidator project. He received his PhD (Machine
Learning and Computer Vision), MSc (Automa-
tion and Robotics) and BSc (Computer Science)
from the same institution in 2019, 2016 and
2015 respectively. Previously he was an intern
at NVIDIA Research/Engineering, Facebook Re-
ality Labs, and Oculus Core Tech. His main research interests include
deep learning (specially convolutional neural networks), virtual reality,
3D computer vision, and parallel computing on GPUs.

John Alejandro Castro Vargas is a PhD stu-


dent at the Department of Computer Technology
(DTIC), University of Alicante. He received his
MSc (Automation and Robotics) and BSc (Com-
puter Science) from the same institution in 2017
and 2016 respectively. His main research inter-
ests include human behavior recognition with
deep learning, virtual reality and parallel comput-
ing on GPUs.

Sergio Orts-Escolano received a BSc, MSc


and PhD in Computer Science from the Uni-
versity of Alicante in 2008, 2010 and 2014 re-
spectively. His research interests include com-
puter vision, assistive robotics, 3D sensors, GPU
computing, virtual/augmented reality and deep
learning. He has authored +50 publications in
top journals and conferences like CVPR, SIG-
GRAPH, 3DV, BMVC, CVIU, IROS, UIST, RAS,
etcetera. He is also a member of European Net-
works like HiPEAC and Eucog. He has experi-
ence as a professor in academia and industry, working as a research
scientist for companies such as Google and Microsoft Research.

Jose Garcia-Rodriguez received his Ph.D. de-


gree, with specialization in Computer Vision and
Neural Networks, from the University of Alicante
(Spain). He is currently Full Professor at the
Department of Computer Technology of the Uni-
versity of Alicante. His research areas of inter-
est include: computer vision, computational intel-
ligence, machine learning, pattern recognition,
robotics, man-machine interfaces, ambient in-
telligence, computational chemistry, and parallel
and multicore architectures.

You might also like