0% found this document useful (0 votes)
27 views10 pages

Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

Who Let The Dogs Out?

Modeling Dog Behavior From Visual Data

Kiana Ehsani1 , Hessam Bagherinezhad1 , Joseph Redmon1


Roozbeh Mottaghi2 , Ali Farhadi1,2
1
University of Washington, 2 Allen Institute for AI (AI2)
arXiv:1803.10827v2 [cs.CV] 17 May 2018

Abstract

We study the task of directly modelling a visually intel-


ligent agent. Computer vision typically focuses on solving
various subtasks related to visual intelligence. We depart
from this standard approach to computer vision; instead we
directly model a visually intelligent agent. Our model takes
visual information as input and directly predicts the actions
of the agent. Toward this end we introduce DECADE, a
dataset of ego-centric videos from a dog’s perspective as Acting like a dog
well as her corresponding movements. Using this data we action?
model how the dog acts and how the dog plans her move-
ments. We show under a variety of metrics that given just t+3
t t+1 t+2
visual input we can successfully model this intelligent agent
in many situations. Moreover, the representation learned by Planning like a dog
our model encodes distinct information compared to repre-
sentations trained on image classification, and our learned
representation can generalize to other domains. In particu- t t+N
lar, we show strong results on the task of walkable surface sequence of actions?

estimation and scene classification by using this dog mod- Learning from a dog
elling task as representation learning. Code is available at
https://github.com/ehsanik/dogTorch.

representation
1. Introduction learning
walkable surfaces

Computer vision research typically focuses on a few well Figure 1. We address three problems: (1) Acting like a dog: where
defined tasks including image classification, object recogni- the goal is to predict the future movements of the dog given a se-
tion, object detection, image segmentation, etc. These tasks quence of previously seen images. (2) Planning like a dog: where
have organically emerged and evolved over time as proxies the goal is to find a sequence of actions that move the dog between
the locations of the given pair of images. (3) Learning from a
for the actual problem of visual intelligence. Visual intel-
dog: where we use the learned representation for a third task (e.g.,
ligence spans a wide range of problems and is hard to for-
walkable surface estimation).
mally define or evaluate. As a result, the proxy tasks have
served the community as the main point of focus and indi-
cators of progress. tems. In this paper, we take a direct approach to the prob-
We value the undeniable impact of these proxy tasks in lem of visual intelligence. Inspired by recent work that
computer vision research and advocate the continuation of explores the role of action and interaction in visual under-
research on these fundamental problems. There is, how- standing [56, 3, 31], we define the problem of visual in-
ever, a gap between the ideal outcome of these proxy tasks telligence as understanding visual data to the extent that
and the expected functionality of visually intelligent sys- an agent can take actions and perform tasks in the visual
world. Under this definition, we propose to learn to act like a static image using a large collection of videos. [29] infer
a visually intelligent agent in the visual world. the goals of people and their intended actions. [35] infer
Learning to act like visually intelligent agents, in gen- future activities from a stream of video. [9] improve track-
eral, is an extremely challenging and a hard-to-define prob- ing by considering multiple hypotheses for future plans of
lem. Actions correspond to a wide range of movements with people. [11] recognize partial events, which enables early
complicated semantics. In this paper, we take a small step detection of events. [14] perform activity forecasting by
towards the problem of learning to directly act like intelli- integrating semantic scene understanding with optimal con-
gent agents by considering actions in their most basic and trol theory. [16] use object affordances to predict the fu-
semantic-free form: simple movements. ture activities of people. [49] localize functional objects by
We choose to model a dog as the visual agent. Dogs predicting people’s intent. [42] propose an unsupervised
have a much simpler action space than, say, a human, mak- approach to predict possible motions and appearance of ob-
ing the task more tractable. However, they clearly demon- jects in the future. [17] propose a hierarchical approach to
strate visual intelligence, recognizing food, obstacles, other predict a set of actions that happen in the future. [33] pro-
humans and animals, and reacting to those inputs. Yet their pose a method to generate the future frames of a video. [15]
goals and motivations are often unknown a priori. They predict the future paths of pedestrians from a vehicle cam-
simply exist as sovereign entities in our world. Thus we are era. [36] predict future trajectories of a person in an ego-
modelling a black box where we only know the inputs and centric setting. [22] predict the future trajectories of objects
outputs of the system. according to Newtonian physics. [41] predict visual repre-
In this paper, we study the problem of learning to act and sentations for future images. [52] forecast future frames by
plan like a dog from visual input. We compile the Dataset learning a policy to reproduce natural video sequences. Our
of Ego-Centric Actions in a Dog Environment (DECADE), work is different from these works since our goal is to pre-
which includes ego-centric videos of a dog with her corre- dict the behavior of a dog and the movement of the joints
sponding movements. To record movements we mount In- from an ego-centric camera that captures the viewpoint of
ertial Measurement Units (IMU) on the joints and the body the dog.
of the dog. We record the absolute position and can calcu- Sequence to sequence models. Sequence to sequence
late the relative angle of the dog’s main limbs and body. learning [38] has been used for different applications in
Using DECADE, we explore three main problems in computer vision such as representation learning [37], video
this paper (Figure 1): (1) learning to act like a dog; (2) captioning [40, 50], human pose estimation [44], motion
learning to plan like a dog; and (3) using dogs movements prediction [20], or body pose labeling and forecasting
as supervisory signal for representation learning. [8, 44]. Our model fits into this paradigm since we map
In learning to act like a dog, we study the problem of the frames in a video to joint movements of the dog.
predicting the dog’s future moves, in terms of all the joint Ego-centric vision. Our work is along the lines of ego-
movements, by observing what the dog has observed up to centric vision (e.g., [7, 32, 18, 19]) since we study the
the current time. In learning to plan like a dog, we ad- dog’s behavior from the perspective of the dog. However,
dress the problem of estimating a sequence of movements dogs have less complex actions compared to humans, which
that take the state of the dog’s world from what is observed makes the problem more manageable. Prior work explores
at a given time to a desired observed state. In using dogs future prediction in the context of ego-centric vision. [55]
as supervision, we explore the potentials of using the dogs infer the temporal ordering of two snippets of ego-centric
movements for representation learning. videos and predict what will happen next. [26] predict plau-
Our evaluations show interesting and promising results. sible future trajectories of ego-motion in ego-centric stereo
Our models can predict how the dog moves in various sce- images. [13] estimates the 3D joint position of unseen body
narios (act like a dog) and how she decides to move from joints using ego-centric videos. [34] use online reinforce-
one state to another (plan like a dog). In addition, we show ment learning to forecast the future goals of the person
that the representation our model learns on dog behavior wearing the camera. In contrast, our work focuses on pre-
generalizes to other tasks. In particular, we see accuracy dicting future joint movements given a stream of video.
improvements using our dog model as pretraining for walk-
able surface estimation and scene recognition. Ego-motion estimation. Our planning approach shares
similarities with ego-motion learning. [54] propose an un-
2. Related Work supervised approach for camera motion estimation. [45]
propose a method based on combination of CNNs and
To the best of our knowledge there is little to no work RNNs to perform ego-motion estimation for cars. [21] learn
that directly models dog behavior. We mention past work a network to estimate relative pose of two cameras. [39]
that is most relevant. also train a CNN to learn depth map and motion of the
Visual prediction. [51, 30] predict the motion of objects in camera in two consecutive images. In contrast to these ap-
proaches that estimate translation and rotation of the cam- timestep.
era, we predict a sequence of joint movements. Note that An Arduino on the dog’s back connects to the IMUs and
the joint movements are constrained by the structure of the records the positional information. It also collects audio
dog body so the predictions are constrained. data via a microphone mounted on the dog’s back. We syn-
Action inference & Planning. Our dog planning model chronize the GoPro with the IMU measurements using au-
infers the action sequence for the dog given a pair of images dio information. This allows us to synchronize the video
showing before and after action execution. [3] also learn stream with the IMU readings with microsecond precision.
the mapping between actions of a robot and changes in the We collect the data in various outdoor and indoor scenes:
visual state for the task of pushing objects. [27] optimize living room, stairs, balcony, street, and dog park are exam-
for actions that capture the state changes in an exploration ples of these scenes. The data is recorded in more than 50
setting. different locations. We recorded the behavior of the dog
Inverse Reinforcement Learning. Several works (e.g., while involved in certain activities such as walking, follow-
[1, 4, 34]) have used Inverse Reinforcement Learning (IRL) ing, fetching, interaction with other dogs, and tracking ob-
to infer the agent’s reward function from the observed be- jects. No annotations are provided for the video frames, we
havior. IRL is not directly applicable to our problem since use the raw data for our experiments.
our action space is large and we do not have multiple train-
ing examples for each goal. 4. Acting like a dog
Self-supervision. Various research explores representation
We predict how the dog acts in the visual world in re-
learning by different self-supervisory signals such as ego-
sponse to various situations. Specifically, we model the
motion [2, 12], spatial location [6], tracking in video [47],
future actions of the dog given a sequence of previously
colorization [53], physical robot interaction [31], inpainting
seen images. The input is a sequence of image frames
[28], sound [25], etc. As a side product, we show we learn a
(I1 , I2 , . . . , It ), and the output is the future actions (move-
useful representation using embeddings of joint movements
ments) of each joint j at each timestep t < t0 ≤ N :
and visual signals.
(ajt+1 , ajt+2 , . . . , ajt+N ). Timesteps are spaced evenly by
3. Dataset 0.2s in time. The action ajt is the movement of the joint
j, that along with the movements of other joints, takes us
We introduce DECADE, a dataset of ego-centric dog from image frame It to It+1 . For instance, a23 represents
video and joint movements. The dataset includes 380 video the movement of the second joint that takes place between
clips from a camera mounted on the dog’s head. It also in- image frames I3 and I4 . Each action is the change in the
cludes corresponding information about body position and orientation of the joints in the 3D space.
movement. Overall we have 24500 frames. We use 21000 We formulate the problem as classification, i.e. we quan-
of them for training, 1500 for validation, and 2000 for test- tize joint angular movements and label each joint movement
ing. Train, validation, and test splits consist of disjoint as a ground-truth action class. To obtain action classes,
video clips. we cluster changes in IMU readings (joint angular move-
We use a GoPro camera on the dog’s head to capture the ments) by K-means, and we use quaternion angular dis-
ego-centric videos. We sub-sample frames at the rate of 5 tances to represent angular distances between quaternions.
fps. The camera applies video stabilization to the captured Each cluster centroid represents a possible movement of
stream. We use inertial measurement units (IMUs) to mea- that joint.
sure body position and movement. Four IMUs measure the Our movement prediction model is based on an encoder-
position of the dog’s limbs, one measures the tail, and one decoder architecture, where the goal is to find a mapping
measures the body position. The IMUs enable us to capture between input images and future actions. For instance, if
the movements in terms of angular displacements. the dog sees her owner with a bag of treats, there is a high
For each frame, we have the absolute angular displace- probability that the dog will sit and wait for a treat, or if the
ment of the six IMUs. Each angular displacement is repre- dog sees her owner throwing a ball, the dog will likely track
sented as a 4 dimensional quaternion vector. More details the ball and run toward it.
about angular calculations in this domain and the method Figure 2 shows our model. The encoder part of the model
for quantizing the data is explained in detail in Section 7. consists of a CNN and an LSTM. At each timestep, the
The absolute angular displacements of the IMUs depend CNN receives a pair of consecutive images as input and
on what direction the dog is facing. For that reason, we provides an embedding, which is used as the input to the
compute the difference between angular displacements of LSTM. That is, the LSTM cell receives the features from
the joints, also in the quaternion space. The difference of frames t and t + 1 as the input in a timestep, and receives
the angular displacements between two consecutive frames frames t + 1 and t + 2 in the next timestep. Our experi-
(that is 0.2s in time) represents the action of the dog in that mental results show that observing the two frames in each
encoder decoder

LSTM LSTM LSTM FC LSTM LSTM LSTM

FC
ResNet

ResNet
ResNet

ResNet

ResNet

ResNet
Figure 2. Model architecture for acting. The model is an encoder-decoder style neural network. The encoder receives a stream of image
pairs, and the decoder outputs future actions for each joint. There is a fully connected layer (FC) between the encoder and decoder parts to
better capture the change in the domain (change from images to actions). In the decoder, the output probability of actions at each timestep
is passed to the next timestep. We share the weights between the two ResNet towers.

timestep of LSTM improves the performance of the model. of joints, and N is the number of timesteps. The f1i fac-
gi
The CNN consists of two towers of ResNet-18 [10], one for tor helps the ground-truth labels that are underrepresented
each frame, whose weights are shared. in the training data.
The decoder’s goal is to predict the future joint move-
ments of the dog given the embedding of the input frames. 5. Planning like a dog
The decoder receives its initial hidden state and cell from
the encoder. At each timestep, the decoder outputs the ac- Another goal is to model how dogs plan actions to ac-
tion class for each of the joints. The input to the decoder complish a task. To achieve this, we design a task as fol-
at the first timestep is all zeros, at all other timesteps, we lows: Given a pair of non-consecutive image frames, plan a
feed in the prediction of the last timestep, embedded by a sequence of joint movements that the dog would take to get
linear transformer. Since we train the model with fixed out- from the first frame (starting state) to the second frame (end-
put length, no stop token is required and we always stop ing state). Note that a traditional motion estimator would
at a fixed number of steps. Note that there are a total of not work here. Motion estimators infer a translation and
six joints; hence our model outputs six classes of actions at rotation for the camera that can take us from an image to
each timestep. another; in contrast, here we expect the model to plan for
Each image is given to the ResNet tower individually and the actuator, with its set of feasible actions, to traverse from
the features for the two images are concatenated. The com- one state to another.
bined features are embedded into a smaller space by a linear More formally, the task can be defined as follows. Given
transformation. The embedded features are fed into the en- a pair of images (I1 , IN ), output an action sequence of
coder LSTM. We use a ResNet pre-trained on ImageNet [5] length N − 1 for each joint, that results in the movement
and we fine-tune it under a Siamese setting to estimate the of the dog from the starting point, where I1 is observed, to
joints movements between two consecutive frames. We use the end point, where IN is observed.
the fine-tuned ResNet in our encoder-decoder model. Each action that the dog takes changes the states of the
We use an average of weighted class entropy losses, one world, and therefore planning for the next steps. Thus,
for each joint, to train our encoder-decoder. Our loss func- we design a recurrent neural network, containing an LSTM
tion can be formulated as follows: that observes the actions taken by the model in previous
N K
timesteps for the next timestamp action prediction. Figure 3
1 XX 1 shows the overview of our model. We feed-forward image
L(o, g) = log o(t)ig(t)i , (1)
N K t=1 i=1 fgii frames I1 and IN to individual ResNet-18 towers, concate-
nate the features from the last layer and feed it to the LSTM.
where g(t)i is the ground-truth class for i-th joint at At each timestep, the LSTM cell outputs planned actions for
timestep t, o(t)igi is the predicted probability score for gi -th all six joints. We pass the planned actions for a timestep as
class of i-th joint at timestep t, fgii is the number of data the input of the next timestep. This enables the network to
points whose i-th joint is labeled with gi , K is the number plan the next movements conditioned on the previous ac-
LSTM LSTM LSTM ResNet

FC FC
ResNet

ResNet

+ DeConv

!" !# + DeConv
Figure 3. Model architecture for planning. The model is a com-
bination of CNNs and an LSTM. The inputs to the model are two
images I1 and IN , which are N − 1 time steps apart in the video + DeConv
sequence. The LSTM receives the features from the CNNs and
outputs a sequence of actions (joint movements) that move the dog
Conv
from I1 to IN .

tions. As opposed to making hard decisions about the pre-


viously taken actions, we pass the action probabilities as the
input to the LSTM in the next timestep. A low probability
action at the current timestep might result in a high proba-
bility trajectory further along in the sequence. Using action Figure 4. Model architecture for walkable surface estimation.
probabilities prevents early pruning to keep all possibilities We augment the last four layers of ResNet with Deconvolution and
for the future actions. Convolution layers to infer walkable surfaces.
We train this recurrent neural network using a weighted
cross entropy loss over all time steps and joints as described
in Equation 1. Similar to the acting problem, we use a dis- 7.1. Implementation details
cretized action space, which is obtained using the procedure We use inertial measurement units (IMUs) to obtain the
described in Section 7. angular displacements of the dog’s joints. The IMUs that
we use in this project process the angular displacements in-
6. Learning from a dog. ternally and provide the absolute orientation in quaternion
form at average rate of 20 readings per second. To synchro-
While learning to predict the movements of the dog’s
nize all IMUs together, we connect all the IMUs to the same
joints from the images that the dog observes we obtain an
embedded system (Raspberry pi 3.0). We use a GoPro on
image representation that encodes different types of infor-
the dog’s head to capture ego-centric video, and we sub-
mation. To learn a representation, we train a ResNet-18
sample the images at a rate of 5 frames per second. To sync
model to estimate the current dog movements (the change in
the GoPro and Raspberry pi, we use audio that has been
the IMUs from time t − 1 to t) by looking at the images that
recorded on both instruments.
the dog observes in time t − 1 and t. We then test this repre-
The rate of the joint movement readings and video
sentation, and compare with a ResNet-18 model trained on
frames are different. We perform interpolation and aver-
ImageNet, in a different task using separate data. For our
aging to compute the absolute angular orientation for each
experiments we choose the task of walkable surface estima-
frame. For each frame of the video, we compute the aver-
tion [23] and scene categorization using SUN397 dataset
age of IMU readings, in quaternion space, corresponding to
[48]. Figure 4 depicts our model for estimating the walka-
a window of 0.1 second centered at the current frame.
ble surfaces from an image. To showcase the effects of our
To factor out the effects of global orientation change we
representation, we replace the ResNet-18 part of the model
use the relative orientations rather than the absolute ones.
shown in blue with a ResNet trained on ImageNet and com-
We compute the difference of absolute angular orientations
pare it with a ResNet trained on DECADE.
corresponding to consecutive frames after calculating the
average of quaternions for each frame. The reason for using
7. Experiments quaternions, instead of Euler angles, is that subtracting two
We evaluate our models on (1) how well they can predict quaternions is more well-defined and is easily obtained by:
the future movements of the dog (acting), (2) how accu- q2 − q1 = q1−1 q2
rately they can plan like a dog, and (3) how well the repre-
sentation learned from dog data generalizes to other tasks. We use K-means clustering to quantize the action space
Model Test Accuracy for each joint. We visualized our training data for differ-
Nearest Neighbor 13.14 ent number of clusters and observed that 8 clusters provide
CNN - regression 12.73 a reasonable separation of clusters, does not result in false
Our Model – Single Tower 18.65 clusters, and clusters correspond to natural movements of
Our Model 20.69 the limbs.
Table 1. Inferring the action between two consecutive frames. We pre-train the ResNets by fine-tuning them for the task
of estimating the joint actions, where we use a pair of con-
(space of relative joint movements). The distance function secutive frames as input and we have 6 different classifica-
that we use for K-means clustering is defined by: tion layers corresponding to different joints.
Learning to Plan. For the planning network, the input is
dist(q1 , q2 ) = 2arccos(hq1 , q2 i) obtained by concatenation of the ResNet-18 features for the
source and destination input images (a vector of size 2048).
The reason for formulating the problem as classification A fully connected layer receives this vector as input and
rather than regression is that our experimental evaluations converts it to a 512 dimensional vector, which is then used
showed that CNNs obtain better results for classification (as as the first time step input for the LSTM. The LSTM output
opposed to regression to continuous values). The same ob- is 48 dimensional (6 joints × 8 action class). The output is
servation has been made by [43, 46, 24], where they also followed by a 48 × 512 fully connected layer. The output of
formulate a continuous value estimation problem as classi- the fully connected layer is used as the input of the LSTM
fication. at the next timestep.
We treat each joint separately during training. It is more Learning from a dog. To obtain the representation, we
natural to consider all joints together for classification to re- train a ResNet-18 model to estimate the dog movements
spect the kinematic constraints, however it should be noted from time t − 1 to time t by looking at the images at time
that: (1) It is hard to collect enough training data for all t − 1 and t. We use a simple Siamese network with two
possible joint combinations (3910 different joint configu- ResNet-18 towers whose weights are shared. We concate-
rations appear in our training data); (2) By combining the nate the features of both frames into a 1024-dimensional
losses for all joints, the model encodes an implicit model vector and use a fully connected layer to predict the final
of kinematic possibilities; (3) Our experiments showed that 48 labels (6 IMUs each having 8 class of values). Table 1
per-joint clustering is better than all-joint clustering. shows our results on how well this network can predict the
To visualize the dog movements we use a 3D model of a current (not future) movements of the dog. The evaluation
dog from [57]. Their model is a 3D articulated model that metric is the class accuracy described below. We use this
can represent animals, such as dogs. For visualization, we base network to obtain our image representation. We also
apply the movement estimates from the model to the dog use this network for initializing our acting and planning net-
model. works.
Learning to Act. The input to the acting network, ex-
plained in Section 4, are pairs of frames of size 224 × 224
and the output is a sequence of movements predicted for fu- 7.2. Evaluation metrics
ture actions of the dog. The input images are fed into two
We use different evaluation metrics to compare our
ResNet-18 towers with shared weights. The outputs of the
method with the baselines.
ResNets (the layer before the classification layer), which are
of size 512, are concatenated into a vector of size 1024. The Class Accuracy: This is the standard metric for classifica-
image vector is then used as the input to the encoder LSTM. tion tasks. We report the average per class accuracy rather
The encoder LSTM has a hidden size of 512, and we set the than the overall accuracy for two reasons: 1) The dataset
initial hidden and cell states to all zeros. is not uniformly distributed among the clusters and we do
The hidden and cell state of the last LSTM cell of the not want to favor larger clusters over smaller ones, and 2)
encoder are used as the initialization of the hidden and cell Unlike the overall unbalanced accuracy, the mean class ac-
state of the LSTM for the decoder part. There is a fully curacy of a model that always predicts the mode class is not
connected layer before the input to the decoder LSTM to higher than chance.
capture the domain change between the encoder and the de- Perplexity: Perplexity measures the likelihood of the
coder (encoder is in the image domain, while the decoder in ground-truth label. Perplexity is commonly used for se-
the joint movement domain). quence modeling. We report perplexity for all of the base-
The output of each decoder LSTM cell is then fed into lines and models that are probabilistic and predict a se-
6 fully connected layers, where each one estimates the ac- quence. If our model assigns probability p to a sequence
1
tion class for each joint. We consider 8 classes of actions of length n, the perplexity measure is calculated as p n .
Figure 5. Qualitative Results: Learning to Act. Our model sees 5 frames of a video where a man begins throwing a ball past the dog. In
the video the ball bounces past the dog and the dog turns to the right to chase the ball. Using just the first 5 frames our model correctly
predicts how the dog turns to the right as the ball flies past.

Model Test Accuracy Perplexity Model Test Accuracy Perplexity


Nearest Neighbor 12.64 N/A Nearest Neighbor 14.09 N/A
CNN 19.84 0.2171 CNN 14.61 0.1419
Our Model–1 tower 18.04 0.2023 Our Model 19.77 0.2362
Our Model–1 frame/timestep 19.28 0.242 Table 3. Planning Results. Planning between the start and end
Our Model 21.62 0.2514 frame. We consider start and end images that are 5 steps apart.
Table 2. Acting Results. We observe a video sequence of 5 frames
and predict the next 5 actions. catenated (6 channels) as the input. We also compare our
model with an ablation that only uses one image at each
7.3. Results timestep instead of looking at two. Our results show that
Learning to act like a dog. Table 2 summarizes our exper- our model outperforms the baselines. Our ablations also
imental results for predicting the future movements of the show the importance of different components in our model.
dog using the images the dog observes. We compare our Figure 5 shows an example where a man is throwing a ball
method with two baselines in terms of both test set accu- past the dog. Our model correctly predicts the future dog
racy and perplexity. The Nearest Neighbors baseline uses movements (the bottom row) by only observing the images
the features obtained from a ResNet18 network trained on the dog observed in the previous time steps (the first row).
ImageNet. The CNN baseline concatenates all the images The second row shows the set of images the dog actually
into an input tensor for a ResNet18 model that classifies the observed. These images were not shown to the algorithm
future actions. We also report two ablations of our model. and are depicted here to better render the situation.
The 1 tower ablation uses the model depicted in Figure 2 Learning to plan like a dog. Table 3 shows our experi-
but only uses one ResNet-18 tower with both frames con- mental results for the task of planning. The nearest neigh-
Model Angular metric All joints Model Pre-training task IOU
Random 131.70 4e-4 ResNet-18 ImageNet Classification 42.88
CNN-acting 63.42 8.67 ResNet-18 Acting like a dog 45.60
Our model-acting 59.61 9.49 Table 5. Walkable surface estimation. We compare the result of
CNN-planning 76.18 0.14 the network that is trained on ImageNet with the network that is
Our model-planning 63.14 3.66 trained for our acting task. The evaluation metric is IOU.
Table 4. Continuous evaluation and all-joint evaluation. Lower is block in Figure 4).
better in the first column. Higher is better in the second column. Table 5 shows the results. Our features provide a sig-
nificant improvement, 3%, over the ImageNet features. We
bor baseline concatenates the image features obtained from use IOU as the evaluation metric. This indicates that our
a ResNet-18 trained on ImageNet and searches for a plan of features have some information orthogonal to the ImageNet
the required size that corresponds to the closest feature. The features.
CNN baseline concatenates the input source and destination 2) Scene classification. We perform an additional scene
images into an input tensor that feeds into a ResNet-18 that recognition experiment using SUN 397 dataset [48]. We
predicts a plan for the required size. Our results show that used the same 5-instance training protocol used by [2]. The
our model outperforms these baselines in the challenging representation learned by us obtains the accuracy of 4.48
task of planning like a dog both in terms of accuracy and (as a point of reference [12] achieves 1.58 and [2] achieves
perplexity. 0.5-6.4 from their representations, and the chance is 0.251).
To better understand the behavior of the model, for both This is interesting since our dataset does not include many
acting and planning, we show the performance in terms of of the scene types (gas station, store, etc).
a continuous angular metric and also for all joints in Ta-
ble 4. The angular metric compares the mean of the pre- 8. Conclusion
dicted cluster with the actual continuous joint movements
in groundtruth (arccos(2(qpred .qgt )2 −1)), where qpred and We study the task of directly modeling a visually intel-
qgt are the predicted and groundtruth quaternion angles, re- ligent agent. Our model learns from ego-centric video and
spectively. The all-joint metric calculates the percentage of movement information to act and plan like a dog would in
correct predictions, where we consider a prediction correct the same situation. We see some success both in our quan-
if all joints are predicted (classified) correctly. titative and qualitative results. Our experiments show that
Learning from a dog. We test our hypothesis about the our models can make predictions about future movements
information encoded in our representation learned from of a dog and can plan movements similar to the dog.
mimicking dogs behaviour by comparing our representation This is a first step towards end-to-end modelling of in-
with a similar one trained for image classification on Ima- telligent agents. This approach does not need manually la-
geNet on a third task. We chose the task of walkable surface beled data or detailed semantic information about the task
estimation and scene classification. or goals of the agent. We can use this model on a wide va-
1) Walkable surface estimation. The goal for this task riety of agents and scenarios and learn useful information
is to label pixels that correspond to walkable regions in an despite the lack of semantic labels.
image (e.g., floor, rug, and carpet regions). We use the For this work, we limit ourselves to only considering vi-
dataset provided by [23]. In our dataset, we have some sual data. However, intelligent agents use a variety of input
sequences of dog walking in indoor and outdoor scenes. modalities when interacting with the world including sound,
There are various types of obstacles (e.g., furniture, people, touch, smell, etc. We are interested in expanding our models
or walls) in the scenes that the dog avoids. We conjecture to encompass more input modalities in a combined, end-to-
that the learned representation for our network should pro- end model. We also limit our work to modelling a single,
vide strong cues for estimating walkable surfaces. specific dog. It would be interesting to collect data from
The definition of walkable surfaces for humans and dogs multiple dogs and evaluate generalization across dogs. We
is not the same. As an example, the area under the tables are hope this work paves the way towards better understanding
labeled as non-walkable for humans and walkable for dogs. of visual intelligence and of the other intelligent beings that
However, since our dog is large-size dog, the definition of inhabit our world.
Acknowledgements: We would like to thank Carlo C. del Mundo
walkability is roughly the same for humans and the dog.
for his help with setting up the data collection system, Marc Mile-
We trained ResNet-18 on ImageNet and then finetuned stone for his help in data collection and miss Kelp M. Redmon
it on the walkable surface dataset as our baseline. We per- for being the dog in our data collection. This work is in part
formed the same procedure for using our features (trained supported by ONR N00014-13-1-0720, NSF IIS-1338054, NSF-
for the acting task). For finetuning both models, we just 1652052, NRI-1637479, Allen Distinguished Investigator Award,
update the weights for the last convolutional layer (green and the Allen Institute for Artificial Intelligence.
References [24] R. Mottaghi, M. Rastegari, A. Gupta, and A. Farhadi. “what
happens if...” learning to predict the effect of forces in im-
[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse ages. In ECCV, 2016. 6
reinforcement learning. In ICML, 2004. 3
[25] A. Owens, P. Isola, J. H. McDermott, A. Torralba, E. H.
[2] P. Agrawal, J. Carreira, and J. Malik. Learning to see by Adelson, and W. T. Freeman. Visually indicated sounds. In
moving. In ICCV, 2015. 3, 8 CVPR, 2016. 3
[3] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine. [26] H. S. Park, J. Hwang, Y. Niu, and J. Shi. Egocentric future
Learning to poke by poking: Experiential learning of intu- localization. In CVPR, 2016. 2
itive physics. In NIPS, 2016. 1, 3 [27] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-
[4] C. L. Baker, R. Saxe, and J. B. Tenenbaum. Action under- driven exploration by self-supervised prediction. In ICML,
standing as inverse planning. Cognition, 2009. 3 2017. 3
[5] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Ima- [28] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A.
genet: A large-scale hierarchical image database. In CVPR, Efros. Context encoders: Feature learning by inpainting. In
2009. 4 CVPR, 2016. 3
[6] C. Doersch, A. Gupta, and A. A. Efros. Unsupervised vi- [29] M. Pei, Y. Jia, and S.-C. Zhu. Parsing video events with goal
sual representation learning by context prediction. In ICCV, inference and intent prediction. In ICCV, 2011. 2
2015. 3 [30] S. L. Pintea, J. C. van Gemert, and A. W. M. Smeulders. Déjà
[7] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding ego- vu: Motion prediction in static images. In ECCV, 2014. 2
centric activities. In ICCV, 2011. 2 [31] L. Pinto, D. Gandhi, Y. Han, Y. Park, and A. Gupta. The
[8] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrent curious robot: Learning visual representations via physical
network models for human dynamics. In ICCV, 2015. 2 interactions. In ECCV, 2016. 1, 3
[9] H. Gong, J. Sim, M. Likhachev, and J. Shi. Multi-hypothesis [32] H. Pirsiavash and D. Ramanan. Detecting activities of daily
motion planning for visual object tracking. In ICCV, 2011. living in first-person camera views. In CVPR, 2012. 2
2 [33] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert,
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning and S. Chopra. Video (language) modeling: a baseline for
for image recognition. In CVPR, 2016. 4 generative models of natural videos. In arXiv, 2014. 2
[11] M. Hoai and F. De la Torre. Max-margin early event detec- [34] N. Rhinehart and K. Kitani. First-person forecasting with
tors. In CVPR, 2012. 2 online inverse reinforcement learning. In ICCV, 2017. 2, 3
[12] D. Jayaraman and K. Grauman. Learning image representa- [35] M. S. Ryoo. Human activity prediction: Early recognition of
tions tied to ego-motion. In ICCV, 2015. 3, 8 ongoing activities from streaming videos. In ICCV, 2011. 2
[13] H. Jiang and K. Grauman. Seeing invisible poses: Estimating [36] K. K. Singh, K. Fatahalian, and A. A. Efros. Krishnacam:
3d body pose from egocentric video. In CVPR, 2017. 2 Using a longitudinal, single-person, egocentric dataset for
[14] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. scene understanding tasks. In WACV, 2016. 2
Activity forecasting. In ECCV, 2012. 2 [37] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsu-
[15] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila. pervised learning of video representations using LSTMs. In
Context-based pedestrian path prediction. In ECCV, 2014. 2 ICML, 2015. 2
[16] H. S. Koppula and A. Saxena. Anticipating human activities [38] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence
using object affordances for reactive robotic response. In learning with neural networks. In NIPS, 2014. 2
RSS, 2013. 2 [39] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,
[17] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical repre- A. Dosovitskiy, and T. Brox. Demon: Depth and motion
sentation for future action prediction. In ECCV, 2014. 2 network for learning monocular stereo. In CVPR, 2017. 2
[18] Y. J. Lee, J. Ghosh, and K. Grauman. Discovering important [40] S. Venugopalan, M. Rohrbach, J. Donahue, R. J. Mooney,
people and objects for egocentric video summarization. In T. Darrell, and K. Saenko. Sequence to sequence - video to
CVPR, 2012. 2 text. In ICCV, 2015. 2
[19] C. Li and K. M. Kitani. Pixel-level hand detection in ego- [41] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating
centric videos. In CVPR, 2013. 2 visual representations with unlabeled video. In CVPR, 2016.
[20] J. Martinez, M. J. Black, and J. Romero. On human motion 2
prediction using recurrent neural networks. In CVPR, 2017. [42] J. Walker, A. Gupta, and M. Hebert. Patch to the future:
2 Unsupervised visual prediction. In CVPR, 2014. 2
[21] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Relative [43] J. Walker, A. Gupta, and M. Hebert. Dense optical flow pre-
camera pose estimation using convolutional neural networks. diction from a static image. In ICCV, 2015. 6
In arXiv, 2017. 2 [44] J. Walker, K. Marino, A. Gupta, and M. Hebert. The pose
[22] R. Mottaghi, H. Bagherinezhad, M. Rastegari, and knows: Video forecasting by generating pose futures. In
A. Farhadi. Newtonian image understanding: Unfolding the ICCV, 2017. 2
dynamics of objects in static images. In CVPR, 2016. 2 [45] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: To-
[23] R. Mottaghi, H. Hajishirzi, and A. Farhadi. A task-oriented wards end-to-end visual odometry with deep recurrent con-
approach for cost-sensitive recognition. In CVPR, 2016. 5, 8 volutional neural networks. In ICRA, 2017. 2
[46] X. Wang, D. F. Fouhey, and A. Gupta. Designing deep net-
works for surface normal estimation. In CVPR, 2015. 6
[47] X. Wang and A. Gupta. Unsupervised learning of visual rep-
resentations using videos. In ICCV, 2015. 3
[48] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
SUN database: Large-scale scene recognition from abbey to
zoo. In CVPR, 2010. 5, 8
[49] D. Xie, S. Todorovic, and S. Zhu. Inferring ”dark matter”
and ”dark energy” from videos. In ICCV, 2013. 2
[50] L. Yao, A. Torabi, K. Cho, N. Ballas, C. J. Pal, H. Larochelle,
and A. C. Courville. Describing videos by exploiting tempo-
ral structure. In ICCV, 2015. 2
[51] J. Yuen and A. Torralba. A data-driven approach for event
prediction. In ECCV, 2010. 2
[52] K. Zeng, W. B. Shen, D. Huang, M. Sun, and J. C. Niebles.
Visual forecasting by imitating dynamics in natural se-
quences. In ICCV, 2017. 2
[53] R. Zhang, P. Isola, and A. A. Efros. Colorful image coloriza-
tion. In ECCV, 2016. 3
[54] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsu-
pervised learning of depth and ego-motion from video. In
CVPR, 2017. 2
[55] Y. Zhou and T. L. Berg. Temporal perception and prediction
in ego-centric video. In ICCV, 2015. 2
[56] Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta,
R. Mottaghi, and A. Farhadi. Visual semantic planning using
deep successor representations. In ICCV, 2017. 1
[57] S. Zuffi, A. Kanazawa, D. Jacobs, and M. J. Black. 3D
menagerie: Modeling the 3D shape and pose of animals. In
CVPR, 2017. 6

You might also like