Who Let The Dogs Out? Modeling Dog Behavior From Visual Data
Who Let The Dogs Out? Modeling Dog Behavior From Visual Data
Who Let The Dogs Out? Modeling Dog Behavior From Visual Data
Abstract
estimation and scene classification by using this dog mod- Learning from a dog
elling task as representation learning. Code is available at
https://github.com/ehsanik/dogTorch.
representation
1. Introduction learning
walkable surfaces
Computer vision research typically focuses on a few well Figure 1. We address three problems: (1) Acting like a dog: where
defined tasks including image classification, object recogni- the goal is to predict the future movements of the dog given a se-
tion, object detection, image segmentation, etc. These tasks quence of previously seen images. (2) Planning like a dog: where
have organically emerged and evolved over time as proxies the goal is to find a sequence of actions that move the dog between
the locations of the given pair of images. (3) Learning from a
for the actual problem of visual intelligence. Visual intel-
dog: where we use the learned representation for a third task (e.g.,
ligence spans a wide range of problems and is hard to for-
walkable surface estimation).
mally define or evaluate. As a result, the proxy tasks have
served the community as the main point of focus and indi-
cators of progress. tems. In this paper, we take a direct approach to the prob-
We value the undeniable impact of these proxy tasks in lem of visual intelligence. Inspired by recent work that
computer vision research and advocate the continuation of explores the role of action and interaction in visual under-
research on these fundamental problems. There is, how- standing [56, 3, 31], we define the problem of visual in-
ever, a gap between the ideal outcome of these proxy tasks telligence as understanding visual data to the extent that
and the expected functionality of visually intelligent sys- an agent can take actions and perform tasks in the visual
world. Under this definition, we propose to learn to act like a static image using a large collection of videos. [29] infer
a visually intelligent agent in the visual world. the goals of people and their intended actions. [35] infer
Learning to act like visually intelligent agents, in gen- future activities from a stream of video. [9] improve track-
eral, is an extremely challenging and a hard-to-define prob- ing by considering multiple hypotheses for future plans of
lem. Actions correspond to a wide range of movements with people. [11] recognize partial events, which enables early
complicated semantics. In this paper, we take a small step detection of events. [14] perform activity forecasting by
towards the problem of learning to directly act like intelli- integrating semantic scene understanding with optimal con-
gent agents by considering actions in their most basic and trol theory. [16] use object affordances to predict the fu-
semantic-free form: simple movements. ture activities of people. [49] localize functional objects by
We choose to model a dog as the visual agent. Dogs predicting people’s intent. [42] propose an unsupervised
have a much simpler action space than, say, a human, mak- approach to predict possible motions and appearance of ob-
ing the task more tractable. However, they clearly demon- jects in the future. [17] propose a hierarchical approach to
strate visual intelligence, recognizing food, obstacles, other predict a set of actions that happen in the future. [33] pro-
humans and animals, and reacting to those inputs. Yet their pose a method to generate the future frames of a video. [15]
goals and motivations are often unknown a priori. They predict the future paths of pedestrians from a vehicle cam-
simply exist as sovereign entities in our world. Thus we are era. [36] predict future trajectories of a person in an ego-
modelling a black box where we only know the inputs and centric setting. [22] predict the future trajectories of objects
outputs of the system. according to Newtonian physics. [41] predict visual repre-
In this paper, we study the problem of learning to act and sentations for future images. [52] forecast future frames by
plan like a dog from visual input. We compile the Dataset learning a policy to reproduce natural video sequences. Our
of Ego-Centric Actions in a Dog Environment (DECADE), work is different from these works since our goal is to pre-
which includes ego-centric videos of a dog with her corre- dict the behavior of a dog and the movement of the joints
sponding movements. To record movements we mount In- from an ego-centric camera that captures the viewpoint of
ertial Measurement Units (IMU) on the joints and the body the dog.
of the dog. We record the absolute position and can calcu- Sequence to sequence models. Sequence to sequence
late the relative angle of the dog’s main limbs and body. learning [38] has been used for different applications in
Using DECADE, we explore three main problems in computer vision such as representation learning [37], video
this paper (Figure 1): (1) learning to act like a dog; (2) captioning [40, 50], human pose estimation [44], motion
learning to plan like a dog; and (3) using dogs movements prediction [20], or body pose labeling and forecasting
as supervisory signal for representation learning. [8, 44]. Our model fits into this paradigm since we map
In learning to act like a dog, we study the problem of the frames in a video to joint movements of the dog.
predicting the dog’s future moves, in terms of all the joint Ego-centric vision. Our work is along the lines of ego-
movements, by observing what the dog has observed up to centric vision (e.g., [7, 32, 18, 19]) since we study the
the current time. In learning to plan like a dog, we ad- dog’s behavior from the perspective of the dog. However,
dress the problem of estimating a sequence of movements dogs have less complex actions compared to humans, which
that take the state of the dog’s world from what is observed makes the problem more manageable. Prior work explores
at a given time to a desired observed state. In using dogs future prediction in the context of ego-centric vision. [55]
as supervision, we explore the potentials of using the dogs infer the temporal ordering of two snippets of ego-centric
movements for representation learning. videos and predict what will happen next. [26] predict plau-
Our evaluations show interesting and promising results. sible future trajectories of ego-motion in ego-centric stereo
Our models can predict how the dog moves in various sce- images. [13] estimates the 3D joint position of unseen body
narios (act like a dog) and how she decides to move from joints using ego-centric videos. [34] use online reinforce-
one state to another (plan like a dog). In addition, we show ment learning to forecast the future goals of the person
that the representation our model learns on dog behavior wearing the camera. In contrast, our work focuses on pre-
generalizes to other tasks. In particular, we see accuracy dicting future joint movements given a stream of video.
improvements using our dog model as pretraining for walk-
able surface estimation and scene recognition. Ego-motion estimation. Our planning approach shares
similarities with ego-motion learning. [54] propose an un-
2. Related Work supervised approach for camera motion estimation. [45]
propose a method based on combination of CNNs and
To the best of our knowledge there is little to no work RNNs to perform ego-motion estimation for cars. [21] learn
that directly models dog behavior. We mention past work a network to estimate relative pose of two cameras. [39]
that is most relevant. also train a CNN to learn depth map and motion of the
Visual prediction. [51, 30] predict the motion of objects in camera in two consecutive images. In contrast to these ap-
proaches that estimate translation and rotation of the cam- timestep.
era, we predict a sequence of joint movements. Note that An Arduino on the dog’s back connects to the IMUs and
the joint movements are constrained by the structure of the records the positional information. It also collects audio
dog body so the predictions are constrained. data via a microphone mounted on the dog’s back. We syn-
Action inference & Planning. Our dog planning model chronize the GoPro with the IMU measurements using au-
infers the action sequence for the dog given a pair of images dio information. This allows us to synchronize the video
showing before and after action execution. [3] also learn stream with the IMU readings with microsecond precision.
the mapping between actions of a robot and changes in the We collect the data in various outdoor and indoor scenes:
visual state for the task of pushing objects. [27] optimize living room, stairs, balcony, street, and dog park are exam-
for actions that capture the state changes in an exploration ples of these scenes. The data is recorded in more than 50
setting. different locations. We recorded the behavior of the dog
Inverse Reinforcement Learning. Several works (e.g., while involved in certain activities such as walking, follow-
[1, 4, 34]) have used Inverse Reinforcement Learning (IRL) ing, fetching, interaction with other dogs, and tracking ob-
to infer the agent’s reward function from the observed be- jects. No annotations are provided for the video frames, we
havior. IRL is not directly applicable to our problem since use the raw data for our experiments.
our action space is large and we do not have multiple train-
ing examples for each goal. 4. Acting like a dog
Self-supervision. Various research explores representation
We predict how the dog acts in the visual world in re-
learning by different self-supervisory signals such as ego-
sponse to various situations. Specifically, we model the
motion [2, 12], spatial location [6], tracking in video [47],
future actions of the dog given a sequence of previously
colorization [53], physical robot interaction [31], inpainting
seen images. The input is a sequence of image frames
[28], sound [25], etc. As a side product, we show we learn a
(I1 , I2 , . . . , It ), and the output is the future actions (move-
useful representation using embeddings of joint movements
ments) of each joint j at each timestep t < t0 ≤ N :
and visual signals.
(ajt+1 , ajt+2 , . . . , ajt+N ). Timesteps are spaced evenly by
3. Dataset 0.2s in time. The action ajt is the movement of the joint
j, that along with the movements of other joints, takes us
We introduce DECADE, a dataset of ego-centric dog from image frame It to It+1 . For instance, a23 represents
video and joint movements. The dataset includes 380 video the movement of the second joint that takes place between
clips from a camera mounted on the dog’s head. It also in- image frames I3 and I4 . Each action is the change in the
cludes corresponding information about body position and orientation of the joints in the 3D space.
movement. Overall we have 24500 frames. We use 21000 We formulate the problem as classification, i.e. we quan-
of them for training, 1500 for validation, and 2000 for test- tize joint angular movements and label each joint movement
ing. Train, validation, and test splits consist of disjoint as a ground-truth action class. To obtain action classes,
video clips. we cluster changes in IMU readings (joint angular move-
We use a GoPro camera on the dog’s head to capture the ments) by K-means, and we use quaternion angular dis-
ego-centric videos. We sub-sample frames at the rate of 5 tances to represent angular distances between quaternions.
fps. The camera applies video stabilization to the captured Each cluster centroid represents a possible movement of
stream. We use inertial measurement units (IMUs) to mea- that joint.
sure body position and movement. Four IMUs measure the Our movement prediction model is based on an encoder-
position of the dog’s limbs, one measures the tail, and one decoder architecture, where the goal is to find a mapping
measures the body position. The IMUs enable us to capture between input images and future actions. For instance, if
the movements in terms of angular displacements. the dog sees her owner with a bag of treats, there is a high
For each frame, we have the absolute angular displace- probability that the dog will sit and wait for a treat, or if the
ment of the six IMUs. Each angular displacement is repre- dog sees her owner throwing a ball, the dog will likely track
sented as a 4 dimensional quaternion vector. More details the ball and run toward it.
about angular calculations in this domain and the method Figure 2 shows our model. The encoder part of the model
for quantizing the data is explained in detail in Section 7. consists of a CNN and an LSTM. At each timestep, the
The absolute angular displacements of the IMUs depend CNN receives a pair of consecutive images as input and
on what direction the dog is facing. For that reason, we provides an embedding, which is used as the input to the
compute the difference between angular displacements of LSTM. That is, the LSTM cell receives the features from
the joints, also in the quaternion space. The difference of frames t and t + 1 as the input in a timestep, and receives
the angular displacements between two consecutive frames frames t + 1 and t + 2 in the next timestep. Our experi-
(that is 0.2s in time) represents the action of the dog in that mental results show that observing the two frames in each
encoder decoder
FC
ResNet
ResNet
ResNet
ResNet
ResNet
ResNet
Figure 2. Model architecture for acting. The model is an encoder-decoder style neural network. The encoder receives a stream of image
pairs, and the decoder outputs future actions for each joint. There is a fully connected layer (FC) between the encoder and decoder parts to
better capture the change in the domain (change from images to actions). In the decoder, the output probability of actions at each timestep
is passed to the next timestep. We share the weights between the two ResNet towers.
timestep of LSTM improves the performance of the model. of joints, and N is the number of timesteps. The f1i fac-
gi
The CNN consists of two towers of ResNet-18 [10], one for tor helps the ground-truth labels that are underrepresented
each frame, whose weights are shared. in the training data.
The decoder’s goal is to predict the future joint move-
ments of the dog given the embedding of the input frames. 5. Planning like a dog
The decoder receives its initial hidden state and cell from
the encoder. At each timestep, the decoder outputs the ac- Another goal is to model how dogs plan actions to ac-
tion class for each of the joints. The input to the decoder complish a task. To achieve this, we design a task as fol-
at the first timestep is all zeros, at all other timesteps, we lows: Given a pair of non-consecutive image frames, plan a
feed in the prediction of the last timestep, embedded by a sequence of joint movements that the dog would take to get
linear transformer. Since we train the model with fixed out- from the first frame (starting state) to the second frame (end-
put length, no stop token is required and we always stop ing state). Note that a traditional motion estimator would
at a fixed number of steps. Note that there are a total of not work here. Motion estimators infer a translation and
six joints; hence our model outputs six classes of actions at rotation for the camera that can take us from an image to
each timestep. another; in contrast, here we expect the model to plan for
Each image is given to the ResNet tower individually and the actuator, with its set of feasible actions, to traverse from
the features for the two images are concatenated. The com- one state to another.
bined features are embedded into a smaller space by a linear More formally, the task can be defined as follows. Given
transformation. The embedded features are fed into the en- a pair of images (I1 , IN ), output an action sequence of
coder LSTM. We use a ResNet pre-trained on ImageNet [5] length N − 1 for each joint, that results in the movement
and we fine-tune it under a Siamese setting to estimate the of the dog from the starting point, where I1 is observed, to
joints movements between two consecutive frames. We use the end point, where IN is observed.
the fine-tuned ResNet in our encoder-decoder model. Each action that the dog takes changes the states of the
We use an average of weighted class entropy losses, one world, and therefore planning for the next steps. Thus,
for each joint, to train our encoder-decoder. Our loss func- we design a recurrent neural network, containing an LSTM
tion can be formulated as follows: that observes the actions taken by the model in previous
N K
timesteps for the next timestamp action prediction. Figure 3
1 XX 1 shows the overview of our model. We feed-forward image
L(o, g) = log o(t)ig(t)i , (1)
N K t=1 i=1 fgii frames I1 and IN to individual ResNet-18 towers, concate-
nate the features from the last layer and feed it to the LSTM.
where g(t)i is the ground-truth class for i-th joint at At each timestep, the LSTM cell outputs planned actions for
timestep t, o(t)igi is the predicted probability score for gi -th all six joints. We pass the planned actions for a timestep as
class of i-th joint at timestep t, fgii is the number of data the input of the next timestep. This enables the network to
points whose i-th joint is labeled with gi , K is the number plan the next movements conditioned on the previous ac-
LSTM LSTM LSTM ResNet
FC FC
ResNet
ResNet
+ DeConv
!" !# + DeConv
Figure 3. Model architecture for planning. The model is a com-
bination of CNNs and an LSTM. The inputs to the model are two
images I1 and IN , which are N − 1 time steps apart in the video + DeConv
sequence. The LSTM receives the features from the CNNs and
outputs a sequence of actions (joint movements) that move the dog
Conv
from I1 to IN .