ICML 2018 Notes: Stockholm, Sweden
ICML 2018 Notes: Stockholm, Sweden
Stockholm, Sweden
David Abel∗
[email protected]
July 2018
Contents
1 Conference Highlights 3
1
3.4.1 PredRNN++: Towards a Resolution of the Deep-in-Time Dilemma [42] . . . 27
3.4.2 Hierarchical Long-term Video Prediction without Supervision [43] . . . . . . 27
3.4.3 Evolving Convolutional Autoencoders for Image Restoration [40] . . . . . . . 28
3.4.4 Model-Level Dual Learning [45] . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Reinforcement Learning 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.1 Machine Theory of Mind [34] . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5.2 Been There Done That: Meta-Learning with Episodic Recall [36] . . . . . . . 30
3.5.3 Transfer in Deep RL using Successor Features in GPI [9] . . . . . . . . . . . . 31
3.5.4 Continual Reinforcement Learning with Complex Synapses [26] . . . . . . . . 31
2
This document contains notes I took during the events I managed to make it to at ICML in Stock-
holm, Sweden. Please feel free to distribute it and shoot me an email at [email protected] if
you find any typos or other items that need correcting.
1 Conference Highlights
Some folks jokingly called it ICRL this year — the RL sessions were in the biggest room and appar-
ently had the most papers. It’s pretty wild. A few of my friends in RL were reminiscing over the
times when there were a dozen or so RL folks at a given big ML conference. My primary research
area is in RL, so I tend to track the RL talks most closely (but I do care deeply about the broader
community, too), All that being said, these notes are heavily biased toward the RL sessions. Also,
I was spending quite a bit more time prepping for my talks/poster sessions so I missed a bit more
than usual.
Some takeaways:
• I’d like to see more explanatory papers in RL – that is, instead of focusing on introducing
new algorithms that perform better on our benchmarks, reflecting back on the techniques
we’ve introduced and do a deep analysis (either theoretical or experimental) to uncover what,
precisely, these methods do.
• I’m going to spend some time thinking about what it would look like to make foundational
progress in RL without MDPs at the core of the result, (there’s some nice work out there
already [30]).
• Lots of tools are sophisticated and robust enough to make a huge impact, now. If you’re into
AI for the long haul Utopia style vision of the future, now is a good time to start thinking
deeply about how to help the world with the tools we’ve been developing. As a start take a
look at the AI for Wildlife Conservation workshop (and comp sust community1 ).
• Sanjeev Arora’s Deep Learning Theory tutorial and Ben Recth’s Optimization tutorial were
both excellent – I’d suggest taking a look at each if you get time. The main ideas for me were
(Sanjeev) we might want to think about doing unsupervised learning with more connection
to downstream tasks, and (Ben) RL and Control theory have loads in common, and the
communities should talk more.
3
2.1 Tutorial: Toward Theoretical Understanding of Deep Learning
Sanjeev Arora is speaking.2
Some Terminology:
• Parameters of deep net
• Gradient Descent:
θt+1 ← θt − η∇θ Ei [`(θt , xi , yi )] (1)
Goal of Theory: Theorems that sort through competing intuitions, lead to new insights and
concepts. A mathematical basis for new ideas.
Talk Overview:
1. Optimization: When/how can it find decent solutions. Highly nonconvex.
4. Unsupervised Learning/GANs
2.1.1 Optimization
Point: Optimization concepts have already helped shape deep learning.
Hurdle: Most optimization problems are non-convex. So, we don’t expect to have polynomial
time algorithms.
4
• Prove random initial points will converge.
Note: if optimization is in Rd , then you want run time poly(d, 1/ε), where ε = accuracy. The naive
upper bound is exponential in exp d/ε.
Curse of Dimensionality: In Rd , ∃ exp(d) directions whose parwise angle is > 60◦ . Thus,
∃exp(d/ε) special directions s.t. all directions have angle at most ε with one of these (an “ε-cover”).
Black box for analysis of Deep Learning. Why: don’t know the landscape, really, just the loss func-
tion. We have basically no mathematical characterization of (x, y), since y is usually a complicated
function of x (think about classifying objects in images: x is an image, y is “dog”).
Instead, we can get: θ → f → f (θ), ∇fθ . Using just this blackbox analysis, we can’t get global
optimums.
Gradient Descent:
• ∇=
6 0: so, there is a descent direction.
∇2 f (θ) ≤ βI (2)
Claim 2.1. If η = 1/2β, then we can achieve |∇f | < ε, in # steps proportional to β/ε2 .
Proof.
1
f (θt ) − f (θt+1 ) ≥ ∇f (θt )(θt+1 − θt ) − β|θt − θt+1 |2
2
2 1 2 2 1
= η|∇t | − βη |∇t | = = |∇t |2
2 2β
But, the solution here is just a critical point, which is a bit too weak. One idea to improve: avoid
saddle points, as in Perturbed SGD introduced by Ge et al. [18].
What about 2nd order optimization? Like the Newton Method. So, we instead consider:
which lets us make stronger guarantees about solutions at the expense of additional computation.
Non-black box analyses. Lots of ML problems that are subclasses of depth two neural networks:
5
Figure 1: Classical story of overfitting in Machine Learning.
Problem: Matrix completion. Suppose we’re given an n×n matrix M of rank r with some missing
entries:
M =U ·VT (4)
Goal is to predict the missing entries. → subclass of learning depth two linear nets! Feeding 1 − hot
inputs into unknown net; setting output at one random output node. Then, learn the net! Recent
work: all local minima for this problem are global minima, proven by [19] (for an arbitrary starting
point).
Theorems for learning multilayer nets? Yes! But usually only for linear nets. Overall net: product
of matrix transformation. Some budding theory:
• Connection to physics: natural gradient/Lagrangian methods.
1. Generate labeled data by feeding random input vectors into depth 2 net with hidden layer of
size n.
2. Difficult to train a new net using this labeled data with same # of hidden nodes.
3. But: much easier to train a new net with bigger hidden layer.
6
Figure 2: Excess capacity experiment from Zhang et al. [47]
Effective Capacity: roughly, log(# distinct a priori models). Generalization theory tells us:
r
N
Test loss - training loss ≤ . (5)
m
Where m = # training samples, whereas N = # of parameters, VC Dimension, Rademacher
complexity.
Worry, though: for Deep Nets, N dominates so much that this is vacuous. Idea, via proof sketch:
• By concentration bounds, for fixed net θ, we get our usual concentration inequalities:
W , suffices to let m > W/ε2 . But then this is the same for
• Thus, if # possible θ = |{z}
capacity
effectively all nets.
Current method of generalization theory: find property Φ that only obtains in a few neural net-
works, and correlates well with generalization. Then, we can use Φ to compute upper bounds on
“very few” networks, and thus lowers effective capacity.
Von Neumann: “Reliable Machines and unreliable components. We have, in human and animal
brains, examples of large and relatively reliable systems constructed from individual components,
the neurons, which would appear to be anything but reliable... In communication theory this can
be done by properly introduced redundancy”.
7
New Idea: Compression-based methods for generalization bounds, introduced at ICML this year
in Arora et al. [5]. The bound is roughly:
2
depth × activation contraction
capacity ≈ (8)
layer cushion × interlayer cushion
• Quantitative bounds too weak to explain why net with 20mil params generalizes with 50k
training dataset.
• Argument needs to involve more properties of training algorithm and/or data distribution.
• New Result! Arora et al. [4] show that increasing depth can sometimes accelerate the opti-
mization, including for classical convex problems.
Now, we’ll replace this with a depth-2 linear circuit – so, we replace w by w1 ·w2 (overparameterize!):
1 T p
L(w) = E(x,y)∼D (x w1 w2 − y) (10)
p
Why do this? Well, the path that gradient descent might take, this might be easier. Gradient
descent now amounts to:
t−1
X
wt+1 = wt − ρt ∇wt − µ(t,τ ) ∇wt (11)
| {z }
τ =1
adaptive learning rate | {z }
memory of past gradients
8
Figure 3: Manifold learning
min max Ex dim Dreal [Dv (x)] − Eh [Dv (Gu (h))] (12)
u∈U v∈V
Generator “wins” if objective ≈ 0, and further training if discriminator doesn’t help (reached equi-
librium).
Q: What spoils a GAN trainers day? A: Mode collapse! Idea: since discriminator only learns
from a few samples, it may be unable to teach generator to produce distribution Dsynth with suffi-
ciently large diversity.
New insights from theory: the problem is not with # training samples but size/capacity of the
discriminator!
Theorem 2.2. Arora et al. [3] If discriminator size = N , then ∃ a generator that generates a
distribution supported on O(N logN ) inputs and still wings against all possible discriminators.
Main Idea: small discriminators inherently incapable of detecting mode collapse. Theory suggests
GANs training objective not guaranteed to avoid mode-collapse. But, does this actually happen?
9
√ ≥ 23 people in a room, the chance is > 0.5 that
A: Yep! Recall the Birthday paradox. If you put
two of them share a birthday. Note that 23 ≈ 365.
√
Thus: if a distribution is supported on N images. Then Pr(sample of size N has a duplicate
image ≥ 1/2.
Possible hole: For the code learned to be good, then p(X, Z) needs to be learnt to very high nu-
merical accuracy, since you’re going to use the code in a down stream task. But this doesn’t really
happen!
• Maximizing Log Likelihood may lead to little unstable insight into the data.
(Ben Recht?) Linearization Principle: “Before committing to deep model figure out what the
linear methods can do.” But, Sanjeev says, Ben doesn’t actually say this is his philosophy.
The point of learning a representation is that the true structure of the data emerges, and classifi-
cation becomes easy. But: the downstream task isn’t always known ahead of time! So, maybe the
10
representation should capture all or most of the information (like a bag of words).
Recovery algorithm:
min |x|1 s.t. Ax = b (13)
But, Calderbank et al. [12] showed that linear classification on compressed vector Ax is as good as x.
Connection to RL: simple linear models in RL can beat the state of the art Deep RL on some
simple tasks. Linearization principle, applied! See Ben’s talk (next section).
2.1.6 Conclusion
What to work on:
2. Look at unsupervised learning! Yes everything is NP-Hard and new but that’s how we’ll
grow.
4. Going beyond (3.), design interesting models for interactive learning of language/skills. Both
theory and applied work are missing some basic ideas.
5. “Best theory will emerge from engaging with real data and real deep net training. (Noncon-
vexity and attendant complexity seems to make armchair theory less fruitful.)”
The games we’ve been successful on (Atari, Go, Chess, etc.) are too structured – what happens
when we move out of games and into the real world? In particular, move into settings where these
systems interact with people in a way that actually a major impact on the lives of lots of people.
Definition 1 (Reinforcement learning): RL (or control theory?) is the study of how to use
past data to enhance the future manipulation of a dynamical system?
11
Figure 5: RL vs. Control
If you come from a department with an “E” in it, then you study CT, and RL is a subset. If you
come from a CS department, then you study RL, and CT is a subset.
Today’s Talk: Try to unify these camps and point out how to merge their perspectives.
Main research challenge: what are the fundamental limits of learning systems that interact with
the environment?
Definition 3 (Reinforcement Learning): The study of discrete dynamical systems with inputs,
where the system is described as a Markov Decision Process (MDP).
Optimal Control:
T
" #
X
min Ee Ct (xt , ut ) ,
t=1
s.t. xt+1 = ft (xt , ut , et )
s.t. ut = πt (τt ).
Where:
• et is a noise process
12
• ft is the state transition function
• τt = (u1 , u2 , . . . , ut ) is a trajectory
• πt (τt ) is the policy
Example: Newton’s Laws define our model. So:
zt+1 = zt + vt
vt+1 = vt + ot
mot = ut
With cost defined by reaching a particular location:
T
X
minimize x2t + ru2t (14)
t=0
Subject to some simple constraints (time, energy, etc.). The cost function isn’t given, typically–we
assume that it requires care in designing.
Definition 4 (Linear Quadratic Regulator (LQR)): Minimize a quadratic cost subject to linear
dynamics. In some sense, the canonical, simple problem (similar to grid world in RL?)
Major Challenge: How to we perform optimal control when the system is unknown? (When f
is unknown?)
Example: Consider the success story of cooling off data centers – here, the dynamics are unknown.
How could we solve this?
• Identify everything: PDE Control, High performance dynamics.
• Identify a coarse model: model predictive control.
• We don’t need no stinking model: RL, PID control.
PID control works: 95% of all industrial control applications are PID controllers.
Some Qs: How much needs to be modeled for more advanced control? Can we learn to
compensate for poor models or changing conditions?
13
Learning to control problem:
T
" #
X
Ee Ct (xt , ut )
t=1
s.t. xt+1 = ft (xt , ut , et )
s.t. ut = πt (τt ).
Challenge: Build a controller with smallest error with fixed sampling budget (N × T ). So, what
is the optimal estimation/design scheme?
Big Question: How many samples are needed to solve the above challenge?
Definition 5 (The Linearization Principle): “If a machine learning algorithm does cray thing
when restricted to linear models, it’s going to do crazy things on complex nonlinear models,
too.”
Basically: would you believe someone had a soot SAT solver if it can’t solve 2SAT problems?
• Then, solve approximate problem, same as LQR but use φ̂ as the model.
Dynamic Programming:
Let’s first suppose everything is known, and just consider the DP problem. Then, we can define
our usual Q function as this expected cost:
" T #
X
Q1 (x, u) = Ee Ct (xt , ut ) (16)
t=1
14
If we continue this process, we end up with the true recursive formulation of Q values:
" T #
X
Qt (x, u) = Ee Ct (xt , ut ) + min
0
Qt+1 (ft (x, u, e), u0 ) . (17)
u
t=1
Because quadratics are well behaved, we get a closed form for the optimal action. A couple nice
things:
• For finite time horizion we could solve this with a variety of batch solvers
Optimal Policy:
π(x) = arg min Q(x, u) (22)
u
New Problem:
min Φ(z). (23)
z∈Rd
15
Figure 6: Different approaches to direct finding a policy.
Then, we can use function approximators that might not capture optimal distribution. Can build
stochastic gradient estimates by sampling:
1. Sample: zT ∼ p(z; θk )
REINFORCE is used at the heart of both policy gradient algorithms and random search algorithms.
To do policy gradient, we replace our deterministic policy with a stochastic one:
• Necessarily becomes derivative free as your accessing the decision variable by sample.
16
2.2.3 Learning Theory
What can we say about sample complexity? In particular, what can we say about the sample com-
plexity of each of the three classes of methods we introduced in the previous section? (Approximate
DP, Model-based, and Policy Search).
Where the above “error” column, is a super rough approximation based on # parameters alone.
What about when we move to the continuous case? Shown in Table ??.
Let’s return to LQR and think about sample complexity. Ben ran some experiments on a double
integrator task from each of the three algorithm classes, and after about 10 samples, ADP and
model-based solved the problem, whereas Policy Gradient did very poorly.
Lance Armstrong: “Extraordinary Claims Require Extraordinary Evidence” (“only if you prior is
correct!” – Ben).
OpenAI quote on trickiness of implementing RL algorithms: “RL results are tricky to reproduce
performance is very noisy, algorithms have many moving parts which allow for subtle bugs, and
many papers don’t report all the required tricks.” Also see Joelle Pineau’s keynote talk at ICLR.
Ben’s Q: Is there a better way? Can we avoid these pitfalls? A: Yes! Let’s use models.
17
2. One idea is to fit dynamics with supervised learning:
N
X
φ̂ = arg min |xt+1 − φ(xt , ut )|2 (28)
φ t=0
3. Then, solve approximate problem, same as LQR but use φ̂ as the model.
The hard part here is what control problem do we solve? We know our model isn’t perfect. Thus
we need something like Robust Control/Coarse-ID control.
In Coarse-ID control:
• solve minu x∗ Qx subject to x = Bu + x0 , with B unknown.
• Estimate B:
n
X
B̂ = min ||Bu)i + x0 − xi ||2 (29)
B
i=1
subject to x = B̂u + x0 . We can then relax this via the triangle inequality intto a convex problem:
p
|| Qx|| + ελ||u||, (31)
subject to the same constraint. They show how you can translate estimation error into control error
in LQR systems – sort of like the simulation lemma from [11]. Yields robust model based control:
shows some experimental results, consistently does quite well (definitely better than model-free).
A return to the Linearization Principle: now, what happens when we remove linearity? (QR?).
They tried running a random search algorithm on MuJocoo, and found it does better (or at least
as good as) natural graident methods and TRPO.
Bens’ Proposed Way Forward: Use models. In particular, model-predictive control (MPC):
H
X
0
Qt (x, u) = Ct (x, u) + E min
0
QH+1 (fH (x, u, e)u ) . (32)
u
t=1
18
• Can we get tight and lower sample complexities for various control problems?
So, lots of exciting things to do! And it’s not just RL and not just control theory. Maybe we need
a new name that’s more inclusive, like “Actionable Intelligence”. So, to conclude:
Definition 6 (Actionable Intelligence): Actionable Intelligence is the study of how to use past
data to enhance the future manipulation of a dynamical system.
Actionable Intelligence interfaces with people, and needs to be trustable, scalable, and predictable.
19
Figure 7: A canonical adversarial example: guacomole cat!
3.1 Best Paper 1: Obfuscated Gradients Give a False Sense of Security [7]
The speaker is Nicholas Carlini, joint with Anish Athalye and David Wagner.
2. A2: Make ML better! (Even ignoring security, we should still want ML to not make these
mistakes)
In light of the presence of these examples, prior work have looked into defenses against examples.
At ICLR this year, there were 13 defense papers: 9 white box (no theory). In this talk: we show
they’re broken.
This talk: How did we evade these defenses? Why were we able to evade them?
Definition 7 (Obfuscated Gradients): There are clear global gradients, but local gradients are
highly random and directionless.
So, new attack: “fixing” gradient descent. Idea: run the image through the network to obtain a
probability distribution. Then, run it backward through a new network, almost identical to the
original, but with obfuscated gradients. Using this we can still generate an adversarial image:
20
Figure 8: Using obfuscated gradients to generate adversarial examples: the orange layers are new
layers, replaced with obfuscated gradients.
Why: what can we learn? What went wrong in the prior papers?
Instead: the purpose of a defense evaluation is to fail to show the defense is wrong.
A: Threat Model:
Definition 8 (Threat Model): A specific set of assumptions we place on an adversary.
Conclusion
1. A paper can only do so much evaluation.
2. We need more re-evaluation papers! Less new attacks.
3. “Anyone from the most clueless amateur to the best cryptographer can create an algorithm
that he[/she] can’t break” – Bruce Schneier
4. One Challenging Suggestion of a defense to break: Defense-GAN on MNIST [37].
21
3.2 Reinforcement Learning 1
Now for the first RL session (of many!). RL is in the biggest room this year!
The main idea: can we design algorithms that inherit better performance on easy MDPs?
• The agent doesn’t know this: the agent sees episodic MDP, still optimizing for a policy
for an H-step horizon.
Optimism for exploration: optimistic value function = empirical estimate + exploration bonus.
Want a smaller exploration bonus, which yields lower optimism.
Goal Construct the smallest exploration bonus that is as tight as possible for the true underlying
problem.
1. Mistakes in the Bandit-MDP are not very costly. Basically, agent can recover easily after one
mistake since the next state is unaffected.
22
Main result: shrink ∆ based on the structure of the underlying problem.
In conclusion:
• Bandits are not identified by a statistical test.
The setting: a platform interacts with a user to learn a user’s preferences over time. But! There’s
a risk: if the system upsets the user, then the user will leave the platform.
Definition 10 (Single Treshold Model): User has a threshold θ ∼ F , for F unknown. Then:
They prove which policies are optimal and near-optimal under a variety of variations of the setting.
New Setting:
• Users arrive sequentially, consider constant policy per user.
• Consider regret:
n
X
regret(n) = np(x∗ ) − p(xu ) (34)
u=1
Method: discretize the action space and run UCB, KL-UCB, achieve regret bounds inherited from
UCB.
23
3.2.3 Lipschitz Continuity in Model-Based RL [6]
The speaker is Kavosh Asadi, joint with Dipendra Misra and Michael Littman.
T̂ (s0 | s, a) ≈ T (s0 | s, a)
R̂(s, a) ≈ R(s, a)
1. Innaccurate models
The combination of these errors is deadly to model-based RL! And, critically, we’ll never have a
perfect model.
Main takeaway: Lipschitz-continuity plays a major role in both overcoming the compounding
error problem, but also more generally in model-based RL.
Also assume we have an approx. Lipschitz model K(T̂ ) and a true Lipschitz model. Then, the error
will be:
n−1
X
W (T n (· | s, a); T̂ n (· | s, a) ≤ ∆ Ki (36)
i=0
Also introduced results about controlling the Lipschitz constant in neural nets, and regarding the
Lipschitz nature of the value function and models.
A: Our theory works for non-tabular cases and can be applied to models of arbitrary complexity.
Building on DQN: same basic architecture/learning setup. Here they introduce the Implicit Quan-
tile Network (IQN), that builds on C51 and QR-DQN by trying to relax the assumptions made
about discretizing the return output distribution.
24
Main Story: Move from DQN to IQN— make a slight change to the network by going from the
mean (DQN) to samples from that return distribution (IQN), using these samples, you solve the
quantile regression problem.
Q: How much data do you need? A: Well, the more you take, the better you do. If you increase
number of samples early you do quite a bit better – later in the learning problem, you don’t get
much better by adding samples.
Results: They run it on the usual Atari benchmarks, and they bind it halves the gap between DQN
and Rainbow.
The goal, then, is to compute a good estimate of ρπe given a set of T step trajectories from
data generated by πb .
Usually for this problem consider an estimator ρ̂πe , typically do the MLE.
One method: do importance sampling on the data collected by the behavior policy in updating ρπe .
They introduce the More Robust Doubly Robust Estimator: an estimator for both contextual
bandits and RL. Prove new bounds and run experiments comparing their estimator to existing
estimators.
25
3.3.1 Coordinating Exploration in Concurrent RL [15]
The speaker is Maria Dimakopoulou, joint with Ben Van Roy.
Idea: focus on concurrent learning – a case where lots of agents can be run in parallel simultaneously.
• Agents are uncertain about P and R about which they share priors.
If we just run -greedy across all the concurrent agents, we don’t see much of a benefit in exploration
due to the lack of coordination.
2. Commitment: main the intent to carry out action sequences that span multiple periods.
SeedSampling: extends PSRL by satisfying each of the above three properties. Each agent starts
by sampling a unique random seed. This seed maps to an MDP, thereby diversifying exploratory
efforts among agents.
Path Planning: find the shortest set of actions to reach a goal location from a starting state.
Other approaches: (1) A∗ , but not differentiable, and (2) Value Iteration Networks (VIN) → dif-
ferentiable. So, VINs are becoming widespread.
Problem: VINs are difficult to optimize. So, goal here is to make them easier to optimize. In
particular, Non-gated RNNs are known to be difficult to optimize.
26
Proposal: replace non-gated RNN with gated-RNNs, and allow for a larger kernel size, yielding
Gated Path-Planning Networks.
Experiments: run in a maze like environment and 3D VizDoom, comparing to VINs, showing
consistent improvement. Really thorough experimental analysis, studying generalization, random
seed initialization, and stability.
Main problem: Previous model (PredRNN) uses zigzag memory flow. But: for short term →
deeper-in-time networks, the gradient vanishes yielding bad long-term modeling capability.
Main contribution: Causal LSTM, uses a longer path for short term dynamics.
They run experiments in the Moving MNIST dataset and the KTC action dataset finding consistent
significant improvement over relevant baselines.
Task: given the first n frames of a video predict the next k frames.
Major problem with prior work: fail to predict far into the future.
They introduce an architecture that encodes current frames and determines loss based on both pre-
diction of the encoding and the original frame (as far as I can tell – the architecture was relatively
27
complex). Also an adversarial component in the mix – I think used to encourage future frames to
be indistinguishable to a discriminator.
Experiments: (1) A shape video prediction problem, where there’s does extremely well, (2) Human
pose prediction dataset, (3) Human Video prediction.
This work: shows that using an evolutionary approach can evolve useful architectures for the pur-
pose of Convolutional Autoencoders, applied to image restoration.
Idea: Represent a CAE architecture as a Directed Acylcic Graph (phenotype), encoded by a geno-
type. Then, optimize a genotype using typical evolutionary algorithms.
28
Also, symmetry is useful in AI (primal task → dual task).
This work: model-level duality. Duality exists not only in the data, but also at the level of the
model. For instance: neural machine translation.
Thus: because of this model level symmetry, we can share knowledge across models. Seems cool!
They evaluated one instance of it in a few different experiments, including machine translation,
sentiment analysis, and an asymmetric setting (closer to traditional classification).
A: Well, surely we do this with people all the time. We assign meaning to others’ actions regularly.
This is commonly called the “Theory of Mind” (by Cog. Pscyhology folks?).
Here is how it works: first you decide to treat the object whose behavior is to be
predicted as a rational agent; then you figure out what beliefs that agent ought
to have, given its place in the world and its purpose. Then you figure out what
desires it ought to have, on the same considerations, and finally you predict that
this rational agent will act to further its goals in the light of its beliefs. A little
practical reasoning from the chosen set of beliefs and desires will in most instances
yield a decision about what the agent ought to do; that is what you predict the agent
will do.
29
–Daniel Dennett The Intentional Stance.
Two camps: (1) Theory of mind, (2) Theory theory (also called “simulation theory”.
Lots of work in modeling other agents in ML: imitation learning, inverse RL, opponent modeling,
multi-agent RL, and more.
This work: taking inspiration from human Theory of Mind – we learn how humans work during
our development. We build this strong prior over how to understand other agents.
Desiderata:
2. Does not simply assume others are noisy-rational utility maximizers with perfect planning
4. Goal: build structure that learns a prior which captures general properties of population.
3.5.2 Been There Done That: Meta-Learning with Episodic Recall [36]
The speaker is Samuel River, joint with Jane Wang.
Consider the Lifelong Learning setting (they call it Meta learning) – interacting with m ∼ D, an
MDP sampled from some distribution.
Example: bandits! But actually, contextual bandits, since you also see c.
30
3.5.3 Transfer in Deep RL using Successor Features in GPI [9]
The speaker is Andre Barreto, joint with Diana Borsa, John Quan, Tom Schaul, David Silver,
Matteo Hessel, Daniel Mankowitz, Augustin Zidek, Remi Munos.
Look at a transfer setting: want to transfer knowledge from one task to another.
Their solution:
2. Successor Features
Generalized Policy Improvement takes as input a bunch of polices, π1 . . . πn , and produces them to
yield π̃ such that:
∀i : V π̃ ≥ V πi (39)
Successor Features: suppose:
X X
Ri = wj Rj , Qj = wj Qj . (40)
j j
Thus, given a new task, we can apply the successor features and GPIs to quickly yield a good policy
for the new task.
Synaptic Consolidation Model: Benna Fusi model introduced by Benna and Fusi [10]. Formally:
Summary:
31
4 Thursday July 12th
I made it to the keynote this morning!
• Work ∝ −DeltaF = −(∆E − ∆H). Energy is the part of the free energy that cannot be
converted into work.
E.T. Jaynes Information Theory and Statistical Mechanics [24]:
• The free energy is a subjective quantity.
Rissanen: “Modeling by shortest data description.” [35], with L(·) description length:
32
Then, Hinton drew from these ideas to study simple Neural Networks [21]. Formulation ended up
with a similar energy term and entropy term.
To summarize:
Figure 11: Some history of the Free Energy, Energy, and Entropy conversation.
Markov Chain Monte Carlo: Stochastic (sample error), Unbiased, Hard to mix between modes,
hard to assess convergence.
“The Big Data Dest”: Any reasonable procedure should give you an answer in finite time.
2. AI power and thermal ceiling: as AI moves from cloud to end systems, we need lower energy
AI computing.
Main Claim: We should think about the amount of intelligence we get from an AI algorithm
per kilowat hour.
33
• Compression, quantization
• Regularization, generalization
• Confidence estimation
• Privacy and adversarial robustness
Showed a few other methods of compressing Neural Networks, such as differentiable quantization,
spiking neural networks.
Increasing trend in fairness research: lots of attention, increasing dramatically over time. In all we
now have 21 definitions of fairness.
Usual idea: come up with a definition of fairness such that this definition ensures protected groups
are better off. That is, if ML systems are fair, we assume that protected groups are better off.
This paper: is the above assumption correct? How do fair ML systems actually impact protected
groups?
Example, loans:
34
This work:
• Introduce the outcome curve a tool for comparing delayed impact of fairness criteria
Individuals have scores, R(X), that denote some relevant value for a given domain. If one individ-
ual has a score, a group of individuals will have a distribution over scores.
Monotonicity assumption: higher scores imply more likely to repay (in the loan case).
Main idea behind the failure mode: scores of accepted individuals change depending on their
success, sometimes for the worse.
So: Equal opportunity and demographic parity may cause relative improvement, relative harm, or
active harm.
Theorem 4.3. Demographic Parity may cause active or relative harm by over-acceptance, equal
opportunity doesn’t.
35
Theorem 4.4. Equal opportunity may cause relative harm by under-acceptance demographic parity
never under accepts.
Run experiments on FICO credit score experiments, the results corroborate their theory, show that
the control groups are effected in different ways depending on the metric of fairness.
Dave: I missed a lot of sessions today due to meetings and prepping for my talk. Also I tried
entering the Deep Learning Theory session but it was full!
This paper: generalized Amari’s gradient like-learning rule [2] to naturalized learning rule, includ-
ing TD-like-algorithms, policy gradient algorithms, and accelerated gradient methods.
4.3.2 PIPPS: Flexible Model-Based Policy Search Robust to the Curse of Chaos [33]
Goal: sample efficient model-based RL based on PILCO.
Three steps:
36
5.1.1 Hierarchical Imitation and Reinforcement Learning [28]
The speaker is Hoang Le, joint with Nan Jiang, Alekh Agarwal, Miroslav Dudk, Yisong Yue, and
Hal Daume.
Well known: most RL methods have a hard time learning long horizon and sparse reward tasks
(like Montezuma’s Revenge).
Problem: The teacher’s feedback can be costly to obtain (lots of good demos are hard to provide).
Often depends on the horizon of the problem.
Main Question: How can we most effectively leverage limited teacher feedback?
• People like giving high level feedback with hierarchical structure built in.
This work:
Teacher provides high level feedback and zooms in to low level only when needed (savings
in teaching effort)Describe a hybrid imitation and RL approach where teacher provides only
high-level feedback.
Motivating problem:
• Only label if subpolicy fails (that is, the low level execution of the macro-action fails).
Summary: Teacher labels high-level trajectory with correct macro-actions. Key insight is that it’s
cheaper to verify that low-level trajectory is succssful is chepar than labeling.
Theorem 5.1. Labeling effort = high level horizon + low level horizon.
37
Experimental results on labyrinth tasks show that the labeling approach requires less data to do
well than flat imitation learning approaches.
They conclude by extending the algorithm to a hybrid IL/RL case, where they learn the meta-
controller by IL and the subpolicies by RL. They test this approach on the first room of Mon-
tezuma’s, showing that the approach can consistently do well in the first room relative to baselines.
Motivation: Reward functions are extremely difficult to structure in the right way.
Q: How can you exploit the reward function definition, if showed to the agent?
This paper:
• An initial state u0 ∈ U
38
5.2 Language to Action
The speaker is Joyce Y. Chai.
Starting with the Jetsons! Comparison to current technology – it seems like we can do most things
in the Jetsons, but not Rosei! Why not? Seems like we’re still far from Rosie.
Lots of exciting progress: language communication with robots has come extremely far. The next
frontier: Interactive Task Learning.
Definition 19 (Interactive Task Learning): Teach robots new tasks through natural interac-
tion, either through demonstration (visual demo, language or kinsethetic guidance) or through
specification (natural language specification, GUI-based specification)
Demo: person teaching a robot to make a smoothie. The goal is to seamlessly communicate to the
robot how to carry out a structured task. The end result is knowledge of a task structure (like a
Hierarchical Task Network).
• Q1: What kind of commonsense knowledge is essential for understanding and modeling action
verbs?
Task: given some commands from a person, we’d like to ground these commands into a semantic
representation of some kind (like a grounding graph, grammar – effectively doing semantic parsing
from language/sensor input).
For instance, given “She peels the cucumber”, we can ask: “what happens to the cucumber?” to
determine if the relevant semantics are captured in digesting the initial statement.
Physical causality of action verbs: “Linguistic studies have shown that concrete action verbs often
denote some change of state as the result of an action” [22].
A: Sure! They collect data in an MTurk like study to annotate verbs with causality knowledge. This
enables robotic systems to perceive the environment, observe some knowledge, and apply/extract
39
causal knowledge about the entities of relevance.
The ultimate question we care about, though: “Can the robot perform the action?” → no. Because
planning is hard when we have high dimensional inputs as language/images.
2. Execution phase: issue an action command, retrieve best fit representation for action plan-
ning/execution, evaluate.
Use RL to learn an interaction policy – when should the robot ask which questions to maximize long
term reward? Implemented this learned interaction policy in a Baxter robot, showed a robot demo
where the robot first asks for a demo, clarifies the scene, and then asks to reset the environment
and performs the same action itself.
Claim: If robots are to become our collaborators, they must acquire this ability to do action-cause-
effect predictions.
Problem: Naive physical Action-Effect Prediction. For example given an action description like
“squeeze-bottle”, and a few images showing the consequences of applying that action (and try the
reverse)
IMAGE
Their approach: aim to have a small number of annotated examples, then pair these high quality
data with simple web search images. Using this approach showed a really nice demo of someone
teaching a robot to make a smoothie.
Conclusions:
2. In the pursuit of AGI and Rosie, we still have a long way to go.
3. Upcoming challenges: lots of unknowns, requires a multidsciplinary, joint effort across vision,
language, robotics, learning, and more.
40
4. Things on the wishlist:
Slide one: AI technologies! We have loads of them. But, we don’t have any “real AI”. We have
machines that do things we thought only humans can do, not the kind of flexible general purpose
reasoners.
• Intelligence is not just about pattern recognition (which has been the focus recently).
If you want to hear more, check out the work by Lake et al. [27]
Fundamental for MIT’s Quest for Intelligence: “imagine if we could build a machine that grows
into intelligence the way a person does, that starts like a baby, and learns like a child.
Early influential/classical papers were published in psych/Cog Sci journals (boltzmann machines
paper, finding structure in time, perceptron, etc.).
Now, the science of how children learn, can now offer real engineering guidance to AI. In particular,
basic questions:
1. What is the form and content of the starting state (inductive bias)?
Cog Sci paper studying how children start to acquire knowledge [39]: in a real sense, they are
already born knowing about object permanence and 3d space.
41
Child as Scientist view: children don’t learn by just copying things down. They learn via play
(experiments) to test hypotheses actively.
So, fundamental question: how do we grasp onto these ideas in machine learning and AI?
Goal: Reverse-engineering “Core Cognition”, intuitive physics, intuitive pscyhology. So, how do
we do this?
• Probabilistic programs integrate our best ideas on intelligence: symbolic languages for knowl-
edge representation, composition, and abstraction. Examples: Church, Anglican, WebPPL,
Pyro, ProbTorch, etc.
• Probabilistic inference: causal reasoning under uncertainty and flexible inductive bias.
Dave: I had to take off for meetings the rest of the day.
Multitask RL with flexible task description: use natural language to provide a seamless way to gen-
eralize to unseen complex tasks with compositions. Prior task descriptions focus on single sentence
or sequence of instructions.
Motivating Example: Household robot making a meal. One might break it down into subtasks,
like: pickup egg, stir egg, scramble egg, pickup bread, and so on.
Instead, one might give high level commands, like “make a meal”. But some tasks impose different
precondition relations between different subtasks.
This work: decompose subtasks into a graph, then do subtask graph execution problem.
42
Definition 20 (Multi-task RL): Let G be a task parameter drawn from a distribution P (G).
Here:
• The task is defined by G as an MDP tuple: hS, A, RG , TG , γG i Dave: and maybe one other
component?
Main idea: construct a differential representation of the subtask graph. Achieved by replacing
“AND” and “OR” operations with approximated-and, approximated-or nodes, which are differen-
tiable.
Evaluate in a 2d Minecraft like domain with lots of preconditions (get stone then make stone pick-
axe then mine iron etc.).
Definition 21 (Meta RL): Learn how to do fast RL from experiment and incorporate prior
experience for fast learning. Agent is given prior experience from some set of tasks (with each
task an MDP).
This paper: remove this supervision. Yields a general recipe for Unsupervised Meta RL (UMRL).
Advantages of UMRL:
43
• Less overfitting on task distributions
• Another idea: use “diversity is all you need idea” to choose tasks that have maximal log
likelihood w.r.t. the state. Seems useful as we generate a bunch of new diverse tasks.
A: We want: (1) continuous improvement, good extrapolation behavior, (2) reverts to standard RL
out of distribution.
MAML: Model Agnostic Meta Learning for RL key idea: learn policy πθ which can adapt to new
tasks with one steps of policy gradient:
X
max Ri (θi0 ). (46)
θ
i∈tasks
Explore MAML with their unsupervised task generation in Cheetah, Ant, and 2d navigation.
Main Q: Can we use our prior experience to learn better exploration strategies?
Problem: Given some prior experience on tasks (T0 , . . . , Tn ) ∼ p(T ), with each task an MDP. Then,
on some new test task Ttest , we’d like the agent to learn/make good decisions as quickly as possible.
44
2. Quickly adapt behavior to new tasks once rewards are experienced.
Q: How do we generate coherent exploration behavior?
Idea: Use structured stochasticity. In particular: noise in latent space generates directed tempo-
rally coherent behaviors. Exploring with noise in latent space.
Then, do Meta-Training with “MAESN”. That is: train a latent policy πθ across multiple tasks,
each with some parameter. Then optimize the meta-objective while constraining pre-update latent
parameters against a prior.
At test time: do RL where you initialize the latent distribution based on the prior.
Formally speaking:
∞
" #
X
0 t 0
ψπ (s, s ) = Eπ,p γ 1{st = s | s0 = s} (47)
t=0
This work: stochastic successor representation. Compute empirical model of the stochastic SR:
n(s, s0 )
P̃π (s, s0 ) = . (48)
n(s) + 1
Idea: let’s us count state visitation, which we can use for exploration.
Similar to return:
1 1
ψ̃π (s1 ) = + ... . (49)
n(s) + 1 n(sk ) + 1)
Let’s them introduce an exploration bonus using these state-visitations.
Run some experiments in River Swim and compare to usual PAC-MDP algorithms, performs com-
petitively to R-Max, E 3 , and so on.
Main reason to do this, though, is that the successor representation can easily generalize to function
approximators.
Main Q: Can model-free algorithms be made sample efficient? In particular, is Q-Learning provably
efficient?
45
First, what do we know about learning in tabular MDPs?
Main result shows that Q-Learning with UCB like exploration strategy has bounded regret:
√
Õ H 4 SAT . (50)
• If you vary the learning rate, you can prioritize earlier vs. later updates, which also unturns
a bias variance trade-off.
Goal: Discuss on direction for UCB on action-values in RL, highlight some open questions and
issues.
Problem setting:
• General state/action space.
In stochastic bandits, a bit more clear how to compute our UCB. Same story, roughly, in contextual
bandits – we can still compute UCB like estimates in this setting.
A1: Temporal connections. A2: Bootstrapping – do not get a sample of the target, especially since
the policy is changing.
Idea for UCB in RL: UCB for a fixed policy. Apply our usual concentration inequalities to obtain
the relevant upper bound w.r.t. the chosen fixed policy.
To extend this to an unfixed policy – using some ideas from Ian Osband and Ben Van Roy’s work
on stochastic optimism can enable the right sort of guarantees.
46
Empirically, algorithms that use this kind of algorithm seem to work quite well: (1) Bootstrap
DQN, (2) Bayesian DQN, (3) Double Uncertain Value Networks, (4) UCLS (new algo in this work).
UCLS and Bayesian DQNs can both incorporate some of these ideas seamlessly in the neural case
by modifying the last layer of an NN.
Open Questions:
• Is it useful to use UCB derived for fixed policies, but with inflated estimates of variance to
get stochastic optimism?
1. Humans have increased the species extinction rate by as much as 1000 times over background
rates.
AI can help! Examples: predicting species ranges, migrations, poaching activity, planning ranger
patrolling and conservation investments, detecting species, and so on.
• Agriculture: to feed the world’s growing population, farmers must produce more food on less
arable land with less environmental impact.
• Water: In less than two decades demand for fresh water is projected to outpace supply.
• Biodiversity: Species going extinct beyond the natural rate by orders of magnitude.
• Climate Change.
Some examples:
3
www.microsoft.com/AIforEarth
47
1. Tagging: Orca killed by satellite tag leads to criticism of science practices.
2. Conservation: poachers are using data from wildlife scientists to target and kill rare species.
3. (and more)
The above summarize a few concerns around data. So, in this talk: how can we innovate our
attitude toward data? Five suggestions: (1) UAV/Drone imagery, (2) Camera trap, (3) Simulation,
(4) Crowd Sourcing, (5) Social Media.
UAV/Drone Imagery
Consider the FarmBeats challenge, issued by Microsoft. Goal is to provide farmers with access
to Microsoft Cloud and AI technologies, enabling data-driven decisions to help farmers improve
agricultural yield, lower costs, and reduce the environmental impact.
→ The challenge: by 2050, the demand for food is expected to outpace production by over 70%.
Solution: FarmBeats uses ML to integrate sensor data with aerial imagery to deliver actionable
insights to farmers, all at a fraction of the cost of existing solutions. Developed an app to help
farmers automate the tasks of drones to bolster the effectiveness of their farm.
Idea: TV White spaces– use unoccupied TV channels to send wireless data. TV uses lower frequen-
cies so they reach quite far (also part of the FarmBeats project).
Challenges:
Simulation
48
Example: used a simulation as part of a challenge problem in the workshop focused on flying UAVs
and drones. Lots of opportunities to use simulations!
Crowdsourcing
AI for Earth has been working with iNaturalist to crowdsource data on biodiversity.
Idea: when you’re out taking a nature walk, you can contribute data to a large publiclly available
dataset for folks working on biodiversity.
AI for Earth offers some pre-trained methods to help out with specifies identification, uses an ex-
isting dataset of animals and also geographic information to help gather good data. Avoids the
bottleneck of novice users needing to know different species.
Wildbook system uses ML to find exact animals (not just species, but literally the same exact
animal). Wilfbook has an agent that scans social media for images of different animals to track the
location of individual animals. Thus: they can track the migration of a specific whale. (Main ML
idea is to use distinct SIFT-like-markers that pick out specific animals). Available here: https:
//www.whaleshark.org/
Next up are spotlight talks from papers.
Problem: Glossy buckthorn is responsible for both ecological and financial damage all over.
Solution: Optimize where, when, and how to target glossy buckthorn (burn? cut? etc.).
Important that recommendations are high confidence/correct. These actions are expensive and
have long term consequences. Previous approaches are heuristic and just make best guess.
Even further difficulty: Ecological data is often extremely sparse and biased. For instance: most
reports of glossy buckthorn appear next to roads—obviously a result of sampling bias!
Their approach: robust optimization. A method that can be used to incorporate confidence into
predictions. IDea: take a point estimate and replace it with a set of plausible realizations, which
yields a mini-max problem:
49
Run simulations based on real data from EDDMaps and WorldClim.
Summary:
3. Robust optimization is a tractable approach that can account for uncertainty in predictions.
Focus: Bird Migration data. Huge dataset! In particular targeting bird roosts where birds com-
mune and fly in a specific location for long periods of time.
Data set: detailed biological phenomenon, 143 radar stations, more than 200 million entries.
Goal: Develop an automated system to detect and track bird information in this data set.
In the end, created an annotated dataset of tree swallow roosts and their movements in the US.
Camera Traps: motion or heat sensor cameras that take photos of wildlife, yielding both presence
and absence of animals (which can give better population estimates). Can even get a short sequence
of images that shed light on movement, depth.
Problems: (1) flash can effect animals, (2) cameras often triggered by nothing (wind, people), (3)
Data is hand sorted by experts (costly!), so even if cameras get cheaper, we can’t scale due to the
labeling.
1. Illumination
2. Blur
3. ROI Size
4. Occlusion
50
5. Camouflage
6. Perspective
This work: organize the data set by location4 , animal type, bounding boxes, and so on.
• Direction of travel
• (and more!)
Environmental monitoring is receiving a huge boost from mobile crowdsourcing (per the methods
outline in the previous few talks and the keynote).
This work: “SnowWatch”. Create novel and lost cost tools to monitor and predict water availability
in dry season in mountainous regions.
Mobile applications for crowd-sourcing already exist, but not for mountains.
Given the success of Pokemon Go, they use an Augmented Reality method.
Build an application called PeakLens for Android: an outdoor Augmented Reality app that iden-
tifies mountain peaks and overlays them in real-time view.
Main technical challenge is to identify the mountain range from the skyline – difficult due to oc-
clusions, compass/GPS error, and low resolution images. Achieve extremely high accuracy (90%+).
Dataset: Kuzikus dataset, contains drone imagery of large animal species like Rhino, Ostrich,
Kudu, Oryz, and so on. Around 1200 animals in 650 images.
Problem: the animals take up a very small percentage of pixels. Dataset is extremely heteroge-
neous, finding animals poses a needle-in-the-haystack problem.
Core contribution of the work: new insights to encourage a CNN to train effectively on a dataset
presenting the above challenges (animals are very small, similar background across images, and so
4
beerys.github.io
51
on).
Leverage: (1) Curriculum learning, (2) Border classes (can help identify rare animals using things
like animal shadows), and a few other techniques. Together, they complement one another to yield
a practically useful detection algorithm for this animal image dataset.
52
References
[1] David Abel, Dilip Arumugam, Lucas Lehnert, and Michael L. Littman. State abstractions for
lifelong reinforcement learning. In Proceedings of the International Conference on Machine
Learning, 2018.
[2] Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):
251–276, 1998.
[3] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and
equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
[4] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit
acceleration by overparameterization. arXiv preprint arXiv:1802.06509, 2018.
[5] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds
for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
[6] Kavosh Asadi, Dipendra Misra, and Michael L Littman. Lipschitz continuity in model-based
reinforcement learning. arXiv preprint arXiv:1804.07193, 2018.
[7] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of
security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,
2018.
[8] Simon Baron-Cohen, Alan M Leslie, and Uta Frith. Does the autistic child have a theory of
mind? Cognition, 21(1):37–46, 1985.
[9] Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel
Mankowitz, Augustin Zidek, and Remi Munos. Transfer in deep reinforcement learning us-
ing successor features and generalised policy improvement. In International Conference on
Machine Learning, pages 510–519, 2018.
[10] Marcus K Benna and Stefano Fusi. Computational principles of synaptic memory consolida-
tion. Nature neuroscience, 19(12):1697, 2016.
[11] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for
near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231,
2002.
[12] Robert Calderbank, Sina Jafarpour, and Robert Schapire. Compressed learning: Universal
sparse dimensionality reduction and learning in the measurement domain. preprint, 2009.
[13] Will Dabney, Georg Ostrovski, David Silver, and Rémi Munos. Implicit quantile networks for
distributional reinforcement learning. arXiv preprint arXiv:1806.06923, 2018.
[14] Peter Dayan. Improving generalization for temporal difference learning: The successor repre-
sentation. Neural Computation, 5(4):613–624, 1993.
[15] Maria Dimakopoulou and Benjamin Van Roy. Coordinated exploration in concurrent rein-
forcement learning. arXiv preprint arXiv:1802.01282, 2018.
53
[16] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In
Conference on Learning Theory, pages 907–940, 2016.
[17] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly
robust off-policy evaluation. arXiv preprint arXiv:1802.03493, 2018.
[18] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle pointsonline stochastic
gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
[19] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum.
In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
[20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680, 2014.
[21] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimiz-
ing the description length of the weights. In Proceedings of the sixth annual conference on
Computational learning theory, pages 5–13. ACM, 1993.
[22] Malka Rappaport Hovav and Beth Levin. Reflections on manner/result complementarity.
Syntax, lexical semantics, and event structure, pages 21–38, 2010.
[23] Rodrigo Toro Icarte, Toryn Klassen, Richard Valenzano, and Sheila McIlraith. Using reward
machines for high-level task specification and decomposition in reinforcement learning. In
International Conference on Machine Learning, pages 2112–2121, 2018.
[24] Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620,
1957.
[25] Bingyi Kang, Zequn Jie, and Jiashi Feng. Policy optimization with demonstrations. In Inter-
national Conference on Machine Learning, pages 2474–2483, 2018.
[26] Christos Kaplanis, Murray Shanahan, and Claudia Clopath. Continual reinforcement learning
with complex synapses. arXiv preprint arXiv:1802.07239, 2018.
[27] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building
machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
[28] Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudı́k, Yisong Yue, and Hal Daumé III.
Hierarchical imitation and reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.
[29] Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, Eric Xing, and Ruslan Salakhutdinov.
Gated path planning networks. arXiv preprint arXiv:1806.06408, 2018.
[30] Jan Leike. Nonparametric general reinforcement learning. arXiv preprint arXiv:1611.08944,
2016.
[31] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training
neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.
54
[32] Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning.
In Advances in Neural Information Processing Systems, pages 3288–3298, 2017.
[33] Paavo Parmas, Carl Edward Rasmussen, Jan Peters, and Kenji Doya. Pipps: Flexible model-
based policy search robust to the curse of chaos. In International Conference on Machine
Learning, pages 4062–4071, 2018.
[34] Neil C Rabinowitz, Frank Perbet, H Francis Song, Chiyuan Zhang, SM Eslami, and Matthew
Botvinick. Machine theory of mind. arXiv preprint arXiv:1802.07740, 2018.
[35] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
[36] Samuel Ritter, Jane X Wang, Zeb Kurth-Nelson, Siddhant M Jayakumar, Charles Blundell,
Razvan Pascanu, and Matthew Botvinick. Been there, done that: Meta-learning with episodic
recall. arXiv preprint arXiv:1805.09692, 2018.
[37] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers
against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
[38] Sven Schmit and Ramesh Johari. Learning with abandonment. In International Conference
on Machine Learning, pages 4516–4524, 2018.
[39] Elizabeth S Spelke, Karen Breinlinger, Janet Macomber, and Kristen Jacobson. Origins of
knowledge. Psychological review, 99(4):605, 1992.
[40] Masanori Suganuma, Mete Ozay, and Takayuki Okatani. Exploiting the potential of stan-
dard convolutional autoencoders for image restoration by evolutionary search. arXiv preprint
arXiv:1803.00370, 2018.
[41] Michael Tomasello. Beyond formalities: The case of language acquisition. The Linguistic
Review, 22(2-4):183–197, 2005.
[42] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++:
Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv
preprint arXiv:1804.06300, 2018.
[43] Nevan Wichers, Ruben Villegas, Dumitru Erhan, and Honglak Lee. Hierarchical long-term
video prediction without supervision. arXiv preprint arXiv:1806.04768, 2018.
[44] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine learning, 8(3-4):229–256, 1992.
[45] Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, and Tie-Yan Liu. Model-level dual
learning. In International Conference on Machine Learning, pages 5379–5388, 2018.
[46] Andrea Zanette and Emma Brunskill. Problem dependent reinforcement learning bounds which
can identify bandit structure in mdps. In International Conference on Machine Learning, pages
5732–5740, 2018.
[47] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Under-
standing deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530,
2016.
55