2015 Lecun
2015 Lecun
2015 Lecun
net/publication/277411157
Deep Learning
CITATIONS READS
48,082 160,872
3 authors, including:
Y. Bengio
Université de Montréal
872 PUBLICATIONS 405,365 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Y. Bengio on 28 August 2015.
Deep learning
Yann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5
Deep learning allows computational models that are composed of multiple processing layers to learn representations of
data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-
ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep
learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine
should change its internal parameters that are used to compute the representation in each layer from the representation in
the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and
audio, whereas recurrent nets have shone light on sequential data such as text and speech.
M
achine-learning technology powers many aspects of modern intricate structures in high-dimensional data and is therefore applica-
society: from web searches to content filtering on social net- ble to many domains of science, business and government. In addition
works to recommendations on e-commerce websites, and to beating records in image recognition1–4 and speech recognition5–7, it
it is increasingly present in consumer products such as cameras and has beaten other machine-learning techniques at predicting the activ-
smartphones. Machine-learning systems are used to identify objects ity of potential drug molecules8, analysing particle accelerator data9,10,
in images, transcribe speech into text, match news items, posts or reconstructing brain circuits11, and predicting the effects of mutations
products with users’ interests, and select relevant results of search. in non-coding DNA on gene expression and disease12,13. Perhaps more
Increasingly, these applications make use of a class of techniques called surprisingly, deep learning has produced extremely promising results
deep learning. for various tasks in natural language understanding14, particularly
Conventional machine-learning techniques were limited in their topic classification, sentiment analysis, question answering15 and lan-
ability to process natural data in their raw form. For decades, con- guage translation16,17.
structing a pattern-recognition or machine-learning system required We think that deep learning will have many more successes in the
careful engineering and considerable domain expertise to design a fea- near future because it requires very little engineering by hand, so it
ture extractor that transformed the raw data (such as the pixel values can easily take advantage of increases in the amount of available com-
of an image) into a suitable internal representation or feature vector putation and data. New learning algorithms and architectures that are
from which the learning subsystem, often a classifier, could detect or currently being developed for deep neural networks will only acceler-
classify patterns in the input. ate this progress.
Representation learning is a set of methods that allows a machine to
be fed with raw data and to automatically discover the representations Supervised learning
needed for detection or classification. Deep-learning methods are The most common form of machine learning, deep or not, is super-
representation-learning methods with multiple levels of representa- vised learning. Imagine that we want to build a system that can classify
tion, obtained by composing simple but non-linear modules that each images as containing, say, a house, a car, a person or a pet. We first
transform the representation at one level (starting with the raw input) collect a large data set of images of houses, cars, people and pets, each
into a representation at a higher, slightly more abstract level. With the labelled with its category. During training, the machine is shown an
composition of enough such transformations, very complex functions image and produces an output in the form of a vector of scores, one
can be learned. For classification tasks, higher layers of representation for each category. We want the desired category to have the highest
amplify aspects of the input that are important for discrimination and score of all categories, but this is unlikely to happen before training.
suppress irrelevant variations. An image, for example, comes in the We compute an objective function that measures the error (or dis-
form of an array of pixel values, and the learned features in the first tance) between the output scores and the desired pattern of scores. The
layer of representation typically represent the presence or absence of machine then modifies its internal adjustable parameters to reduce
edges at particular orientations and locations in the image. The second this error. These adjustable parameters, often called weights, are real
layer typically detects motifs by spotting particular arrangements of numbers that can be seen as ‘knobs’ that define the input–output func-
edges, regardless of small variations in the edge positions. The third tion of the machine. In a typical deep-learning system, there may be
layer may assemble motifs into larger combinations that correspond hundreds of millions of these adjustable weights, and hundreds of
to parts of familiar objects, and subsequent layers would detect objects millions of labelled examples with which to train the machine.
as combinations of these parts. The key aspect of deep learning is that To properly adjust the weight vector, the learning algorithm com-
these layers of features are not designed by human engineers: they putes a gradient vector that, for each weight, indicates by what amount
are learned from data using a general-purpose learning procedure. the error would increase or decrease if the weight were increased by a
Deep learning is making major advances in solving problems that tiny amount. The weight vector is then adjusted in the opposite direc-
have resisted the best attempts of the artificial intelligence commu- tion to the gradient vector.
nity for many years. It has turned out to be very good at discovering The objective function, averaged over all the training examples, can
1
Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations
Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128 Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California
94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.
4 3 6 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
be seen as a kind of hilly landscape in the high-dimensional space of Many of the current practical applications of machine learning use
weight values. The negative gradient vector indicates the direction linear classifiers on top of hand-engineered features. A two-class linear
of steepest descent in this landscape, taking it closer to a minimum, classifier computes a weighted sum of the feature vector components.
where the output error is low on average. If the weighted sum is above a threshold, the input is classified as
In practice, most practitioners use a procedure called stochastic belonging to a particular category.
gradient descent (SGD). This consists of showing the input vector Since the 1960s we have known that linear classifiers can only carve
for a few examples, computing the outputs and the errors, computing their input space into very simple regions, namely half-spaces sepa-
the average gradient for those examples, and adjusting the weights rated by a hyperplane19. But problems such as image and speech recog-
accordingly. The process is repeated for many small sets of examples nition require the input–output function to be insensitive to irrelevant
from the training set until the average of the objective function stops variations of the input, such as variations in position, orientation or
decreasing. It is called stochastic because each small set of examples illumination of an object, or variations in the pitch or accent of speech,
gives a noisy estimate of the average gradient over all examples. This while being very sensitive to particular minute variations (for example,
simple procedure usually finds a good set of weights surprisingly the difference between a white wolf and a breed of wolf-like white
quickly when compared with far more elaborate optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
niques18. After training, the performance of the system is measured different poses and in different environments may be very different
on a different set of examples called a test set. This serves to test the from each other, whereas two images of a Samoyed and a wolf in the
generalization ability of the machine — its ability to produce sensible same position and on similar backgrounds may be very similar to each
answers on new inputs that it has never seen during training. other. A linear classifier, or any other ‘shallow’ classifier operating on
a b
z z
Δz = y Δy
z
y Δy = xy Δx
y
y Δz = yz xy Δ x
x
z z y
x x = y x
Input Hidden Output
(2) (2 sigmoid) (1 sigmoid)
Input units i i
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
layer neural network (shown by the connected dots) can distort the input the total input z to each unit, which is a weighted sum of the outputs of
space to make the classes of data (examples of which are on the red and the units in the layer below. Then a non-linear function f(.) is applied to
blue lines) linearly separable. Note how a regular grid (shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
in input space is also transformed (shown in the middle panel) by hidden The non-linear functions used in neural networks include the rectified
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
units and one output unit, but the networks used for object recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations used for computing the backward
b, The chain rule of derivatives tells us how two small effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit, which is a weighted sum of the error derivatives
x gets transformed first into a small change Δy in y by getting multiplied with respect to the total inputs to the units in the layer above. We then
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error derivative with respect to the output into the error
Δy creates a change Δz in z. Substituting one equation into the other derivative with respect to the input by multiplying it by the gradient of f(z).
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the error derivative with respect to the output of a unit
multiplication by the product of ∂y/∂x and ∂z/∂x. It also works when x, is computed by differentiating the cost function. This gives yl − tl if the cost
y and z are vectors (and the derivatives are Jacobian matrices). c, The function for unit l is 0.5(yl − tl)2, where tl is the target value. Once the ∂E/∂zk
equations used for computing the forward pass in a neural net with two is known, the error-derivative for the weight wjk on the connection from
hidden layers and one output layer, each constituting a module through unit j in the layer below is just yj ∂E/∂zk.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 7
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
Samoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)
Max pooling
Max pooling
Figure 2 | Inside a convolutional network. The outputs (not the filters) corresponding to the output for one of the learned features, detected at each
of each layer (horizontally) of a typical convolutional network architecture of the image positions. Information flows bottom up, with lower-level features
applied to the image of a Samoyed dog (bottom left; and RGB (red, green, acting as oriented edge detectors, and a score is computed for each image class
blue) inputs, bottom right). Each rectangular image is a feature map in output. ReLU, rectified linear unit.
raw pixels could not possibly distinguish the latter two, while putting rule for derivatives. The key insight is that the derivative (or gradi-
the former two in the same category. This is why shallow classifiers ent) of the objective with respect to the input of a module can be
require a good feature extractor that solves the selectivity–invariance computed by working backwards from the gradient with respect to
dilemma — one that produces representations that are selective to the output of that module (or the input of the subsequent module)
the aspects of the image that are important for discrimination, but (Fig. 1). The backpropagation equation can be applied repeatedly to
that are invariant to irrelevant aspects such as the pose of the animal. propagate gradients through all modules, starting from the output
To make classifiers more powerful, one can use generic non-linear at the top (where the network produces its prediction) all the way to
features, as with kernel methods20, but generic features such as those the bottom (where the external input is fed). Once these gradients
arising with the Gaussian kernel do not allow the learner to general- have been computed, it is straightforward to compute the gradients
ize well far from the training examples21. The conventional option is with respect to the weights of each module.
to hand design good feature extractors, which requires a consider- Many applications of deep learning use feedforward neural net-
able amount of engineering skill and domain expertise. But this can work architectures (Fig. 1), which learn to map a fixed-size input
all be avoided if good features can be learned automatically using a (for example, an image) to a fixed-size output (for example, a prob-
general-purpose learning procedure. This is the key advantage of ability for each of several categories). To go from one layer to the
deep learning. next, a set of units compute a weighted sum of their inputs from the
A deep-learning architecture is a multilayer stack of simple mod- previous layer and pass the result through a non-linear function. At
ules, all (or most) of which are subject to learning, and many of which present, the most popular non-linear function is the rectified linear
compute non-linear input–output mappings. Each module in the unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0).
stack transforms its input to increase both the selectivity and the In past decades, neural nets used smoother non-linearities, such as
invariance of the representation. With multiple non-linear layers, say tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster
a depth of 5 to 20, a system can implement extremely intricate func- in networks with many layers, allowing training of a deep supervised
tions of its inputs that are simultaneously sensitive to minute details network without unsupervised pre-training28. Units that are not in
— distinguishing Samoyeds from white wolves — and insensitive to the input or output layer are conventionally called hidden units. The
large irrelevant variations such as the background, pose, lighting and hidden layers can be seen as distorting the input in a non-linear way
surrounding objects. so that categories become linearly separable by the last layer (Fig. 1).
In the late 1990s, neural nets and backpropagation were largely
Backpropagation to train multilayer architectures forsaken by the machine-learning community and ignored by the
From the earliest days of pattern recognition22,23, the aim of research- computer-vision and speech-recognition communities. It was widely
ers has been to replace hand-engineered features with trainable thought that learning useful, multistage, feature extractors with lit-
multilayer networks, but despite its simplicity, the solution was not tle prior knowledge was infeasible. In particular, it was commonly
widely understood until the mid 1980s. As it turns out, multilayer thought that simple gradient descent would get trapped in poor local
architectures can be trained by simple stochastic gradient descent. minima — weight configurations for which no small change would
As long as the modules are relatively smooth functions of their inputs reduce the average error.
and of their internal weights, one can compute gradients using the In practice, poor local minima are rarely a problem with large net-
backpropagation procedure. The idea that this could be done, and works. Regardless of the initial conditions, the system nearly always
that it worked, was discovered independently by several different reaches solutions of very similar quality. Recent theoretical and
groups during the 1970s and 1980s24–27. empirical results strongly suggest that local minima are not a serious
The backpropagation procedure to compute the gradient of an issue in general. Instead, the landscape is packed with a combinato-
objective function with respect to the weights of a multilayer stack rially large number of saddle points where the gradient is zero, and
of modules is nothing more than a practical application of the chain the surface curves up in most dimensions and curves down in the
4 3 8 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
remainder29,30. The analysis seems to show that saddle points with this architecture is twofold. First, in array data such as images, local
only a few downward curving directions are present in very large groups of values are often highly correlated, forming distinctive local
numbers, but almost all of them have very similar values of the objec- motifs that are easily detected. Second, the local statistics of images
tive function. Hence, it does not much matter which of these saddle and other signals are invariant to location. In other words, if a motif
points the algorithm gets stuck at. can appear in one part of the image, it could appear anywhere, hence
Interest in deep feedforward networks was revived around 2006 the idea of units at different locations sharing the same weights and
(refs 31–34) by a group of researchers brought together by the Cana- detecting the same pattern in different parts of the array. Mathemati-
dian Institute for Advanced Research (CIFAR). The researchers intro- cally, the filtering operation performed by a feature map is a discrete
duced unsupervised learning procedures that could create layers of convolution, hence the name.
feature detectors without requiring labelled data. The objective in Although the role of the convolutional layer is to detect local con-
learning each layer of feature detectors was to be able to reconstruct junctions of features from the previous layer, the role of the pooling
or model the activities of feature detectors (or raw inputs) in the layer layer is to merge semantically similar features into one. Because the
below. By ‘pre-training’ several layers of progressively more complex relative positions of the features forming a motif can vary somewhat,
feature detectors using this reconstruction objective, the weights of a reliably detecting the motif can be done by coarse-graining the posi-
deep network could be initialized to sensible values. A final layer of tion of each feature. A typical pooling unit computes the maximum
output units could then be added to the top of the network and the of a local patch of units in one feature map (or in a few feature maps).
whole deep system could be fine-tuned using standard backpropaga- Neighbouring pooling units take input from patches that are shifted
tion33–35. This worked remarkably well for recognizing handwritten by more than one row or column, thereby reducing the dimension of
digits or for detecting pedestrians, especially when the amount of the representation and creating an invariance to small shifts and dis-
labelled data was very limited36. tortions. Two or three stages of convolution, non-linearity and pool-
The first major application of this pre-training approach was in ing are stacked, followed by more convolutional and fully-connected
speech recognition, and it was made possible by the advent of fast layers. Backpropagating gradients through a ConvNet is as simple as
graphics processing units (GPUs) that were convenient to program37 through a regular deep network, allowing all the weights in all the
and allowed researchers to train networks 10 or 20 times faster. In filter banks to be trained.
2009, the approach was used to map short temporal windows of coef- Deep neural networks exploit the property that many natural sig-
ficients extracted from a sound wave to a set of probabilities for the nals are compositional hierarchies, in which higher-level features
various fragments of speech that might be represented by the frame are obtained by composing lower-level ones. In images, local combi-
in the centre of the window. It achieved record-breaking results on a nations of edges form motifs, motifs assemble into parts, and parts
standard speech recognition benchmark that used a small vocabu- form objects. Similar hierarchies exist in speech and text from sounds
lary38 and was quickly developed to give record-breaking results on to phones, phonemes, syllables, words and sentences. The pooling
a large vocabulary task39. By 2012, versions of the deep net from 2009 allows representations to vary very little when elements in the previ-
were being developed by many of the major speech groups6 and were ous layer vary in position and appearance.
already being deployed in Android phones. For smaller data sets, The convolutional and pooling layers in ConvNets are directly
unsupervised pre-training helps to prevent overfitting40, leading to inspired by the classic notions of simple cells and complex cells in
significantly better generalization when the number of labelled exam- visual neuroscience43, and the overall architecture is reminiscent of
ples is small, or in a transfer setting where we have lots of examples the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral path-
for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep way44. When ConvNet models and monkeys are shown the same pic-
learning had been rehabilitated, it turned out that the pre-training ture, the activations of high-level units in the ConvNet explains half
stage was only needed for small data sets. of the variance of random sets of 160 neurons in the monkey’s infer-
There was, however, one particular type of deep, feedforward net- otemporal cortex45. ConvNets have their roots in the neocognitron46,
work that was much easier to train and generalized much better than the architecture of which was somewhat similar, but did not have an
networks with full connectivity between adjacent layers. This was end-to-end supervised-learning algorithm such as backpropagation.
the convolutional neural network (ConvNet)41,42. It achieved many A primitive 1D ConvNet called a time-delay neural net was used for
practical successes during the period when neural networks were out the recognition of phonemes and simple words47,48.
of favour and it has recently been widely adopted by the computer- There have been numerous applications of convolutional net-
vision community. works going back to the early 1990s, starting with time-delay neu-
ral networks for speech recognition47 and document reading42. The
Convolutional neural networks document reading system used a ConvNet trained jointly with a
ConvNets are designed to process data that come in the form of probabilistic model that implemented language constraints. By the
multiple arrays, for example a colour image composed of three 2D late 1990s this system was reading over 10% of all the cheques in the
arrays containing pixel intensities in the three colour channels. Many United States. A number of ConvNet-based optical character recog-
data modalities are in the form of multiple arrays: 1D for signals and nition and handwriting recognition systems were later deployed by
sequences, including language; 2D for images or audio spectrograms; Microsoft49. ConvNets were also experimented with in the early 1990s
and 3D for video or volumetric images. There are four key ideas for object detection in natural images, including faces and hands50,51,
behind ConvNets that take advantage of the properties of natural and for face recognition52.
signals: local connections, shared weights, pooling and the use of
many layers. Image understanding with deep convolutional networks
The architecture of a typical ConvNet (Fig. 2) is structured as a Since the early 2000s, ConvNets have been applied with great success to
series of stages. The first few stages are composed of two types of the detection, segmentation and recognition of objects and regions in
layers: convolutional layers and pooling layers. Units in a convolu- images. These were all tasks in which labelled data was relatively abun-
tional layer are organized in feature maps, within which each unit dant, such as traffic sign recognition53, the segmentation of biological
is connected to local patches in the feature maps of the previous images54 particularly for connectomics55, and the detection of faces,
layer through a set of weights called a filter bank. The result of this text, pedestrians and human bodies in natural images36,50,51,56–58. A major
local weighted sum is then passed through a non-linearity such as a recent practical success of ConvNets is face recognition59.
ReLU. All units in a feature map share the same filter bank. Differ- Importantly, images can be labelled at the pixel level, which will have
ent feature maps in a layer use different filter banks. The reason for applications in technology, including autonomous mobile robots and
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 3 9
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
Vision Language
Deep CNN Generating RNN
A group of people
shopping at an outdoor
market.
A woman is throwing a frisbee in a park. A dog is standing on a hardwood floor. A stop sign is on a road with a
mountain in the background
A little girl sitting on a bed with a teddy bear. A group of people sitting on a boat in the water. A giraffe standing in a forest with
trees in the background.
Figure 3 | From image to text. Captions generated by a recurrent neural with permission from ref. 102. When the RNN is given the ability to focus its
network (RNN) taking, as extra input, the representation extracted by a deep attention on a different location in the input image (middle and bottom; the
convolution neural network (CNN) from a test image, with the RNN trained to lighter patches were given more attention) as it generates each word (bold), we
‘translate’ high-level representations of images into captions (top). Reproduced found86 that it exploits this to achieve better ‘translation’ of images into captions.
self-driving cars60,61. Companies such as Mobileye and NVIDIA are Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly
using such ConvNet-based methods in their upcoming vision sys- growing number of start-ups to initiate research and development
tems for cars. Other applications gaining importance involve natural projects and to deploy ConvNet-based image understanding products
language understanding14 and speech recognition7. and services.
Despite these successes, ConvNets were largely forsaken by the ConvNets are easily amenable to efficient hardware implemen-
mainstream computer-vision and machine-learning communities tations in chips or field-programmable gate arrays66,67. A number
until the ImageNet competition in 2012. When deep convolutional of companies such as NVIDIA, Mobileye, Intel, Qualcomm and
networks were applied to a data set of about a million images from Samsung are developing ConvNet chips to enable real-time vision
the web that contained 1,000 different classes, they achieved spec- applications in smartphones, cameras, robots and self-driving cars.
tacular results, almost halving the error rates of the best compet-
ing approaches1. This success came from the efficient use of GPUs, Distributed representations and language processing
ReLUs, a new regularization technique called dropout62, and tech- Deep-learning theory shows that deep nets have two different expo-
niques to generate more training examples by deforming the existing nential advantages over classic learning algorithms that do not use
ones. This success has brought about a revolution in computer vision; distributed representations21. Both of these advantages arise from the
ConvNets are now the dominant approach for almost all recognition power of composition and depend on the underlying data-generating
and detection tasks4,58,59,63–65 and approach human performance on distribution having an appropriate componential structure40. First,
some tasks. A recent stunning demonstration combines ConvNets learning distributed representations enable generalization to new
and recurrent net modules for the generation of image captions combinations of the values of learned features beyond those seen
(Fig. 3). during training (for example, 2n combinations are possible with n
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun- binary features)68,69. Second, composing layers of representation in
dreds of millions of weights, and billions of connections between a deep net brings the potential for another exponential advantage70
units. Whereas training such large networks could have taken weeks (exponential in the depth).
only two years ago, progress in hardware, software and algorithm The hidden layers of a multilayer neural network learn to repre-
parallelization have reduced training times to a few hours. sent the network’s inputs in a way that makes it easy to predict the
The performance of ConvNet-based vision systems has caused target outputs. This is nicely demonstrated by training a multilayer
most major technology companies, including Google, Facebook, neural network to predict the next word in a sequence from a local
4 4 0 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
context of earlier words71. Each word in the context is presented to handful of words would require very large training corpora. N-grams
the network as a one-of-N vector, that is, one component has a value treat each word as an atomic unit, so they cannot generalize across
of 1 and the rest are 0. In the first layer, each word creates a different semantically related sequences of words, whereas neural language
pattern of activations, or word vectors (Fig. 4). In a language model, models can because they associate each word with a vector of real
the other layers of the network learn to convert the input word vec- valued features, and semantically related words end up close to each
tors into an output word vector for the predicted next word, which other in that vector space (Fig. 4).
can be used to predict the probability for any word in the vocabulary
to appear as the next word. The network learns word vectors that Recurrent neural networks
contain many active components each of which can be interpreted When backpropagation was first introduced, its most exciting use was
as a separate feature of the word, as was first demonstrated27 in the for training recurrent neural networks (RNNs). For tasks that involve
context of learning distributed representations for symbols. These sequential inputs, such as speech and language, it is often better to
semantic features were not explicitly present in the input. They were use RNNs (Fig. 5). RNNs process an input sequence one element at a
discovered by the learning procedure as a good way of factorizing time, maintaining in their hidden units a ‘state vector’ that implicitly
the structured relationships between the input and output symbols contains information about the history of all the past elements of
into multiple ‘micro-rules’. Learning word vectors turned out to also the sequence. When we consider the outputs of the hidden units at
work very well when the word sequences come from a large corpus different discrete time steps as if they were the outputs of different
of real text and the individual micro-rules are unreliable71. When neurons in a deep multilayer network (Fig. 5, right), it becomes clear
trained to predict the next word in a news story, for example, the how we can apply backpropagation to train RNNs.
learned word vectors for Tuesday and Wednesday are very similar, as RNNs are very powerful dynamic systems, but training them has
are the word vectors for Sweden and Norway. Such representations proved to be problematic because the backpropagated gradients
are called distributed representations because their elements (the either grow or shrink at each time step, so over many time steps they
features) are not mutually exclusive and their many configurations typically explode or vanish77,78.
correspond to the variations seen in the observed data. These word Thanks to advances in their architecture79,80 and ways of training
vectors are composed of learned features that were not determined them81,82, RNNs have been found to be very good at predicting the
ahead of time by experts, but automatically discovered by the neural next character in the text83 or the next word in a sequence75, but they
network. Vector representations of words learned from text are now can also be used for more complex tasks. For example, after reading
very widely used in natural language applications14,17,72–76. an English sentence one word at a time, an English ‘encoder’ network
The issue of representation lies at the heart of the debate between can be trained so that the final state vector of its hidden units is a good
the logic-inspired and the neural-network-inspired paradigms for representation of the thought expressed by the sentence. This thought
cognition. In the logic-inspired paradigm, an instance of a symbol is vector can then be used as the initial hidden state of (or as extra input
something for which the only property is that it is either identical or to) a jointly trained French ‘decoder’ network, which outputs a prob-
non-identical to other symbol instances. It has no internal structure ability distribution for the first word of the French translation. If a
that is relevant to its use; and to reason with symbols, they must be particular first word is chosen from this distribution and provided
bound to the variables in judiciously chosen rules of inference. By as input to the decoder network it will then output a probability dis-
contrast, neural networks just use big activity vectors, big weight tribution for the second word of the translation and so on until a
matrices and scalar non-linearities to perform the type of fast ‘intui- full stop is chosen17,72,76. Overall, this process generates sequences of
tive’ inference that underpins effortless commonsense reasoning. French words according to a probability distribution that depends on
Before the introduction of neural language models71, the standard the English sentence. This rather naive way of performing machine
approach to statistical modelling of language did not exploit distrib- translation has quickly become competitive with the state-of-the-art,
uted representations: it was based on counting frequencies of occur- and this raises serious doubts about whether understanding a sen-
rences of short symbol sequences of length up to N (called N-grams). tence requires anything like the internal symbolic expressions that are
The number of possible N-grams is on the order of VN, where V is manipulated by using inference rules. It is more compatible with the
the vocabulary size, so taking into account a context of more than a view that everyday reasoning involves many simultaneous analogies
Figure 4 | Visualizing the learned word vectors. On the left is an illustration or sequences of words are mapped to nearby representations. The distributed
of word representations learned for modelling language, non-linearly projected representations of words are obtained by using backpropagation to jointly learn
to 2D for visualization using the t-SNE algorithm103. On the right is a 2D a representation for each word and a function that predicts a target quantity
representation of phrases learned by an English-to-French encoder–decoder such as the next word in a sequence (for language modelling) or a whole
recurrent neural network75. One can observe that semantically similar words sequence of translated words (for machine translation)18,75.
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 1
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
4 4 2 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
© 2015 Macmillan Publishers Limited. All rights reserved
REVIEW INSIGHT
12. Leung, M. K., Xiong, H. Y., Lee, L. J. & Frey, B. J. Deep learning of the tissue- for the task of classifying low-resolution images of handwritten digits.
regulated splicing code. Bioinformatics 30, i121–i129 (2014). 42. LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to
13. Xiong, H. Y. et al. The human splicing code reveals new insights into the genetic document recognition. Proc. IEEE 86, 2278–2324 (1998).
determinants of disease. Science 347, 6218 (2015). This overview paper on the principles of end-to-end training of modular
14. Collobert, R., et al. Natural language processing (almost) from scratch. J. Mach. systems such as deep neural networks using gradient-based optimization
Learn. Res. 12, 2493–2537 (2011). showed how neural networks (and in particular convolutional nets) can be
15. Bordes, A., Chopra, S. & Weston, J. Question answering with subgraph combined with search or inference mechanisms to model complex outputs
embeddings. In Proc. Empirical Methods in Natural Language Processing http:// that are interdependent, such as sequences of characters associated with the
arxiv.org/abs/1406.3676v3 (2014). content of a document.
16. Jean, S., Cho, K., Memisevic, R. & Bengio, Y. On using very large target 43. Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction, and functional
vocabulary for neural machine translation. In Proc. ACL-IJCNLP http://arxiv.org/ architecture in the cat’s visual cortex. J. Physiol. 160, 106–154 (1962).
abs/1412.2007 (2015). 44. Felleman, D. J. & Essen, D. C. V. Distributed hierarchical processing in the
17. Sutskever, I. Vinyals, O. & Le. Q. V. Sequence to sequence learning with neural primate cerebral cortex. Cereb. Cortex 1, 1–47 (1991).
networks. In Proc. Advances in Neural Information Processing Systems 27 45. Cadieu, C. F. et al. Deep neural networks rival the representation of primate
3104–3112 (2014). it cortex for core visual object recognition. PLoS Comp. Biol. 10, e1003963
This paper showed state-of-the-art machine translation results with the (2014).
architecture introduced in ref. 72, with a recurrent network trained to read a 46. Fukushima, K. & Miyake, S. Neocognitron: a new algorithm for pattern
sentence in one language, produce a semantic representation of its meaning, recognition tolerant of deformations and shifts in position. Pattern Recognition
and generate a translation in another language. 15, 455–469 (1982).
18. Bottou, L. & Bousquet, O. The tradeoffs of large scale learning. In Proc. Advances 47. Waibel, A., Hanazawa, T., Hinton, G. E., Shikano, K. & Lang, K. Phoneme
in Neural Information Processing Systems 20 161–168 (2007). recognition using time-delay neural networks. IEEE Trans. Acoustics Speech
19. Duda, R. O. & Hart, P. E. Pattern Classification and Scene Analysis (Wiley, 1973). Signal Process. 37, 328–339 (1989).
20. Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002). 48. Bottou, L., Fogelman-Soulié, F., Blanchet, P. & Lienard, J. Experiments with time
21. Bengio, Y., Delalleau, O. & Le Roux, N. The curse of highly variable functions delay networks and dynamic time warping for speaker independent isolated
for local kernel machines. In Proc. Advances in Neural Information Processing digit recognition. In Proc. EuroSpeech 89 537–540 (1989).
Systems 18 107–114 (2005). 49. Simard, D., Steinkraus, P. Y. & Platt, J. C. Best practices for convolutional neural
22. Selfridge, O. G. Pandemonium: a paradigm for learning in mechanisation of networks. In Proc. Document Analysis and Recognition 958–963 (2003).
thought processes. In Proc. Symposium on Mechanisation of Thought Processes 50. Vaillant, R., Monrocq, C. & LeCun, Y. Original approach for the localisation of
513–526 (1958). objects in images. In Proc. Vision, Image, and Signal Processing 141, 245–250
23. Rosenblatt, F. The Perceptron — A Perceiving and Recognizing Automaton. Tech. (1994).
Rep. 85-460-1 (Cornell Aeronautical Laboratory, 1957). 51. Nowlan, S. & Platt, J. in Neural Information Processing Systems 901–908 (1995).
24. Werbos, P. Beyond Regression: New Tools for Prediction and Analysis in the 52. Lawrence, S., Giles, C. L., Tsoi, A. C. & Back, A. D. Face recognition: a
Behavioral Sciences. PhD thesis, Harvard Univ. (1974). convolutional neural-network approach. IEEE Trans. Neural Networks 8, 98–113
25. Parker, D. B. Learning Logic Report TR–47 (MIT Press, 1985). (1997).
26. LeCun, Y. Une procédure d’apprentissage pour Réseau à seuil assymétrique 53. Ciresan, D., Meier, U. Masci, J. & Schmidhuber, J. Multi-column deep neural
in Cognitiva 85: a la Frontière de l’Intelligence Artificielle, des Sciences de la network for traffic sign classification. Neural Networks 32, 333–338 (2012).
Connaissance et des Neurosciences [in French] 599–604 (1985). 54. Ning, F. et al. Toward automatic phenotyping of developing embryos from
27. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by videos. IEEE Trans. Image Process. 14, 1360–1371 (2005).
back-propagating errors. Nature 323, 533–536 (1986). 55. Turaga, S. C. et al. Convolutional networks can learn to generate affinity graphs
28. Glorot, X., Bordes, A. & Bengio. Y. Deep sparse rectifier neural networks. In Proc. for image segmentation. Neural Comput. 22, 511–538 (2010).
14th International Conference on Artificial Intelligence and Statistics 315–323 56. Garcia, C. & Delakis, M. Convolutional face finder: a neural architecture for
(2011). fast and robust face detection. IEEE Trans. Pattern Anal. Machine Intell. 26,
This paper showed that supervised training of very deep neural networks is 1408–1423 (2004).
much faster if the hidden layers are composed of ReLU. 57. Osadchy, M., LeCun, Y. & Miller, M. Synergistic face detection and pose
29. Dauphin, Y. et al. Identifying and attacking the saddle point problem in high- estimation with energy-based models. J. Mach. Learn. Res. 8, 1197–1215
dimensional non-convex optimization. In Proc. Advances in Neural Information (2007).
Processing Systems 27 2933–2941 (2014). 58. Tompson, J., Goroshin, R. R., Jain, A., LeCun, Y. Y. & Bregler, C. C. Efficient object
30. Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B. & LeCun, Y. The loss localization using convolutional networks. In Proc. Conference on Computer
surface of multilayer networks. In Proc. Conference on AI and Statistics http:// Vision and Pattern Recognition http://arxiv.org/abs/1411.4280 (2014).
arxiv.org/abs/1412.0233 (2014). 59. Taigman, Y., Yang, M., Ranzato, M. & Wolf, L. Deepface: closing the gap to
31. Hinton, G. E. What kind of graphical model is the brain? In Proc. 19th human-level performance in face verification. In Proc. Conference on Computer
International Joint Conference on Artificial intelligence 1765–1775 (2005). Vision and Pattern Recognition 1701–1708 (2014).
32. Hinton, G. E., Osindero, S. & Teh, Y.-W. A fast learning algorithm for deep belief 60. Hadsell, R. et al. Learning long-range vision for autonomous off-road driving.
nets. Neural Comp. 18, 1527–1554 (2006). J. Field Robot. 26, 120–144 (2009).
This paper introduced a novel and effective way of training very deep neural 61. Farabet, C., Couprie, C., Najman, L. & LeCun, Y. Scene parsing with multiscale
networks by pre-training one hidden layer at a time using the unsupervised feature learning, purity trees, and optimal covers. In Proc. International
learning procedure for restricted Boltzmann machines. Conference on Machine Learning http://arxiv.org/abs/1202.2160 (2012).
33. Bengio, Y., Lamblin, P., Popovici, D. & Larochelle, H. Greedy layer-wise training 62. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R.
of deep networks. In Proc. Advances in Neural Information Processing Systems 19 Dropout: a simple way to prevent neural networks from overfitting. J. Machine
153–160 (2006). Learning Res. 15, 1929–1958 (2014).
This report demonstrated that the unsupervised pre-training method 63. Sermanet, P. et al. Overfeat: integrated recognition, localization and detection
introduced in ref. 32 significantly improves performance on test data and using convolutional networks. In Proc. International Conference on Learning
generalizes the method to other unsupervised representation-learning Representations http://arxiv.org/abs/1312.6229 (2014).
techniques, such as auto-encoders. 64. Girshick, R., Donahue, J., Darrell, T. & Malik, J. Rich feature hierarchies for
34. Ranzato, M., Poultney, C., Chopra, S. & LeCun, Y. Efficient learning of sparse accurate object detection and semantic segmentation. In Proc. Conference on
representations with an energy-based model. In Proc. Advances in Neural Computer Vision and Pattern Recognition 580–587 (2014).
Information Processing Systems 19 1137–1144 (2006). 65. Simonyan, K. & Zisserman, A. Very deep convolutional networks for large-scale
image recognition. In Proc. International Conference on Learning Representations
35. Hinton, G. E. & Salakhutdinov, R. Reducing the dimensionality of data with
http://arxiv.org/abs/1409.1556 (2014).
neural networks. Science 313, 504–507 (2006).
66. Boser, B., Sackinger, E., Bromley, J., LeCun, Y. & Jackel, L. An analog neural
36. Sermanet, P., Kavukcuoglu, K., Chintala, S. & LeCun, Y. Pedestrian detection with
network processor with programmable topology. J. Solid State Circuits 26,
unsupervised multi-stage feature learning. In Proc. International Conference
2017–2025 (1991).
on Computer Vision and Pattern Recognition http://arxiv.org/abs/1212.0142
67. Farabet, C. et al. Large-scale FPGA-based convolutional networks. In Scaling
(2013). up Machine Learning: Parallel and Distributed Approaches (eds Bekkerman, R.,
37. Raina, R., Madhavan, A. & Ng, A. Y. Large-scale deep unsupervised learning Bilenko, M. & Langford, J.) 399–419 (Cambridge Univ. Press, 2011).
using graphics processors. In Proc. 26th Annual International Conference on 68. Bengio, Y. Learning Deep Architectures for AI (Now, 2009).
Machine Learning 873–880 (2009). 69. Montufar, G. & Morton, J. When does a mixture of products contain a product of
38. Mohamed, A.-R., Dahl, G. E. & Hinton, G. Acoustic modeling using deep belief mixtures? J. Discrete Math. 29, 321–347 (2014).
networks. IEEE Trans. Audio Speech Lang. Process. 20, 14–22 (2012). 70. Montufar, G. F., Pascanu, R., Cho, K. & Bengio, Y. On the number of linear regions
39. Dahl, G. E., Yu, D., Deng, L. & Acero, A. Context-dependent pre-trained deep of deep neural networks. In Proc. Advances in Neural Information Processing
neural networks for large vocabulary speech recognition. IEEE Trans. Audio Systems 27 2924–2932 (2014).
Speech Lang. Process. 20, 33–42 (2012). 71. Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. In
40. Bengio, Y., Courville, A. & Vincent, P. Representation learning: a review and new Proc. Advances in Neural Information Processing Systems 13 932–938 (2001).
perspectives. IEEE Trans. Pattern Anal. Machine Intell. 35, 1798–1828 (2013). This paper introduced neural language models, which learn to convert a word
41. LeCun, Y. et al. Handwritten digit recognition with a back-propagation network. symbol into a word vector or word embedding composed of learned semantic
In Proc. Advances in Neural Information Processing Systems 396–404 (1990). features in order to predict the next word in a sequence.
This is the first paper on convolutional networks trained by backpropagation 72. Cho, K. et al. Learning phrase representations using RNN encoder-decoder
2 8 M AY 2 0 1 5 | VO L 5 2 1 | NAT U R E | 4 4 3
© 2015 Macmillan Publishers Limited. All rights reserved
INSIGHT REVIEW
for statistical machine translation. In Proc. Conference on Empirical Methods in 90. Weston, J., Bordes, A., Chopra, S. & Mikolov, T. Towards AI-complete question
Natural Language Processing 1724–1734 (2014). answering: a set of prerequisite toy tasks. http://arxiv.org/abs/1502.05698
73. Schwenk, H. Continuous space language models. Computer Speech Lang. 21, (2015).
492–518 (2007). 91. Hinton, G. E., Dayan, P., Frey, B. J. & Neal, R. M. The wake-sleep algorithm for
74. Socher, R., Lin, C. C-Y., Manning, C. & Ng, A. Y. Parsing natural scenes and unsupervised neural networks. Science 268, 1558–1161 (1995).
natural language with recursive neural networks. In Proc. International 92. Salakhutdinov, R. & Hinton, G. Deep Boltzmann machines. In Proc. International
Conference on Machine Learning 129–136 (2011). Conference on Artificial Intelligence and Statistics 448–455 (2009).
75. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. & Dean, J. Distributed 93. Vincent, P., Larochelle, H., Bengio, Y. & Manzagol, P.-A. Extracting and composing
representations of words and phrases and their compositionality. In Proc. robust features with denoising autoencoders. In Proc. 25th International
Advances in Neural Information Processing Systems 26 3111–3119 (2013). Conference on Machine Learning 1096–1103 (2008).
76. Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly 94. Kavukcuoglu, K. et al. Learning convolutional feature hierarchies for visual
learning to align and translate. In Proc. International Conference on Learning recognition. In Proc. Advances in Neural Information Processing Systems 23
Representations http://arxiv.org/abs/1409.0473 (2015). 1090–1098 (2010).
77. Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen [in 95. Gregor, K. & LeCun, Y. Learning fast approximations of sparse coding. In Proc.
German] Diploma thesis, T.U. Münich (1991). International Conference on Machine Learning 399–406 (2010).
78. Bengio, Y., Simard, P. & Frasconi, P. Learning long-term dependencies with 96. Ranzato, M., Mnih, V., Susskind, J. M. & Hinton, G. E. Modeling natural images
gradient descent is difficult. IEEE Trans. Neural Networks 5, 157–166 (1994). using gated MRFs. IEEE Trans. Pattern Anal. Machine Intell. 35, 2206–2222
79. Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, (2013).
1735–1780 (1997). 97. Bengio, Y., Thibodeau-Laufer, E., Alain, G. & Yosinski, J. Deep generative
This paper introduced LSTM recurrent networks, which have become a crucial stochastic networks trainable by backprop. In Proc. 31st International
ingredient in recent advances with recurrent networks because they are good Conference on Machine Learning 226–234 (2014).
at learning long-range dependencies. 98. Kingma, D., Rezende, D., Mohamed, S. & Welling, M. Semi-supervised learning
80. ElHihi, S. & Bengio, Y. Hierarchical recurrent neural networks for long-term with deep generative models. In Proc. Advances in Neural Information Processing
dependencies. In Proc. Advances in Neural Information Processing Systems 8 Systems 27 3581–3589 (2014).
http://papers.nips.cc/paper/1102-hierarchical-recurrent-neural-networks-for- 99. Ba, J., Mnih, V. & Kavukcuoglu, K. Multiple object recognition with visual
long-term-dependencies (1995). attention. In Proc. International Conference on Learning Representations http://
81. Sutskever, I. Training Recurrent Neural Networks. PhD thesis, Univ. Toronto arxiv.org/abs/1412.7755 (2014).
(2012). 100. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature
82. Pascanu, R., Mikolov, T. & Bengio, Y. On the difficulty of training recurrent neural 518, 529–533 (2015).
networks. In Proc. 30th International Conference on Machine Learning 1310– 101. Bottou, L. From machine learning to machine reasoning. Mach. Learn. 94,
1318 (2013). 133–149 (2014).
83. Sutskever, I., Martens, J. & Hinton, G. E. Generating text with recurrent neural 102. Vinyals, O., Toshev, A., Bengio, S. & Erhan, D. Show and tell: a neural image
networks. In Proc. 28th International Conference on Machine Learning 1017– caption generator. In Proc. International Conference on Machine Learning http://
1024 (2011). arxiv.org/abs/1502.03044 (2014).
84. Lakoff, G. & Johnson, M. Metaphors We Live By (Univ. Chicago Press, 2008). 103. van der Maaten, L. & Hinton, G. E. Visualizing data using t-SNE. J. Mach. Learn.
85. Rogers, T. T. & McClelland, J. L. Semantic Cognition: A Parallel Distributed Research 9, 2579–2605 (2008).
Processing Approach (MIT Press, 2004).
86. Xu, K. et al. Show, attend and tell: Neural image caption generation with visual Acknowledgements The authors would like to thank the Natural Sciences and
attention. In Proc. International Conference on Learning Representations http:// Engineering Research Council of Canada, the Canadian Institute For Advanced
arxiv.org/abs/1502.03044 (2015). Research (CIFAR), the National Science Foundation and Office of Naval Research
87. Graves, A., Mohamed, A.-R. & Hinton, G. Speech recognition with deep recurrent for support. Y.L. and Y.B. are CIFAR fellows.
neural networks. In Proc. International Conference on Acoustics, Speech and
Signal Processing 6645–6649 (2013). Author Information Reprints and permissions information is available at
88. Graves, A., Wayne, G. & Danihelka, I. Neural Turing machines. http://arxiv.org/ www.nature.com/reprints. The authors declare no competing financial
abs/1410.5401 (2014). interests. Readers are welcome to comment on the online version of this
89. Weston, J. Chopra, S. & Bordes, A. Memory networks. http://arxiv.org/ paper at go.nature.com/7cjbaa. Correspondence should be addressed to Y.L.
abs/1410.3916 (2014). ([email protected]).
4 4 4 | NAT U R E | VO L 5 2 1 | 2 8 M AY 2 0 1 5
View publication stats © 2015 Macmillan Publishers Limited. All rights reserved