Introduction To Artificial Neural Networks - Neural Networks and Deep Learning
Introduction To Artificial Neural Networks - Neural Networks and Deep Learning
Introduction Neural
to Artificial Networks
Birds inspired us to fly, burdock plants inspired velcro, and nature has
inspired many other inventions. It seems only logical, then, to look at
the brain’s architecture for inspiration on how to build an intelligent
machine. This is the key idea that inspired artificial neural networks
(ANNs). However, although planes were inspired by birds, they don’t
have to flap their wings. Similarly, ANNs have gradually become quite
different from their biological cousins. Some researchers even argue
that we should drop the biological analogy altogether (e.g., by saying
“units” rather than “neurons”), lest we restrict our creativity to
biologically plausible systems.
ANNs are at the very core of Deep Learning. They are versatile,
powerful, and scalable, making them ideal to tackle large and highly
complex Machine Learning tasks, such as classifying billions of
images (e.g., Google Images), powering speech recognition services
(e.g., Apple’s Siri), recommending the best videos to watch to
hundreds of millions of users every day (e.g., YouTube), or learning to
beat the world champion at the game of Go by examining millions of
past games and then playing against itself (DeepMind’s AlphaGo).
Surprisingly, ANNs have been around for quite a while: they were first
introduced back in 1943 by the neurophysiologist Warren McCulloch
and the mathematician 2
Walter Pitts. In their landmark paper, “A Logical Calculus of Ideas
Immanent in Nervous Activity,” McCulloch and Pitts presented a
simplified computational model of how biological neurons might work
together in animal brains to perform complex computations using
propositional logic. This was the first artificial neural network
architecture. Since then many other architectures have been invented,
as
we will see.
The early successes of ANNs until the 1960s led to the widespread
belief that we would soon be conversing with truly intelligent
machines. When it became clear that this promise would go unfulfilled
(at least for quite a while), funding flew elsewhere and ANNs entered
a long dark era. In the early 1980s there was a revival of interest in
ANNs as new network architectures were invented and better training
techniques were developed. But by the 1990s, powerful alternative
Machine Learning techniques such as Support Vector Machines were
favored by most researchers, as they seemed to offer better results
and stronger theoretical foundations. Finally, we are now witnessing
yet another wave of interest in ANNs. Will this wave die out like the
previous ones did? There are a few good reasons to believe that this
one is different and will have a much more profound impact on our
lives:
3
Figure 1-1. Biological neuron
5
Figure 1-2. Multiple layers in a biological neural
with Neurons
The Perceptron
6
A Perceptron is simply composed of a single layer of LTUs, with each
neuron con nected to all the inputs. These connections are often
represented using special passthrough neurons called input neurons:
they just output whatever input they are fed. Moreover, an extra bias
feature is generally added (x = 1). This bias fea 0
ture is typically represented using a special type of neuron called a
bias neuron, which just outputs 1 all the time.
i, j th th
w is the connection weight between the i input neuron and the j
output neuron.
i th
x is the i input value of the current training instance.
j th
is the output of the j output neuron for the current training
instance. j th
y is the target output of the j output neuron for the current training
instance. η is the learning rate.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron
iris = load_iris()
X = iris.data[:, (2, 3)] # petal
length, petal width y = (iris.target ==
0).astype(np.int) # Iris Setosa?
per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)
s, without suc 8
For each training instance, the algorithm feeds it to the network and
computes the output of every neuron in each consecutive layer (this is
the forward pass, just like when making predictions). Then it measures
the network’s output error (i.e., the difference between the desired
output and the actual output of the network), and it computes how
much each neuron in the last hidden layer contributed to each output
neuron’s error. It then proceeds to measure how much of these error
contributions came from each neuron in the previous hidden layer—
and so on until the algorithm reaches the input layer. This reverse
pass efficiently measures the error gradient across all the connection
weights in the network by propagating the error gradient backward in
the network (hence the name of the algorithm). If you check out the
reverse-mode autodiff algorithm, you will find that the forward and
reverse passes of backpropagation simply perform reverse-mode
autodiff. The last step of the backpropagation algorithm is a Gradient
Descent step on all the connection weights in the network, using the
error gradients measured earlier.
Let’s make this even shorter: for each training instance the
backpropagation algo rithm first makes a prediction (forward pass),
measures the error, then goes through each layer in reverse to
measure the error contribution from each connection (reverse pass),
and finally slightly tweaks the connection weights to reduce the error
(Gradient Descent step).
In order for this algorithm to work properly, the authors made a key
change to the MLP’s architecture: they replaced the step function with
the logistic function, σ(z) = 1 / (1 + exp(–z)). This was essential
because the step function contains only flat segments, so there is no
gradient to work with (Gradient Descent cannot move on a flat
surface), while the logistic function has a well-defined nonzero de
rivative everywhere, allowing Gradient Descent to make some
progress at every step. The backpropagation algorithm may be used
with other activation functions, instead of the logistic function. Two
other popular activation functions are:
Figure 1-9. A modern MLP (including ReLU and softmax) for classification
Biological neurons seem to implement a roughly sigmoid (S-shaped) activation
function, so researchers stuck to sigmoid functions for a very long time. But it
turns out that the ReLU ac tivation function generally works better in ANNs. This
is one of the cases where the biological analogy was misleading.
The simplest way to train an MLP with TensorFlow is to use the high-
level API TF.Learn, which offers a Scikit-Learn–compatible API. The
DNNClassifier class makes it fairly easy to train a deep neural
network with any number of hidden layers, and a softmax output layer
to output estimated class probabilities. For ex ample, the following
code trains a DNN for classification with two hidden layers (one with
300 neurons, and the other with 100 neurons) and a softmax output
lay er with 10 neurons:
import tensorflow as tf
feature_cols =
tf.contrib.learn.infer_real_valued_columns_from_input(X_train
dnn_clf =
tf.contrib.learn.DNNClassifier(hidden_units=[300,100],
n_classes feature_columns=feature_cols) dnn_clf =
tf.contrib.learn.SKCompat(dnn_clf) # if TensorFlow >= 1.1
dnn_clf.fit(X_train, y_train, batch_size=50, steps=40000)
The code first creates a set of real valued columns from the training
set (other types of columns, such as categorical columns, are
available). Then we create the DNNClassifier , and we wrap it in
a Scikit-Learn compatibility helper. Finally, we run 40,000 training
iterations using batches of 50 instances.
If you run this code on the MNIST dataset (after scaling it, e.g., by
using Scikit Learn’s StandardScaler ), you will actually get a
model that achieves around 98.2% accuracy on the test set:
Under the hood, the DNNClassifier class creates all the neuron
layers, based on the ReLU activation function (we can change this by
setting the activation_fn hyperparameter). The output layer relies
on the softmax function, and the cost function is cross entropy.
If you want more control over the architecture of the network, you
may prefer to use TensorFlow’s lower-level Python API. In this
section we will build the same model as before using this API, and
we will implement Mini-batch Gradient De scent to train it on the
MNIST dataset. The first step is the construction phase, building the
TensorFlow graph. The second step is the execution phase, where
you actually run the graph to train the model.
Construction Phase
import tensorflow as tf
Next, you can use placeholder nodes to represent the training data
and targets. The shape of X is only partially defined. We know that it
will be a 2D tensor (i.e., a matrix), with instances along the first
dimension and features along the second dimension, and we know
that the number of features is going to be 28 x 28 (one feature per
pixel), but we don’t know yet how many instances each training batch
will contain. So the shape of X is (None, n_inputs) . Similarly,
we know that y will be a 1D tensor with one entry per instance, but
again we don’t know the size of the training batch at this point, so
the shape is (None) .
X = tf.placeholder(tf.float32, shape=(None,
n_inputs), name="X") y = tf.placeholder(tf.int64,
shape=(None), name="y")
Now let’s create the actual neural network. The placeholder X will
act as the in put layer; during the execution phase, it will be replaced
with one training batch at a time (note that all the instances in a
training batch will be processed simulta neously by the neural
network). Now you need to create the two hidden layers and the
output layer. The two hidden layers are almost identical: they differ
only by the inputs they are connected to and by the number of
neurons they contain. The output layer is also very similar, but it uses
a softmax activation function in stead of a ReLU activation function.
So let’s create a neuron_layer() function that we will use to
create one layer at a time. It will need parameters to specify the
inputs, the number of neurons, the activation function, and the name
of the layer:
1. First we create a name scope using the name of the layer: it will
contain all the computation nodes for this neuron layer. This is
optional, but the graph will look much nicer in TensorBoard if its
nodes are well organized.
2. Next, we get the number of inputs by looking up the input matrix’s
shape and getting the size of the second dimension (the first
dimension is for instances).
3. The next three lines create a W variable that will hold the weights
matrix (of ten called the layer’s kernel). It will be a 2D tensor
containing all the connec tion weights between each input and each
neuron; hence, its shape will be
10
(n_inputs, n_neurons) . It will be initialized randomly,
using a truncated normal (Gaussian) distribution with a standard
deviation of
Okay, so now you have a nice function to create a neuron layer. Let’s
use it to cre ate the deep neural network! The first hidden layer takes
X as its input. The sec ond takes the output of the first hidden layer
as its input. And finally, the output layer takes the output of the
second hidden layer as its input.
with tf.name_scope("dnn"):
hidden1 = neuron_layer(X, n_hidden1,
name="hidden1", activation=tf.nn.relu)
hidden2 = neuron_layer(hidden1, n_hidden2,
name="hidden2", activation=tf.nn.relu)
logits = neuron_layer(hidden2, n_outputs, name="outputs")
Notice that once again we used a name scope for clarity. Also note
that logits is the output of the neural network before going through
the softmax activation function: for optimization reasons, we will
handle the softmax computation later.
As you might expect, TensorFlow comes with many handy
functions to create standard neural network layers, so there’s
often no need to define your own neuron_layer() function
like we just did. For example, TensorFlow’s
tf.layers.dense() function (previously called
tf.contrib.layers.fully_connected() ) creates a fully
connected layer, where all the inputs are connected to all the
neurons in the layer. It takes care of creating the weights and biases
variables, named kernel and bias respective ly, using the
appropriate initialization strategy, and you can set the activation
function using the activation argument. As we will see in
Chapter 2, it also supports regularization parameters. Let’s tweak the
preceding code to use the
dense() function instead of our neuron_layer() function.
Simply replace the dnn construction section with the following code:
with tf.name_scope("dnn"):
hidden1 = tf.layers.dense(X, n_hidden1,
name="hidden1", activation=tf.nn.relu)
hidden2 = tf.layers.dense(hidden1, n_hidden2,
name="hidden2", activation=tf.nn.relu)
logits = tf.layers.dense(hidden2, n_outputs, name="outputs")
Now that we have the neural network model ready to go, we need to
define the cost function that we will use to train it. We will use cross
entropy. As we dis cussed earlier, cross entropy will penalize models
that estimate a low probability for the target class. TensorFlow
provides several functions to compute cross en tropy. We will use
sparse_softmax_cross_entropy_with_logits() : it com
putes the cross entropy based on the “logits” (i.e., the output of the
network before going through the softmax activation function), and it
expects labels in the form of integers ranging from 0 to the number of
classes minus 1 (in our case, from 0 to 9). This will give us a 1D
tensor containing the cross entropy for each instance. We can then
use TensorFlow’s reduce_mean() function to compute the mean
cross entropy over all instances.
with tf.name_scope("loss"):
xentropy =
tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits loss = tf.reduce_mean(xentropy, name="loss")
The sparse_softmax_cross_entropy_with_logits() function is
equivalent to applying the softmax activation function and then computing the
cross entropy, but it is more efficient, and it properly takes care of corner cases:
when logits are large, floating-point rounding er rors may cause the softmax
output to be exactly equal to 0 or 1, and in this case the cross en tropy equation
would contain a log(0) term, equal to negative infinity. The
sparse_softmax_cross_entropy_with_logits() function solves this
problem by com puting log(ε) instead, where ε is a tiny positive number. This is
why we did not apply the soft max activation function earlier. There is also
another function called softmax_cross_ entropy_with_logits() , which
takes labels in the form of one-hot vectors (instead of ints from 0 to the number
of classes minus 1).
We have the neural network model, we have the cost function, and
now we need to define a GradientDescentOptimizer that will
tweak the model parameters to minimize the cost function:
learning_rate = 0.01
with tf.name_scope("train"):
optimizer =
tf.train.GradientDescentOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
with tf.name_scope("eval"):
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
Execution Phase
This part is much shorter and simpler. First, let’s load MNIST. We
could use Scikit Learn for that, but TensorFlow offers its own helper
that fetches the data, scales it (between 0 and 1), shuffles it, and
provides a simple function to load one mini batch a time. Moreover,
the data is already split into a training set (55,000 in stances), a
validation set (5,000 instances), and a test set (10,000 instances). So
let’s use this helper:
n_epochs = 40
batch_size = 50
Now that the neural network is trained, you can use it to make
predictions. To do that, you can reuse the same construction phase,
but change the execution phase like this:
First the code loads the model parameters from disk. Then it loads
some new im ages that you want to classify. Remember to apply the
same feature scaling as for the training data (in this case, scale it
from 0 to 1). Then the code evaluates the
logits node. If you wanted to know all the estimated class
probabilities, you would need to apply the softmax() function to the
logits, but if you just want to predict a class, you can simply pick the
class that has the highest logit value (using the argmax() function
does the trick).
Of course, you can use grid search with cross-validation to find the
right hyperpa rameters, but since there are many hyperparameters to
tune, and since training a neural network on a large dataset takes a lot
of time, you will only be able to ex plore a tiny part of the
hyperparameter space in a reasonable amount of time. It is much
better to use randomized search. Another option is to use a tool such
as Os car, which implements more complex algorithms to help you
find a good set of hyperparameters quickly.
For many problems, you can just begin with a single hidden layer and
you will get reasonable results. It has actually been shown that an
MLP with just one hidden layer can model even the most complex
functions provided it has enough neu rons. For a long time, these
facts convinced researchers that there was no need to investigate any
deeper neural networks. But they overlooked the fact that deep
networks have a much higher parameter efficiency than shallow ones:
they can model complex functions using exponentially fewer neurons
than shallow nets, making them much faster to train.
In summary, for many problems you can start with just one or two
hidden layers and it will work just fine (e.g., you can easily reach
above 97% accuracy on the MNIST dataset using just one hidden
layer with a few hundred neurons, and above 98% accuracy using
two hidden layers with the same total amount of neu rons, in roughly
the same amount of training time). For more complex problems, you
can gradually ramp up the number of hidden layers, until you start
overfit ting the training set. Very complex tasks, such as large image
classification or speech recognition, typically require networks with
dozens of layers (or even hundreds, but not fully connected ones, as
we will see in Chapter 3), and they need a huge amount of training
data. However, you will rarely have to train such networks from
scratch: it is much more common to reuse parts of a pretrained state-
of-the-art network that performs a similar task. Training will be a lot
faster and require much less data (we will discuss this in Chapter 2).
Activation Functions
In most cases you can use the ReLU activation function in the
hidden layers (or one of its variants, as we will see in Chapter 2). It
is a bit faster to compute than other activation functions, and
Gradient Descent does not get stuck as much on plateaus, thanks to
the fact that it does not saturate for large input values (as op
posed to the logistic function or the hyperbolic tangent function, which
saturate at 1).
Exercises
1. Draw an ANN using the original artificial neurons (like the ones in
Figure 1-3) that computes A ⊕ B (where ⊕ represents the
XOR operation). Hint: A ⊕ B = (A ∧ ¬ B) ∨ (¬ A ∧ B).
2. Why is it generally preferable to use a Logistic Regression
classifier rather than a classical Perceptron (i.e., a single layer of
linear threshold units trained
using the Perceptron training algorithm)? How can you tweak a
Perceptron to make it equivalent to a Logistic Regression
classifier?
3. Why was the logistic activation function a key ingredient in
training the first MLPs?
4. Name three popular activation functions. Can you draw them? 5.
Suppose you have an MLP composed of one input layer with 10
passthrough neurons, followed by one hidden layer with 50 artificial
neurons, and finally one output layer with 3 artificial neurons. All
artificial neurons use the ReLU activation function.
1. What is the shape of the input matrix X?
ight vector W , and the h
shape of its bias vector b ?
h
3. What is the shape of the output layer’s weight vector W , and its
bi
as
ve
ct
or
o
b?
o
6. How many neurons do you need in the output layer if you want to
classify email into spam or ham? What activation function should
you use in the out put layer? If instead you want to tackle
MNIST, how many neurons do you need in the output layer,
using what activation function?
7. What is backpropagation and how does it work? What is the
difference be tween backpropagation and reverse-mode
autodiff?
8. Can you list all the hyperparameters you can tweak in an MLP? If
the MLP overfits the training data, how could you tweak these
hyperparameters to try to solve the problem?
9. Train a deep MLP on the MNIST dataset and see if you can get
over 98% preci sion. Try adding all the bells and whistles (i.e.,
save checkpoints, restore the last checkpoint in case of an
interruption, add summaries, plot learning curves using
TensorBoard, and so on).
1
You can get the best of both worlds by being open to biological inspirations with
out being afraid to create biologically unrealistic models, as long
as they work well.
2
2
“A Logical Calculus of Ideas Immanent in Nervous Activity,” W. McCulloch and
W. Pitts (1943).
3
Image by Bruce Blaus (Creative Commons 3.0). Reproduced from
https://en.wikipedia.org/wiki/Neuron.
4
In the context of Machine Learning, the phrase “neural networks” generally
refers to ANNs, not BNNs.
5
Drawing of a cortical lamination by S. Ramon y Cajal (public domain). Repro
duced from https://en.wikipedia.org/wiki/Cerebral_cortex.
6
The name Perceptron is sometimes used to mean a tiny network with a single
LTU.
7
Note that this solution is generally not unique: in general when the data are lin
early separable, there is an infinity of hyperplanes that can separate them.
8
“Learning Internal Representations by Error Propagation,” D. Rumelhart, G. Hin
ton, R. Williams (1986).
9
This algorithm was actually invented several times by various researchers in dif
ferent fields, starting with P. Werbos in 1974.
10
Using a truncated normal distribution rather than a regular normal distribu
tion ensures that there won’t be any large weights, which
could slow down training.
11
For example, if you set all the weights to 0, then all neurons will output 0, and
the error gradient will be the same for all neurons in a given hidden
layer. The Gradient Descent step will then update all the weights in
exactly the same way in each layer, so they will all remain equal. In
other words, despite having hundreds of neurons per layer, your
model will act as if there were only one neuron per layer. It is not
going to fly.
12
By Vincent Vanhoucke in his Deep Learning class on Udacity.com.
Support Sign Out