0% found this document useful (0 votes)
78 views66 pages

Chapter10 Keras

introduction to artificial neural networks with Keras

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views66 pages

Chapter10 Keras

introduction to artificial neural networks with Keras

Uploaded by

Sivaiah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Chapter 10: Introduction to Artificial

Neural Networks with Keras

Tsz-Chiu Au
[email protected]

Ulsan National Institute of Science and Technology (UNIST)


South Korea
Introduction to ANNs
• Artificial neural networks (ANNs) are machine learning
models inspired by the networks of biological neurons found
in our brains.
• ANNs are at the very core of Deep Learning, being used in
» Google Images
» Apple’s Siri
» YouTube
» DeepMind’s AlphaGo
• We will discuss what are ANNs and how to implement them in
Keras.
From Biological to Artificial Neurons
• ANNs were first introduced back in 1943 by the neurophysiologist
Warren McCulloch and the mathematician Walter Pitts.
» A simplified computational models of how biological neurons might work together
in animal brains to perform complex computations using propositional logic.
• The early successes of ANNs led to the widespread belief that we
would soon be conversing with truly intelligent machines.
• Sadly, this promise went unfulfilled, triggering the first AI winter in
the 1970s.
• In mid-80s, new architectures such as Multilayer Perceptrons
(MLPs) and better training techniques such as backpropagation
algorithm revives the interest in connectionism (i.e., the study of
neural networks).
• However, by the 1990s, other powerful Machine Learning
techniques such as Support Vector Machines and Random Forests
overtake ANNs.
From Biological to Artificial Neurons (cont.)
• Since early 2010s, the success of deep learning in computer vision,
there is a huge wave of interest in ANNs.
• Reasons for this AI spring:
» There is now a huge quantity of data available to train neural networks
§ ANNs frequently outperform other ML techniques on very large and complex problems.
» The tremendous increase in computing power since the 1990s now makes
it possible to train large neural networks in a reasonable amount of time.
§ The availability of Powerful GPU cards and cloud computing platforms.
» The training algorithms have been improved.
§ We can train very large networks now.
» Some theoretical limitations of ANNs have turned out to be benign in
practice.
» ANNs seem to have entered a virtuous circle of funding and progress.
A Biological Neuron
Multiple layers in a biological neural
network (human cortex)
Logical Computations with Neurons
• McCulloch and Pitts proposed a very simple model of the
biological neuron, which later became known as an artificial
neuron: it has one or more binary (on/off) inputs and one
binary output.
» Even with such a simplified model it is possible to build a network of
artificial neurons that computes any logical proposition you want.

• These networks can be combined to compute complex logical


expressions
The Perceptron
• The Perceptron is one of the simplest ANN architectures,
invented in 1957 by Frank Rosenblatt.
» Based on a slightly different artificial neuron called a threshold logic
unit (TLU), or sometimes a linear threshold unit (LTU).
§ Compute a weighted sum of its inputs then applies a step function
The Perceptron (cont.)
• A single TLU can be used for simple linear binary
classification.
» It computes a linear combination of the inputs, and if the
result exceeds a threshold, it outputs the positive class.
• A Perceptron is simply composed of a single layer of
TLUs.
» When all the neurons in a layer are connected to every
neuron in the previous layer (i.e., its input neurons), the
layer is called a fully connected layer, or a dense layer.
The Perceptron (cont.)
• Architecture of a Perceptron with two input neurons, one bias
neuron, and three output neurons:
How is a Perceptron trained?
• The Perceptron training algorithm proposed by Rosenblatt was
largely inspired by Hebb’s rule.
» when a biological neuron triggers another neuron often, the connection
between these two neurons grows stronger.
» “Cells that fire together, wire together”
• Perceptron learning rule:
» For every output neuron that produced a wrong prediction, it reinforces
the connection weights from the inputs that would have con- tributed to
the correct prediction.

• Perceptron convergence theorem:


» if the training instances are linearly separable, Rosenblatt demonstrated
that this algorithm would converge to a solution.
Perception in SciKit-Learn
Limitations of Perceptrons
• In their 1969 monograph Perceptrons, Marvin Minsky and
Seymour Papert highlighted a number of serious weaknesses
of Perceptrons
» Exclusive OR (XOR) classification problem
• But some of the limitations can be eliminated by Multilayer
Perceptron (MLP)
Architecture of a Multilayer Perceptron
Backpropagation Algorithm
• The field of Deep Learning studies deep neural networks
(DNNs)---ANNs contain a deep stack of hidden layers---and
more generally models containing deep stacks of
computations.
• For many years researchers struggled to find a way to train
MLPs, without success.
• In 1986, David Rumelhart, Geoffrey Hinton, and Ronald
Williams introduced the backpropagation training
algorithm.
» Two passes through the network (one forward, one backward)
» Compute the gradient of the network’s error with regard to
every single model parameter.
» Adjust the parameters by using gradient descent.
Backpropagation Algorithm (cont.)
• The algorithm handles one mini-batch at a time (e.g., 32 instances in the
training set).
• It goes through the full training set multiple times. Each pass is called an
epoch.
• Forward pass: Each mini-batch is passed from the network’s input layer to
the output layer through the hidden layers. All intermediate results are
preserved.
• Measures the network’s output error by a loss function that compares the
desired output and the actual output of the network.
• Computes how much each output connection contributed to the error by
the chain rule.
• Reserve pass: Measures how much of these error contributions came
from each connection in the layer below, again using the chain rule,
working backward until the algorithm reaches the input layer.
• Performs a Gradient Descent step to tweak all the connection weights in
the network, using the error gradients it just computed.
Activation Functions
• Backpropagation algorithm cannot use with the step function,
which provides no gradient for Gradient Descent
• Logistic (sigmoid) function: σ(z) = 1 / (1 + exp(–z)).
• Hyperbolic tangent function: tanh(z) = 2σ(2z) – 1
• Rectified Linear Unit function: ReLU(z) = max(0, z)
Output Neurons for Regression MLPs
• For regression tasks, an ANN predicts a single value only. Thus,
one output neuron is sufficient.
• For multivariate regression, there is one output neuron per
output dimension.
• Output neurons should return any range of values.
» ReLU function: ReLU(z) = max(0, z)
» Softplus function: softplus(z) = log(1 + exp(z))
» logistic function or hyperbolic tangent with scaling factors.
• Loss functions:
» Mean Squared Error
» Mean Absolute Error
» Huber Loss
Typical regression MLP architecture
Output Neurons for Classification MLPs
• For binary classification tasks, there is one output neuron with
the logistic activation function
» The output can be interpreted as the estimated probability of the
positive class.
• For multilabel binary classification tasks, you need multiple
output neurons.
• For multiclass classification tasks, there is one output neuron
per class, and the softmax activation function should be used
for the whole output layer.
Output Neurons for Classification MLPs (cont.)
• The predicted class is:

• Loss function: cross-entropy loss (a.k.a. the


log loss):
A modern MLP for classification
Typical classification MLP architecture
Implementing MLPs with Keras
• Keras is a high-level Deep Learning API that
allows you to easily build, train, evaluate, and
execute all sorts of neural networks.
» https://keras.io
» Computation backend: TensorFlow, Microsoft
Cognitive Toolkit (CNTK), Theano, Apache MXNet,
Apple’s Core ML, JavaScript or TypeScript, and
PlaidML.
• tf.keras: Extended Keras implementation based
on TensorFlow with TensorFlow-specific features.
• PyTorch is also quite popular.
Multibackend Keras vs. tf.keras
Installing TensorFlow 2
• If you are using Google Colab only, you can skip this step.
• If you plan to run your code on your own computer, please install Jypyter,
Scikit-Learn, etc. according to the instructions in Chapter 2.
• Activate the virtual environment and then use pip to install TensorFlow 2:

• Open a Python shell or a Jupyter notebook and print the version of


TensorFlow and tf.keras
Fashion MNIST Dataset
• 70,000 grayscale images of 28 × 28 pixels each, with 10 classes

• Drop-in replacement of MNIST in Chapter 2.


» But the images represent fashion items rather than handwritten digits.
» More challenging than MNIST: a simple linear model reaches about 92%
accuracy on MNIST, but only about 83% on Fashion MNIST.
Using Keras to Load the Dataset
• Keras provides some utility functions to fetch and load common datasets.

• Loading data from Keras is different from Scikit-Learn:


» Every image is represented as a 28 × 28 array rather than a 1D array of size 784.

» The pixel intensities are represented as integers (from 0 to 255) rather than floats (from
0.0 to 255.0)
• Since we are going to train the neural network using Gradient Descent, we
must scale the input features down to the 0–1 range by dividing them by
255.0:
Naming the Labels
• Unlike MNIST, Fashion MNIST needs the list of class names of
each label to know what the image are:

• For example, the first image in the training set represents a


coat:
Creating the model using the Sequential API

• The first method for building a neural network in tf.keras is the use of
Sequential API.
» Only for neural neteworks that compose of a single stack of layers connected
sequentially.
• The tf.keras code for building a classification MLP with two hidden layers:

» The Flatten layer convert each input image into a 1D array.


» Each Dense layer manages its own weight matrix, containing all the connection weights
between the neurons and their inputs, as well as the bias terms.
Creating the model using the Sequential API
(cont.)
• Alternatively, you can add the layers when the Sequential model is created.

• The model’s summary() method displays the information of the model’s layers:
Accessing the Information of a Model
• Directly get a model’s list of layers:

• All the parameters of a layer can be accessed using its get_weights() and set_weights() methods:
Compiling the Model
• Before training the model, you must compile the model:

• Use the "sparse_categorical_cross entropy" loss when we have


sparse labels (i.e., for each instance, there is just a tar- get class
index, from 0 to 9 in this case) and the classes are exclusive.
» Otherwise, the "categorical_crossentropy" loss if one-hot vectors is used
(i.e., [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.] represents class 3).
» Otherwise, use the "binary_crossentropy" loss if the "sigmoid" (i.e.,
logistic) activation function in the output layer is used for binary
classification tasks.
• Use “sgd” for Stochastic Gradient Descent (i.e., reverse-mode
autodiff plus Gradient Descent)
• Use “accuracy” because our model is a classifier.
Training the Model
• After compiling the model, call fit() to train the model with the training and
validation datasets.

• You should check whether overfitting occurs (i.e., accuracy >> val_accuracy)
• Consider passing the class_weight argument if the training set is skewed.
Drawing the Learning Curves
• fit() returns a History object, which contains:
» The training parameters (history.params)
» The list of epochs it went through (history.epoch)
» The loss and extra metrics at the end of each epoch on the training set
and on the validation set (history.history).
• You can draw the learning curves using matplotlib:
Drawing the Learning Curves (cont.)
• The learning curve shows the mean training loss and accuracy
measured over each epoch, and the mean validation loss and
accuracy measured at the end of each epoch:

• When reporting the learning curves, you should shift the training
curve in the above graph by half an epoch to the left.
Continue the Training
• If the model has not converged yet, call fit() again to continue the
training.
• If you are not satisfied with the performance of your model, you
should go back and tune the hyperparameters.
» Tune the learning rate
» Try another optimizer
» Adjust the number of layers, the number of neurons per layer, and the
types of activation functions to use for each hidden layer
» Change the batch size
• Finally, estimate the generalization error using the test set before
you deploy the model to production.

• Don’t tweak the hyperparameters to improve the accuracy of the


test set
Using the Model to Make Predictions
• After training the model, you can use the model’s predict() method to
make predictions on new instances:

• If you want to know the class with the highest estimated probability only,
use the predict_classes() method instead:

• They should be correct (otherwise, more training)


California Housing with the Sequential API
Building Complex Models Using the
Functional API
• You cannot use the Sequential API to build nonsequential
neural networks.
• For example, consider the Wide & Deep neural network:
» can learn both deep patterns (using the deep path) and simple rules
(through the short path)
Using the Functional API
• How about sending a subset of the features through the wide
path and a different subset (possibly overlapping) through the
deep path:
Using the Functional API (cont.)
• Compile, train, and evaluate the model, and then make
predictions:
Models with Multiple Outputs
• Reasons for having multiple outputs:
» The task may demand it.
» You have multiple independent tasks
based on the same data---multitask
classification.
» Add some auxiliary outputs for
regularization.
Models with Multiple Outputs (cont.)
• Each output will need its own loss function:

• Train the models with two datasets:

• Evaluate the outputs separately:

• Likewise, make predictions separately:


Using the Subclassing API to Build
Dynamic Models
• Both the Sequential API and the Functional API
are declarative
» Advantages:
§ The model can easily be saved, cloned, and shared
§ its structure can be displayed and analyzed
§ the framework can infer shapes and check types, so errors
can be caught early
§ It’s also fairly easy to debug, since the whole model is a
static graph of layers.
» Disadvantage:
§ The models are static---cannot build models that involve
loops, varying shapes, conditional branching, and other
dynamic behaviors.
Using the Subclassing API to Build
Dynamic Models (cont.)
• The Subclassing API: subclass the Model class, create the layers you need in
the constructor, and use them to perform the computations you want in the
call() method.
» Advantage: Imperative programming style---you can use for loops, if statements,
low-level TensorFlow operation in call()
» Disadvantage: Keras cannot inspect the model’s architecture and it is hard to debug
Saving and Restoring a Model
• When using the Sequential API or the Functional API, you can save a
trained Keras model:

• Keras will use the HDF5 format to save


» The model’s architecture (including every layer’s hyperparameters)
» The values of all the model parameters for every layer (e.g., connection
weights and biases)
» The optimizer (including its hyperparameters and any state it may have)
» etc.
• To load the model:
Using Callbacks to Save Intermediate
Models during Training
• Remember to save models at regular intervals during a long training
session to avoid losing everything if your computer crashes.
• The fit() method accepts a callbacks argument that lets you specify a list of
objects that Keras will call at the start and end of training, at the start and
end of each epoch, and even before and after processing each batch.

• If you use a validation set during training, you can set


save_best_only=True when creating the ModelCheckpoint to implement
early stopping:
Using Callbacks to Implement Early
Stopping and custom callbacks
• Another way to implement early stopping is to simply use the
EarlyStopping callback.

• If you need extra control, you can easily write your own
custom callbacks. For example,
Using TensorBoard for Visualization
• TensorBoard is a great interactive visualization
tool that you can use to
» view the learning curves during training
» compare learning curves between multiple runs
» visualize the computation graph
» analyze training statistics
» view images generated by your model
» visualize complex multidimensional data projected
down to 3D and automatically clustered for you
» etc.
Visualizing Learning Curves with TensorBoard
Using TensorBoard
• To use TensorBoard, you must modify your program so that it outputs the
data you want to visualize to special binary log files called event files.
• Each binary data record is called a summary.
• The TensorBoard server will monitor the log directory, and it will
automatically pick up the changes and update the visualizations.
• In general, you want to point the TensorBoard server to a root log
directory and configure your program so that it writes to a different
subdirectory every time it runs.
• Define the root log directory for TensorBoard logs
Using TensorBoard (cont.)
• Keras provides the TensorBoard() callback:

• The callback automatically create the log directory, generate


event files and write summaries to them during training.
• The directory structure:
Using TensorBoard (cont.)
• Start the TensorBoard server by running a command in a
terminal:

• Once the server is up, you can open a web browser and go to
http://localhost:6006
• To use TensorBoard directly within Jupyter:
Using TensorBoard (cont.)
• TensorFlow offers a lower-level API in the tf.summary package.
» E.g., you can create a SummaryWriter using the create_file_writer() function,
and it uses this writer as a context to log scalars, histograms, images, audio,
and text, all of which can then be visualized using TensorBoard
What we’ve learned so far
• We learnt about the history of neural nets research.
• What an MLP is and how you can use it for
classification and regression
• How to use tf.keras’s Sequential API to build MLPs
• How to use the Functional API or the Subclassing API to
build more complex model architectures
• How to save and restore a model
• How to use callbacks for checkpointing, early stopping,
and more.
• How to use TensorBoard for visualization.
Fine-Tuning Neural Network Hyperparameters
• There are many hyperparameters to tweak:
» E.g., the number of layers, the number of neurons per layer, the type
of activation function to use in each layer, the weight initialization
logic
• How do you know what combination of hyperparameters is
the best?
• One option is to try many combinations of hyperparameters
and see which one works best on the validation set (or use K-
fold cross-validation).
» E.g., use GridSearchCV or RandomizedSearchCV to explore the
hyperparameter space.
Fine-Tuning Hyperparameters (cont.)

• Step 1: Wrap our Keras models in objects that mimic regular


Scikit-Learn regressors:

• Step 2: Create a KerasRegressor based on this build_model()


function:
Fine-Tuning Hyperparameters (cont.)

• Step 3: Use the KerasRegressor object like a regular Scikit-


Learn regressor for training, evaluation, and prediction
» Any extra parameter you pass to the fit() method will get passed to
the underlying Keras model.
» The score will be the opposite of the MSE because Scikit-Learn wants
scores, not losses (i.e., higher should be better).

• Then use GridSearchCV() to perform grid search


Fine-Tuning Hyperparameters (cont.)
• Use a randomized search rather than grid search if there are
too many combinations of hyperparameters.

• After exploring for many hours, the result is


Beyond Randomized search
• When training with randomized search is slow, first run a quick
random search using wide ranges of hyperparameter values, then
run another search using smaller ranges of values centered on the
best ones found during the first run, and so on.
• Some Python libraries you can use to optimize hyperparameters:
» e.g., Hyperopt, Hyperas, kopt, Talos, Keras Tuner, Scikit-Optimize (skopt), Spearmint,
Hyperband, Sklearn-Deap.
• Some companies offer services for hyperparameter optimization:
» e.g., Google Cloud AI Platform’s hyperparameter tuning service, Arimo and SigOpt,
and CallDesk’s Oscar.
• Google AutoML not just search for hyperpara meters but also to look for the
best neural network architecture for the problem.
Number of Hidden Layers
• Theoretically, you can use a shallow neural networks to model even the
most complex functions, provided it has enough neurons.
• But deep networks have a much higher parameter efficiency than shallow
ones for complex problems.
» Real-world data is often structured in such a hierarchical way, and deep neural networks
automatically take advantage of this fact.
• Not only does this hierarchical architecture help DNNs converge faster to a
good solution, but it also improves their ability to generalize to new
datasets (i.e., transfer learning)
• Very complex tasks, such as large image classification or speech
recognition, typically require networks with hundreds of layers and they
need a huge amount of training data.
» It is more common to reuse parts of a pretrained state-of-the-art network that performs
these tasks.
Number of Neurons per Hidden Layer
• The number of neurons in the input and output layers is
determined by the type of input and output your task requires.
» e.g., the MNIST task requires 28 × 28 = 784 input neurons and 10 output
neurons.
• As for the hidden layers, it used to be common to size them to form
a pyramid, with fewer and fewer neurons at each layer.
• You can try increasing the number of neurons gradually until the
network starts overfitting.
• The “stretch pants” approach: pick a model with more layers and
neurons than you actually need, then use early stopping and other
regularization techniques to prevent it from overfitting.
» Avoid bottleneck layers that could ruin your model.
• In general you will get more bang for your buck by increasing the
number of layers instead of the number of neurons per layer.
Tuning the Learning Rate
• Learning rate is arguably the most important hyperparameter.
• One way to find a good learning rate is to train the model for a few
hundred iterations, starting with a very low learning rate (e.g., 10-5)
and gradually increasing it up to a very large value (e.g., 10).
» This is done by multiplying the learning rate by a constant factor at each
iteration (e.g., by exp(log(106)/500) to go from 10-5 to 10 in 500 iterations).
• If you plot the loss as a function of the learning rate (using a log
scale for the learning rate), you should see it dropping at first.
» But after a while, the learning rate will be too large, so the loss will shoot
back up
• The optimal learning rate will be a bit lower than the point at which
the loss starts to climb (typically about 10 times lower than the
turning point).
• You can then reinitialize your model and train it normally using this
good learning rate.
Tuning Optimizer, Batch Size, Activation
Functions, and Number of Iterations
• Choosing a better optimizer than plain old Mini-batch
Gradient Descent is quite important.
• The main benefit of using large batch sizes is that hardware
accelerators like GPUs can process them efficiently, so the
training algorithm will see more instances per second.
» But some researchers reported that large batch sizes often lead
to training instabilities, the resulting model may not generalize
as well as a model trained with a small batch size.
• There are activation functions better than ReLU
• In most cases, the number of training iterations does not
actually need to be tweaked: just use early stopping
instead.
Conclusions
• This concludes our introduction to artificial neural networks
and their implementation with Keras.
• In the next lecture, we will discuss techniques to train very
deep nets.
» But we will not talk about the customize models using
TensorFlow’s lower-level API and how to load and preprocess
data efficiently using the Data API.
• After that we will dive into other popular neural network
architectures: convolutional neural networks for image
processing and recurrent neural networks for sequential
data.
» But we will skip autoencoders for representation learning, and
generative adversarial networks to model and generate data.

You might also like