0% found this document useful (0 votes)
11 views40 pages

Lec 07 8

This document covers deep learning concepts, focusing on neural networks and their implementation using the Keras library. It discusses the architecture of neural networks for various tasks such as regression, binary classification, and multi-class classification, along with examples like house rent prediction and the Iris dataset. The lecture also highlights the importance of activation functions, optimizers, and the differences between traditional feature extraction and modern deep learning approaches in computer vision.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

Lec 07 8

This document covers deep learning concepts, focusing on neural networks and their implementation using the Keras library. It discusses the architecture of neural networks for various tasks such as regression, binary classification, and multi-class classification, along with examples like house rent prediction and the Iris dataset. The lecture also highlights the importance of activation functions, optimizers, and the differences between traditional feature extraction and modern deep learning approaches in computer vision.

Uploaded by

202411073
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

IT549: Deep Learning

Lecture 07-08

Neural Networks Examples using Keras


(Slides are created from the lecture notes of Dr. Derek Bridge, UCC, Ireland)

Arpit Rana
16th / 17th January 2025
Deep Learning

● The word 'deep' in 'deep learning' does not mean profound.

● In deep learning, we have 'lots' of layers — tens or even hundreds.


Representations

One way of thinking about Machine Learning:

● It uses guidance from a feedback signal to automatically find transformations that turn
input data into more useful representations.

For example,

○ in the case of supervised learning, the feedback comes from the loss function and
the algorithm seeks a representation that is closer to the target outputs.
Representations

Deep learning is about jointly finding successive layers of representations, usually in the form
of the layers of a neural network.

● The network takes in vectors (examples).

● The first layer in some sense transforms the input vectors into new vectors — a different
representation of the inputs examples.

● The second layer transforms again into new vectors — another representation.

● Since each layer produces a new representations, one way of thinking about this is, for
the kinds of tasks on which it is successful, deep learning automates feature engineering.
Drivers of Deep Learning

Hardware:
● Faster CPUs but then highly-parallel Graphical Processing Units (GPus) and now
specially-designed Tensor Processing Units (TPUs).
Data:
● Sensors and the Internet have made vast datasets available: text, images, video, …
Algorithmic advances:
● The core ideas have been around a long time: Perceptrons (1950s), backpropagation
(1980s or earlier), convolutional networks (1980s), LSTMs (1990s), …
● But new ideas from 2010 onwards: better weight initialization, batch normalization,
different activation functions, variants of SGD, numerous ways to avoid overfitting, new
architectures,…
Freeware:
● Toolkits/APIs; Educational resources.
Money!
Applications of Deep Learning

It is excelling at 'perceptual' tasks, e.g.


● image classification;
● image segmentation;
● speech recognition;
● handwriting transcription.

But it is finding ever wider application:


● video recommendation;
● machine translation;
● text-to-speech;
● question-answering;
● autonomous driving;
● the protein folding problem (AlphaFold);
● superhuman game playing (e.g. AlphaGo).
Implementation

In this lecture:

● We will use layered, dense, feedforward neural networks for regression, binary
classification and multi-class classification:

○ We'll use our two small datasets that contain structured data (sometimes called
tabular data): not necessarily ideal for deep learning.

○ We'll see one example that uses images.

● This will illustrate some of the different activation functions we can use:

○ in the output layer: linear, sigmoid or softmax; and

○ in the hidden layers: sigmoid or ReLU.

● This will also introduce the Keras library.


The Keras Library

scikit-learn has very limited support for neural networks.

Tensorflow and PyTorch are the two main libraries that do support tensor computation, neural
networks and deep learning in Python:

We will use Keras, which is a high-level API for Tensorflow, first released in 2015 by François
Chollet of Google (https://keras.io):

● It is very high-level, making it easy to construct networks, fit models and make


predictions.

● The downside is it gives less fine-grained control than TensorFlow itself. When
fine-grained control is needed, you can mix in TensorFlow functions, methods and
classes.

● This seems a suitable trade-off for us: our module is about AI, not the intricacies of
TensorFlow.
Keras Concepts

Network Architecture: Number of Hidden Layers

● Neural network with no hidden layers is just a linear model.

● Hidden layers are needed when data is not linearly separable.

○ Try to avoid more than 2 hidden layers otherwise it will increase the model
complexity.

○ For very large datasets, gradually ramp up the number of hidden layers until you
start overfitting the training set.
Keras Concepts

Network Architecture: Number of Neurons in Hidden Layers

● The number of hidden neurons should be between the size of the input layer and the size
of the output layer.

● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of
the output layer.

● The number of hidden neurons should be less than twice the size of the input layer.

Source: An Introduction to Neural Networks for Java, Second Edition by Jeff Heaton
Keras Concepts

Layers are the building blocks.


● To begin with, we will use dense (fully connected) layers.

The activation functions of hidden layers are open for you to choose, e.g. sigmoid or ReLU.

● But the activation functions of output layers are determined by the task:
● Regression: linear activation function (default);
● Binary classification: sigmoid activation function; and
● Multiclass classification: softmax activation function.

Layers are combined into networks:


● Consecutive layers must be compatible: the shape of the input to one layer is the shape
of the output of the preceding layer.
Keras Concepts

Once the network is built, we compile it, specifying:

A loss function:
● Regression, e.g. mean-squared-error (mse);
● Binary classification, e.g. (binary) cross-entropy (binary_crossentropy );
● Multiclass classification, e.g. (categorical) cross-entropy
(sparse_categorical_crossentropy if the labels are encoded as integer labels or
categorical_crossentropy if the integer labels are then also one-hot encoded).

An optimizer, such as SGD — but see below.

A list of metrics to monitor during training and testing:


● Regression, e.g. mean-absolute-error (mae);
● Classification, e.g. accuracy (acc).
Keras Optimizers

We know about Gradient Descent: Batch, Mini-Batch, Stochastic.

Without going into details, many other variants of Gradient Descent have been devised (e.g.
RMSprop, Adam, Nadam, Adagrad , …):

● some may have better convergence behaviour in the case of local minima;

● some may converge more quickly.

although a disadvantage is that they typically introduce further hyperparameters (e.g.


momentum) in addition to learning rate.
Keras Optimizers

We will use RMSprop below.

● Be aware, its default learning rate is 0.001. This is usually OK, but in some cases you may
need to change it.

● Be aware too that there is an argument called batch_size . Assuming we set its value to
somewhere between 1 and the size of the training set then we are getting Mini-Batch
Gradient Descent.
A Neural Network for Regression

For regression on structured/tabular data, we might use a network with the following
architecture:

● Input layer: one input per feature.

● Hidden layers: one or more hidden layers.


○ Activation function for neurons in hidden layers can be the sigmoid function or
ReLU.

● Output layer: just one output neuron (assuming we're predicting a single number).
○ Activation function for the output neuron should be the linear function: g(z) = z

There are also biases in each layer except the output layer — Keras will give us these 'for free'.
Example: House Rent Prediction

We don't want too many hidden layers, nor too many neurons in each hidden layer. Why?

Let's start with this:


● An input layer with three inputs (BHK, Size, Bathrooms);
● Two hidden layers, with 32 neurons in each, and ReLU activation function;
● An output layer with a single neuron and linear activation function.

We need to scale the features. But, since we are now not using scikit-learn's
ColumnTransformers to create a preprocessor, we need to take care of the scaling.
Example: House Rent Prediction
Example: House Rent Prediction
A Neural Network for Binary Classification

For binary classification, we might use a network with the following architecture:

● Input layer: one input per feature.

● Hidden layers: one or more hidden layers.


○ Activation function for neurons in hidden layers can be sigmoid or ReLU.

● Output layer: just one output neuron (for binary classification).


○ Activation function for the output neuron should be the sigmoid function also.
Why?
Example: Class Performance Dataset

Let's start with this:


● An input layer with 3 inputs (lec, lab, cao).
● Two hidden layers, with 64 neurons in each, and ReLU activation function.
● An output layer with a single neuron and sigmoid activation function.
We'll scale using a Normalization layer.
Example: Class Performance Dataset

// 0.6666666865348816 Feel free to edit the code, e.g. add or remove


hidden layers, change the number of neurons in
the hidden layers, change ReLU to sigmoid,
change from RMSprop to another optimizer,
change the learning rate, change the number of
epochs, or change the batch size.
A Neural Network for Multiclass Classification

For multi-class classification, we might use a network with the following architecture:

● Input layer: one input per feature.

● Hidden layers: one or more hidden layers.


○ Activation function for neurons in hidden layers can be sigmoid or ReLU.

● Output layer: one output neuron per class.


○ Activation function for the output neurons should be the softmax function.
Example: Iris Dataset

Let's start with this:

● An input layer with 4 inputs (petal width and length, and sepal width and length).

● Two hidden layers, with 64 neurons in each, and ReLU activation function.

● An output layer with three neurons (one for Setosa, Versicolor and Virginica) and
softmax activation function.
Example: Iris Dataset
Example: Iris Dataset
Example: Iris Dataset
Example: Iris Dataset

Note the loss function above:


● sparse_categorical_crossentropy for multi-class classification when the classes
are integers, e.g. 0 = one kind of Iris, 1 = another kind, 2 = a third kind (which is what we
have in the Iris dataset).
● categorical_crossentropy for multi-class classification when the classes have been
one-hot encoded.
● As we've seen, binary_crossentropy for binary classification, where the classes will
be 0 or 1.

Below, an alternative, is code that illustrates one-hot encoding the target values using the
Keras function called to_categorical , and then using categorical_crossentropy for
the loss function.
Example: Iris Dataset

// 0.8999999761581421
Example: Iris Dataset

Observations:

● Neural networks are often not the best-performing approaches for structured data.

● And, sure enough, the results here are not great. Of course, there is a lot we can tweak to
see if we can improve the results.

● But, instead, let's switch to an image processing example.


Example: Fashion MNIST Dataset

Fashion MNIST is also a classic dataset for multi-class classification.


● The task is classification of fashion items.
○ Features: 28 pixel by 28 pixel grayscale images of fashion items. The values are
integers in [0, 255].
○ Classes: ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt",
"Sneaker", "Bag", "Ankle boot"].

● Dataset: 70,000 images, so we can safely use holdout, and it is already partitioned:
○ 60,000 training images; 10,000 test images.
Example: Fashion MNIST Dataset

We don't really need scikit-learn pipelines this time.

But we do need to reshape:

● Our training data is in a 3D array of shape (60000, 28, 28).

● We change it to a 2D array of shape (60000, 28 * 28).

○ This 'flattens' the images.

○ When working with images, it is often better not to do this.

● Similarly, the test data.


Example: Fashion MNIST Dataset

We will do a three-layer network:

● One hidden layer with 300 neurons, using the ReLU activation function.

● Second hidden layer with 100 neurons, using the ReLU activation function.

● The output layer will have 10 neurons, one per class, and will use the softmax activation
function.

The features (pixel values) are all in the same range [0, 255], so we do not need to standardize
using a Normalization layer.

But it is a bad idea to feed into a neural network values that are much larger than the initial
weights, so we will rescale to by dividing by 255. We can do this using a Rescaling layer.
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Remarks on Computer Vision Problems

In the 1960s, 70s, 80s and to some extent 90s, the typical pipeline for a computer vision (or
image processing) system was as follows:

● There would be a module that would extract features from the images.

○ These features would have been carefully hand-designed.

○ They might include edges detected by some edge detection algorithm, for example.
(If you are interested, look up SIFT or SURF or HOG.)

● Then these features would be fed into a typical learning algorithm, e.g. logistic
regression.
Remarks on Computer Vision Problems

Notice how different life is now — when using neural networks.

● There's no extraction of hand-crafted features. We feed in the raw pixel values (or,
lightly-processed pixel values, e.g. scaled values).

● It is the layers of the neural network that automatically discover the features, and the
final layer that makes the classification.

● The dense layers are only one possibility.

○ Computer vision (image processing) more often also uses convolutional layers,
pooling layers, batch normalization layers, and so on. We may study them in
coming lectures.
Concluding Remarks

● A few decisions are constrained: number of inputs; number of output neurons; activation
function of output neurons; and (to some extent) loss function.

● But there are numerous hyperparameters (and even more to come!)

○ Even making a good guess at them is more art than science, although this is
changing.
○ On the other hand, grid search or randomized search will make things even slower
than they already are — and we still have to specify some sensible values for them
to search through.

● There is a considerable risk of overfitting.


Next lecture Training Deep Neural Network
17th January 2025

You might also like