Lec 07 8
Lec 07 8
Lecture 07-08
Arpit Rana
16th / 17th January 2025
Deep Learning
● It uses guidance from a feedback signal to automatically find transformations that turn
input data into more useful representations.
For example,
○ in the case of supervised learning, the feedback comes from the loss function and
the algorithm seeks a representation that is closer to the target outputs.
Representations
Deep learning is about jointly finding successive layers of representations, usually in the form
of the layers of a neural network.
● The first layer in some sense transforms the input vectors into new vectors — a different
representation of the inputs examples.
● The second layer transforms again into new vectors — another representation.
● Since each layer produces a new representations, one way of thinking about this is, for
the kinds of tasks on which it is successful, deep learning automates feature engineering.
Drivers of Deep Learning
Hardware:
● Faster CPUs but then highly-parallel Graphical Processing Units (GPus) and now
specially-designed Tensor Processing Units (TPUs).
Data:
● Sensors and the Internet have made vast datasets available: text, images, video, …
Algorithmic advances:
● The core ideas have been around a long time: Perceptrons (1950s), backpropagation
(1980s or earlier), convolutional networks (1980s), LSTMs (1990s), …
● But new ideas from 2010 onwards: better weight initialization, batch normalization,
different activation functions, variants of SGD, numerous ways to avoid overfitting, new
architectures,…
Freeware:
● Toolkits/APIs; Educational resources.
Money!
Applications of Deep Learning
In this lecture:
● We will use layered, dense, feedforward neural networks for regression, binary
classification and multi-class classification:
○ We'll use our two small datasets that contain structured data (sometimes called
tabular data): not necessarily ideal for deep learning.
● This will illustrate some of the different activation functions we can use:
Tensorflow and PyTorch are the two main libraries that do support tensor computation, neural
networks and deep learning in Python:
We will use Keras, which is a high-level API for Tensorflow, first released in 2015 by François
Chollet of Google (https://keras.io):
● The downside is it gives less fine-grained control than TensorFlow itself. When
fine-grained control is needed, you can mix in TensorFlow functions, methods and
classes.
● This seems a suitable trade-off for us: our module is about AI, not the intricacies of
TensorFlow.
Keras Concepts
○ Try to avoid more than 2 hidden layers otherwise it will increase the model
complexity.
○ For very large datasets, gradually ramp up the number of hidden layers until you
start overfitting the training set.
Keras Concepts
● The number of hidden neurons should be between the size of the input layer and the size
of the output layer.
● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of
the output layer.
● The number of hidden neurons should be less than twice the size of the input layer.
Source: An Introduction to Neural Networks for Java, Second Edition by Jeff Heaton
Keras Concepts
The activation functions of hidden layers are open for you to choose, e.g. sigmoid or ReLU.
● But the activation functions of output layers are determined by the task:
● Regression: linear activation function (default);
● Binary classification: sigmoid activation function; and
● Multiclass classification: softmax activation function.
A loss function:
● Regression, e.g. mean-squared-error (mse);
● Binary classification, e.g. (binary) cross-entropy (binary_crossentropy );
● Multiclass classification, e.g. (categorical) cross-entropy
(sparse_categorical_crossentropy if the labels are encoded as integer labels or
categorical_crossentropy if the integer labels are then also one-hot encoded).
Without going into details, many other variants of Gradient Descent have been devised (e.g.
RMSprop, Adam, Nadam, Adagrad , …):
● some may have better convergence behaviour in the case of local minima;
● Be aware, its default learning rate is 0.001. This is usually OK, but in some cases you may
need to change it.
● Be aware too that there is an argument called batch_size . Assuming we set its value to
somewhere between 1 and the size of the training set then we are getting Mini-Batch
Gradient Descent.
A Neural Network for Regression
For regression on structured/tabular data, we might use a network with the following
architecture:
● Output layer: just one output neuron (assuming we're predicting a single number).
○ Activation function for the output neuron should be the linear function: g(z) = z
There are also biases in each layer except the output layer — Keras will give us these 'for free'.
Example: House Rent Prediction
We don't want too many hidden layers, nor too many neurons in each hidden layer. Why?
We need to scale the features. But, since we are now not using scikit-learn's
ColumnTransformers to create a preprocessor, we need to take care of the scaling.
Example: House Rent Prediction
Example: House Rent Prediction
A Neural Network for Binary Classification
For binary classification, we might use a network with the following architecture:
For multi-class classification, we might use a network with the following architecture:
● An input layer with 4 inputs (petal width and length, and sepal width and length).
● Two hidden layers, with 64 neurons in each, and ReLU activation function.
● An output layer with three neurons (one for Setosa, Versicolor and Virginica) and
softmax activation function.
Example: Iris Dataset
Example: Iris Dataset
Example: Iris Dataset
Example: Iris Dataset
Below, an alternative, is code that illustrates one-hot encoding the target values using the
Keras function called to_categorical , and then using categorical_crossentropy for
the loss function.
Example: Iris Dataset
// 0.8999999761581421
Example: Iris Dataset
Observations:
● Neural networks are often not the best-performing approaches for structured data.
● And, sure enough, the results here are not great. Of course, there is a lot we can tweak to
see if we can improve the results.
● Dataset: 70,000 images, so we can safely use holdout, and it is already partitioned:
○ 60,000 training images; 10,000 test images.
Example: Fashion MNIST Dataset
● One hidden layer with 300 neurons, using the ReLU activation function.
● Second hidden layer with 100 neurons, using the ReLU activation function.
● The output layer will have 10 neurons, one per class, and will use the softmax activation
function.
The features (pixel values) are all in the same range [0, 255], so we do not need to standardize
using a Normalization layer.
But it is a bad idea to feed into a neural network values that are much larger than the initial
weights, so we will rescale to by dividing by 255. We can do this using a Rescaling layer.
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Example: Fashion MNIST Dataset
Remarks on Computer Vision Problems
In the 1960s, 70s, 80s and to some extent 90s, the typical pipeline for a computer vision (or
image processing) system was as follows:
● There would be a module that would extract features from the images.
○ They might include edges detected by some edge detection algorithm, for example.
(If you are interested, look up SIFT or SURF or HOG.)
● Then these features would be fed into a typical learning algorithm, e.g. logistic
regression.
Remarks on Computer Vision Problems
● There's no extraction of hand-crafted features. We feed in the raw pixel values (or,
lightly-processed pixel values, e.g. scaled values).
● It is the layers of the neural network that automatically discover the features, and the
final layer that makes the classification.
○ Computer vision (image processing) more often also uses convolutional layers,
pooling layers, batch normalization layers, and so on. We may study them in
coming lectures.
Concluding Remarks
● A few decisions are constrained: number of inputs; number of output neurons; activation
function of output neurons; and (to some extent) loss function.
○ Even making a good guess at them is more art than science, although this is
changing.
○ On the other hand, grid search or randomized search will make things even slower
than they already are — and we still have to specify some sensible values for them
to search through.