L10 Learning II Gradient Based Learning
L10 Learning II Gradient Based Learning
Semester I, 2024-25
Rohan Paul
1
Outline
• Last Class
• Basics of machine learning
• This Class
• Neural Networks
• Reference Material
• Please follow the notes as the primary reference on this topic. Additional
reading from AIMA book Ch. 18 (18.2, 18.6 and 18.7) and DL book Ch 6
sections 6.1 – 6.5 (except 6.4).
2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.
3
Learning a Best Fit Hypothesis: Linear
Regression Task
Here,
Linear regression (no bias) wT is a vector of weights (parameters) that we are trying to learn.
x is the input vector (e.g., features like visibility).
y_cap is the predicted output based on the model.
Learning: Optimizing the loss will estimate the model parameters w and b.
Online: Given the trained model, we can predict a value. ||.||2(2 in subscript) represents the Euclidean (L2) distance.
m is the number of test samples.
Examples:
- Predicting pollution levels from visibility T in superscript, stands for the transpose of a matrix or vector.
Material from
- Predicting reactivity of a molecule from https://www.deeplearningbook.org/
structural data. Chapter 5 4
Linear Regression Example
Optimal w and the implied linear model
y^(train) is the model’s prediction for the training data, and
y(train) is the actual target values in the training set.
Learning/Training:
Optimize the error w.r.t. the model parameters, w
Material from
Inference: https://www.deeplearningbook.org/
Chapter 5 5
Use the trained model i.e. w, to perform predictions
1. Overview
Classification is about predicting categories rather than numerical values. For example, given an image, we might predict which species it contains (dolphin, cat, grizzly bear, etc.).
2. Model and Notation
Softmax Function: The softmax function is applied to convert the model’s output (a vector of raw scores,z) into probabilities.
Classification Task
Softmax applied at the last stage Classification of an image as containing certain species.
6
Image courtesy: https://mitpress.mit.edu/9780262048972/foundations-of-computer-vision/
How much to fit the data?
Figure shows polynomial curve fitting with increasing k
Overfitting and underfitting with
polynomial functions
• Weight term
• Adding a term to the loss function that prefers
smaller squared sum of weights. A prior over the
parameters.
• Penalize a very complex model to explain the
date.
• Lambda parameter
• Selected ahead of time that controls the strength
of our preference for smaller weights.
• It is a “hyper”-parameter that needs tuning, Material from
https://www.deeplearningbook.org/
expressing our trade off. Chapter 5 8
Model capacity refers to the complexity or flexibility of the model. A model with low capacity (like a simple linear model) may not be able to
capture complex patterns, while a model with high capacity (like a deep neural network) can capture more complex relationships.
High capacity means the model can potentially fit complex patterns, but it might also fit noise in the data, leading to overfitting.
• Underfitting regime
• Training error and generalization
error are both high.
• Increase capacity
• Training error decreases
• Gap between training and
generalization error increases.
• Overfitting
• The size of this gap outweighs the
decrease in training error,
• capacity is too large, above the
optimal capacity.
We can train a model only on the training set. The test set is not available
during training. Material from
https://www.deeplearningbook.org/
How do we know the generalization error when we cannot use the test set? Chapter 5 9
Parameters (like weights w in linear regression) are learned directly from the data during training.
Hyperparameters are not learned from the data; they are set by the model designer. Examples include the learning rate, regularization strength (lambda),
and the number of layers in a neural network. Hyperparameters control the training process and the capacity of the model.
Gradient Descent:
Gradient:
grad J(theta) represents the gradient of the loss function J with
respect to the parameters theta. It points in the direction of the
steepest increase in J.
Update Rule: The gradient descent algorithm updates the
parameters in the opposite direction of the gradient to minimize
the loss:
theta=theta−alpha*(grad J(theta))
Here,
alpha is the learning rate, controlling the size of each step.
11
Gradient Descent
• Compute the gradient and take the step in that direction. W_
2
• Gradient computation
• Analytic
• Numerical
original
W
W_
negative gradient 1
direction
12
Learning rate schedules
Gradient descent with adaptive learning rate.
13
Mini-batch Gradient Descent
W_2
Problem: Large data sets. Difficult to compute the full loss noisy gradient from minibatch
function over the entire training set in order to perform
only a single parameter update
W_1
original W
14
Stochastic Gradient Descent
Setting the mini-batch to contain only a single
example. W_2
True gradients in blue
Gradients are noisy but still make good minibatch gradients in red
progress on average.
W_1
original W
15
Gradient vs Mini-batch Gradient Descent
Cost function decomposes as a sum
over training examples of per-
example loss function
SGD – sample the batch randomly in each iteration. Random sampling adds some robustness to
optimization.
17
Gradient Descent with Momentum
18
Gradient Clipping
19
Other Aspects
Vanishing gradients Multiple local minima
20
Neural Networks: Biological
Neurons in the brain
• Activations and inhibitions
• Parallelism
• Connected networks
Modeling a Neuron
• Setup where there is a function that connects the inputs to the output.
• Problem: learning this function that maps the input to the output.
Perceptron: Model of a Single Neuron
• Introduced in the late 50s
• Minsky and Papert.
• Perceptron convergence theorem
Rosenblatt 1962:
• Perceptron will learn to classify any
linearly separable set of inputs
• Note: the earlier class talked about
model-based classification. Here,
we do not build a model. Operate
directly on feature weights.
Feature Space
• Extract Features from data
Hello,
# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
Classification with a Perceptron
• A decision boundary is a
hyperplane orthogonal too
the weight vector.
w1
f1
w2
Feature 2 2 f2 >0?
w3
+1 = Class A f3
0
-1 = Class B 0 1 Feature 1
Perceptron
One side of the decision boundary is class. A and the other is class B. A threshold is introduced.
How to arrive at the separator?
Learning rule
• Classify with the current weights
• If correct no change. x2
• If the classification is wrong: adjust the
weight vector by adding or subtracting the ??
feature vector.
Subtracting f
x1
https://en.wikipedia.org/wiki/Perceptron#:~:text=The%20first%20hardware%20implementation%20was,demonstrated%20on
%2023%20June%201960.
Perceptron
Case: Linearly-separable Case: Not Linearly-separable
Perceptron learning rule converges to a perfect linear separator when the data points
are linearly separable. Problem when there is non-separable data.
Threshold Functions
0.9 | 0.1
0.7 | 0.3
0.5 | 0.5
0.3 | 0.7
0.1 | 0.9
Logistic Output
• Logistic Function
• Very positive values. Probability -> 1
• Very negative values. Probability -> 0.
• Makes the prediction. Converts to a
probability
• Softens the decision boundary.
• Logistic Regression
• Fitting the weights of this model to
minimize loss on a data set is called
logistic regression.
Example (red or blue classes)
Normalizer
Estimating weights using MLE
Logistic Regression
Maximize the log-likelihood
Softmax Output
Prediction of the unnormalized probabilities.
• Multi-class setting
• A probability distribution over a
discrete variable with n possible
values.
• Generalization of the sigmoid
function to multiple outputs.
Exponentiate and normalize the values.
• Output of a classifier
• Distribution over n different
classes. The individual outputs
must sum to one.
Softmax Example
Multi-class Setting
Can a perceptron learn XOR?
Non-separability and Non-linear Functions
• The original feature space is Kernel map
mapped to some higher-
dimensional feature space where
the training set is separable.
f1(x)
Output: class
Data point as a z1 s probabilities, in general
vector of features f2(x) o the output depends on
as input. the machine learning
f task we are performing.
z2 t
f3(x) m
a
… x
z3
fK(x)
• We are designing features that represent a data point. We hope that the feature represent enables good
classification performance (requires hand-crafting).
• As we will see, a neural network “learns” features to represent a data point for a machine learning task.
• Connect input data to output. Structure
these models by composing many units.
x1
s
x2 o
f
… t
x3 m
a
… … … … … x
xL
g = nonlinear
activation function
Multi-layer Perceptrons (MLPs)
Mapping plots for simple functions that can be used in neural layers Mapping plot for a linear-relu composition
Mapping diagrams for a data distribution
2D mapping diagram for several neural layers.
Target output moves the red and blue points closer to the ground truth data.
As training progresses the network gradually achieves this separation.
The outputs of the final layers before classification can
be considered as learned features for the data.
https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle®Dataset=reg-
plane&learningRate=0.03®ularizationRate=0&noise=0&networkShape=4,2&seed=0.67214&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=fal
se&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
Learning XOR
XOR is not linearly separable. Rectified Linear Activation Network Diagram
More compact
representation
https://xnought.github.io/backprop-explainer/
Material from Ch 6, DL Book
Backpropagation and Computation Graphs
• Computation Graphs
• A way to organize the computation in
a neural network.
• Also enables identification and
caching of repeated sub-expressions.
asdf
Backpropagation: Toy Example
Backpropagation
Backpropagation: Example
Backpropagation: Example
Deep Neural Networks: Highly expressive
• Last layer
• Logistic regression
• Several Hidden Layers
• Computing the features. The features are learned rather than hand-designed.
Core Idea:
- Encourage smaller weights so that the output
function is mother.
- Output is the weighted sum of the previous layer.
- If the input magnitudes are small then the output
variation is small.
Preventing Over-fitting: Early Stopping
• Stop the training procedure before convergence.
• Reduces overfitting if the model has already captured the coarse shape of
the function before it starts to fit to noise.
• Single hyper-parameter – number of steps before learning is terminated.
• Chosen empirically using a validation set (without the need to train multiple models)
• The model is trained once, the performance on the validation set is monitored every T
iterations, and the associated parameters are stored. The stored parameters where the
validation performance was best are selected.
• Intuition
• Weights are often initialized to small values. With early stopping they do not get a chance to
become large.
• Activations remain in the linear range without saturating.
Preventing Over-fitting: Dropouts
• A way to reduce the test set error at the cost of
making it hard to fit on the training data.
• Each step of training
• Dropout applies one step of back prop. To a new
version of the network.
• Created by deactivating a randomly chosen subset
of units.
• During training, dropout samples from an
exponential number of different “thinned”
networks.
• At test time, it is easy to approximate the effect
of averaging the predictions of all these thinned
networks by simply using a single un thinned
network that has smaller weights.
Preventing Over-fitting: Dropouts
https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
Improving SGD Convergence
• Batch-normalization
• Standardizes each neural activation with respect
to its mean and variance.
• Over the mini-batch of data points.
• In effect adjusts the distribution observed by
the deeper layers from the outputs coming from
the earlier layers.
• A problem called internal covariate shift.
• L2- Normalization
• Projects the inputs onto a unit hypersphere.
• Useful for bounding activations to unit vectors.
• The above two can be considered as a non-
linear layers in the neural network.
Neural Networks: Successes