0% found this document useful (0 votes)
5 views

L10 Learning II Gradient Based Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

L10 Learning II Gradient Based Learning

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

COL333/671: Introduction to AI

Semester I, 2024-25

Learning – II: Gradient-based Learning and Neural Networks

Rohan Paul

1
Outline
• Last Class
• Basics of machine learning
• This Class
• Neural Networks
• Reference Material
• Please follow the notes as the primary reference on this topic. Additional
reading from AIMA book Ch. 18 (18.2, 18.6 and 18.7) and DL book Ch 6
sections 6.1 – 6.5 (except 6.4).

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Learning a Best Fit Hypothesis: Linear
Regression Task
Here,
Linear regression (no bias) wT is a vector of weights (parameters) that we are trying to learn.
x is the input vector (e.g., features like visibility).
y_cap is the predicted output based on the model.

Linear regression (with bias)

Error term/Loss term

Learning: Optimizing the loss will estimate the model parameters w and b.
Online: Given the trained model, we can predict a value. ||.||2(2 in subscript) represents the Euclidean (L2) distance.
m is the number of test samples.

Examples:
- Predicting pollution levels from visibility T in superscript, stands for the transpose of a matrix or vector.
Material from
- Predicting reactivity of a molecule from https://www.deeplearningbook.org/
structural data. Chapter 5 4
Linear Regression Example
Optimal w and the implied linear model
y^(train) is the model’s prediction for the training data, and
y(train) is the actual target values in the training set.

Learning/Training:
Optimize the error w.r.t. the model parameters, w

Function/model space Parameter space

Material from
Inference: https://www.deeplearningbook.org/
Chapter 5 5
Use the trained model i.e. w, to perform predictions
1. Overview
Classification is about predicting categories rather than numerical values. For example, given an image, we might predict which species it contains (dolphin, cat, grizzly bear, etc.).
2. Model and Notation
Softmax Function: The softmax function is applied to convert the model’s output (a vector of raw scores,z) into probabilities.

Classification Task
Softmax applied at the last stage Classification of an image as containing certain species.

6
Image courtesy: https://mitpress.mit.edu/9780262048972/foundations-of-computer-vision/
How much to fit the data?
Figure shows polynomial curve fitting with increasing k
Overfitting and underfitting with
polynomial functions

We seek a reasonable model that


is neither underfitting nor over
fitting.

That means, we prefer certain


types of models.

How to “regularize” or solution,


incorporate certain preferences?

Core problem in machine learning: How to fit the data


just enough?
7
Regularization
• Adding a preference for one solution in its hypothesis
space to another.
• Incorporate that preference in the objective
function we are optimization.

• Weight term
• Adding a term to the loss function that prefers
smaller squared sum of weights. A prior over the
parameters.
• Penalize a very complex model to explain the
date.

• Lambda parameter
• Selected ahead of time that controls the strength
of our preference for smaller weights.
• It is a “hyper”-parameter that needs tuning, Material from
https://www.deeplearningbook.org/
expressing our trade off. Chapter 5 8
Model capacity refers to the complexity or flexibility of the model. A model with low capacity (like a simple linear model) may not be able to
capture complex patterns, while a model with high capacity (like a deep neural network) can capture more complex relationships.
High capacity means the model can potentially fit complex patterns, but it might also fit noise in the data, leading to overfitting.

Relation between model capacity and error


Generalization error (often measured as test error): The error the model makes on unseen data (test data). It
reflects how well the model generalizes beyond the training set.

• Underfitting regime
• Training error and generalization
error are both high.
• Increase capacity
• Training error decreases
• Gap between training and
generalization error increases.
• Overfitting
• The size of this gap outweighs the
decrease in training error,
• capacity is too large, above the
optimal capacity.

We can train a model only on the training set. The test set is not available
during training. Material from
https://www.deeplearningbook.org/
How do we know the generalization error when we cannot use the test set? Chapter 5 9
Parameters (like weights w in linear regression) are learned directly from the data during training.
Hyperparameters are not learned from the data; they are set by the model designer. Examples include the learning rate, regularization strength (lambda),
and the number of layers in a neural network. Hyperparameters control the training process and the capacity of the model.

Hyper-parameters and Validation Sets


• Parameters: P(X), P(Y|X)
• Hyper-parameters: k, lambda etc.
Training
• Selecting hyper-parameters
Data
• For each value of the
hyperparameters, train and test
on the held-out data or the
validation data set.
• Choose the best value and do a
final test on the test data
Validation
Training Data: Used to learn the model parameters. Data
Validation Data: Used to evaluate different hyperparameter settings
and choose the best model.
Test Data: Used only once, after hyperparameters are chosen, to
estimate the generalization error.
Test
Data
10
Gradient-based Learning
• The loss term is composed of the
predictions under the weights and the
regularization term.

• The goal is to optimize the loss function


given the parameters

Gradient Descent:
Gradient:
grad J(theta) represents the gradient of the loss function J with
respect to the parameters theta. It points in the direction of the
steepest increase in J.
Update Rule: The gradient descent algorithm updates the
parameters in the opposite direction of the gradient to minimize
the loss:
theta=theta−alpha*(grad J(theta))
Here,
alpha is the learning rate, controlling the size of each step.

11
Gradient Descent
• Compute the gradient and take the step in that direction. W_
2
• Gradient computation
• Analytic
• Numerical

original
W
W_
negative gradient 1
direction

12
Learning rate schedules
Gradient descent with adaptive learning rate.

Types of learning rate schedules.

13
Mini-batch Gradient Descent
W_2
Problem: Large data sets. Difficult to compute the full loss noisy gradient from minibatch
function over the entire training set in order to perform
only a single parameter update

Solution: Only use a small portion of the training set


to compute the gradient.

W_1
original W

14
Stochastic Gradient Descent
Setting the mini-batch to contain only a single
example. W_2
True gradients in blue
Gradients are noisy but still make good minibatch gradients in red
progress on average.

W_1
original W

15
Gradient vs Mini-batch Gradient Descent
Cost function decomposes as a sum
over training examples of per-
example loss function

Gradient for an additive cost function

Mini-batch m’, where m’ is kept constant


while m is increasing.
Update the parameters with the
estimated gradient from the mini-batch. 16
Theta denotes the parameters. Same as w used in the previous slides.
Stochastic Gradient Descent

SGD – sample the batch randomly in each iteration. Random sampling adds some robustness to
optimization.

17
Gradient Descent with Momentum

18
Gradient Clipping

Without clipping – instabilities With clipping – stable convergence

19
Other Aspects
Vanishing gradients Multiple local minima

20
Neural Networks: Biological
Neurons in the brain
• Activations and inhibitions
• Parallelism
• Connected networks
Modeling a Neuron

Main processing unit

• Setup where there is a function that connects the inputs to the output.

• Problem: learning this function that maps the input to the output.
Perceptron: Model of a Single Neuron
• Introduced in the late 50s
• Minsky and Papert.
• Perceptron convergence theorem
Rosenblatt 1962:
• Perceptron will learn to classify any
linearly separable set of inputs
• Note: the earlier class talked about
model-based classification. Here,
we do not build a model. Operate
directly on feature weights.
Feature Space
• Extract Features from data
Hello,

• Learn a model with these


# free : 2 SPAM
YOUR_NAME : 0
Do you want free printr MISSPELLED : 2 or
cartriges? Why pay more when FROM_FRIEND : 0 +
features you can get them ABSOLUTELY
FREE! Just
...

• Data can be viewed as a


point in the feature space. PIXEL-7,12 : 1
PIXEL-7,13 : 0 “2”
# free : 1 ...
NUM_LOOPS : 1
YOUR_NAME : 0 ...
MISSPELLED : 1
FROM_FRIEND : 0
...

# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
Classification with a Perceptron
• A decision boundary is a
hyperplane orthogonal too
the weight vector.
w1
f1


w2
Feature 2 2 f2 >0?
w3
+1 = Class A f3

0
-1 = Class B 0 1 Feature 1
Perceptron

Decision rule (Binary case)

Binary classification task

One side of the decision boundary is class. A and the other is class B. A threshold is introduced.
How to arrive at the separator?
Learning rule
• Classify with the current weights
• If correct no change. x2
• If the classification is wrong: adjust the
weight vector by adding or subtracting the ??
feature vector.

Subtracting f

x1

Binary classification task


Animation: https://adrianstoll.com/post/perceptron-animation/
Some History
A physical realization of the perceptron machine.

https://en.wikipedia.org/wiki/Perceptron#:~:text=The%20first%20hardware%20implementation%20was,demonstrated%20on
%2023%20June%201960.
Perceptron
Case: Linearly-separable Case: Not Linearly-separable

Perceptron learning rule converges to a perfect linear separator when the data points
are linearly separable. Problem when there is non-separable data.
Threshold Functions

• Till now, threshold functions


were linear.
• Can we modify the threshold
function to handle the non-
separable case?
• Can we ”soften” the outputs?
Boolean Functions and Perceptron
Non-separable case
Deterministic Decisions Probabilistic Decisions

0.9 | 0.1
0.7 | 0.3
0.5 | 0.5
0.3 | 0.7
0.1 | 0.9
Logistic Output
• Logistic Function
• Very positive values. Probability -> 1
• Very negative values. Probability -> 0.
• Makes the prediction. Converts to a
probability
• Softens the decision boundary.
• Logistic Regression
• Fitting the weights of this model to
minimize loss on a data set is called
logistic regression.
Example (red or blue classes)

definitely blue not sure definitely red

probability increases exponentially as


we move away from boundary

Normalizer
Estimating weights using MLE
Logistic Regression
Maximize the log-likelihood
Softmax Output
Prediction of the unnormalized probabilities.
• Multi-class setting
• A probability distribution over a
discrete variable with n possible
values.
• Generalization of the sigmoid
function to multiple outputs.
Exponentiate and normalize the values.
• Output of a classifier
• Distribution over n different
classes. The individual outputs
must sum to one.
Softmax Example
Multi-class Setting
Can a perceptron learn XOR?
Non-separability and Non-linear Functions
• The original feature space is Kernel map
mapped to some higher-
dimensional feature space where
the training set is separable.

• Need a non-linear function to


describe the features.

• Applying a non-linear kernel Non-separable Separable in a higher


map. Affine transformation. dimension

Figure from Ray Mooney


Example: Kernel Map
Features ``represent” an input data point
The network is learning the decision boundary,
the function that separates the data.

f1(x)
Output: class
Data point as a z1 s probabilities, in general
vector of features f2(x) o the output depends on
as input. the machine learning
f task we are performing.
z2 t
f3(x) m
a
… x
z3

fK(x)

• We are designing features that represent a data point. We hope that the feature represent enables good
classification performance (requires hand-crafting).
• As we will see, a neural network “learns” features to represent a data point for a machine learning task.
• Connect input data to output. Structure
these models by composing many units.

“Deep” Neural Networks •



Feature learning is implicit in the network.
Paradigm is called deep learning.

x1
s
x2 o
f
… t
x3 m
a
… … … … … x

xL

g = nonlinear
activation function
Multi-layer Perceptrons (MLPs)

Common for networks to be “deep” ->


Leads to expressive models.
Activation Functions

Note: Activations introduce non-linearity. Each unit ”knows” its


[source: MIT 6.S191 introtodeeplearning.com] derivative. Used later to compute gradients automatically.
Common Activation Functions
Deep Neural Networks: In essence
• During training one learns the
parameters of the network.

• At inference time, directly


apply the weights to form the
predication.

• Deep nets consist of linear


layers interleaved with non-
linearities.
Neural Networks are Distribution Transformers
What does a long sequence of linear and non-linear function compositions lead to?

Mapping plots for simple functions that can be used in neural layers Mapping plot for a linear-relu composition
Mapping diagrams for a data distribution
2D mapping diagram for several neural layers.

• The linear layer mapping with shift, stretch


and rotate depending on its weights and
biases.

• The ReLU will map many points to the axes


of the positive quadrant. Density will build
up along the axes.
Binary Classification Example
An MLP with three linear layers and two outputs, suitable Linear-relu-linear-relu-linear architecture
for performing binary softmax regression.

Goal of neural network classifier is to arrange the


input data distribution to match the target label
distribution (two points in the right image.)

Left shows original data distribution. Right shows the target


output (one hot codes)
Remapping input data layer by layer

Target output moves the red and blue points closer to the ground truth data.
As training progresses the network gradually achieves this separation.
The outputs of the final layers before classification can
be considered as learned features for the data.

Representing complex functions


Key success of neural networks is in learning complex functions. The ability to learn complex functions are crucial for
performing complex classification, regression and other machine learning tasks.

https://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-
plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.67214&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=fal
se&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
Learning XOR
XOR is not linearly separable. Rectified Linear Activation Network Diagram

More compact
representation

Example from Ch 6, DL Book


Learning XOR
Network Diagram Model XOR is separable in the transformed space

Takeaway: Applying ReLU to the output of a linear transformation yields a non-


linear transformation. The problem can be solved in the transformed space.
Example from Ch 6, DL Book
Backpropagation and Computation Graphs
• Training NNs
• Backpropagation
• In a NN, need a way to optimize the
output loss with respect to the inputs.
• Core question: assess how does the
output change if the input changes by
a certain amount.
• Apply the chain rule to obtain the
gradient.
• How to manage this computation
efficiently?

https://xnought.github.io/backprop-explainer/
Material from Ch 6, DL Book
Backpropagation and Computation Graphs
• Computation Graphs
• A way to organize the computation in
a neural network.
• Also enables identification and
caching of repeated sub-expressions.

Material from Ch 6, DL Book


Backpropagation: Toy Example

Example from: http://cs231n.stanford.edu/slides/2019/cs231n_2019_lecture04.pdf


Backpropagation: Toy Example
Backpropagation: Toy Example

Once this gradient is computed, the


downstream gradients can ”share”
this computation.
Backpropagation: Toy Example
Backpropagation: Toy Example

asdf
Backpropagation: Toy Example
Backpropagation
Backpropagation: Example
Backpropagation: Example
Deep Neural Networks: Highly expressive
• Last layer
• Logistic regression
• Several Hidden Layers
• Computing the features. The features are learned rather than hand-designed.

• Universal function approximation theorem


• If neural net is large enough
• Then neural net can represent any continuous mapping from input to output with arbitrary accuracy
• Note: overfitting is a challenge.
• In essence, hyper-parametric function approximation.
Preventing Over-fitting: Weight Decay
L2-regularisation

Core Idea:
- Encourage smaller weights so that the output
function is mother.
- Output is the weighted sum of the previous layer.
- If the input magnitudes are small then the output
variation is small.
Preventing Over-fitting: Early Stopping
• Stop the training procedure before convergence.
• Reduces overfitting if the model has already captured the coarse shape of
the function before it starts to fit to noise.
• Single hyper-parameter – number of steps before learning is terminated.
• Chosen empirically using a validation set (without the need to train multiple models)
• The model is trained once, the performance on the validation set is monitored every T
iterations, and the associated parameters are stored. The stored parameters where the
validation performance was best are selected.
• Intuition
• Weights are often initialized to small values. With early stopping they do not get a chance to
become large.
• Activations remain in the linear range without saturating.
Preventing Over-fitting: Dropouts
• A way to reduce the test set error at the cost of
making it hard to fit on the training data.
• Each step of training
• Dropout applies one step of back prop. To a new
version of the network.
• Created by deactivating a randomly chosen subset
of units.
• During training, dropout samples from an
exponential number of different “thinned”
networks.
• At test time, it is easy to approximate the effect
of averaging the predictions of all these thinned
networks by simply using a single un thinned
network that has smaller weights.
Preventing Over-fitting: Dropouts

• If a unit is retained with probability p


during training, the outgoing weights of
that unit are multiplied by p at test time.
This ensures that for any hidden unit the
expected output is the same as the actual
output at test time.
• 2^n networks with shared weights can be Intuition
combined into a single neural network to - Forces neurons to be useful as a whole
be used at test time. paying attention to what others are learning.

https://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf
Improving SGD Convergence
• Batch-normalization
• Standardizes each neural activation with respect
to its mean and variance.
• Over the mini-batch of data points.
• In effect adjusts the distribution observed by
the deeper layers from the outputs coming from
the earlier layers.
• A problem called internal covariate shift.

• L2- Normalization
• Projects the inputs onto a unit hypersphere.
• Useful for bounding activations to unit vectors.
• The above two can be considered as a non-
linear layers in the neural network.
Neural Networks: Successes

You might also like