0% found this document useful (0 votes)

28 views72 pages

Gradient-Based Learning & Neural Networks

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views72 pages

Gradient-Based Learning & Neural Networks

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

COL333/671: Introduction to AI

Semester I, 2024-25

Learning – II: Gradient-based Learning and Neural Networks

Rohan Paul

1
Outline
• Last Class
• Basics of machine learning
• This Class
• Neural Networks
• Reference Material
• Please follow the notes as the primary reference on this topic. Additional
reading from AIMA book Ch. 18 (18.2, 18.6 and 18.7) and DL book Ch 6
sections 6.1 – 6.5 (except 6.4).

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Learning a Best Fit Hypothesis: Linear
Regression Task
Here,
Linear regression (no bias) wT is a vector of weights (parameters) that we are trying to learn.
x is the input vector (e.g., features like visibility).
y_cap is the predicted output based on the model.

Linear regression (with bias)

Error term/Loss term

Learning: Optimizing the loss will estimate the model parameters w and b.
Online: Given the trained model, we can predict a value. ||.||2(2 in subscript) represents the Euclidean (L2) distance.
m is the number of test samples.

Examples:
- Predicting pollution levels from visibility T in superscript, stands for the transpose of a matrix or vector.
Material from
- Predicting reactivity of a molecule from [Link]
structural data. Chapter 5 4
Linear Regression Example
Optimal w and the implied linear model
y^(train) is the model’s prediction for the training data, and
y(train) is the actual target values in the training set.

Learning/Training:
Optimize the error w.r.t. the model parameters, w

Function/model space Parameter space

Material from
Inference: [Link]
Chapter 5 5
Use the trained model i.e. w, to perform predictions
1. Overview
Classification is about predicting categories rather than numerical values. For example, given an image, we might predict which species it contains (dolphin, cat, grizzly bear, etc.).
2. Model and Notation
Softmax Function: The softmax function is applied to convert the model’s output (a vector of raw scores,z) into probabilities.

Classification Task
Softmax applied at the last stage Classification of an image as containing certain species.

6
Image courtesy: [Link]
How much to fit the data?
Figure shows polynomial curve fitting with increasing k
Overfitting and underfitting with
polynomial functions

We seek a reasonable model that

is neither underfitting nor over
fitting.

That means, we prefer certain

types of models.

How to “regularize” or solution,

incorporate certain preferences?

Core problem in machine learning: How to fit the data

just enough?
7
Regularization
• Adding a preference for one solution in its hypothesis
space to another.
• Incorporate that preference in the objective
function we are optimization.

• Weight term
• Adding a term to the loss function that prefers
smaller squared sum of weights. A prior over the
parameters.
• Penalize a very complex model to explain the
date.

• Lambda parameter
• Selected ahead of time that controls the strength
of our preference for smaller weights.
• It is a “hyper”-parameter that needs tuning, Material from
[Link]
expressing our trade off. Chapter 5 8
Model capacity refers to the complexity or flexibility of the model. A model with low capacity (like a simple linear model) may not be able to
capture complex patterns, while a model with high capacity (like a deep neural network) can capture more complex relationships.
High capacity means the model can potentially fit complex patterns, but it might also fit noise in the data, leading to overfitting.

Relation between model capacity and error

Generalization error (often measured as test error): The error the model makes on unseen data (test data). It
reflects how well the model generalizes beyond the training set.

• Underfitting regime
• Training error and generalization
error are both high.
• Increase capacity
• Training error decreases
• Gap between training and
generalization error increases.
• Overfitting
• The size of this gap outweighs the
decrease in training error,
• capacity is too large, above the
optimal capacity.

We can train a model only on the training set. The test set is not available
during training. Material from
[Link]
How do we know the generalization error when we cannot use the test set? Chapter 5 9
Parameters (like weights w in linear regression) are learned directly from the data during training.
Hyperparameters are not learned from the data; they are set by the model designer. Examples include the learning rate, regularization strength (lambda),
and the number of layers in a neural network. Hyperparameters control the training process and the capacity of the model.

Hyper-parameters and Validation Sets

• Parameters: P(X), P(Y|X)
• Hyper-parameters: k, lambda etc.
Training
• Selecting hyper-parameters
Data
• For each value of the
hyperparameters, train and test
on the held-out data or the
validation data set.
• Choose the best value and do a
final test on the test data
Validation
Training Data: Used to learn the model parameters. Data
Validation Data: Used to evaluate different hyperparameter settings
and choose the best model.
Test Data: Used only once, after hyperparameters are chosen, to
estimate the generalization error.
Test
Data
10
Gradient-based Learning
• The loss term is composed of the
predictions under the weights and the
regularization term.

• The goal is to optimize the loss function

given the parameters

Gradient Descent:
Gradient:
grad J(theta) represents the gradient of the loss function J with
respect to the parameters theta. It points in the direction of the
steepest increase in J.
Update Rule: The gradient descent algorithm updates the
parameters in the opposite direction of the gradient to minimize
the loss:
theta=theta−alpha*(grad J(theta))
Here,
alpha is the learning rate, controlling the size of each step.

11
Gradient Descent
• Compute the gradient and take the step in that direction. W_
2
• Gradient computation
• Analytic
• Numerical

original
W
W_
negative gradient 1
direction

12
Learning rate schedules
Gradient descent with adaptive learning rate.

Types of learning rate schedules.

13
Mini-batch Gradient Descent
W_2
Problem: Large data sets. Difficult to compute the full loss noisy gradient from minibatch
function over the entire training set in order to perform
only a single parameter update

Solution: Only use a small portion of the training set

to compute the gradient.

W_1
original W

14
Stochastic Gradient Descent
Setting the mini-batch to contain only a single
example. W_2
True gradients in blue
Gradients are noisy but still make good minibatch gradients in red
progress on average.

W_1
original W

15
Gradient vs Mini-batch Gradient Descent
Cost function decomposes as a sum
over training examples of per-
example loss function

Gradient for an additive cost function

Mini-batch m’, where m’ is kept constant

while m is increasing.
Update the parameters with the
estimated gradient from the mini-batch. 16
Theta denotes the parameters. Same as w used in the previous slides.
Stochastic Gradient Descent

SGD – sample the batch randomly in each iteration. Random sampling adds some robustness to
optimization.

17
Gradient Descent with Momentum

18
Gradient Clipping

Without clipping – instabilities With clipping – stable convergence

19
Other Aspects
Vanishing gradients Multiple local minima

20
Neural Networks: Biological
Neurons in the brain
• Activations and inhibitions
• Parallelism
• Connected networks
Modeling a Neuron

Main processing unit

• Setup where there is a function that connects the inputs to the output.

• Problem: learning this function that maps the input to the output.
Perceptron: Model of a Single Neuron
• Introduced in the late 50s
• Minsky and Papert.
• Perceptron convergence theorem
Rosenblatt 1962:
• Perceptron will learn to classify any
linearly separable set of inputs
• Note: the earlier class talked about
model-based classification. Here,
we do not build a model. Operate
directly on feature weights.
Feature Space
• Extract Features from data
Hello,

• Learn a model with these

# free : 2 SPAM
YOUR_NAME : 0
Do you want free printr MISSPELLED : 2 or
cartriges? Why pay more when FROM_FRIEND : 0 +
features you can get them ABSOLUTELY
FREE! Just
...

• Data can be viewed as a

point in the feature space. PIXEL-7,12 : 1
PIXEL-7,13 : 0 “2”
# free : 1 ...
NUM_LOOPS : 1
YOUR_NAME : 0 ...
MISSPELLED : 1
FROM_FRIEND : 0
...

# free : 0
YOUR_NAME : 1
MISSPELLED : 1
FROM_FRIEND : 1
Classification with a Perceptron
• A decision boundary is a
hyperplane orthogonal too
the weight vector.
w1
f1


w2
Feature 2 2 f2 >0?
w3
+1 = Class A f3

0
-1 = Class B 0 1 Feature 1
Perceptron

Decision rule (Binary case)

Binary classification task

One side of the decision boundary is class. A and the other is class B. A threshold is introduced.
How to arrive at the separator?
Learning rule
• Classify with the current weights
• If correct no change. x2
• If the classification is wrong: adjust the
weight vector by adding or subtracting the ??
feature vector.

Subtracting f

Binary classification task

Animation: [Link]
Some History
A physical realization of the perceptron machine.

[Link]
%2023%20June%201960.
Perceptron
Case: Linearly-separable Case: Not Linearly-separable

Perceptron learning rule converges to a perfect linear separator when the data points
are linearly separable. Problem when there is non-separable data.
Threshold Functions

• Till now, threshold functions

were linear.
• Can we modify the threshold
function to handle the non-
separable case?
• Can we ”soften” the outputs?
Boolean Functions and Perceptron
Non-separable case
Deterministic Decisions Probabilistic Decisions

0.9 | 0.1
0.7 | 0.3
0.5 | 0.5
0.3 | 0.7
0.1 | 0.9
Logistic Output
• Logistic Function
• Very positive values. Probability -> 1
• Very negative values. Probability -> 0.
• Makes the prediction. Converts to a
probability
• Softens the decision boundary.
• Logistic Regression
• Fitting the weights of this model to
minimize loss on a data set is called
logistic regression.
Example (red or blue classes)

definitely blue not sure definitely red

probability increases exponentially as

we move away from boundary

Normalizer
Estimating weights using MLE
Logistic Regression
Maximize the log-likelihood
Softmax Output
Prediction of the unnormalized probabilities.
• Multi-class setting
• A probability distribution over a
discrete variable with n possible
values.
• Generalization of the sigmoid
function to multiple outputs.
Exponentiate and normalize the values.
• Output of a classifier
• Distribution over n different
classes. The individual outputs
must sum to one.
Softmax Example
Multi-class Setting
Can a perceptron learn XOR?
Non-separability and Non-linear Functions
• The original feature space is Kernel map
mapped to some higher-
dimensional feature space where
the training set is separable.

• Need a non-linear function to

describe the features.

• Applying a non-linear kernel Non-separable Separable in a higher

map. Affine transformation. dimension

Figure from Ray Mooney

Example: Kernel Map
Features ``represent” an input data point
The network is learning the decision boundary,
the function that separates the data.

f1(x)
Output: class
Data point as a z1 s probabilities, in general
vector of features f2(x) o the output depends on
as input. the machine learning
f task we are performing.
z2 t
f3(x) m
a
… x
z3

fK(x)

• We are designing features that represent a data point. We hope that the feature represent enables good
classification performance (requires hand-crafting).
• As we will see, a neural network “learns” features to represent a data point for a machine learning task.
• Connect input data to output. Structure
these models by composing many units.

“Deep” Neural Networks •

•
Feature learning is implicit in the network.
Paradigm is called deep learning.

x1
s
x2 o
f
… t
x3 m
a
… … … … … x

g = nonlinear
activation function
Multi-layer Perceptrons (MLPs)

Common for networks to be “deep” ->

Leads to expressive models.
Activation Functions

Note: Activations introduce non-linearity. Each unit ”knows” its

[source: MIT 6.S191 [Link]] derivative. Used later to compute gradients automatically.
Common Activation Functions
Deep Neural Networks: In essence
• During training one learns the
parameters of the network.

• At inference time, directly

apply the weights to form the
predication.

• Deep nets consist of linear

layers interleaved with non-
linearities.
Neural Networks are Distribution Transformers
What does a long sequence of linear and non-linear function compositions lead to?

Mapping plots for simple functions that can be used in neural layers Mapping plot for a linear-relu composition
Mapping diagrams for a data distribution
2D mapping diagram for several neural layers.

• The linear layer mapping with shift, stretch

and rotate depending on its weights and
biases.

• The ReLU will map many points to the axes

of the positive quadrant. Density will build
up along the axes.
Binary Classification Example
An MLP with three linear layers and two outputs, suitable Linear-relu-linear-relu-linear architecture
for performing binary softmax regression.

Goal of neural network classifier is to arrange the

input data distribution to match the target label
distribution (two points in the right image.)

Left shows original data distribution. Right shows the target

output (one hot codes)
Remapping input data layer by layer

Target output moves the red and blue points closer to the ground truth data.
As training progresses the network gradually achieves this separation.
The outputs of the final layers before classification can
be considered as learned features for the data.

Representing complex functions

Key success of neural networks is in learning complex functions. The ability to learn complex functions are crucial for
performing complex classification, regression and other machine learning tasks.

[Link]
plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.67214&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=fal
se&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false
Learning XOR
XOR is not linearly separable. Rectified Linear Activation Network Diagram

More compact
representation

Example from Ch 6, DL Book

Learning XOR
Network Diagram Model XOR is separable in the transformed space

Takeaway: Applying ReLU to the output of a linear transformation yields a non-

linear transformation. The problem can be solved in the transformed space.
Example from Ch 6, DL Book
Backpropagation and Computation Graphs
• Training NNs
• Backpropagation
• In a NN, need a way to optimize the
output loss with respect to the inputs.
• Core question: assess how does the
output change if the input changes by
a certain amount.
• Apply the chain rule to obtain the
gradient.
• How to manage this computation
efficiently?

[Link]
Material from Ch 6, DL Book
Backpropagation and Computation Graphs
• Computation Graphs
• A way to organize the computation in
a neural network.
• Also enables identification and
caching of repeated sub-expressions.

Material from Ch 6, DL Book

Backpropagation: Toy Example

Example from: [Link]

Backpropagation: Toy Example
Backpropagation: Toy Example

Once this gradient is computed, the

downstream gradients can ”share”
this computation.
Backpropagation: Toy Example
Backpropagation: Toy Example

asdf
Backpropagation: Toy Example
Backpropagation
Backpropagation: Example
Backpropagation: Example
Deep Neural Networks: Highly expressive
• Last layer
• Logistic regression
• Several Hidden Layers
• Computing the features. The features are learned rather than hand-designed.

• Universal function approximation theorem

• If neural net is large enough
• Then neural net can represent any continuous mapping from input to output with arbitrary accuracy
• Note: overfitting is a challenge.
• In essence, hyper-parametric function approximation.
Preventing Over-fitting: Weight Decay
L2-regularisation

Core Idea:
- Encourage smaller weights so that the output
function is mother.
- Output is the weighted sum of the previous layer.
- If the input magnitudes are small then the output
variation is small.
Preventing Over-fitting: Early Stopping
• Stop the training procedure before convergence.
• Reduces overfitting if the model has already captured the coarse shape of
the function before it starts to fit to noise.
• Single hyper-parameter – number of steps before learning is terminated.
• Chosen empirically using a validation set (without the need to train multiple models)
• The model is trained once, the performance on the validation set is monitored every T
iterations, and the associated parameters are stored. The stored parameters where the
validation performance was best are selected.
• Intuition
• Weights are often initialized to small values. With early stopping they do not get a chance to
become large.
• Activations remain in the linear range without saturating.
Preventing Over-fitting: Dropouts
• A way to reduce the test set error at the cost of
making it hard to fit on the training data.
• Each step of training
• Dropout applies one step of back prop. To a new
version of the network.
• Created by deactivating a randomly chosen subset
of units.
• During training, dropout samples from an
exponential number of different “thinned”
networks.
• At test time, it is easy to approximate the effect
of averaging the predictions of all these thinned
networks by simply using a single un thinned
network that has smaller weights.
Preventing Over-fitting: Dropouts

• If a unit is retained with probability p

during training, the outgoing weights of
that unit are multiplied by p at test time.
This ensures that for any hidden unit the
expected output is the same as the actual
output at test time.
• 2^n networks with shared weights can be Intuition
combined into a single neural network to - Forces neurons to be useful as a whole
be used at test time. paying attention to what others are learning.

[Link]
Improving SGD Convergence
• Batch-normalization
• Standardizes each neural activation with respect
to its mean and variance.
• Over the mini-batch of data points.
• In effect adjusts the distribution observed by
the deeper layers from the outputs coming from
the earlier layers.
• A problem called internal covariate shift.

• L2- Normalization
• Projects the inputs onto a unit hypersphere.
• Useful for bounding activations to unit vectors.
• The above two can be considered as a non-
linear layers in the neural network.
Neural Networks: Successes

Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Neural Networks & Gradient Descent
No ratings yet
Neural Networks & Gradient Descent
77 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Deep Learning Basics by Romain Tavenard
No ratings yet
Deep Learning Basics by Romain Tavenard
49 pages
DL Unit1
100% (1)
DL Unit1
61 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Deep Learning Tutorial for Business
No ratings yet
Deep Learning Tutorial for Business
58 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Fundamentals of Deep Learning Course
No ratings yet
Fundamentals of Deep Learning Course
195 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lec 10 Oct 24
No ratings yet
Lec 10 Oct 24
21 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Perceptron 2014
No ratings yet
Perceptron 2014
44 pages
DeepLearning Aulas2e3
No ratings yet
DeepLearning Aulas2e3
72 pages
Lec 8
No ratings yet
Lec 8
43 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Deep Learning with Keras Basics
No ratings yet
Deep Learning with Keras Basics
58 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Advanced Machine Learning Techniques
No ratings yet
Advanced Machine Learning Techniques
61 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Chapter 8-Deep Learning Book
No ratings yet
Chapter 8-Deep Learning Book
27 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Deep Learning Important Questions For Ia 1
No ratings yet
Deep Learning Important Questions For Ia 1
11 pages
Unit 1 and Unit 2
No ratings yet
Unit 1 and Unit 2
30 pages
Deep Learning Hand Book 2024
No ratings yet
Deep Learning Hand Book 2024
185 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Supervised Learning: Linear Models
No ratings yet
Supervised Learning: Linear Models
34 pages
Lec 2 Basics of Machine Learning
No ratings yet
Lec 2 Basics of Machine Learning
35 pages
ML LittelBook
No ratings yet
ML LittelBook
161 pages
Artificial Neural Networks Basics
No ratings yet
Artificial Neural Networks Basics
50 pages
Neural Networks (Basics)
No ratings yet
Neural Networks (Basics)
30 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
167 pages
NN Theory
No ratings yet
NN Theory
138 pages
L7 Lecture Image - classification.DNN v4
No ratings yet
L7 Lecture Image - classification.DNN v4
61 pages
ML.8-Neural Networks - Deep Learning (Week 12,13)
No ratings yet
ML.8-Neural Networks - Deep Learning (Week 12,13)
80 pages
The Little Book of Deep Learning
No ratings yet
The Little Book of Deep Learning
163 pages
Deep Learning Fundamentals Explained
No ratings yet
Deep Learning Fundamentals Explained
144 pages
Lbdlu
No ratings yet
Lbdlu
168 pages
Lecture8 DeepLearning
No ratings yet
Lecture8 DeepLearning
94 pages
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
No ratings yet
The Little Book of Deep Learning - (François Fleuret) - University of Geneva-2023.compressed
163 pages
Lect 12 - Deep Feed Forward NN - Review
No ratings yet
Lect 12 - Deep Feed Forward NN - Review
93 pages
Deep Learning Essentials
100% (1)
Deep Learning Essentials
140 pages
Dense Neural Nets
No ratings yet
Dense Neural Nets
68 pages
Chapter 3 - Interpolation Curve Fitting With Tutorials
No ratings yet
Chapter 3 - Interpolation Curve Fitting With Tutorials
25 pages
Lec-24 Unsupervised Image Classification
No ratings yet
Lec-24 Unsupervised Image Classification
18 pages
2022-Strömberg-Efficient Detailed Design Optimization of Topology Optimization Concepts by Using Support Vector Machines and Metamodels
No ratings yet
2022-Strömberg-Efficient Detailed Design Optimization of Topology Optimization Concepts by Using Support Vector Machines and Metamodels
14 pages
Dsa Top 50 Topic Wise Coding Questions (Arrays+Strings+Binary Search+Sorting+Linked Lists)
No ratings yet
Dsa Top 50 Topic Wise Coding Questions (Arrays+Strings+Binary Search+Sorting+Linked Lists)
9 pages
FDA Tool IN Matlab
No ratings yet
FDA Tool IN Matlab
4 pages
Gaussian Elimination Problem Sheet
No ratings yet
Gaussian Elimination Problem Sheet
3 pages
Common Friends Problem
No ratings yet
Common Friends Problem
42 pages
AI Presentation Topics Overview
No ratings yet
AI Presentation Topics Overview
4 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
2 pages
Midterm Exam Answers
No ratings yet
Midterm Exam Answers
8 pages
Determinant of A Matrix
No ratings yet
Determinant of A Matrix
5 pages
PCA on Satellite Images: A Guide
No ratings yet
PCA on Satellite Images: A Guide
10 pages
Audio EQ Basics for Sound Engineers
No ratings yet
Audio EQ Basics for Sound Engineers
2 pages
Lesson 5
No ratings yet
Lesson 5
37 pages
Regularized Logistic Regression Guide
No ratings yet
Regularized Logistic Regression Guide
19 pages
Feature Extraction Using PCA For VHR Satellite Image Time Series Spatio-Temporal Classification
No ratings yet
Feature Extraction Using PCA For VHR Satellite Image Time Series Spatio-Temporal Classification
4 pages
Chapter Three: Simple Sorting and Searching Algorithms
No ratings yet
Chapter Three: Simple Sorting and Searching Algorithms
20 pages
Assembly Programs for Sorting & Operations
No ratings yet
Assembly Programs for Sorting & Operations
14 pages
Unit 1e
No ratings yet
Unit 1e
36 pages
MPEG Video Encoding: I, P, B Frames
No ratings yet
MPEG Video Encoding: I, P, B Frames
4 pages
Bit Sindri: Information and Technology
No ratings yet
Bit Sindri: Information and Technology
67 pages
Linear Programming: After Completing This Chapter, Students Will Be Able To
No ratings yet
Linear Programming: After Completing This Chapter, Students Will Be Able To
61 pages
The Hexagonal Fast Fourier Transform: James B. Birdsong Nicholas I. Rummelt
No ratings yet
The Hexagonal Fast Fourier Transform: James B. Birdsong Nicholas I. Rummelt
4 pages
Crack Faang - Pratham Kohli
No ratings yet
Crack Faang - Pratham Kohli
4 pages
Disco Reduce
No ratings yet
Disco Reduce
2 pages
Chapter-2 Practice Sheet
No ratings yet
Chapter-2 Practice Sheet
4 pages
Ssss 1 1 PDF
No ratings yet
Ssss 1 1 PDF
85 pages
Module 2
No ratings yet
Module 2
46 pages
EE 563 Estimation Theory Syllabus
No ratings yet
EE 563 Estimation Theory Syllabus
3 pages
64f04b4c3c1c93eea57747f9 Nanosoxabowebiwof
No ratings yet
64f04b4c3c1c93eea57747f9 Nanosoxabowebiwof
3 pages

Gradient-Based Learning & Neural Networks

Uploaded by

Gradient-Based Learning & Neural Networks

Uploaded by

COL333/671: Introduction to AI

Learning – II: Gradient-based Learning and Neural Networks

Linear regression (with bias)

Error term/Loss term

Function/model space Parameter space

We seek a reasonable model that

That means, we prefer certain

How to “regularize” or solution,

Core problem in machine learning: How to fit the data

Relation between model capacity and error

Hyper-parameters and Validation Sets

• The goal is to optimize the loss function

Types of learning rate schedules.

Solution: Only use a small portion of the training set

Gradient for an additive cost function

Mini-batch m’, where m’ is kept constant

Without clipping – instabilities With clipping – stable convergence

Main processing unit

• Learn a model with these

• Data can be viewed as a

Decision rule (Binary case)

Binary classification task

Binary classification task

• Till now, threshold functions

definitely blue not sure definitely red

probability increases exponentially as

• Need a non-linear function to

• Applying a non-linear kernel Non-separable Separable in a higher

Figure from Ray Mooney

“Deep” Neural Networks •

Common for networks to be “deep” ->

Note: Activations introduce non-linearity. Each unit ”knows” its

• At inference time, directly

• Deep nets consist of linear

• The linear layer mapping with shift, stretch

• The ReLU will map many points to the axes

Goal of neural network classifier is to arrange the

Left shows original data distribution. Right shows the target

Representing complex functions

Example from Ch 6, DL Book

Takeaway: Applying ReLU to the output of a linear transformation yields a non-

Material from Ch 6, DL Book

Example from: [Link]

Once this gradient is computed, the

• Universal function approximation theorem

• If a unit is retained with probability p

You might also like