0% found this document useful (0 votes)

41 views54 pages

Week 2 Introduction To Linear Models - Revised - v1

The document discusses key concepts in linear regression and logistic regression models: 1) It introduces linear regression, gradient descent, and regularization techniques for modeling relationships between variables. 2) Logistic regression is covered for classification problems, including choosing a hypothesis class of linear classifiers and optimizing classifiers using a loss function and gradient descent. 3) The concepts are applied to examples of modeling the orbit of Ceres using linear regression and classifying examples using a linear classifier with a decision boundary.

Uploaded by

Ahmad Hammoudeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views54 pages

Week 2 Introduction To Linear Models - Revised - v1

Uploaded by

Ahmad Hammoudeh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

AI701: Foundations of Artificial Intelligence

Week 2
Agenda

• Introduction to linear regression

• Logistic regression fundamentals

• Gradient descent (batch/stochastic)

• Underfitting and overfitting

• Introduction to regularization
Linear Regression
The discovery of Ceres
1801: Astronomer Piazzi discovered Ceres
Made 19 observations of location before it was obscured by the sun

Time Right ascension Declination

Jan 01,20:43:17.8 50.91 15.24
Jan 02,20:39:04.6 50.84 15.30
... ... ...
Feb 11, 18:11:58.2 53.51 18.43

Where and when it will be observed again?

Gauss's triumph
September 1801: Gauss took Piazzi’s data and created a model of
Ceres’s orbit
Makes prediction

December 7, 1801: Ceres located within 1/2 degree of Gauss’s prediction,

much more accurate than other astronomers
Method: Least squares linear regression
Linear regression framework

Design decisions:

Which predictors are possible? hypothesis class

How good is a predictor? Loss function
How do we compute the best predictor? Optimization algorithm
Hypothesis class: which predictors?
4

3
f (x) = 1 + 0.57x
2

y
f (x) = 2 + 0.2x
1
f (x) = w1 + w2x
0
0 1 2 3 4 5

Vector notation:
weight vector w = [w1, w2] feature extractor φ(x) = [1, x] feature vector

f w (x) = w ·φ(x) score

f w (3) = [1, 0.57] ·[1, 3] = 2.71

Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is apredictor?
4

training data Dtrain 3

f w (x) = w ·φ(x) x y 2

y
w = [1, 0.57] 1 1 residual
φ(x) = [1, x] 1
2 3
4 3 0
0 1 2 3 4 5

Loss(x, y, w) = (f w (x) − y)2 squared loss

Loss(1, 1, [1, 0.57]) = ([1, 0.57] ·[1, 1] − 1)2 = 0.32
Loss(2, 3, [1, 0.57]) = ([1, 0.57] ·[1, 2] − 3)2 = 0.74
Loss(4, 3, [1, 0.57]) = ([1, 0.57] ·[1, 4] − 3)2 = 0.08

1
TrainLoss(w) = |D train| Σ(x,y)∈Dtrain Loss(x, y,w)
TrainLoss([1, 0.57]) = 0.38
Visualizing Loss Function
w
1 2
TrainLoss(w) = |D | Σ ( f w (x) − y) min TrainLoss(w)
train (x,y)∈D train
Gradient Descent
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)

Definition: gradient

The gradient ∇w TrainLoss(w) is the direction that increases the

training loss the most.
Computing the gradient
Objective function:

Gradient (use chain rule):

Gradient descent example
Linear Classification
Linear classification framework
3

2
training data decision boundary
[2, 0] input 1
x1 x2 y

x2
example 0
0 2 1 learning algorithm
example
f classifier
-1
-2 0 1
example
1 -1 -1 -2
-1 label -3
-3 -2 -1 0 1 2 3

Design decisions:
Which classifiers are possible? hypothesisclass
How good is a classifier? loss function
How do we compute the best classifier? optimization algorithm
An example linear classifier
3

x2
-1

-2

-3
-3 -2 -1 0 1 2 3

x1
f ([0, 2]) = sign([−0.6, 0.6] ·[0, 2]) = sign(1.2) = 1
f ([−2, 0]) = sign([−0.6, 0.6] ·[−2, 0]) = sign(1.2) = 1
f ([1, −1]) = sign([−0.6, 0.6] ·[1, −1]) = sign(−1.2) = −1

Decision boundary: x such that w ·φ(x) = 0

Hypothesis class: which classifiers?
3

1
φ(x) = [x1, x2]

x2
0
f (x) = sign([−0.6, 0.6] ·φ(x)) -1

f (x) = sign([0.5, 1] ·φ(x)) -2

-3
-3 -2 -1 0 1 2 3

General binary classifier:

f w (x) = sign(w ·φ(x))
Hypothesis class:
F = { f w : w ∈R2}
Loss function: how good is aclassifier?
3

2
training data Dtrain
1

f w (x) = w ·φ(x) x1 x2 y

x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3

Loss0-1(x, y, w) = 1[f w (x) y] zero-one loss

Loss([0, 2], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[0, 2]) ƒ= 1] = 0
Loss([−2, 0], 1, [0.5, 1]) = 1[sign([0.5, 1] ·[−2, 0]) ƒ= 1] = 1
Loss([1, −1], −1, [0.5, 1]) = 1[sign([0.5, 1] ·[1, −1]) ƒ= −1] = 0

TrainLoss([0.5, 1]) = 0.33

Score and margin
3

Predicted label: f w (x) = sign(w ·φ(x))

x2
0

-1
Target label: y
-2

-3
-3 -2 -1 0 1 2 3

x1
Definition: score

The score on an example (x, y) is w·φ(x), how confident we are in

predicting +1.

Definition: margin

The margin on an example (x, y) is (w ·φ(x))y, how correct we are.

Zero-one loss rewritten
4

Loss(x,y, w)
3

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Optimization algorithm: how to compute best?
Goal: min w TrainLoss(w)
To run gradient descent, compute the gradient:
∇w TrainLoss(w) = 1
|D train| Σ (x,y)∈D train ∇Loss0-1(x, y,w)

∇w Loss0-1(x, y, w) = ∇1[(w ·φ(x))y ≤ 0]

4
Loss(x,y, w)

2
Gradient is zero almost everywhere!
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Hinge loss
4

Loss(x,y, w)
3
Loss0-1
2
Losshinge
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y

Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}

Another: logistic regression
Losslogistic(x, y, w) = log(1 + e−(w·φ(x))y )
4

Loss(x, y,w)
3
Loss0-1
2 Losshinge

1
Losslogistic

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Intuition: Try to increase margin even when it already exceeds
1
Gradient of the hinge loss
4

Loss(x,y, w)
3

2 Losshinge
1

0
-3 -2 -1 0 1 2 3

margin (w ·φ(x))y
Hinge loss on training data
3

2
training data Dtrain
1

f w (x) = w ·φ(x) x1 x2 y

x2
0
w = [0.5, 1] 0 2 1 -1
φ(x) = [x1, x2]
-2 0 1
-2
1 -1 -1
-3
-3 -2 -1 0 1 2 3

Losshinge(x, y, w) = max{1 − (w ·φ(x))y, 0}

Loss([0, 2], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[0, 2](1), 0} = 0 ∇Loss([0, 2], 1, [0.5, 1]) = [0, 0]
Loss([−2, 0], 1, [0.5, 1]) = max{1 − [0.5, 1] ·[−2, 0](1), 0} = 2 ∇Loss([−2, 0], 1, [0.5, 1]) = [2, 0]
Loss([1, −1], −1, [0.5, 1]) = max{1 − [0.5, 1] ·[1, −1](−1), 0} = 0.5 ∇Loss([1, −1], −1, [0.5, 1]) = [1, −1]
TrainLoss([0.5, 1]) = 0.83 ∇TrainLoss([0.5, 1]) = [1, −0.33]
Regression vs Classification

Regression Classification

Prediction f w (x) score sign(score)

Relate to target y residual (score − y) margin (score y)
zero-one
squared
Loss functions hinge
absolute deviation
logistic
Algorithm gradient descent gradient descent
Stochastic Gradient Descent
Gradient descent is slow

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Problem: each iteration requires going over all training examples

Expensive when have lots of data!

Stochastic gradient descent

Algorithm: stochastic gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain:
w ← w − η∇wLoss(x, y, w)
Step size

Question: what should η be?

0 1
η
conservative, more stable aggressive, faster

Strategies:
• Constant: η = 0.1
• Decreasing:
GD vs SGD

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

Overfitting and Regularization
Minimizing the training loss
Hypothesis class:
f w (x) = w ·φ(x)
Training objective (loss function):
1
TrainLoss(w) = Σ Loss(x, y,w)
|Dtrain| (x,y)∈D
train

Optimization algorithm:
stochastic gradient descent

Is the training loss a good objective to optimize?

Rote Learning
Algorithm: rote learning

Training: just store Dtrain .

Predictor f (x):
If (x, y) ∈ Dtrain : return y.
Else: segfault.

Minimizes the objective perfectly (zero), but clearly bad...

Overfitting scenarios

Classification Regression
Overfitting
Overfitting – Possible reasons

• Too few training data

• Noise in the data

• The hypothesis space is too large

• The input space is high-dimensional

Overfitting vs. model complexity
• We talk of overfitting when High bias Low bias

Error
Low variance High variance
decreasing 𝐸𝑖𝑛 leads to
increasing 𝐸𝑜 𝑢 𝑡

• Major source of failure for

machine learning systems Out of sample error

𝐸𝑜𝑢𝑡
• Overfitting leads to bad
generalization
In sample error Overfitting
• A model can exhibit bad 𝐸𝑖𝑛
generalization even if it does not
overfit
Low Model complexity High
Overfitting vs. model complexity
Evaluation
Dtrain learning algorithm f

How good is the predictor f ?

Key idea: the real learning objective

Our goal is to minimize error on unseen future examples.

Don’t have unseen examples; next best thing:

Definition: test set

Test set Dtest contains examples not used for training.

Generalization
When will a learning algorithm generalizewell?

Dtrain Dtest
Approximation & estimation error
All predictors
F
Learning
f ∗ approx. error est. error
g fˆ

• Approximation error: how good is the hypothesis class?

• Estimation error: how good is the learned predictor relative to the potential of the
hypothesis class?
Effect of hypotheses class
All predictors
F
Learning
f∗ approx. error est. error
g fˆ

As the hypothesis class size increases...

Approximation error decreases because:
taking min over larger set
Estimation error increases because:

harder to estimate something more complex

How do we control the hypothesis class size?

Cure 1: Dimensionality
w ∈ Rd

Reduce the dimensionality d (number of features):

Controlling the dimensionality
Manual feature (template) selection:
• Add feature templates if they help
Remove feature templates if they don’t help

Automatic feature selection (beyond the scope of this class):

• Forward selection
• Boosting
• L 1 regularization

It’s the number of features that matters

Cure 2: Norm
Controlling the Norm
Regularized objective:
λ
min TrainLoss(w)+ ǁwǁ 2
w 2

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η(∇w TrainLoss(w)+λw)

Same as gradient descent, except shrink the weights towards zero by λ.

Controlling the Norm: Early Stopping
Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Idea: simply make T smaller

Intuition: if have fewer updates, then ǁwǁ can’t get too big.

Lesson: try to minimize the training error, but don’t try too hard.
Regularization and bias-variance
The effects of the regularization procedure can be observed in the bias and variance
terms

• Regularization trades bias in order to considerably decrease the variance of the model

• Regularization strives for smoother hypothesis, thus reducing the opportunities to overfit

• The amount of regularization 𝜆 has to be chosen specifically for each type of regularizer

• Usually 𝜆 is chosen by cross-validation

How overfitting affects predictions

Predictive Underfitting Overfitting

Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity
Regularization
• A method for automatically controlling the complexity of the learned
hypothesis

• Idea: penalize for largevaluesof thertaj

oIncorporate penalty into the cost function
oWorks well when we have a lot of features, each that contributes a bit to
predicting the label

• Can also address overfitting by eliminating features (either manually or via model
selection)
L2 Regularization
• Regularized linear regression objectivefunction:

model fit to data regularization

o λis the regularization parameter (λ
oNo regularization on ! 0!
Slide Credits

Percy Liang
Dorsa Sadigh
Mirko Mazzoleni
Ryan P. Adams
Thank You

Mohamed bin Zayed

University of Artificial Intelligence
Masdar City
Abu Dhabi
United Arab Emirates

mbzuai.ac.ae

The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Learning3 6pp
No ratings yet
Learning3 6pp
15 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
DL 02 Basics
No ratings yet
DL 02 Basics
95 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
100 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
Lecture15 Regularization
No ratings yet
Lecture15 Regularization
47 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Week 7
No ratings yet
Week 7
53 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
A Layman's Guide To The Project
No ratings yet
A Layman's Guide To The Project
34 pages
Lecture2 PDF
No ratings yet
Lecture2 PDF
111 pages
Lec 2
No ratings yet
Lec 2
5 pages
CSE 440 AI Volume1 (p1)
No ratings yet
CSE 440 AI Volume1 (p1)
4 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Lec8 Regularization
No ratings yet
Lec8 Regularization
41 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
Group 30
No ratings yet
Group 30
33 pages
Loss Functions
No ratings yet
Loss Functions
29 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
Week 2
No ratings yet
Week 2
43 pages
03. Presentation
No ratings yet
03. Presentation
59 pages
06 Optimization Basics PDF
No ratings yet
06 Optimization Basics PDF
82 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
31 pages
Gansp Awareness Quiz PDF
No ratings yet
Gansp Awareness Quiz PDF
13 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Berkeley-Tutorial Optimization For Machine Learning-Part1
No ratings yet
Berkeley-Tutorial Optimization For Machine Learning-Part1
37 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Gradient Descent Based Learners
No ratings yet
Gradient Descent Based Learners
11 pages
DSCTP 2022 1 ML Slides
No ratings yet
DSCTP 2022 1 ML Slides
351 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
ML Classification Trupesh Patel
No ratings yet
ML Classification Trupesh Patel
39 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
M146 Lec3 Sidenotes S25
No ratings yet
M146 Lec3 Sidenotes S25
29 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Introduction To Quality Control
No ratings yet
Introduction To Quality Control
75 pages
Likely Pure Topics Paper 3 Questions
No ratings yet
Likely Pure Topics Paper 3 Questions
20 pages
Brocks and Schwalbe Experimental and Numerical FractureMechanics-An Individually Dyed History 2014
No ratings yet
Brocks and Schwalbe Experimental and Numerical FractureMechanics-An Individually Dyed History 2014
35 pages
Acceleration Test Method For A High Performance 2s Racing Engine
100% (1)
Acceleration Test Method For A High Performance 2s Racing Engine
12 pages
2022-2023 S4 Paper 1 2nd Term Exam Marking Scheme
No ratings yet
2022-2023 S4 Paper 1 2nd Term Exam Marking Scheme
5 pages
Paper Title (Use Style: Paper Title) : Subtitle As Needed (Paper Subtitle)
No ratings yet
Paper Title (Use Style: Paper Title) : Subtitle As Needed (Paper Subtitle)
3 pages
Graph-Based Network Generation and CCTV Processing Techniques For Fire Evacuation
No ratings yet
Graph-Based Network Generation and CCTV Processing Techniques For Fire Evacuation
19 pages
Scheme Nep 2,4,6
No ratings yet
Scheme Nep 2,4,6
12 pages
Impulse-Momentum Theorem & Impact
No ratings yet
Impulse-Momentum Theorem & Impact
9 pages
Optimal Control of Buck Converter by State Feedback Linearization
No ratings yet
Optimal Control of Buck Converter by State Feedback Linearization
6 pages
Sheet 3
No ratings yet
Sheet 3
2 pages
In-Line Inspection of Multi-Diameter Pipelines: Standardized Development and Testing For A Highly Efficient Tool Fleet
100% (1)
In-Line Inspection of Multi-Diameter Pipelines: Standardized Development and Testing For A Highly Efficient Tool Fleet
10 pages
Nonexistence Results For Pseudo-Parabolic Equations Heisenberg
No ratings yet
Nonexistence Results For Pseudo-Parabolic Equations Heisenberg
16 pages
(PDF) WBJEE PYQs - Sequences and Series
No ratings yet
(PDF) WBJEE PYQs - Sequences and Series
21 pages
Asset Valuation: Debt Investments: Analysis and Valuation: 1 2 N M 1 2 N M
No ratings yet
Asset Valuation: Debt Investments: Analysis and Valuation: 1 2 N M 1 2 N M
23 pages
Topological Sorting of Large Networks
No ratings yet
Topological Sorting of Large Networks
5 pages
Chapter 12 A
No ratings yet
Chapter 12 A
40 pages
Timelike Entanglement Entropy in QFT and - Theorem: College of Physics, Sichuan University, Chengdu, 610065, China
No ratings yet
Timelike Entanglement Entropy in QFT and - Theorem: College of Physics, Sichuan University, Chengdu, 610065, China
5 pages
Dag For Basic Block
No ratings yet
Dag For Basic Block
9 pages
24ee323 Network Theory
No ratings yet
24ee323 Network Theory
2 pages
Lesson Plan
No ratings yet
Lesson Plan
2 pages
GSA Training Notes Dynamic Analysis: 1 Dynamics
No ratings yet
GSA Training Notes Dynamic Analysis: 1 Dynamics
10 pages
Math 7 - Week 2 - Lesson 3 - Key
No ratings yet
Math 7 - Week 2 - Lesson 3 - Key
18 pages
An Introduction To Mathematics
No ratings yet
An Introduction To Mathematics
228 pages
Artificial Neural Network PHD Thesis
100% (2)
Artificial Neural Network PHD Thesis
5 pages
Action Verbs ABA
No ratings yet
Action Verbs ABA
4 pages
ALEVEL數學9231 w02 qp
No ratings yet
ALEVEL數學9231 w02 qp
8 pages
Biochem12 12
No ratings yet
Biochem12 12
1 page
Cracked Section - All
No ratings yet
Cracked Section - All
2 pages