0% found this document useful (0 votes)
10 views71 pages

Linear - Regression - SGD

The document provides an overview of linear regression and gradient descent as fundamental concepts in machine learning. It discusses the definition of machine learning, the components of supervised learning, and the process of training models using regression techniques. Key topics include hypothesis selection, cost functions, and optimization methods for improving model accuracy.

Uploaded by

aidin.zaeim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views71 pages

Linear - Regression - SGD

The document provides an overview of linear regression and gradient descent as fundamental concepts in machine learning. It discusses the definition of machine learning, the components of supervised learning, and the process of training models using regression techniques. Key topics include hypothesis selection, cost functions, and optimization methods for improving model accuracy.

Uploaded by

aidin.zaeim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Linear regression, Gradient Descent

Iran University of Science and Technology


By: M. S. Tahaei, PhD.
Winter 2025
Outline

§ Introduction to Learning
§ Linear Regression
§ Gradient Descent
§ Generalized Linear Regression
A Definition of ML

} Tom Mitchell (1998):Well-posed learning problem


} “A computer program is said to learn from experience E with respect to
some task T and some performance measure P, if its performance on
T, as measured by P,improves with experience E”.

} Using the observed data to make better decisions


} Generalizing from the observed data
ML Definition: Example

} Consider an email program that learns how to filter spam according


to emails you do or do not mark as spam.

} T: Classifying emails as spam or not spam.


} E: Watching you label emails as spam or not spam.
} P: The number (or fraction) of emails correctly classified as spam/not
spam.
The essence of machine learning

§ A pattern exist

§ We do not know it mathematically

§ We have data on it
Example: Home Price
} Housing price prediction

400

300
Price ($)
200
in 1000’s
100

0
0 500 1000 1500 2000 2500
Size in feet2

Figure adopted from slides of Andrew Ng,


Machine Learning course, Stanford.
Example: Bank loan
} Applicant form as the input:

} Output: approving or denying the request


Components of (Supervised) Learning

§ Unknown target function: 𝑓: 𝒳 → 𝒴


§ Input space: 𝒳
§ Output space: 𝒴
§ Training data: 𝒙1, 𝑦1 , 𝒙2, 𝑦2 , … , (𝒙𝑁, 𝑦𝑁)

§ Picka formula 𝑔: 𝒳 → 𝒴 that approximates the target


function 𝑓
§ selected from a set of hypotheses ℋ
Training data: Example
Training data
x2
𝑥1 𝑥2 𝑦
0.9 2.3 1
3.5 2.6 1
2.6 3.3 1
2.7 4.1 1
1.8 3.9 1
6.5 6.8 -1
7.2 7.5 -1
7.9 8.3 -1
6.9 8.3 -1
8.8 7.9 -1
9.1 6.2 -1
x1
Solution Components
} Learning model composed of:
} Learning algorithm
} Hypothesis set

} Perceptron example
Perceptron classifier
} Input 𝒙 = 𝑥1, … , 𝑥𝑑 x2

} Classifier:
} If 𝑑 𝑤𝑖𝑥𝑖 > threshold then output 1
𝑖=1
} else output −1

} The linear formula 𝑔 ∈ ℋ can be written: x1


𝑑
𝑔 𝒙 = sign ∑ 𝑤𝑖𝑥𝑖 + 𝑤0
𝑖=1

If we add a coordinate 𝑥0 = 1 to the input:


𝑑
Vector form
𝑔 𝒙 = sign ∑ 𝑤𝑖𝑥𝑖
𝑖=0
𝑔 𝒙 = sign 𝒘𝑇𝒙
Perceptron learning algorithm: linearly separable data
} Give the training data 𝒙 1 ,𝑦 1 , … , (𝒙 𝑁 , 𝑦(𝑁))

} Misclassified data 𝒙 𝑛 ,𝑦 𝑛 :
sign(𝒘𝑇𝒙 𝑛 ) ≠ 𝑦(𝑛)

Repeat
Pick a misclassified data 𝒙 𝑛 , 𝑦 𝑛 from training data and
update 𝒘:
𝒘 = 𝒘 + 𝑦(𝑛)𝒙(𝑛)
Until all training data points are correctly classified by 𝑔
Perceptron learning algorithm: Example of weight update
Correct Label Correct Label

x2 x2

x1 x1
Experience (E) in ML

} Basic premise of learning:


} “Using a set of observations to uncover an underlying process”

} We have different types of (getting) observations in different types or


paradigms of ML methods
Main Steps of Learning Tasks
} Selection of hypothesis set (or model specification)
} Which class of models (mappings) should we use for our data?

} Learning: find mapping 𝑓 (from hypothesis set) based on the


training data
} Which notion of error should we use? (loss functions)
} Optimization of loss function to find mapping 𝑓

} Evaluation: how well 𝑓 generalizes to yet unseen examples


} How do we ensure that the error on future data is minimized?
(generalization)
Components of (Supervised) Learning

Learning
model
Linear regression, Cost Function and
Gradient Descent
Regression problem
} The goal is to make (real valued) predictions given
features

} Example:predicting house price from 3 attributes

Size (𝑚2) Age (year) Region Price (106T)


100 2 5 500
80 25 3 250
… … … …
Learning problem
} Selecting a hypothesis space
} Hypothesis space: a set of mappings from feature vector to
target

} Learning (estimation):optimization of a cost function


𝑛
} Based on the training set 𝐷 = 𝒙 , 𝑦 𝑖 𝑖 and a cost
𝑖=1
function we find (an estimate) 𝑓 ∈ 𝐹 of the target function

} Evaluation: we measure how well 𝑓 generalizes to


unseen examples
Linear regression: hypothesis space
} Univariate 𝑦

𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1𝑥

𝑥
} Multivariate

𝑓 ∶ ℝ𝑑 → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑

𝒘 = 𝑤0, 𝑤1, . . . , 𝑤𝑑 𝑇 are parameters we need to set.


Learning algorithm and distance metrics
} Select how to measure the error (i.e.prediction loss)
} Find the minimum of the resulting error or cost function
Learning algorithm

Training Set 𝐷

We need to
(1) measure how well 𝑓(𝑥; 𝒘)
Learning approximates the target
Algorithm
(2) choose 𝒘 to minimize the error
measure
𝑤0, 𝑤1

Size of 𝑓 𝑥 = 𝑓(𝑥; 𝒘) Estimated


house price
𝑥
How to measure the error

500

400
𝑦(𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥

2
𝑖
Squared error: 𝑦 − 𝑓 𝑥 ;𝒘 𝑖
Linear regression: univariate example
500

400
𝑦(𝑖) − 𝑓(𝑥 𝑖 ; 𝒘)
300

200

100

0
0 500 1000 1500 2000 2500 3000
𝑥
Cost function:
2
Regression: squared loss
} In the SSE cost function, we used squared error as the
prediction loss:
𝐿𝑜𝑠𝑠 𝑦 ,𝑦6 = 𝑦 −𝑦6 2 𝑦3 = 𝑓(𝒙; 𝒘)

} Cost function (based on the training set):


𝑛
𝐽 𝒘 =∑ 𝐿𝑜𝑠𝑠 𝑦 𝑖 , 𝑓 𝒙 𝑖 ; 𝒘
𝑖=1
𝑛 2
=∑ 𝑦 𝑖 −𝑓 𝒙 𝑖 ;𝒘
𝑖=1

} Minimizing sum (or mean) of squared errors is a common


approach in curve fitting,neural network,etc.
Sum of Squares Error (SSE) cost function
𝑛
2
𝐽 𝒘 =∑ 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
1=𝑖

} 𝐽 𝒘 : sum of the squares of the prediction errors on the


training set

} We want to find the best regression function 𝑓 𝒙 𝑖 ; 𝒘


} equivalently,the best 𝒘

} Minimize 𝐽 𝒘
} Find optimal 𝑓 𝒙 = 𝑓 𝒙; 𝒘 where 𝒘 = argmin 𝐽 𝒘
𝒘
Cost function: univariate example
𝐽(𝒘)
(function of the parameters 𝑤0,𝑤1)
500

400
Price ($) 300
in 1000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x) 𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(for fixed 𝑤0,𝑤1,this is a function of 𝑥) (function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
Cost function: univariate example

𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)


(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
Cost function: univariate example
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
Cost function optimization: univariate

𝑛
𝐽 𝒘 = ∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 2

𝑖=1

} Necessary conditions for the“optimal” parameter values:

𝜕𝐽 𝒘
=0
𝜕𝑤0

𝜕𝐽 𝒘
=0
𝜕𝑤1
Optimality conditions: univariate
𝑛
𝐽 𝒘 = ∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 2

𝑖=1

𝜕𝐽 𝒘 𝑛
=∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 −𝑥 𝑖 =0
𝜕𝑤1 𝑖=1

𝜕𝐽 𝒘 𝑛
=∑ 𝑦 𝑖 − 𝑤0 − 𝑤1𝑥 𝑖 −1 = 0
𝜕𝑤0 𝑖=1

Exercise!
} A systems of 2 linear equations
Cost function: multivariate for various features
} We have to minimize the empirical squared loss:
𝑛 2
𝐽 𝒘 = ∑ 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
1=𝑖
𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑

𝒘 = 𝑤0, 𝑤1, . . . , 𝑤𝑑 𝑇

𝒘 = argmin 𝐽(𝒘)
𝒘∈ℝ𝑑+1
Cost function and optimal linear model

} Necessary conditions for the“optimal” parameter values:


𝛻𝒘 𝐽 𝒘 = 𝟎
} A system of 𝑑 + 1 linear equations
Cost function: matrix notation
𝑛
2
𝐽 𝒘 =∑ 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘) =
𝑖=1
𝑛 2
=∑ 𝑦 𝑖 − 𝒘𝑇 𝒙 𝑖
𝑖=1

1 𝑥(1)
⋯ 𝑥(1)
𝑑 𝑤0
1
𝑦(1) 𝑤1
1 𝑥1(2) ⋯ 𝑥(2)
𝒚= ⋮ 𝑿= 𝑑 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦(𝑛) 𝑤𝑑
(𝑛) (𝑛)
𝑥
1 1 ⋯ 𝑥𝑑

2
𝐽 𝒘 = 𝒚 − 𝑿𝒘
Minimizing cost function
Optimal linear weight vector (for SSE cost function):
2
𝐽 𝒘 = 𝒚 − 𝑿𝒘

𝛻𝒘 𝐽 𝒘 = 2𝑿𝑇 𝒚 − 𝑿𝒘
𝛻𝒘 𝐽 𝒘 = 𝟎 ⇒ 𝑿𝑇𝑿𝒘 = 𝑿𝑇𝒚
𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

𝒘 = 𝑿 †𝒚

𝑿† = 𝑿𝑇𝑿 −𝟏𝑿𝑇

𝑿† is pseudo inverse of 𝑿
Another approach for optimizing the sum squared error
} Iterative approach for solving the following optimization
problem:
𝑛 2
𝐽 𝒘 = : 𝑦 𝑖 𝑖
− 𝑓(𝒙 ; 𝒘)
𝑖=1
Gradient descent
} Cost function:𝐽(𝒘)
} Optimization problem: 𝒘 = argm𝑖𝑛 𝐽(𝒘)
𝒘

} Steps:
} Start from 𝒘0
} Repeat
} Update 𝒘𝑡 to 𝒘𝑡+1 in order to reduce 𝐽
} 𝑡 ←𝑡+1

} until we hopefully end up at a minimum


Gradient descent
} First-order optimization algorithm to find 𝒘∗ = argmin 𝐽(𝒘)
𝒘
} Also known as ”steepest descent”

} In each step, takes steps proportional to the negative of the


gradient vector of the function at the current point 𝒘𝑡:
𝒘𝑡+1 = 𝒘𝑡 − 𝛾𝑡 𝛻 𝐽 𝒘𝑡
} 𝐽(𝒘) decreases fastest if one goes from 𝒘𝑡 in the direction of −𝛻𝐽 𝒘𝑡

} Assumption: 𝐽(𝒘) is defined and differentiable in a neighborhood of a


point 𝒘𝑡

Gradient ascent takes steps proportional to (the positive of)


the gradient to find a local maximum of the function
Gradient descent

} Minimize 𝐽(𝒘) Step size


(Learning rate
parameter)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

𝜕𝐽 𝒘 𝜕𝐽 𝒘 𝜕𝐽 𝒘
𝛻𝒘 𝐽 𝒘 = [ 𝜕𝑤 , , … , ]
1 𝜕𝑤2 𝜕𝑤𝑑

} If 𝜂 is small enough,then 𝐽 𝒘𝑡+1 ≤ 𝐽 𝒘𝑡 .


} 𝜂 can be allowed to change at every iteration as 𝜂𝑡.
Gradient descent

} Local minima problem

} However, when 𝐽 is convex, all local minima are also global


minima ⇒ gradient descent can converge to the global
solution.
Problem of gradient descent with non-convex cost functions

J(w0,w1)

w1
w0
Problem of gradient descent with non-convex cost functions

J(w0,w1)

w1
w0
Gradient descent for SSE cost function
} Minimize 𝐽(𝒘)
𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻𝒘 𝐽(𝒘𝑡 )

} 𝐽(𝒘):Sum of squares error


1 𝑛 𝑖 𝑖
2
𝐽 𝒘 = $ 𝑦 − 𝑓 𝒙 ;𝒘
2 𝑖=1

} Weight update rule for 𝑓 𝒙; 𝒘 = 𝒘𝑇𝒙:


𝑛

𝒘𝑡+1 = 𝒘𝑡 +𝜂∑ 𝑦 𝑖
− 𝒘 𝑇𝒙 𝑖 𝒙(𝑖)
𝑖=1
Gradient descent for SSE cost function
} Weight update rule:𝑓 𝒙; 𝒘 = 𝒘𝑇𝒙
𝑛
𝒘𝑡+1 = 𝒘𝑡 + 𝜂∑ 𝑦 𝑖 −𝒘𝑇𝒙 𝑖 𝒙(𝑖)
1=𝑖
Batch mode:each step
considers all training data

} 𝜂:too small → gradient descent can be slow.


} 𝜂 : too large → gradient descent can overshoot the
minimum.It may fail to converge,or even diverge.
𝐽(𝑤0, 𝑤1)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
𝑓 𝑥; 𝑤0, 𝑤1 = 𝑤0 + 𝑤1𝑥 𝐽(𝑤0, 𝑤1)
(function of the parameters 𝑤0,𝑤1)

𝑤1
𝑤0

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Stochastic gradient descent
} Example:Linear regression with SSE cost function

𝑖 2
𝐽(𝑖)(𝒘) = 𝑦 − 𝒘 𝑇𝒙 𝑖

𝒘𝑡+1 = 𝒘𝑡 − 𝜂𝛻 𝒘𝐽(𝑖) (𝒘)

𝒘𝑡+1 = 𝒘𝑡 + 𝜂 𝑦 𝑖 − 𝒘 𝑇𝒙 𝑖 𝒙(𝑖)
Least Mean Squares (LMS)

It is proper for sequential or online learning


Stochastic gradient descent: online learning
} Sequential learning is also appropriate for real-time applications
} data observations are arriving in a continuous stream
} and predictions must be made before seeing all of the data
} The value of η needs to be chosen with care to ensure that the algorithm
converges
Evaluation and generalization

} Why minimizing the cost function (based on only training data)


while we are interested in the performance on new examples?

𝑛
min ∑ 𝐿𝑜𝑠𝑠 𝑦 (𝑖) , 𝑓(𝒙 (𝑖) ; 𝜽) Empirical loss
𝜽 𝑖=1

} Evaluation: After training, we need to measure how well the


learned prediction function can predicts the target for unseen
examples
Training and test performance
} Assumption:training and test examples are drawn independently
at random from the same but unknown distribution.
} Each training/test example (𝒙, 𝑦) is a sample from joint probability
distribution 𝑃 𝒙, 𝑦 ,i.e., 𝒙, 𝑦 ~𝑃

1 𝑛
Empirical (training) loss = 𝑛 ∑ 𝑖=1 𝐿𝑜𝑠𝑠 𝑦(𝑖), 𝑓(𝒙(𝑖); 𝜽)

Expected (test) loss =𝐸𝒙,𝑦 𝐿𝑜𝑠𝑠 𝑦, 𝑓(𝒙; 𝜽)

} We minimize empirical loss (on the training data) and expect to


also find an acceptable expected loss
} Empirical loss as a proxy for the performance over the whole distribution.
Linear regression: number of training data

𝑛 = 10 𝑛 = 20

𝑛 = 50
Linear regression: generalization
} By increasing the number of training examples, will solution be better?
} Why the mean squared error does not decrease
more after reaching a level?
Linear regression: types of errors
} Structural error: the error introduced by the limited
function class (infinite training data):

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇𝒙 2


𝒘
2
Structural error:𝐸 𝒙,𝑦 𝑦− 𝒘∗𝑇𝒙

where 𝒘∗ = (𝑤0∗ , ⋯ , 𝑤𝑑∗ ) are the optimal linear


regression parameters (infinite training data)
Linear regression: types of errors
} Approximation error measures how close we can get to the
optimal linear predictions with limited training data:

𝒘∗ = argmin 𝐸𝒙,𝑦 𝑦 − 𝒘𝑇𝒙 2


𝒘

𝑛
2
𝒘 = argmin ∑ 𝑦(𝑖) 𝑇
−𝒘 𝒙(𝑖)
𝒘
𝑖=1

2
Approximation error:𝐸 𝒙 𝒘 ∗ 𝑇𝒙 − 𝒘𝑇𝒙

Where 𝒘 are the parameter estimates based on a small


training set (so themselves are random variables).
Recall: Linear regression (squared loss)
} Linear regression functions
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1𝑥
𝑓 ∶ ℝd → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1𝑥1 + . . . 𝑤𝑑𝑥𝑑
𝒘 = 𝑤0,𝑤1,...,𝑤𝑑 𝑇 are the
parameters we need to set.

} Minimizing the squared loss for linear regression


2
𝐽(𝒘) = 𝒚 − 𝑿𝒘

} We obtain 𝒘 = 𝑿𝑇𝑿 −𝟏 𝑿𝑇𝒚


Beyond linear regression

} How to extend the linear regression to non-linear functions?


} Transform the data using basis functions
} Learn a linear regression on the new feature vectors (obtained
§ by basis functions)
Beyond linear regression
} 𝑚𝑡ℎ order polynomial regression (univariate 𝑓 ∶ ℝ ⟶ ℝ)
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1𝑥 + . . . +𝑤𝑚−1 𝑥𝑚−1 +𝑤𝑚 𝑥 𝑚

−𝟏
} Solution:𝒘 = 𝑿′ 𝑿′
𝑇 𝑿′𝑇𝒚

1 1
1 𝑥 𝑥 1 2
⋯ 𝑥 1 𝑚
𝒘0
𝑦1
𝑥 2 1 2 2 𝑚 𝒘1
𝒚 = ⋮ 𝑿′ = 1 𝑥 2 𝑥
⋯ 𝒘=
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥 𝑛 1
𝑥 𝑛 2
⋯ 𝑥 𝑛 1 𝒘𝑚
Polynomial regression: example

𝑚=3
𝑚=1

𝑚=5 𝑚=7
Generalized linear
} Linear combination of fixed non-linear function of the
input vector

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1𝜙1(𝒙)+ . . . 𝑤𝑚𝜙𝑚(𝒙)

{𝜙1(𝒙), . . . , 𝜙𝑚(𝒙)}:set of basis functions (or features)


𝜙𝑖 𝒙 : ℝ𝑑 → ℝ
Basis functions: examples
} Linear

} Polynomial (univariate)
Generalized linear: optimization
𝑛 2
𝐽 𝒘 =∑ 𝑦 𝑖 − 𝑓 𝒙 ;𝒘 𝑖
𝑖=1
𝑛 2
=∑ 𝑦 𝑖 − 𝒘𝑇𝝓 𝒙 𝑖
𝑖=1

(1)
1 𝜙1 (𝒙(1)) ⋯ 𝜙𝑚 (𝒙 ) 𝑤0
𝑦(1) (2) 𝑤1
1 𝜙1 (𝒙(2)) ⋯ 𝜙𝑚 (𝒙 )
𝒚= ⋮ 𝚽= 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦(𝑛) 𝑤𝑚
(𝑛)
1 𝜙1 (𝒙(𝑛)) ⋯ 𝜙𝑚 (𝒙 )

−𝟏
𝒘 = 𝚽𝑇𝚽 𝚽𝑇𝒚
Resource
1 C. M. Bishop, Pattern Recognition and Machine Learning.
2 Y. S. Abu-Mostafa, “Machine learning.” California Institute of Technology, 2012.
3 Machine Learning, Dr. Soleymani, Sharif University

You might also like