100% found this document useful (1 vote)
48 views

Linear - Regression

- Linear regression aims to predict a target value based on an input value using a linear model. - Nonlinear basis functions can be applied to inputs to allow for nonlinear relationships in the model while still using linear regression. - Maximum likelihood estimation and least squares are commonly used to fit the linear regression model parameters by minimizing the sum of squared errors. - Regularization can be added to the least squares optimization to prevent overfitting. Multiple outputs can also be modeled using a matrix of weights.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
48 views

Linear - Regression

- Linear regression aims to predict a target value based on an input value using a linear model. - Nonlinear basis functions can be applied to inputs to allow for nonlinear relationships in the model while still using linear regression. - Maximum likelihood estimation and least squares are commonly used to fit the linear regression model parameters by minimizing the sum of squared errors. - Regularization can be added to the least squares optimization to prevent overfitting. Multiple outputs can also be modeled using a matrix of weights.

Uploaded by

howgibaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Introduction to Machine Learning

Linear Models for Regression


林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University

Some slides are modified from Prof. Sheng-Jyh Wang


and Prof. Hwang-Tzong Chen
Regression

• Given a training data set comprising 𝑁 observations {𝐱𝑛 }𝑁 𝑛=1


and the corresponding target values {𝑡𝑛 }𝑁 𝑛=1 , the goal of
regression is to predict the value of 𝑡 for a new value of 𝐱

https://www.scribbr.com/statistics/linear-regression-in-r/
2
A simple regression model

• A simple linear model:

➢ Each observation is in a 𝐷–dimensional space 𝐱 = (𝑥1 , … , 𝑥𝐷 )T


➢ 𝑦 is a regression model parametrized by 𝐰 = (𝑤0 , … , 𝑤𝐷 )T
➢ The output is a linear combination of the input variables
➢ It is a linear function of parameters
➢ The fitting power is quite limited. Seeking a nonlinear extension
for the input variables

3
An example

• A regressor in the form of


➢ A straight line in this case -> Insufficient fitting power
➢ Nonlinear feature transforms before linear regression

4
Linear regression with nonlinear basis functions

• Simple linear model:

• A linear model with nonlinear basis functions

where {𝜙𝑗 }𝑀−1


𝑗=1 : nonlinear basis functions
𝑀: the number of parameters
𝑤0 : the bias parameter allowing a fixed offset

• The regression output is a linear combination of nonlinear


basis functions of the inputs
5
Linear regression with nonlinear basis functions

• A linear model with nonlinear basis functions

• Let 𝜙0 𝐱 = 1, a dummy basis function. The regression function


is equivalently expressed as

where and

6
Examples of basis functions
• Polynomial basis function: taking the form of powers of 𝑥

• Gaussian basis function: governed by and


➢ governs the location while governs the scale

• Sigmoidal basis function: governed by and


where

7
How basis functions work

• Take Gaussian basis functions as an example

y = w0 + w11 ( x ) + w22 ( x ) + ... + wM −1M −1 ( x )

1(x) 2(x) 3(x) 4(x) 5(x) 6(x) 7(x) 8(x)

8
Maximum likelihood and least squares

• Assume each observation is sampled from a deterministic


function with an added Gaussian noise

where 𝜀 is a zero mean Gaussian and precision (inverse


variance) is 𝛽

• Thus, we have the conditional probability

9
Maximum likelihood and least squares

• Given a data set of inputs X = {x1, . . . , xN} with corresponding


target values t1, . . . , tN, we have the likelihood function

• The log likelihood function is

where

10
Maximum likelihood and least squares

• Given a data set of inputs X = {x1, . . . , xN} with corresponding


target values t1, . . . , tN, we have the likelihood function

• The log likelihood function is

How?

where

11
Maximum likelihood and least squares

• Gaussian noise likelihood ֞ sum-of-squares error function

• Maximum likelihood solution: Optimize 𝐰 by maximizing the


log likelihood function

• Step 1: Compute the gradient of log likelihood w.r.t. 𝐰

• Step 2: Set the gradient to zero, which gives

12
Maximum likelihood and least squares

• Define the design matrix in this task

➢ It has 𝑁 rows, one for each training sample


➢ It has 𝑀 columns, one for each basis function

13
Maximum likelihood and least squares

• Setting the gradient to zero

we have

• How to derive?
➢ Hint 1:
➢ Hint 2:

14
Maximum likelihood and least squares

• The ML solution

• is known as the Moore-Penrose pseudo-


inverse of the design matrix

• has linearly independent columns. Why is invertible?

15
Maximum likelihood and least squares

• Similarly, 𝛽 is optimized by maximizing the log likelihood

where

• We get

16
Regression for a new data point

• The conditional probability (likelihood function)

• After learning, we get 𝐰 ՚ 𝐰ML and 𝛽 ՚ 𝛽ML

• Specify the prediction of a data point 𝐱 in the form of a


−1
Gaussian distribution with mean 𝑦 𝐱, 𝐰ML and variance 𝛽ML

17
Regularized least squares

• Add a regularization term helps alleviate over-fitting

• The simplest form of the regularization term

• The total error function becomes

• Setting the gradient of the function w.r.t. 𝐰 to 0, we have

18
Regularized least squares

• A more general regularizer

• q=2 → quadratic regularizer


• q=1 → the lasso in the statistics literature
• Contours of the regularization term

19
Multiple outputs

• In some applications, we wish to predict 𝐾 > 1 target values


➢ One target value: Income -> Happiness
➢ Multiple target values: Income -> Happiness, Hours of duty, Health

• Recall the one-dimensional case

• With the same basis functions, the regression approach becomes

where 𝐖 is a 𝑀 × 𝐾 matrix, 𝑀 is the number of basis functions,


and 𝐾 is the number of target values

20
Multiple outputs

• The conditional probability of a single observation is

➢ An isotropic Gaussian, i.e., with a diagonal covariance matrix


➢ Each pair of variables are independent

• The log likelihood function is

21
Multiple outputs: Maximum likelihood solution

• Setting the gradient of the log likelihood function w.r.t. 𝐖 to 0,


we have

• Consider the kth column of 𝐖ML

where 𝐭 𝑘 is a 𝑁-dimensional vector with components [𝑡𝑛𝑘 ]

• It leads to 𝐾 independent regression problems

22
Sequential learning

• The maximum likelihood derivation is a batch technique


➢ It takes all training data into account at the same time
➢ Case 1: The training data set is sufficiently large
➢ Case 2: Data points are arriving in a continuous stream

• For the two cases, it is worthwhile to use sequential


algorithms, or on-line algorithms, in which the data points are
considered one by one, and the model parameters are
updated incrementally

23
Sequential learning

• Stochastic gradient descent


➢ Error function comprises a sum over data points 𝐸 = σ𝑛 𝐸𝑛

➢ Given data point 𝐱 𝑛 , the parameter vector 𝐰 is updated by

where 𝜏 is the iteration number and η is the learning rate


➢ In the case of sum-of-squares error, it is

24
Maximum a posterior

• Likelihood function

• Let’s consider a prior function, which is a Gaussian

where 𝐦0 is the mean and S0 is the covariance matrix

• The posterior function is also a Gaussian

where is the mean


and is the covariance
25
How to derive the mean and covariance in posterior

• According to the marginal and conditional Gaussians on page


93 of the PRML textbook

26
A zero-mean isotropic Gaussian prior

• A general Gaussian prior function

where 𝐦0 is the mean and S0 is the covariance matrix

• A widely used Gaussian prior

• Mean and covariance of the resulting posterior function

27
Sequential Bayesian learning: An example

• Data, including observations and target values, are given one-


by-one

• Data are in a one-dimensional space

• Data are sampled from the function ,


where and , and added by a Gaussian
noise
➢ Note that the function is unknown
➢ We have just the observations and the target values

28
An example

• Regression function

29
An example

• Regression function

• In the beginning, no
data are available

• Constant likelihood

• Prior = posterior

• Sample 6 curves for


function according to
posterior distribution

30
An example

• Regression function

• One data (blue circle)


sample is given
• Likelihood for this
sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
31
An example

• Regression function

• Second data (blue


circle) sample is given
• Likelihood for the
second sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
32
An example

• Regression function

• 20 data (blue circle)


sample are given
• Likelihood for the 20th
sample
• White cross
• Posterior proportional
to likelihood x prior
• Sample 6 curves
according to posterior
33
Predictive distribution
• Recall the posterior function

where and

• Given 𝐰, we regress a data sample via

• In Bayesian treatment, the predictive distribution is

• Then we have

• where
34
• Green curve is used to sample data. It is unknown
• Blue circle: a sampled data
• After learning, the predictive distribution

• Red curve: the mean of the Gaussian above


• Red shaded region: One standard deviation on either side of mean
35
36
• Sample 5 points of 𝐰 according to the posterior function
• Plot the corresponding regression functions

37
References

• Chapters 3.1 and 3.3 in the PRML textbook

38
Thank You for Your Attention!

Yen-Yu Lin (林彥宇)


Email: [email protected]
URL: https://www.cs.nycu.edu.tw/members/detail/lin

39

You might also like