0% found this document useful (0 votes)
19 views

Linear Regression

This document summarizes linear regression and gradient descent algorithms. It discusses: 1) Linear regression assumes the output is a linear function of the input and finds the coefficients to minimize the cost function between predicted and actual outputs. 2) Gradient descent is used to find the minimum of the cost function by iteratively updating the coefficients in the direction of steepest descent. 3) It explains batch, stochastic, and mini-batch gradient descent methods and discusses factors like learning rate and convergence conditions.

Uploaded by

Võ Minh Kiệt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Linear Regression

This document summarizes linear regression and gradient descent algorithms. It discusses: 1) Linear regression assumes the output is a linear function of the input and finds the coefficients to minimize the cost function between predicted and actual outputs. 2) Gradient descent is used to find the minimum of the cost function by iteratively updating the coefficients in the direction of steepest descent. 3) It explains batch, stochastic, and mini-batch gradient descent methods and discusses factors like learning rate and convergence conditions.

Uploaded by

Võ Minh Kiệt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Lecture 2

Linear Regression

Dr. Le Huu Ton

Hanoi, 09/2016
Outline

Linear Regression

Gradient Descent?

Multi Feature Representation

Questions and Answers

2
Review of Machine Learning

Training Data

System to
Input (X) train h Output (Y)
(hypothesis)

For each input x, output y = h (x)


3
Linear Regression

Price (billion
Size (m2) VND)
(x(2),y(2)) 30 2.5
43 3.4
25 1.8
51 4.5
40 3.2 (x(5),y(5))

20 1.6
Table 1: Training data of housing price in Hanoi

In Supervised Learning, each training data consists of 2 elements:


input x (features) and output y (response)
Notation
(x,y) : One training data
(x(i),y(i)) : the ith training data
4
Linear Regression

Plotting of training data


Price of House in Hanoi
5

4.5

3.5
y-Price (billion VND)

2.5

1.5

0.5

0
0 10 20 30 40 50 60
x-Square (m2)

Linear Regression: Assume that the output y is a linear function of input x

y=h(x)=a*x+b
5
Linear Regression

Objective:
Learning the function y=h(x)=a*x+b, such as it returns the minimize
error- cost function for the training data (optimization problem)

Find the coefficient a,b to minimize the cost function

6
Linear Regression

Cost Function:
Size(m2) Price (b.VND)
The error for training data:
1 1
30 2.5
e(1)  (h( x (1) )  y (1) ) 2  (30a  b  2.5) 2 43 3.4
2 2
25 1.8
1 1
e(2)  (h( x (2) )  y (2) .) 2  (43a  b  3.4) 2 51 4.5
2 2
. 40 3.2
. 20 1.6
1 1
e ( m )  ( h( x ( m ) )  y ( m ) ) 2  ( x ( m ) a  b  y ( m ) ) 2
2 2
The cost function is define as:
m
1 (1) (2) 1 1 m
E  (e  e  ...  e )   e 
( m) (i )
 ( h( x ( i ) )  y ( i ) ) 2
m m i 1 2m i 1
1 m
E 
2m i 1
( ax (i )
 b  y (i ) 2
)
7
Gradient Descent

Objective:
Use the gradient function to find a minimum of a function
1 m
E 
2m i 1
( ax (i )
 b  y (i ) 2
)

Note that E is a function of a and b, we have only 2 variables a and


b.
Idea:
Choose random number for a and b, the algorithm is implemented
in many steps. At each step, modify a and b such s the cost function
is reduced
 
a : a   E (a ) b : b   E (b)
a b

8
Gradient Descent

A demonstration and explanation of Gradient Descent algorithm


can be found at the following website:
http://www.onmyphd.com/?p=gradient.descent

9
Gradient Descent

Suppose that (x0,y0) is a local minimum of the cost function, what


will 1 iteration of gradient descent do?

1. Leave x0 unchanged
2. Change x0 in random direction
3. Move x0 toward the global minimum
4. Decrease x0

10
Gradient Descent

Calculate the derivations of the cost function:  E (a) 


E (b)
a b
1 m
E 
2m i 1
( ax (i )
 b  y (i ) 2
)

Given the following formula:

 2  
(x )  2x ( f ( x) )  2 f ( x).
2
f ( x)
x x x

11
Gradient Descent

 1 m
E (a )   (ax (i )  b  y (i ) ).x ( i )
a m i 1

 1 m
E (b)   (ax (i )  b  y ( i ) )
b m i 1

12
Gradient Descent

Exercise:
Starting at a=0 and b= 0, α=0.01, what is the cost function?
Calculate the value of a and b after first iteration (first step). Confirm
if the cost function is reduced or not?
Size(m2) Price (b.VND)

30 2.5
43 3.4
25 1.8
51 4.5
40 3.2
20 1.6

13
Gradient Descent

Batch and Stochastic Gradient Descent


Batch gradient Descent:
Compute the Gradient Descent using the whole data set

Stochastic Gradient Descent:


Compute the Gradient Descent using 1 training example at a time
- Randomly reorder the training data
- Use (x(1),y(1)) to calculate the gradient descent in order to update
a,b
- Use (x(2),y(2)) to update
- ……

14
Gradient Descent

Mini-Batch Gradient Descent


Compute the Gradient Descent for t example at a time (1<t<m)
Example:
We have 1000 training data.
Step 1: Update the coefficient using 10 data 1-10
Step 2: Update the coefficient using 10 data 11-20
….

15
Gradient Descent

Big value of α may lead to the


incensement of cost function and
not convergence

 Gradient Descent works if the


cost function decrease at each
step

16
Gradient Descent

Convergence:
How to know if a function is converged or not?
- Cost function is smaller than a predefined threshold
- After a big enough number of step
- Cost function decreased less than a predefine threshold

17
Gradient Descent

Summarization:
1. Calculate the cost function
2. Select random value for coefficient a,b
3. Step by step modify a, b such as the cost function is decreased
While (not converged)
do  
a : a  E (a) b : b  E (b)
a b

18
Multiple Input Representation

Example:
Consider the same example, but with more inputs
Size (m2) N0 of floors N0 of rooms Price (billion
VND)

30 3 6 2.5
43 4 8 3.4
25 2 3 1.8
51 4 9 4.5
40 3 5 3.2
20 1 2 1.6

𝑥 𝑖
: the input of ith training data
(i ) : the component j of ith training data
xj
19
Multiple Input Representation

Matrix representation:
y  h( x)   0  1 x1   2 x2  ...   m xm
1  0 
x   
x  1   1
 ...   ... 
   
 xm   m 

 x0 
x 
 1
h( x)  [ 0 1  2 ... m ]  x2    T x
 
 ... 
 xm 
20
Multiple Input Representation

Cost Function:

1 m 1 m
E ( )  
2m i 1
(h ( x (i ) )  y (i ) ) 2  
2m i 1
( T x (i )  y (i ) )2

21
Multiple Input Representation

Gradient Descent
Start with random value of θ, step by step modify θ in order to decrease
the cost function

 j :  j   E ( )
 j

 1 m  T (i )
 j
E ( )  
2m i 1  j
( x  y (i ) 2
)

1 m T (i )
  ( x  y (i ) ) x j ( i )
m i 1

22
Multiple Input Representation

Gradient Descent
Start with random value of θ, step by step modify θ in order to decrease
the cost function

 j :  j   E ( )
 j

 1 m  T (i )
 j
E ( )  
2m i 1  j
( x  y (i ) 2
)

1 m T (i )
  ( x  y (i ) ) x j ( i )
m i 1

23
Normal Equations

Linear Regression:
Minimize the value of the cost function:
1 m 1 m
E ( )  
2m i 1
( h ( x (i )
)  y (i ) 2
)  
2m i 1
( T (i )
x  y (i ) 2
)

Normal Equations:
Solve the following equation to find out the optimized value of θ


E ( )  0


24
Normal Equations

Linear Regression:
Minimize the value of the cost function:
1 m 1 m
E ( )  
2m i 1
( h ( x (i )
)  y (i ) 2
)  
2m i 1
( T (i )
x  y (i ) 2
)

Normal Equations:
Solve the following equation to find out the optimized value of θ

 
E ( )  0  j  (0,1,..., n), E ( )  0
  j

25
Normal Equations

Solution:
Given a training set of m training example, each contain n inputs, we have
the matrix X (m,n+1) of inputs and vector of output Y

 x0 (1) x1(1) ... xn (1)   ( x (1) )T   y (1) 


 (2) (2) (2)   (2) T   (2) 
 x0 x1 ... xn   ( x )   y 
X  Y
 ...   ...   ... 
 (m) (m) (m)   (m) T   (m) 
 x0 x1 ... xn  ( x )  y 
Solution of normal equations is:

  ( X T X ) 1 X T Y 26
Homework

Write a program in (Matlab/C++)to implement gradient descent algorithm


for the following training data with different learning method: batch
learning, stochastic and mini-batch, normal equation. Send your code and
report on what do you see from the result to my email before 20 October
2016

27
Polynomial Regression

Polynomial Regression
Output is an polynomial function of the input
For example

h( x)  0  1 x   2 x 2  ...   n x n

Assume x1  x
x2  x 2
h( x)   0  1 x1   2 x2  ...   n xn
...
x n x n

Linear Regression
28
References

http://openclassroom.stanford.edu/MainFolder/VideoPage.php?co
urse=MachineLearning&video=02.4-LinearRegressionI-
GradientDescent&speed=100

29
Feature Rescale

Objective: Scale all features to the same scale, in order to have easier
computation
Popular scale : [0,1],[-0.5,0.5]

x=x/max (x)
c= mean (x) x=x-c/max(x) =>

30

You might also like