0% found this document useful (0 votes)
80 views164 pages

Machine Learning: Dr. Shahid Mahmood Awan

The document discusses linear regression for predicting a numeric output variable based on one or more input features. It explains that linear regression finds the best-fitting straight line through the data points by minimizing the sum of squared residuals. This best-fitting line is defined by the hypothesis function h(θ)(x)= θ0 + θ1x, where θ0 is the y-intercept and θ1 is the slope. The goal of linear regression is to choose the θ values that make the hypothesis function closest to the actual output values y for each input x in the training data.

Uploaded by

ali iqrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views164 pages

Machine Learning: Dr. Shahid Mahmood Awan

The document discusses linear regression for predicting a numeric output variable based on one or more input features. It explains that linear regression finds the best-fitting straight line through the data points by minimizing the sum of squared residuals. This best-fitting line is defined by the hypothesis function h(θ)(x)= θ0 + θ1x, where θ0 is the y-intercept and θ1 is the slope. The goal of linear regression is to choose the θ values that make the hypothesis function closest to the actual output values y for each input x in the training data.

Uploaded by

ali iqrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

Machine Learning

Dr. Shahid Mahmood Awan


Assistant Professor
School of Systems and Technology, University of Management and Technology
[email protected]

Umer Saeed(MS Data Science, BSc Telecommunication Engineering)


Sr. RF Optimization & Planning Engineer
[email protected]
Regression
What is Regression?
Regression
 From features to predictions;

Dr. Shahid Awan Umer Saeed


Regression
 From features to predictions;

Dr. Shahid Awan Umer Saeed


Regression
Examples
Salary after ML Subject
 How much will your salary be? (y = $$)
 Depends on x =
 performance in courses,
 quality of project,
 # of forum responses, …

Dr. Shahid Awan Umer Saeed


Stock Prediction
 Predict the price of a stock (y)
 Depends on x =
 Recent history of stock price
 News events
 Related commodities, …

Dr. Shahid Awan Umer Saeed


Tweet Popularity
 How many people will retweet your tweet? (y)
 Depends on x =
 # followers
 # of followers of followers
 features of text tweeted
 popularity of hashtag
 # of past retweets, ….

Dr. Shahid Awan Umer Saeed


Reading your mind

Dr. Shahid Awan Umer Saeed


Regression
Case Study
Predicting House Prices

Dr. Shahid Awan Umer Saeed


Predicting House Prices

Dr. Shahid Awan Umer Saeed


Predicting House Prices
 How much is my house worth?

Dr. Shahid Awan Umer Saeed


Predicting House Prices
 Look at recent sales in my neighborhood: How much did they sell
for?

Dr. Shahid Awan Umer Saeed


Predicting House Prices

Dr. Shahid Awan Umer Saeed


Predicting House Prices

Dr. Shahid Awan Umer Saeed


Predicting House Prices

Dr. Shahid Awan Umer Saeed


Linear Regression
Uni-variate Linear Regression
Linear Regression with One Variable
 Model Representation

 In regression problems, we are taking input variables and trying to fit the output onto
a continuous expected result function.

 Linear regression with one variable is also known as "univariate linear regression.“

 Univariate linear regression is used when you want to predict a single output value y from
a single input value x.

 We're doing supervised learning here, so that means we already have an idea about what
the input/output cause and effect should be.

Dr. Shahid Awan Umer Saeed


Predicting House Prices

Dr. Shahid Awan Umer Saeed


Linear Regression
with One Variable
 Training Set of housing prices Plot Size Price
1650 45
1700 27
 Supervised Learning 1836 32.5
 Given the “right answer” for each example in the data. 1905 62
1950 32
 Regression Problem 1950 49
 Predict real-valued output … …
… …

 Notation
 m= Number of training examples=546
 X’s= “input” variable/features/independent variable = Plot Size
 y’’s= “output” variable/”target” variable/dependent variable =Price
 (X,y)- Single training example
 (X(i),y(i))- ith training example, Superscript i is not the Exponential, its the index.
 Xj(i)= value of feature j in the ith training example
 n=|X(i)|; (the number of features)

Dr. Shahid Awan Umer Saeed


Linear Regression with One Variable
 Example
 Consider the training set shown below. (x(i),y(i)) is the ith training example. What is y(3)?

Size in feet2 (X) Price($) in 1000's (y)


2104 460
1416 232
1534 315
852 178

 (a) 1416
 (b) 1534 (Answer)
 (c) 315
 (d) 0

Dr. Shahid Awan Umer Saeed


Linear Regression with one variable

The Hypothesis Function


The Hypothesis Function
 To describe the supervised learning problem slightly more formally, our goal is, given a
training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the
corresponding value of y.

 For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is
therefore like this:

 h maps from X’s to y’s

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
ℎ 𝜃 𝑥 𝑖𝑠 𝑡ℎ𝑒 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠ℎ𝑖𝑝
𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝜃0 𝑎𝑛𝑑 𝜃1

Fit line through the data

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Our hypothesis function has the general form: h(θ)(x)= θ0+ θ1x, Where θi’s=Parameters of
model

 Note that this is like the equation of a straight line.

 We give to h(θ) (x) values for θ0 and θ1 to get our estimated output h(θ)(x). In other words,
we are trying to create a function called h(θ)that is trying to map our input data (the x's) to
our output data (the y's).

 θ0= Intercept & θ1= Slope or Regression Coefficient or weight on Feature X

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 How to choose θi’s ?

x h(x)
0 1.5
1 1.5
2 1.5
3 1.5

Umer Saeed
Dr. Shahid Awan
The Hypothesis Function

x h(x)
0 0
1 0.5
2 1
3 1.5

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

x h(x)
0 1
1 1.5
2 2
3 2.5

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

 Consider the plot below of h(θ)(x)= θ0+ θ1x. What are θ0 and θ1?

 (a) θ0=0 θ1=1


 (b) θ0=0.5 θ1=1 (answer)
 (c) θ0=1 θ1=0.5
 (d) θ0=1 θ1=1

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Example:
 Suppose we have the following set of training data:

input x output y
0 4
1 7
2 7
3 8

 Now we can make a random guess about our hθ function: θ0=2 and θ1=2.
 Our hypothesis function has the general form: h (θ)(x)= θ0+ θ1x
 The hypothesis function becomes: h(θ)(x)=2+2X
 So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be trying out
various values of θ0 and θ1 to try to find values which provide the best possible "fit" or the
most representative "straight line" through the data points mapped on the x-y plane.
 Idea: Choose θ0 and θ1 so that h(θ)(x) is close to y for our training examples (x,y).

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Residual sum of squares (RSS) (or Squared Error of Regression Line)

 Let we have “n” points on a coordinate plane.

 We want to do is find a line that minimizes the squared distance to these different points.

 Let visualize that line for a second, I don’t know what is looks like right now.

 And what we want to do is minimize this squared error from each of these points to the line.

 Equation of the line is; y=mx+b, Where; m= slope of the line, b= y- intercept
 We want to find an “m” and “b”, so hat it minimizes the squared error.

 So for each of these points, the error between it and the line is vertical distance.

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Error1= y1-(mx1+b), Error2= y2-(mx2+b), -----, Errorn= yn-(mxn+b)

 SELine=(Error1)2+(Error2)2+---+(Errorn)2

 SELine=(y1-(mx1+b))2+(y2-(mx2+b))2+---+(yn-(mxn+b))2

 SELine= [(y12-(2)( y1) (mx1+b)+ (mx1+b)2]+[(y22-(2)( y2) (mx2+b)+ (mx2+b)2]+ ----+ [(yn2-(2)( yn)
(mxn+b)+ (mxn+b)2] (Because (a-b)2=a2-2ab+b2)

 SELine= [y12-2y1mx1-2y1b+{m2x12+2mx1b+b2}]+ [y22-2y2mx2-2y2b+{m2x22+2mx2b+b2}]+----+


[yn2-2ynmxn-2ynb+{m2xn2+2mxnb+b2}] (Because (a+b)2=a2+2ab+b2)

 SELine= (y12+ y22+----+ yn2)-2m(x1y1+ x2y2+---+ xnyn)-2b(y1+y2+---+yn)+m2(x12+ x22+----+


xn2)+2mb(x1+ x2+----+ xn)+nb2 ------ (equation-A)

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 As we know;

𝑦12 + 𝑦22 + ⋯+ 𝑦𝑛2 𝑦12 + 𝑦22 + ⋯ + 𝑦𝑛2 = 𝑛𝑦 2


= 𝑦2
𝑛
 Similarly;

𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 = 𝑛𝑥𝑦
= 𝑥𝑦
𝑛

𝑦1 + 𝑦2 + ⋯ + 𝑦𝑛 𝑦1 + 𝑦2 + ⋯ + 𝑦𝑛 = 𝑛𝑦
=𝑦
𝑛

𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2 𝑥12 + 𝑥22 + ⋯ + 𝑥𝑛2 = 𝑛𝑥 2


= 𝑥2
𝑛

𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 = 𝑛𝑥
=𝑥
𝑛

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Equation (A) becomes, we have

 SELine= (y12+ y22+----+ yn2)-2m(x1y1+ x2y2+---+ xnyn)-2b(y1+y2+---+yn)+m2(x12+ x22+----+


xn2)+2mb(x1+ x2+----+ xn)+nb2

 By putting the values, we have,

𝑆𝐸𝐿𝑖𝑛𝑒 = 𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2 ------- (equation-B)

By taking partial derivative of equation B w.r.t ‘m’, we have

𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕
= (𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2 )
𝜕𝑚 𝜕𝑚
𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕 𝜕 𝜕 𝜕 2 2
𝜕 𝜕
= 2
(𝑛𝑦 ) − (2𝑚𝑛𝑥𝑦) − (2𝑏𝑛𝑦) + (𝑚 𝑛𝑥 ) + (2𝑚𝑏𝑛𝑥) + (𝑛𝑏 2 )
𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= 0 − 2𝑛𝑥𝑦 − 0 + 2𝑚𝑛𝑥 2 + 2𝑏𝑛𝑥 + 0
𝜕𝑚
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= −2𝑛𝑥𝑦 + 2𝑚𝑛𝑥 2 + 2𝑏𝑛𝑥 ------- (Equation-C)
𝜕𝑚

 Let;
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
=0
𝜕𝑚

 So, equation C becomes,

−2𝑛𝑥𝑦 + 2𝑚𝑛𝑥 2 + 2𝑏𝑛𝑥 = 0

2𝑛(−𝑥𝑦 + 𝑚𝑥 2 + 𝑏𝑥) = 0

−𝑥𝑦 + 𝑚𝑥 2 + 𝑏𝑥 = 0

𝑚𝑥 2 + 𝑏𝑥 = 𝑥𝑦

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
𝑚𝑥 2 𝑏𝑥 𝑥𝑦
+ =
𝑥 𝑥 𝑥

𝑚𝑥 2 𝑥𝑦
+𝑏 = ------- Equation-D
𝑥 𝑥

𝑥 2 𝑥𝑦
, 𝑙𝑖𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑏𝑒𝑠𝑡 𝑓𝑖𝑡 𝑙𝑖𝑛𝑒
𝑥 𝑥

Umer Saeed
Dr. Shahid Awan
The Hypothesis Function
 Equation (B) becomes, we have

𝑆𝐸𝐿𝑖𝑛𝑒 = 𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2


By taking partial derivative of equation B w.r.t ‘b’, we have

𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕
= (𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2 )
𝜕𝑏 𝜕𝑏

𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕 𝜕 𝜕 𝜕 𝜕 𝜕
= (𝑛𝑦 2 ) − (2𝑚𝑛𝑥𝑦) − (2𝑏𝑛𝑦) + (𝑚2 𝑛𝑥 2 ) + (2𝑚𝑏𝑛𝑥) + (𝑛𝑏 2 )
𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= 0 − 0 − 2𝑛𝑦 + 0 + 2𝑚𝑛𝑥 + 2𝑛b
𝜕𝑏

𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= −2𝑛𝑦 + 2𝑚𝑛𝑥 + 2𝑛b ------ (Equation-E)
𝜕𝑏

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Let; 𝜕𝑆𝐸𝐿𝑖𝑛𝑒
=0
𝜕𝑚
 So, equation E becomes,

−2𝑛𝑦 + 2𝑚𝑛𝑥 + 2𝑛b=0

2𝑛(−𝑦 + 𝑚𝑥 +b)=0
2𝑛(−𝑦 + 𝑚𝑥 +b)=0
−𝑦 + 𝑚𝑥 +b=0

𝑚𝑥 +b=𝑦 ------- (Equation-F)

𝑥 & 𝑦𝑙𝑖𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑙𝑖𝑛𝑒

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 Equation (F) – Equation (E), we get

𝑥2 𝑥𝑦
𝑚𝑥 + 𝑏 − 𝑚 + 𝑏 = 𝑦 −
𝑥 𝑥

𝑥2 𝑥𝑦
𝑚𝑥 + 𝑏 − 𝑚 − 𝑏 = 𝑦 −
𝑥 𝑥

𝑥2 𝑥𝑦
𝑚𝑥 − 𝑚 =𝑦−
𝑥 𝑥

𝑥2 𝑥𝑦
𝑚 𝑥− =𝑦−
𝑥 𝑥

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
𝑥𝑦
𝑦−
𝑚= 𝑥
𝑥2
𝑥−
𝑥

𝑥 𝑦 − 𝑥𝑦
𝑚=
(𝑥)2 − 𝑥 2

 Equation F becomes, we have

𝑚𝑥 +b=𝑦

b=𝑦 - 𝑚𝑥

Dr. Shahid Awan Umer Saeed


The Hypothesis Function
 By putting the value of ‘m’, we have

𝑥 𝑦−𝑥𝑦
b=𝑦 - 𝑥
(𝑥)2 −𝑥 2

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

Dr. Shahid Awan Umer Saeed


The Hypothesis Function

Dr. Shahid Awan Umer Saeed


Linear Regression with one variable

Cost Function
Cost Function
 We can measure the accuracy of our hypothesis function by using a cost function.

 This takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.

 To break it apart, it is (1/2) where is the mean of the squares of h(xi)−yi, or the
difference between the predicted value and the actual value.

 This function is otherwise called the "Squared error function", or "Mean squared error".

 The mean is halved (1/2m) as a convenience for the computation of the gradient descent,
as the derivative term of the square function will cancel out the (1/2) term.

Dr. Shahid Awan Umer Saeed


Cost Function

 h(x) for fixed θ1, this is the function of x ( size of the house)
 J(θ1) function of the parameter θ1 (which control the slop of the straight line

Dr. Shahid Awan Umer Saeed


Cost Function
 If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We
are trying to make a straight line (defined by h(x)) which passes through these scattered
data points.

 Our objective is to get the best possible line.

 The best possible line will be such so that the average squared vertical distances of the
scattered points from the line will be the least.

 Ideally, the line should pass through all the points of our training data set.

 In such a case, the value of J(θ0, θ1) will be 0.

Dr. Shahid Awan Umer Saeed


Cost Function
The following example shows the ideal situation
where we have a cost function of 0.

IF θ0 =0 and θ1 =1
h(x)=θ0+ θ1x
h(x)=0+1(x)
h(x)=0+x
h(x)=x

x y h(x)=x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 1 0 0
2 2 2 0 0
3 3 3 0 0
Sum 6 6 6 0 0
m=3

1 When θ1=1, we get a slope of 1 which goes through


𝐽 0,1 = 0 =0 every single data point in our model.
2×3
Umer Saeed
Dr. Shahid Awan
Cost Function
IF θ0 =0 and θ1 =0.5
h(x)=θ0+ θ1x
h(x)=0+0.5(x)
h(x)=0.5x

x y h(x)=0.5x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 0.5 -0.5 0.25 When θ1=0.5, we see
2 2 1 -1 1 the vertical distance
3 3 1.5 -1.5 2.25 from our fit to the data
Sum 6 6 3 -3 3.5 points increase.
m=3
1
𝐽 0,0.5 = 3.5 = 0.58
2×3

Dr. Shahid Awan Umer Saeed


Cost Function
IF θ0 =0 and θ1 =0
h(x)=θ0+ θ1x
h(x)=0

x y h(x)=0 h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 0 -1 1
2 2 0 -2 4
3 3 0 -3 9
Sum 6 6 0 -6 14
m=3
1
𝐽 0,0 = 14 = 2.3
2×3
Dr. Shahid Awan Umer Saeed
Cost Function
IF θ0 =0 and θ1 =-0.5
h(x)=θ0+ θ1x
h(x)=0+(-0.5)x
h(x)=-0.5x

x y h(x)=-0.5x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 -0.5 -1.5 2.25
2 2 -1 -3 9
3 3 -1.5 -4.5 20.25
Sum 6 6 -3 -9 31.5
m=3
1
𝐽 0, −0.5 = 31.5 = 5.25
2×3
Dr. Shahid Awan Umer Saeed
Cost Function
IF θ0 =0 and θ1 =1.5
h(x)=θ0+ θ1x
h(x)=0+(1.5)x
h(x)=1.5x

x y h(x)=1.5x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 1.5 0.5 0.25
2 2 3 1 1
3 3 4.5 1.5 2.25
Sum 6 6 9 3 3.5
m=3
1
𝐽 0,1.5 = 3.5 = 0.58
2×3
Dr. Shahid Awan Umer Saeed
Cost Function
IF θ0 =0 and θ1 =2
h(x)=θ0+ θ1x
h(x)=0+(2)x
h(x)=2x

x y h(x)=2x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 2 1 1
2 2 4 2 4
3 3 6 3 9
Sum 6 6 11 5 14
m=3
1
𝐽 0,2 = 14 = 2.33
2×3
Dr. Shahid Awan Umer Saeed
Cost Function
IF θ0 =0 and θ1 =2.5
h(x)=θ0+ θ1x
h(x)=0+(2.5)x
h(x)=2.5x

x y h(x)=2.5x h(x)-y (h(x)-y)^2


0 0 0 0 0
1 1 2.5 1.5 2.25
2 2 5 3 9
3 3 7.5 4.5 20.25
Sum 6 6 15 9 31.5
m=3
1
𝐽 0,2.5 = 31.5 = 5.25
2×3
Dr. Shahid Awan Umer Saeed
Cost Function

Thus as a goal, we should try to minimize the


cost function. In this case, θ1=1 is our global
minimum.

θ1 J(θ1)
1 0
0.5 0.58
0 2.3
-0.5 5.25
1.5 0.58
2 2.33
2.5 5.25

Dr. Shahid Awan Umer Saeed


Cost Function

Dr. Shahid Awan Umer Saeed


Cost Function

Dr. Shahid Awan Umer Saeed


Cost Function
 A contour plot is a graph that contains many contour lines.

 A contour line of a two variable function has a constant value at all points of the same line.
An example of such a graph is the one to the right below.

Dr. Shahid Awan Umer Saeed


Cost Function
 Taking any color and going along the 'circle', one would expect to get the same value of the
cost function.
 For example, the three green points found on the green line above have the same value
for J(θ0, θ1) and as a result, they are found along the same line.

 The circled x displays the value of the cost function for the graph on the left when θ0 = 800
and θ1 = -0.15.

Dr. Shahid Awan Umer Saeed


Cost Function
Taking another h(x) and plotting its contour plot, one gets the following graphs:

 When θ0 = 360 and θ1 = 0, the value of J(θ0, θ1) in the contour plot gets closer to the
center thus reducing the cost function error.

 Now giving our hypothesis function a slightly positive slope results in a better fit of the data.

Dr. Shahid Awan Umer Saeed


Cost Function

 The graph above minimizes the cost function as much as possible and consequently, the
result of θ1 and θ1 tend to be around 0.12 and 250 respectively.

 Plotting those values on our graph to the right seems to put our point in the center of the
inner most 'circle'.

Dr. Shahid Awan Umer Saeed


Linear Regression with one variable

Gradient Descent For Linear Regression


Gradient Descent
 Have some function J(θ0, , θ1)
 Want minimum J(θ0, , θ1)

 Outline:

 Start with some θ0, θ1 (say 0,0)


 Keep changing θ0, θ1 to reduce J(θ0, θ1) until we hopefully end up a minimum.

Dr. Shahid Awan Umer Saeed


Gradient Descent
 We have our hypothesis function and we have a way of measuring how well it fits into the
data. Now we need to estimate the parameters in the hypothesis function. That's where
gradient descent comes in.

 Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we
are graphing the cost function as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our hypothesis function and the cost
resulting from selecting a particular set of parameters.

 We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis.
The points on our graph will be the result of the cost function using our hypothesis with
those specific theta parameters. The graph below depicts such a setup.

Dr. Shahid Awan Umer Saeed


Gradient Descent

 We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum
points in the graph.

Dr. Shahid Awan Umer Saeed


Gradient Descent
 The way we do this is by taking the derivative (the tangential line to a function) of our cost
function.

 The slope of the tangent is the derivative at that point and it will give us a direction to move
towards.

 We make steps down the cost function in the direction with the steepest descent. The size
of each step is determined by the parameter α, which is called the learning rate.

 For example, the distance between each 'star' in the graph above represents a step
determined by our parameter α.

 A smaller α would result in a smaller step and a larger α results in a larger step.

 The direction in which the step is taken is determined by the partial derivative of J(θ0, θ1).
Depending on where one starts on the graph, one could end up at different points.

 The image above shows us two different starting points that end up in two different places.
Dr. Shahid Awan Umer Saeed
Gradient Descent
 We explored the scenario where we used one parameter θ1 and plotted its cost function to
implement a gradient descent. Our formula for a single parameter was :

 Repeat until convergence:

 Regardless of the slope's sign for d/d θ1 J(θ1), θ1 eventually converges to its minimum
value.

 The following graph shows that when the slope is negative, the value of θ1 increases and
when it is positive, the value of θ1 decreases.

Dr. Shahid Awan Umer Saeed


Gradient Descent

θ1:= θ1- (positive number)

Dr. Shahid Awan Umer Saeed


Gradient Descent

θ1:= θ1- (negative number)

Dr. Shahid Awan Umer Saeed


Gradient Descent

Dr. Shahid Awan Umer Saeed


Gradient Descent
 The intuition behind the convergence is that d/dθ1J(θ1) approaches 0 as we approach the
bottom of our convex function.

 At the minimum, the derivative will always be 0 and thus we get:

Dr. Shahid Awan Umer Saeed


Gradient Descent
 Gradient Descent can converge to a local minimum, even with the leaning rate  fixed.

 As we approach a local minimum, gradient descent will automatically take smaller steps.
So, no need to decrease  over time.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒;
{
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 𝐽 𝜃0 , 𝜃1 𝑓𝑜𝑟 𝑗 = 0 𝑎𝑛𝑑 𝑗 = 1
𝜕𝜃𝑗
} −−−−−−−−−− −(𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 − 𝐴)
𝑚
1 𝑖
𝐽 𝜃0 , 𝜃1 = (ℎ𝜃 𝑥 − 𝑦 𝑖 )2
2𝑚
𝑖=1

𝑚
1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2
2𝑚
𝑖=1

𝑚
𝜕 𝜕 1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2 −−−−− −(𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 − 𝐵)
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Case-1, j=0
 By putting the value of j in equation (B), we have
𝑚
𝜕 𝜕 1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2
𝜕𝜃0 𝜕𝜃0 2𝑚
𝑖=1

𝑚
𝜕 2 𝑖
𝜕
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 −𝑦 𝑖 ) (𝜃 + 𝜃1 𝑥 𝑖 −𝑦 𝑖 )
𝜕𝜃0 2𝑚 𝜕𝜃0 0
𝑖=1

𝑚
𝜕 1 𝑖 𝑖
𝜕 𝜕 𝜕
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [ (𝜃0 ) + (𝜃1 𝑥 𝑖 ) − 𝑦 𝑖 )]
𝜕𝜃0 𝑚 𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑖=1

𝑚
𝜕 1 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [1 + 0 − 0]
𝜕𝜃0 𝑚
𝑖=1

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
𝑚
𝜕 1 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [1]
𝜕𝜃0 𝑚
𝑖=1

𝑚
𝜕 1 𝑖 𝑖 (𝑖)
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [𝑥0 ]
𝜕𝜃0 𝑚
𝑖=1

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Case-2, j=1, By putting the value of j in equation (B), we have
𝑚
𝜕 𝜕 1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2
𝜕𝜃1 𝜕𝜃1 2𝑚
𝑖=1

𝑚
𝜕 2 𝑖
𝜕
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 −𝑦 𝑖 ) (𝜃 + 𝜃1 𝑥 𝑖 −𝑦 𝑖 )
𝜕𝜃1 2𝑚 𝜕𝜃1 0
𝑖=1

𝑚
𝜕 1 𝑖 𝑖
𝜕 𝜕 𝜕
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [ (𝜃0 ) + (𝜃1 𝑥 𝑖 ) − (𝑦 𝑖 )]
𝜕𝜃1 𝑚 𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑖=1

𝑚
𝜕 1 𝑖 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [0 + 𝑥 − 0]
𝜕𝜃1 𝑚
𝑖=1

𝑚
𝜕 1 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [𝑥 𝑖 ]
𝜕𝜃1 𝑚
𝑖=1
Dr. Shahid Awan Umer Saeed
Gradient Descent Algorithm
By putting the values in equation (A), we have
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒;
{
𝑚
∝ 𝑖
𝜃0 ≔ 𝜃0 − (𝜃0 + 𝜃1 𝑥 𝑖 − 𝑦 𝑖 )[𝑥0 ]
𝑚
𝑖=1
𝑚
∝ 𝑖
𝜃1 ≔ 𝜃1 − (𝜃0 + 𝜃1 𝑥 − 𝑦 𝑖 )[𝑥 (𝑖) ]
𝑚
𝑖=1
}
𝑈𝑝𝑑𝑎𝑡𝑒 𝜃0 𝑎𝑛𝑑 𝜃1 𝑠𝑖𝑚𝑢𝑙𝑡𝑎𝑛𝑒𝑜𝑢𝑠𝑙𝑦

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 The point of all this is that if we start with a guess for our hypothesis and then repeatedly
apply these gradient descent equations, our hypothesis will become more and more
accurate.

 Batch Gradient Descent:

 So, this is simply gradient descent on the original cost function J. This method looks at
every example in the entire training set on every step, and is called batch gradient
descent.

 Note that, while gradient descent can be susceptible to local minima in general, the
optimization problem we have posed here for linear regression has only one global, and no
other local, optima; thus gradient descent always converges (assuming the learning rate α
is not too large) to the global minimum. Indeed, J is a convex quadratic function.

Umer Saeed
Dr. Shahid Awan
Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
Cost Function for Linear Regression is always like “Convex
Function (Bowl shaped)” No Local minima except Global minima.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Making sure gradient descent is working correctly

 Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot
the cost function, J(θ) over the number of iterations of gradient descent. J(θ)should
decreases after every iterations. If J(θ) ever increases, then you probably need to decrease
α.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Automatic convergence test. Declare convergence if J(θ) decreases by less than E in one
iteration, where E is some small value such as 10−3. However in practice it's difficult to
choose this threshold value.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Gradient descent not working

 Value of α is large.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Use smaller α

 Note:
 For sufficiently small α, J(θ) should decrease on every iteration.
 But if α is too small, gradient descent can be slow to converge.

Dr. Shahid Awan Umer Saeed


Gradient Descent Algorithm
 Question:
 Suppose a friend ran gradient descent three times, with α=0.01, α=0.1, and α=1, and got the
following three plots (labeled A, B, and C):

 Which plots corresponds to which values of α?


 (a) A is α=0.01, B is α=0.1, C is α=1.
 (b) A is α=0.1, B is α=0.01, C is α=1. (Answer)
 (c ) A is α=1, B is α=0.01, C is α=0.1.
 (d) A is α=1, B is α=0.1, C is α=0.01.

Dr. Shahid Awan Umer Saeed


Linear Algebra Review

Matrices and Vectors


Matrices and Vectors
 Matrices are 2-dimensional arrays:

 The above matrix has four rows and three columns, so it is a 4 x 3 matrix.

 A vector is a matrix with one column and many rows:

 So vectors are a subset of matrices. The above vector is a 4 x 1 matrix.

Dr. Shahid Awan Umer Saeed


Matrices and Vectors
 Notation and terms:

 Aij refers to the element in the ith row and jth column of matrix A.

 A vector with 'n' rows is referred to as an 'n'-dimensional vector.

 vi refers to the element in the ith row of the vector.

 In general, all our vectors and matrices will be 1-indexed. Note that for some programming
languages, the arrays are 0-indexed.

 Matrices are usually denoted by uppercase names while vectors are lowercase.

 "Scalar" means that an object is a single value, not a vector or matrix.

  refers to the set of scalar real numbers.

 n refers to the set of n-dimensional vectors of real numbers.

Dr. Shahid Awan Umer Saeed


Addition
 Addition are element-wise, so you simply add each corresponding element:

1 2 5 6
 Example: 𝐴= ,𝐵 =
3 4 7 8

1 2 5 6
𝐴+𝐵 = +
3 4 7 8

1+5 2+6
𝐴+𝐵 =
3+7 4+8

6 8
𝐴+𝐵 =
10 12
 To add two matrices, their dimensions must be the same.

Dr. Shahid Awan Umer Saeed


Subtraction
 Subtraction are element-wise, so you simply subtract each corresponding element:

 Example
1 2 5 6
𝐴= ,𝐵 =
3 4 7 8

1 2 5 6
𝐴−𝐵 = −
3 4 7 8

1−5 2−6
𝐴+𝐵 =
3−7 4−8

−4 −4
𝐴+𝐵 =
−4 −4

 To subtract two matrices, their dimensions must be the same.

Dr. Shahid Awan Umer Saeed


Scalar Multiplication
 In scalar multiplication, we simply multiply every element by the scalar value:

 Example:

1 2
𝐴=
3 4

1 2
2𝐴 = 2
3 4

1×2 2×2
2𝐴 = 2
3×2 4×2

2 4
2𝐴 =
6 8

Dr. Shahid Awan Umer Saeed


Scalar Division
 In scalar division, we simply divide every element by the scalar value:

1 2
𝐴=
3 4

 Example: 1 1 1 2
𝐴=
2 2 3 4

1 1
1 1× 2×
𝐴= 2 2
2 1 1
3× 4×
2 2

1
1 1
𝐴= 2
2 3
2
2
Dr. Shahid Awan Umer Saeed
Matrix-Vector Multiplication
 We map the column of the vector onto each row of the matrix, multiplying each element and
summing the result.

 The result is a vector.

 The number of columns of the matrix must equal the number of rows of the vector.

 An m x n matrix multiplied by an n x 1 vector results in an m x 1 vector.

Dr. Shahid Awan Umer Saeed


Matrix-Vector Multiplication
 Example
1 2
7
𝐴 = 3 4 ,𝐵 =
8
5 6

1 2
7
𝐴𝐵 = 3 4
8
5 6

1×7+2×8
𝐴𝐵 = 3 × 7 + 4 × 8
5×7+6×8

7 + 16
𝐴𝐵 = 21 + 32
35 + 48

23
𝐴𝐵 = 53
83
Dr. Shahid Awan Umer Saeed
Matrix Multiplication
 We multiply two matrices by breaking it into several vector multiplications and concatenating
the result.

 An m x n matrix multiplied by an n x o matrix results in an m x o matrix.

 In the above example, a 3 x 2 matrix times a 2 x 2 matrix resulted in a 3 x 2 matrix.

 To multiply two matrices, the number of columns of the first matrix must equal the number
of rows of the second matrix.

Dr. Shahid Awan Umer Saeed


Matrix Multiplication
 Example

1 2 5 6
𝐴= ,𝐵 =
3 4 7 8

1 2 5 6 1×5+2×7 1×6+2×8 5 + 14 6 + 16 19 22
𝐴𝐵 = = = =
3 4 7 8 3×5+4×7 3×6+4×8 15 + 28 18 + 32 43 50

Dr. Shahid Awan Umer Saeed


Matrix Multiplication Properties
 Matrices are not commutative: A∗B≠B∗A

 Example
1 2 5 6
𝐴= ,𝐵 =
3 4 7 8

1 2 5 6 1×5+2×7 1×6+2×8 5 + 14 6 + 16 19 22
𝐴𝐵 = = = =
3 4 7 8 3×5+4×7 3×6+4×8 15 + 28 18 + 32 43 50

5 6 1 2 5×1+6×3 5×2+6×4 5 + 18 10 + 24 23 34
𝐵𝐴 = = = =
7 8 3 4 7×1+8×3 7×2+8×4 7 + 24 14 + 32 31 46

 Hence proof, A∗B≠B∗A

Dr. Shahid Awan Umer Saeed


Matrix Multiplication Properties
 Matrices are associative: (A∗B)∗C=A∗(B∗C)

1 2 5 6 9 10
 𝐴= ,𝐵 = ,𝐶=
3 4 7 8 11 12

19 22 19 22 9 10 413 454
𝐴∗𝐵 = 𝐴∗𝐵 ∗𝐶 = =
43 50 43 50 11 12 937 1030

111 122 413 454


𝐵∗𝐶 = 𝐴∗ 𝐵∗𝐶 =
151 166 937 1030

 Hence proof, Matrices are associative: (A∗B)∗C=A∗(B∗C)

Dr. Shahid Awan Umer Saeed


Matrix Multiplication Properties
 Matrices are Distributive: A*(B+C)=A*B+A*C
1 2 5 6 9 10
𝐴= ,𝐵 = ,𝐶=
3 4 7 8 11 12
14 16 50 56
𝐵+𝐶 = ,𝐴 ∗ 𝐵 + 𝐶 =
18 20 114 128

19 22 31 34 50 56
𝐴∗𝐵 = ,𝐴∗𝐶 = ,𝐴∗𝐵 +𝐴∗𝐶 =
43 50 71 78 114 128

Dr. Shahid Awan Umer Saeed


Matrix Multiplication Properties
Identity Matrix
 The identity matrix simply has 1's on the diagonal (upper left to lower right diagonal) and 0's
elsewhere.

Dr. Shahid Awan Umer Saeed


Matrix Multiplication Properties
 The identity matrix, when multiplied by any matrix of the same dimensions, results in the
original matrix. It's just like multiplying numbers by 1.

 When multiplying the identity matrix after some matrix (A∗I), the square identity matrix's
dimension should match the other matrix's columns. When multiplying the identity matrix
before some other matrix (I∗A), the square identity matrix's dimension should match the
other matrix's rows.
𝐴∗𝐼 =𝐼∗𝐴=𝐴

1 2 1 0
𝐴= ,𝐼 =
3 4 0 1

1 2 1 0 1 2
𝐴𝐼 = =
3 4 0 1 3 4

1 0 1 2 1 2
𝐼𝐴 = =
0 1 3 4 3 4

 Hence Proof, 𝐴 ∗ 𝐼 = 𝐼 ∗ 𝐴 = 𝐴

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 The inverse of a matrix A is denoted A-1

 Multiplying by the inverse results in the identity matrix.

 A non square matrix does not have an inverse matrix.

 Matrices that don't have an inverse are singular or degenerate.

 Formula for inverse of a matrix A is;

𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Calculate Determinants of a 2 X 2 Matrix:
 Formula;
𝑎 𝑏
𝐴=
𝑐 𝑑

𝑎 𝑏
det 𝐴 = 𝑑𝑒𝑡
𝑐 𝑑

 Example; det 𝐴 = 𝑎𝑑 − 𝑏𝑐
4 −2
𝐴=
1 −3

4 −2
det 𝐴 = 𝑑𝑒𝑡
1 −3

det 𝐴 = 4 × −3 − −2 × 1 = −12 − −2 = −12 + 2 = −10


 |A|≠0, so A-1 solution is possible.

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Calculate Adjoint of a 2 X 2 Matrix:
 Formula;
𝑎 𝑏 𝑎 𝑏 𝑑 −𝑏
𝐴= 𝑎𝑑𝑗 𝐴 = 𝑎𝑑𝑗 𝑎𝑑𝑗 𝐴 =
𝑐 𝑑 𝑐 𝑑 −𝑐 𝑎
 Example

4 −2
𝐴=
1 −3

4 −2
adj 𝐴 = 𝑎𝑗𝑑
1 −3

−3 2
𝑎𝑑𝑗 𝐴 =
−1 4

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Calculate Inverse of a 2 X 2 Matrix:

𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|
 By putting the values, we have
−3 2
𝐴−1 = −1 4
−10

−3 2
𝐴−1 = −10 −10
−1 4
−10 −10

3 −1
𝐴−1 = 10 5
1 −2
10 5
Umer Saeed
Dr. Shahid Awan
Inverse and Transpose
 Calculate Determinants of a 3 X 3 Matrix:
 Formula;
𝑎 𝑏 𝑐
𝐴= 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖

𝑎 𝑏 𝑐
𝐴 = 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖

𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
𝐴 =𝑎 −𝑏 +𝑐
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ

𝐴 = 𝑎 𝑒𝑖 − 𝑓ℎ − 𝑏 𝑑𝑖 − 𝑓𝑔 + 𝑐(𝑑ℎ − 𝑒𝑔)

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Calculate Determinants of a 3 X 3 Matrix:

1 0 1
𝐴= 0 2 3
1 2 1
1 0 1
𝐴 = 0 2 3
1 2 1
2 3 0 3 0 2
𝐴 =1
−0 +1
2 1 1 1 1 2
𝐴 = 1 2×1−2×3 −0 0×1−3×1 +1 0×2−2×1

𝐴 =1 2−6 −0 0−3 +1 0−2


𝐴 = 1 −4 − 0 −3 + 1 −2 = −4 + 0 − 2 = −6

 |A|≠0, so A-1 solution is possible.

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then
reversing it.
 In other words:

1 −2 −5
 Example:
𝐴= 4 3 0
−1 −1 −1
𝑇
1 −2 −5
𝑇
𝐴 = 4 3 0
−1 −1 −1

1 4 −1
𝐴𝑇 = −2 3 −1
−5 0 −1
Dr. Shahid Awan Umer Saeed
Inverse and Transpose
 Calculate Adjoint of a 3 X 3 Matrix:
 Formula;
𝑎 𝑏 𝑐
𝐴= 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖
 Step-1(Calculate cof A)

𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
+ − +
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ
𝑏 𝑐 𝑎 𝑐 𝑎 𝑏
𝐶𝑜𝑓 𝐴 = − + 𝑔 𝑖 −
ℎ 𝑖 𝑔 ℎ
𝑏 𝑐 𝑎 𝑐 𝑎 𝑏
+ − 𝑑 𝑓 +
𝑒 𝑓 𝑑 𝑒

+(𝑒𝑖 − 𝑓ℎ) −(𝑑𝑖 − 𝑓𝑔) +(𝑑ℎ − 𝑒𝑔)


𝐶𝑜𝑓 𝐴 = −(𝑏𝑖 − 𝑐ℎ) +(𝑎𝑖 − 𝑐𝑔) −(𝑎ℎ − 𝑏𝑔)
+(𝑏𝑓 − 𝑐𝑒) −(𝑎𝑓 − 𝑐𝑑) +(𝑎𝑒 − 𝑏𝑑)

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Step-2 (adj(A))
 Formula;
𝑇
 𝑎𝑑𝑗 𝐴 = 𝐶𝑜𝑓(𝐴)
 By putting the values, we have

𝑇
+(𝑒𝑖 − 𝑓ℎ) −(𝑑𝑖 − 𝑓𝑔) +(𝑑ℎ − 𝑒𝑔)
𝑎𝑑𝑗 𝐴 = −(𝑏𝑖 − 𝑐ℎ) +(𝑎𝑖 − 𝑐𝑔) −(𝑎ℎ − 𝑏𝑔)
+(𝑏𝑓 − 𝑐𝑒) −(𝑎𝑓 − 𝑐𝑑) +(𝑎𝑒 − 𝑏𝑑)

+(𝑒𝑖 − 𝑓ℎ) −(𝑏𝑖 − 𝑐ℎ) +(𝑏𝑓 − 𝑐𝑒)


𝑎𝑑𝑗 𝐴 = −(𝑑𝑖 − 𝑓𝑔) +(𝑎𝑖 − 𝑐𝑔) −(𝑎𝑓 − 𝑐𝑑)
+(𝑑ℎ − 𝑒𝑔) −(𝑎ℎ − 𝑏𝑔) +(𝑎𝑒 − 𝑏𝑑)

Dr. Shahid Awan Umer Saeed


Inverse and Transpose
 Calculate Adjoint of a 3 X 3 Matrix:
 Example: 1 0 1
𝐴= 0 2 3
1 2 1
 Step-1(Calculate cof A)
2 3 0 3 0 2
+ − +
2 1 1 1 1 2
0 1 1 1 1 0
𝐶𝑜𝑓 𝐴 = − + −
2 1 1 1 1 2
0 1 1 1 1 0
+ − +
2 3 0 3 0 2

+(2 − 6) −(0 − 3) +(0 − 2)


𝐶𝑜𝑓 𝐴 = −(0 − 2) +(1 − 1) −(2 − 0)
+(0 − 2) −(3 − 0) +(2 − 0)

−4 3 −2
𝐶𝑜𝑓 𝐴 = 2 0 −2
−2 −3 2
Dr. Shahid Awan Umer Saeed
Inverse and Transpose
 Step-2 (adj(A))
 Formula;
𝑇
 𝑎𝑑𝑗 𝐴 = 𝐶𝑜𝑓(𝐴)
 By putting the values, we have

𝑇
−4 3 −2
𝑎𝑑𝑗 𝐴 = 2 0 −2
−2 −3 2

−4 2 −2
𝑎𝑑𝑗 𝐴 = 3 0 −3
−2 −2 2

Umer Saeed
Dr. Shahid Awan
Inverse and Transpose
 Calculate Inverse of a 3 X 3 Matrix:
𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|
 By putting the values, we have
−4 2 −2
−6 −6 −6
3 0 −3
𝐴−1 =
−6 −6 −6
−2 −2 2
−6 −6 −6

2 −1 1
3 3 3
−1 1
𝐴−1 = 0
2 2
1 1 −1
3 3 3

Dr. Shahid Awan Umer Saeed


Linear Regression
Linear Regression with Multiple Variables
Linear Regression with
Multiple Variables

Plot Size (x) Price (y)


1650 45
1700 27
1836 32.5
1905 62
1950 32
1950 49
… …
… …

 h(θ)(x)= θ0+ θ1x

Umer Saeed
Dr. Shahid Awan
Linear Regression with
Multiple Variables
Sr no Price Plotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
1 42000 5850 3 1 2 yes no yes no no 1 no
2 38500 4000 2 1 1 yes no no no no 0 no
3 49500 3060 3 1 1 yes no no no no 0 no
4 60500 6650 3 1 2 yes yes no no no 0 no
5 61000 6360 2 1 1 yes no no no no 0 no
6 66000 4160 3 1 1 yes yes yes no yes 0 no
7 66000 3880 3 2 2 yes no yes no no 2 no
8 69000 4160 3 1 3 yes no no no no 0 no
9 83800 4800 3 1 1 yes yes yes no no 0 no
10 88500 5500 3 2 4 yes yes no no yes 1 no

Notation
m= Number of training examples=546
X’s= “input” variable/features = Plot Size
y’’s= “output” variable/”target” variable=Price
(X,y)- Single training example
(X(i),y(i))- ith training example, Superscript i is not
the Exponential, its the index.
Xj(i)= value of feature j in the ith training example
n=|X(i)|; (the number of features)

Dr. Shahid Awan Umer Saeed


Linear Regression with
Multiple Variables
 Example:
Size (feet)^2 Number of Number of Age of home Price ($1000)
(X1) bedrooms(X2) floors (X3) (years) (X4) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...

 In the training set above, what is x1(4) ?


 (a) The size (in feet^2) of the 1st home in the training set.
 (b) The age (in years) of the 1st home in the training set.
 (c) The size (in feet^2) of the 4th home in the training set. (answer)
 (d) The age (in years) of the 4th home in the training set

Umer Saeed
Dr. Shahid Awan
Linear Regression with
Multiple Variables
 Example 2:
 In the training set above, what is x (2) ?
 Answer

1416
3
2
40

Dr. Shahid Awan Umer Saeed


Linear Regression with
Multiple Variables

 The Hypothesis Function (for one variable): h(θ)(x)= θ0+ θ1x


 The Hypothesis Function (for multiple variable): h(θ)(x)= θ0+ θ1x1+ θ2x2+----+ θnxn
 For convenience of notation, define x0 =1

𝑥0 𝜃0
𝑥1 𝜃
𝑋 = ⋮ , 𝜃 = 1 , 𝑋 𝑎𝑛𝑑𝜃 𝑎𝑟𝑒 𝑛 + 1 dimension 𝑣𝑒𝑐𝑡𝑜𝑟𝑠

𝑛
𝑥 𝜃𝑛

 The Hypothesis Function (for multiple variable): h(θ)(x)= θ0x0 + θ1x1+ θ2x2+----+ θnxn

ℎ(θ)(𝑥) = 𝜃 𝑇 𝑋

 This is also called Multivariate linear regression.

Umer Saeed
Dr. Shahid Awan
Linear Regression with Multiple Variables

Gradient Descent for multiple variables


Gradient Descent for
Multiple Variables

 Hypothesis:

 Parameters: 𝜃, 𝑤ℎ𝑒𝑟𝑒𝜃 is the n+1 dimension 𝑣𝑒𝑐𝑡𝑜𝑟

 Cost Function:

Dr. Shahid Awan Umer Saeed


Gradient Descent for
Multiple Variables

 The gradient descent equation itself is generally the same form; we just have to repeat it for
our 'n' features:

 In other words;

Dr. Shahid Awan Umer Saeed


Gradient Descent for
Multiple Variables

 The following image compares gradient descent with one variable to gradient descent with
multiple variables:

Umer Saeed
Dr. Shahid Awan
Gradient Descent for
Multiple Variables

Dr. Shahid Awan Umer Saeed


Gradient Descent for
Multiple Variables

 Question:
 When there are n features, we define the cost function as;
 For linear regression, which of the following are also equivalent and correct definitions of
J(𝜃)?

Dr. Shahid Awan Umer Saeed


Linear Regression with Multiple Variables

Gradient Descent in Practice


Feature Scaling
 Idea: Make sure features are on a similar scale. If features are in the similar scale then
Gradient Descent converge quickly.

 E.g. 𝑥1 = 𝑠𝑖𝑧𝑒 0 − 2000𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠(1 − 5)

Umer Saeed
Dr. Shahid Awan
Feature Scaling
𝑥1 = 𝑠𝑖𝑧𝑒 0 − 2000𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠(1 − 5)

𝑠𝑖𝑧𝑒(𝑓𝑒𝑒𝑡 2 ) 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠


𝑥1 = , 𝑥2 =
2000 5

0 ≤ 𝑥1 ≤ 1, 0 ≤ 𝑥2 ≤ 1

We can speed up gradient descent by having each


of our input values in roughly the same range. This
is because θ will descend quickly on small ranges
and slowly on large ranges, and so will oscillate
inefficiently down to the optimum when the
variables are very uneven.

Dr. Shahid Awan Umer Saeed


Feature Scaling
 The way to prevent this is to modify the ranges of our input variables so that they are all
roughly the same. Ideally:

−1 ≤ 𝑥𝑖 ≤ 1 or −0.5 ≤ 𝑥𝑖 ≤ 0.5
 These aren't exact requirements; we are only trying to speed things up. The goal is to get all
input variables into roughly one of these ranges, give or take a few

 Feature scaling involves dividing the input values by the range (i.e. the maximum value
minus the minimum value) of the input variable, resulting in a new range of just 1.

Dr. Shahid Awan Umer Saeed


Mean Normalization
 Replace 𝑥𝑖 with 𝑥𝑖 − 𝜇𝑖 to make features have approximately zero mean (Do not apply to
𝑥0 = 1)

 Mean normalization involves subtracting the average value for an input variable from the
values for that input variable resulting in a new average value for the input variable of just
zero. To implement both of these techniques, adjust your input values as shown in this
formula:

𝑥𝑖 − 𝜇𝑖
𝑥𝑖 =
𝑟𝑎𝑛𝑔𝑒

𝑟𝑎𝑛𝑔𝑒 = 𝑚𝑎𝑥 − 𝑚𝑖𝑛


Or
𝑥𝑖 − 𝜇𝑖
𝑥𝑖 =
𝜎

𝜇1 = 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑜𝑓 𝑥𝑖 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑡𝑒𝑠𝑡

𝜎 = standard deviation

Dr. Shahid Awan Umer Saeed


Mean Normalization
 The formula for standard deviation (SD) is;

𝑥−𝜇 2
𝜎=
𝑁

 Example (using Range)

𝑥1 = 𝑠𝑖𝑧𝑒 0 − 2000𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠(1 − 5)

𝑟𝑎𝑛𝑔𝑒1 = 2000, 𝑟𝑎𝑛𝑔𝑒2 = 4

𝜇0 = 1000, 𝜇1 = 3

−0.5 ≤ 𝑥1 ≤ 0.5 , −0.5 ≤ 𝑥2 ≤ 0.5

Dr. Shahid Awan Umer Saeed


Mean Normalization
 Example (using standard deviation)

𝑥1 = 𝑠𝑖𝑧𝑒 0 − 2000𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠(1 − 5)

𝜇0 = 1000, 𝜇1 = 3

𝜎1 = 577.63, 𝜎2 = 1.41

−1.73 ≤ 𝑥1 ≤ 1.73 , −1.41 ≤ 𝑥2 ≤ 1.41

Umer Saeed
Dr. Shahid Awan
Mean Normalization
 Suppose you are using a learning algorithm to estimate the price of houses in a city. You
want one of your features xi to capture the age of the house. In your training set, all of your
houses have an age between 30 and 50 years, with an average age of 38 years. Which of
the following would you use as features, assuming you use feature scaling and mean
normalization?

𝑎 𝑥𝑖 = 𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑜𝑢𝑠𝑒

𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑜𝑢𝑠𝑒


𝑏 𝑥𝑖 =
50

𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑜𝑢𝑠𝑒 − 38


𝑐 𝑥𝑖 =
50

𝑎𝑔𝑒 𝑜𝑓 𝑡ℎ𝑒 ℎ𝑜𝑢𝑠𝑒 − 38


𝑑 𝑥𝑖 =
20
 Answer (d)

Dr. Shahid Awan Umer Saeed


Linear Regression with Multiple Variables

Features Polynomial Regression


Features
 We can improve our features and the form of our hypothesis function in a couple different
ways.
 We can combine multiple features into one. For example, we can combine 𝑥1 and 𝑥2 into a
new feature 𝑥3 by taking 𝑥1 * 𝑥2

 Example (Housing Prices Prediction):


 Let we have two features; 𝑥1 = 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 , 𝑥2 = 𝑑𝑒𝑝𝑡ℎ

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 + 𝜃2 × 𝑑𝑒𝑝𝑡ℎ

𝐴𝑟𝑒𝑎 = 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 × 𝑑𝑒𝑝𝑡ℎ = 𝑥3

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝐴𝑟𝑒𝑎

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑥3
Umer Saeed
Dr. Shahid Awan
Polynomial Regression
 Our hypothesis function need not be linear (a straight line) if that does not fit the data well.

 We can change the behavior or curve of our hypothesis function by making it a quadratic,
cubic or square root function (or any other form).

 Quadratic Function Shape:

Dr. Shahid Awan Umer Saeed


Polynomial Regression
 Cubic Function Shape:

Dr. Shahid Awan Umer Saeed


Polynomial Regression
 square root function Shape:

Dr. Shahid Awan Umer Saeed


Polynomial Regression
 For example, if our hypothesis function is ;

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1
 then we can create additional features based on 𝑥1 , to get the quadratic function;

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12

 or the cubic function;


ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12 + 𝜃3 𝑥13

 To make it a square root function, we could do:

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥1

Dr. Shahid Awan Umer Saeed


Polynomial Regression
 One important thing to keep in mind is, if you choose your features this way then feature
scaling becomes very important.

 e.g. if 𝑥1 has rage 1-1000 then rage of 𝑥12 = 1 − 106 𝑎𝑛𝑑 𝑡ℎ𝑎𝑡 𝑜𝑓 𝑥13 = 1 − 109

Dr. Shahid Awan Umer Saeed


Polynomial Regression
 Question
 Suppose you want to predict a house's price as a function of its size. Your model is;

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 (𝑆𝑖𝑧𝑒) + 𝜃2 𝑆𝑖𝑧𝑒

 Suppose size ranges from 1 to 1000 (feet2). You will implement this by fitting a model

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2

 Finally, suppose you want to use feature scaling (without mean normalization).
 Which of the following choices for x1 and x2 should you use? (Note: 1000 ≈ 32)

Dr. Shahid Awan Umer Saeed


Linear Regression with Multiple Variables

Computing Parameters Analytically


Normal Equation
 Gradient descent gives one way of minimizing J.

 Let’s discuss a second way of doing so, this time performing the minimization explicitly and
without resorting to an iterative algorithm.

 "Normal equation Method" to solve for 𝜃 analytically.

 In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with
respect to the θj ’s, and setting them to zero.

 This allows us to find the optimum theta without iteration.

 Formula for Normal equation is;

𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦

Dr. Shahid Awan Umer Saeed


Normal Equation
Size (feet)^2 Number of Number of Age of home Price ($1000)
(X1) bedrooms(X2) floors (X3) (years) (X4) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178

Size (feet)^2 Number of Number of Age of home Price


(X0)
(X1) bedrooms(X2) floors (X3) (years) (X4) ($1000) (y)
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178

Dr. Shahid Awan Umer Saeed


Normal Equation
1 2104 5 1 45 460
1 1416 3 2 40 232
𝑋= , 𝑦 = 315
1 1534 3 2 30
1 852 2 1 36 178

 Now we can solve the normal equation using above matrices.

𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦

 There is no need to do feature scaling with the normal equation.

Dr. Shahid Awan Umer Saeed


Gradient Descent-Vs- Normal Equation
 The following is a comparison of gradient descent and the normal equation:

 With the normal equation, computing the inversion has complexity O(n3).

 So if we have a very large number of features, the normal equation will be slow.

 In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to
an iterative process.

Dr. Shahid Awan Umer Saeed


Normal Equation
 Question
 Suppose you have the training in the table below:

age (x1) height in cm (x2) weight in kg (y)


4 89 16
9 124 28
5 103 20

 You would like to predict a child's weight as a function of his age and height with the model;

ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑎𝑔𝑒 + 𝜃2 × ℎ𝑒𝑖𝑔ℎ𝑡

 What are X and y?

Dr. Shahid Awan Umer Saeed


Normal Equation

Dr. Shahid Awan Umer Saeed


Linear Regression with Multiple Variables

Normal Equation and non-invertibility


Normal Equation Non-invertibility:
 Normal Equation Non-invertibility:

𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
 Formula for Normal equation is;

 What if 𝑋 𝑇 𝑋 is non-invertible? (Singular or degenerate matrix)

 A square matrix that is not invertible is called singular or degenerate.

 A square matrix is singular if and only if its determinant is 0.

Dr. Shahid Awan Umer Saeed


Normal Equation Non-invertibility:
 What if 𝑋 𝑇 𝑋 is non-invertible, the common causes might be having :

 Redundant features, where two features are very closely related (i.e. they are linearly
dependent)

𝑒. 𝑔. 𝑥1 = 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑠𝑖𝑧𝑒 𝑖𝑛 𝑚2

 Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization“

 Solutions to the above problems include deleting a feature that is linearly dependent with
another or deleting one or more features when there are too many features.

Dr. Shahid Awan Umer Saeed


Reference
 https://www.coursera.org/

Dr. Shahid Awan Umer Saeed

You might also like