Machine Learning: Dr. Shahid Mahmood Awan
Machine Learning: Dr. Shahid Mahmood Awan
In regression problems, we are taking input variables and trying to fit the output onto
a continuous expected result function.
Linear regression with one variable is also known as "univariate linear regression.“
Univariate linear regression is used when you want to predict a single output value y from
a single input value x.
We're doing supervised learning here, so that means we already have an idea about what
the input/output cause and effect should be.
(a) 1416
(b) 1534 (Answer)
(c) 315
(d) 0
For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is
therefore like this:
We give to h(θ) (x) values for θ0 and θ1 to get our estimated output h(θ)(x). In other words,
we are trying to create a function called h(θ)that is trying to map our input data (the x's) to
our output data (the y's).
x h(x)
0 1.5
1 1.5
2 1.5
3 1.5
Umer Saeed
Dr. Shahid Awan
The Hypothesis Function
x h(x)
0 0
1 0.5
2 1
3 1.5
x h(x)
0 1
1 1.5
2 2
3 2.5
Consider the plot below of h(θ)(x)= θ0+ θ1x. What are θ0 and θ1?
input x output y
0 4
1 7
2 7
3 8
Now we can make a random guess about our hθ function: θ0=2 and θ1=2.
Our hypothesis function has the general form: h (θ)(x)= θ0+ θ1x
The hypothesis function becomes: h(θ)(x)=2+2X
So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be trying out
various values of θ0 and θ1 to try to find values which provide the best possible "fit" or the
most representative "straight line" through the data points mapped on the x-y plane.
Idea: Choose θ0 and θ1 so that h(θ)(x) is close to y for our training examples (x,y).
We want to do is find a line that minimizes the squared distance to these different points.
Let visualize that line for a second, I don’t know what is looks like right now.
And what we want to do is minimize this squared error from each of these points to the line.
Equation of the line is; y=mx+b, Where; m= slope of the line, b= y- intercept
We want to find an “m” and “b”, so hat it minimizes the squared error.
So for each of these points, the error between it and the line is vertical distance.
SELine=(Error1)2+(Error2)2+---+(Errorn)2
SELine=(y1-(mx1+b))2+(y2-(mx2+b))2+---+(yn-(mxn+b))2
SELine= [(y12-(2)( y1) (mx1+b)+ (mx1+b)2]+[(y22-(2)( y2) (mx2+b)+ (mx2+b)2]+ ----+ [(yn2-(2)( yn)
(mxn+b)+ (mxn+b)2] (Because (a-b)2=a2-2ab+b2)
𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 𝑥1 𝑦1 + 𝑥2 𝑦2 + ⋯ + 𝑥𝑛 𝑦𝑛 = 𝑛𝑥𝑦
= 𝑥𝑦
𝑛
𝑦1 + 𝑦2 + ⋯ + 𝑦𝑛 𝑦1 + 𝑦2 + ⋯ + 𝑦𝑛 = 𝑛𝑦
=𝑦
𝑛
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 = 𝑛𝑥
=𝑥
𝑛
𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕
= (𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2 )
𝜕𝑚 𝜕𝑚
𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕 𝜕 𝜕 𝜕 2 2
𝜕 𝜕
= 2
(𝑛𝑦 ) − (2𝑚𝑛𝑥𝑦) − (2𝑏𝑛𝑦) + (𝑚 𝑛𝑥 ) + (2𝑚𝑏𝑛𝑥) + (𝑛𝑏 2 )
𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚 𝜕𝑚
Let;
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
=0
𝜕𝑚
2𝑛(−𝑥𝑦 + 𝑚𝑥 2 + 𝑏𝑥) = 0
−𝑥𝑦 + 𝑚𝑥 2 + 𝑏𝑥 = 0
𝑚𝑥 2 + 𝑏𝑥 = 𝑥𝑦
𝑚𝑥 2 𝑥𝑦
+𝑏 = ------- Equation-D
𝑥 𝑥
𝑥 2 𝑥𝑦
, 𝑙𝑖𝑒𝑠 𝑜𝑛 𝑡ℎ𝑒 𝑏𝑒𝑠𝑡 𝑓𝑖𝑡 𝑙𝑖𝑛𝑒
𝑥 𝑥
Umer Saeed
Dr. Shahid Awan
The Hypothesis Function
Equation (B) becomes, we have
𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕
= (𝑛𝑦 2 − 2𝑚𝑛𝑥𝑦 − 2𝑏𝑛𝑦 + 𝑚2 𝑛𝑥 2 + 2𝑚𝑏𝑛𝑥 + 𝑛𝑏2 )
𝜕𝑏 𝜕𝑏
𝜕𝑆𝐸𝐿𝑖𝑛𝑒 𝜕 𝜕 𝜕 𝜕 𝜕 𝜕
= (𝑛𝑦 2 ) − (2𝑚𝑛𝑥𝑦) − (2𝑏𝑛𝑦) + (𝑚2 𝑛𝑥 2 ) + (2𝑚𝑏𝑛𝑥) + (𝑛𝑏 2 )
𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏 𝜕𝑏
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= 0 − 0 − 2𝑛𝑦 + 0 + 2𝑚𝑛𝑥 + 2𝑛b
𝜕𝑏
𝜕𝑆𝐸𝐿𝑖𝑛𝑒
= −2𝑛𝑦 + 2𝑚𝑛𝑥 + 2𝑛b ------ (Equation-E)
𝜕𝑏
2𝑛(−𝑦 + 𝑚𝑥 +b)=0
2𝑛(−𝑦 + 𝑚𝑥 +b)=0
−𝑦 + 𝑚𝑥 +b=0
𝑥2 𝑥𝑦
𝑚𝑥 + 𝑏 − 𝑚 + 𝑏 = 𝑦 −
𝑥 𝑥
𝑥2 𝑥𝑦
𝑚𝑥 + 𝑏 − 𝑚 − 𝑏 = 𝑦 −
𝑥 𝑥
𝑥2 𝑥𝑦
𝑚𝑥 − 𝑚 =𝑦−
𝑥 𝑥
𝑥2 𝑥𝑦
𝑚 𝑥− =𝑦−
𝑥 𝑥
𝑥 𝑦 − 𝑥𝑦
𝑚=
(𝑥)2 − 𝑥 2
𝑚𝑥 +b=𝑦
b=𝑦 - 𝑚𝑥
𝑥 𝑦−𝑥𝑦
b=𝑦 - 𝑥
(𝑥)2 −𝑥 2
Cost Function
Cost Function
We can measure the accuracy of our hypothesis function by using a cost function.
This takes an average (actually a fancier version of an average) of all the results of the
hypothesis with inputs from x's compared to the actual output y's.
To break it apart, it is (1/2) where is the mean of the squares of h(xi)−yi, or the
difference between the predicted value and the actual value.
This function is otherwise called the "Squared error function", or "Mean squared error".
The mean is halved (1/2m) as a convenience for the computation of the gradient descent,
as the derivative term of the square function will cancel out the (1/2) term.
h(x) for fixed θ1, this is the function of x ( size of the house)
J(θ1) function of the parameter θ1 (which control the slop of the straight line
The best possible line will be such so that the average squared vertical distances of the
scattered points from the line will be the least.
Ideally, the line should pass through all the points of our training data set.
IF θ0 =0 and θ1 =1
h(x)=θ0+ θ1x
h(x)=0+1(x)
h(x)=0+x
h(x)=x
θ1 J(θ1)
1 0
0.5 0.58
0 2.3
-0.5 5.25
1.5 0.58
2 2.33
2.5 5.25
A contour line of a two variable function has a constant value at all points of the same line.
An example of such a graph is the one to the right below.
The circled x displays the value of the cost function for the graph on the left when θ0 = 800
and θ1 = -0.15.
When θ0 = 360 and θ1 = 0, the value of J(θ0, θ1) in the contour plot gets closer to the
center thus reducing the cost function error.
Now giving our hypothesis function a slightly positive slope results in a better fit of the data.
The graph above minimizes the cost function as much as possible and consequently, the
result of θ1 and θ1 tend to be around 0.12 and 250 respectively.
Plotting those values on our graph to the right seems to put our point in the center of the
inner most 'circle'.
Outline:
Imagine that we graph our hypothesis function based on its fields θ0 and θ1 (actually we
are graphing the cost function as a function of the parameter estimates). We are not
graphing x and y itself, but the parameter range of our hypothesis function and the cost
resulting from selecting a particular set of parameters.
We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis.
The points on our graph will be the result of the cost function using our hypothesis with
those specific theta parameters. The graph below depicts such a setup.
We will know that we have succeeded when our cost function is at the very bottom of the
pits in our graph, i.e. when its value is the minimum. The red arrows show the minimum
points in the graph.
The slope of the tangent is the derivative at that point and it will give us a direction to move
towards.
We make steps down the cost function in the direction with the steepest descent. The size
of each step is determined by the parameter α, which is called the learning rate.
For example, the distance between each 'star' in the graph above represents a step
determined by our parameter α.
A smaller α would result in a smaller step and a larger α results in a larger step.
The direction in which the step is taken is determined by the partial derivative of J(θ0, θ1).
Depending on where one starts on the graph, one could end up at different points.
The image above shows us two different starting points that end up in two different places.
Dr. Shahid Awan Umer Saeed
Gradient Descent
We explored the scenario where we used one parameter θ1 and plotted its cost function to
implement a gradient descent. Our formula for a single parameter was :
Regardless of the slope's sign for d/d θ1 J(θ1), θ1 eventually converges to its minimum
value.
The following graph shows that when the slope is negative, the value of θ1 increases and
when it is positive, the value of θ1 decreases.
As we approach a local minimum, gradient descent will automatically take smaller steps.
So, no need to decrease over time.
𝑚
1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2
2𝑚
𝑖=1
𝑚
𝜕 𝜕 1
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 (𝑖) − 𝑦 𝑖 )2 −−−−− −(𝐸𝑞𝑢𝑎𝑡𝑖𝑜𝑛 − 𝐵)
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑖=1
𝑚
𝜕 2 𝑖
𝜕
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 −𝑦 𝑖 ) (𝜃 + 𝜃1 𝑥 𝑖 −𝑦 𝑖 )
𝜕𝜃0 2𝑚 𝜕𝜃0 0
𝑖=1
𝑚
𝜕 1 𝑖 𝑖
𝜕 𝜕 𝜕
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [ (𝜃0 ) + (𝜃1 𝑥 𝑖 ) − 𝑦 𝑖 )]
𝜕𝜃0 𝑚 𝜕𝜃0 𝜕𝜃0 𝜕𝜃0
𝑖=1
𝑚
𝜕 1 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [1 + 0 − 0]
𝜕𝜃0 𝑚
𝑖=1
𝑚
𝜕 1 𝑖 𝑖 (𝑖)
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [𝑥0 ]
𝜕𝜃0 𝑚
𝑖=1
𝑚
𝜕 2 𝑖
𝜕
𝐽 𝜃0 , 𝜃1 = (𝜃0 + 𝜃1 𝑥 −𝑦 𝑖 ) (𝜃 + 𝜃1 𝑥 𝑖 −𝑦 𝑖 )
𝜕𝜃1 2𝑚 𝜕𝜃1 0
𝑖=1
𝑚
𝜕 1 𝑖 𝑖
𝜕 𝜕 𝜕
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [ (𝜃0 ) + (𝜃1 𝑥 𝑖 ) − (𝑦 𝑖 )]
𝜕𝜃1 𝑚 𝜕𝜃1 𝜕𝜃1 𝜕𝜃1
𝑖=1
𝑚
𝜕 1 𝑖 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [0 + 𝑥 − 0]
𝜕𝜃1 𝑚
𝑖=1
𝑚
𝜕 1 𝑖 𝑖
𝐽 𝜃0 , 𝜃1 = 𝜃0 + 𝜃1 𝑥 −𝑦 [𝑥 𝑖 ]
𝜕𝜃1 𝑚
𝑖=1
Dr. Shahid Awan Umer Saeed
Gradient Descent Algorithm
By putting the values in equation (A), we have
𝑅𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒;
{
𝑚
∝ 𝑖
𝜃0 ≔ 𝜃0 − (𝜃0 + 𝜃1 𝑥 𝑖 − 𝑦 𝑖 )[𝑥0 ]
𝑚
𝑖=1
𝑚
∝ 𝑖
𝜃1 ≔ 𝜃1 − (𝜃0 + 𝜃1 𝑥 − 𝑦 𝑖 )[𝑥 (𝑖) ]
𝑚
𝑖=1
}
𝑈𝑝𝑑𝑎𝑡𝑒 𝜃0 𝑎𝑛𝑑 𝜃1 𝑠𝑖𝑚𝑢𝑙𝑡𝑎𝑛𝑒𝑜𝑢𝑠𝑙𝑦
So, this is simply gradient descent on the original cost function J. This method looks at
every example in the entire training set on every step, and is called batch gradient
descent.
Note that, while gradient descent can be susceptible to local minima in general, the
optimization problem we have posed here for linear regression has only one global, and no
other local, optima; thus gradient descent always converges (assuming the learning rate α
is not too large) to the global minimum. Indeed, J is a convex quadratic function.
Umer Saeed
Dr. Shahid Awan
Gradient Descent Algorithm
Debugging gradient descent. Make a plot with number of iterations on the x-axis. Now plot
the cost function, J(θ) over the number of iterations of gradient descent. J(θ)should
decreases after every iterations. If J(θ) ever increases, then you probably need to decrease
α.
Value of α is large.
Note:
For sufficiently small α, J(θ) should decrease on every iteration.
But if α is too small, gradient descent can be slow to converge.
The above matrix has four rows and three columns, so it is a 4 x 3 matrix.
Aij refers to the element in the ith row and jth column of matrix A.
In general, all our vectors and matrices will be 1-indexed. Note that for some programming
languages, the arrays are 0-indexed.
Matrices are usually denoted by uppercase names while vectors are lowercase.
1 2 5 6
Example: 𝐴= ,𝐵 =
3 4 7 8
1 2 5 6
𝐴+𝐵 = +
3 4 7 8
1+5 2+6
𝐴+𝐵 =
3+7 4+8
6 8
𝐴+𝐵 =
10 12
To add two matrices, their dimensions must be the same.
Example
1 2 5 6
𝐴= ,𝐵 =
3 4 7 8
1 2 5 6
𝐴−𝐵 = −
3 4 7 8
1−5 2−6
𝐴+𝐵 =
3−7 4−8
−4 −4
𝐴+𝐵 =
−4 −4
Example:
1 2
𝐴=
3 4
1 2
2𝐴 = 2
3 4
1×2 2×2
2𝐴 = 2
3×2 4×2
2 4
2𝐴 =
6 8
1 2
𝐴=
3 4
Example: 1 1 1 2
𝐴=
2 2 3 4
1 1
1 1× 2×
𝐴= 2 2
2 1 1
3× 4×
2 2
1
1 1
𝐴= 2
2 3
2
2
Dr. Shahid Awan Umer Saeed
Matrix-Vector Multiplication
We map the column of the vector onto each row of the matrix, multiplying each element and
summing the result.
The number of columns of the matrix must equal the number of rows of the vector.
1 2
7
𝐴𝐵 = 3 4
8
5 6
1×7+2×8
𝐴𝐵 = 3 × 7 + 4 × 8
5×7+6×8
7 + 16
𝐴𝐵 = 21 + 32
35 + 48
23
𝐴𝐵 = 53
83
Dr. Shahid Awan Umer Saeed
Matrix Multiplication
We multiply two matrices by breaking it into several vector multiplications and concatenating
the result.
To multiply two matrices, the number of columns of the first matrix must equal the number
of rows of the second matrix.
1 2 5 6
𝐴= ,𝐵 =
3 4 7 8
1 2 5 6 1×5+2×7 1×6+2×8 5 + 14 6 + 16 19 22
𝐴𝐵 = = = =
3 4 7 8 3×5+4×7 3×6+4×8 15 + 28 18 + 32 43 50
Example
1 2 5 6
𝐴= ,𝐵 =
3 4 7 8
1 2 5 6 1×5+2×7 1×6+2×8 5 + 14 6 + 16 19 22
𝐴𝐵 = = = =
3 4 7 8 3×5+4×7 3×6+4×8 15 + 28 18 + 32 43 50
5 6 1 2 5×1+6×3 5×2+6×4 5 + 18 10 + 24 23 34
𝐵𝐴 = = = =
7 8 3 4 7×1+8×3 7×2+8×4 7 + 24 14 + 32 31 46
1 2 5 6 9 10
𝐴= ,𝐵 = ,𝐶=
3 4 7 8 11 12
19 22 19 22 9 10 413 454
𝐴∗𝐵 = 𝐴∗𝐵 ∗𝐶 = =
43 50 43 50 11 12 937 1030
19 22 31 34 50 56
𝐴∗𝐵 = ,𝐴∗𝐶 = ,𝐴∗𝐵 +𝐴∗𝐶 =
43 50 71 78 114 128
When multiplying the identity matrix after some matrix (A∗I), the square identity matrix's
dimension should match the other matrix's columns. When multiplying the identity matrix
before some other matrix (I∗A), the square identity matrix's dimension should match the
other matrix's rows.
𝐴∗𝐼 =𝐼∗𝐴=𝐴
1 2 1 0
𝐴= ,𝐼 =
3 4 0 1
1 2 1 0 1 2
𝐴𝐼 = =
3 4 0 1 3 4
1 0 1 2 1 2
𝐼𝐴 = =
0 1 3 4 3 4
Hence Proof, 𝐴 ∗ 𝐼 = 𝐼 ∗ 𝐴 = 𝐴
𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|
𝑎 𝑏
det 𝐴 = 𝑑𝑒𝑡
𝑐 𝑑
Example; det 𝐴 = 𝑎𝑑 − 𝑏𝑐
4 −2
𝐴=
1 −3
4 −2
det 𝐴 = 𝑑𝑒𝑡
1 −3
4 −2
𝐴=
1 −3
4 −2
adj 𝐴 = 𝑎𝑗𝑑
1 −3
−3 2
𝑎𝑑𝑗 𝐴 =
−1 4
𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|
By putting the values, we have
−3 2
𝐴−1 = −1 4
−10
−3 2
𝐴−1 = −10 −10
−1 4
−10 −10
3 −1
𝐴−1 = 10 5
1 −2
10 5
Umer Saeed
Dr. Shahid Awan
Inverse and Transpose
Calculate Determinants of a 3 X 3 Matrix:
Formula;
𝑎 𝑏 𝑐
𝐴= 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖
𝑎 𝑏 𝑐
𝐴 = 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖
𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
𝐴 =𝑎 −𝑏 +𝑐
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ
𝐴 = 𝑎 𝑒𝑖 − 𝑓ℎ − 𝑏 𝑑𝑖 − 𝑓𝑔 + 𝑐(𝑑ℎ − 𝑒𝑔)
1 0 1
𝐴= 0 2 3
1 2 1
1 0 1
𝐴 = 0 2 3
1 2 1
2 3 0 3 0 2
𝐴 =1
−0 +1
2 1 1 1 1 2
𝐴 = 1 2×1−2×3 −0 0×1−3×1 +1 0×2−2×1
1 −2 −5
Example:
𝐴= 4 3 0
−1 −1 −1
𝑇
1 −2 −5
𝑇
𝐴 = 4 3 0
−1 −1 −1
1 4 −1
𝐴𝑇 = −2 3 −1
−5 0 −1
Dr. Shahid Awan Umer Saeed
Inverse and Transpose
Calculate Adjoint of a 3 X 3 Matrix:
Formula;
𝑎 𝑏 𝑐
𝐴= 𝑑 𝑒 𝑓
𝑔 ℎ 𝑖
Step-1(Calculate cof A)
𝑒 𝑓 𝑑 𝑓 𝑑 𝑒
+ − +
ℎ 𝑖 𝑔 𝑖 𝑔 ℎ
𝑏 𝑐 𝑎 𝑐 𝑎 𝑏
𝐶𝑜𝑓 𝐴 = − + 𝑔 𝑖 −
ℎ 𝑖 𝑔 ℎ
𝑏 𝑐 𝑎 𝑐 𝑎 𝑏
+ − 𝑑 𝑓 +
𝑒 𝑓 𝑑 𝑒
𝑇
+(𝑒𝑖 − 𝑓ℎ) −(𝑑𝑖 − 𝑓𝑔) +(𝑑ℎ − 𝑒𝑔)
𝑎𝑑𝑗 𝐴 = −(𝑏𝑖 − 𝑐ℎ) +(𝑎𝑖 − 𝑐𝑔) −(𝑎ℎ − 𝑏𝑔)
+(𝑏𝑓 − 𝑐𝑒) −(𝑎𝑓 − 𝑐𝑑) +(𝑎𝑒 − 𝑏𝑑)
−4 3 −2
𝐶𝑜𝑓 𝐴 = 2 0 −2
−2 −3 2
Dr. Shahid Awan Umer Saeed
Inverse and Transpose
Step-2 (adj(A))
Formula;
𝑇
𝑎𝑑𝑗 𝐴 = 𝐶𝑜𝑓(𝐴)
By putting the values, we have
𝑇
−4 3 −2
𝑎𝑑𝑗 𝐴 = 2 0 −2
−2 −3 2
−4 2 −2
𝑎𝑑𝑗 𝐴 = 3 0 −3
−2 −2 2
Umer Saeed
Dr. Shahid Awan
Inverse and Transpose
Calculate Inverse of a 3 X 3 Matrix:
𝑎𝑑𝑗(𝐴)
𝐴−1 =
|𝐴|
By putting the values, we have
−4 2 −2
−6 −6 −6
3 0 −3
𝐴−1 =
−6 −6 −6
−2 −2 2
−6 −6 −6
2 −1 1
3 3 3
−1 1
𝐴−1 = 0
2 2
1 1 −1
3 3 3
Umer Saeed
Dr. Shahid Awan
Linear Regression with
Multiple Variables
Sr no Price Plotsize bedrooms bathrms stories driveway recroom fullbase gashw airco garagepl prefarea
1 42000 5850 3 1 2 yes no yes no no 1 no
2 38500 4000 2 1 1 yes no no no no 0 no
3 49500 3060 3 1 1 yes no no no no 0 no
4 60500 6650 3 1 2 yes yes no no no 0 no
5 61000 6360 2 1 1 yes no no no no 0 no
6 66000 4160 3 1 1 yes yes yes no yes 0 no
7 66000 3880 3 2 2 yes no yes no no 2 no
8 69000 4160 3 1 3 yes no no no no 0 no
9 83800 4800 3 1 1 yes yes yes no no 0 no
10 88500 5500 3 2 4 yes yes no no yes 1 no
Notation
m= Number of training examples=546
X’s= “input” variable/features = Plot Size
y’’s= “output” variable/”target” variable=Price
(X,y)- Single training example
(X(i),y(i))- ith training example, Superscript i is not
the Exponential, its the index.
Xj(i)= value of feature j in the ith training example
n=|X(i)|; (the number of features)
Umer Saeed
Dr. Shahid Awan
Linear Regression with
Multiple Variables
Example 2:
In the training set above, what is x (2) ?
Answer
1416
3
2
40
𝑥0 𝜃0
𝑥1 𝜃
𝑋 = ⋮ , 𝜃 = 1 , 𝑋 𝑎𝑛𝑑𝜃 𝑎𝑟𝑒 𝑛 + 1 dimension 𝑣𝑒𝑐𝑡𝑜𝑟𝑠
⋮
𝑛
𝑥 𝜃𝑛
The Hypothesis Function (for multiple variable): h(θ)(x)= θ0x0 + θ1x1+ θ2x2+----+ θnxn
ℎ(θ)(𝑥) = 𝜃 𝑇 𝑋
Umer Saeed
Dr. Shahid Awan
Linear Regression with Multiple Variables
Hypothesis:
Cost Function:
The gradient descent equation itself is generally the same form; we just have to repeat it for
our 'n' features:
In other words;
The following image compares gradient descent with one variable to gradient descent with
multiple variables:
Umer Saeed
Dr. Shahid Awan
Gradient Descent for
Multiple Variables
Question:
When there are n features, we define the cost function as;
For linear regression, which of the following are also equivalent and correct definitions of
J(𝜃)?
Umer Saeed
Dr. Shahid Awan
Feature Scaling
𝑥1 = 𝑠𝑖𝑧𝑒 0 − 2000𝑓𝑒𝑒𝑡 2 , 𝑥2 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑏𝑒𝑑𝑟𝑜𝑜𝑚𝑠(1 − 5)
0 ≤ 𝑥1 ≤ 1, 0 ≤ 𝑥2 ≤ 1
−1 ≤ 𝑥𝑖 ≤ 1 or −0.5 ≤ 𝑥𝑖 ≤ 0.5
These aren't exact requirements; we are only trying to speed things up. The goal is to get all
input variables into roughly one of these ranges, give or take a few
Feature scaling involves dividing the input values by the range (i.e. the maximum value
minus the minimum value) of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the
values for that input variable resulting in a new average value for the input variable of just
zero. To implement both of these techniques, adjust your input values as shown in this
formula:
𝑥𝑖 − 𝜇𝑖
𝑥𝑖 =
𝑟𝑎𝑛𝑔𝑒
𝜎 = standard deviation
𝑥−𝜇 2
𝜎=
𝑁
𝜇0 = 1000, 𝜇1 = 3
𝜇0 = 1000, 𝜇1 = 3
𝜎1 = 577.63, 𝜎2 = 1.41
Umer Saeed
Dr. Shahid Awan
Mean Normalization
Suppose you are using a learning algorithm to estimate the price of houses in a city. You
want one of your features xi to capture the age of the house. In your training set, all of your
houses have an age between 30 and 50 years, with an average age of 38 years. Which of
the following would you use as features, assuming you use feature scaling and mean
normalization?
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑓𝑟𝑜𝑛𝑡𝑎𝑔𝑒 + 𝜃2 × 𝑑𝑒𝑝𝑡ℎ
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝐴𝑟𝑒𝑎
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑥3
Umer Saeed
Dr. Shahid Awan
Polynomial Regression
Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
We can change the behavior or curve of our hypothesis function by making it a quadratic,
cubic or square root function (or any other form).
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1
then we can create additional features based on 𝑥1 , to get the quadratic function;
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥12
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥1
e.g. if 𝑥1 has rage 1-1000 then rage of 𝑥12 = 1 − 106 𝑎𝑛𝑑 𝑡ℎ𝑎𝑡 𝑜𝑓 𝑥13 = 1 − 109
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 (𝑆𝑖𝑧𝑒) + 𝜃2 𝑆𝑖𝑧𝑒
Suppose size ranges from 1 to 1000 (feet2). You will implement this by fitting a model
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 𝑥1 + 𝜃2 𝑥2
Finally, suppose you want to use feature scaling (without mean normalization).
Which of the following choices for x1 and x2 should you use? (Note: 1000 ≈ 32)
Let’s discuss a second way of doing so, this time performing the minimization explicitly and
without resorting to an iterative algorithm.
In the "Normal Equation" method, we will minimize J by explicitly taking its derivatives with
respect to the θj ’s, and setting them to zero.
𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
With the normal equation, computing the inversion has complexity O(n3).
So if we have a very large number of features, the normal equation will be slow.
In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to
an iterative process.
You would like to predict a child's weight as a function of his age and height with the model;
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1 × 𝑎𝑔𝑒 + 𝜃2 × ℎ𝑒𝑖𝑔ℎ𝑡
𝜃 = (𝑋 𝑇 𝑋)−1 𝑋 𝑇 𝑦
Formula for Normal equation is;
Redundant features, where two features are very closely related (i.e. they are linearly
dependent)
Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization“
Solutions to the above problems include deleting a feature that is linearly dependent with
another or deleting one or more features when there are too many features.