0% found this document useful (0 votes)
36 views7 pages

Linear Regression

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views7 pages

Linear Regression

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Linear Regression

Some of the most common algorithms are regressions. A


regression predicts a number from infinitely possible number
outputs. Linear regressions fit a straight line to a dataset,
while non-linear regressions fit a curved line to a dataset.
Using only one input variable is called univariate linear
regression, but models can have much more than just one. A
practical example of linear regression is predicting home
prices based on their square footage.

Defining the Model


The function defining a linear regression model is represented as,

fw,b (x) = wx + b

where w represents the slope and b the y-intercept of the line.


Once the line is fitted to the data, the same function definition is applied. The only
difference is the output variable, which is now estimated based on the regression
line. This predicted output and function model are defined as as,

y^ = fw,b (xi ) = wxi + b

The Cost Function


Finding the Cost Function
Since the slope (w ) and y-intercept (b) dramatically influence the position and
direction of the regression line, making sure if fits the data as accurately and
precisely as possible is very important. Measuring how well a line fits the data is
called finding the cost function, which is defined below. The error itself lies in the
difference of the predicted output from the actual output y^i − yi .

Linear Regression 1
m
∑(y^i − yi )2
i=1

It’s important to compute this term for different training examples i in the training
set, up until the number of total training examples m. As the number of training
examples increases, it can be more efficient to calculate the average square
error, rather than the total square error. The cost function J , also called the
squared error cost function, is defined as,

m
1
J (w, b) = ∑(y^i − yi )2
2m
i=1

The cost function most engineers use actually divides by 2m, because it helps
make calculations further into the process neater. Withholding this extra two is
perfectly acceptable. While different applications use different types of cost
functions, squared error cost functions are easily the most common type for
regression problems.

To get a better sense of the parameters that are adjusted in the cost function, it
can be written as,

m
1
J (w, b) = ∑(fw,b (xi ) − yi )2
2m
i=1

The only difference here is the predicted output y^ is represented as the function
model fw,b (x ). Because the the function model contains the only parameters
i

that can be adjusted (slope and y-intercept), this is the complete cost function
definition.

Optimizing the Cost Function


The goal of linear regression algorithms is to minimize the cost function as much
as possible. This is done by comparing its output to the two parameters that
affect its accuracy, w and b.

minimize J (w, b)
w,b

J can be plotted against both parameters individually, which results in 2D graphs


for each. However, to get the full picture, they can also be plotted together in both

Linear Regression 2
2D (contour) and 3D (surface) graphs.

Contour graph of the cost


function; plotted against w and b.
Surface graph of the cost function; plotted against w and
b.

Gradient Descent for Linear Regression


Given all the equations and definitions above, almost everything is present to
programmatically compute the gradient descent for a linear regression. However,
there is one final piece missing, which is expressing the derivative of the cost
∂ ∂
function ∂w J (w, b) and ∂b J (w, b).
Since the cost function and the linear regression model have already been
expressed in terms of the input variables x and y, they are substituted into the
gradient descent algorithm. The derivative is then taken. Consider the following
equations, where m is the number of training examples in the dataset.

Linear regression model Cost function


m
fw,b (x) = wx + b 1
J (w, b) = ∑(fw,b (xi ) − yi )2
2m
i=1

Pre-derived gradient descent algorithm


m
∂ 1
w =w−α J (w, b) ⇒ ∑(fw,b (xi ) − yi )xi )
∂w m
i=1

m
∂ 1
b = b − α J (w, b) ⇒ ∑(fw,b (xi ) − yi )
∂b m
i=1

Linear Regression 3
Final gradient descent algorithm
m
1
w = w − α ∑(fw,b (xi ) − yi )xi
m
i=1

m
1
b = b − α ∑(fw,b (xi ) − yi )
m
i=1

Multiple Feature Linear Regression


Adding more input features is a critical component of achieving accurate machine
learning models. For example, consider the previous example of determining
home prices by square footage. While this data can provide insight, it is probably
not enough for practical use. More variables can be added, such as number of
bedrooms and bathrooms, location, etc.

Notation
When adding more input features, slight modifications to the notation are
necessary,

xj = j th feature
n = number of features
xi = features of ith training example
xj i = value of feature j in ith training example

Model
With the addition of multiple variables comes the necessary updated to the model
functions. Consider the following changes, where n refers to the number of input
features.

Previous single variable Updated multivariable model


model
fw,b (x) = w1 x1 + … + wn xn + b
fw,b (x) = wx + b

To simply the notation, the multivariable model can be written with vectors. Notice
how b is not included in the vectors, as it is just a number and not a parameter.

Linear Regression 4
w = [w1 w2 w3 … wn ]

x = [x1 x2 x3 … xn ]

b, however, is included in the complete model, which is described below.

f w ,b (x) = w ⋅ x + b

Vectorization
Writing vectorized code is essential for turning these linear algebra equations into
something a computer can read and process. It also so happens that GPUs are
very efficient for running vectorized code. Consider the following example, where
the parameters and features are transformed into code.
Notice how the linear algebra count is 1 index, while the Python code is 0
indexed. The Python code utilizes the widely used NumPy package to implement
linear algebra tools, among many others. Without NumPy, the arrow multiplication
between w and x would have to be hardcoded, which could be especially
problematic when running into a high value of n.

Vector notation Python code


Defining parameters and features,

n=3
w = np.array([1.0, 2.5, -3.3])
w = [w1 w2 w3 ] b = 4
x = np.array([10, 20, 30])
b is a number
x = [x1 x2 x3 ]

Without vectorization (NumPy),

fw ,b (x) = (∑ wj xj ) + b
n
f = 0
for j in range(n): # 0 to n-1
j=1 f += w[j] * x[j]
f += b

With vectorization (NumPy),

Linear Regression 5
f w ,b (x) = w ⋅ x + b f = np.dot(w,x) + b

The benefits of using vectorization are not only cleaner and simpler code, but that
the performance is actually much faster. This is because NumPy uses parallel
hardware in the computer.
Consider the example above. Without vectorization, the for loop runs linearly, by
adding the product of w and x for each parameter j , step after step: f += w[j] *
x[j]

However, with vectorization, the products for w and x for each parameter j are
computed in parallel. Using special hardware, these products are then added
together to give the final result for f.

Gradient Descent for Multiple Linear


Regression
Utilizing gradient descent for multiple features follows a very similar process as
multiple feature linear regression: adding vector notation. Consider the following
pre-derived gradient descent algorithm, where multiple features w are used for
m training examples in the dataset,

Pre-derived gradient descent algorithm



w =w−α J (w , b)
∂w

b= b−α J (w , b)
∂b

Final gradient descent algorithm


m
1
wj = wj − α ∑(fw ,b (xi ) − yi )xj
(i)
m i=1
simultaneously update
m
1
bn = bn − α ∑(fw ,b (xi ) − yi ) wj (for j = 1, ⋯ , n) and b
m i=1

Notice how b does not require utilizing n features, because b is a value that does
not change based on the parameter being used. Also, the second xi in the first

Linear Regression 6
equation for w does not use vector notation, and this is because that variable is
dependent on the current row j .

Regularized Linear Regression


Since the cost function has now changed due to regularization, the gradient
descent algorithm will need to be modified too; specifically, the derivative of the
cost function. Given the regularized cost function and gradient descent functions,

m n
1 λ
J (w , b) = ∑(fw ,b (xi ) − yi )2 + ∑ wj2
2m 2m
i=1 j=1

∂ ∂
wj = wj − α J (w , b) b= b−α J (w , b)
∂wj ∂b

Fortunately, the necessary changes require only a small modification to wj in the


gradient descent. Note b is unaffected, because it is not a weighted parameter in
the in the original regularization term.

m
∂ 1 λ
∑(fw ,b (xi ) − yi )xj + wj
(i)
J (w , b) =
∂wj m i=1 m

m
∂ 1
J (w , b) = ∑(fw ,b (xi ) − yi )
∂b m i=1

Combining the equations above, the complete gradient descent is formed,

Complete gradient descent of regularized linear regression

wj = wj − α [ ∑ [(fw ,b (xi ) − yi )xj ] + wj ]


m
1 (i) λ
m i=1 m

m
1
b = b − α ∑(fw ,b (xi ) − yi )
m i=1

Linear Regression 7

You might also like