0% found this document useful (0 votes)
9 views16 pages

Linear Regression

Notes for machine leading course

Uploaded by

obscuredodyssey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Linear Regression

Notes for machine leading course

Uploaded by

obscuredodyssey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Linear Regression∗

February 15, 2023

This course note has two parts. The first part is on linear regression, and the second
part is on the gradient descent optimization technique.

1 Introduction
1.1 Definition
In machine learning, a regression problem is the problem of determining a relation be-
tween one or more independent variables and an output variable which is a real contin-
uous variable, given a set of observed values of the set of independent variables and the
corresponding values of the output variable.

1.2 Examples
• Consider the data on car prices given in Table 1. Suppose we are required to
estimate the price of a car aged 25 years with distance 53240 km, and weight 1200
pounds. This is an example of a regression problem because we have to predict the
value of the numeric variable “Price”.

• Consider the navigation of a mobile robot, say an autonomous car. The output is
the angle by which the steering wheel should be turned at each time, to advance
without hitting obstacles and deviating from the route. Inputs are provided by
sensors on the car like a video camera, GPS, and so forth.

• In finance, the capital asset pricing model uses regression for analyzing and quan-
tifying the systematic risk of an investment.

• In economics, regression is the predominant empirical tool. For example, it is used


to predict consumption spending, inventory investment, purchases of a country’s
exports, spending on imports, labor demand, and labor supply.

References
- Ethem Alpayidin. Introduction to Machine Learning, MIT Press, 2004
- Christopher Bishop. Pattern Recognition and Machine Learning. 2006
- V. N. Krishnachandran. Lecture Notes in Machine Learning. 2018
- Robert Kwiatkowski. Gradient Descent Algorithm - a deep dive. May 22, 2021

1
Course Notes GEO5017: Machine Learning for the Built Environment

Table 1: Prices of used cars: example data for regression.

Price Age Distance Weight


(US$) (years) (km) (pounds)
13500 23 46986 1165
13750 23 72937 1165
13950 24 41711 1165
14950 26 48000 1165
13750 30 38500 1170
12950 32 61000 1170
16900 27 94612 1245
18600 30 75889 1245
21500 27 19700 1185
12950 23 71138 1105

1.3 General approach


Let x denote the set of input variables and y the output variable. In machine learning,
the general approach to regression is to assume a model, that is, some mathematical
relation between x and y, involving some parameters θ, in the following form:

y = f (x, θ). (1)

The function y = f (x, θ) is called the regression function. The machine learning
algorithm optimizes the parameters in the set θ such that the approximation error is
minimized; that is, the estimates of the values of the dependent variable y are as close as
possible to the correct values given in the training set.
Example. If the input variables are “Age”, “Distance” and “Weight” and the output
variable is “Price”, the model y = f (x, θ) may be

Price = a0 + a1 · Age + a2 · Distance + a3 · Weight,

where x = {Age, Distance, Weight} denotes the the set of input variables and θ =
{a0 , a1 , a2 , a3 } denotes the set of parameters of the model. The goal of regression is
to determine the unknown parameters of the model and to use the model for prediction.

1.4 Different regression models1


There are various types of regression techniques available to make predictions. These
techniques mostly differ in three aspects, namely, the number and type of independent
variables, the type of dependent variables and the shape of regression line. Some of these
are listed below.

• Simple linear regression. There is only one continuous independent variable x,


and the assumed relation between the independent variable x and the dependent
variable y is
y = a + bx. (2)
1
Note: logistic regression has a binary dependent variable (i.e., a variable which takes only the values
0 and 1), which is referred to as a classification model.

2
Course Notes GEO5017: Machine Learning for the Built Environment

• Multivariate linear regression. There are more than one independent variable,
say x1, ..., xn , and the assumed relation between the independent variables and the
dependent variable is
y = a0 + a1 x1 + ... + an xn . (3)

• Polynomial regression. There is only one continuous independent variable x and


the assumed model is

y = a0 + a1 x + a2 x2 + ... + an xn . (4)

1.5 Criterion for minimization of error


In regression, we would like to write the numeric output y, called the dependent variable,
as a function of the input x, called the independent variable. We assume that the output
is the sum of a function of the input, i.e., f (x), and some random error denoted by ", i.e.,

y = f (x) + ". (5)

Here the function f (x) is unknown and we would like to approximate it by some estimator
g(x, θ) containing a set of parameters θ. We assume that the random error " follows
normal distribution with mean 0.
Let x1 , ..., xn be a random sample of observations of the input variable x and y1 , ..., yn
the corresponding observed values of the output variable y. The value of desired θ mini-
mizes the following sum of squares:

E(θ) = (y1 − g(x1 , θ))2 + ... + (yn − g(xn , θ))2 . (6)

The method of finding the value of θ that minimizes E(θ) is known as the ordinary
least squares method.

2 Simple linear regression


Let x be the independent predictor variable and y the dependent variable. Assume that
we have a set of observed values of x and y as shown in Table 2.

Table 2: Data set for simple linear regression.

x x1 x2 ... xn
y y1 y2 ... yn

A simple linear regression model defines the relationship between x and y using a line
defined by an equation in the following form:

y = α + βx. (7)

To determine the optimal estimates of α and β, an estimation method known as


Ordinary Least Squares (OLS) is used.
The OLS method. In the OLS method, the values of y-intercept and slope are
chosen such that they minimize the sum of the squared errors; that is, the sum of the

3
Course Notes GEO5017: Machine Learning for the Built Environment

squares of the vertical distance between the predicted y-value and the actual y-value (see
Figure 1). Let ŷi be the predicted value of yi , the sum of squares of errors is given by
n
!
E= (yi − ŷi )2
i=1
n (8)
! 2
= [yi − (α + βxi )]
i=1

So we are required to find the values of α and β, such that E is minimum. Using
methods of calculus, we can show that the values of α and β can be obtained by solving
the following equations:
!n ! n
yi = nα + β xi
i=1 i=1
n n n (9)
! ! !
xi yi = α xi + β x2i
i=1 i=1 i=1

Figure 1: Errors in observed values.

Formulas to find α and β. Recall that the means of x and y are given by
1!
x̄ = xi
n , (10)
1!
ȳ = yi
n
and also that the variance of x is given by
1 !
Var(x) = (xi − x̄i )2 . (11)
n−1
The covariance of x and y, denoted by Cov(x, y) is defined as
1 !
Cov(x, y) = (xi − x̄) (yi − ȳ) (12)
n−1
It can be shown that the values of α and β can be computed using the following
formulas:
Cov(x, y)
β=
Var(x) (13)
α = ȳ − β x̄

4
Course Notes GEO5017: Machine Learning for the Built Environment

Remarks. It is interesting to note why the least squares method discussed above
is christened as “ordinary” least squares method. Several different variants of the least
squares method have been developed over the years. For example, in the weighted least
squares method, the coefficients α and β are estimated such that the weighted sum of
squares of errors
!n
E= wi [yi − (α + βxi )]2 (14)
i=1

for some positive constants w1 , ..., wn , is minimum. There are also methods known by the
names generalized least squares method, partial least squares method, total least squares
method, etc. The reader may refer to Wikipedia to obtain further information about
these methods. The OLS method has a long history. The method is usually credited to
Carl Friedrich Gauss (1795), but it was first published by Adrien-Marie Legendre (1805).
Example. Obtain a linear regression for the data in Table 3 assuming that y is the
independent variable.

Table 3: Example data for simple linear regression.

x 1.0 2.0 3.0 4.0 5.0


y 1.00 2.00 1.30 3.75 2.25

Figure 2: Regression model for Table 3.

Solution. In the usual notations of simple linear regression, we have

5
Course Notes GEO5017: Machine Learning for the Built Environment

n=5
1
x̄ = (1.0 + 2.0 + 3.0 + 4.0 + 5.0) = 3.0
5
1
ȳ = (1.00 + 2.00 + 1.30 + 3.75 + 2.25) = 2.06
5
1
Cov(x, y) = [(1.0 − 3.0)(1.00 − 2.06) + · · · + (5.0 − 3.0)(2.25 − 2.06)] = 1.0625
4
1" #
Var(x) = (1.0 − 3.0)2 + · · · + (5.0 − 3.0)2 = 2.5
4
1.0625
b= = 0.425
2.5
a = 2.06 − 0.425 × 3.0 = 0.785

Therefore, the linear regression model for the data is

y = 0.785 + 0.425x.

3 Polynomial regression
Let x be the independent predictor variable and y the dependent variable. Assume that
we have a set of observed values of x and y as in Table 2. A polynomial regression model
defines the relationship between x and y by an equation in the following form:

y = α 0 + α 1 x + α 2 x2 + · · · + α k xk (15)

To determine the optimal values of the parameters α0 , α1 , ..., αk , the method of ordi-
nary least squares can be used. The desired values of the parameters are those values
that minimize the sum of squared errors, i.e.,
n
! " $ %#2
E= yi − α0 + α1 xi + α2 x2i + · · · + αk xki (16)
i=1

The optimal values of the parameters are obtained by solving the following system of
equations
∂E
= 0, ∀i = 0, 1, . . . , k. (17)
∂αi
Simplifying and transforming Equation 17, we get a system of k + 1 linear equations
! &! ' &! '
yi = α 0 n + α 1 xi + · · · + α k xki
! &! ' &! ' &! '
yi xi = α 0 xi + α 1 x2i + · · · + αk xk+1
i
! &! ' &! ' &! '
yi x2i = α0 x2i + α1 x3i + · · · + αk xk+2
i
(18)
..
.
! &! ' &! ' &! '
yi xki = α0 xki + α1 xk+1
i + · · · + α k x 2k
i

We can see that the values of αi can be obtained by solving the above linear system.

6
Course Notes GEO5017: Machine Learning for the Built Environment

Remarks. The linear system of equations to find αi has a compact matrix represen-
tation, &y = D& α, where
( + ( + ( +
y1 1 x1 x21 ··· xk1 α0
) y2 , ) 1 x2 x2 ··· xk2 , ) α1 ,
) , ) 2 , ) ,
&y = ) .. , , D = ) .. , , and α
& =) .. ,. (19)
* . - * . - * . -
yn 1 xn x2n ··· k
xn αk

Then we have $ %−1 T


& = DT D
α D &y , (20)
where the superscript T denotes the transpose of a matrix.
Example. Find a quadratic regression model for the data show in Table 4.

Table 4: Example data for quadratic polynomial regression.

x 3.0 4.0 5.0 6.0 7.0


y 2.5 3.2 3.8 6.5 11.5

Solution. Let the quadratic regression model be

y = α 0 + α 1 x + α 2 x2 .

The values of α0 , α1 and α2 that minimize the sum of squares of errors satisfy the following
system of equations
! &! ' &! '
yi = nα0 + α1 xi + α 2 x2i
! &! ' &! ' &! '
yi xi = α 0 xi + α 1 x2i + α2 x3i
! &! ' &! ' &! '
2 2 3
yi xi = α 0 xi + α 1 xi + α 2 x4i

Using the given data (shown in Table 4), we have

27.5 = 5α0 + 25α1 + 135α2


158.8 = 25α0 + 135α1 + 775α2
966.2 = 135α0 + 775α1 + 4659α2

Solving this system of equations we get

α0 = 12.4285714
α1 = −5.5128571
α2 = 0.7642857

Thus, the required quadratic polynomial model is

y = 12.4285714 − 5.5128571x + 0.7642857x2 .

Figure 3 shows plots of the data and the quadratic polynomial model.

7
Course Notes GEO5017: Machine Learning for the Built Environment

Figure 3: Plot of quadratic polynomial model.

4 Multiple linear regression


We assume that there are N independent variables x1 , x2 , ..., xN . Let the dependent
variable be y. Let there also be n observed values of these variables, as shown in Table 5.

Table 5: Datasets overview

Values (examples)
Variables
Example 1 Example 2 ... Example n
x1 x11 x12 ... x1n
x1 x21 x22 ... x2n
...
xN xN 1 xN 2 ... xN n
y (outcomes) y1 y2 ... yn

The multiple linear regression model defines the relationship between the N indepen-
dent variables and the dependent variable by an equation of the following form:

y = β0 + β1 x1 + · · · + βN xN (21)
As in simple linear regression, here also we use the ordinary least squares method to
obtain the optimal estimates of β0 , β1 , ..., βN . The method yields the following procedure
for the computation of these optimal estimates. Let
( + ( + ( +
y1 1 x11 x21 · · · xN 1 β0
) y2 , ) 1 x12 x22 · · · xN 2 , ) β1 ,
) , ) , ) ,
Y = ) .. , , X = ) .. , , and B = ) .. , . (22)
* . - * . - * . -
yn 1 x1n x2n · · · xN n βN

It can be shown that the regression coefficients are given by


$ %−1 T
B = XT X X Y. (23)

Example. Fit a multiple linear regression model to the data show in Table 6.

8
Course Notes GEO5017: Machine Learning for the Built Environment

Table 6: Example data for multi-linear regression.

x1 1 1 2 0
x2 1 2 2 1
y 3.25 6.5 3.5 5.0

Solution. In this problem, there are two independent variables and four sets of values
of the variables. Thus, in the notations used above, we have n = 2 and N = 4. The
multiple linear regression model for this problem has the form

y = β0 + β1 x1 + β2 x2 .

The computations are shown below.


( + ( +
3.25 1 1 1 ( +
) 6.5 , ) β0
1 1 2 ,
Y =) ,
* 3.5 - , X = *
) ,, B = * β1 -
1 2 2 -
β2
5.0 1 0 1
( +
4 4 6
XT X = * 4 6 7 -
6 7 10
( 11 1 +
$ T %−1 4 2
−2
X X = * 12 1 −1 -
−2 −1 2
+ (
$ %−1 T 2.0625
B = XT X X Y = * −2.3750 -
3.2500
Thus the required model is

y = 2.0625 − 2.3750x1 + 3.2500x2 .

Figure 4 shows plots of the data and the multiple linear regression model (a plane in
this example).

5 Gradient descent
In the previous sections, we have provided a closed form solution to the linear regres-
sion problems (see Equations 13, 20, and 23). In this section, we show that how linear
regression can be solved as an optimization problem. We will introduce an optimization
algorithm called gradient descent 2 . Gradient descent is an iterative first-order optimiza-
tion algorithm used to find a local minimum/maximum of a given function. This method
is commonly used in machine learning and deep learning to minimize a cost/loss function.
It also has wide applications in areas such as control engineering (robotics, chemical, etc.),
computer games, and mechanical engineering.
2
The gradient descent method was proposed before the era of modern computers and there was an
intensive development meantime that led to numerous improved versions of it. In this course note, we
focus on the basic version/idea of the algorithm.

9
Course Notes GEO5017: Machine Learning for the Built Environment

Figure 4: The regression plane for the data in Table 6.

5.1 Function requirements


Gradient descent algorithm does not work for all functions. There are two specific re-
quirements. A function has to be:

• differentiable. This requires that a function has a derivative for each point in its
domain. See Figure 5 for a few examples of differentiable functions and Figure 6
for a few examples of non-differentiable functions.

• convex. For a univariate function, this means that the line segment connecting
two function’s points lays on or above its curve (i.e., it does not cross it). A line
crossing the curve means that the function has a local minimum (which may not
be the global one). Mathematically, for two points (x1 , f (x1 )), (x2 , f (x2 )) laying on
or above the function’s curve, this condition can be expressed as

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ),

where λ determines a point’s location on a section line and its value is between
0 (left point) and 1 (right point). For example, λ = 0.5 means a location in the
middle. See Figure 7 for two functions with exemplary section lines. Another way
to check mathematically if a univariate function is convex is to calculate the second
derivative and check if its value is always bigger than 0, i.e., d2 f (x)/dx2 > 0.

Example. Let’s investigate a simple quadratic function given by f (x) = x2 − x + 3.


Its first and second derivative are
df (x) d2 f (x)
= 2x − 1, = 2.
dx dx2
Because the second derivative is always bigger than 0, f (x) is strictly convex. It is
also possible to find extreme value of quasi-convex functions using a gradient descent

10
Course Notes GEO5017: Machine Learning for the Built Environment

Figure 5: Examples of differentiable functions.

(a) Cusp (b) Jump discontinuity (c) Infinite discontinuity

Figure 6: Examples of non-differentiable functions. Typical non-differentiable functions


have a step a cusp or a discontinuity.

algorithm. However, often they have so-called saddle points (also called minimax points)
where the algorithm can get stuck. An example of a quasi-convex function is
f (x) = x4 − 2x3 + 2,
whose first order derivative is
df (x)
= 4x3 − 6x2 = x2 (4x − 6),
dx
and its second order derivative is
d2 f (x)
= 12x2 − 12x = 12x(x − 1).
dx2
The value of the second order derivative is zero for x = 0 and x = 1. These locations are
called an inflection point, i.e., a place where the curvature changes sign. In other words,
the function changes from convex to concave or vice-versa. By analyzing this equation
we can conclude that
• for x < 0: function is convex
• for 0 < x < 1: function is concave
• for x > 1: function is convex again
Now we see that point x = 0 has both first and second derivatives equal to zero meaning
this is a saddle point, and point x = 1.5 is a global minimum. The graph of this function
is shown in Figure 8. As calculated before, a saddle point is at x = 0 and its minimum
at x = 1.5.

11
Course Notes GEO5017: Machine Learning for the Built Environment

(a) Convex (b) Non-convex

Figure 7: Exemplary convex and non-convex functions.

Figure 8: Example of a semi-convex function with a saddle point.

5.2 Gradient
Before jumping into the code of the method, one more thing has to be explained - what is
a gradient. Intuitively it is a slope of a curve at a given point in a specified direction. In
the case of a univariate function, it is simply the first derivative at a selected point. In the
case of a multivariate function, it is a vector of derivatives in each main direction (along
variable axes). Because for each variable we are interested only in the slope along its axis
and we don’t care about other variables, these derivatives are called partial derivatives.
A gradient for an n−dimensional function f (&x) at a given point p& is defined as
( ∂f +
∂x1
(&
p )
) .. ,
∇f (&p) = * . -. (24)
∂f
∂xn
(&p)
The upside-down triangle ∇ is the so-called nabla symbol and you read it “del”. To better
understand how to calculate it, let’s do a hand calculation for an exemplary 2-dimensional
function below (the function is plotted in Figure 9),
f (x, y) = 0.5x2 + y 2 .
Its gradient can be derived as
( ∂f + ( +
∂x
(x, y) x
∇f (x, y) = * -=* -.
∂f
∂y
(x, y) 2y

12
Course Notes GEO5017: Machine Learning for the Built Environment

Let’s assume we are interested in a gradient at point p(10, 10), which is


. /
10
∇f (10, 10) = .
20

By looking at these values we can see that the slope is twice steeper along the y axis.

Figure 9: The 3D plot of the 2-dimensional function f (x, y) = 0.5x2 + y 2 .

5.3 Gradient descent algorithm


The gradient descent (GD) algorithm iteratively calculates the next point using gradient
at the current position, then scales it (by a learning rate) and subtracts obtained value
from the current position (makes a step). It subtracts the value because we want to
minimize a function (to maximize it would be adding). This process can be written as:

p&n+1 = p&n − η∇f (&pn ) . (25)

There’s an important parameter η that scales the gradient and thus controls the
step size. In machine learning, it is called learning rate and has a strong influence on
performance.
• The smaller learning rate the longer GD converges, or may reach maximum iteration
before reaching the optimum point.

• If learning rate is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.
In summary, the main steps of the gradient descent method are:
1) choose a starting point (initialization).

2) calculate gradient at this point.

3) make a scaled step in the opposite direction to the gradient (objective: minimize).

4) repeat steps 2) and 3) until one of the criteria is met:

13
Course Notes GEO5017: Machine Learning for the Built Environment

• maximum number of iterations reached.


• step size is smaller than a given tolerance.

Below is an exemplary implementation of the Gradient Descent algorithm (with steps


tracking):

This function takes 5 parameters:

1) starting point - in our case, we define it manually but in practice, it is often an


initial guess or even random initialization.

2) gradient function - a function that computes the gradient of the original function,
which has to be specified before-hand.

3) learning rate - scaling factor for step sizes.

4) maximum number of iterations.

5) tolerance, to conditionally stop the algorithm (in this case a default value is 0.01).

5.4 Examples
Example 1: a quadratic function. Let’s take a simple univariate quadratic function

f (x) = x2 − 4x + 1.

Its gradient function is


df (x)
= 2x − 4.
dx
Let’s write these functions in Python:

14
Course Notes GEO5017: Machine Learning for the Built Environment

For this function, by taking a learning rate of 0.1 and starting point at x = 9 we can
easily calculate each step by hand using Equation 25. Let’s do it for the first 3 steps:
x0 = 9, f (9) = 46
x1 = 9 − 0.1 × (2 × 9 − 4) = 7.6, f (7.6) = 28.36
x2 = 7.6 − 0.1 × (2 × 7.6 − 4) = 6.48, f (6.48) = 17.07
x3 = 6.48 − 0.1 × (2 × 6.48 − 4) = 5.584, f (5.584) = 9.845
Continuing this process, you will get
...
x21 = 2.065, f (2.065) = −2.996
x22 = 2.052, f (2.052) = −2.997
In the last step, the change in x is 0.007, which is smaller than the specified tolerance
0.01, and thus the process terminated. Please also note that the change in the function
value is 0.001.
The python function is called by:

In Figure 10, the trajectories, number of iterations, and the final converged result
(within tolerance) for various learning rates are shown.

Figure 10: The gradient descent steps for minimizing the function f (x) = x2 − 4x + 1
with different learning rates.

Example 2: a function with a saddle point. Now let’s see how the algorithm will
cope with a semi-convex function we investigated mathematically before.
f (x) = x4 − 2x3 + 2.
Figure 11 shows the results for two learning rates and two different starting points. We can
see that the existence of a saddle point imposes a real challenge for the gradient descent
algorithm and obtaining a global minimum is not guaranteed. Second-order algorithms
(e.g. Newton’s method3 ) deal with these situations better. Investigation of saddle points
and how to escape from them is a subject of ongoing studies and various solutions were
proposed. For example, Jin et al. proposed a Perturbing Gradient Descent algorithm4 .
3
Newton’s method in optimization. Wikipedia.
4
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. How to Escape Saddle
Points Efficiently. PMLR 70:1724-1732, 2017

15
Course Notes GEO5017: Machine Learning for the Built Environment

Figure 11: The gradient descent steps for minimizing the function f (x) = x4 − 2x3 + 2
with two different learning rates and two different starting points.

5.5 Summary
We have learned how a gradient decent algorithm works, when it can be used and the
common challenges when using it. This will be a good starting point to explore more
advanced gradient-based optimization methods like Momentum or Nesterov (Accelerated)
Gradient Descent, ADAM, or second-order ones like the Newton’s algorithm.

16

You might also like