Linear Regression
Linear Regression
This course note has two parts. The first part is on linear regression, and the second
part is on the gradient descent optimization technique.
1 Introduction
1.1 Definition
In machine learning, a regression problem is the problem of determining a relation be-
tween one or more independent variables and an output variable which is a real contin-
uous variable, given a set of observed values of the set of independent variables and the
corresponding values of the output variable.
1.2 Examples
• Consider the data on car prices given in Table 1. Suppose we are required to
estimate the price of a car aged 25 years with distance 53240 km, and weight 1200
pounds. This is an example of a regression problem because we have to predict the
value of the numeric variable “Price”.
• Consider the navigation of a mobile robot, say an autonomous car. The output is
the angle by which the steering wheel should be turned at each time, to advance
without hitting obstacles and deviating from the route. Inputs are provided by
sensors on the car like a video camera, GPS, and so forth.
• In finance, the capital asset pricing model uses regression for analyzing and quan-
tifying the systematic risk of an investment.
1
Course Notes GEO5017: Machine Learning for the Built Environment
The function y = f (x, θ) is called the regression function. The machine learning
algorithm optimizes the parameters in the set θ such that the approximation error is
minimized; that is, the estimates of the values of the dependent variable y are as close as
possible to the correct values given in the training set.
Example. If the input variables are “Age”, “Distance” and “Weight” and the output
variable is “Price”, the model y = f (x, θ) may be
where x = {Age, Distance, Weight} denotes the the set of input variables and θ =
{a0 , a1 , a2 , a3 } denotes the set of parameters of the model. The goal of regression is
to determine the unknown parameters of the model and to use the model for prediction.
2
Course Notes GEO5017: Machine Learning for the Built Environment
• Multivariate linear regression. There are more than one independent variable,
say x1, ..., xn , and the assumed relation between the independent variables and the
dependent variable is
y = a0 + a1 x1 + ... + an xn . (3)
y = a0 + a1 x + a2 x2 + ... + an xn . (4)
Here the function f (x) is unknown and we would like to approximate it by some estimator
g(x, θ) containing a set of parameters θ. We assume that the random error " follows
normal distribution with mean 0.
Let x1 , ..., xn be a random sample of observations of the input variable x and y1 , ..., yn
the corresponding observed values of the output variable y. The value of desired θ mini-
mizes the following sum of squares:
The method of finding the value of θ that minimizes E(θ) is known as the ordinary
least squares method.
x x1 x2 ... xn
y y1 y2 ... yn
A simple linear regression model defines the relationship between x and y using a line
defined by an equation in the following form:
y = α + βx. (7)
3
Course Notes GEO5017: Machine Learning for the Built Environment
squares of the vertical distance between the predicted y-value and the actual y-value (see
Figure 1). Let ŷi be the predicted value of yi , the sum of squares of errors is given by
n
!
E= (yi − ŷi )2
i=1
n (8)
! 2
= [yi − (α + βxi )]
i=1
So we are required to find the values of α and β, such that E is minimum. Using
methods of calculus, we can show that the values of α and β can be obtained by solving
the following equations:
!n ! n
yi = nα + β xi
i=1 i=1
n n n (9)
! ! !
xi yi = α xi + β x2i
i=1 i=1 i=1
Formulas to find α and β. Recall that the means of x and y are given by
1!
x̄ = xi
n , (10)
1!
ȳ = yi
n
and also that the variance of x is given by
1 !
Var(x) = (xi − x̄i )2 . (11)
n−1
The covariance of x and y, denoted by Cov(x, y) is defined as
1 !
Cov(x, y) = (xi − x̄) (yi − ȳ) (12)
n−1
It can be shown that the values of α and β can be computed using the following
formulas:
Cov(x, y)
β=
Var(x) (13)
α = ȳ − β x̄
4
Course Notes GEO5017: Machine Learning for the Built Environment
Remarks. It is interesting to note why the least squares method discussed above
is christened as “ordinary” least squares method. Several different variants of the least
squares method have been developed over the years. For example, in the weighted least
squares method, the coefficients α and β are estimated such that the weighted sum of
squares of errors
!n
E= wi [yi − (α + βxi )]2 (14)
i=1
for some positive constants w1 , ..., wn , is minimum. There are also methods known by the
names generalized least squares method, partial least squares method, total least squares
method, etc. The reader may refer to Wikipedia to obtain further information about
these methods. The OLS method has a long history. The method is usually credited to
Carl Friedrich Gauss (1795), but it was first published by Adrien-Marie Legendre (1805).
Example. Obtain a linear regression for the data in Table 3 assuming that y is the
independent variable.
5
Course Notes GEO5017: Machine Learning for the Built Environment
n=5
1
x̄ = (1.0 + 2.0 + 3.0 + 4.0 + 5.0) = 3.0
5
1
ȳ = (1.00 + 2.00 + 1.30 + 3.75 + 2.25) = 2.06
5
1
Cov(x, y) = [(1.0 − 3.0)(1.00 − 2.06) + · · · + (5.0 − 3.0)(2.25 − 2.06)] = 1.0625
4
1" #
Var(x) = (1.0 − 3.0)2 + · · · + (5.0 − 3.0)2 = 2.5
4
1.0625
b= = 0.425
2.5
a = 2.06 − 0.425 × 3.0 = 0.785
y = 0.785 + 0.425x.
3 Polynomial regression
Let x be the independent predictor variable and y the dependent variable. Assume that
we have a set of observed values of x and y as in Table 2. A polynomial regression model
defines the relationship between x and y by an equation in the following form:
y = α 0 + α 1 x + α 2 x2 + · · · + α k xk (15)
To determine the optimal values of the parameters α0 , α1 , ..., αk , the method of ordi-
nary least squares can be used. The desired values of the parameters are those values
that minimize the sum of squared errors, i.e.,
n
! " $ %#2
E= yi − α0 + α1 xi + α2 x2i + · · · + αk xki (16)
i=1
The optimal values of the parameters are obtained by solving the following system of
equations
∂E
= 0, ∀i = 0, 1, . . . , k. (17)
∂αi
Simplifying and transforming Equation 17, we get a system of k + 1 linear equations
! &! ' &! '
yi = α 0 n + α 1 xi + · · · + α k xki
! &! ' &! ' &! '
yi xi = α 0 xi + α 1 x2i + · · · + αk xk+1
i
! &! ' &! ' &! '
yi x2i = α0 x2i + α1 x3i + · · · + αk xk+2
i
(18)
..
.
! &! ' &! ' &! '
yi xki = α0 xki + α1 xk+1
i + · · · + α k x 2k
i
We can see that the values of αi can be obtained by solving the above linear system.
6
Course Notes GEO5017: Machine Learning for the Built Environment
Remarks. The linear system of equations to find αi has a compact matrix represen-
tation, &y = D& α, where
( + ( + ( +
y1 1 x1 x21 ··· xk1 α0
) y2 , ) 1 x2 x2 ··· xk2 , ) α1 ,
) , ) 2 , ) ,
&y = ) .. , , D = ) .. , , and α
& =) .. ,. (19)
* . - * . - * . -
yn 1 xn x2n ··· k
xn αk
y = α 0 + α 1 x + α 2 x2 .
The values of α0 , α1 and α2 that minimize the sum of squares of errors satisfy the following
system of equations
! &! ' &! '
yi = nα0 + α1 xi + α 2 x2i
! &! ' &! ' &! '
yi xi = α 0 xi + α 1 x2i + α2 x3i
! &! ' &! ' &! '
2 2 3
yi xi = α 0 xi + α 1 xi + α 2 x4i
α0 = 12.4285714
α1 = −5.5128571
α2 = 0.7642857
Figure 3 shows plots of the data and the quadratic polynomial model.
7
Course Notes GEO5017: Machine Learning for the Built Environment
Values (examples)
Variables
Example 1 Example 2 ... Example n
x1 x11 x12 ... x1n
x1 x21 x22 ... x2n
...
xN xN 1 xN 2 ... xN n
y (outcomes) y1 y2 ... yn
The multiple linear regression model defines the relationship between the N indepen-
dent variables and the dependent variable by an equation of the following form:
y = β0 + β1 x1 + · · · + βN xN (21)
As in simple linear regression, here also we use the ordinary least squares method to
obtain the optimal estimates of β0 , β1 , ..., βN . The method yields the following procedure
for the computation of these optimal estimates. Let
( + ( + ( +
y1 1 x11 x21 · · · xN 1 β0
) y2 , ) 1 x12 x22 · · · xN 2 , ) β1 ,
) , ) , ) ,
Y = ) .. , , X = ) .. , , and B = ) .. , . (22)
* . - * . - * . -
yn 1 x1n x2n · · · xN n βN
Example. Fit a multiple linear regression model to the data show in Table 6.
8
Course Notes GEO5017: Machine Learning for the Built Environment
x1 1 1 2 0
x2 1 2 2 1
y 3.25 6.5 3.5 5.0
Solution. In this problem, there are two independent variables and four sets of values
of the variables. Thus, in the notations used above, we have n = 2 and N = 4. The
multiple linear regression model for this problem has the form
y = β0 + β1 x1 + β2 x2 .
Figure 4 shows plots of the data and the multiple linear regression model (a plane in
this example).
5 Gradient descent
In the previous sections, we have provided a closed form solution to the linear regres-
sion problems (see Equations 13, 20, and 23). In this section, we show that how linear
regression can be solved as an optimization problem. We will introduce an optimization
algorithm called gradient descent 2 . Gradient descent is an iterative first-order optimiza-
tion algorithm used to find a local minimum/maximum of a given function. This method
is commonly used in machine learning and deep learning to minimize a cost/loss function.
It also has wide applications in areas such as control engineering (robotics, chemical, etc.),
computer games, and mechanical engineering.
2
The gradient descent method was proposed before the era of modern computers and there was an
intensive development meantime that led to numerous improved versions of it. In this course note, we
focus on the basic version/idea of the algorithm.
9
Course Notes GEO5017: Machine Learning for the Built Environment
• differentiable. This requires that a function has a derivative for each point in its
domain. See Figure 5 for a few examples of differentiable functions and Figure 6
for a few examples of non-differentiable functions.
• convex. For a univariate function, this means that the line segment connecting
two function’s points lays on or above its curve (i.e., it does not cross it). A line
crossing the curve means that the function has a local minimum (which may not
be the global one). Mathematically, for two points (x1 , f (x1 )), (x2 , f (x2 )) laying on
or above the function’s curve, this condition can be expressed as
where λ determines a point’s location on a section line and its value is between
0 (left point) and 1 (right point). For example, λ = 0.5 means a location in the
middle. See Figure 7 for two functions with exemplary section lines. Another way
to check mathematically if a univariate function is convex is to calculate the second
derivative and check if its value is always bigger than 0, i.e., d2 f (x)/dx2 > 0.
10
Course Notes GEO5017: Machine Learning for the Built Environment
algorithm. However, often they have so-called saddle points (also called minimax points)
where the algorithm can get stuck. An example of a quasi-convex function is
f (x) = x4 − 2x3 + 2,
whose first order derivative is
df (x)
= 4x3 − 6x2 = x2 (4x − 6),
dx
and its second order derivative is
d2 f (x)
= 12x2 − 12x = 12x(x − 1).
dx2
The value of the second order derivative is zero for x = 0 and x = 1. These locations are
called an inflection point, i.e., a place where the curvature changes sign. In other words,
the function changes from convex to concave or vice-versa. By analyzing this equation
we can conclude that
• for x < 0: function is convex
• for 0 < x < 1: function is concave
• for x > 1: function is convex again
Now we see that point x = 0 has both first and second derivatives equal to zero meaning
this is a saddle point, and point x = 1.5 is a global minimum. The graph of this function
is shown in Figure 8. As calculated before, a saddle point is at x = 0 and its minimum
at x = 1.5.
11
Course Notes GEO5017: Machine Learning for the Built Environment
5.2 Gradient
Before jumping into the code of the method, one more thing has to be explained - what is
a gradient. Intuitively it is a slope of a curve at a given point in a specified direction. In
the case of a univariate function, it is simply the first derivative at a selected point. In the
case of a multivariate function, it is a vector of derivatives in each main direction (along
variable axes). Because for each variable we are interested only in the slope along its axis
and we don’t care about other variables, these derivatives are called partial derivatives.
A gradient for an n−dimensional function f (&x) at a given point p& is defined as
( ∂f +
∂x1
(&
p )
) .. ,
∇f (&p) = * . -. (24)
∂f
∂xn
(&p)
The upside-down triangle ∇ is the so-called nabla symbol and you read it “del”. To better
understand how to calculate it, let’s do a hand calculation for an exemplary 2-dimensional
function below (the function is plotted in Figure 9),
f (x, y) = 0.5x2 + y 2 .
Its gradient can be derived as
( ∂f + ( +
∂x
(x, y) x
∇f (x, y) = * -=* -.
∂f
∂y
(x, y) 2y
12
Course Notes GEO5017: Machine Learning for the Built Environment
By looking at these values we can see that the slope is twice steeper along the y axis.
There’s an important parameter η that scales the gradient and thus controls the
step size. In machine learning, it is called learning rate and has a strong influence on
performance.
• The smaller learning rate the longer GD converges, or may reach maximum iteration
before reaching the optimum point.
• If learning rate is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.
In summary, the main steps of the gradient descent method are:
1) choose a starting point (initialization).
3) make a scaled step in the opposite direction to the gradient (objective: minimize).
13
Course Notes GEO5017: Machine Learning for the Built Environment
2) gradient function - a function that computes the gradient of the original function,
which has to be specified before-hand.
5) tolerance, to conditionally stop the algorithm (in this case a default value is 0.01).
5.4 Examples
Example 1: a quadratic function. Let’s take a simple univariate quadratic function
f (x) = x2 − 4x + 1.
14
Course Notes GEO5017: Machine Learning for the Built Environment
For this function, by taking a learning rate of 0.1 and starting point at x = 9 we can
easily calculate each step by hand using Equation 25. Let’s do it for the first 3 steps:
x0 = 9, f (9) = 46
x1 = 9 − 0.1 × (2 × 9 − 4) = 7.6, f (7.6) = 28.36
x2 = 7.6 − 0.1 × (2 × 7.6 − 4) = 6.48, f (6.48) = 17.07
x3 = 6.48 − 0.1 × (2 × 6.48 − 4) = 5.584, f (5.584) = 9.845
Continuing this process, you will get
...
x21 = 2.065, f (2.065) = −2.996
x22 = 2.052, f (2.052) = −2.997
In the last step, the change in x is 0.007, which is smaller than the specified tolerance
0.01, and thus the process terminated. Please also note that the change in the function
value is 0.001.
The python function is called by:
In Figure 10, the trajectories, number of iterations, and the final converged result
(within tolerance) for various learning rates are shown.
Figure 10: The gradient descent steps for minimizing the function f (x) = x2 − 4x + 1
with different learning rates.
Example 2: a function with a saddle point. Now let’s see how the algorithm will
cope with a semi-convex function we investigated mathematically before.
f (x) = x4 − 2x3 + 2.
Figure 11 shows the results for two learning rates and two different starting points. We can
see that the existence of a saddle point imposes a real challenge for the gradient descent
algorithm and obtaining a global minimum is not guaranteed. Second-order algorithms
(e.g. Newton’s method3 ) deal with these situations better. Investigation of saddle points
and how to escape from them is a subject of ongoing studies and various solutions were
proposed. For example, Jin et al. proposed a Perturbing Gradient Descent algorithm4 .
3
Newton’s method in optimization. Wikipedia.
4
Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. How to Escape Saddle
Points Efficiently. PMLR 70:1724-1732, 2017
15
Course Notes GEO5017: Machine Learning for the Built Environment
Figure 11: The gradient descent steps for minimizing the function f (x) = x4 − 2x3 + 2
with two different learning rates and two different starting points.
5.5 Summary
We have learned how a gradient decent algorithm works, when it can be used and the
common challenges when using it. This will be a good starting point to explore more
advanced gradient-based optimization methods like Momentum or Nesterov (Accelerated)
Gradient Descent, ADAM, or second-order ones like the Newton’s algorithm.
16