Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization
Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization
net/publication/273133972
NumPy / SciPy Recipes for Data Science: Ordinary Least Squares Optimization
CITATIONS READS
6 3,064
1 author:
Christian Bauckhage
University of Bonn
391 PUBLICATIONS 6,411 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Christian Bauckhage on 10 March 2015.
I. I NTRODUCTION 10 10
what it does, and how it is dealt with in practice. data OLS model
−5 −5
We therefore provide an introduction to least squares and
discuss related NumPy and SciPy functions. By means of the (a) (b)
example of a simple linear regression problem, we introduce Fig. 1: Example of linear regression; the two-dimensional data
mathematical concepts and terminology. In particular, we show points (xi , yi ) in this sample were generated using
that the ordinary least squares problem can be formulated as
an optimization problem involving matrices and vectors. This yi = 1.1 · xi + 2.0 +
is good to know because NumPy and SciPy allow for efficient where denotes Gaussian noise of zero mean and variance 1/2.
matrix vector computations. Accordingly, we discuss how the The linear approximation on the right resulted from ordinary
mathematics can be translated into NumPy or SciPy code. least squares which determined the model parameters as 1.11
Note that our discussion assumes that readers are passably and 1.99, respectively.
familiar with multivariate calculus and linear algebra as well
as with the basics of NumPy, SciPy, and Matplotlib [1], [2].
between variables is justified, the observed data may suffer
II. T HEORY
from noise thus making it impossible to determine a line that
In this section, we briefly review the theory behind ordinary passes through each of the given data points exactly.
least squares optimization. Readers familiar with the topic and Considering the example of noisy 2D data in Fig. 1, the
its terminology might want to skip this section. goal is therefore to determine a slope parameter a and an
A. Linear Regression offset parameter b such that the line equation
The term linear regression refers to the problem of fitting y(x) = ax + b (1)
a linear model to a given set of data. The simple example in provides a “good” fit to the data. In other words, we try to
Fig. 1 illustrates the general idea. It shows a sample of two- estimate a specific linear model that captures the gist of the
dimensional data points together with a line that approximates data and ignores the noise.
the data and thus characterizes its behavior.
In what B. Ordinary Least Squares
follows,
n we will adhere to this example and consider
a set (xi , yi ) i=1 of 2D data points where xi , yi ∈ R. Given our example, the basic idea behind least squares for
Whenever we apply linear regression in data analysis, we parameter estimation is easily explained!
tacitly assume a linear relationship between variables and/or Imagine, we already had estimates of the parameters a and
measurements (x and y in our case). This will often be an b of the model in (1). For each observed xi , we could then
oversimplification, because real data seldom result from linear compute an estimate ŷi = axi + b of the corresponding yi .
processes. Nevertheless, linear models may provide valuable However, since ŷi is but an estimate, the squared difference
first insights into the nature of the data at hand. Yet, even in (yi − ŷi )2 between an observed yi and its estimate ŷi will
simple cases such as ours, where the supposed linear relation likely exceed zero.
Nevertheless, we are left with an idea for how to find “good” D. Solving Ordinary Least Squares
model parameters: in order to determine those values of a
Observe that the objective function E(w) in (9) is convex
and b that would yield the best possible linear model for our
in its argument w and therefore has a unique global minimum.
data, we may analyze the overall difference between observed
Whenever we encounter a convex optimization problem like
values yi and their estimates ŷi . That is, we may consider the
this, we can adhere to the mnemonic
sum of squared errors or the so called residual sum of squares
“optimize ⇔ derive and equate to zero”
X 2 X 2 which provides us with a recipe for how to proceed.
E(a, b) = yi − ŷi = yi − (axi + b) (2) Given this recipe, we will now derive the solution of the
i i
primal form of the least squares problem. In fact, we will
because our intuition tells us that an optimal choice of a and b derive it twice. The first method is elementary but cumbersome
would minimize this functional. We are therefore dealing with since it requires us to compute many partial derivatives. The
a minimization problem. If we could solve second method is more abstract as we resort to vector calculus.
X 2 Also, in order not to lose generality, we abstract from our
min yi − (axi + b) (3) introductory 2D example and consider linear regression in
a,b
i
an m-dimensional space. In this more general setting, the
for a and b, we might obtain a good linear model. ingredients of (9) become X ∈ Rn×m , w ∈ Rm , and y ∈ Rn .
1) Method 1: Our first idea for how to solve (9) for the
C. Matrix-Vector Formulation optimal vector of parameters w = [w1 . . . wm ]T is to expand
Note that the objective function in (2), i.e. the function we the expression as follows
agreed to minimize, can be written in terms of matrices and
2
vectors. In order to see how, we define a coefficient vector E(w) =
Xw − y
n
X 2
a = Xw − y i
w= ∈ R2 (4)
b i=1
n X
X m 2
as well as a set of n data vectors = Xij wj − yi . (10)
x i=1 j=1
x i = i ∈ R2 . (5)
1 This form of the objective function is easy to differentiate
Given that these vectors seem somewhat abstract, we note w.r.t. the individual parameters wk . So, we proceed and equate
that they allow for writing the estimates ŷi in terms inner the resulting expressions to zero. This results in k = 1, . . . , m
products. In other words, we have ŷi = axi + b = xTi w and equations
realize that the error functional in (2) can thus be cast as n Xm
∂ X !
X 2 E(w) = 2 Xij wj − yi Xik = 0. (11)
E(w) = yi − xTi w . (6) ∂wk i=1 j=1
i
If we continue along this path and introduce a whole data Rearranging the terms in each of these expressions yields
matrix X the m normal equations which are given by
T n m n
x1 X X X
.. Xik Xij wj = Xik yi
X = . ∈ Rn×2 (7) i=1 j=1 i=1
xTn Xn Xm Xn
⇔ Xik Xij wj = Xik yi
along with the so called target vector i=1 j=1 i=1
m X n n
y1 X X
.. ⇔ Xik Xij wj = Xik yi (12)
y = . ∈ Rn , (8) j=1 i=1 i=1
yn
We emphasize that (12) represents a system of m different
we arrive at equations; the index k ranges from 1 to m and reflects the fact
2 that we are dealing with m unknown parameters wk .
E(w) =
Xw − y
(9) In order to continue towards the overall solution, we note
and recognize that our problem is to minimize the squared two additional identities, namely
Euclidean distance between two vectors Xw and y. For n
X n
X
XT X T
reasons that will become apparent in a later note, we refer to kj
= Xki Xij = Xik Xij (13)
equation (9) as the primal form of the least squares problem. i=1 i=1
and
Listing 1: data generation for linear regression
n n
X X def create_data(n, xmin=-2, xmax=12, a=1.1, b=2.0):
T
Xik yi = Xki yi (14) x = rnd.random(n) * (xmax-xmin) + xmin
y = a * x + b + rnd.randn(n) * 0.5
i=1 i=1
which suggest that we may collect the k individual equations return x, y
in (12) into a single, more compact matrix-vector equation
X T Xw = X T y. (15)
III. P RACTICE
Interestingly, the matrix X T X ∈ Rm×m is square. Should
it be non-degenerate, we can invert it and left-multiplication In this section, we look at how to use NumPy and Matplotlib
of (15) by this inverse then yields the famous ordinary least in order to create plots like the ones in Fig. 1. Note, however,
squares solution that we will not elaborate on fancy visualization but focus on
−1 T the question of how to write NumPy code that implements the
w = XT X X y. (16) matrix-vector mathematics we discussed above.
Having obtained this general result, we can return to our Readers who would like to experiment with the examples
specific linear regression task. Once w = [a b]T has been we provide should import the following NumPy and Matplotlib
determined according to (16), we can compute y(x) = xT w modules
for any x = [x 1]T and plot the results. Figure 1(b) shows # import numpy
such a plot and how the linear model we determined this way import numpy as np
fits our data. # import numpy’s linear algebra module
import numpy.linalg as la
2) Method 2: Above, we expanded (9) in terms of elements
# import numpy’s random module
of vectors and matrices and computed partial derivatives with import numpy.random as rnd
respect to individual parameters wk . Here, we explore a more # import matplotlib (pyplot)
direct route to the solution. If we expand (9) in terms of vectors import matplotlib.pyplot as plt
and matrices n
First of all, to create a didactic 2D data sample (xi , yi ) i=1
2
E(w) =
Xw − y
such as shown in Fig. 1, we may randomly sample n numbers
T xmin ≤ xi ≤ xmax , compute corresponding values yi using a
= Xw − y Xw − y
linear model, i.e. yi = axi + b, and randomly distort the latter
= wT X T Xw − wT X T y − y T Xw + y T y to make the task of model fitting more interesting.
= wT X T Xw − 2wT X T y + y T y, (17) For the practical examples in this note, we accomplish this
using the function create_data in Listing 1. Its arguments
we recognize E(w) as a quadratic form in w. are the number n of data points that are to be created as well
In order to determine the minimizer of this quadratic as values for xmin , xmax , a, and b. In order to produce an
form, we recall two differentiation rules from vector calculus, n-dimensional array (or vector) x of rather arbitrary elements
namely xi , we resort to the function random which is available in
∂ T in NumPy’s random module; in order to add Gaussian noise
x a=a (18)
∂x to the elements yi of the corresponding vector y, we apply
and randn from the same module.
Data points can then be generated, for instance, by calling
∂ T
x Ax = 2Ax (19) x, y = create_data(25)
∂x
where A is a square matrix and x and a are arbitrary vectors Once these NumPy arrays are available, we may plot them
of corresponding dimensions1 . against each other using
∂
Hence, if we compute ∂w E(w) and equate the result to plt.scatter(x, y)
the vector of all zeros, we arrive at plt.show()
∂ !
E(w) = 2X T Xw − 2X T y = 0. (20) which will produce a result similar to Fig. 1(a).
∂w Second of all, given the array or vectors x, we need to
But this is equivalent to the matrix-vector equation we found in compute the data matrix X we defined in (7). As always,
(15). Accordingly, the solution is again given by the expression there are several ways of how we may accomplish this and
we know from (16). The recipe of deriving and equating to Listing 2 presents three corresponding, increasingly compact
zero therefore also applies to the quadratic form in (17). Of NumPy snippets.
course this stood to be expected but it is always a good exercise In function data_matrix_V1, we use len to determine
to hone our skills in multivariate calculus. the number of elements in x, apply ones to create create a
1 An excellent reference for elementary identities like these is [3]; another corresponding vector of all ones, and finally use vstack and
truly comprehensive compendium is [4]. transposition to return an n × 2 matrix.
TABLE I: Average run times (in CPU milliseconds) of the
Listing 2: computing a data matrix for linear regression three implementations of the least squares solution in Listing 3
def data_matrix_V1(x):
n = len(x) n V1 V2 V3
return np.vstack((x, np.ones(n))).T
100 0.00 0.01 0.02
1,000 0.02 0.03 0.02
def data_matrix_V2(x): 10,000 0.05 0.10 0.06
return np.vstack((x, np.ones_like(x))).T
100,000 1.17 2.16 0.46
1,000,000 4.72 12.47 5.56
def data_matrix_V3(x):
return np.vander(x, 2)