0% found this document useful (0 votes)
94 views6 pages

Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization

This document provides an introduction to ordinary least squares (OLS) optimization using NumPy and SciPy. It discusses how to formulate a linear regression problem as an OLS optimization problem by minimizing the residual sum of squares. Two methods for solving the OLS problem are presented: taking the partial derivative and setting it equal to zero, and using a more abstract vector calculus approach. The document aims to translate the mathematical concepts of OLS into code implementations using NumPy and SciPy linear algebra functions.

Uploaded by

Alok Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views6 pages

Numpy / Scipy Recipes For Data Science: Ordinary Least Squares Optimization

This document provides an introduction to ordinary least squares (OLS) optimization using NumPy and SciPy. It discusses how to formulate a linear regression problem as an OLS optimization problem by minimizing the residual sum of squares. Two methods for solving the OLS problem are presented: taking the partial derivative and setting it equal to zero, and using a more abstract vector calculus approach. The document aims to translate the mathematical concepts of OLS into code implementations using NumPy and SciPy linear algebra functions.

Uploaded by

Alok Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/273133972

NumPy / SciPy Recipes for Data Science: Ordinary Least Squares Optimization

Technical Report · March 2015


DOI: 10.13140/2.1.3370.3209/1

CITATIONS READS

6 3,064

1 author:

Christian Bauckhage
University of Bonn
391 PUBLICATIONS   6,411 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

reading group machine learning / AI View project

AI Language Technology View project

All content following this page was uploaded by Christian Bauckhage on 10 March 2015.

The user has requested enhancement of the downloaded file.


NumPy / SciPy Recipes for Data Science:
Ordinary Least Squares Optimization
Christian Bauckhage
B-IT, University of Bonn, Germany
Fraunhofer IAIS, Sankt Augustin, Germany

Abstract—In this note, we study least squares optimization for


20 20
parameter estimation. By means of the basic example of a linear
regression task, we explore different formulations of the ordinary
least squares problem, show how to solve it using NumPy or SciPy, 15 15
and provide suggestions for practical applications.

I. I NTRODUCTION 10 10

The method of least squares is one of the most funda-


mental tools in data science. Since numerous algorithms for 5 5

data analysis, data mining, pattern recognition, clustering, or


classification involve (variants of) least squares optimization, 0 0
it seems a good idea to look at where the method comes from, −5 0 5 10 15 −5 0 5 10 15

what it does, and how it is dealt with in practice. data OLS model
−5 −5
We therefore provide an introduction to least squares and
discuss related NumPy and SciPy functions. By means of the (a) (b)
example of a simple linear regression problem, we introduce Fig. 1: Example of linear regression; the two-dimensional data
mathematical concepts and terminology. In particular, we show points (xi , yi ) in this sample were generated using
that the ordinary least squares problem can be formulated as
an optimization problem involving matrices and vectors. This yi = 1.1 · xi + 2.0 + 
is good to know because NumPy and SciPy allow for efficient where  denotes Gaussian noise of zero mean and variance 1/2.
matrix vector computations. Accordingly, we discuss how the The linear approximation on the right resulted from ordinary
mathematics can be translated into NumPy or SciPy code. least squares which determined the model parameters as 1.11
Note that our discussion assumes that readers are passably and 1.99, respectively.
familiar with multivariate calculus and linear algebra as well
as with the basics of NumPy, SciPy, and Matplotlib [1], [2].
between variables is justified, the observed data may suffer
II. T HEORY
from noise thus making it impossible to determine a line that
In this section, we briefly review the theory behind ordinary passes through each of the given data points exactly.
least squares optimization. Readers familiar with the topic and Considering the example of noisy 2D data in Fig. 1, the
its terminology might want to skip this section. goal is therefore to determine a slope parameter a and an
A. Linear Regression offset parameter b such that the line equation
The term linear regression refers to the problem of fitting y(x) = ax + b (1)
a linear model to a given set of data. The simple example in provides a “good” fit to the data. In other words, we try to
Fig. 1 illustrates the general idea. It shows a sample of two- estimate a specific linear model that captures the gist of the
dimensional data points together with a line that approximates data and ignores the noise.
the data and thus characterizes its behavior.
In what B. Ordinary Least Squares
 follows,
n we will adhere to this example and consider
a set (xi , yi ) i=1 of 2D data points where xi , yi ∈ R. Given our example, the basic idea behind least squares for
Whenever we apply linear regression in data analysis, we parameter estimation is easily explained!
tacitly assume a linear relationship between variables and/or Imagine, we already had estimates of the parameters a and
measurements (x and y in our case). This will often be an b of the model in (1). For each observed xi , we could then
oversimplification, because real data seldom result from linear compute an estimate ŷi = axi + b of the corresponding yi .
processes. Nevertheless, linear models may provide valuable However, since ŷi is but an estimate, the squared difference
first insights into the nature of the data at hand. Yet, even in (yi − ŷi )2 between an observed yi and its estimate ŷi will
simple cases such as ours, where the supposed linear relation likely exceed zero.
Nevertheless, we are left with an idea for how to find “good” D. Solving Ordinary Least Squares
model parameters: in order to determine those values of a
Observe that the objective function E(w) in (9) is convex
and b that would yield the best possible linear model for our
in its argument w and therefore has a unique global minimum.
data, we may analyze the overall difference between observed
Whenever we encounter a convex optimization problem like
values yi and their estimates ŷi . That is, we may consider the
this, we can adhere to the mnemonic
sum of squared errors or the so called residual sum of squares
“optimize ⇔ derive and equate to zero”
X 2 X 2 which provides us with a recipe for how to proceed.
E(a, b) = yi − ŷi = yi − (axi + b) (2) Given this recipe, we will now derive the solution of the
i i
primal form of the least squares problem. In fact, we will
because our intuition tells us that an optimal choice of a and b derive it twice. The first method is elementary but cumbersome
would minimize this functional. We are therefore dealing with since it requires us to compute many partial derivatives. The
a minimization problem. If we could solve second method is more abstract as we resort to vector calculus.
X 2 Also, in order not to lose generality, we abstract from our
min yi − (axi + b) (3) introductory 2D example and consider linear regression in
a,b
i
an m-dimensional space. In this more general setting, the
for a and b, we might obtain a good linear model. ingredients of (9) become X ∈ Rn×m , w ∈ Rm , and y ∈ Rn .
1) Method 1: Our first idea for how to solve (9) for the
C. Matrix-Vector Formulation optimal vector of parameters w = [w1 . . . wm ]T is to expand
Note that the objective function in (2), i.e. the function we the expression as follows
agreed to minimize, can be written in terms of matrices and 2
vectors. In order to see how, we define a coefficient vector E(w) = Xw − y
n
  X 2
a = Xw − y i
w= ∈ R2 (4)
b i=1
n X
X m 2
as well as a set of n data vectors = Xij wj − yi . (10)
 
x i=1 j=1
x i = i ∈ R2 . (5)
1 This form of the objective function is easy to differentiate
Given that these vectors seem somewhat abstract, we note w.r.t. the individual parameters wk . So, we proceed and equate
that they allow for writing the estimates ŷi in terms inner the resulting expressions to zero. This results in k = 1, . . . , m
products. In other words, we have ŷi = axi + b = xTi w and equations
realize that the error functional in (2) can thus be cast as n Xm 
∂ X !
X 2 E(w) = 2 Xij wj − yi Xik = 0. (11)
E(w) = yi − xTi w . (6) ∂wk i=1 j=1
i

If we continue along this path and introduce a whole data Rearranging the terms in each of these expressions yields
matrix X the m normal equations which are given by
 T n m n
x1 X X X
 ..  Xik Xij wj = Xik yi
X =  .  ∈ Rn×2 (7) i=1 j=1 i=1
xTn Xn Xm Xn
⇔ Xik Xij wj = Xik yi
along with the so called target vector i=1 j=1 i=1
  m X n n
y1 X X
 ..  ⇔ Xik Xij wj = Xik yi (12)
y =  .  ∈ Rn , (8) j=1 i=1 i=1
yn
We emphasize that (12) represents a system of m different
we arrive at equations; the index k ranges from 1 to m and reflects the fact
2 that we are dealing with m unknown parameters wk .
E(w) = Xw − y (9) In order to continue towards the overall solution, we note
and recognize that our problem is to minimize the squared two additional identities, namely
Euclidean distance between two vectors Xw and y. For n
X n
X
XT X T

reasons that will become apparent in a later note, we refer to kj
= Xki Xij = Xik Xij (13)
equation (9) as the primal form of the least squares problem. i=1 i=1
and
Listing 1: data generation for linear regression
n n
X X def create_data(n, xmin=-2, xmax=12, a=1.1, b=2.0):
T
Xik yi = Xki yi (14) x = rnd.random(n) * (xmax-xmin) + xmin
y = a * x + b + rnd.randn(n) * 0.5
i=1 i=1
which suggest that we may collect the k individual equations return x, y
in (12) into a single, more compact matrix-vector equation
X T Xw = X T y. (15)
III. P RACTICE
Interestingly, the matrix X T X ∈ Rm×m is square. Should
it be non-degenerate, we can invert it and left-multiplication In this section, we look at how to use NumPy and Matplotlib
of (15) by this inverse then yields the famous ordinary least in order to create plots like the ones in Fig. 1. Note, however,
squares solution that we will not elaborate on fancy visualization but focus on
−1 T the question of how to write NumPy code that implements the
w = XT X X y. (16) matrix-vector mathematics we discussed above.
Having obtained this general result, we can return to our Readers who would like to experiment with the examples
specific linear regression task. Once w = [a b]T has been we provide should import the following NumPy and Matplotlib
determined according to (16), we can compute y(x) = xT w modules
for any x = [x 1]T and plot the results. Figure 1(b) shows # import numpy
such a plot and how the linear model we determined this way import numpy as np
fits our data. # import numpy’s linear algebra module
import numpy.linalg as la
2) Method 2: Above, we expanded (9) in terms of elements
# import numpy’s random module
of vectors and matrices and computed partial derivatives with import numpy.random as rnd
respect to individual parameters wk . Here, we explore a more # import matplotlib (pyplot)
direct route to the solution. If we expand (9) in terms of vectors import matplotlib.pyplot as plt
and matrices  n
First of all, to create a didactic 2D data sample (xi , yi ) i=1
2
E(w) = Xw − y such as shown in Fig. 1, we may randomly sample n numbers
T  xmin ≤ xi ≤ xmax , compute corresponding values yi using a
= Xw − y Xw − y
linear model, i.e. yi = axi + b, and randomly distort the latter
= wT X T Xw − wT X T y − y T Xw + y T y to make the task of model fitting more interesting.
= wT X T Xw − 2wT X T y + y T y, (17) For the practical examples in this note, we accomplish this
using the function create_data in Listing 1. Its arguments
we recognize E(w) as a quadratic form in w. are the number n of data points that are to be created as well
In order to determine the minimizer of this quadratic as values for xmin , xmax , a, and b. In order to produce an
form, we recall two differentiation rules from vector calculus, n-dimensional array (or vector) x of rather arbitrary elements
namely xi , we resort to the function random which is available in
∂ T in NumPy’s random module; in order to add Gaussian noise
x a=a (18)
∂x to the elements yi of the corresponding vector y, we apply
and randn from the same module.
Data points can then be generated, for instance, by calling
∂ T
x Ax = 2Ax (19) x, y = create_data(25)
∂x
where A is a square matrix and x and a are arbitrary vectors Once these NumPy arrays are available, we may plot them
of corresponding dimensions1 . against each other using

Hence, if we compute ∂w E(w) and equate the result to plt.scatter(x, y)
the vector of all zeros, we arrive at plt.show()
∂ !
E(w) = 2X T Xw − 2X T y = 0. (20) which will produce a result similar to Fig. 1(a).
∂w Second of all, given the array or vectors x, we need to
But this is equivalent to the matrix-vector equation we found in compute the data matrix X we defined in (7). As always,
(15). Accordingly, the solution is again given by the expression there are several ways of how we may accomplish this and
we know from (16). The recipe of deriving and equating to Listing 2 presents three corresponding, increasingly compact
zero therefore also applies to the quadratic form in (17). Of NumPy snippets.
course this stood to be expected but it is always a good exercise In function data_matrix_V1, we use len to determine
to hone our skills in multivariate calculus. the number of elements in x, apply ones to create create a
1 An excellent reference for elementary identities like these is [3]; another corresponding vector of all ones, and finally use vstack and
truly comprehensive compendium is [4]. transposition to return an n × 2 matrix.
TABLE I: Average run times (in CPU milliseconds) of the
Listing 2: computing a data matrix for linear regression three implementations of the least squares solution in Listing 3
def data_matrix_V1(x):
n = len(x) n V1 V2 V3
return np.vstack((x, np.ones(n))).T
100 0.00 0.01 0.02
1,000 0.02 0.03 0.02
def data_matrix_V2(x): 10,000 0.05 0.10 0.06
return np.vstack((x, np.ones_like(x))).T
100,000 1.17 2.16 0.46
1,000,000 4.72 12.47 5.56
def data_matrix_V3(x):
return np.vander(x, 2)

performs more involved computations than the other methods


considered here. This is crucial and we will discuss it below.
Listing 3: solving least squares for linear regression All in all, we may therefore solve our regression problem
def lsq_solution_V1(X, y): and produce a plot similar to that in Fig. 1(b) as easily as
w = np.dot(np.dot(la.inv(np.dot(X.T, X)), X.T), y)
return w n = 25; x, y = create_data(n)

def lsq_solution_V2(X, y): X = np.vander(x, 2)


w = np.dot(la.pinv(X), y) w = la.lstsq(X, y)[0]
return w
yhat = np.dot(X, w)
def lsq_solution_V3(X, y):
w, residual, rank, svalues = la.lstsq(X, y) plt.plot(x, yhat, ’-’)
return w plt.plot(x, y, ’o’)
plt.show()

IV. D ISCUSSION AND S UGGESTIONS


Function data_matrix_V2 applies the rather obscure
Having seen that there are at least three different NumPy
NumPy method ones_like. Given an array, it creates a new
recipes for computing the ordinary least squares solution, we
array of all ones of corresponding shape. To produce the final
should ask which one to use?
result, we again make use of vstack and transposition.
Theoretically, all our solutions are identical. Yet, in practice,
In function data_matrix_V3, we produce the matrix
their performance may differ and they may even yield different
X simply by calling vander. This is foreshadowing. Why
results.
would NumPy provide such a function? And why would it be
To illustrate the former, we performed run time experiments
called “vander”? For now, however, we leave the answers to
in which we determined w given data matrices X ∈ Rn×2
an upcoming note in which we study non-linear least squares.
and target vectors y ∈ Rn which we obtained for increasingly
Now that we are given data matrix X and target vector y,
large sample sizes n. All experiments were carried out under
we are left with determining the optimal coefficient vector w.
Python 2.7 and NumPy 1.8.0 installed on a desktop PC with
According to equation (16), it can be computed from a series
an Intel Core i5-4440 CPU (3.10GHz) running Ubuntu 12.04.
of matrix products, inversions, and transpositions. In NumPy,
We applied timeit to measure average CPU times for each
we can thus determine w using dot and inv which would
of the methods in Listing 3 and obtained the results in Tab. I.
lead to an implementation such as realized by the function
Looking at these figures, it is clear that our practical prob-
lsq_solution_V1 in Listing 3.
lem in this note is trivial! Even for a sample of n = 1, 000, 000
But yet again there are different ways of how to implement
data points, each of our NumPy snippets determined the least
a solution. In particular, we note that the matrix on the right
squares solution in mere fractions of a second.
hand side of (16) is often written using the shorthand
Yet, the solution based on pinv appears to be considerably
−1 T slower than the other two. This is because pinv implicitly
X† = XT X X (21)
applies the NumPy function svd to first compute the singular
and it is referred to as the generalized inverse, pseudo inverse, value decomposition of X which is then used to determine
or Moore-Penrose inverse of X. Since the pseudo inverse is the pseudo inverse.
of general importance, it is no surprise that NumPy provides a The function lstsq, on the other hand, does not rely on
function pinv to compute it. Its use in our context is shown other NumPy functions but directly accesses the underlying
in lsq_solution_V2 in Listing 3. LAPACK library for fast numerical computations [5]. For
Finally, we note that the regression problem we are dealing real valued matrices, it calls the function dgelsd which,
with is itself so common and fundamental that the NumPy depending on the shape of the input matrix, applies the QR-
has a dedicated function to solve it. It is called lstsq and or LU decomposition to solve the least squares problem. In
Listing 3 exemplifies its use in function lsq_solution_V3. addition to the solution w, it also computes the residuals
Just from looking at the returned values, it seems that lstsq (yi − ŷi )2 , the rank of matrix X, and its singular values. Often,
practitioners do not need all this information but LAPACK Gauss is credited with inventing least squares analysis in
intentionally operates like this and, accordingly, returns all 1795 when he was eighteen [6]. The technique was indepen-
these results to NumPy. dently developed by Legendre in 1805 (who was the first to
While lstsq in lsq_solution_V3 seems to be about publish it) and by Adrain in 1808.
as fast as the direct computation in lsq_solution_V1, the Even though Gauss did not publish his method until 1809,
previous paragraphs beg the question why to consider pinv he publicly demonstrated its merit when he applied it to predict
or lstsq at all, if they are so involved? the future location of the newly discovered asteroid Ceres in
Because of numerical stability! In practice, we are often 1801. In January 1801, Giuseppe Piazzi had discovered Ceres
dealing with high dimensional data and very large matrices and tracked its path for forty days before it disappeared behind
X. For these it may be difficult (if not impossible) to invert the sun. Based on this data, astronomers tried to predict where
the matrix X T X in (16), for instance, because the required Ceres would reemerged from behind the sun without having
computations lead to under- or overflows of NumPy floats. In to solve Kepler’s complicated nonlinear equations of planetary
such cases, it may help to consider decompositions of X since motion. The only predictions that allowed for successfully
these can facilitate matrix inversion; details as to how they do relocating the asteroid were those performed by Gauss using
this will be discussed in a later note. least squares.
Our discussion up to this point therefore leads to the There are numerous textbooks dealing with least squares,
following suggestions: the following suggestions for further reading primarily address
• direct computations as in lsq_solution_V1 are gen- least squares techniques for application in pattern recognition
erally discouraged; should the data dimensionality m and data mining: Bishop’s book [7] provides an accessi-
exceed the number n of samples or should m and n both ble overview of the mathematical foundations of machine
be large, this method may be too slow or may suffer too learning and pattern recognition. In particular, it comes with
much from numerical imprecision to be useful an extended presentation of probabilistic formulations of the
• the solution exemplified in lsq_solution_V2 is also least squares approach. The book of Hastie, Tibshirani, and
discouraged; although pinv is numerically stable, the Friedman [8] is less accessible to the novice but offers a
implementation of this method appears to be too slow to detailed an insightful discussions of data analysis and model
deal with even moderately sized problems fitting. Finally, the mathematically more inclined may also
• the solution exemplified in lsq_solution_V3 appears be interested in Gallier’s book [9]. It provides a rigorous yet
to be the method of choice; lstsq is fast and robust. accessible account and, in particular, discusses least squares
Still, there is an open issue, namely the question of whether in connection with the singular value decomposition.
to use the NumPy or the SciPy version of lstsq ? Alas, the
answer is not straightforward but depends on the installation R EFERENCES
on the computer of the analyst because the way NumPy and
[1] T. Oliphant, “Python for Scientific Computing,” Computing in Science &
SciPy interface with LAPACK is different. While NumPy uses Engineering, vol. 9, no. 3, 2007.
a wrapper written in C, SciPy uses f2py. While SciPy always [2] J. Hunter, “Matplotlib: A 2D Graphics Environment,” Computing in
links to LAPCK, NumPy compiles a stripped down version of Science & Engineering, vol. 9, no. 3, 2007.
[3] K. B. Petersen and M. S. Pedersen, The Matrix Cookbook, Technical
LAPACK in case an external library cannot be found. Which University of Denmark, Nov. 2012.
of these is faster again depends on how LAPACK has been [4] D. Bernstein, Matrix Mathematics: Theory, Facts, and Formulars, 2nd ed.
installed, i.e. on whether or not it was compiled from scratch Princeton University Press, 2009.
[5] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Don-
or obtained as a precompiled package. garra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and
V. N OTES AND R EFERENCES D. Sorensen, LAPACK Users’ Guide, 3rd ed. SIAM, 1999.
[6] S. Stigler, “Gauss and the Invention of Least Squares,” Annals of
Throughout this note, we used the term ordinary least Statistics, vol. 9, no. 3, 1984.
squares to distinguish least squares for simple linear regression [7] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[8] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
from variants such as non-linear least squares, regularized Learning: Data Mining, Inference, and Prediction, 3rd ed. Springer,
least squares, kernel least squares, weighted least squares, or 2008.
non-negative least squares. Many of these techniques will be [9] J. Gallier, Geometric Methods and Applications for Computer Science
and Engineering. Springer, 2001.
discussed in detail in upcoming notes.

View publication stats

You might also like