Python Introduction PDF
Python Introduction PDF
Kevin Sheppard
University of Oxford
3. Index entries for many topics and chapters. This is important, but very tedious.
3
Contents
I Completed 11
1 Introduction 13
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 Required Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Testing the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Python Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
5
4.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Basic Math 55
5.1 Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.3 Array and Matrix Addition (+) and Subtraction (-) . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 Array Multiplication (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Matrix Multiplication (*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.6 Array and Matrix Division (/) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Array Exponentiation (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.8 Matrix Exponentiation (**) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.9 Parentheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.10 Transpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.11 Operator Precedence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Basic Functions 61
6.1 Generating Arrays and Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.3 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.4 Complex Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.5 Set Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.6 Sorting and Extreme Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.7 Nan Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7 Special Matrices 71
7.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8 Matrix Functions 73
8.1 Views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Shape Information and Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Linear Algebra Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13 Loops 105
13.1 for . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
13.2 while . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
13.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
17 Optimization 147
17.1 Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
17.2 Derivative-free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
17.3 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
17.4 Scalar Function Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
17.5 Nonlinear Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
19 Graphics 161
19.1 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
19.2 Advanced 2D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
19.3 3D Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
19.4 General Plotting Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
19.5 Exporting Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
II Incomplete 203
23 Parallel 205
23.1 map and related functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
23.2 Multiprocess module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
23.3 Python Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
26 Examples 211
26.1 Estimating the Parameters of a GARCH Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
26.2 Estimating the Risk Premia using Fama-MacBeth Regressions . . . . . . . . . . . . . . . . . . . 211
26.3 Estimating the Risk Premia using GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
26.4 Computing Realized Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Completed
11
Chapter 1
Introduction
1.1 Background
These notes are designed for someone new to statistical computing wishing to develop a set of skills neces-
sary to perform original research using Python.
Python is a popular language which is well suited to a wide range of problems. Recent progress has
extended Python’s range of applicability to econometrics, statistics and numerical analysis. Python – with
the right set of add-ons – is comparable to MATLAB and R, among other languages. If you are wondering
whether you should bother with Python (or another language), a very incomplete list of considerations
includes:
You might want to consider R if:
1. You want to apply statistics. The statistics library of R is second to none, and R is clearly at the fore-
front in new statistical algorithm development – meaning you are most likely to find that new(ish)
procedure in R.
3. Free is important.
You might want to consider MATLAB if:
1. Commercial support, and a clean channel to report issues, is important.
2. Documentation and organization of modules is more important than raw routine availability.
3. Performance is more important than scope of available packages. MATLAB has optimizations, such
as JIT compiling of loops, which are not available in most (possibly all) other packages.
Having read the reasons to choose another package, you may wonder why you should consider Python.
1. You need a language which can act as a end-to-end solution so that everything from accessing web-
based services and database servers, data management and processing and statistical computation
can be accomplished in a single language.
3. Free is an important consideration. Python can be freely deployed, even to 100s of servers in a com-
pute cluster.
13
1.2 Conventions
"""A docstring
"""
2. Then a code block contains >>>, this indicates that the command is running an interactive IPython
session. Output will often appear after the console command, and will not be preceded by a com-
mand indicator.
>>> x = 1.0
>>> x + 2
3.0
If the code block does not contain the console session indicator, the code contained in the block is
designed to be in a standalone Python file.
x = np.array([1,2,3,4])
y = np.sum(x)
print(x)
print(y)
1.3.1 Python
Python 2.7.2 (or later, but Python 2.7.x) is required. It provides the core Python interpreter.
1.3.2 NumPy
NumPy provides a set of array and matrix data types which are essential for econometrics and data analysis.
1.3.3 SciPy
SciPy contains a large number of routines needed for analysis of data. The most important include a wide
range of random number generators, linear algebra and optimizers. SciPy depends on NumPy.
1.3.4 IPython
IPython provides an interactive Python environment. It is the main environment for entering commands
and getting instant results, and is a very useful tool when learning Python.
1.3.5 Distribute
Distribute provides a variety of tools which make installing other packages easy.
1.3.6 PyQt4
PyQt4 provides a set of libraries used in the Qt console mode of IPython. This component is optional, but
recommended.
1.3.7 matplotlib
matplotlib provides a plotting environment for 2D plots, with limited support for 3D plotting.
Package Version
1.4 Setup
Setup of the required packages is straightforward. A video demonstration of the setup on Windows 7 and
Fedora 16 is available on the site for these notes.
1.4.1 Windows
Begin by installing Python, NumPy, SciPy, Pyreadline, Distribute, IPython and matplotlib. These are all
standard windows installers (msi or exe), and the order is not important aside from installing Python first.
You should create a shortcut containing c:\Python27\Scripts\iPython.exe --pylab. The icon will
be generic, and if you want a nice icon, select the properties of the shortcut, and then Change Icon, and
navigate to c:\Python27\DLLs and select pyc.ico.
Opening the icon should produce a command window similar to that in figure
The Windows command interpreter (cmd.exe) is very limited compared to other platforms. Fortunately,
cmd.exe can be replaced with Console2. To use Console2, extract the contents of the zip file Console-
2.00b148-Beta_64bit.zip (assumed to be C:\Python27\Console2\). Launch Console.exe, and select Edit
> Settings > Tabs. Click on Add, and input the following:
Title IPython(Pylab)
c:\Python27\Console2\Console.exe -t IPython(Pylab)
IPython comes with its own environment build using the Qt Toolkit. To use this version, it is necessary
to install PyQt, PyZMQ and Pygments. Both PyQt and PyZMQ come with installers and so installation is
simple.
Pygments must be manually installed. Begin by extracting Pygments-1.4.tar.gz to c:\Python27\. Open a
command prompt (cmd.exe), and enter the following two commands:
cd c:\Python27\Pygments-1.4
c:\Python27\Python.exe setup.py install
One final command line switch which may be useful is to add =inline to --pylab (so the command
has --pylab=inline). This will produce graphics which appear inside the QtConsole, rather than in their
own window.
1.4.2 Linux
Installing in Linux is very simple. These instruction assume that the base Python (or later) is available
through the preferred distribution. At the time of writing, this was true for both Fedora and Ubuntu. If
available, you should retrieve the following packaged from your distributions maintained repositories:
• Python-devel
• Python-iPython
• Python-scipy
Figure 1.4: An example of the IPython QtConsole using the command line switch –pylab=inline, which produces plots
inside the console.
• Python-matplotlib
• Python-PyQt4
• Python-zmq
• Python-pygments
• Python-tk
If a component is badly outdated, you should manually install the current version (after uninstalling using
package manager in your distribution).
IPython, PyZMQ and Pygments IPython, PyZMQ and Pygments can all be installed using easy_install.
Run the following commands in a terminal window, omitting any which have maintained versions for your
distribution of Linux:
sudo easy_install iPython
sudo easy_install pyzmq
sudo easy_install pygments
If you have followed the instructions, these should all complete without issue.
Notes:
• If the install of PyZMQ fails, you may need to install or build zeromq and zeromq-devel (see below).
matplotlib Begin by heading to the matplotlib github repository in your browser. There you will find a link
which says zip. Click on the link and download the file. Extract the contents of the file, and navigate in the
terminal to the directory which contains the extracted files. Build and install matplotlib by running
unzip matplotlib-matplotlib-v.1.1.0.411-glcd07a6.zip
cd matplotlib-matplotlib-v.1.1.0.411-glcd07a6
Python setup.py build
sudo Python setup.py install
Note: The file name for the matplotlib source will change as it is further developed.
1.4.3 OSX
OS X is similar to Linux. I do not have access to an OS X computer for testing the installation procedure, and
so no instructions are included. Instructions for installing Python (or Python 3) on OS X are readily available
on the internet, and, once available, the remainder of the install should be similar to that of Linux.
To make sure that you have successfully installed the required components, run IPython using the shortcut
previously created on windows, or by running iPython --pylab or iPython-qtconsole --pylab in a
Linux terminal window. Enter the following commands, one at a time (Don’t worry about what these mean).
Figure 1.5: A successful test that matplotlib, IPython, NumPy and SciPy were all correctly installed.
>>> x = randn(100,100)
>>> y = mean(x,0)
>>> plot(y)
>>> import scipy as sp
If everything was successfully installed, you should see something similar to figure 1.5.
Python can be programmed using an interactive session, preferably using IPython, or by executing Python
scripts, which are simply test files which normally end with the extension .py.
Most of this introduction focuses on interactive programming, which has some distinct advantages when
learning a language. Interactive Python can be initiated using either the Python interpreter directly, by
launching Python.exe (Windows) or Python (Linux). The standard Python interactive console is very ba-
sic, and does not support useful features such as tab completion. IPython, and especially the QtConsole
version of IPython, transforms the console into a highly productive environment which supports a number
of useful features:
• Tab completion - After entering 1 or more characters, pressing the tab button will bring up a list of
functions, packages and variables which match have the same beginning. If the list of matches is
long, a pager is used. Press ’q’ to exit the pager.
• “Magic” function which make tasks such as navigating the local file system (using %cd ~/directory/)
or running other Python programs (using %run program.py) simple. Entering %magic inside and
IPython session will provide a detailed description of the available functions. Alternatively, %lsmagic
provides a succinct list of available magic commands.
• Integrated help - When using the QtConsole, calling a function provides a view of the top of the help
function. For example, mean computes the mean of an array of data. When using the QtConsole,
entering mean( will produce a view of the top 15 lines or so of the help available for mean.
• Inline figures - The QtConsole can also display figure inline (when using the --pylab=inline switch
when starting), which produces a neat environment. In some cases this may be desirable.
• The special variable _ contains the last result in the console. This results can be saved to a new variable
(in this case, named x) using x = _.
Help is available in IPython sessions using help(function). Some functions (and modules) have very long
help files. These can be paged using the command ?function or function?, and the text can be scrolled using
page up and down, and q to quit. ??function or function?? can be used to type the function in the interactive
console.
The IPython environment can be configured using standard Python scripts located in a configuration direc-
tory. On Windows, the start-up directory is located at C:\users\username\.iPython\profile_default\startup, and
on Linux it is located at ~/.config/iPython/profile_default/startup. In this directory, create a file names startup.py,
containing:
# __future__ imports
# division and print_function
import IPython.core.ipapi
ip = IPython.core.ipapi.get()
ip.ex(’ip.compile("from __future__ import division", "<input>", "single") in ip.user_ns’)
ip.ex(’ip.compile("from __future__ import print_function", "<input>", "single") in ip.
user_ns’)
# Startup directory
import os
# Replace with actual directory
os.chdir(’c:\\dir\\to\\start\\in’)
# Linux: os.chdir(’/dir/to/start/in/’)
This code does 2 things. First, it imports 2 “future” features, the print function division, which are useful for
numerical programming.
• In Python 2.7, print is not standard function and is used like print ’string to print’. Python 3.x
changes this behavior to be a standard function call, print(’string to print’). I prefer the latter
since it will make the move to 3.x easier, and is more coherent.
• In Python 2.7, division of integers always produces an integer, and the result is truncated, so 9/5=1. In
Python 3.x, division of integers does not produce an integer if the integers are not even multiples, so
9/5=1.8. Additionally, Python 3.x uses the syntax 9//5 to force integer division with truncation (e.g.
11/5=2.2, while 11//5=2).
While interactive programing is useful for learning a language or quickly developing some simple code,
more complex projects require the use of more complete programs. Programs can be run either using the
IPython magic work %run program.py or by directly launching the Python program using the standard
interpreter using Python program.py (Windows). The advantage of using the IPython environment is that
the variables used in the program can be inspected after the end of the program run, while directly calling
Python will run the program and then terminate – and so it is necessary to output any important results to
a file so that they can be viewed later.1
To test that you can successfully execute a Python program, input the the code in the block below into a
text file and save it as firstprogram.py.
# First Python program
from __future__ import print_function
from __future__ import division
import time
Once you have saved this file, open the console, navigate to the directory you saved the file and run Python
firstprogram.py. If the program does not run on Windows with an error that states Python cannot be
found, you need to add the Python root directory to your path. The path can be located in the Control
Panel, under Environment Variables. Finally, run the program in IPython by first launching IPython, and
the using %cd to change to the location of the program, and finally executing the program using %run
firstprogram.py.
As you progress in Python, and begin writing more sophisticated programs, you will find that using an In-
tegrated Development Environment (IDE) will increase your productivity. Most contain productivity en-
1
Programs can also be run in the standard Python interpreter using the command:
exec(compile(open(’filename.py’).read(),’filename.py’,’exec’))
hancements such as built-in consoles, intellisense (for completing function names) and integrated de-
bugging. Discussion of IDEs is beyond the scope of this text, although I recommend Spyder (free, cross-
platform).
Chapter 2
Python comes in a number of varieties which may be suitable for econometrics, statistics and numerical
analysis. This chapter explains why, ultimately 2.7 was chosen for these notes, and highlights some alterna-
tives.
Python 2.7 is the final version of the Python 2.x line – all future development work will focus on Python 3.2.
It may seem strange to learn an “old” language. The reasons for using 2.7 are:
• There are more modules available for Python 2.7. While all of the core python modules are available
for both Python 2.7 and 3.2, some relevant modules are only available in 2.7, for example, modules
which allow Excel files to be read and written to. Over time, many of these modules will be available
for Python 3.2+, but they aren’t today.
• The language changes relevant for numerical computing are very small – and these notes explicitly
minimize these so that there should few changes needed to run against Python 3.2+ in the future
(ideally none).
2.2 Intel Math Kernel Library and AMD Core Math Library
Intel’s MKL and AMD’s CML provide optimized linear algebra routines. They are much faster then simple
implementations and are, by default, multithreaded so that a matrix inversion can use all of the processors
on your system. They are used by NumPy, although most precompiled code does not use them. The ex-
ception for Windows are the pre-built NumPy binaries made available by Christoph Gohlke. Directions for
building NumPy on Linux with Intel’s MKL are available online. It is strongly recommended that you use a
NumPy built using these highly tuned linear algebra routines i matrix performance is important. Alterna-
tively, EPD (see below) is built with MKL and is available for all Intel platforms.
25
2.3 Other Variants
Enthough EPD (Enthought Python Distribution) is a collection of a large number of modules for scientific
computing. It is available for Windows, Linux and OS X. EPD is regularly updated and is available for free
to members of academic institutions. EPD is also built using MKL, and so matrix performance on Intel
processors is very fast.
2.3.2 IronPython
IronPython is a variant which runs on the CLR (Windows .NET). The core modules – NumPy and SciPy –
are available for IronPython, and so it is a viable alternative for numerical computing, especially if already
familiar with the C# or another .NET language. Other libraries, for example, matplotlib (plotting) are not
available, so there are some important caveats.
2.3.3 PyPy
PyPy is a new implementation of Python which uses Just-in-time compilation to accelerate code, especially
loops (which are common in numerical computing). It may be anywhere between 2 - 5 times faster than
standard Python. Unfortunately, at the time of writing, the core library, NumPy is only partially imple-
mented, and so it is not ready for use. Current plans are to have a version ready by the end of 2012, and if
so, PyPy may quickly become the preferred version of Python for numerical computing.
Most difference significant between Python 2.7 and 3.2 are not important for using Python in econometrics,
statistics and numerical analysis. I will make three common assumptions which will allow 2.7 and 3.2 to be
used interchangeable. These differences are important in stand-alone Python programs. The configuration
instructions for IPython will produce similar behavior when run interactively.
2.A.1 print
print is a function used to display test in the console when running programs. In Python 2.7, print is a
keyword which behaves differently from other functions. In Python 3.2, print behaves like most functions.
The standard use in Python 2.7 is
print ’String to Print’
which resembles calling a function. Python 2.7 contains a version of the Python 3.2 print, which can be
used in any program by including
from __future__ import print_function
at the top of the file. I prefer the 3.2 version of print, and so I assume that all programs will include this
statement.
2.A.2 division
Python 3.2 changes the way integers are divided. In Python 2.7, the ratio of two integers was always an
integer, and was truncated towards 0 if the result was fractional. For example, in Python 2.7, 9/5 is 1. Python
3.2 gracefully converts the result to a floating point number, and so in Python 3.2, 9/5 is 1.8. When working
with numerical data, automatically converting ratios avoids some rare errors. Python 2.7 can use the 3.2
behavior by including
from __future__ import division
at the top of the program. I assume that all programs will include this statement.
It is often useful to generate a sequence of number for use when iterating over the some data. In Python
2.7, the best practice is to use the keyword xrange to do this, while in Python 3.2, this keyword has been
renamed range. Fortunately Python 2.7 contains a function range which is inefficient but compatible with
the range function in Python 3.2, and so I will always use range, even where best practices indicate that
xrange should be used. No changes are needed in code for use range in both Python 2.7 and 3.2.
Chapter 3
Before diving into Python for analyzing data to running Monte Carlos, it is necessary to understand some
basic concepts about the available data types in Python and NumPy. In many ways, this description is
necessary since Python is a general purpose programming language which is also well suited to data anal-
ysis, econometrics and statistics. This differs from environments such as MATLAB and R which are statisti-
cal/numerical packages first, and general purpose programming languages second. For example, the basic
numeric type in MATLAB is an array (using double precision, which is useful for floating point mathemat-
ics), while the numeric basic data type in Python is a 1-dimensional scalar which may be either integer or a
double-precision floating point, depending on how the number is formatted when entered.
Variable names can take many forms, although they can only contain numbers, letters (both upper and
lower), and underscores (_). They must begin with a letter or an underscore and are CaSe SeNsItIve. Addi-
tionally, some words are reserved in Python and so cannot be used for variable names (e.g. import or for).
For example,
x = 1.0
X = 1.0
X1 = 1.0
X1 = 1.0
x1 = 1.0
dell = 1.0
dellreturns = 1.0
dellReturns = 1.0
_x = 1.0
x_ = 1.0
are all legal and distinct variable names. Note that names which begin or end with an underscore are convey
special meaning in Python, and so should be avoided in general. Illegal names do not follow these rules.
# Not allowed
x: = 1.0
1X = 1
_x = 1
X-1 = 1
29
for = 1
3.2.1 Numeric
Simple numbers in Python can be either integers, floats or complex. Integers correspond to either 32 bit or
64-bit integers, depending on whether the python interpreter was compiled 32-bit or 64-bit, and floats are
always 64-bit (corresponding to doubles in C/C++). Long integers, on the other hand, do not have a fixed
size and so can accommodate numbers which are larger than maximum the basic integer type can handle.
Note: This chapter does not cover all Python data types, only those which are most relevant for numerical
analysis, econometrics and statistics. The following built-in data types are not described: bytes, bytearray
and memoryview.
The most important (scalar) data type for numerical analysis is the float. Unfortunately, not all non-complex
numeric data types are floats. To input a floating data type, it is necessary to include a . (period, dot) in the
expression. This example uses the function type() to determine the data type of a variable.
>>> x = 1
>>> type(x)
int
>>> x = 1.0
>>> type(x)
float
>>> x = float(1)
>>> type(x)
float
This example shows that using the expression that x = 1 produces an integer while x = 1.0 produces a
float. Using integers can produce unexpected results and so it is important to ensure values entered manu-
ally are floats (e.g. include “.0” when needed).1
Complex numbers are also important for numerical analysis. Complex numbers are created in Python using
j or the function complex().
>>> x = 1.0
>>> type(x)
float
1
Programs which contain from __future__ import division will automatically convert integers to floats when dividing.
>>> x = 1j
>>> type(x)
complex
>>> x = 2 + 3j
>>> x
(2+3j)
>>> x = complex(1)
>>> x
(1+0j)
Note that a +b j is the same as complex(a ,b ), while complex(a ) is the same as a +0j.
Floats use an approximation to represent numbers which may contain a decimal portion. The integer data
type stores numbers using an exact representation, so that no approximation is needed. The cost of the
exact representation is that the integer data type cannot (naturally) express anything that isn’t an integer.
This renders integers of limited use in most numerical analysis work.
Basic integers can be entered either by excluding the decimal (see float), or explicitly using the int()
function. The int() function can also be used to find the smallest integer (in absolute value) to a floating
point number.
>>> x = 1
>>> type(x)
int
>>> x = 1.0
>>> type(x)
float
>>> x = int(x)
>>> type(x)
int
Integers can range from −231 to 231 − 1. Python contains another type of integer known as a long
integers which has essentially no range. Long integers are entered using the syntax x = 1L or by calling
long(). Additionally python will automatically convert integers outside of the standard integer range to
long integers.
>>> x = 1
>>> x
1
>>> type(x)
int
>>> x = 1L
>>> x
1L
>>> type(x)
long
>>> x = long(2)
>>> type(x)
long
>>> x = 2 ** 64
>>> x
18446744073709551616L
The trailing L after the number indicates that it is a long integer, rather than a standard integer.
The Boolean data type is used to represent true and false, using the reserved keywords True and False.
Booleans are important for program flow control (see Chapter 12) and are typically created as a result of
logical operations (see Chapter 11), although they can be entered directly.
>>> x = True
>>> type(x)
bool
>>> x = bool(1)
>>> x
True
>>> x = bool(0)
>>> x
False
Non-zero values, in general, evaluate to true when evaluated by bool(), although bool(0), bool(0.0), and
bool(None) are all false.
Strings are not usually important for numerical analysis, although they are frequently encountered when
dealing with data files, especially when importing, or when formatting output for human readability (e.g.
nice, readable tables of results). Strings are delimited using ’’ or "". While either single or double quotes
are valid for declaring strings, they cannot be mixed (e.g. do not try ’") in a single string, except when used
to express a quotation.
>>> x = ’abc’
>>> type(x)
str
Substrings within a string an be accessed using slicing. Slicing uses [] to contain the indices of the characters
in a string, where the first index is 0, and the last (assuming the string has n letters) and n − 1. The following
table describes the types of slices which are available. The most useful are str[i ], which will return the
character in position i , str[:i ], which return the characters at the beginning of the string from positions 0
to i − 1, and str[i :] which returns the characters at the end of the string from positions i to n − 1. The
table below provides a list of the types of slices which can be used. The second column shows that slicing
can use negative indices which essentially index the string backward.
>>> text[10]
’i’
>>> L = len(text)
>>> text[L] # Error
IndexError: string index out of range
>>> text[:10]
’Python str’
>>> text[10:]
’ings are sliceable.’
Lists are a built-in data type which requires the other data types to be useful. A list is essentially a collection
of other objects – floats, integers, complex numbers, strings or even other lists. Lists are essential to Python
programming since they are used to store collections of other values. For example, a list of floats can be used
to express a vector (although the NumPy data types array and matrix are better suited). Lists also support
slicing to retrieve one or more elements. Basic lists are constructed using square braces, [], and values are
separated using commas, ,.
>>> x = []
>>> type(x)
builtins.list
>>> x=[1,2,3,4]
>>> x
[1,2,3,4]
These examples show that lists can be regular, nested and can contain any mix of data types. x = [[1,2,3,4],
[5,6,7,8]] is a 2-dimensional list, where the main elements of x are lists, and the elements of these lists
are integers.
Lists, like strings, can be sliced. Slicing is similar, although lists can be sliced in more ways than strings.
The difference arises since lists can be multi-dimensional while strings are always 1 × n . Basic list slicing
is identical to strings, and operations such as x[:], x[1:], x[:1] and x[-3:] can all be used. To understand
slicing, assume x is a 1-dimensioanl list with n elements and i ≥ 0, j > 0, i < j . Python using 0-based
indices, and so the n elements of x can be thought of as x 0 , x 1 , . . . , x n −1 .
Examples of accessing elements of 1-dimensional lists are presented in the code block below.
>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> x[0]
0
>>> x[5]
5
>>> x[10] # Error
IndexError: list index out of range
>>> x[4:]
[4, 5, 6, 7, 8, 9]
>>> x[:4]
[0, 1, 2, 3]
>>> x[1:4]
[1, 2, 3]
>>> x[-0]
0
>>> x[-1]
9
>>> x[-10:-1]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
List can be multidimensional, and slicing can be done directly in higher dimensions. For simplicity, con-
sider slicing a 2D list x = [[1,2,3,4], [5,6,7,8]]. If single indexing is used, x[0] will return the first
(inner) list, and x[1] will return the second (inner) list. Since the value returned by x[0] is sliceable, the
inner list can be directly sliced using x[0][0] or x[0][1:4].
>>> x = [[1,2,3,4], [5,6,7,8]]
>>> x[0]
[1, 2, 3, 4]
>>> x[1]
[5, 6, 7, 8]
>>> x[0][0]
1
>>> x[0][1:4]
[2, 3, 4]
>>> x[1][-4:-1]
[5, 6, 7]
A number of functions are available for manipulating lists. The most useful are
Function Method Description
>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> x.append(0)
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0]
>>> len(x)
11
>>> x.extend([11,12,13])
>>> x
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]
>>> x.pop(1)
>>> x
[0, 2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]
>>> x.remove(0)
>>> x
[2, 3, 4, 5, 6, 7, 8, 9, 0, 11, 12, 13]
3.2.4.3 del
Elements can also be deleted from lists using the keyword del in combination with a slice.
>>> x = [0,1,2,3,4,5,6,7,8,9]
>>> del x[0]
>>> x
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x[:3]
[1, 2, 3]
A tuple is in many ways like a list. A tuple contains multiple pieces of data which comprised of a variety of
data types. Aside from using a different syntax to construct a tuple, they are close enough to lists to ignore
the difference except that tuples are immutable. Immutability means that the elements of tuple cannot
change, and so once a tuple is constructed, it is not possible to change an element without reconstructing a
new tuple.
Tuples are constructed using parentheses (()), rather than square braces ([]) of lists. Tuples can be
sliced in an identical manner as lists. A list can be converted into a tuple using tuple() (Similarly, a tuple
can be converted to list using list()).
>>> x =(0,1,2,3,4,5,6,7,8,9)
>>> type(x)
tuple
>>> x[0]
0
>>> x[-10:-5]
(0, 1, 2, 3, 4)
>>> x = list(x)
>>> type(x)
list
>>> x = tuple(x)
>>> type(x)
tuple
Note that tuples must have a comma when created, so that x = (2,) is assign a tuple to x, while x=(2) will
assign 2 to x. The latter interprets the parentheses as if they are part of a mathematical formula, rather than
being used to construct a tuple. x = tuple([2]) can also be used to create a single element tuple. Lists do
not have this issue since square brackets are reserved.
>>> x =(2)
>>> type(x)
int
>>> x = (2,)
>>> type(x)
tuple
>>> x = tuple([2])
>>> type(x)
tuple
Tuples are immutable, and so only have the functions index and count, which behave in an identical man-
ner to their list counterparts.
A xrange is a useful data type which is most commonly encountered when using a for loop. Range are
essentially lists of numbers. xrange(a,b,i) creates the sequences that follows the pattern a , a + i , a +
2i , . . . , a + (m − 1)i where m = d b −i a e. In other words, it find all integers x starting with a such a ≤ x < b
and where two consecutive values are separated by i . Range can also be called with 1 or two parameters.
xrange(a,b) is the same as xrange(a,b,1) and xrange(b) is the same as xrange(0,b,1).
>>> x = xrange(10)
>>> type(x)
xrange
>>> print(x)
xrange(0, 10)
>>> list(x)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> x = xrange(3,10)
>>> list(x)
[3, 4, 5, 6, 7, 8, 9]
>>> x = xrange(3,10,3)
>>> list(x)
[3, 6, 9]
>>> y = range(10)
>>> type(y)
list
>>> y
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Xrange is not technically a list, which is why the statement print(x) returns xrange(0,10). Explicitly
converting a range to a list using list() produces a list which allows the values to be printed. Technically
xrange is an iterator which does not actually require the storage space of a list. This is a performance
optimization, and is not usually important in numerical applications.
Dictionaries are encountered far less frequently than then any of the previously described data types in nu-
merical Python. They are, however, commonly used to pass options into other functions such as optimizers,
and so familiarity with dictionaries is essential. Dictionaries in Python are similar to the more familiar type
in that they are composed of keys (words) and values (definitions). In Python dictionaries keys must be
unique strings, and values can contain any valid Python data type. Values in dictionaries are accessed by
their place in the list; values in dictionaries are accessed using keys.
>>> data = {’key1’: 1234, ’key2’ : array([1,2])}
>>> type(data)
builtins.dict
>>> data[’key1’]
1234
Values associated with an existing key can be updated by making an assignment to the key in the dici-
tonary.>
>>> data[’key1’] = ’xyz’
>>> data[’key1’]
’xyz’
New key-value pairs can be added by defining a new key and assigning a value to it.
>>> data[’key3’] = ’abc’
>>> data
{’key1’: 1234, ’key2’: array([1, 2]), ’key3’: ’abc’}
Key-value pairs can be deleted using the reserved keyword del.
Sets are collections which contain all unique elements of a collection. set and frozenset only differ in
that the latter is immutable (and so has higher performance). While sets are generally not important in
numerical analysis, they can be very useful when working with messy data – for example, finding the set of
unique tickers in a long list of tickers.
add,difference,difference_update,intersection,intersection_update,union,remove,
A number of functions are available for manipulating lists. The most useful are
Function Method Description
>>> x = set([’MSFT’,’GOOG’,’AAPL’,’HPQ’])
>>> x
set([’GOOG’, ’AAPL’, ’HPQ’, ’MSFT’])
>>> x.add(’CSCO’)
>>> x
set([’GOOG’, ’AAPL’, ’CSCO’, ’HPQ’, ’MSFT’])
>>> x = x.union(y)
>>> x
set([’GOOG’, ’AAPL’, ’XOM’, ’CSCO’, ’HPQ’, ’MSFT’])
>>> x.remove(’XOM’)
set([’GOOG’, ’AAPL’, ’CSCO’, ’HPQ’, ’MSFT’])
3.3 Python and Memory Management
Python uses a highly optimized memory allocation system which attempts to avoid allocating unnecessary
memory. As a result, when one variable is assigned to another (e.g. to y = x), these will actually point to the
same data in the computer’s memory. To verify this, id() can be used to determine the unique identification
number of a piece of data.2
>>> x = 1
>>> y = x
>>> id(x)
82970264L
>>> id(y)
82970264L
>>> x = 2.0
>>> id(x)
82970144L
>>> id(y)
82970264L
In the above example, the initial assignment of y = x produced two variables with the same ID. However,
once x was changed, its ID changed while the ID of y did not, indicating that the data in each variable was
stored in different locations. This behavior is very safe yet very efficient, and is common to the basic Python
types: int, long, float, complex, string, xrange and tuple.
Lists are mutable and so assignment does not create a copy – changes to either variable affect both.
>>> x = [1, 2, 3]
>>> y = x
>>> y[0] = -10
>>> y
[-10, 2, 3]
>>> x
[-10, 2, 3]
Slicing a list creates a copy of the list and any immutable types in the list – but not mutable elements in the
list.
>>> x = [1, 2, 3]
>>> y = x[:]
>>> id(x)
86245960L
>>> id(y)
86240776L
2
The ID numbers on your system will likely differ from those in the code listing.
For example, consider slicing a list of lists.
>>> x=[[0,1],[2,3]]
>>> y = x[:]
>>> y
[[0, 1], [2, 3]]
>>> id(x[0])
117011656L
>>> id(y[0])
117011656L
>>> x[0][0]
0.0
>>> x
[[-10.0, 1], [2, 3]]
When lists are nested or contain other mutable objects (which do not copy), slicing copies the outermost list
to a new ID, but the inner lists (or other objects) are still linked. In order to copy nested lists, it is necessary
to explicitly call deepcopy(), which is in the module copy.
>>> import copy as cp
>>> x=[[0,1],[2,3]]
>>> y = cp.deepcopy(x)
>>> y[0][0] = -10.0
>>> y
[[-10.0, 1], [2, 3]]
>>> x
[[0, 1], [2, 3]]
3.4 Exercises
Chapter 4
NumPy provides the most important data types for econometrics, statistics and numerical analysis. The
two data types provided by NumPy are the arrays and matrices. Arrays and matrices are closely related,
and matrices are essentially a special case of arrays – 2 (and only 2)-dimensional arrays. The differences
between arrays and matrices can be summarized as:
• Arrays can have 1, 2, 3 or more dimensions. Matrices always have 2 dimensions. This means that a
1 by n vector stored as an array has 1 dimension and 5 elements, while the same vector stored as a
matrix has 2-dimensions where the sizes of the dimensions are 1 and n (in either order).
• Standard mathematical operators on arrays operate element-by-element. This is not the case for ma-
trices, where multiplication (*) follows the rules of linear algebra. 2-dimensional arrays can be multi-
plied using the rules of linear algebra using dot(). Similarly, the function multiply() can be used on
two matrices for element-by-element multiplication.
• Arrays are more common than matrices, and so all functions work and are tested with arrays (they
should also work with matrices, but an occasional strange result may be encountered).
• Arrays can be quickly treated as a matrix using either asmatrix() or mat() without copying the un-
derlying data.
4.1 Array
Arrays are the base data type in NumPy, and are the most important data type for numerical analysis in
Python. In many ways, arrays are similar to lists in that they can be used to help collections of elements.
The focus of this section is on arrays which only hold 1 type of data – whether it is float or int – and so all
elements must have the same type (See Chapter 22). Additionally, arrays are always rectangular – in other
words, if the first row has 10 elements, all other rows must have 10 elements.
Arrays are initialized using lists (or tuples), and calling array(). 2-dimensional arrays are initialized
using lists of lists (or tuples of tuples, or lists of tuples, etc.), and higher dimensional arrays can be initialized
by further nesting lists or tuples.
>>> x = [0.0, 1, 2, 3, 4]
>>> y = array(x)
>>> y
43
array([0, 1, 2, 3, 4])
>>> type(y)
NumPy.ndarray
>>> shape(y)
(2L, 5L)
>>> y = array([[[1,2],[3,4]],[[5,6],[7,8]]])
>>> y
array([[[1, 2],
[3, 4]],
[[5, 6],
[7, 8]]])
>>> shape(y)
(2L, 2L, 2L)
Arrays can contain a variety to data types. The most useful is ’float64’, which corresponds to the python
built-in data type of float (and C/C+ double). By default, calls to array() will preserve the type of the input,
if possible. If an input contains all integers, it will have a dtype of ’int32’ (the built in data type ’int’). If
an input contains integers, floats, or a mix of the two, the array’s dtype will be float. It is contains a mix of
integers, floats and complex types, the array will be complex.
>>> x = [0, 1, 2, 3, 4] # Integers
>>> y = array(x)
>>> y.dtype
dtype(’int32’)
>>> y.dtype
dtype(’complex128’)
NumPy attempts to find the smallest data type which can represent the data when constructing an array. It is
possible to force NumPy to use a particular dtype by passing another argument, dtype=datetype to array().
Important: If an array has an integer dtype, trying to place a float into the array results in the float being
truncated and stored as an integer. This is dangerous, and so in most cases, arrays should be initialized to
contain floats unless a conscious decision is taken to have them contain a different data type.
dtype(’int32’)
4.2 Matrix
Matrices are essentially a subset of arrays, and behave in a virtually identical manner. The two important
differences are:
1- and 2-dimensional arrays can be copied to a matrix by calling matrix() on an array. Alternatively, call-
ing mat() or asmatrix() provides a faster method where an array can behave like a matrix (without being
explicitly converted).
4.3 Arrays, Matrices and Memory Management
Arrays and matrices do not behave like lists – slicing an array does not create a copy. In general, when
an array, matrix or list is sliced, the slice will refer to the same memory as original variable – this means
changing an element in the slice also changes an element in the original variable.
>>> x = array([0.0, 1.0, 2.0])
>>> y = x
>>> x
array([ 0., 1., 2.])
>>> y
array([ 0., 1., 2.])
>>> id(x)
130165568L
>>> id(y)
130165568L
>>> x
array([-1., 1., 2.])
y = x sets x and y to the same data, and so changing one changes the other. Next, consider what happens
when y is a slice of x.
>>> x = array([[0.0, 1.0],[2.0,3.0]])
>>> y = x[0]
>>> y
array([ 0., 1.])
In order to get a new variable when slicing or assigning an array or a matrix, it is necessary to explicitly
copy the data. Arrays or matrices can be copied by calling copy. Alternatively, they can also be copied by by
calling array() on arrays, or matrix() on matrices.
>>> x = array([[0.0, 1.0],[2.0,3.0]])
>>> y = copy(x)
>>> id(x)
130166048L
>>> id(y)
130165952L
>>> x # No change in x
array([[ 0., 1.],
[ 2., 3.]])
>>> z = x.copy()
>>> id(z)
130166432L
>>> w = array(x)
>>> id(w)
130166144L
w, x, y and z all have unique IDs are distinct. Changes to one will not affect any of the others.
Finally, assignments from functions which change the value automatically create a copy.
>>> id(y)
130166816L
>>> y = x + 1.0
>>> y
array([[ 1., 2.],
[ 3., 4.]])
>>> id(y)
130167008L
>>> y = exp(x)
>>> y
array([[ 1. , 2.71828183],
[ 7.3890561 , 20.08553692]])
>>> id(y)
130166912L
Even trivial function such as y = x + 0.0 create a copy of x, and so the only cases where explicit copying is
required is when y is directly assigned a slice of x, y is changed, but x should not be.
4.4 Entering Data
Almost all of the data used in are matrices by construction, even if they are 1 by 1 (scalar), K by 1 or 1 by
K (vectors). Vectors, both row (1 by K ) and column (K by 1), can be entered directly into the command
window. The mathematical notation
x = [1 2 3 4 5]
is entered as
>>> x=array([1.0,2.0,3.0,4.0,5.0])
array([ 1., 2., 3., 4., 5.])
for a matrix. 1-dimensional arrays do not have row or column forms, but matrices do. The column
vector,
1
2
x = 3
4
5
The final two line show that converting a column matrix to an array eliminates any notion row and column.
enter the matrix one row at a time, each in a list, and then surround the row lists with another list.
>>> x = array([[1.0,2.0,3.0],[4.0,5.0,6.0],[7.0,8.0,9.0]])
>>> x
array([[ 1., 2., 3.],
[ 4., 5., 6.],
[ 7., 8., 9.]])
Multi-dimensional (N -dimensional) arrays are available for N up to about 30, depending on the size of
each matrix dimension. Manually initializing higher dimension arrays is tedious and error prone, and so it
is better to use functions such as zeros((2, 2, 2)) or empty((2, 2, 2)). Higher dimensional arrays are
useful, e.g. when tracking matrix values through time, such as a time-varying covariance matrices.
4.7 Concatenation
Concatenation is the process by which one vector or matrix is appended to another. Arrays and matrices
can be concatenation horizontally or vertically. For instance, suppose
" # " #
1 2 5 6
x = and y = ;
3 4 7 8
and
" #
x
z = .
y
needs to be constructed. This can be accomplished by treating x and y as elements of a new matrix and
using the function concatenate using the named parameter axis to determine whether the matrices are
vertically (axis = 0) or horizontally (axis = 1) concatenated.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> y = array([[5.0,6.0],[7.0,8.0]])
>>> z = concatenate((x,y),axis = 0)
>>> z
array([[ 1., 2.],
[ 3., 4.],
[ 5., 6.],
[ 7., 8.]])
>>> z = concatenate((x,y),axis = 1)
>>> z
array([[ 1., 2., 5., 6.],
[ 3., 4., 7., 8.]])
Concatenating is the code equivalent of block-matrix forms in standard matrix algebra. Alternatively the
functions vstack and hstack can be used to vertically or horizontally stack arrays, respectively.
>>> z = vstack((x,y)) # Same as z = concatenate((x,y),axis = 0)
>>> z = hstack((x,y)) # Same as z = concatenate((x,y),axis = 1)
4.8 Accessing Elements of Array (Slicing)
Arrays, like lists and tuples, can be sliced. Slicing in arrays is virtually identical to slicing in lists, except
that since arrays are explicitly multidimensional and rectangular, slicing in more than 1-dimension is im-
plemented using a different syntax. 1-dimensional arrays can be sliced in an identical manner as lists or
tuples. 2 (or higher)-dimensional arrays are sliced using the syntax [:,:,. . .,:] (where the number of di-
mensions of the arrays determines the size of the slice). The 2-dimensions, first dimension is always the
row, and the second is the column.
>>> y = array([[[1.0,2],[3,4]],[[5,6],[7,8]]])
>>> y[0,:,:] # Panel 0 of 3D y
array([[1, 2],
[3, 4]])
k -dimensional arrays can be sliced using the [:,:,. . .,:] syntax, or they can be linear sliced. Linear slicing
assigns an index to each element of the array, starting with the first (0), the second (1), and so on up to the
last (n − 1). In 2-dimensions, linear slicing works be first counting across rows, and then down columns. To
use linear slicing, the method or function flat must first be used
>>> y = reshape(arange(25.0),(5,5))
>>> y
array([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[ 10., 11., 12., 13., 14.],
[ 15., 16., 17., 18., 19.],
[ 20., 21., 22., 23., 24.]])
>>> y[0]
array([ 0., 1., 2., 3., 4.])
>>> y.flat[0]
0
>>> y.flat[6]
6.0
>>> y.flat[:]
array([[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21.,
22., 23., 24.]])
arange and reshape are useful functions are described in later chapters.
Once a vector or matrix has been constructed, it is important to be able to access the elements indi-
vidually. Data in matrices is stored in row-major order. This means elements are indexed by first counting
across rows and then down columns. For instance, in the matrix
1 2 3
x = 4 5 6
7 8 9
the first element of x is 1, the second element is 2, the third is 3, the fourth is 4, and so on.
Python, by default, only has access to a small number of built-in types and functions. The vast majority of
functions are located in modules, and before a function can be accessed, the module which contains the
function must be imported. For example, when using ipython --pylab (or any variants), a large number of
modules are automatically imported, including NumPy and matplotlib. This is useful for learning, but care
is needed to make sure that the correct module is imported when working in stand-alone python.
import can be used in a variety of ways. The simplest is to use from mod u l e import *. This will im-
port all functions in mod u l e and make them immediately available. This method of using import can
dangerous since if you use it more than once, it is possible for functions to be hidden by later imports. For
example,
from pylab import *
from numpy import *
creates a conflict for load which is first imported by pylab (from matplotlib.pylab.load), and then im-
ported by NumPy (from numpy.lib.npyio.load). A better method is to just import the required functions.
This still places functions at the top level of the namespace, but can be used to avoid conflicts.
The functions load, array and matrix can be directly called. An alternative, and more common, method is
to use import in the form
import pylab
import scipy
import numpy
import pylab as pl
import scipy as sp
import numpy as np
The only difference between these two is that import numpy is equivalent to import numpy as numpy. When
this form of import is used, function will be located below the “as” name. For example, the load provided
by NumPy, is located at np.load, while the pylab load is pl.load – and both can be used where appropriate.
While this method is the most general, it does require slightly more typing.
Functions calls have different conventions other expressions. The most important difference is that func-
tions can take more than one input and return more than one output. The generic structure of a function
call is out1, out2, out3, . . . = functionname(in1, in2, in3, . . .). The important aspects of this structure are
• If multiple outputs are returned, but only one output variable is provided, the output will (generally)
be a tuple.
• The number of output variables determines how many outputs will be returned. Asking for more
outputs than the function provides will result in an error.
• Inputs can be the result of other functions as long only one output is returned. For example, the
following are equivalent,
>>> y = var(x)
>>> mean(y)
and
>>> mean(var(x))
Required Arguments Most functions have required arguments. For example, consider the definition of
array from help(array),
Array has 1 required input, object, which is usually the list or tuple which contains values to use when cre-
ating the array. Required arguments can be determined by inspecting the function signature since all of the
input follow the patters keyword=default except object – required arguments will not have a default value
provided. The other arguments can be called in order (array accepts at most 2 non-keyword arguments).
>>> array([[1.0,2.0],[3.0,4.0]])
array([[ 1., 2.],
[ 3., 4.]])
Keyword Arguments All of the arguments to array can be called by their keyword, which is listed in the
help file definition.
array(object=[[1.0,2.0],[3.0,4.0]])
array([[1.0,2.0],[3.0,4.0]], dtype=None, copy=True, order=None, subok=False, ndmin=0)
The real advantage of keyword arguments is that they do not have to appear in any order (Note: randomly
ordering arguments is not good practice, and this is only an example).
>>> array(dtype=’complex64’, object = [[1.0,2.0],[3.0,4.0]], copy=True)
array([[ 1.+0.j, 2.+0.j],
[ 3.+0.j, 4.+0.j]], dtype=complex64)
Default Arguments Functions have defaults for optional arguments. These are listed in the function defini-
tion and appear in the help in keyword=default pairs. Returning to array, all inputs have default arguments
except object, which is the only required input.
Mutiple Outputs Some functions can have more than 1 output. These functions can be used in a single
output mode or in multiple output mode. For example, shape can be used on an array to determine the size
of each dimension.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> s = shape(x)
>>> s
(2L, 2L)
Since shape will return as many outputs as there are dimensions, it can be called with 2 outs when the input
is a 2-dimensional array.
>>> x = array([[1.0,2.0],[3.0,4.0]])
>>> M,N = shape(x)
>>> M
2L
>>> N
2L
Similarly, providing two few output can also produce an error. Consider the case where the argument ot
shape is a 3-dimensional array.
>>> x = randn(10,10,10)
>>> shape(x)
(10L, 10L, 10L)
>>> M,N = shape(x) # Error
ValueError: too many values to unpack
4.11 Exercises
u = [1 1 2 3 5 8]
1
1
2
v =
3
5
8
" #
1 0
x =
0 1
" #
1 2
y =
3 4
1 2 1 2
z = 3 4 3 4
1 2 1 2
" #
x x
w =
y y
Basic Math
5.1 Operators
When x and y are scalars, the behavior of these operators is obvious. The only exception occurs when
both x and y are integers for division, where x/y returns the smallest integer less than the ratio (e.g. b yx c). The
simplest method to avoid this problem is to explicitly avoid integers by using 5.0 rather than 5. Alternatively,
integers can be explicitly cast to floats before the division.
>>> x = 9
>>> y = 5
>>> (type(x), type(y))
(int, int)
>>> x/y
1
>>> float(x)/y
1.8
When x and y are arrays or matrices, things are a bit more complex. The examples usually refer to arrays,
and except where explicit differences are noted, it is safe to assume that the behavior is identical for 2-
dimensional arrays and matrices.
I recommend using the import command from __future__ import division in all programs
and IPython. The “future” division avoids this issue by always casting division to floating point.
55
5.2 Broadcasting
Under the normal rules of array mathematics, addition and subtraction are only defined for arrays with the
same shape or between an array and a scalar. For example, there is no obvious method to add a 5-element
vector and a 5 by 4 matrix. NumPy uses a technique called broadcasting to allow mathematical operations
on arrays (and matrices) which would not be compatible under the normal rules of array mathematics.
Arrays can be used in element-by-element mathematics if x is broadcastable to y.
Suppose x is an m -dimensional array with dimensions d = [d 1 , d 2 . . . d m ], and y is an n -dimensional
array with dimensions f = [ f 1 , f 2 . . . f n ] where m ≥ n . Formally, the rules of broadcasting are:
1. If m > n , then treat y as a m -dimensional array with size g = [1, 1, . . . , 1, f 1 , f 2 . . . f n ] where the
number of 1s prepended is m − n . The elements are g i = 1 for i = 1, . . . m − n and g i = f i −m +n for
i > m − i.
The first rule is simply states that if one array has fewer dimensions, it is treated as having the same number
of dimensions as the larger array by prepending 1s. The second rule states that arrays will only be broad-
castable is either (a) they have the same dimension along axis i or (b) one has dimension 1 along axis i .
When 2 arrays are broadcastable, the dimension of the output array is simply max d i , g i for i = 1, . . . n .
One simple method to visualize broadcasting is to use an add and subtract operation where the addition
causes the smaller array to be broadcast, and then the subtract removes the values in the larger array. In this
example, x is 3 by 5, so y must be either scalar or a 5-element array to be broadcastable. When y is a 3-
element array (and so matches the leading dimension), an error occurs.
>>> x = reshape(arange(15),(3,5))
>>> x
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> y = 1
>>> x + y - x
array([[5, 5, 5, 5, 5],
[5, 5, 5, 5, 5],
[5, 5, 5, 5, 5]])
>>> y = arange(5)
>>> y
array([0, 1, 2, 3, 4])
>>> x + y - x
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
>>> y = arange(3)
>>> y
array([0, 1, 2])
>>> x + y - x # Error
ValueError: operands could not be broadcast together with shapes (3,5) (3)
Subject to broadcasting restrictions, addition and subtraction works in the standard way element-by-element.
The standard multiplication operator differs for variables with type array and matrix. For arrays * is element-
by-element multiplication and arrays must be broadcastable. For matrices, * is matrix multiplication as
defined by linear algebra, and there is no broadcasting.
Conformable arrays can be multiplied according to the rules of matrix algebra using the function dot().
For simplicity, assume x is N by M and y is K by L. dot(x,y) will produce the array N by L array z[i,j] =
=dot( x[i,:], y[:,j]) where dot on 1-dimensional arrays is the usual vector dot-product. The behavior
of dot() is described as:
y
Scalar Array
Scalar Any Any
x z = xy z i j = x yi j
Array Any Inside Dimensions Match
PM
z i j = y xi j z i j = k =1 x i k y k j
These rules conform to the standard rules of matrix multiplication. dot() can also be used on higher
dimensional arrays, and is useful if x is T by M by N and y is N by P to produce an output matrix which is
T by M by P, where each of the M by P (T in total) have the form dot(x[i],y).
If x is N by M and y is K by L and both are non-scalar matrices, x*y requires M = K . Similarly, y*x requires
L = N . If x is scalar and y is a matrix, then z=x*y produces z(i,j)=x*y(i,j).
Suppose z=x * y where both x and y are matrices:
y
Scalar Matrix
Scalar Any Any
x z = xy z i j = x yi j
Matrix Any Inside Dimensions Match
PM
z i j = y xi j z i j = k =1 x i k y k j
y
Scalar Array
Scalar Any Any
x z = xy z i j = x yi j
Array Any Both Dimensions Match
z i j = y xi j z i j = x i j yi j
Multiply will use broadcasting if necessary, and so matrices are effectively treated as 2-dimensional ar-
rays.
Array exponentiation operates element-by-element, and the rules of broadcasting are used.
Matrix exponentiation differs from array exponentiation, and can only be used on square matrices. When
x is a square matrix and y is an integer, and z=x*x*...*x (y times). Python does not support non-integer
values for y, although x p can be defined (in linear algebra) using eigenvalues and eigenvectors for a subset
of all matrices.
5.9 Parentheses
Parentheses can be used in the usual way to control the order in which mathematical expressions are eval-
uated, and can be nested to create complex expressions. See section 5.11 on Operator Precedence for more
information on the order mathematical expressions are evaluated.
5.10 Transpose
Matrix transpose is expressed using either the transpose() function, or the shortcut .T. For instance, if x is
an M by N matrix, transpose(x), x.transpose() and x.T are all its transpose with dimensions N by M . In
practice, using the .T will improve readability of code. Consider
>>> x = randn(2,2)
>>> xpx1 = x.T * x
>>> xpx2 = x.transpose() * x
>>> xpx3 = transpose(x) * x
Transpose has no effect on 1-dimensaional arrays. In 2-dimensions, transpose switches indices so that
if z=x.T, z[j,i] is that same as x[i,j]. In higher dimensions, transpose reverses the order or the indices.
For example, if x has 3 dimensions and z=x.T, then x[i,j,k] is the same as z[k,j,i]. Transpose takes
an optional second argument, which can be used to manually determine the order of the axes after the
transposition.
Computer math, like standard math, has operator precedence which determined how mathematical ex-
pressions such as
2**3+3**2/7*13
are evaluated. Best practice is to always use parentheses to avoid ambiguity in the order or operations.
The order of evaluation is:
In the case of a tie, operations are executed left-to-right. For example, x**y**z is interpreted as (x**y)**z.
This table has omitted some operators available in Python (bitwise) which are not useful (in general) in
numerical analysis.
Note: Unary operators are + or - operations that apply to a single element. For example, consider the
expression (-4). This is an instance of a unary - since there is only 1 operation. (-4)**2 produces 16.
-4**2 produces -16 since ∗∗ has higher precedence than unary negation and so is interpreted as -(4**2).
-4 * -4 produces 16 since it is interpreted as (-4) * (-4), since unary negation has higher precedence
than multiplication.
5.12 Exercises
1. Using the matrices entered in exercise 1 of chapter 4, compute the values of u + v 0 , v + u 0 , v u , u v and
xy
2. Is x/1 legal? If not, why not. What about 1/x?
3. Compute the values (x+y)**2 and x**2+x*y+y*x+y**2. Are they the same?
4. Is x**2+2*x*y+y**2 the same as either above?
5. When will x**y for matrices be the same as x**y for vectors?
6. Is a*b+a*c the same as a*b+c? If so, show it, if not, how can the second be changed so they are equal.
7. Suppose a command x**y*w+z was entered. What restrictions on the dimensions of w, x, y and z must be
true for this to be a valid statement?
8. What is the value of -2**4? What about (-2)**4?
Chapter 6
Basic Functions
6.1.1 linspace
linspace(l,u,n) generates a set of n points uniformly spaced between l, a lower bound (inclusive) and u,
an upper bound (inclusive).
>>> x = linspace(0, 10, 11)
>>> x
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
6.1.2 logspace
logspace(l,u,n) produces a set of logarithmically spaced points between 10l and 10u . It is identical to
10**linspace(l,u,n).
6.1.3 arange
arange(l,u,s) a set of points spaced by s between l, a lower bound (inclusive) and u, an upper bound
(exclusive). arange can be used with a single parameter, so that arange(n) is equivalent to arange(0,n,1).
arange will return integer data type if all inputs are integer.
>>> x = arange(11)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> x = arange(11.0)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
6.1.4 meshgrid
meshgrid is a useful function for broadcasting two vectors into grids when plotting functions in 3 dimen-
sions.
61
>>> x = arange(5)
>>> y = arange(3)
>>> X,Y = meshgrid(x,y)
>>> X
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
>>> Y
array([[0, 0, 0, 0, 0],
[1, 1, 1, 1, 1],
[2, 2, 2, 2, 2]])
6.2 Rounding
around rounds to the nearest integer, or to a particular decimal place when called with two arguments.
>>> x= randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> around(x)
array([ 1., 0., -1.])
>>> around(x, 2)
array([ 0.61, -0.34, -0.57])
around can also be used as a method on an ndarray – except that the method is named round. For example,
x.round(2) is identical to around(x, 2). The change of names is needed since there is a built-in function
round which is not aware of arrays.
6.2.2 floor
floor rounds to the next smallest integer (negative values are rounded away from 0).
>>> x= randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> floor(x)
array([ 0., -1., -1.])
6.2.3 ceil
ceil rounds to the next largest integer (negative values are rounded towards 0).
>>> x= randn(3)
array([ 0.60675173, -0.3361189 , -0.56688485])
>>> ceil(x)
array([ 1., -0., -0.])
Note that the values returned are still floating points and so -0. is the same as 0..
6.3 Mathematics
sum sums all elements in an array. By default, it will sum all elements in the array, and so the second argu-
ment is normally used to provide the axis to use (e.g. 0 to sum down columns, 1 for across rows). cumsum
provides the cumulative sum of the values in the array, and is also ususally used with the second argument
to indicate the axis to use.
>>> x= randn(3,4)
>>> x
array([[-0.08542071, -2.05598312, 2.1114733 , 0.7986635 ],
[-0.17576066, 0.83327885, -0.64064119, -0.25631728],
[-0.38226593, -1.09519101, 0.29416551, 0.03059909]])
sum and cumsum can both be used as function or as methods. When used as methods, the fist input is the axis
so that sum(x,0) is the same as x.sum(0).
prod and cumprod work identically to sum and cumsum, except that the produce and cumulative product are
returned. prod and cumprod can be called as function or methods.
6.3.3 diff
diff computes the finite difference on an vector (also array), and so return n -1 element when used on an
n element vector. diff operates on the last axis by default, and so diff(x) operates across columns and
returns x[:,1:size(x,1)]-x[:,:size(x,1)-1] for a 2-dimensional array. diff takes an optional keyword
argument axis so that diff(x, axis=0) will operate across rows. diff can also be used to produce higher
order differences (e.g. double difference).
>>> x= randn(3,4)
>>> x
array([[-0.08542071, -2.05598312, 2.1114733 , 0.7986635 ],
[-0.17576066, 0.83327885, -0.64064119, -0.25631728],
[-0.38226593, -1.09519101, 0.29416551, 0.03059909]])
6.3.4 exp
6.3.5 log
6.3.6 log10
6.3.7 sqrt
√
sqrt returns the element-by-element square root ( x ) for an array.
6.3.8 square
6.3.9 absolute
absolute returns the element-by-element absolute value for an array. For complex values inputs, |a + b i |
√
= a 2 + b 2.
6.3.10 sign
sign returns the element-by-element sign function which is defined as 0 if x = 0, and x /|x | otherwise.
6.4.1 real
real returns the real elements of a complex array. real can be called either as a function real(x) or as a
property x.real.
6.4.2 imag
imag returns the complex elements of a complex array. imag can be called either as a function imag(x) or as
a property x.imag.
conj returns the element-by-element complex conjugate for a complex array. conj can be called either as a
function conj(x) or as a method x.conj(). conjugate is identical to conj.
6.5.1 unique
unique returns the unique elements in an array. It only operates on the entire array. An optional second
argument can be returned which contains the original indices of the unique elements.
>>> x = repeat(randn(3),(2))
array([ 0.11335982, 0.11335982, 0.26617443, 0.26617443, 1.34424621,
1.34424621])
>>> unique(x)
array([ 0.11335982, 0.26617443, 1.34424621])
>>> x.flat[ind]
array([ 0.11335982, 0.26617443, 1.34424621])
6.5.2 in1d
in1d returns a boolean array with the same size as the first input array indicating the elements which are
also in a second array.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> in1d(x,y)
array([False, False, False, False, False, True, True, True, True, True], dtype=bool)
6.5.3 intersect1d
intersect1d is similar to in1d, except that it returns the elements rather than a boolean array, and only
unique elements are returned. It is equivalent to unique(x.flat[in1d(x,y)]).
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> intersect1d(x,y)
array([ 5., 6., 7., 8., 9.])
6.5.4 union1d
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> union1d(x,y)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.,
11., 12., 13., 14.])
BUG: union1d does not work as described in the help. Arrays are not flattened, so that using arrays with
different number of dims produces an error. The solution is to use union1d(x.flat,y.flat). (1.6.1)
6.5.5 setdiff1d
setdiff1d return the set of the elements which are only in the first array array but not in the second array.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> setdiff1d(x,y)
array([ 0., 1., 2., 3., 4.])
6.5.6 setxor1d
setxor1d returns the set of elements which are in one (and only one) of two arrays.
>>> x = arange(10.0)
>>> y = arange(5.0,15.0)
>>> setxor1d(x,y)
array([ 0., 1., 2., 3., 4., 10., 11., 12., 13., 14.])
6.6.1 sort
sort sorts the elements of an array. By default, it sorts using the last axis of x. It uses an optional second
argument to indicate the axis to use for sorting (i.e. 0 for column-by-column, None for sorting all elements).
sort does not alter the input when called as function, unlike the method version of sort.
>>> x = randn(4,2)
>>> x
array([[ 1.29185667, 0.28150618],
[ 0.15985346, -0.93551769],
[ 0.12670061, 0.6705467 ],
[ 2.77186969, -0.85239722]])
>>> sort(x)
array([[ 0.28150618, 1.29185667],
[-0.93551769, 0.15985346],
[ 0.12670061, 0.6705467 ],
[-0.85239722, 2.77186969]])
>>> sort(x, 0)
array([[ 0.12670061, -0.93551769],
[ 0.15985346, -0.85239722],
[ 1.29185667, 0.28150618],
[ 2.77186969, 0.6705467 ]])
ndarray.sort is a method for ndarrays which performs an in-place sort. It economizes on memory use, al-
though x.sort() is different from x after the function, unlike a call to sort(x). x.sort() sorts along the last
axis by default, and takes the same optional arguments as sort(x). argsort returns the indices necessary
to produce a sorted array, but does not actually sort the data. It is otherwise identical to sort, and can be
used either as a function or a method.
>>> x= randn(3)
>>> x
array([ 2.70362768, -0.80380223, -0.10376901])
>>> sort(x)
array([-0.80380223, -0.10376901, 2.70362768])
>>> x
array([ 2.70362768, -0.80380223, -0.10376901])
>>> x.sort()
>>> x
array([-0.80380223, -0.10376901, 2.70362768])
max and min return the maximum and minimum values from an array. They take an optional second argu-
ment which indicates the axis to use.
>>> x= randn(3,4)
>>> x
array([[-0.71604847, 0.35276614, -0.95762144, 0.48490885],
[-0.47737217, 1.57781686, -0.36853876, 2.42351936],
[ 0.44921571, -0.03030771, 1.28081091, -0.97422539]])
>>> amax(x)
2.4235193583347918
>>> x.max()
2.4235193583347918
>>> x.max(0)
array([ 0.44921571, 1.57781686, 1.28081091, 2.42351936])
>>> x.max(1)
array([ 0.48490885, 2.42351936, 1.28081091])
max and min can only be used on arrays as methods. When used as a function, amax and amin must be used
to avoid conflicts with the built-in functions max and min. This behavior is also seen in around and round.
argmax and argmin return the index or indices of the maximum or minimum element(s). They are used in
an identical manner to max and min, and can be used either as a function or method.
maximum and minimum can be used to compute the maximum and minimum of two arrays which are broad-
castable.
>>> x = randn(4)
>>> x
array([-0.00672734, 0.16735647, 0.00154181, -0.98676201])
>>> y = randn(4)
array([-0.69137963, -2.03640622, 0.71255975, -0.60003157])
>>> maximum(x,y)
array([-0.00672734, 0.16735647, 0.71255975, -0.60003157])
NaN function are convenience function which act similarly to their non-NaN versions, only ignoring NaN
values (rather than propogating) when computing the function.
6.7.1 nansum
nansum is identical sum, except that NaNs are ignored. nansum can be used to easily generate other NaN-
functions, such as nanstd (standard deviation, ignoring nans) since variance can be implemented using 2
sums.
>>> x = randn(4)
>>> x[1] = np.nan
>>> x
array([-0.00672734, nan, 0.00154181, -0.98676201])
>>> sum(x)
nan
>>> nansum(x)
-0.99194753275859726
nanmax, nanmin, nanargmax and nanargmin are identical to their non-NaN counterparts, except that NaNs are
ignored.
Chapter 7
Special Matrices
Commands are available to produce a number of useful arrays. These all return arrays by default.
ones
ones generates a array of 1s and is generally called with one argument, a tuple which contains the size of
each dimension. ones takes an optional second argument (dtype) which specifies the data type. If omitted,
the data type is float.
M, N = 5, 5
# Produces a N by M array of 1s
x = ones((M,N))
# Produces a M by M by N 3D array of 1s
x = ones((M,M,N))
# Produces a M by N array of 1s using 32 bit integers
x = ones((M,N), dtype=’int32’)
Note: To use the function call above, N and M must have been previously defined (e.g. N,M=10,7). ones_like
creates an array with the same size and shape as the input. Calling ones_like(x) is equivalent to calling
ones(shape(x),x.dtype)
zeros
zeros produces a array of 0s in the same way ones produces a matrix of 1s, and is useful for initializing a
matrix to hold values generated by another procedure. zeros takes an optional second argument (dtype)
which specifies the data type. If omitted, the data type is float.
# Produces a M by N array of 0s
x = zeros((M,N))
# Produces a M by M by N 3D array of 0s
x = zeros((M,M,N))
# Produces a M by N array of 0s using 64 bit integers
x = zeros((M,N),dtype=’int64’)
zeros_like creates an array with the same size and shape as the input. Calling zeros_like(x) is equivalent
to calling zeros(shape(x),x.dtype).
71
empty
empty produces an empty (uninitialized) array to hold values generated by another procedure. empty takes
an optional second argument (dtype) which specifies the data type. If omitted, the data type is float.
# Produces a M by N empty array
x = empty((M,N))
# Produces a 4D empty array
x = empty((N,N,N,N))
# Produces a M by N empty array using 32-bit floats (single precision)
x = empty((M,N),dtype=’float32’)
Using empty is slightly faster than calling zeros since it does not assign 0 to all elements of the array – the
“empty” array created will be populated with (essential random) values. empty_like creates an array with
the same size and shape as the input. Calling empty_like(x) is equivalent to calling empty(shape(x),x.dtype).
eye, identity
eye generates an identity matrix (an array with ones on the diagonal, zeros every where else). An identity
matrix is square and so usually only 1 input is needed.
In = eye(N)
7.1 Exercises
1. Produce two matrices, one containing all zeros and one containing only ones, of size 10 × 5.
2. Multiply these two matrices in both possible ways.
3. Produce an identity matrix of size 5. Take the exponential of this matrix, element-by-element.
4. How could ones and zeros be replaced with tile?
Chapter 8
Matrix Functions
Some functions operate exclusively on array inputs. Some are mathematical in nature, for instance comput-
ing the eigenvalues and eigenvectors, while other are functions for manipulating the elements of a matrix.
8.1 Views
Views are computationally efficient methods to produce objects which behave as other objects without
copying data. For example, an array x can always be converted to a matrix using matrix(x), which will copy
the elements in x. View “fakes” the call to matrix and only inserts a thin layer so that x viewed as a matrix
behaves like a matrix.
view
view can be used to produce a representation of an array, matrix or recarray as another type without copying
the data. Using view is faster than copying data into a new class.
>>> x = arange(5)
>>> type(x)
numpy.ndarray
>>> x.view(np.matrix)
matrix([[0, 1, 2, 3, 4]])
>>> x.view(np.recarray)
rec.array([0, 1, 2, 3, 4])
asmatrix, mat
asmatrix and mat can be used to view an array as a matrix. This view is useful since matrix views will use
matrix multiplication by default.
>>> x = array([[1,2],[3,4]])
>>> x * x # Element-by-element
array([[ 1, 4],
[ 9, 16]])
73
>>> mat(x) * mat(x) # Matrix multiplication
matrix([[ 7, 10],
[15, 22]])
asarray
asarray work in a similar matter as asmatrix, only that the view produced is that of np.ndarray.
ravel
ravel returns a flattened view (1-dimensional) of an array or matrix. ravel does not copy the underlying
data, and so it is very fast.
>>> x = array([[1,2],[3,4]])
>>> x
array([[ 1, 2],
[ 3, 4]])
>>> x.ravel()
array([1, 2, 3, 4])
>>> x.T.ravel()
array([1, 3, 2, 4])
shape
shape returns the size of all dimensions or an array or matrix as a tuple. shape can be called as a function or a
property. shape can also be used to reshape an array by entering a tuple of sizes. Additionally, the new shape
can contain -1 which indicates to expand along this dimension to satisfy the constraint that the number of
elements cannot change.
>>> x = randn(4,3)
>>> x.shape
(4L, 3L)
>>> shape(x)
(4L, 3L)
reshape transforms a array with one set of dimensions and to one with a different set, preserving the number
of elements. reshape can transform an M by N array x into an K by L array y as long as M N = K L. Note
that the number of elements cannot change. The most useful call to reshape switches a array into a vector
or vice versa. For example
>>> x = array([[1,2],[3,4]])
>>> y = reshape(x,(4,1))
>>> y
array([[1],
[2],
[3],
[4]])
>>> z=reshape(y,(1,4))
>>> z
array([[1, 2, 3, 4]])
>>> w = reshape(z,(2,2))
array([[1, 2],
[3, 4]])
The crucial implementation detail of reshape is that matrices are stored using row-major notation. Ele-
ments in matrices are counted first across and then then down rows. reshape will place elements of the old
array into the same position in the new array and so after calling reshape, x (1) = y (1), x (2) = y (2), and so
on.
size
size returns the total number of elements in an array or matrix. size can be used as a function or a property.
>>> x = randn(4,3)
>>> size(x)
2
>>> x.size
12
ndim
ndim returns the size of all dimensions or an array or matrix as a tuple. ndim can be used as a function or a
property.
>>> x = randn(4,3)
>>> ndim(x)
2
>>> x.ndim
2
tile
tile, along with reshape, are two of the most useful non-mathematical functions. tile replicates an ar-
ray according to a specified size vector. To understand how repmat functions, imagine forming an array-
composed of blocks. The generic form of tile is tile(X , (M , N ) ) where X is the matrix to be replicated,
M is the number of rows in the new block matrix, and N is the number of columns in the new block matrix.
For example, suppose X was a matrix
" #
1 2
X =
3 4
tile has two clear advantages over manual allocation: First, tile can be executed using parameters deter-
mined at run-time, such as the number of explanatory variables in a model and second tile can be used for
arbitrary dimensions. Manual matrix construction becomes tedious and error prone with as few as 3 rows
and columns. repeat is a related function which copies data is a less useful manner.
flatten
flatten works much like ravel, only that is copies the array when producing the flattened version.
flat
flat produces a numpy.flatiter object which is an iterator over a flattened view of an array. Because it is
an iterator, it is especially fast.
>>> x = array([[1,2],[3,4]])
>>> x.flat
<numpy.flatiter at 0x6f569d0>
>>> x.flat[2]
3
>>> x.flat[1:4] = -1
>>> x
array([[ 1, -1],
[-1, -1]])
broadcast, broadcast_arrays
broadcast can be used to broadcast two broadcastable arrays without actually copying any data. It returns
a broadcast object, which works like an iterator.
>>> x = array([[1,2,3,4]])
>>> y = reshape(x,(4,1))
>>> b = broadcast(x,y)
>>> b.shape
(4L, 4L)
broadcast_arrays works similarly to broadcast, except that it copies the broadcast matrices into new ar-
rays. broadcast_arrays is generally slower than broadcast, and should be avoided if possible.
>>> x = array([[1,2,3,4]])
>>> y = reshape(x,(4,1))
>>> b = broadcast_arrays(x,y)
>>> b[0]
array([[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4],
[1, 2, 3, 4]])
>>> b[1]
array([[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4]])
vstack, hstack
vstack, and hstack stack compatible arrays and matrices vertically and horizontally, respectively. Any num-
ber of matrices can be stacked by placing the input matrices in a tuple, e.g. (x,y,z).
>>> x = reshape(arange(6),(2,3))
>>> y = x
>>> vstack((x,y))
array([[0, 1, 2],
[3, 4, 5],
[0, 1, 2],
[3, 4, 5]])
>>> hstack((x,y))
array([[0, 1, 2, 0, 1, 2],
[3, 4, 5, 3, 4, 5]])
concatenate
concatenate generalizes vstack and hsplit to allow concatenation along any axis.
vsplit and hsplit split arrays and matrices vertically and horizontally, respectively. Both can be used to
split an array into n equal parts or into arbitrary segments, depending on the second argument. If scalar,
the matrix is split into n equal sized parts. If a 1 dimensional array, the matrix is split using the elements
of the array as break points. For example, if the array was [2,5,8], the matrix would be split into 4 pieces
using [:2] , [2:5], [5:8] and [8:]. Both vsplit and hsplit are special cases of split.
>>> x = reshape(arange(20),(4,5))
>>> y = vsplit(x,2)
>>> len(y)
2
>>> y[0]
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
>>> y = hsplit(x,[1,3])
>>> len(y)
3
>>> y[0]
array([[ 0],
[ 5],
[10],
[15]])
>>> y[1]
array([[ 1, 2],
[ 6, 7],
[11, 12],
[16, 17]])
delete
delete removes values from an array, and is similar to splitting an array, and then concatenating the values
which are not deleted. The form of delete is delete(x,rc, axis) where rc are the row or column indices to
delete, and axis is the axis to use (0 or 1 for a 2-dimensional array). If axis is omitted, delete operated on
the flattened array.
>>> x = reshape(arange(20),(4,5))
>>> delete(x,1,0) # Same as x[[0,2,3]]
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19]])
squeeze
squeeze removes singleton dimensions from an array. squeeze can be called as a function or a method.
>>> x = ones((5,1,5,1))
>>> shape(x)
(5L, 1L, 5L, 1L)
>>> y = x.squeeze()
>>> shape(y)
(5L, 5L)
>>> y = squeeze(x)
fliplr, flipud
fliplr and flipud flip arrays in a left-to-right and up-to-down directions, respectively. Since 1-dimensional
arrays are neither column nor row vectors, these two functions are only applicable on 2-dimensional (or
higher) arrays.
>>> x = reshape(arange(4),(2,2))
>>> x
array([[0, 1],
[2, 3]])
>>> fliplr(x)
array([[1, 0],
[3, 2]])
>>> flipud(x)
array([[2, 3],
[0, 1]])
diag
diag can produce one of two results depending on the form of the input. If the input is a square matrix, it
will return a column vector containing the elements of the diagonal. If the input is a vector, it will return a
matrix containing the elements of the diagonal along the vector. Consider the following example:
>>> x = matrix([[1,2],[3,4]])
>>> x
matrix([[1, 2],
[3, 4]])
>>> y = diag(x)
>>> y
array([1, 4])
>>> z = diag(y)
>>> z
array([[1, 0],
[0, 4]])
triu, tril
>>> x = matrix([[1,2],[3,4]])
>>> triu(x)
matrix([[1, 2],
[0, 4]])
>>> tril(x)
matrix([[1, 0],
[3, 4]])
matrix_power
matrix_power raises a square array or matrix to an integer power, and matrix_power(x,n) is identical to
x**n.
svd
svd computed the singular value decomposition of a matrix. A singular value decomposition of a matrix X
is
X = U ΣV 0
where Σ is a diagonal elements, and U and V are unitary matrices (orthonormal if real valued). SVDs are
closely related to eigenvalue decompositions when X is a real, positive definite matrix.
cond
cond computes the condition number of a matrix, which measures how close to singular a matrix is. Lower
numbers are better conditioned (and further from singular).
>>> x = matrix([[1.0,0.5],[.5,1]])
>>> cond(x)
3
>>> x = matrix([[1.0,2.0],[1.0,2.0]]) # Singular
>>> cond(x)
inf
slogdet
slogdet computes the sign and log of the absolute value of the determinant. slogdet is useful for computing
determinants which may be very large or small to avoid overflow or underflow.
solve
solve solves the system X β = y when X is square and invertible so that the solution is exact.
>>> X = array([[1.0,2.0,3.0],[3.0,3.0,4.0],[1.0,1.0,4.0]])
>>> y = array([[1.0],[2.0],[3.0]])
>>> solve(X,y)
array([[ 0.625],
[-1.125],
[ 0.875]])
lstsq
lstsq solves the system X β = y when X is n by k , n > k by finding the least squares solution. lstsq returns
a 4-element tuple where the first element is β and the second element is the sum of squared residuals. The
final two outputs are diagnostic – the third is the rank of X and the fourth contains the singular values of X .
>>> X = randn(100,2)
>>> y = randn(100)
>>> lstsq(X,y)
(array([ 0.03414346, 0.02881763]),
array([ 3.59331858]),
2,
array([ 3.045516 , 1.99327863]))array([[ 0.625],
[-1.125],
[ 0.875]])
cholesky
cholesky computes the Cholesky factor of a positive definite matrix or array. The Cholesky factor is a lower
triangular matrix and is defined as C in
CC0 = Σ
where Σ is a positive definite matrix.
>>> x = matrix([[1,.5],[.5,1]])
>>> y = cholesky(x)
>>> y*y.T
matrix([[ 1. , 0.5],
[ 0.5, 1. ]])
det
>>> x = matrix([[1,.5],[.5,1]])
>>> det(x)
0.75
eig
eig computes the eigenvalues and eigenvector of a square matrix. Two output arguments are required in
order compute both the eigenvalues and eigenvectors, val,vec = eig(R).
>>> x = matrix([[1,.5],[.5,1]])
>>> val,vec = eig(x)
>>> vec*diag(val)*vec.T
matrix([[ 1. , 0.5],
[ 0.5, 1. ]])
eigh
eigh computes the eigenvalues and eigenvector of a square, symmetric matrix. Two output arguments are
required in order compute both the eigenvalues and eigenvectors, val,vec = eigh(R). eigh is faster than
eig since it exploits the symmetry of the input. eigvalsh can be used if only eigenvalues are needed from a
square, symmetric.
inv
inv computes the inverse of a matrix. inv(R) can alternatively be computed using x^(-1).
>>> x = matrix([[1,.5],[.5,1]])
>>> xInv = inv(x)
>>> x*xInv
matrix([[ 1., 0.],
[ 0., 1.]])
kron
z =x ⊗y
and is written as z = kron(x,y).
trace
trace computes the trace of a square matrix (sum of diagonal elements) and so trace(x) equals sum(diag(x)).
Chapter 9
Importing data ranges from easy for files which contain only numbers difficult, depending on the data size
and format. A few principles can simplify this task:
• The file imported should contain numbers only, with the exception of the first row which may contain
the variable names.
• Use another program, such as Microsoft Excel, to manipulate data before importing.
• Dates should be converted to YYYYMMDD, a numeric format, before importing. This can be done in
Excel using the formula:
=10000*YEAR(A1)+100*MONTH(A1)+DAY(A1)+(A1-FLOOR(A1,1))
A number of importers are available for regular (e.g. all rows have the same number of columns) comma-
separated value (CSV) data. The choice of which importer to use depends on the complexity and size of the
file. Purely numeric files are the simplest to import, although most files which have a repeated structure can
be imported (unless they are very large).
9.2.1 loadtxt
loadtxt (numpy.lib.npyio.loadtxt)is a simple, but fast, text importer. The basic use is loadtxt( f i l e n a m e ),
which will attempt to load the data in filename as floats. Other useful named arguments include delim,
which allow the file delimiter to be specified, and skiprows which allows one or more rows to be skipped.
loadtxt requires the data to be numeric and so is only useful for the simplest files.
85
ValueError: could not convert string to float: Date
9.2.2 genfromtxt
genfromtxt (numpy.lib.npyio.genfromtxt ) is a slower, but more robust, importer than loadtxt. genfromtxt
is called in an identical matter as loadtxt, but will not fail if a non-numeric type is encountered. Instead,
genfromtxt will return a NaN (not-a-number) for fields in the file it cannot read.
9.2.2.1 csv2rec
csv2rec (matplotlib.mlab.csv2rec) is an even more robust – and slower – csv importer which allows for
non-numeric data such as dates. It also attempts to find the best data type using for each row.
Unlike loadtxt and genfromtxt, which both return an array, csv2rec returns a recarray (numpy.core.
.records .recarray, see Chapter 22) which is, in many ways, like a list. csv2rec converted each row of the
input file into a datetime (see Chapter 18), followed by 4 floats for open, high, low and close, then a long
integer for volume, and finally a float for the adjusted close.
Because the values returned are not an array, it is normally necessary to create an array to store the array.
Reading Excel files in Python is more involved, so unless essential, it is probably simpler to convert the xls
to csv. Reading 97-2003 Excel files requires a python package which is not in the core, xlutils, which can be
installed using easy_install xlutils.
from __future__ import print_function
import xlrd
wb = xlrd.open_workbook(’FTSE_1984_2012.xls’)
sheetNames = wb.sheet_names()
# Assumes 1 sheet name
sheet = wb.sheet_by_name(sheetNames[0])
excelData = []
for i in xrange(sheet.nrows):
excelData.append(sheet.row_values(i))
The listing does a few things. First, it opens the workbook for reading (xlrd. open_workbook( ’FTSE_1984_
_2012.xls’)), then it gets the sheet names (wb.sheet_names()) and opens a sheet (wb.sheet_by_name(sheetNames[0])).
From the sheet, it gets the number of rows (sheet.nrows), and fills a list with the values, row-by-row. Once
the data has been read-in, the final block fills an array from the data from opening prices in the list. This
is substantially more complicated than converting to a CSV file, although reading Excel files is useful for
automated work (e.g. you have no choice but to import from an Excel file since it is produced by some other
software).
xlrd only rady 97-2003 files, and so a different package, openpyxl, is needed to read xlsx files created in Office
2007 or later. Unfortunately openpyxl has a different syntax to xlrd, and so a modified reader is needed for
xlsx files.
from __future__ import print_function
import openpyxl
wb = openpyxl.load_workbook(’FTSE_1984_2012.xlsx’)
sheetNames = wb.get_sheet_names()
# Assumes 1 sheet name
sheet = wb.get_sheet_by_name(sheetNames[0])
excelData = []
rows = sheet.rows
Scipy enables MATLAB data files to be read. The native file format is the MATLAB data file, or mat file.
Data from a mat file can be loaded using scipy.io.loadmat. The data is loaded into a dictionary, and so
individual variables can be accessed using the keys of the distionary.
from __future__ import print_function
import scipy.io as io
matData = io.loadmat(’FTSE_1984_2012.mat’)
open = matData[’open’]
Python can be programmed to read virtually any text format since it contains functions for parsing and
interpreting arbitrary text containing numeric data. Reading poorly formatted data files is an advanced
technique and should be avoided if possible. However, some data is only available in formats where reading
in data line-by-line is the only option. For instance, the standard import method fails if the raw data is very
large (too large for Excel) and is poorly formatted. In this case, the only possibility is to write a program to
read the file line-by-line and to process each line separately.
The file IBM_TAQ.txt contains a simple example of data that is difficult to import. This file was down-
loaded from WRDS and contains all prices for IBM from the TAQ database in the interval January 1,2001
through January 31, 2001. It is too large to use in Excel and has both numbers, dates and text on each line.
The following code block shown how the data in this file can be parsed.
f = file(’IBM_TAQ.txt’, ’r’)
line = f.readline()
# Burn the first list as a header
line = f.readline()
date = []
time = []
price = []
volume = []
while line:
data = line.split(’,’)
date.append(int(data[1]))
price.append(float(data[3]))
volume.append(int(data[4]))
t = data[2]
time.append(int(t.replace(’:’,’’)))
line = f.readline()
allData = array([date,price,volume,time])
f.close()
• Rereads the file parsing each line by the location of the commas using split(’,’) to split the line at
each comma into a list
StatTransfer is available on the servers and is capable of reading and writing approximately 20 different
formats, including MATLAB, GAUSS, Stata, SAS, Excel, CSV and text files. It allow users to load data in
one format and output some or all of the data in another. StatTransfer can make some hard-to-manage
situations (e.g. poorly formatted data) substantially easier. StatTransfer has a comprehensive help file to
provide assistance.
A number of options are available for saving data. These include using native npz data files, MATLAB data
files, csv or plain text. Multiple numpy arrays can be saved using savez (numpy.savez).
x = arange(10)
y = zeros((100,100))
savez(’test’,x,y)
data = load(’test.npz’)
# If no name is given, arrays are generic names arr_1, arr_2, etc
x = data[’arr_1’]
savez(’test’,x=x,otherData=y)
data = load(’test.npz’)
# x=x provides the name x for the data in x
x = data[’x’]
# otherDate = y saves the data in y as otherData
y = data[’otherData’]
A version which compresses data but is otherwise identical is savez_compressed. Compression is very help-
ful for arrays which have repeated values or are very large.
x = arange(10)
y = zeros((100,100))
savez_compressed(’test’,x=x,otherData=y)
data = load(’test.npz’)
# x=x provides the name x for the data in x
x = data[’x’]
# otherDate = y saves the data in y as otherData
y = data[’otherData’]
Scipy enables MATLAB data files to be written. Data can be written using scipy.io.savemat, which takes
two inputs, a filename and a dictionary containing data, in its simplest form.
from __future__ import print_function
import scipy.io
x = array([1.0,2.0,3.0])
y = zeros((10,10))
# Set up the dictionary
saveData = {’x’:x, ’y’:y}
io.savemat(’test’,saveData,do_compression=True)
# Read the data back
matData = io.loadmat(’test.mat’)
savemat uses the optional argument do_compression = True, which compresses the data, and is generally
a good idea on modern computers and/or for large datasets.
Data can be exported to a tab-delimited text files using savetxt. By default, savetxt produces tab delimited
files, although then can be changed using the names argument delimiter.
x = randn(10,10)
# Save using tabs
savetxt(’tabs.txt’,x)
# Save to CSV
savetxt(’commas.csv’,x,delimiter=’,’)
# Reread the data
xData = loadtxt(’commas.csv’,delimiter=’,’)
9.9 Exercises
1. The file exercise3.xls contains three columns of data, the date, the return on the S&P 500, and the
return on XOM (ExxonMobil). Using Excel, convert the date to YYYYMMDD format and save the file.
2. Save the file as both CSV and tab delimmited. Use the three CSV readers to read the file, and parse
loaded data into three variables, dates, SP500 and XOM.
3. Save Numpy, compresssed Numpy and MATLAB data files with all three variables. Which files is the
smallest?
4. Construct a new variable, sumreturns as the sum of SP500 and XOM. Create another new variable,
outputdata as a horizontal concatenation of dates and sumreturns.
x = 1.0
eps = finfo(float).eps
x = x+eps/2
x == 1
True
x-1
0.0
x = 1 + 2*eps
x == 1
False
x-1
ans = 4.4408920985006262e-16
93
To understand what is meant by relative range, examine the following output
x=10
x+2*eps
x-10
In the first example, eps/2<eps when compared to 1 so it has no effect while 2*eps>eps and so this value
is different from 1. In the second example, 2*eps/10<eps, it has no effect when added. This is a very tricky
concept to understand, but failure to understand numeric limits can results in errors in code that appears
to be otherwise correct.
The practical lesson is to think about data scaling. Many variables have natural scales which are vastly
different, and so rescaling is often necessaryto avoid numeric limits.
10.1 Exercises
Logical operators are useful when writing batch files or custom functions. Logical operators, when com-
bined with flow control, allow for complex choices to be compactly expressed.
Logical operators can be used on scalars, arrays or matrices. All comparisons are done element-by-
element and return either True or False. For instance, suppose x and y are matrices. z= x < y will be a
matrix of the same size as x and y composed of True and False. Alternatively, if one is scalar, say y, then
the elements of z are z[i,j] = x[i,j] < y. Logical operators can be used to access elements of a vector or
matrix. For instance, suppose z = xL y where L is one of the logical operators above such as < or ==. The
following table examines the behavior when x and/or y are scalars or matrices. Suppose z = x < y:
y
Scalar Matrix
Scalar Any Any
x z =x <y zij = x < yi j
Matrix Any Same Dimensions
z i j = xi j < y z i j = x i j < yi j
Logical operators are used in portions of programs known as flow control (e.g. if ... else ... blocks)
which will be discussed later. It is important to remember that vector or matrix logical operations return
vector or matrix output and that flow control blocks require scalar logical expressions.
95
11.2 and, or, not and xor
and and logical_and() both return true if both arguments are true. The keyword and can only be used
on scalars, and so is called a short-circuit operator. logical_and() can be used on matrices. The same is
true
>>> x=matrix([[1,2],[-3,-4]])
>>> y = x > 0
>>> z = x < 0
>>> logical_and(y, z)
matrix([[False, False],
[False, False]], dtype=bool)
These operators follow the same rules as other logical operators. If used on two matrices, the dimen-
sions must be the same. If used on a scalar and a matrix, the effect is the same as calling the logical device
on the scalar and each element of the matrix.
Suppose x and y are logical variables (1s or 0s). and define z=logical_and(x,y):
y
Scalar Matrix
Scalar Any Any
x z = x &y z i j = x & yi j
Matrix Any Same Dimensions
zij = xi j & y z i j = x i j & yi j
The commands all and any take logical data input and are self descriptive. all returns True if all logical
elements in an array are 1. If all is called without any additional arguments on an array, it returns True if all
elements of the array are logical true and 0 otherwise. any returns logical(True) if any element of an array
is True. Both all and any can be also be used along the dimensions of the array using a second argument (or
the named argument axis ) which indicates the axis of operation, where 0 is column-wise (e.g. is examines
all elements in a single row), 1 is row-wise, and so on. When used column- or row-wise, the output is an
array with one less dimension than the input, where each element of the output contains the truth value of
the operation on the column or row.
>>> x = matrix([[1,2][3,4]])
>>> y = x <= 2
>>> y
matrix([[ True, True],
[False, False]], dtype=bool)
>>> any(y)
True
>>> any(y,0)
matrix([[ True, True]], dtype=bool)
>>> any(y,1)
matrix([[ True],
[False]], dtype=bool)
allclose
allclose can be used to compare two arrays, while allowing for a tolerance. This type of function is impor-
tant when comparing floating point values which may be effectively the same, but not identical.
>>> eps = np.finfo(np.float64).eps
>>> eps
2.2204460492503131e-16
>>> x = randn(2)
>>> y = x + eps
>>> x == y
array([False, False], dtype=bool)
>>> allclose(x,y)
True
array_equal
array_equal tests if two arrays have the same shape and elements. It is safer than comparing arrays directly
since comparing arrays which are not broadcastable produces an error.
array_equiv
array_equiv tests if two arrays are equivalent, even if they do not have the exact same shape. Equivalence
is defined as one array being broadcastable to produce the other.
>>> x = randn(10,1)
>>> y = tile(x,2)
>>> array_equal(x,y)
False
>>> array_equiv(x,y)
True
11.4 Logical Indexing
find
find is an useful function for working with multiple data series. find is not logical itself, but it takes logical
inputs and returns indices where the logical statement is true. It is called indices = find (x <= 2) will
return indices (0,1,. . .,) so that the elements which are true can be accessed using the slice x.flat[indices].
Note that the flat view is needed since slicing x directly (x[indices] will operate along the first dimension,
and so will return rows of a 2-dimensional matrix.
>>> x = matrix([[1,2],[3,4]])
>>> y = x <= 2
>>> indices = find(y)
>>> indices
array([0, 1], dtype=int64)
>>> x.flat[indices]
matrix([[1, 2]])
# Wrong output
>>> x[indices]
>>> x = matrix([[1,2],[3,4]]);
>>> y = x <= 4
>>> indices = find(y)
>>> x.flat[indices]
matrix([[1, 2, 3, 4]])
argwhere
>>> x = randn(3)
>>> x
array([-0.5910316 , 0.51475905, 0.68231135])
>>> argwhere(x<0)
array([[0]], dtype=int64)
>>> x = randn(3,2)
>>> x
array([[ 0.72945913, 1.2135989 ],
[ 0.74005449, -1.60231553],
[ 0.16862077, 1.0589899 ]])
>>> argwhere(x<0)
array([[1, 1]], dtype=int64)
>>> x = randn(3,2,4)
>>> argwhere(x<0)
array([[0, 0, 1],
[0, 0, 2],
[0, 1, 2],
[0, 1, 3],
[1, 0, 2],
[1, 1, 0],
[2, 0, 1],
[2, 1, 0],
[2, 1, 1],
[2, 1, 3]], dtype=int64)
extract
extract is similar to argwhere except that it returns the values where the condition is true rather then the
indices.
>>> x = randn(3)
>>> x
array([-0.5910316 , 0.51475905, 0.68231135])
>>> extract(x<0, x)
array([-0.5910316])
>>> x = randn(3,2)
>>> x
array([[ 0.72945913, 1.2135989 ],
[ 0.74005449, -1.60231553],
[ 0.16862077, 1.0589899 ]])
>>> extract(x<0,x)
array([-1.60231553])
11.5 is*
A number of special purpose logical tests are provided to determine if a matrix has special characteristics.
Some operate element-by-element and produce a matrix of the same dimension as the input matrix while
other produce only scalars. These functions all begin with is.
isnan 1 if nan element-by-element
isinf 1 if inf element-by-element
isfinite 1 if not inf and not nan element-by-element
isposfin,isnegfin 1 for positive or negative inf element-by-element
isreal 1 if not complex valued element-by-element
iscomplex 1 if complex valued element-by-element
isreal 1 if real valued element-by-element
is_string_like 1 if argument is a string scalar
is_numlike 1 if is a numeric type scalar
isscalar 1 if scalar scalar
isvector 1 if input is a vector scalar
There are a number of other special purpose is* expressions. For more details, search for is* in help.
x=matrix([4,pi,inf,inf/inf])
isnan(x)
isinf(x)
isfinite(x)
Note: isnan(x) isinf(x) isfinite(x) always equals True , implying any element falls into one (and only
one) of these categories.
11.6 Exercises
1. Using the data file created in Chapter 9, count the number of negative returns in both the S&P 500 and
ExxonMobil.
2. For both series, create an indicator variable that takes the value 1 is the return is larger than 2 standard
deviations or smaller than -2 standard deviations. What is the average return conditional on falling into this
range for both returns.
3. Construct an indicator variable that takes the value of 1 when both returns are negative. Compute the
correlation of the returns conditional on this indicator variable. How does this compare to the correlation
of all returns?
4. What is the correlation when at least 1 of the returns is negative?
5. What is the relationship between all and any. Write down a logical expression that allows one or the
other to be avoided (i.e. write def myany(x) and def myall(y)).
Chapter 12
The previous chapter explored one use of logical variables, selecting elements from a matrix. Logical vari-
ables have another important use: flow control. Flow control allows different code to be executed depend-
ing on whether certain conditions are met.
if . . . elif . . . else blocks always begin with an if statement immediately followed by a scalar logical ex-
pression. elif and else are optional and can always be replicated using nested if statements at the expense
of more complex logic and deeper nesting. The generic form of an if . . . elif . . . else block is
if logical_1:
Code to run if logical_1
elif logical_2:
Code to run if logical_2
elif logical_3:
Code to run if logical_3
...
...
else:
Code to run if all previous logicals are false
or
if logical:
Code to run if logical true
else:
Code to run if logical false
101
>>> x = 5
>>> if x<5:
... x += 1
... else:
... x -= 1
>>> x
4
and
>>> x = 5;
>>> if x<5:
... x = x + 1
... elif x>5:
... x = x - 1
... else:
... x = x * 2
>>> x
10
These examples have all used simple logical expressions. However, any scalar logical expressions, such
as (x<0 or x>1) and (y<0 or y>1) (y<0 or y>1) or isinf(x) or isnan(x), can be used in if . . . elif . . .
else blocks.
Exception handling is an advanced programming technique which can be used to make code more resilient
(often at the code of speed). try . . . except blocks are useful for running code which may be dangerous. In
most numerical applications, code should be deterministic and so dangerous code can usually be avoided.
When it can’t, for example, if reading data from a data source which isn’t always available (e.g. a website),
then try . . . except can be used to attempt the code, and then do something helpful if the code fails to
execute. The generic structure of a try . . . except block is
try:
Dangerous Code
except ExceptionType1:
Code to run if ExceptionType1 is raised
except ExceptionType2:
Code to run if ExceptionType1 is raised
...
...
except:
Code to run if an unlisted exception type is raised
12.3 List Comprehensions
List comprehensions are a form of syntatic sugar which may simplify code when an iterable object is looped
across and the results are saved to a list, possibly conditional on some logical test. Simple list can be used
to convert a for loop which includes an append into a single line statement.
>>> x = arange(5.0)
>>> y = []
>>> for i in xrange(len(x)):
... y.append(exp(x[i]))
>>> y
[1.0,
2.7182818284590451,
7.3890560989306504,
20.085536923187668,
54.598150033144236]
This simple list comprehension saves 2 lines of typing. List comprehensions can also be extended to include
a logcial test.
>>> x = arange(5.0)
>>> y = []
>>> for i in xrange(len(x)):
... if floor(i/2)==i/2:
... y.append(x[i]**2)
>>> y
[0.0, 4.0, 16.0]
List comprehensions can also be used to loop over multiple iterable input.
>>> x1 = arange(5.0)
>>> x2 = arange(3.0)
>>> y = []
>>> for i in xrange(len(x1)):
... for j in xrange(len(x2)):
... y.append(x1[i]*x2[j])
>>> y
[0.0, 0.0, 0.0, 0.0, 1.0, 2.0, 0.0, 2.0, 4.0, 0.0, 3.0, 6.0, 0.0, 4.0, 8.0]
While list comprehensions are powerful methods to compactly express complex operations, they are never
essential to Python programming.
12.4 Exercises
1. Write a code block that would take a different path depending on whether the returns on two series are
simultaneously positive, both are negative, or they have different signs using an if . . . elif . . . else block.
Chapter 13
Loops
Loops make many problems, particularly when combined with flow control blocks, simple and in many
cases, possible. Two types of loop blocks are available: for and while. for blocks iterate over a predeter-
mined set of values and while blocks loop as long as some logical expression is satisfied. All for loops can
be expressed as while loops although the opposite is not quite true. They are nearly equivalent when break
is used, although it is generally preferable to use a while loop than a for loop and a break statement.
13.1 for
for loops begin with for item in iterable: . The generic structure of a for loop is
item is an element from iterable, and iterable can be anything that is iterable in Python. The most common
exambles are xrange or range, lists, tuples, arrays or matrices. The for loop will iterate across all items in
iterable, beginning with item 0 and continuing until the final item.
count = 0
for i in range(100):
count += i
count = 0
x = linspace(0,500,50)
for i in x: # Python 3: for i in range(0,500,5)
count += i
count = 0
x = list(arange(-20,21))
for i in x:
count += i
The first loop will iterate over i = 0, 1, 2,. . . , 99. The second loops over the values produced by the
function linspace, which returns an array, which creates 50 uniform points between 0 and 500, inclusive.
The final loops over x, a vector constructed from a call to list(arange(-20,21)), which produces a list
containing the series −20,−19,. . . , 0, . . .19,20. All three – range, arrays, and lists – are iterable. The key to
105
understanding for loop behavior is that for always iterates over the elements of the iterable in the order
they are presented (i.e. iterable[0], iterable[1], . . .).
Python 2.7 vs. 3.2 Note: This chapter exclusively uses range in loops (instead of xrange). This
is a simplification used so that the same code will run in Python 2.7 and 3.2, although the best
practice is to use xrange in Python 2.7 loops.
This for expression can be equivalently expressed using range, by using len to get the number of items in
the iterable:
returns = randn(100)
count = 0
for i in range(len(returns)):
if returns[i]<0:
count += 1
Finally, these ideas can be combined to produce nested loops with flow control.
x = zeros((10,10))
for i in range(size(x,0)):
for j in range(size(x,1)):
if i<j:
x[i,j]=i+j;
else:
x[i,j]=i-j
or loops containing nested loops that are executed based on a flow control statement.
x = zeros((10,10))
for i in range(size(x,0)):
if (i % 2) == 1:
for j in range(size(x,1)):
x[i,j] = i+j
else:
for j in range(int(i/2)):
x[i,j] = i-j
Note: The iterable variable cannot be changed once inside the loop. Consider, for example,
x = range(10) # Python 3: x = range(10)
for i in x:
print(i)
print(’Length of x:’, len(x))
x = range(5)
Note that it is not safe to modify the sequence of the iterable when looping over it. The means that the
iterable should not change size, which can occur when using a list and the functions pop(), insert() or
append() or the keyword del. The loop below would never terminate (except for the break) since L is being
extended each iteration.
L = [1, 2]
for i in L:
print(i)
L.append(i+2)
if i>5:
break
Finally, for loops can be used with 2 items when the iterable is wrapped in enumerate, which allows the
elements of the iterable to be directly accessed, as well as their index in the iterable.
x = linspace(0,100,11)
for i,y in enumerate(x):
print(’i is :’, i)
print(’y is :’, y)
13.1.1 Whitespace
Like if . . . elif . . . else flow control blocks, for loops are whitespace sensitive. The indentation of the
line immediately below the for statement determines the indentation that all statements in the block must
have. The convention is 4 spaces.
13.1.2 break
A loop can be terminated early using break. break is usually used after an if statement to terminate the
loop prematurely if some condition has been met.
x = randn(1000)
for i in x:
print(i)
if i > 2:
break
13.1.3 continue
continue can be used to skip an iteration of a loop, immediately returning to the top of the loop using
the next item in iterable. continue is usually used to avoid a level of nesting, such as in the following two
examples.
x = randn(10)
for i in x:
if i < 0:
print(i)
for i in x:
if i >= 0:
continue
print(i)
Avoiding excessive levels of indentation is essential in Python programming – 4 is usually considered the
maximum – and continue can be used to in a for loop to avoid one level of indentation.
13.2 while
while loops are useful when the number of iterations needed depends on the outcome of the loop contents.
while loops are commonly used when a loop should only stop if a certain condition is met, such as the
change in some parameter is small. The generic structure of a while loop is
while logical:
Code to run
Two things are crucial when using a while loop: first, the logical expression should evaluate to true
when the loop begins (or the loop will be ignored) and second the inputs to the logical expression must
be updated inside the loop. If they are not, the loop will continue indefinitely (hit CTRL+C to break an
interminable loop). The simplest while loops are (verbose) drop-in replacements for for loops:
count = 0
i = 1
while i<=10:
count += i
i += 1
In the block above, the number of iterations required is not known in advance and since randn is a standard
normal pseudo-random number, it may take many iterations until this criteria is met – any finite for loop
cannot be guaranteed to meet the criteria.
13.2.1 break
break can be used in a while loop to immediately terminate execution. In general, break should not be used
in a while loop – instead the logical condition should be set to False to terminate the loop.
condition = True
i = 0
x = randn(1000)
while condition:
print(x[i])
i += 1
if x[i] > 2:
break
It is better to update the logical statement which determines whether the while loop should execute.
i = 0
while x[i] <= 2:
print(x[i])
i += 1
13.2.2 continue
continue can be used in a while loop to skip an iteration of a loop, immediately returning to the top of the
loop, which then checks the while condition, and executes the loop if it still true. Use of continue in while
loops is also rare.
13.3 Exercises
1. Simulate 1000 observations from an ARMA(2,2) where εt are independent standard normal innova-
tions. The process of an ARMA(2,2) is given by
y t = φ1 y t −1 + φ2 y t −2 + θ1 εt −1 + θ2 εt −2 + εt
Use the values φ1 = 1.4, φ2 = −.8, θ1 = .4 and θ2 = .8. Note: When simulating a process, always
simulate more data then needed and throw away the first block of observations to avoid start-up
biases. This process is fairly persistent, at least 100 extra observations should be computed.
2. Simulate a GARCH(1,1) process where εt are independent standard normal innovations. A GARCH(1,1)
process is given by p
y t = εt h t
h t = ω + αεt −1 + β h t −1
3. Simulate a GJR-GARCH(1,1,1) process where εt are independent standard normal innovations. A GJR-
GARCH(1,1) process is given by p
y t = εt h t
Use the values ω = 0.05, α = 0.02 γ = 0.07 and β = 0.9 and set h 0 = ω/ 1 − α − 12 γ − β . Note that
some form of logical expression is needed in the loop. I [εt −1 <0] is an indicator variable that takes the
value 1 if the expression inside the [ ] is true.
Use the values from Exercise 3 for the GJR-GARCH model and use the φ1 = −0.1, θ1 = 0.4 and
λ = 0.03.
5. Find two different methods to use a for loop to fill a 5 × 5 array with i × j where i is the row index,
and j is the column index. One will use range as the iterable, and the other should directly iterate on
the rows, and then the columns of the matrix.
6. Using a while loop, write a bit of code that will do a bisection search to invert a normal CDF. A bisec-
tion search cuts the interval in half repeatedly, only keeping the sub interval with the target in it. Hint:
keep track of the upper and lower bounds of the random variable value and use flow control. This
problem requires stats.norm.cdf.
7. Test out the loop using by finding the inverse CDF of 0, -3 and pi. Verify it is working by taking the
absolute value of the difference between the final value and the value produced by stats.norm.ppf.
Chapter 14
Python supports a wide range of programming styles including procedural (imperative), object oriented and
functional. While object oriented programming and functional programming are powerful programming
paradigms, especially in large, complex software, procedural is often much easier to understand, and is
often a direct representation of a mathematical formula. The basic idea of procedural programming is to
produce a function or set of function (generically) of the form
y = f (x ).
That is, the functions take inputs, and produce outputs – there can be more than one of either.
14.1 Functions
Python functions are very simple to declare and can occur in a variety of locations, including in the same
file as the main program or in a standalone module. Functions are declared using the def keyword, and the
value produced is returned using the return keyword. Consider a simple function which returns the square
of the input,
y = x 2.
def square(x):
return x**2
In this example, the same Python file contains the main program – the bottom 3 lines – as well as the func-
tion. More complex function can be crafted with multiple inputs.
from __future__ import print_function
from __future__ import division
111
def l2distance(x,y):
return (x-y)**2
import numpy as np
def l2_norm(x,y):
d = x - y
return np.sqrt(np.dot(d,d))
import numpy as np
def l1_l2_norm(x,y):
d = x - y
return sum(np.abs(d)),np.sqrt(np.dot(d,d))
Input values in functions are automatically keyword arguments, so that the function can be accessed either
by placing the inputs in the order they appear in the function (positional arguments), or by calling the input
by their name using keyword=value.
from __future__ import print_function
from __future__ import division
import numpy as np
def lp_norm(x,y,p):
d = x - y
return sum(abs(d)**p)**(1/p)
Because variable names are automatically keywords, it is important to use meaningful variable names when
possible, rather than generic variables such as a, b, c or x, y and z. In some cases, x may be a reasonable
default, but in the previous example which computed the L p norm, calling the third input z would be bad
idea.
Default values are set in the function declaration using the syntax input=default.
from __future__ import print_function
from __future__ import division
import numpy as np
Default values should not normally be mutable (e.g. lists or arrays) since they are only initialized the first
time the function is called. Subsequent calls will use the same value, which means that the default value
could change every time the function is called.
from __future__ import print_function
from __future__ import division
import numpy as np
Each call to bad_function() shows that x has a different value – despite the default being 0. The solution
to this problem is to initialize mutable objects to None, and then the use an if to check and initialize.
from __future__ import print_function
from __future__ import division
import numpy as np
Most function written as an “end user” have a deterministic number of inputs. However, functions which
evaluate other functions often must accept variable numbers of input. Variable inputs can be handled using
the *arguments or **keywords syntax. The *arguments syntax will generate a containing all inputs past the
specified input list. For example, consider extending the L p function so that it can accept a set of p values
as extra inputs (Note: in practice it would make more sense to accept an array for p ).
from __future__ import print_function
from __future__ import division
import numpy as np
for p in arguments:
print(’The L’ + str(p) + ’ distance is :’, sum(abs(d)**p)**(1/p))
out.append(sum(abs(d)**p)**(1/p))
return tuple(out)
The alternative syntax, **keywords, generates a dictionary with all keyword inputs which are not in the
function signature. One reason for using **keywords is to allow a long list of optional inputs without having
to have an excessively long function definition, and is how this input mechanism is often encountered when
using other code, for example plot().
import numpy as np
return sum(abs(d)**p)
It is possible to use both *arguments and **keywords in a function definition and their role does not
change – *arguments is a tuple which contains all extraneous non-keyword inputs, and **keywords will con-
tain all extra keyword arguments. Function with both often have the simple signature y = f(*arguments,
**keywords)which allows for a wide range of configuration.
The docstring is one of the most important elements of any function – especially a function written by
consumption by others. The docstring is a special string, enclose using triple-quotation marks, either ’’’
or """, which is available using help(). When help(fun) is called, Python looks for the docstring which is
placed immediately below the function definition.
from __future__ import print_function
from __future__ import division
import numpy as np
lp_norm(x, y, p=2)
The docstring contains any available help for
the function. A good docstring should explain the
inputs and the outputs, provide an example and a list
of any other related function.
Note that the docstring is not a good example. I suggest following the the NumPy guidelines, currently avail-
able atNumPy source repository (or search for numpy docstring). Also see NumPy example.py These differ
from and are more specialized than the standard Python docstring guidelines, and so are more appropriate
for numerical code. A better docstring for lp_norm would be
from __future__ import print_function
from __future__ import division
import numpy as np
Parameters
----------
x : ndarray
First argument
y : ndarray
Second argument
p : float, optional
Power used in distance calcualtion, >=0
Returns
-------
output : scalar
Returns the Lp normed distance between x and y
Notes
-----
Examples
--------
>>> x=[0,1,2]
>>> y=[1,2,3]
>>> lp_norm(x,y)
>>> lp_norm(x,y,1)
"""
if p<0: p=0
d = x - y
dist = sum(abs(d)**p)
if p<1:
return dist
else:
return dist**(1/p)
Convention is to use triple double-quotes in docstrings, with r""" used to indicate “raw” strings, which will
ignore backslash, rather than treating it like an escape character (use u""" if the docstring contains unicode
text, which is not usually necessary). A complete docstring may contain, in order:
Variable scope determines which function can access, and possibly modify a variable. Python determines
variable scope using two principles: where the variable appears in the file, and whether the variable is inside
a function or in the main program. Variables declared inside a function are local variables and are only
available to that function. Variables declared outside a function are global variables, and can be accessed
but normally not modified. Consider the example which shows that variables at the root of the program
which have been declared before a function can be printed by that function.
from __future__ import print_function
from __future__ import division
import numpy as np
a, b, c = 1, 3.1415, ’Python’
def scope():
print(a)
print(b)
print(c)
# print(d) #Error, d has not be declared yet
scope()
d = np.array(1)
def scope2():
print(a)
print(b)
print(c)
print(d) # Ok now
scope2()
def scope3():
a = ’Not a number’ # Local variable
print(’Inside scope3, a is ’, a)
print(’a is ’,a)
scope3()
print(’a is now ’,a)
Using the name of a global variable inside a function does not cause any issues outside of the function. In
scope3, a is given a different value. That value is specific to the function scope3 and outside of the function,
a will have its global value. Generally, global variables can be accessed, but not modified inside a function.
The only exception is when a variable is first declared using the keyword global.
from __future__ import print_function
from __future__ import division
import numpy as np
a = 1
def scope_local():
a = -1
print(’Inside scope_local, a is ’,a)
def scope_global():
global a
a = -10
print(’Inside scope_global, a is ’,a)
print(’a is ’,a)
scope_local()
print(’a is now ’,a)
scope_global()
print(’a is now ’,a)
One word of caution: a variable name cannot be used as a local and global variable in the same function.
Attempting to access the variable as a global (e.g. for printing) and then locally assign the variable produces
an error.
Estimating cross-section regressions using time-series data is common practice. When regressors are per-
sistent, and errors may not be white noise, standard inference, including White standard errors, are no
longer consistent. The most common solution is to use a long-run covariance estimator, and the most
common long-run covariance estimator is known as the Newey-West covariance estimator, which uses a
Bartlett kernel applied to the autocovariances of the scores. This example produces a function which re-
turns parameter estimates, the estimated asymptotic covariance matrix of the parameters, the variance of
the regression error, the R 2 , and adjusted R 2 and the fit values (or errors, since actual is equal to fit plus er-
rors). These be computed using a T -vector for the regressand (dependent variable), a T by k matrix for the
regressors, an indicator for whether to include a constant in the model (default True), and the number of
lags to include in the long-run covariance (default behavior is to automatically determine based on sample
size). The steps required to produce the function are:
The function definition is simple and allows for up to 4 inputs, where 2 have default values: def olsnw(y,
X, constant=True, lags=None):. The size of the variables is then determined using size and the constant is
prepended to the regressors, if needed, using hstack. The regression coefficients are computed using lstsq,
and then the Newey-West covariance is computed for both the errors and and scores. The covariance of the
parameters is then computed using the NW covariance of the scores. Finally the R 2 and R̄ 2 are computed.
A complete code listing is presented in the appendix to this chapter.
14.4 Modules
The previous examples all included the function in inside the Python file that contained the main program.
While this is convenient, especially for writing the function, it hinders use in other code. Modules allow
multiple functions to be combined in a single Python file and accessed using import module and then mod-
ule.function syntax. Suppose a file named core.py contains the following code:
r"""Demonstration module
"""
def square(x):
r"""Returns the square of a scalar input
"""
return x*x
def cube(x):
r"""Returns the cube of a scalar input
"""
return x*x*x
The functions square and cube can be accessed by other files in the same directory using
from __future__ import division
from __future__ import print_function
import core
y = -3
print(core.square(y))
print(core.cube(y))
The functions in core.py can be imported using any of the standard import methods such as
from core import square, cube
or
from core import *
14.4.1 __main__
Normally modules should only have code required for the module to run, and other code should reside
in a different function. However, it is possible that a module could be both directly important and also
directly runnable. If this is the case, it is important that the directly runnable code should not be exe-
cuted when the module is imported by other code. This can be accomplished using a special construct
if __name__=="__main__":before any code that should execute when run as a standalone program. Con-
sider the following simple example in a module namedtest.py.
from __future__ import division
from __future__ import print_function
def square(x):
return x**2
if __name__=="__main__":
print(’Program called directly.’)
else:
print(’Program called indirectly using name: ’, __name__)
14.5 PYTHONPATH
While it is simple to reference files in the same current working directory, this behavior is undesirable for
code shared between multiple projects. Fortunately the PYTHONPATH allows directories to be added so
that they are automatically searched if a matching module cannot be found in the current directory. The
current path can be checked by running
>>> import sys
>>> sys.path
c:\dir1;c:\dir2;c:\dir2\dir3;
which will add 3 directories to the path. On Linux, PYTHONPATH is stored in .bash_profile, and it should
resemble
PYTHONPATH="${PYTHONPATH}:/dir1/:/dir2/:/dir2/dir3/"
export PYTHONPATH
after three directories have been added, using : as a separator between directories.
14.6 Packages
Packages are the next level beyond modules, and allow, for example, nested module names (e.g. numpy.random
which contains randn). Packages are also installed in the local package library, and can be compiled into
optimized Python byte code, which makes loading modules faster (but does not make code run faster).
Building a package is beyond the scope of these notes, but there are many resources on the internet with
instructions for building packages.
There are a number of common practices which can be adopted to produce Python code which looks more
like code found in other modules:
1. Use 4 spaces to indent blocks – avoid using tab, except when an editor automatically converts tabs to
4 spaces
3. Limit lines to 79 characters. The \ symbol can be used to break long lines
7. Avoid from module import * (for any module). Use either from module import func1, func2 or
import module as shortname.
The complete code listing of econometrics, which contains the function olsnw, is presented below.
from numpy import dot, mat, asarray, mean, size, shape, hstack, ones, ceil, \
zeros, arange
from numpy.linalg import inv, lstsq
Parameters
----------
y : array_like
The dependant variable (regressand). 1-dimensional with T elements.
X : array_like
The independant variables (regressors). 2-dimensional with sizes T
and K. Should not include a constant.
constant: bool, optional
If true (default) includes model includes a constant.
lags: int or None, optional
If None, the number of lags is set to 1.2*T**(1/3), otherwise the
number of lags used in the covariance estimation is set to the value
provided.
Returns
-------
b : ndarray, shape (K,) or (K+1,)
Parameter estimates. If constant=True, the first value is the
intercept.
vcv : ndarray, shape (K,K) or (K+1,K+1)
Asymptotic covariance matrix of estimated parameters
s2 : float
Asymptotic variance of residuals, computed using Newey-West variance
estimator.
R2 : float
Model R-square
R2bar : float
Adjusted R-square
e : ndarray, shape (T,)
Array containing the model errors
Notes
-----
The Newey-West covariance estimator applies a Bartlett kernel to estimate
the long-run covariance of the scores. Setting lags=0 produces White’s
Heteroskedasticity Robust covariance matrix.
See also
--------
np.linalg.lstsq
Example
-------
>>> X = randn(1000,3)
>>> y = randn(1000,1)
>>> b,vcv,s2,R2,R2bar = olsnw(y, X)
Exclude constant:
"""
T = y.size
if size(X, 0) != T:
X = X.T
T,K = shape(X)
if constant:
X = copy(X)
X = hstack((ones((T,1)),X))
K = size(X,1)
if lags==None:
lags = int(ceil(1.2 * float(T)**(1.0/3)))
out = lstsq(X,y)
b = out[0]
e = y - dot(X,b)
# Covariance of errors
gamma = zeros((lags+1))
for lag in xrange(lags+1):
gamma[lag] = dot(e[:T-lag],e[lag:]) / T
w = 1 - arange(0,lags+1)/(lags+1)
w[0] = 0.5
s2 = dot(gamma,2*w)
# Covariance of parameters
Xe = mat(zeros(shape(X)))
for i in xrange(T):
Xe[i] = X[i] * float(e[i])
Gamma = zeros((lags+1,K,K))
for lag in xrange(lags+1):
Gamma[lag] = Xe[lag:].T*Xe[:T-lag]
Gamma = Gamma/T
S = Gamma[0].copy()
for i in xrange(1,lags+1):
S = S + w[i]*(Gamma[i]+Gamma[i].T)
XpX = dot(X.T,X)/T
XpXi = inv(XpX)
vcv = mat(XpXi)*S*mat(XpXi)/T
vcv = asarray(vcv)
R2bar = 1-R2*(T-1)/(T-K)
R2 = 1 - R2
return b,vcv,s2,R2,R2bar,e
Chapter 15
This chapter is divided into two main parts, one for NumPy and one for SciPy. Both packages contain
important functions for simulation, probability distributions and statistics.
NumPy
NumPy random number generators are all stored in the module numpy.random. These can be imported
with using import numpy as np and then calling np.random.rand(), for example, or by importing import
numpy.random as rnd and using rnd.rand().1
rand, random_sample
rand and random_sample are uniform random number generators which are identical except that rand takes
a variable number of integer inputs – one for each dimension – while random_sample takes a n -element
tuple. rand is a convenience function for random_sample.
>>> x = rand(3,4,5)
>>> y = random_sample((3,4,5))
randn, standard_normal
randn and standard_normal are standard normal random number generators. randn, like rand, takes a vari-
able number of integer inputs, and standard_normal takes an n -element tuple. Both can be called with
no arguments to generate a single standard normal (e.g. randn()). randn is a convenience function for
standard_normal.
>>> x = randn(3,4,5)
>>> y = standard_normal((3,4,5))
1
Other import methods can also be used, such as from numpy.random import rand and then calling rand()
127
randint, random_integers
randint and random_integers are uniform integer random number generators which take 3 intpus, low,
high and size. Low is the lower bound of the integers generated, high is the upper and size is an n -element
tuple. randint and random_integers differ in that randint generates integers exclusive of the value in high
(as most Python functions), while random_integers includes the value in high.
>>> x = randint(0,10,(100))
>>> x.max() # Is 9 since range is [0,10)
9
>>> y = random_integers(0,10,(100))
>>> y.max() # Is 10 since range is [0,10]
10
shuffle
>>> x = arange(10)
>>> shuffle(x)
>>> x
array([4, 6, 3, 7, 9, 0, 2, 1, 8, 5])
permutation
>>> x = arange(10)
>>> permutation(x)
array([2, 5, 3, 0, 6, 1, 9, 8, 4, 7])
>>> x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
NumPy provides a large selection of random number generators for specific distribution. All take between 0
and 2 required input which are parameters of the distribution, plus a tuple containing the size of the output.
All random number generators are in the module numpy.random.
Bernoulli
There is no Bernoulli generator. Instead use 1 - (rand()>p) to generate a single draw or 1 - (rand(10,10)>p)
to generate an array.
beta
beta(a,b) generates a draw from the Beta(a , b ) distribution. beta(a,b,(10,10)) generates a 10 by 10 array
of draws from a Beta(a , b ) distribution.
binomial
chisquare
chisquare(nu) generates a draw from the χν2 distribution, where ν is the degree of freedom. chisquare(nu,(10,10))
generates a 10 by 10 array of draws from the χν2 distribution.
exponential
exponential() generates a draw from the Exponential distribution with scale parameter λ = 1. exponential(
lambda, (10,10)) generates a 10 by 10 array of draws from the Exponential distribution with scale parame-
ter λ.
f(v1,v2) generates a draw from the distribution Fν1 ,ν2 distribution where ν1 is the numerator degree of free-
dom and ν2 is the denominator degree of freedom. f(v1,v2,(10,10)) generates a 10 by 10 array of draws
from the Fν1 ,ν2 distribution.
gamma
gamma(a) generates a draw from the Gamma(α, 1) distribution, where α is the shape parameter. gamma(a,
theta, (10,10)) generates a 10 by 10 array of draws from the Gamma(α, θ ) distribution where θ is the scale
parameter.
laplace
laplace() generates a draw from the Laplace (Double Exponential) distribution with centered at 0 and unit
scale. laplace(loc, scale, (10,10)) generates a 10 by 10 array of Laplace distributed data with location
loc and scale scale. Using laplace(loc, scale) is equivalent to calling loc + scale*laplace().
lognormal
multinomial
multinomial(n, p) generates a draw from a multinomial distribution using n trials and where each out-
come has probability p , a k -element array where Σki=1 p = 1. The output is a k -element array containing
the number of sucesses in each category. multinomial(n, p, (10,10)) generates a 10 by 10 by k array of
multinomially distributed data with n trials and probabilities p .
multivariate_normal
multivariate_normal(mu, Sigma) generates a draw from a multivariate Normal distribution with mean µ
(k -element array) and covariance Σ (k by k array). multivariate_normal(mu, Sigma, (10,10)) generates
a 10 by 10 by k array of draws from a multivariate Normal distribution with mean µ and covariance Σ.
negative_binomial
negative_binomial(n, p) generates a draw from the Negative Binomial distribution where n is the number
of failures before stopping and p is the success rate. negative_binomial(n, p, (10, 10)) generates a 10 by
10 array of draws from the Negative Binomial distribution where n is the number of failures before stopping
and p is the success rate.
normal
normal() generates draws from a standard Normal (Gaussian). normal(mu, sigma) generates draws from
a Normal with mean µ and standard deviation σ. normal(mu, sigma, (10,10)) generates a 10 by 10 ar-
ray of draws from a Normal with mean µ and standard deviation σ. normal(mu, sigma) is equivalent to
mu + sigma * rand() or mu + sigma * standard_normal().
poisson
poisson() generates a draw from a Poisson distribution with λ = 1. poisson(lambda) generates a draw
from a Poisson distribution with expectation λ. poisson(lambda, (10,10)) generates a 10 by 10 array of
draws from a Poisson distribution with expectation λ.
standard_t
standard_t(nu) generates a draw from a Student’s-t with shape parameter ν . standard_t(nu, (10,10))
generates a 10 by 10 array of draws from a Student -t with shape parameter ν .
uniform
uniform() generates a uniform random variable on (0, 1). uniform(low, high) generates a uniform on
(l , h). uniform(low, high, (10,10)) generates a 10 by 10 array of uniforms on (l , h).
The random number generator can be seeded and its state saved and restored, which allow for repeating
(paeudo) random numbers. See Chapter 16for more about pseudo-random number generation.
RandomState
RandomState is the class used to control the random number generators. Multiple generators can be initial-
ized by RandomState.
>>> gen1 = np.random.RandomState()
>>> gen2 = np.random.RandomState()
>>> gen1.uniform() # Generate a uniform
0.6767614077579269
>>> gen2.set_state(state1)
>>> gen2.uniform() # Same uniform as gen1 produces after assigning state
0.6046087317893271
seed
seed(value) uses value to seed the random number generator. seed() takes actual random data from the
operating system (e.g. /dev/random on Linux, or CryptGenRandom in Windows).
get_state
get_state() gets the current state of the random number generator, which is a 5-element tuple. It can be
called as a function, in which case it gets the state of the default random number generator, or as a method
on a particular instance of RandomState().
set_state
set_state(state) sets the state of the random number generator. It can be called as a function, in which
case it sets the state of the default random number generator, or as a method on a particular instance of
RandomState(). set_state should generally only be called using a state tuple returned by get_state.
mean
mean computes the average of an array. An optional second argument provides the axis to use (default is to
use entire array). mean can be used either as a function or as a method on an array.
>>> x = arange(10.0)
>>> x.mean()
4.5
>>> mean(x)
4.5
>>> x= reshape(arange(20.0),(4,5))
>>> mean(x,0)
array([ 7.5, 8.5, 9.5, 10.5, 11.5])
>>> x.mean(1)
array([ 2., 7., 12., 17.])
median
median computed the median value in an array. An optional second argument provides the axis to use
(default is to use entire array).
>>> x= randn(4,5)
>>> x
array([[-0.74448693, -0.63673031, -0.40608815, 0.40529852, -0.93803737],
[ 0.77746525, 0.33487689, 0.78147524, -0.5050722 , 0.58048329],
[-0.51451403, -0.79600763, 0.92590814, -0.53996231, -0.24834136],
[-0.83610656, 0.29678017, -0.66112691, 0.10792584, -1.23180865]])
>>> median(x)
-0.45558017286810903
>>> median(x, 0)
array([-0.62950048, -0.16997507, 0.18769355, -0.19857318, -0.59318936])
Note that when an array or axis dimension contains an even number of elements (n ), median returns the
average of the 2 inner elements.
std
std computes the standard deviation of an array. An optional second argument provides the axis to use
(default is to use entire array). std can be used either as a function or as a method on an array.
var
var computes the variance of an array. An optional second argument provides the axis to use (default is to
use entire array). var can be used either as a function or as a method on an array.
corrcoef
corrcoef(x) computes the correlation between the rows of a 2-dimensional array x . corrcoef(x, y) com-
putes the correlation between two 1- dimensional vectors. An optional keyword argument rowvar can be
used to compute the correlation between the columns of the input – this is corrcoef(x, rowvar=False)
and corrcoef(x.T) are identical.
>>> x= randn(3,4)
>>> corrcoef(x)
array([[ 1. , 0.36780596, 0.08159501],
[ 0.36780596, 1. , 0.66841624],
[ 0.08159501, 0.66841624, 1. ]])
>>> corrcoef(x[0],x[1])
array([[ 1. , 0.36780596],
[ 0.36780596, 1. ]])
>>> corrcoef(x, rowvar=False)
array([[ 1. , -0.98221501, -0.19209871, -0.81622298],
[-0.98221501, 1. , 0.37294497, 0.91018215],
[-0.19209871, 0.37294497, 1. , 0.72377239],
[-0.81622298, 0.91018215, 0.72377239, 1. ]])
>>> corrcoef(x.T)
array([[ 1. , -0.98221501, -0.19209871, -0.81622298],
[-0.98221501, 1. , 0.37294497, 0.91018215],
[-0.19209871, 0.37294497, 1. , 0.72377239],
[-0.81622298, 0.91018215, 0.72377239, 1. ]])
cov
cov(x) computes the covariance of an array x . cov(x,y) computes the covariance between two 1-dimensional
vectors. An optional keyword argument rowvar can be used to compute the covariance between the columns
of the input – this is cov(x, rowvar=False) and cov(x.T) are identical.
histogram
histogram can be used to compute the histogram (empirical frequency, using k bins) of a set of data. An
optional second argument provides the number of bins. If omitted, k =10 bins are used. histogram returns
two outputs, the first with a k -element vector containing the number of observations in each bin, and the
second with the k + 1 endpoints of the k bins.
>>> x = randn(1000)
>>> count, binends = histogram(x)
>>> count
array([ 7, 27, 68, 158, 237, 218, 163, 79, 36, 7])
>>> binends
array([-3.06828057, -2.46725067, -1.86622077, -1.26519086, -0.66416096,
-0.06313105, 0.53789885, 1.13892875, 1.73995866, 2.34098856,
2.94201846])
histogram2d
SciPy
SciPy provides an extended range of random number generators, probability distributions and statistical
tests.
import scipy
import scipy.stats as stats
SciPy contains a large number of functions for working with continuous random variables. Each function
resides in its own class (e.g. norm for Normal or gamma for Gamma), and classes expose methods for random
number generation, computing the PDF, CDF and inverse CDF, fitting parameters using MLE, and comput-
ing various moments. The methods are listed below, where dist is a generic placeholder for the distribution
name in SciPy. While the functions available for continuous random variables vary in their inputs, all take 3
generic arguments:
1. *args a set of distribution specific non-keyword arguments. These must be entered in the order listed
in the class docstring. For example, when using a F -distribution, two arguments are needed, one for
the numerator degree of freedom, and one for the denominator degree of freedom.
3. scale a scale parameter, which determine the scaling of the distribution. For example, if z is a stan-
dard normal, then s z is a scaled standard normal.
dist.rvs
Pseudo-random number generation. Generically, rvs is called using dist .rvs(*args, loc=0, scale=1, size=size)
where size is an n -element tuple containing the size of the array to be generated.
dist.pdf
Probability density function evaluation for an array of data (element-by-element). Generically, pdf is called
using dist .pdf(x, *args, loc=0, scale=1) where x is an array that contains the values to use when evalu-
ating PDF.
dist.logpdf
Log probability density function evaluation for an array of data (element-by-element). Generically, logpdf
is called using dist .logpdf(x, *args, loc=0, scale=1) where x is an array that contains the values to use
when evaluating log PDF.
dist.cdf
Cumulative distribution function evaluation for an array of data (element-by-element). Generically, cdf is
called using dist .cdf(x, *args, loc=0, scale=1) where x is an array that contains the values to use when
evaluating CDF.
dist.ppf
Inverse CDF evaluation (also known as percent point function) for an array of values between 0 and 1.
Generically, ppf is called using dist .ppf(p, *args, loc=0, scale=1) where p is an array with all elements
between 0 and 1 that contains the values to use when evaluating inverse CDF.
dist.fit
Estimate shape, location, and scale parameters from data by maximum likelihood using an array of data.
Generically, fit is called using dist .fit(data, *args, floc=0, fscale=1) where data is a data array used
to estimate the parameters. floc forces the location to a particular value (e.g. floc=0). fscale similarly
forces the scale to a particular value (e.g. fscale=1) . It is necessary to use floc and/or fscale when com-
puting MLEs if the distribution does not have a location and/or scale. For example, the gamma distribution
is defined using 2 parameters, often referred to as shape and scale. In order to use ML to estimate parame-
ters from a gamma, floc=0 must be used.
dist.median
Returns the median of the distribution. Generically, median is called using dist .median(*args, loc=0, scale=1).
dist.mean
Returns the mean of the distribution. Generically, mean is called using dist .mean(*args, loc=0, scale=1).
dist.moment
nth non-central moment evaluation of the distribution. Generically, moment is called using dist .moment(r, *args,
loc=0, scale=1) where r is the order of the moment to compute.
dist.var
Returns the variance of the distribution. Generically, var is called using dist .var(*args, loc=0, scale=1).
dist.std
Returns the standard deviation of the distribution. Generically, std is called using dist .std(*args, loc=0, scale=1).
The gamma distribution is used as an example. The gamma distribution takes 1 shape parameter a (a is
the only element of *args), which is set to 2 in all examples.
>>> gamma = stats.gamma
>>> gamma.mean(2), gamma.median(2), gamma.std(2), gamma.var(2)
(2.0, 1.6783469900166608, 1.4142135623730951, 2.0)
>>> gamma.ppf(.95957232, 2)
5.0000000592023914
SciPy provides classes for a large number of distribution. The most important in econometrics are listed
in the table below, along with any required arguments (shape parameters). All classes can be used with
the keyword arguments loc and scale to set the location and scale, respectively. The default location is 0
and the default scale is 1. Setting loc to something other than 0 is equivalent to adding loc to the random
variable. Similarly setting scale to something other than 0 is equivalent to multiplying the variable by scale.
Distribution Name SciPy Name Required Arguments Notes
Normal norm Use loc to set mean (µ), scale to set std. dev. (σ)
Beta(a , b ) beta a : a, b : b
Cauchy cauchy
χν2 chi2 ν : df
Exponential(λ) expon Use scale to set shape parameter (λ)
Exponential Power exponpow shape: b Nests normal when b=2, Laplace when b=1
F(ν1 , ν2 ) f ν1 : dfn, ν2 : dfd
Gamma(a , b ) gamma a: a Use scale to set scale parameter (b )
Laplace, Double Exponential laplace Use loc to set mean (µ), scale to set std. dev. (σ)
Log Normal(µ, σ2 ) lognorm σ: s µ is always 0.
Student’s-t ν t ν : df
1. Calling the class along with any shape, location and scale parameters, simultaneously with the method.
For example gamma(1, scale=2).cdf(1).
2. Initializing the class with any shape, location and scale arguments and assigning a variable name.
Using the assigned variable name with the method. For example:
mode
mode computes the mode of an array. An optional second argument provides the axis to use (default is to
use entire array). Returns two outputs: the first contains the values of the mode, the second contains the
number of occurrences.
>>> x=randint(1,11,1000)
>>> stats.mode(x)
(array([ 4.]), array([ 112.]))
moment
moment computed the rth central moment for an array. An optional second argument provides the axis to
use (default is to use entire array).
>>> x = randn(1000)
>>> moment = stats.moment
>>> moment(x,2) - moment(x,1)**2
0.94668836546169166
>>> var(x)
0.94668836546169166
>>> x = randn(1000,2)
>>> moment(x,2,0) # axis 0
array([ 0.97029259, 1.03384203])
skew
skew computes the skewness of an array. An optional second argument provides the axis to use (default is
to use entire array).
>>> x = randn(1000)
>>> skew = stats.skew
>>> skew(x)
0.027187705042705772
>>> x = randn(1000,2)
>>> skew(x,0)
array([ 0.05790773, -0.00482564])
kurtosis
kurtosis computes the excess kurtosis (actual kurtosis minus 3) of an array. An optional second argument
provides the axis to use (default is to use entire array). Setting the keyword argument fisher=False will
compute the actual kurtosis.
>>> x = randn(1000)
>>> kurtosis = stats.kurtosis
>>> kurtosis(x)
-0.2112381820194531
>>> x = randn(1000,2)
>>> kurtosis(x,0)
array([-0.13813704, -0.08395426])
pearsonr
pearsonr computes the Pearson correlation between two 1-dimensional vectors. It also returns the 2-tailed
p-value for the null hypothesis that the correlation is 0.
>>> x = randn(10)
>>> y = x + randn(10)
>>> pearsonr = stats.pearsonr
>>> corr, pval = pearsonr(x, y)
>>> corr
0.40806165708698366
>>> pval
0.24174029858660467
spearmanr
spearmanr computes the Spearman correlation (rank correlation). It can be used with a single 2-dimensional
array input, or 2 1-dimensional arrays. Takes an optional keyword argument axis indicating whether to treat
columns (0) or rows (1) as variables. If the input array has more than 2 variables, returns the correlation ma-
trix. If the input array as 2 variables, returns only the correlation between the variables.
>>> x = randn(10,3)
>>> spearmanr = stats.spearmanr
>>> rho, pval = spearmanr(x)
>>> rho
array([[ 1. , -0.02087009, -0.05867387],
[-0.02087009, 1. , 0.21258926],
[-0.05867387, 0.21258926, 1. ]])
>>> pval
array([[ 0. , 0.83671325, 0.56200781],
[ 0.83671325, 0. , 0.03371181],
[ 0.56200781, 0.03371181, 0. ]])
>>> pval
0.83671325461864643
kendalltau
>>> x = randn(10)
>>> y = x + randn(10)
>>> kendalltau = stats.kendalltau
>>> tau, pval = kendalltau(x,y)
>>> tau
0.46666666666666673
>>> pval
0.06034053974834707
linregress
linregress estimates a linear regression between 2 1-dimensional arrays. It takes two inputs, the indepen-
dent variables (regressors) and the dependent variable (regressand). Models always include a constant.
>>> x = randn(10)
>>> y = x + randn(10)
>>> linregress = stats.linregress
>>> slope, intercept, rvalue, pvalue, stderr = linregress(x,y)
>>> slope
1.6976690163576993
normaltest
ctests for normality in an array of data. An optional second argument provides the axis to use (default is to
use entire array). Returns the test statistic and the p-value of the test. This test is a small sample modified
version of the Jarque-Bera test statistic.
kstest
kstest implements the Kolmogorov-Smirnov test. Requires two inputs, the data to use in the test and the
distribution, which can be a string or a frozen random variable object. If the distribution is provided as a
string, and if it requires shape parameters, these are passed in the third argument using a tuple containing
all parameters, in order.
>>> x = randn(100)
>>> kstest = stats.kstest
>>> stat, pval = kstest(x, ’norm’)
>>> stat
0.11526423481470172
>>> pval
0.12963296757465059
ks_2samp
ks_2samp implements a 2-sample version of the Kolmogorov-Smirnov test. It is called ks_2samp(x,y) where
both inputs are 1-dimensonal arrays, and returns the test statistic and p-value for he null that the distribu-
tion of x is the same as that of y .
shapiro
shapiro implements the Shapiro-Wilk test for normality on a 1-dimensional array of data. It returns the test
statistic and p-value for the null of normality.
Chapter 16
Computer simulated random numbers are usually constructed from very complex, but ultimately deter-
ministic functions. Because they are not actually random, simulated random numbers are generally de-
scribed to as pseudo-random. All pseudo-random numbers in NumPy use one core random number gen-
erator based on the “Mersenne Twister”, a generator which can produce a very long series of pseudo-random
data before repeating (up to 219937 − 1 non-repeating values).
16.2 State
Pseudo-random number generators track a set of values known as the state. The state is usually a vector
which has the property that if two instances of the same pseudo-random number generator have the same
state, the sequence of pseudo-random numbers generated will be identical. The state in NumPy can be read
using numpy.random.get_state and can be restored using numpy.random.set_state (Both are available in
IPython).
>>> st = get_state()
>>> randn(4)
array([ 0.37283499, 0.63661908, -1.51588209, -1.36540624])
>>> set_state(st)
>>> randn(4)
array([ 0.37283499, 0.63661908, -1.51588209, -1.36540624])
The two sequences are identical since they the state is the same when randn is called. The state is a 5-
element tuple where the second element is a 625 by 1 vector of unsigned 32-bit integers. In practice the
state should only be stored using get_state and restored using set_state.
16.3 Seed
numpy.random.seed is a more useful function for initializing the random number generator, and can be used
in one of two ways. seed() will initialize (or reinitialize) the random number generator using some actual
143
random data provided by the operating system.1 seed( s ) takes a vector of values (can be scalar) to initialize
the random number generator at particular state. seed( s ) is particularly useful for producing simulation
studies which are reproducible. In the following example, calls to seed() produce different random num-
bers, since these reinitialize using random data from the computer, while calls to seed(0) produce the same
(sequence) of random numbers.
>>> seed()
>>> randn(1)
array([ 0.62968838])
>>> seed()
>>> randn(1)
array([ 2.230155])
>>> seed(0)
>>> randn(1)
array([ 1.76405235])
>>> seed(0)
>>> randn(1)
array([ 1.76405235])
NumPy always calls seed() when the first random number is generated. As a result. calling randn(1) across
two “fresh” sessions will not produce the same random number.
It is important to have reproducible results when conducting a simulation study. There are two methods to
accomplish this:
1. Call seed() and then st = get_state(), and save st to a file which can then be loaded in the future
when running the simulation study.
Either of these will allow the same sequence of random numbers to be used.
Warning: Do not over-initialize the pseudo-random number generators. The generators should be initial-
ized once per session and then allowed to produce the pseudo-random sequence. Repeatedly re-initializing
the pseudo-random number generators will produce a sequence that is decidedly less random than the gen-
erator was designed to provide.
Simulation studies are ideally suited to parallelization. Parallel code does make reproducibility more diffi-
cult. There are 2 methods which can ensure that a parallel study is reproducible.
1
All modern operating systems collect data that is effectively random by collecting noise from device drivers and other system
monitors.
1. Have a single process produce all of the random numbers, where this process has been initialized us-
ing one of the two methods discussed in the previous section. Formally this can be accomplished by
pre-generating all random numbers, and then passing these into the simulation code as a parameter,
or equivalently by pre-generating the data and passing the state into the function. Inside the simula-
tion function, the random number generator will be set to the state which was passed as a parameter.
The latter is a better option if the amount of data per simulation is large.
2. Seed each parallel worker independently, and then return then save the state inside the simulation
function. The state should be returned and saved along with the simulation results. Since the state
is saved for each simulation, it is possible to use the same state if repeating the simulation using, for
example, a different estimator.
Chapter 17
Optimization
The optimization toolbox contains a number of routines to the find extremum of a user-supplied objective
function. Most of these implement a form of the Newton-Raphson algorithm which uses the gradient to
find the minimum of a function. Note: The optimization routines can only find minima. However, if f is a
function to be maximized, − f is a function with the minimum at located the same point as the maximum
of f .
A custom function that returns the function value at a set of parameters – for example a log-likelihood
or a GMM quadratic form – is required use one of the optimizers must be constructed. All optimization
targets must have the parameters as the first argument. For example consider finding the minimum of x 2 .
A function which allows the optimizer to work correctly has the form
def optim_target1(x):
return x**2
When multiple parameters (a parameter vector) are used, the objective function must take the form
def optim_target2(params):
x, y = params
return x**2-3*x+3+y*x-3*y+y**2
Optimization targets can have additional inputs that are not parameters of interest such as data or hyper-
parameters.
def optim_target3(params,hyperparams):
x, y = params
c1, c2, c3=hyperparams
return x**2+c1*x+c2+y*x+c3*y+y**2
This form is useful when optimization targets require at least two inputs: parameters and data. Once an
optimization target has been specified, the next step is to use one of the optimizers find the minimum.
SciPy contains a large number of optimizers.
147
17.1 Unconstrained Optimization
A number of functions are available for unconstrained optimization using derivative information. Each uses
a different algorithm to determine the best direction to move and the best step size to take in the direction.
The basic structure of all of the unconstrained optimizers is
optimizer(f, x0)
where optimizer is one of fmin_bfgs, fmin_cg, fmin_ncg or fmin_powell, f is a callable function and x0 is
an initial value used to start the algorithm. All of the unconstrained optimizers take the following keyword
arguments, except where noted:
Keyword Description Note
fmin_bfgs
fmin_bfgs is a classic optimizer which uses derivative information in the 1st derivative to estimate the sec-
ond derivative, which is known as the BFGS algorithm (after the initials of the creators). This is probably
the first choice when trying an optimization problem. A function which returns the first derivative of the
problem can be provided. if not provided, it is numerically approximated. The basic use of fmin_bfgs for
optimizing optim_target1 is shown below.
>>> opt.fmin_bfgs(optim_target1, 2)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 2
Function evaluations: 12
Gradient evaluations: 4
array([ -7.45132576e-09])
This is a very simple function to minimize and the solution is accurate to 8 decimal places. fmin_bfgs can
also use first derivative information, which is provided using a function which must have the same inputs
are the optimization target. In this simple example, f 0 (x ) = 2x .
def optim_target1_grad(x):
return 2*x
The derivative information is used through the keyword argument fprime. Using analytic derivatives may
improve accuracy of the solution is will require fewer function evaluations to find the solution.
>>> opt.fmin_bfgs(optim_target1, 2, fprime = optim_target1_grad)
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 2
Function evaluations: 4
Gradient evaluations: 4
array([ 2.71050543e-20])
Multivariate optimization problems are defined using an array for the starting values, but are otherwise
identical.
>>> opt.fmin_bfgs(optim_target2, array([1.0,2.0]))
Optimization terminated successfully.
Current function value: 0.000000
Iterations: 3
Function evaluations: 20
Gradient evaluations: 5
array([ 1. , 0.99999999])
Additional inputs are padded through to the optimization target using the keyword argument args and a
tuple containing the input arguments in the correct order. Note that since there is a single additional input,
the comma is necessary in (hyperp,) to let Python know that this is a tuple.
>>> hyperp = array([1.0,2.0,3.0])
>>> opt.fmin_bfgs(optim_target3, array([1.0,2.0]), args=(hyperp ,))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 3
Function evaluations: 20
Gradient evaluations: 5
array([ 0.33333332, -1.66666667])
Derivative functions can be produced in a similar manner, although the derivative of a scalar function with
respect to an n -element vector is an n -element vector. It is important that the derivative (or gradient)
returned has the same order as the input parameters. Note that the inputs must have both be present, even
if not needed, and in the same order.
def optim_target3_grad(params,hyperparams):
x, y = params
c1, c2, c3=hyperparams
return array([2*x+c1+y,x+c3+2*y])
Using the analytical derivative reduces the number of function evaluations and produces the same result.
>>> opt.fmin_bfgs(optim_target3, array([1.0,2.0]), fprime=optim_target3_grad, args=(hyperp ,))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 3
Function evaluations: 5
Gradient evaluations: 5
array([ 0.33333333, -1.66666667])
fmin_cg
fmin_cg uses a nonlinear conjugate gradient method to minimize a function. A function which returns the
first derivative of the problem can be provided. if not provided, it is numerically approximated.
>>> opt.fmin_cg(optim_target3, array([1.0,2.0]), args=(hyperp ,))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 7
Function evaluations: 59
Gradient evaluations: 12
array([ 0.33333334, -1.66666666])
fmin_ncg
fmin_ncg use a Newton conjugate gradient method. fmin_ncg also requires a function which can compute
the first derivative of the optimization target, and can also take a function which returns the second deriva-
tive of the optimization target. It not provided, the hessian will be numerically approximated.
>>> opt.fmin_ncg(optim_target3, array([1.0,2.0]), optim_target3_grad, args=(hyperp,))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 5
Function evaluations: 6
Gradient evaluations: 21
Hessian evaluations: 0
array([ 0.33333333, -1.66666666])
The hessian can optionally be provided to fmin_ncg using the keyword argument fhess. The hessian returns
∂ 2 f /∂ x ∂ x 0 , which is an n by n array of derivatives. In this simple problem, the hessian does not depend on
the hyperparameters, although the Hessian function must take the same inputs are the optimization target.
def optim_target3_hess(params,hyperparams):
x, y = params
c1, c2, c3=hyperparams
Using an analytical Hessian can reduce the number of function evaluations. While in theory an analytical
Hessian should produce better results, it may not improve convergence, especially is for some parameter
values the Hessian is nearly singular (for example, near a saddle point which is not a minimum).
>>> opt.fmin_ncg(optim_target3, array([1.0,2.0]), optim_target3_grad, \
... fhess = optim_target3_hess, args=(hyperp ,))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 5
Function evaluations: 6
Gradient evaluations: 5
Hessian evaluations: 5
array([ 0.33333333, -1.66666667])
In addition to the keyword argument outlined in the main table, fmin_ncg can take the following additional
arguments.
Keyword Description Note
Derivative free optimizers do not use derivative information and so can be used in a wider variety of prob-
lems such as functions which are not continuously differentiable. Derivative free optimizers can also be
used for functions which are continuously differentiable as an alternative to the derivative methods, al-
though they are likely to be slower. Derivative free optimizers take some alternative keyword arguments.
Keyword Description Note
fmin
fmin uses a simplex algorithm to minimize a function. The optimization in a simplex algorithm is often
described as an amoeba which crawls around on the function surface expanding and contracting while
looking for lower points. The method is derivative free, and so optimizatoin target need not be continuously
differentiable, for example the “tick” loss function used in estimation of quantile regression.
def tick_loss(quantile, data, alpha):
e = data - quantile
The tick loss function is used to estimate the median by using α = 0.5. This loss function is not continuously
differential and so regular optimizers
>>> data = randn(1000)
>>> opt.fmin(tick_loss, 0, args=(data, 0.5))
Optimization terminated successfully.
Current function value: -0.333333
Iterations: 48
Function evaluations: 91
array([ 0.33332751, -1.66668794])
>>> median(data)
-0.0053901030307567602
fmin_powell
fmin_powell used Powell’s method, which is derivative free, to minimize a function. It is an alternative to
fmin which uses a different algorithm.
Constrained optimization is frequently encountered in economic problems where parameters are only mean-
ingful in some particular range – for example, a variance must be weakly positive. The relevant class con-
strained optimization problems can be formulated
minθ f (θ ) subject to
g (θ ) = 0 (equality)
h (θ ) ≥ 0 (inequality)
θL ≤ θ ≤ θH (bounds)
where the bounds constraints are redundant if the optimizer allows for general inequality constraints since
if a scalar x satisfies x L ≤ x ≤ x H , then x − x L ≥ 0 and x H − x ≥ 0. The optimizers in SciPy allow for
different subsets of these configurations.
fmin_slsqp
fmin_slsqp is the most general constrained optimizer and allows for equality, inequality and bounds con-
straints. While bounds are redundant, constraints which take the form of bounds should be implemented
using bounds since this provides more information directly to the optimizer. Constraints are provided ei-
ther as list of callable functions or as a single function which returns an array. The latter is simpler if there
are multiple constraints, especially if the constraints can be easily calculated using linear algebra. Func-
tions which compute the derivative of the optimization target, the derivative of the equality constraints,
and the the derivative of the inequality constraints can be optionally provided. If not provided, these are
numerically approximated.
As an example, consider the problem of optimizing a CRS Cobb-Douglas utility function of the form
U (x 1 , x 2 ) = x 1λ x 21−λ subject to a budget constraint p 1 x 1 + p 2 x 2 ≤ 1. This is a nonlinear function subject to a
linear constraint (note that is must also be that case that x 1 ≥ 0 and x 2 ≥ 0). First, specify the optimization
target
def utility(x, p, alpha):
# Minimization, not maximization so -1 needed
return -1.0 * (x[0]**alpha)*(x[1]**(1-alpha))
There are three constraints, x 1 ≥ 0, x 2 ≥ 0 and the budget line. All constraints must take the form of ≥ 0
constraint, to that the budget line can be reformulated as 1 − p 1 x 1 − p 2 x 2 ≥ 0 . Note that the arguments
in the constraint must be identical to those of the optimization target, which is why in this case the utility
function takes prices as an input, which are not needed, and the constraint takes α which does not affect
the budget line.
def utility_constraints(x, p, alpha):
return array([x[0], x[1], 1 - p[0]*x[0] - p[1]*x[1]])
The optimal combination of good can be computed using fmin_slsqp once the starting values and other
inputs for the utility function and budget constraint are constructed.
>>> p = array([1.0,1.0])
>>> alpha = 1.0/3
>>> x0 = array([.4,.4])
>>> opt.fmin_slsqp(utility, x0, f_ieqcons=utility_constraints, args=(p, alpha))
Optimization terminated successfully. (Exit mode 0)
Current function value: -0.529133683989
Iterations: 2
Function evaluations: 8
Gradient evaluations: 2
array([ 0.33333333, 0.66666667])
fmin_slsqp can also take functions which compute the gradient of the optimization target, as well as
the gradients of the constraint functions (both inequality and equality). The gradient of the optimization
function should return a n -element vector, one for each parameter of the problem.
def utility_grad(x, p, alpha):
grad = zeros(2)
grad[0] = -1.0 * alpha * (x[0]**(alpha-1))*(x[1]**(1-alpha))
grad[1] = -1.0 * (1-alpha) * (x[0]**(alpha))*(x[1]**(-alpha))
return grad
The gradient of the constraint function returns a m by n array where m is the number of constraints. When
both equality and inequality constraints are used, the number of constraints will me m e q and m i n which
will generally not be the same.
def utility_constraint_grad(x, p, alpha):
grad = zeros((3,2)) # 3 constraints, 2 variables
grad[0,0] = 1.0
grad[0,1] = 0.0
grad[1,0] = 0.0
grad[1,1] = 1.0
grad[2,0] = -p[0]
grad[2,1] = -p[1]
return grad
Like in other problems, gradient information reduces the number of iterations and/or function evaluations
needed to find the optimum.
fmin_slsqp also accepts bounds constraints. Since two of the three constraints where simply x 1 ≥ 0
and x 2 ≥ 0, these can be easily specified as a bound. Bounds are given as a list of tuples, where there is a
tuple for each variable with an upper and lower bound. It is not always possible to use np.inf as the upper
bound, even if there is no implicit upper bound since this may produce a nan. In this example, 2 was used as
the upper bound since it was outside of the possible range given the constraint. Using bounds also requires
reformulating the budget constraint to only include the budget line.
The use of non-linear constraints can be demonstrated by formulating the dual problem, that of cost
minimization subject to achieving a minimal amount of utility. In this alternative formulation, the opti-
mization problems becomes
min p 1 x 1 + p 2 x 2 subject to U (x 1 , x 2 ) ≥ Ū
x 1 ,x 2
def total_expenditure(x,p,alpha,Ubar):
return dot(x,p)
def min_utility_constraint(x,p,alpha,Ubar):
x1,x2 = x
u=x1**(alpha)*x2**(1-alpha)
return array([u - Ubar]) # >= constraint, must be array, even if scalar
The objective and the constraint are used along with a bounds constraint to solve the constrained opti-
mization problem.
>>> x0 = array([1.0,1.0])
>>> p = array([1.0,1.0])
>>> alpha = 1.0/3
>>> Ubar = 0.529133683989
>>> opt.fmin_slsqp(total_expenditure, x0, f_ieqcons=min_utility_constraint, \
... args=(p, alpha, Ubar), bounds =[(0.0,2.0),(0.0,2.0)])
Optimization terminated successfully. (Exit mode 0)
Current function value: 0.999999999981
Iterations: 6
Function evaluations: 26
Gradient evaluations: 6
Out[84]: array([ 0.33333333, 0.66666667])
fmin_tnc
fmin_l_bfgs_b
fmin_cobyla
fmin_cobyla supports only inequality constraints, which must be provided as a list of functions. Since it
supports general inequality constraints, bounds constraints are included as a special case, although these
must be included in the list of constraint functions.
def utility_constraints1(x, p, alpha):
return x[0]
Note that fmin_cobyla takes a list rather than an array for the starting values. Using an array produces a
warning, but otherwise works.
>>> p = array([1.0,1.0])
>>> alpha = 1.0/3
>>> x0 = array([.4,.4])
>>> cons = [utility_constraints1, utility_constraints2, utility_constraints3]
>>> opt.fmin_cobyla(utility, x0, cons, args=(p, alpha), rhoend=1e-7)
array([ 0.33333326, 0.66666674])
17.3.1 Reparameterization
Many constrained optimization problems can be converted into an unconstrained program by reparame-
terizing from the space of unconstrained variables into the space where parameters must reside. For exam-
ple, the constraints in the utility function optimization problem require 0 ≤ x 1 ≤ 1/p 1 and 0 ≤ x 2 ≤ 1/p 2 .
Additionally the budget constrain must be satisfied so that if x 1 ∈ [0, 1/p 1 ], x 2 ∈ [0, (1 − p 1 x 1 )/p 2 ]. These
constraints can be implemented using a “squasher” function which maps x 1 into its domain, and x 2 into its
domain and is one-to-one and onto (i.e. a bijeciton). For example,
1 e z1 1 − p 1x 1 e z 2
x1 = , x 2 =
p1 1 + e z1 p2 1 + e z2
will always satisfy the constraints, and so the constrained utility function can be mapped to an uncon-
strained problem, which can then be optimized using an unconstrained optimizer.
The unconstrained utility function can be minimized using fmin_bfgs. Note that the solution returned
is in the transformed space, and so a special call to reparam_utility is used to print the actual values of x
at the solution (which are virtually identical to those found using the constrained optimizer).
>>> x0 = array([.4,.4])
>>> optX = opt.fmin_bfgs(reparam_utility, x0, args=(p,alpha))
Optimization terminated successfully.
Current function value: -0.529134
Iterations: 24
Function evaluations: 104
Gradient evaluations: 26
>>> reparam_utility(optX, p, alpha, printX=True)
[ 0.33334741 0.66665244]
SciPy provides a number of scalar function minimizers. These are very fast since additional techniques are
available for solving scalar problems which are not applicable when the parameter vector has more than 1
element. A simple quadratic function will be used to illustrate the scalar solvers. Scalar function minimizers
do not require starting values, but may require bounds for the search.
golden
golden uses a golden section search algorithm to find the minimum of a scalar function. It can optionally
be provided with bracketing information which can speed up the solution.
>>> hyperp = array([1.0, -2.0, 3])
>>> opt.golden(optim_target5, args=(hyperp,))
0.999999992928981
>>> opt.golden(optim_target5, args=(hyperp,), brack=[-10.0,10.0])
0.9999999942734483
brent
Non-linear least squares is similar to general function minimization. In fact, a generic function minimizer
can (attempt to) minimize a NLLS problem. The main difference is that the optimization target returns a
vector of errors rather than the sum of squared errors.
def nlls_objective(beta, y, X):
b0 = beta[0]
b1 = beta[1]
b2 = beta[2]
return y - b0 - b1 * (X**b2)
A simple non-linear model is used to demonstrate leastsq, the NLLS optimizer in SpiPy.
y i = 10 + 2x i1.5 + e i
leastsq returns a tuple containing the solution, which is very close to the true values, as well as a flag indi-
cating whether convergence was achieved. leastsq takes many of the same additional keyword arguments
as other optimizers, including full_output, ftol, xtol, gtol, maxfev (same as maxfun). It has the additional
keyword argument:
Keyword Description Note
Date and time manipulation is provided by a built-in Python module datetime. This chapter assumes that
datetime has been imported using import datetime.
Dates are created using date using years, months and days and times are created using time using hours,
minutes, seconds and microseconds.
>>> import datetime as dt
>>> yr = 2012; mo = 12; dd = 21
>>> dt.date(yr, mo, dd)
datetime.date(2012, 12, 21)
Dates created using date no not allow times, and dates which require a time stamp can be created using
datetime, which borrow the inputs from date and time , in order.
Date-times and dates (but not times, and only with the same type) can be subtracted to produce a timedelta,
which consists of three values, days, seconds and microseconds. Time deltas can also be added to dates and
times compute different dates – although date types will ignore any information in the time delta hour or
millisecond fields.
>>> d1 = dt.datetime(yr, mo, dd, hr, mm, ss, ms)
>>> d2 = dt.datetime(yr + 1, mo, dd, hr, mm, ss, ms)
>>> d2-d1
datetime.timedelta(365)
159
>>> d2 + dt.timedelta(30,0,0)
datetime.datetime(2014, 1, 20, 12, 21, 12, 20)
If accurately times stamps is important, date types can be promoted to datetime using combine.
>>> d3 = dt.date(2012,12,21)
>>> dt.datetime.combine(d3, dt.time(0))
datetime.datetime(2012, 12, 21, 0, 0)
Values in dates, times and datetimes can be modified using replace, which takes keyword arguments.
>>> d3 = dt.datetime(2012,12,21,12,21,12,21)
>>> d3.replace(month=11,day=10,hour=9,minute=8,second=7,microsecond=6)
datetime.datetime(2012, 11, 10, 9, 8, 7, 6)
Chapter 19
Graphics
Matplotlib contains a complete graphics library for producing high-quality graphics using Python. Mat-
plotlib contains both number of high level functions which produce particular types of figures, for example
a simple line plot or a bar chart, as well as a low level set of functions for creating new types of charts.
Matplotlib is primarily a 2D plotting library, although it also 3D plotting which is sufficient for most appli-
cations. This chapter covers the basics of producing plots using Python and matplotlib. It only scratches
the surface of the capabilities of matplotlib, and more information is available on the matplotlib website or
in some books dedicated to producing print quality graphics using matplotlib.
19.1 2D Plotting
Other modules will be included only when needed for a specific graphic.
The most basic, and often most useful 2D graphic is a line plot. Line plots are produced using plot, which
in its simplest form, takes a single input containing a 1-dimensional array.
>>> y = np.random.randn(100)
>>> plt.plot(y)
The output of this command is presented in panel (a) of figure 19.1. A more complex use of plot includes
a format string which has 1 to 3 elements, a color, represented suing a letter (e.g. g for green), a marker
symbol which is either a letter of a symbol (e.g. s for square, ^ for triangle up), and a line style, which is
always a symbol or series of symbols. In the next example, ’g--’ indicates green (g) and dashed line (–).
>>> plt.plot(y,’g--’)
161
Color Marker Line Style
The default behavior is to use a blue solid line with no marker (unless there is more than one line, in
which case the colors will alter, in order, through those in the Colors column, skipping white). Format
strings can contain 1 or more or the three categories of formatting information. For example, kx-- would
produce a black dashed line with crosses marking the points, *: would produce a dotted line with the default
color using stars to mark points and yH would produce a solid yellow line with a hexagon marker.
When one array is provided, the default x-axis values 1,2, . . . are used. plot(x,y) can be used to plot
specific x values against y values. Panel (c) shows he results of running the following code.
>>> x = np.cumsum(np.random.rand(100))
>>> plt.plot(x,y,’r-’)
While format strings are useful for quickly adding meaningful colors or line styles to a plot, they only expose
a small number of the customizations available. The next example shows how keyword arguments can be
used to add many useful customizations to a plot. Panel (d) contains the plot produced by the following
code.
Note that in the previous example, \ is used to indicate to the Python interpreter that a statement is span-
ning multiple lines.
Keyword Description
Many more keyword arguments are available for a plot. The full list can be found in the docstring or by
running the following code. The functions getp and setp can be used to get the list of properties for a line
(or any matplotlib object). setp can also be used to set a particular property.
>>> h = plot(randn(10))
>>> matplotlib.artist.getp(h)
agg_filter = None
alpha = None
animated = False
antialiased or aa = True
axes = Axes(0.125,0.1;0.775x0.8)
children = []
clip_box = TransformedBbox(Bbox(array([[ 0., 0.], [ 1...
clip_on = True
clip_path = None
color or c = b
contains = None
dash_capstyle = butt
dash_joinstyle = round
data = (array([ 0., 1., 2., 3., 4., 5., 6., 7., 8...
drawstyle = default
figure = Figure(652x492)
fillstyle = full
gid = None
label = _line0
linestyle or ls = -
linewidth or lw = 1.0
marker = None
markeredgecolor or mec = b
markeredgewidth or mew = 0.5
markerfacecolor or mfc = b
markerfacecoloralt or mfcalt = none
markersize or ms = 6
markevery = None
path = Path([[ 0. -0.27752688] [ 1. 0.3...
picker = None
pickradius = 5
rasterized = None
snap = None
solid_capstyle = projecting
solid_joinstyle = round
transform = CompositeGenericTransform(TransformWrapper(Blended...
transformed_clip_path_and_affine = (None, None)
url = None
visible = True
xdata = [ 0. 1. 2. 3. 4. 5.]...
xydata = [[ 0. -0.27752688] [ 1. 0.376091...
ydata = [-0.27752688 0.37609185 -0.24595304 0.28643729 ...
zorder = 2
Scatter plots are little more than a line plot without the line and with markers. scatter produces a scatter
plot between 2 1-dimensional arrays. All examples use a set of simulated normal data with unit variance
and correlation of 50%. The output of the basic scatter command is presented in figure 19.2, panel (a).
>>> z = np.random.randn(100,2)
>>> z[:,1] = 0.5*z[:,0] + np.sqrt(0.5)*z[:,1]
>>> x=z[:,1]
>>> y=z[:,1]
>>> plt.scatter(x,y)
Scatter plots can also be modified using keyword arguments. The most important are included in the next
example, and have identical meaning to those used in the line plot examples. The effect of these keyword
arguments can be see in panel (b).
>>> plt.scatter(x,y, s = 60, c = ’#FF7F00’, marker=’s’, \
... alpha = .5, label = ’Scatter Data’)
One interesting use of scatter is to make add a 3rd dimension to the plot by including an array of size data.
This allows the size of the shapes to convey extra information. The use of variable size data is illustrated in
the code below, which produced the scatter plot in panel (c).
>>> s = np.exp(np.exp(np.exp(np.random.rand(100))))
>>> s = 200 * s/np.max(s)
(a) (b)
100 1000
80 800
60 600
40 400
20 200
03 2 1 0 1 2 3 4 03 2 1 0 1 2 3 4
In some scenarios it is advantageous to have multiple plots or charts in a single figure. Implementing this
is simple using figure and then add_subplot. Figure is used to initialize the figure window. Subplots can
then be added the the figure using a grid notation with m rows and n columns 1 is the upper left, 2 is the
the right of 1, and so on until the end of a row, where the next element is below 1. For example, the plots in
a 3 by 2 subplot have indices
1 2
3 4 .
5 6
add_subplot is called using the notation add_subplot(mni) or add_subplot(m,n,i) where m is the number
of rows, n is the number of columns and i is the index of the subplot. Note that subplots require the subplot
axes to be called as a method from figure. Figure XX contains the output of the code below. Note that the
next code block is sufficient long that it isn’t practical to run interactively. Also note that plt.show() is used
to force an update to the window to ensure that all plots and charts are visible. Figure 19.6 contains the
result running the code below.
fig = plt.figure()
# Add the subplot to the figure
# Panel 1
ax = fig.add_subplot(2,2,1)
y = np.random.randn(100)
plt.plot(y)
ax.set_title(’1’)
# Panel 2
y = np.random.rand(5)
x = np.arange(5)
(a) (b)
15 20
Series 1 The Legend
Series 2 Series 1
10 Series 3 15 Series 2
Series 3
5 10
0 5
5 0
10 5
Figure 19.8: Figures with titles and legend produced using title and legend.
>>> plt.plot(x[:,1],’g-.’,label = ’Series 2’)
>>> plt.plot(x[:,2],’r:’,label = ’Series 3’)
>>> plt.legend()
>>> plt.title(’Basic Legend’)
legend takes keyword arguments which can be used to change its location (loc and an integer, see the
docstring), remove the frame (frameon) and add a title to the legend box (title). The output of a simple
example using these options is presented in panel (b).
>>> plt.plot(x[:,0],’b-’,label = ’Series 1’)
>>> plt.hold(True)
>>> plt.plot(x[:,1],’g-.’,label = ’Series 2’)
>>> plt.plot(x[:,2],’r:’,label = ’Series 3’)
>>> plt.legend(loc = 0, frameon = False, title = ’Data’)
>>> plt.title(’Improved Legend’)
Plots with date x-values on the x-axis are important when using time series data. Producing basic plots with
dates is as simple as plot(x,y) where x is a list or array of dates. This first block of code simulates a random
walk and constructs 2000 datetime values beginning with March 1, 2012 in a list.
import numpy as np
import numpy.random as rnd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
# Simulate data
T = 2000
x = []
for i in xrange(T):
x.append(dt.datetime(2012,3,1)+dt.timedelta(i,0,0))
y = np.cumsum(rnd.randn(T))
A basic plot with dates only requires calling plot(x,y) on the x and y data. The output of this code is in
panel (a) of figure 19.9.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x,y)
plt.draw()
Once the plot has been produced autofmt_xdate() is usually called to rotate and format the labels on the
x-axis. The figure produced by running this command on the existing figure is in panel (b).
fig.autofmt_xdate()
plt.draw()
Sometime, depending on the length of the sample plotted, automatic labels will not be adequate. To
show a case where this issue arises, a shorted sample with only 100 values is simulated.
T = 100
x = []
for i in xrange(T):
x.append(dt.datetime(2012,3,1)+dt.timedelta(i,0,0))
y = np.cumsum(rnd.randn(T))
A basic plot is produced in the same manner, and is depicted in panel (c). Note the labels overlap and so
this figure is not acceptable.
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(x,y)
plt.draw()
A call to autofmt_xdate() can be used to address the issue of overlapping labels. This is shown in panel (d).
fig.autofmt_xdate()
plt.draw()
While the formatted x dates are are an improvement, they are still unsatisfactory in that the date labels
have too much information (month, day and year) and are not at the start of the month. The next piece
of code shows how markers can be placed at the start of the month using MonthLocator which is in the
matplotlib.dates module. This idea is to construct a MonthLocator instance (which is a class), and then to
pass this axes using xaxis.set_major_locator which determines the location of major tick marks (minor
tick marks can be set using xaxis.set_mijor_locator). This will automatically place ticks on the 1st of every
month. Other locators are available, including YearLocator and WeekdayLocator, which place ticks on the
first day of the year and on week days, respectively. The second change is to format the labels on the x-axis
to have the short month name and year. This is done using DateFormatter which takes a custom format
string containing the desired text format. Options for formatting include:
• %m - Numeric month
• %d - Numeric day
• %M - Minute
• %D - Named day
These can be combined along with other characters to produce format strings. For example, %b %d, %Y
would produce a string with the format Mar 1, 2012. The formatter is used by calling DateFormatter. Finally
autofmt_xdate is used to rotate the labels. The result of running this code is in panel (e).
months = mdates.MonthLocator()
ax.xaxis.set_major_locator(months)
fmt = mdates.DateFormatter(’%b %Y’)
ax.xaxis.set_major_formatter(fmt)
fig.autofmt_xdate()
plt.draw()
Note that March 1 is not present in the figure in panel (e). This is because the plot doesn’t actually
include the date March 1 12:00:00 AM, but starts slightly later. To address this, simply change the axis limits
using first calling get_xlim to get the 2-element tuple containing the limits, change the it to include March
1 12:00:00 AM using set_xlim. The line between these call is actually constructing the correctly formatted
date. Internally, matplotlib uses serial dates which are simply the number of days past some initial date.
For example March 1, 2012 12:00:00 AM is 734563.0, March 2, 2012 12:00:00 AM is 734564.0 and March 2,
2012 12:00:00 PM is 734563.5. The function date2num can be used to convert datetimes to serial dates. The
output of running this final price of code on teh existing figure is presented in panel (f )
xlim = list(ax.get_xlim())
xlim[0] = mdates.date2num(dt.datetime(2012,3,1))
ax.set_xlim(xlim)
plt.draw()
For a simple demonstration of the range of matplotlib and Python graphics, consider the problem of pro-
ducing a plot of Macroeconomic time series which has business cycle fluctuations. Capacity utilization data
from FRED has been used to illustrate the steps needed to produce a plot with the time series, dates and
shaded regions representing periods where the NBER has decided were recessions.
The code has been split into two parts. The first is the code needed to read the data, find the common
dates, and finally format the data so that only the common sample is retained.
# Reading the data
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
# csv2rec for simplicity
recessionDates = mlab.csv2rec(’USREC.csv’,skiprows=0)
capacityUtilization = mlab.csv2rec(’TCU.csv’)
d1 = set(recessionDates[’date’])
d2 = set(capacityUtilization[’date’])
# Find the common dates
commonDates = d1.intersection(d2)
(a) (b)
20 20
10 10
0
0
10
10
20
20
30
30
40
3 4 5 6 7
201 201 201 201 201
40 2013 2014 2015 2016 2017
(c) (d)
10 10
5 5
0
0
5
5
10
10
15
15
20
20 25
2 2 2 2 2 2 2 012
201 15 201 29 201 12 201 26 201 10 201 24 201
01 07 2
Mar2501 2012
Mar 15 2012
Mar 29 2012
Apr 12 2012
Apr 26 2012
May 10 2012
May 24 2012
Jun 07 2012 Mar Mar Mar Apr Apr May May Jun
(e) (f )
10 10
5 5
0 0
5 5
10 10
15 15
20 20
25 25
2 2 2 2
201 201 201 201
Mar 2012
Apr 2012
May 2012
Jun 2012
The second part of the code produces the plot. Most of the code is very simple. It begins by constructing
a figure, then add_subplot to the figure, and the plotting the data using plot. fill_between is only one of
many useful functions in matplotlib – it fills an area whenever a variable is 1, which is the structure of the
recession indicator. The final part of the code adds a title with a custom font (set using a dictionary), and
then changes the font and rotation of the axis labels. The output of this code is figure 19.10.
Matplotlib supports using TEX in plots. The only steps needed are the first three lines in the code below,
which configure some settings. the labels use raw mode to avoid needing to escape the \ in the TEX string.
The final plot with TEX in the labels is presented in figure 19.11.
>>> from matplotlib import rc
>>> rc(’text’, usetex=True)
Capacity Utilization
90
85
80
75
70
65
78
98
82
86
70
74
90
94
02
06
10
19
19
19
19
19
19
19
19
20
20
20
Figure 19.10: A plot of capacity utilization (US) with shaded regions indicating NBER recession dates.
>>> rc(’font’, family=’serif’)
>>> y = 50*np.exp(.0004 + np.cumsum(.01*np.random.randn(100)))
>>> plt.plot(y)
>>> plt.xlabel(r’\textbf{time ($\tau$)}’)
>>> plt.ylabel(r’\textit{Price}’,fontsize=16)
>>> plt.title(r’Geometric Random Walk: $d\ln p_t = \mu dt + \sigma dW_t$’,fontsize=16)
>>> rc(’text’, usetex=False)
19.3 3D Plotting
The 3D plotting capabilities of matplotlib are decidedly weaker than the 2D plotting facilities. Despite this
warning, the 3D capabilities are still more than adequate for most application – especially since 3D graphics
are rarely necessary, and often not even useful when used.
Line plot in 3D are virtually identical to plotting in 2D, except that 3 1-dimensional vectors are needed, x , y
and z (height). The simple example demonstrates how how plot can be used with the keyword argument
zs to construct a 3D line plot. The line that sets up the axis using Axed3D(fig) is essential when producing
3D graphics. The other new command, view_init, is used to rotate the view using code (the view can be
interactive rotated in the figure window). The result of running the code below is presented in figure 19.12.
>>> from mpl_toolkits.mplot3d import Axes3D
>>> x = np.linspace(0,6*np.pi,600)
>>> z = x.copy()
>>> y = np.sin(x)
>>> x= np.cos(x)
>>> fig = plt.figure()
>>> ax = Axes3D(fig) # Different usage
Geometric Random Walk: d ln pt = µ + σdWt
55
54
53
52
Price
51
50
49
48
0 20 40 60 80 100
time (τ )
Surface and mesh or wireframe plots are occasionally useful for visualizing functions with 2 inputs, such as a
bivariate distribution. This example produces both for the bivariate normal PDF with mean 0, unit variances
and correlation of 50%. The first block of code generates the points to use in the plot with meshgrid and
evaluates the PDF for all combinations of x and y .
x = np.linspace(-3,3,100)
y = np.linspace(-3,3,100)
x,y = np.meshgrid(x,y)
z = np.mat(np.zeros(2))
p = np.zeros(np.shape(x))
R = np.matrix([[1,.5],[.5,1]])
Rinv = np.linalg.inv(R)
for i in xrange(len(x)):
for j in xrange(len(y)):
z[0,0] = x[i,j]
z[0,1] = y[i,j]
p[i,j] = 1.0/(2*np.pi)*np.sqrt(np.linalg.det(R))*np.exp(-(z*Rinv*z.T)/2)
The next code segment produces a mesh (wireframe) plot using plot_wireframe. The setup of the case
is identical to that of the 3D line, and the call to add_subplot(111, projection=’3d’) is again essential.
The figure is drawn using the 2-dimensional arrays x , y and p . The output of this code is presented in panel
(a) of 19.13.
>>> fig = plt.figure()
3
33 2 1 0 1 2 3
Contour plots are not technically 3D, although they are used as a 2D representation of 3D data. Since they
are ultimately 2D, little setup is needed, aside form a call to contour using the same inputs as plot_surface
and plot_wireframe. The output of the code below is in figure 19.14.
figure
figure is used to open a figure window, and can be used to generate axes. fig = figure(n) produces a
figure object with id n , and assigns the object to fig.
add_subplot
add_subplot is used to add axes to a figure. ax = fig.add_subplot(111) can be used to add a basic axes to
a figure. ax = fig.add_subplot(m,n,i) can be used to add an axes to a non-trivial figure with a m by n grid
of plots.
close
close closes figures. close(n) closes the figure with id n , and close(’all’) closes all figure windows.
show
show is used to force an update to a figure, and to pause execution if not used in an interactive console. show
should not be used in stand along Python programs. Instead draw should be used.
draw
Exporting plots is simple using savefig(’filename.ext ’) where ext determine that type of exported file to
produce. ext can be one of png, pdf, ps, eps and svg.
>>> plt.plot(randn(10,2))
>>> savefig(’figure.pdf’) # PDF export
>>> savefig(’figure.png’) # PNG export
>>> savefig(’figure.svg’) # Scalable Vector Graphics export
savefig has a number of useful keyword arguments. In particular, dpi is useful when exporting png files.
The default dpi is 100.
>>> plt.plot(randn(10,2))
>>> savefig(’figure.png’, dpi = 600) # High resolution PNG export
Chapter 20
String Manipulation
Strings are usually less interesting than numerical values in econometrics and statistics. There are, however,
some important uses for strings:
Recall that strings are sliceable, but unlike arrays, are immutable, and so it is not possible to replace part of
a string.
While + is a simple method to joint strings, the modern method is to use join. join is a string method
which joins a list of strings (the input) using the object calling the string as the separator.
Alternatively, the same output can be constructed using an empty string ’’.
181
20.1.2 Multiplying Strings (*)
split splits a string into a list based on a character, for example a comma. Takes an optional third argument
maxsplit which limits the number of outs in the list. rsplit works identically to split, only scanning from
the end of the string – split and rsplit only differ when maxsplit is used.
>>> s = ’Python is a rewarding language.’
>>> s.split(’ ’)
[’Python’, ’is’, ’a’, ’rewarding’, ’language.’]
20.2.2 join
join concatenates a list or tuple of strings, using an optional argument sep which specified a separator
(default is space).
>>> import string
>>> a = ’Python is’
>>> b = ’a rewarding language.’
>>> string.join((a,b))
’Python is a rewarding language.’
>>> string.join((a,b),’:’)
’Python is:a rewarding language.’
strip removes leading and trailing whitespace from a string. An optional input char removes leading and
trailing occurrences of the input value (instead of space). lstrip and rstrip work identically, only stripping
from the left and right, respectively.
>>> s = ’ Python is a rewarding language. ’
>>> s=s.strip()
’Python is a rewarding language.’
>>> s.strip(’P’)
’ython is a rewarding language.’
find locates the lowest index of a substring, and returns -1 if not found. Optional arguments limit the range
of the search, and s.find(’i’,10,20) is identical to s[10:20].find(’i’). rfind works identically, only
returning the highest index of the substring.
>>> s.find(’i’,10,20)
18
>>> s.rfind(’i’)
18
index returns the lowest index of a substring, and is identical to find except that an error is raised if the
substring does not exist. As a result, index is only safe to use in a try . . . except block.
20.2.6 count
count counts the number of occurrences of a substring. It takes optional arguments which limit the search
range.
lower and upper convert strings to lower and upper case, respectively. They are useful to remove case am-
biguity when comparing string to known constants.
>>> s = ’Python is a rewarding language.’
>>> s.upper()
’PYTHON IS A REWARDING LANGUAGE.’
>>> s.lower()
’python is a rewarding language.’
ljust, rjust and center left justify, right justify and center, respectively, a string while expanding its size to
a given length. If the desired length is smaller than the string, the unchanged string is returned.
>>> s = ’Python is a rewarding language.’
>>> s.ljust(40)
’Python is a rewarding language. ’
>>> s.rjust(40)
’ Python is a rewarding language.’
>>> s.center(40)
’ Python is a rewarding language. ’
20.2.9 replace
replace replaces a substring with an alternative string, which can have different size. An optional argument
limits the number of replacement.
>>> s = ’Python is a rewarding language.’
>>> s.replace(’g’,’Q’)
’Python is a rewardinQ lanQuaQe.’
>>> s.replace(’is’,’Q’)
’Python Q a rewarding language.’
>>> s.replace(’g’,’Q’,2)
’Python is a rewardinQ lanQuage.’
20.2.10 textwrap.wrap
The module textwrap contains a function wrap which reformats a long string into a fixed width paragraph,
stored line-by-line in a list. An optional argument changes the width of the output paragraph form the
default of 70 characters.
>>> import textwrap
>>> s = ’Python is a rewarding language. ’
>>> s = 10*s
>>> textwrap.wrap(s)
[’Python is a rewarding language. Python is a rewarding language. Python’,
’is a rewarding language. Python is a rewarding language. Python is a’,
’rewarding language. Python is a rewarding language. Python is a’,
’rewarding language. Python is a rewarding language. Python is a’,
’rewarding language. Python is a rewarding language.’]
>>> textwrap.wrap(s,50)
[’Python is a rewarding language. Python is a’,
’rewarding language. Python is a rewarding’,
’language. Python is a rewarding language. Python’,
’is a rewarding language. Python is a rewarding’,
’language. Python is a rewarding language. Python’,
’is a rewarding language. Python is a rewarding’,
’language. Python is a rewarding language.’]
Formatting numbers when converting to string allow for automatic generation of tables and well formatted
print statements. Numbers are formatted using the format function, which is used in conjunction with a
format specifier. For example, consider these examples which format π.
>>> pi
3.141592653589793
>>> ’{:12.5f}’.format(pi)
’ 3.14159’
>>> ’{:12.5g}’.format(pi)
’ 3.1416’
>>> ’{:12.5e}’.format(pi)
’ 3.14159e+00’
These all provide alternative formats and the difference is determined by the letter in the format string. The
generic form of a format string is {n : f a s w m ,.p t }. To understand the the various choices, consider the
output produced by the basic output string ’{0:}’
>>> ’{0:}’.format(pi)
’3.14159265359’
• n is a number 0,1,. . . indicating which value to take from the format function
• f a are fill and alignment characters, typically a 2 character string. Fill can be any character except }.
Alignment can < (left) ,> (right), ^ (top), = (pad to the right of the sign). Simple left 0-fills can omit the
alignment character so that f a = 0.
>>> ’{0:0<20}’.format(pi) # Left, 0 padding, precion 20
’3.141592653590000000’
• s indicates whether a sign should be included. + indicates always include sign, - indicates only in-
clude if needed, and a blank space indicates to use a blank space for positive numbers, and a − sign
for negative numbers – this format is useful for producing aligned tables.
>>> ’{0:+}’.format(pi)
’+3.14159265359’
>>> ’{0:-}’.format(pi)
’3.14159265359’
• m is the minimum total size of the formatted string. If the formatted string is shorter than m , charac-
ter w is prepended.
>>> ’{0:10}’.format(pi)
’3.14159265359’
>>> ’{0:20}’.format(pi)
’ 3.14159265359’
>>> ’{0:30}’.format(pi)
’ 3.14159265359’
• c can be , or omitted. , produces numbers with 1000s separated using a ,. In order to use c it is
necessary to include the . before the precision.
>>> ’{0:.2}’.format(pi)
’3.1’
>>> ’{0:.5}’.format(pi)
’3.1416’
Type Description
>>> ’{0:.5e}’.format(pi)
’3.14159e+00’
>>> ’{0:.5g}’.format(pi)
’3.1416’
>>> ’{0:.5f}’.format(pi)
’3.14159’
>>> ’{0:.5%}’.format(pi)
’314.15927%’
All of these features can be combined in a single format string to produce complexly presented data.
>>> ’{0: > 20.4f}, {1: > 20.4f}’.format(pi,-pi)
’ 3.1416, -3.1416’
In the first example, reading from left to right after the colon, the format string consists of:
3. Use no sign for positive numbers, − sign for negative numbers (the blank space after >)
4. Minimum 20 digits
The second is virtually identical to the first, except that it includes a , to show the 1000s separator.
format can be used to output formatted strings using a similar syntax to number formatting, although some
options, precision, sign, comma and type are not relevant.
>>> s = ’Python’
>>> ’{0:}’.format(s)
’Python’
>>> ’{0:!>20}’.format(s)
’!!!!!!!!!!!!!!Python’
format can be used to format multiple objects in the same string output. There are three methods to do this:
• No position arguments, in which case the objects are matched to format strings in order
• Numeric positional arguments, in which case the first object is mapped to ’{0:}’, the second to
’{1:}’, and so on.
• Named arguments such as ’{price:}’ and volume ’{volume:}’, which match keyword arguments
inside format.
>>> ’The price yesterday was {0:} and the volume was {1:}’.format(price,volume)
’The price yesterday was 100.32 with volume 132000’
>>> ’The price yesterday was {1:} and the volume was {0:}’.format(volume,price)
’The price yesterday was 100.32 with volume 132000’
>>> ’The price yesterday was {price:} and the volume was {volume:}’.format(price=price,volume=volume)
’The price yesterday was 100.32 with volume 132000’
Some Python code still uses an older style format string. Old style format strings have %(m a p )f l m .p t ,
where:
– 0: Zero pad
– (blank space)
– - Left adjust output
– + Include sign character
In general, the old format strings should only be used when required by other code (e.g. matplotlib). Below
are some examples of their use in strings.
>>> price = 100.32
>>> volume = 132000
>>> ’The price yesterday was %0.2f with volume %d’ % (price, volume)
’The price yesterday was 100.32 with volume 132000’
>>> ’The price yesterday was %+0.3f and the volume was %010d’ % (price, volume)
’The price yesterday was +100.320 and the volume was 0000132000’
Regular expressions are powerful tools for matching patterns in strings. While teaching regular expressions
is beyond the scope of these notes – there are 500 page books dedicated to regular expressions – they are
sufficiently useful to warrant coverage. Fortunately there are a large number of online regular expression
generators which can assist in finding the pattern to use, and so they are useful to anyone working with
unformatted text.
Using regular expression requires the re module. The most useful function for regular expression match-
ing are findall, finditer and sub. findall and finditer work in similar manners, except that findall re-
turns a list while finditer returns an iterable. finditer is preferred if a large number of matches is possible.
Both search through a string and find all non-overlapping matches of a regular expression.
>>> import re
>>> s = ’Find all numbers in this string: 32.43, 1234.98, and 123.8.’
>>> re.findall(’[\s][0-9]+\.\d*’,s)
[’ 32.43’, ’ 1234.98’, ’ 123.8’]
finditer returns MatchObjects which contain the method span. span returns a 2 element tuple which con-
tains the start and end position of the match.
sub replaces all matched text with another text string (or a function which takes a MatchObject).
>>> s = ’Find all numbers in this string: 32.43, 1234.98, and 123.8.’
>>> re.sub(’[\s][0-9]+\.\d*’,’ NUMBER’,s)
’Find all numbers in this string: NUMBER, NUMBER, and NUMBER.’
>>> re.sub(’[\s][0-9]+\.\d*’,reverse,s)
’Find all numbers in this string: 34.23, 89.4321, and 8.321.’
When repeatedly using a regular expression, for example running it on all lines in a file, it is better to compile
the regular expression, and then to use the resulting RegexObject.
>>> import re
>>> s = ’Find all numbers in this string: 32.43, 1234.98, and 123.8.’
>>> numbers = re.compile(’[\s][0-9]+\.\d*’)
>>> numbers.findall(s)
[’ 32.43’, ’ 1234.98’, ’ 123.8’]
Parsing the regular expression text is relatively expensive, and compiling the expression avoids this cost.
When reading data into Python using a mixed format, blindly converting text to integers or floats is dan-
gerous. For example, float(’a’) returns a ValueError since Python doesn’t know how to convert ’a’ to a
string. The simplest method to safely convert potentially non-numeric data is to use a try . . . except block.
from __future__ import print_function
from __future__ import division
S = [’1234’,’1234.567’,’a’,’1234.a34’,’1.0’,’a123’]
for s in S:
try:
int(s)
print(s, ’is an integer.’)
except:
try:
float(s)
print(s, ’is a float.’)
except:
print(’Unable to convert’, s)
Chapter 21
Manipulating the file system is surprising useful when working with data. The most important file system
commands are located in the modules os and shutil. This chapter assumes that
import os
import shutil
The working directory is where files can be created and accessed without any path information. os.getcwd()
can be used to determine the current working directory, and os.chdir(path) can be used to change the
working directory, where path is a directory, such as /temp or c:\\temp.1 Alternatively, path can can be .. to
more up the directory tree.
pwd = os.getcwd()
os.chdir(’c:\\temp’)
os.chdir(’c:/temp’) # Identical
os.chdir(’..’)
os.getcwd() # Now in ’c:\\’
Directories can be created using os.mkdir(dirname), although it must be the case that the higher level di-
rectories exist (e.g. to create /home/username/ Python/temp, it must be the case that /home/username/Python
already exists). os.makedirs(dirname) works similar to os.mkdir(dirname), except that is will create any
higher level directories needed to create the target directory.
Empty directories can be deleted using os.rmdir(dirname) – if the directory is not empty, an error oc-
curs. shutil.rmtree(dirname) works similarly to os.rmdir(dirname), except that it will delete the direc-
tory, and any files or other directories contained in the directory.
1
On Windows, directories use the backslash, which is used to escape characters in Python, and so an escaped backslash – \\ – is
needed when writing Windows’ paths. Alternatively, the forward slash can be substituted, so that c:\\temp and c:/temp are equivalent.
193
os.mkdir(’c:\\temp\\test’)
os.makedirs(’c:/temp/test/level2/level3’) # mkdir will fail
os.rmdir(’c:\\temp\\test\\level2\\level3’)
shutil.rmtree(’c:\\temp\\test’) # rmdir fails, since not empty
The contents of a directory can be retrieved in a list using os.listdir(dirname), or simply os.listdir(’.’)
to list the current working directory. The list returned will contain all all files and directories. os.path.isdir(
name ) can be used to determine whether a value in the list is a directory, and os.path.isfile(name) can
be used to determine if it is a file. os.path contains other useful functions for working with directory listings
and file attributes.
os.chdir(’c:\\temp’)
files = os.listdir(’.’)
for f in files:
if os.path.isdir(f):
print(f, ’ is a directory.’)
elif os.path.isfile(f):
print(f, ’ is a file.’)
else:
print(f, ’ is a something else.’)
A more sophisticated listing which accepts wildcards and is similar to dir (Windows) and ls (Linux) can
be constructed using the glob module.
import glob
files = glob.glob(’c:\\temp\\*.txt’)
File contents can be copied using shutil.copy( src , dest ), shutil.copy2( src , dest ) or shutil.copyfile(
src , dest ). These functions are all similar, and the differences are:
• shutil.copy will accept either a filename or a directory as dest. If a directory is given, the a file is
created in the directory with the same name as the original file
• shutil.copy2 is identical to shutil.copy, except that metadata such as access times, is also copied.
Finallly, shutil.copytree( src , dest ) will copy an entire directory tree, starting from the directory src to
the directory dest, which must not exist. shutil.move( src,dest ) is similar to shutil.copytree, except that
it moves a file or directory tree to a new location. If preserving file metadata (such as permissions or file
streams) is important, it is better use system commands (copy or move on Windows, cp or mv on Linux) as
an external program.
shutil.chdir(’c:\\temp’)
# Copies file.ext to ’c:\\’
shutil.copy(’file.ext’,’c:\\’)
# Copies file.ext to ’c:\\temp\\file2.ext’
shutil.copy(’file.ext’,’file2.ext’)
# Copies file.ext to ’c:\\temp\\file3.ext’, plus metadata
shutil.copy2(’file.ext’,’file3.ext’)
shutil.copytree(’c:\\temp\\’,’c:\\newtemp\\’)
shutil.move(’c:\\newtemp\\’,’c:\\newtemp2\\’)
Occasionally it is necessary to call other programs, for example to decompress a file compressed in an un-
usual format or to call system copy commands to preserve metadata and file ownership. Both os.system
and subprocess.call (which requires import subprocess) can be used to execute commands as if they
were executed directly in the shell.
import subprocess
Creating and extracting files from archives often allows for further automation in data processing. Python
has native support for zip, tar, gzip and bz2 file formats using shutil.make_archive( archivename , format ,
root ) where archivename is the name of the archive to create, without the extension, format is one of the
supported formats (e..g ’zip’ for a zip archive or ’gztar’, for a gzipped tar file) and root is the root directory
which can be ’.’ for the current working directory.
# Creates files.zip
shutil.make_archive(’files’,’zip’,’c:\\temp\\folder_to_archive’)
# Creates files.tar.gz
shutil.make_archive(’files’,gztar’,’c:\\temp\\folder_to_archive’)
Creating a standard gzip from an existing file is slightly more complicated, and requires using the gzip
module.2
import gzipfile
2
A gzip can only contain 1 file, and is usually used with a tar file to compress a directory or set of files.
gz.close()
csvin.close()
Zip files can be extracted using the module zipfile, and gzip files can be extracted using gzip, and
gzipped tar files can be extracted using tarfile.
import zipfile
import gzipfile
import tarfile
# Extract zip
zip = zipfile.ZipFile(’files.zip’)
zip.extractall(’c:\\temp\\zip\\’)
zip.close()
Occasionally it may be necessary to read or write a file, for example to output a formatted LATEX table. Python
contains low level file access tools which can be used to to generate files with any structure. Custom files all
begin by using file to create a new or open an existing file. Files can be opened in different modes, ’r’ for
reading, ’w’ for writing, and ’a’ for appending (’w’ will overwrite an existing file). An additional modifier ’b’
can be be used if the file is binary (not text), so that ’rb’,’wb’ and ’ab’ allow reading, writing and appending
binary files.
Reading text files is usually implemented using readline() to read a single line, readlines( n) to read
at most n lines or readlines() to read all lines in a file. readline() and readlines( n) are usually used
inside a while loop which terminates if teh value retured is an empty string (’’).
# Read all lines using readlines()
f = file(’file.csv’,’r’)
lines = f.readlines()
for line in lines:
print(line)
f.close()
# Using readline(n)
f = file(’file.csv’,’r’)
line = f.readline()
while line != ’’:
print(line)
line = f.readline()
f.close()
# Using readlines(n)
f = file(’file.csv’,’r’)
lines = f.readlines(2)
while lines != ’’:
for line in lines:
print(line)
lines = f.readline(2)
f.close()
In practice, the information from the file is usually transformed in a more meaningful way than using print.
Writing text files is similar, and begins by using file to create a file, and then fwrite to output informa-
tion. frwite is conceptually similar to using print, except that the output will eb written to a file rather than
printed on screen. The next example show how to create a LATEXtable from an array.
import numpy as np
import scipy.stats as stats
x = np.random.randn(100,4)
mu = np.mean(x,0)
sig = np.std(x,0)
sk = stats.skew(x,0)
ku = stats.kurtosis(x,0)
summaryStats = np.vstack((mu,sig,sk,ku))
rowHeadings = [’Var 1’,’Var 2’,’Var 3’,’Var 4’]
colHeadings = [’Mean’,’Std Dev’,’Skewness’,’Kurtosis’]
line += ’ \\ \hline’
latex.append(line)
for i in xrange(size(summaryStats,0)):
line = rowHeadings[i]
for j in xrange(size(summaryStats,1)):
line += ’ & ’ + str(summaryStats[i,j])
latex.append(line)
latex.append(’\\end{tabular}’)
f.close()
21.8 Exercises
3. Create a new file names tobedeleted.py a text editor in this new directory (It can be empty).
6. Delete the newly created file, and then delete this directory.
Chapter 22
Structured Arrays
The arrays and matrices used in most of these notes are highly optimized data structures where all elements
have the same datatype (e.g. float), and elements can be accessed using slicing. They are essential for high-
performance numerical computing, such as computing inverses of large matrices. Unfortunately, actual
data often have meaningful names – not just “column 0” – or may have different types – dates, strings,
integers and floats – that cannot be combined in a uniform NumPy array. NumPy supports mixed arrays
which solve both of these issues and so are a useful data structures for managing data prior to statistical
analysis. Conceptually, a mixed array with named columns is similar to a spreadsheet where each column
can have its own name and data type.
A mixed NumPy array can be initialized using array or zeros, among other functions. Mixed arrays are in
many ways similar to standard NumPy arrays, except that the dtype input to the function is specified either
using tuples of the form (name,type), or using a dictionary.
>>> x = zeros(4,[(’date’,’int’),(’ret’,’float’)])
>>> x = zeros(4,{’names’: (’date’,’ret’), ’formats’: (’int’, ’float’)})
>>> x
array([(0, 0.0), (0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[(’date’, ’<i4’), (’ret’, ’<f8’)])
These two command are identical, and illustrate the two methods to create an array which contain a named
column “date”, for integer data, and a named column “ret” for floats. These columns can be accessed by
name.
>>> x[’date’]
array([0, 0, 0, 0])
>>> x[’ret’]
array([0.0, 0.0, 0.0, 0.0])
199
Type Syntax Description
Boolean b True/False
Integers i1,i2,i4,i8 1 to 8 byte signed integers (−2 B −1 , . . . 2 B −1 − 1)
Unsigned Integers u1,u2,u4,u8 1 to 8 byte signed integers (0, . . . 2 B )
Floating Point f4,f8 Single (4) and double (8) precision float
Complex c8,c16 Single (8) and double (16) precision complex
Object On Generic n -byte object
String Sn , an n -letter string
Unicode String Un n -letter unicode string
The majority of data types are for numeric data, and are simple to understand. The n in the string data
type indicates the maximum length of a string. Attempting to insert a stringer with more than n characters
will truncate the string. The object data type is somewhat abstract, but allows for storing Python objects
such as datetimes.
Custom data types can be built using dtype. The constructed data type can then be used in the con-
struction of a mixed array.
t = dtype([(’var1’,’f8’), (’var2’,’i8’), (’var3’,’u8’)])
Data types can even be nested to create a structured environment where one of the “variables” has multiple
values. Consider this example which uses a nested data type to contain the bid and ask price or a stock,
along with the time of the transaction.
ba = dtype([(’bid’,’f8’), (’ask’,’f8’)])
t = dtype([(’date’, ’O8’), (’prices’, ba)])
data = zeros(2,t)
In this example, data is an array where each item has 2 elements, the date and the price. Price is also an array
with 2 elements. Names can also be used to access values in nested arrays (e.g. data[’prices’][’bid’]
returns an array containing all bid prices). In practice nested arrays can almost always be expressed as a
non-nested array without loss of fidelity.
Determining the size of object NumPy arrays can store objects which are anything which fall outside of
the usual datatypes. One example of a useful, but abstract, datatype is datetime. One method to determine
the size of an object is to create a plain array containing the object – which will automatically determine the
data type – and then to query the size from the array.
import datetime as dt
x = array([dt.datetime.now()])
# The size in bytes
print(x.dtype.itemsize)
# The name and description
print(x.dtype.descr)
TAQ is the NYSE Trade and Quote database which contains all trades and quotes of US listed equities which
trade on major US markets (not just the NYSE). A record from a trade contains a number of fields:
• Date - The Date in YYYYMMDD format stored as a 4-byte unsigned integer
First consider a data type which stores the data in an identical format.
t = dtype([(’date’, ’u4’), (’time’, ’u4’),
(’size’, ’u4’), (’price’, ’f8’),
(’g127’, ’u2’), (’corr’, ’u2’),
(’cond’, ’S2’), (’ex’, ’S2’)])
taqData = zeros(10, dtype=t)
taqData[0] = (20120201,120139,1,53.21,0,0,’’,’N’)
An alternative is to store the date and time as a datetime, which is an 8-byte object.
import datetime as dt
Record arrays are closely related to mixed arrays with names. The primary difference is that elements record
arrays can be accessed using variable.name format.
>>> x = zeros((4,1),[(’date’,’int’),(’ret’,’float’)])
>>> y = rec.array(x)
>>> y.date
array([[0],
[0],
[0],
[0]])
>>> y.date[0]
array([0])
In practice record arrays may be slower than standard arrays, and unless the variable.name is really impor-
tant, record arrays are not compelling.
Part II
Incomplete
203
Chapter 23
Parallel
To be completed
205
Chapter 24
To be completed
We should forget about small efficiencies, say about 97% of the time: premature optimization
is the root of all evil.
Donald Knuth
24.2 Vectorize
24.4 Cython
207
Chapter 25
To be completed
25.1 scikits.statsmodels
25.2 pandas
209
Chapter 26
Examples
To Be Completed
211
Chapter 27
Quick Reference
To be completed
27.1 Numpy
27.2 SciPy
27.3 Matplotlib
27.4 IPython
213
Index
absolute, 64 exp, 64
all, 96
file, 89
and, 96
any, 96
flat, 76
arange, 61
flatten, 76
argmax, 67
fliplr, 79
flipud, 79
argmin, 67
float, 89
argsort, 67
fmin, 151
asarray, 74
fmin_1_bfgs_b, 155
asmatrix, 73
fmin_bfgs, 148
brent, 157 fmin_cg, 150
broadcast, 77 fmin_cobyla, 155
broadcast_arrays, 77 fmin_ncg, 150
fmin_powell, 152
cholesky, 81 fmin_slsqp, 152
close, 89 fmin_tnc, 155
Complex Values, 64–65 fminbound, 157
concatenate, 78 for, 105
cond, 81
conj, 65 Generating Arrays, 61–62
cumprod, 63 get_state, 143
cumsum, 63 golden, 157
del, 36 hsplit, 78
delete, 78 hstack, 77
det, 82
if, 101
diag, 80
imag, 65
diff, 63
Importing Data, 85–89
dsplit, 78
in1d, 65
dstack, 77
int, 89
intersect1d, 65
eig, 82
inv, 82
eigh, 82
elif, 101
kron, 82
else, 101
empty, 72 leastsq, 157
empty_like, 72 linspace, 61, 62
214
loadtxt, 85 squeeze, 79
log, 64 sum, 63
log10, 64 view, 73
logical_and, 96 ndim, 75
logical_not, 96 not, 96
logical_or, 96
ones_like, 71
logspace, 61
Optimization
lstsq, 81
Constrained, 152–156
mat, 73 Least Squares, 157–158
Mathematical Functions, 63–64 Scalar, 156–157
matrix_power, 80 Unconstrained, 147–152
max, 67 or, 96
maximum, 68
prod, 63
meshgrid, 61
min, 67
ravel, 74
minimum, 68
readline, 89
nanargmax, 69 replace, 89
nanargmin, 69 reshape, 75
nanmax, 69 Rounding, 62–63
nanmin, 69
savetxt, 90
nansum, 68
seed, 143
ndarray
Set Functions, 65–66
argmax, 67
set_state, 143
argmin, 67
setdiff1d, 66
argsort, 67
setxor1d, 66
conj, 65
shape, 74
cumprod, 63
sign, 64
cumsum, 63
Simulation, 143–145
flat, 76
size, 75
flatten, 76
slogdet, 81
imag, 65
solve, 81
max, 67
sort, 66, 67
min, 67
Sorting and Extreme Values, 66–68
ndim, 75
split, 89
prod, 63
sqrt, 64
ravel, 74
square, 64
real, 64
squeeze, 79
reshape, 75
sum, 63
round, 62
svd, 80
shape, 74
size, 75 tile, 76
sort, 67 trace, 83
union1d, 66
unique, 65
view, 73
vsplit, 78
vstack, 77
while, 108
zeros, 71
zeros_like, 71