Python For Unified Research in Econometrics and Statistics: To Cite This Version
Python For Unified Research in Econometrics and Statistics: To Cite This Version
statistics
Roseline Bilina, Steve Lawford
Abstract
Python is a powerful high-level open source programming language, that is available for
multiple platforms. It supports object-oriented programming, and has recently become a
serious alternative to low-level compiled languages such as C++. It is easy to learn and use,
and is recognized for very fast development times, which makes it suitable for rapid software
prototyping as well as teaching purposes. We motivate the use of Python and its free extension
modules for high performance stand-alone applications in econometrics and statistics, and as a
tool for gluing different applications together. (It is in this sense that Python forms a ‘unified’
environment for statistical research). We give details on the core language features, which will
enable a user to immediately begin work, and then provide practical examples of advanced uses
of Python. Finally, we compare the run-time performance of extended Python against a number
of commonly-used statistical packages and programming environments.
∗
JEL classification: C6 (Mathematical methods and programming), C87 (Econometric software), C88 (Other
computer software). Keywords: Object-Oriented Programming, Open Source Software, Programming Language,
Python, Rapid Prototyping.
†
Roseline Bilina, School of Operations Research and Information Engineering, Cornell University, Ithaca, NY,
14853, USA. Email: rb537 (at) cornell.edu.
‡
Corresponding author. Steve Lawford, Department of Economics and Econometrics (LH/ECO), ENAC, 7 avenue
Edouard Belin, BP 54005, 31055, Toulouse, Cedex 4, France. Email: steve lawford (at) yahoo.co.uk.
1
1 Introduction
was designed in the early 1990s by Guido van Rossum, then a programmer at the Dutch National
Research Institute for Mathematics and Computer Science (CWI) in Amsterdam. The core Python
distribution is open source and is available for multiple platforms, including Windows, Linux/Unix
and Mac OS X. The default CPython implementation, as well as the standard libraries and
documentation, are available free of charge from www.python.org, and are managed by the Python
Software Foundation, a non-profit body.1 van Rossum still oversees the language development,
which has ensured a strong continuity of features, design, and philosophy. Python is easy to
learn and use, and is recognized for its very clear, concise, and logical syntax. This feature alone
makes it particularly suitable for rapid software prototyping, and greatly eases subsequent program
In software development, there is often a trade-off between computational efficiency and final
performance, and programming efficiency, productivity, and readability. For both applied and
theoretical econometricians and statisticians, this frequently leads to a choice between low-level
languages such as C++, and high-level languages or software such as PcGive, GAUSS, or Matlab
(e.g. [23] and [27]). A typical academic study might involve development of asymptotic theory
for a new procedure with use of symbolic manipulation software such as Mathematica, assessment
of the finite-sample properties through Monte Carlo simulation using C++ or Ox, treatment of a
very large microeconometric database in MySQL, preliminary data analysis in EViews or Stata,
production of high quality graphics in R, and finally creation of a written report using LATEX.
An industrial application will often add to this some degree of automation (of data treatment or
1
For brevity, we will omit the prefix http:// from internet URL references throughout the paper.
2
We will motivate the use of Python as a particularly appropriate language for high performance
stand-alone research applications in econometrics and statistics, as well as its more commonly
known purpose as a scripting language for gluing different applications together. In industry and
academia, Python has become an alternative to low-level compiled languages such as C++. Recent
examples in large-scale computational applications include [4], [16], [17], [19] and [20, who explicitly
refers to faster development times], and indicate comparable run times with C++ implementations
in some situations (although we would generally expect some overhead from using an interpreted
language). The Python Wiki lists Google, Industrial Light and Magic, and Yahoo! among major
organizations with applications written in Python.2 Furthermore, Python can be rapidly mastered,
which also makes it suitable for training purposes ([3] discusses physics teaching).
The paper is organized as follows. Section 2 explains how Python and various important
additional components can be installed on a Windows machine. Section 3 introduces the core
features of Python, which are straightforward, even for users with little programming experience.
While we do not attempt to replace the excellent book length introductions to Python such as
[2, a comprehensive treatment of standard modules], [8, with good case studies and exercises],
[11, with detailed examples], and [14, more oriented towards computational science], we provide
enough detail to enable a user to immediately start serious work. Shorter introductions to the core
language include the Python tutorial [32], which is regularly updated.3 Python comes into its own
as a general programming language when some of the many free external modules are imported.
In Section 4, we detail some more advanced uses of Python, and show how it can be extended with
scientific modules (in particular, NumPy and SciPy), and used to link different parts of a research
project. We illustrate practical issues such as data input and output, access to the graphical
capabilities of R (through the rpy module), automatic creation of a LATEX code (program segment)
2
See wiki.python.org/moin/OrganizationsUsingPython.
3
For additional documentation, see [25], and references in, e.g. [14] and www.python.org/doc/.
3
from within Python (with numerical results computed using scipy), object-oriented programming
in a rapid prototype of a copula estimation, incorporation of C++ code within a Python program,
and numerical optimization and plotting (using matplotlib). In Section 5, we compare the speed of
extended Python to a number of other scientific software packages and programming environments,
in a series of tests. We find comparable performance to languages such as Ox, for a variety of
mathematical operations. Section 6 concludes the paper, and mentions additional useful modules.
2 Installation of packages
Here, we describe the procedure for installation of Python and some important additional packages,
on a Windows machine. We use Windows for illustration only, and discussion of installation for
other operating systems is found in [11] and [33], and is generally straightforward. We use Python
2.6.5 (March 2010), which is the most recent production version for which compatible versions of the
scientific packages NumPy and SciPy are available. Python 2.6.5 includes the core language and
the download instructions.4 After installation, the Python Integrated Development Environment
(IDLE), or ‘shell window’ (interactive interpreter) becomes available (see Figure 1). The compatible
Pywin32 should also be installed. This provides extension modules for access to many Windows
Two open source packages provide advanced functionality for scientific computing. The first of
these, NumPy (numpy.scipy.org), enables Matlab-like multidimensional arrays and array methods,
linear algebra, Fourier transforms, random number generation, and tools for integrating C++ and
Fortran code into Python programs (see [21] for a comprehensive manual, and [14, Chapter 4] for
applications). NumPy version 1.4.1 is stable with Python 2.6.5 and Pywin32 (build 210), and is
available from sourceforge.net (follow the link from numpy.scipy.org). The second package,
SciPy (www.scipy.org), which requires NumPy, provides further mathematical libraries, including
statistics, numerical integration and optimization, genetic algorithms, and special functions. SciPy
version 0.8.0 is stable with the above packages, and is available from sourceforge.net (follow the
link from www.scipy.org); see [29] for a reference. We use NumPy and SciPy extensively below.
5
For statistical computing and an excellent graphical interface, Python can be linked to the R
R are discussed at length in [6], [12], [26] and [34]. An R journal exists (journal.r-project.org).
Release R 2.9.1 (June 2009; follow the download links on rpy.sourceforge.net, and choose ‘full
installation’) and RPy 1.0.3, available from sourceforge.net, are stable with Python 2.6.5.
An additional third-party module that is useful for data-processing is MDP 2.6: Modular
Toolkit for Data Processing (see [35] for details). It contains a number of learning algorithms
and, in particular, user-friendly routines for principal components analysis. It is available from
([7]) and examples can be found at matplotlib.sourceforge.net. A C++ compiler is also needed
to run Python programs that contain C++ code segments, and a good free option is a full MinGW
5.1.6 (Minimalist GNU for Windows) installation. An automatic Windows installer is available
from www.mingw.org (which links to sourceforge.net), and contains the GCC (GNU Compiler
System), which supports C++. We refer to the above installation as ‘extended Python’, and use it
throughout the paper, and especially in Sections 4 and 5. We have installed the individual packages
for illustration, but bundled scientific distributions of Python and additional packages are available.
These include pythonxy for Windows (code.google.com/p/pythonxy/) and the Enthought Python
Python is well supported by a dynamic community, with helpful online mailing lists, discussion
forums, and archives. A number of Python-related conferences are held annually. A general
The core Python 2.6.5 implementation is made much more powerful by standard ([31]) and third-
party modules (such as RPy and SciPy). A module is easily imported using the import command
(this does not automatically run the module code.). For clarity, this is usually performed at the
start of a program. For instance (see Example 0 below), import scipy (Python is case-sensitive)
loads the main scipy module and methods, which are then called by, e.g. scipy.pi (this gives
π). The available scipy packages (within the main module) can be viewed by help(scipy). If a
single scipy package is of interest, e.g. the stats package, then this can be imported by import
scipy.stats (in which case methods are accessed as, e.g. scipy.stats.kurtosis(), which gives
the excess kurtosis), or from scipy import stats (in which case methods are accessed by, e.g.
stats.kurtosis()). It is often preferable to use the former, since it leads to more readable
programs, while the latter will also overwrite any current packages called stats. Another way to
overcome this problem is to rename the packages upon import, e.g. from scipy import stats
as NewStatsPackage. If all scipy packages are of interest, then these can be imported by from
scipy import *, although this will also overwrite any existing packages with the same names.
Longer Python programs can be split into multiple short modules, for convenience and re-
usability. For instance, in Example 3 below, it is suggested that a user-defined LATEX table function
tex table be saved in a file tex functions.py (the module name is then tex functions). As
above, the function can be imported by import tex functions.tex table and then used directly
by tex functions.tex table(). In the examples, we will use both forms of import.6
mail.scipy.org/mailman/listinfo/scipy-user and mail.scipy.org/mailman/listinfo/numpy-discussion. An
RPy mailing list is at lists.sourceforge.net/lists/listinfo/rpy-list. Conference announcements are posted
at www.python.org/community/workshops.
6
When a script is run that imports tex functions, a compiled Python file tex functions.pyc will usually be
created automatically in the same directory as tex functions.py. This serves to speed up subsequent start-up
(module load) times of the program, as long as the file tex functions.py is not modified. A list of all functions
defined within a module (and of all functions defined within all imported modules) is given by dir(), e.g. import
tex functions and dir(tex functions) would give a list including ‘tex table’.
7
3 Language basics
The IDLE can be used to test short commands in real-time (input is entered after the prompt
>>>). Groups of commands can be written in a new IDLE window, saved with a .py suffix, and
executed as a regular program in IDLE, or in a DOS window by double-clicking on the file. Single-
line comments are preceded by a hash #, and multi-line comments are enclosed within multiple
quotes """. Multiple commands can be included on one line if separated by a semi-colon, and long
commands can be enclosed within parentheses () and split over several lines (a back-slash can also
be used at the end of each line to be split). A general Python object a (e.g. a variable, function,
class instance) be used as a function argument f(a), or can have methods (functions) applied to
it, with dot syntax a.f(). There is no need to declare variables in Python since they are created
at the moment of initialization. Objects can be printed to the screen by entering the object name
The operators +, -, * and / work for the three most commonly-used numeric types: integers, floats,
and complex numbers (a real and an imaginary float). Float division is x/y, which returns the
floor for integers. The modulus x%y returns the remainder of x divided by y, and powers xy are
given by x**y. Variable assignment is performed using =, and must take place before variable use.
Python is a dynamic language, and variable types are checked at run-time. It is also strongly-typed,
7
Some care must be taken with variable assignment, which manipulates ‘references’, e.g. a=3; b=a does not
make a copy of a, and so setting a=2 will leave b=3 (the old reference is deleted by re-assignment). Some standard
mathematical functions are available in the scipy.math module. Note that variables may be assigned to functions,
e.g. b=math.cos is allowed, and b(0) gives 1.0. As for many other mathematical packages, Python also supports
long integer arithmetic, e.g. (2682440**4)+(15365639**4)+(18796760**4) and (20615673**4) both give the 30-digit
number 180630077292169281088848499041L (this verifies the counterexample in [9] to Euler’s (incorrect) conjectured
generalization of Fermat’s Last Theorem: here, that three fourth powers never sum to a fourth power.)
8
>>> x=y=2; z=(2+3j)*(2-3j); print x, y, z, z.real, type(z), 3/2, 3.0/2.0, 3%2, 3**2 \
# simultaneous assignment, complex numbers, type, division, modulus, power
2 2 (13+0j) 13.0 <type ’complex’> 1 1.5 1 9
>>> y=float(x); print x, type(x), y, type(y) # variable type re-assignment
2 <type ’int’> 2.0 <type ’float’>
Python is particularly good at manipulating strings, which are immutable unless re-assigned.
Strings can be written using single (or double) quotes, concatenated using +, repeated by *, and
‘sliced’ with the slicing operator [r:s], where element s is not included in the slice (indexation
starts at 0). Negative indices correspond to position relative to the right-hand-side. Numbers are
converted to strings with str(), and strings to floats or integers by float() or int().8 When
parsing data, it is often useful to remove all start and end whitespace, with strip().
Python has a number of useful built-in data structures. A list is a mutable ordered set of arbitrary
comma-separated objects, such as numbers, strings, and other lists. Lists (like strings) can be
manipulated using +, *, and the slicing operator [r:s]. A list copy can be created using [:]. Nested
list elements are indexed by [r][s][· · · ]. Lists can be sorted in ascending order (numerically
8
The raw input() command can be used to prompt user input, e.g. data=raw input(‘enter:’) will prompt with
‘enter:’ and creates a string data from the user input. Example 3 below shows that double and triple backslashes
in a string will return a single and a double backslash when the string is printed (this is useful when automatically
generating LATEX code). An alternative is to use a raw string r‘string ’, which will almost always be printed as
entered. It is also possible to search across strings, e.g. a in ‘string ’, while len(a) gives the number of characters
in the string. Python also provides some support for Unicode strings (www.unicode.org).
9
and then alphabetically) with the method sort(), or reversed using reverse(). Lists can also
be sorted according to complicated metrics, and this can be very useful in scientific computing.
three matrices, contained in a list x=[A,B,C], then x can be ordered by the determinant (say) using
x.sort(key=det), which will give x=[A,C,B], and where the det function has been imported by
Strings can be split into a list of elements using the split() method, which is very useful
when parsing databases. Note that methods can be combined, as in strip().split(), which
are executed from left to right. New lists can be constructed easily by list comprehension, which
loops over existing lists, may be combined with conditions, and can return nested lists.9 The
enumerate() function loops over the elements of a list, and returns their position (index) and
value, and the zip() function can be used for pairwise-combination of lists.
9
The slicing operator can also take a step-length argument, i.e. [r:s:step]. An empty list of length n is given
by [None]*n, and len() gives the number of elements. The append() and extend() methods can also be used
instead of +=, for lists, or elements of lists, respectively. List elements can be assigned to with a[i]=b, and a list
slice can be replaced with a slice of a different length. Items can be added at a specific index with insert(), and
removed with remove() or pop(). The index of a given element can be found using index(), and its number of
occurrences by count(). Slices can be deleted (and the list dimension changed) by del a[r:s]. A Python tuple
is essentially an immutable list that can be created from a list using the tuple() function. They behave like lists,
although list methods that change the list cannot be applied to them. Tuples are often useful as dictionary keys
(see below). The default of split() is to split on all runs of whitespace, and the inverse of e.g. a.split(’;’)
is ’;’.join(a.split(’;’)). Also, range(n) is equivalent to [0,1,2,. . .,n-1]. Another useful string method is
a.replace(‘string1’,‘string2’ ), which replaces all occurrences of ‘string1 ’ in a with ‘string2 ’.
10
Python dictionaries (also known as ‘hash tables’ or ‘associative arrays’) are flexible mappings,
and contain values that are indexed by unique keys. The keys can be any immutable data type,
such as strings, numbers and tuples. An empty dictionary is given by a={}, a list of keys is
extracted using a.keys(), and elements are accessed by a[key ] or a.get(key ). Elements can be
added to (or re-assigned) and removed from dictionaries.10 Dictionaries can also be constructed
>>> results={’betahat’:[-1.23,0.57],’loglik’:-7.6245,’R2’:0.18,’convergence’:’yes’} \
# dictionary construction
>>> print results.keys(), results[’betahat’][1], results[’R2’]>0.50 # dictionary manipulation
[’convergence’, ’loglik’, ’R2’, ’betahat’] 0.57 False
>>> print dict([(x,[x**2,x**3]) for x in range(5)]) # dictionary build from list
{0: [0, 0], 1: [1, 1], 2: [4, 8], 3: [9, 27], 4: [16, 64]}
10
As for lists, dictionary elements can be re-assigned and deleted, and membership is tested by in. For a full list
of dictionary methods, see [11, Chapter 4].
11
Commands can be executed subject to conditions by using the if, elif (else if) and else statements,
and can contain combinations of == (equal to), <, >, <=, >= and != (not equal to), and the usual
Boolean operators and, or and not. Python while statements repeat commands if some condition
is satisfied, and for loops over the items of an iterable sequence or list, e.g. for i in a:.11 The
A function is defined by def and return statements, and can be documented by including a
""" comment within the body, which can be viewed using help(function name ). In Example
0, the function pdfnorm returns the density function f (x) of the normal N (mu, sigma**2), and
specifies default values for the arguments mu=0 and sigma=1.12 The function can be called without
these arguments, or they can be reset, or can be called using keyword arguments. Functions can
also be called by ‘unpacking’ arguments from a list using *[]. Example 0 gives four ways in
which the function can be called to give f (2) ≈ 0.0540. The function can be evaluated at multiple
arguments using list comprehension, e.g. print [pdfnorm(x) for x in range(10)].13 However,
it is much faster to call the function with a vector array argument by from scipy import arange
and print pdfnorm(arange(10)).14 Python can also deal naturally with composite functions, e.g.
11
The break and continue statements are respectively used to break out of the smallest enclosing for / while
loop, and to continue with the next iteration. The pass statement performs no action, but is useful as a placeholder.
12
This is for illustration only: in practice, the SciPy stats module can be used, e.g. from scipy.stats import
norm, and then norm.pdf(x) gives the same result as pdfnorm(x), for the standard normal N (0, 1).
13
This is also achieved by passing pdfnorm to the map() function, i.e. print map(pdfnorm,range(10)).
14
Array computations are very efficient. In a simple speed test (using the machine described in Section 5), we
compute pdfnorm(x) once only, where x ∈ {0, 1, . . . , 1000000}, by (a) [pdfnorm(x) for x in range(1000000)], (b)
map(pdfnorm,range(1000000)), (c) from scipy import arange and pdfnorm(arange(1000000)), and also (d) from
12
from scipy import pi,sqrt,exp # import scipy module pi, sqrt, exp
def pdfnorm(x,mu=0,sigma=1): # function definition and default arguments
"""Normal N(mu,sigma**2) density function. # function documentation string
Default values mu=0, sigma=1.""" # function name, arguments, string returned by help(pdfnorm)
return (1/(sqrt(2*pi)*sigma))*exp(-0.5*((x-mu)**2)/(sigma**2)) # return function result
print pdfnorm(2), pdfnorm(2,0,1), pdfnorm(2,sigma=1), pdfnorm(*[2,0,1]) # call pdfnorm
Python uses call by assignment, which enables implementation of function calls by value and by
reference. Essentially, the call by value effect can be obtained by appropriate use of immutable
objects (such as numbers, strings or tuples), or by manipulating but not re-assigning mutable
objects (such as lists, dictionaries or class instances). The call by reference effect can be obtained
by re-assigning mutable objects; see e.g. [14, Sections 3.2.10, 3.3.4] for discussion.
4 Longer examples
It is straightforward to import data of various forms into Python objects. We illustrate using 100
cross-sectional observations on income and expenditure data in a text file, from [10, Table F.8.1].16
scipy.stats import norm and norm.pdf(arange(1000000)). Tests (a) and (b) both run in roughly 58 seconds,
while (c) and (d) take about 0.5 seconds, i.e. the array function is more than 100 times faster. It is encouraging that
the user-defined pdfnorm() is comparable in speed terms to the SciPy stats.norm.pdf().
15
Given two functions f : X → Y and g : Y → Z, where the range of f is the same set as the domain of g (otherwise
the composition is undefined), then the composite function g ◦ f : X → Z is defined as (g ◦ f )(x) := g(f (x)).
16
The dataset is freely available at www.stern.nyu.edu/∼wgreene/Text/Edition6/TableF8-1.txt. The variable
names (‘MDR’, ‘Acc’, ‘Age’, ‘Income’, ‘Avgexp’, ‘Ownrent’, and ‘Selfempl’) are given in a single header line. Data is
reported in space-separated columns, which contain integers or floats. See Examples 2, 3, and 6 for analysis.
13
In Example 1a, a file object f is opened with open(). A list of variable names, header, is created
from the first line of the file: readline() leaves a new line character at the end of the string,
which is removed by strip(), and the string is split into list elements by split(). A dictionary
data dict is initialized by list comprehension with keys taken from header, and corresponding
values [] (an empty list). The dictionary is then filled with data (by iterating across the remaining
lines of f), after which the file object is closed by close().17 The command eval() ‘evaluates’
the data into Python expressions, and the dictionary elements corresponding to the variables ‘Acc’,
‘Age’, ‘MDR’, ‘Ownrent’ and ‘Selfempl’ are automatically created as integers, while ‘Avgexp’ and
‘Income’ are floats. The formatted data dictionary is then ready for use in applied work.
The cPickle module provides a powerful means of saving arbitrary Python objects to file,
and for retrieving them.18 The ‘pickling’ (save) can be applied to, e.g. numbers and strings,
lists and dictionaries, top-level module functions and classes, and creates a byte stream (string
representation) without losing the original object structure. The original object can be reconstructed
17
Valid arguments for open() are the filename (including the path if this is not the current location) and the mode
of use of the file: useful are ‘r’ read only (default) and ‘w’ write only (‘r+’ is read and write). While f.readlines()
reads f into a list of strings, f.read() would read f into a single string. The iteration for j in f.readlines(): in
this example could also be replaced by for j in f:.
18
cPickle is an optimized C implementation of the standard pickle module, and is reported to be faster for
data save and load ([2] and [31, Section 12.2]), although some of the pickle functionality is missing. The default
cPickle save ‘serializes’ the Python object into a printable ASCII format (other protocols are available). See
docs.python.org/library/pickle.html for further details. In Example 1b, the raw data file is 3.37k, and the
Python .bin, which contains additional structural information, is 5.77k. The present authors have made frequent use
of cPickle in parsing and manipulating the U.S. Department of Transportation Origin and Destination databases.
14
by ‘unpickling’ (load). The technique is very useful when storing the results of a large dataset parse
to file, for later use, avoiding the need to parse the data more than once. It encourages short modular
code, since objects can easily be passed from one code (or user) to another, or sent across a network.
The speed of cPickle also makes Python a natural choice for application checkpointing, a technique
which stores the current state of an application, and that is used to restart the execution should the
and intensive Monte Carlo simulation (e.g. a bootstrap simulation, or one with a heavy numerical
In Example 1b, a file object g is created, and the dictionary data dict is pickled to the file
python data.bin, before being immediately unpickled to a new dictionary data dict2.
We build on Example 1, and show how Python can be linked to the free statistical language R,
and in particular to its graphical features, through the rpy module. Example 2 creates an empty
R pdf object (by r.pdf), called descriptive.pdf, of dimension 1 × 2 (by r.par), to be filled with
two generated plots: the first r.plot creates a scatter of the ‘Income’ and ‘Avgexp’ variables from
data dict, while the second creates a kernel density (r.density) plot of the ‘Income’ variable.
The .pdf Figure 2 is automatically generated, and can be imported into a LATEX file (as here!).
15
●
0.30
1500
●
0.25
0.20
Expenditure
frequency
1000
0.15
●
●
● ●
0.10
● ●
● ●●
500
●
●
●
● ●
●
● ● ●●
0.05
● ●
● ●
● ●
● ●● ● ●
●
● ● ● ●
● ●
●●●●● ●● ● ●
●
● ●
● ●●
●●
●●● ●
● ● ●
0.00
●● ●
●● ●
●
●●
●●
●●●●●●
●● ●
●
●●●●●
●● ● ●●
●● ●
0
2 4 6 8 10 0 2 4 6 8 10
Income Income
Figure 2: Python/R .pdf output plot, generated using rpy (see Example 2).
16
This example shows how to design a simple function that will create LATEX code for a table of
descriptive statistics (see [13] for related discussion of ‘literate econometric practice’, where models,
data, and programs are dynamically linked to a written report of the research, and R’s ‘Sweave’
for a complementary approach). The user-defined function tex table takes arguments filename
(the name of the .tex file), data dict (a data dictionary, from Example 1), and cap (the table
caption). The list variables contains the sorted keys from data dict. The output string holds
table header information (and where a double backslash corresponds to a single backslash in the
string, and a triple backslash corresponds to a double backslash). For each variable name i, the
output string is augmented with the mean, standard deviation, minimum and maximum of the
corresponding variable from data dict, computed using the scipy.stats functions. The string
formatting symbol % separates a string from values to format. Here, each %0.2f (formats a value
to a 2 decimal place float) acts as a placeholder within the string, and is replaced from left to right
by the values given after the string.19 The output string also contains table footer information,
including the caption and a LATEX label reference. Once output has been created, it is written
(using the write method) to filename. The function return ends the procedure.
19
For additional conversion specifiers, see [11, Table 3-1].
17
(Example 3. LaTeX table from a data dictionary; code saved in tex functions.py)
The tex table function can be saved in a separate ‘module’ tex functions.py, and then called
It is straightforward to import the resulting LATEX output into a .tex file (as in Table 1 in this
paper) using a LATEX command, e.g. \input{./table.tex}.20 The generic code can be used with
any data dictionary of the same general form as data dict, and is easily modified.
The following example illustrates rapid prototyping and object-oriented Python, with a simple
bivariate copula estimation. Appendix A.1 contains a short discussion of the relevant copula theory.
We use 5042 time series observations on the daily closing prices of the Dow Jones Industrial Average
20
It is also possible to automatically compile a valid filename.tex file (that includes all relevant preamble and
document wrapper material) to a .pdf by import os followed by os.system("pdflatex filename.tex ").
18
and the S&P500 over 9 June 1989 to 9 June 2009.21 The raw data is parsed in Python, and log
returns are created, as x and y (not shown). There are two classes: a base class Copula, and
a derived class NormalCopula. The methods available in each class, and the class inheritance
structure, can be viewed by, e.g. help(Copula). A Copula instance (conventionally referred to
by self) is initialized by a=Copula(x,y), where the initialization takes the data as arguments
(and init is the ‘constructor’). The instance variables a.x and a.y (SciPy data arrays) and
a.n (sample size) become available. The method a.empirical cdf(), with no argument, returns
an empirical distribution function Fb(x) = (n + 1)−1 ni=1 1Xi ≤x , for both a.x and a.y, evaluated
P
at the observed datapoints (sum returns the sum). The Copula method a.rhat() will return an
‘in development’ message, and illustrates how code segments can be clearly reserved for future
types). The Copula method a.invert cdf() is again reserved for future development, and will
return a user-defined error message, since this operation requires the estimated copula parameters,
which have not yet been computed (and so a does not have a simulate attribute; this is tested
with hasattr).22
21
The dataset is available from the authors as djia sp500.csv. The data was downloaded from finance.yahoo.com
(under tickers ∧ DJI for Dow Jones and ∧ GSPC for S&P500). It is easy to iterate over the lines of a .csv file with
f=open(’filename.csv’,’r’) and for i in f:, and it is not necessary to use the csv module for this. This example
is not intended to represent a serious copula model (there is no dynamic aspect, for instance).
22
Python has flexible built-in exception handling features, which we do not explore here (e.g. [2, Chapter 5]).
19
NormalCopula(Copula) inherits the methods of Copula (i.e. init , empirical cdf as well as
invert cdf), replacing them by methods defined in the NormalCopula class if necessary (i.e. rhat),
and adding any new methods (i.e. simulate). The first time that the method b.rhat() is called, it
b = n−1 Pn υi υ 0 , where υi = (Φ−1 (ui ), Φ−1 (ui ))
will compute the estimated copula parameters R i=1 i 1 2
(b.test), Φ−1 is the inverse standard normal distribution (norm.ppf from scipy.stats), u1 and
u2 are the empirical distributions of b.x and b.y respectively, and i indexes the observation (the
NumPy matrix type mat is also used here, and has transpose method .T). The routine further
corrects R
b (b.u) by:
(R)
b ij
q q 7→ (R)
b ij ,
(R)ii (R)jj
b b
and stores the result as b.result. All subsequent calls of b.rhat() will immediately return
been estimated, it can be used for simulation with b.simulate() (which would have automatically
called b.rhat() if this had not yet been done). This method computes the Cholesky decomposition
b = AA0 (cholesky, from scipy.linalg, where the lower-triangular A is stored in b.chol), which
R
is used to scale independent bivariate standard normal variates x = (x1 , x2 )0 = AN (0, 1)2 =
b 2 , generated using the rpy r.rnorm function. Bivariate uniforms u = (u1 , u2 )0 are then
N (0, R)
computed by passing x through Φ(·) (norm.cdf), where Φ is the univariate standard normal
distribution. This gives (u1 , u2 )0 = (Fb1 (x1 ), Fb2 (x2 ))0 , where Fb are the empirical marginals. We
In a full application, we may be interested in converting the uniform marginals back to the
original scale. This requires numerical inversion of the empirical distribution functions, which could
be tricky. In this example, the possibility is left open, and b.invert cdf() will now return an
‘in development’ message, as required. We could imagine extending the code to include additional
20
classes, as the empirical study becomes deeper. For instance, Copula, EllipticalCopula(Copula)
class Copula could contain general likelihood-based methods, in addition to computation of empirical
or parametric marginals, and methods for graphical or descriptive data analysis; the derived class
and the Student ‘semi-closed form’ estimation routines (but not by non-elliptical copulas, such as
Archimedean copulas, which could have a separate class); and the derived classes NormalCopula
(R
b directly for the normal copula; and R
b and an estimated degrees-of-freedom parameter for the
(Example 4. Rapid prototype of bivariate copula estimation using classes; comments in main text)
class NormalCopula(Copula):
def rhat(self):
if not hasattr(self,’result’):
self.test=(array(zip(norm.ppf(self.empirical_cdf())[0],
norm.ppf(self.empirical_cdf())[1])))
self.u=array(mat(self.test).T*mat(self.test)/self.n)
self.result=(array([[self.u[i][j]/(sqrt(self.u[i][i])*sqrt(self.u[j][j]))
for i in range(2)] for j in range(2)]))
return self.result
def simulate(self):
if not hasattr(self,’result’): self.result=self.rhat()
self.chol=cholesky(self.result,lower=1)
return norm.cdf(array(mat(self.chol)*mat(r.rnorm(2,0,1)).T))
22
Copula simulation
1.0
● ●●●●●
●
●
●●
●
●
●● ● ●
●
●
●●
●●●
●●●●●●
●
●
●
●●
●●
●
●
●●
●
●
●● ●●●
●●●●●
● ● ●●● ● ●● ● ●●●●●
●●
● ● ●● ● ● ●
●●
●●●●
●
●●
●●
●
●●
●
● ● ●●● ●● ●
● ●
● ● ●
● ●
●● ●●
● ● ●●●
● ●
●●●●● ●
● ● ●● ● ● ● ●●
●● ● ● ●
●● ● ●●● ●● ● ● ●
● ● ●● ●● ● ●●●● ●
● ●
●●●● ●
●
●●●● ● ●● ●● ●●
● ●
● ● ●● ● ●●●
● ●●●●●●● ●●
0.8
● ● ●
●
●
● ●
●●●● ● ● ● ● ● ● ●
● ●
● ● ●● ●
● ●
●●●
● ●●● ● ● ● ●●
● ● ● ●● ● ● ●
●●
● ●●
● ● ● ● ●
●● ● ● ● ●●● ●
●● ● ● ● ●● ●● ●
● ●● ●● ●
●●
●
● ●
●
●● ● ●● ●
●●
●● ● ● ●●
● ● ●
●
●
● ● ●● ● ● ● ● ● ●
●
● ● ● ●● ●● ●●● ● ●
●
●
●
●●●
●
●●●●●●
● ●● ●●
● ●●● ● ● ● ●●●
● ● ● ● ●
●
● ●● ●
0.6
●●● ●● ● ● ● ● ●● ●● ● ● ● ●
● ●● ●
●●●● ● ●● ●● ●●●● ●
● ● ●● ●●
Uniform (Y)
● ● ● ●● ● ●●
●● ● ● ● ● ● ●●●● ● ● ● ● ●
● ●
● ● ●
● ● ●● ● ● ● ● ● ●●●●●
● ●
● ●
●●
●●● ● ● ● ●● ●● ● ●●● ●
●
● ● ● ● ● ●
●● ● ● ● ● ● ●●● ●
● ●●
● ●● ● ●
●● ● ● ● ● ●● ● ● ● ●●
● ● ● ●
●●
● ● ● ● ● ● ● ● ● ●
●
● ● ● ●●
●●●● ● ● ● ●
●● ●●
● ●●
● ● ● ●●
● ● ●
●● ● ●
● ●● ●
●● ●●● ● ●● ●●
0.4
● ●
● ●● ●●
● ●
●●● ●●● ● ● ●● ● ●
●
● ● ●● ● ●
● ● ●
● ● ●●●● ●●● ● ●● ●●● ● ●● ●● ● ●
● ●●
●
● ●●● ● ● ● ● ● ●
● ● ● ●
●
● ●● ● ● ● ● ●●●●
● ● ● ●
● ●
● ● ● ● ●
●● ●
● ● ●● ● ●● ●
●● ●● ● ● ● ●● ●
●●
● ●●●
●●●●
● ●● ●●
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●●● ●● ● ●
● ●●●●● ● ● ● ● ● ●
● ● ●●● ● ● ● ●
● ●● ●●
● ● ● ● ● ●● ● ● ● ●
0.2
● ●● ● ● ●●
●
● ● ● ●● ● ●
●● ●
●● ●●
● ●● ●● ● ●●
● ● ● ●
● ● ●● ●
● ●
● ● ● ●●
●● ● ●●
● ● ● ●
●●●● ● ● ●● ●●● ●●
●
●● ● ●
● ● ●● ● ●●
● ●● ● ● ●
● ●●● ● ● ● ●● ● ● ● ●
●● ●●● ● ● ●●
●●● ● ● ●
●● ●● ● ●● ● ● ●● ●
●●●
●
●
●
●● ●
●●
●
●● ●
● ●● ●
●● ●●
● ●
●●
●●
●
●
●●●●●●
●●●
●
●
●●●● ●● ●
●●●
0.0
●
●
●
●●
●
●●
●
●
Uniform (X)
Figure 3: 1000 simulations from a simple bivariate copula model (see Example 4).
4.5 (Example 5.) Using C++ from Python for intensive computations
This example shows how to include C++ code directly within a Python program, using the
scipy.weave module. We are motivated by the nested speed test result in Section 5, which shows
that Python nested loops are quite inefficient compared to some software packages. Specifically,
a 5000 × 5000 nested loop that only keeps track of the cumulative sum of the loop indexes
runs in about 9 seconds in Python 2.5.4, compared to 5 seconds in Ox Professional 5.00, for
instance (see Table 2). Generally, we would not advise use of Python nested loops for numerical
23
computations, and the problem worsens rapidly as the number of loops increases. However, it is
easy to optimize Python programs by writing the heaviest computations in C++ (or Fortran).
To illustrate, Example 5 solves the problem of the slow nested loop in the speed test. The C++
code that will perform the computation is simply included in the Python program as a raw string
r"""string """, exactly as it would be written in a C++ editor (but without the C++ preamble
statements). The scipy.weave.inline module is called with the string that contains the C++
commands (code), the variables that are to be passed to and (although not in this example) from
the C++ code (dimension), the method of variable type conversion (performed automatically using
the scipy.weave.converters.blitz module), and optionally the compiler to be used (here, gcc,
the GNU Compiler Collection). There will be a short compile time, when the Python program is
first run, and we would expect some small overhead compared to the same code running directly
in C++. However, we find over 1000 replications that the loop test now runs in a mean time of
0.02 seconds, or about 600 times faster than in Python! (roughly 300 times faster than Ox).
This example illustrates the numerical optimization functionality of SciPy, and uses Matplotlib to
create publication-quality graphics (see also [16] for an application). The code segment is included
in Appendix A.3. We use the income and expenditure data that was formatted in Example 1, and
analyzed in Examples 2 and 3.23 The data dictionary is loaded using cPickle. Two SciPy arrays
are created: y (100×1) contains the variable ‘Acc’, and X (100×6) contains a constant, ‘Age’,
‘Income’, ‘MDR’, ‘Ownrent’ and ‘Selfempl’ (‘Avgexp’ is dropped, since it perfectly predicts ‘Acc’).
We estimate a probit model Prob(yi = 1) = Φ(x0i β) + ui , i = 1, 2, . . . , 100, where x0i is the ith
row of X, and β = (β0 , . . . , β5 )0 . Two user-defined functions specify the negative log-likelihood
100
X
− ln L(β) = − {yi ln Φ(x0i β) + (1 − yi ) ln(1 − Φ(x0i β))}
i=1
100
∂(− ln L(β)) φ(x0i β)(yi − Φ(x0i β))
X
=− xi ,
∂β Φ(x0i β)(1 − Φ(x0i β))
i=1
where φ() is the density function of the standard normal (scipy.stats.norm, scipy.log, and
numpy.dot are used in the expressions). The unconstrained optimization βb = arg min(− ln L(β)) is
solved using the SciPy Newton-conjugate-gradient (scipy.optimize.fmin ncg) method, with the
least squares estimate of β used as starting value (scipy.linalg.inv is used in the calculation),
and making use of the analytical gradient. The method converges rapidly, and the accuracy of the
23
‘Acc’ is a dummy variable taking value 1 if a credit card application is accepted, and 0 otherwise. ‘Age’ is age
in years. ‘Avgexp’ is average monthly credit card expenditure. ‘Income’ is scaled (/10000) income. ‘MDR’ is the
number of derogatory reports. ‘Ownrent’ is a dummy taking value 1 if the individual owns his home, and 0 if he
rents. ‘Selfempl’ is a dummy taking value 1 if the individual is self-employed, and 0 otherwise.
25
maximum-likelihood estimate βb was checked using EViews 6.0 (which uses different start values).24
It could be useful in a teaching environment to explain the estimation procedure in more detail.
Here, we use Matplotlib to design a figure for this purpose (Figure 4). We create a contour plot of
− ln L(β) in the (β1 , β2 )-space, using numpy.meshgrid, as well as matplotlib.pyplot. The plot
settings can all be user-defined (e.g. line widths, colours, axis limits, labels, grid-lines, contour
labels). We use LATEX commands directly within Matplotlib to give mathematical axis labels, add
a text box with information on the optimization, and annotate the figure with the position of the
least squares starting values (in (β1 , β2 )-space), and the maximum-likelihood estimate. Matplotlib
creates a .png graphic, which can be saved in various formats (here, as a .pdf file).
5 Speed comparisons
In this section, we compare the speed performance of extended Python 2.6.5 with Gauss 8.0,
Mathematica 6.0, Ox Professional 5.00, R 2.11.1, and Scilab 5.1.1. We run 15 mathematical
benchmark tests on a 1.66MHz Centrino Duo machine with 1GB RAM running Windows XP. The
algorithms are adapted from [30, Section 8], and are described in Appendix A.2. They include a
series of general mathematical, statistical and linear algebra operations, that occur frequently in
applied work, as well as routines for nested loops and data import and analysis. The tests are
generally performed on large dimension random vectors or matrices, which are implemented as
SciPy arrays.25 We summarize the tests, and report the extended Python functions that are used:
24
Other numerical optimization routines are available. For instance, BFGS, with numerical or analytical gradient,
is available from scipy.optimize, as fmin bfgs(f,x0,fprime=fp), where f is the function to be minimized, x0 is the
starting value, and fp is the derivative function (if fprime=None, a numerical derivative is used instead). Optional
arguments control step-size, tolerance, display and execution parameters. Other optimization routines include a
Nelder-Mead simplex algorithm (fmin), a modified level set method due to Powell (fmin powell), a Polak-Ribière
conjugate gradient algorithm (fmin cg), constrained optimizers, and global optimizers including simulated annealing.
25
All of the Python speed tests discussed in Section 5 and Appendix A.2 that require pseudo-random uniform
numbers (13 of the 15 tests) use the well-known Mersenne Twister (MT19937). See Section 10.3 in [21] and Section
9.6 in [31] for details. Python supports random number generation from many discrete and continuous distributions.
For instance, the continuous generators include the beta, Cauchy, χ2 , exponential, Fisher’s F , gamma, Gumbel,
Laplace, logistic, lognormal, noncentral χ2 , normal, Pareto, Student’s t, and Weibull.
26
0.025
0.030
β1
48
50
.00 46.00
.00
0 0
0
0.035
0.040
48. 0.00
000 0
MINIMUM
5
0.045
46
.00
0
Figure 4: Annotated contour plot from probit estimation (see Example 6).
• Two nested loops (core Python loops; fast Python/C++ routine implemented).
For each software environment, the 15 tests were run over 10 (sometimes 5) replications. The code
for the benchmark tests, implemented in Gauss, Mathematica, Ox, R, and Scilab, and the dataset
that is required for the data import test, are available from www.scientificweb.com/ncrunch.
We have made minor modifications to the timing (using the time module) and dimensions of some
of the tests, but have not attempted to further optimize the code developed by [30], although we
have looked for the fastest Python and R implementations in each case. Our results cannot be
directly compared to those in [30], or [15], who run a previous version of these speed tests.
Full results are reported in Table 2, which gives the mean time (in seconds) across all replications
for each of the tests. The tests have been ordered by increasing run-time for the extended Python
implementation. The ‘overall performance’ of each software is calculated following [30], as:
!
X minj (tij )
n−1 × 100%,
tij
i
28
where i = 1, 2, . . . , n are the tests, j is the software, and tij is the speed (seconds) of test i with
software j. Higher overall performance values correspond to higher overall speed (maximum 100%).
the econometric and statistical programming environments Ox and Scilab. For the first 12 tests,
the Python implementation is either the fastest, or close to this, and displays some substantial
speed gains over GAUSS, Mathematica, and R. While the data test imports directly into a NumPy
array, Python is also able to parse complicated and heterogeneous data structures (see Example
1 for a simple illustration). The loop test takes almost twice as long as in GAUSS and Ox, but
is considerably faster than Mathematica, Scilab, and R. It is well-known that Python loops are
inefficient, and most such computations can usually be made much faster either by using vectorized
algorithms ([14, Section 4.2]), or by optimizing one of the loops (often the inner loop). We would
certainly not suggest that Python nested loops be used for heavy numerical work. In Section 4,
Example 5, we show that it is straightforward to write the loop test as a Python/C++ routine,
and that this implementation runs about 600 times faster than core Python. Code optimization
is generally advisable, and not just for Python (see, e.g. www.scipy.org/PerformancePython).
The principal components routine is the fastest implementation. The speed of the eigenvalue
Given the limited number of replications, and the difficulty of suppressing background processes,
the values in Table 2 are only indicative (and especially for the heavier tests, which can sometimes
be observed to have mean run-times that increase in the number of replications), although we do
not expect the qualitative results to change dramatically with increased replications. In any given
and any slow time-critical code components can always be optimized by using C++ or Fortran.
29
Fast Fourier Transform over vector 0.2 2.2 0.2 0.2 0.6 0.7
Linear solve of Xw = y for w 0.2 2.4 0.2 0.7 0.8 0.2
Vector numerical sort 0.2 0.9 0.5 0.2 0.4 0.3
Gaussian error function over matrix 0.3 0.9 3.6 0.1 1.0 0.3
Random Fibonacci numbers 0.3 0.4 2.3 0.3 0.6 0.5
Cholesky decomposition 0.4 1.6 0.3 0.6 1.3 0.2
Data import and statistics 0.4 0.2 0.5 0.3 0.8 0.3
Gamma function over matrix 0.5 0.7 3.3 0.2 0.7 0.2
Matrix element-wise exponentiation 0.5 0.7 0.2 0.2 0.8 0.6
Matrix determinant 0.7 7.3 0.5 3.4 2.1 0.4
Matrix dot product 1.4 8.9 1.0 1.7 7.8 1.0
Matrix inverse 2.0 7.3 1.9 6.4 9.0 1.4
Two nested loops? 8.1 4.3 84.7 4.8 58.0 295.9
Principal components analysis 11.1 359.0 141.7 n/a 55.9 88.3
Computation of eigenvalues 32.3 90.2 24.2 21.7 13.6 17.3
Table 2: Speed test results. The mean time (seconds) across 10 replications is reported to 1
decimal place, for each of the 15 tests detailed in Appendix A.2. The GAUSS and R nested
loops and the GAUSS, R, and Scilab principal components tests were run over 5 replications. The
Scilab principal components test code ([30]) uses a third-party routine. The tests were performed
in ‘Python’ (extended Python 2.6.5), ‘GAUSS’ (GAUSS 8.0), ‘Mathematica’ (Mathematica 6.0),
‘Ox’ (Ox Professional 5.00), ‘R’ (R 2.11.1) and ‘Scilab’ (Scilab 5.1.1), on a 1.66MHz Centrino Duo
machine with 1GB RAM running Windows XP. The fastest implementation of each individual
test is highlighted. ‘Overall performance’ is calculated as in [30]: (n−1 i minj (tij )/tij ) × 100%,
P
where i = 1, 2, . . . , n are the tests, j is the software used, and tij is the speed (seconds) of test i
with software j. The speed test codes are python speed tests.py, benchga5.e, benchmath5.nb,
benchox5.ox, r speed tests.r, and benchsc5.sce, and are available from the authors (code for
the Python and R tests was written by the authors). There is no principal components test in the
[30] Ox code, and that result is not considered in the overall performance for Ox. ? For a much
faster (Python/C++) implementation of the Python nested loop test, see Section 4, Example 5,
and the discussion in Section 5.
30
6 Concluding remarks
Knowledge of computer programming is indispensable for much applied and theoretical research.
Although Python is now used in other scientific fields (e.g. physics), and as a teaching tool,
it is much less well-known to econometricians and statisticians (exceptions are [5], which briefly
introduces Python, and contains some nice examples; and [1]). We have tried to motivate Python
as a powerful alternative for advanced econometric and statistical project work, but in particular
Python is easy to learn, use, and extend, and has a large standard library and extensive third-
party modules. The language has a supportive community, and excellent tutorials, resources, and
references. The ‘pythonic’ language structure leads to readable (and manageable) programs, fast
development times, and facilitates reproducible research. We agree with [13] that reproducibility
is important (they also note the desirability of a single environment that can be used to manage
multiple parts of a research project; consider also the survey paper [22]). Extended Python offers
the possibility of direct programming of large-scale applications or, for users with high-performance
software written in other languages, it can be useful as a strong ‘glue’ between different applications.
We have used the following packages here: (1) cPickle for data import and export, (2)
matplotlib (pyplot) for graphics, (3) mdp for principal components analysis, (4) numpy for efficient
array operations, (5) rpy for graphics and random number generation, (6) scipy for scientific
computation (especially arange for array sequences, constants for built-in mathematical constants,
fftpack for Fast Fourier transforms, linalg for matrix operations, math for standard mathematical
functions, optimize for numerical optimization, special for special functions, stats for statistical
functions, and weave for linking C++ and Python), and (7) time for timing the speed tests.
Many other Python modules can be useful in econometrics (see [2] and [31] for coverage of
standard modules). These include csv (.csv file import and export), os (common operating-
system tools), random (pseudo-random number generation), sets (set-theoretic operations), sys
31
ends), urllib and urllib2 (for internet support, e.g. automated parsing of data from websites and
creation of internet bots and web spiders), and zipfile (for manipulation of .zip compressed data
Also useful are the ‘Sage’ mathematics system (www.sagemath.org), the statsmodels Python
statistics package (statsmodels.sourceforge.net), and the ‘SciPy Stats Project’, a blog that
Of additional interest are the Psyco just-in-time compiler, which is reported to give substantial
speed gains in some applications (see [14, Section 8], and [28] for a manual), and ScientificPython
(not to be confused with SciPy), which provides further open source scientific tools for Python
used to link Fortran and Python (cens.ioc.ee/projects/f2py2e), and the SWIG (Simplified
Wrapper and Interface Generator) interface compiler (www.swig.org) provides advanced linking
for C++ and Python. The ‘Cython’ language can be used to write C extensions for Python
(www.cython.org). For completeness, we note that some commercial Python libraries are also
available, e.g. the PyIMSL wrapper (to the IMSL C Numeric library). Python is also appropriate
for network applications, animations, and application front-end management (e.g. it can be linked
implemented so that only one simple thread can interact with the interpreter at a time (the Global
Interpreter Lock: GIL). However, NumPy can release the GIL, leading to significant speed gains
when arrays are used. Unlike threads, full processes each have their own GIL, and do not interfere
with one another. In general, achieving optimal use of a multiprocessor machine or cluster is
non-trivial. However, Python tools are also available for sophisticated parallelization.
We hope that this paper will encourage the reader to join the Python community!
32
Acknowledgements R.B. and S.L. thank the editor, Esfandiar Maasoumi, and two anonymous
referees, for comments that improved the paper; and are grateful to Christophe Bontemps, Christine
Choirat, David Joyner, Marko Loparic, Sébastien de Menten, Marius Ooms, Skipper Seabold and
Michalis Stamatogiannis for helpful suggestions, Mehrdad Farzinpour for providing access to a
Mac OS X machine, and John Muckstadt and Eric Johnson for providing access to a DualCore
machine. R.B. thanks ENAC and Farid Zizi for providing partial research funding. R.B. was
affiliated to ENAC when the first draft of this paper was completed. S.L. thanks Nathalie Lenoir
for supporting this project, and Sébastien de Menten, who had the foresight to promote Python at
Electrabel (which involved S.L. in the first place). This paper was typed by the authors in MiKTEX
2.8 and WinEdt 5, and numerical results were derived using extended Python 2.6.5 (described in
Section 2), as well as C++, EViews 6.0, Gauss 8.0, Mathematica 6.0, Ox Professional 5.00, R 2.9.1
and R 2.11.1, and Scilab 5.1.1. The results in Table 2 depend upon our machine configuration,
the number of replications, and our modification of the original [30] benchmark routines. They
are not intended to be a definitive statement on the speed of the other software, most of which we
have used productively at some time in our work. The authors are solely responsible for any views
made in this paper, and for any errors that remain. All extended Python, C++, and R code for
the examples and speed tests was written by the authors. Code and data are available on request.
33
References
[1] Almiron, M., Almeida, E., and Miranda, M. The reliability of statistical functions in
four software packages freely used in numerical computation. Brazilian Journal of Probability
[2] Beazley, D. Python Essential Reference, 2nd ed. New Riders, 2001.
[4] Bröker, O., Chinellato, O., and Geus, R. Using Python for large scale linear algebra
[5] Choirat, C., and Seri, R. Econometrics with Python. Journal of Applied Econometrics 24
(2009), 698–704.
[6] Cribari-Neto, F., and Zarkos, S. R: Yet another econometric programming environment.
[7] Dale, D., Droettboom, M., Firing, E., and Hunter, J. Matplotlib Release 0.99.3.
matplotlib.sf.net/Matplotlib.pdf, 2010.
[8] Downey, A. Think Python: How to think like a computer scientist - Version 1.1.22.
www.greenteapress.com/thinkpython/thinkpython.pdf, 2008.
[12] Kleiber, C., and Zeileis, A. Applied Econometrics with R. Springer, 2008.
34
[13] Koenker, R., and Zeileis, A. On reproducible econometric research. Journal of Applied
[14] Langtangen, H. Python Scripting for Computational Science, 2nd ed. Springer, 2005.
[15] Laurent, S., and Urbain, J.-P. Bridging the gap between Gauss and Ox using OXGAUSS.
[17] Meinke, J., Mohanty, S., Eisenmenger, F., and Hansmann, U. SMMP v. 3.0 –
Simulating proteins and protein interactions in Python and Fortran. Computer Physics
[19] Nilsen, J. MontePython: Implementing Quantum Monte Carlo using Python. Computer
Handbook of Econometrics, T. Mills and K. Patterson, Eds., vol. 2. Palgrave MacMillan, 2009.
[23] Ooms, M., and Doornik, J. Econometric software development: past, present and future.
[24] Patton, A. Copula-based models for financial time series. In Handbook of Financial Time
Series, T. Andersen, R. Davis, J.-P. Kreiß, and T. Mikosch, Eds. Springer, 2009.
35
-pdf-5.4.zip, 2004.
[26] Racine, J., and Hyndman, R. Using R to teach econometrics. Journal of Applied
psycoguide.ps.gz, 2007.
docs.scipy.org/doc/scipy/scipy-ref.pdf, 2010.
[30] Steinhaus, S. Comparison of mathematical programs for data analysis (Edition 5.04).
www.scientificweb.com/ncrunch/ncrunch5.pdf, 2008.
[31] van Rossum, G. The Python Library Reference: Release 2.6.2. docs.python.org/archives/
[34] Zeileis, A., and Koenker, R. Econometrics in R: Past, present, and future. Journal of
[35] Zito, T., Wilbert, N., Wiskott, L., and Berkes, P. Modular toolkit for Data Processing
A Appendix
Excellent references to copula theory and applications include [18] and [24]. Let X and Y be
X ∼ F, Y ∼ G, (X, Y ) ∼ H,
where F and G are marginal distribution functions, and H is a joint distribution function. Sklar’s
where CΘ (u, v) is (in this paper) a parametric copula function that maps C(u, v) : [0, 1]2 7→ [0, 1],
and that describes the dependence between u := F (x) and v := G(y), and ‘binds’ the marginals F
and G together, to give a valid joint distribution H; and Θ is a set of parameters that characterize
the copula. The probability integral transform X ∼ F =⇒ F (x) ∼ U [0, 1], where U is a uniform
Elliptical copulas (notably Normal and Student’s t) are derived from elliptical distributions.
They model symmetric dependence, and are relatively easy to estimate and simulate. Copulas
are generally estimated using maximum likelihood, although for elliptical copulas some of the
are empirical marginal distributions, e.g. Fb(x) := (n + 1)−1 ni=1 1Xi ≤x , where
P
and Fb(x) and G(x)
b
and Φ−1 (Φ) is the univariate (inverse) standard Normal distribution. We further define a bivariate
copula density ∂ 2 CΘ (u)/∂u1 ∂u2 , which for the Normal gives cR (υ) := |R|−1/2 exp(−υ 0 (R−1 −
I2 )υ/2), where υ = (Φ−1 (u1 ), Φ−1 (u2 ))0 . Maximum-likelihood estimation of R solves
n
X
b = arg max n−1
R ln cR (u).
R
i=1
Numerical optimization of the log-likelihood surface is usually slow and can lead to numerical errors.
n
X
b = n−1
R υi υi0 ,
i=1
where υi = (Φ−1 (ui1 ), Φ−1 (ui2 ))0 . Due to numerical errors, we correct to a valid correlation matrix:
(R)
b ij
q q 7→ (R)
b ij .
(R)
b ii (R)b jj
Once we have estimated the copula parameters we have, in the bivariate case, H(x, y) =
CΘ
b (F (x), G(y)) := CΘ
b b b (u, v). Simulation from a copula involves generation of (ur , vr ), which can
subsequently be transformed to the original data scale (not considered here). For the bivariate
b = AA0 , (2) Simulate a 2-
Normal, we simulate by: (1) Compute the Cholesky decomposition R
vector of standard Normals z ∼ N (0, 1)2 , (3) Scale the standard Normals: x = Az ∼ N (0, R)
b 2 , (4)
Simulate the uniforms uj by uj = Φ(xj ), j = 1, 2: this gives (u1 , u2 ) = (Fb1 (x1 ), Fb2 (x2 )).
38
We detail the algorithms that are used for the speed tests discussed in Section 5 (adapted from
[30]), and reported in Table 2. We assume that all required Python modules have been imported.
Matrices and vectors have representative elements X = (Xrs ) and x = (xr ) respectively. Further,
we let I = {1, 2, . . . , 10}. We give brief details on the extended Python implementations.
[Replication i] Generate a 220 ×1 random uniform vector x = (U (0, 1)). [Start timer] Compute
the Fast Fourier Transform of x. [End timer] [End replication i] Repeat for i ∈ I. Random
1000×1 vector y = (j), j = 1, 2, . . . , 1000. [Start timer] Solve Xw = y for w. [End timer]
[End replication i] Repeat for i ∈ I. Random numbers are generated using the command
[Replication i] Generate a 1000000×1 random uniform vector x = (U (0, 1)). [Start timer]
Sort the elements of x in ascending order. [End timer] [End replication i] Repeat for i ∈ I.
method is scipy.sort(x).
[Replication i] Generate a 1500×1500 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the Gaussian error function erf(x). [End timer] [End replication i] Repeat for i ∈ I.
39
[Replication i] Generate a 1000000×1 random uniform vector x = (b1000 × U (0, 1)c), where
b·c returns the integer part. The Fibonacci numbers yn are defined by the recurrence relation
yn = yn−1 + yn−2 , where y0 = 0 and y1 = 1 initialize the sequence. [Start timer] Compute
the Fibonacci numbers yn for x = (n) (i.e. n ∈ {0, 1, . . . , 999} will give 1 million random
drawings from the first 1000 Fibonacci numbers), using the closed-form Binet formula
φn − (−φ)−n
yn = √ ,
5
√
where φ = (1 + 5)/2 is the golden ratio.26 [End timer] [End replication i] Repeat for i ∈ I.
• Cholesky decomposition
the dot product X 0 X. [Start timer] Compute the upper-triangular Cholesky decomposition
X 0 X = U 0 U , i.e. solve for square U . [End timer] [End replication i] Repeat for i ∈ I.
command scipy.linalg.cholesky(y,lower=False).
26
√
A faster closed-form formula is yn = b(φn / 5) + (1/2)c, although we do not use it here. For extended Python,
the faster formula takes roughly 0.2 seconds across 10 replications. We also correct for a typo in the [30] Mathematica
code: RandomInteger[{100,1000},. . .] is replaced by RandomInteger[{0,999},. . .].
40
[Replication i] [Start timer] Import the datafile Currency2.txt. This contains 3160 rows and
38 columns of data, on 34 daily exchange rates over the period 2 January 1990 to 8 February
2002. There is a single header line. For each currency-year, compute the mean, minimum and
maximum of the data, and the percentage change over the year.27 [End timer] [End replication
[Replication i] Generate a 1500×1500 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the gamma function Γ(x). [End timer] [End replication i] Repeat for i ∈ I. The
scipy.special.gamma(x).28
exponentiation is x**1000.
• Matrix determinant
[Replication i] Generate a 1500×1500 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the determinant det(X). [End timer] [End replication i] Repeat for i ∈ I. The
computed by scipy.linalg.det(x).
27
The file Currency2.txt is available from www.scientificweb.com/ncrunch. We modify the [30] GAUSS code to
import data into a matrix using load data[]=∧ "filename "; and datmat=reshape(data,3159,38);. For the GAUSS
and Python implementations, we first removed the header line from the data file, before importing the data.
28
We correct for a typo in the [30] Mathematica code: we use RandomReal[{0,1},{1000,1000}] instead of
RandomReal[NormalDistribution[],{1000,1000}].
41
[Replication i] Generate a 1500×1500 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the dot product X 0 X. [End timer] [End replication i] Repeat for i ∈ I. The
numpy.dot(x.T,x).
• Matrix inverse
[Replication i] Generate a 1500×1500 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the inverse X −1 . [End timer] [End replication i] Repeat for i ∈ I. The random
m = 1, 2, . . . , 5000). Set a = a + l + m. (Close inner loop). (Close outer loop). [End timer]
[End replication i] Repeat for i ∈ I. Routine written in core Python. See Section 4 (Example
[Replication i] Generate a 10000×1000 random uniform matrix X = (U (0, 1)). [Start timer]
Transform X into principal components using the covariance method. [End timer] [End
• Computation of eigenvalues
[Replication i] Generate a 1200×1200 random uniform matrix X = (U (0, 1)). [Start timer]
Compute the eigenvalues of X. [End timer] [End replication i] Repeat for i ∈ I. The random
command scipy.linalg.eigvals(x).
42
import cPickle
from matplotlib import pyplot
from scipy import array,shape,arange,log,ones,random
from scipy.stats import norm
from scipy.optimize import fmin_ncg
from scipy.linalg import inv
from numpy import dot,meshgrid
def nll(beta,x,y):
return (dot(-((y*log(norm.cdf(dot(x.T,beta))))+
((1-y)*log(1-norm.cdf(dot(x.T,beta))))),ones(len(y))))
def nllprime(beta,x,y):
return (-dot(x,(norm.pdf(dot(x.T,beta))*(y-norm.cdf(dot(x.T,beta))))/
(norm.cdf(dot(x.T,beta))*(1-norm.cdf(dot(x.T,beta))))))
beta_hat_ols=dot(inv(dot(x,x.T)),dot(x,y))
a=fmin_ncg(nll,beta_hat_ols,args=(x,y),fprime=nllprime,disp=1)
b1=arange(-0.05,-0.03,0.0003);b2=arange(0.095,0.325,0.0003)
X1,X2=meshgrid(b1,b2); Z=ones((len(b1),len(b2))); beta_test=array(list(a)[:])
43
for i in range(len(b1)):
for j in range(len(b2)):
beta_test[1]=b1[i]; beta_test[2]=b2[j]
Z[i][j]=nll(beta_test,x,y)
pyplot.figure()
CS=pyplot.contour(b2,b1,Z,linewidths=(4,4,4,4,4,4),\
colors=(’green’,’blue’,’red’,’black’,’pink’,’yellow’))
pyplot.clabel(CS,inline=1,fontsize=12)
pyplot.xlabel(r’$\beta_{2}$’,fontsize=16)
pyplot.ylabel(r’$\beta_{1}$’,fontsize=16)
pyplot.title(’(Negative) log-likelihood contour cross-section’,fontsize=20)
pyplot.plot([0.19866354],[-0.04009722],’ro’)
pyplot.plot([0.042544232],[-0.01230058],’ro’)
pyplot.annotate(’MINIMUM’,xy=(0.19866354,-0.04009722),\
xytext=(0.21,-0.045),arrowprops=dict(facecolor=’black’,\
shrink=0.01))
pyplot.annotate(’START’,xy=(0.042544232,-0.01230058),\
xytext=(0.10,-0.02),arrowprops=dict(facecolor=’black’,\
shrink=0.01))
pyplot.text(0.16,-0.02,’Optimization successful after 12\niterations,\
13 function evaluations,\nand 104 gradient evaluations’,\
bbox={’facecolor’:’red’,’alpha’:1,’pad’:10},fontsize=14)
pyplot.ylim(-0.05,-0.01)
pyplot.xlim(0.04,0.35)
pyplot.grid(True)
pyplot.show()