Data Structures For Statistical Computing in Python
Data Structures For Statistical Computing in Python
(SCIPY 2010)
Abstract—In this paper we are concerned with the practical issues of working in financial modeling. The package’s name derives from panel
with data sets common to finance, statistics, and other related fields. pandas data, which is a term for 3-dimensional data sets encountered
is a new library which aims to facilitate working with these data sets and to in statistics and econometrics. We hope that pandas will help
provide a set of fundamental building blocks for implementing statistical models. make scientific Python a more attractive and practical statistical
We will discuss specific design issues encountered in the course of developing
computing environment for academic and industry practitioners
pandas with relevant examples and some comparisons with the R language.
We conclude by discussing possible future directions for statistical computing
alike.
and data analysis using Python.
Statistical data sets
Index Terms—data structure, statistics, R
Statistical data sets commonly arrive in tabular format, i.e. as
a two-dimensional list of observations and names for the fields
Introduction of each observation. Usually an observation can be uniquely
Python is being used increasingly in scientific applications tra- identified by one or more values or labels. We show an example
ditionally dominated by [R], [MATLAB], [Stata], [SAS], other data set for a pair of stocks over the course of several days. The
commercial or open-source research environments. The maturity NumPy ndarray with structured dtype can be used to hold this
and stability of the fundamental numerical libraries ([NumPy], data:
[SciPy], and others), quality of documentation, and availability of >>> data
"kitchen-sink" distributions ([EPD], [Pythonxy]) have gone a long array([('GOOG', '2009-12-28', 622.87, 1697900.0),
('GOOG', '2009-12-29', 619.40, 1424800.0),
way toward making Python accessible and convenient for a broad ('GOOG', '2009-12-30', 622.73, 1465600.0),
audience. Additionally [matplotlib] integrated with [IPython] pro- ('GOOG', '2009-12-31', 619.98, 1219800.0),
vides an interactive research and development environment with ('AAPL', '2009-12-28', 211.61, 23003100.0),
('AAPL', '2009-12-29', 209.10, 15868400.0),
data visualization suitable for most users. However, adoption of ('AAPL', '2009-12-30', 211.64, 14696800.0),
Python for applied statistical modeling has been relatively slow ('AAPL', '2009-12-31', 210.73, 12571000.0)],
compared with other areas of computational science. dtype=[('item', '|S4'), ('date', '|S10'),
A major issue for would-be statistical Python programmers in ('price', '<f8'), ('volume', '<f8')])
the past has been the lack of libraries implementing standard mod- >>> data['price']
els and a cohesive framework for specifying models. However, array([622.87, 619.4, 622.73, 619.98, 211.61, 209.1,
in recent years there have been significant new developments in 211.64, 210.73])
econometrics ([StaM]), Bayesian statistics ([PyMC]), and machine Structured (or record) arrays such as this can be effective in many
learning ([SciL]), among others fields. However, it is still difficult applications, but in our experience they do not provide the same
for many statisticians to choose Python over R given the domain- level of flexibility and ease of use as other statistical environments.
specific nature of the R language and breadth of well-vetted open- One major issue is that they do not integrate well with the rest
source libraries available to R users ([CRAN]). In spite of this of NumPy, which is mainly intended for working with arrays of
obstacle, we believe that the Python language and the libraries homogeneous dtype.
and tools currently available can be leveraged to make Python a R provides the data.frame class which can similarly store
superior environment for data analysis and statistical computing. mixed-type data. The core R language and its 3rd-party libraries
In this paper we are concerned with data structures and tools were built with the data.frame object in mind, so most opera-
for working with data sets in-memory, as these are fundamental tions on such a data set are very natural. A data.frame is also
building blocks for constructing statistical models. pandas is a flexible in size, an important feature when assembling a collection
new Python library of data structures and statistical tools initially of data. The following code fragment loads the data stored in the
developed for quantitative finance applications. Most of our ex- CSV file data into the variable df and adds a new column of
amples here stem from time series and cross-sectional data arising boolean values:
* Corresponding author: wesmckinn@gmail.com > df <- read.csv('data')
‡ AQR Capital Management, LLC item date price volume
1 GOOG 2009-12-28 622.87 1697900
Copyright © 2010 Wes McKinney. This is an open-access article distributed 2 GOOG 2009-12-29 619.40 1424800
under the terms of the Creative Commons Attribution License, which permits 3 GOOG 2009-12-30 622.73 1465600
unrestricted use, distribution, and reproduction in any medium, provided the 4 GOOG 2009-12-31 619.98 1219800
original author and source are credited. 5 AAPL 2009-12-28 211.61 23003100
DATA STRUCTURES FOR STATISTICAL COMPUTING IN PYTHON 57
6 AAPL 2009-12-29 209.10 15868400 an ordered set. In the stock data above, the row labels are simply
7 AAPL 2009-12-30 211.64 14696800 sequential observation numbers, while the columns are the field
8 AAPL 2009-12-31 210.73 12571000
names.
> df$ind <- df$item == "GOOG" An Index stores the labels in two ways: as a ndarray and
> df as a dict mapping the values (which must therefore be unique
item date value volume ind and hashable) to the integer indices:
1 GOOG 2009-12-28 622.87 1697900 TRUE
2 GOOG 2009-12-29 619.40 1424800 TRUE >>> index = Index(['a', 'b', 'c', 'd', 'e'])
3 GOOG 2009-12-30 622.73 1465600 TRUE >>> index
4 GOOG 2009-12-31 619.98 1219800 TRUE Index([a, b, c, d, e], dtype=object)
5 AAPL 2009-12-28 211.61 23003100 FALSE >>> index.indexMap
6 AAPL 2009-12-29 209.10 15868400 FALSE {'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4}
7 AAPL 2009-12-30 211.64 14696800 FALSE
8 AAPL 2009-12-31 210.73 12571000 FALSE Creating this dict allows the objects to perform lookups and
determine membership in constant time.
pandas provides a similarly-named DataFrame class which
implements much of the functionality of its R counterpart, though >>> 'a' in index
True
with some important enhancements (namely, built-in data align-
ment) which we will discuss. Here we load the same CSV file These labels are used to provide alignment when performing
as above into a DataFrame object using the fromcsv function data manipulations using differently-labeled objects. There are
and similarly add the above column: specialized data structures, representing 1-, 2-, and 3-dimensional
>>> data = DataFrame.fromcsv('data', index_col=None) data, which incorporate useful data handling semantics to facili-
date item value volume tate both interactive research and system building. A general n-
0 2009-12-28 GOOG 622.9 1.698e+06 dimensional data structure would be useful in some cases, but
1 2009-12-29 GOOG 619.4 1.425e+06
2 2009-12-30 GOOG 622.7 1.466e+06 data sets of dimension higher than 3 are very uncommon in most
3 2009-12-31 GOOG 620 1.22e+06 statistical and econometric applications, with 2-dimensional being
4 2009-12-28 AAPL 211.6 2.3e+07 the most prevalent. We took a pragmatic approach, driven by
5 2009-12-29 AAPL 209.1 1.587e+07
6 2009-12-30 AAPL 211.6 1.47e+07
application needs, to designing the data structures in order to
7 2009-12-31 AAPL 210.7 1.257e+07 make them as easy-to-use as possible. Also, we wanted the objects
>>> data['ind'] = data['item'] == 'GOOG' to be idiomatically similar to those present in other statistical
environments, such as R.
This data can be reshaped into a different form for future examples
by means of the DataFrame method pivot:
>>> df = data.pivot('date', 'item', 'value')
Data alignment
>>> df Operations between related, but differently-sized data sets can
AAPL GOOG
2009-12-28 211.6 622.9
pose a problem as the user must first ensure that the data points
2009-12-29 209.1 619.4 are properly aligned. As an example, consider time series over
2009-12-30 211.6 622.7 different date ranges or economic data series over varying sets of
2009-12-31 210.7 620 entities:
Beyond observational data, one will also frequently encounter >>> s1 >>> s2
categorical data, which can be used to partition identifiers into AAPL 0.044 AAPL 0.025
IBM 0.050 BAR 0.158
broader groupings. For example, stock tickers might be catego- SAP 0.101 C 0.028
rized by their industry or country of incorporation. Here we have GOOG 0.113 DB 0.087
created a DataFrame object cats storing country and industry C 0.138 F 0.004
classifications for a group of stocks: SCGLY 0.037 GOOG 0.154
BAR 0.200 IBM 0.034
>>> cats DB 0.281
country industry VW 0.040
AAPL US TECH
IBM US TECH One might choose to explicitly align (or reindex) one of these
SAP DE TECH 1D Series objects with the other before adding them, using the
GOOG US TECH
C US FIN
reindex method:
SCGLY FR FIN >>> s1.reindex(s2.index)
BAR UK FIN AAPL 0.0440877763224
DB DE FIN BAR 0.199741007422
VW DE AUTO C 0.137747485628
RNO FR AUTO DB 0.281070058049
F US AUTO F NaN
TM JP AUTO GOOG 0.112861123629
IBM 0.0496445829129
We will use these objects above to illustrate features of interest.
However, we often find it preferable to simply ignore the state of
pandas data model
data alignment:
The pandas data structures internally link the axes of a ndarray >>> s1 + s2
AAPL 0.0686791008184
with arrays of unique labels. These labels are stored in instances of BAR 0.358165479807
the Index class, which is a 1D ndarray subclass implementing C 0.16586702944
58 PROC. OF THE 9th PYTHON IN SCIENCE CONF. (SCIPY 2010)
dates or entities) for linear regressions. Especially for unbalanced For a number of years scikits.timeseries [SciTS] has been
panel data, this can be a difficult task. Since we have all of the available to scientific Python users. It is built on top of MaskedAr-
necessary labeling data here, we can easily implement such an ray and is intended for fixed-frequency time series. While forcing
operation as an instance method. data to be fixed frequency can enable better performance in some
areas, in general we have found that criterion to be quite rigid in
practice. The user of scikits.timeseries must also explicitly align
Implementing statistical models
data; operations involving unaligned data yield unintuitive results.
When applying a statistical model, data preparation and cleaning In designing pandas we hoped to make working with time
can be one of the most tedious or time consuming tasks. Ideally series data intuitive without adding too much overhead to the
the majority of this work would be taken care of by the model underlying data model. The pandas data structures are datetime-
class itself. In R, while NA data can be automatically excluded aware but make no assumptions about the dates. Instead, when
from a linear regression, one must either align the data and put frequency or regularity matters, the user has the ability to generate
it into a data.frame or otherwise prepare a collection of 1D date ranges or conform a set of time series to a particular
arrays which are all the same length. frequency. To do this, we have the DateRange class (which is
Using pandas, the user can avoid much of this data preparation also a subclass of Index, so no conversion is necessary) and the
work. As a exemplary model leveraging the pandas data model, DateOffset class, whose subclasses implement various general
we implemented ordinary least squares regression in both the purpose and domain-specific time increments. Here we generate a
standard case (making no assumptions about the content of the date range between 1/1/2000 and 1/1/2010 at the "business month
regressors) and the panel case, which has additional options to end" frequency BMonthEnd:
allow for entity and time dummy variables. Facing the user is a
single function, ols, which infers the type of model to estimate >>> DateRange('1/1/2000', '1/1/2010',
based on the inputs: offset=BMonthEnd())
<class 'pandas.core.daterange.DateRange'>
>>> model = ols(y=Y, x=X) offset: <1 BusinessMonthEnd>
>>> model.beta [2000-01-31 00:00:00, ..., 2009-12-31 00:00:00]
AAPL 0.187984100742 length: 120
GOOG 0.264882582521
MSFT 0.207564901899
A DateOffset instance can be used to convert an object
intercept -0.000896535166817
containing time series data, such as a DataFrame as in our earlier
If the response variable Y is a DataFrame (2D) or dict of 1D example, to a different frequency using the asfreq function:
Series, a panel regression will be run on stacked (pooled) data.
The x would then need to be either a WidePanel, LongPanel, >>> monthly = df.asfreq(BMonthEnd())
AAPL GOOG MSFT YHOO
or a dict of DataFrame objects. Since these objects contain all 2009-08-31 168.2 461.7 24.54 14.61
of the necessary information to construct the design matrices for 2009-09-30 185.3 495.9 25.61 17.81
the regression, there is nothing for the user to worry about (except 2009-10-30 188.5 536.1 27.61 15.9
2009-11-30 199.9 583 29.41 14.97
the formulation of the model). 2009-12-31 210.7 620 30.48 16.78
The ols function is also capable of estimating a moving
window linear regression for time series data. This can be useful Some things which are not easily accomplished in scik-
for estimating statistical relationships that change through time: its.timeseries can be done using the DateOffset model, like
>>> model = ols(y=Y, x=X, window_type='rolling', deriving custom offsets on the fly or shifting monthly data forward
window=250) by a number of business days using the shift function:
>>> model.beta
<class 'pandas.core.matrix.DataFrame'<> >>> offset = Minute(12)
Index: 1103 entries , 2005-08-16 to 2009-12-31 >>> DateRange('6/18/2010 8:00:00',
Data columns: '6/18/2010 12:00:00',
AAPL 1103 non-null values offset=offset)
GOOG 1103 non-null values <class 'pandas.core.daterange.DateRange'>
MSFT 1103 non-null values offset: <12 Minutes>
intercept 1103 non-null values [2010-06-18 08:00:00, ..., 2010-06-18 12:00:00]
dtype: float64(4) length: 21
Here we have estimated a moving window regression with a win- >>> monthly.shift(5, offset=Bay())
dow size of 250 time periods. The resulting regression coefficients AAPL GOOG MSFT YHOO
stored in model.beta are now a DataFrame of time series. 2009-09-07 168.2 461.7 24.54 14.61
2009-10-07 185.3 495.9 25.61 17.81
2009-11-06 188.5 536.1 27.61 15.9
Date/time handling 2009-12-07 199.9 583 29.41 14.97
2010-01-07 210.7 620 30.48 16.78
In applications involving time series data, manipulations on dates
and times can be quite tedious and inefficient. Tools for working Since pandas uses the built-in Python datetime object, one
with dates in MATLAB, R, and many other languages are clumsy could foresee performance issues with very large or high fre-
or underdeveloped. Since Python has a built-in datetime type quency time series data sets. For most general applications finan-
easily accessible at both the Python and C / Cython level, we aim cial or econometric applications we cannot justify complicating
to craft easy-to-use and efficient date and time functionality. When datetime handling in order to solve these issues; specialized
the NumPy datetime64 dtype has matured, we will, of course, tools will need to be created in such cases. This may be indeed be
reevaluate our date handling strategy where appropriate. a fruitful avenue for future development work.
DATA STRUCTURES FOR STATISTICAL COMPUTING IN PYTHON 61
Related packages [Stata] StatCorp. 2010, Stata Statistical Software: Release 11 http://
www.stata.com
A number of other Python packages have appeared recently which [SAS] SAS Institute Inc., SAS System, http://www.sas.com
provide some similar functionality to pandas. Among these, la
([Larry]) is the most similar, as it implements a labeled ndarray
object intending to closely mimic NumPy arrays. This stands
in contrast to our approach, which is driven by the practical
considerations of time series and cross-sectional data found in
finance, econometrics, and statistics. The references include a
couple other packages of interest ([Tab], [pydataframe]).
While pandas provides some useful linear regression models,
it is not intended to be comprehensive. We plan to work closely
with the developers of scikits.statsmodels ([StaM]) to generally
improve the cohesiveness of statistical modeling tools in Python.
It is likely that pandas will soon become a "lite" dependency of
scikits.statsmodels; the eventual creation of a superpackage for
statistical modeling including pandas, scikits.statsmodels, and
some other libraries is also not out of the question.
Conclusions
We believe that in the coming years there will be great oppor-
tunity to attract users in need of statistical data analysis tools
to Python who might have previously chosen R, MATLAB, or
another research environment. By designing robust, easy-to-use
data structures that cohere with the rest of the scientific Python
stack, we can make Python a compelling choice for data analysis
applications. In our opinion, pandas represents a solid step in the
right direction.
R EFERENCES
[pandas] W. McKinney, AQR Capital Management, pandas: a python
data analysis library, http://pandas.sourceforge.net
[Larry] K. Goodman. la / larry: ndarray with labeled axes, http://larry.
sourceforge.net/
[SciTS] M. Knox, P. Gerard-Marchant, scikits.timeseries: python time
series analysis, http://pytseries.sourceforge.net/
[StaM] S. Seabold, J. Perktold, J. Taylor, scikits.statsmodels: statistical
modeling in Python, http://statsmodels.sourceforge.net
[SciL] D. Cournapeau, et al., scikits.learn: machine learning in
Python, http://scikit-learn.sourceforge.net
[PyMC] C. Fonnesbeck, A. Patil, D. Huard, PyMC: Markov Chain
Monte Carlo for Python, http://code.google.com/p/pymc/
[Tab] D. Yamins, E. Angelino, tabular: tabarray data structure for
2D data, http://parsemydata.com/tabular/
[NumPy] T. Oliphant, http://numpy.scipy.org
[SciPy] E. Jones, T. Oliphant, P. Peterson, http://scipy.org
[matplotlib] J. Hunter, et al., matplotlib: Python plotting, http://matplotlib.
sourceforge.net/
[EPD] Enthought, Inc., EPD: Enthought Python Distribution, http://
www.enthought.com/products/epd.php
[Pythonxy] P. Raybaut, Python(x,y): Scientific-oriented Python distribu-
tion, http://www.pythonxy.com/
[CRAN] The R Project for Statistical Computing, http://cran.r-project.
org/
[Cython] G. Ewing, R. W. Bradshaw, S. Behnel, D. S. Seljebotn, et al.,
The Cython compiler, http://cython.org
[IPython] F. Perez, et al., IPython: an interactive computing environment,
http://ipython.scipy.org
[Grun] Batalgi, Grunfeld data set, http://www.wiley.com/legacy/
wileychi/baltagi/
[nipy] J. Taylor, F. Perez, et al., nipy: Neuroimaging in Python, http:
//nipy.sourceforge.net
[pydataframe] A. Straw, F. Finkernagel, pydataframe, http://code.google.com/
p/pydataframe/
[R] R Development Core Team. 2010, R: A Language and Envi-
ronment for Statistical Computing, http://www.R-project.org
[MATLAB] The MathWorks Inc. 2010, MATLAB, http://www.mathworks.
com