forked from pandas-dev/pandas
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathv0.8.0.txt
274 lines (224 loc) · 12 KB
/
v0.8.0.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
.. _whatsnew_080:
v0.8.0 (June 29, 2012)
------------------------
This is a major release from 0.7.3 and includes extensive work on the time
series handling and processing infrastructure as well as a great deal of new
functionality throughout the library. It includes over 700 commits from more
than 20 distinct authors. Most pandas 0.7.3 and earlier users should not
experience any issues upgrading, but due to the migration to the NumPy
datetime64 dtype, there may be a number of bugs and incompatibilities
lurking. Lingering incompatibilities will be fixed ASAP in a 0.8.1 release if
necessary. See the :ref:`full release notes
<release>` or issue tracker
on GitHub for a complete list.
Support for non-unique indexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
All objects can now work with non-unique indexes. Data alignment / join
operations work according to SQL join semantics (including, if application,
index duplication in many-to-many joins)
NumPy datetime64 dtype and 1.6 dependency
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Time series data are now represented using NumPy's datetime64 dtype; thus,
pandas 0.8.0 now requires at least NumPy 1.6. It has been tested and verified
to work with the development version (1.7+) of NumPy as well which includes
some significant user-facing API changes. NumPy 1.6 also has a number of bugs
having to do with nanosecond resolution data, so I recommend that you steer
clear of NumPy 1.6's datetime64 API functions (though limited as they are) and
only interact with this data using the interface that pandas provides.
See the end of the 0.8.0 section for a "porting" guide listing potential issues
for users migrating legacy codebases from pandas 0.7 or earlier to 0.8.0.
Bug fixes to the 0.7.x series for legacy NumPy < 1.6 users will be provided as
they arise. There will be no more further development in 0.7.x beyond bug
fixes.
Time series changes and improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. note::
With this release, legacy scikits.timeseries users should be able to port
their code to use pandas.
.. note::
See :ref:`documentation <timeseries>` for overview of pandas timeseries API.
- New datetime64 representation **speeds up join operations and data
alignment**, **reduces memory usage**, and improve serialization /
deserialization performance significantly over datetime.datetime
- High performance and flexible **resample** method for converting from
high-to-low and low-to-high frequency. Supports interpolation, user-defined
aggregation functions, and control over how the intervals and result labeling
are defined. A suite of high performance Cython/C-based resampling functions
(including Open-High-Low-Close) have also been implemented.
- Revamp of :ref:`frequency aliases <timeseries.alias>` and support for
**frequency shortcuts** like '15min', or '1h30min'
- New :ref:`DatetimeIndex class <timeseries.datetimeindex>` supports both fixed
frequency and irregular time
series. Replaces now deprecated DateRange class
- New ``PeriodIndex`` and ``Period`` classes for representing
:ref:`time spans <timeseries.periods>` and performing **calendar logic**,
including the `12 fiscal quarterly frequencies <timeseries.quarterly>`.
This is a partial port of, and a substantial enhancement to,
elements of the scikits.timeseries codebase. Support for conversion between
PeriodIndex and DatetimeIndex
- New Timestamp data type subclasses `datetime.datetime`, providing the same
interface while enabling working with nanosecond-resolution data. Also
provides :ref:`easy time zone conversions <timeseries.timezone>`.
- Enhanced support for :ref:`time zones <timeseries.timezone>`. Add
`tz_convert` and ``tz_lcoalize`` methods to TimeSeries and DataFrame. All
timestamps are stored as UTC; Timestamps from DatetimeIndex objects with time
zone set will be localized to localtime. Time zone conversions are therefore
essentially free. User needs to know very little about pytz library now; only
time zone names as as strings are required. Time zone-aware timestamps are
equal if and only if their UTC timestamps match. Operations between time
zone-aware time series with different time zones will result in a UTC-indexed
time series.
- Time series **string indexing conveniences** / shortcuts: slice years, year
and month, and index values with strings
- Enhanced time series **plotting**; adaptation of scikits.timeseries
matplotlib-based plotting code
- New ``date_range``, ``bdate_range``, and ``period_range`` :ref:`factory
functions <timeseries.daterange>`
- Robust **frequency inference** function `infer_freq` and ``inferred_freq``
property of DatetimeIndex, with option to infer frequency on construction of
DatetimeIndex
- to_datetime function efficiently **parses array of strings** to
DatetimeIndex. DatetimeIndex will parse array or list of strings to
datetime64
- **Optimized** support for datetime64-dtype data in Series and DataFrame
columns
- New NaT (Not-a-Time) type to represent **NA** in timestamp arrays
- Optimize Series.asof for looking up **"as of" values** for arrays of
timestamps
- Milli, Micro, Nano date offset objects
- Can index time series with datetime.time objects to select all data at
particular **time of day** (``TimeSeries.at_time``) or **between two times**
(``TimeSeries.between_time``)
- Add :ref:`tshift <timeseries.advanced_datetime>` method for leading/lagging
using the frequency (if any) of the index, as opposed to a naive lead/lag
using shift
Other new features
~~~~~~~~~~~~~~~~~~
- New :ref:`cut <reshaping.tile.cut>` and ``qcut`` functions (like R's cut
function) for computing a categorical variable from a continuous variable by
binning values either into value-based (``cut``) or quantile-based (``qcut``)
bins
- Rename ``Factor`` to ``Categorical`` and add a number of usability features
- Add :ref:`limit <missing_data.fillna.limit>` argument to fillna/reindex
- More flexible multiple function application in GroupBy, and can pass list
(name, function) tuples to get result in particular order with given names
- Add flexible :ref:`replace <missing_data.replace>` method for efficiently
substituting values
- Enhanced :ref:`read_csv/read_table <io.parse_dates>` for reading time series
data and converting multiple columns to dates
- Add :ref:`comments <io.comments>` option to parser functions: read_csv, etc.
- Add :ref`dayfirst <io.dayfirst>` option to parser functions for parsing
international DD/MM/YYYY dates
- Allow the user to specify the CSV reader :ref:`dialect <io.dialect>` to
control quoting etc.
- Handling :ref:`thousands <io.thousands>` separators in read_csv to improve
integer parsing.
- Enable unstacking of multiple levels in one shot. Alleviate ``pivot_table``
bugs (empty columns being introduced)
- Move to klib-based hash tables for indexing; better performance and less
memory usage than Python's dict
- Add first, last, min, max, and prod optimized GroupBy functions
- New :ref:`ordered_merge <merging.ordered_merge>` function
- Add flexible :ref:`comparison <basics.binop>` instance methods eq, ne, lt,
gt, etc. to DataFrame, Series
- Improve :ref:`scatter_matrix <visualization.scatter_matrix>` plotting
function and add histogram or kernel density estimates to diagonal
- Add :ref:`'kde' <visualization.kde>` plot option for density plots
- Support for converting DataFrame to R data.frame through rpy2
- Improved support for complex numbers in Series and DataFrame
- Add :ref:`pct_change <computation.pct_change>` method to all data structures
- Add max_colwidth configuration option for DataFrame console output
- :ref:`Interpolate <missing_data.interpolate>` Series values using index values
- Can select multiple columns from GroupBy
- Add :ref:`update <merging.combine_first.update>` methods to Series/DataFrame
for updating values in place
- Add ``any`` and ``all`` method to DataFrame
New plotting methods
~~~~~~~~~~~~~~~~~~~~
.. ipython:: python
:suppress:
import pandas as pd
fx = pd.read_pickle('data/fx_prices')
import matplotlib.pyplot as plt
``Series.plot`` now supports a ``secondary_y`` option:
.. ipython:: python
plt.figure()
fx['FR'].plot(style='g')
@savefig whatsnew_secondary_y.png
fx['IT'].plot(style='k--', secondary_y=True)
Vytautas Jancauskas, the 2012 GSOC participant, has added many new plot
types. For example, ``'kde'`` is a new option:
.. ipython:: python
s = Series(np.concatenate((np.random.randn(1000),
np.random.randn(1000) * 0.5 + 3)))
plt.figure()
s.hist(normed=True, alpha=0.2)
@savefig whatsnew_kde.png
s.plot(kind='kde')
See :ref:`the plotting page <visualization.other>` for much more.
Other API changes
~~~~~~~~~~~~~~~~~
- Deprecation of ``offset``, ``time_rule``, and ``timeRule`` arguments names in
time series functions. Warnings will be printed until pandas 0.9 or 1.0.
Potential porting issues for pandas <= 0.7.3 users
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The major change that may affect you in pandas 0.8.0 is that time series
indexes use NumPy's ``datetime64`` data type instead of ``dtype=object`` arrays
of Python's built-in ``datetime.datetime`` objects. ``DateRange`` has been
replaced by ``DatetimeIndex`` but otherwise behaved identically. But, if you
have code that converts ``DateRange`` or ``Index`` objects that used to contain
``datetime.datetime`` values to plain NumPy arrays, you may have bugs lurking
with code using scalar values because you are handing control over to NumPy:
.. ipython:: python
import datetime
rng = date_range('1/1/2000', periods=10)
rng[5]
isinstance(rng[5], datetime.datetime)
rng_asarray = np.asarray(rng)
scalar_val = rng_asarray[5]
type(scalar_val)
pandas's ``Timestamp`` object is a subclass of ``datetime.datetime`` that has
nanosecond support (the ``nanosecond`` field store the nanosecond value between
0 and 999). It should substitute directly into any code that used
``datetime.datetime`` values before. Thus, I recommend not casting
``DatetimeIndex`` to regular NumPy arrays.
If you have code that requires an array of ``datetime.datetime`` objects, you
have a couple of options. First, the ``asobject`` property of ``DatetimeIndex``
produces an array of ``Timestamp`` objects:
.. ipython:: python
stamp_array = rng.asobject
stamp_array
stamp_array[5]
To get an array of proper ``datetime.datetime`` objects, use the
``to_pydatetime`` method:
.. ipython:: python
dt_array = rng.to_pydatetime()
dt_array
dt_array[5]
matplotlib knows how to handle ``datetime.datetime`` but not Timestamp
objects. While I recommend that you plot time series using ``TimeSeries.plot``,
you can either use ``to_pydatetime`` or register a converter for the Timestamp
type. See `matplotlib documentation
<http://matplotlib.sourceforge.net/api/units_api.html>`__ for more on this.
.. warning::
There are bugs in the user-facing API with the nanosecond datetime64 unit
in NumPy 1.6. In particular, the string version of the array shows garbage
values, and conversion to ``dtype=object`` is similarly broken.
.. ipython:: python
rng = date_range('1/1/2000', periods=10)
rng
np.asarray(rng)
converted = np.asarray(rng, dtype=object)
converted[5]
**Trust me: don't panic**. If you are using NumPy 1.6 and restrict your
interaction with ``datetime64`` values to pandas's API you will be just
fine. There is nothing wrong with the data-type (a 64-bit integer
internally); all of the important data processing happens in pandas and is
heavily tested. I strongly recommend that you **do not work directly with
datetime64 arrays in NumPy 1.6** and only use the pandas API.
**Support for non-unique indexes**: In the latter case, you may have code
inside a ``try:... catch:`` block that failed due to the index not being
unique. In many cases it will no longer fail (some method like ``append`` still
check for uniqueness unless disabled). However, all is not lost: you can
inspect ``index.is_unique`` and raise an exception explicitly if it is
``False`` or go to a different code branch.