Skip to content

Latest commit

 

History

History
494 lines (326 loc) · 15.9 KB

cookbook.rst

File metadata and controls

494 lines (326 loc) · 15.9 KB
.. currentmodule:: pandas

.. ipython:: python
   :suppress:

   import numpy as np
   import random
   import os
   np.random.seed(123456)
   from pandas import *
   options.display.max_rows=15
   import pandas as pd
   randn = np.random.randn
   randint = np.random.randint
   np.set_printoptions(precision=4, suppress=True)

Cookbook

This is a respository for short and sweet examples and links for useful pandas recipes. We encourage users to add to this documentation.

This is a great First Pull Request (to add interesting links and/or put short code inline for existing links)

Idioms

These are some neat pandas idioms

How to do if-then-else?

How to do if-then-else #2

How to split a frame with a boolean criterion?

How to select from a frame with complex criteria?

Select rows closest to a user defined numer

Selection

The :ref:`indexing <indexing>` docs.

Indexing using both row labels and conditionals, see here

Use loc for label-oriented slicing and iloc positional slicing, see here

Extend a panel frame by transposing, adding a new dimension, and transposing back to the original dimensions, see here

Mask a panel by using np.where and then reconstructing the panel with the new masked values here

Using ~ to take the complement of a boolean array, see here

Efficiently creating columns using applymap

MultiIndexing

The :ref:`multindexing <indexing.hierarchical>` docs.

Creating a multi-index from a labeled frame

Arithmetic

Performing arithmetic with a multi-index that needs broadcastin

Slicing

Slicing a multi-index with xs

Slicing a multi-index with xs #2

Setting portions of a multi-index with xs

Sorting

Multi-index sorting

Partial Selection, the need for sortedness

Levels

Prepending a level to a multiindex

Flatten Hierarchical columns

panelnd

The :ref:`panelnd<dsintro.panelnd>` docs.

Construct a 5D panelnd

Missing Data

The :ref:`missing data<missing_data>` docs.

Fill forward a reversed timeseries

.. ipython:: python

   df = pd.DataFrame(np.random.randn(6,1), index=pd.date_range('2013-08-01', periods=6, freq='B'), columns=list('A'))
   df.ix[3,'A'] = np.nan
   df
   df.reindex(df.index[::-1]).ffill()

cumsum reset at NaN values

Replace

Using replace with backrefs

Grouping

The :ref:`grouping <groupby>` docs.

Basic grouping with apply

Using get_group

Apply to different items in a group

Expanding Apply

Replacing values with groupby means

Sort by group with aggregation

Create multiple aggregated columns

Create a value counts column and reassign back to the DataFrame

Expanding Data

Alignment and to-date

Rolling Computation window based on values instead of counts

Rolling Mean by Time Interval

Splitting

Splitting a frame

Pivot

The :ref:`Pivot <reshaping.pivot>` docs.

Partial sums and subtotals

Frequency table like plyr in R

Apply

Turning embeded lists into a multi-index frame

Timeseries

Between times

Using indexer between time

Vectorized Lookup

Turn a matrix with hours in columns and days in rows into a continous row sequence in the form of a time series. How to rearrange a python pandas dataframe?

Resampling

The :ref:`Resample <timeseries.resampling>` docs.

TimeGrouping of values grouped across time

TimeGrouping #2

Using TimeGrouper and another grouping to create subgroups, then apply a custom function

Resampling with custom periods

Resample intraday frame without adding new days

Resample minute data

Resample with groupby

Merge

The :ref:`Concat <merging.concatenation>` docs. The :ref:`Join <merging.join>` docs.

emulate R rbind

Self Join

How to set the index and join

KDB like asof join

Join with a criteria based on the values

Plotting

The :ref:`Plotting <visualization>` docs.

Make Matplotlib look like R

Setting x-axis major and minor labels

Plotting multiple charts in an ipython notebook

Creating a multi-line plot

Plotting a heatmap

Annotate a time-series plot

Annotate a time-series plot #2

Data In/Out

Performance comparison of SQL vs HDF5

CSV

The :ref:`CSV <io.read_csv_table>` docs

read_csv in action

appending to a csv

Reading a csv chunk-by-chunk

Reading only certain rows of a csv chunk-by-chunk

Reading the first few lines of a frame

Reading a file that is compressed but not by gzip/bz2 (the native compresed formats which read_csv understands). This example shows a WinZipped file, but is a general application of opening the file within a context manager and using that handle to read. See here

Inferring dtypes from a file

Dealing with bad lines

Dealing with bad lines II

Reading CSV with Unix timestamps and converting to local timezone

Write a multi-row index CSV without writing duplicates

SQL

The :ref:`SQL <io.sql>` docs

Reading from databases with SQL

Excel

The :ref:`Excel <io.excel>` docs

Reading from a filelike handle

Reading HTML tables from a server that cannot handle the default request header

HDFStore

The :ref:`HDFStores <io.hdf5>` docs

Simple Queries with a Timestamp Index

Managing heteregenous data using a linked multiple table hierarchy

Merging on-disk tables with millions of rows

Deduplicating a large store by chunks, essentially a recusive reduction operation. Shows a function for taking in data from csv file and creating a store by chunks, with date parsing as well. See here

Appending to a store, while creating a unique index

Large Data work flows

Reading in a sequence of files, then providing a global unique index to a store while appending

Groupby on a HDFStore

Troubleshoot HDFStore exceptions

Setting min_itemsize with strings

Using ptrepack to create a completely-sorted-index on a store

Storing Attributes to a group node

.. ipython:: python

    df = DataFrame(np.random.randn(8,3))
    store = HDFStore('test.h5')
    store.put('df',df)

    # you can store an arbitrary python object via pickle
    store.get_storer('df').attrs.my_attribute = dict(A = 10)
    store.get_storer('df').attrs.my_attribute

.. ipython:: python
   :suppress:

    store.close()
    os.remove('test.h5')

Computation

Numerical integration (sample-based) of a time series

Miscellaneous

The :ref:`Timedeltas <timeseries.timedeltas>` docs.

Operating with timedeltas

Create timedeltas with date differences

Adding days to dates in a dataframe

Aliasing Axis Names

To globally provide aliases for axis names, one can define these 2 functions:

.. ipython:: python

   def set_axis_alias(cls, axis, alias):
        if axis not in cls._AXIS_NUMBERS:
            raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
        cls._AXIS_ALIASES[alias] = axis

   def clear_axis_alias(cls, axis, alias):
        if axis not in cls._AXIS_NUMBERS:
            raise Exception("invalid axis [%s] for alias [%s]" % (axis, alias))
        cls._AXIS_ALIASES.pop(alias,None)


   set_axis_alias(DataFrame,'columns', 'myaxis2')
   df2 = DataFrame(randn(3,2),columns=['c1','c2'],index=['i1','i2','i3'])
   df2.sum(axis='myaxis2')
   clear_axis_alias(DataFrame,'columns', 'myaxis2')