100% found this document useful (1 vote)

467 views10 pages

Pandas DataFrame Notes

This document provides a cheat sheet on the pandas DataFrame object. It defines a DataFrame as a two-dimensional table with column and row indexes made up of Series objects. It describes Series as ordered arrays that make up the columns of a DataFrame. It then provides examples of how to load data from various sources into DataFrames and Series, and how to manipulate and combine DataFrames and Series.

Uploaded by

Sarath Ramineni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

467 views10 pages

Pandas DataFrame Notes

Uploaded by

Sarath Ramineni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Cheat Sheet: The pandas DataFrame Object

Preliminaries Get your data into a DataFrame

Start by importing these Python modules Load a DataFrame from a CSV file
import numpy as np df = pd.read_csv('[Link]')# often works
import [Link] as plt df = pd.read_csv(‘[Link]’, header=0,
import pandas as pd index_col=0, quotechar=’”’,sep=’:’,
from pandas import DataFrame, Series na_values = [‘na’, ‘-‘, ‘.’, ‘’])
Note: these are the recommended import aliases Note: refer to pandas docs for all arguments

From inline CSV text to a DataFrame

from StringIO import StringIO # python2.7
The conceptual model #from io import StringIO # python 3
data = """, Animal, Cuteness, Desirable
row-1, dog, 8.7, True
DataFrame object: The pandas DataFrame is a two- row-2, bat, 2.6, False"""
dimensional table of data with column and row indexes. df = pd.read_csv(StringIO(data),
The columns are made up of pandas Series objects. header=0, index_col=0,
skipinitialspace=True)
Column'index'([Link])' Note: skipinitialspace=True allows a pretty layout

Load DataFrames from a Microsoft Excel file

Series'of'data'

# Each Excel sheet in a Python dictionary

([Link])'
Row'index'

workbook = [Link]('[Link]')
dictionary = {}
for sheet_name in workbook.sheet_names:
df = [Link](sheet_name)
dictionary[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.

Series object: an ordered, one-dimensional array of Load a DataFrame from a MySQL database
data with an index. All the data in a Series is of the import pymysql
same data type. Series arithmetic is vectorised after first from sqlalchemy import create_engine
aligning the Series index for each of the operands. engine = create_engine('mysql+pymysql://'
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 +'USER:PASSWORD@localhost/DATABASE')
s2 = Series(range(1,5)) # -> 1, 2, 3, 4 df = pd.read_sql_table('table', engine)
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb' Data in Series then combine into a DataFrame
# Example 1 ...
The index object: The pandas Index provides the axis s1 = Series(range(6))
labels for the Series and DataFrame objects. It can only s2 = s1 * s1
contain hashable objects. A pandas Series has one [Link] = [Link] + 2# misalign indexes
Index; and a DataFrame has two Indexes. df = [Link]([s1, s2], axis=1)
# --- get Index from Series and DataFrame
idx = [Link] # Example 2 ...
idx = [Link] # the column index s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
idx = [Link] # the row index s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
df = [Link]({'A':s3, 'B':s4 }, axis=1)
# --- some Index attributes Note: 1st method has in integer column labels
b = idx.is_monotonic_decreasing Note: 2nd method does not guarantee col order
b = idx.is_monotonic_increasing Note: index alignment on DataFrame creation
b = idx.has_duplicates
i = [Link] # multi-level indexes Get a DataFrame from data in a Python dictionary
# default --- assume data is in columns
# --- some Index methods df = DataFrame({
a = [Link]() # get as numpy array 'col0' : [1.0, 2.0, 3.0, 4.0],
l = [Link]() # get as a python list 'col1' : [100, 200, 300, 400]
idx = [Link](dtype)# change data type })
b = [Link](o) # check for equality
idx = [Link](o) # union of two indexes
i = [Link]() # number unique labels
label = [Link]() # minimum label
label = [Link]() # maximum label

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
'row0' : {'col0':0, 'col1':'A'}, Peek at the DataFrame contents
'row1' : {'col0':1, 'col1':'B'}
[Link]() # index & data types
}, orient='index')
n = 4
dfh = [Link](n) # get first n rows
df = DataFrame.from_dict({ # data by row
dft = [Link](n) # get last n rows
'row0' : [1, 1+1j, 'A'],
dfs = [Link]() # summary stats cols
'row1' : [2, 2+2j, 'B']
top_left_corner_df = [Link][:5, :5]
}, orient='index')
DataFrame non-indexing attributes
Create play/fake data (useful for testing)
dfT = df.T # transpose rows and cols
# --- simple
l = [Link] # list row and col indexes
df = DataFrame([Link](50,5))
(r, c) = [Link] # from above
s = [Link] # Series column data types
# --- with a time-stamp row index:
b = [Link] # True for empty DataFrame
df = DataFrame([Link](500,5))
i = [Link] # number of axes (2)
[Link] = pd.date_range('1/1/2006',
t = [Link] # (row-count, column-count)
periods=len(df), freq='M')
(r, c) = [Link] # from above
i = [Link] # row-count * column-count
# --- with alphabetic row and col indexes
a = [Link] # get a numpy array for df
import string
import random
r = 52 # note: min r is 1; max r is 52 DataFrame utility methods
c = 5 dfc = [Link]() # copy a DataFrame
df = DataFrame([Link](r, c), dfr = [Link]() # rank each col (default)
columns = ['col'+str(i) for i in dfs = [Link]() # sort each col (default)
range(c)], dfc = [Link](dtype) # type conversion
index = list(([Link] +
[Link])[0:r])) DataFrame iteration methods
df['group'] = list( [Link]()# (col-index, Series) pairs
''.join([Link]('abcd') [Link]() # (row-index, Series) pairs
for _ in range(r))
) # example ... iterating over columns
for (name, series) in [Link]():
print('Col name: ' + str(name))
print('First value: ' +
Saving a DataFrame str([Link][0]) + '\n')

Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list)
df.to_csv('[Link]', encoding='utf-8') df = [Link]() # absolute values
df = [Link](o) # add df, Series or value
s = [Link]() # non NA/null values
Saving DataFrames to an Excel Workbook
df = [Link]() # (cols default axis)
from pandas import ExcelWriter df = [Link]() # (cols default axis)
writer = ExcelWriter('[Link]') df = [Link]() # (cols default axis)
df1.to_excel(writer,'Sheet1') df = [Link]() # (cols default axis)
df2.to_excel(writer,'Sheet2') df = [Link]() # 1st diff (col def axis)
[Link]() df = [Link](o) # div by df, Series, value
df = [Link](o) # matrix dot product
Saving a DataFrame to MySQL s = [Link]() # max of axis (col def)
import pymysql s = [Link]() # mean (col default axis)
from sqlalchemy import create_engine s = [Link]()# median (col default)
e = create_engine('mysql+pymysql://' + s = [Link]() # min of axis (col def)
'USER:PASSWORD@localhost/DATABASE') df = [Link](o) # mul by df Series val
df.to_sql('TABLE',e, if_exists='replace') s = [Link]() # sum axis (cols default)
Note: if_exists ! 'fail', 'replace', 'append' Note: The methods that return a series default to
working on columns.
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict() DataFrame filter/select rows or cols on label info
df = [Link](items=['a', 'b']) # by col
Saving a DataFrame to a Python string df = [Link](items=[5], axis=0) #by row
string = df.to_string() df = [Link](like='x') # keep x in col
df = [Link](regex='x') # regex in col
Note: sometimes may be useful for debugging
df = [Link](crit=(lambda x:not x%5))#r
Note: select takes a Boolean function, for cols: axis=1
Note: filter defaults to cols; select defaults to rows
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Columns value set based on criteria
Working with Columns df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)
A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar

Data type conversions

Get column index and labels
s = df['col'].astype(str) # Series dtype
idx = [Link] # get col index
na = df['col'].values # numpy array
label = [Link][0] # 1st col label
pl = df['col'].tolist() # python list
lst = [Link]() # get as a list
Note: useful dtypes for Series conversion: int, float, str
Trap: index lost in conversion from Series to array or list
Change column labels
[Link](columns={'old':'new'},
Common column-wide methods/attributes
inplace=True)
df = [Link](columns={'a':1,'b':'x'}) value = df['col'].dtype # type of data
value = df['col'].size # col dimensions
value = df['col'].count()# non-NA count
Selecting columns value = df['col'].sum()
s = df['colName'] # select col to Series value = df['col'].prod()
df = df[['colName']] # select col to df value = df['col'].min()
df = df[['a','b']] # select 2 or more value = df['col'].max()
df = df[['c','a','b']]# change order value = df['col'].mean()
s = df[[Link][0]] # select by number value = df['col'].median()
df = df[[Link][[0, 3, 4]] # by number value = df['col'].cov(df['col2'])
s = [Link]('c') # get col & drop from df s = df['col'].describe()
s = df['col'].value_counts()
Selecting columns with Python attributes
s = df.a # same as s = df['a'] Find index label for min/max values in column
# cannot create new columns by attribute label = df['col1'].idxmin()
df.existing_col = df.a / df.b label = df['col1'].idxmax()
df['new_col'] = df.a / df.b
Trap: column names must be valid identifiers. Common column element-wise methods
s = df['col'].isnull()
Adding new columns to a DataFrame s = df['col'].notnull() # not isnull()
df['new_col'] = range(len(df)) s = df['col'].astype(float)
df['new_col'] = [Link]([Link],len(df)) s = df['col'].round(decimals=0)
df['random'] = [Link](len(df)) s = df['col'].diff(periods=1)
df['index_as_col'] = [Link] s = df['col'].shift(periods=1)
df1[['b','c']] = df2[['e','f']] s = df['col'].to_datetime()
df3 = [Link](other=df2) s = df['col'].fillna(0) # replace NaN w 0
Trap: When adding an indexed pandas object as a new s = df['col'].cumsum()
column, only items from the new series that have a s = df['col'].cumprod()
corresponding index in the DataFrame will be added. s = df['col'].pct_change(periods=4)
The receiving DataFrame is not extended to s = df['col'].rolling_sum(periods=4,
accommodate the new series. To merge, see below. window=4)
Trap: when adding a python list or numpy array, the Note: also rolling_min(), rolling_max(), and many more.
column will be added by integer position.
Append a column of row sums to a DataFrame
Swap column contents – change column order df['Total'] = [Link](axis=1)
df[['B', 'A']] = df[['A', 'B']] Note: also means, mins, maxs, etc.

Dropping columns (mostly by label) Multiply every column in DataFrame by Series

df = [Link]('col1', axis=1) df = [Link](s, axis=0) # on matched rows
[Link]('col1', axis=1, inplace=True) Note: also add, sub, div, etc.
df = [Link](['col1','col2'], axis=1)
s = [Link]('col') # drops from frame Selecting columns with .loc, .iloc and .ix
del df['col'] # even classic python works df = [Link][:, 'col1':'col2'] # inclusive
[Link]([Link][0], inplace=True) df = [Link][:, 0:2] # exclusive

Vectorised arithmetic on columns Get the integer position of a column index label
df['proportion']=df['count']/df['total'] j = [Link].get_loc('col_name')
df['percent'] = df['proportion'] * 100.0
Test if column index values are unique/monotonic
Apply numpy mathematical functions to columns
if [Link].is_unique: pass # ...
df['log_data'] = [Link](df['col1']) b = [Link].is_monotonic_increasing
df['rounded'] = [Link](df['col2'], 2) b = [Link].is_monotonic_decreasing
Note: Many more mathematical functions

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by label/index
Working with rows [inclusive-from : inclusive–to [ : step]]
df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels Trap: doesn't work on integer labelled rows
idx = [Link] # get row index
label = [Link][0] # 1st row label Append a row of column totals to a DataFrame
lst = [Link]() # get as a list # Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
Change the (row) index sums_df = DataFrame(sums,index=['Total'])
[Link] = idx # new ad hoc index df = [Link](sums_df)
[Link] = range(len(df)) # set with list
df = df.reset_index() # replace old w new # Option 2: All done with pandas
# note: old index stored as a col in df df = [Link](DataFrame([Link](),
df = [Link](index=range(len(df))) columns=['Total']).T)
df = df.set_index(keys=['r1','r2','etc'])
[Link](index={'old':'new'}, Iterating over DataFrame rows
inplace=True) for (index, row) in [Link](): # pass
Trap: row data type may be coerced.
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows values
Hint: convert to a DataFrame and then append. Both df = [Link]([Link][0],
DataFrames should have same column labels. ascending=False)
[Link](['col1', 'col2'], inplace=True)
Dropping rows (by name)
df = [Link]('row_label') Random selection of rows
df = [Link](['row1','row2']) # multi-row import random as r
k = 20 # pick a number
Boolean row selection by values in a column selection = [Link](range(len(df)), k)
df_sample = [Link][selection, :]
df = df[df['col2'] >= 0.0]
df = df[(df['col3']>=1.0) | Note: this sample is not sorted
(df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])] Sort DataFrame by its row index
df = df[~df['col'].isin([1,2,5,7,11])] df.sort_index(inplace=True) # sort by row
df = df[df['col'].[Link]('hello')] df = df.sort_index(ascending=False)
Trap: bitwise "or", "and" “not” (ie. | & ~) co-opted to be
Boolean operators on a Series of Boolean Drop duplicates in the row index
Trap: need parentheses around comparisons. df['index'] = [Link] # 1 create new col
df = df.drop_duplicates(cols='index',
Selecting rows using isin over multiple columns take_last=True)# 2 use new col
# fake up some data del df['index'] # 3 del the col
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]} df.sort_index(inplace=True)# 4 tidy up
df = [Link](data)
Test if two DataFrames have same row index
# multi-column isin len(a)==len(b) and all([Link]==[Link])
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] Get the integer position of a row or col index label
i = [Link].get_loc('row_label')
Selecting rows using an index
Trap: index.get_loc() returns an integer for a unique
idx = df[df['col'] >= 2].index match. If not a unique match, may return a slice or
print([Link][idx]) mask.

Select a slice of rows by integer position Get integer position of rows that meet condition
[inclusive-from : exclusive-to [: step]] a = [Link](df['col'] >= 2) #numpy array
default start is 0; default end is len(df)
df = df[:] # copy DataFrame Test if the row index values are unique/monotonic
df = df[0:2] # rows 0 and 1
df = df[-1:] # the last row if [Link].is_unique: pass # ...
df = df[2:3] # row 2 (the third row) b = [Link].is_monotonic_increasing
df = df[:-1] # all but the last row b = [Link].is_monotonic_decreasing
df = df[::2] # every 2nd row (0 2 ..)
Trap: a single integer without a colon is a column label
for integer numbered columns.

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Working with cells In summary: indexes and addresses

Selecting a cell by row and column labels In the main, these notes focus on the simple, single
value = [Link]['row', 'col'] level Indexes. Pandas also has a hierarchical or
value = [Link]['row', 'col'] multi-level Indexes (aka the MultiIndex).
value = df['col'].at['row'] # tricky
Note: .at[] fastest label based scalar lookup A DataFrame has two Indexes
• Typically, the column index ([Link]) is a list of
Setting a cell by row and column labels strings (observed variable names) or (less
[Link]['row, 'col'] = value commonly) integers (the default is numbered from 0
[Link]['row, 'col'] = value to length-1)
df['col'].at['row'] = value # tricky • Typically, the row index ([Link]) might be:
o Integers - for case or row numbers (default is
Selecting and slicing on labels numbered from 0 to length-1);
df = [Link]['row1':'row3', 'col1':'col3'] o Strings – for case names; or
Note: the "to" on this slice is inclusive. o DatetimeIndex or PeriodIndex – for time series
data (more below)
Setting a cross-section by labels
[Link]['A':'C', 'col1':'col3'] = [Link] Indexing
[Link][1:2,'col1':'col2']=[Link]((2,2)) # --- selecting columns
[Link][1:2,'A':'C']=[Link][1:2,'A':'C'] s = df['col_label'] # scalar
Remember: inclusive "to" in the slice df = df[['col_label']] # one item list
df = df[['L1', 'L2']] # many item list
Selecting a cell by integer position df = df[index] # pandas Index
df = df[s] # pandas Series
value = [Link][9, 3] # [row, col]
value = [Link][0, 0] # [row, col]
# --- selecting rows
value = [Link][len(df)-1,
df = df['from':'inc_to']# label slice
len([Link])-1]
df = df[3:7] # integer slice
df = df[df['col'] > 0.5]# Boolean Series
Selecting a range of cells by int position
df = [Link][2:4, 2:4] # subset of the df df = [Link]['label'] # single label
df = [Link][:5, :5] # top left corner df = [Link][container] # lab list/Series
s = [Link][5, :] # returns row as Series df = [Link]['from':'to']# inclusive slice
df = [Link][5:6, :] # returns row as row df = [Link][bs] # Boolean Series
Note: exclusive "to" – same as python list slicing. df = [Link][0] # single integer
df = [Link][container] # int list/Series
Setting cell by integer position df = [Link][0:5] # exclusive slice
[Link][0, 0] = value # [row, col] df = [Link][x] # loc then iloc
[Link][7, 8] = value
# --- select DataFrame cross-section
# r and c can be scalar, list, slice
Setting cell range by integer position
[Link][r, c] # label accessor (row, col)
[Link][0:3, 0:5] = value [Link][r, c]# integer accessor
[Link][1:3, 1:4] = [Link]((2, 3)) [Link][r, c] # label access int fallback
[Link][1:3, 1:4] = [Link]((2, 3)) df[c].iloc[r]# chained – also for .loc
[Link][1:3, 1:4] = [Link]([[1, 1, 1],
[2, 2, 2]]) # --- select cell
Remember: exclusive-to in the slice # r and c must be label or integer
[Link][r, c] # fast scalar label accessor
.ix for mixed label and integer position indexing [Link][r, c] # fast scalar int accessor
value = [Link][5, 'col1'] df[c].iat[r] # chained – also for .at
df = [Link][1:5, 'col1':'col3']
# --- indexing methods
Views and copies v = df.get_value(r, c) # get by row, col
From the manual: Setting a copy can cause subtle df = df.set_value(r,c,v)# set by row, col
errors. The rules about when a view on the data is df = [Link](key, axis) # get cross-section
df = [Link](items, like, regex, axis)
returned are dependent on NumPy. Whenever an array
df = [Link](crit, axis)
of labels or a Boolean vector are involved in the indexing
operation, the result will be a copy. Note: the indexing attributes (.loc, .iloc, .ix, .at .iat) can
be used to get and set values in the DataFrame.
Note: the .loc, iloc and .ix indexing attributes can accept
python slice objects. But .at and .iat do not.
Note: .loc can also accept Boolean Series arguments
Avoid: chaining in the form df[col_indexer][row_indexer]
Trap: label slices are inclusive, integer slices exclusive.

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Joining/Combining DataFrames Groupby: Split-Apply-Combine

Three ways to join two DataFrames: The pandas "groupby" mechanism allows us to split the
• merge (a database/SQL-like join operation) data into groups, apply a function to each group
• concat (stack side by side or one on top of the other) independently and then combine the results.
• combine_first (splice the two together, choosing
values from one over the other) Grouping
gb = [Link]('cat') # by one columns
Merge on indexes gb = [Link](['c1','c2']) # by 2 cols
df_new = [Link](left=df1, right=df2, gb = [Link](level=0) # multi-index gb
how='outer', left_index=True, gb = [Link](level=['a','b']) # mi gb
right_index=True) print([Link])
How: 'left', 'right', 'outer', 'inner' Note: groupby() returns a pandas groupby object
How: outer=union/all; inner=intersection Note: the groupby object attribute .groups contains a
dictionary mapping of the groups.
Merge on columns Trap: NaN values in the group key are automatically
df_new = [Link](left=df1, right=df2, dropped – there will never be a NA group.
how='left', left_on='col1',
right_on='col2') Iterating groups – usually not needed
Trap: When joining on columns, the indexes on the for name, group in gb:
passed DataFrames are ignored. print (name)
Trap: many-to-many merges on a column can result in print (group)
an explosion of associated data.
Selecting a group
Join on indexes (another way of merging) dfa = [Link]('cat').get_group('a')
df_new = [Link](other=df2, on='col1', dfb = [Link]('cat').get_group('b')
how='outer')
df_new = [Link](other=df2,on=['a','b'], Applying an aggregating function
how='outer') # apply to a column ...
Note: [Link]() joins on indexes by default. s = [Link]('cat')['col1'].sum()
[Link]() joins on common columns by s = [Link]('cat')['col1'].agg([Link])
default. # apply to the every column in DataFrame
s = [Link]('cat').agg([Link])
Simple concatenation is often the best df_summary = [Link]('cat').describe()
df=[Link]([df1,df2],axis=0)#top/bottom df_row_1s = [Link]('cat').head(1)
df = [Link]([df2, df3]) #top/bottom Note: aggregating functions reduce the dimension by
df=[Link]([df1,df2],axis=1)#left/right one – they include: mean, sum, size, count, std, var,
Trap: can end up with duplicate rows or cols sem, describe, first, last, min, max
Note: concat has an ignore_index parameter
Applying multiple aggregating functions
Combine_first gb = [Link]('cat')
df = df1.combine_first(other=df2)
# apply multiple functions to one column
# multi-combine with python reduce() dfx = gb['col2'].agg([[Link], [Link]])
df = reduce(lambda x, y: # apply to multiple fns to multiple cols
x.combine_first(y), dfy = [Link]({
[df1, df2, df3, df4, df5]) 'cat': np.count_nonzero,
'col1': [[Link], [Link], [Link]],
Uses the non-null values from df1. The index of the 'col2': [[Link], [Link]]
combined DataFrame will be the union of the indexes })
from df1 and df2.
Note: gb['col2'] above is shorthand for
[Link]('cat')['col2'], without the need for regrouping.

Transforming functions
# transform to group z-scores, which have
# a group mean of 0, and a std dev of 1.
zscore = lambda x: ([Link]())/[Link]()
dfz = [Link]('cat').transform(zscore)

# replace missing data with group mean

mean_r = lambda x: [Link]([Link]())
dfm = [Link]('cat').transform(mean_r)
Note: can apply multiple transforming functions in a
manner similar to multiple aggregating functions above,

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Applying filtering functions
Filtering functions allow you to make selections based Working with dates, times and their indexes
on whether each group meets specified criteria
# select groups with more than 10 members Dates and time – points and spans
eleven = lambda x: (len(x['col1']) >= 11) With its focus on time-series data, pandas has a suite of
df11 = [Link]('cat').filter(eleven) tools for managing dates and time: either as a point in
time (a Timestamp) or as a span of time (a Period).
Group by a row index (non-hierarchical index) t = [Link]('2013-01-01')
df = df.set_index(keys='cat') t = [Link]('2013-01-01 [Link]')
s = [Link](level=0)['col1'].sum() t = [Link]('2013-01-01 [Link].7')
dfg = [Link](level=0).sum() p = [Link]('2013-01-01', freq='M')
Note: Timestamps should be in range 1678 and 2261
years. (Check [Link] and [Link]).
Pivot Tables A Series of Timestamps or Periods
ts = ['2015-04-01 [Link]',
Pivot '2014-04-02 [Link]']
Pivot tables move from long format to wide format data
df = DataFrame([Link](100,1)) # Series of Timestamps (good)
[Link] = ['data'] # rename col s = pd.to_datetime([Link](ts))
[Link] = pd.period_range('3/3/2014',
periods=len(df), freq='M') # Series of Periods (often not so good)
df['year'] = [Link] s = [Link]( [[Link](x, freq='M')
df['month'] = [Link] for x in ts] )
s = [Link](
# pivot to wide format [Link](ts,freq='S'))
df = [Link](index='year', Note: While Periods make a very useful index; they may
columns='month', values='data') be less useful in a Series.

# melt to long format From non-standard strings to Timestamps

dfm = df t = ['[Link].7654-JAN092002',
dfm['year'] = [Link] '[Link].6589-FEB082016']
dfm = [Link](df, id_vars=['year'], s = [Link](pd.to_datetime(t,
var_name='month', value_name='data') format="%H:%M:%S.%f-%b%d%Y"))
Also: %B = full month name; %m = numeric month;
# unstack to long format %y = year without century; and more …
# reset index to remove multi-level index
dfu=[Link]().reset_index(name='data') Dates and time – stamps and spans as indexes
An index of Timestamps is a DatetimeIndex.
Value counts An index of Periods is a PeriodIndex.
s = df['col1'].value_counts() date_strs = ['2014-01-01', '2014-04-01',
'2014-07-01', '2014-10-01']

dti = [Link](date_strs)

pid = [Link](date_strs, freq='D')

pim = [Link](date_strs, freq='M')
piq = [Link](date_strs, freq='Q')

print (pid[1] - pid[0]) # 90 days

print (pim[1] - pim[0]) # 3 months
print (piq[1] - piq[0]) # 1 quarter

time_strs = ['2015-01-01 [Link].12345',

'2015-01-01 [Link].67890']
pis = [Link](time_strs, freq='U')

[Link] = pd.period_range('2015-01',
periods=len(df), freq='M')

dti = pd.to_datetime(['04-01-2012'],
dayfirst=True) # Australian date format
pi = pd.period_range('1960-01-01',
'2015-12-31', freq='M')
Hint: unless you are working in less than seconds,
prefer PeriodIndex over DateTimeImdex.

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
Period frequency constants (not a complete list) Upsampling and downsampling
Name Description # upsample from quarterly to monthly
U Microsecond pi = pd.period_range('1960Q1',
L Millisecond periods=220, freq='Q')
S Second df = DataFrame([Link](len(pi),5),
index=pi)
T Minute
dfm = [Link]('M', convention='end')
H Hour # use ffill or bfill to fill with values
D Calendar day
B Business day # downsample from monthly to quarterly
W-{MON, TUE, …} Week ending on … dfq = [Link]('Q', how='sum')
MS Calendar start of month
M Calendar end of month Time zones
QS-{JAN, FEB, …} Quarter start with year starting t = ['2015-06-30 [Link]',
(QS – December) '2015-12-31 [Link]']
Q-{JAN, FEB, …} Quarter end with year ending (Q dti = pd.to_datetime(t
– December) ).tz_localize('Australia/Canberra')
AS-{JAN, FEB, …} Year start (AS - December) dti = dti.tz_convert('UTC')
ts = [Link]('now',
A-{JAN, FEB, …} Year end (A - December) tz='Europe/London')
From DatetimeIndex to Python datetime objects # get a list of all time zones
dti = [Link](pd.date_range( import pyzt
start='1/1/2011', periods=4, freq='M')) for tz in pytz.all_timezones:
s = Series([1,2,3,4], index=dti) print tz
na = dti.to_pydatetime() #numpy array Note: by default, Timestamps are created without time
na = [Link].to_pydatetime() #numpy array zone information.

Frome Timestamps to Python dates or times Row selection with a time-series index
df['date'] = [[Link]() for x in df['TS']] # start with the play data above
df['time'] = [[Link]() for x in df['TS']] idx = pd.period_range('2015-01',
Note: converts to [Link] or [Link]. But periods=len(df), freq='M')
does not convert to [Link]. [Link] = idx

From DatetimeIndex to PeriodIndex and back february_selector = ([Link] == 2)

df = DataFrame([Link](20,3)) february_data = df[february_selector]
[Link] = pd.date_range('2015-01-01',
periods=len(df), freq='M') q1_data = df[([Link] >= 1) &
dfp = df.to_period(freq='M') ([Link] <= 3)]
dft = dfp.to_timestamp()
Note: from period to timestamp defaults to the point in mayornov_data = df[([Link] == 5)
| ([Link] == 11)]
time at the start of the period.
totals = [Link]([Link]).sum()
Working with a PeriodIndex
Also: year, month, day [of month], hour, minute, second,
pi = pd.period_range('1960-01','2015-12',
dayofweek [Mon=0 .. Sun=6], weekofmonth, weekofyear
freq='M')
na = [Link] # numpy array of integers [numbered from 1], week starts on Monday], dayofyear
lp = [Link]() # python list of Periods [from 1], …
sp = Series(pi)# pandas Series of Periods
ss = Series(pi).astype(str) # S of strs The [Link] accessor attribute
ls = Series(pi).astype(str).tolist() DataFrame columns that contain datetime-like objects
can be manipulated with the .dt accessor attribute
Get a range of Timestamps t = ['2012-04-14 [Link].307000',
dr = pd.date_range('2013-01-01', '2011-05-14 [Link].457000',
'2013-12-31', freq='D') '2010-06-14 [Link].520000']

# a Series of time stamps

Error handling with dates s = [Link](pd.to_datetime(t))
# 1st example returns string not Timestamp print([Link]) # datetime64[ns]
t = pd.to_datetime('2014-02-30') print([Link]) # 56, 24, 7
# 2nd example returns NaT (not a time) print([Link]) # 4, 5, 6
t = pd.to_datetime('2014-02-30',
coerce=True) # a Series of time periods
# NaT like NaN tests True for isnull() s = [Link]([Link](t,freq='Q'))
b = [Link](t) # --> True print([Link]) # datetime64[ns]
print([Link]) # 2, 2, 2
The tail of a time-series DataFrame print([Link]) # 2012, 2011, 2010
df = [Link]("5M") # the last five months
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
8
Working with missing and non-finite data Working with Categorical Data

Working with missing data Categorical data

Pandas uses the not-a-number construct ([Link] and The pandas Series has an R factors-like data type for
float('nan')) to indicate missing data. The Python None encoding categorical data.
can arise in data as well. It is also treated as missing s = Series(['a','b','a','c','b','d','a'],
data; as is the pandas not-a-time construct dtype='category')
([Link]). df['B'] = df['A'].astype('category')
Note: the key here is to specify the "category" data type.
Missing data in a Series Note: categories will be ordered on creation if they are
s = Series( [8,None,float('nan'),[Link]]) sortable. This can be turned off. See ordering below.
#[8, NaN, NaN, NaN]
[Link]() #[False, True, True, True] Convert back to the original data type
[Link]()#[True, False, False, False] s = Series(['a','b','a','c','b','d','a'],
[Link](0)#[8, 0, 0, 0] dtype='category')
s = [Link]('string')
Missing data in a DataFrame
df = [Link]() # drop all rows with NaN Ordering, reordering and sorting
df = [Link](axis=1) # same for cols s = Series(list('abc'), dtype='category')
df=[Link](how='all') #drop all NaN row print ([Link])
df=[Link](thresh=2) # drop 2+ NaN in r s=[Link].reorder_categories(['b','c','a'])
# only drop row if NaN in a specified col s = [Link]()
df = [Link](df['col'].notnull()) [Link] = False
Trap: category must be ordered for it to be sorted
Recoding missing data
[Link](0, inplace=True) # [Link] ! 0 Renaming categories
s = df['col'].fillna(0) # [Link] ! 0 s = Series(list('abc'), dtype='category')
df = [Link](r'\s+', [Link], [Link] = [1, 2, 3] # in place
regex=True) # white space ! [Link] s = [Link].rename_categories([4,5,6])
# using a comprehension ...
Non-finite numbers [Link] = ['Group ' + str(i)
With floating point numbers, pandas provides for for i in [Link]]
positive and negative infinity. Trap: categories must be uniquely named
s = Series([float('inf'), float('-inf'),
[Link], -[Link]]) Adding new categories
Pandas treats integer comparisons with plus or minus s = [Link].add_categories([4])
infinity as expected.
Removing categories
Testing for finite numbers s = [Link].remove_categories([4])
(using the data from the previous example) [Link].remove_unused_categories() #inplace
b = [Link](s)

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Working with strings Basic Statistics

Working with strings Summary statistics

# assume that df['col'] is series of s = df['col1'].describe()
strings df1 = [Link]()
s = df['col'].[Link]()
s = df['col'].[Link]() DataFrame – key stats methods
s = df['col'].[Link]() [Link]() # pairwise correlation cols
[Link]() # pairwise covariance cols
# the next set work like Python [Link]() # kurtosis over cols (def)
df['col'] += 'suffix' # append [Link]() # mean absolute deviation
df['col'] *= 2 # duplicate [Link]() # standard error of mean
s = df['col1'] + df['col2'] # concatenate [Link]() # variance over cols (def)
Most python string functions are replicated in the pandas
DataFrame and Series objects. Value counts
s = df['col1'].value_counts()
Regular expressions
s = df['col'].[Link]('regex')
Cross-tabulation (frequency count)
s = df['col'].[Link]('regex')
s = df['col'].[Link]('regex') ct = [Link](index=df['a'],
s = df['col'].[Link]('old', 'new') cols=df['b'])
df['b'] = [Link]('(pattern)')
Note: pandas has many more regex methods. Quantiles and ranking
quants = [0.05, 0.25, 0.5, 0.75, 0.95]
q = [Link](quants)
r = [Link]()

Histogram binning
count, bins = [Link](df['col1'])
count, bins = [Link](df['col'],
bins=5)
count, bins = [Link](df['col1'],
bins=[-3,-2,-1,0,1,2,3,4])

Regression
import [Link] as sm
result = [Link](formula="col1 ~ col2 +
col3", data=df).fit()
print ([Link])
print ([Link]())

Smoothing example using rolling_apply

k3x5 = [Link]([1,2,3,3,3,2,1]) / 15.0
s = pd.rolling_apply(df['col1'],
window=7,
func=lambda x: (x * k3x5).sum(),
min_periods=7, center=True)

I
Cautionary note

This cheat sheet was cobbled together by bots roaming

the dark recesses of the Internet seeking ursine and
pythonic myths. There is no guarantee the narratives
were captured and transcribed accurately. You use
these notes at your own risk. You have been warned.

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10

Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Python Pandas Cheat Sheet Guide
No ratings yet
Python Pandas Cheat Sheet Guide
11 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
SQL Zero To Hero
No ratings yet
SQL Zero To Hero
14 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Python & Pandas Coding Quiz
No ratings yet
Python & Pandas Coding Quiz
2 pages
Python Basics for Data Science
100% (3)
Python Basics for Data Science
15 pages
Python Variables and Operations Guide
No ratings yet
Python Variables and Operations Guide
105 pages
Data Ingestion and Reshaping Guide
100% (1)
Data Ingestion and Reshaping Guide
2 pages
Python Complex Data Types Cheat Sheet
No ratings yet
Python Complex Data Types Cheat Sheet
1 page
Pandas Series Practice Questions
0% (1)
Pandas Series Practice Questions
42 pages
Numpy Python Cheat Sheet
100% (1)
Numpy Python Cheat Sheet
1 page
Python Interview Questions & Answers
No ratings yet
Python Interview Questions & Answers
62 pages
Python For Data Science
100% (1)
Python For Data Science
4 pages
International Indian School, Riyadh WORKSHEET (2020-2021) Grade - Xii - Informatics Practices - Second Term
No ratings yet
International Indian School, Riyadh WORKSHEET (2020-2021) Grade - Xii - Informatics Practices - Second Term
9 pages
Mat Plot Lib
No ratings yet
Mat Plot Lib
44 pages
Pandas vs PySpark: Data Operations
No ratings yet
Pandas vs PySpark: Data Operations
3 pages
SQL With Python Guide
No ratings yet
SQL With Python Guide
17 pages
Python DSA Course Overview
No ratings yet
Python DSA Course Overview
4 pages
B13!05!06 - Python - Collection Data Types and String
No ratings yet
B13!05!06 - Python - Collection Data Types and String
83 pages
DAX Cheat Sheet for Power BI
No ratings yet
DAX Cheat Sheet for Power BI
10 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
34 pages
Python 3.x Lists Cheatsheet
No ratings yet
Python 3.x Lists Cheatsheet
5 pages
Python Pandas Beginner's Guide
No ratings yet
Python Pandas Beginner's Guide
45 pages
Python Pandas
No ratings yet
Python Pandas
177 pages
Customer Data Analysis & Feature Engineering
No ratings yet
Customer Data Analysis & Feature Engineering
35 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
12 Ip
No ratings yet
12 Ip
5 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
Numpy ML - AI
No ratings yet
Numpy ML - AI
135 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Python Generators: How To Create A Generator in Python?
No ratings yet
Python Generators: How To Create A Generator in Python?
8 pages
Python Tips & Tricks - 50 Basic & Intermediate Tips & Tricks PDF
100% (1)
Python Tips & Tricks - 50 Basic & Intermediate Tips & Tricks PDF
61 pages
Supervised Learning
No ratings yet
Supervised Learning
3 pages
Data Visualization
No ratings yet
Data Visualization
9 pages
Pandas Interview Prep Guide
No ratings yet
Pandas Interview Prep Guide
5 pages
Pandas
100% (1)
Pandas
1,131 pages
Python Basics for Data Science
100% (1)
Python Basics for Data Science
8 pages
Python Interview Questions
No ratings yet
Python Interview Questions
61 pages
Data Handling Python NCERT
No ratings yet
Data Handling Python NCERT
36 pages
Numpy
No ratings yet
Numpy
19 pages
Data Structure in C
No ratings yet
Data Structure in C
13 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Python Assignment
33% (3)
Python Assignment
53 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Python Data Science: Pandas & ML Basics
100% (1)
Python Data Science: Pandas & ML Basics
41 pages
Memento Python3 en Latest
No ratings yet
Memento Python3 en Latest
2 pages
Python Cheet Sheet
No ratings yet
Python Cheet Sheet
2 pages
ML/DS Interview Cheat Sheets
No ratings yet
ML/DS Interview Cheat Sheets
16 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Pandas DataFrame Cheat Sheet
100% (1)
Pandas DataFrame Cheat Sheet
10 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
4 pages
Pandas DataFrame Cheat Sheet Guide
No ratings yet
Pandas DataFrame Cheat Sheet Guide
10 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas DataFrame Cheat Sheet Guide
No ratings yet
Pandas DataFrame Cheat Sheet Guide
12 pages
Pandas DataFrame Notes
67% (3)
Pandas DataFrame Notes
13 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Cheat Sheet - Pandas
No ratings yet
Cheat Sheet - Pandas
12 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
6 pages
Bharata Ratnakaramu Swami Vidya Prakashananda Giri
100% (1)
Bharata Ratnakaramu Swami Vidya Prakashananda Giri
797 pages
Python Cheat Sheet Overview
No ratings yet
Python Cheat Sheet Overview
7 pages
Lam IntroductionToR LHL
No ratings yet
Lam IntroductionToR LHL
212 pages
Guide To IBM PowerHA System
No ratings yet
Guide To IBM PowerHA System
518 pages
Db2 DBA Check List
No ratings yet
Db2 DBA Check List
20 pages
Safety1st Car Seat
No ratings yet
Safety1st Car Seat
30 pages
User's Manual: Digital Camera
100% (1)
User's Manual: Digital Camera
256 pages
VBA Project Ideas For Beginners To Advanced 1
No ratings yet
VBA Project Ideas For Beginners To Advanced 1
13 pages
Pixel Art With Excel-1
No ratings yet
Pixel Art With Excel-1
7 pages
Product Information DIGSI V4 87-SP1 PDF
No ratings yet
Product Information DIGSI V4 87-SP1 PDF
68 pages
AccuMark E
No ratings yet
AccuMark E
2 pages
Lesson Plan Class Ix
No ratings yet
Lesson Plan Class Ix
3 pages
Introduction To Microsoft Power Bi
100% (10)
Introduction To Microsoft Power Bi
127 pages
ITDS LessonPlans Steps 1-4-202303
No ratings yet
ITDS LessonPlans Steps 1-4-202303
5 pages
Lead Tracker Manual - Vtiger CRM
No ratings yet
Lead Tracker Manual - Vtiger CRM
29 pages
Microsoft Office
No ratings yet
Microsoft Office
5 pages
SAS Statistical Programmer Resume
100% (1)
SAS Statistical Programmer Resume
2 pages
Excel - How To Solve Complicated Incentive Calculation - Dear All I Need..
100% (1)
Excel - How To Solve Complicated Incentive Calculation - Dear All I Need..
35 pages
Dynamics NAV Keyboard Shortcuts
No ratings yet
Dynamics NAV Keyboard Shortcuts
2 pages
Modbus Specifications v025-EN-20230626
No ratings yet
Modbus Specifications v025-EN-20230626
67 pages
Recitation Excel
No ratings yet
Recitation Excel
2 pages
Icecat CSV Documentation Ver 1.9 PDF
No ratings yet
Icecat CSV Documentation Ver 1.9 PDF
13 pages
Excel 2016 Beginner's Guide
No ratings yet
Excel 2016 Beginner's Guide
10 pages
Excel Macros / VBA (Fixed Asset Program)
100% (7)
Excel Macros / VBA (Fixed Asset Program)
147 pages
FMO ESG Toolkit Manual
No ratings yet
FMO ESG Toolkit Manual
32 pages
2024 Year End Revision Paper Y5 ICT
No ratings yet
2024 Year End Revision Paper Y5 ICT
9 pages
E Class Record
100% (1)
E Class Record
18 pages
MS Excel
No ratings yet
MS Excel
2 pages
Screenshot 2024-09-06 at 4.12.05 PM
No ratings yet
Screenshot 2024-09-06 at 4.12.05 PM
14 pages
User Guide Elkem Materials Mixture Analyser - EMMA
100% (1)
User Guide Elkem Materials Mixture Analyser - EMMA
35 pages
RRB Technician Grade 1 Official Paper (Held On - 20 Dec, 2024 Shift 1)
No ratings yet
RRB Technician Grade 1 Official Paper (Held On - 20 Dec, 2024 Shift 1)
19 pages
Computer Concepts and Applications MCQ PDF
No ratings yet
Computer Concepts and Applications MCQ PDF
20 pages
Lesson4 Methods of Organizing Data
No ratings yet
Lesson4 Methods of Organizing Data
33 pages
Pipeline Studio Excel Add-In User Documentation
No ratings yet
Pipeline Studio Excel Add-In User Documentation
32 pages
Nazmul Huda F
No ratings yet
Nazmul Huda F
34 pages
23 Things You Should Know About Excel Pivot Tables - Exceljet PDF
0% (1)
23 Things You Should Know About Excel Pivot Tables - Exceljet PDF
21 pages
BallSim_Direct Spreadsheet Overview
No ratings yet
BallSim_Direct Spreadsheet Overview
58 pages

Pandas DataFrame Notes

Uploaded by

Pandas DataFrame Notes

Uploaded by

Cheat Sheet: The pandas DataFrame Object

Preliminaries Get your data into a DataFrame

From inline CSV text to a DataFrame

Load DataFrames from a Microsoft Excel file

# Each Excel sheet in a Python dictionary

Data type conversions

Dropping columns (mostly by label) Multiply every column in DataFrame by Series

# replace missing data with group mean

# melt to long format From non-standard strings to Timestamps

pid = [Link](date_strs, freq='D')

print (pid[1] - pid[0]) # 90 days

time_strs = ['2015-01-01 [Link].12345',

From DatetimeIndex to PeriodIndex and back february_selector = ([Link] == 2)

# a Series of time stamps

Working with missing data Categorical data

Working with strings Summary statistics

Smoothing example using rolling_apply

This cheat sheet was cobbled together by bots roaming

You might also like