Pandas DataFrame Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Cheat Sheet: The pandas DataFrame

Preliminaries Get your data into a DataFrame

Start by importing these Python modules Instantiate a DataFrame


import numpy as np df = pd.DataFrame() # the empty DataFrame
import pandas as pd df = pd.DataFrame(python_dictionary)
import matplotlib.pyplot as plt # for charts df = pd.DataFrame(numpy_matrix)

Check which version of pandas you are using Load a DataFrame from a CSV file
print(pd.__version__) df = pd.read_csv('file.csv', header=0,
This cheat sheet was written for pandas version 0.25. index_col=0, quotechar='"', sep=':',
It assumes you are using Python 3. na_values = ['na', '-', '.', ''])
Note: refer to pandas docs for all arguments

Get your data from inline python CSV text


The conceptual model from io import StringIO
data = """, Animal, Cuteness, Desirable
Pandas provides two important data types: the A, dog, 8.7, True
DataFrame and the Series. B, cat, 9.5, False"""
df = pd.read_csv(StringIO(data), header=0,
A DataFrame is a two-dimensional table of data with index_col=0, skipinitialspace=True)
column and row indexes (something like a spread Note: skipinitialspace=True allows a pretty layout
sheet). The columns are made up of Series objects.
Also, among many other options …
df = pd.read_html(url/html_string)
Column index (df.columns) df = pd.read_json(path/JSON_string)
df = pd.read_sql(query, connection)
df = pd.read_excel('filename.xlsx')
Series of data

Series of data

Series of data
Series of data

Series of data

Series of data

Series of data

df = pd. read_clipboard() # eg from Excel copy


(df.index)
Row index

Note: See the pandas documentation for arguments.

Fake up some random data – useful for testing


df = (pd.DataFrame(np.random.rand(1100, 6),
columns=list('ABCDEF')) - 0.5).cumsum()
df['Group'] = [np.random.choice(list('abcd'))
for _ in range(len(df))]
A DataFrame has two Indexes: df['Date'] = pd.date_range('1/1/2017',
• Typically, the column index (df.columns) is a list of periods=len(df), freq='D')
strings (variable names) or (less commonly) integers Hint: leave off the Group and/or Date cols if not needed
• Typically, the row index (df.index) might be:
o Integers – for case or row numbers;
o Strings – for case names; or
o DatetimeIndex or PeriodIndex – for time series Saving a DataFrame

A Series is an ordered, one-dimensional array of data Saving a DataFrame to a CSV file


with an index. All the data is of the same data type.
df.to_csv('filename.csv', encoding='utf-8')
Series arithmetic is vectorized after first aligning the
Series index for each of the operands.
Saving a DataFrame to an Excel Workbook
Examples of Series Arithmatic writer = pd.ExcelWriter('filename.xlsx')
s1 = pd.Series(range(0, 4)) # 0, 1, 2, 3 df.to_excel(writer, 'Sheet1')
s2 = pd.Series(range(1, 5)) # 1, 2, 3, 4 writer.save()
s3 = s1 + s2 # 1, 3, 5, 7
Saving a DataFrame to a Python object
s4 = pd.Series([1, 2, 3], index=[0, 1, 2]) d = df.to_dict() # to dictionary
s5 = pd.Series([1, 2, 3], index=[2, 1, 0]) m = df.values # to a numpy matrix
s6 = s4 + s5 # 4, 4, 4
Also, among many other options …
s7 = pd.Series([1, 2, 3], index=[1, 2, 3]) html = df.to_html()
s8 = pd.Series([1, 2, 3], index=[0, 1, 2]) df.to_json()
s9 = s7 + s8 # NAN, 3, 5, NAN df.to_sql()
df.to_clipboard() # then paste into Excel
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Working with the whole DataFrame Working with Columns (and pandas Series)

Peek at the DataFrame contents/structure Peek at the column/Series structure/contents


df.info() # print cols & data types s = df[col].head(i) # get first i elements
dfh = df.head(i) # get first i rows s = df[col].tail(i) # get last i elements
dft = df.tail(i) # get last i rows s = df[col].describe() # summary stats
dfs = df.describe() # summary stats for cols
top_left_corner_df = df.iloc[:4, :4] Get column index and labels
idx = df.columns # get col index
DataFrame non-indexing attributes label = df.columns[0] # first col label
df = df.T # transpose rows and cols l = df.columns.tolist() # list of col labels
l = df.axes # list of row & col indexes a = df.columns.values # array of col labels
(ri,ci) = df.axes # from above
s = df.dtypes # Series column data types Change column labels
b = df.empty # True for empty DataFrame df = df.rename(columns={'old':'new','a':'1'})
i = df.ndim # number of axes (it is 2) df.columns = ['new1', 'new2', 'new3'] # etc.
t = df.shape # (row-count, column-count)
i = df.size # row-count * column-count Selecting columns
a = df.values # get a numpy matrix for df
s = df[col] # select col to Series
df = df[[col]] # select col to df
DataFrame utility methods df = df[[a, b]] # select 2-plus cols
df = df.copy() # copy a DataFrame df = df[[c, a, b]] # change col order
df = df.sort_values(by=col) s = df[df.columns[0]] # select by number
df = df.sort_values(by=[col1, col2]) df = df[df.columns[[0, 3, 4]]] # by numbers
df = df.sort_values(by=row, axis=1) df = df[df.columns[:-1]] # all but last col
df = df.sort_index() # axis=1 to sort cols s = df.pop(col) # get & drop from df
df = df.astype(dtype) # type conversion
Selecting columns with Python attributes
DataFrame iteration methods s = df.a # same as s = df['a']
df.iteritems() # (col-index, Series) pairs df.existing_column = df.a / df.b
df.iterrows() # (row-index, Series) pairs df['new_column'] = df.a / df.b
# example ... iterating over columns ... Trap: column names must be valid Python identifiers,
for (name, series) in df.iteritems(): but not a DataFrame method or attribute name.
print('\nCol name: ' + str(name)) Trap: cannot create new columns
print('1st value: ' + str(series.iat[0])) Hint: Don't be lazy: for clearer code avoid dot notation.

Maths on the whole DataFrame (not a complete list) Adding new columns to a DataFrame
df = df.abs() # absolute values df['new_col'] = range(len(df))
df = df.add(o) # add df, Series or value df['new_col'] = np.repeat(np.nan,len(df))
s = df.count() # non NA/null values df['random'] = np.random.rand(len(df))
df = df.cummax() # (cols default axis) df['index_as_col'] = df.index
df = df.cummin() # (cols default axis) df1[['b', 'c']] = df2[['e', 'f']]
df = df.cumsum() # (cols default axis) Trap: When adding a new column, only items from the
df = df.diff() # 1st diff (col def axis) new column series that have a corresponding index in
df = df.div(o) # div by df, Series, value the DataFrame will be added. The index of the receiving
df = df.dot(o) # matrix dot product DataFrame is not extended to accommodate all of the
s = df.max() # max of axis (col def) new series.
s = df.mean() # mean (col default axis) Trap: when adding a python list or numpy array, the
s = df.median() # median (col default) column will be added by integer position.
s = df.min() # min of axis (col def)
df = df.mul(o) # mul by df Series val Add a mismatched column with an extended index
s = df.sum() # sum axis (cols default) df = pd.DataFrame([1, 2, 3], index=[1, 2, 3])
df = df.where(df > 0.5, other=np.nan) s = pd.Series([2, 3, 4], index=[2, 3, 4])
Note: methods returning a series default to work on cols df = df.reindex(df.index.union(s.index))
df['s'] = s # with NaNs where no data
Select/filter rows/cols based on index label values Note: assumes unique index values
df = df.filter(items=['a', 'b']) # by col
df = df.filter(items=[5], axis=0) # by row Dropping (deleting) columns (mostly by label)
df = df.filter(like='x') # keep x in col df = df.drop(col1, axis=1)
df = df.filter(regex='x') # regex in col df = df.drop([col1, col2], axis=1)
df = df.select(lambda x: not x%5) # 5th rows del df[col] # even classic python works
Note: select takes a Boolean function, for cols: axis=1 df = df.drop(df.columns[0], axis=1) #first
Note: filter defaults to cols; select defaults to rows df = df.drop(df.columns[-1:], axis=1) #last
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Swap column contents Multiply every column in DataFrame by a Series
df[['B', 'A']] = df[['A', 'B']] df = df.mul(s, axis=0) # on matched rows
Note: also add, sub, div, etc.
Vectorised arithmetic on columns
df['proportion'] = df['count'] / df['total'] Selecting columns with .loc, .iloc
df['percent'] = df['proportion'] * 100.0 df = df.loc[:, 'col1':'col2'] # inclusive
df = df.iloc[:, 0:2] # exclusive
Apply numpy mathematical functions to columns
df['log_data'] = np.log(df[col]) Get the integer position of a column index label
Note: many many more numpy math functions i = df.columns.get_loc('col_name')
Hint: Prefer pandas math over numpy where you can.
Test if column index values are unique/monotonic
Set column values set based on criteria if df.columns.is_unique: pass # ...
df[b] = df[a].where(df[a]>0, other=0) b = df.columns.is_monotonic_increasing
df[d] = df[a].where(df.b!=0, other=df.c) b = df.columns.is_monotonic_decreasing
Note: where other can be a Series or a scalar
Mapping a DataFrame column or Series
Data type conversions map = pd.Series(['red', 'green', 'blue'],
s = df[col].astype('float') index=['r', 'g', 'b'])
s = df[col].astype('int') s = pd.Series(['r', 'g', 'r', 'b']).map(map)
s = pd.to_numeric(df[col]) # s contains: ['red', 'green', 'red', 'blue']
s = df[col].astype('str')
a = df[col].values # numpy array m = pd.Series([True, False], index=['Y','N'])
l = df[col].tolist() # python list df =pd.DataFrame(np.random.choice(list('YN'),
Trap: index lost in conversion from Series to array or list 500, replace=True), columns=[col])
df[col] = df[col].map(m)
Common column-wide methods/attributes Note: Useful for decoding data before plotting
value = df[col].dtype # type of data Note: Sometimes referred to as a lookup function
value = df[col].size # col dimensions Note: Indexes can also be mapped if needed.
value = df[col].count() # non-NA count
value = df[col].sum() Find the largest and smallest values in a column
value = df[col].prod() s = df[col].nlargest(n)
value = df[col].min() s = df[col].nsmallest(n)
value = df[col].max()
value = df[col].mean() # also median() Sorting the columns of a DataFrame
value = df[col].cov(df[other_col]) df = df.sort_index(axis=1, ascending=False)
s = df[col].describe() Note: the column labels need to be comparable
s = df[col].value_counts()

Find first row index label for min/max val in column


label = df[col].idxmin()
label = df[col].idxmax()

Common column element-wise methods


s = df[col].isna()
s = df[col].notna() # not isna()
s = df[col].astype('float')
s = df[col].abs()
s = df[col].round(decimals=0)
s = df[col].diff(periods=1)
s = df[col].shift(periods=1)
s = df[col].to_datetime()
s = df[col].fillna(0) # replace NaN w 0
s = df[col].cumsum()
s = df[col].cumprod()
s = df[col].pct_change(periods=4)
s = df[col].rolling(window=4,
min_periods=4, center=False).sum()

Append a column of row sums to a DataFrame


df['Row Total'] = df.sum(axis=1)
Note: also means, mins, maxs, etc.

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by integer position
Working with rows [inclusive-from : exclusive-to [: step]]
start is 0; end is len(df)
Get the row index and labels df = df.iloc[:] # copy entire DataFrame
idx = df.index # get row index df = df.iloc[0:2] # rows 0 and 1
label = df.index[0] # first row label df = df.iloc[2:3] # row 2 (the third row)
label = df.index[-1] # last row label df = df.iloc[-1:] # the last row
l = df.index.tolist() # get as a python list df = df.iloc[:-1] # all but the last row
a = df.index.values # get as numpy array df = df.iloc[::2] # every 2nd row (0 2 ..)
Hint: while the .iloc[] accessor may not be needed
Change the (row) index above, its use makes for more readable code.
df.index = idx # new ad hoc index
Select a slice of rows by label/index
df = df.set_index('A') # index set to col A
df = df.set_index(['A', 'B']) # MultiIndex df = df.loc['a':'c'] # rows 'a' through 'c'
df = df.reset_index() # replace old w new Note: [inclusive-from : inclusive–to [ : step]]
# note: old index stored as a col in df Hint: while the .ioc[] accessor may not be needed
df.index = range(len(df)) # set with list above, its use makes for more readable code.
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1', 'r2', 'etc']) Sorting the rows of a DataFrame by the row index
df = df.sort_index(ascending=False)
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows based on column values
Hint: convert row(s) to a DataFrame and then append. df = df.sort_values(by=df.columns[0],
Both DataFrames must have same column labels. ascending=False)
df = df.sort_values(by=[col1, col2])
Append a row of column totals to a DataFrame
df.loc['Total'] = df.sum() Random selection of rows
Note: best if all columns are numeric import random
k = 20 # pick a number
Iterating over DataFrame rows selection = random.sample(range(len(df)), k)
for (index, row) in df.iterrows(): # pass df_sample = df.iloc[selection, :] # get copy
Trap: row data may be coerced to the same data type Note: this randomly selected sample is not sorted

Dropping rows (by label) Drop duplicates in the row index


df = df.drop(row) df['index'] = df.index # 1 create new col
df = df.drop([row1, row2]) # multi-row drop df = df.drop_duplicates(cols='index',
take_last=True) # 2 use new col
del df['index'] # 3 del the col
Row selection by Boolean series
df = df.sort_index() # 4 tidy up
df = df.loc[df[col] >= 0.0]
df = df.loc[(df[col] >= 1.0) | (df[col]<0.0)]
Test if two DataFrames have same row index
df = df.loc[df[col].isin([1, 2, 5, 7, 11])]
df = df.loc[~df[col].isin([1, 2, 5, 7, 11])] len(a) == len(b) and all(a.index == b.index)
df = df.loc[df[col].str.contains('a')] Note: you may want to sort indexes first.
Hint: while the .loc[] accessor may not be needed
above, its use makes for more readable code. Get the integer position of a row or col index label
Trap: bitwise "or", "and" “not; (ie. | & ~) co-opted to be i = df.index.get_loc(row_label)
Boolean operators on a Series of Boolean; therefore, Trap: index.get_loc() returns an integer for a unique
you need parentheses around comparisons. match. If not a unique match, may return a slice/mask.

Selecting rows using isin over multiple columns Get integer position of rows that meet condition
# fake up some data a = np.where(df[col] >= 2) #numpy array
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}
df = pd.DataFrame(data) Test if the row index values are unique/monotonic
if df.index.is_unique: pass # ...
# multi-column isin b = df.index.is_monotonic_increasing
lf = {1:[1, 3], 3:[8, 27]} # look for b = df.index.is_monotonic_decreasing
f = df.loc[df[list(lf)].isin(lf).all(axis=1)]
Find row index duplicates
Selecting rows using an index if df.index.has_duplicates:
idx = df[df[col] >= 2].index print(df.index.duplicated())
print(df.loc[idx]) Note: also similar for column label duplicates.

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Working with cells Summary: selection using the DataFrame index

Getting a cell by row and column labels Select columns with []


value = df.at[row, col] s = df[col] # returns Series
value = df.loc[row, col] df = df[[col]] # returns DataFrame
Note: .at[] fastest label based scalar lookup df = df[[col1, col2]] # select cols with list
Note: at[] does not take slices as an argument df = df[idx] # select cols with an index
df = df[s] # select with col label Series
Setting a cell by row and column labels Note: scalar returns Series; list &c returns a DataFrame
df.at[row, col] = value Trap: With [] indexing, a label Series gets/sets columns,
df.loc[row, col] = value but a Boolean Series gets/sets rows
Avoid: chaining in the form df[col][row]
Avoid: chaining in the form df[col].at[row] Select rows with .loc[] or .iloc[]
df = df.loc[df[col]>0.5] # Boolean Series
Getting and slicing on labels df = df.loc['label'] # single label
df = df.loc['row1':'row3', 'col1':'col3'] df = df.loc[container] # lab list/Series
Note: the "to" on this slice is inclusive. df = df.loc['from':'to'] # inclusive slice
df = df.loc[bs] # Boolean Series
Setting a cross-section by labels df = df.iloc[0] # single integer
df.loc['A':'C', 'col1':'col3'] = np.nan df = df.iloc[container] # int list/Series
df.loc[1:2,'col1':'col2'] = np.zeros((2,2)) df = df.iloc[0:5] # exclusive slice
df.loc[1:2,'A':'C'] = other.loc[1:2,'A':'C'] Hint: Always use .loc[] or .iloc[] when selecting rows
Remember: inclusive "to" in the slice Hint: Never use the deprecated .ix[] indexer

Select individual cells with .at[,] or .iat[,]


Getting a cell by integer position
v = df.at[r, c] # fast scalar label accessor
value = df.iat[9, 3] # [row, col]
v = df.iat[r, c] # fast scalar int accessor
value = df.iat[len(df)-1, len(df.columns)-1]
Select a cross-section with .loc[,] or .iloc[,]
Getting a range of cells by int position
df = df.iloc[2:4, 2:4] # subset of the df # r and c can be scalar, list, slice
df = df.iloc[:5, :5] # top left corner xs = df.loc[r, c] # label accessor (row, col)
s = df.iloc[5, :] # return row as Series xs = df.iloc[r, c] # integer accessor
df = df.iloc[5:6, :] # returns row as row
DataFrame indexing methods
Note: exclusive "to" – same as python list slicing.
v = df.get_value(r, c) # get by row, col
Setting cell by integer position df = df.set_value(r,c,v) # set by row, col
df.iat[7, 8] = value df = df.xs(key, axis) # get cross-section
df = df.filter(items, like, regex, axis)
df = df.select(crit, axis)
Setting cell range by integer position
Note: the indexing attributes (.loc[], .iloc[], .at[] .iat[]) can
df.iloc[0:3, 0:5] = value
be used to get and set values in the DataFrame
df.iloc[1:3, 1:4] = np.ones((2, 3))
Note: the .loc[], and .iloc[] indexing attributes can accept
df.iloc[1:3, 1:4] = np.zeros((2, 3))
python slice objects. But .at[] and .iat[] do not
df.iloc[1:3, 1:4] = np.array([[1, 1, 1],
Note: .loc[] can also accept Boolean Series arguments
[2, 2, 2]]) Avoid: chaining in the form df[col_indexer][row_indexer]
Remember: exclusive-to in the slice Trap: label slices are inclusive, integer slices exclusive
Views and copies Some index attributes and methods
From the manual: Setting a copy can cause subtle b = idx.is_monotonic_decreasing
errors. The rules about when a view on the data is b = idx.is_monotonic_increasing
returned are dependent on NumPy. Whenever an array b = idx.has_duplicates
of labels or a Boolean vector are involved in the indexing i = idx.nlevels # num of index levels
operation, the result will be a copy. idx = idx.astype(dtype)# change data type
b = idx.equals(o) # check for equality
Hint: Pandas will usually warn you if you are trying to idx = idx.union(o) # union of two indexes
set a copy. Take these warnings seriously. i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Joining/Combining DataFrames Groupby: Split-Apply-Combine

Three ways to join DataFrames: Grouping


• concat – concatenate (or stack) two DataFrames gb = df.groupby(col) # by one columns
side by side, or one on top of the other gb = df.groupby([col1, col2]) # by 2 cols
• merge – using a database-like join operation on side- gb = df.groupby(level=0) # row index groupby
by-side DataFrames gb = df.groupby(level=['a','b']) #mult-idx gb
• combine_first – splice two DataFrames together, print(gb.groups)
choosing values from one over the other Note: groupby() returns a pandas groupby object
Note: the groupby object attribute .groups contains a
Simple concatenation is often what you want dictionary mapping of the groups.
df = pd.concat([df1,df2], axis=0) #top/bottom Trap: NaN values in the group key are automatically
df = pd.concat([df1,df2]).sort_index() # t/b dropped – there will never be a NA group.

df = pd.concat([df1,df2], axis=1) #left/right Applying an aggregating function


Trap: can end up with duplicate rows or cols # apply to a single column ...
Note: concat has an ignore_index parameter s = gb[col].sum()
Note: if no axis is specified, defaults to top/bottom. s = gb[col].agg(np.sum)

Append (another way of doing a top/bottom concat) # apply to every column in DataFrame ...
df = df1.append(df2) #top/bottom s = gb.count()
df = df1.append([df2, df3]) #top/bottom df_summary = gb.describe()
Note: append also has an ignore_index parameter df_row_1s = gb.first()
Note: aggregating functions include mean, sum, size,
Merge count, std, var, sem (standard error of the mean),
df_new = pd.merge(left=df1, right=df2, describe, first, last, min, max
how='outer', left_index=True,
right_index=True) # on indexes Applying multiple aggregating functions
df_new = pd.merge(left=df1, right=df2, # apply multiple functions to one column
how='left', left_on='col1', dfx = gb['col2'].agg([np.sum, np.mean])
right_on='col2') # on columns # apply to multiple fns to multiple cols
dfy = gb.agg({
df_new = df.merge(right=dfg, how='left', 'cat': np.count_nonzero,
left_on='Group', right_index=True) 'col1': [np.sum, np.mean, np.std],
How: 'left', 'right', 'outer', 'inner' (where outer=union/all; 'col2': [np.min, np.max]
inner=intersection) })
Note: merge is both a pandas helper-function, and a Note: gb['col2'] above is shorthand for
DataFrame method df.groupby('cat')['col2'], without the need for regrouping.
Note: DataFrame.merge() joins on common columns by
default (if left and right not specified) Applying transform functions
Trap: When joining on column values, the indexes on # transform to group z-scores, which have
the passed DataFrames are ignored. # a group mean of 0, and a std dev of 1.
Trap: many-to-many merges can result in an explosion zscore = lambda x: (x-x.mean())/x.std()
of associated data. dfz = gb.transform(zscore)

Join on row indexes (another way of merging) # replace missing data with the group mean
df = df1.join(other=df2, how='outer') mean_r = lambda x: x.fillna(x.mean())
df = df1.join(other=df2, on=['a','b'], df = gb.transform(mean_r) # entire DataFrame
how='outer') df[col] = gb[col].transform(mean_r) # one col
Note: DataFrame.join() joins on indexes by default. Note: can apply multiple transforming functions in a
manner similar to multiple aggregating functions above,
Combine_first
df = df1.combine_first(other=df2) Applying filtering functions
Filtering functions allow you to make selections based
# multi-combine with python reduce() on whether each group meets specified criteria
df = reduce(lambda x, y: # select groups with more than 10 members
x.combine_first(other=y), eleven = lambda x: (len(x['col1']) >= 11)
[df1, df2, df3, df4, df5]) df11 = gb.filter(eleven)
Combine_first uses the non-null values from df1. Null
values in df1 are filled with values from the same Group by a row index (non-hierarchical index)
location in df2. The index of the combined DataFrame df = df.set_index(keys='cat')
will be the union of the indexes from df1 and df2. s = df.groupby(level=0)[col].sum()
dfg = df.groupby(level=0).sum()

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Pivot Tables: working with long and wide data Working with dates, times and their indexes

These features work with and often create Dates and time – points, spans, deltas and offsets
hierarchical or multi-level Indexes; Pandas has four date-time like objects that can used for
(the pandas MultiIndex is powerful and complex). data in a Series or in an Index:

Pivot, unstack, stack and melt Concept Data Index


Pivot tables move from long format to wide format data Point Timestamp DatetimeIndex
# Let's start with data in long format Span Period PeriodIndex
from io import StringIO Delta Timedelta TimedeltaIndex
data = """Date,Pollster,State,Party,Est Offset DateOffset None
13/03/2014, Newspoll, NSW, red, 25
13/03/2014, Newspoll, NSW, blue, 28 Timestamps
13/03/2014, Newspoll, Vic, red, 24 Timestamps represent a point in time.
13/03/2014, Newspoll, Vic, blue, 23 t = pd.Timestamp('2019-01-01')
13/03/2014, Galaxy, NSW, red, 23 t = pd.Timestamp('2019-01-01 21:15:06')
13/03/2014, Galaxy, NSW, blue, 24 t = pd.Timestamp('2019-01-01 21:15:06.7')
13/03/2014, Galaxy, Vic, red, 26 t = pd.Timestamp(year=2019, month=1,
13/03/2014, Galaxy, Vic, blue, 25 day=1, hour=21, minute=15, second=6.7,
13/03/2014, Galaxy, Qld, red, 21 tz='Australia/Sydney')
13/03/2014, Galaxy, Qld, blue, 27""" # handles daylight savings time
df = pd.read_csv(StringIO(data), Note: Timestamps can range from 1678 to 2261. (Check
header=0, skipinitialspace=True) out pd.Timestamp.max and pd.Timestamp.min).
Note: the dtype is datetime64[ns] or datetime64[ns,tz]
# pivot to wide format on 'Party' column
# 1st: set up a MultiIndex for other cols DatetimeIndex – an Index of Timestamps
df1 = df.set_index(['Date', 'Pollster', l = ['2019-04-01', '2019-04-02']
'State']) dti = pd.to_datetime(l)
# 2nd: do the pivot l2 = (['01-01-2019', '01-02-2019']
wide1 = df1.pivot(columns='Party') dti2 = pd.to_datetime(l2, dayfirst=True)
# unstack to wide format on State / Party
A Series of Timestamps
# 1st: MultiIndex all but the Values col
l = ['2019-04-01', '2019-04-02']
df2 = df.set_index(['Date', 'Pollster',
s = pd.to_datetime(pd.Series(l))
'State', 'Party'])
# 2nd: unstack a column to go wide on it Note: if we pass the pd.to_datetime() helper function a
wide2 = df2.unstack('State') pandas Series it returns a Series of Timestamps.
wide3 = df2.unstack() # pop last index
From non-standard strings to DatetimeIndex
# Use stack() to get back to long format t = ['09:08:55.7654-JAN092002',
long1 = wide1.stack() '15:42:02.6589-FEB082016']
# Then use reset_index() to remove the s = pd.to_datetime(t,
# MultiIndex. format="%H:%M:%S.%f-%b%d%Y"))
long2 = long1.reset_index() Also: %B = full month name; %m = numeric month;
%y = year without century; and more …
# Or melt() back to long format
# 1st: flatten the column index A range of Timestamps in a DatetimeIndex
wide1.columns = ['_'.join(col).strip() dti = pd.date_range('2015-01',
for col in wide1.columns.values] periods=len(df), freq='M') # end of month
# 2nd: remove the MultiIndex dti = pd.date_range('2019-01-01',
wdf = wide1.reset_index() periods=365, freq='D')
# 3rd: melt away
long3 = pd.melt(wdf, value_vars= Timestamps and DatetimeIndex from columns
['Est_blue', 'Est_red'], # fake up a DataFrame
var_name='Party', id_vars=['Date', y = [2019, 2019, 2019]
'Pollster', 'State']) m = [2, 3, 4]
Note: See documentation, there are many arguments to d = [1, 2, 2]
these methods. df = pd.DataFrame({'yr':y, 'mon':m, 'day':d})

# do the magic
cols = ['yr', 'mon', 'day']
df.index = pd.to_datetime(df[cols])
df['TS'] = pd.to_datetime(df[cols])

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
From DatetimeIndex to Python datetime objects Period frequency constants (not a complete list)
dti = pd.DatetimeIndex(pd.date_range( Name Description
start='1/1/2011', periods=4, freq='M')) U Microsecond
s = Series([1,2,3,4], index=dti) L Millisecond
a = dti.to_pydatetime() # numpy array S Second
a = s.index.to_pydatetime() # numpy array T Minute
H Hour
From Timestamps to Python dates or times D Calendar day
df['py_date'] = [x.date() for x in df['TS']] B Business day
df['py_time'] = [x.time() for x in df['TS']] W-{MON, TUE, …} Week ending on …
Note: converts to datatime.date or datetime.time. But MS Calendar start of month
does not convert to datetime.datetime. M Calendar end of month
QS-{JAN, FEB, …} Quarter start with year starting
Periods (QS – December)
Periods represent a time-span.
Q-{JAN, FEB, …} Quarter end with year ending (Q
p = pd.Period('2019', freq='Y') – December)
p = pd.Period('2019-01', freq='M') AS-{JAN, FEB, …} Year start (AS - December)
p = pd.Period('2019-01-01', freq='D')
A-{JAN, FEB, …} Year end (A - December)
p = pd.Period('2019-01-01 21:15:06',freq='S')
Deltas
From Timestamps to Periods in a Series When we subtract a Timestamp from another
l = ['2019-04-01', '2019-04-02'] Timestamp, we get a Timedelta object in pandas.
ts = pd.to_datetime(pd.Series(l)) ts = pd.Series(pd.date_range('2019-01-01',
ps = ts.dt.to_period(freq='D') periods=31, freq='D'))
Note: the .dt accessor in the last line delta_series = ts.diff(1)
From a DatetimeIndex to a PeriodIndex Converting a Timedelta to a numeric
l = ['2019-04-01', '2019-04-02'] l = ['2019-04-01', '2019-09-03']
dti = pd.to_datetime(l) s = pd.to_datetime(pd.Series(l))
pi = dti.to_period(freq='D') delta = s[1] - s[0]
Hint: unless you are working in less than seconds,
prefer PeriodIndex over DatetimeImdex. day = pd.Timedelta(days=1)
delta_num = delta / day
A range of Periods in a PeriodIndex minute = pd.Timedelta(minutes=1)
pi = pd.period_range('2015-01', delta_num2 = delta / minute
periods=len(df), freq='M')
pi = pd.period_range('2019-01-01', Offsets
periods=365, freq='D') Subtracting a Period from a Period gives an offset.
offset = pd.DateOffset(days=4)
Working with a PeriodIndex s = pd.Series(pd.period_range('2019-01-01',
pi = pd.period_range('1960-01','2015-12', periods=365, freq='D'))
freq='M') offset2 = s[4] - s[0]
a = pi.values # numpy array of integers s = s.diff(1) # s is now a series of offsets
p = pi.tolist() # python list of Periods
sp = pd.Series(pi) # pandas Series of Periods Converting an Offset to a numeric
s = pd.Series(pi).astype('str') x = offset.n # an individual offset
l = pd.Series(pi).astype('str').tolist() t = s.apply(lambda z: np.nan if z is np.nan
else z.n) # convert a Series
From DatetimeIndex to PeriodIndex and back
df = pd.DataFrame(np.random.randn(20,3)) Upsampling
df.index = pd.date_range('2015-01-01', # fake up some quarterly count data
periods=len(df), freq='M') pi = pd.period_range('1960Q1',
dfp = df.to_period(freq='M') periods=220, freq='Q')
dft = dfp.to_timestamp() df = pd.DataFrame(np.random.randint(low=0,
Note: from period to timestamp defaults to the point in high=999, size=(len(pi), 5)), index=pi)
time at the start of the period.
# which we can upsample to monthly count data
The tail of a time-series DataFrame dfm = df.resample('M').asfreq() # with NAs!
df = df.last("5M") # the last five months dfm2 = (df.resample('M').asfreq().fillna(0)
.rolling(window=3, min_periods=3).mean()
.bfill(limit=2)) # assuming no NA data
Note: df.resample(arguments).aggregating_function().
There are lots of options here. See the manual.

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
8
Downsampling
# downsample from monthly to quarterly counts Plotting from the DataFrame
dfq = dfm.resample('Q').sum()
Note: df.resample(arguments).aggregating_function(). Import matplotlib, choose a matplotlib style
import matplotlib.pyplot as plt
Time zones print(plt.style.available)
t = ['2015-06-30 00:00:00', plt.style.use('ggplot')
'2015-12-31 00:00:00']
dti = pd.to_datetime(t Fake up some data (which we reuse repeatedly)
).tz_localize('Australia/Canberra') a = np.random.normal(0, 1, 999)
dti = dti.tz_convert('UTC') b = np.random.normal(1, 2, 999)
ts = pd.Timestamp('now', c = np.random.normal(2, 3, 999)
tz='Europe/London') df = pd.DataFrame([a, b, c]).T
Note: by default, Timestamps are created without time df.columns =['A', 'B', 'C']
zone information.
Line plot
Row selection with a time-series index
df1 = df.cumsum()
# start with some play data ax = df1.plot()
n = 48
df = pd.DataFrame(np.random.randint(low=0, # from here down – standard plot output
high=999, size=(n, 5)), ax.set_title('Title')
index=pd.period_range('2015-01', ax.set_xlabel('X Axis')
periods=n, freq='M')) ax.set_ylabel('Y Axis')
february_selector = (df.index.month == 2) fig = ax.figure
february_data = df[february_selector] fig.set_size_inches(8, 3)
fig.tight_layout(pad=1)
q1_data = df[(df.index.month >= 1) & fig.savefig('filename.png', dpi=125)
(df.index.month <= 3)]
plt.close()
mayornov_data = df[(df.index.month == 5) |
(df.index.month == 11)]

year_totals = df.groupby(df.index.year).sum()
Also: year, month, day [of month], hour, minute, second,
dayofweek, weekofmonth, weekofyear [numbered from
1], week starts on Monday], dayofyear [from 1], …

The Series.dt accessor attribute


DataFrame columns that contain datetime-like objects
can be manipulated with the .dt accessor attribute
Box plot
t = ['2012-04-14 04:06:56.307000',
ax = df.plot.box(vert=False)
'2011-05-14 06:14:24.457000',
# followed by the standard plot code as above
'2010-06-14 08:23:07.520000']

# a Series of time stamps


s = pd.Series(pd.to_datetime(t))
print(s.dtype) # datetime64[ns]
print(s.dt.second) # 56, 24, 7
print(s.dt.month) # 4, 5, 6
# a Series of time periods
s = pd.Series(pd.PeriodIndex(t, freq='Q'))
print(s.dtype) # int64 ax = df.plot.box(column='c1', by='c2')
print(s.dt.quarter) # 2, 2, 2
print(s.dt.year) # 2012, 2011, 2010 Histogram
ax = df['A'].plot.hist(bins=20)
# followed by the standard plot code as above

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Multiple histograms (overlapping or stacked) Scatter plot
ax = df.plot.hist(bins=25, alpha=0.5) # or... ax = df.plot.scatter(x='A', y='C')
ax = df.plot.hist(bins=25, stacked=True) # followed by the standard plot code as above
# followed by the standard plot code as above

Pie chart
s = pd.Series(data=[10, 20, 30],
index = ['dogs', 'cats', 'birds'])
ax = s.plot.pie(autopct='%.1f')

# followed by the standard plot output ...


ax.set_title('Pie Chart')
ax.set_aspect(1) # make it round
Bar plots ax.set_ylabel('') # remove default
bins = np.linspace(-10, 15, 26)
binned = pd.DataFrame() fig = ax.figure
for x in df.columns: fig.set_size_inches(8, 3)
y=pd.cut(df[x],bins,labels=bins[:-1]) fig.savefig('filename.png', dpi=125)
y=y.value_counts().sort_index()
binned = pd.concat([binned,y],axis=1) plt.close(fig)
binned.index = binned.index.astype('float')
binned.index += (np.diff(bins) / 2.0)
ax = binned.plot.bar(stacked=False,
width=0.8) # for bar width
# followed by the standard plot code as above

Change the range plotted


ax.set_xlim([-5, 5])

# for some white space on the chart ...


Horizontal bars lower, upper = ax.get_ylim()
ax = binned['A'][(binned.index >= -4) & ax.set_ylim([lower-1, upper+1])
(binned.index <= 4)].plot.barh()
# followed by the standard plot code as above Add a footnote to the chart
# after the fig.tight_layout(pad=1) above
fig.text(0.99, 0.01, 'Footnote',
ha='right', va='bottom',
fontsize='x-small',
fontstyle='italic', color='#999999')

Density plot
ax = df.plot.kde()
# followed by the standard plot code as above

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10
A line and bar on the same chart
In matplotlib, bar charts visualise categorical or discrete Working with missing and non-finite data
data. Line charts visualise continuous data. This makes
it hard to get bars and lines on the same chart. Typically Working with missing data
combined charts either have too many labels, and/or the Pandas uses the not-a-number construct (np.nan and
lines and bars are misaligned or missing. You need to float('nan')) to indicate missing data. The Python None
trick matplotlib a bit … pandas makes this tricking easier can arise in data as well. It is also treated as missing
data; as is the pandas not-a-time construct
# start with fake percentage growth data (pandas.NaT).
s = pd.Series(np.random.normal(
1.02, 0.015, 40)) Missing data in a Series
s = s.cumprod() s = pd.Series( [8,None,float('nan'),np.nan])
dfg = (pd.concat([s / s.shift(1), #[8, NaN, NaN, NaN]
s / s.shift(4)], axis=1) * 100) - 100 s.isna() #[False, True, True, True]
dfg.columns = ['Quarter', 'Annual'] s.notna() #[True, False, False, False]
dfg.index = pd.period_range('2010-Q1', s.fillna(0)#[8, 0, 0, 0]
periods=len(dfg), freq='Q')
Missing data in a DataFrame
# reindex with integers from 0; keep old
df = df.dropna() # drop all rows with NaN
old = dfg.index
df = df.dropna(axis=1) # same for cols
dfg.index = range(len(dfg))
df = d f.dropna(how='all') # drop all NaN row
df = df.dropna(thresh=2) # drop 2+ NaN in r
# plot the line from pandas
# only drop row if NaN in a specified col
ax = dfg['Annual'].plot(color='blue',
df = df.dropna(df['col'].notnull())
label='Year/Year Growth')

# plot the bars from pandas Recoding missing data


dfg['Quarter'].plot.bar(ax=ax, df = df.fillna(0) # np.nan à 0
label='Q/Q Growth', width=0.8) s = df[col].fillna(0) # np.nan à 0
df = df.replace(r'\s+', np.nan,
# relabel the x-axis more appropriately regex=True) # white space à np.nan
ticks = dfg.index[((dfg.index+0)%4)==0]
labs = pd.Series(old[ticks]).astype('str') Non-finite numbers
ax.set_xticks(ticks) With floating point numbers, pandas provides for
ax.set_xticklabels(labs.str.replace('Q', positive and negative infinity.
'\nQ'), rotation=0) s = pd.Series([float('inf'), float('-inf'),
np.inf, -np.inf])
# fix the range of the x-axis … skip 1st Pandas treats integer comparisons with plus or minus
ax.set_xlim([0.5,len(dfg)-0.5]) infinity as expected.

# add the legend Testing for finite numbers


l=ax.legend(loc='best',fontsize='small') (using the data from the previous example)
b = np.isfinite(s)
# finish off and plot in the usual manner
ax.set_title('Fake Growth Data')
ax.set_xlabel('Quarter')
ax.set_ylabel('Per cent')

fig = ax.figure
fig.set_size_inches(8, 3)
fig.tight_layout(pad=1)
fig.savefig('filename.png', dpi=125)

plt.close()

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
11
Working with Categorical Data Working with strings

Categorical data Working with strings


The pandas Series has an R factors-like data type for # quickly let's fake-up some text data
encoding categorical data. df = pd.DataFrame("Lorem ipsum dolor sit
s = pd.Series(['a','b','a','c','b','d','a'], amet, consectetur adipiscing elit, sed do
dtype='category') eiusmod tempor incididunt ut labore et dolore
df['Cat'] = df['Group'].astype('category') magna aliqua".split(), columns=['t'])
Note: the key here is to specify the "category" data type.
Note: categories will be ordered on creation if they are # assume that df[col] is series of strings
sortable. This can be turned off. See ordering below. s1 = df['t'].str.lower()
s2 = df['t'].str.upper()
Convert back to the original data type s3 = df['t'].str.len()
s = pd.Series(['a','b','a','c','b','d','a'], df2 = df['t'].str.split('t', expand=True)
dtype='category')
s = s.astype('str') # pandas strings are just like Python strings
s4 = df['t'] + '-suffix' # concatenate
Ordering, reordering and sorting s5 = df['t'] * 5 # duplicate
s = pd.Series(list('abc'), dtype='category') Most python string functions are replicated in the pandas
print (s.cat.ordered) DataFrame and Series objects.
s = s.cat.reorder_categories(['b', 'c', 'a'])
s = s.sort_values() Text matching and regular expressions (regex)
s.cat.ordered = False s6 = df['t'].str.match('[sedo]+')
Trap: category must be ordered for it to be sorted s7 = df['t'].str.contains('[em]')
s8 = df['t'].str.startswith('do') # no regex
Renaming categories s8 = df['t'].str.endswith('.') # no regex
s = pd.Series(list('abc'), dtype='category') s9 = df['t'].str.replace('old', 'new')
s.cat.categories = [1, 2, 3] # in place s10 = df['t'].str.extract('(pattern)')
s = s.cat.rename_categories([4, 5, 6]) Note: pandas has many more methods.
# using a comprehension ...
s.cat.categories = ['Group ' + str(i)
for i in s.cat.categories]
Trap: categories must be uniquely named

Adding new categories


s = s.cat.add_categories([7, 8, 9])

Removing categories
s = s.cat.remove_categories([7, 9])
s.cat.remove_unused_categories() #inplace

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
12
Basic Statistics Cautionary note

Summary statistics This cheat sheet was cobbled together by tireless bots
s = df[col].describe() roaming the dark recesses of the Internet seeking ursine
df1 = df.describe() and anguine myths from a fabled land of milk and honey
where it is rumoured pandas and pythons gambol
DataFrame – key stats methods together. There is no guarantee the narratives were
df.corr() # pairwise correlation cols captured and transcribed accurately. You use these
notes at your own risk. You have been warned. I will not
df.cov() # pairwise covariance cols
df.kurt() # kurtosis over cols (def) be held responsible for whatever happens to you and
those you love once your eyes begin to see what is
df.mad() # mean absolute deviation
written here.
df.sem() # standard error of mean
df.var() # variance over cols (def)
Errors: If you find any errors, please email me at
[email protected]; (but please do not correct
Value counts my use of Australian-English spelling conventions).
s = df[col].value_counts()

Cross-tabulation (frequency count)


ct = pd.crosstab(index=df['a'],
cols=df['b'])

Quantiles and ranking


quants = [0.05, 0.25, 0.5, 0.75, 0.95]
q = df.quantile(quants)
r = df.rank()

Histogram binning
count, bins = np.histogram(df[col])
count, bins = np.histogram(df[col], bins=5)
count, bins = np.histogram(df[col],
bins=[-3,-2,-1,0,1,2,3,4])

Regression
import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 + col3",
data=df).fit()
print (result.params)
print (result.summary())

Simple smoothing example using a rolling apply


k3x5 = np.array([1,2,3,3,3,2,1]) / 15.0
s = df['A'].rolling(window=len(k3x5),
min_periods=len(k3x5),
center=True).apply(
func=lambda x: (x * k3x5).sum())
# fix the missing end data ... unsmoothed
s = df['A'].where(s.isna(), other=s)

Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
13

You might also like