Pandas DataFrame Notes
Pandas DataFrame Notes
Pandas DataFrame Notes
Check which version of pandas you are using Load a DataFrame from a CSV file
print(pd.__version__) df = pd.read_csv('file.csv', header=0,
This cheat sheet was written for pandas version 0.25. index_col=0, quotechar='"', sep=':',
It assumes you are using Python 3. na_values = ['na', '-', '.', ''])
Note: refer to pandas docs for all arguments
Series of data
Series of data
Series of data
Series of data
Series of data
Series of data
Maths on the whole DataFrame (not a complete list) Adding new columns to a DataFrame
df = df.abs() # absolute values df['new_col'] = range(len(df))
df = df.add(o) # add df, Series or value df['new_col'] = np.repeat(np.nan,len(df))
s = df.count() # non NA/null values df['random'] = np.random.rand(len(df))
df = df.cummax() # (cols default axis) df['index_as_col'] = df.index
df = df.cummin() # (cols default axis) df1[['b', 'c']] = df2[['e', 'f']]
df = df.cumsum() # (cols default axis) Trap: When adding a new column, only items from the
df = df.diff() # 1st diff (col def axis) new column series that have a corresponding index in
df = df.div(o) # div by df, Series, value the DataFrame will be added. The index of the receiving
df = df.dot(o) # matrix dot product DataFrame is not extended to accommodate all of the
s = df.max() # max of axis (col def) new series.
s = df.mean() # mean (col default axis) Trap: when adding a python list or numpy array, the
s = df.median() # median (col default) column will be added by integer position.
s = df.min() # min of axis (col def)
df = df.mul(o) # mul by df Series val Add a mismatched column with an extended index
s = df.sum() # sum axis (cols default) df = pd.DataFrame([1, 2, 3], index=[1, 2, 3])
df = df.where(df > 0.5, other=np.nan) s = pd.Series([2, 3, 4], index=[2, 3, 4])
Note: methods returning a series default to work on cols df = df.reindex(df.index.union(s.index))
df['s'] = s # with NaNs where no data
Select/filter rows/cols based on index label values Note: assumes unique index values
df = df.filter(items=['a', 'b']) # by col
df = df.filter(items=[5], axis=0) # by row Dropping (deleting) columns (mostly by label)
df = df.filter(like='x') # keep x in col df = df.drop(col1, axis=1)
df = df.filter(regex='x') # regex in col df = df.drop([col1, col2], axis=1)
df = df.select(lambda x: not x%5) # 5th rows del df[col] # even classic python works
Note: select takes a Boolean function, for cols: axis=1 df = df.drop(df.columns[0], axis=1) #first
Note: filter defaults to cols; select defaults to rows df = df.drop(df.columns[-1:], axis=1) #last
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Swap column contents Multiply every column in DataFrame by a Series
df[['B', 'A']] = df[['A', 'B']] df = df.mul(s, axis=0) # on matched rows
Note: also add, sub, div, etc.
Vectorised arithmetic on columns
df['proportion'] = df['count'] / df['total'] Selecting columns with .loc, .iloc
df['percent'] = df['proportion'] * 100.0 df = df.loc[:, 'col1':'col2'] # inclusive
df = df.iloc[:, 0:2] # exclusive
Apply numpy mathematical functions to columns
df['log_data'] = np.log(df[col]) Get the integer position of a column index label
Note: many many more numpy math functions i = df.columns.get_loc('col_name')
Hint: Prefer pandas math over numpy where you can.
Test if column index values are unique/monotonic
Set column values set based on criteria if df.columns.is_unique: pass # ...
df[b] = df[a].where(df[a]>0, other=0) b = df.columns.is_monotonic_increasing
df[d] = df[a].where(df.b!=0, other=df.c) b = df.columns.is_monotonic_decreasing
Note: where other can be a Series or a scalar
Mapping a DataFrame column or Series
Data type conversions map = pd.Series(['red', 'green', 'blue'],
s = df[col].astype('float') index=['r', 'g', 'b'])
s = df[col].astype('int') s = pd.Series(['r', 'g', 'r', 'b']).map(map)
s = pd.to_numeric(df[col]) # s contains: ['red', 'green', 'red', 'blue']
s = df[col].astype('str')
a = df[col].values # numpy array m = pd.Series([True, False], index=['Y','N'])
l = df[col].tolist() # python list df =pd.DataFrame(np.random.choice(list('YN'),
Trap: index lost in conversion from Series to array or list 500, replace=True), columns=[col])
df[col] = df[col].map(m)
Common column-wide methods/attributes Note: Useful for decoding data before plotting
value = df[col].dtype # type of data Note: Sometimes referred to as a lookup function
value = df[col].size # col dimensions Note: Indexes can also be mapped if needed.
value = df[col].count() # non-NA count
value = df[col].sum() Find the largest and smallest values in a column
value = df[col].prod() s = df[col].nlargest(n)
value = df[col].min() s = df[col].nsmallest(n)
value = df[col].max()
value = df[col].mean() # also median() Sorting the columns of a DataFrame
value = df[col].cov(df[other_col]) df = df.sort_index(axis=1, ascending=False)
s = df[col].describe() Note: the column labels need to be comparable
s = df[col].value_counts()
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by integer position
Working with rows [inclusive-from : exclusive-to [: step]]
start is 0; end is len(df)
Get the row index and labels df = df.iloc[:] # copy entire DataFrame
idx = df.index # get row index df = df.iloc[0:2] # rows 0 and 1
label = df.index[0] # first row label df = df.iloc[2:3] # row 2 (the third row)
label = df.index[-1] # last row label df = df.iloc[-1:] # the last row
l = df.index.tolist() # get as a python list df = df.iloc[:-1] # all but the last row
a = df.index.values # get as numpy array df = df.iloc[::2] # every 2nd row (0 2 ..)
Hint: while the .iloc[] accessor may not be needed
Change the (row) index above, its use makes for more readable code.
df.index = idx # new ad hoc index
Select a slice of rows by label/index
df = df.set_index('A') # index set to col A
df = df.set_index(['A', 'B']) # MultiIndex df = df.loc['a':'c'] # rows 'a' through 'c'
df = df.reset_index() # replace old w new Note: [inclusive-from : inclusive–to [ : step]]
# note: old index stored as a col in df Hint: while the .ioc[] accessor may not be needed
df.index = range(len(df)) # set with list above, its use makes for more readable code.
df = df.reindex(index=range(len(df)))
df = df.set_index(keys=['r1', 'r2', 'etc']) Sorting the rows of a DataFrame by the row index
df = df.sort_index(ascending=False)
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows based on column values
Hint: convert row(s) to a DataFrame and then append. df = df.sort_values(by=df.columns[0],
Both DataFrames must have same column labels. ascending=False)
df = df.sort_values(by=[col1, col2])
Append a row of column totals to a DataFrame
df.loc['Total'] = df.sum() Random selection of rows
Note: best if all columns are numeric import random
k = 20 # pick a number
Iterating over DataFrame rows selection = random.sample(range(len(df)), k)
for (index, row) in df.iterrows(): # pass df_sample = df.iloc[selection, :] # get copy
Trap: row data may be coerced to the same data type Note: this randomly selected sample is not sorted
Selecting rows using isin over multiple columns Get integer position of rows that meet condition
# fake up some data a = np.where(df[col] >= 2) #numpy array
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]}
df = pd.DataFrame(data) Test if the row index values are unique/monotonic
if df.index.is_unique: pass # ...
# multi-column isin b = df.index.is_monotonic_increasing
lf = {1:[1, 3], 3:[8, 27]} # look for b = df.index.is_monotonic_decreasing
f = df.loc[df[list(lf)].isin(lf).all(axis=1)]
Find row index duplicates
Selecting rows using an index if df.index.has_duplicates:
idx = df[df[col] >= 2].index print(df.index.duplicated())
print(df.loc[idx]) Note: also similar for column label duplicates.
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4
Working with cells Summary: selection using the DataFrame index
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
5
Joining/Combining DataFrames Groupby: Split-Apply-Combine
Append (another way of doing a top/bottom concat) # apply to every column in DataFrame ...
df = df1.append(df2) #top/bottom s = gb.count()
df = df1.append([df2, df3]) #top/bottom df_summary = gb.describe()
Note: append also has an ignore_index parameter df_row_1s = gb.first()
Note: aggregating functions include mean, sum, size,
Merge count, std, var, sem (standard error of the mean),
df_new = pd.merge(left=df1, right=df2, describe, first, last, min, max
how='outer', left_index=True,
right_index=True) # on indexes Applying multiple aggregating functions
df_new = pd.merge(left=df1, right=df2, # apply multiple functions to one column
how='left', left_on='col1', dfx = gb['col2'].agg([np.sum, np.mean])
right_on='col2') # on columns # apply to multiple fns to multiple cols
dfy = gb.agg({
df_new = df.merge(right=dfg, how='left', 'cat': np.count_nonzero,
left_on='Group', right_index=True) 'col1': [np.sum, np.mean, np.std],
How: 'left', 'right', 'outer', 'inner' (where outer=union/all; 'col2': [np.min, np.max]
inner=intersection) })
Note: merge is both a pandas helper-function, and a Note: gb['col2'] above is shorthand for
DataFrame method df.groupby('cat')['col2'], without the need for regrouping.
Note: DataFrame.merge() joins on common columns by
default (if left and right not specified) Applying transform functions
Trap: When joining on column values, the indexes on # transform to group z-scores, which have
the passed DataFrames are ignored. # a group mean of 0, and a std dev of 1.
Trap: many-to-many merges can result in an explosion zscore = lambda x: (x-x.mean())/x.std()
of associated data. dfz = gb.transform(zscore)
Join on row indexes (another way of merging) # replace missing data with the group mean
df = df1.join(other=df2, how='outer') mean_r = lambda x: x.fillna(x.mean())
df = df1.join(other=df2, on=['a','b'], df = gb.transform(mean_r) # entire DataFrame
how='outer') df[col] = gb[col].transform(mean_r) # one col
Note: DataFrame.join() joins on indexes by default. Note: can apply multiple transforming functions in a
manner similar to multiple aggregating functions above,
Combine_first
df = df1.combine_first(other=df2) Applying filtering functions
Filtering functions allow you to make selections based
# multi-combine with python reduce() on whether each group meets specified criteria
df = reduce(lambda x, y: # select groups with more than 10 members
x.combine_first(other=y), eleven = lambda x: (len(x['col1']) >= 11)
[df1, df2, df3, df4, df5]) df11 = gb.filter(eleven)
Combine_first uses the non-null values from df1. Null
values in df1 are filled with values from the same Group by a row index (non-hierarchical index)
location in df2. The index of the combined DataFrame df = df.set_index(keys='cat')
will be the union of the indexes from df1 and df2. s = df.groupby(level=0)[col].sum()
dfg = df.groupby(level=0).sum()
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
6
Pivot Tables: working with long and wide data Working with dates, times and their indexes
These features work with and often create Dates and time – points, spans, deltas and offsets
hierarchical or multi-level Indexes; Pandas has four date-time like objects that can used for
(the pandas MultiIndex is powerful and complex). data in a Series or in an Index:
# do the magic
cols = ['yr', 'mon', 'day']
df.index = pd.to_datetime(df[cols])
df['TS'] = pd.to_datetime(df[cols])
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
7
From DatetimeIndex to Python datetime objects Period frequency constants (not a complete list)
dti = pd.DatetimeIndex(pd.date_range( Name Description
start='1/1/2011', periods=4, freq='M')) U Microsecond
s = Series([1,2,3,4], index=dti) L Millisecond
a = dti.to_pydatetime() # numpy array S Second
a = s.index.to_pydatetime() # numpy array T Minute
H Hour
From Timestamps to Python dates or times D Calendar day
df['py_date'] = [x.date() for x in df['TS']] B Business day
df['py_time'] = [x.time() for x in df['TS']] W-{MON, TUE, …} Week ending on …
Note: converts to datatime.date or datetime.time. But MS Calendar start of month
does not convert to datetime.datetime. M Calendar end of month
QS-{JAN, FEB, …} Quarter start with year starting
Periods (QS – December)
Periods represent a time-span.
Q-{JAN, FEB, …} Quarter end with year ending (Q
p = pd.Period('2019', freq='Y') – December)
p = pd.Period('2019-01', freq='M') AS-{JAN, FEB, …} Year start (AS - December)
p = pd.Period('2019-01-01', freq='D')
A-{JAN, FEB, …} Year end (A - December)
p = pd.Period('2019-01-01 21:15:06',freq='S')
Deltas
From Timestamps to Periods in a Series When we subtract a Timestamp from another
l = ['2019-04-01', '2019-04-02'] Timestamp, we get a Timedelta object in pandas.
ts = pd.to_datetime(pd.Series(l)) ts = pd.Series(pd.date_range('2019-01-01',
ps = ts.dt.to_period(freq='D') periods=31, freq='D'))
Note: the .dt accessor in the last line delta_series = ts.diff(1)
From a DatetimeIndex to a PeriodIndex Converting a Timedelta to a numeric
l = ['2019-04-01', '2019-04-02'] l = ['2019-04-01', '2019-09-03']
dti = pd.to_datetime(l) s = pd.to_datetime(pd.Series(l))
pi = dti.to_period(freq='D') delta = s[1] - s[0]
Hint: unless you are working in less than seconds,
prefer PeriodIndex over DatetimeImdex. day = pd.Timedelta(days=1)
delta_num = delta / day
A range of Periods in a PeriodIndex minute = pd.Timedelta(minutes=1)
pi = pd.period_range('2015-01', delta_num2 = delta / minute
periods=len(df), freq='M')
pi = pd.period_range('2019-01-01', Offsets
periods=365, freq='D') Subtracting a Period from a Period gives an offset.
offset = pd.DateOffset(days=4)
Working with a PeriodIndex s = pd.Series(pd.period_range('2019-01-01',
pi = pd.period_range('1960-01','2015-12', periods=365, freq='D'))
freq='M') offset2 = s[4] - s[0]
a = pi.values # numpy array of integers s = s.diff(1) # s is now a series of offsets
p = pi.tolist() # python list of Periods
sp = pd.Series(pi) # pandas Series of Periods Converting an Offset to a numeric
s = pd.Series(pi).astype('str') x = offset.n # an individual offset
l = pd.Series(pi).astype('str').tolist() t = s.apply(lambda z: np.nan if z is np.nan
else z.n) # convert a Series
From DatetimeIndex to PeriodIndex and back
df = pd.DataFrame(np.random.randn(20,3)) Upsampling
df.index = pd.date_range('2015-01-01', # fake up some quarterly count data
periods=len(df), freq='M') pi = pd.period_range('1960Q1',
dfp = df.to_period(freq='M') periods=220, freq='Q')
dft = dfp.to_timestamp() df = pd.DataFrame(np.random.randint(low=0,
Note: from period to timestamp defaults to the point in high=999, size=(len(pi), 5)), index=pi)
time at the start of the period.
# which we can upsample to monthly count data
The tail of a time-series DataFrame dfm = df.resample('M').asfreq() # with NAs!
df = df.last("5M") # the last five months dfm2 = (df.resample('M').asfreq().fillna(0)
.rolling(window=3, min_periods=3).mean()
.bfill(limit=2)) # assuming no NA data
Note: df.resample(arguments).aggregating_function().
There are lots of options here. See the manual.
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
8
Downsampling
# downsample from monthly to quarterly counts Plotting from the DataFrame
dfq = dfm.resample('Q').sum()
Note: df.resample(arguments).aggregating_function(). Import matplotlib, choose a matplotlib style
import matplotlib.pyplot as plt
Time zones print(plt.style.available)
t = ['2015-06-30 00:00:00', plt.style.use('ggplot')
'2015-12-31 00:00:00']
dti = pd.to_datetime(t Fake up some data (which we reuse repeatedly)
).tz_localize('Australia/Canberra') a = np.random.normal(0, 1, 999)
dti = dti.tz_convert('UTC') b = np.random.normal(1, 2, 999)
ts = pd.Timestamp('now', c = np.random.normal(2, 3, 999)
tz='Europe/London') df = pd.DataFrame([a, b, c]).T
Note: by default, Timestamps are created without time df.columns =['A', 'B', 'C']
zone information.
Line plot
Row selection with a time-series index
df1 = df.cumsum()
# start with some play data ax = df1.plot()
n = 48
df = pd.DataFrame(np.random.randint(low=0, # from here down – standard plot output
high=999, size=(n, 5)), ax.set_title('Title')
index=pd.period_range('2015-01', ax.set_xlabel('X Axis')
periods=n, freq='M')) ax.set_ylabel('Y Axis')
february_selector = (df.index.month == 2) fig = ax.figure
february_data = df[february_selector] fig.set_size_inches(8, 3)
fig.tight_layout(pad=1)
q1_data = df[(df.index.month >= 1) & fig.savefig('filename.png', dpi=125)
(df.index.month <= 3)]
plt.close()
mayornov_data = df[(df.index.month == 5) |
(df.index.month == 11)]
year_totals = df.groupby(df.index.year).sum()
Also: year, month, day [of month], hour, minute, second,
dayofweek, weekofmonth, weekofyear [numbered from
1], week starts on Monday], dayofyear [from 1], …
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
9
Multiple histograms (overlapping or stacked) Scatter plot
ax = df.plot.hist(bins=25, alpha=0.5) # or... ax = df.plot.scatter(x='A', y='C')
ax = df.plot.hist(bins=25, stacked=True) # followed by the standard plot code as above
# followed by the standard plot code as above
Pie chart
s = pd.Series(data=[10, 20, 30],
index = ['dogs', 'cats', 'birds'])
ax = s.plot.pie(autopct='%.1f')
Density plot
ax = df.plot.kde()
# followed by the standard plot code as above
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
10
A line and bar on the same chart
In matplotlib, bar charts visualise categorical or discrete Working with missing and non-finite data
data. Line charts visualise continuous data. This makes
it hard to get bars and lines on the same chart. Typically Working with missing data
combined charts either have too many labels, and/or the Pandas uses the not-a-number construct (np.nan and
lines and bars are misaligned or missing. You need to float('nan')) to indicate missing data. The Python None
trick matplotlib a bit … pandas makes this tricking easier can arise in data as well. It is also treated as missing
data; as is the pandas not-a-time construct
# start with fake percentage growth data (pandas.NaT).
s = pd.Series(np.random.normal(
1.02, 0.015, 40)) Missing data in a Series
s = s.cumprod() s = pd.Series( [8,None,float('nan'),np.nan])
dfg = (pd.concat([s / s.shift(1), #[8, NaN, NaN, NaN]
s / s.shift(4)], axis=1) * 100) - 100 s.isna() #[False, True, True, True]
dfg.columns = ['Quarter', 'Annual'] s.notna() #[True, False, False, False]
dfg.index = pd.period_range('2010-Q1', s.fillna(0)#[8, 0, 0, 0]
periods=len(dfg), freq='Q')
Missing data in a DataFrame
# reindex with integers from 0; keep old
df = df.dropna() # drop all rows with NaN
old = dfg.index
df = df.dropna(axis=1) # same for cols
dfg.index = range(len(dfg))
df = d f.dropna(how='all') # drop all NaN row
df = df.dropna(thresh=2) # drop 2+ NaN in r
# plot the line from pandas
# only drop row if NaN in a specified col
ax = dfg['Annual'].plot(color='blue',
df = df.dropna(df['col'].notnull())
label='Year/Year Growth')
fig = ax.figure
fig.set_size_inches(8, 3)
fig.tight_layout(pad=1)
fig.savefig('filename.png', dpi=125)
plt.close()
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
11
Working with Categorical Data Working with strings
Removing categories
s = s.cat.remove_categories([7, 9])
s.cat.remove_unused_categories() #inplace
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
12
Basic Statistics Cautionary note
Summary statistics This cheat sheet was cobbled together by tireless bots
s = df[col].describe() roaming the dark recesses of the Internet seeking ursine
df1 = df.describe() and anguine myths from a fabled land of milk and honey
where it is rumoured pandas and pythons gambol
DataFrame – key stats methods together. There is no guarantee the narratives were
df.corr() # pairwise correlation cols captured and transcribed accurately. You use these
notes at your own risk. You have been warned. I will not
df.cov() # pairwise covariance cols
df.kurt() # kurtosis over cols (def) be held responsible for whatever happens to you and
those you love once your eyes begin to see what is
df.mad() # mean absolute deviation
written here.
df.sem() # standard error of mean
df.var() # variance over cols (def)
Errors: If you find any errors, please email me at
[email protected]; (but please do not correct
Value counts my use of Australian-English spelling conventions).
s = df[col].value_counts()
Histogram binning
count, bins = np.histogram(df[col])
count, bins = np.histogram(df[col], bins=5)
count, bins = np.histogram(df[col],
bins=[-3,-2,-1,0,1,2,3,4])
Regression
import statsmodels.formula.api as sm
result = sm.ols(formula="col1 ~ col2 + col3",
data=df).fit()
print (result.params)
print (result.summary())
Version 14 December 2019 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
13