Pandas DataFrameObject
Pandas DataFrameObject
Start by importing these Python modules Load a DataFrame from a CSV file
import numpy as np df = pd.read_csv('file.csv')# often works
import matplotlib.pyplot as plt df = pd.read_csv(‘file.csv’, header=0,
import pandas as pd index_col=0, quotechar=’”’,sep=’:’,
from pandas import DataFrame, Series na_values = [‘na’, ‘-‘, ‘.’, ‘’])
Note: these are the recommended import aliases Note: refer to pandas docs for all arguments
workbook = pd.ExcelFile('file.xlsx')
dictionary = {}
for sheet_name in workbook.sheet_names:
df = workbook.parse(sheet_name)
dictionary[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.
Series object: an ordered, one-dimensional array of Load a DataFrame from a MySQL database
data with an index. All the data in a Series is of the import pymysql
same data type. Series arithmetic is vectorised after first from sqlalchemy import create_engine
aligning the Series index for each of the operands. engine = create_engine('mysql+pymysql://'
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 +'USER:PASSWORD@localhost/DATABASE')
s2 = Series(range(1,5)) # -> 1, 2, 3, 4 df = pd.read_sql_table('table', engine)
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb' Data in Series then combine into a DataFrame
# Example 1 ...
The index object: The pandas Index provides the axis s1 = Series(range(6))
labels for the Series and DataFrame objects. It can only s2 = s1 * s1
contain hashable objects. A pandas Series has one s2.index = s2.index + 2# misalign indexes
Index; and a DataFrame has two Indexes. df = pd.concat([s1, s2], axis=1)
# --- get Index from Series and DataFrame
idx = s.index # Example 2 ...
idx = df.columns # the column index s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
idx = df.index # the row index s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
df = pd.concat({'A':s3, 'B':s4 }, axis=1)
# --- some Index attributes Note: 1st method has in integer column labels
b = idx.is_monotonic_decreasing Note: 2nd method does not guarantee col order
b = idx.is_monotonic_increasing Note: index alignment on DataFrame creation
b = idx.has_duplicates
i = idx.nlevels # multi-level indexes Get a DataFrame from data in a Python dictionary
# default --- assume data is in columns
# --- some Index methods df = DataFrame({
a = idx.values() # get as numpy array 'col0' : [1.0, 2.0, 3.0, 4.0],
l = idx.tolist() # get as a python list 'col1' : [100, 200, 300, 400]
idx = idx.astype(dtype)# change data type })
b = idx.equals(o) # check for equality
idx = idx.union(o) # union of two indexes
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
'row0' : {'col0':0, 'col1':'A'}, Peek at the DataFrame contents
'row1' : {'col0':1, 'col1':'B'}
df.info() # index & data types
}, orient='index')
n = 4
dfh = df.head(n) # get first n rows
df = DataFrame.from_dict({ # data by row
dft = df.tail(n) # get last n rows
'row0' : [1, 1+1j, 'A'],
dfs = df.describe() # summary stats cols
'row1' : [2, 2+2j, 'B']
top_left_corner_df = df.iloc[:5, :5]
}, orient='index')
DataFrame non-indexing attributes
Create play/fake data (useful for testing)
dfT = df.T # transpose rows and cols
# --- simple
l = df.axes # list row and col indexes
df = DataFrame(np.random.rand(50,5))
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
# --- with a time-stamp row index:
b = df.empty # True for empty DataFrame
df = DataFrame(np.random.rand(500,5))
i = df.ndim # number of axes (2)
df.index = pd.date_range('1/1/2006',
t = df.shape # (row-count, column-count)
periods=len(df), freq='M')
(r, c) = df.shape # from above
i = df.size # row-count * column-count
# --- with alphabetic row and col indexes
a = df.values # get a numpy array for df
import string
import random
r = 52 # note: min r is 1; max r is 52 DataFrame utility methods
c = 5 dfc = df.copy() # copy a DataFrame
df = DataFrame(np.random.randn(r, c), dfr = df.rank() # rank each col (default)
columns = ['col'+str(i) for i in dfs = df.sort() # sort each col (default)
range(c)], dfc = df.astype(dtype) # type conversion
index = list((string.uppercase +
string.lowercase)[0:r])) DataFrame iteration methods
df['group'] = list( df.iteritems()# (col-index, Series) pairs
''.join(random.choice('abcd') df.iterrows() # (row-index, Series) pairs
for _ in range(r))
) # example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
Saving a DataFrame str(series.iat[0]) + '\n')
Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list)
df.to_csv('name.csv', encoding='utf-8') df = df.abs() # absolute values
df = df.add(o) # add df, Series or value
s = df.count() # non NA/null values
Saving DataFrames to an Excel Workbook
df = df.cummax() # (cols default axis)
from pandas import ExcelWriter df = df.cummin() # (cols default axis)
writer = ExcelWriter('filename.xlsx') df = df.cumsum() # (cols default axis)
df1.to_excel(writer,'Sheet1') df = df.cumprod() # (cols default axis)
df2.to_excel(writer,'Sheet2') df = df.diff() # 1st diff (col def axis)
writer.save() df = df.div(o) # div by df, Series, value
df = df.dot(o) # matrix dot product
Saving a DataFrame to MySQL s = df.max() # max of axis (col def)
import pymysql s = df.mean() # mean (col default axis)
from sqlalchemy import create_engine s = df.median()# median (col default)
e = create_engine('mysql+pymysql://' + s = df.min() # min of axis (col def)
'USER:PASSWORD@localhost/DATABASE') df = df.mul(o) # mul by df Series val
df.to_sql('TABLE',e, if_exists='replace') s = df.sum() # sum axis (cols default)
Note: if_exists ! 'fail', 'replace', 'append' Note: The methods that return a series default to
working on columns.
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict() DataFrame filter/select rows or cols on label info
df = df.filter(items=['a', 'b']) # by col
Saving a DataFrame to a Python string df = df.filter(items=[5], axis=0) #by row
string = df.to_string() df = df.filter(like='x') # keep x in col
df = df.filter(regex='x') # regex in col
Note: sometimes may be useful for debugging
df = df.select(crit=(lambda x:not x%5))#r
Note: select takes a Boolean function, for cols: axis=1
Note: filter defaults to cols; select defaults to rows
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Columns value set based on criteria
Working with Columns df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)
A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar
Vectorised arithmetic on columns Get the integer position of a column index label
df['proportion']=df['count']/df['total'] j = df.columns.get_loc('col_name')
df['percent'] = df['proportion'] * 100.0
Test if column index values are unique/monotonic
Apply numpy mathematical functions to columns
if df.columns.is_unique: pass # ...
df['log_data'] = np.log(df['col1']) b = df.columns.is_monotonic_increasing
df['rounded'] = np.round(df['col2'], 2) b = df.columns.is_monotonic_decreasing
Note: Many more mathematical functions
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by label/index
Working with rows [inclusive-from : inclusive–to [ : step]]
df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels Trap: doesn't work on integer labelled rows
idx = df.index # get row index
label = df.index[0] # 1st row label Append a row of column totals to a DataFrame
lst = df.index.tolist() # get as a list # Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
Change the (row) index sums_df = DataFrame(sums,index=['Total'])
df.index = idx # new ad hoc index df = df.append(sums_df)
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new # Option 2: All done with pandas
# note: old index stored as a col in df df = df.append(DataFrame(df.sum(),
df = df.reindex(index=range(len(df))) columns=['Total']).T)
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'}, Iterating over DataFrame rows
inplace=True) for (index, row) in df.iterrows(): # pass
Trap: row data type may be coerced.
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows values
Hint: convert to a DataFrame and then append. Both df = df.sort(df.columns[0],
DataFrames should have same column labels. ascending=False)
df.sort(['col1', 'col2'], inplace=True)
Dropping rows (by name)
df = df.drop('row_label') Random selection of rows
df = df.drop(['row1','row2']) # multi-row import random as r
k = 20 # pick a number
Boolean row selection by values in a column selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :]
df = df[df['col2'] >= 0.0]
df = df[(df['col3']>=1.0) | Note: this sample is not sorted
(df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])] Sort DataFrame by its row index
df = df[~df['col'].isin([1,2,5,7,11])] df.sort_index(inplace=True) # sort by row
df = df[df['col'].str.contains('hello')] df = df.sort_index(ascending=False)
Trap: bitwise "or", "and" “not” (ie. | & ~) co-opted to be
Boolean operators on a Series of Boolean Drop duplicates in the row index
Trap: need parentheses around comparisons. df['index'] = df.index # 1 create new col
df = df.drop_duplicates(cols='index',
Selecting rows using isin over multiple columns take_last=True)# 2 use new col
# fake up some data del df['index'] # 3 del the col
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]} df.sort_index(inplace=True)# 4 tidy up
df = pd.DataFrame(data)
Test if two DataFrames have same row index
# multi-column isin len(a)==len(b) and all(a.index==b.index)
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] Get the integer position of a row or col index label
i = df.index.get_loc('row_label')
Selecting rows using an index
Trap: index.get_loc() returns an integer for a unique
idx = df[df['col'] >= 2].index match. If not a unique match, may return a slice or
print(df.ix[idx]) mask.
Select a slice of rows by integer position Get integer position of rows that meet condition
[inclusive-from : exclusive-to [: step]] a = np.where(df['col'] >= 2) #numpy array
default start is 0; default end is len(df)
df = df[:] # copy DataFrame Test if the row index values are unique/monotonic
df = df[0:2] # rows 0 and 1
df = df[-1:] # the last row if df.index.is_unique: pass # ...
df = df[2:3] # row 2 (the third row) b = df.index.is_monotonic_increasing
df = df[:-1] # all but the last row b = df.index.is_monotonic_decreasing
df = df[::2] # every 2nd row (0 2 ..)
Trap: a single integer without a colon is a column label
for integer numbered columns.
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4