0% found this document useful (0 votes)
43 views

Pandas DataFrameObject

This document provides an overview of the pandas DataFrame object and how to load data into a DataFrame from various sources. It describes the DataFrame as a two-dimensional table with column and row indexes made up of Series objects. It also briefly describes the Series object and pandas Index. Finally, it provides examples of loading DataFrames from CSV files, Excel files, SQL databases, and combining Series into DataFrames.

Uploaded by

Evelyn Ortiz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Pandas DataFrameObject

This document provides an overview of the pandas DataFrame object and how to load data into a DataFrame from various sources. It describes the DataFrame as a two-dimensional table with column and row indexes made up of Series objects. It also briefly describes the Series object and pandas Index. Finally, it provides examples of loading DataFrames from CSV files, Excel files, SQL databases, and combining Series into DataFrames.

Uploaded by

Evelyn Ortiz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Cheat Sheet: The pandas DataFrame Object

Preliminaries Get your data into a DataFrame

Start by importing these Python modules Load a DataFrame from a CSV file
import numpy as np df = pd.read_csv('file.csv')# often works
import matplotlib.pyplot as plt df = pd.read_csv(‘file.csv’, header=0,
import pandas as pd index_col=0, quotechar=’”’,sep=’:’,
from pandas import DataFrame, Series na_values = [‘na’, ‘-‘, ‘.’, ‘’])
Note: these are the recommended import aliases Note: refer to pandas docs for all arguments

From inline CSV text to a DataFrame


from StringIO import StringIO # python2.7
The conceptual model #from io import StringIO # python 3
data = """, Animal, Cuteness, Desirable
row-1, dog, 8.7, True
DataFrame object: The pandas DataFrame is a two- row-2, bat, 2.6, False"""
dimensional table of data with column and row indexes. df = pd.read_csv(StringIO(data),
The columns are made up of pandas Series objects. header=0, index_col=0,
skipinitialspace=True)
Column  index  (df.columns)   Note: skipinitialspace=True allows a pretty layout

Load DataFrames from a Microsoft Excel file


Series  of  data  

Series  of  data  

Series  of  data  

Series  of  data  

Series  of  data  

Series  of  data  

Series  of  data  

# Each Excel sheet in a Python dictionary


(df.index)  
Row  index  

workbook = pd.ExcelFile('file.xlsx')
dictionary = {}
for sheet_name in workbook.sheet_names:
df = workbook.parse(sheet_name)
dictionary[sheet_name] = df
Note: the parse() method takes many arguments like
read_csv() above. Refer to the pandas documentation.

Series object: an ordered, one-dimensional array of Load a DataFrame from a MySQL database
data with an index. All the data in a Series is of the import pymysql
same data type. Series arithmetic is vectorised after first from sqlalchemy import create_engine
aligning the Series index for each of the operands. engine = create_engine('mysql+pymysql://'
s1 = Series(range(0,4)) # -> 0, 1, 2, 3 +'USER:PASSWORD@localhost/DATABASE')
s2 = Series(range(1,5)) # -> 1, 2, 3, 4 df = pd.read_sql_table('table', engine)
s3 = s1 + s2 # -> 1, 3, 5, 7
s4 = Series(['a','b'])*3 # -> 'aaa','bbb' Data in Series then combine into a DataFrame
# Example 1 ...
The index object: The pandas Index provides the axis s1 = Series(range(6))
labels for the Series and DataFrame objects. It can only s2 = s1 * s1
contain hashable objects. A pandas Series has one s2.index = s2.index + 2# misalign indexes
Index; and a DataFrame has two Indexes. df = pd.concat([s1, s2], axis=1)
# --- get Index from Series and DataFrame
idx = s.index # Example 2 ...
idx = df.columns # the column index s3 = Series({'Tom':1, 'Dick':4, 'Har':9})
idx = df.index # the row index s4 = Series({'Tom':3, 'Dick':2, 'Mar':5})
df = pd.concat({'A':s3, 'B':s4 }, axis=1)
# --- some Index attributes Note: 1st method has in integer column labels
b = idx.is_monotonic_decreasing Note: 2nd method does not guarantee col order
b = idx.is_monotonic_increasing Note: index alignment on DataFrame creation
b = idx.has_duplicates
i = idx.nlevels # multi-level indexes Get a DataFrame from data in a Python dictionary
# default --- assume data is in columns
# --- some Index methods df = DataFrame({
a = idx.values() # get as numpy array 'col0' : [1.0, 2.0, 3.0, 4.0],
l = idx.tolist() # get as a python list 'col1' : [100, 200, 300, 400]
idx = idx.astype(dtype)# change data type })
b = idx.equals(o) # check for equality
idx = idx.union(o) # union of two indexes
i = idx.nunique() # number unique labels
label = idx.min() # minimum label
label = idx.max() # maximum label

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
1
Get a DataFrame from data in a Python dictionary
# --- use helper method for data in rows Working with the whole DataFrame
df = DataFrame.from_dict({ # data by row
'row0' : {'col0':0, 'col1':'A'}, Peek at the DataFrame contents
'row1' : {'col0':1, 'col1':'B'}
df.info() # index & data types
}, orient='index')
n = 4
dfh = df.head(n) # get first n rows
df = DataFrame.from_dict({ # data by row
dft = df.tail(n) # get last n rows
'row0' : [1, 1+1j, 'A'],
dfs = df.describe() # summary stats cols
'row1' : [2, 2+2j, 'B']
top_left_corner_df = df.iloc[:5, :5]
}, orient='index')
DataFrame non-indexing attributes
Create play/fake data (useful for testing)
dfT = df.T # transpose rows and cols
# --- simple
l = df.axes # list row and col indexes
df = DataFrame(np.random.rand(50,5))
(r, c) = df.axes # from above
s = df.dtypes # Series column data types
# --- with a time-stamp row index:
b = df.empty # True for empty DataFrame
df = DataFrame(np.random.rand(500,5))
i = df.ndim # number of axes (2)
df.index = pd.date_range('1/1/2006',
t = df.shape # (row-count, column-count)
periods=len(df), freq='M')
(r, c) = df.shape # from above
i = df.size # row-count * column-count
# --- with alphabetic row and col indexes
a = df.values # get a numpy array for df
import string
import random
r = 52 # note: min r is 1; max r is 52 DataFrame utility methods
c = 5 dfc = df.copy() # copy a DataFrame
df = DataFrame(np.random.randn(r, c), dfr = df.rank() # rank each col (default)
columns = ['col'+str(i) for i in dfs = df.sort() # sort each col (default)
range(c)], dfc = df.astype(dtype) # type conversion
index = list((string.uppercase +
string.lowercase)[0:r])) DataFrame iteration methods
df['group'] = list( df.iteritems()# (col-index, Series) pairs
''.join(random.choice('abcd') df.iterrows() # (row-index, Series) pairs
for _ in range(r))
) # example ... iterating over columns
for (name, series) in df.iteritems():
print('Col name: ' + str(name))
print('First value: ' +
Saving a DataFrame str(series.iat[0]) + '\n')

Saving a DataFrame to a CSV file Maths on the whole DataFrame (not a complete list)
df.to_csv('name.csv', encoding='utf-8') df = df.abs() # absolute values
df = df.add(o) # add df, Series or value
s = df.count() # non NA/null values
Saving DataFrames to an Excel Workbook
df = df.cummax() # (cols default axis)
from pandas import ExcelWriter df = df.cummin() # (cols default axis)
writer = ExcelWriter('filename.xlsx') df = df.cumsum() # (cols default axis)
df1.to_excel(writer,'Sheet1') df = df.cumprod() # (cols default axis)
df2.to_excel(writer,'Sheet2') df = df.diff() # 1st diff (col def axis)
writer.save() df = df.div(o) # div by df, Series, value
df = df.dot(o) # matrix dot product
Saving a DataFrame to MySQL s = df.max() # max of axis (col def)
import pymysql s = df.mean() # mean (col default axis)
from sqlalchemy import create_engine s = df.median()# median (col default)
e = create_engine('mysql+pymysql://' + s = df.min() # min of axis (col def)
'USER:PASSWORD@localhost/DATABASE') df = df.mul(o) # mul by df Series val
df.to_sql('TABLE',e, if_exists='replace') s = df.sum() # sum axis (cols default)
Note: if_exists ! 'fail', 'replace', 'append' Note: The methods that return a series default to
working on columns.
Saving a DataFrame to a Python dictionary
dictionary = df.to_dict() DataFrame filter/select rows or cols on label info
df = df.filter(items=['a', 'b']) # by col
Saving a DataFrame to a Python string df = df.filter(items=[5], axis=0) #by row
string = df.to_string() df = df.filter(like='x') # keep x in col
df = df.filter(regex='x') # regex in col
Note: sometimes may be useful for debugging
df = df.select(crit=(lambda x:not x%5))#r
Note: select takes a Boolean function, for cols: axis=1
Note: filter defaults to cols; select defaults to rows
Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
2
Columns value set based on criteria
Working with Columns df['b']=df['a'].where(df['a']>0,other=0)
df['d']=df['a'].where(df.b!=0,other=df.c)
A DataFrame column is a pandas Series object Note: where other can be a Series or a scalar

Data type conversions


Get column index and labels
s = df['col'].astype(str) # Series dtype
idx = df.columns # get col index
na = df['col'].values # numpy array
label = df.columns[0] # 1st col label
pl = df['col'].tolist() # python list
lst = df.columns.tolist() # get as a list
Note: useful dtypes for Series conversion: int, float, str
Trap: index lost in conversion from Series to array or list
Change column labels
df.rename(columns={'old':'new'},
Common column-wide methods/attributes
inplace=True)
df = df.rename(columns={'a':1,'b':'x'}) value = df['col'].dtype # type of data
value = df['col'].size # col dimensions
value = df['col'].count()# non-NA count
Selecting columns value = df['col'].sum()
s = df['colName'] # select col to Series value = df['col'].prod()
df = df[['colName']] # select col to df value = df['col'].min()
df = df[['a','b']] # select 2 or more value = df['col'].max()
df = df[['c','a','b']]# change order value = df['col'].mean()
s = df[df.columns[0]] # select by number value = df['col'].median()
df = df[df.columns[[0, 3, 4]] # by number value = df['col'].cov(df['col2'])
s = df.pop('c') # get col & drop from df s = df['col'].describe()
s = df['col'].value_counts()
Selecting columns with Python attributes
s = df.a # same as s = df['a'] Find index label for min/max values in column
# cannot create new columns by attribute label = df['col1'].idxmin()
df.existing_col = df.a / df.b label = df['col1'].idxmax()
df['new_col'] = df.a / df.b
Trap: column names must be valid identifiers. Common column element-wise methods
s = df['col'].isnull()
Adding new columns to a DataFrame s = df['col'].notnull() # not isnull()
df['new_col'] = range(len(df)) s = df['col'].astype(float)
df['new_col'] = np.repeat(np.nan,len(df)) s = df['col'].round(decimals=0)
df['random'] = np.random.rand(len(df)) s = df['col'].diff(periods=1)
df['index_as_col'] = df.index s = df['col'].shift(periods=1)
df1[['b','c']] = df2[['e','f']] s = df['col'].to_datetime()
df3 = df1.append(other=df2) s = df['col'].fillna(0) # replace NaN w 0
Trap: When adding an indexed pandas object as a new s = df['col'].cumsum()
column, only items from the new series that have a s = df['col'].cumprod()
corresponding index in the DataFrame will be added. s = df['col'].pct_change(periods=4)
The receiving DataFrame is not extended to s = df['col'].rolling_sum(periods=4,
accommodate the new series. To merge, see below. window=4)
Trap: when adding a python list or numpy array, the Note: also rolling_min(), rolling_max(), and many more.
column will be added by integer position.
Append a column of row sums to a DataFrame
Swap column contents – change column order df['Total'] = df.sum(axis=1)
df[['B', 'A']] = df[['A', 'B']] Note: also means, mins, maxs, etc.

Dropping columns (mostly by label) Multiply every column in DataFrame by Series


df = df.drop('col1', axis=1) df = df.mul(s, axis=0) # on matched rows
df.drop('col1', axis=1, inplace=True) Note: also add, sub, div, etc.
df = df.drop(['col1','col2'], axis=1)
s = df.pop('col') # drops from frame Selecting columns with .loc, .iloc and .ix
del df['col'] # even classic python works df = df.loc[:, 'col1':'col2'] # inclusive
df.drop(df.columns[0], inplace=True) df = df.iloc[:, 0:2] # exclusive

Vectorised arithmetic on columns Get the integer position of a column index label
df['proportion']=df['count']/df['total'] j = df.columns.get_loc('col_name')
df['percent'] = df['proportion'] * 100.0
Test if column index values are unique/monotonic
Apply numpy mathematical functions to columns
if df.columns.is_unique: pass # ...
df['log_data'] = np.log(df['col1']) b = df.columns.is_monotonic_increasing
df['rounded'] = np.round(df['col2'], 2) b = df.columns.is_monotonic_decreasing
Note: Many more mathematical functions

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
3
Select a slice of rows by label/index
Working with rows [inclusive-from : inclusive–to [ : step]]
df = df['a':'c'] # rows 'a' through 'c'
Get the row index and labels Trap: doesn't work on integer labelled rows
idx = df.index # get row index
label = df.index[0] # 1st row label Append a row of column totals to a DataFrame
lst = df.index.tolist() # get as a list # Option 1: use dictionary comprehension
sums = {col: df[col].sum() for col in df}
Change the (row) index sums_df = DataFrame(sums,index=['Total'])
df.index = idx # new ad hoc index df = df.append(sums_df)
df.index = range(len(df)) # set with list
df = df.reset_index() # replace old w new # Option 2: All done with pandas
# note: old index stored as a col in df df = df.append(DataFrame(df.sum(),
df = df.reindex(index=range(len(df))) columns=['Total']).T)
df = df.set_index(keys=['r1','r2','etc'])
df.rename(index={'old':'new'}, Iterating over DataFrame rows
inplace=True) for (index, row) in df.iterrows(): # pass
Trap: row data type may be coerced.
Adding rows
df = original_df.append(more_rows_in_df) Sorting DataFrame rows values
Hint: convert to a DataFrame and then append. Both df = df.sort(df.columns[0],
DataFrames should have same column labels. ascending=False)
df.sort(['col1', 'col2'], inplace=True)
Dropping rows (by name)
df = df.drop('row_label') Random selection of rows
df = df.drop(['row1','row2']) # multi-row import random as r
k = 20 # pick a number
Boolean row selection by values in a column selection = r.sample(range(len(df)), k)
df_sample = df.iloc[selection, :]
df = df[df['col2'] >= 0.0]
df = df[(df['col3']>=1.0) | Note: this sample is not sorted
(df['col1']<0.0)]
df = df[df['col'].isin([1,2,5,7,11])] Sort DataFrame by its row index
df = df[~df['col'].isin([1,2,5,7,11])] df.sort_index(inplace=True) # sort by row
df = df[df['col'].str.contains('hello')] df = df.sort_index(ascending=False)
Trap: bitwise "or", "and" “not” (ie. | & ~) co-opted to be
Boolean operators on a Series of Boolean Drop duplicates in the row index
Trap: need parentheses around comparisons. df['index'] = df.index # 1 create new col
df = df.drop_duplicates(cols='index',
Selecting rows using isin over multiple columns take_last=True)# 2 use new col
# fake up some data del df['index'] # 3 del the col
data = {1:[1,2,3], 2:[1,4,9], 3:[1,8,27]} df.sort_index(inplace=True)# 4 tidy up
df = pd.DataFrame(data)
Test if two DataFrames have same row index
# multi-column isin len(a)==len(b) and all(a.index==b.index)
lf = {1:[1, 3], 3:[8, 27]} # look for
f = df[df[list(lf)].isin(lf).all(axis=1)] Get the integer position of a row or col index label
i = df.index.get_loc('row_label')
Selecting rows using an index
Trap: index.get_loc() returns an integer for a unique
idx = df[df['col'] >= 2].index match. If not a unique match, may return a slice or
print(df.ix[idx]) mask.

Select a slice of rows by integer position Get integer position of rows that meet condition
[inclusive-from : exclusive-to [: step]] a = np.where(df['col'] >= 2) #numpy array
default start is 0; default end is len(df)
df = df[:] # copy DataFrame Test if the row index values are unique/monotonic
df = df[0:2] # rows 0 and 1
df = df[-1:] # the last row if df.index.is_unique: pass # ...
df = df[2:3] # row 2 (the third row) b = df.index.is_monotonic_increasing
df = df[:-1] # all but the last row b = df.index.is_monotonic_decreasing
df = df[::2] # every 2nd row (0 2 ..)
Trap: a single integer without a colon is a column label
for integer numbered columns.

Version 2 May 2015 - [Draft – Mark Graph – mark dot the dot graph at gmail dot com – @Mark_Graph on twitter]
4

You might also like