Pandas
Pandas
• Conversion
• Indexing, iteration
• Binary operator functions
• Function application, GroupBy & window
• Computations / descriptive statistics
• Reindexing / selection / label manipulation
• Missing data handling
• Reshaping, sorting
• Combining / comparing / joining / merging
• Time Series-related
>>> ser.sum()
10.0
>>> ser.cumsum()
a 1.0
b 3.0
c 6.0
d NaN
e 10.0
dtype: float64
>>> ser.isna()
a False
b False
c False
d True
e False
dtype: bool
DATAFRAME
• By index
• By series
• By selectors:
• loc – label based
• iloc – integer based
• ix - both
>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
>>> df = pd.DataFrame(d)
>>> df
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
>>> df[2:4]
one two three
c 3.0 3 30.0
d NaN 4 NaN
>>> df['one']
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
>>> df['one']['a']
1.0
>>> df.loc['a']
one 1.0
two 1.0
three 10.0
Name: a, dtype: float64
>>> df.iloc[:2,1:]
two three
a 1 10.0
b 2 20.0
PANEL
• 3D container of data
• Deprecated from 0.25 version
• Use MultiIndex instead
• Used terms:
• items: axis 0; each item corresponds to a DataFrame contained inside
• major_axis: axis 1; the index (rows) of each of the DataFrames
• minor_axis: axis 2; the columns of each of the DataFrames
INDEX OBJECTS
• Index
• Numeric Index
• CategoricalIndex
• IntervalIndex
• MultiIndex
• DatetimeIndex
• TimedeltaIndex
• PeriodIndex
>>> data = np.random.randint(1, 10, (5, 3, 2))
>>> data
array([[[1, 7],
[7, 3],
[6, 7]],
[[7, 6],
[8, 7],
[4, 2]],
[[9, 7],
[6, 4],
[5, 5]],
[[4, 1],
[6, 3],
[6, 9]],
[[7, 1],
[3, 7],
[5, 9]]])
>>> data = data.reshape(5, 6).T
>>> data
array([[1, 7, 9, 4, 7],
[7, 6, 7, 1, 1],
[7, 8, 6, 6, 3],
[3, 7, 4, 3, 7],
[6, 4, 5, 6, 5],
[7, 2, 5, 9, 9]])
>>> df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([[2015, 2016, 2017], ['US', 'UK']]),
columns=['item {}'.format(i) for i in range(1, 6)]
)
>>> df
item 1 item 2 item 3 item 4 item 5
2015 US 1 7 9 4 7
UK 7 6 7 1 1
2016 US 7 8 6 6 3
UK 3 7 4 3 7
2017 US 6 4 5 6 5
UK 7 2 5 9 9
BASIC FUNCTIONALITY
Attribute or method Description
T Transposes rows and columns.
axes Returns a list with the row axis labels and column axis labels as the only
members.
dtypes Returns the dtypes in this object.
empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
ndim Number of axes / array dimensions.
shape Returns a tuple representing the dimensionality of the DataFrame.
size Number of elements in the NDFrame.
values Numpy representation of NDFrame.
head() Returns the first n rows.
tail() Returns last n rows.
>>> df.tail(2)
item 1 item 2 item 3 item 4 item 5
2017 US 6 4 5 6 5
UK 7 2 5 9 9
>>> df.head(2)
item 1 item 2 item 3 item 4 item 5
2015 US 1 7 9 4 7
UK 7 6 7 1 1
>>> df.axes
[MultiIndex([(2015, 'US'),
(2015, 'UK'),
(2016, 'US'),
(2016, 'UK'),
(2017, 'US'),
(2017, 'UK')],
), Index(['item 1', 'item 2', 'item 3', 'item 4', 'item 5'], dtype='object')]
>>> df.T
2015 2016 2017
US UK US UK US UK
item 1 1 7 7 3 6 7
item 2 7 6 8 7 4 2
item 3 9 7 6 4 5 5
item 4 4 1 6 3 6 9
item 5 7 1 3 7 5 9
DESCRIPTIVE STATISTICS
Function Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
max() Maximum Value
abs() Absolute Value
prod() Product of Values
cumsum() Cumulative Sum
cumprod() Cumulative Product
>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
>>> df = pd.DataFrame(d)
>>> df.describe()
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
>>> df.sum()
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age 382
Rating 44.92
dtype: object
FUNCTION APPLICATION
>>> df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
>>> df.apply(np.mean, axis=1)
0 0.356482
1 0.221546
2 -0.376605
3 -0.463950
4 0.132884
dtype: float64
>>> df.apply(np.mean)
col1 0.479030
col2 -0.809850
col3 0.253035
dtype: float64
APPLYMAP EXAMPLE
>>> df.applymap(lambda x:x*100)
col1 col2 col3
0 48.147452 -39.275584 98.072821
1 216.419288 -152.239495 2.284114
2 -107.420440 24.806028 -30.366953
3 16.792875 -136.146811 -19.831032
4 65.575667 -102.068915 76.358390
REINDEXING
>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
>>> df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
>>> df2
col1 col2 col3
0 -1.011192 -0.614197 0.351578
1 0.633386 -2.339780 -1.041833
>>> df2.reindex_like(df1,method='ffill',limit=1)
col1 col2 col3
0 -1.011192 -0.614197 0.351578
1 0.633386 -2.339780 -1.041833
2 0.633386 -2.339780 -1.041833
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
>>> df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})
c1 c2 col3
apple 0.048334 0.531656 1.419512
banana 0.877269 0.242278 -0.214751
durian -0.651795 -0.520254 0.184121
3 -0.404872 1.386826 -1.151902
4 -0.695822 0.657571 0.764508
5 0.164209 0.947984 0.724488
ITERATION
>>> for column_name in df:
• pct_change() - compares every element with its prior element and computes
the change percentage.
• cov() - covariance is applied on series data.
• cor() – correlation is applied on series data.
• rank() - produces ranking for each element in the array of elements.
>>> s = pd.Series(np.random.randn(5),
index=list('abcde'))
>>> s
a 0.021429
b -0.501898
c -2.342914
d 0.808404
e 0.926918
dtype: float64
>>> s.rank()
a 3.0
b 2.0
c 1.0
d 4.0
e 5.0
dtype: float64
WINDOW FUNCTIONS
• isnull()
• notnull()
• fillna(value_to_replace)
• dropna()
• replace()
GROUPBY
• Any groupby operation involves one of the following operations on the original object:
• Splitting the Object
• Applying a function
• Combining the results
• Functionality to apply on each data subset:
• Aggregation − computing a summary statistic
• Transformation − perform some group-specific operation
• Filtration − discarding the data with some condition
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> df
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 Kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
>>> grouped = df.groupby('Team')
>>> grouped.groups
{'Devils': [2, 3], 'Kings': [4, 5, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals':
[9, 10]}
>>> df.groupby(['Team','Year']).groups
>>> grouped['Points'].agg([np.sum, np.mean, np.std])
MERGE
• Categorical variables can take on only a limited, and usually fixed number of
possible values.
• Categorical data might have an order but cannot perform numerical
operation.
>>> s = pd.Series(["a","b","c","a"], dtype="category")
>>> s
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
>>> cat
['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
VISUALIZATION