0% found this document useful (0 votes)
13 views44 pages

Pandas

Uploaded by

Tonis Petronis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views44 pages

Pandas

Uploaded by

Tonis Petronis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 44

WHAT IS PANDAS?

• Derived from “panel data” or “panel data analysis”


• Multidimensional structured data sets
• Python Data Analysis Library
HOW TO WORK WITH PANEL DATA?

• All variables can be acquired as vectors.


• But vectors can be of different data types.
• And it would be nice to work with them as with one entity.
• Put ‘em in frame… DataFrame.
WHAT’S INSIDE?

• DataFrame object for data manipulation with integrated indexing.


• Tools for reading and writing data between in-memory data structures and different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of data sets.
• Label-based slicing, fancy indexing, and subsetting of large data sets.
• Data structure column insertion and deletion.
• Group by engine allowing split-apply-combine operations on data sets.
• Data set merging and joining.
• Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
• Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions,
date shifting and lagging.
• Data filtration.
SERIES

• The same as list or array only with labels.


• pandas.Series(data, index, dtype, copy)
• For data it is possible to use lists, dictionaries, numpy arrays
EXAMPLE

>>> ser = pd.Series([1,2,3,math.nan, 4], ['a','b','c','d','e'])


>>> ser
a 1.0
b 2.0
c 3.0
d NaN
e 4.0
dtype: float64
WHAT CAN WE DO WITH SERIES?

• Conversion
• Indexing, iteration
• Binary operator functions
• Function application, GroupBy & window
• Computations / descriptive statistics
• Reindexing / selection / label manipulation
• Missing data handling
• Reshaping, sorting
• Combining / comparing / joining / merging
• Time Series-related
>>> ser.sum()
10.0
>>> ser.cumsum()
a 1.0
b 3.0
c 6.0
d NaN
e 10.0
dtype: float64
>>> ser.isna()
a False
b False
c False
d True
e False
dtype: bool
DATAFRAME

• A way to store data in rectangular grid or table.


• Each row corresponds to measurements or values of an instance.
• Each column is a vector (or Series) containing data for a specific variable.
• Each vector can be of different type (numeric, character, etc.)
• DataFrame can be defined as two-dimensional labeled data structure with
columns of potentially different types.
• pandas.DataFrame( data, index, columns, dtype, copy)
SELECTING DATA

• By index
• By series
• By selectors:
• loc – label based
• iloc – integer based
• ix - both
>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']),
'three' : pd.Series([10,20,30], index=['a','b','c'])}
>>> df = pd.DataFrame(d)
>>> df
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
>>> df[2:4]
one two three
c 3.0 3 30.0
d NaN 4 NaN
>>> df['one']
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
>>> df['one']['a']
1.0
>>> df.loc['a']
one 1.0
two 1.0
three 10.0
Name: a, dtype: float64
>>> df.iloc[:2,1:]
two three
a 1 10.0
b 2 20.0
PANEL

• 3D container of data
• Deprecated from 0.25 version
• Use MultiIndex instead
• Used terms:
• items: axis 0; each item corresponds to a DataFrame contained inside
• major_axis: axis 1; the index (rows) of each of the DataFrames
• minor_axis: axis 2; the columns of each of the DataFrames
INDEX OBJECTS

• Index
• Numeric Index
• CategoricalIndex
• IntervalIndex
• MultiIndex
• DatetimeIndex
• TimedeltaIndex
• PeriodIndex
>>> data = np.random.randint(1, 10, (5, 3, 2))
>>> data
array([[[1, 7],
[7, 3],
[6, 7]],

[[7, 6],
[8, 7],
[4, 2]],

[[9, 7],
[6, 4],
[5, 5]],

[[4, 1],
[6, 3],
[6, 9]],

[[7, 1],
[3, 7],
[5, 9]]])
>>> data = data.reshape(5, 6).T
>>> data
array([[1, 7, 9, 4, 7],
[7, 6, 7, 1, 1],
[7, 8, 6, 6, 3],
[3, 7, 4, 3, 7],
[6, 4, 5, 6, 5],
[7, 2, 5, 9, 9]])
>>> df = pd.DataFrame(
data=data,
index=pd.MultiIndex.from_product([[2015, 2016, 2017], ['US', 'UK']]),
columns=['item {}'.format(i) for i in range(1, 6)]
)
>>> df
item 1 item 2 item 3 item 4 item 5
2015 US 1 7 9 4 7
UK 7 6 7 1 1
2016 US 7 8 6 6 3
UK 3 7 4 3 7
2017 US 6 4 5 6 5
UK 7 2 5 9 9
BASIC FUNCTIONALITY
Attribute or method Description
T Transposes rows and columns.
axes Returns a list with the row axis labels and column axis labels as the only
members.
dtypes Returns the dtypes in this object.
empty True if NDFrame is entirely empty [no items]; if any of the axes are of length 0.
ndim Number of axes / array dimensions.
shape Returns a tuple representing the dimensionality of the DataFrame.
size Number of elements in the NDFrame.
values Numpy representation of NDFrame.
head() Returns the first n rows.
tail() Returns last n rows.
>>> df.tail(2)
item 1 item 2 item 3 item 4 item 5
2017 US 6 4 5 6 5
UK 7 2 5 9 9
>>> df.head(2)
item 1 item 2 item 3 item 4 item 5
2015 US 1 7 9 4 7
UK 7 6 7 1 1
>>> df.axes
[MultiIndex([(2015, 'US'),
(2015, 'UK'),
(2016, 'US'),
(2016, 'UK'),
(2017, 'US'),
(2017, 'UK')],
), Index(['item 1', 'item 2', 'item 3', 'item 4', 'item 5'], dtype='object')]
>>> df.T
2015 2016 2017
US UK US UK US UK
item 1 1 7 7 3 6 7
item 2 7 6 8 7 4 2
item 3 9 7 6 4 5 5
item 4 4 1 6 3 6 9
item 5 7 1 3 7 5 9
DESCRIPTIVE STATISTICS
Function Description
count() Number of non-null observations
sum() Sum of values
mean() Mean of Values
median() Median of Values
mode() Mode of values
std() Standard Deviation of the Values
min() Minimum Value
max() Maximum Value
abs() Absolute Value
prod() Product of Values
cumsum() Cumulative Sum
cumprod() Cumulative Product
>>> d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
>>> df = pd.DataFrame(d)
>>> df.describe()
Age Rating
count 12.000000 12.000000
mean 31.833333 3.743333
std 9.232682 0.661628
min 23.000000 2.560000
25% 25.000000 3.230000
50% 29.500000 3.790000
75% 35.500000 4.132500
max 51.000000 4.800000
>>> df.sum()
Name TomJamesRickyVinSteveSmithJackLeeDavidGasperBe...
Age 382
Rating 44.92
dtype: object
FUNCTION APPLICATION

• Table wise Function Application: pipe()


• Row or Column wise Function Application: apply()
• Element wise Function Application: applymap()
PIPE EXAMPLE
>>> def adder(ele1,ele2):
return ele1+ele2
>>> df =
pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'
])
>>> df
col1 col2 col3
0 -0.565990 0.042208 0.632644
1 -0.090935 -0.520348 -0.094119
2 0.217500 -0.373585 1.279152
3 1.619577 1.533433 1.059039
4 -0.739193 -2.124682 1.802189
>>> df.pipe(adder,2)
col1 col2 col3
0 1.434010 2.042208 2.632644
1 1.909065 1.479652 1.905881
2 2.217500 1.626415 3.279152
3 3.619577 3.533433 3.059039
4 1.260807 -0.124682 3.802189
APPLY EXAMPLE

>>> df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
>>> df.apply(np.mean, axis=1)
0 0.356482
1 0.221546
2 -0.376605
3 -0.463950
4 0.132884
dtype: float64
>>> df.apply(np.mean)
col1 0.479030
col2 -0.809850
col3 0.253035
dtype: float64
APPLYMAP EXAMPLE
>>> df.applymap(lambda x:x*100)
col1 col2 col3
0 48.147452 -39.275584 98.072821
1 216.419288 -152.239495 2.284114
2 -107.420440 24.806028 -30.366953
3 16.792875 -136.146811 -19.831032
4 65.575667 -102.068915 76.358390
REINDEXING
>>> df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
>>> df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
>>> df2
col1 col2 col3
0 -1.011192 -0.614197 0.351578
1 0.633386 -2.339780 -1.041833
>>> df2.reindex_like(df1,method='ffill',limit=1)
col1 col2 col3
0 -1.011192 -0.614197 0.351578
1 0.633386 -2.339780 -1.041833
2 0.633386 -2.339780 -1.041833
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
>>> df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'})
c1 c2 col3
apple 0.048334 0.531656 1.419512
banana 0.877269 0.242278 -0.214751
durian -0.651795 -0.520254 0.184121
3 -0.404872 1.386826 -1.151902
4 -0.695822 0.657571 0.764508
5 0.164209 0.947984 0.724488
ITERATION
>>> for column_name in df:

>>> for column_name, column_as_series in df.iteritems():

>>> for row_index, row_as_series in df.iterrows():

>>> for row_as_named_tuple in df.itertuples():


SORTING
>>> unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],
columns = ['col2','col1'])
>>> unsorted_df.sort_index() #axis, ascending , etc.
col2 col1
0 1.337929 0.025269
1 0.437463 0.124547
2 -0.227216 0.985426
3 -1.111511 -0.624232
4 1.095845 0.823082
5 -0.488366 0.160159
6 -1.304389 0.647291
7 -0.521915 -0.148872
8 -0.073079 1.104150
9 -0.317896 1.547743
>>> unsorted_df.sort_values(by=['col1'])
col2 col1
3 -1.111511 -0.624232
7 -0.521915 -0.148872
0 1.337929 0.025269
1 0.437463 0.124547
5 -0.488366 0.160159
6 -1.304389 0.647291
4 1.095845 0.823082
2 -0.227216 0.985426
8 -0.073079 1.104150
9 -0.317896 1.547743
WORKING WITH TEXT
>>> s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234
SteveSmith'])
>>> s
0 Tom
1 William Rick
2 John
3 Alber@t
4 NaN
5 1234
6 SteveSmith
dtype: object
>>> s.str.lower()
0 tom
1 william rick
2 john
3 alber@t
4 NaN
5 1234
6 stevesmith
dtype: object
STATISTICAL FUNCTIONS

• pct_change() - compares every element with its prior element and computes
the change percentage.
• cov() - covariance is applied on series data.
• cor() – correlation is applied on series data.
• rank() - produces ranking for each element in the array of elements.
>>> s = pd.Series(np.random.randn(5),
index=list('abcde'))
>>> s
a 0.021429
b -0.501898
c -2.342914
d 0.808404
e 0.926918
dtype: float64
>>> s.rank()
a 3.0
b 2.0
c 1.0
d 4.0
e 5.0
dtype: float64
WINDOW FUNCTIONS

• Statistics in finding the trends by smoothing the curve:


• rolling() – rolling window calculations
• expanding() – expanding transformation
• ewm() – exponential weighted move
>>> df = pd.DataFrame(np.random.randn(10, 4),
index = pd.date_range('1/1/2000', periods=10),
columns = ['A', 'B', 'C', 'D'])
>>> df
A B C D
2000-01-01 -0.303304 -2.077986 0.694436 0.491008
2000-01-02 -0.185362 2.821360 0.393097 0.565165
2000-01-03 0.156716 1.260999 0.531221 1.610406
2000-01-04 -0.750791 0.578499 0.520296 -0.491102
2000-01-05 0.076014 2.101193 -0.530172 0.433667
2000-01-06 0.272712 0.771861 0.756701 -0.949046
2000-01-07 0.045145 -0.867820 0.248871 0.242362
2000-01-08 0.736053 -0.135037 0.290101 0.179423
2000-01-09 -0.523680 -1.571565 0.157508 0.588099
2000-01-10 0.235188 0.571272 0.227741 -0.657029
>>> df.rolling(window=3).mean()
A B C D
2000-01-01 NaN NaN NaN NaN
2000-01-02 NaN NaN NaN NaN
2000-01-03 -0.110650 0.668124 0.539585 0.888860
2000-01-04 -0.259812 1.553620 0.481538 0.561490
2000-01-05 -0.172687 1.313564 0.173782 0.517657
2000-01-06 -0.134022 1.150518 0.248942 -0.335494
2000-01-07 0.131290 0.668411 0.158467 -0.091006
2000-01-08 0.351303 -0.076999 0.431891 -0.175754
2000-01-09 0.085839 -0.858141 0.232160 0.336628
2000-01-10 0.149187 -0.378444 0.225117 0.036831
AGGREGATION
>>> r = df.rolling(window=3,min_periods=1)
>>> r.aggregate({'A' : np.sum,'B' : np.mean})
A B
2000-01-01 -0.303304 -2.077986
2000-01-02 -0.488666 0.371687
2000-01-03 -0.331949 0.668124 Can be applied to
2000-01-04 -0.779437 1.553620 DataFrame directly
2000-01-05 -0.518061 1.313564
2000-01-06 -0.402066 1.150518
2000-01-07 0.393871 0.668411
2000-01-08 1.053910 -0.076999
2000-01-09 0.257518 -0.858141
2000-01-10 0.447562 -0.378444
Alias can be used -‘agg’
MISSING DATA

• isnull()
• notnull()
• fillna(value_to_replace)
• dropna()
• replace()
GROUPBY

• Any groupby operation involves one of the following operations on the original object:
• Splitting the Object
• Applying a function
• Combining the results
• Functionality to apply on each data subset:
• Aggregation − computing a summary statistic
• Transformation − perform some group-specific operation
• Filtration − discarding the data with some condition
>>> ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
'Kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
>>> df = pd.DataFrame(ipl_data)
>>> df
Team Rank Year Points
0 Riders 1 2014 876
1 Riders 2 2015 789
2 Devils 2 2014 863
3 Devils 3 2015 673
4 Kings 3 2014 741
5 Kings 4 2015 812
6 Kings 1 2016 756
7 Kings 1 2017 788
8 Riders 2 2016 694
9 Royals 4 2014 701
10 Royals 1 2015 804
11 Riders 2 2017 690
>>> grouped = df.groupby('Team')
>>> grouped.groups
{'Devils': [2, 3], 'Kings': [4, 5, 6, 7], 'Riders': [0, 1, 8, 11], 'Royals':
[9, 10]}
>>> df.groupby(['Team','Year']).groups
>>> grouped['Points'].agg([np.sum, np.mean, np.std])
MERGE

• pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)


• left − A DataFrame object.
• right − Another DataFrame object.
• on − Columns (names) to join on. Must be found in both the left and right DataFrame objects.
• left_on − Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.
• right_on − Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the
DataFrame.
• left_index − If True, use the index (row labels) from the left DataFrame as its join key(s).
• right_index − Same usage as left_index for the right DataFrame.
• how − One of 'left', 'right', 'outer', 'inner'. Defaults to inner
• sort − Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve the performance.
>>> left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
>>> right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
>>> pd.merge(left, right, on=['id','subject_id'],
how='left')
id Name_x subject_id Name_y
0 1 Alex sub1 NaN
1 2 Amy sub2 NaN
2 3 Allen sub4 NaN
3 4 Alice sub6 Bryce
4 5 Ayoung sub5 Betty
CONCATENATION

• pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False)


• objs − This is a sequence or mapping of Series or DataFrame objects.
• axis − {0, 1, ...}, default 0. This is the axis to concatenate along.
• join − {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer
for union and inner for intersection.
• ignore_index − boolean, default False. If True, do not use the index values on the
concatenation axis. The resulting axis will be labeled 0, ..., n - 1.
• join_axes − This is the list of Index objects. Specific indexes to use for the other (n-1) axes
instead of performing inner/outer set logic.
>>> one = pd.DataFrame({
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5'],
'Marks_scored':[98,90,87,69,78]},
index=[1,2,3,4,5])
>>> two = pd.DataFrame({
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
>>> pd.concat([one,two],keys=['x','y'])
Name subject_id Marks_scored
x 1 Alex sub1 98
2 Amy sub2 90
3 Allen sub4 87
4 Alice sub6 69 ‘df1.append(df2)’ also can
5 Ayoung sub5 78 be used
y 1 Billy sub2 89
2 Brian sub4 80
3 Bran sub3 79
4 Bryce sub6 97
5 Betty sub5 88
CATEGORICAL DATA

• Categorical variables can take on only a limited, and usually fixed number of
possible values.
• Categorical data might have an order but cannot perform numerical
operation.
>>> s = pd.Series(["a","b","c","a"], dtype="category")
>>> s
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> cat = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
>>> cat
['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
VISUALIZATION

• Based on matplotlib library.


• Few wrappers:
• plot()
• plot.bar()
• plot.barh()
• plot.hist()
• plot.box()
• plot.area()
• plot.scatter()
• plot.pie()
INPUT/OUTPUT

• A set of ‘read_xxx’ functions:


• pickle – pickled pandas object from file
• table, csv, fwf – flat files
• clipboard – from os clipboard
• excel
• json
• html – html style table into a list or DataFrame
• hdf – pandas object in HDF5 file (PyTables)
• feather – feather format object from file
• parquet – Apache parquet format object from file
• orc – ORC object from file
• sas
• spss
• sql
• gbq – data from Google BigQuery
• stata

You might also like