Notes For Python Part III
Notes For Python Part III
Index Object
We saw that both the Series and DataFrame objects contain an explicit index
which is useful for slicing.
We can use the Python style indexing scheme or the explicit index associated
with the Series and DataFrame objects.
loc attribute allows indexing and slicing that always uses the explicit index.
iloc attribute allows indexing and slicing that always uses the Pythonic index.
1 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Missing Data
Pandas handles missing values using NumPy package, which does not have a
built-in notion of NA values for non-floating-point data types.
Pandas utilizes sentinels for missing data by using two already existing Python
null values: the special floating-point NaN value, and the Python None object.
If you perform aggregations like sum() or min() across an array with a None
value, you will generally get an error.
The special floating-point NaN value is recognized as a number but behaves like
a data virus: it infects any other object it interacts.
If you perform aggregations like sum() or min() across an array with a NaN
value, you will get NaN values.
2 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Hierarchical Indexing
We saw that the Series and DataFrame objects store one-dimensional and
two-dimensional data, respectively.
3 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Hierarchical Indexing
import pandas as pd
import numpy as np
# the bad way
index = [( ' California ' , 2000) , ( ' California ' , 2010) ,
( ' New York ' , 2000) , ( ' New York ' ,2010) ,
( ' Texas ' , 2000) , ( ' Texas ' , 2010)]
populations = [33871648 , 37253956 ,
18976457 , 19378102 ,
20851820 , 25145561]
pop = pd . Series ( populations , index = index )
pop
Out [4]:
( California , 2000) 33871648
( California , 2010) 37253956
( New York , 2000) 18976457
( New York , 2010) 19378102
( Texas , 2000) 20851820
( Texas , 2010) 25145561
dtype : int64
index = pd . MultiIndex . from_tuples ( index )
index
pop = pop . reindex ( index )
pop
Out [5]:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype : int64
4 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Hierarchical Indexing
Note that the first two columns show the multiple index values, while the third
column shows the data.
Suppose you need to access all data for which the second index is 2010. You
can use Pythonic notation.
pop [: , 2010]
Out [6]:
California 37253956
New York 19378102
Texas 25145561
dtype : int64
5 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Hierarchical Indexing
6 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Index resetting
Index labels can be turned into data columns using reset_index method.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset
Out [6]:
level_0 level_1 population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
Often raw data will look like this. You can use reset_index method to build a
MultiIndex.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset = pop_reset . rename ( columns = { ' level_0 ' : ' state ' , ' level_1 ' : ' year ' })
pop_reset . set_index ([ ' state ' , ' year ' ])
pop_reset
Out [19]:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561
7 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Empirical analysis generally involves some form of concat, merge and join
operations.
Pandas has functions and methods that make this operations straightforward.
8 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
one-to-one join
Pandas merge and join operations use a set of rules known as relational
algebra to combine data.
The relational algebra proposes several primitive operations which allows for
handling more complicated operations.
9 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
many-to-one join
many-to-one joins refer to cases where the key columns for merge contain
duplicate entries.
df4 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' engineering ' , ' hr ' ] ,
' supervisor ' :[ ' Carly ' , ' Guido ' , ' Steve ' ]})
print ( pd . merge ( df3 , df4 ))
We see that the resulting dataframe has the supervisor column where the
information is repeated in one or more locations.
10 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
many-to-many join
many-to-many joins involve cases where the left and right dataframes contain
key columns with duplicate entries.
df5 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' accounting ' , ' engineering ' ,
' engineering ' , ' hr ' , ' hr ' ] ,
' skills ' :[ ' math ' , ' spreadsheets ' , ' coding ' , ' linux ' ,
' spreadsheets ' , ' organization ' ]})
print ( pd . merge ( df1 , df5 ))
11 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
on keyword
In the previous examples we have not specified the key columns in pd.merge(),
the function automatically detected to intersection of columns common to the
left and right data frames to join.
Often it will be the case that the key columns are named differently in the left
and right dataframes, and therefore we need to specify them in pd.merge().
print ( df1 ); print ( df2 );
print ( pd . merge ( df1 , df2 , on = ' employee ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
# merge ( df1 , df2 , on = ' employee ')
employee group hire_date
0 Bob accounting 2008
1 Jake engineering 2012
2 Lisa engineering 2004
3 Sue hr 2014
12 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
We can specify the key columns for pd.merge() using left_on and right_on
arguments corresponding to left and right dataframes.
df3 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' salary ' :[70000 , 80000 , 120000 , 90000]})
print ( df1 ); print ( df3 );
print ( pd . merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df3
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
# merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ')
employee group name salary
0 Bob accounting Bob 70000
1 Jake engineering Jake 80000
2 Lisa engineering Lisa 120000
3 Sue hr Sue 90000
Similarly if left and right dataframes have indices set, you can use left_index
and right_index arguments.
13 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
You may be asking what happens when we try to join two dataframes on a key
column, but both dataframes also have conflicting columns.
pd.merge() works and you will notice that the column will be displayed twice
and the columns names will have suffixes _x and _y appended to them,
respectively.
df6 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[1 , 2 , 2 , 4]})
df7 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[3 , 1 , 4 , 2]})
print ( pd . merge ( df6 , df7 , on = ' name ' ))
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 2 4
3 Sue 4 2
14 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
group by
Often we will have to aggregate and/or obtain summary statistics data
conditionally on some label or index.
Group-by object in pandas allows for splitting data, applying functions to splits
and combining the results from the apply step.
group by
Let’s take a look at a table of descriptive statistics for the planets data.
Note that describe() can only be applied to numeric columns. method column
is string data and we can calculate a frequency table of values it takes on.
16 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
group by
Suppose you’d like to slice the data by the method column and calculate the
median value of orbital_period variable across the slices.
planets . groupby ( ' method ' )[ ' orbital_period ' ]. median ()
Out [20]:
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
GroupBy objects allows for iteration over the slices. Suppose you need to see
the shape of each group when grouped by method column.
for ( method , group ) in planets . groupby ( ' method ' ):
print ( " {0:30 s } shape = {1} " . format ( method , group . shape ))
Astrometry shape = (2 , 6)
Eclipse Timing Variations shape = (9 , 6)
Imaging shape = (38 , 6)
Microlensing shape = (23 , 6)
Orbital Brightness Modulation shape = (3 , 6)
Pulsar Timing shape = (5 , 6)
Pulsation Timing Variations shape = (1 , 6)
Radial Velocity shape = (553 , 6)
Transit shape = (397 , 6)
Transit Timing Variations shape = (4 , 6)
17 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
group by
Suppose you’d like to slice the data by the method column and obtain the
descriptive statistics for the distance variable across groups.
planets . groupby ( ' method ' )[ ' distance ' ]. describe ()
Out [26]:
count mean ... 75% max
method ...
Astrometry 2.0 17.875000 ... 19.3225 20.77
Eclipse Timing Variations 4.0 315.360000 ... 500.0000 500.00
Imaging 32.0 67.715937 ... 132.6975 165.00
Microlensing 10.0 4144.000000 ... 4747.5000 7720.00
Orbital Brightness Modulation 2.0 1180.000000 ... 1180.0000 1180.00
Pulsar Timing 1.0 1200.000000 ... 1200.0000 1200.00
Pulsation Timing Variations 0.0 NaN ... NaN NaN
Radial Velocity 530.0 51.600208 ... 59.2175 354.00
Transit 224.0 599.298080 ... 650.0000 8500.00
Transit Timing Variations 3.0 1104.333333 ... 1487.0000 2119.00
[10 rows x 8 columns ]
# also try
# planets . groupby ( ' method ' ) [ ' distance ']. describe (). unstack ()
18 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Let’s calculate min, median, max for orbital_period, mass and distance by
method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). aggregate ([ ' min ' , np . median , max ])
Out [47]:
orbital_period ... distance
min median ... median max
method ...
Astrometry 246.360000 631.180000 ... 17.875 20.77
Eclipse Timing Variations 1916.250000 4343.500000 ... 315.360 500.00
Imaging 4639.150000 27500.000000 ... 40.395 165.00
Microlensing 1825.000000 3300.000000 ... 3840.000 7720.00
Orbital Brightness Modulation 0.240104 0.342887 ... 1180.000 1180.00
Pulsar Timing 0.090706 66.541900 ... 1200.000 1200.00
Pulsation Timing Variations 1170.000000 1170.000000 ... NaN NaN
Radial Velocity 0.736540 360.200000 ... 40.445 354.00
Transit 0.355000 5.714932 ... 341.000 8500.00
Transit Timing Variations 22.339500 57.011000 ... 855.000 2119.00
[10 rows x 9 columns ]
19 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Suppose you’d like to find the groups (based on method) such that there is no
variation in orbital_period.
tmp = pd . DataFrame ([ planets . iloc [: , i ]. fillna ( planets . iloc [: , i ]. dropna (). mean ()) \
for i in range (2 ,6)]). T
planets = pd . concat ([ planets . iloc [: ,:2] , tmp ] , axis = 1)
def my_filter ( x ):
return np . isnan ( x [ ' orbital_period ' ]. std ())
Out [164]:
method number ... distance year
958 Pulsation Timing Variations 1 ... 264.069282 2007.0
[1 rows x 6 columns ]
You can confirm this again from the frequency table of method.
planets . method . value_counts ()
Out [166]:
Radial Velocity 553
Transit 397
.
.
.
Pulsation Timing Variations 1
20 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
transform() allows for transforming full data. Therefore, the output will be
the same shape as the input.
Let’s calculate the deviations from the mean for orbital_period, mass and
distance after grouping by method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). transform ( lambda \
x : x - x . mean ()). head ()
Out [10]:
orbital_period mass distance
0 -554.05468 4.468721 16.962923
1 51.41932 -0.421279 -3.487077
2 -60.35468 -0.031279 -40.597077
3 -497.32468 16.768721 50.182923
4 -307.13468 7.868721 59.032923
21 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
22 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
pivot table
Both the split and combine happen across not a one-dimensional index, but
across a two-dimensional grid.
23 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
pivot table
Approximately, three of every four females on board survived, while only one in
five males survived.
Let’s go one step further and look at survival by both sex and class.
We group by class and gender, select survival, and apply a mean aggregate.
titanic . groupby ([ ' sex ' , ' class ' ])[ ' survived ' ]. aggregate ( ' mean ' ). unstack ()
Out [21]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447
24 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
pivot table
We can easily add another dimension to our survival analysis, say, age.
First, we will generate a discrete age variable using the original age variable.
age = pd . cut ( titanic [ ' age ' ] ,[0 , 18 , 80])
titanic . pivot_table ( ' survived ' , index =[ ' sex ' , age ] , columns = ' class ' )
Out [27]:
class First Second Third
sex age
female (0 , 18] 0.909091 1.000000 0.511628
(18 , 80] 0.972973 0.900000 0.423729
male (0 , 18] 0.800000 0.600000 0.215686
(18 , 80] 0.375000 0.071429 0.133663
25 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Pandas Series and DataFrame can have both columns and indices with data
types describing timestamps and time spans.
26 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
It takes the starting point as a date and time string (or, alternatively, a datetime
object from the Python standard library) as a first argument, and the number
of elements in the range can be set using the periods keyword argument:
pd . date_range ( " 2020 -3 -1 " , periods = 21)
Out :
DatetimeIndex ([ ' 2020 -03 -01 ' , ' 2020 -03 -02 ' , ' 2020 -03 -03 ' , ' 2020 -03 -04 ' ,
' 2020 -03 -05 ' , ' 2020 -03 -06 ' , ' 2020 -03 -07 ' , ' 2020 -03 -08 ' ,
' 2020 -03 -09 ' , ' 2020 -03 -10 ' ,
...
' 2020 -07 -20 ' , ' 2020 -07 -21 ' , ' 2020 -07 -22 ' , ' 2020 -07 -23 ' ,
' 2020 -07 -24 ' , ' 2020 -07 -25 ' , ' 2020 -07 -26 ' , ' 2020 -07 -27 ' ,
' 2020 -07 -28 ' , ' 2020 -07 -29 ' ] ,
dtype = ' datetime64 [ ns ] ' , length =151 , freq = ' D ' )
27 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
To specify the frequency of the timestamps (default is one day), we can use the
freq keyword argument.
Instead of using periods to specify the number of points, we can give both
starting and ending points as date and time strings (or datetime objects) as
the first and second arguments.
28 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Timestamp class has, like the datetime class, attributes for accessing time
fields such as year, month, day, hour, minute, and so on.
29 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
30 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Data that are defined for sequences of time spans can be represented using
Series and DataFrame objects that are indexed using the PeriodIndex class.
31 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Next, we look at the manipulation of two time series that contain sequences of
temperature measurements at given timestamps.
We have one dataset for an indoor temperature sensor and one dataset for an
outdoor temperature sensor.
df1 = pd . read_csv ( ' outdoor . csv ' , names = [ " time " , " outdoor " ])
df2 = pd . read_csv ( ' indoor . csv ' , names = [ " time " , " indoor " ])
Once we have created DataFrame objects for the time-series data, inspect the
data by displaying the first few lines.
df1 . head ()
Out :
time outdoor
0 1388530986 4.38
1 1388531586 4.25
2 1388532187 4.19
32 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
To represent the data as a meaningful time-series data, first convert the UNIX
timestamps to date and time objects using to_datetime with the unit = "s"
argument.
df1 . head ()
Out :
outdoor
time
2014 -01 -01 00:03:06+01:00 4.38
2014 -01 -01 00:13:06+01:00 4.25
2014 -01 -01 00:23:07+01:00 4.19
2014 -01 -01 00:33:07+01:00 4.06
2014 -01 -01 00:43:08+01:00 4.06
33 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
The index now indeed is a date and time object.
Having the index of a time series represented as proper date and time objects
(in contrast to using, e.g., integers representing the UNIX timestamps) allows
us to easily perform many time-oriented operations.
Before we proceed to explore the data in more detail, we first plot the two time
series to obtain an idea of how the data looks like.
fig , ax = plt . subplots (1 , 1 , figsize = (12 , 4))
df1 . plot ( ax = ax )
df2 . plot ( ax = ax )
34 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
A common operation on time series is to select and extract parts of the data.
For example, from the full dataset that contains data for all of 2014, we may be
interested in selecting out and analyzing only the data for the month of January.
35 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Like the datetime class in Python’s standard library, the Timestamp class that
is used in pandas to represent time values has attributes for accessing fields
such as year, month, day, hour, minute, and so on.
36 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Say, we wish to calculate the average temperature for each month of the year.
We begin by creating a new column month, which we assign to the month field
of the Timestamp values of the DatetimeIndex indexer.
To extract the month field from each Timestamp value, we first call
reset_index to convert the index to a column in the data frame (the new
DataFrame object falls back to using an integer index).
Then, we can use the apply function on the newly created time column.
37 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
38 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
39 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
A very useful feature of the pandas time-series objects is the ability to up- and
down-sample the time series using the resample method.
For up-sampling, we need to choose a method for filling in the missing values.
Resample method expects as first argument a string, such as ’H’, ’D’, ’M’,
that specifies the new frequency of data.
It returns a resampler object for which we can invoke aggregation methods such
as mean and sum, in order to obtain the resampled data.
40 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
For plotting purposes, it is often necessary to down- sample the original data to
obtain less busy graphs and regularly spaced time series that can be readily
compared to each other.
Let’s resample the outdoor temperature time series to four different sampling
frequencies and plot the resulting time series.
df1_hour = df1 . resample ( " H " ). mean ()
df1_hour . columns = [ " outdoor ( hourly avg .) " ]
df1_day = df1 . resample ( " D " ). mean ()
df1_day . columns = [ " outdoor ( daily avg .) " ]
df1_week = df1 . resample ( " 7 D " ). mean ()
df1_week . columns = [ " outdoor ( weekly avg .) " ]
df1_month = df1 . resample ( " M " ). mean ()
df1_month . columns = [ " outdoor ( monthly avg .) " ]
df_diff = ( df1 . resample ( " D " ). mean (). outdoor - df2 . resample ( " D " ).
mean (). indoor )
41 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
42 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
For up-sampling, consider the following example where we resample the data
frame df1 to a sampling frequency of 5 minutes using three different
aggregation methods
mean,
The result is three new data frames that we will combine into a single
DataFrame object.
pd . concat (
[ df1 . resample ( " 5 min " ). mean (). rename ( columns ={ " outdoor " : ' None ' }) ,
df1 . resample ( " 5 min " ). ffill (). rename ( columns ={ " outdoor " : ' ffill ' }) ,
df1 . resample ( " 5 min " ). bfill (). rename ( columns ={ " outdoor " : ' bfill ' })
],
axis = 1). head ()
43 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data
Time series
Out :
None ffill bfill
time
2014 -01 -01 00:00:00+01:00 4.38 NaN 4.38
2014 -01 -01 00:05:00+01:00 NaN 4.38 4.25
2014 -01 -01 00:10:00+01:00 4.25 4.38 4.25
2014 -01 -01 00:15:00+01:00 NaN 4.25 4.19
2014 -01 -01 00:20:00+01:00 4.19 4.25 4.19
Depending on the value of the aggregation method, those values are filled (or
not) according to the specified strategies.
44 / 44