0% found this document useful (0 votes)
6 views

Notes For Python Part III

The document discusses pandas indexing, missing data handling, hierarchical indexing, index resetting, concatenation and merging. It provides examples of creating multi-indexed DataFrames and using pandas functions like reset_index, concat and merge for one-to-one and many-to-one joins.

Uploaded by

Erick Solis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Notes For Python Part III

The document discusses pandas indexing, missing data handling, hierarchical indexing, index resetting, concatenation and merging. It provides examples of creating multi-indexed DataFrames and using pandas functions like reset_index, concat and merge for one-to-one and many-to-one joins.

Uploaded by

Erick Solis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Index Object

We saw that both the Series and DataFrame objects contain an explicit index
which is useful for slicing.

You can think of an index object as an immutable array.

We saw different ways to slice Series/DataFrame objects.

We can use the Python style indexing scheme or the explicit index associated
with the Series and DataFrame objects.

loc attribute allows indexing and slicing that always uses the explicit index.

iloc attribute allows indexing and slicing that always uses the Pythonic index.

A third indexing attribute is ix which is a hybrid of the two.

1 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Missing Data

Pandas handles missing values using NumPy package, which does not have a
built-in notion of NA values for non-floating-point data types.

Pandas utilizes sentinels for missing data by using two already existing Python
null values: the special floating-point NaN value, and the Python None object.

If you perform aggregations like sum() or min() across an array with a None
value, you will generally get an error.

The special floating-point NaN value is recognized as a number but behaves like
a data virus: it infects any other object it interacts.

If you perform aggregations like sum() or min() across an array with a NaN
value, you will get NaN values.

2 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

We saw that the Series and DataFrame objects store one-dimensional and
two-dimensional data, respectively.

Higher dimensional data can be stored using hierarchical indexing


(multi-indexing).

It incorporates multiple index levels within a single index.

These objects are MultiIndex objects.

3 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing
import pandas as pd
import numpy as np
# the bad way
index = [( ' California ' , 2000) , ( ' California ' , 2010) ,
( ' New York ' , 2000) , ( ' New York ' ,2010) ,
( ' Texas ' , 2000) , ( ' Texas ' , 2010)]
populations = [33871648 , 37253956 ,
18976457 , 19378102 ,
20851820 , 25145561]
pop = pd . Series ( populations , index = index )
pop
Out [4]:
( California , 2000) 33871648
( California , 2010) 37253956
( New York , 2000) 18976457
( New York , 2010) 19378102
( Texas , 2000) 20851820
( Texas , 2010) 25145561
dtype : int64
index = pd . MultiIndex . from_tuples ( index )
index
pop = pop . reindex ( index )
pop
Out [5]:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype : int64

4 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

Note that the first two columns show the multiple index values, while the third
column shows the data.

Suppose you need to access all data for which the second index is 2010. You
can use Pythonic notation.
pop [: , 2010]
Out [6]:
California 37253956
New York 19378102
Texas 25145561
dtype : int64

5 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

We can add more columns.


pop_df = pd . DataFrame ({ ' total ' : pop ,
' under18 ' : [9267089 , 9284094 ,
4687374 , 4318033 ,
5906301 , 6879014]})
pop_df
Out [10]:
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014

Let’s compute fraction of people by year and present in a wide format.


fr_u18 = pop_df [ ' under18 ' ]/ pop_df [ ' total ' ]
fr_u18 . unstack ()
Out [12]:
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568

6 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Index resetting

Index labels can be turned into data columns using reset_index method.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset
Out [6]:
level_0 level_1 population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561

Often raw data will look like this. You can use reset_index method to build a
MultiIndex.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset = pop_reset . rename ( columns = { ' level_0 ' : ' state ' , ' level_1 ' : ' year ' })
pop_reset . set_index ([ ' state ' , ' year ' ])
pop_reset
Out [19]:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561

7 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Concat and Append

Empirical analysis generally involves some form of concat, merge and join
operations.

Pandas has functions and methods that make this operations straightforward.

def make_dataframe ( cols , ind ):


""" quick DataFrame composer """
data = { c : [ str ( c ) + str ( i ) for i in ind ] for c in cols }
return pd . DataFrame ( data , ind )

df1 = make_dataframe ( ' AB ' , [1 ,2])


df2 = make_dataframe ( ' AB ' , [3 ,4])
pd . concat ([ df1 , df2 ])
Out [21]:
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4

8 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

one-to-one join

Pandas merge and join operations use a set of rules known as relational
algebra to combine data.

The relational algebra proposes several primitive operations which allows for
handling more complicated operations.

merge() functions implements several type of joins: the one-to-one,


many-to-one, and many-to-many.
df1 = pd . DataFrame ({ ' employee ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' group ' :[ ' accounting ' , ' engineering ' , ' engineering ' , ' hr ' ]})
df2 = pd . DataFrame ({ ' employee ' :[ ' Lisa ' , ' Bob ' , ' Jake ' , ' Sue ' ] ,
' hire_date ' :[2004 , 2008 , 2012 , 2014]})
df3 = pd . merge ( df1 , df2 )
df3
Out [23]:
employee group hire_date
0 Bob accounting 2008
1 Jake engineering 2012
2 Lisa engineering 2004
3 Sue hr 2014

9 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

many-to-one join

many-to-one joins refer to cases where the key columns for merge contain
duplicate entries.
df4 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' engineering ' , ' hr ' ] ,
' supervisor ' :[ ' Carly ' , ' Guido ' , ' Steve ' ]})
print ( pd . merge ( df3 , df4 ))

employee group hire_date supervisor


0 Bob accounting 2008 Carly
1 Jake engineering 2012 Guido
2 Lisa engineering 2004 Guido
3 Sue hr 2014 Steve

We see that the resulting dataframe has the supervisor column where the
information is repeated in one or more locations.

10 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

many-to-many join

many-to-many joins involve cases where the left and right dataframes contain
key columns with duplicate entries.
df5 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' accounting ' , ' engineering ' ,
' engineering ' , ' hr ' , ' hr ' ] ,
' skills ' :[ ' math ' , ' spreadsheets ' , ' coding ' , ' linux ' ,
' spreadsheets ' , ' organization ' ]})
print ( pd . merge ( df1 , df5 ))

employee group skills


0 Bob accounting math
1 Bob accounting spreadsheets
2 Jake engineering coding
3 Jake engineering linux
4 Lisa engineering coding
5 Lisa engineering linux
6 Sue hr spreadsheets
7 Sue hr organization

11 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

on keyword

In the previous examples we have not specified the key columns in pd.merge(),
the function automatically detected to intersection of columns common to the
left and right data frames to join.

Often it will be the case that the key columns are named differently in the left
and right dataframes, and therefore we need to specify them in pd.merge().
print ( df1 ); print ( df2 );
print ( pd . merge ( df1 , df2 , on = ' employee ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
# merge ( df1 , df2 , on = ' employee ')
employee group hire_date
0 Bob accounting 2008
1 Jake engineering 2012
2 Lisa engineering 2004
3 Sue hr 2014

12 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

left on and right on keywords

We can specify the key columns for pd.merge() using left_on and right_on
arguments corresponding to left and right dataframes.
df3 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' salary ' :[70000 , 80000 , 120000 , 90000]})
print ( df1 ); print ( df3 );
print ( pd . merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df3
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
# merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ')
employee group name salary
0 Bob accounting Bob 70000
1 Jake engineering Jake 80000
2 Lisa engineering Lisa 120000
3 Sue hr Sue 90000

Similarly if left and right dataframes have indices set, you can use left_index
and right_index arguments.

13 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

overlapping column names

You may be asking what happens when we try to join two dataframes on a key
column, but both dataframes also have conflicting columns.

pd.merge() works and you will notice that the column will be displayed twice
and the columns names will have suffixes _x and _y appended to them,
respectively.
df6 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[1 , 2 , 2 , 4]})
df7 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[3 , 1 , 4 , 2]})
print ( pd . merge ( df6 , df7 , on = ' name ' ))
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 2 4
3 Sue 4 2

It is possible customize suffixes using suffixes argument.


print ( pd . merge ( df6 , df7 , on = ' name ' , suffixes =[ ' _L ' , ' _R ' ]))
name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 2 4
3 Sue 4 2

14 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by
Often we will have to aggregate and/or obtain summary statistics data
conditionally on some label or index.

Group-by object in pandas allows for splitting data, applying functions to splits
and combining the results from the apply step.

The most important operations performed by Group-by objects are aggregate,


filter, transform and apply.

To understand these functionalities of a Group-by object, we will use the


planets dataset available in seaborn module.
import seaborn as sns
planets = sns . load_dataset ( ' planets ' )
planets . shape
Out [15]: (1035 , 6)
planets . head ()
Out [16]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
15 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by

Let’s take a look at a table of descriptive statistics for the planets data.

planets . dropna (). describe ()


Out [18]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

Note that describe() can only be applied to numeric columns. method column
is string data and we can calculate a frequency table of values it takes on.

planets . method . value_counts ()


Out [19]:
Radial Velocity 553
Transit 397
Imaging 38
Microlensing 23
Eclipse Timing Variations 9
Pulsar Timing 5
Transit Timing Variations 4
Orbital Brightness Modulation 3
Astrometry 2
Pulsation Timing Variations 1

16 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by
Suppose you’d like to slice the data by the method column and calculate the
median value of orbital_period variable across the slices.
planets . groupby ( ' method ' )[ ' orbital_period ' ]. median ()
Out [20]:
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000

GroupBy objects allows for iteration over the slices. Suppose you need to see
the shape of each group when grouped by method column.
for ( method , group ) in planets . groupby ( ' method ' ):
print ( " {0:30 s } shape = {1} " . format ( method , group . shape ))
Astrometry shape = (2 , 6)
Eclipse Timing Variations shape = (9 , 6)
Imaging shape = (38 , 6)
Microlensing shape = (23 , 6)
Orbital Brightness Modulation shape = (3 , 6)
Pulsar Timing shape = (5 , 6)
Pulsation Timing Variations shape = (1 , 6)
Radial Velocity shape = (553 , 6)
Transit shape = (397 , 6)
Transit Timing Variations shape = (4 , 6)
17 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by

Suppose you’d like to slice the data by the method column and obtain the
descriptive statistics for the distance variable across groups.
planets . groupby ( ' method ' )[ ' distance ' ]. describe ()
Out [26]:
count mean ... 75% max
method ...
Astrometry 2.0 17.875000 ... 19.3225 20.77
Eclipse Timing Variations 4.0 315.360000 ... 500.0000 500.00
Imaging 32.0 67.715937 ... 132.6975 165.00
Microlensing 10.0 4144.000000 ... 4747.5000 7720.00
Orbital Brightness Modulation 2.0 1180.000000 ... 1180.0000 1180.00
Pulsar Timing 1.0 1200.000000 ... 1200.0000 1200.00
Pulsation Timing Variations 0.0 NaN ... NaN NaN
Radial Velocity 530.0 51.600208 ... 59.2175 354.00
Transit 224.0 599.298080 ... 650.0000 8500.00
Transit Timing Variations 3.0 1104.333333 ... 1487.0000 2119.00
[10 rows x 8 columns ]
# also try
# planets . groupby ( ' method ' ) [ ' distance ']. describe (). unstack ()

18 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: aggregate

aggregate() method can be combine with groupby(). It can take a string, a


function, or a list, and compute all the aggregates at once.

Let’s calculate min, median, max for orbital_period, mass and distance by
method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). aggregate ([ ' min ' , np . median , max ])
Out [47]:
orbital_period ... distance
min median ... median max
method ...
Astrometry 246.360000 631.180000 ... 17.875 20.77
Eclipse Timing Variations 1916.250000 4343.500000 ... 315.360 500.00
Imaging 4639.150000 27500.000000 ... 40.395 165.00
Microlensing 1825.000000 3300.000000 ... 3840.000 7720.00
Orbital Brightness Modulation 0.240104 0.342887 ... 1180.000 1180.00
Pulsar Timing 0.090706 66.541900 ... 1200.000 1200.00
Pulsation Timing Variations 1170.000000 1170.000000 ... NaN NaN
Radial Velocity 0.736540 360.200000 ... 40.445 354.00
Transit 0.355000 5.714932 ... 341.000 8500.00
Transit Timing Variations 22.339500 57.011000 ... 855.000 2119.00
[10 rows x 9 columns ]

19 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: filter


filter() method allows for dropping data based on a user provided function.
You need to define the function first and feed it into filter.

Suppose you’d like to find the groups (based on method) such that there is no
variation in orbital_period.
tmp = pd . DataFrame ([ planets . iloc [: , i ]. fillna ( planets . iloc [: , i ]. dropna (). mean ()) \
for i in range (2 ,6)]). T
planets = pd . concat ([ planets . iloc [: ,:2] , tmp ] , axis = 1)
def my_filter ( x ):
return np . isnan ( x [ ' orbital_period ' ]. std ())

planets . groupby ( ' method ' ). filter ( my_filter )

Out [164]:
method number ... distance year
958 Pulsation Timing Variations 1 ... 264.069282 2007.0
[1 rows x 6 columns ]

You can confirm this again from the frequency table of method.
planets . method . value_counts ()
Out [166]:
Radial Velocity 553
Transit 397
.
.
.
Pulsation Timing Variations 1

20 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: transform

transform() allows for transforming full data. Therefore, the output will be
the same shape as the input.

Let’s calculate the deviations from the mean for orbital_period, mass and
distance after grouping by method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). transform ( lambda \
x : x - x . mean ()). head ()
Out [10]:
orbital_period mass distance
0 -554.05468 4.468721 16.962923
1 51.41932 -0.421279 -3.487077
2 -60.35468 -0.031279 -40.597077
3 -497.32468 16.768721 50.182923
4 -307.13468 7.868721 59.032923

21 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: apply

apply() method allows for applying an arbitrary function to group results.

Let’s normalize orbital_period by subtracting its mean and then dividing by


its standard deviation, after grouping by method.
def my_normalize ( x ):
tmp1 = x [ ' orbital_period ' ]. mean ()
tmp2 = x [ ' orbital_period ' ]. std ()
x [ ' orbital_period ' ] -= tmp1
x [ ' orbital_period ' ] /= tmp2
return x

planets . groupby ( ' method ' ). apply ( my_normalize )


Out [12]:
method number orbital_period mass distance year
0 Radial Velocity 1 0.751004 7.100000 77.40 2006.0
1 Radial Velocity 1 1.167158 2.210000 56.95 2008.0
2 Radial Velocity 1 1.090333 2.600000 19.84 2011.0
3 Radial Velocity 1 0.789995 19.400000 110.62 2007.0
4 Radial Velocity 1 0.920717 10.500000 119.47 2009.0
5 Radial Velocity 1 0.693640 4.800000 76.39 2008.0
6 Radial Velocity 1 1.784802 4.640000 18.15 2002.0
7 Radial Velocity 1 1.114733 2.638161 21.41 1996.0
8 Radial Velocity 1 1.248623 10.300000 73.10 2008.0
.
.
.

22 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

A pivot table operation is similar to GroupBy but operates on tabular data.

It is easier to think about it as a multidimensional version of GroupBy


aggregation.

Both the split and combine happen across not a one-dimensional index, but
across a two-dimensional grid.

We will use the titanic dataset available in seaborn module.


import pandas as pd
import numpy as np
import seaborn as sns
titanic = sns . load_dataset ( ' titanic ' )
titanic . columns
Out [18]:
Index ([ ' survived ' , ' pclass ' , ' sex ' , ' age ' , ' sibsp ' , ' parch ' , ' fare ' ,
' embarked ' , ' class ' , ' who ' , ' adult_male ' , ' deck ' , ' embark_town ' ,
' alive ' , ' alone ' ] ,
dtype = ' object ' )

23 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

Let’s take a look at survival rate by gender.

titanic . groupby ( ' sex ' )[ ' survived ' ]. mean ()


Out [20]:
sex
female 0.742038
male 0.188908

Approximately, three of every four females on board survived, while only one in
five males survived.

Let’s go one step further and look at survival by both sex and class.

We group by class and gender, select survival, and apply a mean aggregate.
titanic . groupby ([ ' sex ' , ' class ' ])[ ' survived ' ]. aggregate ( ' mean ' ). unstack ()
Out [21]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

24 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

This is an example of two-dimensional group by and the code is starting to look


a bit cluttered.

Pandas offers a convenient tool, pivot_table, which succintly handles this


type of multi-dimensional aggreagtion.
titanic . pivot_table ( ' survived ' , index = ' sex ' , columns = ' class ' )
Out [22]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

We can easily add another dimension to our survival analysis, say, age.

First, we will generate a discrete age variable using the original age variable.
age = pd . cut ( titanic [ ' age ' ] ,[0 , 18 , 80])
titanic . pivot_table ( ' survived ' , index =[ ' sex ' , age ] , columns = ' class ' )
Out [27]:
class First Second Third
sex age
female (0 , 18] 0.909091 1.000000 0.511628
(18 , 80] 0.972973 0.900000 0.423729
male (0 , 18] 0.800000 0.600000 0.215686
(18 , 80] 0.375000 0.071429 0.133663

25 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Time series refer to data in which quantities are observed at regularly or


irregularly spaced timestamps or for fixed or variable time spans (periods).

Pandas Series and DataFrame can have both columns and indices with data
types describing timestamps and time spans.

When dealing with temporal data, it is particularly useful to be able to index


the data with time data types.

Using pandas time-series indexers, DatetimeIndex and PeriodIndex, we can


carry out many common date, time, period, and calendar operations, such as
selecting time ranges and shifting and resampling of the data points in a time
series.

26 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To generate a sequence of dates that can be used as an index in a pandas


Series or DataFrame objects, we can, for example, use the date_range().

It takes the starting point as a date and time string (or, alternatively, a datetime
object from the Python standard library) as a first argument, and the number
of elements in the range can be set using the periods keyword argument:
pd . date_range ( " 2020 -3 -1 " , periods = 21)
Out :
DatetimeIndex ([ ' 2020 -03 -01 ' , ' 2020 -03 -02 ' , ' 2020 -03 -03 ' , ' 2020 -03 -04 ' ,
' 2020 -03 -05 ' , ' 2020 -03 -06 ' , ' 2020 -03 -07 ' , ' 2020 -03 -08 ' ,
' 2020 -03 -09 ' , ' 2020 -03 -10 ' ,
...
' 2020 -07 -20 ' , ' 2020 -07 -21 ' , ' 2020 -07 -22 ' , ' 2020 -07 -23 ' ,
' 2020 -07 -24 ' , ' 2020 -07 -25 ' , ' 2020 -07 -26 ' , ' 2020 -07 -27 ' ,
' 2020 -07 -28 ' , ' 2020 -07 -29 ' ] ,
dtype = ' datetime64 [ ns ] ' , length =151 , freq = ' D ' )

27 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To specify the frequency of the timestamps (default is one day), we can use the
freq keyword argument.

Instead of using periods to specify the number of points, we can give both
starting and ending points as date and time strings (or datetime objects) as
the first and second arguments.

For example, to generate hourly timestamps between 00:00 and 12:00 on


2020-01-01, we can use:
pd . date_range ( " 2020 -1 -1 00:00 " , " 2020 -1 -1 12:00 " , freq = " H " )
Out :
DatetimeIndex ([ ' 2020 -01 -01 00:00:00 ' , ' 2020 -01 -01 01:00:00 ' ,
' 2020 -01 -01 02:00:00 ' , ' 2020 -01 -01 03:00:00 ' ,
' 2020 -01 -01 04:00:00 ' , ' 2020 -01 -01 05:00:00 ' ,
' 2020 -01 -01 06:00:00 ' , ' 2020 -01 -01 07:00:00 ' ,
' 2020 -01 -01 08:00:00 ' , ' 2020 -01 -01 09:00:00 ' ,
' 2020 -01 -01 10:00:00 ' , ' 2020 -01 -01 11:00:00 ' ,
' 2020 -01 -01 12:00:00 ' ] ,
dtype = ' datetime64 [ ns ] ' , freq = ' H ' )

The date_range returns an instance of DatetimeIndex, which can be used,


for example, as an index for a Series or DataFrame object.
ts1 = pd . Series ( np . arange (31) , index = pd . date_range ( " 2020 -1 -1 " ,
periods = 31))

28 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

The elements of a DatetimeIndex object can, for example, be accessed using


indexing with date and time strings.

An element in a DatetimeIndex is of the type Timestamp, which is a pandas


object that extends the standard Python datetime object (see the datetime
module in the Python standard library).
ts1 [ " 2020 -1 -3 " ]
Out : 2
ts1 . index [2]
Out : Timestamp ( ' 2020 -01 -03 00:00:00 ' , freq = ' D ' )

Timestamp class has, like the datetime class, attributes for accessing time
fields such as year, month, day, hour, minute, and so on.

The main difference between Timestamp and datetime: Timestamp stores a


timestamp with nanosecond resolution, while a datetime object only uses
microsecond resolution.
ts1 . index [2]. year , ts1 . index [2]. month , ts1 . index [2]. day
Out : (2020 , 1 , 3)
ts1 . index [2]. nanosecond
Out : 0

29 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

We can convert a Timestamp object to a standard Python datetime object


using the to_pydatetime method.
ts1 . index [2]. to_pydatetime ()
Out : datetime . datetime (2020 , 1 , 3 , 0 , 0)

We can use a list of datetime objects to create pandas time series.


import datetime as dt
ts2 = pd . Series ( np . random . rand (2) ,
index = [ dt . datetime (2020 , 1 , 1) ,
dt . datetime (2020 , 2 , 1)])
ts2
Out :
2020 -01 -01 0.371747
2020 -02 -01 0.845133
dtype : float64

30 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Data that are defined for sequences of time spans can be represented using
Series and DataFrame objects that are indexed using the PeriodIndex class.

We can construct an instance of the PeriodIndex class explicitly by passing a


list of Period objects and then specify it as index when creating a Series or
DataFrame object.
periods = pd . PeriodIndex ([ pd . Period ( ' 2020 -01 ' ) ,
pd . Period ( ' 2020 -02 ' ) ,
pd . Period ( ' 2020 -03 ' )])
ts3 = pd . Series ( np . random . rand (3) , index = periods )
ts3
Out :
2020 -01 0.086880
2020 -02 0.010098
2020 -03 0.159337
Freq : M , dtype : float64

We can also convert a Series or DataFrame object indexed by a


DatetimeIndex object to a PeriodIndex using the to_period method (which
takes an argument that specifies the period frequency, here ’M’ for month).
ts2 . to_period ( ' M ' )
Out :
2020 -01 0.371747
2020 -02 0.845133
Freq : M , dtype : float64

31 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series
Next, we look at the manipulation of two time series that contain sequences of
temperature measurements at given timestamps.

We have one dataset for an indoor temperature sensor and one dataset for an
outdoor temperature sensor.

Observations approximately every 10 minutes during most of 2014.


The two data files, indoor.csv and outdoor.csv, are comma-separated values
format files with two columns:
 the first column contains UNIX timestamps: seconds since Jan 1, 1970,

 the second column is the measured temperature in degree Celsius.

df1 = pd . read_csv ( ' outdoor . csv ' , names = [ " time " , " outdoor " ])
df2 = pd . read_csv ( ' indoor . csv ' , names = [ " time " , " indoor " ])

Once we have created DataFrame objects for the time-series data, inspect the
data by displaying the first few lines.
df1 . head ()
Out :
time outdoor
0 1388530986 4.38
1 1388531586 4.25
2 1388532187 4.19

32 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To represent the data as a meaningful time-series data, first convert the UNIX
timestamps to date and time objects using to_datetime with the unit = "s"
argument.

Then, we localize the timestamps (assigning a time zone) using tz_localize


and convert the time zone attribute to the Europe/Stockholm time zone using
tz_convert.

We also set the time column as index using set_index.


df1 . time = ( pd . to_datetime ( df1 . time . values , unit = " s " ).
tz_localize ( ' UTC ' ). tz_convert ( ' Europe / Stockholm ' ))
df1 = df1 . set_index ( " time " )
df2 . time = ( pd . to_datetime ( df2 . time . values , unit = " s " ).
tz_localize ( ' UTC ' ). tz_convert ( ' Europe / Stockholm ' ))
df2 = df2 . set_index ( " time " )

df1 . head ()
Out :
outdoor
time
2014 -01 -01 00:03:06+01:00 4.38
2014 -01 -01 00:13:06+01:00 4.25
2014 -01 -01 00:23:07+01:00 4.19
2014 -01 -01 00:33:07+01:00 4.06
2014 -01 -01 00:43:08+01:00 4.06

33 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series
The index now indeed is a date and time object.

Having the index of a time series represented as proper date and time objects
(in contrast to using, e.g., integers representing the UNIX timestamps) allows
us to easily perform many time-oriented operations.

Before we proceed to explore the data in more detail, we first plot the two time
series to obtain an idea of how the data looks like.
fig , ax = plt . subplots (1 , 1 , figsize = (12 , 4))
df1 . plot ( ax = ax )
df2 . plot ( ax = ax )

34 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

A common operation on time series is to select and extract parts of the data.

For example, from the full dataset that contains data for all of 2014, we may be
interested in selecting out and analyzing only the data for the month of January.

In pandas, we can accomplish this in a number of ways.


mask_jan = ( df1 . index >= " 2014 -1 -1 " ) & ( df1 . index < " 2014 -2 -1 " )
df1_jan = df1 [ mask_jan ]
df1_jan . info ()
Out :
< class ' pandas . core . frame . DataFrame ' >
DatetimeIndex : 4452 entries , 2014 -01 -01 00:03:06+01:00 to 2014 -01 -31 23:56:58+01:00
Data columns ( total 1 columns ):
outdoor 4452 non - null float64
dtypes : float64 (1)
memory usage : 69.6 KB
df2_jan = df2 [ " 2014 -1 -1 " : " 2014 -1 -31 " ]
fig , ax = plt . subplots (1 , 1 , figsize = (12 , 4))
df1_jan . plot ( ax = ax )
df2_jan . plot ( ax = ax )

35 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Like the datetime class in Python’s standard library, the Timestamp class that
is used in pandas to represent time values has attributes for accessing fields
such as year, month, day, hour, minute, and so on.
36 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

These fields are particularly useful when processing time series.

Say, we wish to calculate the average temperature for each month of the year.

We begin by creating a new column month, which we assign to the month field
of the Timestamp values of the DatetimeIndex indexer.

To extract the month field from each Timestamp value, we first call
reset_index to convert the index to a column in the data frame (the new
DataFrame object falls back to using an integer index).

Then, we can use the apply function on the newly created time column.

37 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

df1_month = df1 . reset_index ()


df1_month [ " month " ] = df1_month . time . apply ( lambda x : x . month )
df1_month . head ()
Out :
time outdoor month
0 2014 -01 -01 00:03:06+01:00 4.38 1
1 2014 -01 -01 00:13:06+01:00 4.25 1
2 2014 -01 -01 00:23:07+01:00 4.19 1
3 2014 -01 -01 00:33:07+01:00 4.06 1
4 2014 -01 -01 00:43:08+01:00 4.06 1
df1_month = df1_month . groupby ( " month " ). aggregate ( np . mean )

df2_month = df2 . reset_index ()


df2_month [ " month " ] = df2_month . time . apply ( lambda x : x . month )
df2_month = df2_month . groupby ( " month " ). aggregate ( np . mean )
df_month = df1_month . join ( df2_month )
# alternatively
# df_month = pd . concat ([ df . to_period (" M "). groupby ( level =0). mean ()
# for df in [ df1 , df2 ]] , axis =1)

fig , axes = plt . subplots (1 , 2 , figsize = (12 , 4))


df_month . plot ( kind = ' bar ' , ax = axes [0])
df_month . plot ( kind = ' box ' , ax = axes [1])

38 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

39 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

A very useful feature of the pandas time-series objects is the ability to up- and
down-sample the time series using the resample method.

Resampling means that the number of data points is either increased


(up-sampling) or decreased (down-sampling).

For up-sampling, we need to choose a method for filling in the missing values.

For down-sampling we need to choose a method for aggregating multiple


sample points between each new sample point.

Resample method expects as first argument a string, such as ’H’, ’D’, ’M’,
that specifies the new frequency of data.

It returns a resampler object for which we can invoke aggregation methods such
as mean and sum, in order to obtain the resampled data.

40 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Consider the previous two time series with temperature data.

The original sampling frequency is about 10 minutes, which amounts to a lot of


data points over a year.

For plotting purposes, it is often necessary to down- sample the original data to
obtain less busy graphs and regularly spaced time series that can be readily
compared to each other.

Let’s resample the outdoor temperature time series to four different sampling
frequencies and plot the resulting time series.
df1_hour = df1 . resample ( " H " ). mean ()
df1_hour . columns = [ " outdoor ( hourly avg .) " ]
df1_day = df1 . resample ( " D " ). mean ()
df1_day . columns = [ " outdoor ( daily avg .) " ]
df1_week = df1 . resample ( " 7 D " ). mean ()
df1_week . columns = [ " outdoor ( weekly avg .) " ]
df1_month = df1 . resample ( " M " ). mean ()
df1_month . columns = [ " outdoor ( monthly avg .) " ]
df_diff = ( df1 . resample ( " D " ). mean (). outdoor - df2 . resample ( " D " ).
mean (). indoor )

41 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

fig , ( ax1 , ax2 ) = plt . subplots (2 , 1 , figsize = (12 , 6))


df1_hour . plot ( ax = ax1 , alpha = 0.25)
df1_day . plot ( ax = ax1 )
df1_week . plot ( ax = ax1 )
df1_month . plot ( ax = ax1 )
df_diff . plot ( ax = ax2 )
ax2 . set_title ( " temperature difference between outdoor and indoor " )
fig . tight_layout ()

42 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

For up-sampling, consider the following example where we resample the data
frame df1 to a sampling frequency of 5 minutes using three different
aggregation methods
 mean,

 ffill for forward-fill,

 bfill for back-fill.

The result is three new data frames that we will combine into a single
DataFrame object.
pd . concat (
[ df1 . resample ( " 5 min " ). mean (). rename ( columns ={ " outdoor " : ' None ' }) ,
df1 . resample ( " 5 min " ). ffill (). rename ( columns ={ " outdoor " : ' ffill ' }) ,
df1 . resample ( " 5 min " ). bfill (). rename ( columns ={ " outdoor " : ' bfill ' })
],
axis = 1). head ()

43 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Out :
None ffill bfill
time
2014 -01 -01 00:00:00+01:00 4.38 NaN 4.38
2014 -01 -01 00:05:00+01:00 NaN 4.38 4.25
2014 -01 -01 00:10:00+01:00 4.25 4.38 4.25
2014 -01 -01 00:15:00+01:00 NaN 4.25 4.19
2014 -01 -01 00:20:00+01:00 4.19 4.25 4.19

Note that every second data point is a new sample point.

Depending on the value of the aggregation method, those values are filled (or
not) according to the specified strategies.

When no fill strategy is selected, the corresponding values are marked as


missing using the NaN value.

44 / 44

You might also like