0% found this document useful (0 votes)

14 views44 pages

Notes For Python Part III

The document discusses pandas indexing, missing data handling, hierarchical indexing, index resetting, concatenation and merging. It provides examples of creating multi-indexed DataFrames and using pandas functions like reset_index, concat and merge for one-to-one and many-to-one joins.

Uploaded by

Erick Solis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views44 pages

Notes For Python Part III

Uploaded by

Erick Solis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Index Object

We saw that both the Series and DataFrame objects contain an explicit index
which is useful for slicing.

You can think of an index object as an immutable array.

We saw different ways to slice Series/DataFrame objects.

We can use the Python style indexing scheme or the explicit index associated
with the Series and DataFrame objects.

loc attribute allows indexing and slicing that always uses the explicit index.

iloc attribute allows indexing and slicing that always uses the Pythonic index.

A third indexing attribute is ix which is a hybrid of the two.

1 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Missing Data

Pandas handles missing values using NumPy package, which does not have a
built-in notion of NA values for non-floating-point data types.

Pandas utilizes sentinels for missing data by using two already existing Python
null values: the special floating-point NaN value, and the Python None object.

If you perform aggregations like sum() or min() across an array with a None
value, you will generally get an error.

The special floating-point NaN value is recognized as a number but behaves like
a data virus: it infects any other object it interacts.

If you perform aggregations like sum() or min() across an array with a NaN
value, you will get NaN values.

2 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

We saw that the Series and DataFrame objects store one-dimensional and
two-dimensional data, respectively.

Higher dimensional data can be stored using hierarchical indexing

(multi-indexing).

It incorporates multiple index levels within a single index.

These objects are MultiIndex objects.

3 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing
import pandas as pd
import numpy as np
# the bad way
index = [( ' California ' , 2000) , ( ' California ' , 2010) ,
( ' New York ' , 2000) , ( ' New York ' ,2010) ,
( ' Texas ' , 2000) , ( ' Texas ' , 2010)]
populations = [33871648 , 37253956 ,
18976457 , 19378102 ,
20851820 , 25145561]
pop = pd . Series ( populations , index = index )
pop
Out [4]:
( California , 2000) 33871648
( California , 2010) 37253956
( New York , 2000) 18976457
( New York , 2010) 19378102
( Texas , 2000) 20851820
( Texas , 2010) 25145561
dtype : int64
index = pd . MultiIndex . from_tuples ( index )
index
pop = pop . reindex ( index )
pop
Out [5]:
California 2000 33871648
2010 37253956
New York 2000 18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype : int64

4 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

Note that the first two columns show the multiple index values, while the third
column shows the data.

Suppose you need to access all data for which the second index is 2010. You
can use Pythonic notation.
pop [: , 2010]
Out [6]:
California 37253956
New York 19378102
Texas 25145561
dtype : int64

5 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Hierarchical Indexing

We can add more columns.

pop_df = pd . DataFrame ({ ' total ' : pop ,
' under18 ' : [9267089 , 9284094 ,
4687374 , 4318033 ,
5906301 , 6879014]})
pop_df
Out [10]:
total under18
California 2000 33871648 9267089
2010 37253956 9284094
New York 2000 18976457 4687374
2010 19378102 4318033
Texas 2000 20851820 5906301
2010 25145561 6879014

Let’s compute fraction of people by year and present in a wide format.

fr_u18 = pop_df [ ' under18 ' ]/ pop_df [ ' total ' ]
fr_u18 . unstack ()
Out [12]:
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568

6 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Index resetting

Index labels can be turned into data columns using reset_index method.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset
Out [6]:
level_0 level_1 population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561

Often raw data will look like this. You can use reset_index method to build a
MultiIndex.
pop_reset = pop . reset_index ( name = ' population ' )
pop_reset = pop_reset . rename ( columns = { ' level_0 ' : ' state ' , ' level_1 ' : ' year ' })
pop_reset . set_index ([ ' state ' , ' year ' ])
pop_reset
Out [19]:
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561

7 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Concat and Append

Empirical analysis generally involves some form of concat, merge and join
operations.

Pandas has functions and methods that make this operations straightforward.

def make_dataframe ( cols , ind ):

""" quick DataFrame composer """
data = { c : [ str ( c ) + str ( i ) for i in ind ] for c in cols }
return pd . DataFrame ( data , ind )

df1 = make_dataframe ( ' AB ' , [1 ,2])

df2 = make_dataframe ( ' AB ' , [3 ,4])
pd . concat ([ df1 , df2 ])
Out [21]:
A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4

8 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

one-to-one join

Pandas merge and join operations use a set of rules known as relational
algebra to combine data.

The relational algebra proposes several primitive operations which allows for
handling more complicated operations.

merge() functions implements several type of joins: the one-to-one,

many-to-one, and many-to-many.
df1 = pd . DataFrame ({ ' employee ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' group ' :[ ' accounting ' , ' engineering ' , ' engineering ' , ' hr ' ]})
df2 = pd . DataFrame ({ ' employee ' :[ ' Lisa ' , ' Bob ' , ' Jake ' , ' Sue ' ] ,
' hire_date ' :[2004 , 2008 , 2012 , 2014]})
df3 = pd . merge ( df1 , df2 )
df3
Out [23]:
employee group hire_date
0 Bob accounting 2008
1 Jake engineering 2012
2 Lisa engineering 2004
3 Sue hr 2014

9 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

many-to-one join

many-to-one joins refer to cases where the key columns for merge contain
duplicate entries.
df4 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' engineering ' , ' hr ' ] ,
' supervisor ' :[ ' Carly ' , ' Guido ' , ' Steve ' ]})
print ( pd . merge ( df3 , df4 ))

employee group hire_date supervisor

0 Bob accounting 2008 Carly
1 Jake engineering 2012 Guido
2 Lisa engineering 2004 Guido
3 Sue hr 2014 Steve

We see that the resulting dataframe has the supervisor column where the
information is repeated in one or more locations.

10 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

many-to-many join

many-to-many joins involve cases where the left and right dataframes contain
key columns with duplicate entries.
df5 = pd . DataFrame ({ ' group ' :[ ' accounting ' , ' accounting ' , ' engineering ' ,
' engineering ' , ' hr ' , ' hr ' ] ,
' skills ' :[ ' math ' , ' spreadsheets ' , ' coding ' , ' linux ' ,
' spreadsheets ' , ' organization ' ]})
print ( pd . merge ( df1 , df5 ))

employee group skills

0 Bob accounting math
1 Bob accounting spreadsheets
2 Jake engineering coding
3 Jake engineering linux
4 Lisa engineering coding
5 Lisa engineering linux
6 Sue hr spreadsheets
7 Sue hr organization

11 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

on keyword

In the previous examples we have not specified the key columns in pd.merge(),
the function automatically detected to intersection of columns common to the
left and right data frames to join.

Often it will be the case that the key columns are named differently in the left
and right dataframes, and therefore we need to specify them in pd.merge().
print ( df1 ); print ( df2 );
print ( pd . merge ( df1 , df2 , on = ' employee ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
# merge ( df1 , df2 , on = ' employee ')
employee group hire_date
0 Bob accounting 2008
1 Jake engineering 2012
2 Lisa engineering 2004
3 Sue hr 2014

12 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

left on and right on keywords

We can specify the key columns for pd.merge() using left_on and right_on
arguments corresponding to left and right dataframes.
df3 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' salary ' :[70000 , 80000 , 120000 , 90000]})
print ( df1 ); print ( df3 );
print ( pd . merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ' ))
# df1
employee group
0 Bob accounting
1 Jake engineering
2 Lisa engineering
3 Sue hr
# df3
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
# merge ( df1 , df3 , left_on = ' employee ' , right_on = ' name ')
employee group name salary
0 Bob accounting Bob 70000
1 Jake engineering Jake 80000
2 Lisa engineering Lisa 120000
3 Sue hr Sue 90000

Similarly if left and right dataframes have indices set, you can use left_index
and right_index arguments.

13 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

overlapping column names

You may be asking what happens when we try to join two dataframes on a key
column, but both dataframes also have conflicting columns.

pd.merge() works and you will notice that the column will be displayed twice
and the columns names will have suffixes _x and _y appended to them,
respectively.
df6 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[1 , 2 , 2 , 4]})
df7 = pd . DataFrame ({ ' name ' :[ ' Bob ' , ' Jake ' , ' Lisa ' , ' Sue ' ] ,
' rank ' :[3 , 1 , 4 , 2]})
print ( pd . merge ( df6 , df7 , on = ' name ' ))
name rank_x rank_y
0 Bob 1 3
1 Jake 2 1
2 Lisa 2 4
3 Sue 4 2

It is possible customize suffixes using suffixes argument.

print ( pd . merge ( df6 , df7 , on = ' name ' , suffixes =[ ' _L ' , ' _R ' ]))
name rank_L rank_R
0 Bob 1 3
1 Jake 2 1
2 Lisa 2 4
3 Sue 4 2

14 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by
Often we will have to aggregate and/or obtain summary statistics data
conditionally on some label or index.

Group-by object in pandas allows for splitting data, applying functions to splits
and combining the results from the apply step.

The most important operations performed by Group-by objects are aggregate,

filter, transform and apply.

To understand these functionalities of a Group-by object, we will use the

planets dataset available in seaborn module.
import seaborn as sns
planets = sns . load_dataset ( ' planets ' )
planets . shape
Out [15]: (1035 , 6)
planets . head ()
Out [16]:
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009
15 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by

Let’s take a look at a table of descriptive statistics for the planets data.

planets . dropna (). describe ()

Out [18]:
number orbital_period mass distance year
count 498.00000 498.000000 498.000000 498.000000 498.000000
mean 1.73494 835.778671 2.509320 52.068213 2007.377510
std 1.17572 1469.128259 3.636274 46.596041 4.167284
min 1.00000 1.328300 0.003600 1.350000 1989.000000
25% 1.00000 38.272250 0.212500 24.497500 2005.000000
50% 1.00000 357.000000 1.245000 39.940000 2009.000000
75% 2.00000 999.600000 2.867500 59.332500 2011.000000
max 6.00000 17337.500000 25.000000 354.000000 2014.000000

Note that describe() can only be applied to numeric columns. method column
is string data and we can calculate a frequency table of values it takes on.

planets . method . value_counts ()

Out [19]:
Radial Velocity 553
Transit 397
Imaging 38
Microlensing 23
Eclipse Timing Variations 9
Pulsar Timing 5
Transit Timing Variations 4
Orbital Brightness Modulation 3
Astrometry 2
Pulsation Timing Variations 1

16 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by
Suppose you’d like to slice the data by the method column and calculate the
median value of orbital_period variable across the slices.
planets . groupby ( ' method ' )[ ' orbital_period ' ]. median ()
Out [20]:
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000

GroupBy objects allows for iteration over the slices. Suppose you need to see
the shape of each group when grouped by method column.
for ( method , group ) in planets . groupby ( ' method ' ):
print ( " {0:30 s } shape = {1} " . format ( method , group . shape ))
Astrometry shape = (2 , 6)
Eclipse Timing Variations shape = (9 , 6)
Imaging shape = (38 , 6)
Microlensing shape = (23 , 6)
Orbital Brightness Modulation shape = (3 , 6)
Pulsar Timing shape = (5 , 6)
Pulsation Timing Variations shape = (1 , 6)
Radial Velocity shape = (553 , 6)
Transit shape = (397 , 6)
Transit Timing Variations shape = (4 , 6)
17 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by

Suppose you’d like to slice the data by the method column and obtain the
descriptive statistics for the distance variable across groups.
planets . groupby ( ' method ' )[ ' distance ' ]. describe ()
Out [26]:
count mean ... 75% max
method ...
Astrometry 2.0 17.875000 ... 19.3225 20.77
Eclipse Timing Variations 4.0 315.360000 ... 500.0000 500.00
Imaging 32.0 67.715937 ... 132.6975 165.00
Microlensing 10.0 4144.000000 ... 4747.5000 7720.00
Orbital Brightness Modulation 2.0 1180.000000 ... 1180.0000 1180.00
Pulsar Timing 1.0 1200.000000 ... 1200.0000 1200.00
Pulsation Timing Variations 0.0 NaN ... NaN NaN
Radial Velocity 530.0 51.600208 ... 59.2175 354.00
Transit 224.0 599.298080 ... 650.0000 8500.00
Transit Timing Variations 3.0 1104.333333 ... 1487.0000 2119.00
[10 rows x 8 columns ]
# also try
# planets . groupby ( ' method ' ) [ ' distance ']. describe (). unstack ()

18 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: aggregate

aggregate() method can be combine with groupby(). It can take a string, a

function, or a list, and compute all the aggregates at once.

Let’s calculate min, median, max for orbital_period, mass and distance by
method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). aggregate ([ ' min ' , np . median , max ])
Out [47]:
orbital_period ... distance
min median ... median max
method ...
Astrometry 246.360000 631.180000 ... 17.875 20.77
Eclipse Timing Variations 1916.250000 4343.500000 ... 315.360 500.00
Imaging 4639.150000 27500.000000 ... 40.395 165.00
Microlensing 1825.000000 3300.000000 ... 3840.000 7720.00
Orbital Brightness Modulation 0.240104 0.342887 ... 1180.000 1180.00
Pulsar Timing 0.090706 66.541900 ... 1200.000 1200.00
Pulsation Timing Variations 1170.000000 1170.000000 ... NaN NaN
Radial Velocity 0.736540 360.200000 ... 40.445 354.00
Transit 0.355000 5.714932 ... 341.000 8500.00
Transit Timing Variations 22.339500 57.011000 ... 855.000 2119.00
[10 rows x 9 columns ]

19 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: filter

filter() method allows for dropping data based on a user provided function.
You need to define the function first and feed it into filter.

Suppose you’d like to find the groups (based on method) such that there is no
variation in orbital_period.
tmp = pd . DataFrame ([ planets . iloc [: , i ]. fillna ( planets . iloc [: , i ]. dropna (). mean ()) \
for i in range (2 ,6)]). T
planets = pd . concat ([ planets . iloc [: ,:2] , tmp ] , axis = 1)
def my_filter ( x ):
return np . isnan ( x [ ' orbital_period ' ]. std ())

planets . groupby ( ' method ' ). filter ( my_filter )

Out [164]:
method number ... distance year
958 Pulsation Timing Variations 1 ... 264.069282 2007.0
[1 rows x 6 columns ]

You can confirm this again from the frequency table of method.
planets . method . value_counts ()
Out [166]:
Radial Velocity 553
Transit 397
.
.
.
Pulsation Timing Variations 1

20 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: transform

transform() allows for transforming full data. Therefore, the output will be
the same shape as the input.

Let’s calculate the deviations from the mean for orbital_period, mass and
distance after grouping by method.
planets . iloc [: ,2:5]. groupby ( planets . iloc [: ,0]). transform ( lambda \
x : x - x . mean ()). head ()
Out [10]:
orbital_period mass distance
0 -554.05468 4.468721 16.962923
1 51.41932 -0.421279 -3.487077
2 -60.35468 -0.031279 -40.597077
3 -497.32468 16.768721 50.182923
4 -307.13468 7.868721 59.032923

21 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

group by: apply

apply() method allows for applying an arbitrary function to group results.

Let’s normalize orbital_period by subtracting its mean and then dividing by

its standard deviation, after grouping by method.
def my_normalize ( x ):
tmp1 = x [ ' orbital_period ' ]. mean ()
tmp2 = x [ ' orbital_period ' ]. std ()
x [ ' orbital_period ' ] -= tmp1
x [ ' orbital_period ' ] /= tmp2
return x

planets . groupby ( ' method ' ). apply ( my_normalize )

Out [12]:
method number orbital_period mass distance year
0 Radial Velocity 1 0.751004 7.100000 77.40 2006.0
1 Radial Velocity 1 1.167158 2.210000 56.95 2008.0
2 Radial Velocity 1 1.090333 2.600000 19.84 2011.0
3 Radial Velocity 1 0.789995 19.400000 110.62 2007.0
4 Radial Velocity 1 0.920717 10.500000 119.47 2009.0
5 Radial Velocity 1 0.693640 4.800000 76.39 2008.0
6 Radial Velocity 1 1.784802 4.640000 18.15 2002.0
7 Radial Velocity 1 1.114733 2.638161 21.41 1996.0
8 Radial Velocity 1 1.248623 10.300000 73.10 2008.0
.
.
.

22 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

A pivot table operation is similar to GroupBy but operates on tabular data.

It is easier to think about it as a multidimensional version of GroupBy

aggregation.

Both the split and combine happen across not a one-dimensional index, but
across a two-dimensional grid.

We will use the titanic dataset available in seaborn module.

import pandas as pd
import numpy as np
import seaborn as sns
titanic = sns . load_dataset ( ' titanic ' )
titanic . columns
Out [18]:
Index ([ ' survived ' , ' pclass ' , ' sex ' , ' age ' , ' sibsp ' , ' parch ' , ' fare ' ,
' embarked ' , ' class ' , ' who ' , ' adult_male ' , ' deck ' , ' embark_town ' ,
' alive ' , ' alone ' ] ,
dtype = ' object ' )

23 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

Let’s take a look at survival rate by gender.

titanic . groupby ( ' sex ' )[ ' survived ' ]. mean ()

Out [20]:
sex
female 0.742038
male 0.188908

Approximately, three of every four females on board survived, while only one in
five males survived.

Let’s go one step further and look at survival by both sex and class.

We group by class and gender, select survival, and apply a mean aggregate.
titanic . groupby ([ ' sex ' , ' class ' ])[ ' survived ' ]. aggregate ( ' mean ' ). unstack ()
Out [21]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

24 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

pivot table

This is an example of two-dimensional group by and the code is starting to look

a bit cluttered.

Pandas offers a convenient tool, pivot_table, which succintly handles this

type of multi-dimensional aggreagtion.
titanic . pivot_table ( ' survived ' , index = ' sex ' , columns = ' class ' )
Out [22]:
class First Second Third
sex
female 0.968085 0.921053 0.500000
male 0.368852 0.157407 0.135447

We can easily add another dimension to our survival analysis, say, age.

First, we will generate a discrete age variable using the original age variable.
age = pd . cut ( titanic [ ' age ' ] ,[0 , 18 , 80])
titanic . pivot_table ( ' survived ' , index =[ ' sex ' , age ] , columns = ' class ' )
Out [27]:
class First Second Third
sex age
female (0 , 18] 0.909091 1.000000 0.511628
(18 , 80] 0.972973 0.900000 0.423729
male (0 , 18] 0.800000 0.600000 0.215686
(18 , 80] 0.375000 0.071429 0.133663

25 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Time series refer to data in which quantities are observed at regularly or

irregularly spaced timestamps or for fixed or variable time spans (periods).

Pandas Series and DataFrame can have both columns and indices with data
types describing timestamps and time spans.

When dealing with temporal data, it is particularly useful to be able to index

the data with time data types.

Using pandas time-series indexers, DatetimeIndex and PeriodIndex, we can

carry out many common date, time, period, and calendar operations, such as
selecting time ranges and shifting and resampling of the data points in a time
series.

26 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To generate a sequence of dates that can be used as an index in a pandas

Series or DataFrame objects, we can, for example, use the date_range().

It takes the starting point as a date and time string (or, alternatively, a datetime
object from the Python standard library) as a first argument, and the number
of elements in the range can be set using the periods keyword argument:
pd . date_range ( " 2020 -3 -1 " , periods = 21)
Out :
DatetimeIndex ([ ' 2020 -03 -01 ' , ' 2020 -03 -02 ' , ' 2020 -03 -03 ' , ' 2020 -03 -04 ' ,
' 2020 -03 -05 ' , ' 2020 -03 -06 ' , ' 2020 -03 -07 ' , ' 2020 -03 -08 ' ,
' 2020 -03 -09 ' , ' 2020 -03 -10 ' ,
...
' 2020 -07 -20 ' , ' 2020 -07 -21 ' , ' 2020 -07 -22 ' , ' 2020 -07 -23 ' ,
' 2020 -07 -24 ' , ' 2020 -07 -25 ' , ' 2020 -07 -26 ' , ' 2020 -07 -27 ' ,
' 2020 -07 -28 ' , ' 2020 -07 -29 ' ] ,
dtype = ' datetime64 [ ns ] ' , length =151 , freq = ' D ' )

27 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To specify the frequency of the timestamps (default is one day), we can use the
freq keyword argument.

Instead of using periods to specify the number of points, we can give both
starting and ending points as date and time strings (or datetime objects) as
the first and second arguments.

For example, to generate hourly timestamps between 00:00 and 12:00 on

2020-01-01, we can use:
pd . date_range ( " 2020 -1 -1 00:00 " , " 2020 -1 -1 12:00 " , freq = " H " )
Out :
DatetimeIndex ([ ' 2020 -01 -01 00:00:00 ' , ' 2020 -01 -01 01:00:00 ' ,
' 2020 -01 -01 02:00:00 ' , ' 2020 -01 -01 03:00:00 ' ,
' 2020 -01 -01 04:00:00 ' , ' 2020 -01 -01 05:00:00 ' ,
' 2020 -01 -01 06:00:00 ' , ' 2020 -01 -01 07:00:00 ' ,
' 2020 -01 -01 08:00:00 ' , ' 2020 -01 -01 09:00:00 ' ,
' 2020 -01 -01 10:00:00 ' , ' 2020 -01 -01 11:00:00 ' ,
' 2020 -01 -01 12:00:00 ' ] ,
dtype = ' datetime64 [ ns ] ' , freq = ' H ' )

The date_range returns an instance of DatetimeIndex, which can be used,

for example, as an index for a Series or DataFrame object.
ts1 = pd . Series ( np . arange (31) , index = pd . date_range ( " 2020 -1 -1 " ,
periods = 31))

28 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

The elements of a DatetimeIndex object can, for example, be accessed using

indexing with date and time strings.

An element in a DatetimeIndex is of the type Timestamp, which is a pandas

object that extends the standard Python datetime object (see the datetime
module in the Python standard library).
ts1 [ " 2020 -1 -3 " ]
Out : 2
ts1 . index [2]
Out : Timestamp ( ' 2020 -01 -03 00:00:00 ' , freq = ' D ' )

Timestamp class has, like the datetime class, attributes for accessing time
fields such as year, month, day, hour, minute, and so on.

The main difference between Timestamp and datetime: Timestamp stores a

timestamp with nanosecond resolution, while a datetime object only uses
microsecond resolution.
ts1 . index [2]. year , ts1 . index [2]. month , ts1 . index [2]. day
Out : (2020 , 1 , 3)
ts1 . index [2]. nanosecond
Out : 0

29 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

We can convert a Timestamp object to a standard Python datetime object

using the to_pydatetime method.
ts1 . index [2]. to_pydatetime ()
Out : datetime . datetime (2020 , 1 , 3 , 0 , 0)

We can use a list of datetime objects to create pandas time series.

import datetime as dt
ts2 = pd . Series ( np . random . rand (2) ,
index = [ dt . datetime (2020 , 1 , 1) ,
dt . datetime (2020 , 2 , 1)])
ts2
Out :
2020 -01 -01 0.371747
2020 -02 -01 0.845133
dtype : float64

30 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Data that are defined for sequences of time spans can be represented using
Series and DataFrame objects that are indexed using the PeriodIndex class.

We can construct an instance of the PeriodIndex class explicitly by passing a

list of Period objects and then specify it as index when creating a Series or
DataFrame object.
periods = pd . PeriodIndex ([ pd . Period ( ' 2020 -01 ' ) ,
pd . Period ( ' 2020 -02 ' ) ,
pd . Period ( ' 2020 -03 ' )])
ts3 = pd . Series ( np . random . rand (3) , index = periods )
ts3
Out :
2020 -01 0.086880
2020 -02 0.010098
2020 -03 0.159337
Freq : M , dtype : float64

We can also convert a Series or DataFrame object indexed by a

DatetimeIndex object to a PeriodIndex using the to_period method (which
takes an argument that specifies the period frequency, here ’M’ for month).
ts2 . to_period ( ' M ' )
Out :
2020 -01 0.371747
2020 -02 0.845133
Freq : M , dtype : float64

31 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series
Next, we look at the manipulation of two time series that contain sequences of
temperature measurements at given timestamps.

We have one dataset for an indoor temperature sensor and one dataset for an
outdoor temperature sensor.

Observations approximately every 10 minutes during most of 2014.

The two data files, indoor.csv and outdoor.csv, are comma-separated values
format files with two columns:
the first column contains UNIX timestamps: seconds since Jan 1, 1970,

the second column is the measured temperature in degree Celsius.

df1 = pd . read_csv ( ' outdoor . csv ' , names = [ " time " , " outdoor " ])
df2 = pd . read_csv ( ' indoor . csv ' , names = [ " time " , " indoor " ])

Once we have created DataFrame objects for the time-series data, inspect the
data by displaying the first few lines.
df1 . head ()
Out :
time outdoor
0 1388530986 4.38
1 1388531586 4.25
2 1388532187 4.19

32 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

To represent the data as a meaningful time-series data, first convert the UNIX
timestamps to date and time objects using to_datetime with the unit = "s"
argument.

Then, we localize the timestamps (assigning a time zone) using tz_localize

and convert the time zone attribute to the Europe/Stockholm time zone using
tz_convert.

We also set the time column as index using set_index.

df1 . time = ( pd . to_datetime ( df1 . time . values , unit = " s " ).
tz_localize ( ' UTC ' ). tz_convert ( ' Europe / Stockholm ' ))
df1 = df1 . set_index ( " time " )
df2 . time = ( pd . to_datetime ( df2 . time . values , unit = " s " ).
tz_localize ( ' UTC ' ). tz_convert ( ' Europe / Stockholm ' ))
df2 = df2 . set_index ( " time " )

df1 . head ()
Out :
outdoor
time
2014 -01 -01 00:03:06+01:00 4.38
2014 -01 -01 00:13:06+01:00 4.25
2014 -01 -01 00:23:07+01:00 4.19
2014 -01 -01 00:33:07+01:00 4.06
2014 -01 -01 00:43:08+01:00 4.06

33 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series
The index now indeed is a date and time object.

Having the index of a time series represented as proper date and time objects
(in contrast to using, e.g., integers representing the UNIX timestamps) allows
us to easily perform many time-oriented operations.

Before we proceed to explore the data in more detail, we first plot the two time
series to obtain an idea of how the data looks like.
fig , ax = plt . subplots (1 , 1 , figsize = (12 , 4))
df1 . plot ( ax = ax )
df2 . plot ( ax = ax )

34 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

A common operation on time series is to select and extract parts of the data.

For example, from the full dataset that contains data for all of 2014, we may be
interested in selecting out and analyzing only the data for the month of January.

In pandas, we can accomplish this in a number of ways.

mask_jan = ( df1 . index >= " 2014 -1 -1 " ) & ( df1 . index < " 2014 -2 -1 " )
df1_jan = df1 [ mask_jan ]
df1_jan . info ()
Out :
< class ' pandas . core . frame . DataFrame ' >
DatetimeIndex : 4452 entries , 2014 -01 -01 00:03:06+01:00 to 2014 -01 -31 23:56:58+01:00
Data columns ( total 1 columns ):
outdoor 4452 non - null float64
dtypes : float64 (1)
memory usage : 69.6 KB
df2_jan = df2 [ " 2014 -1 -1 " : " 2014 -1 -31 " ]
fig , ax = plt . subplots (1 , 1 , figsize = (12 , 4))
df1_jan . plot ( ax = ax )
df2_jan . plot ( ax = ax )

35 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Like the datetime class in Python’s standard library, the Timestamp class that
is used in pandas to represent time values has attributes for accessing fields
such as year, month, day, hour, minute, and so on.
36 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

These fields are particularly useful when processing time series.

Say, we wish to calculate the average temperature for each month of the year.

We begin by creating a new column month, which we assign to the month field
of the Timestamp values of the DatetimeIndex indexer.

To extract the month field from each Timestamp value, we first call
reset_index to convert the index to a column in the data frame (the new
DataFrame object falls back to using an integer index).

Then, we can use the apply function on the newly created time column.

37 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

df1_month = df1 . reset_index ()

df1_month [ " month " ] = df1_month . time . apply ( lambda x : x . month )
df1_month . head ()
Out :
time outdoor month
0 2014 -01 -01 00:03:06+01:00 4.38 1
1 2014 -01 -01 00:13:06+01:00 4.25 1
2 2014 -01 -01 00:23:07+01:00 4.19 1
3 2014 -01 -01 00:33:07+01:00 4.06 1
4 2014 -01 -01 00:43:08+01:00 4.06 1
df1_month = df1_month . groupby ( " month " ). aggregate ( np . mean )

df2_month = df2 . reset_index ()

df2_month [ " month " ] = df2_month . time . apply ( lambda x : x . month )
df2_month = df2_month . groupby ( " month " ). aggregate ( np . mean )
df_month = df1_month . join ( df2_month )
# alternatively
# df_month = pd . concat ([ df . to_period (" M "). groupby ( level =0). mean ()
# for df in [ df1 , df2 ]] , axis =1)

fig , axes = plt . subplots (1 , 2 , figsize = (12 , 4))

df_month . plot ( kind = ' bar ' , ax = axes [0])
df_month . plot ( kind = ' box ' , ax = axes [1])

38 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

39 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

A very useful feature of the pandas time-series objects is the ability to up- and
down-sample the time series using the resample method.

Resampling means that the number of data points is either increased

(up-sampling) or decreased (down-sampling).

For up-sampling, we need to choose a method for filling in the missing values.

For down-sampling we need to choose a method for aggregating multiple

sample points between each new sample point.

Resample method expects as first argument a string, such as ’H’, ’D’, ’M’,
that specifies the new frequency of data.

It returns a resampler object for which we can invoke aggregation methods such
as mean and sum, in order to obtain the resampled data.

40 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Consider the previous two time series with temperature data.

The original sampling frequency is about 10 minutes, which amounts to a lot of

data points over a year.

For plotting purposes, it is often necessary to down- sample the original data to
obtain less busy graphs and regularly spaced time series that can be readily
compared to each other.

Let’s resample the outdoor temperature time series to four different sampling
frequencies and plot the resulting time series.
df1_hour = df1 . resample ( " H " ). mean ()
df1_hour . columns = [ " outdoor ( hourly avg .) " ]
df1_day = df1 . resample ( " D " ). mean ()
df1_day . columns = [ " outdoor ( daily avg .) " ]
df1_week = df1 . resample ( " 7 D " ). mean ()
df1_week . columns = [ " outdoor ( weekly avg .) " ]
df1_month = df1 . resample ( " M " ). mean ()
df1_month . columns = [ " outdoor ( monthly avg .) " ]
df_diff = ( df1 . resample ( " D " ). mean (). outdoor - df2 . resample ( " D " ).
mean (). indoor )

41 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

fig , ( ax1 , ax2 ) = plt . subplots (2 , 1 , figsize = (12 , 6))

df1_hour . plot ( ax = ax1 , alpha = 0.25)
df1_day . plot ( ax = ax1 )
df1_week . plot ( ax = ax1 )
df1_month . plot ( ax = ax1 )
df_diff . plot ( ax = ax2 )
ax2 . set_title ( " temperature difference between outdoor and indoor " )
fig . tight_layout ()

42 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

For up-sampling, consider the following example where we resample the data
frame df1 to a sampling frequency of 5 minutes using three different
aggregation methods
mean,

ffill for forward-fill,

bfill for back-fill.

The result is three new data frames that we will combine into a single
DataFrame object.
pd . concat (
[ df1 . resample ( " 5 min " ). mean (). rename ( columns ={ " outdoor " : ' None ' }) ,
df1 . resample ( " 5 min " ). ffill (). rename ( columns ={ " outdoor " : ' ffill ' }) ,
df1 . resample ( " 5 min " ). bfill (). rename ( columns ={ " outdoor " : ' bfill ' })
],
axis = 1). head ()

43 / 44
More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

Time series

Out :
None ffill bfill
time
2014 -01 -01 00:00:00+01:00 4.38 NaN 4.38
2014 -01 -01 00:05:00+01:00 NaN 4.38 4.25
2014 -01 -01 00:10:00+01:00 4.25 4.38 4.25
2014 -01 -01 00:15:00+01:00 NaN 4.25 4.19
2014 -01 -01 00:20:00+01:00 4.19 4.25 4.19

Note that every second data point is a new sample point.

Depending on the value of the aggregation method, those values are filled (or
not) according to the specified strategies.

When no fill strategy is selected, the corresponding values are marked as

missing using the NaN value.

44 / 44

Learning Pandas PDF
No ratings yet
Learning Pandas PDF
171 pages
EG502A (FINAL) Well Test - Analysis and Design
No ratings yet
EG502A (FINAL) Well Test - Analysis and Design
5 pages
(S4) Strategic Management Accounting: Part - A
No ratings yet
(S4) Strategic Management Accounting: Part - A
8 pages
F - Test
No ratings yet
F - Test
9 pages
Data Science Data Manipulation With Pandas
No ratings yet
Data Science Data Manipulation With Pandas
77 pages
Dsp Unit-5 Updated
No ratings yet
Dsp Unit-5 Updated
23 pages
Pandas
No ratings yet
Pandas
94 pages
07 Data Wrangling
No ratings yet
07 Data Wrangling
51 pages
IV Unit Fds
No ratings yet
IV Unit Fds
16 pages
Module_4
No ratings yet
Module_4
38 pages
Unit 4 DSE
No ratings yet
Unit 4 DSE
9 pages
Pandas
No ratings yet
Pandas
26 pages
Combining Datasets
No ratings yet
Combining Datasets
36 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas Data Wrangling Cheatsheet Datacamp PDF
No ratings yet
Pandas Data Wrangling Cheatsheet Datacamp PDF
1 page
Chapter 4
No ratings yet
Chapter 4
40 pages
Ch8 Data Wrangling Join, Combine, and Reshape
No ratings yet
Ch8 Data Wrangling Join, Combine, and Reshape
13 pages
Joining Data 4
No ratings yet
Joining Data 4
40 pages
Chapter 4
No ratings yet
Chapter 4
40 pages
UNIT IV Material
No ratings yet
UNIT IV Material
23 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Pandas - Powerful Python Data Analysis Toolkit
No ratings yet
Pandas - Powerful Python Data Analysis Toolkit
95 pages
Lecture 8 - Data Wrangling Using Pandas
No ratings yet
Lecture 8 - Data Wrangling Using Pandas
31 pages
Python Lecture 5 (2025)
No ratings yet
Python Lecture 5 (2025)
29 pages
Chapter 2 Python Pandas - II
No ratings yet
Chapter 2 Python Pandas - II
19 pages
python 2.1.2 (2)
No ratings yet
python 2.1.2 (2)
7 pages
Loki Temp PPT Pandas 2
No ratings yet
Loki Temp PPT Pandas 2
31 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
Merge, Join, and Concatenate: Concatenating Objects
No ratings yet
Merge, Join, and Concatenate: Concatenating Objects
62 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Pandas
No ratings yet
Pandas
44 pages
Pandas
No ratings yet
Pandas
13 pages
Python Modules
No ratings yet
Python Modules
14 pages
Python Pandas
No ratings yet
Python Pandas
2 pages
01-Numpy & Pandas
No ratings yet
01-Numpy & Pandas
69 pages
Praveen PPT
No ratings yet
Praveen PPT
9 pages
Pandas Moderate
No ratings yet
Pandas Moderate
15 pages
Python Libraries Cheat Sheets
No ratings yet
Python Libraries Cheat Sheets
6 pages
Lab-3 Pandas Library
No ratings yet
Lab-3 Pandas Library
14 pages
EXP-3
No ratings yet
EXP-3
10 pages
Python Programming Pandas Across Examples
No ratings yet
Python Programming Pandas Across Examples
350 pages
Unit-4Introduction To Pandas
No ratings yet
Unit-4Introduction To Pandas
44 pages
Pandas PDF
No ratings yet
Pandas PDF
171 pages
10 Minutes to Pandas
No ratings yet
10 Minutes to Pandas
26 pages
Pandas_Data_Analytics
No ratings yet
Pandas_Data_Analytics
61 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Data Wrangling and Analysis
100% (1)
Data Wrangling and Analysis
36 pages
EDP-3[1]
No ratings yet
EDP-3[1]
16 pages
Pandas
No ratings yet
Pandas
25 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas
No ratings yet
Pandas
4 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Data Mining_Week - 4
No ratings yet
Data Mining_Week - 4
8 pages
Python for Data Science: A Hands-On Introduction
From Everand
Python for Data Science: A Hands-On Introduction
Yuli Vasiliev
No ratings yet
Data Structures in C / C ++: Exercises and Solved Problems
From Everand
Data Structures in C / C ++: Exercises and Solved Problems
Fulbia Torres
No ratings yet
Python for Data Science For Dummies
From Everand
Python for Data Science For Dummies
John Paul Mueller
No ratings yet
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
Frequency Ratio List
No ratings yet
Frequency Ratio List
5 pages
Example Design of Circular Beam ACI 1999
100% (4)
Example Design of Circular Beam ACI 1999
5 pages
Non Newtonian Fluids
No ratings yet
Non Newtonian Fluids
2 pages
Motion Class 9 Extra Questions Science Chapter 8 - Learn CBSE
No ratings yet
Motion Class 9 Extra Questions Science Chapter 8 - Learn CBSE
28 pages
Data Structure Note by Bhupendra Saud
No ratings yet
Data Structure Note by Bhupendra Saud
135 pages
Scilab Textbook Companion For Antenna and Wave Propagation
No ratings yet
Scilab Textbook Companion For Antenna and Wave Propagation
176 pages
Early-Age Thermal Cracking in Concrete Structures - The Role of Zero-Stress Temperature?
No ratings yet
Early-Age Thermal Cracking in Concrete Structures - The Role of Zero-Stress Temperature?
8 pages
Properties of Triangle - Exercise Module-2
No ratings yet
Properties of Triangle - Exercise Module-2
13 pages
Relational Algebra To SQ L
100% (1)
Relational Algebra To SQ L
82 pages
Experiment 19
No ratings yet
Experiment 19
4 pages
Example of A Finite Element Analysis of A Beam: Lin DEF LC 1
No ratings yet
Example of A Finite Element Analysis of A Beam: Lin DEF LC 1
3 pages
Cart-Pole Optimal Control - Dymos
No ratings yet
Cart-Pole Optimal Control - Dymos
6 pages
Causation, Prediction, and Search: Second Edition
100% (1)
Causation, Prediction, and Search: Second Edition
567 pages
First-Order Logic: A Concise Introduction John Heil - Download the full ebook set with all chapters in PDF format
100% (5)
First-Order Logic: A Concise Introduction John Heil - Download the full ebook set with all chapters in PDF format
72 pages
EE 402: Electrical and Electronics Measurements Contact Hours: 4L +0T 40hours Credits: 4 Lecture: 40hours
No ratings yet
EE 402: Electrical and Electronics Measurements Contact Hours: 4L +0T 40hours Credits: 4 Lecture: 40hours
65 pages
Diffusion Models & Representation Learning
No ratings yet
Diffusion Models & Representation Learning
21 pages
HEAT Transfer Modelling in Exhaust Systems of High-Performance Two-Stroke Engines
No ratings yet
HEAT Transfer Modelling in Exhaust Systems of High-Performance Two-Stroke Engines
34 pages
Department of Mechanical Engineering
No ratings yet
Department of Mechanical Engineering
10 pages
CSE330 Quiz Solutions
No ratings yet
CSE330 Quiz Solutions
5 pages
Very Difficult Sudoku - 50 Printable Puzzles With Answers
100% (1)
Very Difficult Sudoku - 50 Printable Puzzles With Answers
55 pages
PH101 v1-0 Course Overview 2021-0421
No ratings yet
PH101 v1-0 Course Overview 2021-0421
2 pages
Calculations in FSW
No ratings yet
Calculations in FSW
3 pages
Modular Multilevel Converters Gijon
No ratings yet
Modular Multilevel Converters Gijon
61 pages
BLUEPRINT 2025 10
No ratings yet
BLUEPRINT 2025 10
25 pages
Russian Math Homework Wellesley
100% (1)
Russian Math Homework Wellesley
8 pages
The Effect of NPL, Car, LDR, Oer and Nim To Banking Return On Asset
No ratings yet
The Effect of NPL, Car, LDR, Oer and Nim To Banking Return On Asset
16 pages
Viscoelastic Properties of Bamboo: JOURNALOFMATERIALSSCIENCE32 (1997) 2693-2697
No ratings yet
Viscoelastic Properties of Bamboo: JOURNALOFMATERIALSSCIENCE32 (1997) 2693-2697
5 pages

Notes For Python Part III

Uploaded by

Notes For Python Part III

Uploaded by

More on Pandas group by: split - apply - combine Pivot Tables Time Series Data

You can think of an index object as an immutable array.

We saw different ways to slice Series/DataFrame objects.

A third indexing attribute is ix which is a hybrid of the two.

Higher dimensional data can be stored using hierarchical indexing

It incorporates multiple index levels within a single index.

These objects are MultiIndex objects.

We can add more columns.

Let’s compute fraction of people by year and present in a wide format.

Concat and Append

def make_dataframe ( cols , ind ):

df1 = make_dataframe ( ' AB ' , [1 ,2])

merge() functions implements several type of joins: the one-to-one,

employee group hire_date supervisor

employee group skills

left on and right on keywords

overlapping column names

It is possible customize suffixes using suffixes argument.

The most important operations performed by Group-by objects are aggregate,

To understand these functionalities of a Group-by object, we will use the

planets . dropna (). describe ()

planets . method . value_counts ()

group by: aggregate

aggregate() method can be combine with groupby(). It can take a string, a

group by: filter

planets . groupby ( ' method ' ). filter ( my_filter )

group by: transform

group by: apply

apply() method allows for applying an arbitrary function to group results.

Let’s normalize orbital_period by subtracting its mean and then dividing by

planets . groupby ( ' method ' ). apply ( my_normalize )

A pivot table operation is similar to GroupBy but operates on tabular data.

It is easier to think about it as a multidimensional version of GroupBy

We will use the titanic dataset available in seaborn module.

Let’s take a look at survival rate by gender.

titanic . groupby ( ' sex ' )[ ' survived ' ]. mean ()

This is an example of two-dimensional group by and the code is starting to look

Pandas offers a convenient tool, pivot_table, which succintly handles this

Time series refer to data in which quantities are observed at regularly or

When dealing with temporal data, it is particularly useful to be able to index

Using pandas time-series indexers, DatetimeIndex and PeriodIndex, we can

To generate a sequence of dates that can be used as an index in a pandas

For example, to generate hourly timestamps between 00:00 and 12:00 on

The date_range returns an instance of DatetimeIndex, which can be used,

The elements of a DatetimeIndex object can, for example, be accessed using

An element in a DatetimeIndex is of the type Timestamp, which is a pandas

The main difference between Timestamp and datetime: Timestamp stores a

We can convert a Timestamp object to a standard Python datetime object

We can use a list of datetime objects to create pandas time series.

We can construct an instance of the PeriodIndex class explicitly by passing a

We can also convert a Series or DataFrame object indexed by a

Observations approximately every 10 minutes during most of 2014.

the second column is the measured temperature in degree Celsius.

Then, we localize the timestamps (assigning a time zone) using tz_localize

We also set the time column as index using set_index.

In pandas, we can accomplish this in a number of ways.

These fields are particularly useful when processing time series.

df1_month = df1 . reset_index ()

df2_month = df2 . reset_index ()

fig , axes = plt . subplots (1 , 2 , figsize = (12 , 4))

Resampling means that the number of data points is either increased

For down-sampling we need to choose a method for aggregating multiple

Consider the previous two time series with temperature data.

The original sampling frequency is about 10 minutes, which amounts to a lot of

fig , ( ax1 , ax2 ) = plt . subplots (2 , 1 , figsize = (12 , 6))

ffill for forward-fill,

bfill for back-fill.

Note that every second data point is a new sample point.

When no fill strategy is selected, the corresponding values are marked as

You might also like