Data Aggregation
Data Aggregation
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Time Series
Why Data Aggregation and Group Operation? 3
▸ Majors
▸ GPAs
▸ Years
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Time Series
Split-Apply-Combine 6
▸ Method: <Series/DataFrame>.groupby(<keys>)
▸ R
Iterating Over Groups 12
Group by
multiple keys
Data Aggregation 13
▸ Can aggregate a slice of the dataset: use square brackets with column names
▸ Can define your own aggregation functions and pass them to agg function
Groupby Aggregation Operations 14
Data Aggregation Examples 15
Data Aggregation Examples 16
▸ Consider the DataFrame tips with columns: total_bill, tip, smoker, day, time,
size, tip_pct
▸ 1. Create a Series of the average tip percentage for each size on each day.
▸ 2. Create a dataFrame of the sum of the total bill in each time on each day.
Exercise 1 Solution 18
Outline 19
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Time Series
Apply: General split-apply-combine 20
▸ GroupBy Mechanics
▸ Data Aggregation
▸ Time Series
Pivot Tables 23
▸ Aggregate a table of data by one or more keys with some keys along the rows
(index), and some along the columns (columns)
▸ List the average values of each day for both smokers and
non-smokers.
Average value of
aggregated data
Pivot Table Options 26
Pivot Tables: Examples with More Parameters 27
Pivot Tables: More Examples 28
Or:
Sources: http://pbpython.com/pandas-pivot-table-explained.html
Cross Table 30
Index Column
Cross Table: More Example 31
▸ List the number of smokers of each day at different times. Include the row/
column subtotal and the grand total as well.
Equivalent to
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Dates and Times 33
▸ Time to computer is the number of seconds elapsed since the Unix epic (1 Jan.
1970 00:00:00 UTC)
▸ Usually break down to years, months, days, hours, minutes, seconds, etc
▸ Time zones
Time Series 34
▸ Important form of structured data in may fields, such as finance, ecology, and
economics
Sources: http://evafengeva.blogspot.com/2016/01/fang-stock-correlation-analysis.html
Outline 36
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Date And Time in Python 37
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Datetime in pandas 42
▸ The following two lists indicate the single home prices in New York City from
2016 to 2018.
time = [‘2016-06','2016-12','2017-03',
‘2017-06','2017-09','2018-03']
prices = [379100, 388000, 393500, 403100, 409700, 423500]
▸ 1. Create a Series and set the time as index of type datetimes and the prices as
values.
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Generating Date Ranges and Frequency 49
▸ Range:
▸ Create a DatetimeIndex which shows all the business days from 01/22/2019
to 01/31/2019.
Exercise 3 Solution 53
▸ Create a DatetimeIndex which shows all the business days from 01/22/2019 to
01/31/2019.
Date Shifting 54
▸ Shifting date refers to moving data backward and forward through time
▸ Both Series and DataFrame have a shift method for naive shifts, leaving the
index unmodified:
▸ Shifting by Time:
Outline 55
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Time Zone Handling 56
▸ All the other time zones are indicated as offsets from UTC (+/- [1, 12])
▸ For example, New York time: UTC-5 (UTC-4 in daylight saving time)
▸ By default, time series in pandas are time zone naive (no time zone setting).
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Resampling 60
▸ Refers to the process of converting a time series from one frequency to another
▸ In pandas, can call resample to group the data and then call an aggregation
function
▸ e.g.,
ts.resample('D').mean()
Resampling Example 61
Downsampling 62
▸ The desired frequency defines bin edges to slice the time series into intervals
to aggregate
▸ Each interval is half-open (only one side is included) so that any data point just
belong to one interval
▸ Just need to consider how to fill the missing values result in the gaps
Filled by NaN
Upsampling (Cont.) 65
▸ Time Series
▸ Intro to Date&Time
▸ Time Zone
▸ Resampling
▸ Window Functions
Window Functions 67
▸ The basic idea is to apply functions over a window of time, get the results, and
then slide the window ahead, and continue