0% found this document useful (0 votes)
255 views

Data Aggregation

This document provides an outline and introduction to the topics of data aggregation, group operations, pivot tables, and time series analysis in exploratory data analysis and visualization. It discusses splitting data into groups, calculating statistics for each group, and rearranging data using pivot tables. It also introduces working with dates and times in Python and pandas, including converting between string and datetime formats.

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
255 views

Data Aggregation

This document provides an outline and introduction to the topics of data aggregation, group operations, pivot tables, and time series analysis in exploratory data analysis and visualization. It discusses splitting data into groups, calculating statistics for each group, and rearranging data using pivot tables. It also introduces working with dates and times in Python and pandas, including converting between string and datetime formats.

Uploaded by

Gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

1

CSIT 553 EXPLORATORY DATA


ANALYSIS AND VISUALIZATION

Introductor: Jiayin Wang


Class 5: Data Aggregation & Time Series
Outline 2

▸ Introduction to Data Aggregation and Group Operation

▸ GroupBy Mechanics

▸ Data Aggregation

▸ Apply: General split-apply-combine

▸ Pivot Tables and Crosstab

▸ Time Series
Why Data Aggregation and Group Operation? 3

▸ Sometimes we want to analyze data by groups. E.g., group students by

▸ Majors

▸ GPAs

▸ Years

▸ Calculate statistics based on these groups

▸ Perform our own type of transformation

▸ Create visualizations for each group


Techniques in Data Aggregation and Group Operation 4

▸ Split data into pieces using one or more keys

▸ Calculate group summary statistics (count, mean, standard deviation, or user-


defined function, etc)

▸ Compute pivot tables and cross-tabulations


Outline 5

▸ Introduction to Data Aggregation and Group Operation

▸ GroupBy Mechanics

▸ Data Aggregation

▸ Apply: General split-apply-combine

▸ Pivot Tables and Crosstab

▸ Time Series
Split-Apply-Combine 6

▸ Coined by Hadley Wickham, 2011

▸ Split: split into groups based on one or more keys

▸ Apply: apply one function to each group, producing a new value

▸ Combine: results are combined into a result


Split-Apply-Combine 7
Example: Splitting 8

Sources: Hadley Wickham, 2011


Example: Apply (Counting) & Combine 9

Sources: Hadley Wickham, 2011


GroupBy in pandas 10

▸ Method: <Series/DataFrame>.groupby(<keys>)

▸ <keys> is the group key (one or more columns)

▸ Create a GroupBy object

▸ groupby method doesn’t compute anything until you apply aggregation


operation to each groups
GroupBy in pandas 11

▸ R
Iterating Over Groups 12

▸ The GroupBy object supports iteration, generating a sequence of 2-tuples


containing the group name along with the chunk of data
for name, group in tips.groupby(‘day’)
print(name)
print(group)

Group by
multiple keys
Data Aggregation 13

▸ Refer to any data process in which information is gathered and expressed in a


summary form

▸ Examples: count(), sum(), mean(), median()

▸ Can aggregate a slice of the dataset: use square brackets with column names

▸ Can define your own aggregation functions and pass them to agg function
Groupby Aggregation Operations 14
Data Aggregation Examples 15
Data Aggregation Examples 16

▸ Can also apply .describe() to the groups


Exercise 1 17

▸ Consider the DataFrame tips with columns: total_bill, tip, smoker, day, time,
size, tip_pct

▸ 1. Create a Series of the average tip percentage for each size on each day.

▸ 2. Create a dataFrame of the sum of the total bill in each time on each day.
Exercise 1 Solution 18
Outline 19

▸ Introduction to Data Aggregation and Group Operation

▸ GroupBy Mechanics

▸ Data Aggregation

▸ Apply: General split-apply-combine

▸ Pivot Tables and Crosstab

▸ Time Series
Apply: General split-apply-combine 20

▸ During apply, functions are invoked on each group (piece)

▸ Then, all groups (pieces) are concatenated together

▸ You can pass your own function by .apply(<function>)


Apply: General split-apply-combine 21
Outline 22

▸ Introduction to Data Aggregation and Group Operation

▸ GroupBy Mechanics

▸ Data Aggregation

▸ Apply: General split-apply-combine

▸ Pivot Tables and Crosstab

▸ Time Series
Pivot Tables 23

▸ A data summarization tool well used in spreadsheet programs

▸ Aggregate a table of data by one or more keys with some keys along the rows
(index), and some along the columns (columns)

▸ A combination of groupby operation and reshape operation utilizing


hierarchical indexing

▸ Pandas supports pandas.pivot_table function

▸ DataFrame has a pivot_table method


Pivot Tables: Simple Syntax 24

▸ Aggregate the table by one or more keys (columns)


<DataFrame>.pivot_table(index=<column(s)>)

▸ Create a new DataFrame with the configured index and the


default values are the average ones

▸ For example, list the average values of each day


Pivot Tables: More Examples 25

▸ List the average values of each day for both smokers and
non-smokers.

Two levels of index:


day & smoker

Average value of
aggregated data
Pivot Table Options 26
Pivot Tables: Examples with More Parameters 27
Pivot Tables: More Examples 28

Number of rows in each group

Can also set as:

len, np.mean, np.sum, np.max, np.min

Or:

‘count’, ’mean’, ‘sum’, ‘max’, ‘min’


Pivot Tables: Cheat Sheet 29

Sources: http://pbpython.com/pandas-pivot-table-explained.html
Cross Table 30

▸ A special case of a pivot table to compute group frequencies

▸ Similar to pivot_table with aggfunc=‘count’

Index Column
Cross Table: More Example 31

▸ List the number of smokers of each day at different times. Include the row/
column subtotal and the grand total as well.

Equivalent to

tips.pivot_table(‘total_bill’, index=[‘time’, ‘day’],


columns=[‘smoker’], aggfunc=‘count’,
margins=True, fill_value=0)
Outline 32

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Dates and Times 33

▸ Time to computer is the number of seconds elapsed since the Unix epic (1 Jan.
1970 00:00:00 UTC)

▸ Usually break down to years, months, days, hours, minutes, seconds, etc

▸ There are lots of formats to express time:

▸ ‘2018-11-28’ vs. ’11/28/2018’

▸ 12-hour clock vs. 24-hour clock

▸ Time zones
Time Series 34

▸ Anything that is observed or measured at many points in time forms a time


series

▸ Important form of structured data in may fields, such as finance, ecology, and
economics

▸ Time series may be referred as:

▸ Timestamps, specific instants in time

▸ Fixed periods, such as full year of 2018

▸ Intervals of time, indicated by a start and end timestamp


Why Time Series? 35

▸ To identify trends, cycles, and seasonal variances to aid in the forecasting of a


future event

Sources: http://evafengeva.blogspot.com/2016/01/fang-stock-correlation-analysis.html
Outline 36

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Date And Time in Python 37

▸ Main module for date and time data: datetime


from datetime import datetime

▸ Stores both the date and time down to the microseconds

▸ .now() indicates the current date time

▸ Can access to the year, month, day, hour, etc


Date And Time in Python 38

▸ Data types in the datetime module

▸ Represent difference between two datetime objects: timedelta

▸ Can get the difference in days and seconds


Converting Datetime to String 39

▸ str(<datetime>) converts datetime to a string as “YYYY-MM-DD hh:mm:ss”

▸ .strftime() can pass datetime to a specific format


Converting String to Datetime 40

▸ datetime.strptime: for known format

▸ dateutil.parser.parse: for unknown format


Outline 41

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Datetime in pandas 42

▸ pd.to_datetime(<arg>) method: convert an entire column to date time

▸ arg can be list, tuple, 1-d array, Series, and DataFrame

▸ Datetime is stored as numpy.datetime64 format

▸ A NaT is used to indicate a missing time value (similar to NaN)

▸ pd.Timestamp is the pandas equivalence of Python’s datetime.datetime

▸ Can use time as the index


Datetime in pandas Examples 1 43

▸ Data: Average single home prices in New Jersey from 05/01/2018 to


10/01/2018 (source: Zillow Research)
Datetime in pandas Examples 2 44
Indexing, Selection, Subsetting 45

▸ Slicing also works

▸ Can select one date, or only a year


(or a year and a month)

▸ Can select a range of datetime


Exercise 2 46

▸ The following two lists indicate the single home prices in New York City from
2016 to 2018.
time = [‘2016-06','2016-12','2017-03',
‘2017-06','2017-09','2018-03']
prices = [379100, 388000, 393500, 403100, 409700, 423500]

▸ 1. Create a Series and set the time as index of type datetimes and the prices as
values.

▸ 2. Write the command to list the prices in 2017.


Exercise 2 Solution 47
Outline 48

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Generating Date Ranges and Frequency 49

▸ pandas.date_range can generate DatetimeIndex with a time range and


frequency

▸ Range:

▸ by setting the start/end date time

▸ by setting the periods

▸ Frequency defines how the range is divided. Can be set as:

▸ year, month, week, day, hours, etc…

▸ and the combination of them (such as 1h30 min)


Generating Date Ranges and Frequency Example 50
Date Frequencies 51
Exercise 3 52

▸ Create a DatetimeIndex which shows all the business days from 01/22/2019
to 01/31/2019.
Exercise 3 Solution 53

▸ Create a DatetimeIndex which shows all the business days from 01/22/2019 to
01/31/2019.
Date Shifting 54

▸ Shifting date refers to moving data backward and forward through time

▸ Both Series and DataFrame have a shift method for naive shifts, leaving the
index unmodified:

▸ Shifting by Time:
Outline 55

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Time Zone Handling 56

▸ We need to handle the times in different time zone

▸ Current international standard is the coordinated universal time (UTC)


(successor to Greenwich Mean Time)

▸ All the other time zones are indicated as offsets from UTC (+/- [1, 12])

▸ For example, New York time: UTC-5 (UTC-4 in daylight saving time)

▸ Time zone in Python: using pytz library


Time Zone in Time Series 57

▸ By default, time series in pandas are time zone naive (no time zone setting).

▸ Can localize a time zone by the tz_localize method

▸ Can set the time zone in generating a date range by setting tz

▸ Can convert from one time zone to another by tz_convert()

▸ Operations between different time zones: automatically convert to UTC


Time Zone in Time Series Example 58
Outline 59

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Resampling 60

▸ Refers to the process of converting a time series from one frequency to another

▸ Downsampling: higher frequency to lower frequency

▸ Upsampling: lower frequency to higher frequency

▸ Other: every Wednesday to every Friday

▸ In pandas, can call resample to group the data and then call an aggregation
function

▸ e.g.,
ts.resample('D').mean()
Resampling Example 61
Downsampling 62

▸ The desired frequency defines bin edges to slice the time series into intervals
to aggregate

▸ Each interval is half-open (only one side is included) so that any data point just
belong to one interval

▸ While downsampling data, think about

▸ Which side of each interval is closed

▸ How to label each aggregation bin


(the start or the end of the interval)
Downsampling Example 63

By default, closed and label are set as ‘left’


Upsampling 64

▸ No aggregation is needed in upsampling

▸ Just need to consider how to fill the missing values result in the gaps
Filled by NaN
Upsampling (Cont.) 65

Filled a certain number


Filled forward of periods forward
Outline 66

▸ Introduction to Data Aggregation and Group Operation

▸ Time Series

▸ Intro to Date&Time

▸ Date & Time in Python

▸ Date & Time in pandas

▸ Date Ranges, Frequencies ,and Shifting

▸ Time Zone

▸ Resampling

▸ Window Functions
Window Functions 67

▸ Apply statistics over a sliding window of time

▸ The basic idea is to apply functions over a window of time, get the results, and
then slide the window ahead, and continue

▸ Used for smoothing noisy or missing data

▸ The rolling operator is called on Series/DateFrame along with a window of


time (period) and a resample/aggregation function

▸ Result is set to the right edge of the window (can be changed by


center=True)
Window Functions Example 68

You might also like