0% found this document useful (0 votes)
41 views32 pages

Chapter5-Case Study Analyzing Flight Delays

This document discusses preparing flight delay data and weather data for parallel processing with Dask in Python. It covers reading in data from multiple files into Dask DataFrames, cleaning and preprocessing the data, and merging the flight delays and weather DataFrames for further analysis. It also discusses techniques for improving performance such as caching data with persistence after initial preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

Chapter5-Case Study Analyzing Flight Delays

This document discusses preparing flight delay data and weather data for parallel processing with Dask in Python. It covers reading in data from multiple files into Dask DataFrames, cleaning and preprocessing the data, and merging the flight delays and weather DataFrames for further analysis. It also discusses techniques for improving performance such as caching data with persistence after initial preprocessing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Preparing Flight

Delay Data
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Case study: Analyzing flight delays

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Limitations of Dask DataFrames
Reading data into Dask DataFrames:
A single le

Using glob on many les

Limitations:
Unsupported le formats

Cleaning les independently

Nested subdirectories tricky with glob

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Sample account data
accounts/Alice.csv :

date,amount
2016-01-31,103.15
2016-02-25,114.17
2016-03-06,4.03
2016-05-20,150.48

accounts/Bob.csv :

date,amount
2016-01-04,99.68
2016-02-09,146.41
2016-02-21,-42.94
2016-03-14,0.26

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Reading/cleaning in a function
import pandas as pd
from dask import delayed
@delayed
def pipeline(filename, account_name):
df = pd.read_csv(filename)
df['account_name'] = account_name
return df

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Using dd.from_delayed()
delayed_dfs = []
for account in ['Bob', 'Alice', 'Dave']:
fname = 'accounts/{}.csv'.format(account)
delayed_dfs.append(pipeline(fname, account))
import dask.dataframe as dd
dask_df = dd.from_delayed(delayed_dfs)
dask_df['amount'].mean().compute()

10.56476

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Flight delays and weather
Cleaning ight delays
Use .replace() : 0 → NaN

Cleaning weather data


'PrecipitationIn' : text → numeric
Add column for airport code

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Flight delays data
df = pd.read_csv('flightdelays-2016-1.csv')
df.columns

Index(['FL_DATE', 'UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN',


'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',
'DEST', 'DEST_CITY_NAME', 'DEST_STATE_ABR',
'DEST_STATE_NM', 'CRS_DEP_TIME', 'DEP_DELAY',
'CRS_ARR_TIME', 'ARR_DELAY', 'CANCELLED', 'DIVERTED',
'CARRIER_DELAY','WEATHER_DELAY', 'NAS_DELAY',
'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY',
'Unnamed: 22'],
dtype='object')

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Flight delays data
df['WEATHER_DELAY'].tail()

89160 NaN
89161 0.0
89162 NaN
89163 NaN
89164 NaN
Name: WEATHER_DELAY, dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Replacing values
series new_series = series.replace(
6, np.nan)
new_series
0 6
1 0
2 6 0 NaN
3 5 1 0.0
4 7 2 NaN
dtype: int64 3 5.0
4 7.0
dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Preparing Weather
Data
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Daily weather data
import pandas as pd
df = pd.read_csv('DEN.csv', parse_dates=True, index_col='Date')
df.columns

Index(['Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',


'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn',
'Mean Sea Level PressureIn', 'Min Sea Level PressureIn',
'Max VisibilityMiles', 'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH',
'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees'],
dtype='object')

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Daily weather data
df.loc['March 2016', ['PrecipitationIn','Events']].tail()

PrecipitationIn Events
Date
2016-03-27 0.00 NaN
2016-03-28 0.00 NaN
2016-03-29 0.04 Rain-Thunderstorm
2016-03-30 0.04 Rain-Snow
2016-03-31 0.01 Snow

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Examining PrecipitationIn & Events columns
df['PrecipitationIn'][0]
type(df['PrecipitationIn'][0])

'0.00'
str

df[['PrecipitationIn', 'Events']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 2 columns):
PrecipitationIn 366 non-null object
Events 115 non-null object
dtypes: object(2)
memory usage: 5.8+ KB

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Converting to numeric values
series new_series = pd.to_numeric(series,
errors='coerce')
new_series
0 0
1 M
2 2 0 0.0
3 1.5 1 NaN
4 E 2 2.0
dtype: object 3 1.5
4 NaN
dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Merging & Persisting
DataFrames
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Merging DataFrames
Pandas: pd.merge()

Pandas: pd.DataFrame.merge()

Dask: dask.dataframe.merge()

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Merging example
left_df right_df

cat_left value_left cat_right value_right


0 d 4 0 b 9
1 d 9 1 c 2
2 b 1 2 f 0
3 d 7 3 d 8
4 c 3 4 a 8

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Merging example
left_df.merge(right_df,
left_on=['cat_left'],
right_on=['cat_right'],
how='inner')

cat_left value_left cat_right value_right


0 d 4 d 8
1 d 9 d 8
2 d 7 d 8
3 b 1 b 9
4 c 3 c 2

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Dask DataFrame pipelines
Flight delays & weather set up
1. Read & clean 12 months of ight delay data

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Dask DataFrame pipelines
Flight delays & weather set up
1. Read & clean 12 months of ight delay data

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Repeated reads & performance
import dask.dataframe as dd
df = dd.read_csv('flightdelays-2016-*.csv')
%time print(df.WEATHER_DELAY.mean().compute())

2.701183508773752
CPU times: user 3.35 s, sys: 719 ms, total: 4.07 s
Wall time: 1.64 s

%time print(df.WEATHER_DELAY.std().compute())

21.230502105
CPU times: user 3.33 s, sys: 706 ms, total: 4.04 s
Wall time: 1.61 s

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Repeated reads & performance
%time print(df.WEATHER_DELAY.count().compute())

192563
CPU times: user 3.36 s, sys: 695 ms, total: 4.06 s
Wall time: 1.66 s

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Using persistence
%time persisted_df = df.persist()

CPU times: user 3.32 s, sys: 688 ms, total: 4.01 s


Wall time: 1.59 s

%time print(persisted_df.WEATHER_DELAY.mean().compute())

2.701183508773752
CPU times: user 15.1 ms, sys: 9.24 ms, total: 24.3 ms
Wall time: 18.5 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Using persistence
%time print(persisted_df.WEATHER_DELAY.std().compute())

21.230502105
CPU times: user 29.6 ms, sys: 12.5 ms, total: 42.1 ms
Wall time: 29.5 ms

%time print(persisted_df.WEATHER_DELAY.count().compute())

192563
CPU times: user 9.88 ms, sys: 2.98 ms, total: 12.9 ms
Wall time: 9.43 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Final thoughts
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Ma hew Rocklin & Dhavide Aruli…


Instructors, Anaconda
What you've learned
How to:
Use Dask data structures and delayed functions

Set up data analysis pipelines with deferred computation

... while working with real-world data!

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Next steps
Deploying Dask on your own cluster

Integrating with other Python libraries

Dynamic task scheduling and data management

h ps://dask.org/

PARALLEL PROGRAMMING WITH DASK IN PYTHON


Congratulations!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

You might also like