0% found this document useful (0 votes)

41 views32 pages

Chapter5-Case Study Analyzing Flight Delays

This document discusses preparing flight delay data and weather data for parallel processing with Dask in Python. It covers reading in data from multiple files into Dask DataFrames, cleaning and preprocessing the data, and merging the flight delays and weather DataFrames for further analysis. It also discusses techniques for improving performance such as caching data with persistence after initial preprocessing.

Uploaded by

Komi David ABOTSITSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views32 pages

Chapter5-Case Study Analyzing Flight Delays

Uploaded by

Komi David ABOTSITSE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Preparing Flight

Delay Data
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Case study: Analyzing flight delays

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Limitations of Dask DataFrames
Reading data into Dask DataFrames:
A single le

Using glob on many les

Limitations:
Unsupported le formats

Cleaning les independently

Nested subdirectories tricky with glob

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Sample account data
accounts/Alice.csv :

date,amount
2016-01-31,103.15
2016-02-25,114.17
2016-03-06,4.03
2016-05-20,150.48

accounts/Bob.csv :

date,amount
2016-01-04,99.68
2016-02-09,146.41
2016-02-21,-42.94
2016-03-14,0.26

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Reading/cleaning in a function
import pandas as pd
from dask import delayed
@delayed
def pipeline(filename, account_name):
df = pd.read_csv(filename)
df['account_name'] = account_name
return df

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using dd.from_delayed()
delayed_dfs = []
for account in ['Bob', 'Alice', 'Dave']:
fname = 'accounts/{}.csv'.format(account)
delayed_dfs.append(pipeline(fname, account))
import dask.dataframe as dd
dask_df = dd.from_delayed(delayed_dfs)
dask_df['amount'].mean().compute()

10.56476

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight delays and weather
Cleaning ight delays
Use .replace() : 0 → NaN

Cleaning weather data

'PrecipitationIn' : text → numeric
Add column for airport code

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight delays data
df = pd.read_csv('flightdelays-2016-1.csv')
df.columns

Index(['FL_DATE', 'UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN',

'ORIGIN_CITY_NAME', 'ORIGIN_STATE_ABR', 'ORIGIN_STATE_NM',
'DEST', 'DEST_CITY_NAME', 'DEST_STATE_ABR',
'DEST_STATE_NM', 'CRS_DEP_TIME', 'DEP_DELAY',
'CRS_ARR_TIME', 'ARR_DELAY', 'CANCELLED', 'DIVERTED',
'CARRIER_DELAY','WEATHER_DELAY', 'NAS_DELAY',
'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY',
'Unnamed: 22'],
dtype='object')

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Flight delays data
df['WEATHER_DELAY'].tail()

89160 NaN
89161 0.0
89162 NaN
89163 NaN
89164 NaN
Name: WEATHER_DELAY, dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Replacing values
series new_series = series.replace(
6, np.nan)
new_series
0 6
1 0
2 6 0 NaN
3 5 1 0.0
4 7 2 NaN
dtype: int64 3 5.0
4 7.0
dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Preparing Weather
Data
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Daily weather data
import pandas as pd
df = pd.read_csv('DEN.csv', parse_dates=True, index_col='Date')
df.columns

Index(['Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',

'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
'Mean Humidity', 'Min Humidity', 'Max Sea Level PressureIn',
'Mean Sea Level PressureIn', 'Min Sea Level PressureIn',
'Max VisibilityMiles', 'Mean VisibilityMiles',
'Min VisibilityMiles',
'Max Wind SpeedMPH', 'Mean Wind SpeedMPH', 'Max Gust SpeedMPH',
'PrecipitationIn', 'CloudCover', 'Events', 'WindDirDegrees'],
dtype='object')

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Daily weather data
df.loc['March 2016', ['PrecipitationIn','Events']].tail()

PrecipitationIn Events
Date
2016-03-27 0.00 NaN
2016-03-28 0.00 NaN
2016-03-29 0.04 Rain-Thunderstorm
2016-03-30 0.04 Rain-Snow
2016-03-31 0.01 Snow

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Examining PrecipitationIn & Events columns
df['PrecipitationIn'][0]
type(df['PrecipitationIn'][0])

'0.00'
str

df[['PrecipitationIn', 'Events']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 2 columns):
PrecipitationIn 366 non-null object
Events 115 non-null object
dtypes: object(2)
memory usage: 5.8+ KB

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Converting to numeric values
series new_series = pd.to_numeric(series,
errors='coerce')
new_series
0 0
1 M
2 2 0 0.0
3 1.5 1 NaN
4 E 2 2.0
dtype: object 3 1.5
4 NaN
dtype: float64

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Merging & Persisting
DataFrames
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Dhavide Aruliah
Director of Training, Anaconda
Merging DataFrames
Pandas: pd.merge()

Pandas: pd.DataFrame.merge()

Dask: dask.dataframe.merge()

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Merging example
left_df right_df

cat_left value_left cat_right value_right

0 d 4 0 b 9
1 d 9 1 c 2
2 b 1 2 f 0
3 d 7 3 d 8
4 c 3 4 a 8

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Merging example
left_df.merge(right_df,
left_on=['cat_left'],
right_on=['cat_right'],
how='inner')

cat_left value_left cat_right value_right

0 d 4 d 8
1 d 9 d 8
2 d 7 d 8
3 b 1 b 9
4 c 3 c 2

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Dask DataFrame pipelines
Flight delays & weather set up
1. Read & clean 12 months of ight delay data

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Dask DataFrame pipelines
Flight delays & weather set up
1. Read & clean 12 months of ight delay data

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Repeated reads & performance
import dask.dataframe as dd
df = dd.read_csv('flightdelays-2016-*.csv')
%time print(df.WEATHER_DELAY.mean().compute())

2.701183508773752
CPU times: user 3.35 s, sys: 719 ms, total: 4.07 s
Wall time: 1.64 s

%time print(df.WEATHER_DELAY.std().compute())

21.230502105
CPU times: user 3.33 s, sys: 706 ms, total: 4.04 s
Wall time: 1.61 s

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Repeated reads & performance
%time print(df.WEATHER_DELAY.count().compute())

192563
CPU times: user 3.36 s, sys: 695 ms, total: 4.06 s
Wall time: 1.66 s

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using persistence
%time persisted_df = df.persist()

CPU times: user 3.32 s, sys: 688 ms, total: 4.01 s

Wall time: 1.59 s

%time print(persisted_df.WEATHER_DELAY.mean().compute())

2.701183508773752
CPU times: user 15.1 ms, sys: 9.24 ms, total: 24.3 ms
Wall time: 18.5 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using persistence
%time print(persisted_df.WEATHER_DELAY.std().compute())

21.230502105
CPU times: user 29.6 ms, sys: 12.5 ms, total: 42.1 ms
Wall time: 29.5 ms

%time print(persisted_df.WEATHER_DELAY.count().compute())

192563
CPU times: user 9.88 ms, sys: 2.98 ms, total: 12.9 ms
Wall time: 9.43 ms

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Let's practice!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N
Final thoughts
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Ma hew Rocklin & Dhavide Aruli…

Instructors, Anaconda
What you've learned
How to:
Use Dask data structures and delayed functions

Set up data analysis pipelines with deferred computation

... while working with real-world data!

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Next steps
Deploying Dask on your own cluster

Integrating with other Python libraries

Dynamic task scheduling and data management

h ps://dask.org/

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Congratulations!
PA R A L L E L P R O G R A M M I N G W I T H D A S K I N P Y T H O N

Class 12 Ip Practical Programs 2023-24 (Updated)
82% (28)
Class 12 Ip Practical Programs 2023-24 (Updated)
38 pages
Kali Linux Course: Subscribe!!!
50% (2)
Kali Linux Course: Subscribe!!!
45 pages
Informatics Practices Class 12
No ratings yet
Informatics Practices Class 12
225 pages
Data Analysis With Python
100% (3)
Data Analysis With Python
49 pages
Mikroc Pro For Dspic Manual v100-15136
100% (2)
Mikroc Pro For Dspic Manual v100-15136
783 pages
Pandas Dataframe and Series
No ratings yet
Pandas Dataframe and Series
10 pages
Veye Operation User Manual
0% (1)
Veye Operation User Manual
11 pages
12th Practical
No ratings yet
12th Practical
21 pages
Basler Ace GigE Users Manual
100% (1)
Basler Ace GigE Users Manual
360 pages
Pandas Course Slides
No ratings yet
Pandas Course Slides
90 pages
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
No ratings yet
Practical File Class - Xii Informatics Practices (New) : 1. How To Create A Series From A List, Numpy Array and Dict?
17 pages
Python
0% (1)
Python
67 pages
CLASS 12 IP LTST Practical Programs (1) (1) FINAL
No ratings yet
CLASS 12 IP LTST Practical Programs (1) (1) FINAL
40 pages
Chapter3-Working With Dask DataFrames
100% (1)
Chapter3-Working With Dask DataFrames
24 pages
AJP Proposal
0% (1)
AJP Proposal
2 pages
aPC MIB PDF
No ratings yet
aPC MIB PDF
16 pages
Debug Log
No ratings yet
Debug Log
144 pages
programming Avr I2c Interface
No ratings yet
programming Avr I2c Interface
11 pages
BOIFUN DQ201 User Manual V1.0
No ratings yet
BOIFUN DQ201 User Manual V1.0
78 pages
DS Final
No ratings yet
DS Final
46 pages
MP250 Tutorial Draft
No ratings yet
MP250 Tutorial Draft
75 pages
Python
No ratings yet
Python
67 pages
N RQgi 8 Eg DUNFS451 K4 X QXA
No ratings yet
N RQgi 8 Eg DUNFS451 K4 X QXA
61 pages
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
No ratings yet
Data Science With Python - Lesson 07 - Data Manipulation With Python - Pandas
72 pages
Class 12 Ip Practical Programs 2024-25 Revised
No ratings yet
Class 12 Ip Practical Programs 2024-25 Revised
42 pages
Chapter2-Working With Dask Arrays
No ratings yet
Chapter2-Working With Dask Arrays
41 pages
Python
No ratings yet
Python
67 pages
Class 12 IP
No ratings yet
Class 12 IP
42 pages
Dfs Manual
No ratings yet
Dfs Manual
43 pages
Class 12 Ip Practical Programs 2023 24 Updated2
No ratings yet
Class 12 Ip Practical Programs 2023 24 Updated2
38 pages
Class-12-Ip-Practical - Old
No ratings yet
Class-12-Ip-Practical - Old
42 pages
Chapter4-Working With Dask Bags For Unstructured Data
No ratings yet
Chapter4-Working With Dask Bags For Unstructured Data
33 pages
Python
No ratings yet
Python
67 pages
Ip Practical File
No ratings yet
Ip Practical File
39 pages
12 - Ip Prac
No ratings yet
12 - Ip Prac
52 pages
Chapter1-Working With Big Data
No ratings yet
Chapter1-Working With Big Data
44 pages
Final Print
No ratings yet
Final Print
43 pages
Isbssjksa
No ratings yet
Isbssjksa
40 pages
0 - Accuments Kaushik CV
No ratings yet
0 - Accuments Kaushik CV
5 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
IP Practical File
No ratings yet
IP Practical File
41 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
No ratings yet
019) Pandas - Batch 2 - Day 019 (FINAL DAY)
43 pages
E Canteen
No ratings yet
E Canteen
31 pages
Practical File Ip Class 12
No ratings yet
Practical File Ip Class 12
40 pages
Practical
No ratings yet
Practical
29 pages
IP Grade 12 Record
No ratings yet
IP Grade 12 Record
12 pages
Lesson - 3 - 1 Data Wrangling
No ratings yet
Lesson - 3 - 1 Data Wrangling
29 pages
ch5 XDSL
No ratings yet
ch5 XDSL
34 pages
SL Arora
No ratings yet
SL Arora
41 pages
Lab Manual ET Lab III
No ratings yet
Lab Manual ET Lab III
38 pages
AI Record 2024
No ratings yet
AI Record 2024
19 pages
Practical File
No ratings yet
Practical File
36 pages
Common Python Data Science Interview Questions1
No ratings yet
Common Python Data Science Interview Questions1
5 pages
Chapter 14
No ratings yet
Chapter 14
5 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
Chapter 3
No ratings yet
Chapter 3
15 pages
Pandas 1 Series
No ratings yet
Pandas 1 Series
14 pages
Chapter 3
No ratings yet
Chapter 3
16 pages
Practical File 1 Pages Deleted Output
No ratings yet
Practical File 1 Pages Deleted Output
24 pages
IP Practical Record 2022-23
No ratings yet
IP Practical Record 2022-23
43 pages
Chapter 3
No ratings yet
Chapter 3
12 pages
FTP TLS-SSL Instructions - Client
No ratings yet
FTP TLS-SSL Instructions - Client
9 pages
HD Formats Workflow
No ratings yet
HD Formats Workflow
54 pages
Installation and Licensing Guide - IsE PDF
No ratings yet
Installation and Licensing Guide - IsE PDF
74 pages
Gbo 001 E1 1 GSM Basic-40
No ratings yet
Gbo 001 E1 1 GSM Basic-40
40 pages
Data Handling Using Pandas and Data Visualization - Assessment1 Class Room Notes
No ratings yet
Data Handling Using Pandas and Data Visualization - Assessment1 Class Room Notes
18 pages
Complex Computing Problem KMeans Clustering
No ratings yet
Complex Computing Problem KMeans Clustering
4 pages
Chapter 1
No ratings yet
Chapter 1
10 pages
Chapter 1
No ratings yet
Chapter 1
9 pages
FAZ Reports Creation
No ratings yet
FAZ Reports Creation
42 pages
Intro A Msp430
No ratings yet
Intro A Msp430
39 pages
Logistic Regression in Python Using Dask
No ratings yet
Logistic Regression in Python Using Dask
19 pages
Fan Testing Code
No ratings yet
Fan Testing Code
3 pages
PDF 20230708 071434 0000
No ratings yet
PDF 20230708 071434 0000
6 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Face Recognition System Contoh
No ratings yet
Face Recognition System Contoh
14 pages
Monitoring System Equipment and Material Selection Table
No ratings yet
Monitoring System Equipment and Material Selection Table
3 pages
Sangfor IAM V11.9 Associate 07 Firewall
No ratings yet
Sangfor IAM V11.9 Associate 07 Firewall
19 pages
04 Introduction To Python-1
No ratings yet
04 Introduction To Python-1
29 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Alexandru Gris - Video Game Producer (Curriculum Vitae)
No ratings yet
Alexandru Gris - Video Game Producer (Curriculum Vitae)
7 pages
Synopsis On Online Chat Room: Submitted by
No ratings yet
Synopsis On Online Chat Room: Submitted by
16 pages
Python - Working With Data - Text Formats
No ratings yet
Python - Working With Data - Text Formats
23 pages
Buflab
No ratings yet
Buflab
11 pages
Wap 1
No ratings yet
Wap 1
7 pages
74HC595N Datasheet
No ratings yet
74HC595N Datasheet
29 pages
Dask For Parallel Computing Cheat Sheet
No ratings yet
Dask For Parallel Computing Cheat Sheet
2 pages
Microsoft Partner
No ratings yet
Microsoft Partner
4 pages
SAC-01-1 Test Answers
No ratings yet
SAC-01-1 Test Answers
2 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Power BI Case Study Meta Data Sheet-2
No ratings yet
Power BI Case Study Meta Data Sheet-2
1 page
Evaluative Language
No ratings yet
Evaluative Language
7 pages
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
From Everand
Learning Pandas 2.0: A Comprehensive Guide to Data Manipulation and Analysis for Data Scientists and Machine Learning Professionals
Matthew Rosch
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Cody's Data Cleaning Techniques Using SAS, Third Edition
From Everand
Cody's Data Cleaning Techniques Using SAS, Third Edition
Ron Cody
4.5/5 (3)
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

Chapter5-Case Study Analyzing Flight Delays

Uploaded by

Chapter5-Case Study Analyzing Flight Delays

Uploaded by

Preparing Flight

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Using glob on many les

Cleaning les independently

Nested subdirectories tricky with glob

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Cleaning weather data

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Index(['FL_DATE', 'UNIQUE_CARRIER', 'FL_NUM', 'ORIGIN',

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Index(['Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

cat_left value_left cat_right value_right

PARALLEL PROGRAMMING WITH DASK IN PYTHON

cat_left value_left cat_right value_right

PARALLEL PROGRAMMING WITH DASK IN PYTHON

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON

2. Make flight_delay dataframe with dd.from_delayed

3. Read & clean weather daily data from 5 airports

4. Make weather dataframe with dd.from_delayed

5. Merge the two dataframes

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

CPU times: user 3.32 s, sys: 688 ms, total: 4.01 s

PARALLEL PROGRAMMING WITH DASK IN PYTHON

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Ma hew Rocklin & Dhavide Aruli…

Set up data analysis pipelines with deferred computation

... while working with real-world data!

PARALLEL PROGRAMMING WITH DASK IN PYTHON

Integrating with other Python libraries

Dynamic task scheduling and data management

PARALLEL PROGRAMMING WITH DASK IN PYTHON

You might also like