0% found this document useful (0 votes)

197 views6 pages

Assignment 3

The document contains exercises analyzing flight data. There are 336,776 rows representing unique flights with 19 columns of data. Various data wrangling techniques like filtering, arranging, grouping, and mutating are applied to explore the data. Key findings include the carrier with the longest average arrival delay is F9 and the shortest is AS.

Uploaded by

Ray Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

197 views6 pages

Assignment 3

Uploaded by

Ray Guo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment 3: Flights of New York

Raymond Guo
2020-02-03

Exercise 1
There are 336776 rows and 19 columns.
The row represents every unique flight.
sched_dep_time represents the time where it is expected to depart and dep_time represents when
it actually departed.
The column representing date or time are year, month, day, dep_time, sched_dep_time, dep_delay,
arr_time, sched_arr_time, arr_delay, air_time, hour, minute, and time_hour.
The best column that represents a unique key to every flight is the tailnum.

Exercise 2

flights %>%
select(year, month)

## # A tibble: 336,776 x 2
## year month
## <int> <int>
## 1 2013 1
## 2 2013 1
## 3 2013 1
## 4 2013 1
## 5 2013 1
## 6 2013 1
## 7 2013 1
## 8 2013 1
## 9 2013 1
## 10 2013 1
## # ... with 336,766 more rows
It selects only the columns inbetween year and month without excluding itself.

Exercise 3

flights %>%
select(year:day)

## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1

1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
The colon selected all columns that are between year and day hence why month was included.

Exercise 4

flights %>%
arrange(sched_dep_time, sched_arr_time)

## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 3 31 456 500 -4 635
## 3 2013 4 1 454 500 -6 636
## 4 2013 4 2 453 500 -7 657
## 5 2013 4 3 453 500 -7 642
## 6 2013 4 4 454 500 -6 636
## 7 2013 4 5 456 500 -4 716
## 8 2013 4 6 453 500 -7 619
## 9 2013 4 7 455 500 -5 639
## 10 2013 4 8 454 500 -6 645
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
It follows a left to right operation, so it sorts sched_dep_time first then sched_arr_time. If you
reverse them in the parameter, it follows the same pattern whereas sched_arr_time will be sorted
first. All sorting is set in ascending order.

Exercise 5

flights %>%
arrange(desc(dep_delay))

## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607

2
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The column to identify flight delay is dep_delay and the flight with the longest delay was flight 51
with tailnum as N384HA.

Exercise 6

flights %>%
mutate(
average_speed = distance / (air_time * 60)
)

## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, average_speed <dbl>
The new column is created at the farthest right of the dataset and it was named average_speed.
Yes, the code determines the name of the column.

Exercise 7

flights %>%
mutate(
dep_time_hour = dep_time %/% 100,

3
dep_time_minute = dep_time %% 100,
dep_time_minutes_midnight = dep_time_hour * 60 + dep_time_minute
)

## # A tibble: 336,776 x 22
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, dep_time_hour <dbl>,
## # dep_time_minute <dbl>, dep_time_minutes_midnight <dbl>

Exercise 8

flights %>%
filter(
arr_delay < 0,
carrier == "UA"
)

## # A tibble: 34,642 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 559 600 -1 854
## 3 2013 1 1 607 607 0 858
## 4 2013 1 1 643 646 -3 922
## 5 2013 1 1 644 636 8 931
## 6 2013 1 1 646 645 1 910
## 7 2013 1 1 646 645 1 1023
## 8 2013 1 1 656 700 -4 948
## 9 2013 1 1 659 700 -1 959
## 10 2013 1 1 701 700 1 1123
## # ... with 34,632 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

4
Exercise 9

flights %>%
group_by(carrier) %>%
summarize(
average_arr_delay = mean(arr_delay, na.rm = TRUE)
)

## # A tibble: 16 x 2
## carrier average_arr_delay
## <chr> <dbl>
## 1 9E 7.38
## 2 AA 0.364
## 3 AS -9.93
## 4 B6 9.46
## 5 DL 1.64
## 6 EV 15.8
## 7 F9 21.9
## 8 FL 20.1
## 9 HA -6.92
## 10 MQ 10.8
## 11 OO 11.9
## 12 UA 3.56
## 13 US 2.13
## 14 VX 1.76
## 15 WN 9.65
## 16 YV 15.6
The carrier with the longest arrival delays is F9 and the shorest is AS.
flights %>%
group_by(carrier) %>%
summarize(
average_arr_delay = mean(arr_delay, na.rm = TRUE),
average_dep_delay = mean(dep_delay, na.rm = TRUE)
)

## # A tibble: 16 x 3
## carrier average_arr_delay average_dep_delay
## <chr> <dbl> <dbl>
## 1 9E 7.38 16.7
## 2 AA 0.364 8.59
## 3 AS -9.93 5.80
## 4 B6 9.46 13.0
## 5 DL 1.64 9.26
## 6 EV 15.8 20.0
## 7 F9 21.9 20.2
## 8 FL 20.1 18.7
## 9 HA -6.92 4.90

5
## 10 MQ 10.8 10.6
## 11 OO 11.9 12.6
## 12 UA 3.56 12.1
## 13 US 2.13 3.78
## 14 VX 1.76 12.9
## 15 WN 9.65 17.7
## 16 YV 15.6 19.0

Programming For Data Science Assignment-2
No ratings yet
Programming For Data Science Assignment-2
23 pages
Practice 1 From Introductory Time Series With R
No ratings yet
Practice 1 From Introductory Time Series With R
14 pages
Marketing Research: MRKT 451 Experimentation I
No ratings yet
Marketing Research: MRKT 451 Experimentation I
41 pages
Sony Hcd-s40d Ver.1.0 SM
No ratings yet
Sony Hcd-s40d Ver.1.0 SM
38 pages
Isye HW2
No ratings yet
Isye HW2
10 pages
Statics 1
No ratings yet
Statics 1
22 pages
Unit I
No ratings yet
Unit I
85 pages
Week3HW 091323
No ratings yet
Week3HW 091323
8 pages
Correlation
100% (1)
Correlation
49 pages
Boxplots in R-1
No ratings yet
Boxplots in R-1
10 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Learn R For Applied Statistics
No ratings yet
Learn R For Applied Statistics
457 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
1
100% (1)
1
385 pages
Assignment 9
No ratings yet
Assignment 9
8 pages
How To Work With List Columns
No ratings yet
How To Work With List Columns
104 pages
Data 101 Complete PDF
No ratings yet
Data 101 Complete PDF
603 pages
ML Section16 Causality
No ratings yet
ML Section16 Causality
57 pages
Means Median Mode
100% (1)
Means Median Mode
14 pages
07 - Natural Experiment (Part 2) PDF
No ratings yet
07 - Natural Experiment (Part 2) PDF
90 pages
KrutikaKolhe 862467252 HW4
No ratings yet
KrutikaKolhe 862467252 HW4
16 pages
Create A Vector
No ratings yet
Create A Vector
46 pages
Assignment 8
No ratings yet
Assignment 8
6 pages
And Lists: Jason Myers
No ratings yet
And Lists: Jason Myers
114 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
R Notes Chapter 1. Data Type and Data Entry
No ratings yet
R Notes Chapter 1. Data Type and Data Entry
54 pages
Non Parametric Methods 8
No ratings yet
Non Parametric Methods 8
23 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
Day 13 Slides Subnetting Part 1
No ratings yet
Day 13 Slides Subnetting Part 1
33 pages
2018 H2 Prelim Compilation (Correlation Regression)
No ratings yet
2018 H2 Prelim Compilation (Correlation Regression)
21 pages
Univariate and Bivariate Data Analysis + Probability
100% (1)
Univariate and Bivariate Data Analysis + Probability
5 pages
Intro To Analytics and ML With Sparklyr
No ratings yet
Intro To Analytics and ML With Sparklyr
63 pages
Homework 2
100% (1)
Homework 2
12 pages
Ggplot2: Quick Correlation Matrix Heatmap - R Software and Data Visualization - Easy Guides - Wiki - STHDA
No ratings yet
Ggplot2: Quick Correlation Matrix Heatmap - R Software and Data Visualization - Easy Guides - Wiki - STHDA
7 pages
When Should You Adjust Standard Errors For Clustering?: Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge
No ratings yet
When Should You Adjust Standard Errors For Clustering?: Alberto Abadie, Susan Athey, Guido Imbens, & Jeffrey Wooldridge
33 pages
Assignment 4
No ratings yet
Assignment 4
4 pages
Java Microservices Interview Questions
No ratings yet
Java Microservices Interview Questions
14 pages
R With SQL
No ratings yet
R With SQL
8 pages
Diff in Diff Uk12 Villa
No ratings yet
Diff in Diff Uk12 Villa
16 pages
Lecture 9 Exercises
No ratings yet
Lecture 9 Exercises
3 pages
Clustering
No ratings yet
Clustering
8 pages
Mathematical Statistics
No ratings yet
Mathematical Statistics
84 pages
Funciones para Python
No ratings yet
Funciones para Python
33 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
Multivariate Regression
No ratings yet
Multivariate Regression
20 pages
R Notes For Data Analysis and Statistical Inference
No ratings yet
R Notes For Data Analysis and Statistical Inference
10 pages
Nested Logit Models
No ratings yet
Nested Logit Models
3 pages
Introduction To Dplyr
No ratings yet
Introduction To Dplyr
9 pages
Introduction To Data-2
No ratings yet
Introduction To Data-2
13 pages
GPresets - InfoGuideTemplate-Beba Vowels
No ratings yet
GPresets - InfoGuideTemplate-Beba Vowels
8 pages
Forecast
No ratings yet
Forecast
82 pages
R For Everyone - For Data Science
No ratings yet
R For Everyone - For Data Science
10 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Intro To Data Coursera
No ratings yet
Intro To Data Coursera
9 pages
Statistics Study Guide: Measures of Central Tendancy
No ratings yet
Statistics Study Guide: Measures of Central Tendancy
2 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Chapter 6. Comparison of Several Multivariate Means
100% (1)
Chapter 6. Comparison of Several Multivariate Means
9 pages
Exercises 01
No ratings yet
Exercises 01
2 pages
Statistics For Data Sciences
No ratings yet
Statistics For Data Sciences
10 pages
Introduction To Digital Literacy
No ratings yet
Introduction To Digital Literacy
40 pages
IT - Ebook - Semester 2
No ratings yet
IT - Ebook - Semester 2
67 pages
Installation and User's Guide - IBM System x3550 M2 (7946) - English
No ratings yet
Installation and User's Guide - IBM System x3550 M2 (7946) - English
142 pages
Airmaster Q Series Q1 Controller Software For Positive Displacement Compressor Factsheet
No ratings yet
Airmaster Q Series Q1 Controller Software For Positive Displacement Compressor Factsheet
8 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Exponential Smoothing
No ratings yet
Exponential Smoothing
5 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Abinash Cloud - 1741782342394
No ratings yet
Abinash Cloud - 1741782342394
11 pages
Discrete Data Is A Count That Involves Integers. Only A Limited Number of
No ratings yet
Discrete Data Is A Count That Involves Integers. Only A Limited Number of
3 pages
Wifi-6 Paper
No ratings yet
Wifi-6 Paper
13 pages
A7600 Series at Command Manual v1.01
No ratings yet
A7600 Series at Command Manual v1.01
403 pages
4 Bit Braun Multiplier With Kogge Stone Adder
No ratings yet
4 Bit Braun Multiplier With Kogge Stone Adder
15 pages
Spoto Ccna 200-125 Dumps
No ratings yet
Spoto Ccna 200-125 Dumps
5 pages
Safend Data Protection Suite 3.4.5 Installation Guide
No ratings yet
Safend Data Protection Suite 3.4.5 Installation Guide
76 pages
R Cheat Sheet 3 PDF
No ratings yet
R Cheat Sheet 3 PDF
2 pages
BCP 78 BCP 79
No ratings yet
BCP 78 BCP 79
43 pages
PPL Mini Test - Solution
No ratings yet
PPL Mini Test - Solution
6 pages
5.12 English
No ratings yet
5.12 English
35 pages
Longrich Back Office User Manual - For Stockist: Topic 1 - How To Login in To The System
No ratings yet
Longrich Back Office User Manual - For Stockist: Topic 1 - How To Login in To The System
12 pages
Cyber Security Checklist
No ratings yet
Cyber Security Checklist
54 pages
Veeam Backup & Replication: Version 9.5 Update 4
No ratings yet
Veeam Backup & Replication: Version 9.5 Update 4
14 pages
AI Question Bank
No ratings yet
AI Question Bank
3 pages
Devil S Dragon: White Paper
No ratings yet
Devil S Dragon: White Paper
20 pages
Preschool All About Me Plans and Printables Preview
No ratings yet
Preschool All About Me Plans and Printables Preview
8 pages
20 Matrices Formula Sheets Quizrr
No ratings yet
20 Matrices Formula Sheets Quizrr
8 pages
Document 378262 - How To To Define An Application User That Has All The Privileges in User Management
No ratings yet
Document 378262 - How To To Define An Application User That Has All The Privileges in User Management
2 pages
SAP BW Useful Tables
No ratings yet
SAP BW Useful Tables
12 pages
Indigo 36X3: Manual de Usuario
No ratings yet
Indigo 36X3: Manual de Usuario
4 pages
Annalis Clint Resume - Content Strategist - Marketing Manager - Digital Media Maven
No ratings yet
Annalis Clint Resume - Content Strategist - Marketing Manager - Digital Media Maven
2 pages
Sanda Marin
0% (6)
Sanda Marin
3 pages
LogicalReasoningTest1 Solutions
100% (1)
LogicalReasoningTest1 Solutions
14 pages

Assignment 3

Uploaded by

Assignment 3

Uploaded by

Assignment 3: Flights of New York

You might also like