0% found this document useful (0 votes)
195 views

Assignment 3

The document contains exercises analyzing flight data. There are 336,776 rows representing unique flights with 19 columns of data. Various data wrangling techniques like filtering, arranging, grouping, and mutating are applied to explore the data. Key findings include the carrier with the longest average arrival delay is F9 and the shortest is AS.

Uploaded by

Ray Guo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Assignment 3

The document contains exercises analyzing flight data. There are 336,776 rows representing unique flights with 19 columns of data. Various data wrangling techniques like filtering, arranging, grouping, and mutating are applied to explore the data. Key findings include the carrier with the longest average arrival delay is F9 and the shortest is AS.

Uploaded by

Ray Guo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment 3: Flights of New York

Raymond Guo
2020-02-03

Exercise 1
There are 336776 rows and 19 columns.
The row represents every unique flight.
sched_dep_time represents the time where it is expected to depart and dep_time represents when
it actually departed.
The column representing date or time are year, month, day, dep_time, sched_dep_time, dep_delay,
arr_time, sched_arr_time, arr_delay, air_time, hour, minute, and time_hour.
The best column that represents a unique key to every flight is the tailnum.

Exercise 2

flights %>%
select(year, month)

## # A tibble: 336,776 x 2
## year month
## <int> <int>
## 1 2013 1
## 2 2013 1
## 3 2013 1
## 4 2013 1
## 5 2013 1
## 6 2013 1
## 7 2013 1
## 8 2013 1
## 9 2013 1
## 10 2013 1
## # ... with 336,766 more rows
It selects only the columns inbetween year and month without excluding itself.

Exercise 3

flights %>%
select(year:day)

## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1

1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
The colon selected all columns that are between year and day hence why month was included.

Exercise 4

flights %>%
arrange(sched_dep_time, sched_arr_time)

## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 7 27 NA 106 NA NA
## 2 2013 3 31 456 500 -4 635
## 3 2013 4 1 454 500 -6 636
## 4 2013 4 2 453 500 -7 657
## 5 2013 4 3 453 500 -7 642
## 6 2013 4 4 454 500 -6 636
## 7 2013 4 5 456 500 -4 716
## 8 2013 4 6 453 500 -7 619
## 9 2013 4 7 455 500 -5 639
## 10 2013 4 8 454 500 -6 645
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
It follows a left to right operation, so it sorts sched_dep_time first then sched_arr_time. If you
reverse them in the parameter, it follows the same pattern whereas sched_arr_time will be sorted
first. All sorting is set in ascending order.

Exercise 5

flights %>%
arrange(desc(dep_delay))

## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 9 641 900 1301 1242
## 2 2013 6 15 1432 1935 1137 1607

2
## 3 2013 1 10 1121 1635 1126 1239
## 4 2013 9 20 1139 1845 1014 1457
## 5 2013 7 22 845 1600 1005 1044
## 6 2013 4 10 1100 1900 960 1342
## 7 2013 3 17 2321 810 911 135
## 8 2013 6 27 959 1900 899 1236
## 9 2013 7 22 2257 759 898 121
## 10 2013 12 5 756 1700 896 1058
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
The column to identify flight delay is dep_delay and the flight with the longest delay was flight 51
with tailnum as N384HA.

Exercise 6

flights %>%
mutate(
average_speed = distance / (air_time * 60)
)

## # A tibble: 336,776 x 20
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 13 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, average_speed <dbl>
The new column is created at the farthest right of the dataset and it was named average_speed.
Yes, the code determines the name of the column.

Exercise 7

flights %>%
mutate(
dep_time_hour = dep_time %/% 100,

3
dep_time_minute = dep_time %% 100,
dep_time_minutes_midnight = dep_time_hour * 60 + dep_time_minute
)

## # A tibble: 336,776 x 22
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # ... with 336,766 more rows, and 15 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, dep_time_hour <dbl>,
## # dep_time_minute <dbl>, dep_time_minutes_midnight <dbl>

Exercise 8

flights %>%
filter(
arr_delay < 0,
carrier == "UA"
)

## # A tibble: 34,642 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 558 600 -2 923
## 2 2013 1 1 559 600 -1 854
## 3 2013 1 1 607 607 0 858
## 4 2013 1 1 643 646 -3 922
## 5 2013 1 1 644 636 8 931
## 6 2013 1 1 646 645 1 910
## 7 2013 1 1 646 645 1 1023
## 8 2013 1 1 656 700 -4 948
## 9 2013 1 1 659 700 -1 959
## 10 2013 1 1 701 700 1 1123
## # ... with 34,632 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>

4
Exercise 9

flights %>%
group_by(carrier) %>%
summarize(
average_arr_delay = mean(arr_delay, na.rm = TRUE)
)

## # A tibble: 16 x 2
## carrier average_arr_delay
## <chr> <dbl>
## 1 9E 7.38
## 2 AA 0.364
## 3 AS -9.93
## 4 B6 9.46
## 5 DL 1.64
## 6 EV 15.8
## 7 F9 21.9
## 8 FL 20.1
## 9 HA -6.92
## 10 MQ 10.8
## 11 OO 11.9
## 12 UA 3.56
## 13 US 2.13
## 14 VX 1.76
## 15 WN 9.65
## 16 YV 15.6
The carrier with the longest arrival delays is F9 and the shorest is AS.
flights %>%
group_by(carrier) %>%
summarize(
average_arr_delay = mean(arr_delay, na.rm = TRUE),
average_dep_delay = mean(dep_delay, na.rm = TRUE)
)

## # A tibble: 16 x 3
## carrier average_arr_delay average_dep_delay
## <chr> <dbl> <dbl>
## 1 9E 7.38 16.7
## 2 AA 0.364 8.59
## 3 AS -9.93 5.80
## 4 B6 9.46 13.0
## 5 DL 1.64 9.26
## 6 EV 15.8 20.0
## 7 F9 21.9 20.2
## 8 FL 20.1 18.7
## 9 HA -6.92 4.90

5
## 10 MQ 10.8 10.6
## 11 OO 11.9 12.6
## 12 UA 3.56 12.1
## 13 US 2.13 3.78
## 14 VX 1.76 12.9
## 15 WN 9.65 17.7
## 16 YV 15.6 19.0

You might also like