0% found this document useful (0 votes)
67 views20 pages

BDM - Mining Over Datasets

The dataset contains information about 100,161 commercial flights in the US, including details like departure/arrival times, airports, carriers, and delays. Some key findings: - United Airlines dominates with 63.6% of flights, while the smallest carrier PA has just 0.3%. - One airport, IAD, is by far the most common origin and destination, accounting for 50.1% and 49.9% respectively. - Most flights arrive within an hour of schedule, but some extreme delays exceed 11 hours. - Saturday and Sunday see about 1% fewer flights on average. - The least popular carriers PA and EA experience the longest average delays, over 16 and

Uploaded by

base94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views20 pages

BDM - Mining Over Datasets

The dataset contains information about 100,161 commercial flights in the US, including details like departure/arrival times, airports, carriers, and delays. Some key findings: - United Airlines dominates with 63.6% of flights, while the smallest carrier PA has just 0.3%. - One airport, IAD, is by far the most common origin and destination, accounting for 50.1% and 49.9% respectively. - Most flights arrive within an hour of schedule, but some extreme delays exceed 11 hours. - Saturday and Sunday see about 1% fewer flights on average. - The least popular carriers PA and EA experience the longest average delays, over 16 and

Uploaded by

base94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

MSc in Data Science

Big Data Mining


Assignment 3: Mining over Datasets

Group 3:

Eleni Neti
Dimitris Tsakonas
Petros-Fotis Kamberi
1

Table of Contents
Dataset 1: Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska 3
Overview of the dataset size, features, and distribution of feature values 3
Average delays per airport/airline 9
Most prominent rules of association between delays and point of origin AND/OR point of
arrival 12
An attempt to predict the delay given all other features 13
Patterns/rules identification regarding delays. When delays should be expected, based
on these patterns? 14

Dataset 2: Religion data 15


Dataset Overview 15
Inconsistencies / ambiguous data 16
The overall picture of religious groups over the US 16
Counties with the highest per-person ratio of Orthodox Christian members 17
Extreme counties with respect to the distribution of their churches across religions 18
Considering the location to build a cross-religion center 19
2

Dataset 1: Airlines Dataset Inspired in the regression dataset from


Elena Ikonomovska

Overview of the dataset size, features, and distribution of feature values

The dataset consists of:

Instances: 100161
Rows: 7

The table below demonstrates the features headers along with the corresponding data
types.

Feature data-type

1 DayofWeek float64

2 CRSDepTime float64

3 UniqueCarrier object

4 FlightNum float64

5 Origin object

6 Dest object

7 ArrDelay float64
3

A missing values check showed that there are no missing values in the data set so moved
forward on checking some basic descriptive statistics for the numerical features
By examining the values obtained we can confirm that the column DayofWeek contains
numbers 1 to 7 each representing a day in the week. It is unclear at this point if the first is
Monday so this has to be kept in mind. Next, the column CRSDepTime represents the time
a flight left with 0 being midnight up to 2359. With regards to the FlightNum column we
notice that while we would expect it to be a unique identifier for each of the flights, the min
and max values are not reflecting the size of our data set meaning that there are either
duplicate flights or simply the flight number is not unique between flights. Lastly the
ArrDelay represents the delay with which the flight arrived at the destination airport. The
units are unclear but the delay is most likely measured in minutes, with an average delay of
about 4.6 mins. Notably, the delay column has negative values which we have assumed
represent a flight arriving earlier than expected and also seems to have certain extreme
values as the maximum is up to 667 mins of delay (11+hours)

CRSDepTime

Looking further into the departure time feature we notice that there are no flights at the very
early of the day as the airports are closed. Overall the flight time distribution shows that
flights depart throughout the remaining hours of the day with peaks roughly around 8am,
12:00 and 17:00.

UniqueCarriers

The next feature contains the carriers that are performing flights in our data set. It is
apparent that among the 9 different carriers the UA carrier is dominating over the others as it
4

participates in 63.6 % of the total flights. The remaining flights are spread among the other 8
carriers with the second most popular being CO with 9.2% while the least favorable is PA(1)
with only 0.3 % of the total flights.

Carrier Number of flights % of total flights

1 UA 63706 63.6 %

2 CO 9219 9.2 %

3 AA 8620 8.6 %

4 NW 5521 5.5 %

5 DL 4793 4.8 %

6 US 3513 3.5 %

7 TW 3056 3.1 %

8 EA 1320 1.3 %

9 PA(1) 313 0.3 %


5

Origin & Dest

By examining the origin and destination airports we notice that there are:

Origin : 58 unique airports


Destination: 59 unique airports

Further looking into each feature we discover that the distribution is again dominated by one
value. The airport IAD is part of all flights as the flights are either leaving or going to IAD.
Namely it appears 50229 times as origin (50.1% of total) and 49932 as destination
airport(49.9% of total).

The distribution of the remaining airports appears to be exponential as shown in the graph
below. The second most popular origin airport is the DEN with 2911 flights and the least
common is the MDT with only 1 flight leaving from that airport.

The same distribution is followed in the remaining destination airports as shown in the
following graph, where the second most popular destination is again DEN with 3005 flights
and least popular MDT with only 1 flight.
6

Finally it is worth mentioning that airport CRW appears only in the destination column with a
flight count of 109. This is interesting as overall the airports followed a pattern where they
would appear roughly with the same frequency between origin and destination. It is possible
that a row has been omitted during the data collection.

ArrDelay

As mentioned earlier the ArrDelay feature most likely demonstrates the delay in minutes
with which the airplane arrived at its destination. As we noticed that there might be certain
extreme values we have plotted various graphs in order to visualize the distribution
effectively.

In the graph below we notice that most of the flights arrive around +/- 1 hour of the expected
time with some cases having up to 200 minutes of delay. Plotting the logarithmic values
reveals certain extreme cases with around 400 and 600 minutes of delay that have too low
frequency to be visible on the normal graph.
7

This is also well captured on the scatter plot below where we can see that the highest
flight-density is between -80 and 100 minutes. At around the 100 minute mark the amount
of flights decreases drastically until there are only a few outliers with a delay above 300
minutes.
8

DayofWeek

While we do not have any information about the label encoding, by plotting the frequency of
each day as a percentage of the total flights in our dataset, we notice that day 6 and 7 have
about 1% less flights than the rest. This could be a possible indicator that day 6 and 7
represent Saturday and Sunday however it still remains unclear.

Average delays per airport/airline

The following table summarizes the average delay per airline. Noticeably the top 2 airlines
that average the highest delay are the airlines that are the least favorable, PA(1) and EA,
which might indicate why their popularity is so low.

Carrier Total flights Average Delay


(%) (mins)

1 PA (1) 0.3 % 16.2

2 EA 1.3 % 11.7

3 US 3.5 % 6.2

4 TW 3.1 % 5.1

5 DL 4.8 % 5.1

6 NW 5.5 % 5.0
9

7 CO 9.2 % 4.9

8 UA 63.6 % 4.5

9 AA 8.6 % 1.9

Next, the table below displays the flights with the highest average delay with regard to the
origin airport. It is worth noting that for certain cases such as the RIC and MDT airports we
do not have a lot of data as there are very few flights from those airports. Overall, certain
airports seem to have significantly higher than the average 4.6 min of all the flights in the
data set.

Airport (Origin) Total flights Average Delay


(mins)

1 SJU 277 18.1

2 RIC 4 13.0

3 PVD 296 12.3

4 HPN 228 12.1

5 MHT 447 11.1

6 MDT 1 11.0

7 ORD 2753 10.6

8 BOS 2210 9.5

9 PWM 306 8.4

10 STL 902 7.8


10

Finally, repeating the process for the destination airports we find that at the very top is the
MDT airport again. We suspect the delay value for this airport is an outlier as it is
significantly higher than the rest and might be a “spike” caused by the low volume of flights.

Airport Total flights Average Delay


(Destination) (mins)

1 MDT 1 21

2 FLL 747 14.6

3 MDW 63 11.8

4 CHS 524 11.8

5 PBI 700 11

6 JFK 165 9.6

7 MCO 1676 9.5

8 HOU 201 8.9

9 BOS 2209 8.5

10 EWR 1953 8.5


11

Overall we see from the two graphs that most of the airports have a delay on average while
there are 6-9 airports for which their flights on average arrive earlier. Those airports are
summarized on the table below.

Airport Average Delay Airport Average Delay


(Origin) (mins) (Dest) (mins)

2 MDW -0.5 BNA -0.1

3 SDF -1.0 ISP -2.1

4 IAH -1.6 SRQ -2.7

5 PHX -2.6 BWI -6.0

6 IND -3.0 GSO -6.0

7 JAX -3.7 CRW -6.2

8 HOU -4.2

9 BNA -4.2

10 GSO -4.4

Most prominent rules of association between delays and point of origin AND/OR point
of arrival

First of all, we start by creating a dataframe called itemsets that comprises three columns
from the original dataframe containing the data, namely Origin, Dest, and ArrDelay. Then we
proceed with feature engineering. In more detail, the delay attribute has a wide range of
values, thus the extraction of rules of association will not be efficient enough to extract
meaningful rules. Therefore, in order to tackle this issue, we add a new column named delay
status based on the delay times as follows:

● ArrDelay = 0.0 => delay status: on-time


● ArrDelay > 0.0 => delay status: delayed
● ArrDelay < 0.0 => delay status: early

Furthermore, in order to differentiate between the airport of origin and airport of destination
when including both airports in the process of generating association rules, a suffix is added
at each airport code. In particular, ‘_O’ is added at each code in column Origin and ‘_D’ is
added at each code in column Dest, designating in this way the origin and destination
airports respectively.

For each kind of association rules (e.g. between delays and point of origin, between delays
and point of destination, between delays and points of origin and destination) we employ the
12

apriori algorithm to extract the frequent itemsets. Then we used the smallest possible value
of minimum support (i.e. 0.00001) in order to capture all the rules. This was done because of
the size of the dataset that leads to relatively low support values (< 0.03). Next, for each kind
of rule, we only kept the ones that satisfied a minimum confidence threshold of 0.7 and had
a lift score greater than 1, in order to obtain the most prominent ones.

The most prominent rules for each kind of association rules (e.g. between delays and point
of origin, between delays and point of destination, between delays and points of origin and
destination) can be found in the respective python notebook accompanying this report.

An attempt to predict the delay given all other features

In order for our models to be able to process the categorical data, one hot encoding has
been applied. This ensures that no bias is added in the data set, something a label encoder
would do by assigning numerical values for each category.

The data set has been scaled using standard scaler in order to achieve a uniform format
and then has been separated to training and test set at a 77 - 33 ratio for a preliminary
2
evaluation basis the 𝑅 score.

Model 2
𝑅 score

Ridge Regression 0.03196 3.196 %

XGboost Regression 0.0513 5.13 %

Bayesian Regression 0.03246 3.246 %

As shown in the table above, the performance of all models appeared to be very poor
however we selected Ridge and XGBoost for hyperparameter tuning in order to assess if
their performance would improve. The optuna library has been used (hyperparameter
optimization framework) in order to enhance the speed of gridsearch as the tuning of
XGBoost took approximately 1 hr.

Model 2
𝑅 score
2
𝑅 score
(Tuned)
(previous) (new)

Ridge Regression 3.196 % 3.73 %

XGboost Regression 5.13 % 8.08 %

The performance of both models after hyper parameter tuning has improved however
remained very poor overall. This could be due to our feature space as we have tried to
predict the delay utilizing all of the features without considering their usefulness. For
example it is likely that the Flight Number does not contribute to estimating the delay. Hence,
13

the correlation should be examinited between the class and the features as well as between
the features themselves. Furthermore, the fact that the airports and the carriers had to be
one-hot encoded created a very sparse feature space which might have had an impact on
the model's predictive power.

Another option in order to improve the accuracy of the models would be to convert the target
into multi-classification where the model would be tasked to predict a range of delay instead
of exact time. However this would be outside the scope of this assignment and hence have
not examined it further.

Patterns/rules identification regarding delays. When delays should be expected,


based on these patterns?

Aiming to identify rules of association by taking more features into account, we first proceed
with performing some feature engineering. We start with discretization. Firstly, we replace
the number indicating the day of week each flight happened with its actual name. Then, we
replaced the departure time with its corresponding part of the day according to this source:
(https://www.britannica.com/dictionary/eb/qa/parts-of-the-day-early-morning-late-morning-etc
). Finally, we applied the same two preprocessing steps we used while extracting rules of
association between delays and point of origin AND/OR point of arrival (i.e use delay status
instead of delay time and add a suffix to the airport codes in order to designate the airport of
origin and the airport of arrival for each flight). At this point, it should be noted that the
columns used for the association rules mining for this task are UniqueCarrier, Origin, Dest
from the original data and the PartOfDay, DayofWeek, DelayStatus feature engineered from
the original data.

Afterwards, we also employed the apriori algorithm with 0.1 minimum support in order to
extract the frequent itemsets, and then we obtained the association rules. This minimum
support value was used in order to capture all the association rules possible. From those
rules we only kept the ones that satisfied a minimum confidence threshold of 0.1 and had as
a consequents or part of the consequents of the rule a delay status item, namely delayed,
early or on-time. Finally, we grouped the rules based on the delay status and the results of
this process are presented in the respective notebook for the airlines dataset accompanying
this report. It is worth noting that there were not any association rules to be extracted for the
on-time delay status that satisfied both the minimum support threshold nor the minimum
confidence threshold.

Based on these rules, delays should be expected in the morning, in the evening and in the
afternoon as far as the part of the day is concerned. Also, delays can be expected when the
aircarrier is UA. Finally, regarding the airports, delays should be expected when IAD is either
the origin or arrival airport. The occurrence of UA carrier and IAD airport in the most
prominent rules regarding the delay were expected due to the vast amount of data instances
found in the provided dataset having the mentioned entities as their carrier and airport of
origin or destination respectively. It is worth noting that there are not any significant rules
regarding on-time arrival time. This might stem from the nature of the problem, where there
14

is not always easy to be on schedule because of unforeseen circumstances such as weather


conditions, air traffic or even emergencies.

Dataset 2: Religion data


This section focuses on data related to the religious groups present in the United States
during the year 1952. The data for this section was obtained from the following Github
repository:
https://github.com/aaronpenne/data_visualization/blob/master/religion/data/1952.xls

Dataset Overview

The dataset provided includes 3075 records representing each county in the United States in
1952.

In summary, the original dataset includes the following attributes:

● Case ID (unknown - probably some unique identifier in a Database system)


● County Name (CNAME)
● State Census Code (STCODE)
● County Census Code (CCODE)
● Total Population in 1950 (TOTPOP)
● Total number of members in 1952 (TOTMEMB)
● Total Number of churches in 1952 (TOTCHUR)

Additionally, 228 columns pertain to data regarding the presence of religious groups within
each county. The number of members for each individual religion is denoted with the suffix
“_M ” and the number of churches with “_C”, respectively.

Given the lack of clarity in defining membership, we consider the notion through the
established literature in which membership is primarily understood as an individual's formal
institutional affiliation and social allegiance (Finner, 1970).

A peek at the data:


15

The above instance of the dataset, which comprises only of 5 rows and 9 columns, provides
a brief overview of the dataset. As you can see, religious entities, such as members of the
Seventh-day Adventists (SDA) are represented by the letter "M", while the churches of SDA
are represented by the letter "C".

Eventually, an extra column ‘STNAME’ was manually added to the dataset to indicate the
name of the state in which the respective county is located.

Inconsistencies / ambiguous data

By calculating the ratio of total members (TOTMEMB) to total population (TOTPOP), we can
determine the percentage of the population that is affiliated with a particular religion (no
matter which one) in each county. This metric gives us a rough estimate of the relative
importance of the religious population compared to the entire population of the county.
Through this analysis, we discovered that in 15 counties, the religious population exceeds
the total one. 10/15 cases are reported in Texas.
It should be noted that population data used in this analysis is from the year 1950, while the
data on religious affiliation was collected in 1952, according to source information. This
discrepancy may account for any inconsistencies observed. Additionally, it is also possible
that individuals from neighboring counties may be registered members of religious
communities outside their county of residence, which could also contribute to the
mismatching of data.

The overall picture of religious groups over the US


In order to obtain a better insight of the religious profile of the US in that year, we
summarized our data per state. The code provided with this report, generates an interactive
map, which you can consult. For this purpose, we used the spatial coordinates of the US
states available online in order to accurately position them on the map. It should be noted
that the creation of the map may take a considerable amount of time, ranging between 40-60
minutes, depending on the system it is run on.
The map should look like:
16

When interacting with the circle markers, a histogram displaying the average percentage of a
given religious group within a state will be displayed.

For instance:

By selecting the North Dakota marker, you can view the proportion of different religions
present in that state.

Counties with the highest per-person ratio of Orthodox Christian members


Now suppose our goal is to identify the counties with the highest ratio of Orthodox Christian
members per person. We will consider the members of the following religious groups that
identify themselves as Orthodoxes:

ARAPO : Armenian Apostolic Orthodox Church of America

GRKAD: Greek Orthodox Archdiocese of North and South America

BEOC: Bulgarian Eastern Orthodox Church

ACROC: American Carpatho-Russian Orthodox Greek Catholic Church

By calculating the ratio of the total members of the Orthodox denominations, to the total
population of each county, we were able to determine the per-capita ratio of Orthodox
Christians per county. The results of this analysis are presented below for the top 5 counties.
17

As the graph indicates, the overall presence of Orthodox Christians is relatively low.
However, among the top five counties with the highest concentration of Orthodox Christian
members, Cherokee stands out as having the highest density of Orthodoxes.

Extreme counties with respect to the distribution of their churches across religions

The presence of religious institutions, such as churches, serves as an indication of a


religion's local presence, and reflects how well a religious group is established in that area.

One effective way to determine the three counties with the greatest diversity in worship
centers is to identify the number of unique churches per county. This is a straightforward
index that can be used to roughly quantify the variety of religious institutions within a county.

The analysis produced the following results:

County Total # of churches # of unique churches

Los Angeles, CA 1939 70

Cook, IL 1702 66

Wayne, MI 881 61
18

Considering the location to build a cross-religion center

In the next step, we focus only on the geographical location (spatial coordinates) of each
county and the corresponding number of unique churches. These three elements combined
form a data point. The data points are then grouped using the k-means clustering method,
which separates them into distinct clusters. The optimal number of clusters is determined
using the elbow method, which allows us to select the most suitable value of the parameter
k. It's important to note that we limit the value of k between 1 and 100 (for approximately
3000 counties, this corresponds to a minimum of 30 counties per cluster). For k=18, the
clusters are shown below.

The Silhouette score is approximately 0.25. This suggests that while the clustering is
reasonably good, there are overlaps between the clusters. We proceed by setting an
extremity threshold which is the ceiling of the average value of the mean number of unique
churches for each cluster. We then consider only the counties where the value of unique
churches is greater than this threshold. In this way, we aspire to find the cluster that includes
the largest number of extreme (i.e. religiously diverse) counties, regardless of the total
number of counties in that cluster, as there is probably a higher chance of “conflict” between
the different denominations, due to less religious homogeneity.

Next, we group these extreme counties, in other words the filtered counties that satisfied the
extremity threshold, by the number of cluster assigned to each county and count the number
of (extreme) counties in each cluster. The cluster with the highest number of extreme
counties can be considered the most suitable one for a cross-religion center of discussion
between religions to be constructed within this cluster’s boundaries to maximize the center’s
impact.
19

The previous analysis revealed that the 2nd cluster is the most extreme one so we proceed
by locating its centroid on the map (see picture below). Note that kmeans centroids are not
part of the input dataset. Though
we considered using k medoids or
DBSCAN for clustering purposes
instead of k means, as the former
uses only centroids from the input
data, we did not find an
implementation in sklearn for k
medoids and, in addition, we
obtained poor clustering results
using DBSCAN. Consequently, we
eventually used kmeans for
clustering the data as discussed.

In order to tackle the issue arising


from the fact that the centroid of
the best cluster is not an actual
data point but just a point on the
map, we find the nearest neighbor
on the map for this centroid in the
original dataset and thereby we
propose this county as the most
appropriate one to build the
interfaith center. In particular, the county nearest to the most extreme cluster’s centroid is
Tucker county in the state of West Virginia.

References

Finner, S. L. (Ed.). (1970). Religious Membership and Religious Preference: Equal Indicators of Religiosity?
Journal for the Scientific Study of Religion, 9(4), 273–279. JSTOR. https://doi.org/10.2307/1384571

You might also like