BDM - Mining Over Datasets
BDM - Mining Over Datasets
Group 3:
Eleni Neti
Dimitris Tsakonas
Petros-Fotis Kamberi
1
Table of Contents
Dataset 1: Airlines Dataset Inspired in the regression dataset from Elena Ikonomovska 3
Overview of the dataset size, features, and distribution of feature values 3
Average delays per airport/airline 9
Most prominent rules of association between delays and point of origin AND/OR point of
arrival 12
An attempt to predict the delay given all other features 13
Patterns/rules identification regarding delays. When delays should be expected, based
on these patterns? 14
Instances: 100161
Rows: 7
The table below demonstrates the features headers along with the corresponding data
types.
Feature data-type
1 DayofWeek float64
2 CRSDepTime float64
3 UniqueCarrier object
4 FlightNum float64
5 Origin object
6 Dest object
7 ArrDelay float64
3
A missing values check showed that there are no missing values in the data set so moved
forward on checking some basic descriptive statistics for the numerical features
By examining the values obtained we can confirm that the column DayofWeek contains
numbers 1 to 7 each representing a day in the week. It is unclear at this point if the first is
Monday so this has to be kept in mind. Next, the column CRSDepTime represents the time
a flight left with 0 being midnight up to 2359. With regards to the FlightNum column we
notice that while we would expect it to be a unique identifier for each of the flights, the min
and max values are not reflecting the size of our data set meaning that there are either
duplicate flights or simply the flight number is not unique between flights. Lastly the
ArrDelay represents the delay with which the flight arrived at the destination airport. The
units are unclear but the delay is most likely measured in minutes, with an average delay of
about 4.6 mins. Notably, the delay column has negative values which we have assumed
represent a flight arriving earlier than expected and also seems to have certain extreme
values as the maximum is up to 667 mins of delay (11+hours)
CRSDepTime
Looking further into the departure time feature we notice that there are no flights at the very
early of the day as the airports are closed. Overall the flight time distribution shows that
flights depart throughout the remaining hours of the day with peaks roughly around 8am,
12:00 and 17:00.
UniqueCarriers
The next feature contains the carriers that are performing flights in our data set. It is
apparent that among the 9 different carriers the UA carrier is dominating over the others as it
4
participates in 63.6 % of the total flights. The remaining flights are spread among the other 8
carriers with the second most popular being CO with 9.2% while the least favorable is PA(1)
with only 0.3 % of the total flights.
1 UA 63706 63.6 %
2 CO 9219 9.2 %
3 AA 8620 8.6 %
4 NW 5521 5.5 %
5 DL 4793 4.8 %
6 US 3513 3.5 %
7 TW 3056 3.1 %
8 EA 1320 1.3 %
By examining the origin and destination airports we notice that there are:
Further looking into each feature we discover that the distribution is again dominated by one
value. The airport IAD is part of all flights as the flights are either leaving or going to IAD.
Namely it appears 50229 times as origin (50.1% of total) and 49932 as destination
airport(49.9% of total).
The distribution of the remaining airports appears to be exponential as shown in the graph
below. The second most popular origin airport is the DEN with 2911 flights and the least
common is the MDT with only 1 flight leaving from that airport.
The same distribution is followed in the remaining destination airports as shown in the
following graph, where the second most popular destination is again DEN with 3005 flights
and least popular MDT with only 1 flight.
6
Finally it is worth mentioning that airport CRW appears only in the destination column with a
flight count of 109. This is interesting as overall the airports followed a pattern where they
would appear roughly with the same frequency between origin and destination. It is possible
that a row has been omitted during the data collection.
ArrDelay
As mentioned earlier the ArrDelay feature most likely demonstrates the delay in minutes
with which the airplane arrived at its destination. As we noticed that there might be certain
extreme values we have plotted various graphs in order to visualize the distribution
effectively.
In the graph below we notice that most of the flights arrive around +/- 1 hour of the expected
time with some cases having up to 200 minutes of delay. Plotting the logarithmic values
reveals certain extreme cases with around 400 and 600 minutes of delay that have too low
frequency to be visible on the normal graph.
7
This is also well captured on the scatter plot below where we can see that the highest
flight-density is between -80 and 100 minutes. At around the 100 minute mark the amount
of flights decreases drastically until there are only a few outliers with a delay above 300
minutes.
8
DayofWeek
While we do not have any information about the label encoding, by plotting the frequency of
each day as a percentage of the total flights in our dataset, we notice that day 6 and 7 have
about 1% less flights than the rest. This could be a possible indicator that day 6 and 7
represent Saturday and Sunday however it still remains unclear.
The following table summarizes the average delay per airline. Noticeably the top 2 airlines
that average the highest delay are the airlines that are the least favorable, PA(1) and EA,
which might indicate why their popularity is so low.
2 EA 1.3 % 11.7
3 US 3.5 % 6.2
4 TW 3.1 % 5.1
5 DL 4.8 % 5.1
6 NW 5.5 % 5.0
9
7 CO 9.2 % 4.9
8 UA 63.6 % 4.5
9 AA 8.6 % 1.9
Next, the table below displays the flights with the highest average delay with regard to the
origin airport. It is worth noting that for certain cases such as the RIC and MDT airports we
do not have a lot of data as there are very few flights from those airports. Overall, certain
airports seem to have significantly higher than the average 4.6 min of all the flights in the
data set.
2 RIC 4 13.0
6 MDT 1 11.0
Finally, repeating the process for the destination airports we find that at the very top is the
MDT airport again. We suspect the delay value for this airport is an outlier as it is
significantly higher than the rest and might be a “spike” caused by the low volume of flights.
1 MDT 1 21
3 MDW 63 11.8
5 PBI 700 11
Overall we see from the two graphs that most of the airports have a delay on average while
there are 6-9 airports for which their flights on average arrive earlier. Those airports are
summarized on the table below.
8 HOU -4.2
9 BNA -4.2
10 GSO -4.4
Most prominent rules of association between delays and point of origin AND/OR point
of arrival
First of all, we start by creating a dataframe called itemsets that comprises three columns
from the original dataframe containing the data, namely Origin, Dest, and ArrDelay. Then we
proceed with feature engineering. In more detail, the delay attribute has a wide range of
values, thus the extraction of rules of association will not be efficient enough to extract
meaningful rules. Therefore, in order to tackle this issue, we add a new column named delay
status based on the delay times as follows:
Furthermore, in order to differentiate between the airport of origin and airport of destination
when including both airports in the process of generating association rules, a suffix is added
at each airport code. In particular, ‘_O’ is added at each code in column Origin and ‘_D’ is
added at each code in column Dest, designating in this way the origin and destination
airports respectively.
For each kind of association rules (e.g. between delays and point of origin, between delays
and point of destination, between delays and points of origin and destination) we employ the
12
apriori algorithm to extract the frequent itemsets. Then we used the smallest possible value
of minimum support (i.e. 0.00001) in order to capture all the rules. This was done because of
the size of the dataset that leads to relatively low support values (< 0.03). Next, for each kind
of rule, we only kept the ones that satisfied a minimum confidence threshold of 0.7 and had
a lift score greater than 1, in order to obtain the most prominent ones.
The most prominent rules for each kind of association rules (e.g. between delays and point
of origin, between delays and point of destination, between delays and points of origin and
destination) can be found in the respective python notebook accompanying this report.
In order for our models to be able to process the categorical data, one hot encoding has
been applied. This ensures that no bias is added in the data set, something a label encoder
would do by assigning numerical values for each category.
The data set has been scaled using standard scaler in order to achieve a uniform format
and then has been separated to training and test set at a 77 - 33 ratio for a preliminary
2
evaluation basis the 𝑅 score.
Model 2
𝑅 score
As shown in the table above, the performance of all models appeared to be very poor
however we selected Ridge and XGBoost for hyperparameter tuning in order to assess if
their performance would improve. The optuna library has been used (hyperparameter
optimization framework) in order to enhance the speed of gridsearch as the tuning of
XGBoost took approximately 1 hr.
Model 2
𝑅 score
2
𝑅 score
(Tuned)
(previous) (new)
The performance of both models after hyper parameter tuning has improved however
remained very poor overall. This could be due to our feature space as we have tried to
predict the delay utilizing all of the features without considering their usefulness. For
example it is likely that the Flight Number does not contribute to estimating the delay. Hence,
13
the correlation should be examinited between the class and the features as well as between
the features themselves. Furthermore, the fact that the airports and the carriers had to be
one-hot encoded created a very sparse feature space which might have had an impact on
the model's predictive power.
Another option in order to improve the accuracy of the models would be to convert the target
into multi-classification where the model would be tasked to predict a range of delay instead
of exact time. However this would be outside the scope of this assignment and hence have
not examined it further.
Aiming to identify rules of association by taking more features into account, we first proceed
with performing some feature engineering. We start with discretization. Firstly, we replace
the number indicating the day of week each flight happened with its actual name. Then, we
replaced the departure time with its corresponding part of the day according to this source:
(https://www.britannica.com/dictionary/eb/qa/parts-of-the-day-early-morning-late-morning-etc
). Finally, we applied the same two preprocessing steps we used while extracting rules of
association between delays and point of origin AND/OR point of arrival (i.e use delay status
instead of delay time and add a suffix to the airport codes in order to designate the airport of
origin and the airport of arrival for each flight). At this point, it should be noted that the
columns used for the association rules mining for this task are UniqueCarrier, Origin, Dest
from the original data and the PartOfDay, DayofWeek, DelayStatus feature engineered from
the original data.
Afterwards, we also employed the apriori algorithm with 0.1 minimum support in order to
extract the frequent itemsets, and then we obtained the association rules. This minimum
support value was used in order to capture all the association rules possible. From those
rules we only kept the ones that satisfied a minimum confidence threshold of 0.1 and had as
a consequents or part of the consequents of the rule a delay status item, namely delayed,
early or on-time. Finally, we grouped the rules based on the delay status and the results of
this process are presented in the respective notebook for the airlines dataset accompanying
this report. It is worth noting that there were not any association rules to be extracted for the
on-time delay status that satisfied both the minimum support threshold nor the minimum
confidence threshold.
Based on these rules, delays should be expected in the morning, in the evening and in the
afternoon as far as the part of the day is concerned. Also, delays can be expected when the
aircarrier is UA. Finally, regarding the airports, delays should be expected when IAD is either
the origin or arrival airport. The occurrence of UA carrier and IAD airport in the most
prominent rules regarding the delay were expected due to the vast amount of data instances
found in the provided dataset having the mentioned entities as their carrier and airport of
origin or destination respectively. It is worth noting that there are not any significant rules
regarding on-time arrival time. This might stem from the nature of the problem, where there
14
Dataset Overview
The dataset provided includes 3075 records representing each county in the United States in
1952.
Additionally, 228 columns pertain to data regarding the presence of religious groups within
each county. The number of members for each individual religion is denoted with the suffix
“_M ” and the number of churches with “_C”, respectively.
Given the lack of clarity in defining membership, we consider the notion through the
established literature in which membership is primarily understood as an individual's formal
institutional affiliation and social allegiance (Finner, 1970).
The above instance of the dataset, which comprises only of 5 rows and 9 columns, provides
a brief overview of the dataset. As you can see, religious entities, such as members of the
Seventh-day Adventists (SDA) are represented by the letter "M", while the churches of SDA
are represented by the letter "C".
Eventually, an extra column ‘STNAME’ was manually added to the dataset to indicate the
name of the state in which the respective county is located.
By calculating the ratio of total members (TOTMEMB) to total population (TOTPOP), we can
determine the percentage of the population that is affiliated with a particular religion (no
matter which one) in each county. This metric gives us a rough estimate of the relative
importance of the religious population compared to the entire population of the county.
Through this analysis, we discovered that in 15 counties, the religious population exceeds
the total one. 10/15 cases are reported in Texas.
It should be noted that population data used in this analysis is from the year 1950, while the
data on religious affiliation was collected in 1952, according to source information. This
discrepancy may account for any inconsistencies observed. Additionally, it is also possible
that individuals from neighboring counties may be registered members of religious
communities outside their county of residence, which could also contribute to the
mismatching of data.
When interacting with the circle markers, a histogram displaying the average percentage of a
given religious group within a state will be displayed.
For instance:
By selecting the North Dakota marker, you can view the proportion of different religions
present in that state.
By calculating the ratio of the total members of the Orthodox denominations, to the total
population of each county, we were able to determine the per-capita ratio of Orthodox
Christians per county. The results of this analysis are presented below for the top 5 counties.
17
As the graph indicates, the overall presence of Orthodox Christians is relatively low.
However, among the top five counties with the highest concentration of Orthodox Christian
members, Cherokee stands out as having the highest density of Orthodoxes.
Extreme counties with respect to the distribution of their churches across religions
One effective way to determine the three counties with the greatest diversity in worship
centers is to identify the number of unique churches per county. This is a straightforward
index that can be used to roughly quantify the variety of religious institutions within a county.
Cook, IL 1702 66
Wayne, MI 881 61
18
In the next step, we focus only on the geographical location (spatial coordinates) of each
county and the corresponding number of unique churches. These three elements combined
form a data point. The data points are then grouped using the k-means clustering method,
which separates them into distinct clusters. The optimal number of clusters is determined
using the elbow method, which allows us to select the most suitable value of the parameter
k. It's important to note that we limit the value of k between 1 and 100 (for approximately
3000 counties, this corresponds to a minimum of 30 counties per cluster). For k=18, the
clusters are shown below.
The Silhouette score is approximately 0.25. This suggests that while the clustering is
reasonably good, there are overlaps between the clusters. We proceed by setting an
extremity threshold which is the ceiling of the average value of the mean number of unique
churches for each cluster. We then consider only the counties where the value of unique
churches is greater than this threshold. In this way, we aspire to find the cluster that includes
the largest number of extreme (i.e. religiously diverse) counties, regardless of the total
number of counties in that cluster, as there is probably a higher chance of “conflict” between
the different denominations, due to less religious homogeneity.
Next, we group these extreme counties, in other words the filtered counties that satisfied the
extremity threshold, by the number of cluster assigned to each county and count the number
of (extreme) counties in each cluster. The cluster with the highest number of extreme
counties can be considered the most suitable one for a cross-religion center of discussion
between religions to be constructed within this cluster’s boundaries to maximize the center’s
impact.
19
The previous analysis revealed that the 2nd cluster is the most extreme one so we proceed
by locating its centroid on the map (see picture below). Note that kmeans centroids are not
part of the input dataset. Though
we considered using k medoids or
DBSCAN for clustering purposes
instead of k means, as the former
uses only centroids from the input
data, we did not find an
implementation in sklearn for k
medoids and, in addition, we
obtained poor clustering results
using DBSCAN. Consequently, we
eventually used kmeans for
clustering the data as discussed.
References
Finner, S. L. (Ed.). (1970). Religious Membership and Religious Preference: Equal Indicators of Religiosity?
Journal for the Scientific Study of Religion, 9(4), 273–279. JSTOR. https://doi.org/10.2307/1384571