GROUP 07 CLASS CC02 Ê
GROUP 07 CLASS CC02 Ê
FINAL PROJECT
ACKNOWLEDGEMENT
We would like to extend our sincere gratitude to PhD. Phan Thi Huong – the
lecturer for the Statistical Probability course and our project supervisor. Her
wholehearted guidance enabled the team to complete the assignment on schedule and
effectively address encountered challenges.
Table 1:
1.2.Data management:
Air traffic passenger statistics can be a useful tool for understanding the aviation
industry and planning travel. This dataset from Open Flight contains information on air
traffic passenger statistics by airline for 2017. The data includes passenger numbers,
operating airlines, published airlines, geographic regions, class of operations codes, fare
category codes, terminals, boarding areas, year and month of flight.
1.3.Research ideas:
• One-way ANOVA (Determining Differences Between Geographic Regions):
The main objective is to see if passenger counts differ significantly between geographic
regions (GEO Region). For example, do passenger counts in Europe, North America,
and Asia differ significantly? If there are differences, this may reflect The level of
economic development in each region; differences in air transport demand; and
differences in cultural, social, or transportation planning factors.
2. Theoretical basis
2.1.Theory of P-value:
The p-value principle is a widely employed statistical method for assessing the
reliability of research findings. Its primary function is to determine whether observed
results are attributable to chance or are influenced by a specific factor.
Application of P-value
The choice of test statistic depends on the type of data and the research question. Common
test statistics include:
● For a z-test, use the formula 𝑧 = σ where is the sample mean, 𝑥¯ is the
x¯–μ
population mean, s is the sample standard deviation, and 𝑛 is the sample size.
√n
For a z-test, use a z-table. If using a left-tailed test, the p-value is found directly in
the table. If it is a right-tailed test, the p-value is found by subtracting the value in the
table from 1. If it is a two-tailed test, multiply the value in the table by 2.
Example:
A researcher wants to test whether the average height of a population is
different from 170 cm. They collect a sample of 30 individuals and find the following:
𝑥−
𝑡 𝜇
𝑠
=
√𝑛
There are two primary types of variance analysis models: one-factor and two-
factor ANOVA. The term "factor" in this context refers to the number of independent
variables influencing the dependent variable under study.
One factor in this context means that we are only examining the influence
of one variable (an independent variable) on the dependent variable. For example:
Education: Comparing the average test scores of students from three different schools.
Healthcare: Comparing the effectiveness of three different drugs in reducing pain.
Suppose we want to compare the means of k populations (in our example, k = 3) based
on independent random samples of sizes n₁, n₂, n₃, ..., nk drawn from each population.
● Normality: The populations from which the samples are drawn should be
normally distributed.
● Homogeneity of variance: The populations should have equal variances.
● Independence: The observations within each sample should be independent of
each other.
Given these assumptions, we can formulate the null and alternative hypotheses
for the ANOVA test:
Null Hypothesis (H₀): The means of all k populations are equal (μ₁ = μ₂ = ... =
μₖ). Alternative Hypothesis (H₁): At least one pair of population means differs.
In simpler terms, the null hypothesis states that there is no significant difference
between the means of the k populations, while the alternative hypothesis suggests that
at least one population mean is different from the others.
Steps to follow:
Averaging the sample of each group 𝑥1, 𝑥2, . . . 𝑥k according to the formula: 𝑥1 =
j= 𝑥ij (𝑖 = 1,2, . . , 𝑘) And the general average of k samples (the general average of
∑ ni
0
Step 2: Calculate the sum of squared differences (or sums of squares) Sum the
squared differences within the SSW group and the sum of the squared differences
between the SSG groups.
+ The sum of the squared differences of each group is calculated by the formula:
𝑥¯ )2
Group 1: SS1= ∑ni
(𝑥
j=1 1j– 1
𝑥¯ )2
Group 2: SS2= ∑ni
(𝑥
j=1 2j– 2
Similarly we calculate until the k-th group is SSk. So the sum of the squared differences
within the groups is calculated as follows:
i=1
+The sum of the squared differences of the entire SST is calculated by adding the
sum of the squared differences between each observed value of the entire sample (𝑥ij)
with the total mean (x).SST reflects variability of the resulting factor due to the
influence of all causes.
𝑆𝑆𝑇 = Σ Σ (𝑥ij − 𝑥¯ )2
ni ni
i=1 i=1
It can be easily proved that the sum of the total squared differences is equal to the
sum of the squared differences within the groups and the sum of the squared differences
between the groups.
SST = SSW + SSG
Thus, the above formula shows that SST is the entire variation of the resulting
factor that has been analyzed into two parts: the variation generated by the factor under
study (SSG) and the other variation produced by other factors not studied here (SSW).If
the variation produced by the causal factor under consideration is more "significant"
have to refute 𝐻0and the conclusion is that the causal factor under study significantly
than the variation produced by other factors that are not produced, the more grounds we
𝑀𝑆𝑊 = 𝑆𝑆𝑊
𝑛−𝑘
between groups by the corresponding degree of freedom as 𝑘 − 1
Calculates intergroup variance (MSG) by dividing the squared differences
.MSG is an estimate of the variability of the resulting factor caused by the causal factor
under study.
𝑀𝑆𝐺 = 𝑆𝑆𝐺
𝑘−1
freedom 𝑘 − 1
ratio is called the F-ratio because it obeys the Fisher-Snedecor law with degrees of
𝑀𝑆𝐺
𝐹
𝑀𝑆𝑊
=
We reject the 𝐻0
hypothesis, which holds that the mean value of k overall is equal when: 𝐹 >
𝐹k–1,n–k; α
F > 𝐹k–1;n–k;α : As a boundary value for degrees of freedom k look up by first row and
n – k look up by first column, remember to select the table with the appropriate significance
level.
3. Data pre-processing:
3.1.Data importing:
We procced to use the command read.csv to read the data file “Air_Traffic_
Passenger_Statistics.csv” was downloaded from
https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passengerstatistics and
then use the command head continue to print out the first 4 rows of the data as shown
in table 1:
csddsdv
## A tibble: 4 × 17
#
3.2.Inspecting data:
Moving forward, we utilize the str() command to thoroughly examine the
structure of the dataset, as demonstrated in table 2. This command allows us to gain a
comprehensive understanding of the dataset by providing detailed information about
its structure, including the number of rows and columns, the data types of each
variable, and a preview of sample values for each column. Such an inspection is
crucial for ensuring that we are familiar with the data and can prepare it appropriately
for further analysis.
## index Activity_Period
## 0.0000000 0.0000000
## Operating_Airline Operating_Airline_IATA_Code
## 0.0000000 0.3598321
## Published_Airline Published_Airline_IATA_Code
## 0.0000000 0.3598321
## GEO_Summary GEO_Region
## 0.0000000 0.0000000
## Activity_Type_Code Price_Category_Code
## 0.0000000 0.0000000
## Terminal Boarding_Area
## 0.0000000 0.0000000
## Passenger_Count Adjusted_Activity_Type_Code
## 0.0000000 0.0000000
## Adjusted_Passenger_Count Year
# 0.0000000 0.0000000
# Mont
# h
# 0.000000
## 0
3.4.Outlier detection
Outliers are unusual points in the dataset that have the potential to affect the
analysis. We need to check these outliers, and an outlier is defined as a value that falls
outside the range of:
[ Q1 – 1.5IQR ; Q3 + 1.5IQR ]
We use the boxplot() command to identify outliers, and from there, we obtain
the results in Table 6.
## [1] 2426
Table 7: Outlier
The results show that there are 2,426 outliers out of 14,953 data points, accounting
for more than 16% of the values in the dataset shown in table 7. Therefore, we will
retain these outlier values, as they may provide meaningful insights into the problem
at hand.
## [1] 0.1622417
After not removing the outliers, based on Figure 1, we can see that the
passenger count data does not follow any specific distribution pattern. Therefore,
we need to approximate this data as a normal distribution by applying a natural
logarithm transformation.
We use the hist() command to redraw the passenger count histogram after
applying a log + 1 transformation shown in Figure 2:
Figure 3: Histogram of Log Passenger Count
After applying the log transformation, the data appears to approximate a normal
distribution, which is favorable for the analysis results.
4. Descriptive statistics:
4.1.Perform sample statistics for the number of passengers.
We will use commands to calculate sample statistics for the passenger count
variable and obtain the results as Table 8.
# Statistic Value
#
# 1 Mean 2.934562e+0
# 4
# 2 Standard Deviation 5.839845e+0
# 4
# 3 Min 1.000000e+0
# 0
# 4 Max 6.598370e+0
# 5
# 5 Median 9.260000e+0
# 3
# 6 Variance 3.410379e+0
# 9
# 7 25% Quantile 1.000000e+0
# 0
# 8 50% Quantile (Median) 5.409000e+0
# 3
# 9 75% Quantile 9.260000e+0
# 3
Observation:
The mean value of passenger counts is 29,345.62. This represents the total number
of passengers divided by the total number of observations.
The minimum value in the dataset is 1. This means at least one instance
recorded a very low passenger count.
The maximum value is 659,837, showing that one observation recorded an
extremely high passenger count, far exceeding the mean.
Right-skewed distribution: The mean is significantly higher than the
median, indicating a few extremely large values are pulling the mean upward.
High variability: The large standard deviation and the wide range between
the minimum (1) and maximum (659,837) indicate substantial variability in the data.
Low quantile values: With the 25% quantile at 1, a significant portion of the
data represents very low passenger counts.
##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 694 1369 1409 1433 1393 1377 1375 1362 1346 1364 1460 371
Observation:
This data may represent the number of passengers, flights, or other related metrics
by year. Observations indicate:
#
# April August Decembe February January July June
# r
#
March
## 1141 1303 1255 1254 1264 1299 1177
1248
## May Novembe October Septembe
r r
## 1168 1261 1287 1296
Observation:
#
# Deplane Enplaned Thru / Transit
# d 699 91
# 7043 1 9
Table 12: The result for the variable “Activity type code”
Observation:
Deplaned (7043) and Enplaned (6991) have the highest counts, indicating most
passengers are either arriving or departing.
Thru / Transit (919) has the fewest observations, indicating that fewer passengers
are in transit without leaving the airport.
This therefore, indicates that most of the activities involve arrival and departure
passengers with only a few being in transit.
##
## Low Other
Fare ## 1306
Table 13: The result for the variable “Price Category Code”
Observation:
Other 13,069-This is the highest count, meaning that most passengers fall into
this category.
Low Fare: 1,884, which is the lower number of observations, indicating that
fewer passengers are using the low-fare options.
This would tend to indicate that most passengers are not low fare travelers.
#
# Asia Australia / Oceani Canada Central Americ
# a a
#
## 3272 737 1418 272
## Europe Mexico Middle East South Americ
a
## 2078 1115 214 90
## US Table
## 5757 14:
The result for the variable “GEO Region”
Observation:
US comes 5757, meaning this records most passengers from the United States of
America.
Asia has a total count of 3272 and Europe 2078, showing these areas to be the
next biggest in passengers.
South America (90) and Middle East (214) have the least number of passengers.
Most are from the US, with Asian and European passengers coming in after it
and fewer from other regions.
Observation: Based on the graph, we can see that the US region is the region with the
most flights and has a larger number of passengers than other regions.
Plot a boxplot for Passenger Count by Price Category Code.
Figure 5: Boxplot of Passenger Count by Price Category Code
Observation: Overall, there is not much difference in passenger numbers for this
variable.
Observation:
● This chart highlights the differences in passenger numbers across the three
Activity Type Code groups.
● The Deplaned and Enplaned groups have similar sizes and distributions, whereas
the Thru/Transit group is significantly smaller.
5. Inferential Statistics:
5.1.One-factor ANOVA Building:
The primary goal of this section is to determine whether there are significant
differences in the average passenger counts among different geographic regions.
By using one-factor ANOVA, we aim to identify if regional factors influence
passenger traffic and highlight any regions with notably higher or lower passenger
counts, provides a clearer understanding of air traffic patterns across various parts of
the world.
To conduct the ANOVA test, we used the aov() function to compare the means
of different groups, calculates the F-statistic, which indicates whether the variability
between group means is significantly greater than the variability within groups.
Beside, we structured the dataset to include passenger numbers as the dependent
variable and geographic regions as the independent variable. understanding of air
traffic patterns across various parts of the world.
HYPOTHESES
Hypothesis (H0): There is no difference in the mean passenger counts across
the GEO regions.
The Tukey HSD test compares the mean differences in passenger counts across
different GEO regions. The results are as follows:
+ Hypothesis H0: The average number of passengers on flights from the US and South
America is equal.
+ Hypothesis H1: The average number of passengers on flights from the US and South
America is different.
Based on the Tukey HSD test, the adjusted p-value is 0<5% significance level.
Therefore, we reject H0, indicating that there is a statistically significant difference in
the average number of passengers between the US and South America.
In addition, we can also rely on the confidence interval to comment on the
difference in the average number of passengers between the two regions, we find that
the interval is (1.361, 2.351). Since the interval does not include 0, this confirms that
there is a meaningful difference between the two regions.
Similarly for the remaining pairs, we can make the conclusion:
Significant Differences:
-Asia consistently has lower passenger counts compared to many regions, including:
+Australia / Oceania: Mean difference = -0.592, p<0.0001
+Canada: Mean difference = -0.554, p<0.0001
+Mexico: Mean difference = -0.745, p<0.0001
+South America: Mean difference = -1.211, p<0.0001
-US has significantly higher passenger counts compared to several regions, such as:
+South America: Mean difference = 1.856, p<0.0001
+Mexico: Mean difference = 1.391, p<0.0001
+Middle East: Mean difference = 0.809, p<0.0001
Non-Significant Differences:
Europe và Asia: P-value = 1.00, no significant difference.
Middle East và Asia: P-value = 0.835, no significant difference.
South America và Mexico: P-value = 0.108, no significant difference.
This analysis provides valuable insights into the differences in passenger numbers
across geographic regions, highlighting that the US has a significantly higher number
of passengers compared to other regions, while areas such as South America and Asia
exhibit notably lower numbers. These findings can be used to inform strategic
decisions regarding resource allocation and marketing efforts, enabling airlines to
prioritize regions with higher passenger demand while exploring opportunities to
enhance connectivity and engagement in underperforming areas.
Figure 7:
- As part of our analysis, we examined four key residual plots: Residuals Vs Fitted plot:
The red line is a curve, so the assumption that Y and the independent variables
X do not have a linear relationship is not completely satisfied.
The red line is close to the y=0 line, so the assumption that the expected errors
are 0 is satisfied.
The errors are randomly scattered along the red line, so the assumption that the
variance is constant is satisfactory.
Q-Q Residuals:
The errors deviate from the expected normal distribution line, so the normal
distribution assumption of the errors is not satisfied.
Scale-Location plot:
The error values are not randomly scattered along the red line, so the
assumption that the error variance is constant is not satisfied.
The graph shows that these two observations 9044 and 10560 have not gone
beyond Cook's distance. So, there is no need to remove them from the dataset.
- The diagnostic plots indicate some violations of key assumptions for ANOVA,
specifically the linearity and normality of residuals, as well as constant variance.
While the influential points are not extreme enough to warrant removal, these
assumption violations may limit the reliability of the ANOVA results. Adjustments,
such as transforming the dependent variable or employing a more robust statistical
model, might be necessary to address these issues.
We use the aov function to fit a two-factor ANOVA model, where the dependent
variable is Log Passenger_Count, and the independent variables are GEO_Region and
Operating Airline. The interaction term, GEO_Region:Operating_Airline, is included
to test whether the effect of one factor (e.g., GEO_Region) depends on the level of the
other factor (e.g., Operating_Airline). This allows us to assess both the individual and
combined effects of these factors on passenger counts.
## Df Sum Sq Mean Sq F value Pr(>F)
## GEO_Region 8 3840 480.1 324.91 <2e-16 ***
## Operating_Airline 69 10821 156.8 106.15 <2e-16 ***
## GEO_Region:Operating_Airline 16 945 59.1 39.99 <2e-16 ***
## Residuals 14859 21954 1.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '1
Table 17:
The residual analysis shows a residual F-value = 1.5, indicating that the residuals
are relatively small. This suggests that the model fits the data well and that the
assumptions of normality, independence, and equal variance are reasonably satisfied.
The residuals appear to be evenly distributed, reinforcing the adequacy of the model in
explaining the variability in passenger counts. The small residuals further suggest that
the ANOVA model effectively captures the key relationships between the dependent
and independent variables.
The analysis highlights that both the geographical region and the airline play
significant roles in influencing passenger counts. Their interaction reveals a complex
relationship. This finding highlights the importance of considering both factors jointly
when analyzing passenger trends. By understanding these factors, airlines and
policymakers can make more informed decisions about resource allocation, marketing
strategies, and route planning. Tailoring approaches to specific regions and airlines
allows for more targeted and effective strategies, potentially improving passenger
engagement and operational efficiency. Additionally, the model offers a foundation for
further research into other factors that could further optimize strategies for increasing
passenger numbers.
We set the hypothesis to conclude which variable will influence the Passenger_Count:
General equation:
Passenger count =𝛽0̂+𝛽1̂× TerminalOther+ 𝛽2̂ ×TerminalTerminal1 +𝛽3̂
×TerminalTerminal2+𝛽4̂ × TerminalTerminal3 +𝛽5̂ × Price_Category_CodeOther+𝛽6̂
× Year.
As the result of the code, we get: 𝛽0̂ = -83.975324 , 𝛽1= -7.416359, 𝛽2̂ =0.290825 , 𝛽3̂
= 2.149910, 𝛽4̂ = 1.152311, 𝛽5̂= 0.077896, 𝛽 6 =̂ 0.046361
Thus, the regression line is estimated by the following equation:
𝑃𝑎𝑠𝑠𝑒𝑛𝑔𝑒̂𝑟𝐶𝑜𝑢𝑛𝑡 =-83.975324-7.416359 × TerminalOther
+ 0.290825 ×
TerminalTerminal 1 + 2.149910 x TerminalTerminal 2+ 1.152311 × TerminalTerminal
We see that the p-value corresponding to the F-statistic is less than 2.2e-16, which is
highly significant. This shows that at least one predictor variable in the model has a
very high explanatory significance for the variable "Passenger_Count".
Check the regression coefficients:
The first graph (Residuals vs Fitted): Checking the mean of expected value.
We give an assumption that the mean of expected value equal 0. As we can see on the
graph, the red line has to be straight and around the line y = 0. But at the nearest end
point of the graph, which the fitted values around 9 - 11, the red line is not a straight
horizontal line, instead it is a curve. So we don’t have enough evidence to accept the
assumption: mean of expected value must be 0.
The second graph (Normal Q - Q): This graph allows the user to check the
assumption of the normal distribution of errors. The condition is only met if all of the
residual points lie on the same line. As we can see on the graph, the residual points
from - 4 to -2 are not on the same line, there’s a big difference when we compare
these to the other residual points on the top of the graph.
The third graph (Scale - Location): We give the assumption that the error
variance is constant. The error values are not randomly scattered along the red line,
which means we don’t have enough evidence to prove the assumption is satisfied.
Another point is that the red line is not horizontal straight, instead it is sloping (or
curved) so the asumption is also not satisfied.
The fourth graph (Residuals - Location): This graph show that there are highly
influential points can be outliners, which are the points that can have the most
influence when analyzing data. If we observe a dashed red line (Cook's distance), and
there are some points that cross this distance line, it means those points are highly
influential points. If we only observe the Cook distance line at the corner of the graph
and no point crosses it, it means that no point really has high influence. As we can see
in the data set, the observations 5674 and 5376 may be highly influential points.
However, we can easily see these two have not gone beyond the Cook’s
distance. Therefore, we don’t have to move these two observations because it has
insignificantly high influence.
Overall, the first, second and the third graph give us the same result: we
don’t have enough evidences to accept all of the assumptions.
This analysis provides insight into the differences in passenger traffic between
geographic regions, showing that the US has the highest number of passengers, while
other regions such as South America and Asia have lower numbers. These results can
be used to make strategic decisions about allocating resources or marketing to the
respective regions.
Overall: The US region has the highest number of passengers compared to most
other regions. The South America region has the lowest number of passengers
compared to many other regions, especially compared to the US and Asia. Regions such
as Europe and the Middle East do not have significant differences compared to Asia and
Canada.
Pairs of groups with significant differences: Asia has lower number of passengers
compared to most other regions, such as Australia/Oceania, Canada, Mexico, and South
America.
Pairs of groups with no significant differences: Europe and Asia, Middle East and
Asia, South America and Mexico
The analysis of the data set is limited to the Mexico geographical area, we can
expand it further through analysis in other geographical areas.
7 .Conclusion
Factors affecting passengers numbers in the Mexico geographic area: Operating
Airline_IATA_Code, Activity_Type_Code, Year.
We, as the authors of this project, hope that our solutions will satisfy the given problems
The cooperation in the implementation of the project has improved the ability and
the responsibility at work of each of the members.
We are looking forward to receiving comments and suggestions from the lecturer
to carry out more accurate and professional topics in the foreseeable future.
9 .References
1. Phan Thi Huong, Lecture on Statistical Probability
2. Nguyen Tien Dung (editor), Nguyen Dinh Huy, Probability - Statistics & Data
Analysis, 2019
3. Nguyen Dinh Huy (editor), Nguyen Ba Thi, Probability and Statistics
Textbook, 2018
4. Introductory Statistics with R, J Jambers – D. Hand – W. Hardle
5. Applied Statistics with R, 2020
6. Lecture on Quantitative Economics, PhD. Nguyen Canh Huy
7. Sample example of multiple regression, Hoang Van Ha
8.Data:https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-
statistics