0% found this document useful (0 votes)
11 views36 pages

GROUP 07 CLASS CC02 Ê

The document outlines a final project on airline traffic passenger statistics conducted by a group of students at Ho Chi Minh City University of Technology. It includes sections on data introduction, theoretical basis, data pre-processing, descriptive statistics, and inferential statistics, with specific tasks assigned to each group member. The project aims to analyze passenger traffic data using statistical methods such as ANOVA and regression to draw insights about the aviation industry.

Uploaded by

phucthai0816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

GROUP 07 CLASS CC02 Ê

The document outlines a final project on airline traffic passenger statistics conducted by a group of students at Ho Chi Minh City University of Technology. It includes sections on data introduction, theoretical basis, data pre-processing, descriptive statistics, and inferential statistics, with specific tasks assigned to each group member. The project aims to analyze passenger traffic data using statistical methods such as ANOVA and regression to draw insights about the aviation industry.

Uploaded by

phucthai0816
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

FACULTY OF APPLIED SCIENCES


FACULTY OF TRANSPORTATION ENGINEERING
·▪•🙢🙞🕮🙜🙠•▪·

PROBABILITY AND STATISTICS (MT2013)

FINAL PROJECT

TOPIC: AIRLINE TRAFFIC PASSENGER STATISTICS

Instructor: PhD. PHAN THI HUONG


Class: CC02- Group: 7- Semester: 241

Ho Chi Minh City 12/2024


Group Information:

Full name Student ID Percentage Assigned task


of work
Nguyen Đinh Đang 2352251 16,66% Inferential Statistics: ANOVA, code R

To Tan Đat 2352243 16,66% Theoretical basis: Theory of P-value.


Theory of ANOVA

Huynh Phuc Thai 2353090 16,66% Data pre-processing , code R

Lương Hoang Khang 2352473 16,66% Descriptive statistics, code R, Final


checking .

Tran Nguyen Gia Thinh 2353143 16,66% Inferential Statistics: Multivariate


regression, code R.

Nguyen Phan Dung 2352209 16,66% Data introduction, Discussion and


expansion, Conclusion, Summary and
format of reports.

ACKNOWLEDGEMENT

We would like to extend our sincere gratitude to PhD. Phan Thi Huong – the
lecturer for the Statistical Probability course and our project supervisor. Her
wholehearted guidance enabled the team to complete the assignment on schedule and
effectively address encountered challenges.

"Probability and Statistics are important concepts in Math. It helps to represent


complicated data straightforwardly and understandably. The professionals use the stats
and the predictions of many different aspects and departures. Therefore, involvement in
this subject has improved and sharpened our skills, not only in Data Science, but also
in teamwork, and problem orientation.
Table of Contents
1. Data introduction.........................................................................................................4
1.1. Introduction to data.................................................................................................4
1.2. Data management.....................................................................................................5
1.3. Research ideas...........................................................................................................5
2. Theoretical basis..........................................................................................................5
2.1. Theory of P-value.....................................................................................................5
2.2. Theory of ANOVA (Analysis of Variance)............................................................7
3. Data pre-processing...................................................................................................11
3.1. Data importing........................................................................................................11
3.2. Inspecting data........................................................................................................12
3.3. Inspect and handle missing data...........................................................................13
3.4. Outlier detection.....................................................................................................16
4. Descriptive statistics..................................................................................................17
4.1. Perform sample statistics for the number of passengers....................................17
4.2. Perform count statistics for categorical variables..............................................18
4.3. Plot distribution charts for selected categorical variables.................................20
5. Inferential Statistics:…………………………………………..…………………….21
5.1. One-factor ANOVA Building:
…………………………………………………………………….22
5.2 Two-factor ANOVA problem: ……………………………………………………27
5.3. Multivariate regression model..............................................................................28
5.3.1 Perform data extraction in the US region……………………………………..28
5.3.2 Perform regression model ……………………………………………………...29
6. Discussion and expansion.........................................................................................32
7. Conclusion..................................................................................................................32
8. Code and data source................................................................................................33
8.1. Code source.............................................................................................................33
8.2. Data source..............................................................................................................33
9. References...................................................................................................................33
1. Data introduction
1.1.Introduction to data.
This dataset contains information on airline passenger traffic statistics. It includes
information on airlines, airports, and regions where flights depart and arrive. It also
includes information on the type of operation, price type, terminal, boarding area, and
number of passengers.
Data is taken from "https://www.kaggle.com/datasets/thedevastator/airlines-traffic-
passenger statistics".
Variables in the dataset:
Variable Abbre Cont/cat Unit Description
viation
Activity Period AP Discrete None Date of the activity
None
Operating Airline OA Discrete The airline operating the flight.
Operating Airline OAI Discrete None IATA code of the airline operating the
IATA Code flight
Published Airline PA Discrete None The airline announces the ticket price
for the flight
Published Airline PAI Discrete None The IATA code of the airline that
IATA Code published the ticket price for the flight
GEO Summary GS Discrete None Summary of geographical area
GEO Region GR Discrete None The geographic area
Activity Type ATC Discrete None The activity type
Code
Price Category PCC Discrete None The price category of ticket prices
Code
None
Terminal Ter Discrete The terminal of the flight
Boarding Area BA Discrete None Flight terminal
Passenger Count PC Continuous Person The number of passengers on the
flight
Adjusted Activity AATC Discrete None The activity type, is adjusted for
Type Code missing data
Adjusted APC Continuous Person Number of passengers on the flight,
Passenger Count adjusted for missing data
Year Continuous None Year of the activity
Month Continuous None Month of the activity

Table 1:
1.2.Data management:
Air traffic passenger statistics can be a useful tool for understanding the aviation
industry and planning travel. This dataset from Open Flight contains information on air
traffic passenger statistics by airline for 2017. The data includes passenger numbers,
operating airlines, published airlines, geographic regions, class of operations codes, fare
category codes, terminals, boarding areas, year and month of flight.

1.3.Research ideas:
• One-way ANOVA (Determining Differences Between Geographic Regions):
The main objective is to see if passenger counts differ significantly between geographic
regions (GEO Region). For example, do passenger counts in Europe, North America,
and Asia differ significantly? If there are differences, this may reflect The level of
economic development in each region; differences in air transport demand; and
differences in cultural, social, or transportation planning factors.

• Two-way ANOVA (Interaction between GEO Region and Operating Airline):


Does the effect of Operating Airline on passenger volume differ depending on the GEO
Region? For example, a large airline may serve many passengers in North America, but
fewer passengers in Africa. Similarly, a domestic airline may have a high passenger
volume in a particular region but lower in another region.

2. Theoretical basis
2.1.Theory of P-value:
The p-value principle is a widely employed statistical method for assessing the
reliability of research findings. Its primary function is to determine whether observed
results are attributable to chance or are influenced by a specific factor.

Application of P-value

In a statistical test, sample results are compared to possible population


conditions by way of two competing hypotheses: the null hypothesis (𝐻0) is a neutral
or "uninteresting" statement about a population, such as "no change" in the value of a
parameter from a previous known value or "no difference" between two groups; the
other, the alternative (or research) hypothesis (𝐻1) is the "interesting" statement that
the person performing the test would like to conclude if the data will allow it. The p-
value is the probability of obtaining the observed sample results (or a more extreme
result) when the null hypothesis is actually true. If this p-value is very small, usually
less than or equal to a threshold value previously chosen called the significance level
(traditionally 5% or 1% ), it suggests that the observed data is inconsistent with the
assumption that the null hypothesis is true, and thus that hypothesis must be rejected
and the other hypothesis accepted as true
An informal interpretation of a p-value, based on a significance level of
about 10%, might be:

𝑝 ≤ 0.01: Very strong presumption against null hypothesis


0.01 < 𝑝 ≤ 0.05: Strong presumption against null hypothesis

0.05 < 𝑝 ≤ 0.01: Low presumption against null hypothesis


𝑝>0.01: No presumption against the null hypothesis




Steps to calculate the P-value

1. State the Null and Alternative Hypotheses:

Null Hypothesis (H₀): A statement of no effect or no difference.

Alternative Hypothesis (H₁): A statement that contradicts the null hypothesis.

2. Choose the Appropriate Test Statistic:

The choice of test statistic depends on the type of data and the research question. Common
test statistics include:

● Z-test: For large samples and known population standard deviation.


● T-test: For small samples or unknown population standard deviation.

3. Calculate the Test Statistic:

● For a z-test, use the formula 𝑧 = σ where is the sample mean, 𝑥¯ is the
x¯–μ

population mean, 𝜎 is the population standard deviation (or sample standard


√n

deviation if the population standard deviation is unknown) and 𝑛 is the sample


size.
● For a t-test, use the formula 𝑧 = where 𝑥¯ is the sample mean, 𝜇 is the
x¯–μ
s

population mean, s is the sample standard deviation, and 𝑛 is the sample size.
√n

4. Find the p-value for the test statistic.

For a z-test, use a z-table. If using a left-tailed test, the p-value is found directly in
the table. If it is a right-tailed test, the p-value is found by subtracting the value in the
table from 1. If it is a two-tailed test, multiply the value in the table by 2.

For a t-test, use a t-table, the degree of freedom is equal to 𝑛 − 1

Example:
A researcher wants to test whether the average height of a population is
different from 170 cm. They collect a sample of 30 individuals and find the following:

- Sample mean (x̄ ) = 168 cm


- Sample standard deviation (s) = 5 cm
- Sample size (n) = 30

Step 1: State the Hypotheses:

Null Hypothesis (H₀): μ = 170 cm (The population mean is 170 cm)

Alternative Hypothesis (H₁): μ ≠ 170 cm (The population mean is not 170 m)

Step 2: Choose the Test Statistic:

Because we do not know the population standard deviation of the above


problem, we will calculate the t-test

Step 3: Calculate the t-test:

The formula for the t-statistic is:

𝑥−
𝑡 𝜇
𝑠
=
√𝑛

Plugging in the values:

t = (168 - 170) / (5 / √30) ≈ -1.10

Step 4: Find the P-Value:

(-oo; -1.96) U (1.96 ; oo)

2.2.Theory of ANOVA (Analysis of Variance):


The objective of the Analysis of Variance (ANOVA) is to compare the means of
multiple population groups. This is accomplished by analyzing the mean values of
observed samples drawn from these groups and conducting hypothesis tests to
determine if the overall population means are equal.

There are two primary types of variance analysis models: one-factor and two-
factor ANOVA. The term "factor" in this context refers to the number of independent
variables influencing the dependent variable under study.
One factor in this context means that we are only examining the influence
of one variable (an independent variable) on the dependent variable. For example:

Education: Comparing the average test scores of students from three different schools.
Healthcare: Comparing the effectiveness of three different drugs in reducing pain.

Analysis of single-factor variance

The overall k case has a normal distribution and equal variance

Suppose we want to compare the means of k populations (in our example, k = 3) based
on independent random samples of sizes n₁, n₂, n₃, ..., nk drawn from each population.

To conduct an ANOVA analysis, three key assumptions must be met:

● Normality: The populations from which the samples are drawn should be
normally distributed.
● Homogeneity of variance: The populations should have equal variances.
● Independence: The observations within each sample should be independent of
each other.

Given these assumptions, we can formulate the null and alternative hypotheses
for the ANOVA test:

Null Hypothesis (H₀): The means of all k populations are equal (μ₁ = μ₂ = ... =
μₖ). Alternative Hypothesis (H₁): At least one pair of population means differs.

In simpler terms, the null hypothesis states that there is no significant difference
between the means of the k populations, while the alternative hypothesis suggests that
at least one population mean is different from the others.

Steps to follow:

Step 1: Calculate the sample averages of groups (as representatives of populations)

k independent random samples (symbol 𝑥 ) and the general mean of k samples


First of all, we consider how to calculate the sample averages from the observations of

observed (symbol x) from the general case as follows:

Below Table: General data table performing variance analysis.


Figure 1:

Averaging the sample of each group 𝑥1, 𝑥2, . . . 𝑥k according to the formula: 𝑥1 =
j= 𝑥ij (𝑖 = 1,2, . . , 𝑘) And the general average of k samples (the general average of
∑ ni
0

the entire survey sample): x = k nx


∑ i= i i
1
k ni
∑i=1

Step 2: Calculate the sum of squared differences (or sums of squares) Sum the
squared differences within the SSW group and the sum of the squared differences
between the SSG groups.

+ The sum of intragroup squared differences (SSW) is calculated by adding the


squared differences between the observed values to the sample mean of each group, and
then summing up the results of all groups.SSW reflects the variability of the resulting
factor due to the influence of other factors, rather than the causal factor under study
(which is used to distinguish the populations/groups being compared).

+ The sum of the squared differences of each group is calculated by the formula:

𝑥¯ )2
Group 1: SS1= ∑ni
(𝑥
j=1 1j– 1

𝑥¯ )2
Group 2: SS2= ∑ni
(𝑥
j=1 2j– 2

Similarly we calculate until the k-th group is SSk. So the sum of the squared differences
within the groups is calculated as follows:

𝑆𝑆𝑊 = 𝑆𝑆1 + 𝑆𝑆2 + ⋯ + 𝑆𝑆k


+ The sum of the squared differences between groups (SSG) is calculated by
adding the squared differences between the sample averages of each group with the
general mean of k groups (these differences are multiplied by the corresponding number
of observations for each group). SSG reflects the variability of the resulting factor due
to the influence of the causal factor under study.

𝑆𝑆𝐺 = Σ(𝑥ij − 𝑥¯) 2


ni

i=1

+The sum of the squared differences of the entire SST is calculated by adding the
sum of the squared differences between each observed value of the entire sample (𝑥ij)
with the total mean (x).SST reflects variability of the resulting factor due to the
influence of all causes.

𝑆𝑆𝑇 = Σ Σ (𝑥ij − 𝑥¯ )2
ni ni

i=1 i=1

It can be easily proved that the sum of the total squared differences is equal to the
sum of the squared differences within the groups and the sum of the squared differences
between the groups.
SST = SSW + SSG

Thus, the above formula shows that SST is the entire variation of the resulting
factor that has been analyzed into two parts: the variation generated by the factor under
study (SSG) and the other variation produced by other factors not studied here (SSW).If
the variation produced by the causal factor under consideration is more "significant"

have to refute 𝐻0and the conclusion is that the causal factor under study significantly
than the variation produced by other factors that are not produced, the more grounds we

influences the resulting factor.


Step 3: Calculate the variances (which are the average of the squared differences).
Variances are calculated by dividing the sums of squared differences by the
corresponding degree of freedom.

within groups (SSW) by the corresponding degree of freedom as 𝑛 − 𝑘


Calculates intragroup variance (MSW) by dividing the differences of squares

( n is the number of observations, k is the number of comparison groups).MSW is an


estimate of the resulting factor variation caused by other factors.

𝑀𝑆𝑊 = 𝑆𝑆𝑊
𝑛−𝑘
between groups by the corresponding degree of freedom as 𝑘 − 1
Calculates intergroup variance (MSG) by dividing the squared differences

.MSG is an estimate of the variability of the resulting factor caused by the causal factor
under study.

𝑀𝑆𝐺 = 𝑆𝑆𝐺
𝑘−1

Step 4: Test your hypothesis:


The hypothesis of the equality of the overall mean k is decided based on the ratio
of two variances: intergroup variance (MSG) and intragroup variance (MSW). This

freedom 𝑘 − 1
ratio is called the F-ratio because it obeys the Fisher-Snedecor law with degrees of

in the numerator and 𝑛 − 𝑘


in the denominator.

𝑀𝑆𝐺
𝐹
𝑀𝑆𝑊
=

We reject the 𝐻0
hypothesis, which holds that the mean value of k overall is equal when: 𝐹 >
𝐹k–1,n–k; α
F > 𝐹k–1;n–k;α : As a boundary value for degrees of freedom k look up by first row and
n – k look up by first column, remember to select the table with the appropriate significance
level.

3. Data pre-processing:
3.1.Data importing:
We procced to use the command read.csv to read the data file “Air_Traffic_
Passenger_Statistics.csv” was downloaded from
https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passengerstatistics and
then use the command head continue to print out the first 4 rows of the data as shown
in table 1:
csddsdv
## A tibble: 4 × 17
#

## index `Activity Period` `Operating Airline` `Operating Airline IATA C


ode`
## <dbl <dbl> <chr> <chr>
>
## 0 200507 ATA Airlines TZ
1
## 1 200507 ATA Airlines TZ
2
## 2 200507 ATA Airlines TZ
3
## 3 200507 Air Canada AC
4
# # ℹ 13 more variables: `Published Airline` <chr>,
# `Published Airline IATA Code` <chr>, `GEO Summary` <chr>,
# #
#
# # `GEO Region` <chr>, `Activity Type Code` <chr>,
# `Price Category Code` <chr>, Terminal <chr>, `Boarding Area` <chr>,
# # `Passenger Count` <dbl>, `Adjusted Activity Type Code` <chr>,
# #
#
#
# # `Adjusted Passenger Count` <dbl>, Year <dbl>, Month <chr>
#
Table 2: Data table

3.2.Inspecting data:
Moving forward, we utilize the str() command to thoroughly examine the
structure of the dataset, as demonstrated in table 2. This command allows us to gain a
comprehensive understanding of the dataset by providing detailed information about
its structure, including the number of rows and columns, the data types of each
variable, and a preview of sample values for each column. Such an inspection is
crucial for ensuring that we are familiar with the data and can prepare it appropriately
for further analysis.

## spc_tbl_ [15,007 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)


## $ index : num [1:15007] 0 1 2 3 4 5 6 7 8 9 ...
## $ Activity_Period : num [1:15007] 200507 200507 200507
2005
07 200507 ...
## $ Operating_Airline : chr [1:15007] "ATA Airlines" "ATA
Airlines" "ATA Airlines" "Air Canada" ...
## $ Operating_Airline_IATA_Code: chr [1:15007] "TZ" "TZ" "TZ" "AC" ... ##
$ Published_Airline : chr [1:15007] "ATA Airlines" "ATA
Airlines" "ATA Airlines" "Air Canada" ...
## $ Published_Airline_IATA_Code: chr [1:15007] "TZ" "TZ" "TZ" "AC" ...
## $ GEO_Summary : chr [1:15007] "Domestic" "Domestic"
"Domestic" "International" ...
## $ GEO_Region : chr [1:15007] "US" "US" "US" "Canada" .
..
## $ Activity_Type_Code : chr [1:15007] "Deplaned" "Enplaned"
"Thru / Transit" "Deplaned" ...
## $ Price_Category_Code : chr [1:15007] "Low Fare" "Low Fare"
"Low Fare" "Other" ...
## $ Terminal : chr [1:15007] "Terminal 1" "Terminal 1"
"Terminal 1" "Terminal 1" ...
## $ Boarding_Area : chr [1:15007] "B" "B" "B" "B" ...
## $ Passenger_Count : nu [1:15007] 27271 29131 5415 35156
m
34090 ...
## $ Adjusted_Activity_Type_Code: chr [1:15007] "Deplaned" "Enplaned"
"Thru / Transit * 2" "Deplaned" ...
## $ Adjusted_Passenger_Count : num [1:15007] 27271 29131 10830 35156
34090 ...
## $ Year : num [1:15007] 2005 2005 2005 2005 2005
...
## $ Month : chr [1:15007] "July" "July" "July"
"July" ...
## - attr(*, "spec")=
## .. cols(
## .. index = col_double(),
## .. `Activity Period` = col_double(),
## .. `Operating Airline` = col_character(),
## .. `Operating Airline IATA Code` = col_character(),
## .. `Published Airline` = col_character(),
## .. `Published Airline IATA Code` = col_character(),
## .. `GEO Summary` = col_character(),
## .. `GEO Region` = col_character(),
## .. `Activity Type Code` = col_character(), ##
.. `Price Category Code` = col_character(),
## .. Terminal = col_character(),
## .. `Boarding Area` = col_character(),
## .. `Passenger Count` = col_double(),
## .. `Adjusted Activity Type Code` = col_character(),
## .. `Adjusted Passenger Count` = col_double(),
## .. Year = col_double(),
## .. Month = col_character()
## .. )
## - attr(*, "problems")=<externalptr>

Table 3: Structure of the dataset


The table provide that the dataset contains 15,007 rows and 17 columns. Displays
the data type of each column (numeric or character) and shows sample values from each
column.

3.3.Inspect and handle missing data:


Carry on, we will use apply(is.na() to check whether there are any missing values
in the dataset as shown in table 3. This step is essential for understanding the
completeness of the data and identifying any potential issues that may need to be
addressed before conducting further analysis.
## index Activity_Period
## 0 0
## Operating_Airline Operating_Airline_IATA_Code
## 0 54
## Published_Airline Published_Airline_IATA_Code
## 0 54
## GEO_Summary GEO_Region
## 0 0
## Activity_Type_Code Price_Category_Code
## 0 0
## Terminal Boarding_Area
## 0 0
## Passenger_Count Adjusted_Activity_Type_Code
## 0 0
## Adjusted_Passenger_Count Year
## 0 0
## Month
## 0

Table 4: The count of missing values for each variable.


It is noted that the variables Operating_Airline_IATA_Code and Published_
Airline_IATA_Code both have 54 missing data points.

We continue checking the percentage of missing data to handle them accordingly.


After calculation, the results show the missing rate for each variable in the dataset as
shown in table 4:

## index Activity_Period
## 0.0000000 0.0000000
## Operating_Airline Operating_Airline_IATA_Code
## 0.0000000 0.3598321
## Published_Airline Published_Airline_IATA_Code
## 0.0000000 0.3598321
## GEO_Summary GEO_Region
## 0.0000000 0.0000000
## Activity_Type_Code Price_Category_Code
## 0.0000000 0.0000000
## Terminal Boarding_Area
## 0.0000000 0.0000000
## Passenger_Count Adjusted_Activity_Type_Code
## 0.0000000 0.0000000
## Adjusted_Passenger_Count Year
# 0.0000000 0.0000000
# Mont
# h
# 0.000000
## 0

Table 5: Percentage of missing for each variable in the dataset


It is noted that these two variables each contain only 0.35% of missing data,
accounting for a small portion of the dataset. Therefore, the handling approach will be
to remove the missing data.
Proceed to remove missing data
After completing the process of handling the missing data, we use the apply(is.na()
command one more time shown in talble 5 to thoroughly check whether all the
missing values in the dataset have been properly addressed and resolved.
## index Activity_Period
## 0 0
## Operating_Airline Operating_Airline_IATA_Code
## 0 0
## Published_Airline Published_Airline_IATA_Code
## 0 0
## GEO_Summary GEO_Region
## 0 0
## Activity_Type_Code Price_Category_Code
## 0 0
## Terminal Boarding_Area
## 0 0
## Passenger_Count Adjusted_Activity_Type_Code
## 0 0
## Adjusted_Passenger_Count Year
# 0 0
# Month
# 0
# Table 6: All the missing values in the dataset
The missing data has been successfully handled.
We use the hist() function to draw a frequency distribution histogram to analyze the
data distribution of passenger numbers as shown in figure 1:

Figure 2: Histogram of Passenger count


The majority of passenger count values are concentrated at low levels (represented
by the large column at the beginning of the histogram).The data contains some
extremely large values (outliers), indicated by the smaller columns further to the
right.Therefore, it is necessary to check these outliers to ensure accurate data
processing.

3.4.Outlier detection
Outliers are unusual points in the dataset that have the potential to affect the
analysis. We need to check these outliers, and an outlier is defined as a value that falls
outside the range of:

[ Q1 – 1.5IQR ; Q3 + 1.5IQR ]

We use the boxplot() command to identify outliers, and from there, we obtain
the results in Table 6.

## [1] 2426

Table 7: Outlier

The results show that there are 2,426 outliers out of 14,953 data points, accounting
for more than 16% of the values in the dataset shown in table 7. Therefore, we will
retain these outlier values, as they may provide meaningful insights into the problem
at hand.

## [1] 0.1622417

Table 8: Percentage of outliers

After not removing the outliers, based on Figure 1, we can see that the
passenger count data does not follow any specific distribution pattern. Therefore,
we need to approximate this data as a normal distribution by applying a natural
logarithm transformation.

We use the hist() command to redraw the passenger count histogram after
applying a log + 1 transformation shown in Figure 2:
Figure 3: Histogram of Log Passenger Count
After applying the log transformation, the data appears to approximate a normal
distribution, which is favorable for the analysis results.

4. Descriptive statistics:
4.1.Perform sample statistics for the number of passengers.
We will use commands to calculate sample statistics for the passenger count
variable and obtain the results as Table 8.

# Statistic Value
#
# 1 Mean 2.934562e+0
# 4
# 2 Standard Deviation 5.839845e+0
# 4
# 3 Min 1.000000e+0
# 0
# 4 Max 6.598370e+0
# 5
# 5 Median 9.260000e+0
# 3
# 6 Variance 3.410379e+0
# 9
# 7 25% Quantile 1.000000e+0
# 0
# 8 50% Quantile (Median) 5.409000e+0
# 3
# 9 75% Quantile 9.260000e+0
# 3

Table 9: Resutlt fot the passenger count variable

Observation:
The mean value of passenger counts is 29,345.62. This represents the total number
of passengers divided by the total number of observations.
The minimum value in the dataset is 1. This means at least one instance
recorded a very low passenger count.
The maximum value is 659,837, showing that one observation recorded an
extremely high passenger count, far exceeding the mean.
Right-skewed distribution: The mean is significantly higher than the
median, indicating a few extremely large values are pulling the mean upward.
High variability: The large standard deviation and the wide range between
the minimum (1) and maximum (659,837) indicate substantial variability in the data.
Low quantile values: With the 25% quantile at 1, a significant portion of the
data represents very low passenger counts.

4.2.Perform count statistics for categorical variables.


We will use the Table function to perform frequency statistics for categorical
variables and obtain the results as Table 9.

##
## 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 694 1369 1409 1433 1393 1377 1375 1362 1346 1364 1460 371

Table 10: The result for the variable “Year”

Observation:

This data may represent the number of passengers, flights, or other related metrics
by year. Observations indicate:

 The period from 2005 to 2007 saw a significant increase.


 The period from 2008 to 2015 was relatively stable, fluctuating between 1300
and 1500.
 In 2016, there was a sharp decline, which could indicate a major event or
disruption (economic recession, natural disasters, pandemics).

#
# April August Decembe February January July June
# r
#
March
## 1141 1303 1255 1254 1264 1299 1177
1248
## May Novembe October Septembe
r r
## 1168 1261 1287 1296

Table 11: The result for the variable “Month”

Observation:

 High counts: August-1303, July-1299, and September-1296, have


the highest number of observations, hence are the months of highest activities.
 Low counts: April (1141) and June (1177) have the fewest
observations, which are the quietest months.
 Moderate counts: Other months range between 1168 and 1296, with steady
activity overall.
 This distribution suggests possible seasonal trends with higher activity in
summer months.

#
# Deplane Enplaned Thru / Transit
# d 699 91
# 7043 1 9
Table 12: The result for the variable “Activity type code”

Observation:

 Deplaned (7043) and Enplaned (6991) have the highest counts, indicating most
passengers are either arriving or departing.
 Thru / Transit (919) has the fewest observations, indicating that fewer passengers
are in transit without leaving the airport.
 This therefore, indicates that most of the activities involve arrival and departure
passengers with only a few being in transit.

##
## Low Other
Fare ## 1306

Table 13: The result for the variable “Price Category Code”

Observation:

 Other 13,069-This is the highest count, meaning that most passengers fall into
this category.
 Low Fare: 1,884, which is the lower number of observations, indicating that
fewer passengers are using the low-fare options.
 This would tend to indicate that most passengers are not low fare travelers.

#
# Asia Australia / Oceani Canada Central Americ
# a a
#
## 3272 737 1418 272
## Europe Mexico Middle East South Americ
a
## 2078 1115 214 90
## US Table
## 5757 14:
The result for the variable “GEO Region”
Observation:
US comes 5757, meaning this records most passengers from the United States of
America.
 Asia has a total count of 3272 and Europe 2078, showing these areas to be the
next biggest in passengers.
 South America (90) and Middle East (214) have the least number of passengers.
 Most are from the US, with Asian and European passengers coming in after it
and fewer from other regions.

4.3.Plot distribution charts for selected categorical variables.


- Use the Boxplot command to plot the graphs for the categorical variables."
4o mini

Figure 4: Boxplot of Passenger Count by GEO Region

Observation: Based on the graph, we can see that the US region is the region with the
most flights and has a larger number of passengers than other regions.
Plot a boxplot for Passenger Count by Price Category Code.
Figure 5: Boxplot of Passenger Count by Price Category Code

Observation: Overall, there is not much difference in passenger numbers for this
variable.

Plot a boxplot for Passenger Count by Activity Type Code.

Figure 6: Boxplot of Passenger Count by Activity Type

Observation:
● This chart highlights the differences in passenger numbers across the three
Activity Type Code groups.
● The Deplaned and Enplaned groups have similar sizes and distributions, whereas
the Thru/Transit group is significantly smaller.
5. Inferential Statistics:
5.1.One-factor ANOVA Building:
The primary goal of this section is to determine whether there are significant
differences in the average passenger counts among different geographic regions.
By using one-factor ANOVA, we aim to identify if regional factors influence
passenger traffic and highlight any regions with notably higher or lower passenger
counts, provides a clearer understanding of air traffic patterns across various parts of
the world.
To conduct the ANOVA test, we used the aov() function to compare the means
of different groups, calculates the F-statistic, which indicates whether the variability
between group means is significantly greater than the variability within groups.
Beside, we structured the dataset to include passenger numbers as the dependent
variable and geographic regions as the independent variable. understanding of air
traffic patterns across various parts of the world.

## Df Sum Sq Mean Sq F value Pr(>F)


## 8 3840 480. 212.7 <2e-16 ***
GEO_Region 14944 1
## Residuals 2.3
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
Table 15:

HYPOTHESES
Hypothesis (H0): There is no difference in the mean passenger counts across
the GEO regions.

Hypothesis (H1): At least one pair of GEO regions has a statistically


significant difference in mean passenger counts.

Based on Anova results, it shows:


SSB = 3840, degrees of freedom df = k-1 = 8 ( k=9 )
SSW = 33721, degrees of freedom df = N-k = 14944 ( N = 14953, k = 9 ) where k is
the total number of observations in the groups.
Because P-value < 2e-16 < 5% significance level, we reject the hypothesis
(H0). This means there is a statistically significant difference in mean passenger
counts across the GEO regions. In other words, the average passenger count varies
significantly depending on the geographic region.

Compare multiples after Anova: LSD method:


We performed a post hoc test using TukeyHSD. This approach allows us to
identifies which specific pairs of regions differ significantly. For instance, it provides
insight into whether regions such as the US and Asia or South America and Europe
exhibit meaningful disparities in passenger averages.

## Tukey multiple comparisons of


## Fit: aov(formula = Log_Passenger_Count ~ GEO_Region, data = DF) ##
## $GEO_Region
## diff lwr upr
## Australia / Oceania-Asia -0.59218968 -0.78219346 -
0.402185897
## Canada-Asia -0.55418556 -0.70234396 -
0.406027169
## Central America-Asia -0.64686967 -0.94093265 -0.352806681
## Europe-Asia 0.00260222 -0.12811478 0.133319221
## Mexico-Asia -0.74535659 -0.90695025 -0.583762917
## Middle East-Asia -0.16364218 -0.49244442 0.165160055
## South America-Asia -1.21056631 -1.70848171 -0.712650912
## US-Asia 0.64525459 0.54323122 0.747277961
## Canada-Australia / Oceania 0.03800411 -0.17360598 0.249614209
## Central America-Australia / Oceania - -0.38528695 0.275926971
0.05467999
## Europe-Australia / Oceania 0.59479190 0.39500487 0.794578926
## Mexico-Australia / Oceania -0.15316691 -0.37439176 0.068057945
## Middle East-Australia / Oceania 0.42854750 0.06669278 0.790402209
## South America-Australia / Oceania - -1.13871082 -0.098042446
0.61837663
## US-Australia / Oceania 1.23744427 1.05513494 1.419753594
## Central America-Canada -0.09268410 -0.40114874 0.215780539
## Europe-Canada 0.55678778 0.39627507 0.717300494
## Mexico-Canada -0.19117102 -0.37769156 -0.004650489
## Middle East-Canada 0.39054338 0.04880034 0.732286424
## South America-Canada -0.65638075 -1.16293493 -0.149826564
## US-Canada 1.19944015 1.06128747 1.337592838
## Europe-Central America 0.64947189 0.34899483 0.949948940
## Mexico-Central America -0.09848692 -0.41362505 0.216651207
## Middle East-Central America 0.48322749 0.05742185 0.909033117
## South America-Central America -0.56369664 -1.13037051 0.002977221
## US-Central America 1.29212426 1.00297327 1.581275244
## Mexico-Europe -0.74795881 -0.92094995 -0.574967667
## Middle East-Europe -0.16624440 -0.50079534 0.168306538
## South America-Europe -1.21316853 -1.71489870 -0.711438359
## US-Europe 0.64265237 0.52339555 0.761909189
## Middle East-Mexico 0.58171441 0.23393587 0.929492941
## South America-Mexico -0.46520972 -0.97585514 0.045435695
## US-Mexico 1.39061118 1.23813899 1.543083364
## South America-Middle East - -1.63237877 -0.461469485
1.04692413
## US-Middle East 0.80889677 0.48448012 1.133313423
# US-South America 1.85582090 1.36079060 2.350851200
# p adj
# Australia / Oceania-Asia 0.0000000
# Canada-Asia 0.0000000
# Central America-Asia 0.0000000
#
#
#
#
#
# Europe-Asia 1.0000000
# Mexico-Asia 0.0000000
# Middle East-Asia 0.8346903
#
#
#
# South America-Asia 0.0000000
# US-Asia 0.0000000
# Canada-Australia / Oceania 0.9997773
#
#
#
# Central America-Australia / Oceania
# 0.9998803 Europe-Australia /
# Oceania 0.0000000
# Mexico-Australia / Oceania 0.4404667
# Middle East-Australia / Oceania 0.0073911
#
#
#
# South America-Australia / Oceania 0.0070496
#
# US-Australia / Oceania 0.0000000
#
# Central America-Canada 0.9911763
#
# Europe-Canada 0.0000000
#
# Mexico-Canada 0.0395676
#
# Middle East-Canada 0.0117823
#
# South America-Canada 0.0019177
#
# US-Canada 0.0000000
#
# Europe-Central America 0.0000000
#
# Mexico-Central America 0.9885562
#
# Middle East-Central America 0.0128525
#
# South America-Central America 0.0524735
#
# US-Central America 0.0000000
#
# Mexico-Europe 0.0000000
#
# Middle East-Europe 0.8358960
#
# South America-Europe 0.0000000
#
# US-Europe 0.0000000
#
# Middle East-Mexico 0.0000076
#
# South America-Mexico 0.1077970
#
# US-Mexico 0.0000000
#
# South America-Middle East 0.0000011
#
# US-Middle East 0.0000000
#
# US-South America 0.0000000
#
Table 16:

The Tukey HSD test compares the mean differences in passenger counts across
different GEO regions. The results are as follows:
+ Hypothesis H0: The average number of passengers on flights from the US and South
America is equal.
+ Hypothesis H1: The average number of passengers on flights from the US and South
America is different.
Based on the Tukey HSD test, the adjusted p-value is 0<5% significance level.
Therefore, we reject H0, indicating that there is a statistically significant difference in
the average number of passengers between the US and South America.
In addition, we can also rely on the confidence interval to comment on the
difference in the average number of passengers between the two regions, we find that
the interval is (1.361, 2.351). Since the interval does not include 0, this confirms that
there is a meaningful difference between the two regions.
Similarly for the remaining pairs, we can make the conclusion:

Significant Differences:
-Asia consistently has lower passenger counts compared to many regions, including:
+Australia / Oceania: Mean difference = -0.592, p<0.0001
+Canada: Mean difference = -0.554, p<0.0001
+Mexico: Mean difference = -0.745, p<0.0001
+South America: Mean difference = -1.211, p<0.0001
-US has significantly higher passenger counts compared to several regions, such as:
+South America: Mean difference = 1.856, p<0.0001
+Mexico: Mean difference = 1.391, p<0.0001
+Middle East: Mean difference = 0.809, p<0.0001
Non-Significant Differences:
Europe và Asia: P-value = 1.00, no significant difference.
Middle East và Asia: P-value = 0.835, no significant difference.
South America và Mexico: P-value = 0.108, no significant difference.
This analysis provides valuable insights into the differences in passenger numbers
across geographic regions, highlighting that the US has a significantly higher number
of passengers compared to other regions, while areas such as South America and Asia
exhibit notably lower numbers. These findings can be used to inform strategic
decisions regarding resource allocation and marketing efforts, enabling airlines to
prioritize regions with higher passenger demand while exploring opportunities to
enhance connectivity and engagement in underperforming areas.

Checking Assumptions Through residuals plot:


We created residual diagnostic plots to check whether the data met the necessary
assumptions for ANOVA. These plots are essential for checking whether the data
satisfy the conditions of normality, linearity, and homoscedasticity (constant variance)

Figure 7:
- As part of our analysis, we examined four key residual plots: Residuals Vs Fitted plot:
 The red line is a curve, so the assumption that Y and the independent variables
X do not have a linear relationship is not completely satisfied.
 The red line is close to the y=0 line, so the assumption that the expected errors
are 0 is satisfied.
 The errors are randomly scattered along the red line, so the assumption that the
variance is constant is satisfactory.

Q-Q Residuals:

 The errors deviate from the expected normal distribution line, so the normal
distribution assumption of the errors is not satisfied.

Scale-Location plot:

 The error values are not randomly scattered along the red line, so the
assumption that the error variance is constant is not satisfied.

Residuals vs Leverage plot:

 The graph shows that these two observations 9044 and 10560 have not gone
beyond Cook's distance. So, there is no need to remove them from the dataset.
- The diagnostic plots indicate some violations of key assumptions for ANOVA,
specifically the linearity and normality of residuals, as well as constant variance.
While the influential points are not extreme enough to warrant removal, these
assumption violations may limit the reliability of the ANOVA results. Adjustments,
such as transforming the dependent variable or employing a more robust statistical
model, might be necessary to address these issues.

5.2.Two-factor ANOVA problem:


The goal of this model is to examine the effects of two independent factors,
GEO_Region (geographical region) and Operating_Airline, on the dependent variable
Log_Passenger_Count. Additionally, we assess their interaction to understand whether
the combined influence of these factors significantly impacts passenger counts.

We use the aov function to fit a two-factor ANOVA model, where the dependent
variable is Log Passenger_Count, and the independent variables are GEO_Region and
Operating Airline. The interaction term, GEO_Region:Operating_Airline, is included
to test whether the effect of one factor (e.g., GEO_Region) depends on the level of the
other factor (e.g., Operating_Airline). This allows us to assess both the individual and
combined effects of these factors on passenger counts.
## Df Sum Sq Mean Sq F value Pr(>F)
## GEO_Region 8 3840 480.1 324.91 <2e-16 ***
## Operating_Airline 69 10821 156.8 106.15 <2e-16 ***
## GEO_Region:Operating_Airline 16 945 59.1 39.99 <2e-16 ***
## Residuals 14859 21954 1.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '1

Table 17:

The analysis of the GEO_Region factor reveals a significant effect on passenger


counts, with an F-value = 324.91 and p-value < 2e-16. This result suggests that the
geographical region plays a substantial role in determining passenger volumes. The
variations in passenger counts across different regions are statistically significant,
indicating that geographical location is a key determinant in passenger behavior and
demand.

Similarly, the Operating_Airline factor has a significant impact on passenger


counts, with an F-value = 106.15 and p-value < 2e-16. This finding highlights that
the airline operating within a given region significantly influences passenger
numbers. Different airlines have varying abilities to attract passengers, which can be
attributed to factors such as service quality, routes, pricing, and brand loyalty.
The interaction between GEO_Region and Operating_Airline is also highly
significant, with an F-value = 39.99 and p-value < 2e-16. This result indicates a strong
relationship between the geographical region and the operating airline, meaning that
the effect of the geographical region on passenger counts depends on the specific
airline operating within that region. This interaction implies that the influence of a
region on passenger numbers is not uniform and may vary based on the airline serving
that region. Airlines may perform differently depending on the location, suggesting
that regional strategies could be beneficial for improving passenger numbers.

The residual analysis shows a residual F-value = 1.5, indicating that the residuals
are relatively small. This suggests that the model fits the data well and that the
assumptions of normality, independence, and equal variance are reasonably satisfied.
The residuals appear to be evenly distributed, reinforcing the adequacy of the model in
explaining the variability in passenger counts. The small residuals further suggest that
the ANOVA model effectively captures the key relationships between the dependent
and independent variables.

The analysis highlights that both the geographical region and the airline play
significant roles in influencing passenger counts. Their interaction reveals a complex
relationship. This finding highlights the importance of considering both factors jointly
when analyzing passenger trends. By understanding these factors, airlines and
policymakers can make more informed decisions about resource allocation, marketing
strategies, and route planning. Tailoring approaches to specific regions and airlines
allows for more targeted and effective strategies, potentially improving passenger
engagement and operational efficiency. Additionally, the model offers a foundation for
further research into other factors that could further optimize strategies for increasing
passenger numbers.

5.3. Multivariate regression model:


5.3.1 Perform data extraction in the US region:
To evaluate the influence of factors on the dependent variable
"Passenger_Count", we consider the variables in the data. Here we see that the
independent variables in the data are quantitative variables and some categorical
variables. Therefore, the analysis method we choose here is to build a multivariate
linear regression model.

We set the hypothesis to conclude which variable will influence the Passenger_Count:

𝐻0: coefficient of the variable doesn’t influence to Passenger_Count ( 𝛽𝑖̂=0)


𝐻1: coefficient of variable influence to Passenger_Count ( 𝛽𝑖̂≠0)
Based on P-value theory, if P-value < 5%, we reject 𝐻0, and accept 𝐻1, if > 5%,
we don’t have enough basis to reject 𝐻0.
In order to analyze factors affecting the number of passengers, we have to use the
multivariate regresion model by filtering a dataset for rows where the GEO_Region is
“US” and then inspect the structure of the resulting subset, which is stored in the
US_data variable.

US_data <- subset(DF, DF$GEO_Region == "US")


str(US_data)

5.3.2 Perform regression model


We will build a multiple linear regression model base on the independent
variable (Terminal, Price_Category_code, year) to see how the independent
variables (terminal, price_category_code, year) affect the number of passengers in
US (Passenger_Count).
Dependent variable: Passenger_Count
Independent variables: Terminal, Price_Category_code, Year.
We using the lm (model_lm2 <- lm(Log_Passenger_Count ~ Terminal + to
building the multiple linear regression model.
Model_lm2<- lm(Log_Passenger_Count~ Terminal + Price_Category_ Code + Year,
data = US_data)
summary(model_lm2)##

General equation:
Passenger count =𝛽0̂+𝛽1̂× TerminalOther+ 𝛽2̂ ×TerminalTerminal1 +𝛽3̂
×TerminalTerminal2+𝛽4̂ × TerminalTerminal3 +𝛽5̂ × Price_Category_CodeOther+𝛽6̂
× Year.
As the result of the code, we get: 𝛽0̂ = -83.975324 , 𝛽1= -7.416359, 𝛽2̂ =0.290825 , 𝛽3̂
= 2.149910, 𝛽4̂ = 1.152311, 𝛽5̂= 0.077896, 𝛽 6 =̂ 0.046361
Thus, the regression line is estimated by the following equation:
𝑃𝑎𝑠𝑠𝑒𝑛𝑔𝑒̂𝑟𝐶𝑜𝑢𝑛𝑡 =-83.975324-7.416359 × TerminalOther
+ 0.290825 ×
TerminalTerminal 1 + 2.149910 x TerminalTerminal 2+ 1.152311 × TerminalTerminal

𝑃𝑟𝑖𝑐𝑒_𝐶𝑎𝑡𝑒𝑔𝑜𝑟𝑦_𝐶𝑜𝑑𝑒𝑂𝑡ℎ𝑒𝑟 + 0.046361 x Year.


3+ 0.077896 × Price_Category_CodeOther +0.046361 ×

We see that the p-value corresponding to the F-statistic is less than 2.2e-16, which is
highly significant. This shows that at least one predictor variable in the model has a
very high explanatory significance for the variable "Passenger_Count".
Check the regression coefficients:

- Hypothesis 𝐻0: 𝛽𝑖=0 ⇔ The regression coefficient is not statistically significant


- Hypothesis 𝐻0: 𝛽𝑖≠0 ⇔ The regression coefficient is statistically significant
Because the p-value corresponds to the variables: TerminalOther, TerminalTerminal 1,
TerminalTerminal 2, TerminalTerminal 3, Price_Category_CodeOther, and Year all have a
significance level < 5%, so we reject 𝐻0 and accept 𝐻1. Therefore, the regression coefficients
with these variables are statistically significant.

Because the p-value corresponding to the variable Price_Category_CodeOther is


0.172 > 5% significance level, we have not rejected 𝐻0. Therefore, the regression
coefficient corresponding to this variable is not statistically significant, so we remove
it from the model.

RESULT AND COMMENT OF THE GRAPH:


After we see how the independent variables (Terminal, Price_Category_code,
Year.) affect the number of passengers in US base on the regression model, we have
to analyze all of the variance of the US region, which means we have to prove all of 3
three assumptions:
- Graph 1 : the mean of expected value equal 0
- Graph 2: the normal distribution of errors
- Graph 3: the error variance is constant.

The first graph (Residuals vs Fitted): Checking the mean of expected value.
We give an assumption that the mean of expected value equal 0. As we can see on the
graph, the red line has to be straight and around the line y = 0. But at the nearest end
point of the graph, which the fitted values around 9 - 11, the red line is not a straight
horizontal line, instead it is a curve. So we don’t have enough evidence to accept the
assumption: mean of expected value must be 0.
The second graph (Normal Q - Q): This graph allows the user to check the
assumption of the normal distribution of errors. The condition is only met if all of the
residual points lie on the same line. As we can see on the graph, the residual points
from - 4 to -2 are not on the same line, there’s a big difference when we compare
these to the other residual points on the top of the graph.
The third graph (Scale - Location): We give the assumption that the error
variance is constant. The error values are not randomly scattered along the red line,
which means we don’t have enough evidence to prove the assumption is satisfied.
Another point is that the red line is not horizontal straight, instead it is sloping (or
curved) so the asumption is also not satisfied.
The fourth graph (Residuals - Location): This graph show that there are highly
influential points can be outliners, which are the points that can have the most
influence when analyzing data. If we observe a dashed red line (Cook's distance), and
there are some points that cross this distance line, it means those points are highly
influential points. If we only observe the Cook distance line at the corner of the graph
and no point crosses it, it means that no point really has high influence. As we can see
in the data set, the observations 5674 and 5376 may be highly influential points.

However, we can easily see these two have not gone beyond the Cook’s
distance. Therefore, we don’t have to move these two observations because it has
insignificantly high influence.

Overall, the first, second and the third graph give us the same result: we
don’t have enough evidences to accept all of the assumptions.

6 .Discussion and expansion:


The data set helps our team have a more intuitive view of how to extract data,
process and analyze raw data, and get some results as follows:

This analysis provides insight into the differences in passenger traffic between
geographic regions, showing that the US has the highest number of passengers, while
other regions such as South America and Asia have lower numbers. These results can
be used to make strategic decisions about allocating resources or marketing to the
respective regions.

Overall: The US region has the highest number of passengers compared to most
other regions. The South America region has the lowest number of passengers
compared to many other regions, especially compared to the US and Asia. Regions such
as Europe and the Middle East do not have significant differences compared to Asia and
Canada.

Pairs of groups with significant differences: Asia has lower number of passengers
compared to most other regions, such as Australia/Oceania, Canada, Mexico, and South
America.

Pairs of groups with no significant differences: Europe and Asia, Middle East and
Asia, South America and Mexico

The analysis of the data set is limited to the Mexico geographical area, we can
expand it further through analysis in other geographical areas.

7 .Conclusion
Factors affecting passengers numbers in the Mexico geographic area: Operating
Airline_IATA_Code, Activity_Type_Code, Year.

We, as the authors of this project, hope that our solutions will satisfy the given problems

The cooperation in the implementation of the project has improved the ability and
the responsibility at work of each of the members.

We are looking forward to receiving comments and suggestions from the lecturer
to carry out more accurate and professional topics in the foreseeable future.

8 .Code and data source


8.1 Code source
Link
8.2 Data source
Link

9 .References
1. Phan Thi Huong, Lecture on Statistical Probability
2. Nguyen Tien Dung (editor), Nguyen Dinh Huy, Probability - Statistics & Data
Analysis, 2019
3. Nguyen Dinh Huy (editor), Nguyen Ba Thi, Probability and Statistics
Textbook, 2018
4. Introductory Statistics with R, J Jambers – D. Hand – W. Hardle
5. Applied Statistics with R, 2020
6. Lecture on Quantitative Economics, PhD. Nguyen Canh Huy
7. Sample example of multiple regression, Hoang Van Ha
8.Data:https://www.kaggle.com/datasets/thedevastator/airlines-traffic-passenger-
statistics

You might also like