0% found this document useful (0 votes)
11 views

Report_FinalProject

Uploaded by

Tin Trang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Report_FinalProject

Uploaded by

Tin Trang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

University of Science – Vietnam National University

Faculty of Information Technology

STAT452 - Applied Statistics for Engineers and Scientists II

FINAL PROJECT

Teacher: Nguyễn Thị Mộng Ngọc


Teaching Assistant: Nguyễn Hữu Toàn
Class: 22TT1
Group 9:
Phạm Quốc Bửu ID: 22125012
Huỳnh Hữu Hậu ID: 22125025
Trần Nhật Thanh ID: 22125093
Trang Đặng Đức Tin ID: 22125106

Ho Chi Minh City, August 20th, 2024


% %
DATASET
COMPLETION COMPLETION
& TASK ASSIGNED MEMBER
(evaluate by (evaluate by
REPORT
TEAM) LECTURER)
Import and clean
data
Quoc Buu 100%
Perform descriptive
statistics.
Nhat Thanh 100%
Nhat Thanh,
Visualize data 100%
Duc Tin
Split data into
training and Quoc Buu 100%
validation sets.
Auto_mpg
(Activity Select the best Nhat Thanh, Quoc
100%
1) model Buu
Explain the chosen
model's significance
Nhat Thanh 100%
Test model
assumptions
Nhat Thanh 100%
Prediction and
comparison of Nhat Thanh 100%
results
Provide conclusions
and suggestions
Nhat Thanh 100%
Find dataset Huu Hau 100%
Import and clean
data
Huu Hau 100%
Perform descriptive
statistics.
Huu Hau 100%
Visualize data Huu Hau, Duc Tin 100%
Split data into
training and Huu Hau 100%
Admission validation sets.
Predict Select the best
(Activity model
Huu Hau, Quoc Buu 100%
2) Explain the chosen
model's significance
Huu Hau 100%
Test model
assumptions
Huu Hau 100%
Prediction and
comparison of Huu Hau 100%
results
Provide conclusions
and suggestions
Huu Hau 100%

2
Find dataset Quoc Buu 100%
Import and clean
data
Quoc Buu 100%
Perform descriptive
statistics.
Duc Tin 100%
Visualize data Duc Tin 100%
Split data into
training and Quoc Buu 100%
Car data validation sets.
(Activity Select the best
model
Quoc Buu 100%
2)
Explain the chosen
model's significance
Duc Tin 100%
Test model
assumptions
Duc Tin 100%
Prediction and
comparison of Quoc Buu 100%
results
Provide conclusions
and suggestions
Duc Tin 100%
Duc Tin,
Report Writing and
Typing
Nhat Thanh, 100%
Huu Hau
Formatting, Figures, Duc Tin, Nhat
Report 100%
and Tables Thanh, Quoc Buu
tasks
Edit Appendix Nhat Thanh 100%
Review,
Duc Tin, Nhat
Proofreading, and 100%
Final Edits Thanh, Huu Hau

3
Contents
ACTIVITY 1: Dataset ‘Auto_mpg’ ....................................................................................... 6
1. Description .......................................................................................................................... 6
2. Data Analysis ...................................................................................................................... 6
3. Data preprocessing ...................................................................................................... 13
4. Estimate the model....................................................................................................... 13
5. Assumption Testing ..................................................................................................... 18
6. Model meaning ............................................................................................................... 20
7. Testing the model.......................................................................................................... 21
ACTIVITY 2 .................................................................................................................................. 22
I. Dataset ‘Admission_Predict’ ..................................................................................... 22
1. Description ....................................................................................................................... 22
2. Data Analysis ................................................................................................................... 22
3. Data preprocessing ...................................................................................................... 28
4. Estimate the model .................................................................................................... 29
5. Assumption Testing ..................................................................................................... 33
6. Model Meaning................................................................................................................ 36
7. Testing the model.......................................................................................................... 37
II. Dataset: Car data ........................................................................................................... 39
1. Description ....................................................................................................................... 39
2. Data Analysis ................................................................................................................... 39
3. Data preprocessing ...................................................................................................... 43
4. Estimate the model ...................................................................................................... 44
5. Assumption Testing ..................................................................................................... 51
6. Model meaning ............................................................................................................... 58
7. Testing the model ......................................................................................................... 60
APPENDIX .................................................................................................................................... 61

4
References .............................................................................................................................. 61
Dataset source...................................................................................................................... 62
Code Listing ........................................................................................................................... 62
Activity 1.................................................................................................................................... 62
Activity 2.................................................................................................................................... 67
Graduation Admission ................................................................................................ 67
Car Data ........................................................................................................................... 72
Visualization of data ......................................................................................................... 76
Activity 1.................................................................................................................................... 76
Activity 2.................................................................................................................................... 83
Graduation Admission............................................................................................... 83
Car Data ........................................................................................................................... 85

5
ACTIVITY 1: Dataset ‘Auto_mpg’
1. Description
This is a dataset of fuel consumption in the city. The data is sourced from the UCI Machine
Learning Repository https://archive.ics.uci.edu/ml/datasets/Auto+MPG. The dataset
consists of 398 observations on the following 9 variables:
• "mpg": (continuous) fuel consumption in miles per gallon.
• "cylinders": (multi-valued discrete) number of cylinders.
• "displacement": (continuous) engine size.
• "horsepower": (continuous) engine power.
• "weight": (continuous) vehicle weight.
• "acceleration": (continuous) vehicle acceleration.
• "model year": (multi-valued discrete) model year (last 2 digits).
• "origin": (multi-valued discrete) place of origin:
• 1 - North America,
• 2 - Europe,
• 3 – Asia.
• "car name": (multi-valued discrete) car name.

2. Data Analysis
We first view some head rows of the initial data

We can see that "car name" typically doesn't have direct statistical significance for predictive
modeling. It's a categorical variable with many unique values, making it challenging to
include in a model unless it's aggregated or encoded in a meaningful way. Moreover, in its
raw form, it is not suitable for linear modeling unless it is transformed into dummy variables,
which could significantly increase the complexity of the model without necessarily adding
value. Hence, we can remove “car name” from the predicted model.
Then, we check for missing and duplicated values

6
We obtained that there are 6 NA values in “horsepower”, which accounts for less than 2% of
the data so we can remove these 6 rows. There are no duplicated data.
We will convert qualitative variables using dummy ones and obtain:
• origin1 represents Europe, origin2 represents Asia, and when these two are 0, it
represents North America.
• model_year_group1 represents group “75-79”, model_year_group2 represents group
“80-82”, and when these two are 0, it represents group “70-74”.

Now we will visualize the data:

Figure 1: Bar plot of origin

Fig. 1 shows that the majority of cars are from North America, while the number of European
and Asian cars is nearly the same, with a slightly higher representation of Asian cars.

7
Figure 2: Bar plot of model_year_group

Fig. 2 indicates that the majority of cars are made in the 1970s with roughly 150 cars from
each group, while the number of cars made in the early 1980s is only half of them, about 75.

Figure 3: Histogram of mpg

The right-skewed histogram of mpg in Fig. 3 indicates that the distribution of fuel efficiency
(measured in miles per gallon) is asymmetric, with a longer tail on the right side. This
suggests that while most cars in the dataset have lower to moderate fuel efficiency, there are
fewer cars with very high fuel efficiency.

8
A right-skewed distribution might lead to issues in regression models, particularly if the
skewness is extreme. Transforming the data (e.g., taking the logarithm of mpg) can help
normalize the distribution, making it more symmetric and potentially improving model
performance. We will do this later.

As “origin” and “model_year_group” are qualitative variables, we will not include it in the
boxplots and correlation matrix.
View other boxplots

Figure 4: Boxplot of acceleration

The acceleration data ranges from around 13 to 18, with several outliers on both sides. The
bulk of the data lies within a narrow range, suggesting a relatively consistent acceleration
distribution with some extreme values.

From all the boxplots, we can see there are a few outliers in “horsepower” and “acceleration”.
We will clean this using Cook’s distance after we create the models.

Now, we will consider the relationship between the qualitative variables with other
quantitative ones
View other boxplots

9
Figure 5: Boxplot of displacement by origin

North American cars exhibit the greatest variability in displacement, with a wider range and
a median displacement around 250. European cars have less variability, a few visible
outliers, and a median around 100. Asian cars have moderate variability and a median
displacement a little lower than European cars.

View other boxplots

Figure 6: Boxplot of displacement by model_year_group

10
We can see a decreasing pattern in the displacement size over the years, with early 1970s
range from 100 to 350, late 1970s from 100 to 250 while early 1980s only range from 100
to 150 with a few outliers.

Moreover, we also consider the relationship between the two qualitative variables “origin”
and “model_year_group”

Figure 7: Grouped barplot of origin and model_year_group

We can see from Fig. 7 that:


• North America produced a large number of cars in the earlier model years ("70-74"),
and fewer cars in the later model years.
• Europe had a more evenly distributed production of cars across the different model
year groups.
• Asia saw an increase in car production in the later model years, particularly in the
"80-82" group.

Then we will examine the relationship between other quantitative variables using
correlation matrix and scatter plots.

11
Figure 8: Correlation matrix and scatter plots of quantitative variables

From Fig. 8, we have the following comments:


• Displacement, Horsepower, and Weight:

12
o There seems to be a negative correlation between “mpg” and these three
variables.
o As displacement, horsepower, or weight increases, “mpg” tends to decrease.
o This suggests that cars with larger engines or heavier weights tend to have
lower fuel efficiency.
• Cylinders:
o The number of cylinders shows clusters at specific counts (4, 6, 8).
o Cars with more cylinders (e.g., 8 cylinders) tend to have lower “mpg.”
o Fewer cylinders (e.g., 4 cylinders) are associated with better fuel efficiency.
• Acceleration:
o Acceleration does not exhibit a clear linear pattern with “mpg.”
o However, there might be a slight positive correlation.
o Further analysis is needed to confirm its significance.

After visualizing all the variables, now we will split the data and estimate the model.

3. Data preprocessing

Splitting the data into 2 parts:


• Training Set (train_data): Contains 80% of the original dataset, used to train the
model.
• Testing Set (test_data): Contains the remaining 20%, used to test the model's
performance on unseen data.

4. Estimate the model


We will estimate an initial model with all the base predictors first.

13
The scatter plots from Fig. 8 reveal that "displacement," "horsepower," and "weight" exhibit
a slight curved pattern, suggesting that we should incorporate quadratic predictors for these
variables. The discrete pattern of “cylinders” also suggests that we should eliminate it from
the model.

14
Adjusted R2 indeed increases in this case, by approximately 4%.
Next, we will consider the interaction of the predictors. Based on the box plots, we will
include the interactions of all the pairs that introduce an increasing or decreasing pattern to
obtain model 3. We also add the interaction of “origin” and “model_year_group” as there is a
relationship between them as describe in Fig. 7

Adjusted R2 increases by approximately 2%.

15
Now, we will use VIF function to eliminate one by one those predictors that cause
multicollinearity in our model and obtain model4.

Although adjusted R2 decreases a little bit, by roughly 4% but we have excluded all the
predictors that cause multicollinearity.

Now, we will use the step function to remove insignificant predictors and obtain the
step_model.

16
Recall from Fig. 3 that the “mpg” variable is right-skewed. To address this issue and improve
the model's accuracy, we will apply a log transformation to “mpg.” The final model is:

17
We also use Cook’s distance to remove all outliers and check the final model again, this gives
the same result.

5. Assumption Testing
a) Linearity

Figure 9: Linearity Assumption

The plot indicates that while the model may reasonably capture the linear relationship, there
might be some minor deviations from linearity and potential non-linear patterns. However,
the residuals do not show a strong, clear pattern, so the linearity assumption is not severely
violated.

b) Normality of Residuals

Figure 10: Q-Q Plot of Residuals for Normality assumption

18
From the Q-Q Plot, the residuals of our model appear to be normally distributed, which
validates one of the key assumptions of linear regression. This suggests that the results of
hypothesis tests and confidence intervals based on this model should be reliable.

Shapiro-Wilk Test

Although the p-value in the final model is still less than 0.05, it shows significant
improvement from the previous model.

c) No Perfect Multicollinearity

All the GVIF values are smaller than 5, suggests that no multicollinearity happens here.

d) No Autocorrelation of Residuals

The Durbin-Watson statistic is close to 2, and the p-value is not significant. Therefore, we do
not have evidence of autocorrelation in the residuals, which means that the assumption of
no autocorrelation is likely satisfied.

e) Homoscedasticity

19
The low p-value indicates that there is significant evidence of heteroscedasticity in the
residuals, meaning that the variance of the errors is not constant across observations. This
violation can lead to inefficiency in the estimates and may affect the validity of hypothesis
tests.

6. Model meaning
Interpretation of Coefficients
• Intercept (3.951): This is the estimated log(mpg) when all predictors are zero.
Although not directly interpretable (since it might not make practical sense for all
predictors to be zero), it provides a baseline for comparison.
• Acceleration (-0.01427): For each unit increase in acceleration (assuming origin and
model year group are at their reference levels), the log(mpg) decreases by
approximately 0.01427. This suggests that, in general, cars with higher acceleration
have lower fuel efficiency.
• Weight (three coefficients: -0.0002116, -0.0003439, -0.0003254): These coefficients
indicate how the relationship between weight and log(mpg) varies by origin. For
instance, for North American cars (origin0), the log(mpg) decreases by approximately
0.0002116 for each unit increase in weight. The effect is stronger for cars from
Europe and Asia, as indicated by the larger negative coefficients.
• Acceleration (two coefficients: 0.02260, 0.01823): These coefficients suggest that
the effect of acceleration on log(mpg) varies by origin. For European and Asian cars,
acceleration positively affects fuel efficiency, which is the opposite of the general
trend observed for the whole dataset.
• Displacement (three coefficients: (-0.0007836, -0.0007364, -0.0003890): These
coefficients show how the effect of engine displacement on log(mpg) varies across
model year groups. For older models (70-74 and 75-79), the displacement has a
stronger negative impact on log(mpg) than for the newer models (80-82).
• Acceleration (two coefficients: 0.008873, 0.01663): These coefficients show that for
cars produced between 1975-1979 and 1980-1982, acceleration positively impacts
log(mpg) more significantly than for older models (1970-1974).

Model Fit and Significance


• Multiple R-squared (0.8887): Indicates that approximately 88.87% of the variability
in log(mpg) is explained by the model. This is a high value, suggesting the model fits
the data well.
20
• Adjusted R-squared (0.8847): This is slightly lower than the R-squared value,
accounting for the number of predictors in the model. It suggests that the model is
still a good fit after adjusting for the complexity of the model.
• F-statistic (218.6) and p-value (< 2.2e-16): The model as a whole is highly
statistically significant, indicating that at least one of the predictors is significantly
related to log(mpg).

Conclusion
This model suggests that log(mpg) (and hence, fuel efficiency) is influenced by a combination
of direct effects (like acceleration) and interactions between variables (such as weight with
origin, or displacement with model year). The interactions highlight that the impact of these
predictors is not uniform across different groups (e.g., different origins or model years).
The significant coefficients and high R-squared values suggest that this model captures a
substantial portion of the variability in fuel efficiency. However, the interpretation of
individual coefficients must consider the interaction terms, as the effect of a variable like
acceleration or weight may differ depending on the car's origin or model year.

7. Testing the model


We obtain the result

Interpretation:
• High Accuracy: With an accuracy rate of 98.73%, the model predicts mpg values
within 30% of the actual values for nearly all cases in the validation dataset. This
suggests that the model is very effective at capturing the relationship between the
predictors (e.g., acceleration, weight, displacement) and the fuel efficiency of the cars.
• Model Generalization: The high accuracy rate also implies that the model
generalizes well to unseen data (i.e., the test dataset), indicating that it is not
overfitted to the training data. This is a good sign that the model can be trusted to
make reliable predictions on new data.

Conclusion:
The model's performance is excellent, with only a small percentage (around 1.27%)
of predictions falling outside the 30% error tolerance. This suggests that the model is
well-constructed and suitable for predicting fuel consumption (mpg) in similar
datasets.

21
ACTIVITY 2
I. Dataset ‘Admission_Predict’
1. Description
This dataset, designed for predicting Graduate Admissions from an Indian perspective,
consists of 400 observations and focuses on several key parameters that influence the
application process for Master's programs. The parameters include:
• GRE Scores: Ranging up to 340, this score reflects a student's aptitude for graduate-
level education.
• TOEFL Scores: Out of 120, this measures English language proficiency.
• University Rating: A scale of 1 to 5, indicating the reputation of the institution.
• Statement of Purpose and Letter of Recommendation Strength: Rated on a scale
of 1 to 5, these factors assess the quality of the applicant's personal statement and
recommendations.
• Undergraduate GPA: Scored on a 10-point scale, this represents academic
performance in undergraduate studies.
• Research Experience: A binary indicator (0 or 1) that shows whether the applicant
has research experience.
• Chance of Admit: A probability value between 0 and 1, representing the likelihood
of admission.
The dataset is inspired by the UCLA Graduate Dataset and is designed to assist students in
estimating their chances of admission to various universities based on their profiles. The
dataset is attributed to Mohan S Acharya and is referenced in a study comparing regression
models for predicting graduate admissions, presented at the IEEE International Conference
on Computational Intelligence in Data Science 2019.

2. Data Analysis

Let’s have the first look at some top rows of the dataset:

22
The Serial No. variable should be eliminated because it simply represents the order of
observations in the dataset and does not carry any meaningful information about the factors
influencing graduate admissions. Including it in the analysis could introduce noise or
irrelevant data, potentially leading to misleading results or weakening the model's
predictive accuracy. Since it has no statistical significance, it does not contribute to
understanding or predicting the Chance of Admit and should therefore be excluded from
any analysis.

Then, we check for missing and duplicated values:

There’re no missing and duplicated values in the data.

In this dataset, the Research and University Rating variables represent categorical
information. Research indicates whether a student has research experience, making it a
binary categorical variable. University Rating, while ordinal with a scale from 1 to 5, can be
categorized into broader groups (e.g., "Low" for 1-2 and "High" for 3-5) to simplify analysis.
Converting these variables into factors allows for clearer interpretation and more
appropriate use in regression models and other categorical analyses.

23
Figure 11: Correlation matrix and scatter plots

24
View other boxplots

Figure 12: Box Plot of GRE Scores

• 50% GRE Scores greater than 315


• No outliers

View other bar plots

Figure 13: Bar Plot of University Ratings


• Most applications target mid-range universities and fewer Applications for
High and Low-Ranked Universities.

25
Figure 14: Histogram of Chance of Admit

The histogram of "Chance of Admit" is skewed to the left, indicating a non-normal


distribution. This suggests that a transformation like square root might be considered to
normalize the data.

Figure 15: Squared Histogram of Chance of Admit

Hence, we should:
• Transform the "Chance of Admit" to square root form.
• Remove the outliers from LOR and CGPA variables.

26
Figure 16: Box Plot of LOR after removing Figure 17: Box Plot of CPGA after removing
outlier outlier

Figure 18: Box Plots of numerical variables by categorical variable University Rating

27
Figure 19: Box Plots of numerical variables by categorical variable Research

We can see both category variables University.Rating and Research shows a positive
correlation with others numeric variables.

3. Data preprocessing

a) Converting category variables to factor:

Converted Research to a binary factor (0, 1) and University.Rating to a categorical factor


(Low, High).

b) Splitting data:

Splitting the data into 2 parts:


• Training Set (train_data): Contains 80% of the original dataset, used to train the
model.
28
• Testing Set (test_data): Contains the remaining 20%, used to test the model's
performance on unseen data.

4. Estimate the model

We will estimate an initial model with all the base predictors first:

Transforming the Chance.Of.Admit to Chance.Of.Admit2 base on the conclusion of Fig. 15.

29
• The Adjusted R-squared also improved from 0.7706 to 0.8065, further confirming
the enhanced model performance.
• The transformation to the square of Chance.of.Admit provided a better-fitting model,
with more significant predictors and improved explanatory power, though some
variables remained non-significant.

Next, we will consider the interaction of the predictors. Based on the boxplots, we will
include the interactions of all the pairs that introduce an increasing or decreasing pattern.

Now, we use the step function to remove insignificant predictors and obtain the model 3.

30
Since the p value of SOP too much larger than SOP, we drop this variable to the model and
obtain the model 4:

Check the vif values of model 4 and drop the CGPA:Research variable:

31
Then, we obtain model 5:

Check the vif values of model 5:

Now, there’s no multicollinearity anymore in the model.

Use cook’s distance to measure the influence of each data point on the regression model.

32
We got the final model:

5. Assumption Testing
a) Linearity

Figure 20: Linearity Assumption for Final model and First model

The residuals vs. fitted plots for the final and first models show differences in linearity:

33
First Model
• Residual Patterns: The first model shows a noticeable pattern, with residuals
displaying curvature.
• Heteroscedasticity: There's a tendency for the spread of residuals to decrease as
fitted values increase, indicating potential heteroscedasticity.
Final Model
• Improved Randomness: The residuals appear more randomly scattered around the
horizontal line, suggesting improved linearity.
• Reduced Patterns: There's less curvature, indicating a better fit compared to the first
model.
• Consistent Spread: The spread of residuals seems more consistent across fitted
values, suggesting better homoscedasticity.
Overall, the final model shows an improvement in satisfying the linearity assumption, with
more random residuals and less evident patterns.

b) Normality of Residuals

Figure 21: Q-Q Plot for Normality Assumption of Final model and First model
The Shapiro-Wilk test and Q-Q plots help assess the normality of residuals for both
models:
Shapiro-Wilk Test Results
• First Model
o W = 0.9173, p-value = 3.098e-12
o Suggests significant deviation from normality.
• Final Model
o W = 0.97574, p-value = 6.748e-05 (both w and p-value are huge
improved)
o Indicates some deviation from normality, but it's less severe.
Q-Q Plots
• Final Model: The points mostly follow the diagonal line, indicating a closer fit to
normality, although there are some deviations at the tails.

34
• First Model: More pronounced deviations from the diagonal, particularly at the tails,
indicating a poorer fit to normal distribution.
The final model shows improved normality of residuals compared to the first model, as
evidenced by both the Shapiro-Wilk test and the Q-Q plot. However, there are still some
deviations that might need addressing.

c) No Autocorrelation of Residuals
The Durbin-Watson test checks for autocorrelation in the residuals of a regression
model:
First Model
• D-W Statistic: 1.934
• Autocorrelation: 0.033
• p-value: 0.568
Final Model
• D-W Statistic: 1.962
• Autocorrelation: 0.0176
• p-value: 0.696
Interpretation
• D-W Statistic around 2: Indicates little to no autocorrelation.
• p-value > 0.05: Suggests that there is no significant autocorrelation.

Both models show very little autocorrelation in their residuals. The final model shows
slightly better performance with a D-W statistic closer to 2 and a lower
autocorrelation value.

d) Homoscedasticity
The Breusch-Pagan test assesses heteroscedasticity in a regression model:

Figure 22: Plot for Homoscedasticity assumption of Final model and First model
35
Both plots show a random scatter of residuals around the horizontal line at zero, with
no clear pattern of increasing or decreasing variance as the fitted values change.
Final Model
• BP Statistic: 15.649
• p-value: 0.04769
First Model
• BP Statistic: 22.899
• p-value: 0.001775
Interpretation
• p-value < 0.05: Suggests evidence of heteroscedasticity in both models.
• Overall, the final model is an improvement in terms of addressing
heteroscedasticity, but further model adjustments might be needed to fully
resolve this issue.

e) Conclusion
The final model shows notable improvements over the first model in terms of
linearity and normality of residuals. The residuals of the final model are more
randomly distributed and less patterned, with a closer fit to normality as indicated by
the Shapiro-Wilk test and Q-Q plot. However, both models still exhibit some degree
of heteroscedasticity, as evidenced by the Breusch-Pagan test. Overall, the final model
is more robust with respect to the assumptions of linear regression but may still
benefit from further refinement to address residual variance issues.

6. Model Meaning
Interpretation of Coefficients
• Intercept (-2.4678821, p < 2e-16): The intercept represents the expected value of
the squared Chance.of.Admit when all other predictors are set to zero. However, since
many variables in the model (e.g., GRE Score, TOEFL Score) are not realistically zero,
the intercept alone doesn't have a practical interpretation but serves as the baseline
level of the model.
• GRE.Score (0.0032913, p < 0.0001): For each additional point in GRE Score, the
squared Chance.of.Admit increases by approximately 0.0033, holding all other
variables constant. The very low p-value indicates this effect is statistically significant.
• TOEFL.Score (0.0040357, p = 0.0014): Each additional point in TOEFL Score
increases the squared Chance.of.Admit by approximately 0.004, holding all other
variables constant. This effect is statistically significant.
• University.RatingLow (0.1437695, p < 0.0001): Being in a university with a low
rating (compared to the reference category, likely medium or high) increases the

36
squared Chance.of.Admit by approximately 0.144, holding all other variables
constant. This effect is statistically significant.
• LOR (0.0269528, p < 0.0001): For each additional point in the LOR (Letter of
Recommendation) rating, the squared Chance.of.Admit increases by approximately
0.027, holding all other variables constant. This effect is statistically significant.
• CGPA (0.1527704, p < 2e-16): Each additional point in CGPA increases the squared
Chance.of.Admit by approximately 0.153, holding all other variables constant. This is
a highly significant effect, reflecting the strong impact of CGPA on the admission
chances.
• Research1 (0.0437895, p < 0.0001): Having research experience increases the
squared Chance.of.Admit by approximately 0.044, holding all other variables
constant. This effect is statistically significant.
• University.RatingHigh:SOP (0.0292400, p = 0.0003): The interaction term between
high University Rating and SOP shows that for each additional point in SOP, the
squared Chance.of.Admit increases by approximately 0.029 when the university
rating is high, holding all other variables constant. This effect is statistically
significant.
• University.RatingLow:SOP (-0.0116992, p = 0.2147): The interaction term between
low University Rating and SOP suggests that for each additional point in SOP, the
squared Chance.of.Admit decreases by approximately 0.012 when the university
rating is low, but this effect is not statistically significant (p > 0.05).

Model Fit and Significance


• Multiple R-squared (0.898): Approximately 89.8% of the variability in the squared
Chance.of.Admit can be explained by the model, which indicates a strong fit.
• Adjusted R-squared (0.8951): Adjusted for the number of predictors, this value is still
very high, indicating that the model explains a significant portion of the variance in
the squared Chance.of.Admit.
• F-statistic (314.7, p < 2.2e-16): The overall model is highly significant, indicating that
at least one predictor variable significantly contributes to predicting the squared
Chance.of.Admit.

This model captures the relationship between the squared Chance.of.Admit and various
factors such as GRE Score, TOEFL Score, CGPA, research experience, and interactions
between University Rating and SOP. The model is highly significant, with a strong fit
indicated by the R-squared values.

7. Testing the model

37
Using the final selected model, predictions were made on the test dataset for the dependent
variable "Chance.of.Admit." The R-squared value for these predictions was calculated as
follows:

The R-squared value of approximately 0.88 indicates that the final model explains about 83%
of the variability in the "Chance.of.Admit" variable. This is a strong indicator of the model's
goodness of fit.
The accuracy of the predictions was evaluated by comparing the predicted values (after
transforming them back to the original scale) with the actual values of "Chance.of.Admit" in
the test dataset. The accuracy was computed as follows:
• Accuracy Calculation: If the absolute relative error between the predicted and actual
values is less than or equal to 30%, the prediction is considered accurate.
• Accuracy Rate: 100%

The final model has demonstrated a high level of predictive accuracy, with an R-squared
value of 0.88 indicating a strong fit to the data. The accuracy rate of approximately 100%
suggests that the model performs exceptionally well in predicting "Chance.of.Admit," with
most predictions falling within 30% of the actual values. This indicates that the final model
is highly reliable and effective for forecasting "Chance.of.Admit" based on the provided
features.

38
II. Dataset: Car data
1. Description
Dataset contains information about used cars listed on different websites. This data
can be used for a lot of purposes such as price prediction to exemplify the use of linear
regression in Machine Learning. The data consists of 301 cars.
Variables
• Car_name: Name of the model of the car
• Year: The year in which the car model was made
• Present_Price: The current showroom price of the car (scale of 100000 Indian
rupee INR)
• Kms_Driven: The total number of kms that the car has driven
• Fuel_Type: Type of fuel that the car uses
• Seller_Type: Defines the type of the seller (Individual or Dealer)
• Transmission: Type of transmission (manual or automatic)
• Owner: Number of previous owners
Target variables
• Selling_price: The price that the owner of the car wants to sell it at (scale of
100000 Indian rupee INR)

2. Data Analysis
a. Categorical variables
View other bar plots

Figure 23: Bar plot of Fuel type

39
The fuel that has the highest frequency is Petrol and the fuel that has the lowest frequency is
CNG. This indicates that most of the cars in the sample use Petrol as their source of energy.

b. Selling Price histogram

Figure 24: Histogram of Selling Price

Right-skewed distribution, which means that the selling price of the cars in the dataset were
quite cheap, most of the cars were sold in the range of 0 to 5 (100000 INR), mostly in the
range of 0 to 1 (100000 INR).

c. Selling _Price vs Numerical variables


View other scatter plots

40
Figure 25: Scatter Plot of Selling Price and Present Price

According to the scatter plot, when present price increases, the selling price increases as
well, which indicates that the selling price is directly proportional to the present price.

d. Selling price and category variables:


View other boxplots

Selling Price vs Fuel Type:

Figure 26: Box plot of Selling price by Fuel type

41
According to the boxplot:
1. The price of CNG fuel is presented by a very narrow range, which means that the dataset
contains very few CNG-used cars.
2. Diesel:
The broadest range of selling price among the 3 fuel_type.
The median selling price is around 7.5 (local currency).
The first quartile is around 4 (local currency) and the third quartile is around 12.5 (local
currency), which means that 50% of the selling price of the Diesel fuel-used car falls within
this range.
The whiskers extend from around 3 to 24 (local currency), showing the range of prices that
are within 1.5 times the IQR from the lower and upper quartiles.
There are 2 outliers.
Petrol
+ A broad range but narrower than diesel.
+ The median selling price is around 2.5 (local currency).
+ The first quartile is around 1 (local currency) and the third quartile is around 5 (local
currency), which means that 50% of the selling price of the Petrol fuel-used car falls within
this range.
+ The whiskers extend from around 0.5 to 11 (local currency), showing the range of prices
that are within 1.5 times the IQR from the lower and upper quartiles.
+3 outliers
Diesel Cars > CNG Cars > Petrol Cars in terms of Selling_Price.

d. Linearity investigation:

42
Figure 27: Scatter plots of quantitative variables

3. Data preprocessing
a) Feature subset selection

Since the number of car names is quite high compared to the number of dataset samples, it’s
better to drop it.

b) Feature transformation
Replace the column with the car’s year calculated by the car’s current year minus the min
year value of the sample.

c) Detect and remove outliers

43
Only remove outliers when selling price > 30.

d) Splitting Data

Split data into 2 parts:


• Training Set (train_data): Contains 80% of the original dataset, used to train the
model.
• Testing Set (test_data): Contains the remaining 20%, used to test the model's
performance on unseen data.

4. Estimate the model


First, we will estimate the initial model with base predictors (called model)

44
The model satisfies 2/5 assumptions, and it’s not normal, we use box cox transformation
(in part 5).

Figure 28: Profile likelihood plot and histogram of new_x_exact

45
Figure 29: Scatter plot of box cox transformed selling price and Present price

Looking at the curve, we can see a quadratic relationship between the transformed Selling
price and Present price, indicating that we should incorporate the quadratic predictor for
this variable.

Initial Model1:

46
Model 1 is the initial model included all predictors and a squared term for Present_Price
Stepwise Regression:
Using backward elimination to remove the least significant predictor of model1 based on the
p-value until only significant predictors remain.
Using AIC to compare models, balancing model fit and complexity. A lower AIC value
indicates a better model.
Initial AIC:

• Step 1:

The transmission predictor was removed, results in the AIC = -502.34, which is lower than
initial AIC.
• Step 2:

47
The Owner predictor was removed, results in the AIC = -502.68, which is also lower than the
first step’s AIC.

At the end of the Stepwise Regression, we have a model 2 with 6 predictors and the AIC = -
502.68

Summary for model2:

• R^2 raises from 0.8858 to 0.9503


• P-value of normality of model 2 is raised from 6.372e-08 to 0.001769

48
Figure 30: Box plots of numerical variables by categorical variable Seller_type

• As we can see from the 3 numerical predictors boxplot with the categorical predictor
seller_type, the Present_Price has the most clearly relationship with the seller_type,
hence we choose the interaction Present_Price:Seller_Type for model3.

49
Figure 31: Box plots of numerical variables by categorical variable Year

• As we can see from the 3 predictors boxplot with the predictor Year, the Present_Price
has the most clearly relationship with the Year, hence we choose the interaction
Present_Price:Year Type for model3.
By adding 2 interactions: Present_Price:Seller_Type and Year:Present_Price, we obtain
model3:

50
• R^2 raises from 0.9503 to 0.9654.
• P-value raises significantly to 0.3746.

5. Assumption Testing
5.1 Base model (model with base predictors)
a) Linearity

Figure 32: Linearity assumption for base model

51
The plot shows non-clear pattern in the residuals => The linear model is not a good fit for
the data. The residuals are not randomly scattered around zero, but instead show a curved
pattern. This suggests that the relationship between the response variable and the predictor
variables is not linear.

b) Homoscedasticity:

Figure 33: Plot for Homoscedasticity assumption of base model

The residuals are not randomly scattered around zero, but instead show a funnel shape. This
suggests that the variance of the errors is not constant across all levels of the predictor
variables since funnel shape indicates that the variance of the errors increases as the fitted
values increase. This violates the assumption of homoscedasticity.

c) No Autocorrelation of Residuals

D-W statistic = 1.993545 is very close to 2, which means that there is no significant
autocorrelation in the residuals.
p-value = 0.966 which is high => fail to reject null hypothesis, which states that there is no
autocorrelation in the residuals.

52
 The model does not exhibit autocorrelation.

d) No perfect multicollinearity

There is no perfect multicollinearity in the model. The predictor variables are not highly
correlated with each other, meaning they each provide independent information to the
model.

e) Normality

Figure 34: Q-Q plot for Normality Assumption of base model

The p-value is much less than 0.05, which indicates that the model does not follow normal
distribution.

5.2 Model2
a) Linearity

53
Figure 35: Linearity assumption of model2

Although the residuals appear to be randomly scattered around the horizontal line at zero,
there are a few outliers and slight indications of patterns, indicate that the linearity
assumption is not fully met, suggesting potential issues with the model.

b) Homoscedasticity

Figure 36: Plot for Homoscedasticity of model2

54
According to the plot, the residuals appear to be scattered around the horizontal line at y =
0, but they are not evenly dispersed, suggesting that there might be heteroscedasticity in the
model. The variance of the residuals does not seem to be constant across all fitted values.

c) No autocorrelation residuals:

Autocorrelation=0.129725, indicates a low positive autocorrelation at lag 1.


D-W statistic=1.734, which suggests mild positive autocorrelation.
p-value = 0.052, which is slightly above 0.05, suggesting marginal evidence against the null
hypothesis.

d) No perfect multicollinearity

According to the GVIF, we can see that all GVIF^(1/(2Df)) values are below 2, which suggests
that there is no severe multicollinearity present in the model.
e) Normality

Figure 37: Q-Q plot for normality assumption of model2

55
The p-value is much less than 0.05, which indicates that the model does not follow normal
distribution.
Conclusion: model 2 compared to base model
• Box cox transforming for Selling_Price
• Add Present_Price^2 predictor
• The number of assumptions satisfied is equal.

5.3 Model3
a) Linearity

Figure 38: Linearity Assumption of model3

The residuals appear to be randomly scattered around the horizontal line (y = 0), which
generally supports the linearity assumption, there is a minor curve in the red trend line,
indicating some potential deviation from linearity, but it is not strongly pronounced.
 The linearity assumption is mostly satisfied.

b) Homoscedasticity

56
Figure 39: Plot for Homoscedasticity of model3

The residuals seem to be evenly scattered around the horizontal line (y = 0) without any
clear pattern or funnel shape, there is no obvious of increasing or decreasing spread as the
fitted values change.
 The assumption of homoscedasticity appears to be satisfied.

c) No autocorrelation residuals

Autocorrelation=0.0060935, which is a very low value indicating negligible autocorrelation


at lag 1.
D-W statistic=1.979, which suggests no significant autocorrelation.
p-value =0.862, which is a very high value compared to 0.05, suggesting that there is no
significant evidence to reject null hypothesis.
 The model satisfies the assumption of no autocorrelation since there is no significant
autocorrelation in the model.

d) No perfect multicollinearity

57
Most predictors show low multicollinearity, except seller_type, suggests that there are few
severe multicollinearities present in the model.

e) Normality

Figure 40: Q-Q plot for Normality Assumption of model3


The p-value=0.3747 which is much larger than 0.05, suggesting that the model strongly
follows normal distribution.

Conclusion:
• Add 2 interaction predictors: Seller_Type:Present_Price and Year:Present_Price
• The model satisfies all the 5 assumptions.
• R^2 for test set = 0.9624215.
• Accuracy = 0.85.
6. Model meaning
Interpret of coefficients
• Intercept (-0.2755): This represents the expected transformed selling price when
all other predictors are zero. It's not highly significant (p = 0.07108), indicating that
its contribution might not be substantial.
• Year (0.0830): The coefficient for Year is positive and highly significant (p < 0.001),
suggesting that newer cars (higher Year value) tend to have a higher selling price, as
expected.

58
• Present_Price (0.1319): Present price is also highly significant (p < 0.001),
indicating a positive relationship between the current price of the car and the selling
price.
• Kms_Driven (-1.405e-06): The negative coefficient (p = 0.00972) suggests that as
the number of kilometers driven increases, the selling price slightly decreases, which
aligns with the general depreciation of cars with usage.
• Fuel_TypePetrol (-0.2276): This negative and highly significant coefficient (p <
0.001) indicates that petrol cars tend to have a lower selling price compared to the
reference category (likely diesel).
• Seller_TypeIndividual (-1.169): Individual sellers tend to have significantly lower
selling prices compared to dealers, as reflected by this highly significant and large
negative coefficient (p < 0.001).
• Owner (-0.2221): The number of previous owners negatively impacts the selling
price (p = 0.00448), meaning that cars with more owners tend to sell for less.
• I(Present_Price^2) (-0.0034): The negative coefficient suggests a quadratic effect
of present price, indicating that the effect of price diminishes as the price increases.
• Present_Price:Seller_TypeIndividual (0.0697): This positive interaction term (p <
0.001) suggests that the effect of present price on selling price is stronger for
individual sellers.
• Year:Present_Price (0.0076): This positive and significant interaction term (p <
0.001) indicates that the effect of the present price on selling price is stronger for
newer cars.
Model Fit:
• Multiple R-squared (0.9654) and Adjusted R-squared (0.964): These values
indicate that the model explains approximately 96.5% of the variance in the
transformed selling price, which suggests an excellent fit.
• Residual Standard Error (0.2867): This represents the standard deviation of the
residuals (errors), indicating the average amount by which the predicted
transformed selling prices differ from the actual values.
• F-statistic (704): The model’s overall F-statistic is highly significant (p < 2.2e-16),
indicating that the model is significantly better at predicting the selling price than
using the mean selling price as a predictor.
Significance of Predictors:
• Most predictors are highly significant, with p-values less than 0.001, indicating they
are strongly associated with the selling price.
• The inclusion of interaction terms (Present_Priceand Year) and a quadratic term
(I(Present_Price^2)) suggests that the relationship between the selling price and the
predictors is not strictly linear and that the effects of some variables vary depending
on the levels of others.
Residuals:
59
• The residuals seem reasonably well-behaved, with the smallest and largest residuals
being within a range of about -0.8 to 0.9. This suggests that the model predictions are
generally close to the observed values, though there might be some outliers.

7. Testing the model


Use model3 to make prediction on the test part of the dataset, we obtain:

This indicates that model3 explains about 94.1% of the variability in the "Selling_Price"
variable. This is a strong indicator of the model's goodness of fit.
The accuracy of the model is calculated as follows:

The model3 has demonstrated a high level of predictive accuracy, with an R-squared value
of 0.88 indicating a strong fit to the data. This calculation includes the error of box cox -
inverse box cox.

60
APPENDIX
References
Sheather, S. (2009). A Modern Approach to Regression with R. In Springer texts
in statistics.
https://doi.org/10.1007/978-0-387-09608-7

Dalpiaz, D. (2021). Applied statistics with R.


https://sadil.ws/handle/123456789/3733

Rencher, A. C., & Schaalje, G. B. (2007). Linear models in statistics.


https://doi.org/10.1002/9780470192610

Probability and Statistics for Engineering and the Sciences + Enhanced


WebAssign access. (2017).

Mohan S Acharya, Asfia Armaan, Aneeta S Antony: A Comparison of Regression


Models for Prediction of Graduate Admissions, IEEE International Conference
on Computational Intelligence in Data Science 2019.
https://www.kaggle.com/datasets/mohansacharya/graduate-
admissions/data

Ahmedabbas. (2023, November 11). Student Performance Prediction || EDA ||


ML. Kaggle.
https://www.kaggle.com/code/ahmedabbas757/student-performance-
prediction-eda-ml

Vehicle dataset. (2023, January 14).


Kaggle. https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-
cardekho?select=car+data.csv

Farzadnekouei. (2022, December 29). Polynomial regression| Regularization|


Assumptions. Kaggle.
61
https://www.kaggle.com/code/farzadnekouei/polynomial-regression-
regularization-assumptions

Dataset source
Graduate Admission
Car Data

Code Listing

Activity 1
a) Load necessary libraries

b) Import data and overview

c) Remove unnecessary, missing and duplicated data

62
d) Create dummy variables

e) Visualize data

63
f) Split data for training and testing

64
g) Create base model, then add more predictors

h) Use vif and step to clean model

i) Taking logarithm base e of targeted variable to obtain final model

j) Clean outliers and test final model again

65
k) Assumption testing of final model

l) Test final model with data

66
Activity 2

Graduation Admission
a) Load necessary libraries

b) Import data and overview

c) Remove unnecessary, missing and duplicated data

d) Visualize data

67
68
e) Create dummy variables

f) Split data for training and testing

g) Create base model, then add others

69
h) Clean outliers and test final model again

i) Assumption testing of final model

70
j) Test final model with data

71
Car Data
a) Load necessary libraries

b) Import data and overview

c) Reprocess data, create dummy variables and clean outliers

d) Visualize data

72
e) Split data for training and testing

f) Box-Cot function

73
g) Create base model, transform and test

h) Assumption testing

74
75
i) Test final model with data

Visualization of data

Activity 1
Boxplots of individual variables

76
The distribution of cylinder counts is broad, with most cars having 4 to 8 cylinders.

Engine displacement is mostly between 100-300 cubic inches.

77
Horsepower ranges from around 75 to 125, with a few outliers above 200. The distribution
is fairly even across this range.

The weight data ranges from approximately 2200 to 3600, showing a symmetric distribution
without any apparent outliers. The interquartile range is wide, indicating a diverse range of
weights.

Boxplots vs origin

78
The boxplot for North American cars shows more variability in cylinder counts, with a
median around 6, while European and Asian cars all have 4 cylinders and a few outliers.

North American cars exhibit more variability in horsepower, with a median around 100,
while European cars have less variability and a median of 80, and Asian cars have similar
variability to North American cars but a similar median.

79
North American cars have higher variability in weight with a median between 3000 and
4000 pounds, European cars show less variability and a median around 2000 pounds, and
Asian cars have moderate variability with a median slightly lower than European cars.

The boxplot for North American cars shows moderate variability in acceleration with a
median around 15 units, while European cars exhibit greater variability, with a median
slightly above 15, and Asian cars have moderate variability with the highest median.

80
Boxplots vs model_year_group

Early 1970s created cars with 4 to 8 cylinders, while cars in late 1970s only have 4 to 6 of
them. Moreover, though there is some outliers, cars in the early 1980s has a mean of 4
cylinders.

Horsepower also shows a decreasing pattern over the years. Early 1970s ranged from 80 to
150, late 1970s from 75 to 125 while early 1980s only range from 60 to 80.

81
Weight of cars made in 1970s are much heavier than those in early 1980s, from 2400 to more
than 4000 for the former group and only 2100 to less than 3000 for the latter.

Considering acceleration, we see a slight increase pattern over the years here, with all 3
groups ranging from around 12 to 18.

82
Activity 2

Graduation Admission
Boxplots of individual variables

• 50% GRE TOEFL greater than 115.


• No outliers.

• 50% SOP greater than 3.5.


• No outliers.

83
• 50% LOR greater than 3.5
• There’s a 1 outlier which is the min values of LOR variable.

• 50% CGPA greater than 8.5


• There’s a 1 outlier which is the min values of CGPA variable.

Bar plot

84
• Individuals having research experience (1) more than those without (0)

Car Data
Bar plot of individual variables

There are 2 types of seller: Dealer and Individual. Due to the boxplot, dealer type has the
highest frequency and individual type has lowest frequency. This means that the larger
number of the car’s transactions in the sample was made through dealers.

85
There are 2 types of transmissions: automatic and manual. According to the boxplot, the
manual’s frequency is much higher than the automatic’s frequency, which means that most
of the transmissions in the sample were made manually.

Scatter plot of individual variables

86
From the scatter plot, as the kms driven increases, the selling price goes down. The more the
car has traveled, the lower the price becomes, which means that the selling price is inversely
proportional to the kms driven by the car.

According to the scatterplot, the more modern the car is (The year that the model of the car
was released getting closer to current date), the higher the price becomes.

87
Due to the scatter plot, as the number of previous owners increases, the selling price
decreases.

Boxplots vs selling price


Selling Price vs Seller Type:

According to the boxplot:


• Dealers tend to sell cars at higher prices compared to individuals. The median
selling price for dealer-sold cars is significantly higher than that for individual-
sold cars.
• Price Variability: There is greater variability in the selling prices of cars sold
by dealers, as evidenced by the wider IQR and the presence of numerous high-
price outliers.
• Distribution: The selling price distribution for cars sold by individuals is
more concentrated around lower prices, while the distribution for dealer-sold
cars is more spread out with higher price points.
Selling Price vs Transmission:

88
According to the boxplot:
• Automatic Transmission cars tend to have higher selling prices compared to Manual
Transmission cars. The median selling price for automatic cars is almost twice that of
manual cars.
• Price Variability: There is greater variability in the selling prices of cars with
automatic transmission, as evidenced by the wider IQR and longer whiskers,
indicating that automatic cars are available in a wider range of prices.

89

You might also like