Linear Model
Linear Model
Statistical modelling can be defined as the method of using different statistical techniques for
describing, analysing and making predictions on the relationships within the data. It mainly involves
creating representations or models for capturing underlying patterns, structures and associations in
data, mathematically.
Statistical modelling techniques are methods used to analyse data and uncover relationships,
patterns, and insights within it. These techniques involve the application of statistical principles to
create models that represent the underlying structure of the data.
1. Linear Models
At the core of statistical modelling, linear models form a cornerstone. They establish relationships
between a dependent variable and one or more independent variables, assuming a linear
connection. These models offer simplicity, interpretability, and a strong theoretical basis, making
them invaluable for understanding data patterns and making predictions.
1. Real Estate Pricing: Predicting the selling price of houses based on features like square
footage, number of bedrooms, location, and age of the property. In this scenario, the
response variable (house price) is continuous, and a linear relationship is typically assumed
between the house prices and the features.
Linear Regression is employed to predict a continuous numerical outcome based on one or more
predictors. Its simplicity and interpretability make it a popular choice.
summary(model)
Part 1
Residual
In linear regression, a residual is the difference between the actual value and the value predicted
by the model. It is calculated as the observed value minus the predicted value. A least-squares
regression model minimizes the sum of the squared residuals. If the observed value is larger than
the predicted value, the residual is positive, and if the predicted value is larger than the observed
value, the residual is negative
Ex:
Residuals:
The minimum residual was -3.3598, the median residual was -0.5099 and the max residual
was 5.7078.
Part 2
Coefficient
The linear regression coefficients describe the mathematical relationship between each
independent variable and the dependent variable. The p values for the coefficients indicate
whether these relationships are statistically significant.
Coefficients:
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can use these coefficients to form the following estimated regression equation:
Variables of interest in an experiment (those that are measured or observed) are called response
or dependent variables. Other variables in the experiment that affect the response and can be set
or measured by the experimenter are called predictor, explanatory, or independent variables.
Estimate: The estimated coefficient. This tells us the average increase in the response variable
associated with a one unit increase in the predictor variable, assuming all other predictor variables
are held constant.
Std. Error: This is the standard error of the coefficient. This is a measure of the uncertainty in our
estimate of the coefficient.
t value: This is the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).
Pr(>|t|): This is the p-value that corresponds to the t-statistic. If this value is less than some alpha
level (e.g. 0.05) than the predictor variable is said to be statistically significant.
If we used an alpha level of α = .05 to determine which predictors were significant in this
regression model, we’d say that hp and wt are statistically significant predictors while drat is not.
Part 3
Residual standard error: This tells us the average distance that the observed values fall from the
regression line. The smaller the value, the better the regression model is able to fit the data.
The degrees of freedom is calculated as n-k-1 where n = total observations and k = number of
predictors. In this example, mtcars has 32 observations and we used 3 predictors in the regression
model, thus the degrees of freedom is 32 – 3 – 1 = 28.
Multiple R-Squared: This is known as the coefficient of determination. It tells us the proportion of
the variance in the response variable that can be explained by the predictor variables.
This value ranges from 0 to 1. The closer it is to 1, the better the predictor variables are able to
predict the value of the response variable.
Adjusted R-squared: This is a modified version of R-squared that has been adjusted for the number
of predictors in the model. It is always lower than the R-squared.
The adjusted R-squared can be useful for comparing the fit of different regression models that use
different numbers of predictor variables.
F-statistic: This indicates whether the regression model provides a better fit to the data than a
model that contains no independent variables. In essence, it tests if the regression model as a
whole is useful.
p-value: This is the p-value that corresponds to the F-statistic. If this value is less than some
significance level (e.g. 0.05), then the regression model fits the data better than a model with no
predictors.
“When building regression models, we hope that this p-value is less than some significance level
because it indicates that the predictor variables are actually useful for predicting the value of the
response variable.”
II. ANOVA (Analysis of Variance) compares means across different groups which is particularly useful
for experimental designs.
aov function or the anova_test function from the rstatix package, which provides a user-friendly
framework to perform ANOVA tests.
One-way ANOVA: One-way When there is a single categorical independent variable (also
known as a factor) and a single continuous dependent variable, an ANOVA is employed. It
seeks to ascertain whether there are any notable variations in the dependent variable’s
means across the levels of the independent variable.
[A one-way ANOVA is used to determine whether or not there is a statistically significant
difference between the means of three or more independent groups.]
Two-way ANOVA: When there are two categorical independent variables (factors) and one
continuous dependent variable, two-way ANOVA is used as an extension of one-way ANOVA.
You can evaluate both the direct impacts of each independent variable and how they interact
with one another on the dependent variable.
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
program 2 98.93 49.46 30.83
7.55e-11 ***
Residuals 87 139.57 1.60
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Df program: The degrees of freedom for the variable program. This is calculated as #groups -1. In this
case, there were 3 different workout programs, so this value is: 3-1 = 2.
#total observations – # groups. In this case, there were 90 observations and 3 groups, so this value is:
90 -3 = 87.
Sum Sq program: The sum of squares associated with the variable program. This value is 98.93.
Sum Sq Residuals: The sum of squares associated with the residuals or “errors.” This value is 139.57.
Mean Sq. Program: The mean sum of squares associated with program. This is calculated as Sum Sq.
program / Df program. In this case, this is calculated as: 98.93 / 2 = 49.46.
Mean Sq. Residuals: The mean sum of squares associated with the residuals. This is calculated as
Sum Sq. residuals / Df residuals. In this case, this is calculated as: 139.57 / 87 = 1.60.
F Value: The overall F-statistic of the ANOVA model. This is calculated as Mean Sq. program / Mean
sq. Residuals. In this case, it is calculated as: 49.46 / 1.60 = 30.83.
Pr(>F): The p-value associated with the F-statistic with numerator df = 2 and denominator df = 87. In
this case, the p-value is 7.552e-11, which is an extremely tiny number.
The most important value in the entire output is the p-value because this tells us whether there is a
significant difference in the mean values between the three groups.
Since the p-value in our ANOVA table (.7552e-11) is less than .05, we have sufficient evidence to
reject the null hypothesis.
Since each of the adjusted p-values is less than .05, we can conclude that there is a significant
difference in mean weight loss between each group.
2. Generalised Linear Models (GLMs)
Generalised Linear Models (GLMs) expand the capabilities of linear models by accommodating a
wider range of response variable types. Traditional linear regression assumes a normal distribution
for the outcome, whereas GLMs can handle response variables that follow different probability
distributions.
1. Disease Diagnosis: Predicting the probability of a patient having a particular disease (say,
diabetes) based on various factors like age, body mass index, family history, and blood
pressure. This is a binary outcome (disease: yes/no), making logistic regression, a type of
GLM, appropriate.
2. Traffic Accident Count Analysis: Modeling the number of traffic accidents occurring at an
intersection based on factors like traffic volume, day of the week, and weather conditions.
Since the response variable (number of accidents) is a count, a Poisson regression, which is a
GLM suitable for count data, would be appropriate.
Logistic Regression is tailored for predicting binary outcomes, making it invaluable for classification
tasks.
Poisson Regression is suitable for counting data, modelling phenomena like the number of
occurrences within a specific time period.
where:
family: The statistical family to use to fit the model. Default is gaussian but other options
include binomial, Gamma, and poisson among others.
data: The name of the data frame that contains the data
head(mtcars)
mpg cyl disp hp drat wt qsec vs
am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
1
Hornet Sportab 18.7 8 360 175 3.15 3.440 17.02 0 0 3
2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3
1
We will use the variables disp and hp to predict the probability that a given car takes on a value of 1
for the am variable.
data(mtcars)
summary(model)
Call:
Part1
Deviance Residuals:
Part 2
Coefficients:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For example, a one unit increase in the predictor variable disp is associated with an average change
of -0.09518 in the log odds of the response variable am taking on a value of 1. This means that higher
values of disp are associated with a lower likelihood of the am variable taking on a value of 1.
The standard error gives us an idea of the variability associated with the coefficient estimate. We
then divide the coefficient estimate by the standard error to obtain a z value.
For example, the z value for the predictor variable disp is calculated as -.09518 / .048 = -1.983.
The p-value Pr(>|z|) tells us the probability associated with a particular z value. This essentially tells
us how well each predictor variable is able to predict the value of the response variable in the model.
For example, the p-value associated with the z value for the disp variable is .0474. Since this value is
less than .05, we would say that disp is a statistically significant predictor variable in the model.
The null deviance in the output tells us how well the response variable can be predicted by a model
with only an intercept term.
Part 3
AIC: 22.713
The residual deviance tells us how well the response variable can be predicted by the specific model
that we fit with p predictor variables. The lower the value, the better the model is able to predict the
value of the response variable.
We can then find the p-value associated with this Chi-Square statistic. The lower the p-value, the
better the model is able to fit the dataset compared to a model with just an intercept term.
The null deviance in the output tells us how well the response variable can be predicted by a model
with only an intercept term.
The residual deviance tells us how well the response variable can be predicted by the specific model
that we fit with p predictor variables. The lower the value, the better the model is able to predict the
value of the response variable.
3. Nonlinear Models
It represents complex relationships between variables that straight lines cannot adequately
capture. These models offer greater flexibility to fit data exhibiting curves, peaks, or other non-linear
patterns.
By accommodating a wider range of functional forms, nonlinear models often provide more accurate
and informative insights in comparison to their linear counterparts. We employ Nonlinear Least
Squares to fit models with complex, non-linear patterns in the data.
A likelihood ratio test compares the goodness of fit of two nested regression models.
A nested model is simply one that contains a subset of the predictor variables in the overall
regression model.
library(lmtest)
lrtest(model_full, model_reduced)
1 6 -77.558
From the output we can see that the Chi-Squared test-statistic is 2.0902 and the corresponding p-
value is 0.3517.
lrtest(model_full, model_reduced)
1 4 -78.603
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
p-value is not less than .05, we will fail to reject the null hypothesis.
From the output we can see that the p-value of the likelihood ratio test is 0.008136. Since this is less
than .05, we would reject the null hypothesis.