0% found this document useful (0 votes)
15 views

Linear Model

Uploaded by

parasf0143
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Linear Model

Uploaded by

parasf0143
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

What is Statistical Modelling?

Statistical modelling can be defined as the method of using different statistical techniques for
describing, analysing and making predictions on the relationships within the data. It mainly involves
creating representations or models for capturing underlying patterns, structures and associations in
data, mathematically.

Statistical Modelling Techniques

Statistical modelling techniques are methods used to analyse data and uncover relationships,
patterns, and insights within it. These techniques involve the application of statistical principles to
create models that represent the underlying structure of the data.

Types of Statistical Models in R

1. Linear Models

At the core of statistical modelling, linear models form a cornerstone. They establish relationships
between a dependent variable and one or more independent variables, assuming a linear
connection. These models offer simplicity, interpretability, and a strong theoretical basis, making
them invaluable for understanding data patterns and making predictions.

Linear Regression Examples:

1. Real Estate Pricing: Predicting the selling price of houses based on features like square
footage, number of bedrooms, location, and age of the property. In this scenario, the
response variable (house price) is continuous, and a linear relationship is typically assumed
between the house prices and the features.

2. Academic Performance: Estimating a student’s final grade based on continuous predictors


such as hours spent studying, attendance rate, and scores in previous exams. The final grade
is a continuous outcome expected to have a linear relationship with these predictors.

Linear Regression is employed to predict a continuous numerical outcome based on one or more
predictors. Its simplicity and interpretability make it a popular choice.

model <- lm(mpg ~ hp + drat + wt, data = mtcars)

#view model summary

summary(model)

lm(formula = mpg ~ hp + drat + wt, data = mtcars)

Part 1

Residual

In linear regression, a residual is the difference between the actual value and the value predicted
by the model. It is calculated as the observed value minus the predicted value. A least-squares
regression model minimizes the sum of the squared residuals. If the observed value is larger than
the predicted value, the residual is positive, and if the predicted value is larger than the observed
value, the residual is negative
Ex:

Residuals:

Min 1Q Median 3Q Max

-3.3598 -1.8374 -0.5099 0.9681 5.7078

The minimum residual was -3.3598, the median residual was -0.5099 and the max residual
was 5.7078.

Part 2

Coefficient

The linear regression coefficients describe the mathematical relationship between each
independent variable and the dependent variable. The p values for the coefficients indicate
whether these relationships are statistically significant.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 29.394934 6.156303 4.775 5.13e-05 ***

hp -0.032230 0.008925 -3.611 0.001178 **

drat 1.615049 1.226983 1.316 0.198755

wt -3.227954 0.796398 -4.053 0.000364 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can use these coefficients to form the following estimated regression equation:

mpg = 29.39 – .03*hp + 1.62*drat – 3.23*wt

For each predictor variable, we’re given the following values:

Variables of interest in an experiment (those that are measured or observed) are called response
or dependent variables. Other variables in the experiment that affect the response and can be set
or measured by the experimenter are called predictor, explanatory, or independent variables.

Estimate: The estimated coefficient. This tells us the average increase in the response variable
associated with a one unit increase in the predictor variable, assuming all other predictor variables
are held constant.

Std. Error: This is the standard error of the coefficient. This is a measure of the uncertainty in our
estimate of the coefficient.

t value: This is the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).

Pr(>|t|): This is the p-value that corresponds to the t-statistic. If this value is less than some alpha
level (e.g. 0.05) than the predictor variable is said to be statistically significant.
If we used an alpha level of α = .05 to determine which predictors were significant in this
regression model, we’d say that hp and wt are statistically significant predictors while drat is not.

Part 3

Assessing Model Fit

Residual standard error: 2.561 on 28 degrees of freedom

Multiple R-squared: 0.8369, Adjusted R-squared: 0.8194

F-statistic: 47.88 on 3 and 28 DF, p-value: 3.768e-11

Residual standard error: This tells us the average distance that the observed values fall from the
regression line. The smaller the value, the better the regression model is able to fit the data.

The degrees of freedom is calculated as n-k-1 where n = total observations and k = number of
predictors. In this example, mtcars has 32 observations and we used 3 predictors in the regression
model, thus the degrees of freedom is 32 – 3 – 1 = 28.

Multiple R-Squared: This is known as the coefficient of determination. It tells us the proportion of
the variance in the response variable that can be explained by the predictor variables.

This value ranges from 0 to 1. The closer it is to 1, the better the predictor variables are able to
predict the value of the response variable.

Adjusted R-squared: This is a modified version of R-squared that has been adjusted for the number
of predictors in the model. It is always lower than the R-squared.

The adjusted R-squared can be useful for comparing the fit of different regression models that use
different numbers of predictor variables.

F-statistic: This indicates whether the regression model provides a better fit to the data than a
model that contains no independent variables. In essence, it tests if the regression model as a
whole is useful.

p-value: This is the p-value that corresponds to the F-statistic. If this value is less than some
significance level (e.g. 0.05), then the regression model fits the data better than a model with no
predictors.

“When building regression models, we hope that this p-value is less than some significance level
because it indicates that the predictor variables are actually useful for predicting the value of the
response variable.”
II. ANOVA (Analysis of Variance) compares means across different groups which is particularly useful
for experimental designs.

ANCOVA (Analysis of Covariance) extends ANOVA by incorporating continuous covariates to account


for their influence on the response variable.

To perform an ANOVA test in R, you can use the

aov function or the anova_test function from the rstatix package, which provides a user-friendly
framework to perform ANOVA tests.

ANOVA tests are of two types:

 One-way ANOVA: One-way When there is a single categorical independent variable (also
known as a factor) and a single continuous dependent variable, an ANOVA is employed. It
seeks to ascertain whether there are any notable variations in the dependent variable’s
means across the levels of the independent variable.
[A one-way ANOVA is used to determine whether or not there is a statistically significant
difference between the means of three or more independent groups.]

 Two-way ANOVA: When there are two categorical independent variables (factors) and one
continuous dependent variable, two-way ANOVA is used as an extension of one-way ANOVA.
You can evaluate both the direct impacts of each independent variable and how they interact
with one another on the dependent variable.

Step 1: Create the Data


 Suppose we want to determine if three different workout
programs lead to different average weight loss in
individuals.
 To test this, we recruit 90 people to participate in an
experiment in which we randomly assign 30 people to
follow either program A, program B, or program C for
one month.
 data <- data.frame(program = rep(c('A', 'B', 'C'), each = 30),
 weight_loss = c(runif(30, 0, 3),
 runif(30, 0, 5),
 runif(30, 1, 7)))

 #view first six rows of data frame
 head(data)

 program weight_loss
 1 A 2.6900916
 2 A 0.7965260
 3 A 1.1163717
 4 A 1.7185601
 5 A 2.7246234
 6 A 0.6050458
 model <- aov(weight_loss ~ program, data = data)
 #view summary of one-way ANOVA model

 summary(model)

 Df Sum Sq Mean Sq F value Pr(>F)
 program 2 98.93 49.46 30.83
7.55e-11 ***
 Residuals 87 139.57 1.60
 ---
 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here’s how to interpret every value in the output:

Df program: The degrees of freedom for the variable program. This is calculated as #groups -1. In this
case, there were 3 different workout programs, so this value is: 3-1 = 2.

Df Residuals: The degrees of freedom for the residuals. This is calculated as

#total observations – # groups. In this case, there were 90 observations and 3 groups, so this value is:
90 -3 = 87.

Sum Sq program: The sum of squares associated with the variable program. This value is 98.93.

Sum Sq Residuals: The sum of squares associated with the residuals or “errors.” This value is 139.57.

Mean Sq. Program: The mean sum of squares associated with program. This is calculated as Sum Sq.
program / Df program. In this case, this is calculated as: 98.93 / 2 = 49.46.

Mean Sq. Residuals: The mean sum of squares associated with the residuals. This is calculated as
Sum Sq. residuals / Df residuals. In this case, this is calculated as: 139.57 / 87 = 1.60.

F Value: The overall F-statistic of the ANOVA model. This is calculated as Mean Sq. program / Mean
sq. Residuals. In this case, it is calculated as: 49.46 / 1.60 = 30.83.

Pr(>F): The p-value associated with the F-statistic with numerator df = 2 and denominator df = 87. In
this case, the p-value is 7.552e-11, which is an extremely tiny number.

The most important value in the entire output is the p-value because this tells us whether there is a
significant difference in the mean values between the three groups.

Since the p-value in our ANOVA table (.7552e-11) is less than .05, we have sufficient evidence to
reject the null hypothesis.

Since each of the adjusted p-values is less than .05, we can conclude that there is a significant
difference in mean weight loss between each group.
2. Generalised Linear Models (GLMs)

Generalised Linear Models (GLMs) expand the capabilities of linear models by accommodating a
wider range of response variable types. Traditional linear regression assumes a normal distribution
for the outcome, whereas GLMs can handle response variables that follow different probability
distributions.

1. Disease Diagnosis: Predicting the probability of a patient having a particular disease (say,
diabetes) based on various factors like age, body mass index, family history, and blood
pressure. This is a binary outcome (disease: yes/no), making logistic regression, a type of
GLM, appropriate.

2. Traffic Accident Count Analysis: Modeling the number of traffic accidents occurring at an
intersection based on factors like traffic volume, day of the week, and weather conditions.
Since the response variable (number of accidents) is a count, a Poisson regression, which is a
GLM suitable for count data, would be appropriate.

Logistic Regression is tailored for predicting binary outcomes, making it invaluable for classification
tasks.

Poisson Regression is suitable for counting data, modelling phenomena like the number of
occurrences within a specific time period.

glm() : function is used to perform

This function uses the following syntax:

glm(formula, family=gaussian, data, …)

where:

 formula: The formula for the linear model (e.g. y ~ x1 + x2)

 family: The statistical family to use to fit the model. Default is gaussian but other options
include binomial, Gamma, and poisson among others.

 data: The name of the data frame that contains the data
 head(mtcars)

 mpg cyl disp hp drat wt qsec vs
am gear carb
 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4
4
 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4
4
 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4
1
 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3
1
 Hornet Sportab 18.7 8 360 175 3.15 3.440 17.02 0 0 3
2
 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3
1

We will use the variables disp and hp to predict the probability that a given car takes on a value of 1
for the am variable.

# Load the dataset

data(mtcars)

# Fit a logistic regression model

model <- glm(am ~ disp + hp, data = mtcars, family = binomial)

# View the model summary

summary(model)

Call:

glm(formula = am ~ disp + hp, family = binomial, data = mtcars)

Part1

Deviance Residuals:

Min 1Q Median 3Q Max

-1.9665 -0.3090 -0.0017 0.3934 1.3682

Part 2

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 1.40342 1.36757 1.026 0.3048

disp -0.09518 0.04800 -1.983 0.0474 *

hp 0.12170 0.06777 1.796 0.0725 .


---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

For example, a one unit increase in the predictor variable disp is associated with an average change
of -0.09518 in the log odds of the response variable am taking on a value of 1. This means that higher
values of disp are associated with a lower likelihood of the am variable taking on a value of 1.

The standard error gives us an idea of the variability associated with the coefficient estimate. We
then divide the coefficient estimate by the standard error to obtain a z value.

For example, the z value for the predictor variable disp is calculated as -.09518 / .048 = -1.983.

The p-value Pr(>|z|) tells us the probability associated with a particular z value. This essentially tells
us how well each predictor variable is able to predict the value of the response variable in the model.

For example, the p-value associated with the z value for the disp variable is .0474. Since this value is
less than .05, we would say that disp is a statistically significant predictor variable in the model.

The null deviance in the output tells us how well the response variable can be predicted by a model
with only an intercept term.

Part 3

Null deviance: 43.230 on 31 degrees of freedom

Residual deviance: 16.713 on 29 degrees of freedom

AIC: 22.713

Number of Fisher Scoring iterations: 8

The residual deviance tells us how well the response variable can be predicted by the specific model
that we fit with p predictor variables. The lower the value, the better the model is able to predict the
value of the response variable.

To determine if a model is “useful” we can compute the Chi-Square statistic as:

X2 = Null deviance – Residual deviance

with p degrees of freedom.

We can then find the p-value associated with this Chi-Square statistic. The lower the p-value, the
better the model is able to fit the dataset compared to a model with just an intercept term.

The null deviance in the output tells us how well the response variable can be predicted by a model
with only an intercept term.
The residual deviance tells us how well the response variable can be predicted by the specific model
that we fit with p predictor variables. The lower the value, the better the model is able to predict the
value of the response variable.

3. Nonlinear Models

It represents complex relationships between variables that straight lines cannot adequately
capture. These models offer greater flexibility to fit data exhibiting curves, peaks, or other non-linear
patterns.

By accommodating a wider range of functional forms, nonlinear models often provide more accurate
and informative insights in comparison to their linear counterparts. We employ Nonlinear Least
Squares to fit models with complex, non-linear patterns in the data.

Maximum Likelihood model

A likelihood ratio test compares the goodness of fit of two nested regression models.

A nested model is simply one that contains a subset of the predictor variables in the overall
regression model.

library(lmtest)

#fit full model

model_full <- lm(mpg ~ disp + carb + hp + cyl, data = mtcars)

#fit reduced model

model_reduced <- lm(mpg ~ disp + carb, data = mtcars)

fc #perform likelihood ratio test for differences in models

lrtest(model_full, model_reduced)

Likelihood ratio test


Model 1: mpg ~ disp + carb + hp + cyl

Model 2: mpg ~ disp + carb

#Df LogLik Df Chisq Pr(>Chisq)

1 6 -77.558

2 4 -78.603 -2 2.0902 0.3517

From the output we can see that the Chi-Squared test-statistic is 2.0902 and the corresponding p-
value is 0.3517.

Since this library(lmtest)

#fit full model

model_full <- lm(mpg ~ disp + carb, data = mtcars)

#fit reduced model

model_reduced <- lm(mpg ~ disp, data = mtcars)

#perform likelihood ratio test for differences in models

lrtest(model_full, model_reduced)

Likelihood ratio test

Model 1: mpg ~ disp + carb

Model 2: mpg ~ disp

#Df LogLik Df Chisq Pr(>Chisq)

1 4 -78.603

2 3 -82.105 -1 7.0034 0.008136 **

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

p-value is not less than .05, we will fail to reject the null hypothesis.

From the output we can see that the p-value of the likelihood ratio test is 0.008136. Since this is less
than .05, we would reject the null hypothesis.

You might also like