MIT 302 - Statistical Computing II - Tutorial 03
MIT 302 - Statistical Computing II - Tutorial 03
COMPUTING 11
TUTORIAL 3: Advanced Statistical Modelling
1 Overview
Advanced Statistical Modelling in R takes your data analysis skills to the next level by
introducing you to powerful techniques for modelling complex relationships and extracting
meaningful insights from your data. This course provides a comprehensive overview of
advanced statistical modelling concepts and their practical application using the R
programming language.
The course begins with an introduction to the importance of statistical modelling in data
analysis. You will gain an understanding of how advanced modelling techniques can unravel
complex patterns and uncover hidden relationships within your data.
One of the key topics covered is linear regression, building upon your existing knowledge. You
will dive deeper into multiple linear regression, exploring how to model relationships between
multiple predictors and a continuous outcome variable. You will learn how to assess model
assumptions and diagnostics, as well as how to handle interactions and non-linear relationships.
Another crucial area of focus is generalized linear models (GLMs). You will discover the
versatility of GLMs in modelling a variety of response variables, including binary, count, and
categorical outcomes. Through hands-on exercises, you will gain proficiency in fitting GLMs,
interpreting model output, and assessing model fit.
The course also explores advanced topics such as mixed-effects models, time series analysis,
and machine learning algorithms. You will learn how to handle correlated and nested data using
mixed-effects models, analyse temporal patterns and forecast future values with time series
models, and harness the predictive power of machine learning algorithms for complex
modelling tasks.
Throughout the course, you will work extensively with real-world datasets, applying the
advanced modelling techniques using R. You will gain practical experience in implementing
these models, interpreting results, and effectively communicating your findings.
By the end of the course, you will have a solid foundation in advanced statistical modelling
techniques in R, enabling you to tackle complex data analysis problems and derive meaningful
insights from your data. Whether you're working in academia, industry, or research, this course
equips you with the skills to make informed decisions and drive impactful outcomes through
advanced statistical modelling in R.
Where:
• 𝑃(𝑌 = 1 ∣ 𝑋) is the probability of the dependent variable being 1 given the values of
the independent variables.
• 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑝 are the coefficients or parameters estimated by the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.
3.1.3 Logistic Regression Implementation in R
R provides the "glm" function (generalized linear model) for implementing logistic regression.
Here's an example code snippet to illustrate the implementation:
# Load the dataset
data <- read.csv("customer_data.csv")
# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.
# Interpretation:
# The predicted probabilities represent the probability of churn (1) for the new d
ata points based on the fitted logistic regression model.
In this example, the "customer_data.csv" file contains a dataset with the dependent variable
"churn" indicating whether a customer churned (1) or not (0), and independent variables such
as "age," "gender," and "usage." The logistic regression model is fitted using the "glm"
function, specifying the formula, dataset, and the family as "binomial" to indicate logistic
regression. The model summary provides information about the estimated coefficients and their
significance. Coefficients with small p-values are considered statistically significant, indicating
a significant relationship between the independent variable and the probability of the outcome.
The "predict" function is used to predict probabilities for new data provided in the "new_data"
dataframe.
Interpretation of the logistic regression model involves examining the estimated coefficients
and their significance. Positive coefficients indicate a positive relationship with the probability
of the outcome, while negative coefficients indicate a negative relationship. The magnitude of
the coefficients reflects the strength of the relationship.
By using logistic regression in R, you can analyse and predict binary outcomes based on the
relationship between independent variables and the probability of the outcome occurring.
3.2 Poisson Regression
Poisson regression is a statistical modelling technique used to analyze count data, where the
dependent variable represents the number of occurrences of an event within a fixed interval. It
is commonly used when the dependent variable follows a Poisson distribution. Poisson
regression models the relationship between the independent variables and the expected counts
of the event.
3.2.1 Components of Poisson Regression
• Dependent Variable: The dependent variable in Poisson regression is a count variable
representing the number of occurrences of an event. For example, it could be the
number of accidents at a particular intersection in a day.
• Independent Variables: These are the predictor variables that are used to explain the
variation in the count variable. They can be continuous or categorical variables. For
example, the independent variables could include variables such as the time of day,
weather conditions, and road type.
3.2.2 Poisson Regression Formula
The Poisson regression model estimates the expected counts of the event (Y) given the values
of the independent variables (X). The model assumes that the expected counts follow a Poisson
distribution. The formula for Poisson regression is:
(𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 +⋯+ 𝛽𝑝 𝑋𝑝 )
𝐸(𝑌 ∣ 𝑋) = 𝑒
Where:
• 𝐸(𝑌 ∣ 𝑋) is the expected count of the event given the values of the independent
variables.
• 𝛽0 , 𝛽1 , 𝛽2 , … , 𝛽𝑝 are the coefficients or parameters estimated by the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.
# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.
# Interpretation:
# The predicted expected counts represent the expected counts of accidents for the
new data points based on the fitted Poisson regression model.
In this example, the "accident_data.csv" file contains a dataset with the dependent variable
"accidents" representing the number of accidents, and independent variables such as
"time_of_day," "weather_condition," and "road_type." The Poisson regression model is fitted
using the "glm" function, specifying the formula, dataset, and the family as "poisson" to
indicate Poisson regression. The model summary provides information about the estimated
coefficients and their significance. Coefficients with small p-values are considered statistically
significant, indicating a significant relationship between the independent variable and the
expected counts of the event. The "predict" function is used to predict the expected counts for
new data provided in the "new_data" dataframe.
Interpretation of the Poisson regression model involves examining the estimated coefficients
and their significance. Positive coefficients indicate a positive relationship with the expected
counts of the event, while negative coefficients indicate a negative relationship. The magnitude
of the coefficients reflects the strength of the relationship.
By using Poisson regression in R, you can analyze count data and understand the relationship
between independent variables and the expected counts of an event.
3.3 Multinomial Logistic Regression
Multinomial logistic regression is a statistical modelling technique used to analyse categorical
dependent variables with more than two categories. It allows us to model the relationship
between multiple independent variables and the probabilities of each category of the dependent
variable. Multinomial logistic regression is an extension of binary logistic regression, which is
used when the dependent variable has only two categories. Let's explore the components,
formula, code, and interpretation of multinomial logistic regression in R:
3.3.1 Components of Multinomial Logistic Regression
• Dependent Variable: The dependent variable in multinomial logistic regression is a
categorical variable with more than two categories. For example, it could be a variable
representing flower species, such as "setosa," "versicolor," and "virginica."
• Independent Variables: These are the predictor variables that are used to explain the
variation in the categories of the dependent variable. They can be continuous or
categorical variables. For example, the independent variables could include flower
measurements such as sepal length, sepal width, petal length, and petal width.
3.3.2 Multinomial Logistic Regression Formula
Multinomial logistic regression models the relationship between the independent variables and
the probabilities of each category of the dependent variable. It uses the softmax function to
estimate the probabilities for each category. The formula for multinomial logistic regression is:
Where:
• 𝑃(𝑌 = 𝑗|𝑋) is the probability of category j given the values of the independent
variables.
• 𝛽0𝑗 , 𝛽1𝑗 , 𝛽2𝑗 + ⋯ + 𝛽𝑝𝑗 are the coefficients or parameters estimated for category j by
the model.
• 𝑋1 , 𝑋2 , … , 𝑋𝑝 are the values of the independent variables.
• J represents the total number of categories.
3.3.3 Multinomial Logistic Regression Implementation in R
R provides the "multinom" function from the "nnet" package for implementing multinomial
logistic regression. Here's an example code snippet to illustrate the implementation using the
iris dataset:
# Load the required library
library(nnet)
# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.
# Interpretation:
# The predicted probabilities represent the probabilities of each category of the
dependent variable for the new data points based on the fitted multinomial logisti
c regression model.
In this example, we use the "iris" dataset, which is a built-in dataset in R that contains
measurements of flower species. The dependent variable is "Species," which represents the
flower species, and the independent variables are "Sepal.Length," "Sepal.Width,"
"Petal.Length," and "Petal.Width." The multinomial logistic regression model is fitted using
the "multinom" function, specifying the formula and the dataset. The model summary provides
information about the estimated coefficients and their significance. Coefficients with small p-
values are considered statistically significant, indicating a significant relationship between the
independent variables and the probabilities of each category of the dependent variable. The
"predict" function is used to predict the probabilities for new data provided in the "new_data"
dataframe.
Interpretation of the multinomial logistic regression model involves examining the estimated
coefficients and their significance for each category of the dependent variable. Positive
coefficients indicate a positive relationship with the corresponding category, while negative
coefficients indicate a negative relationship. The magnitude of the coefficients reflects the
strength of the relationship.
By using multinomial logistic regression in R, you can analyse categorical dependent variables
with more than two categories, model the relationship between independent variables and the
probabilities of each category, and make predictions for new observations. The example using
the iris dataset demonstrates how to implement multinomial logistic regression in R, interpret
the model summary, and predict probabilities for new data points.
Where:
• 𝑋𝑡 is the time series at time t.
• c is a constant term.
• 𝜙1 , 𝜙2 , … , 𝜙𝑝 are the AR coefficients.
• 𝜖𝑡 represents the error term at time t.
• 𝜃1 , 𝜃2 , … , 𝜃𝑞 are the MA coefficients.
• p, d, and q represent the order of the AR, I, and MA components, respectively.
4.3.3 ARIMA Implementation in R
R provides the "forecast" package, which includes the "arima" function for implementing
ARIMA models. Here's an example code snippet to illustrate the implementation using the iris
dataset:
To provide a more appropriate example, let's consider the "AirPassengers" dataset available in
R, which contains the monthly number of airline passengers from 1949 to 1960. Here's an
updated code snippet using the "AirPassengers" dataset for ARIMA modelling :
# Load the required library
library(forecast)
# Interpretation:
# The model summary provides information about the estimated coefficients and thei
r significance.
# Coefficients with small p-values are considered statistically significant.
# Interpretation:
# The forecasted values represent the predicted future values based on the fitted
ARIMA model.
# The "h" parameter specifies the number of future periods to forecast.
# Interpretation:
# The plot shows the original time series and the forecasted values for future per
iods.
In this updated example, we load the "AirPassengers" dataset, which contains the monthly
number of airline passengers from 1949 to 1960. We convert the dataset into a time series object
using the "ts" function, specifying the frequency as 12 since the data is monthly. We then plot
the original time series to visualize the data.
Next, we fit an ARIMA model to the time series data using the "auto.arima" function from the
"forecast" package. The "auto.arima" function automatically selects the best ARIMA model
based on various criteria. The model summary provides information about the estimated
coefficients and their significance.
We then use the "forecast" function to forecast future values based on the fitted ARIMA model.
The "h" parameter specifies the number of future periods to forecast, in this case, 12 months.
The forecasted values represent the predicted future values based on the ARIMA model.
Finally, we use the "plot" function to visualize the time series data along with the forecasted
values.
Interpretation of the ARIMA model involves examining the estimated coefficients, their
significance, and the forecasted values. Positive AR coefficients indicate a positive relationship
with past observations, while negative MA coefficients indicate a dependency on past forecast
errors. The forecasted values provide insights into the future behaviour of the time series,
helping with forecasting and decision-making.
7 Data
7.1 Customer_data
The following code creates the "customer_data.csv" file with randomly generated customer
data, including the age, gender, usage, and churn variables.
# Load the required library
library(dplyr)