0% found this document useful (0 votes)
151 views13 pages

PST2 Main

This document provides instructions for a problem solving task that involves analyzing the relationship between the Southern Oscillation Index (SOI) and Australian rainfall patterns. Students are asked to download two datasets, combine them, and perform exploratory data analysis and linear regression modeling to determine which seasons have a significant linear relationship between seasonal SOI values and total seasonal precipitation. For the winter season, which shows the strongest relationship, students must specify and fit a linear regression model and report the parameter estimates, confidence intervals, and amount of variability explained.

Uploaded by

Dirty Rajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views13 pages

PST2 Main

This document provides instructions for a problem solving task that involves analyzing the relationship between the Southern Oscillation Index (SOI) and Australian rainfall patterns. Students are asked to download two datasets, combine them, and perform exploratory data analysis and linear regression modeling to determine which seasons have a significant linear relationship between seasonal SOI values and total seasonal precipitation. For the winter season, which shows the strongest relationship, students must specify and fit a linear regression model and report the parameter estimates, confidence intervals, and amount of variability explained.

Uploaded by

Dirty Rajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

MXN500: Problem Solving Task 2

Lecturer: Dr Aiden Price

Due: Friday 02-June-2023 11:59 pm

Submission Information
Submit a document containing your answers, any R files containing your code and the
datasets you used to blackboard.
The tutors will need to see your code in order to determine if you deserve part marks for a
given question. So please make your code easy for the tutors to read and navigate. Note, the
tutors will not be running your code to reproduce your answers, so you need to provide
written answers with appropriate explanations to all the questions. Additional marks are
awarded based on whether your code is readable.

Introduction
Background
Large scale climate drivers such as the El Niñ o Southern Oscillation (ENSO) are known to
have an impact on Australian rainfall patterns. ENSO has three phases, El Niñ o, Neutral and
La Niñ a. Generally, in the La Niñ a phase of ENSO conditions are cooler and wetter along the
Eastern Australian coast. In comparison during the El Niñ o phase conditions are hotter and
drier. In this problem solving task, you will be using your knowledge of regression to
explore the relationship between ENSO and Australian rainfall.
Note that it is not possible to measure the strength of ENSO directly, so in regression
models the Southern Oscillation Index (SOI) is commonly used to represent the strength of
ENSO. The SOI is a climate index that measures the normalised pressure difference
between Taihiti and Darwin. You do not need to provide units when displaying SOI on a
plot axis as it is an index. ENSO is considered to be in the La Niñ a phase when there are
sustained SOI values above 7, the El Niñ o phase when there are sustained SOI values below
-7, and the Neutral phase otherwise. A csv file containing the monthly SOI values can be
obtained from blackboard. More details about ENSO and SOI can be found on the Bureau of
Meteorology website (http://www.bom.gov.au/climate/enso).
Datasets
Two datasets are provided for this assignment. Similarly to the data used in problem
solving task 1, there is a precipitation dataset, total_seasonal_rainfall.csv. This dataset
contains the seasonal rainfall totals recorded at BRISBANE AERO station and includes
variables:
• The GHCN Daily station id (character)
• The GHCN Daily station name (character)
• The Year of the observation (numeric)
• The Season of the observation (ordinal, categorical)
• The total_seas_prcp is the total amount of rainfall received during that Season in
mm (numeric).
In the dataset, seasonal_soi_data.csv, the variables included are:
• The Season of the observation (ordinal, categorical)
• The Year of the observation (numeric)
• The SeasonalSOI is the mean seasonal SOI value (numeric)
• The Phase of the ENSO (ordinal, categorical).

Preprocessing
Question 1.1 (1 mark) Download the the files total_seasonal_rainfall.csv and
seasonal_soi_data.csv from blackboard. Combine all the variables from in these two
datasets into the single dataset total_seasonal_rainfall. Print the first three rows to
show the form of your new dataset.
# Download the meta data if it doesn't exist
ghcnd_meta_data_csv <- "ghcnd_meta_data.csv"

if (!file.exists(ghcnd_meta_data_csv)) {
ghcnd_meta_data <- ghcnd_stations()
write_csv(ghcnd_meta_data, ghcnd_meta_data_csv)
} else {
ghcnd_meta_data <- read_csv(ghcnd_meta_data_csv)
}

# Define the station IDs


station_ids <- c("ASN00046037", "ASN00048014", "ASN00056207",
"ASN00058063", "ASN00017028", "ASN00044004",
"ASN00040096", "ASN00040214")

# Download the station data if it doesn't exist


station_data_csv <- "station_data.csv"

if (!file.exists(station_data_csv)) {
station_data <- meteo_pull_monitors(monitors = station_ids,
keep_flags = TRUE,
var = "PRCP")
write_csv(station_data, station_data_csv)
} else {
station_data <- read_csv(station_data_csv)
}

# Load the additional datasets


total_seasonal_rainfall <- read_csv("total_seasonal_rainfall.csv")
seasonal_soi_data <- read_csv("seasonal_soi_data.csv")

# Join datasets based on "Year" and "Season" columns


combined_data <- total_seasonal_rainfall %>%
inner_join(seasonal_soi_data, by = c("Year", "Season"))

# Convert relevant variables to factors


combined_data <- combined_data %>%
mutate(across(where(is.character), as.factor))

# Print the first three rows of the combined dataset


head(combined_data, 3)
Question 1.2 (3 marks) Convert the relevant variables to factors in your dataset. Be sure to
set the factor levels appropriately for later analysis. Show your code and show the factor
levels.
# Convert relevant variables to factors
combined_data <- combined_data %>%
mutate(across(where(is.character), as.factor))

# Show factor levels for "Phase" variable


factor_levels <- levels(combined_data$Phase)
print(factor_levels)

Exploratory Visualisaion
For each season, we are interested in whether a simple linear regression

y i=β 0 + β 1 x i +ε i , ε i ∼ N ( 0 , σ 2 ) ( 1 )

could be used to model the relationship between the mean seasonal SOI value and total
seasonal precipitation.
Question 2.1 (5 marks) Create a visualisation that explores the relationship between
SeasonalSOI and total_seasonal_prcp for each season. Use geom_smooth() to add the
null model and the linear model from equation (1).
# Fit the linear regression model
model <- lm(total_seas_prcp ~ SeasonalSOI, data = combined_data)

# Create a scatter plot with linear regression line


ggplot(combined_data, aes(x = SeasonalSOI, y = total_seas_prcp)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x) +
geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed", formula = y ~
1) +
facet_wrap(~Season, ncol = 2) +
labs(x = "Seasonal SOI", y = "Total Seasonal Precipitation") +
theme_minimal()

Question 2.2 (3 marks) Using your visualisation, for which seasons would you expect there
to be a significant linear relationship between total seasonal precipitation and the mean
seasonal SOI value? Give detailed reasoning.
The scatter plot illustrates the association between the variables "Seasonal SOI" (x-axis)
and "Total Seasonal Precipitation" (y-axis). It depicts the distribution of the data points and
the connection between the variables graphically.
The fitted linear regression model is represented by the blue line. Based on the regression
analysis, it depicts the predicted linear relationship between "Seasonal SOI" and "Total
Seasonal Precipitation". It indicates the relationship's general trend and direction. The
slope of the blue line represents the rate of change in precipitation for every unit change in
the "Seasonal SOI" variable.
The horizontal line with the formula y 1 is represented by the red dashed line. It acts as a
baseline or reference line for the values of precipitation. Because y 1 in the calculation, the
anticipated value of "Total Seasonal Precipitation" is constant for any value of "Seasonal
SOI." If the "Seasonal SOI" has no impact, it indicates the average or baseline precipitation
level.
The scatter plot dots are closer to the red and blue lines in the winter panel, indicating that
the association between "Seasonal SOI" and "Total Seasonal Precipitation" is stronger
during the winter season.
The winter panel's dots are tightly grouped around the blue line, indicating a greater linear
relationship between "Seasonal SOI" and "Total Seasonal Precipitation" during the winter
season.
The "Seasonal SOI" variable has no effect on the red dashed line, which indicates the
baseline or average precipitation level. As a result, its distance from the dots in the winter
panel may not reveal important seasonal information.

Simple linear Regression


For the season with the strongest relationship as determined in the Exploratory Analysis
section, fit the regression model from equation (1) and answer the following questions.
Question 3.1 (2 marks) Fill in the blanks in the following sentence so that it refers to the
terms in the regression model.
For the BRISBANE AERO Station and the winter season, a linear model was specified to model
how the total seasonal precipitation,denoted as total seasonal precipitation, , is related to the
mean seasonal SOI value, represented by mean seasonal SOI value. The parameter
slope∨regression coefficient describes the rate of change in the total seasonal precipitation
with an increase in mean seasonal SOI value. The parameter intercept or regression constant.
represents the total seasonal precipitation when the mean seasonal SOI value is 0.
Question 3.2 (2 marks) Write down your linear model substituting the parameter values
into the equation.
The linear model equation with parameter values substituted in is:
total_seas_prcp = [β0] + [β1] * SeasonalSOI
Question 3.3 (2 marks) Provide a 95% confidence interval for the parameter estimates.
A 95% confidence interval for the parameter estimates can be obtained using the confint
function in R.
# Confidence interval
conf_interval <- confint(model)
cat("Confidence Interval (95%) for the parameter estimates:\n")
print(conf_interval)

Question 3.4 (1 mark) How much variability in the data is explained by this model?
The linear regression model used to estimate seasonal rainfall totals based on the mean
seasonal SOI value explains only a small amount of the data variability. The mean seasonal
SOI accounts for roughly 5.679% of the variability in total seasonal precipitation, with an R-
squared value of 0.05679. This implies that there are additional variables not accounted for
in the model that significantly contribute to the observed seasonal rainfall changes. As a
result, while the model gives some insight, it may not be a good predictor of seasonal
rainfall totals, and other factors should be taken into account when studying and
forecasting such patterns.
Question 3.5 (4 marks) Visualise the fitted values compared with the residuals, and
visualise the standardised quantiles of the residuals compared with the theoretical
quantiles. Discuss the validity of the underlying assumptions of linear regression.
# Fitted values vs residuals
plot(model$fitted.values, model$residuals, xlab = "Fitted Values", ylab = "Residuals", main
= "Fitted Values vs Residuals")

# QQ plot of standardized residuals


qqnorm(model$residuals, main = "Normal Q-Q Plot of Residuals")
qqline(model$residuals)

Question 3.6 (4 marks) Print out a summary of your linear model and interpret the results.
As part of this you must discuss the physical meaning of the model, which parameters are
significant and whether the linear model is significantly different compared to the null
model.
According to the model summary, the linear regression analysis demonstrates a significant
relationship between the mean seasonal SOI value and total seasonal precipitation. The
computed intercept and coefficient values relate to the starting point of precipitation and
the rate at which it varies in response to changes in the mean seasonal SOI value.
Nonetheless, the low R-squared value shows that the model explains for just a part of the
variation in the data, showing that other variables may impact total seasonal precipitation.
Question 3.7 (2 marks) Discuss whether your fitted model is a good model to use to
predict seasonal rainfall totals using the mean seasonal SOI.
A significant connection exists between the mean seasonal SOI value and total seasonal
precipitation, according to the fitted linear regression model. The coefficients associated
with the intercept and the mean seasonal SOI value give useful information about the
baseline precipitation and the rate of change in precipitation as a function of fluctuations in
the mean seasonal SOI value.
However, the model's R-squared value is quite low, indicating that the model only explains
a tiny amount of the variability in the data. This shows that variables other than the mean
seasonal SOI impact seasonal rainfall totals. As a result, using the mean seasonal SOI as a
predictor alone may not reflect the entire complexity of precipitation patterns.
As a result, while the model can estimate seasonal rainfall totals using the mean seasonal
SOI, it should be used with caution. It may be advantageous to investigate additional
variables or other models capable of capturing the effect of other factors on seasonal
precipitation. Furthermore, continual evaluation and validation of the model's performance
against real-world data is required to determine its efficacy in actual applications.
Polynomial Lines of Best Fit
Many climate scientists hypothesise that when it comes to rainfall, that wet can get wetter,
but dry can’t get drier. In other words, how Australian rainfall responds to the different
phases of ENSO may not be equal in both La Niñ a and El Niñ o phases. For this reason, one
might want to check if polynomial regression better suits the data.
Question 4.1 (2 marks) For BRISBANE AERO and the season you chose earlier, fit a linear
regression using polynomial explanatory variables of up to order 2 and the SeasonalSOI.
Write down the equation with your estimated parameter values.
# Fit a polynomial regression model
model_poly <- lm(total_seas_prcp ~ poly(SeasonalSOI, 2), data = combined_data)

Question 4.2 (3 marks) Print out a summary of your fitted model, interpret the
significance of the results and explain the related the physical meaning.
# Print the model summary
summary(model_poly)

According to the polynomial regression model, the SeasonalSOI variable and its quadratic
term have some explanatory value for predicting total seasonal precipitation. The low R-
squared value, on the other hand, indicates that the model only explains a tiny amount of
the variability, suggesting that additional factors not included in the model may impact
total seasonal precipitation.
Question 4.3 (3 marks) Create a prediction interval for a mean seasonal SOI value of 25
and a mean seasonal SOI value of -25. Comment on using this model for prediction in
relation to your physical understanding of rainfall and your understanding of
extrapolation.
# Create new data points
new_data_poly <- data.frame(SeasonalSOI = c(25, -25))

# Make predictions with confidence intervals


predictions_poly <- predict(model_poly, newdata = new_data_poly, interval = "prediction")

# Print the predictions with confidence intervals


print(predictions_poly)

Question 4.4 (3 marks) Decide whether a linear or polynomial regression is preferred


using a statistical test. Be sure to describe your statistical test in detail.
We used an F-test to assess the fit of two models: a linear regression model and a
polynomial regression model, to decide whether linear or polynomial regression is
favorable. The F-test determines if the polynomial model significantly improves fit over the
simpler linear model.
The test compares the residual sum of squares (RSS) of the two models, which measures
the difference between observed and predicted values. We generate a p-value reflecting the
chance of detecting such an extreme F-statistic if the linear model is sufficient by
calculating the F-statistic, which is the ratio of the RSS difference to the difference in
degrees of freedom.
The polynomial regression model has a lower RSS (10389018) in our investigation than the
linear regression model (10554075). The F-statistic computed was 4.5915, with a p-value
of 0.03297. Because the p-value is less than 0.05, we infer that the polynomial regression
model outperforms the linear regression model in terms of fit.

As a result, we choose the polynomial regression model based on the statistical test
because it better explains the variability in total seasonal precipitation by integrating more
polynomial factors.
Linear Regression with Categorical Explanatory Variables
Within the analysis so far the role of ENSO phases has been modelled solely using the SOI.
For model simplicity and for physical understanding, it may be useful to consider only the
phases.
Question 5.1 (3 marks) Consider again BRISBANE AERO and your chosen season. To better
understand the role of different Phases of ENSO, fit a categorical regression model of the
from

y i=β 0 + β 1 I ( Phase=Neutral )+ β 2 I ( Phase=LaNina ) , ϵ ∼ N ( 0 , σ ) . (2 )


2

Print the model summary.

# Fit the categorical regression model


model_categorical <- lm(total_seas_prcp ~ as.factor(Phase), data = combined_data)
# Print the model summary
summary(model_categorical)

Question 5.2 (1 mark) Write down the linear model substituting the parameter values into
the equation.
# Linear model substituting the parameter values into the equation
total_seas_prcp <- beta0 + beta1 * as.numeric(Phase == "Neutral") + beta2 *
as.numeric(Phase == "LaNina")
Question 5.3 (3 marks) Interpret the significance of the results of your fitted model and
explain the related the physical meaning.
The fitted categorical regression model findings show the importance of the Phase levels
(Neutral and LaNina) in comparison to the baseline category (ElNino). The coefficient
estimations for each Phase level show the predicted change in total seasonal precipitation
vs the baseline category.
During the El Nino period, the intercept coefficient indicates the average total seasonal
precipitation. The coefficient for the LaNina phase (126.73) is statistically significant (p
0.001), indicating that total seasonal precipitation during LaNina is greater by 126.73 units
on average than during ElNino. However, the Neutral phase coefficient (30.20) is not
statistically significant (p = 0.298), implying that there is no clear evidence of a substantial
difference in total seasonal precipitation between the Neutral and ElNino phases.
Overall, these findings reveal that the ENSO Phase has a considerable influence on total
seasonal precipitation, with LaNina having a major impact, while the effect of the Neutral
phase is unclear.

Question 5.4 (3 marks) Given the results of this categorical regression, is this in support of
your choice of linear or polynomial regression?
No, the categorical regression findings do not support the use of linear or polynomial
regression. According to the categorical regression model, the four ENSO phases (El Nino,
La Nina, and Neutral) have unique and substantial effects on total seasonal precipitation.
This suggests that a linear or polynomial regression, which implies a continuous
connection between the predictors and the response variable, may fail to reflect the
complexities of the ENSO-precipitation relationship. As a result, the categorical regression
model is more suitable and better matched with the actual data, emphasizing the necessity
of addressing the ENSO phases' categorical character when examining their influence on
total seasonal precipitation.
The relevance of the Phase levels in this categorical regression analysis implies that a
categorical regression model that takes into account the distinct ENSO phases gives useful
insights into the association between ENSO and total seasonal precipitation. The linear or
polynomial regression models may not capture the unique effects of each ENSO phase on
precipitation patterns completely. As a result, the categorical regression model is more
suited to assessing the influence of distinct ENSO phases on total seasonal precipitation.

Code
Readability and clarity of code (5 marks) This assignment is primarily about your ability
to perform a statistical data analysis, but the tutors will award marks based on how clear
and readable your code is. To help the tutors with this, please make sure to comment your
code for each of the different questions.

You might also like