0% found this document useful (0 votes)

77 views24 pages

Regression Analysis Script

The document provides problems and steps to analyze the Anscombe data set using R. Specifically, it involves: 1) Loading and encoding the Anscombe data set from Excel into R. 2) Computing summary statistics like the mean, variance, and standard deviation of the variables. 3) Building simple linear regression models to analyze the relationships between (X1, Y1), (X2, Y2), (X3, Y3), and (X4, Y4). 4) Validating the assumptions of each linear regression model, including normality, independence, homoscedasticity, and linearity. The assumptions are satisfied for the (X1, Y1)

Uploaded by

John Riel Canete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views24 pages

Regression Analysis Script

Uploaded by

John Riel Canete

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 24

# PANGANTIHON, Norman O.

STT061-M14

# Activity on Regression Analysis

# Your task is consider the Anscombe data set again

# These are your new problems

# Problem 1: Encode the Anscombe data set using Excel (Save the file)

# Problem 2: Export the xls file into a csv file (Find Export command

# in the File command menu of Excel)

# csv means comma separated values

# Problem 3: Use R command to load the csv file in R

homedir <- "~/NORMAN/STT061/"

setwd(homedir)

Anscombe <- read.csv("Anscombe.csv")

# Problem 4: Use R command to compute the mean of X1,X2,X3,X4

mean(Anscombe$X1)

mean(Anscombe$X2)

mean(Anscombe$X3)

mean(Anscombe$X4)
# Problem 5: Use R command to compute the mean of Y1,Y2,Y3,Y4

mean(Anscombe$Y1)

mean(Anscombe$Y2)

mean(Anscombe$Y3)

mean(Anscombe$Y4)

# Problem 6: Use R command to compute the variance Y1,Y2,Y3,Y4

var(Anscombe$Y1)

var(Anscombe$Y2)

var(Anscombe$Y3)

var(Anscombe$Y4)

# Problem 7: Use R command to compute the variance X1,X2,X3,X4

var(Anscombe$X1)

var(Anscombe$X2)

var(Anscombe$X3)

var(Anscombe$X4)

# Problem 8: Use R command to compute the sd Y1,Y2,Y3,Y4

sd(Anscombe$Y1)

sd(Anscombe$Y2)

sd(Anscombe$Y3)

sd(Anscombe$Y4)
# Problem 9: Use R command to compute the sd X1,X2,X3,X4

sd(Anscombe$X1)

sd(Anscombe$X2)

sd(Anscombe$X3)

sd(Anscombe$X4)

# Problem 10: Use R command to compute the correlation (X1,Y1),

# and also for (X2,Y2), (X3,Y3), (X4,Y4).

cor(Anscombe$X1,Anscombe$Y1)

cor(Anscombe$X2,Anscombe$Y2)

cor(Anscombe$X3,Anscombe$Y3)

cor(Anscombe$X4,Anscombe$Y4)

# Establish causal relationship by validating the four assumptions

# Problem 11: Build the Simple Linear Regression Model for (X1, Y1).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y1 ~ Anscombe$X1, data=Anscombe$Y1andAnscombe$X1) # create the linear

regression model
plot(Anscombe$X1,Anscombe$Y1) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X1,Anscombe$Y1) # get correlation coefficient

# correlation = 0.8164205 - this value is high

# I can say that there is a strong linear relationship between X1 and Y1

# Findings for Assumption1:

# The linearity of data is satisfied

# because of the high value of correlation

# and the scatter plot shows that

# as X1 increases, Y1 also increases.

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion: Independence of error terms is satisfied.

# In the residuals versus fits plot,

# the points seem randomly scattered,

# and it does not appear that there is a pattern.

# Also, there is a red line which is

# approximately horizontal at 0.
# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y1,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y1),col=2,lwd = 3)

# Findings: it appears that the distribution

# is approximately normal

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y1)

kurtosis(Anscombe$Y1)

# skew = -0.065 > -0.5 indicating nearly symmetrical data

# Kurtosis = -0.535 < 0 indicating that the curve is flatter than normal

# Findings: The distribution of the response variable is nearly symmetric

# Conclusion: The distribution of the response variable is normal

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y1)

# W = 0.97693, p-value = 0.9467

# The p-value of the test turns out to be 0.9467 .

# Since this value is greater than .05, we have sufficient evidence

# to say that the sample data come from a population

# that is normally distributed.

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y1

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 81.82/100/100 which satisfies the empirical rule

# and there is a validation that the curve is symmetric

# using the histogram skewness and kurtosis.

# Therefore the result about the actual empirical coverage

# should be considered

# because it has already been established that the curve is nearly symmetric.

# Findings: the Empirical Normality rule is satisfied

# Decision: The result of Empirical rule is valid

# since the conditions are satisfied and

# the shape of the distribution is nearly symmetric.

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings: residuals seem to fall along the straight line

# This means the error terms is normally distributed.

# Final Conclusion about normality of error terms:

# Assumption is satisfied.

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: There is constant variance of Y for each X value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X1, Anscombe$Y1)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X1, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# As you can see the shaded band which represents the standard

# deviation across all values of x. Its values are closer

# from the beginning up to the end values of x.

# This suggests that the variance of Y is constant along the values

# of x.

# Conclusion: Assumption 4 is satisfied

# ---------------------------------------------------------

# Final Conclusion: All 4 assumptions are satisfied.

# Therefore the simple linear regression model is

# appropriate for the data. This means to say that we do not have

# to look for other methods to improve the model.

# Validation of the assumptions showed that the Simple linear regression

# model is valid to represent the causal relationship

# between the X1 variable and the Y1 variable.

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 12: Build the Simple Linear Regression Model for (X2, Y2).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y2 ~ Anscombe$X2, data=Anscombe$Y2andAnscombe$X2) # create the linear

regression model
plot(Anscombe$X2,Anscombe$Y2) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X2,Anscombe$Y2) # get correlation coefficient

# correlation = 0.8162365 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the
# QQ plot and the Normality Rule

hist(Anscombe$Y2,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y2),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y2) # skew = -1.316 < -0.5 indicating strong skewness

# on the left side

kurtosis(Anscombe$Y2) # kurtosis = 0.846 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y2)

# W = 0.82837, p-value = 0.02222

# The p-value of the test turns out to be 0.02222 .

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y2

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 100/100/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:
# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: No constant variance of Y for each

# x value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X2, Anscombe$Y2)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X2, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion:

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 13: Build the Simple Linear Regression Model for (X3, Y3).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y3 ~ Anscombe$X3, data=Anscombe$Y3andAnscombe$X3)

# create the linear regression model# create the linear regression model

plot(Anscombe$X3,Anscombe$Y3) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X3,Anscombe$Y3) # get correlation coefficient

# correlation = 0.8162867 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y3,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y3),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y3) # skew = 1.855 > 1 indicating strong skewness

# on the right side

kurtosis(Anscombe$Y3) # kurtosis = 4.384 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y3)

# W = 0.83361, p-value = 0.02604

# The p-value of the test turns out to be 0.02604 .

#
# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y3

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 90.91/90.91/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:

# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings: No constant variance of Y for each

# x value.

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)

ggplot(augment_model, aes(Anscombe$X3, Anscombe$Y3)) +

geom_point() +
stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X3, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion:

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Problem 14: Build the Simple Linear Regression Model for (X4, Y4).

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y4 ~ Anscombe$X4, data=Anscombe$Y4andAnscombe$X4) # create the linear

regression model

plot(Anscombe$X4,Anscombe$Y4) # scatter plot

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X4,Anscombe$Y4) # get correlation coefficient

# correlation = 0.8165214 -

# Findings for Assumption1:

# Step 3: Verify Assumption2: Independence of Error terms

plot(model,1)

plot(model$fitted.values, model$residuals)

plot(model,1)

# Conclusion:

# Step4: Verify Assumption3: Normality of Error terms.

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# follows a normal distribution. Use also Test for normality and the

# QQ plot and the Normality Rule

hist(Anscombe$Y4,probability=T, main="Histogram",xlab="Raw data")

lines(density(Anscombe$Y4),col=2,lwd = 3)

# Findings:

# We can confirm this by computing the skewness and kurtosis coefficients

library(datawizard)

skewness(Anscombe$Y4) # skew = 1.507 > -.5 indicating strong skewness

# on the right side

kurtosis(Anscombe$Y4) # kurtosis = 3.151 > 0 indicating that the curve

# is higher than normal

# Findings:

# Conclusion:

# Perform Shapiro-Wilk test for Normality

shapiro.test(Anscombe$Y4)
# W = 0.87536, p-value = 0.09081

# The p-value of the test turns out to be 0.09081 .

# Perform the Empirical Normality rule: (This requires that the curve is symmetric)

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y4

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 90.91/90.91/100" which satisfies the empirical rule

# Findings:

# Decision:

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

qqnorm(model$residuals)

qqline(model$residuals)

# Findings:

# Decision:

# Final Conclusion about normality of error terms:

# Step5: Verify Assumption4: Constant Variance of Y for each x

# The assumption can be checked by examining the scale-location

# plot which is also known as spread location-plot

plot(model,3)

# Findings:

# Another view on how to verify constancy of variance

# Use augment function to automatically insert fitted values

# and residuals

library(broom)

augment_model <- augment(model)

library(ggplot2)
ggplot(augment_model, aes(Anscombe$X4, Anscombe$Y4)) +

geom_point() +

stat_smooth(method = lm, se = TRUE) +

geom_segment(aes(xend = Anscombe$X4, yend = .fitted), color = "red", size = 0.3)

# The variance is represented by the shaded band.

# Conclusion: Assumption 4 is not satisfied

# ---------------------------------------------------------

# Final Conclusion:

# ---------------------------------------------------------

# If instead of evaluating the response variable,

# we will investigate the residuals (actual value - fitted value)

# If we will take a look at the residuals we will have

hist(model$residuals,probability=T, main="Histogram",xlab="Raw data")

lines(density(model$residuals),col=2,lwd = 3)

# ---------------------------------------------------------

# Finally which pair is suitable for Simple Linear Regression Analysis?

# Short Answer:

# ---------------------------------------------------------

Multiple Regression
100% (1)
Multiple Regression
21 pages
7 OLS Assumptions
No ratings yet
7 OLS Assumptions
37 pages
Types of Random Sampling Techniques
80% (10)
Types of Random Sampling Techniques
3 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Mda-Session-7 Simple Linear Regression
No ratings yet
Mda-Session-7 Simple Linear Regression
75 pages
Lab 5 LR
No ratings yet
Lab 5 LR
9 pages
A028 GLM-SC3
No ratings yet
A028 GLM-SC3
137 pages
Lecture 12 - Adv. Correlation and Multiple Regression
No ratings yet
Lecture 12 - Adv. Correlation and Multiple Regression
32 pages
Hypothesis Testing (WithSolutions)
No ratings yet
Hypothesis Testing (WithSolutions)
32 pages
Remedial Measures Purdue - Edu
No ratings yet
Remedial Measures Purdue - Edu
28 pages
R Practice
No ratings yet
R Practice
38 pages
maths lab
No ratings yet
maths lab
17 pages
Wic 5 MLR & Anova
No ratings yet
Wic 5 MLR & Anova
10 pages
Homework 1
No ratings yet
Homework 1
8 pages
Lab Wk1soln PDF
No ratings yet
Lab Wk1soln PDF
14 pages
HW6 Solution
No ratings yet
HW6 Solution
10 pages
Mod3
No ratings yet
Mod3
50 pages
Final Predictive Vaibhav 2020
No ratings yet
Final Predictive Vaibhav 2020
101 pages
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
No ratings yet
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
6 pages
MakeUpCat
No ratings yet
MakeUpCat
6 pages
6th Lecture Note 108335647 230518 203102
No ratings yet
6th Lecture Note 108335647 230518 203102
35 pages
Weatherwax Weisberg Solutions
No ratings yet
Weatherwax Weisberg Solutions
162 pages
Regression Analysis Assignment1111
No ratings yet
Regression Analysis Assignment1111
13 pages
Evans Analytics2e PPT 08
No ratings yet
Evans Analytics2e PPT 08
65 pages
Chapter 06-Regression Analysis
No ratings yet
Chapter 06-Regression Analysis
41 pages
Lab file AD pdf
No ratings yet
Lab file AD pdf
25 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
r notesss
No ratings yet
r notesss
12 pages
04 BasicAnalyses
No ratings yet
04 BasicAnalyses
44 pages
R CODES
No ratings yet
R CODES
5 pages
Lab 5
No ratings yet
Lab 5
6 pages
Simple Regression Model Fitting
No ratings yet
Simple Regression Model Fitting
5 pages
Regression Modelli Ng Assignment
No ratings yet
Regression Modelli Ng Assignment
3 pages
WEEK
No ratings yet
WEEK
17 pages
RegrCorr PDF
No ratings yet
RegrCorr PDF
20 pages
Statistic For Agriculture Studies: The Assumptions of Regression
No ratings yet
Statistic For Agriculture Studies: The Assumptions of Regression
6 pages
DoE Lecture
100% (1)
DoE Lecture
315 pages
CH 2
No ratings yet
CH 2
31 pages
ASSIGNMENT NO- 2, FDAS_ SUMANYAKUMARI_bfia
No ratings yet
ASSIGNMENT NO- 2, FDAS_ SUMANYAKUMARI_bfia
6 pages
practice_questions_on_symmetry_corr_reg_on_vectors
No ratings yet
practice_questions_on_symmetry_corr_reg_on_vectors
3 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Regression Analysis
100% (1)
Regression Analysis
280 pages
Linear Regression
100% (2)
Linear Regression
228 pages
3010 Lab Model diagnostic-1
No ratings yet
3010 Lab Model diagnostic-1
4 pages
Business Analytics C-2
No ratings yet
Business Analytics C-2
7 pages
Lab box cox and multiple linear reg-1
No ratings yet
Lab box cox and multiple linear reg-1
4 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
27 pages
Exam 1 Notes
No ratings yet
Exam 1 Notes
4 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Series 1
No ratings yet
Series 1
2 pages
C1M5 Peer Reviewed Others
No ratings yet
C1M5 Peer Reviewed Others
27 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
M6 - Basic Statistics
No ratings yet
M6 - Basic Statistics
66 pages
Multiple Linear Regression
100% (1)
Multiple Linear Regression
14 pages
Lab-5-1-Regression and Multiple Regression
100% (2)
Lab-5-1-Regression and Multiple Regression
8 pages
Exercice V
No ratings yet
Exercice V
5 pages
Section 2
No ratings yet
Section 2
22 pages
MultivariableRegression Summary
No ratings yet
MultivariableRegression Summary
15 pages
R Lab 4
No ratings yet
R Lab 4
7 pages
Adf
50% (2)
Adf
15 pages
Linear Regression Experiment
No ratings yet
Linear Regression Experiment
6 pages
Quality Control
No ratings yet
Quality Control
39 pages
EE 214
No ratings yet
EE 214
17 pages
Homework 2
100% (1)
Homework 2
14 pages
STAT-2450 Assignment 1: Name:, Student ID: B00
No ratings yet
STAT-2450 Assignment 1: Name:, Student ID: B00
9 pages
Deliverable 01 Worksheet: Answer and Explanation
No ratings yet
Deliverable 01 Worksheet: Answer and Explanation
5 pages
U Critic Mann Whitney Table
100% (1)
U Critic Mann Whitney Table
2 pages
Probability
No ratings yet
Probability
35 pages
Module AGRI 214
No ratings yet
Module AGRI 214
55 pages
Lab Report: Course: EEE 4554 (Random Signal and Processes Lab) Experiment No. 1 Experiment Name: Introduction
No ratings yet
Lab Report: Course: EEE 4554 (Random Signal and Processes Lab) Experiment No. 1 Experiment Name: Introduction
7 pages
Lecture
No ratings yet
Lecture
15 pages
T Value
No ratings yet
T Value
15 pages
Whatispseudorandomnumber 121019125730 Phpapp02
No ratings yet
Whatispseudorandomnumber 121019125730 Phpapp02
3 pages
Stat Las April 15 18
No ratings yet
Stat Las April 15 18
19 pages
Some Probability Distribution Binomial Poisson
No ratings yet
Some Probability Distribution Binomial Poisson
6 pages
Beta Calcutaion SPSS
No ratings yet
Beta Calcutaion SPSS
3 pages
Weighted Standard Error
No ratings yet
Weighted Standard Error
9 pages
Grouping Table Mba 2nd Semester
No ratings yet
Grouping Table Mba 2nd Semester
8 pages
19MAT301 - Mathematics For Intelligence Systems - V: Assignment - 2
No ratings yet
19MAT301 - Mathematics For Intelligence Systems - V: Assignment - 2
4 pages
APA Format For Statistical Notation and Other Things
No ratings yet
APA Format For Statistical Notation and Other Things
4 pages
Chapter 1 - Introduction To Statistics - Part 1
No ratings yet
Chapter 1 - Introduction To Statistics - Part 1
20 pages
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Lesson 3: Measures of Dispersion Measures of Dispersion: Measures The Spread of The Data About The Mean Value
No ratings yet
Lesson 3: Measures of Dispersion Measures of Dispersion: Measures The Spread of The Data About The Mean Value
3 pages
3.17) Refer To Exercise 3.7. Find The Mean and Standard Deviation For Y The Number of Empty
No ratings yet
3.17) Refer To Exercise 3.7. Find The Mean and Standard Deviation For Y The Number of Empty
2 pages
Quantitative Techniques - EPGP-15 - Course Outline
No ratings yet
Quantitative Techniques - EPGP-15 - Course Outline
4 pages
GENG5507 Stat TutSheet 5 Solutions
No ratings yet
GENG5507 Stat TutSheet 5 Solutions
5 pages
Goodman and Kruskal's Lambda
No ratings yet
Goodman and Kruskal's Lambda
2 pages
Z T and Chi-Square Tables
No ratings yet
Z T and Chi-Square Tables
6 pages
Probability Theory: A Concise Course
From Everand
Probability Theory: A Concise Course
Y. A. Rozanov
4/5 (2)
Exercises of Statistical Inference
From Everand
Exercises of Statistical Inference
Simone Malacrida
No ratings yet

Regression Analysis Script

Uploaded by

Regression Analysis Script

Uploaded by

# PANGANTIHON, Norman O.

# Activity on Regression Analysis

# Your task is consider the Anscombe data set again

# These are your new problems

# in the File command menu of Excel)

# csv means comma separated values

# Problem 3: Use R command to load the csv file in R

homedir <- "~/NORMAN/STT061/"

Anscombe <- read.csv("Anscombe.csv")

# Problem 4: Use R command to compute the mean of X1,X2,X3,X4

# Problem 6: Use R command to compute the variance Y1,Y2,Y3,Y4

# Problem 7: Use R command to compute the variance X1,X2,X3,X4

# Problem 8: Use R command to compute the sd Y1,Y2,Y3,Y4

# Problem 10: Use R command to compute the correlation (X1,Y1),

# and also for (X2,Y2), (X3,Y3), (X4,Y4).

# Establish causal relationship by validating the four assumptions

# Verify the assumptions of this model.

# Step1: Load data set

Anscombe <- read.csv("Anscombe.csv")

# Step2: Build the Simple Linear Regression Model

model <- lm(Anscombe$Y1 ~ Anscombe$X1, data=Anscombe$Y1andAnscombe$X1) # create the linear

abline(model,col = "red",lwd = 3) # plot the regression line

model # display regression coefficients

cor(Anscombe$X1,Anscombe$Y1) # get correlation coefficient

# correlation = 0.8164205 - this value is high

# I can say that there is a strong linear relationship between X1 and Y1

# Findings for Assumption1:

# The linearity of data is satisfied

# because of the high value of correlation

# and the scatter plot shows that

# as X1 increases, Y1 also increases.

# Step 3: Verify Assumption2: Independence of Error terms

# Conclusion: Independence of error terms is satisfied.

# In the residuals versus fits plot,

# the points seem randomly scattered,

# and it does not appear that there is a pattern.

# Also, there is a red line which is

# For simple linear regression

# we can check the normality of the response variable

# Can use the hist() to check whether the dependent variable

# QQ plot and the Normality Rule

hist(Anscombe$Y1,probability=T, main="Histogram",xlab="Raw data")

# Findings: it appears that the distribution

# We can confirm this by computing the skewness and kurtosis coefficients

# skew = -0.065 > -0.5 indicating nearly symmetrical data

# Findings: The distribution of the response variable is nearly symmetric

# Conclusion: The distribution of the response variable is normal

# Perform Shapiro-Wilk test for Normality

# W = 0.97693, p-value = 0.9467

# The p-value of the test turns out to be 0.9467 .

# Since this value is greater than .05, we have sufficient evidence

# to say that the sample data come from a population

# that is normally distributed.

# Percentage of area covered under normality rule: 68/95/99.7

# Area covered within 1 SD is 68%

# Area covered within 2 SD is 95%

# Area covered within 3 SD is 99.7%

# Compute the actual percentage

data1 <- Anscombe$Y1

within1sd <- round(sum(abs(data1) < mean(data1)+1*sd(data1))*(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2*sd(data1))*(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3*sd(data1))*(100/length(data1)),2)

percentage <- paste0(within1sd,'/',within2sd,'/',within3sd)

(normsfreq <- paste0("Actual Percentage: ", percentage))

# "Actual Percentage: 81.82/100/100 which satisfies the empirical rule

# and there is a validation that the curve is symmetric

# using the histogram skewness and kurtosis.

# Therefore the result about the actual empirical coverage

# Findings: the Empirical Normality rule is satisfied

# Decision: The result of Empirical rule is valid

# since the conditions are satisfied and

# the shape of the distribution is nearly symmetric.

# Here is another way to establish normality

# create Quantile-Quantile plot for residuals ( the plot

# should fall along a straight line)

# Findings: residuals seem to fall along the straight line

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)

within1sd <- round(sum(abs(data1) < mean(data1)+1sd(data1))(100/length(data1)),2)

within2sd <- round(sum(abs(data1) < mean(data1)+2sd(data1))(100/length(data1)),2)

within3sd <- round(sum(abs(data1) < mean(data1)+3sd(data1))(100/length(data1)),2)