0% found this document useful (0 votes)
51 views

Manual For Regression Analysis Using R-Software

This is the R software manual for practical lesson concerning regression analysis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Manual For Regression Analysis Using R-Software

This is the R software manual for practical lesson concerning regression analysis
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Statistical laboratory Manual for Regression

Analysis using R-programing software

College of Natural Science

Department Statistics
Laboratory manual

Prepared by:

Wosenie Gebireamanuel

Dessie, Ethiopia
March, 2013 E.C
Table of content and table of figure
Table of content and table of figure ............................................................................................................... i
List of figures ................................................................................................................................................ ii
Introduction ................................................................................................................................................... 1
Part –I ............................................................................................................................................................ 2
1. Introduction to R and its feature ........................................................................................................... 2
1.1. What is R....................................................................................................................................... 2
1.2. Features of R ................................................................................................................................. 2
1.3. Installing R .................................................................................................................................... 2
1.4. Opening R ...................................................................................................................................... 3
1.5. Data entry ...................................................................................................................................... 4
1.6. The Menu Bar ............................................................................................................................... 5
1.7. Basic data types in R ..................................................................................................................... 6
1.7.1. Vectors .................................................................................................................................. 7
1.7.2. Matrices................................................................................................................................. 8
1.7.3. Array ................................................................................................................................... 10
1.7.4. Lists ..................................................................................................................................... 10
1.7.5. Data frame ........................................................................................................................... 10
2. Part II statistical regression analysis ................................................................................................... 11
2.1. Regression analysis ..................................................................................................................... 11
2.1.1. Simple linear regression ...................................................................................................... 12
2.1.2. Multiple linear regressions .................................................................................................. 17
2.2. ANOVA Models ......................................................................................................................... 19
2.3. Generalized linear model ............................................................................................................ 22
2.3.1. Binary logistic regression.................................................................................................... 24
2.4. MODEL DIAGNOSTIC ............................................................................................................. 26
2.4.1. Scatter ................................................................................................................................. 26
2.4.2. Normality of Residuals ....................................................................................................... 30
2.4.3. Outliers checking ................................................................................................................ 31
2.4.4. Influential Observations ...................................................................................................... 32
2.4.5. Non-constant Error Variance/heteroscedasticity/................................................................ 33
2.4.6. Multi-collinearity ................................................................................................................ 34
2.4.7. Evaluate Nonlinearity.......................................................................................................... 35
Reference .................................................................................................................................................... 36

WU, Department of Statistics Page i


List of figures
Figure 1.1 Windows Graphical User Interface .............................................................................................. 4
Figure 1.2 Matrix operations ........................................................................................................................ 9
Figure 2.1 interaction plot .......................................................................................................................... 21

WU, Department of Statistics Page ii


Introduction
These notes are a practical guide and some important introduction to using the statistical
software package R for Regression analysis of statistics course. They are meant to accompany
Applied Linear Regression book such as Weisberg Applied linear regression third edition and
others. The goals are to show basic analysis of regression models using R software and some
features of R.

Regression analysis answers questions about the dependence of a response variable on one or
more predictors, including prediction of future values of a response, discovering which
predictors are important, and estimating the impact of changing a predictor or a treatment on the
value of the response.

Linear statistical models for regression, analysis of variance, and experimental design are widely
used today in business administration, economics, engineering, and the social, health, and
biological sciences. Successful applications of these models require a sound understanding of
both the underlying theory and the practical problems that are encountered in using the models in
real-life situations. While this module, is basically a practical guide to perform regression
analysis using R software.

This module has two parts. Basically it illustrates on part one an introductory about R and in
part two, there are some note and R-software syntax of statistical regression analysis those
elaborate in detail way about practical Regression analysis.

WU, Department of Statistics Page 1


Part –I

1. Introduction to R and its feature

1.1.What is R?
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team. The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions.

1.2.Features of R
As stated earlier, R is a programming language and software environment for statistical analysis,
graphics representation and reporting. The following are the important features of R:
-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
ndling and storage facility,

sis and display either directly at the computer or


printing at the papers.

1.3. Installing R
The R system for statistical computing consists of two major parts: the base system and a
collection of user contributed add-on packages. The R language is implemented in the base
system. Implementations of statistical and graphical procedures are separated from the base
system and are organized in the form of packages. A package is a collection of functions,
examples and documentation. The functionality of a package is often focused on a special
statistical methodology. Both the base system and packages are distributed via the
Comprehensive R Archive Network (CRAN) accessible under
http://CRAN.R-project.org

You can download the Windows installer version of R from R-(recent version) for Windows
(32/64 bit) and save it in a local directory.

WU, Department of Statistics Page 2


As it is a Windows installer (.exe) with a name "R-version-win.exe". You can just double click
and run the installer accepting the default settings. If your Windows is 32-bit version, it installs
the 32-bit version. But if your window is 64-bit, then it installs both the 32-bit and 64-bit
versions.

1.4. Opening R
1. After installation you can locate the icon to run the Program under the Windows Program
Files. Clicking this icon brings up the R-GUI which is the R console to do R
Programming.
2. Double click on the R shortcut(existing R shortcut)
3. Double-click on the .RData file in the folder.
Then, the R console (command line) window will be automatically displayed

R comes without any frills and on startup shows simply a short introductory message including
the version number and a prompt ‘>‟ sign. It indicates that R is ready for another command. Or
“way of saying―Go Ahead...Do something”
 The instructions you give R are called commands.
 Commands are separated either by a semicolon (;) or by a new line.
 Comments can be put almost anywhere, starting with a hash mark (#); everything after#
is a comment.
 If you see a “+” in place of the prompt that means that your last command was not
completed.
Example:
>4*
+
 Don„t forget that R is case sensitive!

WU, Department of Statistics Page 3


Figure 1.1 Windows Graphical User Interface

1.5. Data entry


Entering data with c
The most useful R command for quickly entering in small data sets is the c function. This
function combines, or concatenates terms together. As an example, suppose we have the
following count of the number of typos per page of these notes:
23031001
To enter this into an R session we do so with
>typos = c(2,3,0,3,1,0,0,1)
> typos
[1] 2 3 0 3 1 0 0 1
Notice a few things
 We assigned the values to a variable called typos
 The assignment operator is a =.
 The value of the typos doesn't automatically print out. It does when we type just the name
though as the last input line indicates

WU, Department of Statistics Page 4


 The value of typos is prefaced with a funny looking [1]. This indicates that the value is a
vector.
Applying a function
R comes with many built in functions that one can apply to data such as typos. One of them is
the mean function for finding the mean or average of the data. To use it is easy
> mean(typos)
[1] 1.25
As well, we could call the median or var to _find the median or sample variance. The syntax is
the same { the function name followed by parentheses to contain the argument(s):
> median(typos)
[1] 1
> var(typos)
[1] 1.642857

1.6.The Menu Bar

The menu bar in R is very similar to that in most Windows based/menu based programs (SPSS,
MINITAB…).
It contains six pull down menus, which are briefly described below. Much of the functionality
provided by the menus is redundant with those available using standard windows commands
(CTRL+C to copy, for example) and with commands you can enter at the command line.
Nevertheless, it is handy to have the menu system for quick access to functionality.
File
Similar to other statistical packages, in R the file menu contains options for opening, saving, and
printing R documents, as well as the option for exiting the program (which can also be done
using the close button in the upper right hand corner of the main program window). The options
that begin with “load” (“Load Workspace and “Load History”) are options to open previously
saved work. The next chapter discusses the different save options available in some detail as well
as what a workspace and a history are in terms of R files. The option to print is standard and will
print the information selected.

Edit
The edit menu contains the standard functionality of cut, copy and paste, and selects all. In
addition there is an option to “Clear console or Ctrl+L” which creates a blank workspace with

WU, Department of Statistics Page 5


only a command prompt (although objects are still in working memory), which can essentially
clean a messy desk. The last option on the edit menu is “GUI preferences” which pops up the
Rgui configuration editor, allowing you to set options controlling the GUI, such as font size and
background colors.
Misc
The Misc menu contains some functionality not categorized elsewhere. The most notable feature
of this menu is the first option which can also be accessed with the ESC key on your keyboard.
This is your panic button should you have this misfortune of coding R to do something where it
gets stuck, such as programming it in a loop which has no end or encountering some other
unforeseeable snag. Selecting this option (or ESC) should get the situation under control and
return the console to a new command line. Always try this before doing something more drastic
as it will often work. The other functionality provided by Misc is listing (Misc->list objects or
the command is ls()) and removing all objects (Misc->remove all objects or
rm(list=ls(all=TRUE))).
Packages
The packages menu is very important, as it is the easiest way to load and install packages to the
R system. Therefore the entire section following this is devoted to demonstrating how to use this
menu.
Windows
The Windows menu provides options for cascading, and tiling windows. If there is more than
one window open (for example, the console and a help window) you can use the open Windows
list on the bottom of this menu to access the different open windows.
Help
The Help menu directs you to various sources of help and warrants some exploration.
The first option, called “Console” pops up a dialog box listing a cheat sheet of “Information”
listing various shortcut keystrokes to perform tasks for scrolling and editing in the main console
window.

1.7.Basic data types in R


The entities that R creates and manipulates are known as objects. These may be variables,
arrays of numbers, character strings, functions, or more general structures built from such
components. R saves any object you create.
 In short, the objects includes vector, factor, array, matrix, data.frame, ts, list
WU, Department of Statistics Page 6
To list the objects you have created in a session use either of the following commands:
> objects()
> ls()
To remove all the objects in R type:

> rm(list=ls(all=T))
To remove a specified number of objects use:

>rm(x,y) #only object x and y will be removed

1.7.1. Vectors
Vectors are the simplest type of object in R. They can easily be created with c, the combined
function.
There are 3 main types of vectors:
a) Numeric vectors
b) Character vectors
c) Logical vectors
A. Numeric Vector: is a single entity consisting of an ordered collection of numbers.
Example: To setup a numeric vector X consisting of 5 numbers,10,6,3,6,22, we use any one of
the following commands:
>x<-c(10,6,3,6,22)#OR
>x=c(10,6,3,6,22)#OR
>assign(“x”,c(10,6,3,6,22))#OR
>c(10,6,3,6,22)->x
Functions that return a single value
>length(x)#the number of elements in x
>sum(x)#the sum of the values of x
>mean(x)#the mean of the values of x
>var(x)#the variance of the values of x
>sd(x)#the standard deviation of the values of x
>min(x)#the minimum value from the values of x
>max(x)#the maximum value from the values of x
>prod(x)#the product of the values of x
>range(x)#the range of the values of x(smallest and largest)

WU, Department of Statistics Page 7


x=c(1,2,3,4); //Make vector x=(1,2,3,4)
y=c(3,4,7,8); //Make vector y=(3,4,7,8)
b=x * 3; // Make a vector b which is multiplication all elements of x by 3:
k= x^3; // Make a vector k which is the power of 3 of all elements of x:
w=cbind(x,y); //Make matrix w
o=rbind(x,y); //Make matrix o
t=x+y; //Make vector t=x+y
a=1:20; //Make vector a=(1,2,3,4…,20)
z=seq(1,10,2); //Make vector z=(1,3,5,7,9)
I=matrix(c(1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1),nrow=4,ncol=4,byrow=TRUE); //Make
Matrix I

B. Character vectors
To setup a character/string vector z consisting of 3 place names use:
>z<-c("Canberra","Sydney","Newcastle")Or
>z<-c('Canberra','Sydney','Newcastle')
Character strings are entered using either matching double ("") or single ('') quotes, but are
printed using double quotes (or sometimes without quotes).
C. Logical Vectors
A logical vector is a vector whose elements are TRUE, FALSE or NA.

Note: TRUE and FALSE are often abbreviated asT and F respectively, however T and F are just
variables which are set toTRUE and FALSE by default, but are not reserved words and hence
can be over written by the user.
The logical operators are <, <=, >, >=, == for exact equality and !=for inequality.

1.7.2. Matrices
Is a Rectangular table of data of the same type. As with vectors, all the elements of a matrix must
be of the same data type.
 We can Use the function matrix
X=matrix(c(1:8),2,4,byrow=F)
 An equivalent expression:
> x<-matrix(c(1:8),nrow=2,ncol=4)
Use the function cbind to create a matrix by binding two or more vectors as column vectors.
WU, Department of Statistics Page 8
The function rbind is used to create a matrix by binding two or more vectors as row vectors.
Example:
> cbind(c(1,2,3),c(4,5,6))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> rbind(c(1,2,3),c(4,5,6))
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Matrix operations
Name Operation
dim() Dimension of the matrix (number of rows and columns)
as.matrix() Used to coerce an argument into a matrix object
%*% Matrix multiplication
t() Matrix transpose
det() Determinant of a square matrix
solve() Matrix inverse; also solves a system of linear equations
eigen() Computes eigenvalues and eigenvectors
R has a number of matrix specific operations, for example:

Figure 1.2 Matrix operations

WU, Department of Statistics Page 9


1.7.3. Array
Array can be considered as a multiply subscripted collection of data entries, for example
numeric. Arrays are generalizations of vectors and matrices.
Z<-array(data_vector, dim_vector)
For example, if the vector h contains 24 or fewer, numbers then the command
> Z<-array(h, dim=c(3,4,2))
Would use h to setup 3 by 4 by 2 array in Z. If the size of his exactly 24 the result is the same as
> Z<-h ; dim(Z) <-c(3,4,2)

1.7.4. Lists
Are collections of arbitrary objects. That is, the elements of a list can be objects of any type and
structure. Consequently, a list can contain another list and therefore it can be used to construct
arbitrary data structures. A list could consist of a numeric vector, a logical value, a matrix, a
complex vector, a character array, a function, and so on.
Lists are created with the list() command:
L<-list(object-1,object-2,…,object-m)
Example:
>L<-list(c(1,5,3),matrix(1:6,nrow=3),c("Hello","world"))
>L
[[1]]
[1] 1 5 3
[[2]]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
[[3]]
[1] "Hello“ "world"

1.7.5. Data frame


Data frames: regarded as an extension to matrices.
Data frames can have columns of different data types and are the most convenient data structure
for data analysis in R. Data frames are lists with the constraint that all elements are vectors of the
same length. The command data.frame() creates a data frame:
dat<-data.frame(object-1,object-2,…,object-m)
WU, Department of Statistics Page 10
Example:
> name= c("Eden","Solomon","Zelalem","Kidist")
> age=c(18,22,25,27)
> sex=c("F","M","M","F")
> stud=data.frame(name,age,sex)
> stud
name age sex
1 Eden 18 F
2 Solomon 22 M
3 Zelalem 25 M
4 kidist 27 F
 To display the column names:
> names(stud) #OR colnames(stud)
[1] "name" "age" "sex"
 To display the row names:
> rownames(stud)
[1] "1" "2" "3" "4"
After we have seen some important features of R, we will proceed to the objective! That means
REGRESSION!
2. Part II statistical regression analysis

2.1. Regression analysis

Regression Analysis is a statistical tool for the investigation of relationship between variables.
The investigator wants to ascertain causal effect of one variable on another. The investigator
also typically assesses the ―Statistical significance‖ of the estimated relationship (the degree of
confidence that the true relationship is close to the estimated relationship).

Regression may be simple or multiple, linear or non-linear. The template for a statistical model is
a linear regression model with independent, homoscedastic errors.

The expected model is


βo + β1×X

WU, Department of Statistics Page 11


Where βo is the theoretical y-intercept and β1 is the theoretical slope. The goal of a linear regression
is to find the best estimates for βo and β1 by minimizing the residual error between the X and
predicted Y. These parameters are usually called as regression coefficients. The unobservable

error component ε accounts for the failure of data to lie on the straight line and represents the

difference between the true and observed realization of Y . There can be several reasons for such
difference, e.g., the effect of all deleted variables in the model, variables may be qualitative,

inherent randomness in the observations etc. We assume that ε is observed as independent and

identically distributed random variable with mean zero and constant variance . Later, we will

additionally assume that ε is normally distributed.

Syntax Model Comments


Y~A Y = βo + β1A Straight-line with an implicit y-intercept
Y ~ -1 + A Y = β1A Straight-line with no y-intercept; that is, a fit forced through (0,0)
Y ~ A + I(A^2) Y = βo+ β1A + β2A2 Polynomial model; note that the identity function I( ) allows terms in the model to include
normal mathematical symbols.
Y~A+B Y = βo+ β1A + β2B A first-order model in A and B without interaction terms.
Y ~ A:B Y = βo + β1AB A model containing only first-order interactions between A and B.
Y ~ A*B Y = βo+ β1A + β2B + β3AB A full first-order model with a term; an equivalent code is Y ~ A + B + A:B.

Y ~ (A + B + C)^2 Y = βo+ β1A + β2B + β3C + A model including all first-order effects and interactions up to the nth order, where n is
β4AB + β5AC + β6AC given by ( )^n. An equivalent code in this case is Y ~ A*B*C – A:B:C.

2.1.1. Simple linear regression

We consider the modeling between the dependent and one independent variable. When there is
only one independent variable in the linear regression model, the model is generally termed as a
simple linear regression model. When there are more than one independent variables in the
model, then the linear model is termed as the multiple linear regression model.
Response ~expression

 y~x
 y~1+x
Both imply the same simple linear regression model of y on x. The first has an implicit intercept
term and the second an explicit one.
 y~0+x

WU, Department of Statistics Page 12


 y ~ -1 + x
 y~x-1
Simple linear regression of y on x through the origin (that is, without an intercept term). The
operator ~ is used to define a model formula in R.
The basic syntax for a regression analysis in R is
lm(Y ~ model)
Where Y is the object containing the dependent variable to be predicted and model is the formula for
the chosen mathematical model. The command lm( ) provides the model‟s coefficients but no further
statistical information.

Also : fitted.model <- lm(formula, data = data.frame)

 Using R sample data “iris”, we can make simple linear regression analysis Sepal.Length on
Sepal.Width

>linmodel=lm(Sepal.Length~Sepal.Width,data=iris)
> linmodel

Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Coefficients:
(Intercept) Sepal.Width
6.5262 -0.2234
> summary(linmodel)
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = iris)

Residuals:
Min 1Q Median 3Q Max
-1.5561 -0.6333 -0.1120 0.5579 2.2226

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5262 0.4789 13.63 <2e-16 ***

WU, Department of Statistics Page 13


Sepal.Width -0.2234 0.1551 -1.44 0.152
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

Residual standard error: 0.8251 on 148 degrees of freedom


Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
Some useful extractors an output from R code.

 coef(object)
Extract the regression coefficient (matrix).
Long form: coefficients(object).
 deviance(object)
Residual sum of squares, weighted if appropriate.
 formula(object)
Extract the model formula.
 plot(object)
Produce four plots, showing residuals, fitted values and some diagnostics.
 predict(object, newdata=data.frame)
The data frame supplied must have variables specified with the same labels as the original. The
value is a vector or matrix of predicted values corresponding to the determining variable values
in data.frame.
 print(object)
Print a concise version of the object. Often used implicitly.
 residuals(object)
Extract the (matrix of) residuals, weighted as appropriate.

Short form: resid(object).


 step(object)
Select a suitable model by adding or dropping terms and preserving hierarchies.

The model with the smallest value of AIC (Akaike„s An Information Criterion) discovered in the
stepwise search is returned.
 summary(object)
Print a comprehensive summary of the results of the regression analysis.

WU, Department of Statistics Page 14


 vcov(object)
This used to Returns variance-covariance matrix of the main parameters of a fitted model object.
 confint(model_Name, level=0.95)
Confidence interval formation for the model parameters
 fitted(model_Name)
Used to display predicted values depending on the fitted model
 influence(model_Name)
Used for regression diagnostics
 anova(model_Name)
It displays the analysis of variance table
 cor(Data.frame)
A quick way to look for relationships between variables in a data frame
 pairs(Data.frame)
To visualize these relationships
 weights(model)
When weights are used to display what it is
 rank(model)
The numeric rank of the fitted linear model
 call(model)
The matched call
 terms(model)
The terms object used
 contrasts
Only when relevant the contrasts used
First explore the data structure as follows
Str(iris)
> str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

WU, Department of Statistics Page 15


coef(linmodel)

(Intercept) Sepal.Width

6.5262226 -0.2233611

resid(linmodel)

> resid(linmodel)

1 2 3 4 5 6 7 8
-0.64445884 -0.95613937 -1.11146716 -1.23380326 -0.72212273 -0.25511441 -1.16679494 -0.76679494
9 10 11 12 13 14 15 16
-1.47847547 -0.93380326 -0.29978662 -0.96679494 -1.05613937 -1.55613937 0.16722169 0.15656612
17 18 19 20 21 22 23 24
-0.25511441 -0.64445884 0.02254948 -0.57745052 -0.36679494 -0.59978662 -1.12212273 -0.68913105
25 26 27 28 29 30 31 32
-0.96679494 -0.85613937 -0.76679494 -0.54445884 -0.56679494 -1.11146716 -1.03380326 -0.36679494
33 34 35 36 37 38 39 40
-0.41044220 -0.08810609 -0.93380326 -0.81146716 -0.24445884 -0.82212273 -1.45613937 -0.66679494
41 42 43 44 45 46 47 48
-0.74445884 -1.51249211 -1.41146716 -0.74445884 -0.57745052 -1.05613937 -0.57745052 -1.21146716
49 50 51 52 53 54 55 56
-0.39978662 -0.78913105 1.18853284 0.58853284 1.06619674 -0.51249211 0.59918842 -0.20081158
57 58 59 60 61 62 63 64
0.51086895 -1.09015600 0.72152453 -0.72314769 -1.07950043 0.04386063 -0.03482822 0.22152453
65 66 67 68 69 70 71 72
-0.27847547 0.86619674 -0.25613937 -0.12314769 0.16517178 -0.36781990 0.08853284 0.19918842
73 74 75 76 77 78 79 80
0.33218010 0.19918842 0.52152453 0.74386063 0.89918842 0.84386063 0.12152453 -0.24548379
81 82 83 84 85 86 87 88
-0.49015600 -0.49015600 -0.12314769 0.07685231 -0.45613937 0.23320506 0.86619674 0.28750789
89 90 91 92 93 94 95 96
-0.25613937 -0.46781990 -0.44548379 0.24386063 -0.14548379 -1.01249211 -0.32314769 -0.15613937
97 98 99 100 101 102 103 104
-0.17847547 0.32152453 -0.86781990 -0.20081158 0.51086895 -0.12314769 1.24386063 0.42152453
105 106 107 108 109 110 111 112
0.64386063 1.74386063 -1.06781990 1.42152453 0.73218010 1.47787727 0.68853284 0.47685231
113 114 115 116 117 118 119 120
0.94386063 -0.26781990 -0.10081158 0.58853284 0.64386063 2.02254948 1.75451621 -0.03482822
121 122 123 124 125 126 127 128
1.08853284 -0.30081158 1.79918842 0.37685231 0.91086895 1.38853284 0.29918842 0.24386063

129 130 131 132 133 134 135 136


0.49918842 1.34386063 1.49918842 2.22254948 0.49918842 0.39918842 0.15451621 1.84386063
137 138 139 140 141 142 143 144

WU, Department of Statistics Page 16


0.53320506 0.56619674 0.14386063 1.06619674 0.86619674 1.06619674 -0.12314769 0.98853284
145 146 147 148 149 150
0.91086895 0.84386063 0.33218010 0.64386063 0.43320506 0.04386063

As it has 150 observations, the residuals errors in Sepal.Length are also up to 150.

fitted(linmodel)
1 2 3 4 5 6 7 8 9 10 11
5.744459 5.856139 5.811467 5.833803 5.722123 5.655114 5.766795 5.766795 5.878475 5.833803 5.699787
12 13 14 15 16 17 18 19 20 21 22
5.766795 5.856139 5.856139 5.632778 5.543434 5.655114 5.744459 5.677451 5.677451 5.766795 5.699787
23 24 25 26 27 28 29 30 31 32 33
5.722123 5.789131 5.766795 5.856139 5.766795 5.744459 5.766795 5.811467 5.833803 5.766795 5.610442
34 35 36 37 38 39 40 41 42 43 44
5.588106 5.833803 5.811467 5.744459 5.722123 5.856139 5.766795 5.744459 6.012492 5.811467 5.744459
45 46 47 48 49 50 51 52 53 54 55
5.677451 5.856139 5.677451 5.811467 5.699787 5.789131 5.811467 5.811467 5.833803 6.012492 5.900812
56 57 58 59 60 61 62 63 64 65 66
5.900812 5.789131 5.990156 5.878475 5.923148 6.079500 5.856139 6.034828 5.878475 5.878475 5.833803
67 68 69 70 71 72 73 74 75 76 77
5.856139 5.923148 6.034828 5.967820 5.811467 5.900812 5.967820 5.900812 5.878475 5.856139 5.900812
78 79 80 81 82 83 84 85 86 87 88
5.856139 5.878475 5.945484 5.990156 5.990156 5.923148 5.923148 5.856139 5.766795 5.833803 6.012492
89 90 91 92 93 94 95 96 97 98 99
5.856139 5.967820 5.945484 5.856139 5.945484 6.012492 5.923148 5.856139 5.878475 5.878475 5.967820
100 101 102 103 104 105 106 107 108 109 110
5.900812 5.789131 5.923148 5.856139 5.878475 5.856139 5.856139 5.967820 5.878475 5.967820 5.722123
111 112 113 114 115 116 117 118 119 120 121
5.811467 5.923148 5.856139 5.967820 5.900812 5.811467 5.856139 5.677451 5.945484 6.034828 5.811467
122 123 124 125 126 127 128 129 130 131 132
5.900812 5.900812 5.923148 5.789131 5.811467 5.900812 5.856139 5.900812 5.856139 5.900812 5.677451
133 134 135 136 137 138 139 140 141 142 143
5.900812 5.900812 5.945484 5.856139 5.766795 5.833803 5.856139 5.833803 5.833803 5.833803 5.923148
144 145 146 147 148 149 150
5.811467 5.789131 5.856139 5.967820 5.856139 5.766795 5.856139

It is predicted value of Y(Sepal.Length)

2.1.2. Multiple linear regressions

It is Examining of the linear relationship between one dependent (Y) & two or more
independent variables (Xi). Or Use more than one explanatory variable to explain the variability
in the response variable.

Multiple Regression Model with k Independent Variables


Yi  β 0  β1X1i  β 2 X 2i    β k X ki  ε

WU, Department of Statistics Page 17


Where

ε is random error
The coefficients of the multiple regression models are estimated using sample data with k
independent variables

Ŷi  b0  b1X1i  b 2 X 2i    b k X ki

Where b0 is the estimated intercept

b1,b2,b3,……….bk are the estimated slops

Interpretation of the Slopes: (referred to as a Net Regression Coefficient)

– b1=The change in the mean of Y per unit change in X1, taking into account the
effect of X2 (or net of X2)

– b0 Y intercept. It is the same as simple regression.

To fit a linear model, we can use the lm function.


> model1<-lm(len~supp+dose+supp*dose,data=ToothGrowth)
> model1
Call:
lm(formula = len ~ supp + dose + supp * dose, data = ToothGrowth)
Coefficients:
(Intercept) suppVC dose suppVC:dose
11.550 -8.255 7.811 3.904
A linear model with interaction is fitted and the results are displayed as seen here.
M1 <- lm(sr ~ pop15 + pop75 + dpi + ddpi, data=savings)

 Modelsummary <-summary(M1) # summary of estimated model

ModelSummary

WU, Department of Statistics Page 18


Call:

lm(formula = sr ~ pop15 + pop75 + dpi + ddpi, data = savings)

Residuals:

Min 1Q Median 3Q Max

-8.2422 -2.6857 -0.2488 2.4280 9.7509

Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept) 28.5660865 7.3545161 3.884 0.000334 ***

pop15 -0.4611931 0.1446422 -3.189 0.002603 **

pop75 -1.6914977 1.0835989 -1.561 0.125530

dpi -0.0003369 0.0009311 -0.362 0.719173

ddpi 0.4096949 0.1961971 2.088 0.042471 *

---

Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

Residual standard error: 3.803 on 45 degrees of freedom


Multiple R-squared: 0.3385, Adjusted R-squared: 0.2797

F-statistic: 5.756 on 4 and 45 DF, p-value: 0.0007904

2.2.ANOVA Models
One-Way ANOVA: Analysis of variance is used to test the hypothesis that several means are
equal
One way ANOVA model: Yij    i   ij
i=1,2,…,I and j=1,2,….,Ji
This function will calculate an analysis of variance table, which can be used to evaluate the
significance of the terms in single models or to compare two nested models.
The basic function for fitting ordinary ANOVA models is aov() and a streamlined version of the
call is as follows:

WU, Department of Statistics Page 19


ANOVA.model <-aov(formula, data = data name)
Multiple Comparisons for One-way Independent Groups ANOVA: After detecting some
difference in the levels of the factor, interest centers on which levels or combinations of levels
are different.
 Tukey‟s Honestly Significant Difference (HSD)
 TukeyHSD(a)# for the previous example Tukey multiple comparisons of means 95%
family-wise confidence level.
To demonstrate ANOVA in R, let‗s start with a simple data set with the base packages called
InsectSprays. The dataset has two variables (count and spray type). This dataset shows the
effectiveness of six different insecticides.
 mod.1<-aov(count ~ spray, data = InsectSprays)
mod.1
>mod.1<-aov(count ~ spray, data = InsectSprays)
> mod.1
Call:
aov(formula = count ~ spray, data = InsectSprays)
Terms:
spray Residuals
Sum of Squares 2668.833 1015.167
Deg. of Freedom 5 66

Residual standard error: 3.921902


Estimated effects may be unbalanced
To get more detailed output, we need to use the summary function.
 summary(mod.1)
> summary(mod.1)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2669 533.8 34.7 <2e-16 ***
Residuals 66 1015 15.4
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
>TukeyHSD(mod.1)
Tukey multiple comparisons of means
95% family-wise confidence level
WU, Department of Statistics Page 20
Fit: aov(formula = count ~ spray, data = InsectSprays)
$spray
diff lwr upr p adj
B-A 0.8333333 -3.866075 5.532742 0.9951810
C-A -12.4166667 -17.116075 -7.717258 0.0000000
D-A -9.5833333 -14.282742 -4.883925 0.0000014
E-A -11.0000000 -15.699409 -6.300591 0.0000000
F-A 2.1666667 -2.532742 6.866075 0.7542147
C-B -13.2500000 -17.949409 -8.550591 0.0000000
D-B -10.4166667 -15.116075 -5.717258 0.0000002
E-B -11.8333333 -16.532742 -7.133925 0.0000000
F-B 1.3333333 -3.366075 6.032742 0.9603075
D-C 2.8333333 -1.866075 7.532742 0.4920707
E-C 1.4166667 -3.282742 6.116075 0.9488669
F-C 14.5833333 9.883925 19.282742 0.0000000
E-D -1.4166667 -6.116075 3.282742 0.9488669
F-D 11.7500000 7.050591 16.449409 0.0000000
F-E 13.1666667 8.467258 17.866075 0.0000000

Interaction plots can be done as follows


 interaction.plot(var1,var2,var3)
Example ToothGrowth data
 interaction.plot(ToothGrowth$dose,ToothGrowth$supp,Toot hGrowth$len)

Figure 2.1 interaction plot

WU, Department of Statistics Page 21


2.3.Generalized linear model

Generalized Linear Models in R-are an extension of linear regression models allow dependent
variables to be far from normal. A general linear model makes three assumptions –

 Residuals are independent of each other.


 Residuals are distributed normally.
 Model parameters and y share a linear relationship.
A Generalized Linear Model extends on the last two assumptions. It generalizes the possible
distributions that the residuals share to a family of distributions known as the exponential family.

For Example – Normal, Poisson, Binomial


In R, we can use the function glm() to work with generalized linear models in R. Thus, the usage
of glm() is like that of the function lm() which we before used for much linear regression. We
use an extra argument family. That is to describe the error distribution. And link function to be
used in the model to show the main difference.
GLM are fit using the glm( ) function. The form of the glm function is –
Generalized linear models (GLMs) are a very flexible class of statistical models.
 In R, GLM models can be specified using the glm function. There are eight different
error distributions available in glm, including binomial and poisson, each with a default
link function.
 Arguments and default argument values can be found in the help file for glm:
 glm(formula, family = gaussian, data, weights, subset,
na.action, start = NULL, etastart, mustart, offset, control
= glm.control(...), model = TRUE, method = "glm.fit", x =
FALSE, y = TRUE, contrasts = NULL, )

The glm function is very flexible.


To demonstrate a very different application, lets read in the data on insect numbers in response
to insecticide spraying.
This data set was analyzed above using an ANOVA, but recall that it required a transformation
of the response.
In this case, we want to carry out an ANOVA, but the GLM lets us use an appropriate
distribution for count data: the Poisson distribution.
Note that the default link functions for the Poisson distribution is log.

WU, Department of Statistics Page 22


> modelglm<-glm(count~spray,poisson,data=InsectSprays)
> summary(modelglm)
Call:
glm(formula =count ~spray,family ="poisson", data =
InsectSprays)

modelglm<-glm(count~spray,poisson,data=InsectSprays)

> summary(modelglm)

Call:

glm(formula = count ~ spray, family = poisson, data = InsectSprays)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.3852 -0.8876 -0.1482 0.6063 2.6922

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.67415 0.07581 35.274 < 2e-16 ***
sprayB 0.05588 0.10574 0.528 0.597
sprayC -1.94018 0.21389 -9.071 < 2e-16 ***
sprayD -1.08152 0.15065 -7.179 7.03e-13 ***
sprayE -1.42139 0.17192 -8.268 < 2e-16 ***
sprayF 0.13926 0.10367 1.343 0.179
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

(Dispersion parameter for poisson family taken to be 1)


Null deviance: 409.041 on 71 degrees of freedom
Residual deviance: 98.329 on 66 degrees of freedom
AIC: 376.59
Number of Fisher Scoring iterations: 5

WU, Department of Statistics Page 23


2.3.1. Binary logistic regression
Logistic regression is an example of a generalized linear model where by response variable
is categorical. Binary logistic regression is similar to a linear regression model but is suited to
model where the dependent variable is dichotomous as presence or absence of characteristics
on interest which is often coded as 1 or 0. For instance R code/script to fit Binary Logistic
Regression model with three predictors is:
logr<-glm(Y~X1+X2+X3,family=binomial("logit")) or
logr<-glm(Y~X1+X2+X3,family=binomial), where y is binary
response variable and X i, i=1,2,3 are explanatory variables
considered for illustration.
Example 1: Let us consider the following data collected to model the impact of age, Gender,
and Educational level on the status mind classied as stressed and relaxed for a random
sample of 25households from Arbaminch town. The main interest was to identify the
determinant of being stressed. Dataset consisted Dependent variable: status of mind:
1=stressed, and 0= not stressed and Independent variables: Age of households (in year),
Gender: 0=Male, and 1= Female, Educational Level: 0=Elementary or less, 1=High school
and 2=Above High School.

Id Status.mind(Y) Age(X1) Gender(X2) Edu.level(X3)


1 1 23 1 1
2 1 32 0 2
3 0 57 0 2
4 0 23 1 0
5 0 34 1 0
6 0 28 1 0
7 1 24 0 2
8 1 35 1 2
9 1 39 0 2
10 0 32 0 0
11 1 43 0 1
12 0 30 1 1
13 1 25 1 1
14 0 36 1 0
15 1 33 1 0
16 1 41 0 0
17 0 24 0 1
18 0 27 1 0
19 0 28 1 0
20 1 40 0 2
21 1 23 1 1
22 1 35 1 1
23 1 50 0 1
24 0 35 0 1
25 0 26 0 0

WU, Department of Statistics Page 24


Analyze data by using appropriate statistical model in order to achieve the objective stated
above
R-code/script for data entry and to fit binary logistic regression model as response is
categorical with binary outcomes
Let‟s assign the variables
Status.mind=c(1,1,0,0,0,0,1,1,1,0,1,0,1,0,1,1,0,0,0,1,1,1,1,0,0)

Age=c(23,32,57,23,34,28,24,35,39,32,43,30,25,36,33,41,24,27,28,40,23,35,50,35,26)

Gender=c(1,0,0,1,1,1,0,1,0,0,0,1,1,1,1,0,0,1,1,0,1,1,0,0,0)

Edu.level=c(1,2,2,0,0,0,2,2,2,0,1,1,1,0,0,0,1,0,0,2,1,1,1,1,0)

dataBLRM=data.frame(Status.mind,Age,Gender,Edu.level)

fittedBLRM=glm(Status.mind~Age+as.factor(Gender)+as.factor(Edu.level),
family="binomial",data=dataBLRM)

summary(fittedBLRM)

Call:
glm(formula = Status.mind ~ Age + as.factor(Gender) + as.factor(Edu.level),
family = "binomial", data = dataBLRM)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9761 -0.7033 0.4872 0.8396 1.9189
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.24021 2.61303 -0.857 0.3913
Age 0.01395 0.06441 0.217 0.8286
as.factor(Gender)1 0.57867 1.17692 0.492 0.6229
as.factor(Edu.level)1 2.17438 1.09859 1.979 0.0478 *
as.factor(Edu.level)2 3.24459 1.52683 2.125 0.0336 *
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.617 on 24 degrees of freedom
Residual deviance: 26.622 on 20 degrees of freedom
AIC: 36.622
Number of Fisher Scoring iterations: 4
WU, Department of Statistics Page 25
2.4. MODEL DIAGNOSTIC
As with all statistical procedures linear regression analysis rests on basic assumptions about
the population from where the data have been derived. The results of the analysis are only
reliable when these assumptions are satisfied. Hence, the possible influence of outliers and
the checking of assumptions made in fitting the linear regression model, i.e., constant
variance and normality of error terms, can both be undertaken using a variety of diagnostic
tools, of which the simplest and most well-known are the estimated residuals, i.e., the
differences between the observed values of the response and the fitted values of the response.
In essence these residuals estimate the error terms in the simple and multiple linear
regression models. So, after estimation, the next stage in the analysis should be an
examination of such residuals from fitting the chosen model to check on the normality and
constant variance assumptions and to identify outliers. Hence, in order to check whether the
model assumptions are satisfied or not residuals and fitted values of the model can be
extracted and saved using fitted (model) and residuals (model), respectively. The most useful
plots of these residuals are:
• A plot of residuals against each explanatory variable in the model. The presence of a non-
linear relationship, for example, may suggest that a higher order term, in the explanatory
variable should be considered.
• A plot of residuals against fitted values. If the variance of the residuals appears to increase
with predicted value, a transformation of the response variable may be in order.
• A normal probability plot of the residuals. After all the systematic variation has been
removed from the data, the residuals should look like a sample from a standard normal
distribution. A plot of the ordered residuals against the expected order statistics from a
normal distribution provides a graphical check of this assumption.

2.4.1. Scatter
Scatter is measured by the size of the residuals. A common problem is where the scatter
increases as the mean response increases. This means the big residuals happen when the fitted
values are big recognize this by a “funnel effect” in the residuals versus fitted value plot.

Example: new York air quality measurement

 Data (airquality) for Daily air quality measurements in New York, May to September
1973:

 Library(car); data(airquality)
WU, Department of Statistics Page 26
 Variables are:

 Ozone: Mean ozone in parts per billion from 1300 to 1500 hours at
Roosevelt Island
 Solar.R: Solar radiation in Langleys in the frequency band 4000–7700
Angstroms from 0800 to 1200 hours at Central Park
 Wind: Average wind speed in miles per hour at 0700 and 1000 hours at
LaGuardia Airport
 Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia
Airport.
 Month: months, 1…12
 Day: the numeric data from 1….31

Fit model: Ozone~ Solar.R: +Wind+Temp

regair=lm(Ozone~Solar.R+Wind+Temp,data=airquality)

regair

summary(regair)

>regair=lm(Ozone~Solar.R+Wind+Temp,data=airquality)

> regair

Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)

Coefficients:
(Intercept) Solar.R Wind Temp
-64.34208 0.05982 -3.33359 1.65209

summary(regair)
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality)

Residuals:

WU, Department of Statistics Page 27


Min 1Q Median 3Q Max
-40.485 -14.219 -3.551 10.097 95.619

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -64.34208 23.05472 -2.791 0.00623 **
Solar.R 0.05982 0.02319 2.580 0.01124 *
Wind -3.33359 0.65441 -5.094 1.52e-06 ***
Temp 1.65209 0.25353 6.516 2.42e-09 ***
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1

Residual standard error: 21.18 on 107 degrees of freedom


(42 observations deleted due to missingness)
Multiple R-squared: 0.6059, Adjusted R-squared: 0.5948
F-statistic: 54.83 on 3 and 107 DF, p-value: < 2.2e-16

R2 is 60.59%

WU, Department of Statistics Page 28


par(mfrow=c(2,2))

plot(regair)

th
The 117
Observation is an

outlier

Let see the output by excluding the point/observation 117

>regair117=lm(Ozone~Solar.R+Wind+Temp,data=airquality,subset=-117)

>regair117

WU, Department of Statistics Page 29


>summary(regair117)

summary(regair117)
Call:
lm(formula = Ozone ~ Solar.R + Wind + Temp, data = airquality,
subset = -117)

Residuals:
Min 1Q Median 3Q Max
-38.757 -13.274 -1.993 9.972 62.314

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -76.89396 20.86408 -3.685 0.000362 ***
Solar.R 0.05405 0.02087 2.590 0.010951 *
Wind -2.76110 0.59860 -4.613 1.12e-05 ***
Temp 1.74239 0.22854 7.624 1.11e-11 ***
---
Signif. codes: 0 „***‟ 0.001 „**‟ 0.01 „*‟ 0.05 „.‟ 0.1 „ ‟ 1
Residual standard error: 19.04 on 106 degrees of freedom
(42 observations deleted due to missingness)
Multiple R-squared: 0.6369, Adjusted R-squared: 0.6267
F-statistic: 61.99 on 3 and 106 DF, p-value: < 2.2e-16

Now R2 is 63.69%

2.4.2. Normality of Residuals

2.4.2.1. Histogram
It is a graphical way of depicting data whether the data are normally distributed or not. In
general, histogram is used for checking the normality assumptions.
For example we use the airquality data

WU, Department of Statistics Page 30


hist(airquality$Ozone) ## Draw a new plot
hist(airquality$Ozone, col = "green")

2.4.2.2. qq plot for studentized residual


qqPlot(fit, main="QQ Plot")

2.4.3. Outliers checking

Car package provides advanced utilities for regression modeling.


# Assume that we are fitting a multiple linear regression
# on the MTCARS data
library(car)
fit <- lm(mpg~disp+hp+wt+drat,data=mtcars)
outlierTest(fit) # Bonferonni p-value for most extreme observations
outlierTest(fit)
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p

WU, Department of Statistics Page 31


Toyota Corolla 2.51597 0.01838 0.58816
layout(matrix(c(1,2,3,4,5,6),2,3)) # optional layout
leveragePlots(fit, ask=FALSE) # leverage plots

2.4.4. Influential Observations

2.4.4.1. Added variable plot


avPlots(fit, one.page=TRUE, ask=FALSE)

2.4.4.2. cook’s D plot


# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(mtcars)-length(fit$coefficients)-2))

plot(fit, which=4, cook.levels=cutoff)

WU, Department of Statistics Page 32


2.4.4.3. Influence plot
influencePlot(fit, main="Influence Plot",sub="Circle size is proportial to Cook's Distance")

2.4.5. Non-constant Error Variance/heteroscedasticity/


# evaluate homoscedasticity
# non-constant error variance test
ncvTest(fit)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 1.429672, Df = 1, p = 0.23182
# plot studentized residuals vs. fitted values
spreadLevelPlot(fit)

WU, Department of Statistics Page 33


2.4.6. Multi-collinearity
install.packages("car") # installing package required to get VIF for multicollinearity checking
library("car") #Loading required package:
Evaluate Collinearity
vif(fit) # variance inflation factors
disp hp wt drat
8.209402 2.894373 5.096601 2.279547
sqrt(vif(fit)) > 2 # problem?
disp hp wt drat

TRUE FALSE TRUE FALSE

WU, Department of Statistics Page 34


2.4.7. Evaluate Nonlinearity

2.4.7.1. component +residual plot


crPlots(fit, one.page=TRUE, ask=FALSE)

2.4.7.2. Ceres plot


ceresPlots(fit, one.page=TRUE, ask=FALSE)

WU, Department of Statistics Page 35


Reference

1. Applied Regression Analysis and Generalized Linear Models / John Fox, Sage, 2008

2. Applied statistics using R(2020) designed for university of Illinois at Urbana-Champaign.

3. Dalgaard, P. (2002). Introductory Statistics with R. Springer-Verlag, New York


4. Fox, J. (2002). An R and S-PLUS Companion to Applied Regression. Sage Books
5. R Project, “Regression analysis using R?” (http://www.r-project.org/about.html).

WU, Department of Statistics Page 36

You might also like