R Statistical Package
R Statistical Package
Dec 2024
Wolaita Sodo,
Ethiopia
Outline
Introduction
Important Commands in R
Data Modes & Types in R
Operators in R
Creating Vectors, Matrix, Lists & Data
frames in R
Importing Data into R
Descriptive Statistics & Graphics in R
Statistical Models in R
Introduction
R is a powerful computer program for
performing statistical analysis and
graphics.
It is free & easy to learn.
R is a platform for the object-oriented
statistical programming language
R has an excellent built-in help system.
R has excellent graphing capabilities.
R is a computer programming
language.
R is initially written by Ross Ihaka and
Robert Gentleman at Dep. of Statistics of
It is "open-source" software (which for our
purposes means that it can be freely
downloaded);
Download R at http://cran.r‐project.org/
It is available for a number of different
operating systems, including Windows, Linux,
and Macintosh;
By itself is fairly powerful and is extensible
(meaning that procedures for analyzing data
Getting Started
Once you have installed R, there will be an
icon on your desktop. Double click it and R
will start up.
Type 'q()' to quit R.
R does have a few pull-down menus, but
mostly commands in R are entered on the
command line (>).
The > is a prompt symbol displayed by R, not
typed by you. This is R’s way of telling you it’s
ready for you to type a command
To see the list of installed datasets, use the
data method with an empty argument:
> data()
Installing and using R packages
Installing R packages:
install.packages() #To Install a package from CRAN
>help(var) or ?var
>help(t.test) or ?t.test, etc…
will give you all the information, with examples,
references, etc., on how to use the function
mean() or ?mean, ?var, etc.
Data Modes
Logical - Binary data mode, with values
represented as T or F.
Numeric - Numeric data mode includes
integer , representations of numeric
values.
Complex - Complex numeric values (real
and imaginary parts).
Character - Character values represented
Data Types
Vector : A set of elements in a specified
order.
Matrix : is a two-dimensional array of
elements of the same mode.
Factor : is a vector of categorical data.
Data frame : is a two-dimensional array
whose columns may represent data of
different modes.
R as a calculator
Arithmetic: R can function as a calculator for scalar arithmetic, performing
addition +, subtraction −, multiplication *, division /, exponentiation ˆ , taking
the modulus %%, and integer division %/%. Parentheses () specifies the order
of operations.
Example
>(17*0.35)^(1/3); > log(10); > exp(1); > 3^-1; >2+5
> (3+5/78)^3*7
[1] 201.3761
> 89%%13 # modulus
[1] 11
> 89%/%13 # division
[1] 6
Assigning Values to variables
Variables are assigned using ‘<-’ or “=“
> x<-12.6
>x
[1] 12.6
Variables that contains many values (vectors), e.g. with the concatenate
function:
> y<-c(3,7,9,11)
>y
[1] 3 7 9 11
Assigning Values to variables
Operator ‘:’ means “a series of integers between”:
> x<-1:6
> x
[1] 1 2 3 4 5 6
Object names cannot contain `strange' symbols like !, +, -, #.
A dot (.) and an underscore ( _) are allowed, also a name starting with a dot.
Object names can contain a number but cannot start with a number.
R is case sensitive, X and x are two different objects, as well as temp and temP.
> x = sin(9)/75
> y = log(x) + x^2
> x
> y
> m <- matrix(c(1,2,4,1), ncol=2)
> m
> solve(m)
To list the objects that you have in your current R session use the function ls or
the function objects.
> ls()
[1] "x" "y"
10
Operators in R
I. Arithmetic Operators
* : Multiply
+ : Add
- : Subtract
/ : Divide
^ : Exponentiation
%% : Modulus
II. Comparison Operators
!= Not Equal To
< Less Than
<= Less Than or Equal to
== Equal
> Greater Than
>= Greater Than or Equal
to
III. Logical Operators
!: Not
| : Or (For Calculating Vectors and Arrays
of Logical)
Subsetting:
Individual elements of a vector, matrix, array or data
frame are accessed with “[ ]” by specifying their index, or
their name
Useful Functions
>length(object) # number of elements or
components
>names(object) # names
>c(object,object,...) # combine objects into a vector
>cbind(object, object, ...) # combine objects as
columns
>rbind(object, object, ...) # combine objects as rows
>ls() # list current objects
>rm(object) # delete an object
>newobject <- edit(object) # edit copy and save a
new object
>fix(object) # edit in place
Data Import
From the keyboard one by one
c( )
dat1<-read.table(“D:/Rtraining/datatry.txt", header=TRUE)
dat1
attach(dat1)
data
dat2
Data Import…
#How to import data from SPSS into R.
getwd()
dat3
By a spreadsheet
data.entry()
edit()
Value Labels
You can use the factor function to create your own value
labels.
# variable v1 is coded 1, 2 or 3
# we want to attach value labels 1=red, 2=blue,3=green
>mydata$v1 <- factor(mydata$v1,
levels = c(1,2,3),
labels = c("red", "blue", "green"))
# mydata$sex <- factor(mydata$sex, levels = c(1,2), labels =
c("male", "female"))
# variable y is coded 1, 3 or 5
# we want to attach value labels 1=Low, 3=Medium, 5=High
>mydata$y <- ordered(mydata$y, levels = c(1,3, 5),
labels = c("Low", "Medium", "High"))
Note: factor and ordered are used the same way, with the
same arguments. The former creates factors and the later
creates ordered factors.
Creating new variables
Use the assignment operator <- to create new variables.
A wide array of operators and functions are available
here.
# Three examples for doing the same computations
>mydata$sum <- mydata$x1 + mydata$x2
>mydata$mean <- (mydata$x1 + mydata$x2)/2
>attach(mydata)
>mydata$sum <- x1 + x2
>mydata$mean <- (x1 + x2)/2
>detach(mydata)
>mydata <- transform( mydata, sum = x1 + x2,mean =
(x1+ x2)/2 )
Recoding variables
In order to recode data, you will probably use
one or more of R's control structures.
# create 2 age categories
>mydata$agecat <- ifelse(mydata$age > 70,
c("older"), c("younger"))
# another example: create 3 age categories
>attach(mydata)
>mydata$agecat[age > 75] <- "Elder“
>mydata$agecat[age > 45 & age <= 75] <-
"Middle Aged“
>mydata$agecat[age <= 45] <- "Young“
>detach(mydata)
Merging Data frames/files/data
To merge two data frames (datasets) horizontally, use
the merge function. In most cases, you join two data
frames by one or more common key variables (i.e., an
inner join).
# merge two data frames by ID
>total <- merge(dataframeA, dataframeB,by="ID")
# merge two dataframes by ID and Country
>total <-
merge(dataframeA,dataframeB,by=c("ID","Country"))
ADDING ROWS
To join two data frames (datasets) vertically, use the
rbind function.
The two data frames must have the same variables, but
they do not have to be in the same order.
>total <- rbind(dataframeA, dataframeB)
Statistical HypothesisTests
>t.test(): ->Student t-tests(one and two
samples)
>var.test(): ->Fisher(variance tests; one and
equality of variances)
>cor.test(): ->correlation tests
>chisq.test(): ->² test
>prop.test(): ->proportion tests (one &
difference of two proportions)
>Wilcox.test(): ->wilcoxon test(one and two
samples)
Graphical Procedures
> plot(x)->function is used
> plot(xvalues,yvalues)
Histograms
Histograms are a useful graphic for displaying univariate data
>hist(x)
>boxplot(x)#to produce box plot.
Q-Q Plot: to check normality
> qqnorm(resid,main="Normal Q-Qplot")
Changing the Look of Graphics
> plot(xvalues,yvalues, ylab = "Label for y axis", xlab = "Label for x axis", las = 1,
cex.lab = 1.5)
las : numeric in {0,1,2,3} change orientation of the axis
labels;
cex.lab : magnification to be used for x and y labels;
To get full range of changes about graphical parameters:
>?par
Cont’d
Each function has its own set of arguments.
The most common ones are
xlim,ylim: range of variable plotted on the x
and y axis respectively
pch, col, lty: plotting character, colour and
line type
xlab, ylab: labels of x and y axis respectively
main, sub: main title and sub-title of graph
type=“l” (line),”p” (point),”h” (vertical line)…
Example:
## plot the graph of f(x)=x^2+2x+9 b/n x=-3
and x=3
> x<-seq(-3,3,0.01)
> f<-x^2+2*x+9
> plot(x,f,type="l",main="Graph of
Quadratic",xlab="Xvalue",ylab="funalvalue
",col="red")
Cont’d
Graph of Quadratic
20
funalvale
15
10
-3 -2 -1 0 1 2 3
Xvalue
Histogram
10
8
Frequency
6
4
2
0
10 20 30 40 50 60
xx
Example
The plot of x^3 −3x between x=−2 and x=2:
>curve(x^3-3*x, -2, 2)
Here is the more cumbersome code to do the
same thing using plot:
>x<-seq(-2,2,0.01)
>y<-x^3-3*x
>plot(x,y,type="l")
More Graphical Parameters
C0lor options and their descriptions
>col # Default plotting color. Some functions
(e.g. lines) accept a vector of values that are
recycled.
>col.axis # color for axis annotation
>col.lab # color for x and y labels
>col.main # color for titles
>col.sub #color for subtitles
>fg # plot foreground color (axes, boxes - also
sets col= to same)
>bg # plot background color
Scatterplot Matrices
# Basic Scatterplot Matrix
>pairs(~mpg+disp+drat+wt,data=mtcars,
main="Simple Scatterplot Matrix")
Statistical Models in R
Regression Model in R
>fit1<-lm(y ∼ x) : ->Simple regression
>lm(y ∼ 1+x): -> Explicit intercept
>lm(y ∼ -1 + x):-> Through the origin
>fit<-lm(y ∼ x + x2):-> Quadratic regression
>fit<-lm(y ∼ x1 + x2 + x3):-> Multiple Regression
>coef(fit)-> to find regression coefficients
>resid(fit) -> to find residuals
>fitted(fit) -> to find fitted values
>summary(fit) -> to find analysis summary
>predict(fit)-> predict for new data
>anova(fit) # to get anova table
>deviance(fit)-> residual sum of squares
>plot(resid, fitted) #to check constant variance assumption
>qqnorm(resid(fit)) # to check normality assumption
>X <- model.matrix(˜ y - 1, Data)
Cont’d
Fitting the Model
# Multiple Linear Regression Example
>fit <- lm(y ~ x1 + x2 + x3, data=mydata)
>summary(fit) # show results
# Other useful functions
>coefficients(fit) # model coefficients
>confint(fit, level=0.95) # CIs for model parameters
>fitted(fit) # predicted values
>residuals(fit) # residuals
>anova(fit) # anova table
>vcov(fit) # covariance matrix for model parameters
>influence(fit) # regression diagnostics
# diagnostic plots provide checks for heteroscedasticity,
normality, and influential observations.
>plot(fit) #Diagnostic Plots.
Example of Simple LRM from R
data set
>data()# to view data set available in R
>edit(cars) -> close it # to import data frame
named cars to our current working space
>names(cars)
[1] "speed" "dist"
> y<-cars$speed
> x<-cars$dist
> fit<-lm(y~x)
>Fit
>plot(resid(fit),fitted(fit),main=“CCVA”,
ylab=“fitted”, xlab=“resid”)
>qqnorm(resid(fit),main=“QQ plot”)
Example of Multiple LRM from R
data set
>data()# to view data set available in R
>edit(rock) -> close it # to import data frame
named rock to our current working space
>names(rock)# names in data frame rock
[1] "area" "peri" "shape" "perm"
> Y<-rock$area
> X1<-rock$peri
> X2<-rock$shape
> X3<-rock$perm
> fit1<-lm(Y~X1+X2+X3)# fitting multiple
linear Regression model
> fit1
Tree data example
>data() # to view data set in R
>edit(trees) # to view data trees
> names(trees)
[1] "Girth" "Height" "Volume"
> Y<-trees$Girth
> x1<-trees$Height
> x2<-trees$Volume
> fit1<-lm(Y~x1+x2)
>fit1
>coef(fit1)
>anova(fit1)
Extracting Statistics from the Regression
The most important statistics and parameters of a
regression are stored in the lm object or the summary
object.
> output <- summary(result)
> SSR <- deviance(result)
> LL <- logLik(result)
> DegreesOfFreedom <- result$df
> Yhat <- result$fitted.values
> Coef <- result$coefficients
> Resid <- result$residuals
> s <- output$sigma
> RSquared <- output$r.squared
> CovMatrix <- s^2*output$cov
> aic <- AIC(result)
>vcov() #variance-covariance matrix of the coefficients
linear model
>lm(y~x) : ->To fit régression model
>lm(y~x1+x2): ->To fit multiple linear régression
model using two regressors x1 & x2
>aov(y~x): -> to fit one way anova model
>f=as.factor(f): ->transforms f into a factor
>lm(y~f) : ->one factor ANOVA
>lm(y~f1+f2) : ->two factors ANOVA
>lm(y~x+f): -> covariance analysis
Families :
?family # to identify the family of model
Logistic regression
glm.out=glm(y~x, binomial)
Poisson régression
glm.out=glm(y~x, poisson)
Remark:
>lm(y~x) equivalent to > glm(y~x, gaussian)
ANOVA MODEL
Partition of variation into
Between groups
Within groups
The model:(One Way ANOVA Model)
Yij = m + aj + eij
Assumptions:
Normality
Independence
Homogeneity
Var(Y) = Var(m) + Var(a) + Var(e) = Var(a) + Var(e)
Example: Perform one way ANOVA for the data given in table
below:
>treat <- c(1,1,1,2,2,2,3,3,3) A B C
>y <- c(43, 40, 35, 41, 47, 54, 39, 34, 43
37) 41 39
>treat <- as.factor(treat) 40 47 34
>fit <- aov(y ~ treat) 35 54 37
>summary(fit)
>anova(fit)
treat=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
yield=c(13.25,25.61,37.9,38.65,2.14,14.05,23.8,37.5,38.93,1.91,14
.21,24.76,36.44,37.8,1.78)
>treat <- as.factor(treat)
>fit <- aov(yield ~ treat)
>summary(fit)
>anova(fit)
ANOVA - Fit a Model
# One Way Anova (Completely Randomized Design)
>fit <- aov(y ~ group)
# Randomized Block Design (B is the blocking factor)
>fit <- aov(y ~ A + B)
# Two Way Factorial Design
>fit <- aov(y ~ A + B + A*B, data=mydataframe)
>fit <- aov(y ~ A*B, data=mydataframe) # same thing
# Analysis of Covariance
>fit <- aov(y ~ A + x, data=mydataframe)
For within subjects designs, the dataframe has to be rearranged
so that each measurement on a subject is a separate observation
# One Within Factor
>fit <- aov(y~A+Error(Subject/A),data=mydataframe)
# Two Within Factors W1 W2, Two Between Factors B1 B2
>fit <- aov(y~(W1*W2*B1*B2)+Error(Subject/(W1*W2))+(B1*B2),
data=mydataframe)
Factorial ANOVA
Example :Perform factorial ANOVA using for the
following
Variety data Pesticide Total
1 2 3 4
B1 29 50 43 53 175
B2 41 58 42 73 214
B3 66 85 63 85 305
Total 136 193 154 211 694
Prob(Yi=0) = exp(hi)/(1+exp(hi))
hi = Sj xij bj - Linear Predictor
Model is investigated by
estimating the bj’s by maximum likelihood
testing if the estimates are different from 0
Fitting the Model
>afit <- glm( y ~additive(x),family=‘binomial’)
Model Comparison
> afit <- glm(t$y ~ additive(t$m20))
> gfit <- glm(t$y ~ genotype(t$m20))
> anova(afit,gfit)
R> plasma_glm_1 <- glm(ESR ~ fibrinogen, data =
plasma,family = binomial())# simple Logistic
R> data("womensrole", package = "HSAUR2")
R> fm1 <- cbind(agree, disagree) ~ gender +
education
R> womensrole_glm_1 <- glm(fm1, data =
womensrole,
+ family = binomial())
> no.yes <- c("No","Yes")
> smoking <- gl(2,1,8,no.yes)
> obesity <- gl(2,2,8,no.yes)
> snoring <- gl(2,4,8,no.yes)
> n.tot <- c(60,17,8,2,187,85,51,23)
> n.hyp <- c(5,2,1,0,35,13,15,8)
> data.frame(smoking,obesity,snoring,n.tot,n.hyp)
The gl function, to “generate levels”
R is able to fit logistic regression analyses for
tabular data in two different ways.
> hyp.tbl <- cbind(n.hyp,n.tot-n.hyp)
>
glm(hyp.tbl~smoking+obesity+snoring,family=bi
nomial ("logit"))
logistic regression model is to give the
proportion of diseased in each cell:
> prop.hyp <- n.hyp/n.tot
> glm.hyp <-
glm(prop.hyp~smoking+obesity+snoring,
+ binomial,weights=n.tot)
> summary(glm.hyp)
> confint(glm.hyp)
> exp(confint(glm.hyp))
Thank You!