Module - 4 (R Training) - Basic Stats & Modeling
Module - 4 (R Training) - Basic Stats & Modeling
Program:
KPO Training
Module:
Session:
7 and 8
1
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Outline
Descriptive Statistics
Frequencies and Crosstabs
Correlations
Multiple Linear Regression
Logistic Regression
Time Series
Principal Component
Factor Analysis
Cluster Analysis
2
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Descriptive Statistics
R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the sapply( ) function with a specified summary
statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
- Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores
3
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Frequencies
R provides many methods for creating frequency and contingency tables. Three are
described below. In the following examples, assume that A, B, and C represent categorical
variables.
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table( ) can also generate multidimensional tables based on 3 or more categorical
variables. In this case, use the ftable( ) function to print the results more attractively.
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
4
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Crosstabs
xtabs : The xtabs( ) function allows you to create crosstabulations using formula style input.
# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
Crosstable :
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)
Note: Table ignores missing values. To include NA as a category in counts, include the table option
exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor
using newfactor <- factor(oldfactor, exclude=NULL).
5
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Correlations
Use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where
Example:
# Correlations/covariances among numeric variables in data frame mtcars. Use listwise deletion of
missing data.
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")
use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the
columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
6
Diagnostic Plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
Comparing Models : We can compare nested models with the anova( ) function. The following code
provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
7
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables.
# Logistic Regression where F is a binary factor and x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
summary(fit)
# display results
confint(fit)
# 95% CI for the coefficients
exp(coef(fit))
# exponentiated coefficients
exp(confint(fit))
# 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals
Note : One can use anova(fit1,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F~x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable.
8
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Time Series
# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
stl() function. Note that a series with multiplicative effects can often by transformed into series with
additive effects through a log transformation (i.e., newts <- log(myts)).
# Seasonal decompostion
fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
library(forecast)
seasonplot(myts)
9
The principal( ) function in the psych package can be used to extract and rotate principal
components.
# Varimax Rotated Principal Components
# retaining 5 components
library(psych)
fit <- principal(mydata, nfactors=5, rotate="varimax")
fit # print results
Note: mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is
used. rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
10
Factor Analysis
Cluster Analysis
Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for
comparability.
# Prepare Data
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables
Partitioning
K-means clustering is the most popular partitioning method. It requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help determine the appropriate number of clusters.
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
12
Cluster Analysis
(continued..)
Hierarchical Agglomerative
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)
13
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
Cluster Analysis
(continued..)
14
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)
THANK YOU
15
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)