0% found this document useful (0 votes)
134 views

Module - 4 (R Training) - Basic Stats & Modeling

stats

Uploaded by

RohitGahlan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

Module - 4 (R Training) - Basic Stats & Modeling

stats

Uploaded by

RohitGahlan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

IVY Professional School

Program:

KPO Training

Module:

Basic Statistics and Predictive Modeling

Session:

7 and 8

1
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Outline

Descriptive Statistics
Frequencies and Crosstabs
Correlations
Multiple Linear Regression
Logistic Regression
Time Series
Principal Component
Factor Analysis
Cluster Analysis

2
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Descriptive Statistics
R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the sapply( ) function with a specified summary
statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
- Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores

3
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Frequencies
R provides many methods for creating frequency and contingency tables. Three are
described below. In the following examples, assume that A, B, and C represent categorical
variables.
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table( ) can also generate multidimensional tables based on 3 or more categorical
variables. In this case, use the ftable( ) function to print the results more attractively.
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
4
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Crosstabs
xtabs : The xtabs( ) function allows you to create crosstabulations using formula style input.
# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
Crosstable :
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)

Note: Table ignores missing values. To include NA as a category in counts, include the table option
exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor
using newfactor <- factor(oldfactor, exclude=NULL).

5
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Correlations

Use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where

Example:
# Correlations/covariances among numeric variables in data frame mtcars. Use listwise deletion of
missing data.
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")

use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the
columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
6

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Multiple (Linear) Regression

Fitting the Model


# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit)
# show results
# Other useful functions
coefficients(fit)
# model coefficients
confint(fit, level=0.95) # CIs for model parameters
fitted(fit)
# predicted values
residuals(fit)
# residuals
anova(fit)
# anova table
vcov(fit)
# covariance matrix for model parameters
influence(fit)
# regression diagnostics

Diagnostic Plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
Comparing Models : We can compare nested models with the anova( ) function. The following code
provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
7

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Logistic Regression

Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables.
# Logistic Regression where F is a binary factor and x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
summary(fit)
# display results
confint(fit)
# 95% CI for the coefficients
exp(coef(fit))
# exponentiated coefficients
exp(confint(fit))
# 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals

Note : One can use anova(fit1,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F~x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable.

8
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Time Series

Creating a time series


The ts() function will convert a numeric vector into an R time series object. The format is ts(vector,
start=, end=, frequency=) where start and end are the times of the first and last observation and
frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
# save a numeric vector containing 48 monthly observations from Jan 2009 to Dec 2014 as a time
series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)
# subset the time series (June 2014 to December 2014)
myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))

# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
stl() function. Note that a series with multiplicative effects can often by transformed into series with
additive effects through a log transformation (i.e., newts <- log(myts)).
# Seasonal decompostion
fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
library(forecast)
seasonplot(myts)
9

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Principal Component Analysis

The princomp( ) function produces an unrotated principal component analysis.


# Pricipal Components Analysis
# entering raw data and extracting PCs from the correlation matrix
fit <- princomp(mydata, cor=TRUE)
summary(fit)
# print variance accounted for
loadings(fit)
# pc loadings
plot(fit,type="lines")
# scree plot
fit$scores
# the principal components
biplot(fit)
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.

The principal( ) function in the psych package can be used to extract and rotate principal
components.
# Varimax Rotated Principal Components
# retaining 5 components
library(psych)
fit <- principal(mydata, nfactors=5, rotate="varimax")
fit # print results
Note: mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is
used. rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
10

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Factor Analysis

The factanal( ) function produces maximum likelihood factor analysis.


# Maximum Likelihood Factor Analysis entering raw data and extracting 3 factors, with varimax rotation
fit <- factanal(mydata, 3, rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
plot(load,type="n")
# set up plot
text(load,labels=names(mydata),cex=.7) # add variable names
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.
The factor.pa( ) function in the psych package offers a number of factor analysis related functions,
including principal axis factoring.
# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results
Determining the Number of Factors to Extract
# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
11

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for
comparability.
# Prepare Data
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables

Partitioning
K-means clustering is the most popular partitioning method. It requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help determine the appropriate number of clusters.
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
12

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

(continued..)

Hierarchical Agglomerative
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

13
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Cluster Analysis

(continued..)

Plotting Cluster Solutions


# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(mydata, fit$cluster)

14
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

THANK YOU

15
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

You might also like