0% found this document useful (0 votes)

146 views15 pages

Module - 4 (R Training) - Basic Stats & Modeling

stats

Uploaded by

RohitGahlan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views15 pages

Module - 4 (R Training) - Basic Stats & Modeling

stats

Uploaded by

RohitGahlan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 15

IVY Professional School

Program:

KPO Training

Module:

Basic Statistics and Predictive Modeling

Session:

7 and 8

1
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Outline

Descriptive Statistics
Frequencies and Crosstabs
Correlations
Multiple Linear Regression
Logistic Regression
Time Series
Principal Component
Factor Analysis
Cluster Analysis

2
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Descriptive Statistics
R provides a wide range of functions for obtaining summary statistics. One method of
obtaining descriptive statistics is to use the sapply( ) function with a specified summary
statistic.
# get means for variables in data frame mydata
# excluding missing values
sapply(mydata, mean, na.rm=TRUE)
- Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
# mean,median,25th and 75th quartiles,min,max
summary(mydata)
library(Hmisc)
describe(mydata)
# n, nmiss, unique, mean, 5,10,25,50,75,90,95th percentiles
# 5 lowest and 5 highest scores

3
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Frequencies
R provides many methods for creating frequency and contingency tables. Three are
described below. In the following examples, assume that A, B, and C represent categorical
variables.
# 2-Way Frequency Table
attach(mydata)
mytable <- table(A,B) # A will be rows, B will be columns
mytable # print table
margin.table(mytable, 1) # A frequencies (summed over B)
margin.table(mytable, 2) # B frequencies (summed over A)
prop.table(mytable) # cell percentages
prop.table(mytable, 1) # row percentages
prop.table(mytable, 2) # column percentages
table( ) can also generate multidimensional tables based on 3 or more categorical
variables. In this case, use the ftable( ) function to print the results more attractively.
# 3-Way Frequency Table
mytable <- table(A, B, C)
ftable(mytable)
4
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Crosstabs
xtabs : The xtabs( ) function allows you to create crosstabulations using formula style input.
# 3-Way Frequency Table
mytable <- xtabs(~A+B+c, data=mydata)
ftable(mytable) # print table
summary(mytable) # chi-square test of indepedence
Crosstable :
# 2-Way Cross Tabulation
library(gmodels)
CrossTable(mydata$myrowvar, mydata$mycolvar)

Note: Table ignores missing values. To include NA as a category in counts, include the table option
exclude=NULL if the variable is a vector. If the variable is a factor you have to create a new factor
using newfactor <- factor(oldfactor, exclude=NULL).

5
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Correlations

Use the cor( ) function to produce correlations and the cov( ) function to produces covariances.
A simplified format is cor(x, use=, method= ) where

Example:
# Correlations/covariances among numeric variables in data frame mtcars. Use listwise deletion of
missing data.
cor(mtcars, use="complete.obs", method="kendall")
cov(mtcars, use="complete.obs")

use the format cor(X, Y) or rcorr(X, Y) to generate correlations between the columns of X and the
columns of Y. This similar to the VAR and WITH commands in SAS PROC CORR.
# Correlation matrix from mtcars with mpg, cyl, and disp as rows and hp, drat, and wt as columns
x <- mtcars[1:3]
y <- mtcars[4:6]
cor(x, y)
6

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Multiple (Linear) Regression

Fitting the Model

# Multiple Linear Regression Example
fit <- lm(y ~ x1 + x2 + x3, data=mydata)
summary(fit)
# show results
# Other useful functions
coefficients(fit)
# model coefficients
confint(fit, level=0.95) # CIs for model parameters
fitted(fit)
# predicted values
residuals(fit)
# residuals
anova(fit)
# anova table
vcov(fit)
# covariance matrix for model parameters
influence(fit)
# regression diagnostics

Diagnostic Plots provide checks for heteroscedasticity, normality, and influential observerations.
# diagnostic plots
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit)
Comparing Models : We can compare nested models with the anova( ) function. The following code
provides a simultaneous test that x3 and x4 add to linear prediction above and beyond x1 and x2
# compare models
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2)
anova(fit1, fit2)
7

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Logistic Regression

Logistic regression is useful when you are predicting a binary outcome from a set of continuous
predictor variables.
# Logistic Regression where F is a binary factor and x1-x3 are continuous predictors
fit <- glm(F~x1+x2+x3,data=mydata,family=binomial())
summary(fit)
# display results
confint(fit)
# 95% CI for the coefficients
exp(coef(fit))
# exponentiated coefficients
exp(confint(fit))
# 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals

Note : One can use anova(fit1,fit2, test="Chisq") to compare nested models. Additionally, cdplot(F~x,
data=mydata) will display the conditional density plot of the binary outcome F on the continuous x variable.

8
Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Time Series

Creating a time series

The ts() function will convert a numeric vector into an R time series object. The format is ts(vector,
start=, end=, frequency=) where start and end are the times of the first and last observation and
frequency is the number of observations per unit time (1=annual, 4=quartly, 12=monthly, etc.).
# save a numeric vector containing 48 monthly observations from Jan 2009 to Dec 2014 as a time
series object
myts <- ts(myvector, start=c(2009, 1), end=c(2014, 12), frequency=12)
# subset the time series (June 2014 to December 2014)
myts2 <- window(myts, start=c(2014, 6), end=c(2014, 12))

# plot series
plot(myts)
Seasonal Decomposition
A time series with additive trend, seasonal, and irregular components can be decomposed using the
stl() function. Note that a series with multiplicative effects can often by transformed into series with
additive effects through a log transformation (i.e., newts <- log(myts)).
# Seasonal decompostion
fit <- stl(myts, s.window="period")
plot(fit)
# additional plots
monthplot(myts)
library(forecast)
seasonplot(myts)
9

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Principal Component Analysis

The princomp( ) function produces an unrotated principal component analysis.

# Pricipal Components Analysis
# entering raw data and extracting PCs from the correlation matrix
fit <- princomp(mydata, cor=TRUE)
summary(fit)
# print variance accounted for
loadings(fit)
# pc loadings
plot(fit,type="lines")
# scree plot
fit$scores
# the principal components
biplot(fit)
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.

The principal( ) function in the psych package can be used to extract and rotate principal
components.
# Varimax Rotated Principal Components
# retaining 5 components
library(psych)
fit <- principal(mydata, nfactors=5, rotate="varimax")
fit # print results
Note: mydata can be a raw data matrix or a covariance matrix. Pairwise deletion of missing data is
used. rotate can "none", "varimax", "quatimax", "promax", "oblimin", "simplimax", or "cluster" .
10

Factor Analysis

The factanal( ) function produces maximum likelihood factor analysis.

# Maximum Likelihood Factor Analysis entering raw data and extracting 3 factors, with varimax rotation
fit <- factanal(mydata, 3, rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)
# plot factor 1 by factor 2
load <- fit$loadings[,1:2]
plot(load,type="n")
# set up plot
text(load,labels=names(mydata),cex=.7) # add variable names
Note: Use cor=FALSE to base the principal components on the covariance matrix. Use the covmat=
option to enter a correlation or covariance matrix directly. If entering a covariance matrix, include the
option n.obs=.
The factor.pa( ) function in the psych package offers a number of factor analysis related functions,
including principal axis factoring.
# Principal Axis Factor Analysis
library(psych)
fit <- factor.pa(mydata, nfactors=3, rotation="varimax")
fit # print results
Determining the Number of Factors to Extract
# Determine Number of Factors to Extract
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
11

Cluster Analysis

Data Preparation
Prior to clustering data, you may want to remove or estimate missing data and rescale variables for
comparability.
# Prepare Data
mydata <- na.omit(mydata) # listwise deletion of missing
mydata <- scale(mydata) # standardize variables

Partitioning
K-means clustering is the most popular partitioning method. It requires the analyst to specify the
number of clusters to extract. A plot of the within groups sum of squares by number of clusters
extracted can help determine the appropriate number of clusters.
# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(mydata, 5) # 5 cluster solution
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, fit$cluster)
12

Cluster Analysis

(continued..)

Hierarchical Agglomerative
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
# Ward Hierarchical Clustering with Bootstrapped p values
library(pvclust)
fit <- pvclust(mydata, method.hclust="ward",
method.dist="euclidean")
plot(fit) # dendogram with p values
# add rectangles around groups highly supported by the data
pvrect(fit, alpha=.95)

Cluster Analysis

(continued..)

Plotting Cluster Solutions

# K-Means Clustering with 5 clusters
fit <- kmeans(mydata, 5)
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
# Centroid Plot against 1st 2 discriminant functions
library(fpc)
plotcluster(mydata, fit$cluster)

THANK YOU

Lord El-Melloi II Case Files - Volume 01 (TwilightsCall) (Toshiyashiro - Calibre)
No ratings yet
Lord El-Melloi II Case Files - Volume 01 (TwilightsCall) (Toshiyashiro - Calibre)
403 pages
Draft Case Study Choledocholithiasis
100% (5)
Draft Case Study Choledocholithiasis
15 pages
R Cheat Sheet
No ratings yet
R Cheat Sheet
9 pages
Useful R Functions-1
No ratings yet
Useful R Functions-1
4 pages
CORRELATION AND COVARIANCE in R
100% (1)
CORRELATION AND COVARIANCE in R
24 pages
R-Unit 5
No ratings yet
R-Unit 5
76 pages
Commands For Data Analysis Using R
No ratings yet
Commands For Data Analysis Using R
11 pages
Module2 BDA
No ratings yet
Module2 BDA
44 pages
Statistics With R
No ratings yet
Statistics With R
20 pages
R Syntax Examples 1
No ratings yet
R Syntax Examples 1
6 pages
Ds
No ratings yet
Ds
2 pages
R Programming
No ratings yet
R Programming
47 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
An Introduction To The Psych Package: Part I: Data Entry and Data Description
No ratings yet
An Introduction To The Psych Package: Part I: Data Entry and Data Description
63 pages
R语言学习笔记
No ratings yet
R语言学习笔记
78 pages
R Course
No ratings yet
R Course
7 pages
R Examples
No ratings yet
R Examples
56 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
R Practicals
No ratings yet
R Practicals
32 pages
Which Test When: 1 Exploratory Tests
No ratings yet
Which Test When: 1 Exploratory Tests
5 pages
Notes
No ratings yet
Notes
6 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Essential R
No ratings yet
Essential R
261 pages
Intro To R Software
No ratings yet
Intro To R Software
7 pages
STATISTICS
No ratings yet
STATISTICS
6 pages
Basics: TH TH TH TH TH TH TH
No ratings yet
Basics: TH TH TH TH TH TH TH
3 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
BAN5
No ratings yet
BAN5
2 pages
Statistics With R
No ratings yet
Statistics With R
10 pages
CRM Cheat Sheet
No ratings yet
CRM Cheat Sheet
7 pages
R Commands
No ratings yet
R Commands
18 pages
R Commands
No ratings yet
R Commands
5 pages
UL2
No ratings yet
UL2
2 pages
STAT501 Multivariate Analysis
No ratings yet
STAT501 Multivariate Analysis
196 pages
Solutions For QB3
No ratings yet
Solutions For QB3
14 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
48 pages
Machine Learning-Intro
No ratings yet
Machine Learning-Intro
7 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
8 pages
R Module 11 - Statistics
No ratings yet
R Module 11 - Statistics
35 pages
STTN 225 R Summary
No ratings yet
STTN 225 R Summary
18 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Data Analysis in R
No ratings yet
Data Analysis in R
10 pages
R Intro 2011
No ratings yet
R Intro 2011
115 pages
Advanced Topics in Analysis of Economic and Financial Data Using R
No ratings yet
Advanced Topics in Analysis of Economic and Financial Data Using R
148 pages
Mod 3
No ratings yet
Mod 3
50 pages
Lab0 R Tutorial EHS
No ratings yet
Lab0 R Tutorial EHS
9 pages
R Commands: Appendix B
No ratings yet
R Commands: Appendix B
5 pages
04 BasicAnalyses
No ratings yet
04 BasicAnalyses
44 pages
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
No ratings yet
Mydata - Read - CSV ("Nameofthedatafile - CSV") : Sorting A Data Frame
2 pages
BA Notes
No ratings yet
BA Notes
5 pages
R Reference Card
No ratings yet
R Reference Card
1 page
Unit Ii Eda Using R
No ratings yet
Unit Ii Eda Using R
11 pages
Unit 3
No ratings yet
Unit 3
36 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Introduction To Psych Package
No ratings yet
Introduction To Psych Package
65 pages
Introduction To Matlab Lecture Advanced Data Analysis Jan2012
No ratings yet
Introduction To Matlab Lecture Advanced Data Analysis Jan2012
50 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Group 3 - Psychological Perspective of The Self
No ratings yet
Group 3 - Psychological Perspective of The Self
8 pages
Copy Central Digital Copy Solution vs. Domrique Digest
No ratings yet
Copy Central Digital Copy Solution vs. Domrique Digest
1 page
The Sword of Righteousness
No ratings yet
The Sword of Righteousness
18 pages
Learn Your Nature:: Chapter 2 Excerpt From Learn More Now
No ratings yet
Learn Your Nature:: Chapter 2 Excerpt From Learn More Now
13 pages
Zeffiretti Lusinghieri
No ratings yet
Zeffiretti Lusinghieri
2 pages
The Chaplet of Divine Mercy
No ratings yet
The Chaplet of Divine Mercy
4 pages
Cluster Feasibility Data
No ratings yet
Cluster Feasibility Data
57 pages
Untitled
No ratings yet
Untitled
4 pages
Homeroom Guidance PDF
No ratings yet
Homeroom Guidance PDF
6 pages
4 Massimo Scaligero Mysticism Sacred and Profane 1957
No ratings yet
4 Massimo Scaligero Mysticism Sacred and Profane 1957
5 pages
Nonformal Education Manual: Information Collection and Exchange Publication No. M0042
No ratings yet
Nonformal Education Manual: Information Collection and Exchange Publication No. M0042
190 pages
Msu Grade 11 - Stem Students: Multiple Struggles in Accomplishing Tasks in Subjects and Their Coping Mechanisms
No ratings yet
Msu Grade 11 - Stem Students: Multiple Struggles in Accomplishing Tasks in Subjects and Their Coping Mechanisms
11 pages
Psychotherapy Process and Outcome Research
No ratings yet
Psychotherapy Process and Outcome Research
6 pages
Strategic Management and The Entrepreneur
50% (2)
Strategic Management and The Entrepreneur
52 pages
The Psychedelic Paintings of Cody Seekins - Hi-Fructose Magazine
No ratings yet
The Psychedelic Paintings of Cody Seekins - Hi-Fructose Magazine
1 page
Grammar: Interrogative Affirmative Negative
No ratings yet
Grammar: Interrogative Affirmative Negative
3 pages
Contract of Indemnit & Contract of Guarantee
No ratings yet
Contract of Indemnit & Contract of Guarantee
37 pages
Case Study 4
No ratings yet
Case Study 4
4 pages
70-411 R2 Test Bank Lesson 16
No ratings yet
70-411 R2 Test Bank Lesson 16
11 pages
Health & Recreation
No ratings yet
Health & Recreation
20 pages
Lesson 1 Sound and Light
No ratings yet
Lesson 1 Sound and Light
4 pages
G-9 Non-Mendelian Genetics
50% (2)
G-9 Non-Mendelian Genetics
5 pages
Detailed Lesson Plan (Statement of The Problem)
No ratings yet
Detailed Lesson Plan (Statement of The Problem)
6 pages
501 740 Questionnaire - 035712
No ratings yet
501 740 Questionnaire - 035712
55 pages
Asp PDF
No ratings yet
Asp PDF
161 pages
Armenia
No ratings yet
Armenia
7 pages
Optimum Level of Quality
No ratings yet
Optimum Level of Quality
4 pages
Chapter 12 - Plot Generator v1pr
No ratings yet
Chapter 12 - Plot Generator v1pr
12 pages

Module - 4 (R Training) - Basic Stats & Modeling

Uploaded by

Module - 4 (R Training) - Basic Stats & Modeling

Uploaded by

IVY Professional School

Basic Statistics and Predictive Modeling

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Multiple (Linear) Regression

Fitting the Model

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Creating a time series

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Principal Component Analysis

The princomp( ) function produces an unrotated principal component analysis.

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

The factanal( ) function produces maximum likelihood factor analysis.

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Copyright Ivy Professional School - 2009-10 (All Rights Reserved)

Plotting Cluster Solutions

You might also like