R Code For Discriminant and Cluster Analysis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 23

DISCRIMINANT ANALYSIS

 Since we are dealing with multiple features, one of the first assumptions
that the technique makes is the assumption of multivariate normality that
means the features are normally distributed when separated for each
class. This also implies that the technique is susceptible to possible
outliers and is also sensitive to the group sizes. If there is an imbalance
between the group sizes and one of the groups is too small or too large,
the technique suffers when classifying data points into that ‘outlier’ class
 The second assumption is about homoscedasticity. This states that the
variance of the features is same across all the classes of the predictor
feature
 We also assume that the features are sampled randomly
 The final assumption is about the absence of multicollinearity. If the
variables are correlated with each other, the predictive ability will
decrease.

library(MASS)
library(ggplot2)

Step 2: Load the Data


For this example, we’ll use the built-in iris dataset
in R. The following code shows how to load and
view this dataset:
#attach iris dataset to make it easy to work with
attach(iris)

#view structure of dataset


str(iris)

Step 3: Scale the Data


One of the key assumptions of linear discriminant
analysis is that each of the predictor variables
have the same variance. An easy way to assure
that this assumption is met is to scale each
variable such that it has a mean of 0 and a
standard deviation of 1.
We can quickly do so in R by using
the scale() function:
#scale each predictor variable (i.e. first 4 columns)
iris[1:4] <- scale(iris[1:4])

Step 4: Create Training and Test Samples


Next, we’ll split the dataset into a training set to
train the model on and a testing set to test the
model on:
#make this example reproducible
set.seed(1)

#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.7,0.3))
train <- iris[sample, ]
test <- iris[!sample, ]

Step 5: Fit the LDA Model


Next, we’ll use the lda() function from
the MASS package to fit the LDA model to our
data:
#fit LDA model
model <- lda(Species~., data=train)

#view model output


model

Call:
lda(Species ~ ., data = train)

Prior probabilities of groups:


setosa versicolor virginica
0.3207547 0.3207547 0.3584906

Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -1.0397484 0.8131654 -1.2891006 -1.2570316
versicolor 0.1820921 -0.6038909 0.3403524 0.2208153
virginica 0.9582674 -0.1919146 1.0389776 1.1229172

Coefficients of linear discriminants:


LD1 LD2
Sepal.Length 0.7922820 0.5294210
Sepal.Width 0.5710586 0.7130743
Petal.Length -4.0762061 -2.7305131
Petal.Width -2.0602181 2.6326229

Proportion of trace:
LD1 LD2
0.9921 0.0079

Step 6: Use the Model to Make Predictions


Once we’ve fit the model using our training data,
we can use it to make predictions on our test data:
#use LDA model to make predictions on test data
predicted <- predict(model, test)

names(predicted)

[1] "class" "posterior" "x"

This returns a list with three variables:

 class: The predicted class


 posterior: The posterior probability that an
observation belongs to each class
 x: The linear discriminants

We can quickly view each of these results for the


first six observations in our test dataset:
#view predicted class for first six observations in test set
head(predicted$class)

[1] setosa setosa setosa setosa setosa setosa


Levels: setosa versicolor virginica

#view posterior probabilities for first six observations in test set


head(predicted$posterior)

setosa versicolor virginica


4 1 2.425563e-17 1.341984e-35
6 1 1.400976e-21 4.482684e-40
7 1 3.345770e-19 1.511748e-37
15 1 6.389105e-31 7.361660e-53
17 1 1.193282e-25 2.238696e-45
18 1 6.445594e-22 4.894053e-41

#view linear discriminants for first six observations in test set


head(predicted$x)

LD1 LD2
4 7.150360 -0.7177382
6 7.961538 1.4839408
7 7.504033 0.2731178
15 10.170378 1.9859027
17 8.885168 2.1026494
18 8.113443 0.7563902

https://www.statology.org/linear-discriminant-analysis-in-r/
K-Means Clustering in R: Step-by-Step Example

Clustering is a technique in machine learning that


attempts to find clusters of observations within a
dataset.
The goal is to find clusters such that the
observations within each cluster are quite similar
to each other, while observations in different
clusters are quite different from each other.
Clustering is a form of unsupervised learning because
we’re simply attempting to find structure within a
dataset rather than predicting the value of
some response variable.
Clustering is often used in marketing when
companies have access to information like:
 Household income

 Household size

 Head of household Occupation

 Distance from nearest urban area

When this information is available, clustering can


be used to identify households that are similar and
may be more likely to purchase certain products or
respond better to a certain type of advertising.
One of the most common forms of clustering is
known as k-means clustering.
What is K-Means Clustering?
K-means clustering is a technique in which we
place each observation in a dataset into one
of K clusters.
The end goal is to have K clusters in which the
observations within each cluster are quite similar
to each other while the observations in different
clusters are quite different from each other.
In practice, we use the following steps to perform
K-means clustering:
1. Choose a value for K.
 First, we must decide how many clusters we’d

like to identify in the data. Often we have to


simply test several different values for K and
analyze the results to see which number of
clusters seems to make the most sense for a
given problem.
2. Randomly assign each observation to an initial cluster, from 1
to K.
3. Perform the following procedure until the cluster assignments
stop changing.
 For each of the K clusters, compute the

cluster centroid. This is simply the vector of


the p feature means for the observations in
the kth cluster.
 Assign each observation to the cluster whose

centroid is closest. Here, closest is defined


using Euclidean distance.
K-Means Clustering in R
The following tutorial provides a step-by-step
example of how to perform k-means clustering in R.
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several
useful functions for k-means clustering in R.
library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
For this example we’ll use the USArrests dataset
built into R, which contains the number of arrests
per 100,000 residents in each U.S. state in 1973
for Murder, Assault, and Rape along with the
percentage of the population in each state living in
urban areas, UrbanPop.
The following code shows how to do the following:
 Load the USArrests dataset

 Remove any rows with missing values

 Scale each variable in the dataset to have a

mean of 0 and a standard deviation of 1


#load data
df <- USArrests

#remove rows with missing values


df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1


df <- scale(df)

#view first six rows of dataset


head(df)

Murder Assault UrbanPop Rape


Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Step 3: Find the Optimal Number of Clusters
To perform k-means clustering in R we can use the
built-in kmeans() function, which uses the following
syntax:
kmeans(data, centers, nstart)
where:
 data: Name of the dataset.

 centers: The number of clusters, denoted k.

 nstart: The number of initial configurations.

Because it’s possible that different initial


starting clusters can lead to different results,
it’s recommended to use several different
initial configurations. The k-means algorithm
will find the initial configurations that lead to
the smallest within-cluster variation.
Since we don’t know beforehand how many
clusters is optimal, we’ll create two different plots
that can help us decide:
1. Number of Clusters vs. the Total Within Sum of Squares
First, we’ll use the fviz_nbclust() function to create a
plot of the number of clusters vs. the total within
sum of squares:
fviz_nbclust(df, kmeans, method = "wss")
Typically when we create this type of plot we look
for an “elbow” where the sum of squares begins to
“bend” or level off. This is typically the optimal
number of clusters.
For this plot it appears that there is a bit of an
elbow or “bend” at k = 4 clusters.
2. Number of Clusters vs. Gap Statistic
Another way to determine the optimal number of
clusters is to use a metric known as the gap statistic,
which compares the total intra-cluster variation for
different values of k with their expected values for
a distribution with no clustering.
We can calculate the gap statistic for each number
of clusters using the clusGap() function from
the cluster package along with a plot of clusters vs.
gap statistic using the fviz_gap_stat() function:
#calculate gap statistic based on number of clusters
gap_stat <- clusGap(df,
FUN = kmeans,
nstart = 25,
K.max = 10,
B = 50)

#plot number of clusters vs. gap statistic


fviz_gap_stat(gap_stat)

From the plot we can see that gap statistic is


highest at k = 4 clusters, which matches the elbow
method we used earlier.
Step 4: Perform K-Means Clustering with Optimal K
Lastly, we can perform k-means clustering on the
dataset using the optimal value for k of 4:
#make this example reproducible
set.seed(1)

#perform k-means clustering with k = 4 clusters


km <- kmeans(df, centers = 4, nstart = 25)

#view results
km

K-means clustering with 4 clusters of sizes 16, 13, 13, 8

Cluster means:
Murder Assault UrbanPop Rape
1 -0.4894375 -0.3826001 0.5758298 -0.26165379
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3 0.6950701 1.0394414 0.7226370 1.27693964
4 1.4118898 0.8743346 -0.8145211 0.01927104

Clustering vector:
Alabama Alaska Arizona Arkansas California Colorado
4 3 3 4 3 3
Connecticut Delaware Florida Georgia Hawaii Idaho
1 1 3 4 1 2
Illinois Indiana Iowa Kansas Kentucky Louisiana
3 1 2 1 2 4
Maine Maryland Massachusetts Michigan Minnesota Mississippi
2 3 1 3 2 4
Missouri Montana Nebraska Nevada New Hampshire New Jersey
3 2 2 3 2 1
New Mexico New York North Carolina North Dakota Ohio Oklahoma
3 3 4 2 1 1
Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee
1 1 1 4 2 4
Texas Utah Vermont Virginia Washington West Virginia
3 1 2 1 1 2
Wisconsin Wyoming
2 1

Within cluster sum of squares by cluster:


[1] 16.212213 11.952463 19.922437 8.316061
(between_SS / total_SS = 71.2 %)

Available components:

[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"


[7] "size" "iter" "ifault"
From the results we can see that:
 16 states were assigned to the first cluster

 13 states were assigned to the second cluster

 13 states were assigned to the third cluster

 8 states were assigned to the fourth cluster

We can visualize the clusters on a scatterplot that


displays the first two principal components on the
axes using the fivz_cluster() function:
#plot results of final k-means model
fviz_cluster(km, data = df)

We can also use the aggregate() function to find the


mean of the variables in each cluster:
#find means of each cluster
aggregate(USArrests, by=list(cluster=km$cluster), mean)

cluster Murder Assault UrbanPop Rape

1 3.60000 78.53846 52.07692 12.17692


2 10.81538 257.38462 76.00000 33.19231
3 5.65625 138.87500 73.87500 18.78125
4 13.93750 243.62500 53.75000 21.41250
We interpret this output is as follows:
 The mean number of murders per 100,000

citizens among the states in cluster 1 is 3.6.


 The mean number of assaults per 100,000

citizens among the states in cluster 1 is 78.5.


 The mean percentage of residents living in an

urban area among the states in cluster 1


is 52.1%.
 The mean number of rapes per 100,000 citizens

among the states in cluster 1 is 12.2.


And so on.
We can also append the cluster assignments of
each state back to the original dataset:
#add cluster assigment to original data
final_data <- cbind(USArrests, cluster = km$cluster)

#view final data


head(final_data)

Murder Assault UrbanPop Rape cluster

Alabama 13.2 236 58 21.2 4


Alaska 10.0 263 48 44.5 2
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 4
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 2
Pros & Cons of K-Means Clustering
K-means clustering offers the following benefits:
 It is a fast algorithm.
 It can handle large datasets well.

However, it comes with the following potential


drawbacks:
 It requires us to specify the number of clusters

before performing the algorithm.


 It’s sensitive to outliers.

Two alternatives to k-means clustering are k-


medoids clustering and hierarchical clustering.
Hierarchical Clustering in R: Step-by-Step Example

Clustering is a technique in machine learning that


attempts to find groups
or clusters of observations within a dataset such that
the observations within each cluster are quite
similar to each other, while observations in
different clusters are quite different from each
other.
Clustering is a form of unsupervised learning because
we’re simply attempting to find structure within a
dataset rather than predicting the value of
some response variable.
Clustering is often used in marketing when
companies have access to information like:
 Household income

 Household size

 Head of household Occupation

 Distance from nearest urban area

When this information is available, clustering can


be used to identify households that are similar and
may be more likely to purchase certain products or
respond better to a certain type of advertising.
One of the most common forms of clustering is
known as k-means clustering. Unfortunately this
method requires us to pre-specify the number of
clusters K.
An alternative to this method is known
as hierarchical clustering, which does not require us to
pre-specify the number of clusters to be used and
is also able to produce a tree-based representation
of the observations known as a dendrogram.
What is Hierarchical Clustering?
Similar to k-means clustering, the goal of
hierarchical clustering is to produce clusters of
observations that are quite similar to each other
while the observations in different clusters are
quite different from each other.
In practice, we use the following steps to perform
hierarchical clustering:
1. Calculate the pairwise dissimilarity between each observation in
the dataset.
 First, we must choose some distance metric –

like the Euclidean distance – and use this metric to


compute the dissimilarity between each
observation in the dataset.
 For a dataset with n observations, there will

be a total of n(n-1)/2 pairwise dissimilarities.


2. Fuse observations into clusters.
 At each step in the algorithm, fuse together the

two observations that are most similar into a


single cluster.
 Repeat this procedure until all observations are

members of one large cluster. The end result is


a tree, which can be plotted as a dendrogram.
To determine how close together two clusters are,
we can use a few different methods including:
Complete linkage clustering: Find the max distance
between points belonging to two different
clusters.
 Single linkage clustering: Find the minimum

distance between points belonging to two


different clusters.
 Mean linkage clustering: Find all pairwise distances

between points belonging to two different


clusters and then calculate the average.
 Centroid linkage clustering: Find the centroid of

each cluster and calculate the distance


between the centroids of two different
clusters.
 Ward’s minimum variance method: Minimize the

total
Depending on the structure of the dataset, one of
these methods may tend to produce better (i.e.
more compact) clusters than the other methods.
Hierarchical Clustering in R
The following tutorial provides a step-by-step
example of how to perform hierarchical clustering
in R.
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several
useful functions for hierarchical clustering in R.
library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
For this example we’ll use the USArrests dataset
built into R, which contains the number of arrests
per 100,000 residents in each U.S. state in 1973
for Murder, Assault, and Rape along with the
percentage of the population in each state living in
urban areas, UrbanPop.
The following code shows how to do the following:
 Load the USArrests dataset

 Remove any rows with missing values

 Scale each variable in the dataset to have a

mean of 0 and a standard deviation of 1


#load data
df <- USArrests

#remove rows with missing values


df <- na.omit(df)

#scale each variable to have a mean of 0 and sd of 1


df <- scale(df)

#view first six rows of dataset


head(df)

Murder Assault UrbanPop Rape


Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
Arizona 0.07163341 1.4788032 0.9989801 1.042878388
Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144 1.7589234 2.067820292
Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Step 3: Find the Linkage Method to Use
To perform hierarchical clustering in R we can use
the agnes() function from the cluster package, which
uses the following syntax:
agnes(data, method)
where:
 data: Name of the dataset.

 method: The method to use to calculate

dissimilarity between clusters.


Since we don’t know beforehand which method will
produce the best clusters, we can write a short
function to perform hierarchical clustering using
several different methods.
Note that this function calculates the
agglomerative coefficient of each method, which is
metric that measures the strength of the clusters.
The closer this value is to 1, the stronger the
clusters.
#define linkage methods
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

#function to compute agglomerative coefficient


ac <- function(x) {
agnes(df, method = x)$ac
}

#calculate agglomerative coefficient for each clustering linkage method


sapply(m, ac)

average single complete ward


0.7379371 0.6276128 0.8531583 0.9346210
We can see that Ward’s minimum variance method
produces the highest agglomerative coefficient,
thus we’ll use that as the method for our final
hierarchical clustering:
#perform hierarchical clustering using Ward's minimum variance
clust <- agnes(df, method = "ward")

#produce dendrogram
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")
Each leaf at the bottom of the dendrogram
represents an observation in the original dataset.
As we move up the dendrogram from the bottom,
observations that are similar to each other are
fused together into a branch.
Step 4: Determine the Optimal Number of Clusters
To determine how many clusters the observations
should be grouped in, we can use a metric known
as the gap statistic, which compares the total intra-
cluster variation for different values of k with their
expected values for a distribution with no
clustering.
We can calculate the gap statistic for each number
of clusters using the clusGap() function from
the cluster package along with a plot of clusters vs.
gap statistic using the fviz_gap_stat() function:
#calculate gap statistic for each number of clusters (up to 10 clusters)
gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)

#produce plot of clusters vs. gap statistic


fviz_gap_stat(gap_stat)

From the plot we can see that the gap statistic is


highest at k = 4 clusters. Thus, we’ll choose to
group our observations into 4 distinct clusters.
Step 5: Apply Cluster Labels to Original Dataset
To actually add cluster labels to each observation
in our dataset, we can use the cutree() method to
cut the dendrogram into 4 clusters:
#compute distance matrix
d <- dist(df, method = "euclidean")
#perform hierarchical clustering using Ward's method
final_clust <- hclust(d, method = "ward.D2" )

#cut the dendrogram into 4 clusters


groups <- cutree(final_clust, k=4)

#find number of observations in each cluster


table(groups)

1 2 3 4
7 12 19 12
We can then append the cluster labels of each
state back to the original dataset:
#append cluster labels to original data
final_data <- cbind(USArrests, cluster = groups)

#display first six rows of final data


head(final_data)

Murder Assault UrbanPop Rape cluster


Alabama 13.2 236 58 21.2 1
Alaska 10.0 263 48 44.5 2
Arizona 8.1 294 80 31.0 2
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 2
Colorado 7.9 204 78 38.7 2
Lastly, we can use the aggregate() function to find
the mean of the variables in each cluster:
#find mean values for each cluster
aggregate(final_data, by=list(cluster=final_data$cluster), mean)

cluster Murder Assault UrbanPop Rape cluster


1 1 14.671429 251.2857 54.28571 21.68571 1
2 2 10.966667 264.0000 76.50000 33.60833 2
3 3 6.210526 142.0526 71.26316 19.18421 3
4 4 3.091667 76.0000 52.08333 11.83333 4
We interpret this output is as follows:
 The mean number of murders per 100,000

citizens among the states in cluster 1 is 14.67.


 The mean number of assaults per 100,000

citizens among the states in cluster 1 is 251.28.


 The mean percentage of residents living in an
urban area among the states in cluster 1
is 54.28%.
 The mean number of rapes per 100,000 citizens
among the states in cluster 1 is 21.68.

You might also like