R Code For Discriminant and Cluster Analysis
R Code For Discriminant and Cluster Analysis
R Code For Discriminant and Cluster Analysis
Since we are dealing with multiple features, one of the first assumptions
that the technique makes is the assumption of multivariate normality that
means the features are normally distributed when separated for each
class. This also implies that the technique is susceptible to possible
outliers and is also sensitive to the group sizes. If there is an imbalance
between the group sizes and one of the groups is too small or too large,
the technique suffers when classifying data points into that ‘outlier’ class
The second assumption is about homoscedasticity. This states that the
variance of the features is same across all the classes of the predictor
feature
We also assume that the features are sampled randomly
The final assumption is about the absence of multicollinearity. If the
variables are correlated with each other, the predictive ability will
decrease.
library(MASS)
library(ggplot2)
#Use 70% of dataset as training set and remaining 30% as testing set
sample <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.7,0.3))
train <- iris[sample, ]
test <- iris[!sample, ]
Call:
lda(Species ~ ., data = train)
Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa -1.0397484 0.8131654 -1.2891006 -1.2570316
versicolor 0.1820921 -0.6038909 0.3403524 0.2208153
virginica 0.9582674 -0.1919146 1.0389776 1.1229172
Proportion of trace:
LD1 LD2
0.9921 0.0079
names(predicted)
LD1 LD2
4 7.150360 -0.7177382
6 7.961538 1.4839408
7 7.504033 0.2731178
15 10.170378 1.9859027
17 8.885168 2.1026494
18 8.113443 0.7563902
https://www.statology.org/linear-discriminant-analysis-in-r/
K-Means Clustering in R: Step-by-Step Example
Household size
#view results
km
Cluster means:
Murder Assault UrbanPop Rape
1 -0.4894375 -0.3826001 0.5758298 -0.26165379
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3 0.6950701 1.0394414 0.7226370 1.27693964
4 1.4118898 0.8743346 -0.8145211 0.01927104
Clustering vector:
Alabama Alaska Arizona Arkansas California Colorado
4 3 3 4 3 3
Connecticut Delaware Florida Georgia Hawaii Idaho
1 1 3 4 1 2
Illinois Indiana Iowa Kansas Kentucky Louisiana
3 1 2 1 2 4
Maine Maryland Massachusetts Michigan Minnesota Mississippi
2 3 1 3 2 4
Missouri Montana Nebraska Nevada New Hampshire New Jersey
3 2 2 3 2 1
New Mexico New York North Carolina North Dakota Ohio Oklahoma
3 3 4 2 1 1
Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee
1 1 1 4 2 4
Texas Utah Vermont Virginia Washington West Virginia
3 1 2 1 1 2
Wisconsin Wyoming
2 1
Available components:
Household size
total
Depending on the structure of the dataset, one of
these methods may tend to produce better (i.e.
more compact) clusters than the other methods.
Hierarchical Clustering in R
The following tutorial provides a step-by-step
example of how to perform hierarchical clustering
in R.
Step 1: Load the Necessary Packages
First, we’ll load two packages that contain several
useful functions for hierarchical clustering in R.
library(factoextra)
library(cluster)
Step 2: Load and Prep the Data
For this example we’ll use the USArrests dataset
built into R, which contains the number of arrests
per 100,000 residents in each U.S. state in 1973
for Murder, Assault, and Rape along with the
percentage of the population in each state living in
urban areas, UrbanPop.
The following code shows how to do the following:
Load the USArrests dataset
#produce dendrogram
pltree(clust, cex = 0.6, hang = -1, main = "Dendrogram")
Each leaf at the bottom of the dendrogram
represents an observation in the original dataset.
As we move up the dendrogram from the bottom,
observations that are similar to each other are
fused together into a branch.
Step 4: Determine the Optimal Number of Clusters
To determine how many clusters the observations
should be grouped in, we can use a metric known
as the gap statistic, which compares the total intra-
cluster variation for different values of k with their
expected values for a distribution with no
clustering.
We can calculate the gap statistic for each number
of clusters using the clusGap() function from
the cluster package along with a plot of clusters vs.
gap statistic using the fviz_gap_stat() function:
#calculate gap statistic for each number of clusters (up to 10 clusters)
gap_stat <- clusGap(df, FUN = hcut, nstart = 25, K.max = 10, B = 50)
1 2 3 4
7 12 19 12
We can then append the cluster labels of each
state back to the original dataset:
#append cluster labels to original data
final_data <- cbind(USArrests, cluster = groups)