Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
Engineering
LP2- ETL MODEL
Assignment No. 2
R C V T Total Dated Sign
(2) (4) (2) (2) (10)
1.1 Title:
Consider a suitable dataset. For clustering of data instances in different groups, apply different clustering techniques
(minimum 2). Visualize the clusters using suitable tool.
1.3 Prerequisite:
Basic concepts of ETL.
Knowledge about R tool
Use R functions to create K-means Clustering models and hierarchical clustering models
1.7 Outcomes:
Visualize the effectiveness of the K-means Clustering algorithm and hierarchical clustering
using graphic capabilities in R
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign each
data point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the
groups that have formed organically. The "Choosing K" section below describes how the number of
groups can be determined. Each centroid of a cluster is a collection of feature values which define the
resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind
of group each cluster represents.
Lab Practices-2 Fourth Year Computer Engineering
Engineering
Steps to Perform K-Means Clustering
As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of
two variables on each of seven individuals:
Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5
This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the A
& B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:
The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new
member is added. This leads to the following series of steps:
Lab Practices-2 Fourth Year Computer Engineering
Engineering
Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)
Now the initial partition has changed, and the two clusters at this stage having the following characteristics:
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each
individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:
Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other
words, each individual's distance to its own cluster mean should be smaller that the distance to the other
cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:
R implementation
x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric
vector or a data frame with all numeric columns).
centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a
random set of (distinct) rows in x is chosen as the initial centers.
nstart: If centers is a number, nstart gives the number of random sets that should be chosen.
algorithm: The algorithm to be used. It should be one of the following "Hartigan-Wong", "Lloyd",
"Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by
default
IRIS dataset
This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant.
Attribute Information:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
1 Iris Setosa
2 Iris Versicolour
3 Iris Virginica
Steps
Hierarchical Clustering
Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of
Johnson's (1967) hierarchical clustering is this:
Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters,
each containing just one item. Let the distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.
Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
have one less cluster.
Compute distances (similarities) between the new cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
R Implementation
Arguments
Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and
average-link clustering
Mtcars dataset
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10
aspects of automobile design and performance for 32 automobiles (1973–74 models)
A data frame with 32 observations on 11 variables.
In general, there are many choices of cluster analysis methodology. The hclust function in R uses the
complete linkage method for hierarchical clustering by default. This particular clustering method defines
the cluster distance between two clusters to be the maximum distance between their individual
components. At every stage of the clustering process, the two nearest clusters are merged into a new
cluster.
With the distance matrix found in previous tutorial, we can use various techniques of cluster analysis for
relationship discovery. For example, in the data set mtcars, we can run the distance matrix with hclust, and
plot a dendrogram that displays a hierarchical relationship among the vehicles.
References
1 www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
2 http://www.statmethods.net/advstats/cluster.html
3 http://people.revoledu.com/kardi/tutorial/Clustering/Numerical%20Example.htm
4 http://www.stat.berkeley.edu/~s133/Cluster2a.html
5 http://www.rdatamining.com/examples/kmeans-clustering
6 http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/