0% found this document useful (0 votes)
66 views7 pages

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

This document discusses clustering techniques and how to visualize clusters using R. Specifically, it: 1) Explains k-means clustering and the steps to perform k-means clustering using an example dataset. 2) Describes how to implement k-means clustering in R using the kmeans function. 3) Introduces the Iris dataset as a sample clustering problem and outlines the steps to cluster this data using k-means in R. 4) Briefly introduces hierarchical clustering and defines it as a method that groups similar items into tree structures called dendograms.

Uploaded by

Ishwari Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views7 pages

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

This document discusses clustering techniques and how to visualize clusters using R. Specifically, it: 1) Explains k-means clustering and the steps to perform k-means clustering using an example dataset. 2) Describes how to implement k-means clustering in R using the kmeans function. 3) Introduces the Iris dataset as a sample clustering problem and outlines the steps to cluster this data using k-means in R. 4) Briefly introduces hierarchical clustering and defines it as a method that groups similar items into tree structures called dendograms.

Uploaded by

Ishwari Pawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

Lab Practices-2 Fourth Year Computer Engineering

Engineering
LP2- ETL MODEL
Assignment No. 2
R C V T Total Dated Sign
(2) (4) (2) (2) (10)

1.1 Title:

Consider a suitable dataset. For clustering of data instances in different groups, apply different clustering techniques
(minimum 2). Visualize the clusters using suitable tool.

1.2 Problem Definition:

Visulize the Cluster using Suitable tool

1.3 Prerequisite:
 Basic concepts of ETL.
 Knowledge about R tool

1.4 Software Requirements:


 R tool

1.5 Hardware Requirement:

 PIV, 2GB RAM, 500 GB HDD, Lenovo A13-4089Model.

1.6 Learning Objectives:

Use R functions to create K-means Clustering models and hierarchical clustering models

1.7 Outcomes:

Visualize the effectiveness of the K-means Clustering algorithm and hierarchical clustering
using graphic capabilities in R

1.8 Theory Concepts:

What is K-means clustering?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign each
data point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the
groups that have formed organically. The "Choosing K" section below describes how the number of
groups can be determined. Each centroid of a cluster is a collection of feature values which define the
resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind
of group each cluster represents.  
Lab Practices-2 Fourth Year Computer Engineering
Engineering
Steps to Perform K-Means Clustering

As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of
two variables on each of seven individuals:

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the A
& B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:

Individual Mean Vector


(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new
member is added. This leads to the following series of steps:
Lab Practices-2 Fourth Year Computer Engineering
Engineering

Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:

Individual Mean Vector


(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each
individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:

Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other
words, each individual's distance to its own cluster mean should be smaller that the distance to the other
cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:

Individual Mean Vector


(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
Lab Practices-2 Fourth Year Computer Engineering
Engineering
The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other cluster
and the iteration stops, choosing the latest partitioning as the final cluster solution.

R implementation

The K-Means function, provided by the cluster package, is used as follows:

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd",


"Forgy", "MacQueen"))

where the arguments are:

x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric
vector or a data frame with all numeric columns).

centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a
random set of (distinct) rows in x is chosen as the initial centers.

iter.max: The maximum number of iterations allowed.

nstart: If centers is a number, nstart gives the number of random sets that should be chosen.

algorithm: The algorithm to be used. It should be one of the following "Hartigan-Wong", "Lloyd",
"Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by
default

IRIS dataset

This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Attribute Information:

 sepal length in cm
 sepal width in cm
 petal length in cm
 petal width in cm
 class:
1 Iris Setosa
2 Iris Versicolour
3 Iris Virginica
Steps

1. Set working directory


2. Get data from datasets
3. Execute the model
4. View the output
5. Plot the results
Lab Practices-2 Fourth Year Computer Engineering
Engineering

Hierarchical Clustering

What is Hierarchical clustering?

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of
Johnson's (1967) hierarchical clustering is this:

 Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters,
each containing just one item. Let the distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.

 Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
have one less cluster.

 Compute distances (similarities) between the new cluster and each of the old clusters.
 Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

R Implementation

hclust(d, method = "complete", members = NULL)

## S3 method for class 'hclust'


plot(x, labels = NULL, hang = 0.1, check = TRUE,
axes = TRUE, frame.plot = FALSE, ann = TRUE,
main = "Cluster Dendrogram",
sub = NULL, xlab = NULL, ylab = "Height", ...)

Arguments

d a dissimilarity structure as produced by dist.


method the agglomeration method to be used. This should be (an unambiguous abbreviation of)
one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (=
WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
members NULL or a vector with length size of d. See the ‘Details’ section.
x an object of the type produced by hclust.
hang The fraction of the plot height by which labels should hang below the rest of the plot. A
negative value will cause the labels to hang down from 0.
check logical indicating if the x object should be checked for validity. This check is not
necessary when x is known to be valid such as when it is the direct result of hclust(). The
Lab Practices-2 Fourth Year Computer Engineering
default is check=TRUE, as invalid inputs may crash R due Engineering
to memory violation in the
internal C plotting code.
labels A character vector of labels for the leaves of the tree. By default the row names or row
numbers of the original data are used. If labels = FALSE no labels at all are plotted.
axes, logical flags as in plot.default.
frame.plot,
ann
main, sub, character strings for title. sub and xlab have a non-NULL default when there's a
xlab, ylab tree$call.
... Further graphical arguments. E.g., cex controls the size of the labels (if plotted) in the
same way as text.

Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and
average-link clustering

Mtcars dataset

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10
aspects of automobile design and performance for 32 automobiles (1973–74 models)
A data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon


[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
Number of carburetors
[,11] carb

In general, there are many choices of cluster analysis methodology. The hclust function in R uses the
complete linkage method for hierarchical clustering by default. This particular clustering method defines
the cluster distance between two clusters to be the maximum distance between their individual
components. At every stage of the clustering process, the two nearest clusters are merged into a new
cluster.

With the distance matrix found in previous tutorial, we can use various techniques of cluster analysis for
relationship discovery. For example, in the data set mtcars, we can run the distance matrix with hclust, and
plot a dendrogram that displays a hierarchical relationship among the vehicles.

> d <- dist(as.matrix(mtcars)) # find distance matrix


Lab Practices-2 Fourth Year Computer Engineering
> hc <- hclust(d) # apply hirarchical clustering > plot(hc) # plot the dendrogram
Engineering

1.9 Assignment Question

1. What is difference between Supervised and Unsupervised Learning?


2. What are different similarities between Kmean and KNN Algorithm?
3. What is Euclidean distance? Explain with Suitable example?
4. What is hamming distance? Explain with Suitable example?
5. What is Chi Squre Distance? Explain with Suitable example?
6. What are different types of Clustering?
7. What is Weka Tool? Explain the Step to Perform Clustering on Sample data set?

References

1 www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
2 http://www.statmethods.net/advstats/cluster.html
3 http://people.revoledu.com/kardi/tutorial/Clustering/Numerical%20Example.htm
4 http://www.stat.berkeley.edu/~s133/Cluster2a.html
5 http://www.rdatamining.com/examples/kmeans-clustering
6 http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/

You might also like