0% found this document useful (0 votes)

66 views7 pages

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

This document discusses clustering techniques and how to visualize clusters using R. Specifically, it: 1) Explains k-means clustering and the steps to perform k-means clustering using an example dataset. 2) Describes how to implement k-means clustering in R using the kmeans function. 3) Introduces the Iris dataset as a sample clustering problem and outlines the steps to cluster this data using k-means in R. 4) Briefly introduces hierarchical clustering and defines it as a method that groups similar items into tree structures called dendograms.

Uploaded by

Ishwari Pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views7 pages

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

Uploaded by

Ishwari Pawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 7

Lab Practices-2 Fourth Year Computer Engineering

Engineering
LP2- ETL MODEL
Assignment No. 2
R C V T Total Dated Sign
(2) (4) (2) (2) (10)

1.1 Title:

Consider a suitable dataset. For clustering of data instances in different groups, apply different clustering techniques
(minimum 2). Visualize the clusters using suitable tool.

1.2 Problem Definition:

Visulize the Cluster using Suitable tool

1.3 Prerequisite:
 Basic concepts of ETL.
 Knowledge about R tool

1.4 Software Requirements:

 R tool

1.5 Hardware Requirement:

 PIV, 2GB RAM, 500 GB HDD, Lenovo A13-4089Model.

1.6 Learning Objectives:

Use R functions to create K-means Clustering models and hierarchical clustering models

1.7 Outcomes:

Visualize the effectiveness of the K-means Clustering algorithm and hierarchical clustering
using graphic capabilities in R

1.8 Theory Concepts:

What is K-means clustering?

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign each
data point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity. The results of the K-means clustering algorithm are:
1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the
groups that have formed organically. The "Choosing K" section below describes how the number of
groups can be determined. Each centroid of a cluster is a collection of feature values which define the
resulting groups. Examining the centroid feature weights can be used to qualitatively interpret what kind
of group each cluster represents.
Lab Practices-2 Fourth Year Computer Engineering
Engineering
Steps to Perform K-Means Clustering

As a simple illustration of a k-means algorithm, consider the following data set consisting of the scores of
two variables on each of seven individuals:

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

This data set is to be grouped into two clusters. As a first step in finding a sensible initial partition, let the A
& B values of the two individuals furthest apart (using the Euclidean distance measure), define the initial
cluster means, giving:

Individual Mean Vector

(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which they are
closest, in terms of Euclidean distance to the cluster mean. The mean vector is recalculated each time a new
member is added. This leads to the following series of steps:
Lab Practices-2 Fourth Year Computer Engineering
Engineering

Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following characteristics:

Individual Mean Vector

(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we compare each
individual’s distance to its own cluster mean and to that of the opposite cluster. And we find:

Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster 1). In other
words, each individual's distance to its own cluster mean should be smaller that the distance to the other
cluster's mean (which is not the case with individual 3). Thus, individual 3 is relocated to Cluster 2
resulting in the new partition:

Individual Mean Vector

(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)
Lab Practices-2 Fourth Year Computer Engineering
Engineering
The iterative relocation would now continue from this new partition until no more relocations occur.
However, in this example each individual is now nearer its own cluster mean than that of the other cluster
and the iteration stops, choosing the latest partitioning as the final cluster solution.

R implementation

The K-Means function, provided by the cluster package, is used as follows:

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd",

"Forgy", "MacQueen"))

where the arguments are:

x: A numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric
vector or a data frame with all numeric columns).

centers: Either the number of clusters or a set of initial (distinct) cluster centers. If a number, a
random set of (distinct) rows in x is chosen as the initial centers.

iter.max: The maximum number of iterations allowed.

nstart: If centers is a number, nstart gives the number of random sets that should be chosen.

algorithm: The algorithm to be used. It should be one of the following "Hartigan-Wong", "Lloyd",
"Forgy" or "MacQueen". If no algorithm is specified, the algorithm of Hartigan and Wong is used by
default

IRIS dataset

This is perhaps the best known database to be found in the pattern recognition literature. The data set
contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

Attribute Information:

 sepal length in cm
 sepal width in cm
 petal length in cm
 petal width in cm
 class:
1 Iris Setosa
2 Iris Versicolour
3 Iris Virginica
Steps

1. Set working directory

2. Get data from datasets
3. Execute the model
4. View the output
5. Plot the results
Lab Practices-2 Fourth Year Computer Engineering
Engineering

Hierarchical Clustering

What is Hierarchical clustering?

Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of
Johnson's (1967) hierarchical clustering is this:

 Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters,
each containing just one item. Let the distances (similarities) between the clusters equal the distances
(similarities) between the items they contain.

 Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you
have one less cluster.

 Compute distances (similarities) between the new cluster and each of the old clusters.
 Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

R Implementation

hclust(d, method = "complete", members = NULL)

## S3 method for class 'hclust'

plot(x, labels = NULL, hang = 0.1, check = TRUE,
axes = TRUE, frame.plot = FALSE, ann = TRUE,
main = "Cluster Dendrogram",
sub = NULL, xlab = NULL, ylab = "Height", ...)

Arguments

d a dissimilarity structure as produced by dist.

method the agglomeration method to be used. This should be (an unambiguous abbreviation of)
one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (=
WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC).
members NULL or a vector with length size of d. See the ‘Details’ section.
x an object of the type produced by hclust.
hang The fraction of the plot height by which labels should hang below the rest of the plot. A
negative value will cause the labels to hang down from 0.
check logical indicating if the x object should be checked for validity. This check is not
necessary when x is known to be valid such as when it is the direct result of hclust(). The
Lab Practices-2 Fourth Year Computer Engineering
default is check=TRUE, as invalid inputs may crash R due Engineering
to memory violation in the
internal C plotting code.
labels A character vector of labels for the leaves of the tree. By default the row names or row
numbers of the original data are used. If labels = FALSE no labels at all are plotted.
axes, logical flags as in plot.default.
frame.plot,
ann
main, sub, character strings for title. sub and xlab have a non-NULL default when there's a
xlab, ylab tree$call.
... Further graphical arguments. E.g., cex controls the size of the labels (if plotted) in the
same way as text.

Step 3 can be done in different ways, which is what distinguishes single-link from complete-link and
average-link clustering

Mtcars dataset

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10
aspects of automobile design and performance for 32 automobiles (1973–74 models)
A data frame with 32 observations on 11 variables.

[, 1] mpg Miles/(US) gallon

[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (lb/1000)
[, 7] qsec 1/4 mile time
[, 8] vs V/S
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
Number of carburetors
[,11] carb

In general, there are many choices of cluster analysis methodology. The hclust function in R uses the
complete linkage method for hierarchical clustering by default. This particular clustering method defines
the cluster distance between two clusters to be the maximum distance between their individual
components. At every stage of the clustering process, the two nearest clusters are merged into a new
cluster.

With the distance matrix found in previous tutorial, we can use various techniques of cluster analysis for
relationship discovery. For example, in the data set mtcars, we can run the distance matrix with hclust, and
plot a dendrogram that displays a hierarchical relationship among the vehicles.

> d <- dist(as.matrix(mtcars)) # find distance matrix

Lab Practices-2 Fourth Year Computer Engineering
> hc <- hclust(d) # apply hirarchical clustering > plot(hc) # plot the dendrogram
Engineering

1.9 Assignment Question

1. What is difference between Supervised and Unsupervised Learning?

2. What are different similarities between Kmean and KNN Algorithm?
3. What is Euclidean distance? Explain with Suitable example?
4. What is hamming distance? Explain with Suitable example?
5. What is Chi Squre Distance? Explain with Suitable example?
6. What are different types of Clustering?
7. What is Weka Tool? Explain the Step to Perform Clustering on Sample data set?

References

1 www.r-tutor.com/gpu-computing/clustering/hierarchical-cluster-analysis
2 http://www.statmethods.net/advstats/cluster.html
3 http://people.revoledu.com/kardi/tutorial/Clustering/Numerical%20Example.htm
4 http://www.stat.berkeley.edu/~s133/Cluster2a.html
5 http://www.rdatamining.com/examples/kmeans-clustering
6 http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/

The Hitch Hike Season 1
100% (2)
The Hitch Hike Season 1
969 pages
Martial God Space
No ratings yet
Martial God Space
942 pages
Bestiary (2nd Edition)
100% (1)
Bestiary (2nd Edition)
132 pages
Insight Method of 4d Reservoir Monitoring PDF
100% (1)
Insight Method of 4d Reservoir Monitoring PDF
234 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Chapter 8 - Clustering
No ratings yet
Chapter 8 - Clustering
42 pages
Unit IV
No ratings yet
Unit IV
51 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
New List of Hospitals Empanelled Under Dghs
No ratings yet
New List of Hospitals Empanelled Under Dghs
9 pages
About Us: Presenting Company Information On Corporate Websites and in Sections
No ratings yet
About Us: Presenting Company Information On Corporate Websites and in Sections
258 pages
Clustering
No ratings yet
Clustering
55 pages
Week 10
No ratings yet
Week 10
84 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Cluster
100% (1)
Cluster
72 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Unit 6 - Machine Learning in R
No ratings yet
Unit 6 - Machine Learning in R
45 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Lab Report CNC Milling Manufacturing Process
No ratings yet
Lab Report CNC Milling Manufacturing Process
15 pages
Unit 4
No ratings yet
Unit 4
63 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
Cluster Analysis Thesis Matlab Code PDF
100% (3)
Cluster Analysis Thesis Matlab Code PDF
7 pages
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
No ratings yet
Cluster Analysis: Talha Farooq Faizan Ali Muhammad Abdul Basit
16 pages
Clustering
No ratings yet
Clustering
75 pages
Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
ML Unit-5
No ratings yet
ML Unit-5
30 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
Clustering
No ratings yet
Clustering
22 pages
Asp PDF
No ratings yet
Asp PDF
161 pages
(Francis B Jacobs) The EU After Brexit
No ratings yet
(Francis B Jacobs) The EU After Brexit
143 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Unit 3
No ratings yet
Unit 3
12 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
84 pages
IDS Unit-3 L2
No ratings yet
IDS Unit-3 L2
26 pages
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
No ratings yet
DEU CSC5045 Intelligent System Applications Using Fuzzy - 4+clustering
61 pages
Status of Women in Ancient, Medieval and Modern Period of India
No ratings yet
Status of Women in Ancient, Medieval and Modern Period of India
7 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
06 - Unsupervised Learning - 18 Dec 2023
No ratings yet
06 - Unsupervised Learning - 18 Dec 2023
50 pages
Social Studies 1
100% (1)
Social Studies 1
19 pages
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
No ratings yet
Statistical Computing With R: Masters in Data Sciences 503 (S27) Third Batch, SMS, TU, 2024
30 pages
Meditation The Only Way
No ratings yet
Meditation The Only Way
10 pages
Teacher Donna's Class
No ratings yet
Teacher Donna's Class
28 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
DS203 2024-02-09 Clustering K Means and Hierarchical v2
No ratings yet
DS203 2024-02-09 Clustering K Means and Hierarchical v2
35 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
Health & Recreation
No ratings yet
Health & Recreation
20 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Chap1 Phy3 SampleProblems PDF
No ratings yet
Chap1 Phy3 SampleProblems PDF
39 pages
STAT452 Project1
No ratings yet
STAT452 Project1
13 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
Data Mining - Chapter 4 Cluster Analysis
No ratings yet
Data Mining - Chapter 4 Cluster Analysis
37 pages
Information Technology-P1 - (NOV-08), ICAB
No ratings yet
Information Technology-P1 - (NOV-08), ICAB
1 page
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
ML Mod6
No ratings yet
ML Mod6
24 pages
K Means
No ratings yet
K Means
3 pages
Group 3 - Psychological Perspective of The Self
No ratings yet
Group 3 - Psychological Perspective of The Self
8 pages
K-Means Clustering
No ratings yet
K-Means Clustering
18 pages
Noramin 2017
No ratings yet
Noramin 2017
18 pages
B43 Exp5 ML
No ratings yet
B43 Exp5 ML
6 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
In Re: Complaint of Judicial Misconduct
No ratings yet
In Re: Complaint of Judicial Misconduct
6 pages
List of School Visit
No ratings yet
List of School Visit
4 pages
Rory Gilbertiiii
No ratings yet
Rory Gilbertiiii
4 pages
God's Kingdom in A Man
No ratings yet
God's Kingdom in A Man
4 pages
DSRPG Character Sheet Fillable
No ratings yet
DSRPG Character Sheet Fillable
1 page
Data Clustering..
No ratings yet
Data Clustering..
10 pages
The Teacher As A Curricularist Survey Tool
No ratings yet
The Teacher As A Curricularist Survey Tool
2 pages
Surgical Technique of Coccygectomy
No ratings yet
Surgical Technique of Coccygectomy
7 pages
CLADDING AND REPRESENTATION: BETWEEN SCENOGRAPHY AND TECTONICS, MYRIAM BLAlS
No ratings yet
CLADDING AND REPRESENTATION: BETWEEN SCENOGRAPHY AND TECTONICS, MYRIAM BLAlS
5 pages
RA7610
No ratings yet
RA7610
3 pages
Contoh Overview
No ratings yet
Contoh Overview
4 pages
Reason Why Mobile Phones Are Not 100% Safe
No ratings yet
Reason Why Mobile Phones Are Not 100% Safe
1 page
Core Novel HW Chunks 4th MP 8th GR
No ratings yet
Core Novel HW Chunks 4th MP 8th GR
2 pages
Cluster Analysis Usingr PDF
No ratings yet
Cluster Analysis Usingr PDF
0 pages
INVENRELATION
From Everand
INVENRELATION
Shih Yu Chang
No ratings yet

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

Uploaded by

Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign

Uploaded by

Lab Practices-2 Fourth Year Computer Engineering

1.2 Problem Definition:

Visulize the Cluster using Suitable tool

1.4 Software Requirements:

1.5 Hardware Requirement:

 PIV, 2GB RAM, 500 GB HDD, Lenovo A13-4089Model.

1.6 Learning Objectives:

1.8 Theory Concepts:

What is K-means clustering?

Individual Mean Vector

Individual Mean Vector

Individual Mean Vector

The K-Means function, provided by the cluster package, is used as follows:

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd",

where the arguments are:

iter.max: The maximum number of iterations allowed.

1. Set working directory

What is Hierarchical clustering?

hclust(d, method = "complete", members = NULL)

## S3 method for class 'hclust'

d a dissimilarity structure as produced by dist.

[, 1] mpg Miles/(US) gallon

> d <- dist(as.matrix(mtcars)) # find distance matrix

1.9 Assignment Question

1. What is difference between Supervised and Unsupervised Learning?

You might also like