0% found this document useful (0 votes)

4 views14 pages

Unit Iv

Cluster analysis involves grouping similar objects into clusters, with applications in pattern recognition, market research, and data mining. It requires algorithms to handle large datasets, various data types, and noise while being interpretable and usable. Major clustering methods include partitioning, hierarchical, density-based, grid-based, and model-based techniques, each with distinct advantages and challenges.

Uploaded by

dr.c.sowjanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views14 pages

Unit Iv

Uploaded by

dr.c.sowjanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

UNIT -IV CLUSTERNING AND APPLICATIONS

4.1. Clusters Analysis: Types Of Data In Cluster Analysis

What is Cluster Analysis?

The process of grouping a set of physical objects into classes of similar objects is called
clustering.

Cluster – collection of data objects

– Objects within a cluster are similar and objects in different clusters are dissimilar.

Cluster applications – pattern recognition, image processing and market research.

- helps marketers to discover the characterization of customer groups based on
purchasing patterns
- Categorize genes in plant and animal taxonomies
- Identify groups of house in a city according to house type, value and
geographical location
- Classify documents on WWW for information discovery

Clustering is a preprocessing step for other data mining steps like classification,
characterization.
Clustering – Unsupervised learning – does not rely on predefined classes with class labels.

Typical requirements of clustering in data mining:

1. Scalability – Clustering algorithms should work for huge databases
2. Ability to deal with different types of attributes – Clustering algorithms should work
not only for numeric data, but also for other data types.
3. Discovery of clusters with arbitrary shape – Clustering algorithms (based on distance
measures) should work for clusters of any shape.
4. Minimal requirements for domain knowledge to determine input parameters –
Clustering results are sensitive to input parameters to a clustering algorithm
(example – number of desired clusters). Determining the value of these parameters is
difficult and requires some domain knowledge.
5. Ability to deal with noisy data – Outlier, missing, unknown and erroneous data
detected by a clustering algorithm may lead to clusters of poor quality.
6. Insensitivity in the order of input records – Clustering algorithms should produce
same results even if the order of input records is changed.
7. High dimensionality – Data in high dimensional space can be sparse and highly
skewed, hence it is challenging for a clustering algorithm to cluster data objects in
high dimensional space.
8. Constraint-based clustering – In Real world scenario, clusters are performed based on
various constraints. It is a challenging task to find groups of data with good clustering
behavior and satisfying various constraints.
9. Interpretability and usability – Clustering results should be interpretable,
comprehensible and usable. So we should study how an application goal may
influence the selection of clustering methods.

4.2 Types of data in Clustering Analysis:

1. Data Matrix: (object-by-variable structure)
Represents n objects, (such as persons) with p variables (or attributes) (such as age,
height, weight, gender, race and so on. The structure is in the form of relational table
or n x p matrix as shown below:

 called as “two mode” matrix

2. Dissimilarity Matrix: (object-by-object structure)

This stores a collection of proximities (closeness or distance) that are available for all
pairs of n objects. It is represented by an n-by-n table as shown below.

 called as “one mode” matrix

Where d (i, j) is the dissimilarity between the objects i and j; d (i, j) = d (j, i) and d (i,
i) = 0

Many clustering algorithms use Dissimilarity Matrix. So data represented using Data
Matrix are converted into Dissimilarity Matrix before applying such clustering
algorithms.

Clustering of objects done based on their similarities or dissimilarities.

Similarity coefficients or dissimilarity coefficients are derived from correlation
coefficients.
4.3 Categorization of Major Clustering Methods Categorization of Major Clustering Methods
The choice of many available clustering algorithms depends on type of data available and the
application used.

Major Categories are:

1. Partitioning Methods:
- Construct k-partitions of the n data objects, where each partition is a cluster and k
<= n.
- Each partition should contain at least one object & each object should belong
to exactly one partition.
- Iterative Relocation Technique – attempts to improve partitioning by moving
objects from one group to another.
- Good Partitioning – Objects in the same cluster are “close” / related and objects in
the different clusters are “far apart” / very different.
- Uses the Algorithms:
o K-means Algorithm: - Each cluster is represented by the mean value of the
objects in the cluster.
o K-mediods Algorithm: - Each cluster is represented by one of the
objects located near the center of the cluster.
o These work well in small to medium sized database.

2. Hierarchical Methods:
- Creates hierarchical decomposition of the given set of data objects.
- Two types – Agglomerative and Divisive
- Agglomerative Approach: (Bottom-Up Approach):
o Each object forms a separate group
o Successively merges groups close to one another (based on
distance between clusters)
o Done until all the groups are merged to one or until a termination
condition holds. (Termination condition can be desired number of
clusters)
- Divisive Approach: (Top-Down Approach):
o Starts with all the objects in the same cluster
o Successively clusters are split into smaller clusters
o Done until each object is in one cluster or until a termination condition
holds (Termination condition can be desired number of clusters)
- Disadvantage – Once a merge or split is done it can not be undone.
- Advantage – Less computational cost
- If both these approaches are combined it gives more advantage.
- Clustering algorithms with this integrated approach are BIRCH and CURE.

3. Density Based Methods:

- Above methods produce Spherical shaped clusters.
- To discover clusters of arbitrary shape, clustering done based on the notion of
density.
- Used to filter out noise or outliers.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 17

- Continue growing a cluster so long as the density in the neighborhood exceeds

some threshold.
- Density = number of objects or data points
- That is for each data point within a given cluster; the neighborhood of a
given radius has to contain at least a minimum number of points.
- Uses the algorithms: DBSCAN and OPTICS

4. Grid-Based Methods:
- Divides the object space into finite number of cells to forma grid structure.
- Performs clustering operations on the grid structure.
- Advantage – Fast processing time – independent on the number of data objects &
dependent on the number of cells in the data grid.
- STING – typical grid based method
- CLIQUE and Wave-Cluster – grid based and density based clustering algorithms.

5. Model-Based Methods:
- Hypothesizes a model for each of the clusters and finds a best fit of the data to the
model.
- Forms clusters by constructing a density function that reflects the spatial
distribution of the data points.
- Robust clustering methods
- Detects noise / outliers.

Many algorithms combine several clustering methods.

3.9 Partitioning Methods

Partitioning Methods
Database has n objects and k partitions where k<=n; each partition is a cluster.

Partitioning criterion = Similarity function:

Objects within a cluster are similar; objects of different clusters are dissimilar.

Classical Partitioning Methods: k-means and k-mediods:

(A) Centroid-based technique: The k-means method:

- Cluster similarity is measured using mean value of objects in the cluster
(or clusters center of gravity)
- Randomly select k objects. Each object is a cluster mean or center.
- Each of the remaining objects is assigned to the most similar cluster – based on
the distance between the object and the cluster mean.
- Compute new mean for each cluster.
- This process iterates until all the objects are assigned to a cluster and
the partitioning criterion is met.
- This algorithm determines k partitions that minimize the squared error function.

- Square Error Function is defined as:

Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 18

- Where x is the point representing an object, mi is the mean of the cluster Ci.

- Algorithm:

- Advantages: Scalable; efficient in large databases

- Computational Complexity of this algorithm:
o O(nkt); n = number of objects, k number of partitions, t = number of
iterations
o k << n and t << n
- Disadvantage:
o Cannot be applied for categorical data – as mean cannot be calculated.
o Need to specify the number of partitions – k
o Not applicable for clusters of different size.
o Noise and outliers cannot be detected

(B) Representative Point-based technique: The k-mediods method:

- Mediod – most centrally located point in a cluster – Reference point
- Partitioning is based on the principle of minimizing the sum of the
dissimilarities between each object with its corresponding reference point.

- PAM – Partitioning Around Mediods – k-mediods type clustering algorithm.

- Finds k clusters in n objects by finding mediod for each cluster.
- Initial set of k mediods are arbitrarily selected.
- Iteratively replaces one of the mediods with one of the non-mediods so that the
total distance of the resulting clustering is improved.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 19

- After initial selection of k-mediods, the algorithm repeatedly tries to make a

better choice of mediods by analyzing all the possible pairs of objects such that
one object is the mediod and the other is not.
- The measure of clustering quality is calculated for each such combination.
- The best choice of points in one iteration is chosen as the mediods for the next
iteration.
- Cost of single iteration is O(k(n-k)2).
- For large values of n and k, the cost of such computation could be high.

- Advantage: - k-mediods method is more robust than k-means method.

- Disadvantage: - k-mediods method is more costly than k-means method.
- User needs to specify k the number of clusters in both these methods.

(C) Partitioning method in large databases: from k-mediods to CLARANS:

- (i) CLARA – Clustering LARge Applications – Sampling based method.
- In this method, only a sample set of data is considered from the whole dataset and
the mediods are selected from this sample using PAM. Sample selected randomly.
- CLARA draws multiple samples of the data set, applies PAM on each sample and
gives the best clustering as the output. Classifies the entire dataset to the resulting
clusters.
- Complexity of each iteration in this case is: O(kS2 + k(n-k)); S = size of the
sample; k = number of clusters; n = total number of objects.
- Effectiveness of CLARA depends on sample size.
- Good clustering of samples does not imply good clustering of the dataset if the
sample is biased.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 20

- (ii) CLARANS – Clustering LARge Applications based on RANdomized Search

– To improve quality and scalability of CLARA.
-
- This is similar to PAM & CLARA
- It does not consider a sample or does not consider the entire database.
- Begins like PAM – selects k-mediods by applying Randomized Iterative
Optimization
- Then randomly selects few pairs (k, j) = “maxneighbour” number of pairs for
swapping.
-
- If the pair with minimum cost found then updates the mediod set and continues.
- Else current selections of mediods are considered as the local optimum set.
- Now repeat by randomly selecting new mediods – search for another
local optimum set.
-
- Stops after finding “num local” number of local optimum sets.
- Returns best of local optimum sets.
- CLARANS enables detection of outliers – Best mediod based method.
- Drawbacks – Assumes object fits into main memory; Result is based on input
order.

3.10 Hierarchical Methods

Hierarchical Methods
This works by grouping data objects into a tree of clusters. Two types – Agglomerative and
Divisive.
Clustering algorithms with integrated approach of these two types are BIRCH, CURE, ROCK
and CHAMELEON.

BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:

- Integrated Hierarchical Clustering algorithm.
- Introduces two concepts – Clustering Feature and CF tree (Clustering Feature
Tree)
- CF Trees – Summarized Cluster Representation – Helps to achieve good speed &
clustering scalability
- Good for incremental and dynamical clustering of incoming data points.
- Clustering Feature CF is the summary statistics for the cluster defined as:
;
- where N is the number of points in the sub cluster (Each point is represented as
);
- is the linear sum of N points = ; SS is the square sum of data points

- CF Tree – Height balanced tree that stores the Clustering Features.

- This has two parameters – Branching Factor B and threshold T
- Branching Factor specifies the maximum number of children.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 21

- Threshold parameter T = maximum diameter of sub clusters stored at the leaf

nodes.
- Change the threshold value => Changes the size of the tree.
- The non-leaf nodes store sums of their children’s CF’s – summarizes information
about their children.

- BIRCH algorithm has the following two phases:

o Phase 1: Scan database to build an initial in-memory CF tree – Multi-level
compression of the data – Preserves the inherent clustering structure of the
data.

 CF tree is built dynamically as data points are inserted to the

closest leaf entry.
 If the diameter of the subcluster in the leaf node after insertion
becomes larger than the threshold then the leaf node and possibly
other nodes are split.
 After a new point is inserted, the information about it is passed
towards the root of the tree.
 Is the size of the memory to store the CF tree is larger than the the
size of the main memory, then a smaller value of threshold is
specified and the CF tree is rebuilt.
 This rebuild process builds from the leaf nodes of the old tree.
Thus for building a tree data has to be read from the database only
once.

o Phase 2: Apply a clustering algorithm to cluster the leaf nodes of the CF-
tree.
- Advantages:
o Produces best clusters with available resources.
o Minimizes the I/O time
- Computational complexity of this algorithm is – O(N) – N is the number of
objects to be clustered.
- Disadvantage:
o Not a natural way of clustering;
o Does not work for non-spherical shaped clusters.

CURE – Clustering Using Representatives:

- Integrates hierarchical and partitioning algorithms.
- Handles clusters of different shapes and sizes; Handles outliers separately.
- Here a set of representative centroid points are used to represent a cluster.
- These points are generated by first selecting well scattered points in a cluster and
shrinking them towards the center of the cluster by a specified fraction (shrinking
factor)
- Closest pair of clusters are merged at each step of the algorithm.
-
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 22

- Having more than one representative point in a cluster allows BIRCH to

handle clusters of non-spherical shape.
- Shrinking helps to identify the outliers.
- To handle large databases – CURE employs a combination of random sampling
and partitioning.
- The resulting clusters from these samples are again merged to get the final cluster.

- CURE Algorithm:
o Draw a random sample s
o Partition sample s into p partitions each of size s/p
o Partially cluster partitions into s/pq clusters where q > 1
o Eliminate outliers by random sampling – if a cluster is too slow eliminate
it.
o Cluster partial clusters
o Mark data with the corresponding cluster labels
o
- Advantage:
o High quality clusters
o Removes outliers
o Produces clusters of different shapes & sizes
o Scales for large database
- Disadvantage:
o Needs parameters – Size of the random sample; Number of Clusters and
Shrinking factor
o These parameter settings have significant effect on the results.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 23

ROCK:
- Agglomerative hierarchical clustering algorithm.
- Suitable for clustering categorical attributes.
- It measures the similarity of two clusters by comparing the aggregate inter-
connectivity of two clusters against a user specified static inter-connectivity
model.
- Inter-connectivity of two clusters C1 and C2 are defined by the number of cross
links between the two clusters.
- link(pi, pj) = number of common neighbors between two points pi and pj.

- Two steps:
o First construct a sparse graph from a given data similarity matrix using
a similarity threshold and the concept of shared neighbors.
o Then performs a hierarchical clustering algorithm on the sparse graph.

CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:

- In this clustering process, two clusters are merged if the inter-connectivity and
closeness (proximity) between two clusters are highly related to the internal
interconnectivity and closeness of the objects within the clusters.
- This merge process produces natural and homogeneous clusters.
- Applies to all types of data as long as the similarity function is specified.

- This first uses a graph partitioning algorithm to cluster the data items into
large number of small sub clusters.
- Then it uses an agglomerative hierarchical clustering algorithm to find the genuine
clusters by repeatedly combining the sub clusters created by the graph partitioning
algorithm.
- To determine the pairs of most similar sub clusters, it considers the
interconnectivity as well as the closeness of the clusters.

- In this objects are represented using k-nearest neighbor graph.

- Vertex of this graph represents an object and the edges are present between two
vertices (objects)
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 24

- Partition the graph by removing the edges in the sparse region and keeping
the edges in the dense region. Each of these partitioned graph forms a cluster
- Then form the final clusters by iteratively merging the clusters from the
previous cycle based on their interconnectivity and closeness.

- CHAMELEON determines the similarity between each pair of clusters Ci and

Cj according to their relative inter-connectivity RI(Ci, Cj) and their relative
closeness RC(Ci, Cj).

-
- = edge-cut of the cluster containing both Ci and Cj

- = size of min-cut bisector

-
- = Average weight of the edges that connect vertices in Ci to
vertices in Cj

- = Average weight of the edges that belong to the min-cut bisector of

cluster Ci.
- Advantages:
o More powerful than BIRCH and CURE.
o Produces arbitrary shaped clusters
- Processing cost:

- - n = number of objects.
Unit III - DATA WAREHOUSING AND DATA MINING -CA5010 25

Review Questions

1.Explain about the Decision tree induction algorithm with an example.

2(i) Write notes on Bayes Classification. (2)
(ii) Define the Bayes Theorem with example. (4)
(iii) Explain in detail about Naïve Bayesian Classifiers with suitable example. (10)
2. (i) Describe on the k-means classical partitioning algorithm. (8)
(ii) Describe on the k-mediods / Partitioning Around Mediods algorithm. (8)
3. (i) Describe on the BIRCH hierarchical algorithm. (8)
(ii) Describe on the CURE hierarchical algorithm. (8)
4. (i) Describe on the ROCK hierarchical algorithm. (8)
(ii) Describe on the CHAMELEON hierarchical algorithm. (8)

Assignment Topic:
1. Write in detail about “Other Classification Methods”.

Unit 5
No ratings yet
Unit 5
27 pages
PH and PH Meter-1
100% (1)
PH and PH Meter-1
9 pages
ASTM F2245-07 Airplanes PDF
No ratings yet
ASTM F2245-07 Airplanes PDF
29 pages
Chassis Control Systems
100% (2)
Chassis Control Systems
312 pages
CHEVRON Maintenance Heat Exchanger
67% (3)
CHEVRON Maintenance Heat Exchanger
23 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Assignment Thermal
No ratings yet
Assignment Thermal
32 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
ML Cheat Sheet
50% (2)
ML Cheat Sheet
74 pages
ONGC Uran
No ratings yet
ONGC Uran
10 pages
Data Mining
No ratings yet
Data Mining
98 pages
Grouping
No ratings yet
Grouping
98 pages
DMDW R20 Unit 5
No ratings yet
DMDW R20 Unit 5
21 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Techniques of Cluster Analysis: A Seminar On
No ratings yet
Techniques of Cluster Analysis: A Seminar On
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
No ratings yet
Improved K-Means Clustering Algorithm by Getting Initial Cenroids
9 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Unit 4 Descriptive Modeling
No ratings yet
Unit 4 Descriptive Modeling
18 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Chap-5-Keyboard-Worksheet 2,3 Answer
100% (1)
Chap-5-Keyboard-Worksheet 2,3 Answer
3 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Clustering
No ratings yet
Clustering
7 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
Clustering
No ratings yet
Clustering
104 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Introduction To Cluster Analysis.
No ratings yet
Introduction To Cluster Analysis.
53 pages
DM Module 4
No ratings yet
DM Module 4
17 pages
Lecture 3.2.1 3.2.2
No ratings yet
Lecture 3.2.1 3.2.2
28 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering
No ratings yet
Clustering
25 pages
Unit 2 - Introduction To Cluster Analysis
No ratings yet
Unit 2 - Introduction To Cluster Analysis
53 pages
Unit-IV Cluster Outlier Analysis
No ratings yet
Unit-IV Cluster Outlier Analysis
21 pages
Clustering
No ratings yet
Clustering
34 pages
DX Diag
No ratings yet
DX Diag
27 pages
DMDW Unit-5
No ratings yet
DMDW Unit-5
21 pages
Unit 2 DMW
No ratings yet
Unit 2 DMW
26 pages
Data Mining Notes UNIT IV
No ratings yet
Data Mining Notes UNIT IV
19 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
A Parallel Study On Clustering Algorithms in Data Mining
No ratings yet
A Parallel Study On Clustering Algorithms in Data Mining
7 pages
Unit VII
No ratings yet
Unit VII
30 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Data Mining-Unit IV
No ratings yet
Data Mining-Unit IV
15 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Basis Worksheet
No ratings yet
Basis Worksheet
52 pages
Module V
No ratings yet
Module V
16 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
Preheat Calculation 2 PDF
No ratings yet
Preheat Calculation 2 PDF
3 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Clustering in Data Mining
No ratings yet
Clustering in Data Mining
14 pages
A06-A Survey of Clustering Techniques
No ratings yet
A06-A Survey of Clustering Techniques
5 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Cie 15 2004 Tables
No ratings yet
Cie 15 2004 Tables
34 pages
Unit 4
No ratings yet
Unit 4
4 pages
Modified Test
No ratings yet
Modified Test
12 pages
High-Performance Liquid Chromatography Determination of Zn-Bacitracin in Animal Feed by Post-Column Derivatization and Fluorescence Detection
No ratings yet
High-Performance Liquid Chromatography Determination of Zn-Bacitracin in Animal Feed by Post-Column Derivatization and Fluorescence Detection
8 pages
Course Pack OR-BBA 2020
No ratings yet
Course Pack OR-BBA 2020
88 pages
Unit 1
No ratings yet
Unit 1
57 pages
Varian Catalog GPC-SEC
No ratings yet
Varian Catalog GPC-SEC
40 pages
M. Tech. Chemical 2018
No ratings yet
M. Tech. Chemical 2018
37 pages
Application of PLC Control Technology in Intelligent Automatic Control
No ratings yet
Application of PLC Control Technology in Intelligent Automatic Control
4 pages
Rocker Gear and Valves
No ratings yet
Rocker Gear and Valves
10 pages
Industrial Filters PDF
No ratings yet
Industrial Filters PDF
48 pages
GMAT Quant Topic 8 - Probability Solutions
No ratings yet
GMAT Quant Topic 8 - Probability Solutions
20 pages
Math Project Correction
No ratings yet
Math Project Correction
8 pages
Lesson Explainer - Velocity - Nagwa
No ratings yet
Lesson Explainer - Velocity - Nagwa
34 pages
Grade 3 Excel Formatting
No ratings yet
Grade 3 Excel Formatting
2 pages
Exam - 1013S 2023 Final
No ratings yet
Exam - 1013S 2023 Final
20 pages
Conference
No ratings yet
Conference
7 pages
Room Checksums: Room - 001 Heating Coil Peak CLG Space Peak Cooling Coil Peak Temperatures
No ratings yet
Room Checksums: Room - 001 Heating Coil Peak CLG Space Peak Cooling Coil Peak Temperatures
1 page
Gr09 Maths Term2 Pack01 Practice Paper Memo
No ratings yet
Gr09 Maths Term2 Pack01 Practice Paper Memo
5 pages
Polyester Double Braid Info Sheet
No ratings yet
Polyester Double Braid Info Sheet
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit Iv

Uploaded by

Unit Iv

Uploaded by

UNIT -IV CLUSTERNING AND APPLICATIONS

4.1. Clusters Analysis: Types Of Data In Cluster Analysis

What is Cluster Analysis?

Cluster – collection of data objects

Cluster applications – pattern recognition, image processing and market research.

Typical requirements of clustering in data mining:

4.2 Types of data in Clustering Analysis:

 called as “two mode” matrix

2. Dissimilarity Matrix: (object-by-object structure)

 called as “one mode” matrix

Clustering of objects done based on their similarities or dissimilarities.

Major Categories are:

3. Density Based Methods:

- Continue growing a cluster so long as the density in the neighborhood exceeds

Many algorithms combine several clustering methods.

3.9 Partitioning Methods

Partitioning criterion = Similarity function:

Classical Partitioning Methods: k-means and k-mediods:

(A) Centroid-based technique: The k-means method:

- Square Error Function is defined as:

- Advantages: Scalable; efficient in large databases

(B) Representative Point-based technique: The k-mediods method:

- PAM – Partitioning Around Mediods – k-mediods type clustering algorithm.

- After initial selection of k-mediods, the algorithm repeatedly tries to make a

- Advantage: - k-mediods method is more robust than k-means method.

(C) Partitioning method in large databases: from k-mediods to CLARANS:

- (ii) CLARANS – Clustering LARge Applications based on RANdomized Search

3.10 Hierarchical Methods

BIRCH – Balanced Iterative Reducing and Clustering using Hierarchies:

- CF Tree – Height balanced tree that stores the Clustering Features.

- Threshold parameter T = maximum diameter of sub clusters stored at the leaf

- BIRCH algorithm has the following two phases:

 CF tree is built dynamically as data points are inserted to the

CURE – Clustering Using Representatives:

- Having more than one representative point in a cluster allows BIRCH to

CHAMELEON – A hierarchical clustering algorithm using dynamic modeling:

- In this objects are represented using k-nearest neighbor graph.

- CHAMELEON determines the similarity between each pair of clusters Ci and

- = size of min-cut bisector

- = Average weight of the edges that belong to the min-cut bisector of

1.Explain about the Decision tree induction algorithm with an example.

You might also like