0% found this document useful (0 votes)

99 views

Data Mining Unit-4

The document discusses different types of clustering methods and data used in cluster analysis. It describes partitioning methods like k-means clustering which organize data into k partitions by minimizing distances between data points and their assigned cluster centroids. It also discusses hierarchical clustering methods that create a hierarchy of clusters, density-based methods that find clusters based on areas of high density, and grid-based methods that quantize the data space. The document outlines different types of data that can be used for cluster analysis including interval-scaled, binary, nominal, ordinal and mixed-type variables.

Uploaded by

19Q91A1231 NALDEEGA SAKETHA CHARY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views

Data Mining Unit-4

Uploaded by

19Q91A1231 NALDEEGA SAKETHA CHARY

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Mining Unit-4

Lecture Notes
---------------------------------------------------------------------------------------------------------------
Clustering and Applications: Cluster Analysis – Types of data in cluster analysis –
Categorization of Major Clustering Methods – Partitioning Methods, Hierarchical Methods-
Density based Methods, Grid based Methods, Outlier Analysis.

Topic 1: Cluster Analysis

Cluster Analysis?

 Cluster analysis or simply clustering is the process of partitioning a set of data

objects (or observations) into subsets. Each subset is a cluster, such that objects in a
cluster are similar to one another, yet dissimilar to objects in other clusters.
 Clustering is an unsupervised Machine Learning-based Algorithm that comprises a
group of data points into clusters so that the objects belong to the same group.

 Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.

 Cluster analysis has been widely used in many applications such as business
intelligence, image pattern recognition, Web search, biology, and security.
 In business intelligence, clustering can be used to organize a large number of
customers into groups, where customers within a group share strong similar
characteristics

 Moreover, consider a consultant company with a large number of projects. To

improve project management, clustering can be applied to partition projects into
categories based on similarity so that project auditing and diagnosis (to improve
project delivery and outcomes) can be conducted effectively.
 In image recognition, clustering can be used to discover clusters or “subclasses” in
hand written character recognition systems. Suppose we have a data set of
handwritten digits, where each digit is labelled as either 1, 2, 3, and so on. Note that
there can be a large variance in the way in which people write the same digit. Take
the number 2, for example. Some people may write it with a small circle at the left
bottom part, while some others may not. We can use clustering to determine
subclasses for “2,” each of which represents a variation on the way in which 2 can be
written. Using multiple models based on the subclasses can improve overall
recognition accuracy.
 Clustering can be used to organize the search results into groups and present the
results in a concise and easily accessible way.
 Applications of cluster analysis in data mining:
 In many applications, clustering analysis is widely used, such as data analysis,
market research, pattern recognition, and image processing.
 It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups.
 It helps in allocating documents on the internet for data discovery.
 Clustering is also used in tracking applications such as detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
 In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into structure
inherent to populations.
 It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city according to
house type, value, and geographical location.

Topic -2 Types of data in cluster analysis

Types of Data Structures
First of all, let us know what types of data structures are widely used in cluster analysis.

We shall know the types of data that often occur in cluster analysis and how to preprocess
them for such analysis.
Suppose that a data set to be clustered contains n objects, which may represent persons,
houses, documents, countries, and so on.

Main memory-based clustering algorithms typically operate on either of the following two
data structures.
Types of data structures in cluster analysis are
 Data Matrix (or object by variable structure)
 Dissimilarity Matrix (or object by object structure)

Data Matrix
This represents n objects, such as persons, with p variables such as age, height, weight,
gender, race and so on. The structure is in the form of a relational table, or n-by-p matrix (n
objects x p variables)

The Data Matrix is often called a two-mode matrix since the rows and columns of this
represent the different entities.

Dissimilarity Matrix
It is often represented by a n – by – n table, where d(i,j) is the measured difference or
dissimilarity between objects i and j. In general, d(i,j) is a non-negative number that is close
to 0 when objects i and j are higher similar or “near” each other and becomes larger the
more they differ. Since d(i,j) = d(j,i) and d(i,i) =0,
This is also called as one mode matrix since the rows and columns of this represent the same
entity.

Types Of Data Used In Cluster Analysis Are:

 Interval-Scaled variables
 Binary variables
 Nominal, Ordinal, and Ratio variables
 Variables of mixed types
Interval-Scaled Variables
 Interval-scaled variables are continuous measurements of a roughly linear scale.

 Typical examples include weight and height, latitude and longitude coordinates (e.g.,
when clustering houses), and weather temperature.

 The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for
weight, may lead to a very different clustering structure.

 To help avoid dependence on the choice of measurement units, the data should be
standardized. Standardizing measurements attempts to give all variables an equal
weight.

 This is especially useful when given no prior knowledge of the data. However, in
some applications, users may intentionally want to give more weight to a certain set
of variables than to others.

 For example, when clustering basketball player candidates, we may prefer to give
more weight to the variable height.

Binary Variables
 A binary variable is a variable that can take only 2 values.
 For example, generally, gender variables can take 2 variables male and female.
 Contingency Table For Binary Data
 Let us consider binary values 0 and 1
Sub type of Binary variable
 Symmetric binary : we cannot change the values according user wish Ex: male or
female
 Asymmetric binary: we can change the values according to user wish Ex: Covid

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

Categorical Variables
 The data can be divided in to categories
 A generalization of the binary variable in that it can take more than 2 states, e.g., red,
yellow, blue, green.
 Categorical variables are divided in to two types
 Nominal variables:
 Method 1: Simple matching
 The dissimilarity between two objects i and j can be computed based on the simple
matching.
 m: Let m be no of matches (i.e., the number of variables for which i and j are in the
same state).
 p: Let p be total no of variables.

 Method 2: use a large number of binary variables

 Creating a new binary variable for each of the M nominal states.
 Ordinal Variables
 An ordinal variable can be discrete or continuous.
 In this order is important, e.g., rank.
 It can be treated like interval-scaled
 By replacing xif by their rank,

 By mapping the range of each variable onto [0, 1] by replacing the i-th object in the
f-th variable by,

 Variables Of Mixed Type

 A database may contain all the six types of variables
 Symmetric binary, asymmetric binary, nominal, ordinal, interval And those
combinedly called as mixed-type variables.
Topic -3 Categorization of Major Clustering Methods

Partitioning Methods:
The simplest and most fundamental version of cluster analysis is partitioning, which
organizes the objects of a set into several exclusive groups or clusters. To keep the
problem specification concise, we can assume that the number of clusters is given as
background knowledge. This parameter is the starting point for partitioning methods.
Formally, given a data set, D, of n objects, and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions
k-Means: A Centroid-Based Technique
Suppose a data set, D, contains n objects in Euclidean space. Partitioning methods
distribute the objects in D into k clusters, C1, : : : ,Ck,
An objective function is used to assess the partitioning quality so that objects within a
cluster are similar to one another but dissimilar to objects in other clusters. This is, the
objective function aims for high intra cluster similarity and low inter cluster similarity.
A centroid-based partitioning technique uses the centroid of a cluster, Ci , to represent
that cluster. Conceptually, the centroid of a cluster is its center point. The centroid can
be defined in various ways such as by the mean or medoid of the objects (or points)
assigned to the cluster.

The k-means clustering algorithm mainly performs two tasks:

 Determines the best value for K center points or centroids by an iterative process.
 Assigns each data point to its closest k-center. Those data points which are near to
the particular k-center, create a cluster
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
 Step-1: Select the number K to decide the number of clusters.
 Step-2: Select random K points or centroids. (It can be other from the input dataset).
 Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
 Step-4: Calculate the variance and place a new centroid of each cluster.
 Step-5: Repeat the third steps, which means reassign each data point to the new
closest centroid of each cluster.
 Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
 Step-7: The model is ready.
Hence each cluster has data points with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Hierarchical clustering Methods
Hierarchical clustering refers to an unsupervised learning procedure that determines
successive clusters based on previously defined clusters. It works via grouping data into a
tree of clusters.
Hierarchical clustering stats by treating each data points as an individual cluster.
The endpoint refers to a different set of clusters, where each cluster is different from the
ther cluster, and the objects within each cluster are the same as one another.
There are two types of hierarchical clustering
 Agglomerative Hierarchical Clustering
 Divisive Clustering

Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with

taking all data points as single clusters and merging them until one cluster is left.
Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.
Why hierarchical clustering?
In the K-means clustering that there are some challenges with this algorithm, which are a
predetermined number of clusters, and it always tries to create the clusters of the same size.
To solve these two challenges, we can opt for the hierarchical clustering algorithm because,
in this algorithm, we don't need to have knowledge about the predefined number of clusters.

Agglomerative hierarchical clustering

 Agglomerative clustering is one of the most common types of hierarchical clustering
used to group similar objects in clusters.
 Agglomerative clustering is also known as AGNES (Agglomerative Nesting).
 In agglomerative clustering, each data point act as an individual cluster and at each
step, data objects are grouped in a bottom-up method.
 Initially, each data object is in its cluster.
 At each iteration, the clusters are combined with different clusters until one cluster is
formed.
Step 1: Determine the similarity between individuals and all other clusters. (Find
proximity matrix).

Step 2: Consider each data point as an individual cluster.

Step 3: Combine similar clusters.

Step 4: Recalculate the proximity matrix for each cluster.

Step 5: Repeat step 3 and step 4 until you get a single cluster.
Divisive Hierarchical Clustering
Divisive hierarchical clustering is exactly the opposite of Agglomerative Hierarchical
clustering.
In Divisive Hierarchical clustering, all the data points are considered an individual
cluster, and in every iteration, the data points that are not similar are separated from the
cluster.
The separated data points are treated as an individual cluster.
Finally, we are left with N clusters.
Example:

We need to calculate distance between ( p1,[p3,p5]) , ( p2,[p3,p5]) and ( p4,[p3,p5])

Then the matrix representation will be given bellow

Now to combine or joining a cluster I need to considered least value in matrix i.e 5 which
contain the objects p2 and p4
We need to calculate distance between (p1,[p2,p4])

The matrix representation will be given bellow

Now I need to take least value in matrix to combine or join the cluster i.e 9 so we need
join p1 and p2,p4
The final visualization of cluster will be given bellow
Example 2:
Advantages of Hierarchical clustering
 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.
Disadvantages of hierarchical clustering
 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.
Density-Based Methods
 Partitioning and hierarchical methods are designed to find spherical-shaped clusters.
 They have difficulty finding clusters of arbitrary shape such as the “S” shape and oval
clusters Given such data, they would likely inaccurately identify convex regions, where
noise or outliers are included in the clusters.
 To find clusters of arbitrary shape, alternatively, we can model clusters as dense regions
in the data space, separated by sparse regions.
 This is the main strategy behind density-based clustering methods, which can discover
clusters of non spherical shape
 Density-based clustering by studying DBSCAN ((Density-Based Spatial Clustering of
Applications with Noise)

In DBSCAN basically we required two basic points

 Epsilon – it is like as radios
 Minimum points – Minimum values which are needed in cluster
 Core point is a data point of a cluster which is satisfy the minimum data points
suppose a is my main data point with epsilon I have form a circle in that circle at
least 4 points are available the a will be core point
 Border point is a data point of a cluster which is not satisfy the minimum data
points suppose b is my main data point with epsilon I have form a circle in that
circle I have only 2 points are available then we need to check nearby point if
there is any core point is available then that point is called as border point
 Noise point is a data point of a cluster which doesn’t have any relation with
either core point or Border point is called as Noise point
Grid – Based Method
 The cluster methods so far are data driven they partitioning the set of data objects and
adapt to the distribution of the objects in the embedding space
 Alternatively Grid based clustering method takes a space-driven approach by
partitioning the embedding space in to cells independent of the distribution of the
input objects
 The grid-based clustering approach uses a multiresolution grid data structure
 It quantizes the data space into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
 The main advantage of the approach is its fast processing time which is typically
independent of the number of data objects, yet dependent on only the number of cells

`
 Grid-based clustering using several interesting methods.
 STING: explores statistical information stored in the grid cells.
 CLIQUE: represents a grid- and density-based approach for subspace clustering in a
high-dimensional data space.
 STING (STatistical INformation Grid) STING is a grid-based multiresolution
clustering technique in which the embedding spatial area of the input objects is
divided into rectangular cells. The space can be divided in a hierarchical and recursive
way.
 Several levels of such rectangular cells correspond to different levels of resolution and
form a hierarchical structure:
 Each cell at a high level is partitioned to form a number of cells at the next lower
level.
 Statistical information regarding the attributes in each grid cell, such as the mean,
maximum, and minimum values, is precomputed and stored as statistical parameters.

The figure shows a hierarchical structure for STING clustering. The statistical parameters of
higher-level cells can easily be computed from the parameters of the lower-level cells.
These parameters include the following: the attribute-independent parameter, count; and the
attribute-dependent parameters, mean, stdev (standard deviation), min (minimum), max
(maximum), and the type of distribution that the attribute value in the cell follows such as
normal, uniform, exponential, or none

“How is this statistical information useful for query answering?” The statistical parameters
can be used in a top-down, grid-based manner as follows. First, a layer within the
hierarchical structure is determined from which the query-answering process is to start. This
layer typically contains a small number of cells. For each cell in the current layer, we
compute the confidence interval (or estimated probability range) reflecting the cell’s
relevancy to the given query. The irrelevant cells are removed from further consideration.
Processing of the next lower level examines only the remaining relevant cells. This process
is repeated until the bottom layer is reached. At this time, if the query specification is met,
the regions of relevant cells that satisfy the query are returned. Otherwise, the data that fall
into the relevant cells are retrieved and further processed until they meet the query’s
requirements.

“What advantages does STING offer over other clustering methods?” STING offers several
Advantages:

 The grid-based computation is query-independent because the statistical information

stored in each cell represents the summary information of the data in the grid cell,
independent of the query
 The grid structure facilitates parallel processing and incremental updating
 The method’s efficiency is a major advantage

Disadvantage:
 All the cluster boundaries are either horizontal or vertical and no diagonal boundary is
detected
CLIQUE
For example, consider a health informatics application where patient records contain
attributes describing, Personal information, Numerous symptoms, Conditions and
Family history. In bird flu patients, for instance, the age, gender, and job attributes may
vary dramatically within a wide range of values.
Thus, it can be difficult to find such a cluster within the entire data space. Instead, by
searching in subspaces, we may find a cluster of similar patients in a lower-dimensional
space (e.g., patients who are similar to one other with respect to symptoms like high
fever, cough but no runny nose, and aged between 3 and 16).

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density

based clusters in subspaces.
CLIQUE partitions each dimension into non overlapping intervals, thereby partitioning
the entire embedding space of the data objects into cells.
It uses a density threshold to identify dense cells and sparse ones. A cell is dense if the
number of objects mapped to it exceeds the density threshold.
What Are Outliers?
Assume that a given statistical process is used to generate a set of data objects. An outlier
is a data object that deviates significantly from the rest of the objects, as if it were generated
by a different mechanism.

Outliers are different from noisy data. Noise is a random error or variance in a measured
variable. In general, noise is not interesting in data analysis, including outlier detection. For
example, in credit card fraud detection, a customer’s purchase behavior can be modeled as a
random variable. A customer may generate some “noise transactions” that may seem like
“random errors” or “variance,” such as by buying a bigger lunch one day, or having one
more cup of coffee than usual. Such transactions should not be treated as outliers; otherwise,
the credit card company would incur heavy costs from verifying that many transactions. The
company may also lose customers by bothering them with multiple false alarms. As in many
other data analysis and data mining tasks, noise should be removed before outlier detection.
Outliers are interesting because they are suspected of not being generated by the same
mechanisms as the rest of the data. Therefore, in outlier detection, it is important to
 Outlier detection is also related to novelty detection in evolving data sets.
 For example, by monitoring a social media web site where new content is incoming,
novelty detection may identify new topics and trends in a timely manner.
 Novel topics may initially appear as outliers.
 To this extent, outlier detection and novelty detection share some similarity in
modeling and detection methods
 In general, outliers can be classified into three categories, namely global outliers,
contextual (or conditional) outliers, and collective outliers.
 Global Outliers
 In a given data set, a data object is a global outlier if it deviates significantly from
the rest of the data set.
 Global outliers are sometimes called point anomalies, and are the simplest type of
outliers.
 Most outlier detection methods are aimed at finding global outliers.
 Global outlier detection is important in many applications.
 Consider intrusion detection in computer networks, for example.
 If the communication behavior of a computer is very different from the normal
patterns (e.g., a large number of packages is broadcast in a short time), this behavior
may be considered as a global outlier and the corresponding computer is a suspected
victim of hacking.
 As another example, in trading transaction auditing systems, transactions that do not
follow the regulations are considered as global outliers and should be held for further
examination.
 Contextual Outliers
 “The temperature today is 35 º C. Is it exceptional (i.e., an outlier)?” It depends, for
example, on the time and location! If it is in winter in Hyderabad, yes, it is an
outlier.
 If it is a summer day in Hyderabad, then it is normal.
 Unlike global outlier detection, in this case, whether or not today’s temperature
value is an outlier depends on the context—the date, the location, and possibly some
other factors.
 Contextual outliers are a generalization of local outliers
 In credit card fraud detection, in addition to global outliers, an analyst may consider
outliers in different contexts.
 Consider customers who use more than 90% of their credit limit.
 If one such customer is viewed as belonging to a group of customers with low credit
limits, then such behavior may not be considered an outlier.
 However, similar behavior of customers from a high-income group may be
considered outliers if their balance often exceeds their credit limit.
 Such outliers may lead to business opportunities—raising credit limits for such
customers can bring in new revenue
 Collective Outliers
 Suppose you are a supply-chain manager of AllElectronics. You handle thousands of
orders and shipments every day. If the shipment of an order is delayed, it may not be
considered an outlier because, statistically, delays occur from time to time.
 However, you have to pay attention if 100 orders are delayed on a single day.
 Those 100 orders as a whole form an outlier, although each of them may not be
regarded as an outlier if considered individually.
 You may have to take a close look at those orders collectively to understand the
shipment problem.


 Collective outliers. In Figure the black objects as a whole form a collective outlier
because the density of those objects is much higher than the rest in the data set.
 However, every black object individually is not an outlier with respect to the whole
data set.
 Collective outlier detection has many important applications.
 For example, in intrusion detection, a denial-of-service package from one computer
to another is considered normal, and not an outlier at all.
 However, if several computers keep sending denial-of-service packages to each
other, they as a whole should be considered as a collective outlier.

Method General Characteristics
Partitioning methods
 Find mutually exclusive clusters of spherical shape Distance-based
 May use mean or medoid (etc.) to represent cluster center
 Effective for small- to medium-size data sets
Hierarchical methods
 Clustering is a hierarchical decomposition (i.e., multiple levels)
 Cannot correct erroneous merges or splits
 May incorporate other techniques like microclustering or
 consider object “linkages”
Density-based methods
 Can find arbitrarily shaped clusters
 Clusters are dense regions of objects in space that are separated by low-density
regions
 Cluster density: Each point must have a minimum number of points within its
“neighborhood” May filter out outliers
Grid-based methods
 Use a multiresolution grid data structure
 Fast processing time (typically independent of the number of data objects, yet
dependent on grid size)

NA 2 Notes 3 - Power Transmission
0% (1)
NA 2 Notes 3 - Power Transmission
8 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Iv Unit DM
No ratings yet
Iv Unit DM
26 pages
Unit 2 - Introduction to Cluster Analysis
No ratings yet
Unit 2 - Introduction to Cluster Analysis
53 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
Data Mining: Concepts and Techniques: Cluster Analysis
No ratings yet
Data Mining: Concepts and Techniques: Cluster Analysis
97 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Clustering
No ratings yet
Clustering
51 pages
Datawarehousing and Data Mining
No ratings yet
Datawarehousing and Data Mining
119 pages
Clustering
No ratings yet
Clustering
47 pages
DataMining_Unit4_notes
No ratings yet
DataMining_Unit4_notes
27 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
DWDM - Unit - VI
No ratings yet
DWDM - Unit - VI
38 pages
Data Mining
No ratings yet
Data Mining
98 pages
Cluster Analysis
No ratings yet
Cluster Analysis
36 pages
8 Clustering
No ratings yet
8 Clustering
53 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
48 pages
Cluster Analysis
No ratings yet
Cluster Analysis
26 pages
Analysis of cluteruing
No ratings yet
Analysis of cluteruing
16 pages
ML Unit 4 Notes - NJ
No ratings yet
ML Unit 4 Notes - NJ
15 pages
DM-24-TYPES-OF-DATA-IN-CLUSTER-ANALYSIS
No ratings yet
DM-24-TYPES-OF-DATA-IN-CLUSTER-ANALYSIS
3 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
Clustering
No ratings yet
Clustering
27 pages
Cluster Analysis
No ratings yet
Cluster Analysis
39 pages
DATA_MINING_UNIT-4
No ratings yet
DATA_MINING_UNIT-4
15 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
MODULE-V
No ratings yet
MODULE-V
16 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
DWDS Unit 6 Cluster Analysis (1)
No ratings yet
DWDS Unit 6 Cluster Analysis (1)
31 pages
Screenshot 2024-05-17 at 3.30.05 PM
No ratings yet
Screenshot 2024-05-17 at 3.30.05 PM
31 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Unit 4
No ratings yet
Unit 4
65 pages
20600222047_Manish_Bej_IT_CA2_DWDM
No ratings yet
20600222047_Manish_Bej_IT_CA2_DWDM
4 pages
Clustering
No ratings yet
Clustering
123 pages
Unit 5
No ratings yet
Unit 5
27 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
05. UNIT-V(DMWH6EM)
No ratings yet
05. UNIT-V(DMWH6EM)
30 pages
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
No ratings yet
Unit-V Cluster Analysis?: Unsupervised Classification Stand-Alone Tool Preprocessing Step
24 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Fundamentals of Data Science Unit 3
No ratings yet
Fundamentals of Data Science Unit 3
15 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
120 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
UNIT 4 Clustering and Applications
No ratings yet
UNIT 4 Clustering and Applications
5 pages
DM Unit 5
No ratings yet
DM Unit 5
15 pages
Clustering
No ratings yet
Clustering
25 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
DM Unit-3
No ratings yet
DM Unit-3
20 pages
Data Mining Unit 2 1
No ratings yet
Data Mining Unit 2 1
15 pages
Time and Work
No ratings yet
Time and Work
11 pages
II Notes
No ratings yet
II Notes
18 pages
SPPM 1 To 3 Units
No ratings yet
SPPM 1 To 3 Units
69 pages
DBMS Module5 Notes
No ratings yet
DBMS Module5 Notes
16 pages
Text Analytics with Python A Practical Real World Approach to Gaining Actionable Insights from Your Data 1st Edition Dipanjan Sarkar pdf download
100% (1)
Text Analytics with Python A Practical Real World Approach to Gaining Actionable Insights from Your Data 1st Edition Dipanjan Sarkar pdf download
60 pages
OOPs STUDY MATERIAL
No ratings yet
OOPs STUDY MATERIAL
14 pages
Project Proposal
No ratings yet
Project Proposal
5 pages
Fairmoney Product Training Report
No ratings yet
Fairmoney Product Training Report
9 pages
Water Distribution System Modeling by Using Epanet 2.0, A Case Study of Cuet
No ratings yet
Water Distribution System Modeling by Using Epanet 2.0, A Case Study of Cuet
12 pages
Id Now: Instrument User Manual
No ratings yet
Id Now: Instrument User Manual
77 pages
The National Code of Practice For The Construction Industry and Implementation Guidelines
No ratings yet
The National Code of Practice For The Construction Industry and Implementation Guidelines
2 pages
Geometric student report - копия
No ratings yet
Geometric student report - копия
4 pages
Introduction To Computers: Subject Name: Computer Skill-I Subject Code: CSK-I
No ratings yet
Introduction To Computers: Subject Name: Computer Skill-I Subject Code: CSK-I
46 pages
Aws Cloud Computing Viva Questions
No ratings yet
Aws Cloud Computing Viva Questions
4 pages
Technical Drawing Intro + 01 Layout
100% (1)
Technical Drawing Intro + 01 Layout
14 pages
Vision CSM24 T05
No ratings yet
Vision CSM24 T05
26 pages
Nuc React
No ratings yet
Nuc React
2 pages
Pv Array Layout_3.10MWac _Bandikui,Rajsthan
No ratings yet
Pv Array Layout_3.10MWac _Bandikui,Rajsthan
1 page
Product Price Reference
No ratings yet
Product Price Reference
153 pages
BCS306A-Object Oriented Programming With Java Laboratory (Lab Manual)
No ratings yet
BCS306A-Object Oriented Programming With Java Laboratory (Lab Manual)
23 pages
LP3470
No ratings yet
LP3470
22 pages
AHEGB
No ratings yet
AHEGB
10 pages
Learning Strand 1: Communication Skills Title of Module: Hello, May I Help You? Pre-Test
100% (1)
Learning Strand 1: Communication Skills Title of Module: Hello, May I Help You? Pre-Test
12 pages
Optitex Installation Guide 2016
No ratings yet
Optitex Installation Guide 2016
74 pages
En Cuisine Libro de Frances
No ratings yet
En Cuisine Libro de Frances
113 pages
Trapezoidal Footing Volume Formula
No ratings yet
Trapezoidal Footing Volume Formula
4 pages
1-2-1 EN Frese COMBIFLOW 6-Way
No ratings yet
1-2-1 EN Frese COMBIFLOW 6-Way
10 pages
Matrix of Required CPD 2018-Onwards-11718
No ratings yet
Matrix of Required CPD 2018-Onwards-11718
15 pages
The Launch of the iPhone 15
No ratings yet
The Launch of the iPhone 15
3 pages
Hydraulic Pumps
No ratings yet
Hydraulic Pumps
44 pages
AFW DataSheet 2020
No ratings yet
AFW DataSheet 2020
2 pages
1SJ18CS117 Venkatesh Murthy
No ratings yet
1SJ18CS117 Venkatesh Murthy
37 pages

Data Mining Unit-4

Uploaded by

Data Mining Unit-4

Uploaded by

Data Mining Unit-4

Topic 1: Cluster Analysis

 Cluster analysis or simply clustering is the process of partitioning a set of data

 Moreover, consider a consultant company with a large number of projects. To

Topic -2 Types of data in cluster analysis

Types Of Data Used In Cluster Analysis Are:

Simple matching coefficient (invariant, if the binary variable is symmetric):

Jaccard coefficient (noninvariant if the binary variable is asymmetric):

 Method 2: use a large number of binary variables

 Variables Of Mixed Type

The k-means clustering algorithm mainly performs two tasks:

Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm starts with

Agglomerative hierarchical clustering

Step 2: Consider each data point as an individual cluster.

Step 3: Combine similar clusters.

Step 4: Recalculate the proximity matrix for each cluster.

We need to calculate distance between ( p1,[p3,p5]) , ( p2,[p3,p5]) and ( p4,[p3,p5])

The matrix representation will be given bellow

In DBSCAN basically we required two basic points

 The grid-based computation is query-independent because the statistical information

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density

You might also like