0% found this document useful (0 votes)

33 views16 pages

Den Clue

The document provides an introduction to clustering algorithms, focusing on density-based methods like DENCLUE, which utilizes kernel density estimation to identify clusters of arbitrary shape while handling noise. It outlines the algorithm's two main steps: preprocessing and clustering, emphasizing its efficiency and ability to scale. Additionally, it discusses the challenges of kernel methods in high-dimensional spaces and suggests potential solutions for improving model interpretability.

Uploaded by

Fotsing Engoulou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views16 pages

Den Clue

Uploaded by

Fotsing Engoulou

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Introduction to Some

Complementary Algorithms for

Clustering the data

Collected & Prepared by:

Morteza H. Chehreghani

Data Mining Course, Sharif

University of Technology 1
Chapter 8. Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary Data Mining Course, Sharif University of Technology 2
Density-Based Clustering Methods
• Clustering based on density (local cluster
criterion), such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DBRS
DBRS:: Wang, et al. (2003).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
Data Mining Course, Sharif University of Technology 3
DENCLUE: Using density
functions
• DENsity-based CLUstEring by Hinneburg & Keim
(KDD’98)
• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
– Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
– But needs a large number of parameters
Data Mining Course, Sharif University of Technology 28
DENCLUE
• Models the overall density of a set of points as the sum of
‘influence’ functions associated with each point.
• The resulting overall density function will have local peaks,
i.e., local density maxima, and these local peaks can be
used to define clusters in a straightforward way.
• For each data point, a hill climbing procedure finds the
nearest peak associated with that point, and the set of all
data points associated with a particular peak (called a local
density attractor) becomes a (center-defined) cluster.
• If the density at a local peak is too low, then the points in
the associated cluster are classified as noise and discarded.
• If a local peak can be connected to a second local peak by a
path of data points, and the density at each point on the
path is above a minimum density threshold, then the
clusters associated with these local peaks are merged.
• Thus, clusters of any shape can be discovered.
Data Mining Course, Sharif University of Technology 30
DENCLUE
• DENCLUE is based on a well-developed area of
– statistics
– pattern recognition
– which is know as ‘kernel density estimation.’
• The goal of kernel density estimation (and many
other statistical techniques as well) is to describe
the distribution of the data by a function.
• For kernel density estimation, the contribution of
each point to the overall density function is
expressed by an ‘influence’ (kernel) function.
• The overall density is then merely the sum of the
influence functions associated with each point.
Data Mining Course, Sharif University of Technology 31
DENCLUE
• Typically the influence or kernel function is
symmetric (the same in all directions) and its
value (contribution) decreases as the
distance from the point increases.
• the Gaussian function often used as a kernel
function.

Data Mining Course, Sharif University of Technology 32

Influence function
• Example

Data Mining Course, Sharif University of Technology 33

Density Attractor

Data Mining Course, Sharif University of Technology 34

DENCLUE
• The DENCLUE algorithm has two steps,
– preprocessing step
– clustering step
• In the pre-clustering step, a grid for the data is created by
dividing the minimal bounding hyper-rectangle into d-
dimensional hyper-rectangles with an edge length of 2σ.
The rectangles that contain points are then determined.
(Actually, only the occupied hyper-rectangles are
constructed.) The hyper-rectangles are numbered with
respect to a particular origin (at one edge of the bounding
hyper-rectangle and these keys are stored in a search tree
to provide efficient access during later processing. For each
stored cell, the number of points, the sum of the points in
the cell, and the connections to neighboring population
cubes are also stored.
Data Mining Course, Sharif University of Technology 35
DENCLUE
• For the clustering step DENCLUE, considers only the highly
populated cubes and the cubes that are connected to them.
• Starting with each of these cubes as a cluster, the algorithm
proceeds as follows:
• For each point, x, the local density function is calculated
only by considering those points that are from clusters
which are
– a) in clusters that are connected to the one containing the point
– b) have cluster centroids within a distance of k of the point, where k = 4.

• DENCLUE discards clusters associated with a density

attractor whose density is less than ξ.
• Finally, DENCLUE merges density attractors that can be
joined by a path of points, all of which have a density
greater than ξ.

Data Mining Course, Sharif University of Technology 36

DENCLUE

Data Mining Course, Sharif University of Technology 37

DENCLUE
• This provides a high level of generality:
– DBSCAN
– k-means clusters

• DENCLUE scales well.

• Since at its initial stage it builds a map of hyper-rectangle
cubes with edge length 2σ. For this reason, the algorithm
can be classified as a grid-based method.

Data Mining Course, Sharif University of Technology 38

Kernel Density Estimation
• Kernel estimates smooth out the contribution of each observed data point over a
local neighborhood of that point
• The contribution of data point x(i) to the estimate at some point x* depends on
how far apart x(i) and x* are.
• The extent of this contribution is dependent upon on the shape of the kernel
function adopted and the width accorded to it.

Where K(t)dt = 1

• The quality of a kernel estimate depends less on the shape of K than on the
value of h.
• A common form for K is the Normal (Gaussian) curve, with h as its spread
parameter (standard deviation), i.e.,

• where C is a normalization constant and t = x - x(i) is the distance of the query

point x to data point x(i).
• The bandwidth h is equivalent to s, the standard deviation (or width) of the
Gaussian kernel function.
Data Mining Course, Sharif University of Technology 39
Kernel Density Estimation
• Kernel methods are closely related to nearest neighbor methods.
• Naively, N kernel estimations is N^2; for large data sets, fast algorithms
are needed
– Can use tricks learned from N-body, e.g., trees
– Sufficient to bound the density, and compute until the bounds separate
• Choice of bandwidth is critical (analogy: choice of histogram bin size)
– Small values of h lead to very spiky estimates (not much smoothing at all)
– large values lead to oversmoothing.

• often described as non-parametric

– because the model is largely data-driven with no parameters in the
conventional sense (except for the bandwidth h).

• Such data-driven smoothing techniques are useful for data

interpretation, at least in one or two dimensions.

Data Mining Course, Sharif University of Technology 40

Drawbacks
• In particular, as the number of variables in the
predictor space increases, so the number of data
points required to obtain accurate estimates
increases exponentially.
• This means that these "local neighborhood“ models
tend to scale poorly to high dimensions.
• lack of interpretability of the model.
• Solutions:
– using a subset of relevant variables to construct the
model
– transforming the original p variables into a new set of p'
variables, where again p' << p.
Data Mining Course, Sharif University of Technology 41

MGMT 59000 - Customer Analytics
No ratings yet
MGMT 59000 - Customer Analytics
15 pages
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
No ratings yet
Wiley'S Cfa Program Level I Smartsheets: Fundamentals For Cfa Exam Success
11 pages
Assignment Report - Data Mining
No ratings yet
Assignment Report - Data Mining
24 pages
Demand Forecasting in A Supply Chain: Chopra and Meindl, Chapter 4
0% (1)
Demand Forecasting in A Supply Chain: Chopra and Meindl, Chapter 4
43 pages
Econometrics MTU
No ratings yet
Econometrics MTU
31 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
Discriminative and Generative Models in Machine Learning
No ratings yet
Discriminative and Generative Models in Machine Learning
9 pages
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
No ratings yet
A Quick Guide To Understanding The Impact of Test Time On Estimation of Mean Time Between Failure (MTBF)
7 pages
Effective Model Validation Using Machine Learning
No ratings yet
Effective Model Validation Using Machine Learning
4 pages
Tsa Solutions
No ratings yet
Tsa Solutions
49 pages
Dbscan: Fast Density-Based Clustering With R: Michael Hahsler Matthew Piekenbrock
No ratings yet
Dbscan: Fast Density-Based Clustering With R: Michael Hahsler Matthew Piekenbrock
28 pages
One-Sample T Test Results Presentation 11. APA Style Results Presentation
No ratings yet
One-Sample T Test Results Presentation 11. APA Style Results Presentation
1 page
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
Density Based
No ratings yet
Density Based
27 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Lesson 4.1 - Unsupervised Learning Partitioning Methods
No ratings yet
Lesson 4.1 - Unsupervised Learning Partitioning Methods
32 pages
Slides Courtesy: Ling Chen [email protected]
No ratings yet
Slides Courtesy: Ling Chen [email protected]
42 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
A New Look at The Statistical Model Identification PDF
No ratings yet
A New Look at The Statistical Model Identification PDF
8 pages
Cost Behavior - Exercise
100% (1)
Cost Behavior - Exercise
2 pages
Estad Istica II Chapter 5. Regression Analysis (Second Part)
No ratings yet
Estad Istica II Chapter 5. Regression Analysis (Second Part)
39 pages
Comparison of Different Clustering Algorithms Using WEKA Tool
No ratings yet
Comparison of Different Clustering Algorithms Using WEKA Tool
3 pages
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
No ratings yet
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
20 pages
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
No ratings yet
DENCLUE 2.0: Fast Clustering Based On Kernel Density Estimation
11 pages
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
No ratings yet
A Thorough Investigation On The Clustering and Classification Techniques in Various Applications
4 pages
An Empirical Evaluation of Density-Based Clustering Techniques
No ratings yet
An Empirical Evaluation of Density-Based Clustering Techniques
8 pages
SJNanda - Spider and CollidingBodies
No ratings yet
SJNanda - Spider and CollidingBodies
50 pages
A Survey of Some Density Based Clustering Techniques PDF
No ratings yet
A Survey of Some Density Based Clustering Techniques PDF
5 pages
Clustering Density Based
No ratings yet
Clustering Density Based
14 pages
1st Research Paper
No ratings yet
1st Research Paper
4 pages
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
No ratings yet
Image Clustering: Prof. Dr. Rafiqul Islam Department of CSE
26 pages
4 Var Awl
No ratings yet
4 Var Awl
42 pages
Intervalo de Confianza y Dummy Variables 1
No ratings yet
Intervalo de Confianza y Dummy Variables 1
13 pages
Density Based Clustering Algorithm
No ratings yet
Density Based Clustering Algorithm
25 pages
Lecture 4 - Density Based Methods
No ratings yet
Lecture 4 - Density Based Methods
16 pages
Statistics and Probability Letters: S.Y. Hwang, J.S. Baek, J.A. Park, M.S. Choi
No ratings yet
Statistics and Probability Letters: S.Y. Hwang, J.S. Baek, J.A. Park, M.S. Choi
8 pages
Discovering Knowledge in Data: Lecture Review of
No ratings yet
Discovering Knowledge in Data: Lecture Review of
20 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
Title: Interval Estimation: σ known.: Confidence Interval Lower Limit Upper Limit
No ratings yet
Title: Interval Estimation: σ known.: Confidence Interval Lower Limit Upper Limit
2 pages
Second Assignment On QT
No ratings yet
Second Assignment On QT
1 page
University of Michigan STATS 500 hw3 F2020
No ratings yet
University of Michigan STATS 500 hw3 F2020
2 pages
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
No ratings yet
Kuesioner Pengaruh Etika Komunikasi Antar Mahasiswa
20 pages
Unit 4
No ratings yet
Unit 4
5 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Knowledge Mining Using Classification Through Clustering
No ratings yet
Knowledge Mining Using Classification Through Clustering
6 pages
Data Mining: Prof Jyotiranjan Hota
No ratings yet
Data Mining: Prof Jyotiranjan Hota
17 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Chapter 7
No ratings yet
Chapter 7
39 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
Unit 3 Clustering Algorithm
No ratings yet
Unit 3 Clustering Algorithm
44 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
PPP Models - ARIMA & NARNN - Ipynb - Colaboratory
No ratings yet
PPP Models - ARIMA & NARNN - Ipynb - Colaboratory
8 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit I
No ratings yet
Unit I
19 pages
Assignment 2 (RSA)
No ratings yet
Assignment 2 (RSA)
11 pages
Histogram of Delivery Data
No ratings yet
Histogram of Delivery Data
4 pages
E-Note 28966 Content Document 20241211091351PM
No ratings yet
E-Note 28966 Content Document 20241211091351PM
69 pages
The Relationship Between Strategic Management Practices
No ratings yet
The Relationship Between Strategic Management Practices
10 pages
Clustering
No ratings yet
Clustering
12 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
Clustering
No ratings yet
Clustering
65 pages
Model Question Paper-2
No ratings yet
Model Question Paper-2
3 pages
ML Lecture14
No ratings yet
ML Lecture14
17 pages
Data-Clustering (Part I)
No ratings yet
Data-Clustering (Part I)
74 pages
Yihao Final Paper CCSC For Submission
No ratings yet
Yihao Final Paper CCSC For Submission
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
27 pages
Unit 5
No ratings yet
Unit 5
85 pages
Sta 32101 Questions-Random Variables
No ratings yet
Sta 32101 Questions-Random Variables
9 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 2
No ratings yet
Unit 2
33 pages
Module 10
No ratings yet
Module 10
59 pages
MLC2
No ratings yet
MLC2
9 pages
SMS 3466 Survival Analysis
No ratings yet
SMS 3466 Survival Analysis
6 pages
Density Based Clustering
No ratings yet
Density Based Clustering
17 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
Linear Model and Extensions Peng Ding Instant Download
No ratings yet
Linear Model and Extensions Peng Ding Instant Download
91 pages
Module-5 Clustering Algorithms
No ratings yet
Module-5 Clustering Algorithms
44 pages
ML 8
No ratings yet
ML 8
5 pages
O Ptimalb Andwidth Selectionforden C L U E A Lgorithm: Beijing, C H Ina
No ratings yet
O Ptimalb Andwidth Selectionforden C L U E A Lgorithm: Beijing, C H Ina
4 pages
Unit 4
No ratings yet
Unit 4
16 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Den Clue

Uploaded by

Den Clue

Uploaded by

Introduction to Some

Complementary Algorithms for

Collected & Prepared by:

Data Mining Course, Sharif

Data Mining Course, Sharif University of Technology 32

Data Mining Course, Sharif University of Technology 33

Data Mining Course, Sharif University of Technology 34

• DENCLUE discards clusters associated with a density

Data Mining Course, Sharif University of Technology 36

Data Mining Course, Sharif University of Technology 37

• DENCLUE scales well.

Data Mining Course, Sharif University of Technology 38

• where C is a normalization constant and t = x - x(i) is the distance of the query

• often described as non-parametric

• Such data-driven smoothing techniques are useful for data

Data Mining Course, Sharif University of Technology 40

You might also like