0% found this document useful (0 votes)
33 views16 pages

Den Clue

The document provides an introduction to clustering algorithms, focusing on density-based methods like DENCLUE, which utilizes kernel density estimation to identify clusters of arbitrary shape while handling noise. It outlines the algorithm's two main steps: preprocessing and clustering, emphasizing its efficiency and ability to scale. Additionally, it discusses the challenges of kernel methods in high-dimensional spaces and suggests potential solutions for improving model interpretability.

Uploaded by

Fotsing Engoulou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views16 pages

Den Clue

The document provides an introduction to clustering algorithms, focusing on density-based methods like DENCLUE, which utilizes kernel density estimation to identify clusters of arbitrary shape while handling noise. It outlines the algorithm's two main steps: preprocessing and clustering, emphasizing its efficiency and ability to scale. Additionally, it discusses the challenges of kernel methods in high-dimensional spaces and suggests potential solutions for improving model interpretability.

Uploaded by

Fotsing Engoulou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Introduction to Some

Complementary Algorithms for


Clustering the data

Collected & Prepared by:


Morteza H. Chehreghani

Data Mining Course, Sharif


University of Technology 1
Chapter 8. Cluster Analysis
• What is Cluster Analysis?
• Types of Data in Cluster Analysis
• A Categorization of Major Clustering Methods
• Partitioning Methods
• Hierarchical Methods
• Density-Based Methods
• Grid-Based Methods
• Model-Based Clustering Methods
• Outlier Analysis
• Summary Data Mining Course, Sharif University of Technology 2
Density-Based Clustering Methods
• Clustering based on density (local cluster
criterion), such as density-connected points
• Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
• Several interesting studies:
– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DBRS
DBRS:: Wang, et al. (2003).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98)
Data Mining Course, Sharif University of Technology 3
DENCLUE: Using density
functions
• DENsity-based CLUstEring by Hinneburg & Keim
(KDD’98)
• Major features
– Solid mathematical foundation
– Good for data sets with large amounts of noise
– Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets
– Significant faster than existing algorithm (faster than
DBSCAN by a factor of up to 45)
– But needs a large number of parameters
Data Mining Course, Sharif University of Technology 28
DENCLUE
• Models the overall density of a set of points as the sum of
‘influence’ functions associated with each point.
• The resulting overall density function will have local peaks,
i.e., local density maxima, and these local peaks can be
used to define clusters in a straightforward way.
• For each data point, a hill climbing procedure finds the
nearest peak associated with that point, and the set of all
data points associated with a particular peak (called a local
density attractor) becomes a (center-defined) cluster.
• If the density at a local peak is too low, then the points in
the associated cluster are classified as noise and discarded.
• If a local peak can be connected to a second local peak by a
path of data points, and the density at each point on the
path is above a minimum density threshold, then the
clusters associated with these local peaks are merged.
• Thus, clusters of any shape can be discovered.
Data Mining Course, Sharif University of Technology 30
DENCLUE
• DENCLUE is based on a well-developed area of
– statistics
– pattern recognition
– which is know as ‘kernel density estimation.’
• The goal of kernel density estimation (and many
other statistical techniques as well) is to describe
the distribution of the data by a function.
• For kernel density estimation, the contribution of
each point to the overall density function is
expressed by an ‘influence’ (kernel) function.
• The overall density is then merely the sum of the
influence functions associated with each point.
Data Mining Course, Sharif University of Technology 31
DENCLUE
• Typically the influence or kernel function is
symmetric (the same in all directions) and its
value (contribution) decreases as the
distance from the point increases.
• the Gaussian function often used as a kernel
function.

Data Mining Course, Sharif University of Technology 32


Influence function
• Example

Data Mining Course, Sharif University of Technology 33


Density Attractor

Data Mining Course, Sharif University of Technology 34


DENCLUE
• The DENCLUE algorithm has two steps,
– preprocessing step
– clustering step
• In the pre-clustering step, a grid for the data is created by
dividing the minimal bounding hyper-rectangle into d-
dimensional hyper-rectangles with an edge length of 2σ.
The rectangles that contain points are then determined.
(Actually, only the occupied hyper-rectangles are
constructed.) The hyper-rectangles are numbered with
respect to a particular origin (at one edge of the bounding
hyper-rectangle and these keys are stored in a search tree
to provide efficient access during later processing. For each
stored cell, the number of points, the sum of the points in
the cell, and the connections to neighboring population
cubes are also stored.
Data Mining Course, Sharif University of Technology 35
DENCLUE
• For the clustering step DENCLUE, considers only the highly
populated cubes and the cubes that are connected to them.
• Starting with each of these cubes as a cluster, the algorithm
proceeds as follows:
• For each point, x, the local density function is calculated
only by considering those points that are from clusters
which are
– a) in clusters that are connected to the one containing the point
– b) have cluster centroids within a distance of k of the point, where k = 4.

• DENCLUE discards clusters associated with a density


attractor whose density is less than ξ.
• Finally, DENCLUE merges density attractors that can be
joined by a path of points, all of which have a density
greater than ξ.

Data Mining Course, Sharif University of Technology 36


DENCLUE

Data Mining Course, Sharif University of Technology 37


DENCLUE
• This provides a high level of generality:
– DBSCAN
– k-means clusters

• DENCLUE scales well.


• Since at its initial stage it builds a map of hyper-rectangle
cubes with edge length 2σ. For this reason, the algorithm
can be classified as a grid-based method.

Data Mining Course, Sharif University of Technology 38


Kernel Density Estimation
• Kernel estimates smooth out the contribution of each observed data point over a
local neighborhood of that point
• The contribution of data point x(i) to the estimate at some point x* depends on
how far apart x(i) and x* are.
• The extent of this contribution is dependent upon on the shape of the kernel
function adopted and the width accorded to it.

Where K(t)dt = 1

• The quality of a kernel estimate depends less on the shape of K than on the
value of h.
• A common form for K is the Normal (Gaussian) curve, with h as its spread
parameter (standard deviation), i.e.,

• where C is a normalization constant and t = x - x(i) is the distance of the query


point x to data point x(i).
• The bandwidth h is equivalent to s, the standard deviation (or width) of the
Gaussian kernel function.
Data Mining Course, Sharif University of Technology 39
Kernel Density Estimation
• Kernel methods are closely related to nearest neighbor methods.
• Naively, N kernel estimations is N^2; for large data sets, fast algorithms
are needed
– Can use tricks learned from N-body, e.g., trees
– Sufficient to bound the density, and compute until the bounds separate
• Choice of bandwidth is critical (analogy: choice of histogram bin size)
– Small values of h lead to very spiky estimates (not much smoothing at all)
– large values lead to oversmoothing.

• often described as non-parametric


– because the model is largely data-driven with no parameters in the
conventional sense (except for the bandwidth h).

• Such data-driven smoothing techniques are useful for data


interpretation, at least in one or two dimensions.

Data Mining Course, Sharif University of Technology 40


Drawbacks
• In particular, as the number of variables in the
predictor space increases, so the number of data
points required to obtain accurate estimates
increases exponentially.
• This means that these "local neighborhood“ models
tend to scale poorly to high dimensions.
• lack of interpretability of the model.
• Solutions:
– using a subset of relevant variables to construct the
model
– transforming the original p variables into a new set of p'
variables, where again p' << p.
Data Mining Course, Sharif University of Technology 41

You might also like