Clustering
Clustering
Clustering
Typical examples include weight and height, latitude and longitude coordinates (e.g., when
clustering houses), and weather temperature.
The measurement unit used can affect the clustering analysis. For example, changing
measurement units from meters to inches for height, or from kilograms to pounds for weight,
may lead to a very different clustering structure.
In general, expressing a variable in smaller units will lead to a larger range for that variable, and
thus a larger effect on the resulting clustering structure.
To help avoid dependence on the choice of measurement units, the data should be standardized.
Standardizing measurements attempts to give all variables an equal weight.
This is especially useful when given no prior knowledge of the data. However, in some
applications, users may intentionally want to give more weight to a certain set of variables than
to others.
For example, when clustering basketball player candidates, we may prefer to give more weight to
the variable height.
Binary Variables
Let p=a+b+c+d
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow,
blue, green.
Method 1: Simple matching
The dissimilarity between two objects i and j can be computed based on the simple matching.
m: Let m be no of matches (i.e., the number of variables for which i and j are in the same state).
Ordinal Variables
By mapping the range of each variable onto [0, 1] by replacing the i-th object in the f-th variable
by,
Ratio-Scaled Intervals
• Finally, treat them as continuous ordinal data treat their rank as interval-scaled.
1. K-means
Given k, the k-means algorithm is implemented in four steps:
1. Partition objects into k nonempty subsets
2. Compute seed points as the centroids of the clusters of the current partition (the
centroid is the center, i.e., mean point, of the cluster)
3. Assign each object to the cluster with the nearest seed point
4. Go back to Step 2, stop when no more new assignment
2. Partitioning Method
Suppose we are given a database of n objects, the partitioning method construct k partition of
data.
Each partition will represent a cluster and k≤n. It means that it will classify the data into k
groups, which satisfy the following requirements:
Typical methods:
K-means, k-medoids, CLARANS
3 Hierarchical Methods
This method creates the hierarchical decomposition of the given set of data objects.:
o Agglomerative Approach
o Divisive Approach
Agglomerative Approach
This approach is also known as bottom-up approach. In this we start with each object forming a
separate group. It keeps on merging the objects or groups that are close to one another. It keep
on doing so until all of the groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as top-down approach. In this we start with all of the objects in the
same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down
until each object in one cluster or the termination condition holds.
Disadvantage
This method is rigid i.e. once merge or split is done, It can never be undone.
MinPts: MinPts refers to the minimum number of points in an Eps neighborhood of that point.
A point i is considered as the directly density reachable from a point k with respect to Eps,
MinPts if
i belongs to NEps(k)
Advertisement
A point denoted by i is a density reachable from a point j with respect to Eps, MinPts if there is a
sequence chain of a point i1,…., in, i1 = j, pn = i such that ii + 1 is directly density reachable
from ii.
Density connected:
A point i refers to density connected to a point j with respect to Eps, MinPts if there is a point o
such that both i and j are considered as density reachable from o with respect to Eps and MinPts.
Working of Density-Based Clustering
Suppose a set of objects is denoted by D', we can say that an object I is directly density reachable
form the object j only if it is located within the ε neighborhood of j, and j is a core object.
An object i is density reachable form the object j with respect to ε and MinPts in a given set of
objects, D' only if there is a sequence of object chains point i1,…., in, i1 = j, pn = i such that i i +
1 is directly density reachable from ii with respect to ε and MinPts.
Advertisement
An object i is density connected object j with respect to ε and MinPts in a given set of objects, D'
only if there is an object o belongs to D such that both point i and j are density reachable from o
with respect to ε and MinPts.
o It is a scan method.
o It requires density parameters as a termination condition.
o It is used to manage noise in data clusters.
o Density-based clustering is used to identify clusters of arbitrary size.
Advertisement
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It depends on
a density-based notion of cluster. It also identifies clusters of arbitrary size in the spatial database
with outliers.
OPTICS
Advertisement
OPTICS stands for Ordering Points To Identify the Clustering Structure. It gives a significant
order of database with respect to its density-based clustering structure. The order of the cluster
comprises information equivalent to the density-based clustering related to a long range of
parameter settings. OPTICS methods are beneficial for both automatic and interactive cluster
analysis, including determining an intrinsic clustering structure.
DENCLUE
Algorithm
• Initially, assign k cluster centers randomly.
• It can be iteratively refined the clusters based on two steps are as follows −
Expectation step − It can assign each data point Xi to cluster Ci with the following probability
P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)P(Xi∈Ck)=P(Ck⏐Xi)=P(Ck)P(Xi⏐Ck)P(Xi)
mk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cjmk=1N∑i=1NXiP(Xi∈Ck)XjP(Xi)∈Cj
Machine learning approach − Machine learning is an approach that makes complex algorithms
for huge data processing and supports results to its users. It uses complex programs that
can understand through experience and create predictions.
The algorithms are improved by themselves by frequent input of training information. The
main objective of machine learning is to learn data and build models from data that can be
understood and used by humans.
It is a famous approach of incremental conceptual learning, which produces a hierarchical
clustering in the form of a classification tree. Each node defines a concept and includes a
probabilistic representation of that concept.
Limitations
• The assumption that the attributes are independent of each other is often too strong because correlation can
exist.
• It is not suitable for clustering large database data, skewed trees, and expensive probability distributions.
Neural Network Approach − The neural network approach represents each cluster as an
example, acting as a prototype of the cluster. The new objects are distributed to the cluster
whose example is the most similar according to some distance measure.
Scientific Analysis: Scientific simulations are generating bulks of data every day. This
includes data collected from nuclear laboratories, data about human psychology, etc. Data
mining techniques are capable of the analysis of these data. Now we can capture and store
more new data faster than we can analyze the old data already accumulated. Example of
scientific analysis:
• Sequence analysis in bioinformatics
• Classification of astronomical objects
• Medical decision support.
Intrusion Detection: A network intrusion refers to any unauthorized activity on a
digital network. Network intrusions often involve stealing valuable network resources. Data
mining technique plays a vital role in searching intrusion detection, network attacks, and
anomalies. These techniques help in selecting and refining useful and relevant information
from large data sets. Data mining technique helps in classify relevant data for Intrusion
Detection System. Intrusion Detection system generates alarms for the network traffic about
the foreign invasions in the system. For example:
• Detect security violations
• Misuse Detection
• Anomaly Detection
An Intrusion Detection System is a device or an application that detects unusual indication and
monitors traffic and report its results to an administrator, but cannot take action to prevent
unusual activity. The system protects the confidentiality, integrity, and availability of data and
information systems from internet attacks. We see that the network extended dynamically, so
too are the possibilities of risks and chances of malicious intrusions are increasing.
Types of attacks detected by Intrusion detection systems majorly:
• Scanning attacks
• Denial of service (DOS) attacks
• Penetration attacks
An Intrusion Detection System is a device or an application that detects unusual indication and
monitors traffic and report its results to an administrator, but cannot take action to prevent
unusual activity. The system protects the confidentiality, integrity, and availability of data and
information systems from internet attacks. We see that the network extended dynamically, so
too are the possibilities of risks and chances of malicious intrusions are increasing.
Types of attacks detected by Intrusion detection systems majorly:
• Scanning attacks
• Denial of service (DOS) attacks
• Penetration attacks
Fig.2 Architecture of IDS
Multimedia data mining is an interdisciplinary field that integrates image processing and
understanding, computer vision, data mining, and pattern recognition. Multimedia data mining
discovers interesting patterns from multimedia databases that store and manage large collections
of multimedia objects, including image data, video data, audio data, sequence data and hypertext
data containing text, text markups, and linkages. Issues in multimedia data mining
include content-based retrieval and similarity search, generalization and multidimensional
analysis. Multimedia data cubes contain additional dimensions and measures for multimedia
information.
The framework that manages different types of multimedia data stored, delivered, and utilized in
different ways is known as a multimedia database management system. There are three classes of
multimedia databases: static, dynamic, and dimensional media. The content of the Multimedia
Database management system is as follows:
Over the last few years, the World Wide Web has become a significant source of information
and simultaneously a popular platform for business. Web mining can define as the method of
utilizing data mining techniques and algorithms to extract useful information directly from the
web, such as Web documents and services, hyperlinks, Web content, and server logs. The World
Wide Web contains a large amount of data that provides a rich source to data mining. The
objective of Web mining is to look for patterns in Web data by collecting and examining data in
order to gain insights.
Web mining can widely be seen as the application of adapted data mining techniques to the web,
whereas data mining is defined as the application of the algorithm to discover patterns on mostly
structured data embedded into a knowledge discovery process. Web mining has a distinctive
property to provide a set of various data types. The web has multiple aspects that yield different
approaches for the mining process, such as web pages consist of text, web pages are linked via
hyperlinks, and user activity can be monitored via web server logs. These three features lead to
the differentiation between the three areas are web content mining, web structure mining, web
usage mining.
Data mining is an innovative way of gaining new and valuable business insights by analyzing the
information held in your company database. These insights can enable you to identify market
niches, and they support and facilitate the making of well-informed business decisions.
Essentially, data mining is a ground-breaking way to leverage the information that your company
already has in order to plan a business strategy for the future.
Data mining uncovers this in-depth business intelligence by using advanced analytical and
modeling techniques. With data mining, you can ask far more sophisticated questions of your
data than you can with conventional querying methods. The information that data mining
provides can lead to an immense improvement in the quality and dependability of business
decision-making.
Conventional methods can tell a bank, for example, which of the bank account types that it
provides is the most profitable. However, only data mining enables the bank to create profiles of
the customers who already have this type of account. The bank can then use data mining to find
other customers who match that profile, so that it can accurately target a marketing campaign to
them.
Data mining can identify patterns in company data, for example, in records of supermarket
purchases. If, for example, customers buy product A and product B, which product C are they
most likely to buy as well? Accurate answers to questions like these are invaluable aids to
marketing strategies.
Data mining can identify the characteristics of a known group of customers, for example, those
who have a proven record as poor credit risks. The company can then use these characteristics to
screen new customers and to predict if they also will be poor credit risks.
Data Mining can automatically perform time series forecasts without requiring statistic skills.
You can use time series forecasts to optimize warehouse stock management.
Data mining tools ease and automate the process of discovering this kind of information from
large stores of data.
• Model training
A data mining analyst wants to obtain the rules contained in historic data. A specific mining
algorithm is selected, configured, and applied to a specified set of input data. The execution of
the mining algorithm is called training phase. The result of the training phase is a data mining
model.
• The data mining process
The data mining process comprises different steps such as building, testing, or working with the
mining models.
• Introducing database objects and data mining in SQL
The database objects that are provided by Intelligent Miner reflect the current standard for data
mining in the context of SQL.
• Using interfaces
Intelligent Miner provides a set of user-defined methods, functions, and stored procedures
to Db2. You invoke these database objects by using SQL statements.
Complex algorithms form the basis for data mining as they allow data segmentation to identify
trends and patterns, detect variations, and predict the probabilities of various events. The raw
data may come in both analog and digital formats and is inherently based on the source of the
data. Companies need to keep track of the latest data mining trends and stay updated to do well
in the industry and overcome challenging competition.
Corporations can use data mining to discover customers' choices, make a good relationship with
customers, increase revenue, and reduce risks. Data mining is based on complex algorithms that
allow data segmentation to discover numerous trends and patterns, detect deviations, and
estimate the likelihood of certain occurrences occurring. Raw data can be in both analog and
digital formats, and it is essentially dependent on the data's source. Companies must keep up
with the latest data mining trends and stay current to succeed in the industry and beat out the
competition.