Unit 4 Mining
Unit 4 Mining
Cluster Analysis
Cluster analysis is a machine learning technique.
Cluster analysis is a method of data mining that groups similar data points
together. The data set has unlabelled.
goal. Goal of cluster analysis is to divide a data set into groups called
clusters such that the data points within each group are more similar to each
other than to data points in other groups.
Used:- Cluster analysis is used for data analysis and can help identify
patterns or relationships within the data.
Cluster:- Cluster is a subset of similar objects.
It is used by Amazon, Netflix, and the recommendation system to provide
recommendations as per the past search of products, movies, respectively.
Key objective of the cluster algorithm is that inter-cluster distance is
maximized and intra-cluster distance is minimized.
Properties of Clustering
1. Clustering scalability Now, a data share is a huge amount of data and
dealing with huge amount of database. In order to have a huge
database, clustering algorithm should be scalable. Data should be
scalable.
If it is not scalable, then we cannot get the appropriate result and
would lead to wrong result.
2. Dealing with unstructured data There would be some database that
contains errors, missing values, and noisy data. So, it should be able to
handle unstructured data by giving it some structure and organizing
data into groups of similar data objects.
3. Algorithm usability with multiple data Different kinds of data can be
used with clustering algorithm. It should be capable of dealing with
different types of data.
4. High dimensionality The algorithm should be able to high dimension
data.
5. Interpretability The outcome of a clustering should be usable and data
must be easy to understand.
1
Unit 4
2
Unit 4
Partitioning Method
It is a two-down approach to clustering.
In this method, we have divided the n data items/objects/Sets. They will be
divided into k partitions, parts, clustering a group.
Each partitions a group's part is called one cluster.
The size of individual clusters should be less than or equal to AND's total
data items.
3
Unit 4
1. Simple to implement,
2. it is scalable to a huge dataset and also faster to large dataset
3. Time taken to cluster k-means rise linearly with the number of data
points
Disadvantages
4
Unit 4
Evaluation of Clustering
Clustering evaluation evaluates the visibility and quality of clustering
analysis on a dataset through examining the results produced by a clustering
method on a specific dataset.
The major task of clustering evaluation includes the following,
1. accessing clustering tendency,
2. determining the number of clusters in a dataset,
3. measuring cluster quality,
1. accessing clustering tendency. Before evaluating or analyzing clustering
performance, it is essential to verify that a dataset under consideration
has clustering tendency and does not contain uniform distributed points.
If the data does not contain clustering tendency, the clustering identified
by any clustering algorithm may be useless.
Non-uniform distribution of points in a data collection becomes
important in clustering.
2. Determining the number of clusters in a dataset Choosing the appropriate
number of clusters in a dataset is essential for clustering algorithm. It
needs to accomplish a balance between compressibility and accuracy. The
number of clusters can be unclear depending on the distribution style,
scale, and desired clustering resolution.
ELBO method. This method includes setting the number of clusters to
about Q by n square for a data set of n points, in which we deduce the
sum of cluster variance and a heuristic use the turning point in the curve
of the sum of cluster variance with respect to the number of clusters.
Cross validation. It is a technique in the classification used to determine
the right number of clusters. First, divide the given data set into m parts.
Next, use m-1 parts to build a clustering model and use remaining parts
for testing the quality of clustering.
3. Measuring cluster quality. After clustering, a number of matrices can be
used to determine how effectively the clustering worked. Ideally,
clustering has minimum intra-cluster distance and maximum inter-
cluster distance.
There are generally two types of extrinsic and intrinsic measures.
1. Extrinsic measures, which require some ground truth labels, e.g.,
adjusted index, Flowlkes Mallow Scores, etc.
2. Intrinsic measures that do not require ground truth labels. Sum of
clustering and performance means are a coefficient, calinski, Hara-
Baze, Index, etc.
5
Unit 4
6
Unit 4
Algorithm
1. Select two hyperplanes which separates the data with no points
between them.
2. Maximum linear distance .
3. the average line will be the decision boundary.
Advantages
1. Memory sufficient
2. Effective in high dimensional cases
3. offer great accuracy.
Disadvantages
1. High training time
2. Not suitable for large data sets.
Example suppose we are given the following labelled data points
(3,1), (3,-1) (6,1) (6,-1) (1,0) (0,1) (0,-1) (-1,0)
7
Unit 4
8
Unit 4
5. assign the new data points to that category for which the number of
neighbours is maximum.
6. Our model is ready.
Advantages of KNN
1. it is very simple and easy to implement the algorithm because we
need only two parameters. First is k value and other is Euclidean
distance.
2. A good value of k makes it robust to noise.
3. KNN learns a non-linear decision boundary.
4. There is no training required. That is why we can add new data
anytime. And we don't have work to train the model. As a result, it is
very fast in nature because training time is zero.
Disadvantages,
1. insufficient,
2. does not work With high dimension
3. it does not handle category feature very well.
Example perform KNN classification algorithm on following dataset and
predict the class for [maths=6,computer=8] where k=3
maths ccomputer Result
4 3 F
6 7 P
7 8 P
5 5 F
8 8 p
9
Unit 4
Classification
it is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features.
The goal of classification is to build a model that accurately predicts the class
label of new instances based on their features.
A classification technique is a systematic approach to build classification
model from an input data set that is used to identify class level.
There are two steps in classification,
1. model construction, and
2. model usage.
Decision Tree
It is a supervised learning technique.
With the help of decision trees, we can solve both recursion and classification
problems.
By decision trees, we can take decisions.
Decision trees will represent classification models and recursion models in
the form of tree structures, where internal nodes represent the features of a
dataset, branches represent the decision rules, and external nodes represent
the outcomes.
Decision tree terminologies Decision tree contains three types of nodes,
1. root node,
2. leaf node, and
3. internal node.
1. Root node is the topmost node in the tree, data which is inside the tree
is attribute. It is represented by a rectangle.
2. Leaf node is the final output node and can be further split. Leaf node is
represented by a circle.
3. Internal node represents the features of the dataset. It is represented by
a rectangle.
Algorithm
Create a root node and assign all the training data to it.
10
Unit 4
11
Unit 4
12