0% found this document useful (0 votes)
17 views12 pages

Unit 4 Mining

Cluster analysis is a machine learning technique that groups similar data points into clusters, aiming to maximize inter-cluster distance and minimize intra-cluster distance. Hierarchical clustering organizes unlabelled datasets into a dendrogram format, while partitioning methods like k-means divide data into predefined clusters based on centroids. Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) are supervised learning algorithms used for classification, with SVM focusing on decision boundaries and KNN relying on the similarity of data points.

Uploaded by

sharmashree9876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Unit 4 Mining

Cluster analysis is a machine learning technique that groups similar data points into clusters, aiming to maximize inter-cluster distance and minimize intra-cluster distance. Hierarchical clustering organizes unlabelled datasets into a dendrogram format, while partitioning methods like k-means divide data into predefined clusters based on centroids. Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) are supervised learning algorithms used for classification, with SVM focusing on decision boundaries and KNN relying on the similarity of data points.

Uploaded by

sharmashree9876
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unit 4

Cluster Analysis
Cluster analysis is a machine learning technique.
Cluster analysis is a method of data mining that groups similar data points
together. The data set has unlabelled.
goal. Goal of cluster analysis is to divide a data set into groups called
clusters such that the data points within each group are more similar to each
other than to data points in other groups.
Used:- Cluster analysis is used for data analysis and can help identify
patterns or relationships within the data.
Cluster:- Cluster is a subset of similar objects.
It is used by Amazon, Netflix, and the recommendation system to provide
recommendations as per the past search of products, movies, respectively.
Key objective of the cluster algorithm is that inter-cluster distance is
maximized and intra-cluster distance is minimized.
Properties of Clustering
1. Clustering scalability Now, a data share is a huge amount of data and
dealing with huge amount of database. In order to have a huge
database, clustering algorithm should be scalable. Data should be
scalable.
If it is not scalable, then we cannot get the appropriate result and
would lead to wrong result.
2. Dealing with unstructured data There would be some database that
contains errors, missing values, and noisy data. So, it should be able to
handle unstructured data by giving it some structure and organizing
data into groups of similar data objects.
3. Algorithm usability with multiple data Different kinds of data can be
used with clustering algorithm. It should be capable of dealing with
different types of data.
4. High dimensionality The algorithm should be able to high dimension
data.
5. Interpretability The outcome of a clustering should be usable and data
must be easy to understand.

1
Unit 4

Hierarchical Clustering Methods


Hierarchical Clustering is an unsupervised machine learning algorithm.
It groups the unlabelled dataset into a cluster, also known as Hierarchical
Cluster Analysis or HCA.
These unlabelled dataset clusters will display in the hierarchical format.
This format is called Dendrogram. It is a tree-type structure.
Why we use Hierarchical Clustering?
In k-means algorithm, the dataset labeled is known in advance. We know
the number of clusters we created and size of all clusters is equal. But if we
have unlabeled dataset and initially we do not know the number of clusters
we created and the size of clusters is not equal, then we use Hierarchical
Clustering. Do not need to have knowledge about the predefined number of
clusters. To solve these two challenges, we use Hierarchical Clustering.
Types of Hierarchical Clustering
1. Agglomerative-Related Clustering
2. Divisive Clustering
1. Algometric Clustering It is a bottom-up approach in which the algorithm
starts with taking all data points as a single cluster and merging them
until one cluster is left.

2. Divisive Clustering It is a top-down approach. It is a reverse of the


Agglomerative Clustering. We take one cluster which is composed of all
the data points and we keep divide on them till the single single element
is left.
Dendogram: Dendogram is a type of tree diagram showing hierarchical
relationship between different set of data. It contains the memory of
Hierarchical Clustering algorithm.
Application of Hierarchical Clustering Algorithm
1. Identify Fake News There are different words related to fake news. Is
a false information, spam etc.
2. Identify Criminal Activity
3. Document Analysis

2
Unit 4

Partitioning Method
It is a two-down approach to clustering.
In this method, we have divided the n data items/objects/Sets. They will be
divided into k partitions, parts, clustering a group.
Each partitions a group's part is called one cluster.
The size of individual clusters should be less than or equal to AND's total
data items.

When we partition, we should follow two rules.


1. Each partition should have at least one object.
2. Each object should belong to only one partition.

Example of Partitioning Method k means clustering.


k means number of clusters. Clustering means grouping of similar type of
data. This algorithm is centroid based algorithm where each cluster having
a particular centroid.
Centroid having the minimum distance between the remaining data points.

3
Unit 4

k means algorithm steps.


Step 1. Select the number k to decide the number of clusters.
Step 2. Select random k points or centroid. It can be other from the input
data set.
Step 3. Assign each data points to their closest centroid which will form the
predefined k clusters.
Step 4. Calculate the variance and place a new centroid over each cluster.
Step 5. Repeat the step which means reassign each data point to the new
closest centroid over each cluster.
Step 6. If any reassignment occurs, then go to step 4 as go to finish.
Step 7. This model is ready.

Advantages of k-means algorithm

1. Simple to implement,
2. it is scalable to a huge dataset and also faster to large dataset
3. Time taken to cluster k-means rise linearly with the number of data
points

Disadvantages

1. User need to specify an initial value of k,


2. not suitable for unlabeled dataset,
3. this is sensitive to the outlier.
Example:-
Sno age Amount
1 20 500
2 40 1000
3 30 800
4 18 300
5 28 1200
6 35 400

4
Unit 4

Evaluation of Clustering
Clustering evaluation evaluates the visibility and quality of clustering
analysis on a dataset through examining the results produced by a clustering
method on a specific dataset.
The major task of clustering evaluation includes the following,
1. accessing clustering tendency,
2. determining the number of clusters in a dataset,
3. measuring cluster quality,
1. accessing clustering tendency. Before evaluating or analyzing clustering
performance, it is essential to verify that a dataset under consideration
has clustering tendency and does not contain uniform distributed points.
If the data does not contain clustering tendency, the clustering identified
by any clustering algorithm may be useless.
Non-uniform distribution of points in a data collection becomes
important in clustering.
2. Determining the number of clusters in a dataset Choosing the appropriate
number of clusters in a dataset is essential for clustering algorithm. It
needs to accomplish a balance between compressibility and accuracy. The
number of clusters can be unclear depending on the distribution style,
scale, and desired clustering resolution.
ELBO method. This method includes setting the number of clusters to
about Q by n square for a data set of n points, in which we deduce the
sum of cluster variance and a heuristic use the turning point in the curve
of the sum of cluster variance with respect to the number of clusters.
Cross validation. It is a technique in the classification used to determine
the right number of clusters. First, divide the given data set into m parts.
Next, use m-1 parts to build a clustering model and use remaining parts
for testing the quality of clustering.
3. Measuring cluster quality. After clustering, a number of matrices can be
used to determine how effectively the clustering worked. Ideally,
clustering has minimum intra-cluster distance and maximum inter-
cluster distance.
There are generally two types of extrinsic and intrinsic measures.
1. Extrinsic measures, which require some ground truth labels, e.g.,
adjusted index, Flowlkes Mallow Scores, etc.
2. Intrinsic measures that do not require ground truth labels. Sum of
clustering and performance means are a coefficient, calinski, Hara-
Baze, Index, etc.

5
Unit 4

SVM, Support Vector Machine


Support Vector Machine or SVM is one of the most popular supervised
learning algorithms.
This machine is used for classification problems as well as regression
problems.
This machine is generally used in classification problems because it
generates more accurate results and more classified data.
Goal of SVM is to create a best line or to create a proper decision
boundary between the particular classes.
In the above diagram, we can see two classes. These two classes are
properly classified and the middle line is called the decision boundary.
This decision boundary is generated by SVM. So, SVM properly
classifies the data.
Uses :-
1. Face detection,
2. image classification,
3. text categorization, etc.
Key concept,
1. support vector data points that are closest to the hyperplane is called
sport vector.
2. Hyperplane it is a decision plane or space where they can divide the
particular class in a particular set of objects.
3. Margin means particular gap between two lines on the closest data
points of different classes. It is calculated as the perpendicular distance
from the line to the sport vector. Large margin is considered as good
margin and small margin is considered as bad margin.
Type of SVM
1. Linear SVM
2. Non-linear SVM
1. Linear SVM Linear SVM is used for linearly separatable data which
means if a data set can be classified into two classes by using a single
straight line, this classifier is called linear SVM classifier.
2. Non-linear SVM. It is used for non-linearly separated data, which
means if a dataset cannot be classified by using a single straight line, this
classifier is called non-linear SVM classifier.
Kernel is used for classifying the data into non-linear SVM.

6
Unit 4

Algorithm
1. Select two hyperplanes which separates the data with no points
between them.
2. Maximum linear distance .
3. the average line will be the decision boundary.
Advantages
1. Memory sufficient
2. Effective in high dimensional cases
3. offer great accuracy.
Disadvantages
1. High training time
2. Not suitable for large data sets.
Example suppose we are given the following labelled data points
(3,1), (3,-1) (6,1) (6,-1) (1,0) (0,1) (0,-1) (-1,0)

7
Unit 4

Lazy Learner, KNN, K- Nearest Neighbour.


KNN is one of the simplest machine learning algorithms based on
supervised learning techniques.
KNN algorithm can be used for regression as well as classification problem,
mostly it is used for classification problem.
KNN algorithm is also called lazy learner because it does not learn from the
training setup, instead it stores data and at the time of classification, it
performs an action on data setup.
In supervised learning techniques, we train machines using labelled data,
data which contains input or output.
KNN algorithm assumes the similarity between the new data and available
data and puts the new data into the category that is most similar to the
available category.
I have an input data, there are two available data categories, one is square
category and other is circle category. Now I want to place this input data in
any of these two categories. By using KNN classifier, we can place this input
data in any of available category. It can be placed based on the similarity
input data is similar to square because it looks like a square. So, KNN
classifier placed this input data in square category.
In the example, there are two category of data available, that is category A
and category B. Category A contains the square data, whereas category B
contains the circle data. Now I want to place new data in either category A
or category B by using KNN algorithm. First of all, KNN algorithm do the
find the nearest neighbor of the input data. Nearest neighbor of new data is
2 square and 1 circle. Among the three nearest neighbors, two are square
and one is circle, so majorly square. Then we place the new data based on
the majority, so majority is square. Then the data placed in category A, we
place new data on the basis of nearest neighbor, that is why it is called KNN
algorithm.
Working step of KNN algorithm,
1. select the k number of neighbours,
2. calculate the Euclidean distance of k number of neighbors,
3. take the k nearest neighbors as per the Euclidean distance.
4. Among these k neighbours, count the number of data points In each
category,

8
Unit 4

5. assign the new data points to that category for which the number of
neighbours is maximum.
6. Our model is ready.
Advantages of KNN
1. it is very simple and easy to implement the algorithm because we
need only two parameters. First is k value and other is Euclidean
distance.
2. A good value of k makes it robust to noise.
3. KNN learns a non-linear decision boundary.
4. There is no training required. That is why we can add new data
anytime. And we don't have work to train the model. As a result, it is
very fast in nature because training time is zero.
Disadvantages,
1. insufficient,
2. does not work With high dimension
3. it does not handle category feature very well.
Example perform KNN classification algorithm on following dataset and
predict the class for [maths=6,computer=8] where k=3
maths ccomputer Result
4 3 F
6 7 P
7 8 P
5 5 F
8 8 p

9
Unit 4

Classification
it is a task in data mining that involves assigning a class label to each instance
in a dataset based on its features.
The goal of classification is to build a model that accurately predicts the class
label of new instances based on their features.
A classification technique is a systematic approach to build classification
model from an input data set that is used to identify class level.
There are two steps in classification,
1. model construction, and
2. model usage.

Decision Tree
It is a supervised learning technique.
With the help of decision trees, we can solve both recursion and classification
problems.
By decision trees, we can take decisions.
Decision trees will represent classification models and recursion models in
the form of tree structures, where internal nodes represent the features of a
dataset, branches represent the decision rules, and external nodes represent
the outcomes.
Decision tree terminologies Decision tree contains three types of nodes,
1. root node,
2. leaf node, and
3. internal node.
1. Root node is the topmost node in the tree, data which is inside the tree
is attribute. It is represented by a rectangle.
2. Leaf node is the final output node and can be further split. Leaf node is
represented by a circle.
3. Internal node represents the features of the dataset. It is represented by
a rectangle.
Algorithm
Create a root node and assign all the training data to it.

10
Unit 4

Select the best splitting attribute according to certain criteria.


Add a branch to the root node for each value of the split.
Split the data into mutual exclusive subsets along the lines of the specific
splits.
Repeat steps 2 and 3 for each and every leaf node until the splitting criteria
is reached.
Splitting criteria
Which variable to use for the first data? How should one determine the most
important variable for the first branch and subs to grant claim for each
subtree? Algorithm uses different methods like list errors, information gain,
gimme coefficient, etc. to compute the splitting variable that provides the
most benefit.
What value to use for split? If the variable have continuous values such as 4,
8, etc., what value ranges should be used?
How many branches should be allowed for each node? There could be binary
trees with just two branches at each node or more branches allowed.
Stopping criteria
1. When to stop of building the tree?
2. When a certain depth of branching has been reached?
3. When the error level at any node is in predefined tolerable level?
Pruning It is the act of reducing the size of decision tree by removing the
fraction of the tree that provides the little value.
Goal to make decision tree more balanced, more general and more easily
usable.
There are two approaches to avoid overfitting.
1. Pre-pruning and
2. Post-pruning
1. Pre-pruning means to help you stop the tree construction early. When
certain criteria are met,
disadvantages, it is difficult to decide what criteria to use for help stop the
construction because we do not know the stopping criteria.
2. Post-pruning means removing branches or subtrees from a fully grown
tree. This method is commonly used.

11
Unit 4

Advantages of a decision tree


• Simple to understand,
• less requirement of data cleaning compared to other algorithms.
• It can be very useful for solving decision related problems.
• It solves both classification and regression problems.
Disadvantages
a. for more class labels, the computational complexity of decision tree
may increase.
b. Overfitting issues.
c. The decision tree contains lots of layers which makes it complex.
day weather temperature humidity wind play
1 Sunny Hot High Weak N
2 Cloudy Hot H Weak Y
3 Cloudy Mild H Strong Y
4 Rainy Mild H S N
5 Sunny Mild Normal S Y
6 Rainy Cold N S N
7 Rainy Mild H W Y
8 Sunny Hot H S N
9 cloudy Hot N W Y
10 rainy mild H s n

12

You might also like