Unit 2 ML
Unit 2 ML
Clustering in Machine Learning: Types of Clustering Method: Partitioning Clustering, Distribution Model-Based
Clustering, Hierarchical Clustering, Fuzzy Clustering. Birch Algorithm, CURE Algorithm. Gaussian Mixture Models and
Expectation Maximization. Parameters estimations – MLE, MAP. Applications of Clustering.
UNIT-III
Classification algorithm: - Logistic Regression, Decision Tree Classification, Neural Network, K-Nearest Neighbors (K-
NN), Support Vector Machine, Naive Bayes (Gaussian, Multinomial, Bernoulli). Performance Measures: Confusion
Matrix, Classification Accuracy, Classification Report: Precisions, Recall, F1 score and Support.
UNIT-IV
Ensemble Learning and Random Forest: Introduction to Ensemble Learning, Basic Ensemble Techniques (Max Voting,
Averaging, Weighted Average), Voting Classifiers, Bagging and Pasting, Out-of-Bag Evaluation, Random Patches and
Random Subspaces, Random Forests (Extra-Trees, Feature Importance), Boosting (AdaBoost, Gradient Boosting),
Stacking.
UNIT-V
Dimensionality Reduction: The Curse of Dimensionality, Main Approaches for Dimensionality Reduction (Projection,
Manifold Learning) PCA: Preserving the Variance, Principal Components, Projecting Down to d Dimensions, Explained
Variance Ratio, Choosing the Right Number of Dimensions, PCA for Compression, Randomized PCA, Incremental PCA.
Kernel PCA: Selecting a Kernel and Tuning Hyper parameters. Learning Theory: PAC and VC model.
# Clustering in Machine Learning
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined
as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities with another group."
It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and
divides them as per the presence and absence of those similar patterns.
It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the
unlabeled dataset.
After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id
to simplify the processing of large and complex datasets.
The clustering technique is commonly used for statistical data analysis.
Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset that we
are using. In classification, we work with the labeled data set, whereas in clustering, we work with the unlabelled
dataset.
Example: Let's understand the clustering technique with the real-world example of Mall: When we visit any shopping
mall, we can observe that the things with similar usage are grouped together. Such as the t-shirts are grouped in one
section, and trousers are at other sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are
grouped in separate sections, so that we can easily find out the things. The clustering technique also works in the
same way. Other examples of clustering are grouping documents according to the topic.
The clustering technique can be widely used in various tasks. Some most common uses of this technique are:
o Market Segmentation
o Statistical data analysis
o Social network analysis
o Image segmentation
o Anomaly detection, etc.
Apart from these general usages, it is used by the Amazon in its recommendation system to provide the
recommendations as per the past search of products. Netflix also uses this technique to recommend the movies and
web-series to its users as per the watch history.
The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into
several groups with similar properties.
Advantages:
Efficient for large datasets.
Easy to understand and implement.
Works well when clusters are globular and evenly sized.
Disadvantages:
The number of clusters (K) needs to be predefined.
Sensitive to initial centroids (can lead to poor local minima).
Assumes clusters are spherical and equally sized, which might not be the case in all datasets.
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be connected. This algorithm does it by identifying different
clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are
divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high
dimensions.
Advantages:
Does not require the number of clusters to be specified in advance.
Provides a detailed view of data with a dendrogram that shows the merging/splitting process.
Can handle clusters of various shapes and sizes.
Disadvantages:
Computationally expensive, especially for large datasets.
Sensitive to noise and outliers.
May struggle with large-scale datasets as it involves repeated distance calculations.
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. Each
dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. Fuzzy C-
means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
Advantages:
Handles overlapping clusters and soft assignments.
Provides a more nuanced view of data by assigning data points to multiple clusters.
Disadvantages:
Computationally expensive, especially for large datasets.
Requires careful selection of the fuzziness parameter (degree of membership).
Can struggle with non-convex clusters.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained above. There are different types
of clustering algorithms published, but only a few are commonly used. The clustering algorithm is based on the kind
of data that we are using. Such as, some algorithms need to guess the number of clusters in the given dataset,
whereas some are required to find the minimum distance between the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used in machine learning:
1. K-Means algorithm: The k-means algorithm is one of the most popular clustering algorithms. It classifies the
dataset by dividing the samples into different clusters of equal variances. The number of clusters must be
specified in this algorithm. It is fast with fewer computations required, with the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth density of data
points. It is an example of a centroid-based model, that works on updating the candidates for centroid to be
the center of the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with Noise. It is an
example of a density-based model similar to the mean-shift, but with some remarkable advantages. In this
algorithm, the areas of high density are separated by the areas of low density. Because of this, the clusters
can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an alternative for the k-
means algorithm or for those cases where K-means can be failed. In GMM, it is assumed that the data points
are Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm performs the bottom-up
hierarchical clustering. In this, each data point is treated as a single cluster at the outset and then
successively merged. The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it does not require to specify the
number of clusters. In this, each data point sends a message between the pair of data points until
convergence. It has O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
Below are some commonly known applications of clustering technique in Machine Learning:
o In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous
cells. It divides the cancerous and non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique. The search result appears based on
the closest object to the search query. It does it by grouping similar data objects in one group that is far from
the other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm
used.
o Customer Segmentation: It is used in market research to segment the customers based on their choice and
preferences.
o In Biology: It is used in the biology stream to classify different species of plants and animals using the image
recognition technique.
o In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database.
This can be very useful to find that for what purpose the particular land should be used, that means for
which purpose it is more suitable.
# BIRCH Clustering
Clustering algorithms like K-means clustering do not perform clustering very efficiently and it is difficult to process
large datasets with a limited amount of resources (like memory or a slower CPU). So, regular clustering algorithms do
not scale well in terms of running time and quality as the size of the dataset increases. This is where BIRCH clustering
comes in. Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can
cluster large datasets by first generating a small and compact summary of the large dataset that retains as much
information as possible. This smaller summary is then clustered instead of clustering the larger dataset. BIRCH is
often used to complement other clustering algorithms by creating a summary of the dataset that the other clustering
algorithm can now use. However, BIRCH has one major drawback – it can only process metric attributes. A metric
attribute is any attribute whose values can be represented in Euclidean space i.e., no categorical attributes should be
present. Before we implement BIRCH, we must understand two important terms: Clustering Feature (CF) and CF –
Tree Clustering Feature (CF): BIRCH summarizes large datasets into smaller, dense regions called Clustering Feature
(CF) entries. Formally, a Clustering Feature entry is defined as an ordered triple, (N, LS, SS) where ‘N’ is the number of
data points in the cluster, ‘LS’ is the linear sum of the data points and ‘SS’ is the squared sum of the data points in the
cluster. It is possible for a CF entry to be composed of other CF entries. CF Tree: The CF tree is the actual compact
representation that we have been speaking of so far. A CF tree is a tree where each leaf node contains a sub-cluster.
Every entry in a CF tree contains a pointer to a child node and a CF entry made up of the sum of CF entries in the
child nodes. There is a maximum number of entries in each leaf node. This maximum number is called the threshold.
We will learn more about what this threshold value is. Parameters of BIRCH Algorithm :
threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of the CF tree can
hold.
branching_factor : This parameter specifies the maximum number of CF sub-clusters in each node (internal
node).
n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is complete i.e., number
of clusters after the final clustering step. If set to None, the final clustering step is not performed and
intermediate clusters are returned.
Implementation of BIRCH in Python: For the sake of this example, we will generate a dataset for clustering using
scikit-learn’s make_blobs() method. To learn more about make_blobs(), you can refer to the link below: https://scikit-
learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html Code: To create 8 clusters with 600
randomly generated samples and then plotting the results in a scatter plot.
python3