0% found this document useful (0 votes)
7 views63 pages

Unit 4

This document discusses various unsupervised learning techniques, focusing on clustering methods such as K-Means and Hierarchical Clustering, along with their applications and implementations. It explains the K-Means algorithm, including how to choose the optimal number of clusters using the Elbow Method, and provides a Python implementation example using a dataset of mall customers. Additionally, it introduces Hierarchical Clustering, detailing its agglomerative approach and the creation of dendrograms to visualize cluster hierarchies.

Uploaded by

netkuber16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views63 pages

Unit 4

This document discusses various unsupervised learning techniques, focusing on clustering methods such as K-Means and Hierarchical Clustering, along with their applications and implementations. It explains the K-Means algorithm, including how to choose the optimal number of clusters using the Elbow Method, and provides a Python implementation example using a dataset of mall customers. Additionally, it introduces Hierarchical Clustering, detailing its agglomerative approach and the creation of dendrograms to visualize cluster hierarchies.

Uploaded by

netkuber16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

UNIT -IV: Unsupervised Learning

Clustering- K-Means, KNN clustering, Hierarchical Clustering, Association Rule


Learning, Dimensionality Reduction: Principal Component Analysis, ensemble
learning, bagging and random forests, boosting, meta learning, Recommendation
Systems – Content Based Recommendation System, Collaborative Filtering Based
Recommendation System

Clustering
Clustering or cluster analysis is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense)
to each other than to those in other groups (clusters). Clustering is a main task of
exploratory data mining and used in many fields, including machine learning,
pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics. It can be achieved by various algorithms that
differ significantly in their notion of what constitutes a cluster and how to efficiently
find them. Popular notions of clusters include groups with small distances between
cluster members, dense areas of the data space, etc.
Examples of data with natural clusters
In many applications, there will naturally be several groups or clusters in samples.
1. Consider the case of optical character recognition: There are two ways of writing
the digit 7; the American writing is ‘7’, whereas the European writing style has a
horizontal bar in the middle (something like ). In such a case, when the sample
contains examples from both continents, the sample will contain two clusters or
groups one corresponding to the American 7 and the other corresponding to the
European .
2. In speech recognition, where the same word can be uttered in different ways,
due to different pronunciation, accent, gender, age, and so forth, there is not a
single, universal prototype. In a large sample of utterances of a specific word, All
the different ways should be represented in the sample.

k-means clustering
Outline
The k-means clustering algorithm is one of the simplest unsupervised learning
algorithms for solving the clustering problem. Let it be required to classify a given
data set into a certain number of clusters, say, k clusters. We start by choosing k
points arbitrarily as the “centres” of the clusters, one for each cluster. We then
associate each of the given data points with the nearest centre. We now take the
averages of the data points associated with a centre and replace the centre with
the average, and this is done for each of the centres. We repeat the process until
the centres converge to some fixed points. The data points nearest to the centres
form the various clusters in the dataset. Each cluster is represented by the
associated centre

Example
We illustrate the algorithm in the case where there are only two variables so that
the data points and cluster centres can be geometrically represented by points in a
coordinate plane. The distance between the points (x1, x2) and (y1, y2) will be
calculated using the familiar distance formula of elementary analytical geometry:
Problem
Use k-means clustering algorithm to divide the following data into two clusters and
also compute the the representative data points for the clusters.

Solution

1. In the problem, the required number of clusters is 2 and we take k = 2.


2. We choose two points arbitrarily as the initial cluster centres. Let us choose
arbitrarily

3. We compute the distances of the given data points from the cluster centers.
How to choose the value of "K number of clusters" in K-
means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient
clusters that it forms. But choosing the optimal number of clusters is a big task. There are
some different ways to find the optimal number of clusters, but here we are discussing
the most appropriate method to find the number of clusters or value of K. The method is
given below:

Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters.
This method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of
Squares, which defines the total variations within a cluster. The formula to calculate the
value of WCSS (for 3 clusters) is given below:

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data
point and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such
as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is
considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as
the elbow method. The graph for the elbow method looks like the below image:

Python Implementation of K-means Clustering Algorithm


In the above section, we have discussed the K-means algorithm, now let's see how
it can be implemented using Python.

Before implementation, let's understand what type of problem we will solve here.
So, we have a dataset of Mall_Customers, which is the data of customers who
visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($), and
Spending Score (which is the calculated value of how much a customer has spent
in the mall, the more the value, the more he has spent). From this dataset, we need
to calculate some patterns, as it is an unsupervised method, so we don't know what
to calculate exactly.

The steps to be followed for the implementation are given below:

o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be different
from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:

In the above code, the numpy

we have imported for the performing mathematics calculation, matplotlib is for


plotting the graph, and pandas are for managing the dataset.
o Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using
the Mall_Customer_data.csv dataset. It can be imported using the below
code:

By executing the above lines of code, we will get our dataset in the Spyder IDE.
The dataset looks like the below image:

From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will just add a
line of code for the matrix of features.
1. x = dataset.iloc[:, [3, 4]].values

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d plot
to visualize the model, and some features are not required, such as customer_id.

Step-2: Finding the optimal number of clusters using the


elbow method
In the second step, we will try to find the optimal number of clusters for our clustering
problem. So, as discussed above, here we are going to use the elbow method for this
purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by plotting WCSS
values on the Y-axis and the number of clusters on the X-axis. So we are going to calculate
the value for WCSS for different k values ranging from 1 to 10. Below is the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

As we can see in the above code, we have used the KMeans class of sklearn. cluster library
to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is used to
contain the value of wcss computed for different values of k ranging from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value of k ranging
from 1 to 10; since for loop in Python, exclude the outbound limit, so it is taken as 11 to
include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the model
on a matrix of features and then plotted the graph between the number of clusters and
WCSS.

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters here
will be 5.
Step- 3: Training the K-means algorithm on the training
dataset
As we have got the number of clusters, so we can now train the model on the dataset.

To train the model, we will use the same two lines of code as we have used in the above
section, but here instead of using i, we will use 5, as we know there are 5 clusters that
need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.
In the second line of code, we have created the dependent variable y_predict to train the
model.

By executing the above lines of code, we will get the y_predict variable. We can check it
under the variable explorer option in the Spyder IDE. We can now compare the values
of y_predict with our original dataset. Consider the below image:

From the above image, we can now relate that the CustomerID 1 belongs to a cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster 4, and
so on.

Step-4: Visualizing the Clusters


The last step is to visualize the clusters. As we have 5 clusters for our model, so we will
visualize each cluster one by one.

To visualize the clusters will use scatter plot using mtp.scatter() function of matplotlib.

1. #visulaizing the clusters


2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', label = 'Cluster 1
') #for first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label = '
Cluster 2') #for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
#for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label = 'C
luster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', label = 'Clus
ter 5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300, c
= 'yellow', label = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()

In above lines of code, we have written code for each clusters, ranging from 1 to 5. The
first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x value for the
showing the matrix of features values, and the y_predict is ranging from 0 to 1.

Output:

The output image is clearly showing the five different clusters with different colors. The
clusters are formed between two parameters of the dataset; Annual income of customer
and Spending. We can change the colors and labels as per the requirement or choice. We
can also observe some points from the above patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending so we can
categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can categorize
them as careful.
o Cluster3 shows the low income and also low spending so they can be categorized as
sensible.
o Cluster4 shows the customers with low income with very high spending so they can be
categorized as careless.
o Cluster5 shows the customers with high income and high spending so they can be
categorized as target, and these customers can be the most profitable customers for the
mall owner.

Hierarchical Clustering in Machine Learning


Hierarchical clustering is another unsupervised machine learning algorithm, which is used
to group the unlabeled datasets into a cluster and also known as hierarchical cluster
analysis or HCA.

In this algorithm, we develop the hierarchy of clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Sometimes the results of K-means clustering and hierarchical clustering may look similar,
but they both differ depending on how they work. As there is no requirement to
predetermine the number of clusters as we did in the K-Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which the algorithm


starts with taking all data points as single clusters and merging them until one
2. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-
down approach.
Why hierarchical clustering?
As we already have other clustering algorithms such as K-Means Clustering, then why
we need hierarchical clustering? So, as we have seen in the K-means clustering that there
are some challenges with this algorithm, which are a predetermined number of clusters,
and it always tries to create the clusters of the same size. To solve these two challenges,
we can opt for the hierarchical clustering algorithm because, in this algorithm, we don't
need to have knowledge about the predefined number of clusters.

In this topic, we will discuss the Agglomerative Hierarchical clustering algorithm.

Agglomerative Hierarchical clustering


The agglomerative hierarchical clustering algorithm is a popular example of HCA. To
group the datasets into clusters, it follows the bottom-up approach. It means, this
algorithm considers each dataset as a single cluster at the beginning, and then start
combining the closest pair of clusters together. It does this until all the clusters are merged
into a single cluster that contains all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?


The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are N data points, so the
number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to form one cluster. So,
there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together to form one cluster.
There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the following clusters.
Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster, develop the dendrogram
to divide the clusters as per the problem.

Note: To better understand hierarchical clustering, it is advised to have a look on k-


means clustering
Measure for the distance between two clusters
As we have seen, the closest distance between the two clusters is crucial for the
hierarchical clustering. There are various ways to calculate the distance between two
clusters, and these ways decide the rule for clustering. These measures are called Linkage
methods. Some of the popular linkage methods are given below:

1. Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
3. Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
4. Centroid Linkage: It is the linkage method in which the distance between the centroid of
the clusters is calculated. Consider the below image:

From the above-given approaches, we can apply any of them according to the type of
problem or business requirement.

Woking of Dendrogram in Hierarchical clustering


The dendrogram is a tree-like structure that is mainly used to store each step as a memory
that the HC algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean
distances between the data points, and the x-axis shows all the data points of the given
dataset.

The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram.

o As we have discussed above, firstly, the datapoints P2 and P3 combine together and form
a cluster, correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance between the
data points.
o In the next step, P5 and P6 form a cluster, and the corresponding dendrogram is created.
It is higher than of previous, as the Euclidean distance between P5 and P6 is a little bit
greater than the P2 and P3.
o Again, two new dendrograms are created that combine P1, P2, and P3 in one dendrogram,
and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data points together.

We can cut the dendrogram tree structure at any level as per our requirement.

Python Implementation of Agglomerative Hierarchical


Clustering
Now we will see the practical implementation of the agglomerative hierarchical clustering
algorithm using Python. To implement this, we will use the same dataset problem that we
have used in the previous topic of K-means clustering so that we can compare both
concepts easily.
The dataset is containing the information of customers that have visited a mall for
shopping. So, the mall owner wants to find some patterns or some particular behavior of
his customers using the dataset information.

Steps for implementation of AHC using Python:


The steps for implementation will be the same as the k-means clustering, except for some
changes such as the method to find the number of clusters. Below are the steps:

1. Data Pre-processing
2. Finding the optimal number of clusters using the Dendrogram
3. Training the hierarchical clustering model
4. Visualizing the clusters

Data Pre-processing Steps:


In this step, we will import the libraries and datasets for our model.

o Importing the libraries

1. # Importing the libraries


2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

The above lines of code are used to import the libraries to perform specific tasks, such
as numpy for the Mathematical operations, matplotlib for drawing the graphs or scatter
plot, and pandas for importing the dataset.

o Importing the dataset

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

As discussed above, we have imported the same dataset of Mall_Customers_data.csv, as


we did in k-means clustering. Consider the below output:
o Extracting the matrix of features

Here we will extract only the matrix of features as we don't have any further information
about the dependent variable. Code is given below:

1. x = dataset.iloc[:, [3, 4]].values

Here we have extracted only 3 and 4 columns as we will use a 2D plot to see the clusters.
So, we are considering the Annual income and spending score as the matrix of features.
Step-2: Finding the optimal number of clusters using the
Dendrogram
Now we will find the optimal number of clusters using the Dendrogram for our model.
For this, we are going to use scipy library as it provides a function that will directly return
the dendrogram for our code. Consider the below lines of code:

1. #Finding the optimal number of clusters using the dendrogram


2. import scipy.cluster.hierarchy as shc
3. dendro = shc.dendrogram(shc.linkage(x, method="ward"))
4. mtp.title("Dendrogrma Plot")
5. mtp.ylabel("Euclidean Distances")
6. mtp.xlabel("Customers")
7. mtp.show()

In the above lines of code, we have imported the hierarchy module of scipy library. This
module provides us a method shc.denrogram(), which takes the linkage() as a
parameter. The linkage function is used to define the distance between two clusters, so
here we have passed the x(matrix of features), and method "ward," the popular method
of linkage in hierarchical clustering.

The remaining lines of code are to describe the labels for the dendrogram plot.

Output:

By executing the above lines of code, we will get the below output:
Using this Dendrogram, we will now determine the optimal number of clusters for our
model. For this, we will find the maximum vertical distance that does not cut any
horizontal bar. Consider the below diagram:
In the above diagram, we have shown the vertical distances that are not cutting their
horizontal bars. As we can visualize, the 4th distance is looking the maximum, so according
to this, the number of clusters will be 5(the vertical lines in this range). We can also take
the 2nd number as it approximately equals the 4th distance, but we will consider the 5
clusters because the same we calculated in the K-means algorithm.

So, the optimal number of clusters will be 5, and we will train the model in the next
step, using the same.

Step-3: Training the hierarchical clustering model


As we know the required optimal number of clusters, we can now train our model. The
code is given below:

1. #training the hierarchical model on dataset


2. from sklearn.cluster import AgglomerativeClustering
3. hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
4. y_pred= hc.fit_predict(x)

In the above code, we have imported the AgglomerativeClustering class of cluster


module of scikit learn library.

Then we have created the object of this class named as hc. The AgglomerativeClustering
class takes the following parameters:

o n_clusters=5: It defines the number of clusters, and we have taken here 5 because it is the
optimal number of clusters.
o affinity='euclidean': It is a metric used to compute the linkage.
o linkage='ward': It defines the linkage criteria, here we have used the "ward" linkage. This
method is the popular linkage method that we have already used for creating the
Dendrogram. It reduces the variance in each cluster.

In the last line, we have created the dependent variable y_pred to fit or train the model. It
does train not only the model but also returns the clusters to which each data point
belongs.

After executing the above lines of code, if we go through the variable explorer option in
our Sypder IDE, we can check the y_pred variable. We can compare the original dataset
with the y_pred variable. Consider the below image:
As we can see in the above image, the y_pred shows the clusters value, which means the
customer id 1 belongs to the 5th cluster (as indexing starts from 0, so 4 means 5th cluster),
the customer id 2 belongs to 4th cluster, and so on.

Step-4: Visualizing the clusters


As we have trained our model successfully, now we can visualize the clusters
corresponding to the dataset.

Here we will use the same lines of code as we did in k-means clustering, except one
change. Here we will not plot the centroid that we did in k-means, because here we have
used dendrogram to determine the optimal number of clusters. The code is given below:

1. #visulaizing the clusters


2. mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
3. mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'Clust
er 2')
4. mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
5. mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluste
r 4')
6. mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5'
)
7. mtp.title('Clusters of customers')
8. mtp.xlabel('Annual Income (k$)')
9. mtp.ylabel('Spending Score (1-100)')
10. mtp.legend()
11. mtp.show()

Output: By executing the above lines of code, we will get the below output:

Association Rule Learning


Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can
be more profitable. It tries to find some interesting relations or associations among the
variables of dataset. It is based on different rules to discover the interesting relations
between variables in the database.

The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc. Here market basket analysis is a technique used by the various big
retailer to discover the associations between items. We can understand it by taking an
example of a supermarket, as in a supermarket, all products that are purchased together
are put together.

For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below diagram:

Association rule learning can be divided into three types of algorithms:

1. Apriori
2. Eclat
3. F-P Growth Algorithm

We will understand these algorithms in later chapters.

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statement, such as if A then
B.
Here the If element is called antecedent, and then statement is called as Consequent.
These types of relationships where we can find out some association or relation between
two items is known as single cardinality. It is all about creating rules, and if the number of
items increases, then cardinality also increases accordingly. So, to measure the
associations between thousands of data items, there are several metrics. These metrics
are given below:

o Support
o Confidence
o Lift

Let's understand each of them:

Support
Support is the frequency of A or how frequently an item appears in the dataset. It is
defined as the fraction of the transaction T that contains the itemset X. If there are X
datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often the
items X and Y occur together in the dataset when the occurrence of X is already given. It
is the ratio of the transaction that contains X and Y to the number of records that contain
X.

Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y are
independent of each other. It has three possible values:

o If Lift= 1: The probability of occurrence of antecedent and consequent is independent of


each other.
o Lift>1: It determines the degree to which the two itemsets are dependent to each other.
o Lift<1: It tells us that one item is a substitute for other items, which means one item has
a negative effect on another.

Types of Association Rule Learning


Association rule learning can be divided into three algorithms:

Apriori Algorithm
This algorithm uses frequent datasets to generate association rules. It is designed to work
on the databases that contain transactions. This algorithm uses a breadth-first search and
Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can
be bought together. It can also be used in the healthcare field to find drug reactions for
patients.

Eclat Algorithm
Eclat algorithm stands for Equivalence Class Transformation. This algorithm uses a
depth-first search technique to find frequent itemsets in a transaction database. It
performs faster execution than Apriori Algorithm.

F-P Growth Algorithm


The F-P growth algorithm stands for Frequent Pattern, and it is the improved version of
the Apriori Algorithm. It represents the database in the form of a tree structure that is
known as a frequent pattern or tree. The purpose of this frequent tree is to extract the
most frequent patterns.
Applications of Association Rule Learning
It has various applications in machine learning and data mining. Below are some popular
applications of association rule learning:

o Market Basket Analysis: It is one of the popular examples and applications of association
rule mining. This technique is commonly used by big retailers to determine the association
between items.
o Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.
o Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
o It is also used for the Catalog Design and Loss-leader Analysis and many more other
applications.

Python Implementation of Association Learning using Apriori


Algorithm

DataSet:
Dimensionality Reduction

What is Dimensionality Reduction?


The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning

for obtaining a better fit predictive model while solving the classification and regression problems.

It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of
features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on high-
dimensional data, it becomes overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction


Some benefits of applying dimensionality reduction technique to the given dataset are
given below:

o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction


There are also some disadvantages of applying the dimensionality reduction, which are
given below:

o Some data may be lost due to dimensionality reduction.


o In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning

. It is a statistical process that converts the observations of correlated features into a set of linearly
uncorrelated features with the help of orthogonal transformation. These new transformed features are
called the Principal Components. It is one of the popular tools that is used for exploratory data
analysis and predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-dimensional
data.

PCA works by considering the variance of each attribute because the high attribute shows
the good split between the classes, and hence it reduces the dimensionality. Some real-
world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature
extraction technique, so it contains the important variables and drops the least important
variable.

The PCA algorithm is based on some mathematical concepts such as:

o Variance and Covariance


o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:

o Dimensionality: It is the number of features or variables present in the given dataset.


More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as
if one changes, the other variable also gets changed. The correlation value ranges from -
1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and +1
indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will
be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:

o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has
the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where
X is the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data
items, and the column corresponds to the Features. The number of columns is the
dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features
with high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will
divide each data item in a column with the standard deviation of the column. Here we
will name the matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z.
In the resultant matrix Z*, each observation is the linear combination of original features.
Each column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to
remove. It means, we will only keep the relevant or important features in the new
dataset, and unimportant features will be removed out.

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.

Ensemble learning
Ensemble methods combine different decision trees to deliver better predictive results,
afterward utilizing a single decision tree. The primary principle behind the ensemble
model is that a group of weak learners come together to form an active learner.

There are two techniques given below that are used to perform ensemble decision tree.

Bagging
Bagging is used when our objective is to reduce the variance of a decision tree. Here the
concept is to create a few subsets of data from the training sample, which is chosen
randomly with replacement. Now each collection of subset data is used to prepare their
decision trees thus, we end up with an ensemble of various models. The average of all the
assumptions from numerous tress is used, which is more powerful than a single decision
tree.

Implementation Steps of Bagging


• Step 1: Multiple subsets are created from the original data set with equal
tuples, selecting observations with replacement.
• Step 2: A base model is created on each of these subsets.
• Step 3: Each model is learned in parallel with each training set and
independent of each other.
• Step 4: The final predictions are determined by combining the predictions
from all the models.

An illustration for the concept of bootstrap aggregating (Bagging)

Random Forest

Random Forest is an expansion over bagging. It takes one additional step to predict a
random subset of data. It also makes the random selection of features rather than using
all features to develop trees. When we have numerous random trees, it is called the
Random Forest.

These are the following steps which are taken to implement a Random forest:

o Let us consider X observations Y features in the training data set. First, a model from the
training data set is taken randomly with substitution.
o The tree is developed to the largest.
o The given steps are repeated, and prediction is given, which is based on the collection of
predictions from n number of trees.

Advantages of using Random Forest technique:

o It manages a higher dimension data set very well.


o It manages missing quantities and keeps accuracy for missing data.

Disadvantages of using Random Forest technique:

Since the last prediction depends on the mean predictions from subset trees, it won't give
precise value for the regression model.

Boosting:
Boosting is another ensemble procedure to make a collection of predictors. In other
words, we fit consecutive trees, usually random samples, and at each step, the objective
is to solve net error from the prior trees.

If a given input is misclassified by theory, then its weight is increased so that the upcoming
hypothesis is more likely to classify it correctly by consolidating the entire set at last
converts weak learners into better performing models.

Gradient Boosting is an expansion of the boosting procedure.

Gradient Boosting = Gradient Descent + Boosting

It utilizes a gradient descent algorithm that can optimize any differentiable loss function.
An ensemble of trees is constructed individually, and individual trees are summed
successively. The next tree tries to restore the loss ( It is the difference between actual and
predicted values).
Algorithm:

1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data
points.
3. Increase the weight of the wrongly classified data points and decrease the
weights of correctly classified data points. And then normalize the weights of
all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Similarities Between Bagging and Boosting


Bagging and Boosting, both being the commonly used methods, have a universal
similarity of being classified as ensemble methods. Here we will explain the similarities
between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the
majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.

Advantages of using Gradient Boosting methods:

o It supports different loss functions.


o It works well with interactions.

Disadvantages of using a Gradient Boosting methods:

o It requires cautious tuning of different hyper-parameters.

Difference between Bagging and Boosting:


What is Meta-learning?
Meta-learning, described as “learning to learn”, is a subset of machine learning
in the field of computer science. It is used to improve the results and
performance of the learning algorithm by changing some aspects of the learning
algorithm based on the results of the experiment. Meta-learning helps
researchers understand which algorithms generate the best/better predictions
from datasets.
Meta-learning algorithms use learning algorithm metadata as input. They then
make predictions and provide information about the performance of these
learning algorithms as output. An image’s metadata in a learning model can
include, for example, its size, resolution, style, creation date, and owner.
Systematic experiment design in meta-learning is the most important challenge.

Importance of Meta-learning in Present Scenario


Machine learning algorithms have some problems, such as

• The need for large datasets for training


• High operating costs due to many trials/experiments during the training
phase
• Experiments/trials take a long time to find the best model that performs
best for a given data set.

Meta-learning can help machine learning algorithms deal with these challenges
by optimizing and finding learning algorithms that perform better.

Working of Meta-learning
In general, a meta-learning algorithm is trained using the outputs (i.e., model
predictions) and metadata of machine learning algorithms. After training is
done, its skills are sent for testing and used to make end/final predictions.
Meta-learning includes tasks such as

• Observing the performance of different machine learning models on


learning tasks
• Learning from metadata
• The faster learning process for new tasks

For example, we may want to train a machine learning model to label discrete
breeds of dogs.

• First, we need an annotated dataset.


• Different ML models are built on the training set. They could only focus
on certain parts of the dataset.
• A meta-training process is used to improve the performance of these
models
• Finally, the meta-training model can be used to build a new model from
several examples based on its experience with the previous training
process
What is a Recommender System?

Recommender systems are a type of machine learning algorithm that provides


consumers with "relevant" recommendations. When we search for something
anywhere, be it in an app or in our search engine, this recommender system is
used to provide us with relevant results. They use a class of algorithms to find
out the relevant recommendation for the user.

For example, if a user listens to rock music every day, his youtube
recommendation feed will get full of rock music and music of related genres.

In this, items are ranked according to their relevancy and the most relevant ones
are recommended to the user. The recommendation system must assess the
relevance, which is primarily based on past data. Just like the rock music thing
we just saw.

The recommender system is divided into mainly two categories: Collaborative


filtering and content based filtering.

Collaborative filtering

Methods for recommender systems that are primarily based on previous


interactions between users and the target items are known as collaborative
filtering methods.

As a result, all past data about user interactions with target objects will be fed
into a collaborative filtering system. This information is usually recorded as a
matrix, with the rows representing users and the columns representing items.

The basic premise of such systems is that the users' previous data should be
sufficient to generate a prediction. That is, we don't require anything other than
historical data, no more user input, no current trending data, and so on.

Furthermore, collaborative filtering methods are divided into two sub-groups:


memory-based methods and model-based methods.

• Memory Based

Memory-based methods are the most basic because they use no model at
all. They assume that predictions can be made based solely on "memory"
of past data and typically use a simple distance-measurement approach,
such as the nearest neighbor

• Model Based

Model-based approaches, on the other hand, usually presuppose some


form of the underlying model and attempt to ensure that any predictions
made fit the model properly.

Now let us jump to the main course of our discussion, which is a second
category of recommender system, i.e., content-based recommendation
system. Before that understand the challenges of the recommendation
system.

Content- based Recommender System


Content-based filtering is one popular technique of recommendation or
recommender systems. The content or attributes of the things you like are
referred to as "content."

Here, the system uses your features and likes in order to recommend you with
things that you might like. It uses the information provided by you over the
internet and the ones they are able to gather and then they curate
recommendations according to that.
The goal behind content-based filtering is to classify products with specific
keywords, learn what the customer likes, look up those terms in the database,
and then recommend similar things.

This type of recommender system is hugely dependent on the inputs provided


by users, some common examples included Google, Wikipedia, etc. For
example, when a user searches for a group of keywords, then Google displays
all the items consisting of those keywords. The below video explains how a
content-based recommender works.

Example

Suppose I am a fan of the Harry Potter series and watch only such kinds of
movies on the internet. When my data will be gathered from Google or
Wikipedia, it will be found out that I am a fan of fantasy movies. Therefore, my
recommendation will be filled with fantasy movies. Among all the movies, the
ones best for me will be curated and then recommended to me.

Suppose there are two movies, one is Fantastic Beasts and the other is
Shawshank Redemption, then according to my preference of fantasy movies,
the Fantastic Beasts will recommend to me.

How does it work?

The content-based recommendation system works on two methods, both of


them using different models and algorithms. One uses the vector spacing
method and is called method 1, while the other uses a classification model and
is called method 2.

• Method 1: The vector space method

Let us suppose you read a crime thriller book by Agatha Christie, you
review it on the internet. Also, you review one more fictional book of the
comedy genre with it and review the crime thriller books as good and the
comedy one as bad.

Now, a rating system is made according to the information provided by


you. In the rating system from 0 to 9, crime thriller and detective genres
are ranked as 9, and other serious books lie from 9 to 0 and the comedy
ones lie at the lowest, maybe in minus.

With this information, the next book recommendation you will get will be
of crime thriller genres most probably as they are the highest rated genres
for you.

For this ranking system, a user vector is created which ranks the
information provided by you. After this, an item vector is created where
books are ranked according to their genres on it.

With the vector, every book name is assigned a certain value by multiplying
and getting the dot product of the user and item vector, and the value is
then used for recommendation.
Like this, the dot products of all the available books searched by you are
ranked and according to it the top 5 or top 10 books are assigned.

This method was the first method used by a content-based


recommendation system to recommend items to the user.

• Method 2: Classification method

The second method is the classification method. In it, we can create a


decision tree and find out if the user wants to read a book or not.

For example, a book is considered, let it be The Alchemist.

Based on the user data, we first look at the author name and it is not
Agatha Christie. Then, the genre is not a crime thriller, nor is it the type of
book you ever reviewed. With these classifications, we conclude that this
book shouldn’t be recommended to you.

Advantages and Disadvantages of content- based


recommendation system

Advantages of content-based recommender system are following;

• Because the recommendations are tailored to a person, the model does


not require any information about other users. This makes scaling of a
big number of people more simple.
• The model can recognize a user's individual preferences and make
recommendations for niche things that only a few other users are
interested in.
• New items may be suggested before being rated by a large number of
users, as opposed to collective filtering.

Listing below the disadvantages of it;

• This methodology necessitates a great deal of domain knowledge


because the feature representation of the items is hand-engineered to
some extent. As a result, the model can only be as good as the
characteristics that were hand-engineered.

• The model can only give suggestions based on the user's current
interests. To put it another way, the model's potential to build on the
users' existing interests is limited.

• Since it must align the features of a user's profile with available products,
content-based filtering offers only a small amount of novelty.

Only item profiles are generated in the case of item-based filtering, and
users are recommended items that are close to what they rate or search
for, rather than their previous background. A perfect content-based
filtering system can reveal nothing surprising or unexpected.
What is Collaborative Filtering? Types, Working and
Case Study
What is Collaborative Filtering?

Ever thought about how e-commerce sites recommend products to their customers while

they are looking for something exactly like that? Ever wondered how Netflix recommends

similar movies based on what we have recently watched or added to our watchlist?

Artificial Intelligence technology has advanced to such an extent that the world can be

perceived through the lens of this technology.

With various techniques like deep learning, machine learning, and artificial neural

networks, artificial intelligence tools and techniques have enabled the internet to become

a black hole filled with information and entertainment.

In this respect, it has also enabled the internet to recommend users or items to netizens

active on the internet.

A variety of machine learning applications and software use recommender systems that

are empowered by machine learning techniques and tools for recommending their users’

similar items or products.

Broadly, there are 2 types of recommendation techniques that are in use as of now. First,

content-based filtering requires users to enter data that is then processed to produce

desired outputs.
For instance, a user logs in to his/her Netflix account and enters "Hollywood Romantic

Movies'' in the search bar. The results obtained from this search are procured with the

help of the content-based approach that works on the basis of content inputs.

Second, the technique of collaborative filtering implies that computers produce outputs

based on a user's past interaction on a platform. Herein, we shall understand this with an

example.

Let us suppose that an individual is inclined towards romanticism and likes to watch

movies belonging to the romantic genre on his Netflix account. Perhaps whenever he logs

in to his account, he will see a separate section that will only display recommended

movies based on his past preferences and watch history.

Therefore, the technique of collaborative filtering filters information and infers from the

past interaction of a user to recommend similar items or content material. In this blog,

we shall learn about Collaborative Filtering at length.

Recommendation systems, used on a variety of platforms like e-commerce sites, OTT

platforms, and music streaming applications, employ the technology of Collaborative

Filtering that further recommends items or users based on a user's historical browsing
data. Be it Instagram that recommends people we may know or similar clothing items

that resemble the items that we've just added to our carts, collaborative filtering is a

leading technology in the contemporary scenario.

The key to this technique is Collaborative Filtering that has only emerged in the 21st

century as a powerful unsupervised machine learning algorithm.


How does it work?

As we have already learned, Collaborative Filtering is an important machine learning

technique that helps a computer to filter information based on past interactions and data

recorded on the user's end.

Simply put, collaborative filtering algorithms produce similar results based on the user's

historical data. For instance, it has been established that a user is interested in Pop

songs.

Perhaps the collaborative filtering algorithms in music streaming applications will record

this interaction of the user and interpret that the user prefers Pop Genre over other

genres.

The recommendation system built-in with this technique will display other popular songs

of the Pop Genre. This is how a collaborative filtering algorithm works.

By recording past interactions of a user on a particular platform, the technique of

collaborative filtering interprets and produces recommendation results with similar

traits.

"It's based on the idea that people who agreed in their evaluation of certain
items are likely to agree again in the future." Collaborative Filtering in
Recommender Systems

Types of Collaborative Filtering


Broadly, there are 2 types of Collaborative Filtering techniques that can be used by

software and applications worldwide. They are as follows:-

1. User-based Collaborative Filtering

As collaborative filtering procures its results from implicit data, it is able to retrieve

information that users otherwise might not provide. The first class of collaborative

filtering is the user-based approach.

This approach narrows down users with the help of collaborative filtering that has

similar behaviors, common contacts, and close demographics, and similar

consumer behaviors. Social networking sites incorporate this approach to

recommend users to other users based on their patterns of behavior.

Moreover, this approach is also employed for targeted ads and suggested items

based on other users who have similar choices and preferences. Among the various

approaches of collaborative filtering, user-based collaborative filtering is the first

approach that came into existence.

A typical example of this approach is the 'suggested friends' category displayed in

Facebook. This category recommends people that users might know based on their

virtual contacts and similar preferences.

(Suggested blog: AI algorithms/ models)


2. Item-based Collaborative Filtering

A class of collaborative filtering techniques, item-based collaborative filtering refers

to the recommendation of items or products using collaborative filtering.

By measuring similarity among products and inferring respective ratings, items are

recommended to users based on their historical data and interactive history.

This class of collaborative filtering was invented and first used by Amazon in 1998.
Even today, e-commerce sites like Amazon and Flipkart use item-based

recommendation systems to recommend similar items or products to users by

filtering them with the help of a user's past interactive data.

With the statistical technique of Nearest Neighbour, the technique of Collaborative

Filtering in this approach works effectively and presents users with legitimate

recommendations that have only led to an increase in consumption.

Case Studies of Collaborative Filtering

In this segment, we will be looking at various real-world case studies that will help us to

understand the role of collaborative filtering in a better manner.

1. FACEBOOK
A social networking site that was launched in the year 2004, Facebook has pioneered the

world of social networking that aims to connect people from one corner of the world to
another.
Currently led by Mark Zuckerberg, Facebook uses numerous techniques of AI that

have advanced the social networking site. However, one of the most striking

techniques used by this social media giant is Collaborative Filtering.

Be it target marketing suggested friends, or discovering friends, collaborative

filtering is a highly significant technique.

"Facebook uses different recommender systems for different parts of the


site. For example, the user timeline may use one algorithm, while the News
section and Marketplace sections use other recommender systems to
provide data it thinks is useful to the user. " Collaborative Filtering in
Facebook

2. AMAZON

An e-commerce website, Amazon is a retail platform that sells various commodities

and acts as a middleman by connecting sellers and buys from worldwide.

Launched in 1994, Amazon earlier traded in only books. By using a variety of

the best machine learning tools for better performance and enhanced user

interaction, Amazon also incorporates the collaborative filtering technique for

its recommendation system.

Since an e-commerce platform like Amazon has millions of users surfing through

the platform, this technique is of great use to the company and its users. With a
colossal technological interface, Amazon offers a user-based approach and an

item-based approach for suggested products and similar items.

All in all, the platform's item-based collaborative filtering has proved to be a useful

system that has triggered the profit-making capacity of the business.

What's more, this platform opts for item-based collaborative filtering more than a

user-based approach in order to produce high-quality recommendations. At first,


collaborative filtering had only one approach - a user-based approach.

However, it was Amazon that developed an item-based approach that began to look

at items rather than users.

3. NETFLIX

The third case study is based on one of the most renowned OTT platforms

worldwide - Netflix. Known for its humongous entertainment collection and latest

OTT content, Netflix was founded in 1997.

With millions of users from around the world, the platform offers various
recommendations to its users, thanks to a collaborative filtering movie

recommendation system.

"Recommendation algorithms are at the core of the Netflix product.


They provide our members with personalized suggestions to reduce the
amount of time and frustration to find something great to watch."
Collaborative Filtering Applications in Netflix
In the most mundane schedule, a user could log in to his/her Netflix account and

make the most out of the recommendations that are displayed at every step of the

way.

With so much to watch and learn from, platforms like Netflix have brought in the

best of all worlds as they use recommender systems empowered by collaborative

filtering approaches that narrow down our options, and work according to our
preferences

You might also like