Mod4_Unsupervised Learning
Mod4_Unsupervised Learning
A process of organizing objects into groups such that data points in the same groups are similar to the data points in the
same group. A cluster is a collection of objects where these objects are similar and dissimilar to the other cluster.
EXAMPLE:
A bank wants to give credit card offers to its customers. Currently, they look at the details of each customer and based
on this information, decide which offer should be given to which customer. Now, the bank can potentially have
millions of customers. Does it make sense to look at the details of each customer separately and then make a decision?
Certainly not! It is a manual process and will take a huge amount of time.
So what can the bank do? One option is to segment its customers into different groups. For instance, the bank can
group the customers based on their income:
The bank can now make three different strategies or offers, one for each group. Here, instead of creating different
strategies for individual customers, they only have to make 3 strategies. This will reduce the effort as well as the time.
-The groups are known as clusters and the process of creating these groups is known as clustering.
APPLICATIONS:
(i) Customer Segmentation: It isn’t just limited to banking. This strategy is across functions, including telecom,
e-commerce, sports, advertising, sales, etc.
(ii) Document Clustering: Let’s say you have multiple documents and you need to cluster similar documents
together. Clustering helps us group these documents such that similar documents are in the same clusters.
APPLICATIONS:
(iii) Image Segmentation: We can also use clustering to perform image segmentation. Here, we try to club
similar pixels in the image together. We can apply clustering to create clusters having similar pixels in the
same group.
(iv) Recommendation Engines: Let’s say you want to recommend songs to your friends. You can look at the
songs liked by that person and then use clustering to find similar songs and finally recommend the most similar
songs.
When we are working with huge volumes of data, it makes sense to partition the data into logical groups and
doing the analysis. We can use Clustering to make the data into groups with the help of several algorithms like
K-Means.
-K-Means clustering is a type of unsupervised learning. The main goal of this algorithm to find groups in data
and the number of groups is represented by K. It is an iterative procedure where each data point is assigned to
one of the K groups based on feature similarity.
-K-Means algorithm starts with initial estimates of K centroids, which are randomly selected from the dataset.
The algorithm iterates between two steps assigning data points and updating Centroids.
• Data Assignment
In this step, the data point is assigned to its nearest centroid based on
the squared Euclidean distance. Let us assume a Cluster with c as
centroid and a data point x is assigned to this cluster, based on the
distance between c, x.
• Centroid Update
Centroids are recomputed by taking the mean of all data points
assigned to a particular cluster.
EXAMPLE:
3. Calculate squared Euclidean distance between all data points to the centroids AB, CD.
For example distance between A(2,3) and AB (4,2) can be given by
s= (2–4)² + (3–2)²
4. If we observe in the fig, the highlighted distance between (A, CD) is 4 and is less compared to (AB, A)
which is 5. Since point A is close to the CD we can move A to CD cluster.
5. There are two clusters formed so far, let recompute the centroids i.e, B, ACD similar to
step 2.
ACD = Average of A, C, D
B=B
6. As we know K-Means is iterative procedure now we have to calculate the distance of all
points (A, B, C, D) to new centroids (B, ACD ) similar to step 3.
7. In the above picture, we can see respective cluster values are minimum that A is too far
from cluster B and near to cluster ACD. All data points are assigned to clusters (B, ACD )
based on their minimum distance. The iterative procedure ends here.
8. To conclude, we have started with two centroids and end up with two clusters, K=2.
Choosing K
One method of choosing value K is the elbow method. In this method we will run K-Means clustering for
a range of K values lets say ( K= 1 to 10 ) and calculate the Sum of Squared Error (SSE). SSE is
calculated as the mean distance between data points and their cluster centroid.
Then plot a line chart for SSE values for each K, if the line chart looks like an arm then the elbow on the
arm is the value of K that is the best.
PROGRAM: