0% found this document useful (0 votes)
16 views9 pages

13 Unsupervised Learning

Unsupervised learning is a machine learning approach that analyzes unlabelled data to uncover hidden patterns without prior training, contrasting with supervised learning which predicts outcomes based on input features. Applications include consumer segmentation, anomaly detection, and image processing, with clustering techniques like k-means and k-medoids being essential for grouping data. Additionally, association rule learning identifies relationships within data, commonly used in market basket analysis, with algorithms like Apriori and Eclat facilitating the discovery of frequent itemsets.

Uploaded by

ghoshpradipto13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views9 pages

13 Unsupervised Learning

Unsupervised learning is a machine learning approach that analyzes unlabelled data to uncover hidden patterns without prior training, contrasting with supervised learning which predicts outcomes based on input features. Applications include consumer segmentation, anomaly detection, and image processing, with clustering techniques like k-means and k-medoids being essential for grouping data. Additionally, association rule learning identifies relationships within data, commonly used in market basket analysis, with algorithms like Apriori and Eclat facilitating the discovery of frequent itemsets.

Uploaded by

ghoshpradipto13
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Unsupervised learning

• Unsupervised learning is a machine learning concept where the unlabelled and


unclassified information is analysed to discover hidden knowledge.
• The algorithms work on the data without any prior training.
• The data is constructed in such a way that they can identify patterns, groupings, sorting
order, and numerous other interesting knowledge.
• In supervised learning, the aim was to predict the outcome variable Y on the basis of
the feature set X1, X2,… Xn.
• Regression and classification were the two types of supervised learning.
• The concept of the unsupervised learning is to observe only the features X1, X2, … Xn ;
not to predict any outcome variable, but rather to find out the association between
the features or their grouping to understand the nature of the data.
Applications
• Segmentation of target consumer populations by an advertisement consulting
agency on the basis of few dimensions such as demography, financial data,
purchasing habits, etc. so that the advertisers can reach their target consumers
efficiently.
• Anomaly or fraud detection in the banking sector by identifying the pattern of loan
defaulters.
• Image processing and image segmentation such as face recognition, expression
identification, etc.
• Grouping of important characteristics in genes to identify important influencers in
new areas of genetics.
CLUSTERING
• Clustering refers to a broad set of techniques for finding subgroups, or clusters, in a
data set.
• This is being done on the basis of the characteristics of the objects within that data set
in such a manner that the objects within the group are similar (or related to each
other) but are different from (or unrelated to) the objects from the other groups.
Application of clustering
• Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity relation
modelling.
• Customer segmentation: creating clusters of customers on the basis of parameters
such as demographics, financial conditions, buying habits, etc., which can be used by
retailers and advertisers to promote their products in the correct segment.
• Anomaly checking: checking of anomalous behaviours such as fraudulent bank
transaction, unauthorized computer intrusion, suspicious movements on a radar
scanner, etc.
• Data mining: simplify the data mining task by grouping a large number of features from
an extremely large data set to make the analysis manageable.
Different types of clustering techniques
• Partitioning methods
• Hierarchical methods
• Density-based methods

Partitioning methods
• Two of the most important algorithms for partitioning-based clustering are k-means
and k-medoid.
• In the k-means algorithm, the centroid of the prototype is identified for clustering,
which is normally the mean of a group of points.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points. We can also infer that in most cases, the
centroid does not correspond to an actual data point, whereas medoid is always an
actual data point. Let us discuss both these algorithms in detail.
Strengths and Weaknesses of K-means

Elbow method
• This method tries to measure the homogeneity or heterogeneity within the cluster
and for various values of ‘K’ and helps in arriving at the optimal ‘K’. These iterations
take significant computation effort, and after a certain point, the increase in
homogeneity benefit is no longer in accordance with the investment required to
achieve it, as is evident from the figure. This point is known as the elbow point, and
the ‘K’ value at this point produces the optimal clustering performance.
• The k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters when the means of the data points are used as centroids.
Let us take an example of eight data points, and for simplicity, we can consider them
to be 1-D data with values 1, 2, 3, 5, 9, 10, 12, and 25. Point 25 is the outlier, and it
affects the cluster formation negatively when the mean of the points is considered as
centroids.
• Because the SSE of the second clustering is lower, k-means tend to put point 9 in the
same cluster with 1, 2, 3, and 6 though the point is logically nearer to points 10 and
12. This skewedness is introduced due to the outlier point 25, which shifts the mean
away from the centre of the cluster.
k-medoids
• k-medoids provides a solution to this problem. Instead of considering the mean of
the data points in the cluster, kmedoids considers k representative data points from
the existing points in the data set as the centre of the clusters. It then assigns the
data points according to their distance from these centres to form k clusters. Note
that the medoids in this case are actual data points or objects from the data set and
not an imaginary point as in the case when the mean of the data sets within cluster
is used as the centroid in the k-means technique. The SSE is calculated as

• where oi is the representative point or object of cluster Ci .


• The k-medoids method groups n objects in k clusters by minimizing the SSE.
• Because of the use of medoids from the actual representative data points, k-
medoids is less influenced by the outliers in the data. One of the practical
implementation of the k-medoids principle is the Partitioning Around Medoids
(PAM) algorithm.
• In this algorithm, we replaced the current representative object with a non-
representative object and checked if it improves the quality of clustering.
• In the iterative process, all possible replacements are attempted until the quality of
clusters no longer improves.
• If o ,…, ok are the current set of representative objects or medoids and there is a non-
1
representative object o , then to determine whether or is a good replacement of o (1
r j
≤ j ≤ k), the distance of each object x is calculated from its nearest medoid from the
set {o , o ,…, o , o , o ,…, o } and the SSE is calculated.
1 2 j-1 r j+1 k
• If the SSE after replacing o with o decreases, it means that o represents the cluster
j r r
better than o , and the data points in the set are reassigned according to the nearest
j
medoids now.
Hierarchical clustering
• The hierarchical clustering methods are used to group the data into hierarchy or tree-
like structure.
• There are two main hierarchical clustering methods:
• agglomerative clustering
• divisive clustering.
• Agglomerative clustering is a bottom-up technique which starts with individual
objects as clusters and then iteratively merges them to form larger clusters.
• It starts with each object forming its own cluster and then iteratively merges the
clusters according to their similarity to form larger clusters.
• It terminates either when a certain clustering condition imposed by the user is
achieved or all the clusters merge into a single cluster.
Divisive Cluster
• The divisive method starts with one cluster with all given objects and then splits it
iteratively to form smaller clusters.
• The divisive hierarchical clustering method uses a top-down strategy.
• The starting point is the largest cluster with all the objects in it, and then, it is split
recursively to form smaller and smaller clusters, thus forming the hierarchy.
• The end of iterations is achieved when the objects in the final clusters are sufficiently
homogeneous to each other or the final clusters contain only one object or the user-
defined clustering condition is achieved.

Density-based methods - DBSCAN


• You might have noticed that when we used the partitioning and hierarchical
clustering methods, the resulting clusters are spherical or nearly spherical in nature.
In the case of the other shaped clusters such as S-shaped or uneven shaped clusters,
the above two types of method do not provide accurate results. The density-based
clustering approach provides a solution to identify clusters of arbitrary shapes. The
principle is based on identifying the dense area and sparse area within the data set
and then run the clustering algorithm. DBSCAN is one of the popular density-based
algorithm which creates clusters by using connected regions with high density.
FINDING PATTERN USING ASSOCIATION RULE
• Association rule presents a methodology that is useful for identifying interesting
relationships hidden in large data sets.
• It is also known as association analysis, and the discovered relationships can be
represented in the form of association rules comprising a set of frequent items.
• A common application of this analysis is the Market Basket Analysis that retailers use
for cross-selling of their products.
• For example, every large grocery store accumulates a large volume of data about the
buying pattern of the customers.
• On the basis of the items purchased together, the retailers can push some cross-
selling either by placing the items bought together in adjacent areas or creating
some combo offer with those different product types.
• The below association rule signifies that people who have bought bread and milk
have often bought egg also; so, for the retailer, it makes sense that these items are
placed together for new opportunities for cross-selling.
• {Bread, Milk} → {Egg}
Another example
• By discovering the interesting relationship between food habit and patients
developing some kind of cancer, a new cancer prevention mechanism can be found
which will benefit thousands of people in the world.
Association rule learning algorithms
• The most popular association rule learning algorithms are:
• Apriori algorithm
• Eclat process.
• Apriori is designed to function on a large database containing various types of
transactions (for example, collections of products bought by customers, or details of
websites visited by customers frequently).
• Apriori algorithm uses a ‘bottom-up’ method, where repeated subsets are extended
one item at a time (a step is known as candidate generation, and groups of
candidates are tested against the data).
• The Apriori Algorithm is a powerful algorithm for frequent mining of itemsets for
boolean association rules.
• A low support may indicate that the rule has occurred by chance.
• Also, from its application perspective, this rule may not be a very attractive business
investment as the items are seldom bought together by the customers. Thus,
support can provide the intelligence of identifying the most interesting rules for
analysis.
• Confidence provides the measurement for reliability of the inference of a rule.
• Higher confidence of a rule X → Y denotes more likelihood of to be present in
transactions that contain X as it is the estimate of the conditional probability of Y
given X.
• Also, understand that the confidence of X leading to Y is not the same as the
confidence of Y leading to X.
• In an example, confidence of {Bread, Milk} → {Egg} = 0.75 but confidence of {egg} →
{Bread, Milk} =0.6 say, therefore, the rule {Bread, Milk} → {Egg} is the strong rule.
• Association rules were developed in the context of Big Data and data science and are
not used for prediction. They are used for unsupervised knowledge discovery in large
databases, unlike the classification and numeric prediction algorithms.
• Still we will find that association rule learners are closely related to and share many
features of the classification rule learners.
The apriori algorithm for association rule learning
• The first step is to decide the minimum support and minimum confidence of the
association rules.
• From a set of transaction T, let us assume that we will find out all the rules that have
support ≥ minS and confidence ≥ minC, where minS and minC are the support and
confidence thresholds, respectively, for the rules to be considered acceptable.
• Now, even if we put the minS = 20% and minC = 50%, it is seen that more than 80% of
the rules are discarded; this means that a large portion of the computational efforts
could have been avoided if the itemsets for consideration were first pruned and the
itemsets which cannot generate association rules with reasonable support and
confidence were removed.
Eclat algorithm
• ECLAT stands for ‘Equivalence Class Clustering and bottomup Lattice Traversal’.
• Eclat algorithm is another set of frequent itemset generation similar to Apriori
algorithm.
• Three traversal approaches such as ‘Top-down’, ‘Bottom-up’, and Hybrid approaches
are supported.
• Transaction IDs (TID list) are stored here. It represents the data in vertical format.

You might also like