ML-Unit3 Updated
ML-Unit3 Updated
Aman Kumar
Unsupervised Learning
• In the case of unsupervised learning, not all
variables and data patterns are classified.
• Instead, the machine must uncover hidden
patterns and create labels through the use of
unsupervised learning algorithms.
• The k-means clustering algorithm is a popular
example of unsupervised learning. This simple
algorithm groups data points that are found to
possess similar features as shown in Figure ahead
Clustering
• If you group data points
based on the purchasing
behavior of SME (Small
and Medium-sized
Enterprises) and large
enterprise customers, for
example, you are likely to
see two clusters emerge.
Two clusters are formed after calculating the Euclidean distance of the remaining data points to the centroids.
k-means Clustering
The centroid coordinates for each cluster are updated to reflect the cluster’s
mean value. As one data point has switched from the right cluster to the left
cluster, the centroids of both clusters are recalculated.
k-means Clustering
In the domain of anomaly detection, this algorithm DBScan algorithm, on the other hand, locates regions of high
5. causes problems as anomalous points will be assigned density that are separated from one another by regions of
to the same cluster as “normal” data points. low density.
Varying densities of the data points doesn’t affect K- DBScan clustering does not work very well for sparse datasets
7.
means clustering algorithm. or for data points with varying density.
Distribution-Based Clustering
So far we learned about clustering based on similarity/distance
or density. This family of clustering algorithms takes a totally
different metric into consideration: probability.
Distribution-Based Clustering is a clustering model in which we
will fit the data on the probability of it belonging to the same
distribution.
This clustering approach assumes data is composed of
distributions, such as Normal, Gaussian, binomial, etc. Gaussian
distribution is prominent when we have a fixed number of
distributions and all the upcoming data is fitted into it such that
the distribution of data may get maximized.
Distribution-Based Clustering
• As seen below, data is modeled into 3 Gaussian distributions and as the
distance from the distribution’s center increases, the probability that a
point belongs to the distribution decreases.
• The bands show a decrease in probability. The distribution models of
clustering are most closely related to statistics and it is very closely related
to the way in which datasets are generated and arranged using random
sampling principles i.e., to fetch data points from one form of distribution.
• Clusters can then easily be defined as objects that are most likely to belong
to the same distribution.
• The expectation-maximization algorithm is one of the popular examples of
distribution-based clustering.
Hierarchical Clustering
It creates a tree of clusters. Hierarchical clustering, not
surprisingly, is well suited to hierarchical data, such as
taxonomies
In addition, another advantage is that any number of
clusters can be chosen by cutting the tree at the right
level.
Refer Link:
file:///C:/Users/dell/OneDrive/CU/AI/ML-IVSem/HierarchicalClustering.html
Performance Matrix for Clustering
Silhouette Coefficient: Silhouette Coefficient or
silhouette score is a metric used to calculate the
goodness of a clustering technique. Its value ranges
from -1 to 1.
1: Means clusters are well apart from each other and
clearly distinguished.
0: Means clusters are indifferent, or we can say that
the distance between clusters is not significant.
-1: Means clusters are assigned in the wrong way.
Performance Matrix for Clustering
Silhouette Coefficient:
Performance Matrix for Clustering
Silhouette Coefficient:
a= average intra-cluster distance i.e the average distance between each point within a cluster.
b= average inter-cluster distance i.e the average distance between all clusters.
Performance Matrix for Clustering
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline
X= np.random.rand(50,2)
Y= 2 + np.random.rand(50,2)
Z= np.concatenate((X,Y))
Z=pd.DataFrame(Z) #converting into data frame for ease
sns.scatterplot(Z)
KMean= KMeans(n_clusters=2)
KMean.fit(Z)
label=KMean.predict(Z)
KMean= KMeans(n_clusters=3)
KMean.fit(Z)
label=KMean.predict(Z)
print(f’Silhouette Score(n=3): {silhouette_score(Z, label)}’)
sns.scatterplot(Z[0],Z[1],hue=label,palette=’inferno_r’)
As you can see in the figure clusters are not well apart. The inter cluster distance
between cluster 1 and cluster 2 is almost negligible. That is why the silhouette score for
n= 3(0.596) is lesser than that of n=2(0.806).
Market Basket Analysis
• A technique that is used to uncover purchase
patterns in any retail setting is known as Market
Basket Analysis
• This is a technique that gives the careful study of
purchases done by a customer in a supermarket.
This concept identifies the pattern of frequent
purchase items by customers. This analysis can
help to promote deals, offers, sale by the
companies, and data mining techniques helps to
achieve this analysis task
Market Basket Analysis
•Lift - a measure that tells us whether the probability of an event B increases or decreases
given event A.
Market Basket Analysis
Apriori Property –
All non-empty subset of frequent itemset must be frequent.
Step-1: K=1
(I) Create a table containing support count of each item present in
dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then
remove those items). This gives us itemset L1.
Step-2: K=2
(I) Generate candidate set C2 using L1 (this is called join step).
Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)
Now find support count of these itemsets by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count(here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.
Step-4:
Generate candidate set C4 using L3.
Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So
no itemset in C4
We stop here because no frequent itemsets are found further
Market Basket Analysis
Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into picture. For that
we need to calculate confidence of each rule.
Confidence –
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule generation.
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered as strong association rules.
A confidence of 50% means that 50% of the customers, who purchased item1 and item2 also bought item3.
Market Basket Analysis
These two properties inevitably make the algorithm slower. To overcome these redundant
steps, a new association-rule mining algorithm was developed named Frequent Pattern
Growth Algorithm. It overcomes the disadvantages of the Apriori algorithm by storing all
the transactions in a Trie Data Structure
Market Basket Analysis- FP Growth
Algo
Item Frequency
A 1
Transaction ID Items C 2
T1 {E,K,M,N,O,Y}
This data is a hypothetical dataset of D 1
T2 {D,E,K,N,O,Y} transactions with each letter representing an E 4
T3 {A,E,K,M} I 1
T4 {C,K,M,U,Y}
item. The frequency of each individual item is K 5
T5 {C,E,I,K,O,O} computed M 3
N 2
O 4
U 1
Y 3
Let the minimum support be 3. A Frequent Pattern set is built which will contain all the
elements whose frequency is greater than or equal to the minimum support. These
elements are stored in descending order of their respective frequencies. After insertion
of the relevant items, the set L looks like this
L = {K : 5, E : 4, M : 3, O : 4, Y : 3}
Market Basket Analysis- FP Growth
Algo
Now, for each transaction, the respective Ordered-Item set is built. It is done by
iterating the Frequent Pattern set and checking if the current item is contained in the
transaction in question. If the current item is contained, the item is inserted in the
Ordered-Item set for the current transaction. The following table is built for all the
transactions:
T1 {E,K,M,N,O,Y} {K,E,M,O,Y}
T2 {D,E,K,N,O,Y} {K,E,O,Y}
T3 {A,E,K,M} {K,E,M}
T4 {C,K,M,U,Y} {K,M,Y}
T5 {C,E,I,K,O,O} {K,E,O}
Now, all the Ordered-Item sets are inserted into a trie Data Structure.
Market Basket Analysis- FP Growth
Algo
a) Inserting the set {K, E, M, O, Y}:
Here, all the items are simply linked one after the other in the order of occurrence in
the set and initialize the support count for each item as 1.
Market Basket Analysis- FP Growth
Algo
b) Inserting the set {K, E, O, Y}: Till the insertion of the elements K and E, simply the
support count is increased by 1. On inserting O we can see that there is no direct link
between E and O, therefore a new node for the item O is initialized with the support
count as 1 and item E is linked to this new node. On inserting Y, we first initialize a new
node for the item Y with support count as 1 and link the new node of O with the new
node of Y.
Market Basket Analysis- FP Growth
Algo
c) Inserting the set {K, E, M}:
Similar to step b), first the support count of K is increased, then new nodes for M and Y
are initialized and linked accordingly.
Market Basket Analysis- FP Growth
Algo
e) Inserting the set {K, E, O}:
Here simply the support counts of the respective elements are increased. Note that the
support count of the new node of item O is increased.
Market Basket Analysis- FP Growth
Algo
Now, for each item, the Conditional Pattern Base is computed which is path labels of all
the paths which lead to any node of the given item in the frequent-pattern tree. Note that
the items in the below table are arranged in the ascending order of their frequencies.
Market Basket Analysis- FP Growth
Algo
Now for each item, the Conditional Frequent Pattern Tree is built. It is done by taking the
set of elements that is common in all the paths in the Conditional Pattern Base of that item
and calculating its support count by summing the support counts of all the paths in the
Conditional Pattern Base.
Market Basket Analysis- FP Growth
Algo
From the Conditional Frequent Pattern tree, the Frequent Pattern rules are generated by
pairing the items of the Conditional Frequent Pattern Tree set to the corresponding to the
item as given in the below table.
For each row, two types of association rules can be inferred for example for the first row
which contains the element, the rules K -> Y and Y -> K can be inferred. To determine
the valid rule, the confidence of both the rules is calculated and the one with
confidence greater than or equal to the minimum confidence value is retained.
Market Basket Analysis
• Descriptive market basket analysis: This type only derives insights from past data and is the
most frequently used approach. The analysis here does not make any predictions but rates
the association between products using statistical techniques. For those familiar with the
basics of Data Analysis
• Predictive market basket analysis: This type uses supervised learning models like
classification and regression. It essentially aims to mimic the market to analyze what causes
what to happen. Essentially, it considers items purchased in a sequence to determine cross-
selling. For example, buying an extended warranty is more likely to follow the purchase of an
iPhone. While it isn't as widely used as a descriptive MBA, it is still a very valuable tool for
marketers
• Differential market basket analysis: This type of analysis is beneficial for competitor analysis.
It compares purchase history between stores, between seasons, between two time periods,
between different days of the week, etc., to find interesting patterns in consumer behavior.
For example, it can help determine why some users prefer to purchase the same product at
the same price on Amazon vs Flipkart. The answer can be that the Amazon reseller has more
warehouses and can deliver faster, or maybe something more profound like user experience.
Market Basket Analysis- Benefits
Reinforcement Learning
• Reinforcement learning (RL) is a machine learning (ML) technique that
trains software to make decisions to achieve the most optimal results.
• It mimics the trial-and-error learning process that humans use to achieve
their goals. Software actions that work towards your goal are reinforced,
while actions that detract from the goal are ignored.
• RL algorithms use a reward-and-punishment paradigm as they process
data. They learn from the feedback of each action and self-discover the
best processing paths to achieve final outcomes.
• The algorithms are also capable of delayed gratification. The best overall
strategy may require short-term sacrifices, so the best approach they
discover may include some punishments or backtracking along the way. RL
is a powerful method to help artificial intelligence (AI) systems achieve
optimal outcomes in unseen environments.
• Unlike supervised and unsupervised learning, reinforcement learning
continuously improves its model by leveraging feedback from previous
iterations. This is different to supervised and unsupervised learning, which
both reach an indefinite endpoint after a model is formulated from the
training and test data segments.
Reinforcement Learning
Key concepts
• In reinforcement learning, there are a few key concepts to
familiarize yourself with:
• The agent is the ML algorithm (or the autonomous system)
• The environment is the adaptive problem space with attributes
such as variables, boundary values, rules, and valid actions
• The action is a step that the RL agent takes to navigate the
environment
• The state is the environment at a given point in time
• The reward is the positive, negative, or zero value—in other words,
the reward or punishment—for taking an action
• The cumulative reward is the sum of all rewards or the end value
Reinforcement Learning
Algorithm basics
• Reinforcement learning is based on the Markov decision
process, a mathematical modeling of decision-making that
uses discrete time steps. At every step, the agent takes a
new action that results in a new environment state.
Similarly, the current state is attributed to the sequence of
previous actions.
• Through trial and error in moving through the environment,
the agent builds a set of if-then rules or policies. The
policies help it decide which action to take next for optimal
cumulative reward. The agent must also choose between
further environment exploration to learn new state-action
rewards or select known high-reward actions from a given
state. This is called the exploration-exploitation trade-off.
Reinforcement Learning
Difference w.r.t Unsupervised Learning: RL has a predetermined end goal. While it takes an
exploratory approach, the explorations are continuously validated and improved to increase
the probability of reaching the end goal. It can teach itself to reach very specific outcomes.
Reinforcement Learning
• A specific algorithmic example of reinforcement learning is Q-learning. In Q-
learning, you start with a set environment of states, represented by the symbol ‘S’.
In the game Pac-Man, states could be the challenges, obstacles or pathways that
exist in the game. There may exist a wall to the left, a ghost to the right, and a
power pill above—each representing different states.
• The set of possible actions to respond to these states is referred to as “A.” In the
case of Pac-Man, actions are limited to left, right, up, and down movements, as
well as multiple combinations thereof.
• The third important symbol is “Q.” Q is the starting value and has an initial value of
“0.”
• As Pac-Man explores the space inside the game, two main things will happen:
• Q drops as negative things occur after a given state/action
• Q increases as positive things occur after a given state/action
• In Q-learning, the machine will learn to match the action for a given state that
generates or maintains the highest level of Q. It will learn initially through the
process of random movements (actions) under different conditions (states). The
machine will record its results (rewards and penalties) and how they impact its Q
level and store those values to inform and optimize its future actions
Assignment 3
• Use the k-means clustering algorithm and Euclidean distance to cluster the following eight
examples into three clusters: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4),
A7=(1,2), A8=(4,9). Find the new centroid at every new point entry into the cluster group.
Assume initial cluster centers as A1, A4, A7
• Explain Reinforcement Learning in detail, discuss its types along with various elements.
Outline on partially observable state.