UNIT – 3 Unsupervised Learning
Syllabus: Association Analysis:
3.1 Basic concepts
3.2 Frequent Itemsets
3.3 The Apriori Algorithm
3.4 FP Growth Algorithm
3.5 Association Rules
3.6 Mining various kinds of Association Rules
3.7 From Association mining to Correlation Analysis
3.8 Constraint-based Association mining
Unsupervised Learning
the machine is trained on unlabeled data
Only Inputs
learns on itself without any supervision
models itself finds the hidden patterns and insights from the
given data
No specific output
“Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed to
act on that data without any supervision.”
The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities,
patterns, and differences.
Working of Unsupervised learning models:
We feed the model, data with no categories or
outputs for training
Model interprets raw data to identify hidden patterns
Depending on data, we use suitable algorithms
Algorithm groups data
Clustering:
A data mining technique which groups unlabeled data based on their
similarities or differences.
It is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no
similarities with the objects of another group.
Association:
An association rule is a rule-based method for finding relationships
between variables in a given dataset. These methods are frequently
used for market basket analysis, allowing companies to better understand
relationships between different products.
It determines the set of items that occurs together in the dataset.
Association Analysis
Association is a data mining technique that discovers the
probability of the co-occurrence of items in a collection.
Association analysis is the task of finding interesting
relationships among large sets of data items.
These interesting relationships can take two forms: frequent
item sets or association rules.
Frequent item sets are a collection of items that frequently
occur together.
The relationships between co-occurring items are expressed
as Association Rules.
3.1 & 3.2 Basic concepts
Frequent Itemsets
Frequent patterns are patterns (e.g., itemsets, subsequences)
that appear frequently in a data set.
For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent
itemset.
A subsequence, such as buying first a PC, then a digital camera,
and then a memory card, if it occurs frequently in a shopping
history database, is a (frequent) sequential pattern.
Finding frequent patterns plays an essential role in mining
associations, correlations, and many other interesting
relationships among data.
Frequent Itemsets
Itemset – A collection of one or more items
K-itemset - > An itemset that contains k items
A frequent item set is a set of items that occur together
frequently in a dataset.
Support count:
The frequency of an item set is measured by the support count,
which is the number of transactions or records in the dataset that
contain the item set.
Support:
Fraction of the transactions that contain an itemset
Frequent Itemset:
An itemset whose support is greater than or equal to
prespecified minimum support threshold (min_sup)
CLOSED & MAXIMAL Frequent Itemsets
Closed and maximal frequent itemsets are subsets of frequent
itemsets
An itemset X is closed in a data set D if there exists no proper
super-itemset Y such that Y has the same support count as X in
D.
An itemset X is a closed frequent itemset in set D if X is both
closed and frequent in D.
An itemset X is a maximal frequent itemset (or max-itemset)
in a data set D if X is frequent, and there exists no super-itemset
Y such that X ⊂Y and Y is frequent in D.
An itemset is maximal frequent if none of its immediate
supersets is frequent.
An itemset is closed if none of its immediate supersets has the
same support as the itemset .
Frequent Itemset Mining
Frequent itemset mining leads to the discovery of associations
and correlations among items in large transactional or
relational data sets.
The discovery of interesting correlation relationships among
huge amounts of business transaction records can help in many
business decision-making processes such as
catalog design/store layout,
cross-marketing, and
customer shopping behavior analysis.
It allows retailers to identify relationships between the items
that people buy together frequently.
A typical example is a Market Basket Analysis.
Market Basket Analysis
This process analyzes customer buying habits by finding
associations between the different items that customers
place in their “shopping baskets”
Can find - which items are frequently purchased together by
customers.
For instance, if customers are buying milk, how likely are they
to also buy bread (and what kind of bread) on the same trip to
the supermarket?
Example:
AllElectronics branch, you would like to learn more about the buying
habits of your customers
For example, the information that customers who purchase computers
also tend to buy antivirus software at the same time is represented in the
following association rule.
computer ⇒ antivirus software [support = 2%,confidence = 60%]
A support of 2% means that 2% of all the transactions under analysis show
that computer and antivirus software are purchased together.
A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software.
Typically, association rules are considered interesting
(STRONG) if they satisfy both a minimum support threshold
and a minimum confidence threshold.
These thresholds can be a set by users or domain experts.
3.5 Association rules & Association
rule mining
Association rule learning/mining is a rule-based machine
learning method for discovering interesting relations
between variables in large databases.
The goal of association rule mining is to identify relationships
between items in a dataset that occur frequently together.
It is intended to identify strong rules discovered in databases
using some measures(support, confidence) of interestingness.
Let I = {I1, I2,..., Im} be an itemset.
Let D be a set of database transactions
where each transaction T is a nonempty itemset such that T ⊆ I.
Each transaction is associated with an identifier, called a TID.
Let A, B be a set of items.
An association rule is an implication of the form
A⇒B
where A ⊂ I, B ⊂ I, A ≠ ∅, B ≠ ∅, and A ∩B ≠ φ.
The rule A ⇒ B holds in the transaction set D with support S
and confidence C
Support (S) :
the percentage of transactions in D that contain A ∪ B (i.e., the
union of sets A and B say, or, both A and B).
support(A⇒B) =P(A∪B)
Confidence(C):
Percentage of transactions in D containing A that also contain B.
This is taken to be the conditional probability, P(B|A)
confidence(A⇒B) = P(B|A)
= support(A∪B) / support(A)
= support_count(A∪B) support_count(A)
N - > number of transactions
Freq(X) - > support_count or frequency of X in
the data set
In general, association rule mining can be viewed as a
two-step process:
1. Find all frequent itemsets: By definition, each of
these itemsets will occur at least as frequently as a
predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent
itemsets: By definition, these rules must satisfy
minimum support and minimum confidence
3.3 Apriori Algorithm
Used for:
Finding Frequent Itemsets by Confined Candidate
Generation.
Mining frequent itemsets for Boolean association rules
Apriori employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k + 1)-
itemsets.
To improve the efficiency of the level-wise generation of
frequent itemsets, an important property called the Apriori
property is used to reduce the search space.
Apriori property: “All nonempty subsets of a
frequent itemset must also be frequent.” (If an itemset
is infrequent, all its supersets will be infrequent)
if an itemset I does not satisfy the minimum support
threshold, min sup, then I is not frequent, that is,
P(I) < min sup, then I is not frequent.
If an item A is added to the itemset I, then the
resulting itemset (i.e., I ∪ A) cannot occur more
frequently than I. Therefore, I ∪ A is not frequent
either, that is,
P(I ∪ A) < min sup, then I ∪ A is not frequent
This property belongs to a special category of properties
called antimonotonicity in the sense that if a set cannot
pass a test, all of its supersets will fail the same test as
well.
Procedure:
First, the set of frequent 1-itemsets C1 is found by
scanning the database to accumulate the count for
each item, and
collecting those items that satisfy minimum support,
the resulting set is denoted by L1.
Next, L1 is used to find L2, the set of frequent 2-
itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be found.
The finding of each Lk requires one full scan of the
database.
The steps followed in the Apriori Algorithm of data mining are:
1. Join Step:
This step generates K- itemsets from K-1 itemsets by joining
each item with itself.
To find Lk , a set of candidate K-itemsets is generated by
joining Lk−1 with itself. This set of candidates is denoted Ck .
2. Prune Step:
A database scan to determine the count of each candidate in Ck would
result in the determination of Lk
If the candidate item does not meet minimum support, then it is
regarded as infrequent and thus it is removed.
Lk -> all candidates having a count no less than the minimum support
count
This step is performed to reduce the size of the candidate itemsets.
Example: Consider the following transactional data
for AllElectronics
Generation of the candidate itemsets (Ck) and frequent itemsets
(Lk), where the minimum support count is 2
Frequent Itemsets are: {I1,I2,I3}, {I1,I2,I5}
Generating Association Rules from Frequent Itemsets:
Strong association rules satisfy both minimum support and minimum
confidence
Confidence(A ⇒ B) = P(B|A)
= support _count(A ∪ B)
support_ count(A)
Support_count(A ∪ B) is the number of transactions containing the
itemsets A ∪ B, and
Support_count(A) is the number of transactions containing the itemset A
Based on this equation, association rules can be generated as
follows:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule “s ⇒ (l − s)” if
support count(l) / support count(s) ≥ min_conf, where min_conf
is the minimum confidence threshold.
Because the rules are generated from frequent itemsets, each one
automatically satisfies the minimum support.
Example: AllElectronics
Consider the frequent itemset X = {I1, I2, I5}
The nonempty subsets of X are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, and {I5}.
The resulting association rules are as shown below, each
listed with its confidence:
If the minimum confidence threshold is, say, 70%, then
only the second, third, and last rules are output Strong
Association rules
Drawbacks of Apriori:
In many cases the Apriori candidate generate-and-test
method significantly reduces the size of candidate sets,
leading to good performance gain.
However, it can suffer from two nontrivial costs:
1. It may still need to generate a huge number of candidate
sets. For example, if there are 104 frequent 1-itemsets, the
Apriori algorithm will need to generate more than 104
candidate 2-itemsets.
2. It may need to repeatedly scan the whole database and
check a large set of candidates by pattern matching. It is
costly to go over each transaction in the database to
determine the support of the candidate itemsets.
3.4 FP Growth Algorithm
Frequent pattern growth, or simply FP-growth, adopts a
divide-and-conquer strategy.
Used for finding frequent itemsets without candidate
generation resulting in greater efficiency
It constructs a highly compact data structure (an FP-tree) to
compress the original transaction database.
Procedure:
1. The first scan of the database is the same as Apriori, which derives
the set of frequent items (1-itemsets) and their support counts
(frequencies).
2. The set of frequent items is sorted in the order of descending
support count. This resulting Frequent Pattern set or list is
denoted by L
3. Construct Ordered Itemset based on L
4. An FP-tree is then constructed
5. Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base (a “sub-
database,” which consists of the set of prefix paths in the FP-tree
co-occurring with the suffix pattern
6. Then construct its conditional FP-tree, and perform mining
recursively on the tree.
7. From the Conditional FP tree, the Frequent Pattern rules are
generated
Example: Consider the following transactional data
Step 1: 1-itemsets
Let the minimum
support be 3
Step 2: Frequent Pattern Set/List - set of frequent
items is sorted in the order of descending support count.
L = {K : 5, E : 4, M : 3, O : 3, Y : 3}
Step 3: respective Ordered-Item set is built
done by iterating the L and checking if the current item is contained
in the transaction in question. If the current item is contained, the item is
inserted in the Ordered-Item set for the current transaction.
Step 4: An FP-tree is then constructed – a trie data
structure into which all the Ordered-Item sets are inserted
Step 5: Conditional Pattern Base (path labels of all the paths
which lead to any node of the given item in the frequent-pattern tree) is
computed.
Step 6: For each item in the Conditional Pattern Base,
the Conditional FP Tree (common in all the paths in the Conditional
Pattern Base of that item)is built.
Step 7: From the Conditional Frequent Pattern tree, the Frequent
Pattern rules are generated by pairing the items of the Conditional
Frequent Pattern Tree set to the corresponding to the item as given
in the below table.
For each row, generate association rules:
For example for the first row, the rules K ->Y and Y -> K can be inferred.
To determine the valid rule, the confidence of both the rules is
calculated and the one with confidence greater than or equal to the
minimum confidence value is retained.
3.7 From Association mining to
Correlation Analysis
A misleading “strong” association rule:
Consider AllElectronics transactions with respect to the
purchase of computer games and videos.
Let game refer to the transactions containing computer
games, and video refer to those containing videos.
Of the 10,000 transactions analyzed, the data show
that 6000 of the customer transactions included
computer games, while 7500 included videos, and 4000
included both computer games and videos.
a data mining program for discovering association rules is
run on the data, using a minimum support of, say, 30% and
a minimum confidence of 60%.
The following association rule is discovered:
buys(X, “computer games”) ⇒ buys(X, “videos”)
[support = 40%, confidence = 66%] RULE 1
Rule 1 is a strong association rule
Rule 1 is misleading because the probability of purchasing
videos is 75%, which is even larger than 66%.
In fact, computer games and videos are negatively
associated
Conclusion: the support and confidence measures are
insufficient at filtering out uninteresting association rules.
Solution: a correlation measure can be used to
augment the support–confidence framework for
association rules.
This leads to correlation rules of the form:
A ⇒ B [support, confidence, correlation]
That is, a correlation rule is measured not only by its
support and confidence but also by the correlation
between itemsets A and B.
Correlation Analysis
Correlation Analysis is statistical method that is used to
discover if there is a relationship between two or more
variables, and how strong that relationship may be.
The correlation coefficient ranges between -1 and +1 and
quantifies the direction and strength of the linear association
between the two variables.
The correlation between two variables can be POSITIVE (i.e.,
higher levels of one variable are associated with higher levels
of the other) or NEGATIVE (i.e., higher levels of one variable
are associated with lower levels of the other).
The Sign of the correlation coefficient indicates the direction
of the association.
The Magnitude of the correlation coefficient indicates the
strength of the association.
For example:
A correlation of r = 0.9 suggests a strong, positive
association between two variables.
A correlation of r = -0.2 suggest a weak, negative
association.
A correlation close to zero suggests no linear association
between two continuous variables.
Correlation measures
1. Lift:
The occurrence of itemset A is independent of the occurrence of
itemset B if P(A ∪B) = P(A)P(B); otherwise, itemsets A and B
are dependent and correlated as events.
The lift between the occurrence of A and B can be measured by
computing:
If lift(A,B) is less than 1, then the occurrence of A is negatively
correlated with the occurrence of B
If the resulting value is greater than 1, then A and B are
positively correlated.
If the resulting value is equal to 1, then A and B are
independent and there is no correlation between them.
Example: Correlation analysis using lift
From the table, the probability of purchasing a computer game is
P({game}) = 0.60, the probability of purchasing a video is
P({video}) = 0.75, and the probability of purchasing both is
P({game, video}) = 0.40.
Lift({game,video}) = P({game, video})/(P({game}) × P({video}))
= 0.40/(0.60 × 0.75) = 0.89 < 1
there is a negative correlation between the occurrence of {game}
and {video}.
2. χ 2 (Chi-square) measure:
To compute the χ 2 value, we take the squared difference
between the observed and expected value for a slot (A and B
pair) in the contingency table, divided by the expected value.
This amount is summed for all slots of the contingency table.
Example:
χ 2 = 555.6
χ 2 value is greater and the observed value of the slot
(game, video) = 4000, which is less than the expected
value of 4500, buying game and buying video are
negatively correlated.
Exercise:
3.6 Mining various kinds of
Association Rules
Kinds of Association rules:
1. Multilevel Association Rules:
involve concepts at different abstraction levels.
2. Multidimensional Association Rules:
involve more than one dimension or predicate
Example: rules that relate what a customer buys to his
or her age.
3. Quantitative Association Rules:
involve numeric attributes that have an implicit ordering
among values
Example: age
1. Mining Multilevel Association Rules
For many applications, it is difficult to find associations among
data items at LOW or primitive levels of abstraction.
Strong associations discovered at high abstraction levels, though
with high support, could be commonsense knowledge.
Therefore data mining systems should provide capabilities for
mining patterns at multiple abstraction levels, with sufficient
flexibility for easy traversal among different abstraction spaces.
Example: Table - Sales in an AllElectronics store, showing the
items purchased for each transaction.
The concept hierarchy for the items is:
Level-0
Level-1
Level-2
Level-3
Level-4
A concept hierarchy defines a sequence of mappings from a
set of low-level concepts to a higher-level, more general
concept set.
Data can be generalized by replacing low-level concepts
within the data by their corresponding higher-level concepts,
or ancestors, from a concept hierarchy.
level 0 at the root node for all - the most general abstraction
level.
Level 4 - is the most specific abstraction level of this
hierarchy. It consists of the raw data values.
Association rules generated from mining data at multiple
abstraction levels are called multiple-level or multilevel
association rules.
Multilevel association rules can be mined efficiently using
concept hierarchies under a support-confidence framework.
For each level, any algorithm for discovering frequent itemsets
may be used, such as Apriori or its variations.
Approaches:
(i) Using uniform minimum support for all levels:
Same support for all levels
(ii) Using reduced minimum support at lower levels (referred
to as reduced support):
Each abstraction level has its own minimum support
threshold.
The deeper the abstraction level, the smaller the
corresponding threshold.
(iii) Using item or group-based minimum support
(referred to as group-based support):
Because users or experts often have insight as to which
groups are more important than others, it is sometimes
more desirable to set up user-specific, item, or group based
minimal support thresholds when mining multilevel rules.
For example, experts are interested in purchase patterns of
laptops. Therefore low support threshold is set for this
group to give attention to these items’ purchase patterns.
A serious side effect of mining multilevel association rules is
its generation of many redundant rules across multiple
abstraction levels due to the “ancestor” relationships among
items.
2. Mining Multidimensional
Association Rules:
Association rule with a single predicate:
buys(X, “IBM Laptop computer”) buys(X,”HP Inkjet Printer”)
Association rules that involve two or more dimensions or
predicates can be referred to as multi dimensional association
rules
Example: age(X, “20…29”) ^ occupation (X, “student”)
buys(X, “Laptop”)
Above rule contains 3 predicates (age, occupation and buys)
Approach:
Using static discretization of quantitative
attributes:
Quantitative attributes, in this case, are discretized
before mining using predefined concept hierarchies
or data discretization techniques, where numeric
values are replaced by interval labels.
Categorical attributes may also be generalized to
higher conceptual levels if desired.
Data cubes are well suited for mining
Instead of searching on only one attribute like ‘buys’, we need to
search through all of the relevant attributes, treating each
attribute-value pair as an itemset.
Suitable for smaller data sets.
3. Mining Quantitative Association
Rules:
Quantitative (numeric) attributes are dynamically
discretized during the mining process so as to satisfy some
mining criteria like maximizing the confidence etc.
2 Quantitative attributes on left hand side, 1 categorical
attribute on right side
Aquan1 and Aquan2 are tests on quantitative attribute
intervals (intervals are dynamically determined) and Acat
test a categorical attribute.
Example: age(X,”20..25”) Λ income(X,”30K..41K”) buys
(X,”Laptop Computer”)
Uses ‘Binning’: quantitative attributes are categorized /
partitioned into bins, where the intervals are considered as
“bins”, and it is based on the distribution of the data.
3.8 Constraint-based Association
mining
A data mining procedure can uncover thousands of rules from a
given set of information, most of which end up being
independent or uninteresting to the users.
A good approach is to have the users define expectations as
constraints to constraint/reduce the search space. This
strategy is called constraint-based mining.
The general constraint is the “minimum support threshold”.
Definition: Constraint-based mining is the development of
data mining algorithms that search through a pattern or
model space restricted by constraints.
The importance of constraints is “well-defined” − they create
only association rules that are appealing to users.
Constraint-based algorithms need constraints to decrease
the search area in the frequent item set generation step of
Association-rule mining.
Constraint-based mining boost interactive & exploratory
mining and analysis.
Constraint based mining provides
User Flexibility: - allows users to describe the rules that they would like
to uncover.
System Optimization: explores constraints to help efficient mining.
The constraints can include the following:
Knowledge type constraints
Data constraints
Interestingness constraints
Rule constraints
Knowledge constraints −
These define the type of knowledge to be mined, such as
association or correlation.
Data constraints −
These define the set of task-relevant information such as
Dimension/level constraints: These defines the desired
dimensions (or attributes) of the information, or methods
of the concept hierarchies, to be utilized in mining.
Interestingness constraints -
These specify thresholds on statistical measures of rule
interestingness, such as support, confidence, and
correlation.
Rule constraints −
These defines the form of rules to be mined.
Such constraints can be defined as metarules (rule
templates), as the maximum or minimum number of
predicates that can appear in the rule antecedent (left) or
consequent (right) , or as relationships between attributes,
attribute values, etc,.
The above constraints can be described using a high-level
declarative data mining query language and user interface.
This form of constraint-based mining enables users to define
the rules that they can like to uncover, thus by creating the data
mining process more efficient.