0% found this document useful (0 votes)
3 views65 pages

DM - Unit II

The document discusses association rule mining, a method for discovering relationships between variables in large databases, focusing on frequent patterns, mining methods, and various types of association rules. It highlights market basket analysis as a practical application, explaining concepts like frequent itemsets, support, and confidence, and introduces algorithms such as Apriori and FP-Growth for mining these patterns. Additionally, it covers different types of association rules, including multilevel and multidimensional rules, and emphasizes the importance of correlation analysis in understanding variable relationships.

Uploaded by

manognabingi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views65 pages

DM - Unit II

The document discusses association rule mining, a method for discovering relationships between variables in large databases, focusing on frequent patterns, mining methods, and various types of association rules. It highlights market basket analysis as a practical application, explaining concepts like frequent itemsets, support, and confidence, and introduces algorithms such as Apriori and FP-Growth for mining these patterns. Additionally, it covers different types of association rules, including multilevel and multidimensional rules, and emphasizes the importance of correlation analysis in understanding variable relationships.

Uploaded by

manognabingi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Unit-2

• Association rule Mining:


Mining frequent patterns
Associations and Correlations
Mining Methods
Mining various kinds of Association rules
Correlation Analysis
Constraint based Association Mining
Graph Pattern Mining
SPM
Association Rule Mining
• Association rule mining is a popular and well researched method for
discovering interesting relations between variables in large databases.
• It is intended to identify strong rules discovered in databases using
different measures of interestingness.
• Based on the concept of strong rules, RakeshAgrawal et al. introduced
association rules.
• Frequent patterns are patterns (e.g., itemsets, subsequences, or
substructures) that appear frequently in a data set.
• For example, a set of items, such as milk and bread, that appear
frequently together in a transaction data set is a frequent itemset.
• A subsequence, such as buying first a PC, then a digital camera, and
then a memory card, if it occurs frequently in a shopping history
database, is a (frequent) sequential pattern.
• A substructure can refer to different structural forms, such as
subgraphs, subtrees, or sublattices, which may be combined with
itemsets or subsequences.
• If a substructure occurs frequently, it is called a (frequent) structured
pattern
• Frequent pattern mining searches for recurring relationships in a
given data set.
• This section introduces the basic concepts of frequent pattern mining
for the discovery of interesting associations and correlations between
itemsets in transactional and relational databases.
Market Basket Analysis: A Motivating
Example
• A typical example of frequent itemset mining is market basket
analysis.
• This process analyzes customer buying habits by finding associations
between the different items that customers place in their “shopping
baskets” The discovery of these associations can help retailers develop
marketing strategies by gaining insight into which items are frequently
purchased together by customers.
• For instance, if customers are buying milk, how likely are they to also
buy bread (and what kind of bread) on the same trip
Frequent Itemsets, Closed Itemsets, and
Association Rules
• Let I = {I1, I2,..., Im} be an itemset.
• Let D, the task-relevant data, be a set of database transactions
• where each transaction T is a nonempty itemset such that T ⊆ I.
• Each transaction is associated with an identifier, called a TID. Let A be a set of items.
• A transaction T is said to contain A if A ⊆ T. An association rule is an implication of the form
• A ⇒ B, where A ⊂ I, B ⊂ I, A 6= ∅, B 6= ∅, and A ∩B = φ.
• The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage of
transactions in D that contain A ∪B (i.e., the union of sets A and B say, or, both A and B).
• This is taken to be the probability, P(A ∪B).
• 1 The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions
in D containing A that also contain B.
• This is taken to be the conditional probability, P(B|A). That is,
• support(A⇒B) =P(A ∪B) ……………………………..(6.2)
• confidence(A⇒B) =P(B|A). ………………………….(6.3)
• Rules that satisfy both a minimum support threshold (min sup) and a
minimum confidence threshold (min conf ) are called strong
• confidence(A⇒ B) = P(B|A) = support(A ∪B) / support(A)
= support count(A ∪B) /support count(A) . ……………….(6.4)
In general, association rule mining can be viewed as a two-step
process:
1. Find all frequent itemsets: By definition, each of these itemsets will
occur at least as frequently as a predetermined minimum support
count, min sup.
2. Generate strong association rules from the frequent itemsets: By
definition, these rules must satisfy minimum support and minimum
confidence
• Example: If customers who purchase computers also tend to buy
antivirussoftware at the same time, then placing the hardware display
close to the software displaymay help increase the sales of both items.
• In an alternative strategy, placing hardware andsoftware at opposite
ends of the store may entice customers who purchase such items topick
up other items along the way.
• For instance, after deciding on an expensive computer,a customer may
observe security systems for sale while heading toward the software
displayto purchase antivirus software and may decide to purchase a
home security systemas well.
• Market basket analysis can also help retailers plan which items to put
on saleat reduced prices.
• If customers tend to purchase computers and printers together,
thenhaving a sale on printers may encourage the sale of printers as well
as computers.
• Frequent Pattern Mining: Frequent patternmining can be classified in various
ways,
1. Based on the completeness of patterns to be mined: We can mine the
complete set of frequent itemsets, the closed frequent itemsets, and the maximal
frequent itemsets, given a minimum support threshold.
We can also mine constrained frequent itemsets, approximate frequent
itemsets,nearmatch frequent itemsets, top-k frequent itemsets and so on.
2. Based on the levels of abstraction involved in the rule set: Some methods for
associationrule mining can find rules at differing levels of abstraction. For example,
supposethat a set of association rules mined includes the following rules where X is
a variablerepresenting a customer:
• buys(X, ―computer‖))=>buys(X, ―HP printer‖) (1)
• buys(X, ―laptop computer‖)) =>buys(X, ―HP printer‖) (2)
• In rule (1) and (2), the items bought are referenced at different levels ofabstraction
• (e.g., ―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule: If
the items or attributes in an association rule reference only one
dimension, then it is a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)
If a rule references two or more dimensions, such as the dimensions age,
income, and buys, then it is a multidimensional association rule.
The following rule is an example of a multidimensional rule:
age(X, ―30,31…39‖) ^ income(X, ―42K,…48K‖))=>buys(X, ―high
resolution TV‖)
4. Based on the types of values handled in the rule: If a rule involves
associations between the presence or absence of items, it is a Boolean
association rule.
If a rule describes associations between quantitative items or attributes,
then it is a quantitative association rule.
5. Based on the kinds of rules to be mined: Frequent pattern analysis
can generate various kinds of rules and other interesting relationships.
Association rule mining cangenerate a large number of rules, many of
which are redundant or do not indicate a correlation relationship among
itemsets.
The discovered associations can be further analyzed to uncover
statistical correlations, leading to correlation rules
6.Based on the kinds of patterns to be mined: Many kinds of frequent
patterns can be mined from different kinds of data sets.
• Sequential pattern mining searches for frequent subsequences in a sequence
data set, where a sequence records an ordering of events.
• For example, with sequential pattern mining, we can study the order in
which items are frequently purchased.
• For instance, customers may tend to first buy a PC, followed by a digital
camera,and then a memory card.
• Structured pattern mining searches for frequent substructures in a structured
data set.
• Single items are the simplest form of structure.
• Each element of an item set may contain a subsequence, a subtree, and so
on.
• Therefore, structured pattern mining can be considered as the most general
form of frequent pattern mining.
Frequent Itemset Mining Methods
• Apriori, the basic algorithm for finding frequent itemsets
• Efficient Frequent Itemset Mining Methods: Finding Frequent Itemsets Using Candidate
Generation:
• The Apriori Algorithm Apriori is a seminal algorithm proposed by R. Agrawal and R.
Srikant in 1994 for mining frequent itemsets for Boolean association rules.
• The name of the algorithm is based on the fact that the algorithm uses prior knowledge of
frequent itemset properties.
• Apriori employs an iterative approach known as a level-wise search, where k-itemsets are
used to explore (k+1)-itemsets.
• First, the set of frequent 1-itemsets is found by scanning the database to accumulate the
count for each item, and collecting those items that satisfy minimum support.
• The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets,
which is used to find L3, and so on, until no more frequent k-itemsets can be found.
• The finding of each Lkrequires one full scan of the database.
• A two-step process is followed in Apriori consisting of joinand prune action.
• Apriori property: All nonempty subsets of a frequent itemset must
also be frequent.
• The Apriori property is based on the following observation. By
definition, if an itemset I does not satisfy the minimum support
threshold, min sup, then I is not frequent, that is, P(I) < min sup.
• If an item A is added to the itemset I, then the resulting itemset (i.e., I
∪A) cannot occur more frequently than I.
• Therefore, I ∪A is not frequent either, that is, P(I ∪A) < min sup.
This property belongs to a special category of properties called
antimonotonicity in the sense that if a set cannot pass a test, all of its
supersets will fail the same test as well.
• It is called antimonotonicity because the property is monotonic in the
context of failing a test.
• The join step: To find Lk , a set of candidate k-itemsets is generated by
joining Lk−1 with itself.
• This set of candidates is denoted Ck . Let l1 and l2 be itemsets in Lk−1. The
notation li[j] refers to the jth item in li (e.g., l1[k − 2] refers to the second to
the last item in l1).
• For efficient implementation, Apriori assumes that items within a transaction
or itemset are sorted in lexicographic order. For the (k − 1)-itemset, li , this
means that the items are sorted such that li[1] < li[2] < ··· < li[k − 1].
• The join, Lk−1 ✶ Lk−1, is performed, where members of Lk−1 are joinable
if their first (k − 2) items are in common.
• That is, members l1 and l2 of Lk−1 are joined if (l1[1] = l2[1]) ∧ (l1[2] =
l2[2]) ∧ ··· ∧ (l1[k − 2] = l2[k − 2]) ∧(l1[k − 1] < l2[k − 1]). The condition
l1[k − 1] < l2[k − 1] simply ensures that no duplicates are generated.
• The resulting itemset formed by joining l1 and l2 is {l1[1], l1[2],..., l1[k − 2],
l1[k − 1], l2[k − 1]}.
The prune step: Ck is a superset of Lk , that is, its members may or
may not be frequent, but all of the frequent k-itemsets are included in
Ck .
A database scan to determine the count of each candidate in Ck would
result in the determination of Lk (i.e., all candidates having a count no
less than the minimum support count are frequent by definition, and
therefore belong to Lk). Ck , however, can be huge, and so this could
involve heavy computation.
• To reduce the size of Ck , the Apriori property is used as follows. Any
(k − 1)-itemset that is not frequent cannot be a subset of a frequent k-
itemset. Hence, if any (k − 1)-subset of a candidate k-itemset is not in
Lk−1, then the candidate cannot be frequent either and so can be
removed from Ck .
• This subset testing can be done quickly by maintaining a hash tree of
all frequent itemsets.
Difference between Apriori and FP Growth
Algorithm
Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some basic differences
between these algorithms, such as:

Apriori FP Growth

Apriori generates frequent patterns FP Growth generates an FP-Tree for


by making the itemsets using making frequent patterns.
pairings such as single item set,
double itemset, and triple itemset.

Apriori uses candidate generation FP-growth generates a conditional


where frequent subsets are FP-Tree for every item in the data.
extended one item at a time.

Since apriori scans the database in FP-tree requires only one database
each step, it becomes time- scan in its beginning steps, so it
consuming for data where the consumes less time.
number of items is larger.

A converted version of the A set of conditional FP-tree for


database is saved in the memory every item is saved in the memory

It uses a breadth-first search It uses a depth-first search


Mining Various Kinds of AssociationRules
1) Mining Multilevel Association Rules :For many applications, it is
difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those
levels.
• Strong associations discovered at high levels of abstraction may
represent commonsense knowledge.
• Moreover, what may represent common sense to one user may seem
novel to another.
• Therefore, data mining systems should provide capabilities for
mining association rules at multiple levels of abstraction, with
sufficient flexibility for easy traversal among different abstraction
spaces.
2) Mining Multidimensional Association Rules: from Relational
Databases and DataWarehouses
• We have studied association rules that imply a single predicate, that
is, the predicate buys.
• For instance, in mining our AllElectronics database, we may
discover the Boolean association rule
3)Mining Multidimensional Association Rules :Using Static
Discretization of Quantitative Attributes Quantitative attributes, in this
case, are discretized before mining using predefined concept hierarchies
or data discretization techniques, where numeric values are replaced by
interval labels.
• Categorical attributes may also be generalized to higher conceptual
levels if desired.
• If the resulting task-relevant data are stored in a relational table, then
any of the frequent itemset mining algorithms we have discussed can
be modified easily so as to find all frequent predicate sets rather than
frequent itemsets.
• In particular, instead of searching on only one attribute like buys, we
need to search through all of the relevant attributes, treating each
attribute-value pair as an itemset
4)Mining Quantitative Association Rules: Quantitative association rules
are multidimensional association rules in which the numeric attributes are
dynamically discretized during the mining process so as to satisfy some
mining criteria, such as maximizing the confidence or compactness of the
rules mined.
• In this section, we focus specifically on how to mine quantitative
association rules having two quantitative attributes on the left-hand side
of the rule and one categorical attribute on the right-hand side of the rule
• Most association rule mining algorithms employ a support-confidence
framework. Often, many interesting rules can be found using low
support thresholds.
• Although minimum support and confidence thresholds help weed out
or exclude the exploration of a good number of uninteresting rules,
many rules so generated are still not interesting to the users.
• Unfortunately, this is especially true when mining at low support
thresholds or mining for long patterns.
• This has been one of the major bottlenecks for successful application
of association rule mining
Correlation Analysis
• Correlation analysis is a statistical method used to measure the
strength of the linear relationship between two variables and compute
their association.
• Correlation analysis calculates the level of change in one variable due
to the change in the other.
• A high correlation points to a strong relationship between the two
variables, while a low correlation means that the variables are weakly
related.
• Researchers use correlation analysis to analyze quantitative data
collected through research methods like surveys and live polls for
market research.
• They try to identify relationships, patterns, significant connections,
and trends between two variables or datasets.
• There is a positive correlation between two variables when an
increase in one variable leads to an increase in the other.
• On the other hand, a negative correlation means that when one
variable increases, the other decreases and vice-versa.
• Correlation is a bivariate analysis that measures the strength of
association between two variables and the direction of the relationship.
• In terms of the strength of the relationship, the correlation coefficient's
value varies between +1 and -1. A value of ± 1 indicates a perfect
degree of association between the two variables.
• As the correlation coefficient value goes towards 0, the relationship
between the two variables will be weaker.
• The coefficient sign indicates the direction of the relationship; a + sign
indicates a positive relationship, and a - sign indicates a negative
relationship.
Types of Correlation Analysis in Data Mining
1. Pearson r correlation
Pearson r correlation is the most widely used correlation statistic to
measure the degree of the relationship between linearly related
variables.
• For example, in the stock market, if we want to measure how two
stocks are related to each other, Pearson r correlation is used to
measure the degree of relationship between the two.
• The point-biserial correlation is conducted with the Pearson
correlation formula, except that one of the variables is dichotomous.
The following formula is used to calculate the Pearson r correlation:
rxy= Pearson r correlation coefficient between x and y
n= number of observations
xi = value of x (for ith observation)
yi= value of y (for ith observation)
2. Kendall rank correlation
• Kendall rank correlation is a non-parametric test that measures the
strength of dependence between two variables.
• Considering two samples, a and b, where each sample size is n, we
know that the total number of pairings with a b is n(n-1)/2.
• The following formula is used to calculate the value of Kendall rank
correlation:

• Nc= number of concordant


• Nd= Number of discordant
3. Spearman rank correlation
• Spearman rank correlation is a non-parametric test that is used to measure the degree of
association between two variables.
• The Spearman rank correlation test does not carry any assumptions about the data
distribution.
• It is the appropriate correlation analysis when the variables are measured on an at least
ordinal scale.
• This coefficient requires a table of data that displays the raw data, its ranks, and the
difference between the two ranks.
• This squared difference between the two ranks will be shown on a scatter graph, which
will indicate whether there is a positive, negative, or no correlation between the two
variables.
• The constraint that this coefficient works under is -1 ≤ r ≤ +1, where a result of 0 would
mean that there was no relation between the data whatsoever.
• The following formula is used to calculate the Spearman rank correlation:
ρ= Spearman rank correlation
di= the difference between the ranks of corresponding variables
n= number of observations
• When to Use These Methods
• The two methods outlined above will be used according to whether
there are parameters associated with the data gathered. The two terms
to watch out for are:
• Parametric:(Pearson's Coefficient) The data must be handled with
the parameters of populations or probability distributions.
• Typically used with quantitative data already set out within said
parameters.
• Non-parametric:(Spearman's Rank) Where no assumptions can be
made about the probability distribution.
• Typically used with qualitative data, but can be used with quantitative
data if Spearman's Rank proves inadequate.
Interpreting Results
• Typically, the best way to gain a generalized but more immediate
interpretation of the results of a set of data is to visualize it on a
scatter graph such as these:
• Positive Correlation: Any score from +0.5 to +1 indicates a very
strong positive correlation, which means that they both increase
simultaneously.
• This case follows the data points upwards to indicate the positive
correlation.
• The line of best fit, or the trend line, places to best represent the
graph's data.
• Negative Correlation: Any score from -0.5 to -1 indicates a strong
negative correlation, which means that as one variable increases, the
other decreases proportionally.
• The line of best fit can be seen here to indicate the negative
correlation.
• In these cases, it will slope downwards from the point of origin
• No Correlation: Very simply, a score of 0 indicates no correlation, or
relationship, between the two variables.
• This fact will stand true for all, no matter which formula is used.
• The more data inputted into the formula, the more accurate the result
will be.
• The larger the sample size, the more accurate the result.
Benefits of Correlation Analysis

• 1. Reduce Time to Detection


In anomaly detection, working with many metrics and surfacing
correlated anomalous metrics helps draw relationships that reduce
time to detection (TTD) and support shortened time to remediation
(TTR).
• As data-driven decision-making has become the norm, early and
robust detection of anomalies is critical in every industry domain, as
delayed detection adversely impacts customer experience and revenue.
2. Reduce Alert Fatigue
• Another important benefit of correlation analysis in anomaly detection
is reducing alert fatigue by filtering irrelevant anomalies (based on the
correlation) and grouping correlated anomalies into a single alert.
• Alert storms and false positives are significant challenges
organizations face - getting hundreds, even thousands of separate alerts
from multiple systems when many of them stem from the same
incident
3. Reduce Costs
• Correlation analysis helps significantly reduce the costs
associated with the time spent investigating
meaningless or duplicative alerts.
• In addition, the time saved can be spent on more
strategic initiatives that add value to the organization.
Graph Pattern Mining
• Graph pattern mining is the mining of frequent subgraphs (also called
(sub)graph patterns) in one or a set of graphs.
• Methods for mining graph patterns can be categorized into Apriori-based and
pattern growth–based approaches.
• Alternatively, we can mine the set of closed graphs where a graph g is closed
if there exists no proper super graph g 0 that carries the same support count as
g.
• Moreover, there are many variant graph patterns, including approximate
frequent graphs, coherent graphs, and dense graphs.
• User-specified constraints can be pushed deep into the graph pattern mining
process to improve mining efficiency.
• Graph pattern mining has many interesting applications.
• For example, it can be used to generate compact and effective graph
index structures based on the concept of frequent and discriminative
graph patterns.
• Approximate structure similarity search can be achieved by exploring
graph index structures and multiple graph features.
• Moreover, classification of graphs can also be performed effectively
using frequent and discriminative subgraphs as features
Sequential Pattern Mining
• A symbolic sequence consists of an ordered set of elements or events,
recorded with or without a concrete notion of time.
• There are many applications involving data of symbolic sequences
such as customer shopping sequences, web click streams, program
execution sequences, biological sequences, and sequences of events in
science and engineering and in natural and social developments
• Sequential pattern mining has focused extensively on mining symbolic sequences.
• A sequential pattern is a frequent subsequence existing in a single sequence or a set of
sequences.
• A sequence α = ha1a2 ···ani is a subsequence of another sequence
β = hb1b2 ···bmi if there exist integers 1 ≤ j1 < j2 < ··· < jn ≤ m such that
a1 ⊆ bj1 , a2 ⊆ bj2 ,...,an ⊆ bjn .
• For example, if α = h{ab},di and β = h{abc},{be},{de},ai, where a,b,c,d, and e are
items, then α is a subsequence of β.
• Mining of sequential patterns consists of mining the set of subsequences that are
frequent in one sequence or a set of sequences.
• Many scalable algorithms have been developed as a result of extensive studies in this
area.
• Alternatively, we can mine only the set of closed sequential patterns, where a
sequential pattern s is closed if there exists no sequential pattern s 0 , where s is a
proper subsequence of s 0 , and s 0 has the same (frequency) support as s.
• Similar to its frequent pattern mining counterpart, there are also studies on efficient
mining of multidimensional, multilevel sequential patterns
Constraint-based Association Mining
• A data mining procedure can uncover thousands of rules from a given
set of information, most of which end up being independent or tedious
to the users.
• Users have a best sense of which “direction” of mining can lead to
interesting patterns and the “form” of the patterns or rules they can like
to discover.
• Therefore, a good heuristic is to have the users defines such intuition or
expectations as constraints to constraint the search space.
• This strategy is called constraint-based mining.
• Constraint-based algorithms need constraints to decrease the search
area in the frequent itemset generation step (the association rule
generating step is exact to that of exhaustive algorithms).
• The general constraint is the support minimum threshold. If a
constraint is uncontrolled, its inclusion in the mining phase can
support significant reduction of the exploration space because of
the definition of a boundary inside the search space lattice,
following which exploration is not needed.
• The important of constraints is well-defined − they create only
association rules that are appealing to users. The method is quite
trivial and the rules space is decreased whereby remaining methods
satisfy the constraints.
• Constraint-based clustering discover clusters that satisfy user-
defined preferences or constraints. It depends on the characteristics
of the constraints, constraint-based clustering can adopt rather than
different approaches.
• The constraints can include the following which are as follows −
• Knowledge type constraints − These define the type of knowledge to be
mined, including association or correlation.
• Data constraints − These define the set of task-relevant information such
as Dimension/level constraints − These defines the desired dimensions (or
attributes) of the information, or methods of the concept hierarchies, to be
utilized in mining.
• Interestingness constraints − These defines thresholds on numerical
measures of rule interestingness, including support, confidence, and
correlation.
• Rule constraints − These defines the form of rules to be mined. Such
constraints can be defined as metarules (rule templates), as the maximum or
minimum number of predicates that can appear in the rule antecedent or
consequent, or as relationships between attributes, attribute values, and/or
aggregates.
1. Metarule-Guided Mining of Association Rules
2. Constraint Pushing: Mining Guided by Rule Constraints

You might also like