Unit 4 Notes DWM
Unit 4 Notes DWM
In data mining, association and correlation are key techniques for extracting
patterns and relationships from large datasets. Association uncovers relationships
between items, while correlation measures the strength of the link between two
variables. This exploration will delve into these techniques, their types, and
methods, pivotal for informed decision-making in various domains.
What is Association?
Association analysis can provide valuable insights into consumer behaviour and preferences.
It can help retailers identify the items that are frequently purchased together, which can be
used to optimize product placement and promotions. Similarly, it can help e-commerce
websites recommend related products to customers based on their purchase history.
Types of Associations
Here are the most common types of associations used in data mining:
Here are the most commonly used algorithms to implement association rule mining in
data mining:
Apriori Algorithm - Apriori is one of the most widely used algorithms for association rule
mining. It generates frequent item sets from a given dataset by pruning infrequent item sets
iteratively. The Apriori algorithm is based on the concept that if an item set is frequent, then
all of its subsets must also be frequent. The algorithm first identifies the frequent items in
the dataset, then generates candidate itemsets of length two from the frequent items, and
so on until no more frequent itemsets can be generated. The Apriori algorithm is
computationally expensive, especially for large datasets with many items.
FP-Growth Algorithm - FP-Growth is another popular algorithm for association rule mining
that is based on the concept of frequent pattern growth. It is faster than the Apriori
algorithm, especially for large datasets. The FP-Growth algorithm builds a compact
representation of the dataset called a frequent pattern tree (FP-tree), which is used to mine
frequent item sets. The algorithm scans the dataset only twice, first to build the FP-tree and
then to mine the frequent itemsets. The FP-Growth algorithm can handle datasets with both
discrete and continuous attributes.
Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal) is a
frequent itemset mining algorithm based on the vertical data format. The algorithm first
converts the dataset into a vertical data format, where each item and the transaction ID in
which it appears are stored. Eclat then performs a depth-first search on a tree-like structure,
representing the dataset's frequent itemsets. The algorithm is efficient regarding both
memory usage and runtime, especially for sparse datasets.
Delve Deeper: Our Data Science Certification Course is Your Next Step. Enroll Now and
Transform Your Understanding into Practical Expertise.
Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the statistical
relationship between two or more variables, where the variation in one variable is associated
with the variation in another variable. In other words, it measures how changes in one
variable are related to changes in another variable. Correlation can be positive, negative,
or zero, depending on the direction and strength of the relationship between the variables.
, For example,, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On the
other hand, if we find that as the number of hours of study increases, the grades obtained
decrease, then there is a negative correlation between the two variables. If there is no
relationship between the two variables, we would say that there is zero correlation
Correlation analysis is important because it allows us to measure the strength and direction
of the relationship between two or more variables. This information can help identify patterns
and trends in the data, make predictions, and select relevant variables for analysis. By
understanding the relationships between different variables, we can gain valuable insights
into complex systems and make informed decisions based on data-driven analysis.
There are three main types of correlation analysis used in data mining, as mentioned
below:
τ=nc−nd12n(n−1)τ=21n(n−1)nc−nd
where ncnc is the number of concordant pairs, ndnd is the number of discordant
pairs, n0n0 is the total number of pairs, and nn represents the sample size.
ρs=1−6∑di2n(n2−1)ρs=1−n(n2−1)6∑di2
where dd is the difference between the ranks of the paired observations and nn is the
number of observations.
Interpreting Results Of Correlation Analysis
Any score from +0.5 to +1 indicates a very strong positive correlation, meaning that
the variables are strongly related in a positive direction, increasing together or
simultaneously.
Any score from -0.5 to -1 indicates a strong negative correlation, meaning that the
variables are strongly related in a negative direction. It also means that as one
variable decreases, the other variable increases and vice-versa.
A score of 0 indicates no correlation, meaning there is no relationship between the
analyzed variables.
Correlation analysis is a powerful tool in data mining and statistical analysis that offers
several benefits.
Association rules generated from mining data at multiple levels of abstraction are
called multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept hierarchies under
a support-confidence framework.
Rules at high concept level may add to common sense while rules at low concept
level may not be useful always.
o Using uniform minimum support for all levels:
When a uniform minimum support threshold is used, the search procedure is
simplified.
The method is also simple, in that users are required to specify only one minimum
support threshold.
The same minimum support threshold is used when mining at each level of
abstraction.
For example, in Figure, a minimum support threshold of 5% is used throughout.
(e.g. for mining from “computer” down to “laptop computer”).
Both “computer” and “laptop computer” are found to be frequent, while “desktop
computer” is not.
Using reduced minimum support at lower levels:
o Each level of abstraction has its own minimum support threshold.
o The deeper the level of abstraction, the smaller the corresponding threshold
is.
o For example in Figure, the minimum support thresholds for levels 1 and 2 are
5% and 3%, respectively.
o In this way, “computer,” “laptop computer,” and “desktop computer” are all
considered frequent.
Ans.
Association rules that involve two or more dimensions or
predicates can be referred to as multidimensional association rules.
For instance, the rule
What is Association?
Association analysis can provide valuable insights into consumer behaviour and
preferences. It can help retailers identify the items that are frequently purchased
together, which can be used to optimize product placement and promotions.
Similarly, it can help e-commerce websites recommend related products to
customers based on their purchase history.
Types of Associations
Here are the most common types of associations used in data mining:
Here are the most commonly used algorithms to implement association rule mining in
data mining:
Apriori Algorithm - Apriori is one of the most widely used algorithms for association
rule mining. It generates frequent item sets from a given dataset by pruning
infrequent item sets iteratively. The Apriori algorithm is based on the concept that if
an item set is frequent, then all of its subsets must also be frequent. The algorithm
first identifies the frequent items in the dataset, then generates candidate itemsets
of length two from the frequent items, and so on until no more frequent itemsets
can be generated. The Apriori algorithm is computationally expensive, especially for
large datasets with many items.
FP-Growth Algorithm - FP-Growth is another popular algorithm for association rule
mining that is based on the concept of frequent pattern growth. It is faster than the
Apriori algorithm, especially for large datasets. The FP-Growth algorithm builds a
compact representation of the dataset called a frequent pattern tree (FP-tree),
which is used to mine frequent item sets. The algorithm scans the dataset only twice,
first to build the FP-tree and then to mine the frequent itemsets. The FP-Growth
algorithm can handle datasets with both discrete and continuous attributes.
Eclat Algorithm - Eclat (Equivalence Class Clustering and Bottom-up Lattice Traversal)
is a frequent itemset mining algorithm based on the vertical data format. The
algorithm first converts the dataset into a vertical data format, where each item and
the transaction ID in which it appears are stored. Eclat then performs a depth-first
search on a tree-like structure, representing the dataset's frequent itemsets. The
algorithm is efficient regarding both memory usage and runtime, especially for
sparse datasets.
Delve Deeper: Our Data Science Certification Course is Your Next Step. Enroll Now and
Transform Your Understanding into Practical Expertise.
Correlation Analysis is a data mining technique used to identify the degree to which two or
more variables are related or associated with each other. Correlation refers to the statistical
relationship between two or more variables, where the variation in one variable is associated
with the variation in another variable. In other words, it measures how changes in one
variable are related to changes in another variable. Correlation can be positive, negative,
or zero, depending on the direction and strength of the relationship between the variables.
, For example,, we are studying the relationship between the hours of study and the grades
obtained by students. If we find that as the number of hours of study increases, the grades
obtained also increase, then there is a positive correlation between the two variables. On the
other hand, if we find that as the number of hours of study increases, the grades obtained
decrease, then there is a negative correlation between the two variables. If there is no
relationship between the two variables, we would say that there is zero correlation.
Correlation analysis is important because it allows us to measure the strength and direction
of the relationship between two or more variables. This information can help identify patterns
and trends in the data, make predictions, and select relevant variables for analysis. By
understanding the relationships between different variables, we can gain valuable insights
into complex systems and make informed decisions based on data-driven analysis.
Types of Correlation Analysis in Data Mining
There are three main types of correlation analysis used in data mining, as mentioned
below:
ρX,Y=cov(X,Y)σXσY=∑i=1n(Xi−Xˉ)
(Yi−Yˉ)∑i=1n(Xi−Xˉ)2∑i=1n(Yi−Yˉ)2ρX,Y=σXσYcov(X,Y)=∑i=1n(Xi−Xˉ)2
∑i=1n(Yi−Yˉ)2∑i=1n(Xi−Xˉ)(Yi−Yˉ)
τ=nc−nd12n(n−1)τ=21n(n−1)nc−nd
where ncnc is the number of concordant pairs, ndnd is the number of discordant
pairs, n0n0 is the total number of pairs, and nn represents the sample size.
ρs=1−6∑di2n(n2−1)ρs=1−n(n2−1)6∑di2
where dd is the difference between the ranks of the paired observations and nn is the
number of observations.
Interpreting Results Of Correlation Analysis
Any score from +0.5 to +1 indicates a very strong positive correlation, meaning that
the variables are strongly related in a positive direction, increasing together or
simultaneously.
Any score from -0.5 to -1 indicates a strong negative correlation, meaning that the
variables are strongly related in a negative direction. It also means that as one
variable decreases, the other variable increases and vice-versa.
A score of 0 indicates no correlation, meaning there is no relationship between the
analyzed variables.
Correlation analysis is a powerful tool in data mining and statistical analysis that
offers several benefits.
Here are some examples of the most common use cases for association and
correlation in data mining -
Market Basket Analysis - Association mining is commonly used in retail and
e-commerce industries to identify patterns in customer purchase
behaviour. By analyzing transaction data, businesses can uncover product
associations and make informed decisions about product placement,
pricing, and marketing strategies.
Medical Research - Correlation analysis is often used in medical research to
explore relationships between different variables, such as the correlation
between smoking and lung cancer risk or the correlation between blood
pressure and heart disease.
Financial Analysis - Correlation analysis is frequently used in financial
analysis to measure the strength of relationships between different
financial variables, such as the correlation between stock prices and
interest rates.
Fraud Detection - Association mining can be used to identify behaviour
patterns associated with fraudulent activity, such as multiple failed login
attempts or unusual purchase patterns.
Data
Warehousing and Data
Mining
Data
Warehousing and Data
Mining
Mining multilevel and
multidimensional
association rules
Motivation
(Why you
(students) should
learn these
topics?)
The CDT17 provides
knowledge on Mining
closed frequent item sets .
Lecture Learning
Outcomes (LLOs): After
completion of this lecture,
you should be
able to…
LLO1
On topic 1
association rules Mining
types multilevel and multi
dimensional .
Lecture Summary – Key
Takeaways
Multilevel Association
Rule :
Association rules
created from mining
information at different
degrees of reflection
are called various level
or staggered
association rules.
Multilevel association
rules can be mined
effectively utilizing idea
progressions under a
help certainty system.
Rules at a high idea
level may add to good
judgment while rules at
a low idea level may
not be valuable
consistently
Mining multilevel and
multidimensional
association rules
Motivation
(Why you
(students) should
learn these
topics?)
The CDT17 provides
knowledge on Mining
closed frequent item sets .
Lecture Learning
Outcomes (LLOs): After
completion of this lecture,
you should be
able to…
LLO1
On topic 1
association rules Mining
types multilevel and multi
dimensional .
Lecture Summary – Key
Takeaways
Multilevel Association
Rule :
Association rules
created from mining
information at different
degrees of reflection
are called various level
or staggered
association rules.
Multilevel association
rules can be mined
effectively utilizing idea
progressions under a
help certainty system.
Rules at a high idea
level may add to good
judgment while rules at
a low idea level may
not be valuable
consisten
Mining multilevel
and multidimensional
association rules
Motivation
(Why you
(students) should
learn these
topics?)
The CDT17 provides
knowledge on Mining
closed frequent item sets .
Lecture Learning
Outcomes (LLOs): After
completion of this lecture,
you should be
able to…
LLO1
On topic 1
association rules Mining
types multilevel and multi
dimensional .
Lecture Summary – Key
Takeaways
Multilevel Association
Rule :
Association rules
created from mining
information at different
degrees of reflection
are called various level
or staggered
association rules.
Multilevel association
rules can be mined
effectively utilizing idea
progressions under a
help certainty system.
Rules at a high idea
level may add to good
judgment while rules at
a low idea level may
not be valuable
consistent