DM Unit 3
DM Unit 3
UNIT 3
Knowledge Representation
Histograms
Data Visualization
i. Scatter-plot matrices
It consists of scatter plots of all possible pairs of variables in a dataset.
Chernoff faces
• This concept was introduced by Herman Chernoff in 1973.
• The faces in Chernoff faces are related to facial expressions or features of human being.
So, it becomes easy to identify the difference between the faces.
• It includes the mapping of different data dimensions with different facial features.
For example: The face width, the length of the mouth and the length of nose, etc. as
shown in the following diagram.
These primitives allow us to communicate in an interactive manner with the data mining
system. Here is the list of Data Mining Task Primitives −
This is the portion of database in which the user is interested. This portion includes the
following −
• Database Attributes
• Data Warehouse dimensions of interest
2) Kind of knowledge to be mined
• Characterization
• Discrimination
• Association and Correlation Analysis
• Classification
• Prediction
• Clustering
• Outlier Analysis
• Evolution Analysis
3) Background knowledge
The background knowledge allows data to be mined at multiple levels of abstraction. For
example, the Concept hierarchies are one of the background knowledge that allows data to be
mined at multiple levels of abstraction.
This is used to evaluate the patterns that are discovered by the process of knowledge discovery.
There are different interesting measures for different kind of knowledge.
This refers to the form in which discovered patterns are to be displayed. These representations
may include the following. −
• Rules
• Tables
• Charts
• Graphs
• Decision Trees
• Cubes
➢ Association rule mining is a technique used to identify patterns in large data sets. It
involves finding relationships between variables in the data and using those
relationships to make predictions or decisions.
➢ The goal of association rule mining is to uncover rules that describe the relationships
between different items in the data set.
➢ Association rule mining typically involves using algorithms to analyze the data and
identify the relationships. These algorithms can be based on statistical methods or
machine learning techniques.
➢ The resulting rules are often expressed in the form of "if-then" statements, where the
"if" part represents the antecedent (the condition being tested) and the "then" part
represents the consequent (the outcome that occurs if the condition is met).
➢ By identifying associations between variables, association rule mining can help users
understand the relationships between different variables and how those variables may
be related to one another.
➢ This can be useful for various purposes, such as identifying market trends, detecting
fraudulent activity, or understanding customer behavior.
➢ Association rule mining can also be used as a stepping stone for other types of data
analysis, such as predicting outcomes or identifying key drivers of certain phenomena.
Overall, association rule mining is a valuable tool for extracting insights and
understanding the underlying structure of data.
Association rule mining is commonly used in a variety of applications, some common ones
are:
One of the most well-known applications of association rule mining is in market basket
analysis. This involves analyzing the items customers purchase together to understand their
purchasing habits and preferences.
For example, a retailer might use association rule mining to discover that customers who
purchase diapers are also likely to purchase baby formula. We can use this information to
optimize product placements and promotions to increase sales.
2. Customer Segmentation
Association rule mining can also be used to segment customers based on their purchasing
habits.
For example, a company might use association rule mining to discover that customers who
purchase certain types of products are more likely to be younger. Similarly, they could learn
that customers who purchase certain combinations of products are more likely to be located
in specific geographic regions.
3. Fraud Detection
You can also use association rule mining to detect fraudulent activity. For example, a credit
card company might use association rule mining to identify patterns of fraudulent transactions,
such as multiple purchases from the same merchant within a short period of time.
Various companies use association rule mining to identify patterns in social media data that
can inform the analysis of social networks.
For example, an analysis of Twitter data might reveal that users who tweet about a particular
topic are also likely to tweet about other related topics, which could inform the identification
of groups or communities within the network.
5. Recommendation systems
Association rule mining can be used to suggest items that a customer might be interested in
based on their past purchases or browsing history. For example, a music streaming service
might use association rule mining to recommend new artists or albums to a user based on their
listening history.
➢ A data mining technique that is used to uncover purchase patterns in any retail setting
is known as Market Basket Analysis. In simple terms Basically, Market basket
analysis in data mining is to analyze the combination of products which been bought
together.
➢ This is a technique that gives the careful study of purchases done by a customer in a
supermarket. This concept identifies the pattern of frequent purchase items by
customers.
➢ This analysis can help to promote deals, offers, sale by the companies, and data
mining techniques helps to achieve this analysis task.
➢ Example:
• Data mining concepts are in use for Sales and marketing to provide better
customer service, to improve cross-selling opportunities, to increase direct mail
response rates.
• Customer Retention in the form of pattern identification and prediction of likely
defections is possible by Data mining.
• Risk Assessment and Fraud area also use the data-mining concept for identifying
inappropriate or unusual behavior etc.
Market basket analysis mainly works with the ASSOCIATION RULE {IF} -> {THEN}.
• IF means Antecedent: An antecedent is an item found within the data
• THEN means Consequent: A consequent is an item found in combination with
the antecedent.
Like we said above Antecedent is the item sets that are available in data. By formulating from
the rules means {if} component and from the example is the domain.
Same as Consequent is the item that is found with the combination of Antecedents. By
formulating from the rules means {THEN} component and from the example is extra
plugins/extensions.
With the help of these, we are able to predict customer behavioral patterns. From this, we are
able to make certain combinations with offers that customers will probably buy those
products. That will automatically increase the sales and revenue of the company.
An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is
an item found within the data. A consequent is an item found in combination with the
antecedent.
There are three types of Market Basket Analysis. They are as follow:
1. Descriptive market basket analysis: This sort of analysis looks for patterns and
connections in the data that exist between the components of a market basket. This
kind of study is mostly used to understand consumer behavior, including what
products are purchased in combination and what the most typical item
combinations. Retailers can place products in their stores more profitably by
understanding which products are frequently bought together with the aid of
descriptive market basket analysis.
2. Predictive Market Basket Analysis: Market basket analysis that predicts future
purchases based on past purchasing patterns is known as predictive market basket
analysis. Large volumes of data are analyzed using machine learning algorithms
in this sort of analysis in order to create predictions about which products are most
likely to be bought together in the future. Retailers may make data-driven
decisions about which products to carry, how to price them, and how to optimize
shop layouts with the use of predictive market basket research.
1. Retail: Market basket research is frequently used in the retail sector to examine
consumer buying patterns and inform decisions about product placement,
inventory management, and pricing tactics. Retailers can utilize market basket
research to identify which items are sluggish sellers and which ones are commonly
bought together, and then modify their inventory management strategy
accordingly.
2. E-commerce: Market basket analysis can help online merchants better
understand the customer buying habits and make data-driven decisions about
product recommendations and targeted advertising campaigns. The behaviour of
visitors to a website can be examined using market basket analysis to pinpoint
problem areas.
3. Finance: Market basket analysis can be used to evaluate investor behaviour and
forecast the types of investment items that investors will likely buy in the future.
The performance of investment portfolios can be enhanced by using this
information to create tailored investment strategies.
4. Telecommunications: To evaluate consumer behaviour and make data-driven
decisions about which goods and services to provide, the telecommunications
business might employ market basket analysis. The usage of this data can enhance
client happiness and the shopping experience.
5. Manufacturing: To evaluate consumer behaviour and make data-driven
decisions about which products to produce and which materials to employ in the
production process, the manufacturing sector might use market basket analysis.
Utilizing this knowledge will increase effectiveness and cut costs.
IMP
Apriori Algorithm
(II) compare candidate set item’s support count with minimum support count(here
min_support=2 if support_count of candidate set items is less than min_support then remove
those items). This gives us itemset L1.
Step-2: K=2
• Generate candidate set C2 using L1 (this is called join step). Condition of
joining Lk-1 and Lk-1 is that it should have (K-2) elements in common.
• Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset. (Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for
each itemset)
• Now find support count of these itemsets by searching in dataset.
(II) compare candidate (C2) support count with minimum support count (here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L2.
Step-3:
• Generate candidate set C3 using L2 (join step). Condition of joining L k-
1 and Lk-1 is that it should have (K-2) elements in common. So here, for
L2, first element should match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2,
I3, I4}{I2, I4, I5}{I2, I3, I5}
• Check if all subsets of these itemsets are frequent or not and if not, then
remove that itemset. (Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1,
I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so
remove it. Similarly check for every itemset)
• find support count of these remaining itemset by searching in dataset.
(II) Compare candidate (C3) support count with minimum support count (here
min_support=2 if support_count of candidate set item is less than min_support then
remove those items) this gives us itemset L3.
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common. So
here, for L3, first 2 elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5},
which is not frequent). So, no itemset in C4
• We stop here because no frequent itemsets are found further.
Thus, we have discovered all the frequent item-sets. Now generation of strong
association rule comes into picture. For that we need to calculate confidence of each
rule.
Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and
bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
So here, by taking an example of any frequent itemset, we will show the rule
generation.
➢ Apriori Algorithm can be slow. The main limitation is time required to hold
a vast number of candidates sets with much frequent itemsets, low minimum
support or large itemsets i.e., it is not an efficient approach for large number
of datasets.
➢ For example, if there are 10^4 from frequent 1- itemsets, it needs to generate
more than 10^7 candidates into 2-length which in turn they will be tested and
accumulate.
➢ Furthermore, to detect frequent pattern in size 100 i.e., v1, v2… v100, it has
to generate 2^100 candidate itemsets that yield on costly and wasting of time
of candidate generation.
➢ So, it will check for many sets from candidate itemsets, also it will scan
database many times repeatedly for finding candidate itemsets.
➢ Apriori will be very low and inefficiency when memory capacity is limited
with large number of transactions.
IMP
FP Growth Algorithm in Data Mining
➢ The FP-Growth Algorithm proposed by Han in. This is an efficient and scalable
method for mining the complete set of frequent patterns by pattern fragment growth,
using an extended prefix-tree structure for storing compressed and crucial information
about frequent patterns named frequent-pattern tree (FP-tree).
➢ In his study, Han proved that his method outperforms other popular methods for mining
frequent patterns, e.g., the Apriori Algorithm and the TreeProjection.
➢ In some later works, it was proved that FP-Growth performs better than other methods,
including Eclat and Relim.
➢ The popularity and efficiency of the FP-Growth Algorithm contribute to many studies
that propose variations to improve its performance.
➢ The FP-Growth Algorithm is an alternative way to find frequent item sets without using
candidate generations, thus improving performance.
➢ For so much, it uses a divide-and-conquer strategy. The core of this method is the usage
of a special data structure named frequent-pattern tree (FP-tree), which retains the item
set association information.
o First, it compresses the input database creating an FP-tree instance to represent frequent
items.
o After this first step, it divides the compressed database into a set of conditional
databases, each associated with one frequent pattern.
o Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by recursively looking for short
patterns and then concatenating them into the long frequent patterns.
In large databases, holding the FP tree in the main memory is impossible. A strategy to cope
with this problem is to partition the database into a set of smaller databases (called projected
databases) and then construct an FP-tree from each of these smaller databases.
FP-Tree
➢ The frequent-pattern tree (FP-tree) is a compact data structure that stores quantitative
information about frequent patterns in a database. Each transaction is read and then
mapped onto a path in the FP-tree.
➢ This is done until all transactions have been read. Different transactions with common
subsets allow the tree to remain compact because their paths overlap.
➢ A frequent Pattern Tree is made with the initial item sets of the database. The purpose
of the FP tree is to mine the most frequent pattern. Each node of the FP tree represents
an item of the item set.
➢ The root node represents null, while the lower nodes represent the item sets. The
associations of the nodes with the lower nodes, that is, the item sets with the other item
sets, are maintained while forming the tree.
Additionally, the frequent-item-header table can have the count support for an item.
Algorithm by Han
The original algorithm to construct the FP-Tree defined by Han is given below:
Using this algorithm, the FP-tree is constructed in two database scans. The first scan collects
and sorts the set of frequent items, and the second constructs the FP-Tree.
Example
Table 1:
T2 I2,I3,I4
T3 I4,I5
T4 I1,I2,I4
T5 I1,I2,I3,I5
T6 I1,I2,I3,I4
Item Count
I1 4
I2 5
I3 4
I4 4
I5 2
Item Count
I2 5
I1 4
I3 4
I4 4
Build FP Tree
Here are the following advantages of the FP growth algorithm, such as:
o This algorithm needs to scan the database twice when compared to Apriori, which scans
the transactions for each iteration.
o The pairing of items is not done in this algorithm, making it faster.
o The database is stored in a compact version in memory.
o It is efficient and scalable for mining both long and short frequent patterns.
Apriori and FP-Growth algorithms are the most basic FIM algorithms. There are some basic
differences between these algorithms, such as:
Apriori FP Growth
Apriori generates frequent patterns by making the FP Growth generates an FP-Tree for
itemsets using pairings such as single item set, double making frequent patterns.
itemset, and triple itemset.
Apriori uses candidate generation where frequent FP-growth generates a conditional FP-
subsets are extended one item at a time. Tree for every item in the data.
Since apriori scans the database in each step, it becomes FP-tree requires only one database scan
time-consuming for data where the number of items is in its beginning steps, so it consumes
larger. less time.
A converted version of the database is saved in the A set of conditional FP-tree for every
memory item is saved in the memory