WEB DATA MINING
UNIT-1
1
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports-Extended Model
• Mining Algorithm
• Rule Generation
2
What is Data Mining?
• Data mining is also called knowledge discovery in databases (KDD).
• It is commonly defined as the process of discovering useful patterns
or knowledge from data sources, e.g., databases, texts, images, the
Web, etc.
• Data mining is a multi-disciplinary field involving machine learning,
statistics, databases, artificial intelligence, information retrieval, and
visualization.
• There are many data mining tasks. Some of the common ones are
1.Supervised learning (or classification),
2.Unsupervised learning (or clustering),
3.Association rule mining, and sequential pattern mining.
3
Data mining can be performed, in three main
steps:
1.Pre-processing: The raw data is usually not suitable for mining due to
various reasons. It may need to be cleaned to remove noises or
abnormalities. The data may also be too large and/or involve many
irrelevant attributes, which call for data reduction through sampling
and attribute or feature selection.
2. Data mining: The processed data is then fed to a data mining
algorithm which will produce patterns or knowledge.
3. Post-processing: In many applications, not all discovered patterns
are useful. This step identifies those useful ones for applications.
Various evaluation and visualization techniques are used to make the
decision.
4
Applications:
• Data mining has a wide range of applications across various
industries, including
• Marketing,
• Finance,
• Healthcare, and
• Telecommunications.
• For example, in healthcare, it can be used to identify risk factors for
diseases and develop personalized treatment plans.
5
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports
• Mining class association rules
• Summary
6
What is Web Mining?
• Web mining aims to discover useful information or knowledge from
the Web hyperlink structure, page content, and usage data.
• Web mining can be broadly divided into three different types of
techniques of mining:
1. Web Structure Mining,
2. Web Content Mining, and
3. Web Usage Mining.
7
8
Web Structure Mining:
• Web structure mining is the application of discovering structure
information from the web.
• The structure of the web graph consists of web pages as nodes, and
hyperlinks as edges connecting related pages.
• Structure mining basically shows the structured summary of a
particular website.
• For example, from the links, we can discover important Web pages,
which is a key technology used in search engines.
• We can also discover communities of users who share common
interests
9
Web Content Mining:
• Web content mining is the application of extracting useful information
from the content of the web documents.
• Web content consist of several types of data – text, image, audio, video etc.
• Content data is the group of facts that a web page is designed. It can
provide effective and interesting patterns about user needs.
• Text documents are related to text mining, machine learning and natural
language processing.
• This mining is also known as text mining. This type of mining performs
scanning and mining of the text, images and groups of web pages according
to the content of the input.
10
Web Usage Mining:
• Web usage mining involves analyzing user behavior on the web,
including clickstream data, search queries, and other interactions with
web pages.
• Web usage mining can help identify user preferences, behavior
patterns, and trends.
• This information can be used to personalize content, improve website
design, and target advertising.
• Web usage mining can also be used for security purposes, such as
detecting fraud and identifying potential security threats.
11
Applications of Web Mining:
• The applications of web mining are wide-ranging and include:
• E-commerce -
Web mining is used to analyze user behavior on e-commerce websites,
including purchase history, search queries, and clickstream data. This
information can be used to optimize website design, personalize product
recommendations, and improve customer experience.
• Search engine optimization:
• Web mining can be used to analyze search engine queries and search engine
results pages (SERPs). This information can be used to improve the visibility of
websites in search engine results and increase traffic to the website.
• Fraud detection:
• Web mining can be used to detect fraudulent activity on websites. This
information can be used to prevent financial fraud, identity theft, and other
types of online fraud.
12
• Sentiment analysis:
• Web mining can be used to analyze social media data and extract
sentiment from posts, comments, and reviews. This information can
be used to understand customer sentiment towards products and
services and make informed business decisions.
•
13
Process of Web Mining:
• The process of web mining typically involves the following steps -
• Data collection -
Web data is collected from various sources, including web pages, databases, and APIs.
• Data pre-processing -
The collected data is pre-processed to remove irrelevant information, such as
advertisements and duplicate content.
• Data integration -
The pre-processed data is integrated and transformed into a structured format for
analysis.
• Pattern discovery -
Web mining techniques are applied to identify patterns, trends, and relationships.
• Evaluation -
The discovered patterns are evaluated to determine their significance and usefulness.
• Visualization -
The analysis results are visualized through graphs, charts, and other visualizations.
14
Difference Between Data Mining and Web Mining:
Parameter Data Mining Web Mining
The process of discovering patterns in The process of discovering patterns in
Definition
large datasets web data
Databases, data warehouses, and Web pages, weblogs, social media,
Data Source
other data repositories and other web-related data sources
Structured, semi-structured, and
Data Characteristics Mostly unstructured data
unstructured data
Text mining, natural language
Clustering, classification, association
Techniques processing, image analysis, link
rules, regression, etc.
analysis, etc.
E-commerce, social media, search
Applications Marketing, finance, healthcare, etc.
engines, etc.
Data quality, scalability, and privacy Data heterogeneity, ambiguity, and
Challenges
concerns dynamic nature of the web
15
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports
• Mining class association rules
• Summary
16
Association Rules and Sequential Patterns:
• Association rule is a fundamental data mining task.
• Proposed by Agrawal et al in 1993.
• Its objective is to find all co-occurrence relationships, called
associations, among data items.
• The classic application of association rule mining is the market basket
data analysis,
• which aims to discover how items purchased by customers in a
supermarket (or a store) are associated.
17
Basic Concepts of Association Rules:
TID Items
1.Itemset: 1 Bread, Milk
A collection of one or more items Bread, Diaper, Beer,
2
Example: {Milk,Bread,Diaper} Eggs
2.Support Count(σ): Milk, Diaper, Beer,
3
Frequency of occurrence of an itemset. Coke
Example: σ({Milk,Bread,Diaper}) = 2 4
Bread, Milk, Diaper,
Beer
3.Support:
Fraction of transaction that contain an itemset Bread, Milk, Diaper,
5
Coke
Example: s({Milk,Bread,Diaper}) = 2/5
18
• Frequent Itemset
– An itemset whose support is greater than or equal to minsup
threshold.
• Association Rule
– An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
–Example:
{Milk,Diaper} {Beer}
19
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t I.
• Transaction Database T: a set of transactions T = {t1,
t2, …, tn}.
20
Transaction data: supermarket data
• Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
• Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it may have TID (transaction
ID)
• A transactional dataset: A set of transactions
21
The model: rules
• A transaction t contains X, a set of items (itemset) in I, if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
• An itemset is a set of items.
• E.g., X = {milk, bread, cereal} is an itemset.
• A k-itemset is an itemset with k items.
• E.g., {milk, bread, cereal} is a 3-itemset
22
Rule strength measures:
• Support: The rule holds with support sup in T (the transaction data
set) if sup% of transactions contain X Y.
• sup = Pr(X Y).
• Confidence: The rule holds in T with confidence conf if conf% of
tranactions that contain X also contain Y.
• conf = Pr(Y | X)
• An association rule is a pattern that states when X occurs, Y occurs
with certain probability.
23
Support and Confidence:
• Support count: The support count of an itemset X, denoted by
X.count, in a data set T is the number of transactions in T that
contain X. Assume T has n transactions.
• Then,
( X Y ).count
support
n
( X Y ).count
confidence
X .count
24
• support and confidence greater than or equal to the user-specified
minimum support (denoted by minsup) and minimum confidence
(denoted by minconf)
• Example:
{Milk,Diaper} {Beer}
s = count.(Milk,Diaper, Beer) =2/5=0.4
T
c = count.(Milk,Diaper, Beer) =2/3=0.67
σ(Milk,Diaper )
25
Goal and key features
• Goal: Find all rules that satisfy the user-specified
minimum support (minsup) and minimum confidence
(minconf).
• Key Features
• Completeness: find all rules.
• No target item(s) on the right-hand-side
• Mining with data on hard disk (not in memory)
26
t1: Beer, Chicken, Milk
An example t2:
t3:
Beer, Cheese
Cheese, Boots
t4: Beer, Chicken, Cheese
t5: Beer, Chicken, Clothes, Cheese, Milk
• Transaction data t6: Chicken, Clothes, Milk
• Assume: t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
• An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
• Association rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
27
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports-Extended Model
• Mining Algorithm
• Rule Generation
28
The Apriori algorithm:
• The best known algorithm
• Two steps:
1.Find all itemsets that have minimum support (frequent itemsets, also called
large itemsets).
2.Use frequent itemsets to generate rules.
• E.g., a frequent itemset
{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
29
Step 1: Mining all frequent itemsets
• A frequent itemset is an itemset whose support is ≥
minsup.
• Key idea: The apriori property (downward closure
property): any subsets of a frequent itemset are also
frequent itemsets
ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
30
The Algorithm
• Iterative algo. (also called level-wise search): Find all 1-
item frequent itemsets; then all 2-item frequent itemsets,
and so on.
• In each iteration k, only consider itemsets that contain
some k-1 frequent itemset.
• Find frequent itemsets of size 1: F1
• From k = 2
• Ck = candidates of size k: those itemsets of size k that could
be frequent, given Fk-1
• Fk = those itemsets that are actually frequent, Fk Ck
(need to scan the database once).
31
Apriori candidate generation
• The candidate-gen function takes Fk-1 and returns a
superset (called the candidates) of the set of all
frequent k-itemsets. It has two steps
• join step: Generate all possible candidate itemsets Ck of
length k
• prune step: Remove those candidates in Ck that cannot
be frequent.
32
Details: the algorithm
Algorithm Apriori(T)
C1 init-pass(T);
F1 {f | f C1, f.count/n minsup}; // n: no. of transactions in T
for (k = 2; Fk-1 ; k++) do
Ck candidate-gen(Fk-1);
for each transaction t T do
for each candidate c Ck do
if c is contained in t then
c.count++;
end
end
Fk {c Ck | c.count/n minsup}
end
return F k Fk;
33
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports-Extended Model
• Mining Algorithm
• Rule Generation
34
Frequent Itemset Generation:
Consider the following dataset and we will find frequent
itemsets and generate association rules for them.
minimum support count is 2
minimum confidence is 60%
35
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
36
• (II) compare candidate set item’s support count with minimum
support count(here min_support=2 if support_count of candidate set
items is less than min_support then remove those items).
• This gives us itemset L1.
37
Step-2: K=2
Generate candidate set C2 using L1 (this is called join step). Condition of joining Lk-1 and Lk-1 is that it
should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.(Example subset
of{I1, I2} are {I1}, {I2} they are frequent.Check for each itemset)
Now find support count of these itemsets by searching in dataset.
38
Step-3:
Generate candidate set C3 using L2 (join step).
Condition of joining is that it should have (K-2)
elements in common.
find support count of these remaining itemset by
searching in dataset.
39
Step-4:
• Generate candidate set C4 using L3 (join step). Condition of joining (K=4) is
that, they should have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.
• Check all subsets of these itemsets are frequent or not (Here itemset formed
by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not
frequent). So no itemset in C4
• We stop here because no frequent itemsets are found further
40
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports-Extended Model
• Mining Algorithm
• Rule Generation
41
Association Rule Generation:
• Generating rules from frequent itemsets:
• Generate strong association rules from frequent item sets.
• We need to generate 2k -2 rules
• Then we need to calculate confidence of each rule.
• Confidence –
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
42
By taking an example of any frequent itemset, we will
show the rule generation.
Itemset {I1, I2, I3} //from L3
• SO rules can be:
1. [I1^I2]=>[I3]
confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
2. [I1^I3]=>[I2]
confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
3. [I2^I3]=>[I1]
confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%
43
4. [I1]=>[I2^I3]
confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
5. [I2]=>[I1^I3]
confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
6. [I3]=>[I1^I2]
confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be considered
as strong association rules.
44
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports
• Mining class association rules
• Summary
45
Different data formats for mining
• The data can be in transaction form or table form
Transaction form: a, b
a, c, d, e
a, d, f
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
• Table data need to be converted to transaction form
for association mining
46
From a table to a set of transactions
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
Transaction form:
(Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)
candidate-gen can be slightly improved. Why?
47
Mining with Multiple
Minimum Supports
48
Problems with the association mining
1.The key element that makes association rule mining practical
is the minsup threshold.
2.It is used to limit the number of frequent itemsets and rules
generated
3. In many applications, some items appear very frequently in
the data, while some other items rarely appear.
4.If the frequencies of items vary a great deal, we will
encounter two problems
1.Rare items
2. combinatorial explosion
49
Rare items:
• If the minsup is set too high, we will not find rules that
involve infrequent items or rare items in the data.
Combinatorial Explosion:
• To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause combinatorial
explosion because those frequent items will be associated
with one another in all possible ways.
50
For example:
• Consider frequent itemset f1
f1: {Bread, Cheese, Egg, Bagel, Milk, Sugar, Butter} [sup =
0.007%],
• Knowing that 0.007% of the customers buy the seven items
in f1 together is useless because all these items are so
frequently purchased in a supermarket.
• they will almost certainly cause combinatorial explosion!
• A better solution is to allow the user to specify multiple
minimum supports, i.e., to specify a different minimum item
support (MIS) to each item.
• This method helps solve the problem of f1. 51
• f2: {Bread, Egg, Milk, CookingPan} [sup = 0.006%].
• To prevent very frequent items and very rare items from
appearing in the same itemsets, a constraint will be
introduced known as support difference constraint.
• support difference constraint:
maxis{sup(i)} minis{sup(i)} ≤ ,
• where 0 ≤ ≤ 1 is the user-specified maximum support
difference
52
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports-Extended Model
• Mining Algorithm
• Rule Generation
53
Extended Model:
• To allow multiple minimum supports, the original model needs to be
extended.
• In the extended model, the minimum support of a rule is expressed in
terms of minimum item supports (MIS) of the items that appear in
the rule.
• That is, each item in the data can have a MIS value specified by the
user
• By providing different MIS values for different items, the user
effectively expresses different support requirements for different
rules.
54
• Let MIS(i) be the MIS value of item i.
• The minimum support(minsup)of a rule R is the lowest MIS value
among the items in the rule.
That is, a rule R,
i1, i2, …, ik ik+1, …, ir, satisfies its minimum support if its actual
support is
min(MIS(i1), MIS(i2), …, MIS(ir)).
55
• Minimum item supports thus enable us to
achieve the goal of having higher minimum
supports for rules that involve only frequent
items,
• and having lower minimum supports for rules
that involve less frequent items.
56
Example :
• Consider the following items:
bread, shoes, clothes
The user-specified MIS values are as follows:
MIS(bread) = 2% MIS(shoes) = 0.1%
MIS(clothes) = 0.2%
The following rule doesn’t satisfy its minsup:
clothes bread [sup=0.15%,conf =70%]
This is so because min(MIS(Bread), MIS(Clothes)) = 0.2%.
sup>=minsup
57
The following rule satisfies its minsup:
clothes shoes [sup=0.15%,conf =70%]
because min(MIS(Clothes), MIS(Shoes)) = 0.1%.
58
Downward closure property
• In the new model, the property no longer
holds (?)
E.g., Consider four items 1, 2, 3 and 4 in a
database. Their minimum item supports are
MIS(1) = 10% MIS(2) = 20%
MIS(3) = 5% MIS(4) = 6%
If we find that itemset {1, 2} has a support of
9% at level 2, then it does not satisfy either
MIS(1) or MIS(2).
Using the Apriori algorithm, this itemset is
discarded since it is not frequent 59
• Then, the potentially frequent itemsets {1, 2, 3} and {1, 2, 4} will not
be generated for level 3.
• but {1, 2, 3} and {1, 2, 4} could be frequent.
• because MIS(3) is only 5% and MIS(4) is 6%. It is thus wrong to discard
{1, 2}. However, if we do not discard {1, 2}, the downward closure
property is lost.
• We use mining algorithm(MS-Apriori). to solve this problem.
• The essential idea is to sort the items according to their MIS values in
ascending order to avoid the problem.
60
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports
• Mining Algorithm
• Summary
61
Mining Algorithm(MS-Apriori):
• The new algorithm generalizes the Apriori algorithm for finding
frequent itemsets.
• We call the algorithm, MS-Apriori.
• When there is only one MIS value (for all items), it reduces to the
Apriori algorithm.
• Like Apriori, MS-Apriori is also based on level-wise search. It
generates all frequent itemsets by making multiple passes over the
data.
62
• The key operation in the new algorithm is the
sorting of the items in I in ascending order of their
MIS values
• The order is used throughout the algorithm in each
itemset.
• Let Fk denote the set of frequent k-itemsets. Each
itemset w is of the following form,
{w[1], w[2], …, w[k]}, which consists of items,
w[1], w[2], …, w[k],
where MIS(w[1]) < MIS(w[2]) < … < MIS(w[k]).
The MS-Aapriori algorithm
Algorithm MSapriori(T, MS, ) // is for support difference constraint
1.M sort(I, MS); // according to MIS(i)’s stored in MS
2.L init-pass(M, T); // make the first pass over T
3.F1 {{i} | i L, i.count/n MIS(i)}; // n is the size of T
4.for (k = 2; Fk-1 ; k++) do
5. if k=2 then
6. Ck level2-candidate-gen(L, )
7. else Ck MScandidate-gen(Fk-1, );
8. end;
9. for each transaction t T do
10 for each candidate c Ck do
11 if c is contained in t then
12 c.count++;
13 if c – {c[1]} is contained in t then
14 c.tailCount++
15 end
16 end
17 Fk {c Ck | c.count/n MIS(c[1])}
18end
19return F kFk;
64
• Line 1 performs the sorting on I according to the MIS value of each
item (stored in MS).
• Line 2 makes the first pass over the data using the function init-pass(),
which takes two arguments, the data set T and the sorted items M, to
produce the seeds L for generating candidate itemsets of length 2,
i.e., C2.
• init-pass() has two steps:
• 1. It first scans the data once to record the support count of each
item.
• 2. It then follows the sorted order to find the first item i in M that
meets MIS(i). i is inserted into L.
• For each subsequent item j in M after i, if j.count/n >=MIS(i), then j is
also inserted into L, where j.count is the support count of j, and n is
the total number of transactions in T.
• Frequent 1-itemsets (F1) are obtained from L (line 3). It is easy to
show that all frequent 1-itemsets are in F1.
65
Candidate itemset generation
• Special treatments needed:
• Sorting the items according to their MIS
values
• First pass over data
• Candidate generation at level-2
• Pruning step in level-k (k > 2) candidate
generation.
66
First pass over data:
• It makes a pass over the data to record the support
count of each item.
• It then follows the sorted order to find the first item i
in M that meets MIS(i).
• i is inserted into L.
• For each subsequent item j in M after i, if j.count/n
MIS(i) then j is also inserted into L, where j.count is
the support count of j and n is the total number of
transactions in T.
• L is used by function level2-candidate-gen
67
Example:
• Consider the four items 1, 2, 3 and 4 in a data set. Their minimum item
supports are:
MIS(1) = 10% MIS(2) = 20%
MIS(3) = 5% MIS(4) = 6%
• Assume our data set has 100 transactions. The first pass gives us the
following support counts:
{3}.count = 6, {4}.count = 3,
{1}.count = 9, {2}.count = 25.
• Then L = {3, 1, 2}, and F1 = {{3}, {2}}
• Item 4 is not in L because 4.count/n < MIS(3) (= 5%),
• {1} is not in F1 because 1.count/n < MIS(1) (= 10%).
68
Road map
• Data Mining
• Web Mining
• Basic concepts of Association Rules
• Apriori algorithm
• Frequent item set generation
• Association Rule Generation
• Different data formats for mining
• Mining with multiple minimum supports
• Mining Algorithm
• Rule Generation
69
Rule Generation:
• Association rules are generated using frequent itemsets.
• in the case of MS-Apriori, if we only record the support count
of each frequent itemset, it is not sufficient.
• The following two lines in MSapriori algorithm are important
for rule generation, which are not needed for the Apriori
algorithm
if c – {c[1]} is contained in t then
c.tailCount++
• Many rules cannot be generated without them.
70