Web Mining: G.Anuradha References From Dunham
Web Mining: G.Anuradha References From Dunham
G.Anuradha
References from Dunham
Objective
• What is web mining?
• Taxonomy of web mining?
• Web content mining
• Web structure mining
• Web usage mining
What is web mining?
• Mining of data related to WWW
– Data present in Web pages or data related to web
activity
• Web data is classified
– Content of web pages
– Intrapage structure which include code and actual
linkage
– Usage data – how used by visitors
– User profiles
Taxonomy of Web Mining
Web Content Mining
• Extension of basic search engines
• Search engines are keyword-based
• Traditional search engines use crawlers
– to search the Web
– gather information
– indexing techniques to store the information
– query processing to provide fast and accurate
information to users
Taxonomy of Web content mining
5
2 5
1 1 1 6
3 6
7
4 7
• Hubs Authorities
29
Step By Step HITS-1
• determines a base set S
• let set of documents returned by a standard
search engine be called the root set R
• Initialize S to R
• Note: The hub weights are computed from the current authority
weights, which were computed from the previous hub weights.
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Selime Işık-Büşra İpek 39
HITS vs PageRank
• HITS emphasizes mutual reinforcement between authority
and hub webpages, while PageRank does not attempt to
capture the distinction between hubs and authorities. It ranks
pages just by authority.
• HITS is applied to the local neighborhood of pages
surrounding the results of a query whereas PageRank is
applied to the entire web
• HITS is query dependent but PageRank is query-independent
• Ppages; UUsers;
What is session?
• Ordered list of pages accessed by a user
{<p1,t1>,,p2,t2>….<pn,tn>}
• Each session has a unique identifier called as
session ID.
• The length of session is number of pages in it
denoted by len(S)
• D be a database having all sessions and length
of D is total len(S)
Recap of networking
• What is ISP?
• Internet Service Provider
• What are cookies?
• Cookies are used in identifying a single user
regardless of machine used to access the WEB
Trie
• Data structure that is used to keep track of
patterns during web usage mining
• Path from root to leaf represents a sequence
• Tries are used to store strings fro pattern-
matching applications
• Each character in the string is stored on the
edge to the node and common prefixes of
strings are shared
Sample tries
A C
N
A A C
N A
Y R
R Y C
A
R
T $ T
$
SUFFIX
TRIE
TRIE
Characteristics of suffix trie
• Each internal node except the rot has atleast
two children
• Each edge represents a nonempty
subsequence
• Subsequences begin with different symbols
• Suffix tree build for multiple sessions is called
a generalized suffix tree (GST)
Pattern Discovery
• For clickstream data the common DM technique is uncovering
traversal pattern
• Traversal pattern is a set of pages visited by a user in a session
• There are different traversal patterns having the following
features
– Duplicate page references
– Pattern may have contiguous page references or pages referenced in
the same session
– A pattern may or may not be maximal
– Frequent pattern may or may not be maximal if it has no subpattern
that is also frequent
Association rules
• Can be used to find what pages are accessed
together
• In this case a page is regarded as an item and
a session is regarded as a transaction with
duplicates and ordering ignored
• Support=No: of occurrences of itemset
-------------------------------------------------------------
Seq. ID Sequence
10 <(bd)cb(ac)>
Given support threshold
20 <(bf)(ce)b(fg)>
min_sup =2
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)> 58
GSP—Generalized Sequential Pattern Mining
59
Finding Length-1 Sequential Patterns
• Initial candidates:
– <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Cand Sup
• Scan database once, count support for <a> 3
candidates <b> 5
<c> 4
min_sup =2
<d> 3
Seq. ID Sequence
10 <(bd)cb(ac)> <e> 3
20 <(bf)(ce)b(fg)> <f> 2
30 <(ah)(bf)abf> <g> 1
40 <(be)(ce)d>
<h> 1
50 <a(bd)bcb(ade)>
60
Generating Length-2 Candidates
<a> <b> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>
51 length-2 <b> <ba> <bb> <bc> <bd> <be> <bf>
Candidates <c>
<d>
<ca>
<da>
<cb>
<db>
<cc>
<dc>
<cd>
<dd>
<ce>
<de>
<cf>
<df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>
62
The GSP Mining Process
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 46 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
min_sup =2 10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)> 63
The GSP Algorithm
• Take sequences in form of <x> as length-1 candidates
• Scan database once, find F1, the set of length-1
sequential patterns
• Let k=1; while Fk is not empty do
– Form Ck+1, the set of length-(k+1) candidates from Fk;
– If Ck+1 is not empty, scan database once, find Fk+1, the set of
length-(k+1) sequential patterns
– Let k=k+1;
64
The GSP Algorithm
• Benefits from the Apriori pruning
– Reduces search space
• Bottlenecks
– Scans the database multiple times
– Generates a huge set of candidate sequences
65