Webmininglec
Webmininglec
Web Mining
Web Mining Outline
Goal: Examine the use of data mining on
the World Wide Web
Introduction
Web Content Mining
Web Structure Mining
Web Usage Mining
2
Introduction
The Web is perhaps the single largest data
source in the world.
Web mining aims to extract and mine useful
knowledge from the Web.
A multidisciplinary field: data mining, machine
learning, natural language processing,
statistics, databases, information retrieval,
multimedia, etc.
Due to the heterogeneity and lack of structure
of Web data, mining is a challenging task.
3
Opportunities and Challenges
The amount of info on the Web is huge, and easily
accessible.
The coverage of Web info is very wide and diverse.
Info/data of almost all types exist on the Web, e.g.,
structured tables, texts, multimedia data, etc.
Much of the Web information is semi-structured
due to the nested structure of HTML code.
Much of the Web info is linked. hyperlinks among
pages within a site, and across different sites.
Much of the Web info is redundant. Same piece of
info or its variants may appear in many pages.
4
Opportunities and Challenges
Web is noisy. A Web page typically contains many kinds of
info, e.g., main contents, advertisements, navigation
panels, copyright notices, etc.
Web consists of surface Web and deep Web.
– Surface Web: pages that can be browsed using a browser.
– Deep Web: can only be accessed thro parameterized QI.
Web is also about services.
Web is dynamic. Information on the Web changes
constantly. Keeping up with the changes and monitoring
the changes are important issues.
The Web is a virtual society. It is not only about data,
information and services, but also about interactions
among people, organizations and automatic systems, i.e.
communities.
5
Web Mining Other Issues
Size
– >1000 million pages 227,225,642 web sites
(Sep 2010) (Netcraft Survey)
– Grows at about 1 million pages a day
– Google indexes > 5 billion documents
Diverse types of data
So not possible to warehouse or normal
data mining
6
Web Data
Web pages
Intra-page structures (HTML, XML code)
Inter-page structures (actual linkage
structures between web pages)
Usage data
Supplemental data
– Profiles
– Registration information
– Cookies
7
Web Mining Taxonomy
8
Web Mining Taxonomy
Web Content Mining
– Extends work of basic search engines
Web Structure Mining
– Mine structure (links, graph) of the Web
Web Usage Mining
– Analyses Logs of Web Access
Web Mining applications include Target
Advtg., Recommendation Engines,
CRM etc
9
Web Content Mining
Extends work of basic search engines
Web content mining: mining, extraction
and integration of useful data, information
and knowledge from Web page contents
Search Engines
– IR application, Keyword based, Similarity
between query and document
– Crawlers, Indexing
– Profiles
– Link analysis
10
Issues in Web Content Mining
Developing intelligent tools for IR
– Finding keywords and key phrases
– Discovering grammatical rules and
collocations
– Hypertext classification/categorization
– Extracting key phrases from text documents
– Learning extraction models/rules
– Hierarchical clustering
– Predicting (words) relationship
11
Search Engine – Two Rank Functions
Ranking based on link
Search structure analysis
Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or Backward Link Web Topology
(Anchor Text) Graph
text Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor
Web Pages
12
How do We Find Similar Web
Pages?
Content based approach
Structure based approach
Combing both content and structure
approach
13
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
inverted index
Crawlers
Robot (spider) traverses the hypertext structure
in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Traditional Crawler – visits entire Web and
replaces index
Periodic Crawler – visits portions of the Web and
updates subset of index
Incremental Crawler – selectively searches the
Web and incrementally modifies index
Focused Crawler – visits pages related to a
particular subject
15
Focused Crawler
Only visit links from a page if that page is
determined to be relevant.
Classifier is static after learning phase.
Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages based on crawler and
distiller scores.
16
Focused Crawler
Classifier to related documents to topics
Classifier also determines how useful
outgoing links are
Hub Pages contain links to many relevant
pages. Must be visited even if not high
relevance score.
17
Focused Crawler
18
Virtual Web View
Multiple Layered DataBase (MLDB) built on top
of the Web.
Each layer of the database is more generalized
(and smaller) and centralized than the one beneath
it.
Upper layers of MLDB are structured and can be
accessed with SQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to place
in first layer of MLDB.
Higher levels contain more summarized data
obtained through generalizations of the lower
levels.
19
Multilevel Databases
Examples:
– WebLog: Restructuring extracted information from Web
sources.
– W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Architecture of a Global
MLDB
Concept
Higher
Hierarchy
Levels
Source
Source22
.
.
. Resource Discovery (MLDB)
Examples:
– WebLog: Restructuring extracted information from
Web sources.
– W3QL: Combines structure query (organization of
hypertext) and content query (information retrieval
techniques).
Architecture of a Global
MLDB
Concept
Higher
Hierarchy
Levels
Source
Source22
.
.
. Resource Discovery (MLDB)
26
Applications
ShopBot
Bookmark Organizer
Recommender Systems
Intelligent Search Engines
27
Web Structure Mining
Mine structure (links, graph) of the Web
Techniques
– PageRank
– CLEVER
Create a model of the Web organization.
May be combined with content mining to
more effectively retrieve important pages.
28
Web as a Graph
Web pages as nodes of a graph.
Links as directed edges.
my page www.vesit.edu
my page
www.vesit.edu
www.vesit.edu
www.google.com
www.google.com
www.google.com
29
Link Structure of the Web
Forward links (out-edges).
Backward links (in-edges).
Approximation of importance/quality: a
page may be of high quality if it is referred
to by many other pages, and by pages of
high quality.
30
Authorities and Hubs
Authority is a page which has relevant
information about the topic.
Hub is a page which has collection of links
to pages about that topic.
a1
a2
h
a3
a4
31
PageRank
Introduced by Brin and Page (1998).
Mine hyperlink structure of web to produce
‘global’ importance ranking of every web page.
Used in Google Search Engine.
Web search result is returned in the rank
order.
Treats link as like academic citation.
Assumption: Highly linked pages are more
‘important’ than pages with a few links.
32
PageRank
Used by Google
Prioritize pages returned from search by
looking at Web structure.
Importance of page is calculated based
on number of pages which point to it –
Backlinks.
Weighting is used to provide more
importance to backlinks coming form
important pages.
33
PageRank: Main Idea
A page has a high rank if the sum of the
ranks of its back-links is high.
Google utilizes a number of factors to rank
the search results:
– proximity, anchor text, page rank
The benefits of Page Rank are the greatest
for underspecified queries, example: ‘Mumbai
University’ query using Page Rank lists the
university home page the first.
34
Basic Idea
Back-links coming from important pages
convey more importance to a page.
For example, if a web page has a link from
the yahoo home page, it may be just one
link but it is a very important one.
A page has high rank if the sum of the
ranks of its back-links is high.
This covers both the case when a page
has many back-links and when a page has
a few highly ranked back-links.
35
Definition
A page’s rank is equal to the sum of all
the pages pointing to it.
Rank (v )
Rank (u )
vBu Nv
Bu set of pages with links to u
N v number of links from v
36
Simplified PageRank Example
Rank(u) = Rank
of page u , where
c is a
normalization
constant (c < 1 to
cover for pages
with no outgoing
links).
37
Expanded Definition
R(u): page rank of page u
c: factor used for normalization (<1)
Bu: set of pages pointing to u
Nv: outbound links of v
R(v): page rank of site v that points to u
E(u): distribution of web pages that a random
surfer periodically jumps (set to 0.15)
R (v )
R (u ) c cE (u )
vBu N v
38
Problem 1 - Rank Sink
Page cycles pointed by some incoming link.
39
Problem 2 - Dangling Links
40
PageRank (cont’d)
41
HITS
Hyperlink-Induces Topic Search
a1
a2
h
a3
a4
43
Authorities and Hubs (cont.)
Good hubs are the ones that point to good
authorities.
Good authorities are the ones that are
pointed to by
good hubs. h
1
a1
h2 a2
h3 a3
a4
h4
a5
h5 a446
Finding Authorities and Hubs
45
Construction of Sub-graph
Expanded
Rootset
Topic Search Engine Crawler set
Pages Pages
Rootset 46
Root Set and Base Set
Use query term to
collect a root set of
pages from text-
based search engine
(Lycos, Altavista ).
Root set
47
Root Set and Base Set (cont.)
48
Hubs & Authorities
Calculation
Iterative algorithm on Base Set: authority weights a(p), and hub
weights h(p).
– Set authority weights a(p) = 1, and hub weights h(p) = 1 for
all p.
– Repeat following two operations
(and then re-normalize a and h to have unit norm):
h(v1) v1 v1 a(v1)
h(v2) v2 p p v2 a(v2)
h(v3) v3 v3 a(v3)
a( p) h(q)
q points to p
h( p ) a(q)
p points to q 49
Example
0.45, 0.45
0.45, 0.45
0.45, 0.9
1.35, 0.9
52
Results
Although HITS is only link-based (it
completely disregards page content) results
are quite good in many tested queries.
From narrow topic, HITS tends to end in more
general one.
Specific of hub pages - many links can cause
algorithm drift. They can point to authorities in
different topics.
Pages from single domain / website can
dominate result, if they point to one page -
not necessarily a good authority.
53
Possible Enhancements
Use weighted sums for link calculation.
Take advantage of “anchor text” - text
surrounding link itself.
Break hubs into smaller pieces. Analyze each
piece separately, instead of whole hub page as
one.
Disregard or minimize influence of links inside
one domain.
IBM expanded HITS into Clever; not seen as
viable real-time search engine.
54
CLEVER
Identify authoritative and hub pages.
Authoritative Pages :
– Highly important pages.
– Best source for requested information.
Hub Pages :
– Contain links to highly important pages.
55
CLEVER
The CLEVER algorithm is an extension of standard
HITS and provides an appropriate solution to the
problems that result from standard HITS.
CLEVER assigns a weight to each link based on the
terms of the queries and end-points of the link.
It combines anchor text to set weights to the links as
well.
Moreover, it breaks large hub pages into smaller
units so that each hub page is focused on as a single
topic.
Finally, in the case of a large number of pages from a
single domain, it scales down the weights of pages to
reduce the probabilities of overhead weights
56
PageRank vs. HITS
HITS
PageRank
(CLEVER)
(Google)
– performed on the set of
– computed for all web
retrieved web pages for
pages stored in the each query
database prior to the
– computes authorities
query
and hubs
– computes authorities
– easy to compute, but
only
real-time execution is
– Trivial and fast to
hard
compute
57
Web Usage Mining
Performs mining on Web Usage data or
Web Logs
A web log is a listing of page reference
data also called as a click steam
Can be seen from either server
perspective – better web site design
Or client perspective – prefetching of web
pages etc.
58
Web Usage Mining
Applications
Personalization
Improve structure of a site’s Web pages
Aid in caching and prediction of future page
references
Improve design of individual pages
Improve effectiveness of e-commerce (sales
and advertising)
Improve web server performance (Load
Balancing)
59
Web Usage Mining Activities
Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
Transaction: session
Itemset: pattern (or subset)
Order is important
Pattern Analysis
60
Web Usage Mining Issues
Identification of exact user not possible.
Exact sequence of pages referenced by a
user not possible due to caching.
Session not well defined
Security, privacy, and legal issues
61
Web Usage Mining - Outcome
Association rules
– Find pages that are often viewed
together
Clustering
– Cluster users based on browsing patterns
– Cluster pages based on content
Classification
– Relate user attributes to patterns
62
Web Log Cleansing
Replace source IP address with unique
but non-identifying ID.
Replace exact URL of pages referenced
with unique but non-identifying ID.
Delete error records and records
containing not page data (such as figures
and code)
63
Data Structures
Keep track of patterns identified during
Web usage mining process
Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree
64
Web Usage Mining – Three
Phases
http://www.acm.org/sigs/sigkdd/explorations/issue1-2/srivastava.pdf
Phase 1: Pre-processing
Converts the raw data into the data
abstraction necessary for the further
applying the data mining algorithm
– Mapping the log data into relational
tables before an adapted data mining
technique is performed.
– Using the log data directly by utilizing
special pre-processing techniques.
66
Raw data – Web log
Click stream: a sequential series of page
view request
User session: a delimited set of user clicks
(click stream) across one or more Web
servers.
Server session (visit): a collection of user
clicks to a single Web server during a user
session.
Episode: a subset of related user clicks
that occur within a user session.
67
Phase 2: Pattern Discovery
Pattern Discovery uses techniques
such as statistical analysis,
association rules, clustering,
classification, sequential pattern,
dependency Modeling.
68
Phase 3: Pattern Analysis
A process to gain Knowledge about how
visitors use Website in order to
– Prevent disorientation and help designers to
place important information/functions exactly
where the visitors look for and in the way
users need it.
– Build up adaptive Website server
69
70
Techniques for Web usage mining
Construct multidimensional view on the Weblog database
– Perform multidimensional OLAP analysis to find the top
N users, top N accessed Web pages, most frequently
accessed time periods, etc.
Perform data mining on Weblog records
– Find association patterns, sequential patterns, and
trends of Web accessing
– May need additional information,e.g., user browsing
sequences of the Web pages in the Web server buffer
Conduct studies to
– Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web page
swapping
Software for Web Usage Mining
WEBMINER :
– introduces a general architecture for Web usage
mining, automatically discovering association rules
and sequential patterns from server access logs.
– proposes an SQL-like query mechanism for querying
the discovered knowledge in the form of association
rules and sequential patterns.
WebLogMiner
– Web log is filtered to generate a relational database
– Data mining on web log data cube and web log
database
WEBMINER
SQL-like Query
A framework for Web mining,
– Association rules: using Apriori algorithm
40% of clients who accessed the Web page with
URL /company/products/product1.html, also
accessed /company/products/product2.html
– Sequential patterns:
60% of clients who placed an online order in
/company/products/product1.html, also placed
an online order in
/company/products/product4.html within 15
days
WebLogMiner
Database construction from server log file:
– data cleaning
– data transformation
Multi-dimensional web log data cube construction and
manipulation
Data mining on web log data cube and web log database
Mining the World-Wide Web
Design of a Web Log Miner
– Web log is filtered to generate a relational database
– A data cube is generated from the database
– OLAP is used to drill-down and roll-up in the cube
– OLAM is used for mining interesting knowledge
3
OLAP 4
1 Data Cleaning 2 Data Cube Creation Mining