Srivastava Tut Pres
Srivastava Tut Pres
Srivastava Tut Pres
Jaideep Srivastava
University of Minnesota
USA
[email protected]
http://www.cs.umn.edu/faculty/srivasta.html
© Jaideep Srivastava 1
Overview
Web Structure Mining
Introduction to data mining Definition
Data mining process Interesting Web Structures
Data Mining techniques Overview of Hyperlink Analysis
Classification Methodology
Clustering Key Concepts
Topic Analysis PageRank
Concept Hierarchy Hubs and Authorities
Content Relevance Web Communities
Web mining Information Scent
Web mining definition Conclusions
Web mining taxonomy Web Usage Mining
Web Content Mining Definition
Definition Preprocessing of usage data
Pre-processing of content Session Identification
Common Mining techniques CGI Data
Classification Caching
Clustering Dynamic Pages
Topic Analysis Robot Detection and Filtering
Concept Hierarchy Transaction Identification
Content Relevance Identify Unique Users
Applications of Content Mining Identify Unique User
transaction
© Jaideep Srivastava 2
Overview
Related Concepts
Web Usage Mining (contd.)
Web Visualization
Path and Usage Pattern Discovery
Topic Distillation
Pattern Analysis
Web Page Categorization
Applications
Semantic Web Mining
Conclusions
Distributed Web Mining
Web mining applications
Amazon.com
Web services & Web mining
Google Definitions
Double Click What they provide
AOL Service Oriented Architecture
eBay SOAP
MyYahoo WSDL
CiteSeer UDDI
i-MODE How WM can help WS
v-TAG Web Mining Server Web Services Optimization
© Jaideep Srivastava 3
Overview
Research Directions
Process Mining
Temporal Evolution of the Web
Web Services Optimization
Fraud at E-tailer
Fraud at online Auctioneer
Other threats
Web Mining and Privacy
Public Attitude towards Privacy
Why this attitude
Does understanding implications
help
What needs to be done
Conclusions
© Jaideep Srivastava 4
Introduction to Data Mining
© Jaideep Srivastava 5
Why Mine Data?
© Jaideep Srivastava 6
The Data Mining (KDD) Process
interpretation
data mining
transformation
preprocessing KNOWLEDGE
selection
Patterns
Transformed
Preprocessed Data
Data
DATA Target Data
© Jaideep Srivastava 7
Data Mining Techniques
Classification
Primary
Clustering techniques
Association Rules
Sequential Patterns
Regression
Deviation Detection
© Jaideep Srivastava 8
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes
is the class.
Find a model for class attribute as a function
of the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it.
© Jaideep Srivastava 9
Classification Example
cal cal us
i i o
gor gor i nu
ate ate ont ass
c c c cl
Tid Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Set
Set Classifier
© Jaideep Srivastava 10
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Genetic Algorithms
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
© Jaideep Srivastava 11
What is Cluster Analysis?
Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups.
Based on information found in the data that describes the objects
and their relationships.
Also known as unsupervised classification.
Many applications
Understanding: group related documents for browsing or to find
genes and proteins that have similar functionality.
Summarization: Reduce the size of large data sets.
Web Documents are divided into groups based on a
similarity metric.
Most common similarity metric is the dot product between
two document vectors.
© Jaideep Srivastava 12
What is not Cluster Analysis?
Supervised classification.
Have class label information.
Simple segmentation.
Dividing students into different registration groups
alphabetically, by last name.
Results of a query.
Groupings are a result of an external specification.
Graph partitioning
Some mutual relevance and synergy, but areas are not
identical.
© Jaideep Srivastava 13
Notion of a Cluster is Ambiguous
© Jaideep Srivastava 14
Types of Clusterings
A clustering is a set of clusters.
One important distinction is between
hierarchical and partitional sets of clusters.
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset.
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree.
© Jaideep Srivastava 15
Partitional Clustering
6 6
4 4
5 5
2 2
1 1
3 3
© Jaideep Srivastava 16
Hierarchical Clustering
(agglomerative clustering)
6 5
0.2
4
3 4 0.15
2
5
2 0.1
1 0.05
3 1
0
1 3 2 5 4 6
© Jaideep Srivastava 17
Other Distinctions Between Sets of
Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to multiple
clusters.
Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy
In fuzzy clusterings, a point belongs to every cluster with
some weight between 0 and 1.
Weights must sum to 1.
Probabilistic clustering has similar characteristics.
Partial versus complete.
In some cases, we only want to cluster some of the data.
© Jaideep Srivastava 18
Mining Associations
Given a set of records, find rules that will predict the
occurrence of an item based on the occurrences of
other items in the record
Market-Basket transactions
TID Items Example:
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
© Jaideep Srivastava 19
Definition of Association Rule
s ,c
TID Items
Association Rule: X⇒ y
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs σ ( X ∪ y)
Support: s = ( s = P(X, y))
3 Milk, Diaper, Beer, Coke |T |
4 Bread, Milk, Diaper, Beer σ ( X ∪ y)
Confidence: c = (c = P ( y | X))
5 Bread, Milk, Diaper, Coke σ (X)
Observations:
• All the rules above correspond to the
same itemset: {Milk, Diaper, Beer}
• Rules obtained from the same
itemset have identical support but
can have different confidence
© Jaideep Srivastava 21
Association Rule Mining
Two-step approach:
1. Generate all frequent itemsets (sets of items whose
support ≥ minsup)
2. Generate high confidence association rules from
each frequent itemset
Each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is the more
expensive operation
© Jaideep Srivastava 22
Sequential Pattern Discovery
Given a set of objects, with each object associated with
its own timeline of events, find rules that predict strong
dependencies among different events.
(A B) (C) (D E)
Examples:
In point-of-sale transaction sequences
(Intro_to_visual_C)(C++-Primer) (Perl_for_dummies)(TCL_TK)
In Telecommunication alarm logs:
(Inverter_Problem Excessive_Line_Current) (Rectifier_Alarm)
(Fire_Alarm)
© Jaideep Srivastava 23
Regression
© Jaideep Srivastava 24
Deviation Detection
© Jaideep Srivastava 25
Web Mining
© Jaideep Srivastava 26
Web Mining
© Jaideep Srivastava 27
Web Mining – History
© Jaideep Srivastava 28
Web Mining Taxonomy
© Jaideep Srivastava 29
Pre-processing Web Data
Web Content
Extract “snippets” from a Web document that
represents the Web Document
Web Structure
Identifying interesting graph patterns or pre-
processing the whole web graph to come up with
metrics such as PageRank
Web Usage
User identification, session creation, robot detection
and filtering, and extracting usage path patterns
© Jaideep Srivastava 30
Web Content Mining
© Jaideep Srivastava 31
Definition
© Jaideep Srivastava 32
Pre-processing Content
Content Preparation
Extract text from HTML.
Perform Stemming.
Remove Stop Words.
Calculate Collection Wide Word Frequencies (DF).
Calculate per Document Term Frequencies (TF).
Vector Creation
Common Information Retrieval Technique.
Each document (HTML page) is represented by a sparse vector of term
weights.
TFIDF weighting is most common.
Typically, additional weight is given to terms appearing as keywords or in
titles.
© Jaideep Srivastava 33
Common Mining Techniques
The more basic and popular data mining
techniques include:
Classification
Clustering
Associations
The other significant ideas:
Topic Identification, tracking and drift analysis
Concept hierarchy creation
Relevance of content.
© Jaideep Srivastava 34
Document Classification
“Supervised” technique
Categories are defined and documents are
assigned to one or more existing categories
The “definition” of a category is usually in the
form of a term vector that is produced during a
“training” phase
Training is performed through the use of
documents that have already been classified
(often by hand) as belonging to a category
© Jaideep Srivastava 35
Document Clustering
“Unsupervised” technique
Documents are divided into groups based on a
similarity metric
No pre-defined notion of what the groups should
be
Most common similarity metric is the dot product
between two document vectors
© Jaideep Srivastava 36
Topic Identification and Tracking
© Jaideep Srivastava 37
Concept Hierarchy Creation
© Jaideep Srivastava 38
Relevance of Content
© Jaideep Srivastava 39
Document Relevance
Measure of how useful a given document is in a
given situation
Commonly seen in the context of queries -
results are ordered by some measure of
relevance
In general, a query is not necessary to assign a
relevance score to a document
© Jaideep Srivastava 40
Query Based Relevance
Most common
Well established in Information Retrieval
Similarity between query keywords and
document is calculated
Can be enhanced through additional information
such as popularity (Google) or term positions
(AltaVista)
© Jaideep Srivastava 41
User Based Relevance
Often associated with personalization
Profile for a particular user is created
Similarity between a profile and document is
calculated
No query is necessary
© Jaideep Srivastava 42
Role/Task Based Relevance
Similar to User Based Relevance
Profile is based on a particular role or task,
instead of an individual
Input to profile can come from multiple users
© Jaideep Srivastava 43
Web Content Mining Applications
© Jaideep Srivastava 44
Web Structure Mining
© Jaideep Srivastava 45
What is Web Structure Mining?
© Jaideep Srivastava 46
Motivation to study Hyperlink Structure
© Jaideep Srivastava 47
Web Structure Terminology(1)
© Jaideep Srivastava 48
Web Structure Terminology(2)
© Jaideep Srivastava 49
Interesting Web Structures
[ERC+2000]
Endorsement Mutual Reinforcement
Transitive Endorsement
© Jaideep Srivastava 50
The Bow-Tie Model of the Web
[BKM+2000]
© Jaideep Srivastava 51
Hyperlink Analysis Techniques [DSKT2002]
© Jaideep Srivastava 52
Hyperlink Analysis Techniques
© Jaideep Srivastava 53
Google’s PageRank [BP1998]
d
N
P1
1
OutDeg ( P1) Key idea
P2
1 Rank of a web page
OutDeg ( P 2) depends on the rank of
1 P the web pages pointing
P3
OutDeg ( P3)
to it
© Jaideep Srivastava 54
The PageRank Algorithm [BP1998]
Set PR ← [r1, r2, …..rN], where r-i is some initial rank of
page I, and N the number of Web pages in the
graph;
d ← 0.15; D ← [1/N…….1/N]T;
A is the adjacency matrix as described above;
do
PRi+1 ← AT*PRi ;
PRi+1 ← (1-d)* PRi+1 + d*D;
δ ← || PRi+1 - PRi||1
while δ < ε, where ε is a small number indicating the
convergence threshold
return PR.
© Jaideep Srivastava 55
Hubs and Authorities [K1998]
Key ideas
Hubs and authorities are
‘fans’ and ‘centers’ in a
bipartite core of a web graph
A good hub page is one that Hubs
points to many good authority Authorities
pages
A good authority page is one
that is pointed to by many
good hub pages
Bipartite Core
© Jaideep Srivastava 56
HITS Algorithm [K1998]
© Jaideep Srivastava 57
Information Scent [CPCP2001]
Key idea
a user at a given page “foraging” for information would follow a
link which “smells” of that information
the probability of following a link depends on how strong the
“scent” is on that link
P1 Scent P2
© Jaideep Srivastava 58
Web Communities [FLG2000]
Definition
Web communities can be
described as a collection of
web pages such that each
member node has more
hyperlinks ( in either direction)
within the community than
outside the community.
Approach
• Maximal-flow model
• Graph substructure
identification
Web Communities
© Jaideep Srivastava 59
Max Flow- Min Cut Algorithm
Central
Page Like
Yahoo (Sink)
Community
Community
© Jaideep Srivastava 60
Conclusions
Web Structure is a useful source for extracting
information such as
Quality of Web Page
- The authority of a page on a topic
- Ranking of web pages
Interesting Web Structures
- Graph patterns like Co-citation, Social choice,
Complete bipartite graphs, etc.
Web Page Classification
- Classifying web pages according to various topics
© Jaideep Srivastava 61
Conclusions (Cont…)
© Jaideep Srivastava 62
Web Usage Mining
© Jaideep Srivastava 63
What is Web Usage Mining?
© Jaideep Srivastava 64
The Web Usage Mining Process
© Jaideep Srivastava 65
Preprocessing Architecture
Path
Completion
Usage Statistics
Site Structure
and Content
Episode File
© Jaideep Srivastava 66
ECLF Log File Format
IP Address rfc931 authuser Date and time of request request status bytes referer user agent
128.101.35.92 - - [09/Mar/2002:00:03:18 -0600] "GET /~harum/ HTTP/1.0" 200 3014 http://www.cs.umn.edu/ Mozilla/4.7 [en] (X11; I; SunOS 5.8 sun4u)
© Jaideep Srivastava 67
Issues in Usage Data
Session Identification
CGI Data
Caching
Dynamic Pages
Robot Detection and Filtering
Transaction Identification
Identify Unique Users
Identify Unique User transaction
© Jaideep Srivastava 68
Session Identification Problems
© Jaideep Srivastava 69
Session Identification Solutions
© Jaideep Srivastava 70
CGI Data
© Jaideep Srivastava 71
Example URI
Base URI
/cgi-bin/templates
?BV_EngineID=falfiffkdgfbemmcfnnckcgl.0&BV_
Operation=Dyn_RawSmartLink&BV_SessionI
D=2131083763.936854172&BV_ServiceName
=MyStore&form%25destination=mysite/logo.t
mpl
CGI Data
© Jaideep Srivastava 72
CGI Data Problems
© Jaideep Srivastava 73
CGI Data Solutions
Pull data directly from the HTTP traffic instead of the
Server log
Advantages: Generic, works for any Web server/Content
server configuration
Disadvantages: No access to secure data. No access to
internal Content server variables
Have Content server create an “access log”
Advantages: All relevant information is always available
Clean log of page views instead of file accesses is
created. No sessionID “first access” problems
Disadvantages: Content server performance may be
degraded. Not automatic like Server logs
© Jaideep Srivastava 74
Caching Problems
© Jaideep Srivastava 75
Server Log Incompleteness due to
Caching
index.html
page1.html
page2.html
© Jaideep Srivastava 76
Wrong Access Timings Recorded
at Server
Server Client
Request pa
ge 2 t4
} Actual viewing time
t5
© Jaideep Srivastava 77
Missed Page Views at Server
}
t1-0 t1-3 t2-0 t2-3 t3-0 t3-3
Client
Cache t2-1 t2-2
Server
t1-1 t1-2 t3-1 t3-2
Viewing time calculated from server log
© Jaideep Srivastava 78
Caching Solutions
© Jaideep Srivastava 79
Robot Detection and Filtering
[TK2002]
Web robots are software programs that automatically
traverse the hyperlink structure of world wide web in
order to locate and retrieve information
Motivation for distinguishing web robot visits from other
users
Unauthorized gathering of business information at e-
commerce web sites
Consumption of considerable network bandwidth
Difficulty in performing click-stream analysis
effectively on web data
© Jaideep Srivastava 80
Transaction Identification
Main Questions:
how to identify unique users
how to identify/define a user transaction
Problems:
user ids are often suppressed due to security concerns
individual IP addresses are sometimes hidden behind proxy servers
client-side & proxy caching makes server log data less reliable
Standard Solutions/Practices:
user registration
}
client-side cookies
not full-proof
cache busting } increases network traffic
© Jaideep Srivastava 81
Heuristics for Transaction
Identification
Identifying User Sessions
use IP, agent, and OS fields as key attributes
use client-side cookies & unique user ids, if available
use session time-outs
use synchronized referrer log entries and time stamps to expand
user paths belonging to a session
path completion to infer cached references
EX: expanding a session A ==> B ==> C by an access pair
(B ==> D) results in: A ==> B ==> C ==> B ==> D
to disambiguate paths, sessions are expanded based on page
attributes (size, type), reference length, and no. of back references
required to complete the path
© Jaideep Srivastava 82
Inferring User Transactions from
Sessions
Studies show that reference lengths
follow an exponential distribution.
Page types: navigational, content, mixed.
Page types correlate with reference lengths.
Histogram of
Can automatically classify pages as page reference
navigational or content using % of lengths (secs)
navigational pages (based on site
topology) and a normal estimate of Chi-
squared distribution.
A transaction is an intra-session path ending
in a content page.
navigational content
pages pages
© Jaideep Srivastava 83
Associations in Web Transactions
Association Rules:
discovers affinities among sets of items across transactions
α, σ
X =====> Y
where X, Y are sets of items, α = confidence, σ = support
Examples:
60% of clients who accessed /products/, also accessed
/products/software/webminer.htm.
30% of clients who accessed /special-offer.html, placed
an online order in /products/software/.
(Actual Example from IBM official Olympics Site)
{Badminton, Diving} ===> {Table Tennis} (α = 69.7%, σ = 0.35%)
© Jaideep Srivastava 84
Other Patterns from Web
Transactions
Sequential Patterns:
30% of clients who visited /products/software/, had done a search in
Yahoo using the keyword “software” before their visit
60% of clients who placed an online order for WEBMINER, placed another
online order for software within 15 days
Clustering and Classification
clients who often access /products/software/webminer.html tend
to be from educational institutions.
clients who placed an online order for software tend to be students in the
20-25 age group and live in the United States.
75% of clients who download software from /products/software/demos/ visit
between 7:00 and 11:00 pm on weekends.
© Jaideep Srivastava 85
Path and Usage Pattern Discovery
© Jaideep Srivastava 86
Pattern Analysis
© Jaideep Srivastava 87
Implications of Web Usage Mining for
E-commerce
Electronic Commerce
determine lifetime value of clients
design cross marketing strategies across products
evaluate promotional campaigns
target electronic ads and coupons at user groups
based on their access patterns
predict user behavior based on previously learned
rules and users’ profile
present dynamic information to users based on
their interests and profiles
© Jaideep Srivastava 88
Implications for Other Applications
© Jaideep Srivastava 89
What’s Round-the-Corner for WUM
© Jaideep Srivastava 90
Related Concepts
© Jaideep Srivastava 91
Interestingness Measure [PT1998,C2000]
© Jaideep Srivastava 92
User Behavior Profiles [MSSZ2002]
Why?
To understand the complex human decision making process.
How?
Record click-stream data.
Gather other user information such as demographic,
psychographic, etc data.
At what level?
Within a web site e.g Amazon.Com [AMZNa].
On the whole world wide web e.g Alexa research [ALEX]
and DoubleClick [DCLKa].
© Jaideep Srivastava 93
Distributed Web Mining
Motivation: Data on the Web is huge and
distributed across various sites
Traditional Approach: Integrate all data into one
site and perform required analysis.
Problem: Time consuming and not scalable.
Solution: Analyze data locally at different
locations and build an overall model
Application: Personalization of Web Sites
depending on user’s ‘life on the web’ (the users
interests, locations and behavior across different
sites).
© Jaideep Srivastava 94
Distributed Web Mining - Approaches
The approaches can be classified into two kinds
Surreptious
User behavior across different web sites is tracked and
integrated without the user having to explicitly submit any
information.
Co-operative
Behavior is reported to a central organization or database
(e.g Network Attacks are reported to CERT)
© Jaideep Srivastava 95
Web Visualization
Motivation
Mining Web Data provides huge
information that can be better
understood using visualization
tools than pure text
representation.
Prominent tools developed
WebViz
WUM: Web Utlization Miner
Figure: WEEV- Time Tube
WEEV
representing the evolution of Web
WebQuilt Ecology over time
Naviz
© Jaideep Srivastava 96
Naviz - User Behavior Visualization of Dynamic
Page [PPT+2003]
Naviz Features
Two operation mode
Traversal diagram mode
Traversal path mode
Thickness of edge
Represents support value
Color of edge (range from NAVIZ: Traversal Diagram mode
blue to red)
Represents confidence
degree (low to high)
Visitor Success
path
Web Mining
Documents
© Jaideep Srivastava 99
Web Page Categorization
Web page Categorization determines the category or class
a web page belongs to, from a pre-determined set of categories
or classes.
(categories can be based on topics or other functionalities such as
home page, research page, content pages etc.)
Approaches:
Pirolli et al. defined 8 categories and identified 7 features based
on which they web pages can be classfied.
Chakrabarti et al. used relaxation labeling technique and
assigned categories based on neighboring documents that link to
a given document or linked by a given document.
Getoor et al used a Probabilistic Relational Model to specify
probability distribution over document link databse and classify
documents using belief popagation methods
Modem
Content Server
Client Computer ISP Server Web Server
Packet-Sniffer
logs
Client-level logs Proxy-level logs Server-level logs Content-level logs
Jaideep Srivastava, like to read magazines? Like to receive $10--or $20? Visit
Today's Deals.
(If you're not Jaideep
Use of Web mining Srivastava, click here.)
• cookies to identify user Your Message Center
You have 6 new
• analysis of user’s past behavior and messages.
Your Shopping Cart
‘peer group analysis’ for You have 0 items in
DoubleClick
• places its own cookie on the machine
of its customer’s users
• reads this cookie each time it serves
an ad to this user through any
customer in the DoubleClick network
Use of Web mining
• use of a special cookie to track user
across multiple Web sites
• analysis of multi-site behavior
• ad serving using DART system
© Jaideep Srivastava 115
Understanding user communities -
AOL
Search Topic
Similar or Related
Papers
Select ‘slippery’
items state, i.e.
1-click buy
cross-sell up-sell
promotions promotions
The Sting
Perpetrator P signs up as vendor P.com, and advertises he
has 10 VCRs to sell
P also signs up as 10 customers C1, C2, … who all ‘buy’ from
P
7 of the ‘customers’ complain to A.com that they did not
receive their VCRs
A.com pays out $250 each to 4 of the customers before
discovering the sting
© Jaideep Srivastava 131
Fraud at On-line Auctioneer e.com
Auctioneer e.com creates the ultimate ‘virtual flea market’
Gains immense traction
Participation in large numbers
People spend large amounts of time
Popular for similar reasons as gambling and game shows
Some people will not share any kind of private data at any
cost – the ‘paranoids’
Some people will share any data for returns – the ‘Jerry
Springerites’
The vast majority in the middle wants
a reasonable level of comfort that private data about them will
NOT be misused
Tangible and compelling benefits in return for sharing their
private data – Big Mac example, frequent flier programs