Module 2 Web Usage Mining
Module 2 Web Usage Mining
-
Dr. Jalpa Darshit Mehta
Introduction
◼ Web usage mining: automatic discovery of
patterns in clickstreams and associated data
collected or generated as a result of user
interactions with one or more Web sites.
◼ Goal: analyze the behavioral patterns and
profiles of users interacting with a Web site.
◼ The discovered patterns are usually
represented as collections of pages, objects,
or resources that are frequently accessed by
groups of users with common interests.
Introduction
◼ Data in Web Usage Mining:
❑ Web server logs
❑ Site contents
❑ Data about the visitors, gathered from external channels
❑ Further application data
◼ Not all these data are always available.
◼ When they are, they must be integrated.
◼ A large part of Web usage mining is about
processing usage/ clickstream data.
❑ After that various data mining algorithm can be applied.
3
Web server logs
1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200
318814 HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
4
Web usage mining process
5
Data preparation
6
Pre-processing of web usage data
7
Data cleaning
◼ Data cleaning
❑ remove irrelevant references and fields in server
logs
❑ remove references due to spider navigation
❑ remove erroneous references
❑ add missing references due to caching (done after
sessionization)
8
Identify sessions (sessionization)
9
Sessionization strategies
10
Sessionization heuristics
11
Sessionization example
12
User identification
13
User identification: an example
14
Pageview
15
Path completion
◼ Client- or proxy-side caching can often result
in missing access references to those pages
or objects that have been cached.
◼ For instance,
❑ if a user returns to a page A during the same
session, the second access to A will likely result in
viewing the previously downloaded version of A
that was cached on the client-side, and therefore,
no request is made to the server.
❑ This results in the second reference to A not being
recorded on the server logs.
16
Missing references due to caching
17
Path completion
◼ The problem of inferring missing user
references due to caching.
◼ Effective path completion requires extensive
knowledge of the link structure within the site
◼ Referrer information in server logs can also
be used in disambiguating the inferred paths.
◼ Problem gets much more complicated in
frame-based sites.
18
Integrating with e-commerce events
◼ Either product oriented or visit oriented
◼ Used to track and analyze conversion of
browsers to buyers.
❑ Major difficulty for E-commerce events is defining
and implementing the events for a site, however,
in contrast to clickstream data, getting reliable
preprocessed data is not a problem.
◼ Another major challenge is the successful
integration with clickstream data
19
Product-Oriented Events
◼ Product View
❑ Occurs every time a product is displayed on a
page view
❑ Typical Types: Image, Link, Text
◼ Product Click-through
❑ Occurs every time a user “clicks” on a product to
get more information
20
Product-Oriented Events
21
Web usage mining process
22
Integration with page content
23
Integration with link structure
24
E-commerce data analysis
25
Session analysis
26
Session analysis: aggregate reports
27
OLAP
28
Data mining
29
Data mining (cont.)
30
Some usage mining applications
31
Personalization application
32
Standard approaches
33
Summary
◼ Web usage mining has emerged as the essential
tool for realizing more personalized, user-friendly
and business-optimal Web services.
◼ The key is to use the user-clickstream data for
many mining purposes.
◼ Traditionally, Web usage mining is used by e-
commerce sites to organize their sites and to
increase profits.
◼ It is now also used by search engines to improve
search quality and to evaluate search results, etc,
and by many other applications.
34