Week9 Decision Support in E-Business Web Analytics
Week9 Decision Support in E-Business Web Analytics
1
Week 9: Lecture1
3
Decision support systems
• Provide interactive ad hoc support for the decision-making
processes of managers and other business professionals.
• Support nonroutine decision making
– Example: What is impact on production schedule if December sales
doubled?
• Model driven DSS,
• Data driven DSS
• Serve middle management
• Examples: product pricing, profitability forecasting, and risk analysis systems.
Components
of a typical
DSS
Analytical Models
• Optimization
• Simulation
• Decision analysis
• Static vs. dynamic models
• Deterministic Vs. Stochastic models
6
Statistical Models
• Descriptive statistics
• Outlier analysis
• Univariate predictive models
• Multi variate predictive models
7
Data mining Models
• Predictive:
– Regression
– Classification
– Collaborative Filtering
• Descriptive:
– Clustering / similarity matching
– Association rules and variants
8
Text mining
• Natural Language processing
• Discover patterns
9
Term Time frame Specific meaning
Decision support 1970–1985 Use of data analysis to support decision making
10
11
Week 9: Lecture2
13
Review of
HTTP
http://www.highteck.net/EN/Application/Application_Layer_Functionality_and_Protocols.html
Review of
HTTP
http://www.highteck.net/EN/Application/Application_Layer_Functionality_and_Protocols.html
Review of
HTTP
http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol2/pcg1/
Review of HTTP
http://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol2/pcg1/
Review of HTTP
http://www.cisco.com/c/en/us/about/press/internet-protocol-journal/back-issues/table-contents-47/131-aggregation.html
Review
of HTTP
http://seo-advisors.com/searchegnies-web-crawlers/
Collecting site navigation data
• Server Log Files
– Access log
• Common log format
• Extended (Combined) log format
– Error log
http://httpd.apache.org/docs/1.3/logs.html
What’s in a typical Web server log …
<ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent>
203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-
bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-
news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-
news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.252.234.33 - - [01/Jun/1999:03:12:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/"
"Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/"
"Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-
news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:13:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/"
"Mozilla/4.06 [en] (Win95; I)“
203.30.5.145 - - [01/Jun/1999:03:13:25 -0600] "GET /Calls/AWAC.html HTTP/1.0" 200 104 "http://www.acr-
news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
Field Data Description
26
Problems using the access log
• Caching
• Dynamic Address Allocation by the ISP
• Stateless nature of the HTTP protocol
• Crawler Activity
• Tedious preprocessing and cleaning steps
– Other Approaches for session tracking
• Cookies
• URL rewriting
• Hidden form fields
Filtering Merged Access
• Filtering the entries for embedded requests
Log Files
– Image, video and audio files
– HTML files within a frame
• Filtering robot entries
– Not human like trials
• Searching all the links in an HTML document
• Requests only for the text documents
– Analyzing user agent fields
• Tracing popular well-behaved robots
– Robot.txt
• Using a table of web pages
• Filtering consumes 80% of the effort in log analysis
Detecting Robots: http://www.cs.princeton.edu/~kyoungso/papers/robot-usenix.pdf, http://ieeexplore.ieee.org/iel5/7101/19134/00884534.pdf,
http://caltechlib.library.caltech.edu/73/01/Report-2004-NOV.pdf
Popular Robots: http://www.pgts.com.au/pgtsj/pgtsj0502d.html
Data
preparation
29
Pre-processing of web usage data
30
Data Preprocessing (1)
Data cleaning
– remove irrelevant references and fields in server logs
– remove references due to spider navigation
– remove erroneous references
– add missing references due to caching (done after sessionization)
Data integration
– synchronize data from multiple server logs
– Integrate semantics, e.g.,
• meta-data (e.g., content labels)
• e-commerce and application server data
– integrate demographic / registration data
Data Preprocessing (2)
Data Transformation
– user identification
– sessionization / episode identification
– pageview identification
• a pageview is a set of page files and associated objects that contribute to a single display in a Web
Browser
Data Reduction
– sampling and dimensionality reduction (ignoring certain pageviews /
items)
– Identifying User Transactions (i.e., sets or sequences of pageviews
possibly with associated weights)
Why sessionize?
• Quality of the patterns discovered depends on the quality of the data on
which mining is applied.
• In Web usage analysis, these data are the sessions of the site visitors: the
activities performed by a user from the moment she enters the site until
the moment she leaves it.
• Difficult to obtain reliable usage data due to proxy servers and
anonymizers, dynamic IP addresses, missing references due to caching,
and the inability of servers to distinguish among different visits.
• Cookies and embedded session IDs produce the most faithful
approximation of users and their visits, but are not used in every site, and
not accepted by every user.
• Therefore, heuristics are needed that can sessionize the available access
data.
Mechanisms for User Identification
44
Site Basic Framework for Web Log Data Analysis
Content
Web Usage and E-Business Analytics
Content Data Cleaning / Integrated Session Analysis /
Analysis Sessionization Sessionized Static Aggregation
Module Module Data
OLAP
Data E-Commerce Tools
Web/Application Integration Data Mart
Server Logs OLAP
Module Analysis
Data Cube
Advantages:
– Gain insight into typical customer behaviors.
– Trace specific problems with the site.
Drawbacks:
– LOTS of data.
– Difficult to generalize.
Online Analytical Processing (OLAP)
Allows changes to aggregation level for multiple dimensions.
Generally associated with a Data Warehouse.
Advantages & Drawbacks
– Very flexible
– Requires significantly more resources than static reporting.
Page Number of Average View Count
View Sessions per Session
Kid's Stuff Products 2,000 5.9
Page Number of Average View Count
View Sessions per Session
Kid's Stuff Products
Electronics
Educational 63 2.3
Radio-Controlled 93 2.5
Web Log Analytics
• The measurement, collection, analysis and reporting of internet
data for purposes of understanding and optimizing web usage
• Tools
– Webalizer
Level of Processing
– Sawmill
Static Aggregation and Statistics
– WebTrends
Session Analysis
– AWStats
– WWWStat
– Apache Logs Viewer
– Google analytics
UCSF School of Medicine,
Office of the Dean, Information
Services Unit
Few Definitions
• Hits
– A request for a file from the web server. Available only in log analysis
• Page Views
– A request for a file whose type is defined as a page
• Visits/Sessions
– A series of requests from the same uniquely identified client with a set
timeout, often 30 minutes. A visit contains one or more page views
• Click Paths
– the sequence of hyperlinks one or more website visitors follows on a
given site
https://communicators.ucsf.edu/resources/files/web_analytics.ppt
What Numbers Say
• About Navigation
• About Content
• About Users
Market-Basket transactions
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} {Beer},
2 Bread, Diaper, Beer, Eggs {Milk, Bread} {Eggs,Coke},
3 Milk, Diaper, Beer, Coke {Beer, Bread} {Milk},
4 Bread, Milk, Diaper, Beer Implication means co-occurrence, not
5 Bread, Milk, Diaper, Coke causality!
Applications
• Pre-fetching and caching web pages
• Web site reorganisation
• Personalisation
• Recommendation of links and products
Applications
• Calibration of a Web server:
– Prediction of the next page invocation over a group of
concurrent Web users under certain constraints
• Sequence mining, Markov chains
• Cross-selling of products:
– Mapping of Web pages/objects to products
– Discovery of associated products
• Association rules, Sequence Mining
– Placement of associated products on the same page
Applications
Sophisticated cross-selling and up-selling of products:
– Mapping of pages/objects to products of different price groups
– Identification of Customer Groups
• Clustering, Classification
– Discovery of associated products of the same/different price
categories
• Association rules, Sequence Mining
– Formulation of recommendations to the end-user
• Suggestions on associated products
• Suggestions based on the preferences of similar users
Summary
• Web usage mining has emerged as the essential tool for
realizing more personalized, user-friendly and business-
optimal Web services.
• The key is to use the user-clickstream data for many mining
purposes.
• Traditionally, Web usage mining is used by e-commerce
sites to organize their sites and to increase profits.
• It is now also used by search engines to improve search
quality and to evaluate search results, etc, and by many
other applications.
62
63
Week 9: Lecture 5
65
Probabilistic models of browsing behavior
• Useful to build models that describe the
browsing behavior of users
• Can generate insight into how users use the
website
• Provide mechanism for making predictions
• Can help in pre-fetching and personalization
66
Markov models for understanding user behavior
• General approach is to use a finite-state Markov chain
– Each state can be a specific Web page or a category of Web
pages
– If only interested in the order of visits (and not in time), each
new request can be modeled as a transition of states
• Issues
– Self-transition
– Time-independence
67
Discrete – Time Markov Chains
Many real-world systems contain uncertainty and
evolve over time.
Browse 7
• Calculate probability
Browse 8/20
Browse 7/20
Customer’s think time
Client Server
t0 Reque
st for
Page A
t1
Customer’s
Time
A
t2 ends Page
Think time Ser ver S
t3
t4
t5 Reques
t for Pag
eB
t6
Finding average think time
• Total think time from all the visits from one
state to the other/frequency of visit
Browse 10
5
Add to cart 15 Select
Browse 12
Browsing Behaviour as a Markov Chain
Properties of the transition probability matrix of a CBMG
• pi1 = 0 2≤i≤n-1
– No transition can be made to the Entry state from any state
other than the Exit state.
• p1n = 0
– No transition can be made from the Entry state to the Exit state.
• pnj = 0 2≤j≤n-1
– No transition can be made from the Exit state to any state other
than the Entry state.
• pnn + pn1= 1
– A transition from the Exit state to itself or to the Entry state.
End of Week 9
76