Web Mining1
Web Mining1
1
Introduction
Web Mining is the use of the data mining
techniques to automatically discover and
extract information from web
documents/services.
Discovering useful information from the
World-Wide Web and its usage patterns
Using data mining techniques to make the
web more useful and more profitable (for
some) and to increase the efficiency of our
2
interaction with the web
Web Mining Outline
Goal: Examine the use of data mining on
the World Wide Web
Introduction
Web Content Mining
Web Structure Mining
Web Usage Mining
3
Web Mining
The WWW is huge, widely distributed, global
information service centre for
Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
Hyper-link information
Access and usage information
WWW provides rich sources of data for data mining
4
Why mine the Web?
Enormous wealth of information on Web
– Financial information (e.g. NSE)
– Book/CD/Video stores (e.g. Amazon)
– Restaurant information (e.g. Zomato)
– Car prices (e.g. Cars24)
Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information
–People who ski also travel frequently to Europe
–Tech stocks have corrections in the summer and rally
from November until February 5
Web Mining Issues
Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
– over 130 trillion pages (2016)
– 100,000,000 gigabytes
– 2.5 exabytes (2,500,000,000 gigabytes) of
data every single day
Diverse types of data
6
Web Data
Web pages
Intra-page structures
Inter-page structures
Usage data
Supplemental data
– Profiles
– Registration information
– Cookies
7
Web Mining Taxonomy
8
Web Mining Taxonomy
9
Web Content Mining
Extends work of basic search engines
Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
10
Crawlers
Robot (spider) traverses the hypertext
sructure in the Web.
Collect information from visited pages
Used to construct indexes for search engines
Scrapy
Beautiful Soup
Selenium
12
Focused Crawler
Only visit links from a page if that page
is determined to be relevant.
Classifier is static after learning phase.
Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.
13
Focused Crawler
Classifier to relate documents to topics
Classifier also determines how useful
outgoing links are
Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.
14
Focused Crawler
15
Context Focused Crawler
Context Graph:
– Context graph created for each seed document .
– Root is the seed document.
– Nodes at each level show documents with links to
documents at next higher level.
– Updated during crawl itself .
Approach:
1. Construct context graph and classifiers using seed
documents as training data.
2. Perform crawling using classifiers and context
graph created.
16
Context Graph
17
Virtual Web View
Multiple Layered DataBase (MLDB) built on top
of the Web.
Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
Upper layers of MLDB are structured and can be
accessed with SQL type queries.
Translation tools convert Web documents to XML.
Extraction tools extract desired information to
place in first layer of MLDB.
Higher levels contain more summarized data
obtained through generalizations of the lower
levels. 18
Personalization
Web access or contents tuned to better fit the
desires of each user.
Manual techniques identify user’s preferences
based on profiles or demographics.
Collaborative filtering identifies preferences
based on ratings from similar users.
Content based filtering retrieves pages
based on similarity between pages and user
profiles.
19
Web Structure Mining
Mine structure (links, graph) of the Web
Techniques
– PageRank
– CLEVER
Create a model of the Web organization.
May be combined with content mining to
more effectively retrieve important pages.
20
PageRank
Used by Google
Prioritize pages returned from search by
looking at Web structure.
Importance of page is calculated based
on number of pages which point to it –
Backlinks.
Weighting is used to provide more
importance to backlinks coming form
important pages.
21
PageRank (cont’d)
PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i
22
PageRank
The connections between pages is represented by a
graph (web graph )
A node represents a webpage and an arrow from
page A to page B means that there is a link from
page A to page B.
The number of out-going links is an important
parameter.
“out-degree of a node-the number of out-going links
contained in a page.
23
Example
24
Example
Let N be the total number of pages. We create
an N *N matrix A by defining the (i, j)-entry as
A=
26
Example
27
Example
29
CLEVER
Identify authoritative and hub pages.
Authoritative Pages :
– Highly important pages.
– Best source for requested information.
Hub Pages :
– Contain links to highly important pages.
30
HITS
Hyperlink-Induces Topic Search
Based on a set of keywords, find set of
relevant pages – R.
Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
31
HITS Algorithm
32
Web Usage Mining
Extends work of basic search engines
Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
33
Web Usage Mining Applications
Personalization-
Improve structure of a site’s Web pages
Aid in caching and prediction of future
page references
Improve design of individual pages
Improve effectiveness of e-commerce
(sales and advertising)
Identify criminal activities
Customer Relationship management 34
Web Usage Mining Activities
Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
» Transaction: session
» Itemset: pattern (or subset)
» Order is important
Pattern Analysis
35
Downside
Privacy - can pose a big threat to the
company.
No high ethical standards – additional
data can be exposed
36
ARs in Web Mining
Web Mining:
– Content
– Structure
– Usage
Frequent patterns of sequential page
references in Web searching.
Uses:
– Caching
– Clustering users
– Develop user profiles
– Identify important pages
37
Web Usage Mining Issues
Identification of exact user not possible.
Exact sequence of pages referenced by
a user not possible due to caching.
Session not well defined
Security, privacy, and legal issues
38
Web Log Cleansing
Replace source IP address with unique but
non-identifying ID.
Replace exact URL of pages referenced
with unique but non-identifying ID.
Delete error records and records containing
not page data (such as figures and code)
39
Sessionizing
Divide Web log into sessions.
Two common techniques:
– Number of consecutive page references from
a source IP address occurring within a
predefined time interval (e.g. 25 minutes).
– All consecutive page references from a
source IP address where the interclick time is
less than a predefined threshold.
40
Data Structures
Keep track of patterns identified during
Web usage mining process
Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree
41
Trie vs. Suffix Tree
Trie:
– Rooted tree
– Edges labeled which character (page) from
pattern
– Path from root to leaf represents pattern.
Suffix Tree:
– Single child collapsed with parent. Edge
contains labels of both prior edges.
42
Trie and Suffix Tree
43
Generalized Suffix Tree
Suffix tree for multiple sessions.
Contains patterns from all sessions.
Maintains count of frequency of occurrence
of a pattern in the node.
WAP Tree:
Compressed version of generalized suffix tree
44
Types of Patterns
Algorithms have been developed to discover
different types of patterns.
Properties:
– Ordered – Characters (pages) must occur in the exact
order in the original session.
– Duplicates – Duplicate characters are allowed in the
pattern.
– Consecutive – All characters in pattern must occur
consecutive in given session.
– Maximal – Not subsequence of another pattern.
45
Pattern Types
Association Rules
None of the properties hold
Episodes
Only ordering holds
Sequential Patterns
Ordered and maximal
Forward Sequences
Ordered, consecutive, and maximal
Maximal Frequent Sequences
All properties hold
46
Episodes
Partially ordered set of pages
Serial episode – totally ordered with
time constraint
Parallel episode – partial ordered with
time constraint
General episode – partial ordered with
no time constraint
47
DAG for Episode
48
Spatial Mining Outline
Goal: Provide an introduction to some
spatial mining techniques.
Introduction
Spatial Data Overview
Spatial Data Mining Primitives
Generalization/Specialization
Spatial Rules
Spatial Classification
Spatial Clustering
49
Spatial Object
Contains both spatial and nonspatial
attributes.
Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
May retrieve object using either (or
both) spatial or nonspatial attributes.
50
Spatial Data Mining Applications
Geology
GIS Systems
Environmental Science
Agriculture
Medicine
Robotics
May involved both spatial and temporal
aspects
51
Spatial Queries
Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
Region (Range) Query – find objects that
intersect a given region.
Nearest Neighbor Query – find object close to
identified object.
Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
52
Spatial Data Structures
Data structures designed specifically to store or
index spatial data.
Often based on B-tree or Binary Search Tree
Cluster data on disk basked on geographic
location.
May represent complex spatial structure by
placing the spatial object in a containing structure
of a specific geographic shape.
Techniques:
– Quad Tree
– R-Tree
– k-D Tree
53
MBR
Minimum Bounding Rectangle
Smallest rectangle that completely
contains the object
54
MBR Examples
55
Quad Tree
Hierarchical decomposition of the space
into quadrants (MBRs)
Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
Each level is a more exact representation
of the object.
The number of levels is determined by
the degree of accuracy desired.
56
Quad Tree Example
57
R-Tree
As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
Rectangles need not be of the same
size or number at each level.
Rectangles may actually overlap.
Lowest level cell has only one object.
Tree maintenance algorithms similar to
those for B-trees.
58
R-Tree Example
59
K-D Tree
Designed for multi-attribute data, not
necessarily spatial
Variation of binary search tree
Each level is used to index one of the
dimensions of the spatial object.
Lowest level cell has only one object
Divisions not based on MBRs but
successive divisions of the dimension
range.
60
k-D Tree Example
61
Topological Relationships
Disjoint
Overlaps or Intersects
Equals
Covered by or inside or contained in
Covers or contains
62
Distance Between Objects
Euclidean
Manhattan
Extensions:
63
Progressive Refinement
Make approximate answers prior to
more accurate ones.
Filter out data not part of answer
Hierarchical view of data based on
spatial relationships
Coarse predicate recursively refined
64
Progressive Refinement
65
Spatial Data Dominant Algorithm
66
STING
STatistical Information Grid-based
Hierarchical technique to divide area
into rectangular cells
Grid data structure contains summary
information about each cell
Hierarchical clustering
Similar to quad tree
67
STING
68
STING Build Algorithm
69
STING Algorithm
70
Spatial Rules
Characteristic Rule
The average family income in Dallas is $50,000.
Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.
71
Spatial Association Rules
Either antecedent or consequent must
contain spatial predicates.
View underlying database as set of
spatial objects.
May create using a type of progressive
refinement
72
Spatial Association Rule Algorithm
73
Spatial Classification
Partition spatial objects
May use nonspatial attributes and/or
spatial attributes
Generalization and progressive
refinement may be used.
74
ID3 Extension
Neighborhood Graph
– Nodes – objects
– Edges – connects neighbors
Definition of neighborhood varies
ID3 considers nonspatial attributes of all
objects in a neighborhood (not just one)
for classification.
75
Spatial Decision Tree
Approach similar to that used for spatial
association rules.
Spatial objects can be described based
on objects close to them – Buffer.
Description of class based on
aggregation of nearby objects.
76
Spatial Decision Tree Algorithm
77
Spatial Clustering
Detect clusters of irregular shapes
Use of centroids and simple distance
approaches may not work well.
Clusters should be independent of order
of input.
78
Spatial Clustering
79
CLARANS Extensions
Remove main memory assumption of
CLARANS.
Use spatial index techniques.
Use sampling and R*-tree to identify
central objects.
Change cost calculations by reducing
the number of objects examined.
Voronoi Diagram
80
Voronoi
81
SD(CLARANS)
Spatial Dominant
First clusters spatial components using
CLARANS
Then iteratively replaces medoids, but
limits number of pairs to be searched.
Uses generalization
Uses a learning to to derive description
of cluster.
82
SD(CLARANS) Algorithm
83
DBCLASD
Extension of DBSCAN
Distribution Based Clustering of LArge
Spatial Databases
Assumes items in cluster are uniformly
distributed.
Identifies distribution satisfied by
distances between nearest neighbors.
Objects added if distribution is uniform.
84
DBCLASD Algorithm
85
Aggregate Proximity
Aggregate Proximity – measure of how
close a cluster is to a feature.
Aggregate proximity relationship finds the
k closest features to a cluster.
CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
86
CRH
87