Module1PartAweb mining-intro
Module1PartAweb mining-intro
Applications
What is Web Mining
Web Mining- the potential of extracting valuable knowledge
from the Web has been quite evident
Web mining is the collection of technologies to fulfill this objective
2
Source: Intel, 2012 3
What’s needed to succeed in the new
world of “big data” Internet?
Leveraging big data
Many of these applications manage, clean, and preprocess integrate
often unstructured data from across many channels
Biggest challenge is in data distillation and preprocessing
Effective use of data mining and analytics
No longer just a luxury but an integral part of systems
Especially important to leverage and effectively use user behavior and
social data
Real-time deployment of models
Needed for effective delivery of relevant, targeted, personalized content
Especially important on the Web: Predictive User Modeling
4
Predictive User Modeling
The Problem
Dynamically serve customized content (ads, products, deals,
recommendations, etc.) to users based on their profiles, preferences, or
expected interests
5
Web Data Mining- Challenges &
Opportunities
Web is huge
structured tables, semistructured pages, unstructured texts, and
multimedia files (images, audios, and videos)
Information on the Web is heterogeneous
Web is linked
Hyperlinks
Authoritative pages
Web is noisy
main content
businesses and commerce
recommender systems
Web is dynamic
Web is a virtual society
opinion mining and social network analysis
Data Mining
Knowledge Discovery in Databases (KDD)
Discovering useful patterns or knowledge from data sources
Data Mining Tasks
Supervised learning (or classification)
unsupervised learning (or clustering),
association rule mining,
sequential pattern mining
Data Mining Stages
Pre-processing
Data mining task
Post-processing
Data Mining Vs Web Mining
Traditional data mining uses structured data stored in relational
tables, spread sheets, or flat files in the tabular form.
With the growth of the Web and text documents, Web mining and
text mining are becoming increasingly important and popular.
Web mining uses many data mining techniques, it is not purely an
application of traditional data mining techniques due to the
heterogeneity and semi-structured or unstructured nature of the Web
data.
Web Mining
Web structure mining: discovers useful knowledge from
hyperlinks
Web content mining
Web usage mining
data collection
Information Retrieval and Web Search
Indexing
Taxonomy of Web Mining
Web Mining
11
Types of Web Mining
Web Mining
Extracting useful
knowledge from the
contents of Web
documents or other
semantic information
about Web resources
12
Types of Web Mining
Web Mining
13
Types of Web Mining
Web Mining
Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools
14
Types of Web Mining
Web Mining
Extracting interesting
patterns from user
interactions with
resources on one or
more Web sites
15
Types of Web Mining
Web Mining
Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems
16
Types of Web Mining
Web Mining
Discovering useful
patterns from the
hyperlink structure
connecting Web sites
or Web resources
17
Types of Web Mining
Web Mining
18
Types of Web Mining
Web Mining
Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis
19
Web Content Mining
:: common approaches and applications
Basic notion: document similarity
Most Web content mining and information retrieval applications involve
measuring similarity among two or more documents
Vector representation facilitates similarity computations using vector-space
operations (such as Cosine of the angle between two vectors)
Examples
Search engines: measure the similarity between a query (represented as a
vector) and the indexed document vectors to return a ranked list of relevant
documents
Document clustering: group documents based on similarity or dissimilarity
(distance) among them
Document categorization: measure the similarity of a new document to be
classified with representations of existing categories (such as the mean vector
representing a group of document vectors)
Personalization: recommend documents or items based their similarity to a
representation of the user’s profile (may be a term vector representing concepts
or terms of interest to the user)
20
Web Content Mining
:: example – clustered search results
Can drill
down within
clusters to
view sub-
topics or to
view the
relevant
subset of
results
21
Web Content Mining
:: example – personalized content delivery
Google's
personalized news is
an example of a
content-based
recommender
system which
recommends items
(in part) based on
the similarity of their
content to a user’s
profile (gathered
from search and click
history)
22
Web Structure Mining
:: graph structures on the Web
The structure of a typical Web graph
Web pages as nodes
hyperlinks as edges connecting two related pages
Hyperlink Analysis
Hyperlinks can serve as a tool for pure navigation
But, often they are used to point to pages with authority on the same topic as the
source page (similar to a citation in a publication)
Some interesting Web structures *
23
Web Structure Mining
:: example – Google’s PageRank algorithm
Basic idea:
Rank of a page depends on the ranks of pages
pointing to it
Out Degree of page is the number of edges
pointing away from it – used to compute the
contribution of the page to those to which it
points
The final PageRank value represents the
Illustration of PageRank propagation probability that a random surfer will reach the
page
d is the prob. that a random surfer chooses the
page directly rather than getting there via
navigation
24
Web Structure Mining
:: example – Hubs and Authorities
Basic idea
Authority comes from in-edges
Being a hub comes from out-edges
Mutually re-enforcing relationship
A good authority is a page that is pointed
to by many good hubs.
A good hub is a page that points to many
good authorities.
Together they tend to form a bipartite
graph Hubs Authorities
This idea can be used to discover
authoritative pages related to a topic
HITS algorithm – Hypertext Induced
Topic Search
25
Web Structure Mining
:: example – online communities
Community 2
Basic idea
Web communities are collections of
Community 1
Web pages such that each member
node has more hyperlinks (in either
direction) within the community than
outside the community.
Typical approach: Maximal-
flow model *
Ex: separate the two subgraphs with
Source sink any choice of source node (left
node
subgraph) and sink node (right
subgraph), removing the three dashed
links
* Source: G. Flake, et al. “Self-Organization and Identification of Web Communities”, IEEE Computer,
Vol. 35, No. 3, pp. 66-71, March 2002 .
26
Web Usage Mining
The Problem: analyze Web navigational data to
Find how the Web site is used by Web users
Understand the behavior of different user segments
Predict how users will behave in the future
Target relevant or interesting information to individual or groups of users
Increase sales, profit, loyalty, etc.
Challenge
Quantitatively capture Web users’ common interests and characterize
their underlying tasks
27
Applications of Web Usage Mining
Electronic Commerce
design cross marketing strategies across products
evaluate promotional campaigns
target electronic ads and coupons at user groups based on their access patterns
predict user behavior based on previously learned rules and users’ profiles
present dynamic information to users based on their interests and profiles:
“Web personalization”
Effective and Efficient Web Presence
determine the best way to structure the Web site
identify “weak links” for elimination or enhancement
prefetch files that are most likely to be accessed
enhance workgroup management & communication
Search Engines
Behavior-based ranking
28