0% found this document useful (0 votes)
11 views

Module1PartAweb mining-intro

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module1PartAweb mining-intro

Uploaded by

Ayush Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Overview of Web Data Mining and

Applications
What is Web Mining
 Web Mining- the potential of extracting valuable knowledge
from the Web has been quite evident
 Web mining is the collection of technologies to fulfill this objective

Web Mining Definition

application of data mining and machine learning


techniques to extract useful knowledge from the content,
structure, and usage of Web resources.

 But, why is this important and why is it more relevant than at


any other time during the history of the Web?

2
Source: Intel, 2012 3
What’s needed to succeed in the new
world of “big data” Internet?
 Leveraging big data
 Many of these applications manage, clean, and preprocess integrate
often unstructured data from across many channels
 Biggest challenge is in data distillation and preprocessing
 Effective use of data mining and analytics
 No longer just a luxury but an integral part of systems
 Especially important to leverage and effectively use user behavior and
social data
 Real-time deployment of models
 Needed for effective delivery of relevant, targeted, personalized content
 Especially important on the Web: Predictive User Modeling

4
Predictive User Modeling
 The Problem
 Dynamically serve customized content (ads, products, deals,
recommendations, etc.) to users based on their profiles, preferences, or
expected interests

 Why we need it?


 Information spaces are becoming much more complex for user to navigate
(huge online repositories, social networks, mobile applications, blogs, ….)
 For businesses: need to grow customer loyalty / increase sales
 Industry Research: successful online retailers are generating as much as
35% of their business from recommendations/targeted content delivery

5
Web Data Mining- Challenges &
Opportunities
 Web is huge
 structured tables, semistructured pages, unstructured texts, and
multimedia files (images, audios, and videos)
 Information on the Web is heterogeneous
 Web is linked
 Hyperlinks
 Authoritative pages
 Web is noisy
 main content
 businesses and commerce
 recommender systems
 Web is dynamic
 Web is a virtual society
 opinion mining and social network analysis
Data Mining
 Knowledge Discovery in Databases (KDD)
 Discovering useful patterns or knowledge from data sources
 Data Mining Tasks
 Supervised learning (or classification)
 unsupervised learning (or clustering),
 association rule mining,
 sequential pattern mining
 Data Mining Stages
 Pre-processing
 Data mining task
 Post-processing
Data Mining Vs Web Mining
 Traditional data mining uses structured data stored in relational
tables, spread sheets, or flat files in the tabular form.
 With the growth of the Web and text documents, Web mining and
text mining are becoming increasingly important and popular.
 Web mining uses many data mining techniques, it is not purely an
application of traditional data mining techniques due to the
heterogeneity and semi-structured or unstructured nature of the Web
data.
Web Mining
 Web structure mining: discovers useful knowledge from
hyperlinks
 Web content mining
 Web usage mining
 data collection
 Information Retrieval and Web Search
 Indexing
Taxonomy of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

11
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Extracting useful
knowledge from the
contents of Web
documents or other
semantic information
about Web resources

12
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Content data may


consist of text, images,
audio, video, structured
records from lists and
tables, or item
attributes from backend
databases.

13
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools

14
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Extracting interesting
patterns from user
interactions with
resources on one or
more Web sites

15
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems

16
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Discovering useful
patterns from the
hyperlink structure
connecting Web sites
or Web resources

17
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Data sources include the


explicit hyperlink between
documents, or implicit
links among objects (e.g.,
two objects being
“tagged” using the same
keyword).

18
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure


Mining Mining Mining

Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis

19
Web Content Mining
:: common approaches and applications
 Basic notion: document similarity
 Most Web content mining and information retrieval applications involve
measuring similarity among two or more documents
 Vector representation facilitates similarity computations using vector-space
operations (such as Cosine of the angle between two vectors)
 Examples
 Search engines: measure the similarity between a query (represented as a
vector) and the indexed document vectors to return a ranked list of relevant
documents
 Document clustering: group documents based on similarity or dissimilarity
(distance) among them
 Document categorization: measure the similarity of a new document to be
classified with representations of existing categories (such as the mean vector
representing a group of document vectors)
 Personalization: recommend documents or items based their similarity to a
representation of the user’s profile (may be a term vector representing concepts
or terms of interest to the user)

20
Web Content Mining
:: example – clustered search results

Can drill
down within
clusters to
view sub-
topics or to
view the
relevant
subset of
results

21
Web Content Mining
:: example – personalized content delivery

Google's
personalized news is
an example of a
content-based
recommender
system which
recommends items
(in part) based on
the similarity of their
content to a user’s
profile (gathered
from search and click
history)

22
Web Structure Mining
:: graph structures on the Web
 The structure of a typical Web graph
 Web pages as nodes
 hyperlinks as edges connecting two related pages
 Hyperlink Analysis
 Hyperlinks can serve as a tool for pure navigation
 But, often they are used to point to pages with authority on the same topic as the
source page (similar to a citation in a publication)
 Some interesting Web structures *

23
Web Structure Mining
:: example – Google’s PageRank algorithm

 Basic idea:
 Rank of a page depends on the ranks of pages
pointing to it
 Out Degree of page is the number of edges
pointing away from it – used to compute the
contribution of the page to those to which it
points
 The final PageRank value represents the
Illustration of PageRank propagation probability that a random surfer will reach the
page
 d is the prob. that a random surfer chooses the
page directly rather than getting there via
navigation

24
Web Structure Mining
:: example – Hubs and Authorities
 Basic idea
 Authority comes from in-edges
 Being a hub comes from out-edges
 Mutually re-enforcing relationship
 A good authority is a page that is pointed
to by many good hubs.
 A good hub is a page that points to many
good authorities.
 Together they tend to form a bipartite
graph Hubs Authorities
 This idea can be used to discover
authoritative pages related to a topic
 HITS algorithm – Hypertext Induced
Topic Search

25
Web Structure Mining
:: example – online communities
Community 2
 Basic idea
 Web communities are collections of
Community 1
Web pages such that each member
node has more hyperlinks (in either
direction) within the community than
outside the community.
 Typical approach: Maximal-
flow model *
 Ex: separate the two subgraphs with
Source sink any choice of source node (left
node
subgraph) and sink node (right
subgraph), removing the three dashed
links

* Source: G. Flake, et al. “Self-Organization and Identification of Web Communities”, IEEE Computer,
Vol. 35, No. 3, pp. 66-71, March 2002 .

26
Web Usage Mining
The Problem: analyze Web navigational data to
 Find how the Web site is used by Web users
 Understand the behavior of different user segments
 Predict how users will behave in the future
 Target relevant or interesting information to individual or groups of users
 Increase sales, profit, loyalty, etc.

Challenge
 Quantitatively capture Web users’ common interests and characterize
their underlying tasks

27
Applications of Web Usage Mining
 Electronic Commerce
 design cross marketing strategies across products
 evaluate promotional campaigns
 target electronic ads and coupons at user groups based on their access patterns
 predict user behavior based on previously learned rules and users’ profiles
 present dynamic information to users based on their interests and profiles:
“Web personalization”
 Effective and Efficient Web Presence
 determine the best way to structure the Web site
 identify “weak links” for elimination or enhancement
 prefetch files that are most likely to be accessed
 enhance workgroup management & communication
 Search Engines
 Behavior-based ranking

28

You might also like