0% found this document useful (0 votes)

20 views28 pages

Module1PartAweb Mining-Intro

Uploaded by

Ayush Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views28 pages

Module1PartAweb Mining-Intro

Uploaded by

Ayush Tiwari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Overview of Web Data Mining and

Applications
What is Web Mining
 Web Mining- the potential of extracting valuable knowledge
from the Web has been quite evident
 Web mining is the collection of technologies to fulfill this objective

Web Mining Definition

application of data mining and machine learning

techniques to extract useful knowledge from the content,
structure, and usage of Web resources.

 But, why is this important and why is it more relevant than at

any other time during the history of the Web?

2
Source: Intel, 2012 3
What’s needed to succeed in the new
world of “big data” Internet?
 Leveraging big data
 Many of these applications manage, clean, and preprocess integrate
often unstructured data from across many channels
 Biggest challenge is in data distillation and preprocessing
 Effective use of data mining and analytics
 No longer just a luxury but an integral part of systems
 Especially important to leverage and effectively use user behavior and
social data
 Real-time deployment of models
 Needed for effective delivery of relevant, targeted, personalized content
 Especially important on the Web: Predictive User Modeling

4
Predictive User Modeling
 The Problem
 Dynamically serve customized content (ads, products, deals,
recommendations, etc.) to users based on their profiles, preferences, or
expected interests

 Why we need it?

 Information spaces are becoming much more complex for user to navigate
(huge online repositories, social networks, mobile applications, blogs, ….)
 For businesses: need to grow customer loyalty / increase sales
 Industry Research: successful online retailers are generating as much as
35% of their business from recommendations/targeted content delivery

5
Web Data Mining- Challenges &
Opportunities
 Web is huge
 structured tables, semistructured pages, unstructured texts, and
multimedia files (images, audios, and videos)
 Information on the Web is heterogeneous
 Web is linked
 Hyperlinks
 Authoritative pages
 Web is noisy
 main content
 businesses and commerce
 recommender systems
 Web is dynamic
 Web is a virtual society
 opinion mining and social network analysis
Data Mining
 Knowledge Discovery in Databases (KDD)
 Discovering useful patterns or knowledge from data sources
 Data Mining Tasks
 Supervised learning (or classification)
 unsupervised learning (or clustering),
 association rule mining,
 sequential pattern mining
 Data Mining Stages
 Pre-processing
 Data mining task
 Post-processing
Data Mining Vs Web Mining
 Traditional data mining uses structured data stored in relational
tables, spread sheets, or flat files in the tabular form.
 With the growth of the Web and text documents, Web mining and
text mining are becoming increasingly important and popular.
 Web mining uses many data mining techniques, it is not purely an
application of traditional data mining techniques due to the
heterogeneity and semi-structured or unstructured nature of the Web
data.
Web Mining
 Web structure mining: discovers useful knowledge from
hyperlinks
 Web content mining
 Web usage mining
 data collection
 Information Retrieval and Web Search
 Indexing
Taxonomy of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

11
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Extracting useful
knowledge from the
contents of Web
documents or other
semantic information
about Web resources

12
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Content data may

consist of text, images,
audio, video, structured
records from lists and
tables, or item
attributes from backend
databases.

13
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Applications:
• document clustering or
categorization
• topic identification / tracking
• concept discovery
• focused crawling
• content-based personalization
• intelligent search tools

14
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Extracting interesting
patterns from user
interactions with
resources on one or
more Web sites

15
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Applications:
• user and customer behavior modeling
• Web site optimization
• e-customer relationship management
• Web marketing
• targeted advertising
• recommender systems

16
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Discovering useful
patterns from the
hyperlink structure
connecting Web sites
or Web resources

17
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Data sources include the

explicit hyperlink between
documents, or implicit
links among objects (e.g.,
two objects being
“tagged” using the same
keyword).

18
Types of Web Mining
Web Mining

Web Content Web Usage Web Structure

Mining Mining Mining

Applications:
• document retrieval and
ranking (e.g., Google)
• discovery of “hubs” and
“authorities”
• discovery of Web
communities
• social network analysis

19
Web Content Mining
:: common approaches and applications
 Basic notion: document similarity
 Most Web content mining and information retrieval applications involve
measuring similarity among two or more documents
 Vector representation facilitates similarity computations using vector-space
operations (such as Cosine of the angle between two vectors)
 Examples
 Search engines: measure the similarity between a query (represented as a
vector) and the indexed document vectors to return a ranked list of relevant
documents
 Document clustering: group documents based on similarity or dissimilarity
(distance) among them
 Document categorization: measure the similarity of a new document to be
classified with representations of existing categories (such as the mean vector
representing a group of document vectors)
 Personalization: recommend documents or items based their similarity to a
representation of the user’s profile (may be a term vector representing concepts
or terms of interest to the user)

20
Web Content Mining
:: example – clustered search results

Can drill
down within
clusters to
view sub-
topics or to
view the
relevant
subset of
results

21
Web Content Mining
:: example – personalized content delivery

Google's
personalized news is
an example of a
content-based
recommender
system which
recommends items
(in part) based on
the similarity of their
content to a user’s
profile (gathered
from search and click
history)

22
Web Structure Mining
:: graph structures on the Web
 The structure of a typical Web graph
 Web pages as nodes
 hyperlinks as edges connecting two related pages
 Hyperlink Analysis
 Hyperlinks can serve as a tool for pure navigation
 But, often they are used to point to pages with authority on the same topic as the
source page (similar to a citation in a publication)
 Some interesting Web structures *

23
Web Structure Mining
:: example – Google’s PageRank algorithm

 Basic idea:
 Rank of a page depends on the ranks of pages
pointing to it
 Out Degree of page is the number of edges
pointing away from it – used to compute the
contribution of the page to those to which it
points
 The final PageRank value represents the
Illustration of PageRank propagation probability that a random surfer will reach the
page
 d is the prob. that a random surfer chooses the
page directly rather than getting there via
navigation

24
Web Structure Mining
:: example – Hubs and Authorities
 Basic idea
 Authority comes from in-edges
 Being a hub comes from out-edges
 Mutually re-enforcing relationship
 A good authority is a page that is pointed
to by many good hubs.
 A good hub is a page that points to many
good authorities.
 Together they tend to form a bipartite
graph Hubs Authorities
 This idea can be used to discover
authoritative pages related to a topic
 HITS algorithm – Hypertext Induced
Topic Search

25
Web Structure Mining
:: example – online communities
Community 2
 Basic idea
 Web communities are collections of
Community 1
Web pages such that each member
node has more hyperlinks (in either
direction) within the community than
outside the community.
 Typical approach: Maximal-
flow model *
 Ex: separate the two subgraphs with
Source sink any choice of source node (left
node
subgraph) and sink node (right
subgraph), removing the three dashed
links

* Source: G. Flake, et al. “Self-Organization and Identification of Web Communities”, IEEE Computer,
Vol. 35, No. 3, pp. 66-71, March 2002 .

26
Web Usage Mining
The Problem: analyze Web navigational data to
 Find how the Web site is used by Web users
 Understand the behavior of different user segments
 Predict how users will behave in the future
 Target relevant or interesting information to individual or groups of users
 Increase sales, profit, loyalty, etc.

Challenge
 Quantitatively capture Web users’ common interests and characterize
their underlying tasks

27
Applications of Web Usage Mining
 Electronic Commerce
 design cross marketing strategies across products
 evaluate promotional campaigns
 target electronic ads and coupons at user groups based on their access patterns
 predict user behavior based on previously learned rules and users’ profiles
 present dynamic information to users based on their interests and profiles:
“Web personalization”
 Effective and Efficient Web Presence
 determine the best way to structure the Web site
 identify “weak links” for elimination or enhancement
 prefetch files that are most likely to be accessed
 enhance workgroup management & communication
 Search Engines
 Behavior-based ranking

A - Group - 3 - Shail Goyal - B23044
No ratings yet
A - Group - 3 - Shail Goyal - B23044
9 pages
Web Mining
No ratings yet
Web Mining
28 pages
Knime - Words To Wisdom
100% (2)
Knime - Words To Wisdom
177 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Production Planing - Microsoft Excel Pivot Table Data
No ratings yet
Production Planing - Microsoft Excel Pivot Table Data
4 pages
Web Scraping - PPT-1
100% (2)
Web Scraping - PPT-1
9 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
ETL Testing Notes
No ratings yet
ETL Testing Notes
8 pages
Data Mining
No ratings yet
Data Mining
12 pages
Web Mining
100% (3)
Web Mining
28 pages
Webmininglec
100% (1)
Webmininglec
75 pages
B - Group - 5 - Abinandhan M - B23175
No ratings yet
B - Group - 5 - Abinandhan M - B23175
12 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
BeyeNetwork Open Source Research Report
No ratings yet
BeyeNetwork Open Source Research Report
33 pages
1.1 (Final - CCP - DS)
No ratings yet
1.1 (Final - CCP - DS)
9 pages
2037 4486 1 PB
No ratings yet
2037 4486 1 PB
14 pages
Computer Awareness MCQs PDF
100% (2)
Computer Awareness MCQs PDF
27 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
A - Group - 1 - Soumya Muskara - B23053
No ratings yet
A - Group - 1 - Soumya Muskara - B23053
20 pages
SGRR Library
No ratings yet
SGRR Library
19 pages
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
No ratings yet
Web Mining Research: A Survey: Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000
34 pages
ScoopSense ADSR For Mall Managers
No ratings yet
ScoopSense ADSR For Mall Managers
8 pages
A - Group - 4 - Arya Atharva - B23016
No ratings yet
A - Group - 4 - Arya Atharva - B23016
13 pages
Open Table Format - Delta Lake
No ratings yet
Open Table Format - Delta Lake
10 pages
MIS 501 Assignment
No ratings yet
MIS 501 Assignment
2 pages
Webbased Systems and Development
No ratings yet
Webbased Systems and Development
4 pages
DM M5.1 Web Mining v3.11
No ratings yet
DM M5.1 Web Mining v3.11
114 pages
Dm-Unit Advanced Concepts
No ratings yet
Dm-Unit Advanced Concepts
57 pages
Employee Profile - SuccessFactors
No ratings yet
Employee Profile - SuccessFactors
2 pages
CMDB Practical Steps To Successful Implementation
100% (1)
CMDB Practical Steps To Successful Implementation
8 pages
TARP
No ratings yet
TARP
7 pages
A - Group - 2 - Devansh Damani (Devansh Damani - B23021)
No ratings yet
A - Group - 2 - Devansh Damani (Devansh Damani - B23021)
8 pages
M517-E124 (DAR-7500 DR Console DICOM MWM Network WL)
No ratings yet
M517-E124 (DAR-7500 DR Console DICOM MWM Network WL)
10 pages
Alteryx 5KeyMarketing r501
No ratings yet
Alteryx 5KeyMarketing r501
15 pages
Unit 7
No ratings yet
Unit 7
31 pages
Roadmap To Define A Backup Strategy For Sap Applications: by Prakash Palani
No ratings yet
Roadmap To Define A Backup Strategy For Sap Applications: by Prakash Palani
11 pages
"Accounting Infograflex": Overview
No ratings yet
"Accounting Infograflex": Overview
2 pages
DM Unit4 1 Unit 1
No ratings yet
DM Unit4 1 Unit 1
15 pages
Web Mining For BI - Part 2
No ratings yet
Web Mining For BI - Part 2
31 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
17 pages
Web Mining
No ratings yet
Web Mining
8 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Unit 3 DMW
No ratings yet
Unit 3 DMW
31 pages
Web Mining
No ratings yet
Web Mining
73 pages
ITC 213 Lesson 5
No ratings yet
ITC 213 Lesson 5
12 pages
Tera Com
No ratings yet
Tera Com
9 pages
5016 S 4hana Embedded Analytics Fiori
No ratings yet
5016 S 4hana Embedded Analytics Fiori
8 pages
Week 1
No ratings yet
Week 1
80 pages
Data Mining
No ratings yet
Data Mining
80 pages
Set 5 MC
No ratings yet
Set 5 MC
68 pages
Business Data Mining Week 13
No ratings yet
Business Data Mining Week 13
15 pages
Web Mining: BY: Anitha K 17EUEE017
No ratings yet
Web Mining: BY: Anitha K 17EUEE017
19 pages
Introduction To Web Mining
No ratings yet
Introduction To Web Mining
13 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
SQL Views and Indexes
No ratings yet
SQL Views and Indexes
11 pages
Artificial Intelligence and Innovative A
No ratings yet
Artificial Intelligence and Innovative A
9 pages
Web Mining
No ratings yet
Web Mining
42 pages
Web Mining
No ratings yet
Web Mining
20 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Usage Mining
No ratings yet
Web Usage Mining
13 pages
Abstract
No ratings yet
Abstract
1 page
6 WebMining
No ratings yet
6 WebMining
45 pages
BAM Session2Slides
No ratings yet
BAM Session2Slides
25 pages
Introduction To Web Mining
No ratings yet
Introduction To Web Mining
20 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
QU PPT Format
No ratings yet
QU PPT Format
12 pages
Concurrency Control in Distributed Databases: Gul Sabah Arif
No ratings yet
Concurrency Control in Distributed Databases: Gul Sabah Arif
18 pages
Functional - Non Functional Teting
No ratings yet
Functional - Non Functional Teting
2 pages
Webminingtextmining 160906165305
No ratings yet
Webminingtextmining 160906165305
18 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
TYCS - Data Science MCQ
No ratings yet
TYCS - Data Science MCQ
6 pages
Google Analytics Implementation Checklist: Tracking Code
No ratings yet
Google Analytics Implementation Checklist: Tracking Code
5 pages
Web Mining MMMUT NOTES
No ratings yet
Web Mining MMMUT NOTES
5 pages
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
No ratings yet
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
5 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Web Mining Using Artificial Ant Colonies: A Survey
No ratings yet
Web Mining Using Artificial Ant Colonies: A Survey
6 pages
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
No ratings yet
Analysis of Web Usage Mining: International Journal of Application or Innovation in Engineering & Management (IJAIEM)
7 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
Comparison of Business Intelligence and Decision Support System
No ratings yet
Comparison of Business Intelligence and Decision Support System
4 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
No ratings yet
Web Mining and Knowledge Discovery of Usage Patterns: CS 748T Project (Part I)
25 pages
Web Mining
No ratings yet
Web Mining
15 pages
Web Mining
No ratings yet
Web Mining
13 pages
Building Generative AI Agents With Vertex AI Agent Builder
No ratings yet
Building Generative AI Agents With Vertex AI Agent Builder
13 pages
Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar
No ratings yet
Web Mining: By-Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar
20 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
No ratings yet
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
12 pages
Citation Guide
No ratings yet
Citation Guide
4 pages
Web Mining
No ratings yet
Web Mining
53 pages