0% found this document useful (0 votes)

20 views87 pages

Web Mining1

Uploaded by

notesbook14925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views87 pages

Web Mining1

Uploaded by

notesbook14925

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

WEB MINING

1
Introduction
 Web Mining is the use of the data mining
techniques to automatically discover and
extract information from web
documents/services.
 Discovering useful information from the
World-Wide Web and its usage patterns
 Using data mining techniques to make the
web more useful and more profitable (for
some) and to increase the efficiency of our
2
interaction with the web
Web Mining Outline
Goal: Examine the use of data mining on
the World Wide Web
 Introduction
 Web Content Mining
 Web Structure Mining
 Web Usage Mining

3
Web Mining
 The WWW is huge, widely distributed, global
information service centre for
 Information services: news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc.
 Hyper-link information
 Access and usage information
 WWW provides rich sources of data for data mining

4
Why mine the Web?
 Enormous wealth of information on Web
– Financial information (e.g. NSE)
– Book/CD/Video stores (e.g. Amazon)
– Restaurant information (e.g. Zomato)
– Car prices (e.g. Cars24)
 Lots of data on user access patterns
Web logs contain sequence of URLs accessed by users
 Possible to mine interesting nuggets of information
–People who ski also travel frequently to Europe
–Tech stocks have corrections in the summer and rally
from November until February 5
Web Mining Issues
 Size
– >350 million pages (1999)
– Grows at about 1 million pages a day
– Google indexes 3 billion documents
– over 130 trillion pages (2016)
– 100,000,000 gigabytes
– 2.5 exabytes (2,500,000,000 gigabytes) of
data every single day
 Diverse types of data
6
Web Data
 Web pages
 Intra-page structures
 Inter-page structures
 Usage data
 Supplemental data
– Profiles
– Registration information
– Cookies
7
Web Mining Taxonomy

8
Web Mining Taxonomy

9
Web Content Mining
 Extends work of basic search engines
 Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
10
Crawlers
 Robot (spider) traverses the hypertext
sructure in the Web.
 Collect information from visited pages
 Used to construct indexes for search engines

 Scrapy
 Beautiful Soup
 Selenium

 Amazonbot- web content identification and

backlink discovery.
 Bingbot for Bing search engine by Microsoft. 11
Crawlers
1. Traditional Crawler – visits entire Web (?) and
replaces index
2. Periodic Crawler – visits portions of the Web
and updates subset of index
3. Incremental Crawler – selectively searches
the Web and incrementally modifies index
4. Focused Crawler – visits pages related to a
particular subject

12
Focused Crawler
 Only visit links from a page if that page
is determined to be relevant.
 Classifier is static after learning phase.
 Components:
– Classifier which assigns relevance score to
each page based on crawl topic.
– Distiller to identify hub pages.
– Crawler visits pages to based on crawler
and distiller scores.

13
Focused Crawler
 Classifier to relate documents to topics
 Classifier also determines how useful
outgoing links are
 Hub Pages contain links to many
relevant pages. Must be visited even if
not high relevance score.

14
Focused Crawler

15
Context Focused Crawler
 Context Graph:
– Context graph created for each seed document .
– Root is the seed document.
– Nodes at each level show documents with links to
documents at next higher level.
– Updated during crawl itself .
 Approach:
1. Construct context graph and classifiers using seed
documents as training data.
2. Perform crawling using classifiers and context
graph created.
16
Context Graph

17
Virtual Web View
 Multiple Layered DataBase (MLDB) built on top
of the Web.
 Each layer of the database is more generalized
(and smaller) and centralized than the one
beneath it.
 Upper layers of MLDB are structured and can be
accessed with SQL type queries.
 Translation tools convert Web documents to XML.
 Extraction tools extract desired information to
place in first layer of MLDB.
 Higher levels contain more summarized data
obtained through generalizations of the lower
levels. 18
Personalization
 Web access or contents tuned to better fit the
desires of each user.
 Manual techniques identify user’s preferences
based on profiles or demographics.
 Collaborative filtering identifies preferences
based on ratings from similar users.
 Content based filtering retrieves pages
based on similarity between pages and user
profiles.

19
Web Structure Mining
 Mine structure (links, graph) of the Web
 Techniques
– PageRank
– CLEVER
 Create a model of the Web organization.
 May be combined with content mining to
more effectively retrieve important pages.

20
PageRank
 Used by Google
 Prioritize pages returned from search by
looking at Web structure.
 Importance of page is calculated based
on number of pages which point to it –
Backlinks.
 Weighting is used to provide more
importance to backlinks coming form
important pages.

21
PageRank (cont’d)
 PR(p) = c (PR(1)/N1 + … + PR(n)/Nn)
– PR(i): PageRank for a page i which points
to target page p.
– Ni: number of links coming out of page i

22
PageRank
 The connections between pages is represented by a
graph (web graph )
 A node represents a webpage and an arrow from
page A to page B means that there is a link from
page A to page B.
 The number of out-going links is an important
parameter.
 “out-degree of a node-the number of out-going links
contained in a page.

23
Example

 L(p) be the number of out-going links in a page p.

 L(A) = 3, L(B) = 1 and L(C) = L(D) = 2.

24
Example
 Let N be the total number of pages. We create
an N *N matrix A by defining the (i, j)-entry as

 A=

Note that the sum of the entries in each 25

column is equal to 1.
Example
 The simplied Pagerank algorithm is:
Initialize x to an N*1 column vector with non-
negative components, and then repeatedly
replace x by the product Ax until it converges.
 x - the pagerank vector.
 Usually it is initialize to a column vector whose
components are equal to each other.

26
Example

27
Example

 page D has 3 incoming links, while the others

have either 1 or 2 incoming links. It conforms
28
with the rationale
EXAMPLE

29
CLEVER
 Identify authoritative and hub pages.
 Authoritative Pages :
– Highly important pages.
– Best source for requested information.
 Hub Pages :
– Contain links to highly important pages.

30
HITS
 Hyperlink-Induces Topic Search
 Based on a set of keywords, find set of
relevant pages – R.
 Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or
from R.
– Calculate weights for authorities and hubs.
 Pages with highest ranks in R are returned.

31
HITS Algorithm

32
Web Usage Mining
 Extends work of basic search engines
 Search Engines
– IR application
– Keyword based
– Similarity between query and document
– Crawlers
– Indexing
– Profiles
– Link analysis
33
Web Usage Mining Applications
 Personalization-
 Improve structure of a site’s Web pages
 Aid in caching and prediction of future
page references
 Improve design of individual pages
 Improve effectiveness of e-commerce
(sales and advertising)
 Identify criminal activities
 Customer Relationship management 34
Web Usage Mining Activities
 Preprocessing Web log
– Cleanse
– Remove extraneous information
– Sessionize
Session: Sequence of pages referenced by one user at a sitting.
 Pattern Discovery
– Count patterns that occur in sessions
– Pattern is sequence of pages references in session.
– Similar to association rules
» Transaction: session
» Itemset: pattern (or subset)
» Order is important
 Pattern Analysis
35
Downside
 Privacy - can pose a big threat to the
company.
 No high ethical standards – additional
data can be exposed

36
ARs in Web Mining
 Web Mining:
– Content
– Structure
– Usage
 Frequent patterns of sequential page
references in Web searching.
 Uses:
– Caching
– Clustering users
– Develop user profiles
– Identify important pages
37
Web Usage Mining Issues
 Identification of exact user not possible.
 Exact sequence of pages referenced by
a user not possible due to caching.
 Session not well defined
 Security, privacy, and legal issues

38
Web Log Cleansing
 Replace source IP address with unique but
non-identifying ID.
 Replace exact URL of pages referenced
with unique but non-identifying ID.
 Delete error records and records containing
not page data (such as figures and code)

39
Sessionizing
 Divide Web log into sessions.
 Two common techniques:
– Number of consecutive page references from
a source IP address occurring within a
predefined time interval (e.g. 25 minutes).
– All consecutive page references from a
source IP address where the interclick time is
less than a predefined threshold.

40
Data Structures
 Keep track of patterns identified during
Web usage mining process
 Common techniques:
– Trie
– Suffix Tree
– Generalized Suffix Tree
– WAP Tree

41
Trie vs. Suffix Tree
 Trie:
– Rooted tree
– Edges labeled which character (page) from
pattern
– Path from root to leaf represents pattern.
 Suffix Tree:
– Single child collapsed with parent. Edge
contains labels of both prior edges.

42
Trie and Suffix Tree

43
Generalized Suffix Tree
 Suffix tree for multiple sessions.
 Contains patterns from all sessions.
 Maintains count of frequency of occurrence
of a pattern in the node.
 WAP Tree:
Compressed version of generalized suffix tree

44
Types of Patterns
 Algorithms have been developed to discover
different types of patterns.
 Properties:
– Ordered – Characters (pages) must occur in the exact
order in the original session.
– Duplicates – Duplicate characters are allowed in the
pattern.
– Consecutive – All characters in pattern must occur
consecutive in given session.
– Maximal – Not subsequence of another pattern.

45
Pattern Types
 Association Rules
None of the properties hold
 Episodes
Only ordering holds
 Sequential Patterns
Ordered and maximal
 Forward Sequences
Ordered, consecutive, and maximal
 Maximal Frequent Sequences
All properties hold
46
Episodes
 Partially ordered set of pages
 Serial episode – totally ordered with
time constraint
 Parallel episode – partial ordered with
time constraint
 General episode – partial ordered with
no time constraint

47
DAG for Episode

48
Spatial Mining Outline
Goal: Provide an introduction to some
spatial mining techniques.
 Introduction
 Spatial Data Overview
 Spatial Data Mining Primitives
 Generalization/Specialization
 Spatial Rules
 Spatial Classification
 Spatial Clustering
49
Spatial Object
 Contains both spatial and nonspatial
attributes.
 Must have a location type attributes:
– Latitude/longitude
– Zip code
– Street address
 May retrieve object using either (or
both) spatial or nonspatial attributes.
50
Spatial Data Mining Applications
 Geology
 GIS Systems
 Environmental Science
 Agriculture
 Medicine
 Robotics
 May involved both spatial and temporal
aspects
51
Spatial Queries
 Spatial selection may involve specialized
selection comparison operations:
– Near
– North, South, East, West
– Contained in
– Overlap/intersect
 Region (Range) Query – find objects that
intersect a given region.
 Nearest Neighbor Query – find object close to
identified object.
 Distance Scan – find object within a certain
distance of an identified object where distance is
made increasingly larger.
52
Spatial Data Structures
 Data structures designed specifically to store or
index spatial data.
 Often based on B-tree or Binary Search Tree
 Cluster data on disk basked on geographic
location.
 May represent complex spatial structure by
placing the spatial object in a containing structure
of a specific geographic shape.
 Techniques:
– Quad Tree
– R-Tree
– k-D Tree
53
MBR
 Minimum Bounding Rectangle
 Smallest rectangle that completely
contains the object

54
MBR Examples

55
Quad Tree
 Hierarchical decomposition of the space
into quadrants (MBRs)
 Each level in the tree represents the
object as the set of quadrants which
contain any portion of the object.
 Each level is a more exact representation
of the object.
 The number of levels is determined by
the degree of accuracy desired.
56
Quad Tree Example

57
R-Tree
 As with Quad Tree the region is divided
into successively smaller rectangles
(MBRs).
 Rectangles need not be of the same
size or number at each level.
 Rectangles may actually overlap.
 Lowest level cell has only one object.
 Tree maintenance algorithms similar to
those for B-trees.
58
R-Tree Example

59
K-D Tree
 Designed for multi-attribute data, not
necessarily spatial
 Variation of binary search tree
 Each level is used to index one of the
dimensions of the spatial object.
 Lowest level cell has only one object
 Divisions not based on MBRs but
successive divisions of the dimension
range.
60
k-D Tree Example

61
Topological Relationships
 Disjoint
 Overlaps or Intersects
 Equals
 Covered by or inside or contained in
 Covers or contains

62
Distance Between Objects
 Euclidean
 Manhattan
 Extensions:

63
Progressive Refinement
 Make approximate answers prior to
more accurate ones.
 Filter out data not part of answer
 Hierarchical view of data based on
spatial relationships
 Coarse predicate recursively refined

64
Progressive Refinement

65
Spatial Data Dominant Algorithm

66
STING
 STatistical Information Grid-based
 Hierarchical technique to divide area
into rectangular cells
 Grid data structure contains summary
information about each cell
 Hierarchical clustering
 Similar to quad tree

67
STING

68
STING Build Algorithm

69
STING Algorithm

70
Spatial Rules
 Characteristic Rule
The average family income in Dallas is $50,000.
 Discriminant Rule
The average family income in Dallas is $50,000,
while in Plano the average income is $75,000.
 Association Rule
The average family income in Dallas for families
living near White Rock Lake is $100,000.

71
Spatial Association Rules
 Either antecedent or consequent must
contain spatial predicates.
 View underlying database as set of
spatial objects.
 May create using a type of progressive
refinement

72
Spatial Association Rule Algorithm

73
Spatial Classification
 Partition spatial objects
 May use nonspatial attributes and/or
spatial attributes
 Generalization and progressive
refinement may be used.

74
ID3 Extension
 Neighborhood Graph
– Nodes – objects
– Edges – connects neighbors
 Definition of neighborhood varies
 ID3 considers nonspatial attributes of all
objects in a neighborhood (not just one)
for classification.

75
Spatial Decision Tree
 Approach similar to that used for spatial
association rules.
 Spatial objects can be described based
on objects close to them – Buffer.
 Description of class based on
aggregation of nearby objects.

76
Spatial Decision Tree Algorithm

77
Spatial Clustering
 Detect clusters of irregular shapes
 Use of centroids and simple distance
approaches may not work well.
 Clusters should be independent of order
of input.

78
Spatial Clustering

79
CLARANS Extensions
 Remove main memory assumption of
CLARANS.
 Use spatial index techniques.
 Use sampling and R*-tree to identify
central objects.
 Change cost calculations by reducing
the number of objects examined.
 Voronoi Diagram
80
Voronoi

81
SD(CLARANS)
 Spatial Dominant
 First clusters spatial components using
CLARANS
 Then iteratively replaces medoids, but
limits number of pairs to be searched.
 Uses generalization
 Uses a learning to to derive description
of cluster.
82
SD(CLARANS) Algorithm

83
DBCLASD
 Extension of DBSCAN
 Distribution Based Clustering of LArge
Spatial Databases
 Assumes items in cluster are uniformly
distributed.
 Identifies distribution satisfied by
distances between nearest neighbors.
 Objects added if distribution is uniform.
84
DBCLASD Algorithm

85
Aggregate Proximity
 Aggregate Proximity – measure of how
close a cluster is to a feature.
 Aggregate proximity relationship finds the
k closest features to a cluster.
 CRH Algorithm – uses different shapes:
– Encompassing Circle
– Isothetic Rectangle
– Convex Hull
86
CRH

Fundamentals of Database Systems 6th Edition by Ramez Elmasri
No ratings yet
Fundamentals of Database Systems 6th Edition by Ramez Elmasri
317 pages
Web Content Mining
100% (1)
Web Content Mining
112 pages
Data Mining Unit4 5
No ratings yet
Data Mining Unit4 5
130 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
IR Module 3
No ratings yet
IR Module 3
45 pages
DM M5.1 Web Mining v3.11
No ratings yet
DM M5.1 Web Mining v3.11
114 pages
Web Mining
No ratings yet
Web Mining
28 pages
WI Sem8
No ratings yet
WI Sem8
56 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
(External) FREE AWS Cloud Project Bootcamp - Outline
No ratings yet
(External) FREE AWS Cloud Project Bootcamp - Outline
42 pages
Xii CS - 2024-25 - Practical File
No ratings yet
Xii CS - 2024-25 - Practical File
77 pages
Cloud Digital Leader Notes
No ratings yet
Cloud Digital Leader Notes
16 pages
Week 1
No ratings yet
Week 1
80 pages
Web Mining For BI - Part 2
No ratings yet
Web Mining For BI - Part 2
31 pages
Data Mining Module 5 Important Topics PYQs
No ratings yet
Data Mining Module 5 Important Topics PYQs
28 pages
Turban Chap 03
No ratings yet
Turban Chap 03
30 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Basic Power BI Interview Questions For Freshers
No ratings yet
Basic Power BI Interview Questions For Freshers
49 pages
Sma Unit 2
No ratings yet
Sma Unit 2
18 pages
Data Mining
No ratings yet
Data Mining
80 pages
Webmining I
No ratings yet
Webmining I
69 pages
Laravel Rest Api
No ratings yet
Laravel Rest Api
3 pages
13-Overview of Web Mining-11-11-2024
No ratings yet
13-Overview of Web Mining-11-11-2024
35 pages
Amazon Kinesis Data Firehose: Developer Guide
No ratings yet
Amazon Kinesis Data Firehose: Developer Guide
113 pages
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
No ratings yet
Reviewed Oracle 1z0 084 Dumps by Ware 01-04-2024 10qa Ebraindumps
23 pages
Web Mining
No ratings yet
Web Mining
26 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Keys & Integrity Constraints
No ratings yet
Keys & Integrity Constraints
37 pages
Web Mining - Unearthing Insights From The Digital Landscape
No ratings yet
Web Mining - Unearthing Insights From The Digital Landscape
9 pages
Spatial and Web Mining
No ratings yet
Spatial and Web Mining
27 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
Ir 5
No ratings yet
Ir 5
18 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
Stok List
No ratings yet
Stok List
6 pages
A Study On Different Aspects of Web Mining and Research Issues
No ratings yet
A Study On Different Aspects of Web Mining and Research Issues
8 pages
Prompt Llama 3.1 Developer
No ratings yet
Prompt Llama 3.1 Developer
8 pages
CS8391 Data Structure PDF
No ratings yet
CS8391 Data Structure PDF
347 pages
Web2py Intro
No ratings yet
Web2py Intro
12 pages
Unit 5 DW & DM
No ratings yet
Unit 5 DW & DM
11 pages
Web Mining
No ratings yet
Web Mining
8 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Web Mining
100% (3)
Web Mining
28 pages
Venkat J - SR .Net Fullstack Developer-Ewtrenxty5kfcvyb
No ratings yet
Venkat J - SR .Net Fullstack Developer-Ewtrenxty5kfcvyb
7 pages
Creating Triggers in The NorthWind
No ratings yet
Creating Triggers in The NorthWind
10 pages
Datamining
No ratings yet
Datamining
21 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
DBMS A6 Project Report
No ratings yet
DBMS A6 Project Report
9 pages
Intelligent Web Mining Techniques Using Semantic Web
No ratings yet
Intelligent Web Mining Techniques Using Semantic Web
7 pages
CS571 Note
No ratings yet
CS571 Note
2 pages
Azure Redis Implementation
No ratings yet
Azure Redis Implementation
7 pages
Specialist Domain 1 Connecting To Preparing Data
No ratings yet
Specialist Domain 1 Connecting To Preparing Data
5 pages
SQL Injection PDF
No ratings yet
SQL Injection PDF
6 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Xii CS Practical File
No ratings yet
Xii CS Practical File
41 pages
INTRODUCTION
No ratings yet
INTRODUCTION
2 pages
Webmininglec
100% (1)
Webmininglec
75 pages
COMP 552 Introduction To Cybersecurity Winter 2021: Page 1 of 3
No ratings yet
COMP 552 Introduction To Cybersecurity Winter 2021: Page 1 of 3
3 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
Reactive Programming
No ratings yet
Reactive Programming
8 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
18 pages
Reya
No ratings yet
Reya
2 pages
Web Mining
No ratings yet
Web Mining
48 pages
Web Crawler Assisted Web Page Cleaning For Web Data Mining
No ratings yet
Web Crawler Assisted Web Page Cleaning For Web Data Mining
75 pages
Informatics Practices Practical List22-23
No ratings yet
Informatics Practices Practical List22-23
3 pages
Web Mining
No ratings yet
Web Mining
13 pages
Editable Tables in JavaFX - DZone Java
No ratings yet
Editable Tables in JavaFX - DZone Java
15 pages
Crawling The Web: Information Retrieval © Crista Lopes, UCI
No ratings yet
Crawling The Web: Information Retrieval © Crista Lopes, UCI
25 pages
DR Rac Build Referred Link:: Diskgroup
No ratings yet
DR Rac Build Referred Link:: Diskgroup
6 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
A Web Mining and Optimization Approach For Improving Data Retrieval Performance in Web Search Engine Outcomes
No ratings yet
A Web Mining and Optimization Approach For Improving Data Retrieval Performance in Web Search Engine Outcomes
5 pages
Web Mining
No ratings yet
Web Mining
10 pages
Data Cleaning - Session Identification - Data Conversion Crawlers
No ratings yet
Data Cleaning - Session Identification - Data Conversion Crawlers
1 page
Oracle View - Javatpoint
No ratings yet
Oracle View - Javatpoint
9 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Miningppt
No ratings yet
Web Miningppt
29 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
No ratings yet
Web Content Mining: by Saumya Aggarwal (0232083107 - IT) Richa Sharma (0732082707 - CSE)
12 pages
Web Mining
No ratings yet
Web Mining
53 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Web Design All-in-One For Dummies
From Everand
Web Design All-in-One For Dummies
Sue Jenkins
3/5 (12)
Modern JavaScript Applications
From Everand
Modern JavaScript Applications
Narayan Prusty
No ratings yet
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet