Web Miningppt
Web Miningppt
Apply data mining techniques and other Web mining techniques Two categories:
Pattern Discovery Tools Pattern Analysis Tools
This requires the analysis of the structure of hyperlinks and the contents of the pages
The frequency of visits per document Most recent visit per document Frequency of use of each hyperlink Most recent use of each hyperlink
Techniques:
Visualization techniques OLAP techniques Data & Knowledge Querying Usability analysis
Filtering
Allows to answer specific questions like:
how many visitors to the site in this
week?
Filter
Sequential Patterns
help find inter-transaction patterns
50% who bought items in /pcworld/computers/, also bought in /pcworld/accessories/ within 15 days
What Year?
What Type?
Crawlers
Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler visits entire Web (?) and replaces index Periodic Crawler visits portions of the Web and updates subset of index Incremental Crawler selectively searches the Web and incrementally modifies index Focused Crawler visits pages related to a particular subject
16
Focused Crawler
Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components:
Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages to based on crawler and distiller scores.
Week 1: Data Mining II 17
Focused Crawler
Classifier to related documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.
18
Focused Crawler
19
Approach:
1. Construct context graph and classifiers using seed documents as training data. 2. Perform crawling using classifiers and context graph created.
Week 1: Data Mining II 20
Context Graph
21
22
Personalization
Web access or contents tuned to better fit the desires of each user. Manual techniques identify users preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles.
23
Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages.
24
PageRank
Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it Backlinks. Weighting is used to provide more importance to backlinks coming form important pages.
Week 1: Data Mining II 25
PageRank (contd)
PR(p) = c (PR(1)/N1 + + PR(n)/Nn)
PR(i): PageRank for a page i which points to target page p. Ni: number of links coming out of page i
26
CLEVER
Identify authoritative and hub pages. Authoritative Pages :
Highly important pages. Best source for requested information.
Hub Pages :
Contain links to highly important pages.
27
HITS
Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages R. Identify hub and authority pages for these.
Expand R to a base set, B, of pages linked to or from R. Calculate weights for authorities and hubs.
HITS Algorithm
29