0% found this document useful (0 votes)
59 views29 pages

Web Miningppt

This document provides an overview of web mining techniques. It discusses why web usage mining is useful for discovering visitor profiles and measuring marketing efforts. It then describes how to perform web usage mining by obtaining web traffic data from sources like server logs, databases, and forms. Common pattern analysis techniques are outlined for understanding site usage and frequent pages. Pattern discovery tools involve preprocessing data, analyzing paths, grouping similar information, and applying techniques like clustering and decision trees. The document also discusses focused crawlers, virtual web views, personalization, and algorithms for analyzing web structure like PageRank and HITS.

Uploaded by

Teresa Sebastian
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views29 pages

Web Miningppt

This document provides an overview of web mining techniques. It discusses why web usage mining is useful for discovering visitor profiles and measuring marketing efforts. It then describes how to perform web usage mining by obtaining web traffic data from sources like server logs, databases, and forms. Common pattern analysis techniques are outlined for understanding site usage and frequent pages. Pattern discovery tools involve preprocessing data, analyzing paths, grouping similar information, and applying techniques like clustering and decision trees. The document also discusses focused crawlers, virtual web views, personalization, and algorithms for analyzing web structure like PageRank and HITS.

Uploaded by

Teresa Sebastian
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Web Mining Taxonomy

Why Web Usage Mining?


Explosive growth of E-commerce
Provides an cost-efficient way doing business Amazon.com: online Wal-Mart

Hidden Useful information


Visitors profiles can be discovered Measuring online marketing efforts, launching marketing campaigns, etc.

How to perform Web Usage Mining


Obtain web traffic data from
Web server log files Corporate relational databases Registration forms

Apply data mining techniques and other Web mining techniques Two categories:
Pattern Discovery Tools Pattern Analysis Tools

Pattern Analysis Tools


Answer Questions like:
How are people using this site? which Pages are being accessed most frequently?

This requires the analysis of the structure of hyperlinks and the contents of the pages

Pattern Analysis Tools


O/P of Analysis

The frequency of visits per document Most recent visit per document Frequency of use of each hyperlink Most recent use of each hyperlink

Techniques:
Visualization techniques OLAP techniques Data & Knowledge Querying Usability analysis

Pattern Discovery Tools


Data Pre-processing
Filtering/clean Web log files
eliminate outliers and irrelevant items

Integration of Web Usage data from:


Web Server Logs Referral logs Registration file Corporate Database

Pattern Discovery Techniques


Converting IP addresses to Domain Names
Domain Name System does the conversion Discover information from visitors domain names:
Ex: .ca(Canada), .cn(China), etc

Converting URLs to Page Titles


Page Title: between <title> and </title>

Pattern Discovery Techniques


Path Analysis
Uses Graph Model Provide insights to navigational problems Example of info. Discovered by Path analysis:
78% company-> whats new->sample-> order 60% left sites after 4 or less page references => most important info must be within the first 4 pages of site entry points.

Pattern Discovery Techniques


Grouping
Groups similar info. to help draw higher-level conclusions Ex: all URLs containing the word Yahoo

Filtering
Allows to answer specific questions like:
how many visitors to the site in this

week?
Filter

Pattern Discovery Techniques


Dynamic Site Analysis
Dynamic html links to the database, and requires parameters appended to URLs http://search.netscape.com/cgiin/search?search=Federal+Tax+Return+Form&c p=ntserch Knowledge:
What the visitors looked for What keywords S/B purchased from Search engineer

Pattern Discovery Techniques


Cookies
Randomly assigned ID by web server to browser Cookies are beneficial to both web site developers and visitors Cookie field entry in log file can be used by Web traffic analysis software to track repeat visitors loyal customers.

Pattern Discovery Techniques


Association Rules
help find spending patterns on related products
30% who accessed/company/products/bread.html, also accessed /company/products/milk.htm.

Sequential Patterns
help find inter-transaction patterns
50% who bought items in /pcworld/computers/, also bought in /pcworld/accessories/ within 15 days

Pattern Discovery Techniques


Clustering
Identifies visitors with common characteristics based on visitors profiles 50% who applied discover platinum card in /discovercard/customerService/newcard, were in the 25-35 age group, with annual income between $40,000 50,000.

Pattern Discovery Techniques


Decision Trees
a flow chart of questions leading to a decision Ex: car buying decision tree
What Brand? 2000 Model Honda Accord EX

What Year?

What Type?

Web Content Mining


Extends work of basic search engines Search Engines
IR application Keyword based Similarity between query and document Crawlers Indexing Profiles Link analysis
Week 1: Data Mining II 15

Crawlers
Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler visits entire Web (?) and replaces index Periodic Crawler visits portions of the Web and updates subset of index Incremental Crawler selectively searches the Web and incrementally modifies index Focused Crawler visits pages related to a particular subject

Week 1: Data Mining II

16

Focused Crawler
Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components:
Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages to based on crawler and distiller scores.
Week 1: Data Mining II 17

Focused Crawler
Classifier to related documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

Week 1: Data Mining II

18

Focused Crawler

Week 1: Data Mining II

19

Context Focused Crawler


Context Graph:
Context graph created for each seed document . Root is the seed document. Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself .

Approach:
1. Construct context graph and classifiers using seed documents as training data. 2. Perform crawling using classifiers and context graph created.
Week 1: Data Mining II 20

Context Graph

Week 1: Data Mining II

21

Virtual Web View


Multiple Layered DataBase (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels.

Week 1: Data Mining II

22

Personalization
Web access or contents tuned to better fit the desires of each user. Manual techniques identify users preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles.

Week 1: Data Mining II

23

Web Structure Mining


Mine structure (links, graph) of the Web Techniques
PageRank CLEVER

Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages.

Week 1: Data Mining II

24

PageRank
Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it Backlinks. Weighting is used to provide more importance to backlinks coming form important pages.
Week 1: Data Mining II 25

PageRank (contd)
PR(p) = c (PR(1)/N1 + + PR(n)/Nn)
PR(i): PageRank for a page i which points to target page p. Ni: number of links coming out of page i

Week 1: Data Mining II

26

CLEVER
Identify authoritative and hub pages. Authoritative Pages :
Highly important pages. Best source for requested information.

Hub Pages :
Contain links to highly important pages.

Week 1: Data Mining II

27

HITS
Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages R. Identify hub and authority pages for these.
Expand R to a base set, B, of pages linked to or from R. Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.


Week 1: Data Mining II 28

HITS Algorithm

Week 1: Data Mining II

29

You might also like