0% found this document useful (0 votes)

59 views29 pages

Web Miningppt

This document provides an overview of web mining techniques. It discusses why web usage mining is useful for discovering visitor profiles and measuring marketing efforts. It then describes how to perform web usage mining by obtaining web traffic data from sources like server logs, databases, and forms. Common pattern analysis techniques are outlined for understanding site usage and frequent pages. Pattern discovery tools involve preprocessing data, analyzing paths, grouping similar information, and applying techniques like clustering and decision trees. The document also discusses focused crawlers, virtual web views, personalization, and algorithms for analyzing web structure like PageRank and HITS.

Uploaded by

Teresa Sebastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views29 pages

Web Miningppt

Uploaded by

Teresa Sebastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

Web Mining Taxonomy

Why Web Usage Mining?

Explosive growth of E-commerce
Provides an cost-efficient way doing business Amazon.com: online Wal-Mart

Hidden Useful information

Visitors profiles can be discovered Measuring online marketing efforts, launching marketing campaigns, etc.

How to perform Web Usage Mining

Obtain web traffic data from
Web server log files Corporate relational databases Registration forms

Apply data mining techniques and other Web mining techniques Two categories:
Pattern Discovery Tools Pattern Analysis Tools

Pattern Analysis Tools

Answer Questions like:
How are people using this site? which Pages are being accessed most frequently?

This requires the analysis of the structure of hyperlinks and the contents of the pages

Pattern Analysis Tools

O/P of Analysis

The frequency of visits per document Most recent visit per document Frequency of use of each hyperlink Most recent use of each hyperlink

Techniques:
Visualization techniques OLAP techniques Data & Knowledge Querying Usability analysis

Pattern Discovery Tools

Data Pre-processing
Filtering/clean Web log files
eliminate outliers and irrelevant items

Integration of Web Usage data from:

Web Server Logs Referral logs Registration file Corporate Database

Pattern Discovery Techniques

Converting IP addresses to Domain Names
Domain Name System does the conversion Discover information from visitors domain names:
Ex: .ca(Canada), .cn(China), etc

Converting URLs to Page Titles

Page Title: between <title> and </title>

Pattern Discovery Techniques

Path Analysis
Uses Graph Model Provide insights to navigational problems Example of info. Discovered by Path analysis:
78% company-> whats new->sample-> order 60% left sites after 4 or less page references => most important info must be within the first 4 pages of site entry points.

Pattern Discovery Techniques

Grouping
Groups similar info. to help draw higher-level conclusions Ex: all URLs containing the word Yahoo

Filtering
Allows to answer specific questions like:
how many visitors to the site in this

week?
Filter

Pattern Discovery Techniques

Dynamic Site Analysis
Dynamic html links to the database, and requires parameters appended to URLs http://search.netscape.com/cgiin/search?search=Federal+Tax+Return+Form&c p=ntserch Knowledge:
What the visitors looked for What keywords S/B purchased from Search engineer

Pattern Discovery Techniques

Cookies
Randomly assigned ID by web server to browser Cookies are beneficial to both web site developers and visitors Cookie field entry in log file can be used by Web traffic analysis software to track repeat visitors loyal customers.

Pattern Discovery Techniques

Association Rules
help find spending patterns on related products
30% who accessed/company/products/bread.html, also accessed /company/products/milk.htm.

Sequential Patterns
help find inter-transaction patterns
50% who bought items in /pcworld/computers/, also bought in /pcworld/accessories/ within 15 days

Pattern Discovery Techniques

Clustering
Identifies visitors with common characteristics based on visitors profiles 50% who applied discover platinum card in /discovercard/customerService/newcard, were in the 25-35 age group, with annual income between $40,000 50,000.

Pattern Discovery Techniques

Decision Trees
a flow chart of questions leading to a decision Ex: car buying decision tree
What Brand? 2000 Model Honda Accord EX

What Year?

What Type?

Web Content Mining

Extends work of basic search engines Search Engines
IR application Keyword based Similarity between query and document Crawlers Indexing Profiles Link analysis
Week 1: Data Mining II 15

Crawlers
Robot (spider) traverses the hypertext structure in the Web. Collect information from visited pages Used to construct indexes for search engines Traditional Crawler visits entire Web (?) and replaces index Periodic Crawler visits portions of the Web and updates subset of index Incremental Crawler selectively searches the Web and incrementally modifies index Focused Crawler visits pages related to a particular subject

Week 1: Data Mining II

Focused Crawler
Only visit links from a page if that page is determined to be relevant. Classifier is static after learning phase. Components:
Classifier which assigns relevance score to each page based on crawl topic. Distiller to identify hub pages. Crawler visits pages to based on crawler and distiller scores.
Week 1: Data Mining II 17

Focused Crawler
Classifier to related documents to topics Classifier also determines how useful outgoing links are Hub Pages contain links to many relevant pages. Must be visited even if not high relevance score.

Week 1: Data Mining II

Focused Crawler

Week 1: Data Mining II

Context Focused Crawler

Context Graph:
Context graph created for each seed document . Root is the seed document. Nodes at each level show documents with links to documents at next higher level. Updated during crawl itself .

Approach:
1. Construct context graph and classifiers using seed documents as training data. 2. Perform crawling using classifiers and context graph created.
Week 1: Data Mining II 20

Context Graph

Week 1: Data Mining II

Virtual Web View

Multiple Layered DataBase (MLDB) built on top of the Web. Each layer of the database is more generalized (and smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be accessed with SQL type queries. Translation tools convert Web documents to XML. Extraction tools extract desired information to place in first layer of MLDB. Higher levels contain more summarized data obtained through generalizations of the lower levels.

Week 1: Data Mining II

Personalization
Web access or contents tuned to better fit the desires of each user. Manual techniques identify users preferences based on profiles or demographics. Collaborative filtering identifies preferences based on ratings from similar users. Content based filtering retrieves pages based on similarity between pages and user profiles.

Week 1: Data Mining II

Web Structure Mining

Mine structure (links, graph) of the Web Techniques
PageRank CLEVER

Create a model of the Web organization. May be combined with content mining to more effectively retrieve important pages.

Week 1: Data Mining II

PageRank
Used by Google Prioritize pages returned from search by looking at Web structure. Importance of page is calculated based on number of pages which point to it Backlinks. Weighting is used to provide more importance to backlinks coming form important pages.
Week 1: Data Mining II 25

PageRank (contd)
PR(p) = c (PR(1)/N1 + + PR(n)/Nn)
PR(i): PageRank for a page i which points to target page p. Ni: number of links coming out of page i

Week 1: Data Mining II

CLEVER
Identify authoritative and hub pages. Authoritative Pages :
Highly important pages. Best source for requested information.

Hub Pages :
Contain links to highly important pages.

Week 1: Data Mining II

HITS
Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages R. Identify hub and authority pages for these.
Expand R to a base set, B, of pages linked to or from R. Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.

Week 1: Data Mining II 28

HITS Algorithm

Week 1: Data Mining II

Web Content Mining
100% (1)
Web Content Mining
112 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Web Mining
No ratings yet
Web Mining
28 pages
Web Mining
100% (3)
Web Mining
28 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
Data Mining
No ratings yet
Data Mining
12 pages
Unit 5 DM
No ratings yet
Unit 5 DM
61 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
DM M5.1 Web Mining v3.11
No ratings yet
DM M5.1 Web Mining v3.11
114 pages
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
No ratings yet
Web Mining: by Saumil Shah Roll No: 46 Mca 4 Sem
28 pages
Data Mining
No ratings yet
Data Mining
80 pages
Web Data Mining - 5
No ratings yet
Web Data Mining - 5
14 pages
H 5
No ratings yet
H 5
13 pages
Week 1
No ratings yet
Week 1
80 pages
Our Topic:: Web Usage Mining
No ratings yet
Our Topic:: Web Usage Mining
51 pages
Web Mining For BI - Part 2
No ratings yet
Web Mining For BI - Part 2
31 pages
Web Mining
No ratings yet
Web Mining
42 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
Web Mining
No ratings yet
Web Mining
8 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Algorithm For Tracing Visitors' On-Line Behaviors
No ratings yet
Algorithm For Tracing Visitors' On-Line Behaviors
7 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining
No ratings yet
Web Mining
13 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
A Study On Different Aspects of Web Mining and Research Issues
No ratings yet
A Study On Different Aspects of Web Mining and Research Issues
8 pages
UNIT - 3 Final
No ratings yet
UNIT - 3 Final
37 pages
User Web Usage Mining For Navigation Improvisation Using Semantic Related Frequent Patterns
No ratings yet
User Web Usage Mining For Navigation Improvisation Using Semantic Related Frequent Patterns
5 pages
Web Usage Mining: - Hat, Hy, Ho
No ratings yet
Web Usage Mining: - Hat, Hy, Ho
18 pages
Web Mining PPT 4121
No ratings yet
Web Mining PPT 4121
18 pages
Web Mining and Knowledge Discovery of Usage Patterns - A Survey
No ratings yet
Web Mining and Knowledge Discovery of Usage Patterns - A Survey
27 pages
Social Media Marketing
100% (2)
Social Media Marketing
406 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Datamining
No ratings yet
Datamining
21 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
A Plausible Comprehensive Web Intelligent System For Investigation of Web User Behaviour Adaptable To Incremental Mining
No ratings yet
A Plausible Comprehensive Web Intelligent System For Investigation of Web User Behaviour Adaptable To Incremental Mining
20 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining and Text Mining
No ratings yet
Web Mining and Text Mining
65 pages
ICT Project Maintenance
No ratings yet
ICT Project Maintenance
19 pages
Data Mining. Mining WWW.: Sonali. Parab
No ratings yet
Data Mining. Mining WWW.: Sonali. Parab
25 pages
Web Mining
No ratings yet
Web Mining
48 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
Web Mining
No ratings yet
Web Mining
13 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Web Mining
No ratings yet
Web Mining
42 pages
Web Mining Using Artificial Ant Colonies: A Survey
No ratings yet
Web Mining Using Artificial Ant Colonies: A Survey
6 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Webmining I
No ratings yet
Webmining I
69 pages
Ijctt V3i4p110
No ratings yet
Ijctt V3i4p110
3 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Watch Manga 18
No ratings yet
Watch Manga 18
11 pages
Experiment 9: Web Mining
No ratings yet
Experiment 9: Web Mining
9 pages
Web Mining
No ratings yet
Web Mining
53 pages
Lesson Plan Grade 13
No ratings yet
Lesson Plan Grade 13
1 page
Class - 10 HTML-1
No ratings yet
Class - 10 HTML-1
62 pages
Book246 PDF
No ratings yet
Book246 PDF
15 pages
An Introduction To JavaScript
No ratings yet
An Introduction To JavaScript
28 pages
CCNP Switch 642-813 Quick Reference Guide
No ratings yet
CCNP Switch 642-813 Quick Reference Guide
4,297 pages
Unit 5 SEO
No ratings yet
Unit 5 SEO
45 pages
SNA Mod 5 Search Engines
No ratings yet
SNA Mod 5 Search Engines
64 pages
Rws Reviewer
No ratings yet
Rws Reviewer
5 pages
Unit 5 - QB
No ratings yet
Unit 5 - QB
6 pages
STS BAHASA INGGRIS Kelas 1
No ratings yet
STS BAHASA INGGRIS Kelas 1
43 pages
CSS - Position Property
100% (1)
CSS - Position Property
3 pages
T&H Andreea Iuga 2705400
No ratings yet
T&H Andreea Iuga 2705400
62 pages
Chapter 3
No ratings yet
Chapter 3
5 pages
BY2IT OL Unit35 WAD July22 Sem2 Task1 Report HussainRiyaz 5201
No ratings yet
BY2IT OL Unit35 WAD July22 Sem2 Task1 Report HussainRiyaz 5201
19 pages
Internship Report Format
No ratings yet
Internship Report Format
6 pages
L3 Seo
No ratings yet
L3 Seo
19 pages
Instalación de Oracle y APEX en CentOS 8 - Linode
No ratings yet
Instalación de Oracle y APEX en CentOS 8 - Linode
11 pages
Adding Javascript To A Server Control: Figure 1. Showing The Date and Time On A Button
No ratings yet
Adding Javascript To A Server Control: Figure 1. Showing The Date and Time On A Button
10 pages
Learning JavaScript - Window Object and Pop-Up Windows (Lesson 14) PDF
No ratings yet
Learning JavaScript - Window Object and Pop-Up Windows (Lesson 14) PDF
4 pages
Secure Communication Channels: Secure Sockets Layer (SSL)
No ratings yet
Secure Communication Channels: Secure Sockets Layer (SSL)
5 pages
Apache2 Virtual Hosts
No ratings yet
Apache2 Virtual Hosts
6 pages
Javascript Exercises
No ratings yet
Javascript Exercises
2 pages
Pages V Posts in Wordpress
No ratings yet
Pages V Posts in Wordpress
3 pages
Zen Cart
No ratings yet
Zen Cart
6 pages
Untitled
100% (1)
Untitled
2 pages
Cheatsheet Template Joomla!3
No ratings yet
Cheatsheet Template Joomla!3
1 page
Web Design All-in-One For Dummies
From Everand
Web Design All-in-One For Dummies
Sue Jenkins
3/5 (12)
SEO, SEM & SMM for Small Business Owners: SEO, SEM & SMM SERIES, #1
From Everand
SEO, SEM & SMM for Small Business Owners: SEO, SEM & SMM SERIES, #1
Harriet Fosuah Quansah
No ratings yet
WordPress Web Application Development - Second Edition
From Everand
WordPress Web Application Development - Second Edition
Rakhitha Nimesh Ratnayake
No ratings yet
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
From Everand
Mastering Search Engine Marketing: A Guide for SEM Campaign Success
Rebecca Cox
No ratings yet
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet
Web Scraping with Python Step by Step: A Practical Guide with Examples
From Everand
Web Scraping with Python Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet