Experiment 9: Web Mining

This document provides an overview of web mining, including web content mining, web structure mining, and web usage mining. Web content mining analyzes the contents of web pages such as text, images, audio and video to extract useful information. Web structure mining analyzes the hyperlink structure of the web to determine relationships between web pages. Web usage mining analyzes user interactions with web pages such as navigation patterns and transactional information to discover how users interact with websites.

Uploaded by

Hazel D'cunha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views9 pages

Experiment 9: Web Mining

Uploaded by

Hazel D'cunha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 9

Chapter 10 Experiment 9: Web Mining

9.1 Aim
Aim: To study Web Mining

9.2 Theory 9.2.1 Web Content Mining

Web Content Mining can be thought of as extending the work performed by basic search engines. Agent-based approach Agent-based approaches have software systems or agents that perform the content mining. The search engines that belong to this class are intelligent search agents which perform information filtering. 1. Crawlers A crawler (spider or robot) is a program that traverses the hypertext structure in the web. The page (or a set of pages) that the crawler starts with are referred to as seed url. By starting at one page, all links from it are recorded and saved in a queue. These new pages are in turn searched and their links saved. As these robots search the web, they may collect information about each page such as extract keywords and store indices for users of the associated search engine. Periodic Crawler A crawler may visit a certain number of pages and then stop, build an index, and replace the existing index. This type of crawler is referred to as a periodic crawler because it is activated periodically. Regular/Traditional Crawler Traditional crawlers usually replace the entire index or a section thereof.

Incremental Crawler An incremental crawler selectively searches the web and only updates the index incrementally as opposed to replacing it. Focused Crawler A focused Crawler visits pages related to topics of interest. It has been proposed because of the tremendous size of the web. 2. Harvest System The Harvest System is based on the use of caching, indexing and crawling. Harvest is actually a set of rules that facilitate a gathering of information from diverse sources. The Harvest design is centered around the use of gatherers and brokers. A gatherer obtains information for indexing from an internet service provider while a broker provides the index and query interface. The relationship between brokers and gatherers can vary. Brokers may interface directly with gatherers or may through other brokers to get to the gatherers. Indices and brokers are topic-specific in Harvest to avoid scalability problems. Harvest gatherers use the Essence system to assist in collecting data.

Database approach The data based approaches view the web data as belonging to a database. These approaches view the web as a multilevel database and have query languages that target the web. 1. Virtual Web View One proposed approach to handling the large amounts of somewhat unstructured data on the web is to create a multiple layered database (MLDB) on top of the data in the web. This database is massive and distributed. Each layer of this database is more generalised than the layer beneath it. Unlike the lowest level (the web), the upper levels are structured and can be accessed and mined by an SQL-like query language. The MLDB provides an abstracted and condensed view of a portion of the web. Thus, a view of the MLDB which is called the Virtual Web View can be constructed.

A web data mining query language, WebML is proposed to provide data mining operations on the MLDB. WebML is an extension of DMQL. A Major feature of WebML are four primitive operations based on the use of concept hierarchies for the keywords:

(a) COVERS - One concept covers another if it was higher (ancestor) in the hierarchy and it is extended to include synonyms. (b) COVERED BY - This is the reverse of COVERS in that it reverses to descendants. (c) LIKE - The concept is a synonym. (d) CLOSE TO - One concept is close to another if it is a sibling in the hierarchy and it is extended to include synonyms. 2. Personalisation With personalisation, web access or the contents of a web page are modified to better fit the desires of the user. This may involve actually creating web pages that are unique per user or using the desires of a user to determine what web documents to retrieve. With personalisation, advertisements to be sent to be potential customer are chosen based on specific knowledge concerning that customer. Personalisation may be performed on the target web page. The goal here is to entice a current customer to purchase something he or she may not have thought about purchasing. Personalisation includes such techniques as use of cookies, use of databases, and more complex data mining and machine learning strategies. Personalisation may be performed in many ways, some of which are not data mining. Personalisation can be viewed as a type of clustering, classification, or even prediction. Through classification, the desires of a user are determined based on those for the class. With clustering, the desires are determined based on those users to which he or she is determined to be similar. Prediction is used to predict what the user really wants to see. There are 3 basic types of web page personalisation: Manual Techniques which perform personalisation through user registration preferences or via the use of rules that are used to classify individuals based on profiles or demographics.

Collaborative Filtering accomplishes personalisation by recommending information (pages) that have previously been given high ratings from similar users. Content Based Filtering retrieves pages based on similarity between them and user profiles. One of the earliest uses of personalisation was with MyYahoo! Some observations about the use of personalisation are: (a) Only few users create very sophisticated pages by utilizing the customisation provided. (b) Most users do not seem to understand what personalisation means and only use the default page. (c) Any personalisation system should be able to support both types of users.

9.2.2 Web Structure Mining

Web Structure Mining can be viewed as creating a model of the web organisation or a portion thereof. This can be used to classify web pages or to create similarity measures between the documents.

PageRank The PageRank technique was designed to increase the effectiveness of search engines and improve their efficiency. PageRank is used to measure the importance of a page and to prioritize pages returned from a traditional search engine using keyword searching. The effectiveness of this measure has been demonstrated by the success of Google. The PageRank value for a page is calculated based on the number of pages that point to it. This is actually a measure based on the number of backlinks to a page. A backlink is a link pointing to a page rather than pointing out from a page. The measure is not simply a count of the number of backlinks because a weighting is used to provide more importance to backlinks coming from important pages. Given a page p, we use Bp to be the set of pages that point to p, and Fp to be the set of links out of p. The PageRank of a page p is defined as

PR(p) = cq Bp[PR(q)/Nq] Here, Nq = |Fq|

(9.1)

The constant c is a value between 0 and 1 and is used for normalisation. A problem called rank sink that exists with this PageRank calculation is that when a cyclic reference occurs, the PR value for these pages increases. This problem is solved by adding an additional term to the formula.

PR(p) = cq Bp[PR(q)/Nq] + cE(v) (9.2) where c is maximised. Here, E(v) is a vector that adds an artificial link. This simulates a random surfer who periodically decides to stop following links and jumps to a new page. E(v) adds links of small probabilities between every pair of nodes. The PageRank technique is different from other approaches that look at links. It does not count all links the same. The values are normalised by the number of links in the page.

Clever The Clever system was developed at IBM. It is aimed at finding both authoritative pages and hubs. An authoritative page is described as the best source for the requested information. A hub is a page that contains links to authoritative pages. The Clever systems identifies authoritative pages and hub pages by creating weights. A search can be viewed as having a goal of finding the best hubs and authorities. Authoritative pages have higher quality content than other pages. This i.e authoritative is different from relevant. A page may be extremely relevant, but if it contains factual errors, users may not want to retrieve it. HITS HITS stands for Hyperlink Induced Topic Search. It finds hubs and authoritative pages. The HITS technique contains 2 components: 1. Based on a given set of keywords found in a query, a set of relevant pages is found.

Hub and authority measures are associated with these pages and pages with the highest values are returned.
2.

Algorithm

9.2.3 Web Usage Mining

Web Usage Mining performs mining on web usage data, or web logs. A Web log is a listing of page reference data. It is also referred to as click stream data because each entry corresponds to a mouse click. These logs can be examined from either a client perspective or a server perspective. When evaluated from a server perspective, mining uncovers information about the sites where the service resides. It can be used to improve the design of the sites. By evaluating a client's sequence of clicks, information about a user (or a group of users) is detected. This could be used to perform prefetching or caching of pages.

Web Usage Mining actually consists of 3 separate types of activities Pre-processing Activities centre around reformatting the web log data before processing. Pattern Discovery Activities form the major portion of the mining activities because these activities look to _nd hidden patterns within the log data. Pattern Analysis is the process of looking at, and interpreting the results of the discovery activities. There are many issues associated with using the web log for mining purposes. 1. Identification of the user is not possible from the log alone. 2. With a web client cache, the exact sequence of pages a user actually visits is difficult to uncover from the server site. 3. Pages that are referenced may be found in the cache. 4. There are also many security, privacy and legal issues. 1) Preprocessing Steps that are part of the preprocessing phase include cleansing, user identification, session identification, path completion, and formatting. Data on a web browser may be changed in several ways. Eg. For security or privacy reasons, the page addresses may be changed into unique but non-identifying page identifications such as alphabetic characters. This conversion also saves storage space. Data may also be cleansed by removing any irrelevant information. Data from the log may be grouped together to provide more information. All pages visited from one source could be grouped together by a server to better understand the patterns of page references from each user. Similarly, patterns from groups of site may be discovered. A common technique for a server site is to divide the log records into sessions. A session is a set of page references from one source site during one logical period. Login and logoff of a user into a computer represents the logical start and end of a session. Each session has a unique identi_er callled a session id. Most of the problems associated with preprocessing activities center around the correct identification of the actual user. User identification is complicated by the use of proxy servers, client side caching and corporate firewalls.

Cookies can also be used to assist in identifying a single user regardless of the machine used to access the web. Prediction of missing pages by path completion is an attempt to add page accesses that do not exist in the log but actually occur. 2) Data Structures A basic data structure that keeps track of patterns identified during web usage mining process is called a trie. A trie is a rooted tree, where each path from the root to a leaf represents a sequence. Tries are used to store strings for pattern matching applications. Each character in the string is stored on the edge to the node. Common prefixes of strings are shared. A problem in using tries for many long strings is the space required. The compressed trie is a suffix tree. A suffix tree has the following characteristics (a) Each internal node except the root has atleast two children. (b) Each edge represents a non-empty subsequence. (c) The subsequences represented by sibling edges begin with different symbols. A suffix tree is efficient in finding any subsequence in a sequence and also common subsequences among multiple sequences. A slight variation on the suffix tree that is used to build a suffix tree for multiple sessions is called a generalized suffix tree (GST). Pattern Discovery The most common data mining technique used on clickstream data is uncovering traversal patterns. A traversal pattern is a set of pages visited by a user in a session. Similar traversal patterns may be clustered together to provide a clustering of the users. Patterns may differ in how the patterns are defined. The differences between the different types of patterns can be described by the following features (a) Duplicate page references (backward traversals and refreshes/reloads) may or may not be allowed. (b) A pattern may be composed only of contiguous page references or alternately of any pages referenced in the same session. (c) The pattern of references may or may not be maximal in the session. (d) A frequent pattern is maximal if it has no subpattern that is also frequent.

Patterns found using different combinations of these 3 properties may be used to discover different features.

Pattern Analysis Once a pattern has been identified, it must be analysed to determine how the information can be used. Some of the generated patterns may be deleted and determined not to be of any interest. Patterns found need not have contiguous page references. A web mining query language MINT facilitates the statement of interesting properties. The idea of a sequence is expanded to the concept of what the authors call a g-sequence. A g-sequence is a vector that consists not only of the pages visited i.e. events but also of wild cards. Patterns found across two logs can be compared for similarity. Similarity is determined using the following rule: (a) Two patterns are comparable if their g-sequences have atleast the first n pages the same where n is supplied by the user. (b) In addition, only fragments of patterns that occur frequently are considered. The goal of this work is to increase the number of customers. Non-customer patterns with no comparable customer patterns indicate that some changes to the link structure or web page designs may be in order.

02.cognos TM1 Developer Guide 10.1.0
50% (2)
02.cognos TM1 Developer Guide 10.1.0
398 pages
Slides On Data Structures Tree and Graph
No ratings yet
Slides On Data Structures Tree and Graph
94 pages
SAP Funds Management - Fund Center
No ratings yet
SAP Funds Management - Fund Center
24 pages
Webmininglec
100% (1)
Webmininglec
75 pages
Web Mining1
No ratings yet
Web Mining1
87 pages
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
No ratings yet
Web Mining: By:-Vineeta 8pgc18 M.Tech (II Semester)
33 pages
Datamining
No ratings yet
Datamining
21 pages
19 Web Mining 2
No ratings yet
19 Web Mining 2
41 pages
Artificial Intelligence and Innovative A
No ratings yet
Artificial Intelligence and Innovative A
9 pages
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
No ratings yet
Cse3024 Web-Mining Eth 1.1 47 Cse3024 PDF
12 pages
Web Search Engingine Indexing Crawling and Ranking
No ratings yet
Web Search Engingine Indexing Crawling and Ranking
63 pages
Web Mining
No ratings yet
Web Mining
13 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
36 pages
Web Mining Notes
100% (1)
Web Mining Notes
8 pages
Web Mining: G.Anuradha References From Dunham
100% (1)
Web Mining: G.Anuradha References From Dunham
63 pages
EB Ining: Dvanced Opics
0% (1)
EB Ining: Dvanced Opics
48 pages
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
No ratings yet
Overview of Web Data Mining and Applications: Bamshad Mobasher Depaul University
25 pages
04 Chapter 2
No ratings yet
04 Chapter 2
24 pages
Web Miningppt
No ratings yet
Web Miningppt
29 pages
Bda Class - Feb 7th
No ratings yet
Bda Class - Feb 7th
28 pages
3.Eng-A Survey On Web Mining
No ratings yet
3.Eng-A Survey On Web Mining
8 pages
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
No ratings yet
DWM Assignment 1: 1. Write Detailed Notes On The Following: - A. Web Content Mining
10 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Enhancing Link Evaluation Through A Coor
No ratings yet
Enhancing Link Evaluation Through A Coor
21 pages
Spatial & Web Mining
100% (1)
Spatial & Web Mining
45 pages
Web Mining
No ratings yet
Web Mining
53 pages
Web Mining
No ratings yet
Web Mining
48 pages
Web Mining - Unearthing Insights From The Digital Landscape
No ratings yet
Web Mining - Unearthing Insights From The Digital Landscape
9 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Web Mining U-1,2
No ratings yet
Web Mining U-1,2
15 pages
Unit 4 (DWDM)
No ratings yet
Unit 4 (DWDM)
27 pages
Web Mining App and Tech2 PDF
No ratings yet
Web Mining App and Tech2 PDF
443 pages
Web Assignment1
No ratings yet
Web Assignment1
4 pages
Unit 5 DW & DM
No ratings yet
Unit 5 DW & DM
11 pages
Web Mining: Presented By: Vikash Kumar
No ratings yet
Web Mining: Presented By: Vikash Kumar
24 pages
Webmining I
No ratings yet
Webmining I
69 pages
Web Mining Report
100% (2)
Web Mining Report
46 pages
Data Mining
No ratings yet
Data Mining
80 pages
Intelligent Web Mining Techniques Using Semantic Web
No ratings yet
Intelligent Web Mining Techniques Using Semantic Web
7 pages
Engineering-A Review Web Data Scrapping
No ratings yet
Engineering-A Review Web Data Scrapping
4 pages
Web Mining
No ratings yet
Web Mining
3 pages
A Study On Different Aspects of Web Mining and Research Issues
No ratings yet
A Study On Different Aspects of Web Mining and Research Issues
8 pages
Web Mining
100% (3)
Web Mining
28 pages
13-Overview of Web Mining-11-11-2024
No ratings yet
13-Overview of Web Mining-11-11-2024
35 pages
Data Mining-World Wide Web
No ratings yet
Data Mining-World Wide Web
4 pages
Web Content Mining
100% (1)
Web Content Mining
112 pages
Module1PartAweb Mining-Intro
No ratings yet
Module1PartAweb Mining-Intro
28 pages
A New Approach For Web Usage Mining Using Artificial Neural Network
No ratings yet
A New Approach For Web Usage Mining Using Artificial Neural Network
5 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
11 pages
DM M5.1 Web Mining v3.11
No ratings yet
DM M5.1 Web Mining v3.11
114 pages
Sma Unit 2
No ratings yet
Sma Unit 2
18 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Webmining I
No ratings yet
Webmining I
69 pages
Mining The Web Searching and Integration
No ratings yet
Mining The Web Searching and Integration
5 pages
Report
No ratings yet
Report
2 pages
Improving Web Search Results in Web Personalization
No ratings yet
Improving Web Search Results in Web Personalization
4 pages
Web Usage Mining Negative-Association: S.vignesh
No ratings yet
Web Usage Mining Negative-Association: S.vignesh
48 pages
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
No ratings yet
Web Content Mining: A Case Study For Bput Results: Binayak Panda, K Murali Gopal, Sudhanshu Shekhar Bisoyi
5 pages
WordPress Optimization: The Basics
From Everand
WordPress Optimization: The Basics
Janet Amber
No ratings yet
Seo Learning Guide
From Everand
Seo Learning Guide
ngencoband
No ratings yet
An Introduction To SEO
From Everand
An Introduction To SEO
Nirmalya Roy
No ratings yet
Access 2016: Up To Speed
From Everand
Access 2016: Up To Speed
R.M. Hyttinen
5/5 (2)
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Cheat Sheat CPSC 331 Midterm 1
No ratings yet
Cheat Sheat CPSC 331 Midterm 1
2 pages
Data Structure
No ratings yet
Data Structure
16 pages
WWW Geeksforgeeks Org Types of Network Topology
No ratings yet
WWW Geeksforgeeks Org Types of Network Topology
10 pages
Question Bank
No ratings yet
Question Bank
11 pages
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
No ratings yet
Linear Models: The Least-Squares Method, The Perceptron: A Heuristic Learning Algorithm For
25 pages
20 Coding Patterns To Master MAANG Interviews
No ratings yet
20 Coding Patterns To Master MAANG Interviews
22 pages
Unit 1
No ratings yet
Unit 1
53 pages
Data Structure Questions
No ratings yet
Data Structure Questions
13 pages
Algorithms and Data Structures
0% (1)
Algorithms and Data Structures
161 pages
Week 8 - Understanding The Decision Tree
No ratings yet
Week 8 - Understanding The Decision Tree
28 pages
S Y B Tech 2023-Pattern-NEP-Sem-I
No ratings yet
S Y B Tech 2023-Pattern-NEP-Sem-I
29 pages
SVU M.sc. (CS) Syllabus
No ratings yet
SVU M.sc. (CS) Syllabus
23 pages
Power Distribution System Load Flow Usin PDF
No ratings yet
Power Distribution System Load Flow Usin PDF
6 pages
MCA Syllabus 2020
No ratings yet
MCA Syllabus 2020
32 pages
Obst
No ratings yet
Obst
11 pages
DSA Insem
No ratings yet
DSA Insem
2 pages
Experiment No.: 9 Implement Binary Search Tree ADT Using Linked List
No ratings yet
Experiment No.: 9 Implement Binary Search Tree ADT Using Linked List
18 pages
Module3 - Learning, Uncertainity Lecture Notes. 16861418577274
No ratings yet
Module3 - Learning, Uncertainity Lecture Notes. 16861418577274
30 pages
Bhagwant University Electrical and Electronics Engg. B.Tech Course Semester Iii
No ratings yet
Bhagwant University Electrical and Electronics Engg. B.Tech Course Semester Iii
41 pages
CS8391 Data Structures
No ratings yet
CS8391 Data Structures
1 page
Data Structure Handout
No ratings yet
Data Structure Handout
24 pages
Ads-Unit Iii-Priority Queues
No ratings yet
Ads-Unit Iii-Priority Queues
18 pages
Logic Programming Notes
No ratings yet
Logic Programming Notes
83 pages
Chapter 8-Data Structures and Algorithms
No ratings yet
Chapter 8-Data Structures and Algorithms
22 pages
Data Structure PPT
No ratings yet
Data Structure PPT
20 pages
Data Structures
No ratings yet
Data Structures
11 pages
Unit 1
No ratings yet
Unit 1
26 pages