Experiment 9: Web Mining
Experiment 9: Web Mining
9.1 Aim
Aim: To study Web Mining
Incremental Crawler An incremental crawler selectively searches the web and only updates the index incrementally as opposed to replacing it. Focused Crawler A focused Crawler visits pages related to topics of interest. It has been proposed because of the tremendous size of the web. 2. Harvest System The Harvest System is based on the use of caching, indexing and crawling. Harvest is actually a set of rules that facilitate a gathering of information from diverse sources. The Harvest design is centered around the use of gatherers and brokers. A gatherer obtains information for indexing from an internet service provider while a broker provides the index and query interface. The relationship between brokers and gatherers can vary. Brokers may interface directly with gatherers or may through other brokers to get to the gatherers. Indices and brokers are topic-specific in Harvest to avoid scalability problems. Harvest gatherers use the Essence system to assist in collecting data.
Database approach The data based approaches view the web data as belonging to a database. These approaches view the web as a multilevel database and have query languages that target the web. 1. Virtual Web View One proposed approach to handling the large amounts of somewhat unstructured data on the web is to create a multiple layered database (MLDB) on top of the data in the web. This database is massive and distributed. Each layer of this database is more generalised than the layer beneath it. Unlike the lowest level (the web), the upper levels are structured and can be accessed and mined by an SQL-like query language. The MLDB provides an abstracted and condensed view of a portion of the web. Thus, a view of the MLDB which is called the Virtual Web View can be constructed.
A web data mining query language, WebML is proposed to provide data mining operations on the MLDB. WebML is an extension of DMQL. A Major feature of WebML are four primitive operations based on the use of concept hierarchies for the keywords:
(a) COVERS - One concept covers another if it was higher (ancestor) in the hierarchy and it is extended to include synonyms. (b) COVERED BY - This is the reverse of COVERS in that it reverses to descendants. (c) LIKE - The concept is a synonym. (d) CLOSE TO - One concept is close to another if it is a sibling in the hierarchy and it is extended to include synonyms. 2. Personalisation With personalisation, web access or the contents of a web page are modified to better fit the desires of the user. This may involve actually creating web pages that are unique per user or using the desires of a user to determine what web documents to retrieve. With personalisation, advertisements to be sent to be potential customer are chosen based on specific knowledge concerning that customer. Personalisation may be performed on the target web page. The goal here is to entice a current customer to purchase something he or she may not have thought about purchasing. Personalisation includes such techniques as use of cookies, use of databases, and more complex data mining and machine learning strategies. Personalisation may be performed in many ways, some of which are not data mining. Personalisation can be viewed as a type of clustering, classification, or even prediction. Through classification, the desires of a user are determined based on those for the class. With clustering, the desires are determined based on those users to which he or she is determined to be similar. Prediction is used to predict what the user really wants to see. There are 3 basic types of web page personalisation: Manual Techniques which perform personalisation through user registration preferences or via the use of rules that are used to classify individuals based on profiles or demographics.
Collaborative Filtering accomplishes personalisation by recommending information (pages) that have previously been given high ratings from similar users. Content Based Filtering retrieves pages based on similarity between them and user profiles. One of the earliest uses of personalisation was with MyYahoo! Some observations about the use of personalisation are: (a) Only few users create very sophisticated pages by utilizing the customisation provided. (b) Most users do not seem to understand what personalisation means and only use the default page. (c) Any personalisation system should be able to support both types of users.
PageRank The PageRank technique was designed to increase the effectiveness of search engines and improve their efficiency. PageRank is used to measure the importance of a page and to prioritize pages returned from a traditional search engine using keyword searching. The effectiveness of this measure has been demonstrated by the success of Google. The PageRank value for a page is calculated based on the number of pages that point to it. This is actually a measure based on the number of backlinks to a page. A backlink is a link pointing to a page rather than pointing out from a page. The measure is not simply a count of the number of backlinks because a weighting is used to provide more importance to backlinks coming from important pages. Given a page p, we use Bp to be the set of pages that point to p, and Fp to be the set of links out of p. The PageRank of a page p is defined as
(9.1)
The constant c is a value between 0 and 1 and is used for normalisation. A problem called rank sink that exists with this PageRank calculation is that when a cyclic reference occurs, the PR value for these pages increases. This problem is solved by adding an additional term to the formula.
PR(p) = cq Bp[PR(q)/Nq] + cE(v) (9.2) where c is maximised. Here, E(v) is a vector that adds an artificial link. This simulates a random surfer who periodically decides to stop following links and jumps to a new page. E(v) adds links of small probabilities between every pair of nodes. The PageRank technique is different from other approaches that look at links. It does not count all links the same. The values are normalised by the number of links in the page.
Clever The Clever system was developed at IBM. It is aimed at finding both authoritative pages and hubs. An authoritative page is described as the best source for the requested information. A hub is a page that contains links to authoritative pages. The Clever systems identifies authoritative pages and hub pages by creating weights. A search can be viewed as having a goal of finding the best hubs and authorities. Authoritative pages have higher quality content than other pages. This i.e authoritative is different from relevant. A page may be extremely relevant, but if it contains factual errors, users may not want to retrieve it. HITS HITS stands for Hyperlink Induced Topic Search. It finds hubs and authoritative pages. The HITS technique contains 2 components: 1. Based on a given set of keywords found in a query, a set of relevant pages is found.
Hub and authority measures are associated with these pages and pages with the highest values are returned.
2.
Algorithm
Web Usage Mining actually consists of 3 separate types of activities Pre-processing Activities centre around reformatting the web log data before processing. Pattern Discovery Activities form the major portion of the mining activities because these activities look to _nd hidden patterns within the log data. Pattern Analysis is the process of looking at, and interpreting the results of the discovery activities. There are many issues associated with using the web log for mining purposes. 1. Identification of the user is not possible from the log alone. 2. With a web client cache, the exact sequence of pages a user actually visits is difficult to uncover from the server site. 3. Pages that are referenced may be found in the cache. 4. There are also many security, privacy and legal issues. 1) Preprocessing Steps that are part of the preprocessing phase include cleansing, user identification, session identification, path completion, and formatting. Data on a web browser may be changed in several ways. Eg. For security or privacy reasons, the page addresses may be changed into unique but non-identifying page identifications such as alphabetic characters. This conversion also saves storage space. Data may also be cleansed by removing any irrelevant information. Data from the log may be grouped together to provide more information. All pages visited from one source could be grouped together by a server to better understand the patterns of page references from each user. Similarly, patterns from groups of site may be discovered. A common technique for a server site is to divide the log records into sessions. A session is a set of page references from one source site during one logical period. Login and logoff of a user into a computer represents the logical start and end of a session. Each session has a unique identi_er callled a session id. Most of the problems associated with preprocessing activities center around the correct identification of the actual user. User identification is complicated by the use of proxy servers, client side caching and corporate firewalls.
Cookies can also be used to assist in identifying a single user regardless of the machine used to access the web. Prediction of missing pages by path completion is an attempt to add page accesses that do not exist in the log but actually occur. 2) Data Structures A basic data structure that keeps track of patterns identified during web usage mining process is called a trie. A trie is a rooted tree, where each path from the root to a leaf represents a sequence. Tries are used to store strings for pattern matching applications. Each character in the string is stored on the edge to the node. Common prefixes of strings are shared. A problem in using tries for many long strings is the space required. The compressed trie is a suffix tree. A suffix tree has the following characteristics (a) Each internal node except the root has atleast two children. (b) Each edge represents a non-empty subsequence. (c) The subsequences represented by sibling edges begin with different symbols. A suffix tree is efficient in finding any subsequence in a sequence and also common subsequences among multiple sequences. A slight variation on the suffix tree that is used to build a suffix tree for multiple sessions is called a generalized suffix tree (GST). Pattern Discovery The most common data mining technique used on clickstream data is uncovering traversal patterns. A traversal pattern is a set of pages visited by a user in a session. Similar traversal patterns may be clustered together to provide a clustering of the users. Patterns may differ in how the patterns are defined. The differences between the different types of patterns can be described by the following features (a) Duplicate page references (backward traversals and refreshes/reloads) may or may not be allowed. (b) A pattern may be composed only of contiguous page references or alternately of any pages referenced in the same session. (c) The pattern of references may or may not be maximal in the session. (d) A frequent pattern is maximal if it has no subpattern that is also frequent.
3)
Patterns found using different combinations of these 3 properties may be used to discover different features.
4)
Pattern Analysis Once a pattern has been identified, it must be analysed to determine how the information can be used. Some of the generated patterns may be deleted and determined not to be of any interest. Patterns found need not have contiguous page references. A web mining query language MINT facilitates the statement of interesting properties. The idea of a sequence is expanded to the concept of what the authors call a g-sequence. A g-sequence is a vector that consists not only of the pages visited i.e. events but also of wild cards. Patterns found across two logs can be compared for similarity. Similarity is determined using the following rule: (a) Two patterns are comparable if their g-sequences have atleast the first n pages the same where n is supplied by the user. (b) In addition, only fragments of patterns that occur frequently are considered. The goal of this work is to increase the number of customers. Non-customer patterns with no comparable customer patterns indicate that some changes to the link structure or web page designs may be in order.