Framework For Web Personalization Using Web Mining
Framework For Web Personalization Using Web Mining
Abstract
WWW is a large amount of information provider and a very big source of information. Users are increasing every day for accessing web sites. For efficient and effective handling, web mining coupled with suggestion techniques provides personalized contents at the disposal of users. Web Mining is an area of Data Mining dealing with the extraction of interesting knowledge from the Web. Here we are presenting a comprehensive overview of the personalization process based on Web usage mining. In this a host of Web usage mining activities required for this process, including the pre-processing and integration of data from multiple sources, and common pattern discovery techniques that are applied to the integrated usage data.
Index Terms: Web-Usage Mining, Data Mining, Personalization, Pattern Discovery, Web Mining, Web Personalization -----------------------------------------------------------------------***----------------------------------------------------------------------1. INTRODUCTION AND BACKGROUND
The tremendous growth in the number and the complexity of information resources and services on the Web has made Web personalization an indispensable tool for both Web-based organizations and for the end users. The ability of a site to engage visitors at a deeper level, and to successfully guide them to useful and pertinent information, is now viewed as one of the key factors in the sites ultimate success. Web personalization can be described as any action that makes the Web experience of a user customized to the users taste or preferences. Principal elements of Web personalization include modelling of Web objects (such as pages or products) and subjects (such as users or customers), categorization of objects and subjects, matching between and across objects and/or subjects, and determination of the set of actions to be recommended for personalization. There are several well-known drawbacks to content-based or rule-based filtering techniques for personalization. The type of input is often a subjective description of the users by the users themselves, and thus is prone to biases. The profiles are often static, obtained through user registration, and thus the system performance degrades over time as the profiles age. Furthermore, using content similarity alone may result in missing important pragmatic relationships among Web objects based on how they are accessed by users. Collaborative filtering [Herlocker et al., 1999; Konstan et al., 1997; Shardanand and Maes, 1995] has tried to address some of these issues, and, in fact, has become the predominant commercial approach in most successful e-commerce systems. These techniques generally involve matching the ratings of a current user for objects (e.g., movies or products) with those of similar users (nearest neighbours) in order to produce recommendations for objects not yet rated by the user. The primary technique used to accomplish this task is the kNearest-Neighbor (kNN) classification approach which compares a target users record with the historical records of other users in order to find the top k users who have similar tastes or interests.
use of especial algorithms and heuristics not commonly employed in other domains. This process is critical to the successful extraction of useful patterns from the data. In this section we discuss some of the issues and concepts related to data modelling and preparation in Web usage mining. While this discussion is in the general context of Web usage analysis, we are focused especially on the factors that have been shown to greatly affect the quality and usability of the discovered usage patterns for their application in Web personalization.
4. USAGE DATA
The log data collected automatically by the Web and application servers represents the fine-grained navigational behaviour of visitors. Depending on the goals of the analysis, this data needs to be transformed and aggregated at different levels of abstraction. In Web usage mining, the most basic level of data abstraction is that of a page view. Physically, a page view is an aggregate representation of a collection of Web objects contributing to the display on a users browser resulting from a single user action (such as a click through). These Web objects may include multiple pages (such as in a frame-based site), images, embedded components, or script and database queries that populate portions of the displayed page (in dynamically generated sites). Conceptually, each page view represents a specific type of user activity on the site, e.g., reading a news article, browsing the results of a search query, viewing a product page, adding a product to the shopping cart, and so on. On the other hand, at the user level, the most basic level if behavioural abstraction is that of a server session (or simply a session). A session (also commonly referred to as a visit) is a sequence of page views by a single user during a single visit. The notion of a session can be further abstracted by selecting a subset of page views in the session that are significant or relevant for the analysis tasks at hand. We shall refer to such a semantically meaningful subset of page views as a transaction (also referred to as an episode according to the W3C Web Characterization Activity [W3C]). It is important to note that a transaction does not refer simply to product purchases, but it can include a variety of types of user actions as captured by different pageviews in a session.
site-specific, and involves tasks such as, removing extraneous references to embedded objects, graphics, or sound files, and removing references due to spider navigations. The latter task can be performed by maintaining a list of known spiders, and through heuristic identification of spiders and Web robots [Tan and Kumar, 2002]. It may also be necessary to merge log files from several Web and application servers. This may require global synchronization across these servers. In the absence of shared embedded session ids, heuristic methods based on the referrer field in server logs along with various sessionization and user identification methods (see below) can be used to perform the merging. Client- or proxy-side caching can often result in missing access references to those pages or objects that have been cached. Missing references due to caching can be heuristically inferred through path completion which relies on the knowledge of site structure and referrer information from server logs [Cooley et al., 1999]. In the case of dynamically generated pages, form-based applications using the HTTP POST method result in all or part of the user input parameter not being appended to the URL accessed by the user (though, in the latter case, it is possible to re-capture the user input through packet sniffers on the server side).
Figure 3:- Distribution of pageview durations: raw-time scale (left), log-time scale (right).
For example, an algorithm called PageGather has been used to discover significant groups of pages based on user access patterns [Perkowitz and Etzioni, 1998]. This algorithm uses, as its basis, clustering of pages based the Clique (complete link) clustering technique. The resulting clusters are used to automatically synthesize alternative static index pages for a site, each reflecting possible interests of one user segment. Clustering of user rating records has also been used as a prior step to collaborative filtering in order to remedy the scalability problems of the k-nearest-neighbor algorithm [OConner and Herlocker, 1999]. Both transaction clustering and pageview clustering have been used as an integrated part of a Web personalization framework based on Web usage mining.
that use these data structures to directly produce real-time recommendations (without the apriori generation of rule).
REFERENCES
[1] R. Agarwal, C. Aggarwal, and V. Prasad. A Tree Projection Algorithm for Generation of Frequent Itemsets. In Proceedings of the High Performance Data Mining Workshop, Puerto Rico, April 1999. C. C. Aggarwal, J. L. Wolf, and P. S. Yu. A New Method for Similarity Indexing for Market Data. In Proceedings of the 1999 ACM SIGMOD Conference, Philadelphia, PA, June 1999. R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. In Proceedings of the 20th
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
International Conference on Very Large Data Bases (VLDB94), Santiago, Chile, Sept 1994. R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proceedings of the International Conference on Data Engineering (ICDE95), Taipei, Taiwan, March 1995. A. Banerjee and J. Ghosh. Clickstream Clustering Using Weighted Longest Common Subsequences. In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, Chicago, Illinois, April 2001. B. Berendt, A. Hotho, and G. Stumme. Towards Semantic Web Mining. In Proceedings of the First International Semantic Web Conference (ISWC02), Sardinia, Italy, June 2002. B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou. The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at the ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD2000), Edmonton, Alberta, Canada, July 2002b. 7.B. Berendt and M. Spiliopoulou. Analysing Navigation Behaviour in Web Sites Integrating Multiple Information Systems. VLDB Journal, Special Issue on Databases and the Web, 9(1):5675, 2000. J. Borges and M. Levene. Data Mining of User Navigation Patterns. In B. Masand and M. Spiliopoulou, editors, Web Usage Analysis and User Profiling: Proceedings of the WEBKDD99 Workshop, LNAI 1836, pages 92111. Springer-Verlag, 1999. A. Buchner and M. D. Mulvenna. Discovering Internet Marketing Intelligence through Online Analytical Web Usage Mining. SIGMOD Record, 4(27):5461, 1999. M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining Content-based and Collaborative Filters in an Online Newspaper. In Proceedings of the ACM SIGIR 99 Workshop on Recommender Systems: Algorithms and Evaluation, Berkeley, California, August 1999. R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns fromWeb Data. Ph. d. dissertation, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota, 2000. R. Cooley, B. Mobasher, and J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems, 1(1):532, 1999. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to Construct Knowledge Bases from theWorldWideWeb. Artificial Intelligence, 118(1-2):69113, 2000.
[15]
[16]
[17]
H. Dai and B. Mobasher. Using Ontologies to Discover Domain-LevelWeb Usage Profiles. In Proceedings of the 2nd SemanticWeb Mining Workshop at ECML/PKDD 2002, Helsinki, Finland, August 2002. M. Deshpande and G. Karypis. Selective Markov Models for Predicting Web-Page Accesses. In Proceedings of the First International SIAM Conference on Data Mining, Chicago, April 2001. W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs, NJ, 1992.
BIOGRAPHIES:
Mrs. Monika Soni Pursuing M. Tech. in Computer Science. She has published many national and international research papers. She has written 3 books for engineering and engineering diploma.
Mr. Rahul Sharma Pursuing M. Tech. in Computer Science & Engineering. He has published many national and international research papers. He has written 5 books for engineering and engineering diploma
Mr. Vishal Shrivastava working as Assistant Professor in Arya College & IT. He has published many national and international research papers. He has very depth knowledge of his research areas.