3.Eng-A Survey On Web Mining
3.Eng-A Survey On Web Mining
ABSTRACT
Aim of this paper is to study and analyze different Web Mining Tools and techniques used to mine the
information from World Wide Web. This survey will provide the detail information of different web mining
tool/techniques for Web Content Mining, Web Structure Mining and Web Usage Mining, as well as a comparative study of
their advantages and disadvantages.
KEYWORDS: Search Agent, Personalized Web Agents, Web Usage Mining, Web Content Mining, Web Structure
Mining
I. INTRODUCTION
The World Wide Web (WWW) is a vast resource of multiple types of information in varied formats which is very
useful for the analysis of business progress, which is very important now a days to stand in the competition of business.
Researchers are beginning to investigate human behavior in this distributed Web data warehouse and are trying to build
models for understanding human behavior in virtual environments. Data mining, often called Web mining when applied to
the Internet, is a process of extracting hidden predictive information and discovering meaningful patterns, profiles, and
trends from large databases. Web mining is an iterative process of discovering knowledge and is proving to be a valuable
strategy for understanding consumer and business activity on the Web. Basically there are three sub categories for mining
web information. These sub categories are
28
characteristics and user profiles. Information agents used number of techniques to filter data according to the predefine
information. Adapted web agents learn user preferences and discovers documents related to those user profiles. In
Database approach it consists of well formed database containing schemas and attributes with defined domains.
Web content mining has the following approaches to mine data
Structured mining,
Multimedia mining.
Information Extraction,
Topic Tracking
Summarization
Categorization
Clustering
Information Visualization
Web Crawler
Wrapper Generation,
29
natural representation of complex real world objects without sending the application writer into contortions. HTML is a
special case of such intra-document structure. The techniques used for semi structured data mining are
SKICAT
Multimedia Miner
Data collection
Preprocessing
Data Collection
The data collection is the discovery of hidden information and usage pattern trends, which could aid the Web
managers for improving the management, performance and controlling of the Web servers.
30
Data Preprocessing
The selection of useful data is an important task in the data pre-processing stage. The datas were selected in each
data type to generate the cluster models for finding web user access and server usage patterns. The removal of irrelevant
and noisy data is an initial step in this task. The most recently accessed data were indexed with higher value of time index
while the least recently accessed data were placed at the bottom with lowest value . This becomes the critical step to obtain
more precise analysis result due to time dependence characteristics of Web usage data.
Data Clustering
The method of clustering is broadly used in different projects by researchers for finding the usage patterns or user
profiles. The clustering algorithms become the most mining method in websites and the cluster objects include user groups
(to describe user actions) and web pages.
Pattern Discovery and Analysis
Using this pattern discovery and pattern analysis, relevant and useful information can be easily predicted based on
data analysis and Graph. Web usages data includes data from web server access, proxy server and browser logs, user
profiles, sessions or transactions, queries, registration data, cookies, bookmark data, mouse clicks and scrolls or any other
data as result of interaction. Analysis of web access logs for web sites can help understand the user behavior and also its
web structure, thus improving the design of this massive collection of resources. There are two tendencies in Web Usage
Mining driven by the applications of the discoveries: General Access Pattern Tracking and Customized Usage Tracking .
31
Commonalities
All the tools automate the business task and retrieve the web data in an efficient way.
Differences
Screen-scrapper needs prior knowledge of proxy server and some knowledge of HTML and HTTP where as
other tools do not require any such knowledge and it need Internet connection to run.
Automation-Anywhere 7 allows recording of actions this facility is not provided in the other tools.
Though we have setup file, Mozenda will not allow us to install without Internet connection, thi is not the
case with other tools.
32
5. Link Cardinality
The main task here is to predict the number of links between objects. There are some uses of web structure mining
like it is:
Page categorization
To find out similarity between them Many authors have proposed web structure mining algorithms like: Page rank
algorithm, weighted page rank algorithm, Hyper Induced Topic search algorithm, weighted Topic sensitive page
rank algorithm. In next section we will explain these algorithms in detail.
C. Web Usage Mining[1]
There are different task to be carry out in Web Usage Mining that are given below and different tools are used for
that work..
i) Data Preprocessing
The first step of Web Usage Mining is preprocessing of data stored in web logs as it is noisy in nature. The
process of converting usage, content and structure information into data abstraction is described in preprocessing. The
processing of preprocessing consists of four phases: data cleaning, session reconstruction, content and structure
information retrieval and data abstraction.
ii) Usage Preprocessing
This is considered as most difficult task of web usage mining because of presence of incomplete and inconsistent
data in server log. Only IP address, agent and server side click stream are available to identify users and server sessions
which faces many problems like single IP address/multiple server sessions, multiple IP address/single server session,
multiple IP address/single user and multiple agent/single user. Usage preprocessing also encountered the problem of
inferring cached page references.
Iii) Content Preprocessing
Content preprocessing concerned with transforming unstructured and semi structured documents into the forms
that are suitable for web usage mining.
iv) Discovery Pattern Discovery
It focuses on to uncover patterns from the abstractions produced as a result of preprocessing phase. It focuses on
applying various methods and techniques developed from several fields such as data mining, machine learning, statistics
and pattern recognition. Discovery of desired patterns and to extract understandable knowledge from them is a challenging
33
SewebarCms
i-Miner
Argunaut
MiDas(Mini
ng Internet
Data for
Associative
Sequences)
Webalizer
Naviz
WebViz
Web Miner
Stratdyn
Features
Data Pre-Processing Tools
Performs cleaning, extraction and
transformation of data before pattern
discovery.
Platform independent data transformation
tool. Based on Sumatra script and support
Rapid application Development
Performs data pre-processing by analysing
the click stream and data collected.
Mines web server logs and reconstruct the
user navigational path for session
identification
Pattern Discovery Tools
Provides interaction between data analyst
and domain expert to perform discovery of
patterns. Helps in selection of rules among
various rules in association rule mining [34].
Discover data cluster by using fuzzy
clustering algorithm and fuzzy inference
system for pattern discovery and analysis
[33]
Develop the patterns of useful data by using
sequence of various rules.
Discover marketing based navigational
pattern from log files. It applies more
features to traditional sequential method.
Pattern Analysis Tools
GNU GPL license based and produces web
pages after analyzing patterns.
Visualization tool that combines 2-D graph
of visitor access and grouping of related
pages. It describes the pattern of user
navigation on the web.
Analyze the patterns and provides them in
the form of graphical patterns
Mines the useful patterns and provides the
user specific information
Enhances WUM and provides visualization
of patterns
III. CONCLUSIONS
This paper described several tool/techniques for Web Content Mining, Web Structure Mining and Web Usage
Mining. We analyzed their strengths and limitations and provide comparison among them. So we can say that this paper
Impact Factor(JCC): 1.9586- This article can be downloaded from www.impactjournals.us
34
may be used as a reference by researchers when deciding which tool/techniques are suitable.
IV. REFERENCES
1.
Kamika Chaudhary, Santosh Kumar Gupta, Web Usage Mining Tools & Techniques: A Survey in International
Journal of Scientific & Engineering Research, Volume 4, Issue 6,June-2013 1762 ISSN 2229-5518.
2.
V. Bharanipriya & V. Kamakshi Prasad, Web Content Mining tools: A Comparative Study in International
Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 211215.
3.
Preeti Chopra, Md. Ataullah, a Survey on Improving the Efficiency of Different Web Structure Mining
Algorithms in International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 8958,
Volume-2, Issue-3, February 2013.
4.
Zhang, Q., Segall, R.S., Web Mining: A Survey of Current Research, Techniques, and Software, International
Journal of Information Technology & Decision Making. Vol.7, No. 4, pp. 683-720. World Scientific Publishing
Company (2008).
5.
Darshna Navadiya, Roshni Patel, Web Content Mining Techniques-A Comprehensive Survey, International
Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 10, December- 2012 ISSN: 2278-0181.
6.
7.
8.