Web Mining Frameworks
Web Mining Frameworks
Abstract—With the availability of huge amount of data structured records, web logs etc. Due to the factors
on the World Wide Web, it became a fertile place for data such as dynamic, high dimensionality, diverse, huge of
mining research. The growth of web data will no doubt web data, Web Researchers faced many problems like
continue to grow in coming years. So to analyses all multimedia alignment, temporal issues, scalability, etc.
the data in a manner to produce or extract information The information on the web increases in such a way
reflecting user behavior, interaction, demands, and to that it becomes as endless as ocean. This intense growth
optimize search results, the concept of web mining is of information can be in the form of structured or semi
used. Web mining is basically a technique of data mining. structured data. So, to manage this highly evolving content
Web mining is come under the applications of the data and high data dimensionality, it become necessary to
mining approaches to order discover analyze the patterns develop new approaches and methods in order to organize
extracted from the web. The main objective of web mining the data and extract some relevant information according
is to develop intelligent tools to make easy for the user to our requirements and applications.
to extract, filter, find and evaluate useful information. Data mining techniques are applied to web data that refers
Nowadays, data available on the web has become an web data mining or web mining. Web mining includes
essential part of organization. Data is produced in huge analysis and extraction of relevant information from the
amount, as a result of interaction of several users and web. data available on World Wide Web. Anyone can easily
The output can be extracted to generate knowledge so that deluge with data because of unstructured, heterogeneous
later it can be applied in various applications. Analysis of and partial structured data available on the web.
web site content and patterns obtained by user navigation
is valuable for business and research community. The aim So, mining the web have become important and
of research in web mining is to develop and apply new challenging task for data mining and data management
techniques to mine and extract valuable knowledge or professionals [1]. Web mining is further divided as Web
information from the web pages. Due to the diversity and content mining, web structured mining and web usage
unstructured form of web data, discovering of targeted mining. Each classification has its own tools and algorithm
or unexpected knowledge/information automatically is and used to serve different purposes. As the data will
the important and challenging task. So the focus of this increase on web, this technology plays a main role in
paper is to provide a more evaluative update of web extracting knowledge from the web.
mining research and techniques available. This paper, The objective of this paper is to provide the reviews for
provide the reviews for concept of Web mining, the type concept of Web mining, the type of Web mining and
of Web mining and different techniques used in each type. different techniques used in each type in the last decade.
This paper discusses the current trends and challenges This paper discusses the current trends and challenges in
in this research area. this research area.
Keywords—Clustering, Association rules, Pattern This paper is divided into five different sections. Section
discover, Hyperlink analysis, Pattern analysis, Web 2 describes Web Mining. Section 3 describes Web Content
access logs Mining. Section 4 discusses Web Structured Mining.
Section 5 describes Web Usage Mining. The conclusion
I. Introduction is presented in the last section 6.
Today, World Wide Web (WWW) becomes a large II. Web Mining
repository of information which is be used to store,
disseminate, retrieve information and manipulates data Data mining can be viewed as a result of the natural
every day. The data available on web includes text, evolution of information technol- ogy. There is evolution
tables, multimedia, audios, videos, hyperlinks, metadata, in the database system by introducing data warehouse
Technically, the focus of web content mining is on the 3.1 Unstructured data mining
content of web pages like what type of content is shown,
Information Extraction. Web pattern matching is used to
how the information is conveyed, etc., while Web structure
extract information from unstructured data. This technique
mining tries to discover web graph of each website that
is very useful for the huge volume of text. It follows a
is the linking structure of the hyperlinks. On the basis
procedure in which the first step is to detect the keywords
of structure of the hyperlinks, Web structure mining will
and phrases and second step involves discovering of
classify the Web pages and produce the information like
connection of phrases and keywords in the text [3]. In
relationship among different web- sites, how websites
this way, useful data is extracted from which information
are similar to each other, etc. Web usage mining deals
is mined and then using different approaches missing
with the user interaction with the web and extracting
information is found to complete the information. In
valuable information.
Information Extraction, the unstructured data is converted
into some structured form [4].
III. Web Content Mining
Topic Tracking. As the name suggests, this technique is
Web Content Mining is the process of mining and used to find the documents that are related to the interest
extracting useful information from the web documents of the user. This technique studies the user profile and
and then indexing them in order to retrieve quickly and keeps the record of those documents which are accessed
users can find information easily. The content of web and visited by the user. Yahoo has applied topic tracking,
documents may comprise of text, images, audio, video, in which a keyword is discovered from user and anything
sound, structured records such as list and tables and other related to that keyword will notify the user. Basically,
multimedia data. The data on the web documents can be this technique is used by two fields namely, medical and
in form of semi structured or unstructured data. The group educational field respectively. In education field, user
of facts that a web page is designed is called content can easily find latest course or any information related
IITM Journal of Management and IT 94 Volume 12, Issue 1 • January-June 2021
to work. In medical field, doctors can easily get to know then defragment them until atomic objects get extracted.
latest treatment and news in their respected fields [3] [5]. [3]. Web Data Extraction Language. In this technique,
Categorization. In this technique, the first step is to count the relevant data is stored in the form of table after
the number of words and their meaning in the document converting content of web pages in the structured form
and then find the suitable predefined head topic. Rank and then take this data to the users [6].
is then specified to the document according to the head
topic. Web pages with massive content on a given topic IV. Web Structured Mining
are set to rank first. Thus, this technique discovers the Web structure mining describes the structure of a particular
head theme and placed the web pages in predefined groups website, how the web pages are connected with each
accordingly [3] [4]. other via hyperlinks. Fig. 5. Shows Web Graph Structure
Clustering. In clustering, there are no predefined topics. Web structure is used to produce web pattern graph [8].
The topics are defined on the basis of data extracted The web graph pattern mainly consists of web pages and
from the content of web pages. Then the grouping of web documents as nodes and hyperlinks act as edges that
web documents is done on the basis of these topics and connect two related web page.
similar documents get grouped together. In this way, Web page contains HTML tags due to which web pages
important documents are not excluded from search result can organized in a tree structure format based on Document
and help the users to choose their topics in which they Object Model (DOM) and it helps in the research of link
are interested [3]. analysis. Link mining has important tasks on the basis of
links like classification, cluster analysis, cardinality and
3.2 Structured data Mining
sturdiness of link. The study of the hyperlink structure
Web crawler. Web crawlers are like type of computer is also called hyperlink analysis [9]. Hyperlinks provide
program which pass over from the HTML structure in connection to web pages to go on a location either in same
the web. Anyone can use this technique to extract and webpage or different web page. There are two categories
gain the information available on the web. The very big of a hyperlink namely, inter document and intra document.
example is of search engines which use web crawlers to Intra document hyperlink connects different parts of the
gather information about content available on the web same page and on other hand, inter document hyperlink
documents. Web crawlers are further divided into two connects two different pages.
types namely, inter and external web crawler. Internal
web crawler passes over internal structure of website and
external web crawler pass over different websites [3].
Page Content Mining. Page content mining technique
used to find only structured data from the web pages and
then mine them to gather information. Rank is given to
these pages. Also, the search engine ranked the web pages
by comparing web page rank [6].
Wrapper Generator. Wrapper generator generates the
information on the basis of sources. As web pages are
ranked by search engines according web page rank, the
web pages are searched on the basis of query [6].
3.3 Semi-Structured Data Mining
Object Exchange Model. In this method, the useful Fig. 6. Web Graph Structure
information is get discover from semi structured data and
then gather in the groups and this information gets stored 4.1 Hyperlink Analysis
in Object Exchange Model. This method is very useful to With the hyperlinks, additional information about a
understand the information structure available on the web website can be extracted to opti- mize the search result.
accurately. The very helping feature of this model is that Hyperlink analysis is a technique used to evaluate
there is no need to describe the structure of an object in relationship (connection) between nodes (web page).
prior, the model itself describes the object [6, 7]. There are many algorithms which are used for hyperlink
Top down Extraction. In this technique, compound analysis. The two important algorithms are page rank and
objects are discovered from abundant web sources and Hyperlink Induced Topic Search (HITS).