0% found this document useful (0 votes)
39 views6 pages

Web Mining Frameworks

It explains all the web mining concepts

Uploaded by

PALAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Web Mining Frameworks

It explains all the web mining concepts

Uploaded by

PALAK
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Web Mining: A Framework

Surbhi Sharma*1, Sudhir Kumar Sharma*2


1,2
Institute of Information Technology and Management, New Delhi-58, India
1
[email protected] and [email protected]

Abstract—With the availability of huge amount of data structured records, web logs etc. Due to the factors
on the World Wide Web, it became a fertile place for data such as dynamic, high dimensionality, diverse, huge of
mining research. The growth of web data will no doubt web data, Web Researchers faced many problems like
continue to grow in coming years. So to analyses all multimedia alignment, temporal issues, scalability, etc.
the data in a manner to produce or extract information The information on the web increases in such a way
reflecting user behavior, interaction, demands, and to that it becomes as endless as ocean. This intense growth
optimize search results, the concept of web mining is of information can be in the form of structured or semi
used. Web mining is basically a technique of data mining. structured data. So, to manage this highly evolving content
Web mining is come under the applications of the data and high data dimensionality, it become necessary to
mining approaches to order discover analyze the patterns develop new approaches and methods in order to organize
extracted from the web. The main objective of web mining the data and extract some relevant information according
is to develop intelligent tools to make easy for the user to our requirements and applications.
to extract, filter, find and evaluate useful information. Data mining techniques are applied to web data that refers
Nowadays, data available on the web has become an web data mining or web mining. Web mining includes
essential part of organization. Data is produced in huge analysis and extraction of relevant information from the
amount, as a result of interaction of several users and web. data available on World Wide Web. Anyone can easily
The output can be extracted to generate knowledge so that deluge with data because of unstructured, heterogeneous
later it can be applied in various applications. Analysis of and partial structured data available on the web.
web site content and patterns obtained by user navigation
is valuable for business and research community. The aim So, mining the web have become important and
of research in web mining is to develop and apply new challenging task for data mining and data management
techniques to mine and extract valuable knowledge or professionals [1]. Web mining is further divided as Web
information from the web pages. Due to the diversity and content mining, web structured mining and web usage
unstructured form of web data, discovering of targeted mining. Each classification has its own tools and algorithm
or unexpected knowledge/information automatically is and used to serve different purposes. As the data will
the important and challenging task. So the focus of this increase on web, this technology plays a main role in
paper is to provide a more evaluative update of web extracting knowledge from the web.
mining research and techniques available. This paper, The objective of this paper is to provide the reviews for
provide the reviews for concept of Web mining, the type concept of Web mining, the type of Web mining and
of Web mining and different techniques used in each type. different techniques used in each type in the last decade.
This paper discusses the current trends and challenges This paper discusses the current trends and challenges in
in this research area. this research area.
Keywords—Clustering, Association rules, Pattern This paper is divided into five different sections. Section
discover, Hyperlink analysis, Pattern analysis, Web 2 describes Web Mining. Section 3 describes Web Content
access logs Mining. Section 4 discusses Web Structured Mining.
Section 5 describes Web Usage Mining. The conclusion
I. Introduction is presented in the last section 6.
Today, World Wide Web (WWW) becomes a large II. Web Mining
repository of information which is be used to store,
disseminate, retrieve information and manipulates data Data mining can be viewed as a result of the natural
every day. The data available on web includes text, evolution of information technol- ogy. There is evolution
tables, multimedia, audios, videos, hyperlinks, metadata, in the database system by introducing data warehouse

IITM Journal of Management and IT 93 Volume 12, Issue 1 • January-June 2021


which helps in collecting data, creating information, data. It generates the interesting patterns about the user
managing them and analyze them. The data can be in needs. One of the technique to mine text is called text
any form like storage, search, retrieval, logs, transactions, mining. Text documents are related to machine learning,
etc. [2]. Web mining is one of the important application text mining and natural language processing. Text mining
of data mining techniques. Web mining helps to discover involves extraction of information in order to study pattern
and extracts useful knowledge from the unstructured or recognition, word frequency distributions, annotation, and
semi-structured data available on the web. text analysis and pro- duce some valuable knowledge.
Web miming can be classified into three categories. Text mining deals with natural language texts either
Fig. 1. shows Web Mining Cate- gories. The brief stored in semi-structured or unstructured formats. The
introduction are as follows: information extracted can be used to derive summaries
of the document.
• Web Content Mining: It deals with the content of web
pages like text, audio, sound and other multimedia There are three approaches for web content mining to mine
to extract valuable information. data. Fig. 3. shows Web Content Mining Approaches. The
brief introductions are given in the nest section of paper.
• Web Structured Mining: It deals with structure of a
linking of web pages inside a website to discover web
graph pattern and generate some useful information.
• Web Usage Mining: It deals with the web logs
records to find the user interaction with the web.

Fig. 1. Web Mining Categories Fig. 4. Web Content Mining Approaches

Technically, the focus of web content mining is on the 3.1 Unstructured data mining
content of web pages like what type of content is shown,
Information Extraction. Web pattern matching is used to
how the information is conveyed, etc., while Web structure
extract information from unstructured data. This technique
mining tries to discover web graph of each website that
is very useful for the huge volume of text. It follows a
is the linking structure of the hyperlinks. On the basis
procedure in which the first step is to detect the keywords
of structure of the hyperlinks, Web structure mining will
and phrases and second step involves discovering of
classify the Web pages and produce the information like
connection of phrases and keywords in the text [3]. In
relationship among different web- sites, how websites
this way, useful data is extracted from which information
are similar to each other, etc. Web usage mining deals
is mined and then using different approaches missing
with the user interaction with the web and extracting
information is found to complete the information. In
valuable information.
Information Extraction, the unstructured data is converted
into some structured form [4].
III. Web Content Mining
Topic Tracking. As the name suggests, this technique is
Web Content Mining is the process of mining and used to find the documents that are related to the interest
extracting useful information from the web documents of the user. This technique studies the user profile and
and then indexing them in order to retrieve quickly and keeps the record of those documents which are accessed
users can find information easily. The content of web and visited by the user. Yahoo has applied topic tracking,
documents may comprise of text, images, audio, video, in which a keyword is discovered from user and anything
sound, structured records such as list and tables and other related to that keyword will notify the user. Basically,
multimedia data. The data on the web documents can be this technique is used by two fields namely, medical and
in form of semi structured or unstructured data. The group educational field respectively. In education field, user
of facts that a web page is designed is called content can easily find latest course or any information related
IITM Journal of Management and IT 94 Volume 12, Issue 1 • January-June 2021
to work. In medical field, doctors can easily get to know then defragment them until atomic objects get extracted.
latest treatment and news in their respected fields [3] [5]. [3]. Web Data Extraction Language. In this technique,
Categorization. In this technique, the first step is to count the relevant data is stored in the form of table after
the number of words and their meaning in the document converting content of web pages in the structured form
and then find the suitable predefined head topic. Rank and then take this data to the users [6].
is then specified to the document according to the head
topic. Web pages with massive content on a given topic IV. Web Structured Mining
are set to rank first. Thus, this technique discovers the Web structure mining describes the structure of a particular
head theme and placed the web pages in predefined groups website, how the web pages are connected with each
accordingly [3] [4]. other via hyperlinks. Fig. 5. Shows Web Graph Structure
Clustering. In clustering, there are no predefined topics. Web structure is used to produce web pattern graph [8].
The topics are defined on the basis of data extracted The web graph pattern mainly consists of web pages and
from the content of web pages. Then the grouping of web documents as nodes and hyperlinks act as edges that
web documents is done on the basis of these topics and connect two related web page.
similar documents get grouped together. In this way, Web page contains HTML tags due to which web pages
important documents are not excluded from search result can organized in a tree structure format based on Document
and help the users to choose their topics in which they Object Model (DOM) and it helps in the research of link
are interested [3]. analysis. Link mining has important tasks on the basis of
links like classification, cluster analysis, cardinality and
3.2 Structured data Mining
sturdiness of link. The study of the hyperlink structure
Web crawler. Web crawlers are like type of computer is also called hyperlink analysis [9]. Hyperlinks provide
program which pass over from the HTML structure in connection to web pages to go on a location either in same
the web. Anyone can use this technique to extract and webpage or different web page. There are two categories
gain the information available on the web. The very big of a hyperlink namely, inter document and intra document.
example is of search engines which use web crawlers to Intra document hyperlink connects different parts of the
gather information about content available on the web same page and on other hand, inter document hyperlink
documents. Web crawlers are further divided into two connects two different pages.
types namely, inter and external web crawler. Internal
web crawler passes over internal structure of website and
external web crawler pass over different websites [3].
Page Content Mining. Page content mining technique
used to find only structured data from the web pages and
then mine them to gather information. Rank is given to
these pages. Also, the search engine ranked the web pages
by comparing web page rank [6].
Wrapper Generator. Wrapper generator generates the
information on the basis of sources. As web pages are
ranked by search engines according web page rank, the
web pages are searched on the basis of query [6].
3.3 Semi-Structured Data Mining
Object Exchange Model. In this method, the useful Fig. 6. Web Graph Structure
information is get discover from semi structured data and
then gather in the groups and this information gets stored 4.1 Hyperlink Analysis
in Object Exchange Model. This method is very useful to With the hyperlinks, additional information about a
understand the information structure available on the web website can be extracted to opti- mize the search result.
accurately. The very helping feature of this model is that Hyperlink analysis is a technique used to evaluate
there is no need to describe the structure of an object in relationship (connection) between nodes (web page).
prior, the model itself describes the object [6, 7]. There are many algorithms which are used for hyperlink
Top down Extraction. In this technique, compound analysis. The two important algorithms are page rank and
objects are discovered from abundant web sources and Hyperlink Induced Topic Search (HITS).

IITM Journal of Management and IT 95 Volume 12, Issue 1 • January-June 2021


V. Web Usage Mining 5.2 Phases of web usage mining
Web usage mining is the process of tracking behavior of The process of web usage mining is given in Fig. 5.
users online by extracting useful information from server There are three phases of web usage mining. The brief
logs. For this reason, it is also known as web log mining. introduction is as follows:
User access data is collected from the web. Several users Data Pre-processing. This phase retrieves the raw data
surf the web, follow some pattern and analyzing these and processes it to make it relevant and organize the
patterns enable to find the way user interacts with the data to produce useful information. For better efficiency
web. Thus, this technique is used to predict the behavior and scalability, data is go through many steps like data
of the user. Based on the how the user interacts with the cleaning, integration, transformation, reduction and
websites, web usage mining copes with the order how discretization [14].
one can make personalized web pages or enhanced search Discovery of pattern. In this phase, algorithms and rules
engines. The web log data is stored at different locations are applied to extract the pattern formed. Classification,
like web server, web proxy server and client browser. clustering, association rules and sequential analysis are
Huge amount of data is stored at location. The data can some techniques used to discover pattern.
be in form of semi structured and unstructured data which
contain lots of noisy data, errors, missing attributes, failed
re- quest message, incomplete data and irrelevant data.
With the help of web usage mining techniques web log
data is analyzed.

5.1 Web Servers Logs


There are four types of web server logs. Fig. 7. Shows
Web Server Logs

Fig. 9. Process of Web Usage Mining

Analysis of pattern. After pattern discovery, the pattern


is checked and analyzed to generate valuable knowledge.
Various techniques are there for pattern analysis to ex-
tract useful knowledge and finally get the useful pattern
used by user and use this in- formation for commercial
strategies [15].
Fig. 8. Web Server Logs This category of web mining has several tools and
approaches to analyze the behavior of the user. It mainly
Access Log. This log stores the information of user
uses data mining algorithms such as association rule
activities on web and has many attributes like click event,
mining, sequential rule mining and clustering.
visits, search and access of the user.
Agent Log. Agent log store the details of user’s online 5.3 Association Rule
interest like type of browser a user uses, browser version, Association rule is the most basic and widely used
types of applications downloaded by the user, etc. method. It is used to find the association and correlations
Error Log. This log is used to store the information about among large set of web pages that are frequently access
links or pages on which a user clicks but the page is together in the user browser session. This rule shows how
enable to open and shows failure like error 404 not found. frequently an item-set occurs in a transaction. These rules
Referrer Log. The information about the URLs of are statements in the form M ==> N where (M) and (N)
websites that link to web pages are stored in the referrer are the set of available items in a set of transactions. The
log. If a user clicks on a link from a website to go to rule of M ==> N states that, transactions that contain items
the other website, then URL of that website gets stored in X, may also include items in Y [17]. For example,
in this log [20]. WebPage1, WebPage2, WebPage3 ==> WebPage4

IITM Journal of Management and IT 96 Volume 12, Issue 1 • January-June 2021


In this example, if a user visits webpage1, webpage2 and information from the World-Wide Web and its usage
webpage3 then the user will most likely to visit webpage4 patterns. This paper also dis- cussed its categories- web
as well. Apriorialgorithm having its set of rules can be content mining, web structured mining and web usage
used that is used to extract pattern by applying some mining along with the techniques and methods used in
rules on the frequent occurrence of web pages by user. each category of web mining. Web content mining extracts
the knowledge, in which the data like text, audio, video,
5.4 Sequential analysis
documents, records, tables, etc. of web documents are
. Sequential analysis method is used to find the frequent mined. Web Structure Mining emphasis on analysis of
navigation performed by the user. Sequential analysis the web pattern graph that is, the link structure of the
is the analysis of navigation performed by the user. websites. Web usage mining extracts knowledge from
This method uses data mining techniques to analyze the user navigation patterns through web data and also uses
sequential data and then extract the patterns. It is used secondary data like data generated by the user through
to extract interesting sequences and their subsequence surfing the web to find patterns. Web usage mining collects
and then group them together. To measure the interesting the data from different web logs records to find user visit
subsequence, various criteria are there such as number of and access pattern on the web. This paper discussed the
occurrences, length, frequency, etc. In sequential analysis, current trends and challenges in this research area.
evaluation of data is processed when they are collected
and it is stopped according to the predefined rule (known References
as stopping rule) when required results are observed.
[1] Jiawei Han, Kevin, Chen-Chuan Chang “Data Mining for Web
MIDAS (Mining Internet Data for Association Sequences) Intelligence” IEEE International Conference on Data Mining,
algorithm can be used for extracting sequential patterns 2002.
in order to provide marketing intelligent behavior for [2] Kosala and Blockeel, ―Web mining research: A survey,‖
ecommerce scenario [18]. SIGKDD:SIGKDD Explorations: Newsletter of the Special
Interest Group(SIG) on Knowledge Discovery and Data Mining,
5.5 Clustering ACM, Vol. 2, 2000
[3] Sharma, Arvind Kumar, and P. C. Gupta. “Study & Analysis
Clustering is the technique that group together the abstract of Web Content Mining Tools to Improve Techniques of Web
objects into cluster of similar objects that is, Clustering Data Mining” International Journal of Advanced Research in
is the process of grouping objects together in such a way Computer Engineering & Technology (IJARCET) Volume 1
that the objects having similar characteristics, rely in the (2012)
same group are identical and those belonging to different [4] Srividya, M., D. Anandhi and M. I. Ahmed. “Web mining and its
groups are not identical. It is an unsupervised machine categories– a survey “In- ternational Journaof Engineering and
Computer Science, IJECS 2.4 (2013)
learning- based algorithm that consists of a group of
[5] Deepti Sharda and Sonal Chawla “Web Content Mining
data points into clusters so that the objects having same
Techniques: A Study.”International Journal of Innovative
characteristics included in the same group. There are Research in Technology & Science
different methods and techniques used for cluster analysis. [6] Johnson, Faustina, and Santosh Kumar Gupta. “Web Content
Clustering identifies the user with identical behavior so Mining Techniques: A Sur- vey.”International Journal of
Computer Applications (0975–888) Volume (2012)
it can help in personalizing the website. Clustering can
be done in two ways, first as usage clustering, in which [7] Srividya, M., D. Anandhi and M. I. Ahmed. “Web mining and its
categories–a survey.”In- ternational Journal of Engineering and
clustering is done on the basis of those users that have Computer Science, IJECS 2.4 (2013).
same browsing pattern and second as page clustering,
[8] Joy Shalom Sona, Prof. Asha Ambhaikar” A Reconciling Website
in which clustering is done on the basis of web pages System to Enhance Ef- ficiency with Web Mining Techniques”
containing same content. International Journal Of Scientific & Engineering Research
Volume 3, Issue 2, February-2012 1 ISSN 2229-5518
VI. Conclusion [9] Mamta M. Hegde, Prof. M.V.Phatak, “Developing an approach
for hyperlink analysis with noise reduction using Web Structure
This paper has attempted to give the detailed review on Mining”, International Journal of Advanced Research in
the concept of web mining which acts as a framework to Computer Engineering & Technology Volume 1, Issue 3,
May2012
extract pattern and analyze valuable information from the
web on the basis of con- tent, hyperlinks and web logs. [10] Q. Lu, and L. Getoor. Link-based classification. In Proceedings
of ICML-03, 2003
The main purpose of web mining is discovering useful

IITM Journal of Management and IT 97 Volume 12, Issue 1 • January-June 2021


[11] N. Duhan, A.K. Sharma and K.K. Bhatia, PageRanking Journal of Computer Applications (0975 – 8887) Volume 97–
Algorithms: A Survey, Proceedings of the IEEE International No.18, July 2014
Conference on Advance Computing, 2009. [15] Amit Pratap Singh1, Dr. R. C. Jain 2,” A Survey on Different
[12] T.Nithya, International Journal of Advanced Research in Phases of Web Usage Mining for Anomaly User Behavior
Computer and Communication Engineering Vol. 2, Issue 8, Investigation” International Journal of Emerging Trends &
August 2013 Technology in Computer Science (IJETTCS)Volume 3, Issue 3,
May – June 2014 ISSN 2278-6856
[13] Ashutosh Kumar Singh, Ravi Kumar P, “A Comparative Study of
PageRanking Algorithms for Information Retrieval”,International [16] Dr.S. Vijiyarani1 and Ms. E. Suganya2, International Journal of
journal of electrical and computer engineering 4:7:2009 Computer-Aided Technologies (IJCAx) Vol.2, No.3, July 2015
[14] Mitali Srivastava, Rakhi Garg, P. K. Mishra,” Preprocessing [17] N a s r i n J O K A R , A l i R e z a H O N A RVA R , S h i m a
Techniques in Web Usage Mining: A Survey” International AgHAMIRZADEH, Khadijeh ESFANDIARI, Bulletin de la
Société des Sciences de Liège, Vol. 85, 2016, p. 321 – 328

IITM Journal of Management and IT 98 Volume 12, Issue 1 • January-June 2021

You might also like