Web Content Classification: A Survey: Prabhjot Kaur
Web Content Classification: A Survey: Prabhjot Kaur
ABSTARCT: As the information contained within the web is step but do belong to the overall KDD process as additional
increasing day by day, organizing this information could be a steps.
necessary requirement.The data mining process is to extract
information from a data set and transform it into an
understandable structure for further use. Classification of web
page content is essential to many tasks in web information
retrieval such as maintaining web directories and focused
crawling.The uncontrolled type of nature of web content
presents additional challenges to web page classification as
compared to the traditional text classification ,but the
interconnected nature of hypertext also provides features that
can assist the process. In this paper the web classification is
discussed in detail and its importance in field of data mining is
explored.
I. INTRODUCTION
Data mining offers value across a broad spectrum of information returned by keyword-based search engines or in
industries.The telecommunications and credit card companies constructing catalogues that organize web documents into
are two of the leaders in applying data mining to detect hierarchical collections. However it is difficult to meet without
fraudulent use of their services. The insurance companies and automated web-page classification techniques due to the labor-
stock exchanges are also interested in applying this technology intensive nature of human editing. On a first glance web-page
to reduce fraud. The medical applications are another fruitful classification can borrow directly from the machine learning
area where data mining can be used to predict the effectiveness literature for text classification. On closer examination however
of surgical procedure or medications. The companies active in the solution is far from being so straightforward[8]. The web
the financial markets use data mining to determine market and pages have their own underlying embedded structure in the
industry characteristics as well as to predict individual company HTML language. They typically contain noisy content such as
and stock performance. The retailers are making more use of advertisement banner and navigation bar. If a pure-text
data mining to decide which products to stock in particular classification method is directly applied to these pages, it will
stores, as well as to assess the effectiveness of promotions and incur much bias for the classification algorithm making it
coupons. The pharmaceutical firms are mining large databases possible to lose focus on the main topics and important content.
of chemical compounds and of genetic material to discover Thus a critical issue is to design an intelligent preprocessing
substances that might be candidates for development as agents technique to extract the main topic of a WebPage[10].
for the treatments of disease.
II. WEB PAGE CLASSIFICATION Based on the organization of categories the web page
classification can also be divided into flat classification and
hierarchical classification.Flat classification categories are
With the rapid growth of the World Wide Web (WWW) there considered parallel i.e. one category does not supersede
is an increasing need to provide automated assistance to Web another.On the other hand in hierarchical classification the
users for Web page classification and categorization. Such categories are organized in a hierarchical tree-like structure in
assistance is helpful in organizing the vast amount of which each category may have a number of subcategories.
Classification of web content is different in some aspects as In the paper [2] they have introduced the Web has become one
compared with text classification. The uncontrolled nature of of the most widespread platforms for information change and
web content presents additional challenges to web page retrieval. As it becomes easier to publish documents as the
classification as compared to traditional text classification. The number of users and thus publishers, increases and as the
web content is semi structured and contains formatting number of documents grows searching for information is turning
information in form of HTML tags. A web page consists of into a cumbersome and time-consuming operation. Due to
hyperlinks to point to other pages. This interconnected nature of heterogeneity and unstructured nature of the data available on
web pages provides features that can be of greater help in the WWW Web mining uses various data mining techniques to
classification. First all HTML tags are removed from the web discover useful knowledge from Web hyperlinks page content
pages including punctuation marks. The next step is to remove and usage log.The main uses of web content mining are to
stop words as they are common to all documents and does not gather categorize, organize and provide the best possible
contribute much in searching. In most cases a stemming information available on the Web to the user requesting the
algorithm is applied to reduce words to their basic stem. One information. The mining tools are imperative to scanning the
such frequently used stemmer is the Porter’s stemming many HTML documents, images, and text. Then the result is
algorithm. Machine learning algorithms are then applied on such used by the search engines. In this paper they have firstly
vectors for the purpose of training the respective classifier. The introduced the concepts related to web mining then present an
classification mechanism of the algorithm is used to test an overview of different Web Content Mining tools.Then
unlabelled sample document against the learnt data. In this concluded by presenting a comparative table of these tools
approach user deal with home pages of organizational websites. based on some pertinent criteria.
[9]A neatly developed home page of a web site is treated as an
entry point for the entire web site. It represents the summary of In the paper [3] they have described and evaluated methods for
the rest of the web site. Many URLs link to the second level learning to forecast forthcoming events of interest from a corpus
pages telling more about the nature of the organization. The containing 22 years of news stories. The examples of identifying
information contained the title, meta keyword, meta description significant increases in the likelihood of disease outbreaks,
and in the labels of the A HREF (anchor) tags are very deaths and riots in advance of the occurrence of these events in
important source of rich features. In order to rank high in search the world. Here the details of methods and studies including the
engine result site promoters pump in many relevant keywords. automated extraction and generalization of sequences of events
Most of the homepages are designed to fit in a single screen. from news corpora and multiple web resources are provided.
The factors discussed above contributed to the expression power The predictive power of the approach on real-world events with
of the home page to identify the nature of the organization[11]. held from the system is evaluated.
III. RELATED WORK In the paper [4] the author dealed with a preliminary discussion
of WEB mining few key computer science contributions in the
In the paper [1] they have introduced that increase in the amount field of web mining the prominent successful applications and
of information on the Web has caused the need for accurate outlines some promising areas of future research. From very
automated classifiers for Web pages to maintain Web directories beginning the potential of extracting valuable knowledge from
and to increase search engine performance. Every tag and every the Web has been quite evident. Web mining i.e. the application
term on each Web page can be considered as a feature there is a of data mining techniques to extract knowledge from Web
need for efficient methods to select best features to reduce content, structure and usage is the collection of technologies to
feature space of the Web page classification problem. The aim fulfill this potential. Web mining is the application of data
of this paper is to apply a recent optimization technique namely mining techniques to extract knowledge from Web data where at
the firefly algorithm (FA) to select best features for Web page least one of structure or usage data is used in the mining
classification problem. The firefly algorithm (FA) is a process. Interest in Web mining has grown rapidly in its short
metaheuristic algorithm, inspired by the flashing behavior of existence both in the research and practitioner communities.
fireflies.Using FA to select a subset of features and to evaluate
the fitness of the selected features J48 classifier of the Weka
data mining tool is employed. Web KB and Conference datasets In the paper [5] the author has described nature-inspired
were used to evaluate the effectiveness of the proposed feature metaheuristic algorithms especially those based on swarm
selection system.Observation is that when a subset of features intelligence have attracted much attention in the last ten years. It
are selected by using FA, WebKB and Conference datasets were describes the fundamentals of firefly algorithm together with a
classified without loss of accuracy even more time needed to selection of recent publications. The discussion is optimality
classify new Web pages reduced sharply as the number of associated with balancing exploration and exploitation, which is
features were decreased. essential for all methodes algorithms. By comparing with
intermittent search strategy, the conclusion is that method such a graduate student homepage, a course page, etc.) or not, a
as firefly algorithm are better than the optimal intermittent classifier needs to have “good” features extracted from the Web
search strategy. Analysization of algorithms and their pages. As every component in a Web page such as HTML tags
implications for higher-dimensional optimization problems is and terms can be taken as a feature, dimension of the
done. classification problem becomes too high to be solved by well
known classifiers like decision trees, support vector machines,
etc. To decrease the feature space, we developed a genetic
In the paper [6] they have proposed an entirely new dimension algorithm that determines the best features for a given set of
towards web page classification using Artificial Neural Web pages. It is found that when features selected by our
Networks (ANN).World Wide Web is growing at an genetic algorithm are used and a kNN classifier is employed, the
uncontrollable rate. Hundreds of thousands of web sites appear accuracy improves up to 96%.
every day with the added challenge of keeping the web
directories up-to-date.The uncontrolled nature of web presents IV. CONCLUSION
difficulties for Web page classification.As the number of
The increase in the amount of information on the Web has
Internet users is growing, so there is a need for classification of
caused the need for accurate automated classifiers for Web
web pages with greater precision in order to present the users pages to maintain Web directories and to increase search engine
with web pages of their desired class. However, web page performance. Every tag and every term on each Web page can
classification has been accomplished mostly by using textual be considered as a feature there is a need for efficient methods
categorization methods. In this paper they have proposed a
to select best features to reduce feature space of the Web page
novel approach for web page classification that uses the HTML classification problem. The web classification research with
information present in a web page for its classification is done. respect to its features and algorithms, we conclude this by
summarizing the lessons we have learned from existing research
In the paper [7] they have introduced that Intelligent Water and pointing out future opportunities in web classification.
Drops (IWD) algorithm is adapted for feature selection with Classification tasks include assigning documents on the basis of
Rough Set (RS). Specifically, IWD is used to search for a subset
subject, function, sentiment, genre, and more. We have studied
of features based on RS dependency as an evaluation function.
number of techniques for web page classification but due to the
The resulting system, called IWDRSFS (Intelligent Water Drops rapid growth of data on internet still there is a need of efficient
for Rough Set Feature Selection), is evaluated with six
technique. Which will speed up the web page classification
benchmark data sets. The performance of IWDRSFS are
process and give the optimized result.
analysed and compared with those from other methods in the
literature. The outcomes indicate that IWDRSFS is able to
provide competitive and comparable results. In summary, this
V. Acknowledgements
study shows that IWD is a useful method for undertaking feature
I would like to thank to all the people those who have help me to give
selection problems with RS. the knowledge about these research papers and I thankful to my guide
with whose guidance I would have completed my research paper and
In the paper [8] they propose a genetic algorithm to select best make it to published, finally I like to thank to all the website and IEEE
features for Web page classification problem to improve paper which I have gone through and have refer to create my Review
accuracy and run time performance of the classifiers. To paper successful.
determine whether a Web page belongs to a specific class (e.g.,
REFERENCES
[3] Kira Radinsky, Eric Horvitz, “Mining the Web to
[1]Esra Saraç, Selma Ayşe Özel, “Web Page Classification Using Predict Future Events”, WSDM’13, February 4–8, 2012,
Firefly Optimization” IEEE International Symposium on Innovations in Rome, Italy
Intelligent Systems and Applications (INISTA), PP 1-5, 19-21 June [4] Monika Yadav, Mr. Pradeep Mittal, “Web Mining: An
2013. Introduction” International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 3,
[2]Abdelhakim Herrouz, Chabane Khentout, Mahieddine Djoudi, Issue 3, March 2013 .
“Overview of Web Content Mining Tools” The International Journal of [5] Xin-She Yang, Xingshi He, “Firefly algorithm: recent
Engineering And Science (IJES), Volume 2, Issue 6, 2013. advances and applications” Int. J. Swarm Intelligence, Vol. 1,
No. 1, 2013.
[6] Pikakshi Manchanda, Sonali Gupta, Komal Kumar Bhatia, [9] Lim Wern Han and Saadat M. Alhashmi, “Joint Web-
“On The Automated Classification of Web Pages Using Feature (JFEAT): A Novel Web Page Classification
Artificial Neural Network“ IOSR Journal of Computer Framework” IBIMA Publishing, Vol. 2010 (2010), Article ID
Engineering (IOSRJCE), Volume 4, Issue 1 (Sep-Oct. 2012), 73408, 8 pages.
PP 20-25”.
[7] Basem O. Alijla, Lim Chee Peng, Ahamad Tajudin [10] K S Chandwani, “Clustering of Web Page Search Result
Khader, and Mohammed Azmi Al- Betar,” Intelligent Water using Web Content Mining Approaches” International Journal
Drops Algorithm for Rough Set Feature Selection”pp. 356– of Computer, Information Technology & Bioinformatics
365, 2013. Springer-Verlag Berlin Heidelberg 2013 (IJCITB), Volume-1, Issue-2.
[11] Daniele Riboni, “Feature Selection for Web Page
[8] Selma Ayşe Özel,” A Genetic Algorithm Based Optimal Classification”.
Feature Selection for Web Page Classification” Department of [12] S. Sumathi and S.N. Sivanandam,” Introduction to Data
Computer Engineering, Çukurova University, 01330 Balcalı, Mining and its Applications” Studies in Computational
Sarıçam, Adana, Türkiye [email protected] Intelligence, Volume 29.
[13] Sunita Beniwal and Jitender Arora,” classification and
[8] Xiaoguang Qi, Brian D. Davison, “Web Page Feature Selection Techniques in Data Mining”, International
Classification: Features and Algorithms” Journal of Engineering Research & Technology (IJERT), Vol.
http://www.cse.lehigh.edu/~xiq204/pubs/classification- 1,Issue6.
survey/LU-CSE-07-010.pdf.