Web Page Classification Based On Schema - Org Collection
Web Page Classification Based On Schema - Org Collection
org Collection
Abstract—The internet is a library of a huge amount of buy the product. There is no option to set these criteria. But
information and there is a need for categorize its content with more semantics on the web, search engines can support
based on web page classification. Classification of web page this in the future.
content can improve the quality of web search and its accuracy.
Unfortunately the high dimensionality of the web pages dataset Existing web pages globally lack a better semantics to
has made the process of classification difficult. The use of an provide to the user similar advanced search options. In order
automatic method for web page classification can simplify the to provide detailed and more relevant information on the web
whole process and assist the search engine in getting more search content, collection of microdata schemas schema.org
relevant results. Nowadays information on the web is generally was created in 2011. It extend family of existing RDF [1],
structured and formatted in a not formal way. This absence
of semantics leads to create formal methods to provide more Microformats and microdata [2] to their gradual replacement
semantics information into web page. Search engines including in the future [3]. This collection gives us an uniform and
Bing, Google, Yahoo! and Yandex formed collection of schemas formal set of rules and recommendations that allow us to
Schema.org to support web page semantics and improve their add significantly better semantic information to the web page
search results. This paper explores the use of formal source source code.
code structure for classifying a large collection of the web
content. Is focused on use of schemas collection Schema.org to Search engines including Bing, Google, Yahoo! and Yan-
classify web pages and categorize them unambiguously. dex rely on this markup to improve the display of search
results, making it easier for people to find the right web
Keywords-Collection of schemas Schema.org; Web Page
Clasification; Genres; Microgenres; Microformats; pages. These global web search engines who overarching
schema.org project, committed to support future of mi-
I. I NTRODUCTION crodata. Their search algorithms are extended to support
microdata schemas gradually. Many applications, especially
Internet can be considered as the world’s largest informa- search engines, can benefit greatly from direct access to this
tion library. We can imagine single books in the library as structured data information. On-page markup enable search
an individual web pages. There is no one who can classify engines to clearly understand the information on web pages
them according to specific topics into the shelves. In this and provides richer search results in order to make it easier
situation it’s not an easy way to be and stay oriented in the for users to find relevant information on the web. Markup
amount of unstructured information. can also enable new tools and applications that can benefit
Automatic classification of web pages is an effective way from use of this structure.
to deal with the difficulty of retrieving relevant information Currently most widely used and supported schema
from the Internet and help users to orientate better. is the Recipe schema. Our work focuses on the unique
Our goal is offer to users more effective way of orientation identification of specific genres and microgenres [4] using
in the amount of web information and give them more the microdata schema collection. Extraction of semantic
relevant search results. We want to assign to each web page information and classifying web pages. About the content
which contain Schema.org collection new auxiliary labels as of analyzed web pages we can determine the amount
genres “Movie”, “Person”, “Recipe”, “Blog” or microgenres of additional information which can be used to expand
as “Price information”, “Something to read”, etc. This can web search options, more accurate search results and others.
clearly identify web page genres and microgenres. We can
offer to our users more detailed search possibilities, which
can lead to much more relevant search results. A. Related work
For example when user is searching for lasagna recipe Apart from schema.org microdata schemas collection
with at least one user review and five star rating. Current there is several other formally defined rules for the semantic
search engines don’t allow this advanced search query. The web. These semantic web languages are presented in [8].
similar situation is with any product. When we are searching Also systems for web information extraction which trans-
only for web pages with product specification and offer to form web pages into program-friendly structures such as
978-1-4673-4794-5/12/$31.00 2012
c IEEE 356
a relational database are important to us. This approach
analyzes the structure and the templates of the web page.
The survey of major web information extraction approaches
is presented at [9].
Concept of Genres and MicroGenres is introduced at [4]
as ambiguous categories without fixed boundaries and are
especially formed by the sets of conventions. Authors of
paper [6] analyze web pages using a web patterns and
introduces a method for semantic analysis of web pages.
Automatic web page classification in a dynamic and hier-
archical way is presented at [12]. It relates to text learning
and document classification. Text learning is a machine
learning method on textual data that combines information
retrieval techniques and is used as a tool to extract the
content of textual data [10]. Significant survey “Web page
classification: Features and Algorithms” [13] examine the
space of Web classification approaches to find new areas for
research, as well as to collect the latest practices to inform
future classifier implementations. Carefully review the Web-
specific features and algorithms that have been explored and
found to be useful for Web page classification.
Importance of HTML structural elements and metadata
in automated subject classification is shown in paper [11].
The aim of the paper was to determine how significance
indicators assigned to different Web page elements (internal
metadata, title, headings, and main text) influence automated
classification.
B. Organization
This paper is organized into the following sections: Sec-
tion II. describes our research and approach. Section III.
introduces schemas collection Schema.org, our algorithm
and genres. Finally, in Section IV. we draw conclusion and
provide future research.
II. O UR RESEARCH
In contrast to the above approaches is the information
located in the source code by tags and atributes essential
to us and our approach cannot be applied to plain text
only. Our algorithm uses a specific semantic attributes and
the information they marked. It has been demonstrated that
using information derived from tags can boost the classifiers
performance [11].
People who search the web usually have a clear concep-
tion for what they are searching for and they know how
this search result ideally looks like. In our research we are Fig. 1: Recipe by chow.com described by Schema.org
searching for Recipes and for our experiments we chose
schema of Recipe, the most widely used schema and being
supported by web search engines mentioned above in I.
Introduction. obtain on the basis of our analysis additional semantic
Our aim was to analyze the source code of a sufficient information about a particular web page. Assign these web
number of web pages that publish articles about cooking. pages to one or more predefined category labels, also known
We want try to clearly identify the Recipe schema and as web page categorization or classification in order to
2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN) 357
increase the precision of web search. The labels can be
represented as genres and microgenres examples from I. People can read this information and understand the
section or any others. We went manually through hundreds meaning of its individual parts, but search engine crawler
of international and domestic recipe web sites. We were will not understand the meaning so well. Information will
looking for pages that contain information about the Recipe be stored in a search engine database as a plain text inside
schema, possibility rate the recipe by stars and have option some general table probably. Maybe just add to some words
write a user review. Our results from human manual crawling more weight thanks to the importance of html tag. If you
were compared with our results from algorithm using a assign a schemas to our data above, the result for the search
microdata collection Schema.org. engine crawler will be much more readable. The following
Demonstrative recipe web page from our research is is an example of how to embed information about a recipe
presented at (Fig. 1). Schema.org schemas are described in and the structure of the information into a website. In order
blocks. There is Recipe and Review schemas which includes to mark up the data the attribute itemtype along with the url
CreativeWork and Thing schemas with their own properties of the schema is used. The attribute itemscope defines the
from collections. Main genre of web page is Recipe and scope of the itemtype. The kind of the current item can be
we can also see some microgenres “Something to read” and defined by using the attribute itemprop. Within the schema
“Rating”. for a recipe is a schema for a nutrition Information.
III. M ICRODATA COLLECTION S CHEMA . ORG <div itemscope itemtype="http://schema.org/Recipe">
<h1 itemprop="name">Mom’s World Famous Banana
Schema.org goal is to get back unambiguous meaning of Bread</h1>
the information that is lost during the transfer from the By <span itemprop="author">John Smith</span>,
<meta itemprop="datePublished" content="2009-05-08">
database (information in database is clearly divided and May 8, 2009
described in the tables, their columns and rows) to the <img itemprop="image" src="bananabread.jpg" />
<span itemprop="description">This classic banana
aplication presentation layer. Collection of schemas is used bread</span>
to restore that lost information back into the source code of <div itemprop="nutrition"
itemscope itemtype="http://schema.org/NutritionInfo
web pages and offers the possibility to extend the semantic rmation">
meaning even further. Major web search engines create and <strong>Nutrition facts:</strong>
<span itemprop="calories">240 calories</span>,
support a common vocabulary for structured data markup on <span itemprop="fatContent">9 grams fat</span>
web pages. </div>
<strong>Ingredients:</strong>
With schema.org collection, site owners and developers - <span itemprop="ingredients">3 or 4 ripe bananas,
can learn about structured data and improve how their sites smashed</span>
- <span itemprop="ingredients">1 egg</span>
appear in major search engines. Web page owners can ...
improve how their sites appear in search results not only on </div>
Google, but on Bing, Yahoo! and potentially other search
engines as well in the future. The information described using microdata is much
Schema.org also introduces schemas for more than a better semantically structured [5]. We can see that the text
hundred new categories, including movies, music, organiza- belongs to the genre of Recipe has its own name, author,
tions, TV shows, products, places and more. As webmasters published date and recipe description. Also we have clearly
add this semantic markup to their sites, search engines can listed the individual ingredients and nutritional information.
develop richer search experiences. Search engines have been Now we can store these semantic results into recipe tables
working independently to support structured markup for a with appropriate properties.
few years now. Much of the vocabulary on schema.org was
inspired by earlier formats such as Microformats, FOAF, A. Algorithm
GoodRelations, hCard and OpenCyc.
If we are talking about semantics, we can imagine se- The algorithm parses source code of web pages that
mantically unstructured data as plain text with some html is divided into several blocks by itemscope and itemtype
tags: attribute. With this procedure we obtain clear information if
the recipe schema is on the website or not (Fig. 2). Because
<h1>Mom’s World Famous Banana Bread</h1> we are searching for recipe schema, algorithm at first seeks
By John Smith
May 8, 2009 for itemscope with Recipe value and then inside the returned
<img src="bananabread.jpg" /> data blocks is looking for aditional information (aggregat-
This classic banana bread
<strong>Nutrition facts:</strong> eRating, Review) using values in itemprop attribute.
240 calories, 9 grams fat
<strong>Ingredients:</strong> <div itemscope itemtype="http://schema.org/Recipe">
- 3 or 4 ripe bananas, smashed <div itemprop="aggregateRating" itemscope itemtype=
- 1 egg "http://schema.org/AggregateRating">
... <span itemprop="ratingValue">4</span> stars - based
358 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN)
Fig. 2: Algorithm process
2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN) 359
IV. C ONCLUSION [7] M. Eirinaki and M. Vazirgiannis, Web mining for web person-
alization. ACM Transactions on Internet Technology (TOIT),
Due to the dependence of the schemas and semantic Vol. 3 Issue 1, 2003.
structure of source code our approach is not universal. Also
schemas can be currently found worldwide only in a limited [8] J. Bailey, F. Bry, T. Furche and S. Schaffert, Web and Semantic
number of web pages. However it is clear already that web Web Query Languages: A Survey. Reasoning Web Summer
School, Springer-Verlag, LNCS 3564, 2005.
search engines will be pushing web developers to write their
source code with microdata. Web developers should want to [9] C.-H. Chang, M. Kayed, M. R. Girgis and K. F. Shaalan, A
create more semantic source code, because their website will survey of Web information extraction systems. IEEE Transac-
be more visible in search engine results. We think that in tions on Knowledge and Data Engineering, 18(10):14111428,
the collection of schemas schema.org is the future of web 2006.
semantics. [10] D. Koller and M. Sahami, Hierarchically classifying docu-
In the near future, we want to focus on web pages that ments usina very few words. Proceedings of the 14 interna-
contains no schema information but there should be any. tional Conference on Machine Learning ECML98, 1998.
Web pages which belongs to any of Schema.org collection
but they don’t have microdata in the source code. We will [11] K. Golub and A. Ardo, Importance of HTML Structural
Elements and Metadata in Automated Subject Classification.
be analyze these web pages and trying to find a way to ECDL 2005, LNCS 3652, Springer-Verlag Berlin Heidelberg,
recommend add schema to the source code and improve 2005.
their semantic meaning.
Also we want take a deeper look to web page Genres, [12] X. Peng, B. Choi, Automatic Web Page Classification in a
Dynamic and Hierarchical Way. Center for Entrepreneurship
MicroGenres and make better classification through them.
and Information Technology (CEnlT), Louisiana Tech Univer-
sity, 2002.
ACKNOWLEDGMENT
[13] X. QI and B. D. Davison, Web Page Classification: Features
This paper was supported by the IT4Innovations Centre and Algorithms. ACM Computing Surveys, Vol. 41, No. 2,
of Excellence project, reg. no. CZ.1.05/1.1.00/02.0070 sup- Article 12, 2009.
ported by Operational Programme Research and Develop-
ment for Innovations’ funded by Structural Funds of the [14] A. Finn and N. Kushmerick, Learning to classify documents
European Union and state budget of the Czech Republic; by according to genre. In IJCAI-03 WS on Computational
Approaches to Style Analysis and Synthesis, 2003.
the SoftComp: Development of human resources in research
and development of innovative softcomputing methods and
their practical use, reg. no.CZ.1.07/2.3.00/20.0072 funded
by Operational Programme Education for Competitiveness,
cofinanced by ESF and state budget of the Czech Republic;
by SGS, VSB-Technical University of Ostrava, under the
grant no. SP2012/58.
R EFERENCES
[1] D. Brickley and R. V. Guha, RDF Vocabulary Description Lan-
guage 1.0: RDF Schema. The World Wide Web Consortium
(W3C), 2004.
[5] S. Bradley, Why (And How) You Should Use HTML5 Micro-
data. Van SEO Design, 2011.
360 2012 Fourth International Conference on Computational Aspects of Social Networks (CASoN)