Comparative Study On Semantic Search Engines
Comparative Study On Semantic Search Engines
net/publication/290977643
CITATIONS READS
6 697
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ashok Sharma on 14 March 2016.
4
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
Current search engines process queries based on 2 where it sends user requests to several other search engines
keywords. Thus, retrieve all web pages containing and/or databases and aggregates the results into a single list
those keywords without considering the fact that an and displays them according to their source.
accurate answer is produced on the basis of user‟s
context.
Current search engines are unable to gather complex
information.
Current WWW contains a lot of information and
knowledge, but current search engines are unable to
retrieve complex information. For instance, user
fires a query “find 10 engineering college for
computer stream in india and the top computer
companies in their close proximity”. Current search Figure 2. Architecture of a General Meta-Search Engine
engines would not be able to yield desired results.
For the results, user has to separately fire the query Meta-Search Engines enable users to enter search criteria once
and manually merge the results. and access several search engines simultaneously. Meta-
Search engines operate on the premise that the Web is too
Current Search Engines are handicapped by being large for any one search-engine to index it all and that more
unable to figure out the context in which a word is comprehensive search results can be obtained by combining
being used. the results from several search engines. This also may save the
user from having to use multiple search engines separately.
Although the search engines are very helpful in
However, it is experienced by the end user that results are not
finding information on the Internet and are getting
relevant and thus, he keeps himself navigating within the
smarter with the passage of time, but they lack in
search results for a long time.
finding the meanings of the terms, expressions used
in the Web pages and the relationships between To deal with such problem, Berners-Lee, Hendler and Lassila
them. The problem comes due to the existence of [6] presented a vision of a Web in which information is given
words which have many meanings also known as well-defined meaning, better enabling computers to
polysemy and several words having same meaning understand the meaning of content and help people to provide
also known as synonymy in natura1 languages. relevant information which is called Semantic Web. The next
Thus, when a user gives a search query like “Flip- section discusses about Semantic Web.
Flop” to find the definition of “Flip-Flop” in
Computer Science domain, the most accredited 3. SEMANTIC WEB
search engine, Google, is unable to find the right Current web contains millions of unstructured web documents
document (no document is relevant among the top which are accessed using search engines such as google, bing
ten results returned). ‟ This is because Google does etc. but these search engines do not satisfy user‟s expectation
not know which Flip-Flop user is talking about; a because they display a list of documents which matches the
kind of female shoes , or a device for Electronics for terms present in the fired query. They are not concerned with
used one bit memory storage. It was possible for the fact whether they yield the user‟s required information or
Google to find the right document only if it knew not. It is because web documents are unstructured in nature
the relationship between the two terms given to it; due to which contents are analysed syntactically and thus
“Flip-Flop” and “Electronics”. makes difficult to generate the meaning of the content from it.
Therefore, to resolve this problem, Tim-Berner-Lee, inventor
2.2 Directory of WWW and director of W3C visioned about Semantic Web.
A Web Directory [4] organizes Web sites by subject, and is
usually maintained by humans instead of software. The According to Tim-Berner-Lee, Semantic Web is an extension
searcher looks at sites organized in a series of categories and of the current web in which information is given well-defined
menus. It does not display results in the form of web pages meaning, better enabling computers and people to work in co-
based on keywords. The results of directory are in the form of operation.
links that contains category and sub categories. The database The goal of semantic Web is to represent data in structured
size of directory is smaller as compared to engines' databases, format which would help machines to understand more
it is human-sited directory and not crawled by crawlers. One information on the web which supports in richer discovery and
of the famous directories, „The Open Directory‟ has been data integration from different sources via linking hereby
around since 1999, and is a human-edited directory. Also producing more exact results to the user as compared to
known as DMOZ (Directory Mozilla), the Open Directory current web search engines.
Project proposed to be the “largest on the Web”, constructed
and maintained by a vast, global community of volunteer 3.1 Architecture of Semantic Web
editors. Directory tends to work best when the user want to As part of the development of the Semantic Web vision, Tim
browse a relatively broad subject. Starting with, a directory Berners-Lee proposed a layered architecture for the Semantic
can give a good idea about the amount and type of web based Web [7] which is shown in Figure 3.The first layer consists of
information on user‟s desired topic. documents written in Unicode and their associated Uniform
Resource Identifiers (URI) and URIRef. URI is a general form
2.3 Meta-Search Engines of identifier which allow user to create and represent a
A Meta-Search engine [5] performs a search by calling on resource uniquely. It is not necessary that a URI of a resource
more than one search engine to do the actual work. The will redirect to some other location as done by URL. URLref
general architecture of Meta-Search engine is shown in Figure is another type of a string that represents aURI, and represents
5
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
the resource identified by that URI. It is a URI, together with <I>this is an example of current web data presentation.
the optional fragment identifies at the end separated by #.An
</BODY>
</HTML>
Rule Trust
s These tags are keywords which tell how to present data
Proof written in between them. The web browser reads the HTML
Dat
Digital Signature
a
document and uses these tags to interpret the content of the
Logic page.
Self - Dat
descripti a Whereas semantic web is an extension of the current web in
ve Ontology Vocabulary
which information is represented in structured format using
documen
RDF+rdfschema semantic language such as RDF, OWL, DAML, OIL etc.
t
XML+NS+xmlschema
For example in RDF “the sky has color blue”, will be
represented as the triple: a subject denoting the sky has color
Unicode URI+URIRef as predicate and object as value blue. Graphically, it will be
represented as shown in Figure 4.
Figure 3. Architecture of Semantic Web Has color
example of URI is http://www.w3school.org. The second Sky Blue
layer contains XML, a general purpose markup language for
documents containing structured information with XML
namespace and XML schema definitions. It makes sure that Figure 4. RDF Representation
there is a common syntax used in the semantic web. XML Each and every object is identified by its URI which helps to
schema serves for expressing schema for a particular set of resolve polysemy and synonymy problem that is often
XML documents. The third layer contains the core data encountered in current web. A lot of generic ontologies are
representation format for semantic web known as Resource available such as Dublin core, FOAF etc which are used to
Description Framework (RDF) and RDF Schema (RDFS). represent an object.
RDF represents data in the form of statement using <Subject-
Predicate-Object> that a system can process and understand. For example, The author of Book is Ranjna Jain.
RDF uses URI to identify subject-Predicate-object and to
Here Subject: Book
process these RDF statements it uses XML. RDF defines a
specific RDF/XML markup language, used to represent RDF Object: Ranjna Jain
information and for exchanging it between machines. RDF
Schema defines framework to describe classes and individuals Predicate: author
to define the vocabulary of that application. The vocabulary for the above statement is taken from Dublin
The next layer contains OWL (Web Ontology language).The core. The subject: Book can be represented by any URI such
OWL is a language derived from description logics, and offers as http://www.w3.org/Book to understand the meaning.
more constructs over RDFS. It is syntactically embedded into Here, Dublin core is used to represent the predicate: author
RDF, so like RDFS, it provides additional standardized and it will be http://purl.org/dc/elements/1.1/creator and
vocabulary. Logic and Proof layers provide the ability to finally this statement would be expressed in RDF as
perform logic on semantic statements such as inferences and
agents. Proofs are more difficult in that they must trawl many <rdf:rdf
assertions to come to conclusions. The semantic web is based
xmlns:rdf= “http://www.w3.org/1999/01/22-rdf-syntax-ns#”
on the internet. Therefore, the levels of trust in assertions and
knowledge must be determined if the source facts are to be <rdf:description rdf:about= “http://www.w3.org/book”>
believed. Digital signatures provide some trust elements, but
referrals through the “web of trust” are also valid mechanisms. <http://www.purl.org/dc/elements/1.1/creator> Ranjna Jain
A level of trust (or distrust) will need to be factored into the </http://www.purl.org/dc/elements/1.1/creator>
agents and search engines that use the semantic web. </rdf:description>
3.2 Why to represent data in a new format </rdf:rdf>
in semantic web when html is available? 4. SEMANTIC SEARCH ENGINES
Current web displays knowledge on pages using HTML which In order to access structured data, a number of semantic search
is unstructured in nature. HTML is a presentation language engines has been introduced which understand the meaning of
which displays data using tags. Web browser understands data and help in displaying more exact results as compared to
those tags and displays data accordingly but computer is not current search engines. Some of the existing prevalent
intelligent enough to understand the semantic of the content. semantic search engines have been selected for discussion in
For e.g. this section with their architectures.
<HTML> Swoogle” is a crawler-based indexing and retrieval system for
Semantic Webdocuments using RDF and OWL. It is being
<TITLE>my current page</TITLE> developed by the University of Maryland Baltimore County
<BODY> (http://pear.cs.umbc.edu/swoogle/). It extracts meta-data and
computes relations between documents. Discovered
<H1>welcome to my home page documents are alsoindexed by an information retrieval system
6
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
to compute the similarity among a setof documents and to search application and is restricted to retrieving
compute rank as a measure of the importance of a ontologies files with embedded RDF content on the
SemanticWeb document (SWD). internet. Apart from this, it has poor indexing of
documents and has long response time
4.1 Swoogle corresponding to fired query.
Swoogle [9] is a crawler based indexing and retrieval system
for semantic web documents (SWDs) written in RDF and 4.2 Falcon
OWL. The architecture of swoogle is discussed in Figure 5. It is a keyword based semantic search engine [10] which
Swoogle architecture can be broken into four major generates all the ranked RDF documents that include the terms
components: SWD discovery, metadata creation, data analysis in the fired query. For example user wants to know about
and interface. BSAITM, then corresponding to this query, it tries to
generates those RDF documents that contains this kind of
Interface information and in the form of snippet that exact information
IR SWD is shown so that user does not need to crawl unnecessarily to
Data
Analyze analyze other pages. It displays required information on the snippet
analysi
r r itself; therefore user does not need to explore the pages.
s Web
Server The Architecture of Falcon is described in Figure 6 and
components are described below:
Web
Service
RDF
SWD Document
Crawler
Cache SWD Agent cache
Metadat Service Web
a
Metadat
a Seeds Summarizatio
n
creation
User Interface
SWD Summar
Reader y cache
Document- level
Web Analysis
SWD Candidat
Discover e URLs Analytical Index
y Jena parser Triple Store Data
Web
Crawler
Typing analysis
Global Analysis
Textual
Description
Figure 5. Architecture of Swoogle Analysis Vocabulary Reasoning Indexing
Identification
a. SWD Discovery: At the back end, it creates a
database of SWD‟s using hybrid approach to harvest Figure 6. Architecture of Falcon Search Engine
the semantic web. It uses following mechanism to
generate URLs to find SWDs on the web: (i) seed a. RDF Crawler: An RDF crawler is setup to crawl
URLs and promising and trusted Sites (ii) URLs RDF documents. It creates queries by enumerating
from conventional search engines using meta general keywords which are sent to Google and
crawlers (iii) from swooglebot crawler that analyses swoogle to generate RDF documents. The crawler is
SWDs and generate new URI candidates. also customized to download RDF documents from
Dbpedia, Hannover, DBLP Bibliography, ping the
b. Indexing: This component indexes SWDs using its semanticweb.com
metadata and for this it captures encoding schemes
namely “RDF/XML”, N-triple, language such as b. Document level analysis: It contains jena parser,
OWL, DAML, RDFS, RDF. It records ontology which parses the cache documents collected by RDF
properties such as label, comment, version info, crawler. During this process new generated URIs
relations between two SWDs via imports, extends are queued in the seed to explore more RDF
etc. documents. Falcon index URIs by including its local
name, its associated literal values and description
c. Analysis: This component uses the created metadata about its neighboring semantic web objects in RDF
to derive analytical reports such as classification of graph and corresponding to this, maintains a virtual
SWOs and SWDB, ranking SWDs using rational document.
surfer model.
c. Global Analysis: Before indexing, vocabulary
d. Services: This interface component focuses on identification and then reasoning using class
providing data services such as search services that inclusion relation is done and then indexing is
search ontologies at the term level.But swoogle has performed.
some limitations such as; it is not a general purpose
7
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
d. Summarization: A query dependent snippet of characterization and by this queries are categorized
knowledge is provided to facilitate the end user to into various senses they convey.
gather its information from the snippet itself.
d. QDex Storage: It creates or maintains a file for each
e. User Interface: when a user gives a query to falcon, query which stores information about the document,
it serves a list of objects as well as types such as paragraph from which that it was extracted. After
location, organization etc. with this, user can specify that, each Qdex file is placed in a known destination
a type and focus on a particular dimension of via hash-mode operation. All this work is performed
knowledge.But Falcon has some limitations such as offline.
this engine is not interested to rank these objects
according to query. e. Query Processor: When the user fires a query from
user interface, the query is sent to the query analyzer
4.3 Hakia to generate the sense and context of the user using
Hakia[12] is a semantic search technology based search fall back algorithm and with hash mode, Qdex files
engine that presents relevant results based on concept match destination location is retrieved correctly.
rather than keyword match or popularity ranking. The f. Ranking: A pool of relevant paragraphs are ranked
Architecture of Hakia is described as below in Figure 7 and of by semantic analysis rank algorithm which is based
components described below. on advanced sentence analysis and concept match
between the query. And the best sentence for each
paragraph which will be highlighted in the snippet
to attract the user is retrieved.
WWW
But hakia has some limitation such as it has some issues such
as URL canonicalization, privacy session ID‟s, virtual
10. Results 7. Query contents and dynamic contents.
8
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
aggregating from many sources, description also result data for each of the hits and display them as
include inferred data which is not necessarily been an output at interface.
published but derived from the existing data through
reasoning. But Semantic Web Search engine has some limitations such
as poor ranking of documents because the Ranking process
f. Query processing and User interface: It accepts user comes before the indexing stage. Ranking technique is coming
queries, retrieves top k hits and requests the snippet independently with data indexed in dataset.
Pre-runtime
Intermediary data
Query
Processing
5. COMPARATIVE STUDY and relevancy of the returned pages etc. The detailed
The comparison of discussed Search Engines is performed comparison study is outlined in Table 1.
on various measures like the underlying technique, input
parameters required, working levels, complexity, quality
9
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
What does HTML Web HTML Web RDF data RDF data RDF data
it Crawls? Pages Pages
Reasoning Not available Not available Rule based SAOR Class inclusion
Technique systems, Bayesian (Scalable relation
reasoning authoritative
OWL reasoner)
Indexing Inverted Qdex Swangling Inverted inverted indexing
Scheme indexing (Query Technique indexing for
Detection & literals
Extraction) sparse indexing
for structured
data
What it terms keeps a RDF triples RDF literals Local name of URIs
Indexes? collection of structured data along with its
queries (detailed associated literals
extracted description of description about
from the web entity neighbouring objects
page aggregated
from different
sources)
Ranking Page Rank Semantic Rational surfer through link Objects ranked
Technique algorithm rank model based analysis based on
algorithm combination of their
relevance to the
query and their
popularity.
Uses cosine similarity
measures b/w the
query and virtual
doc. Of objects
Knowledge Yes Yes Metadata Yes Yes
Snippet
Results document Document Entity Entity Entity
10
International Journal of Computer Applications (0975 – 8887)
Volume 131 – No.14, December2015
[10] Cheng, G., Ge, W., Wu, H., Qu, Y.: Searching Semantic Search Engine. Journal of Web Semantics, Vol 9 no.4
Web Objects Based on Class Hierarchies. In: Bizer, C., (2011)
Heath, T., Idehen, K., Berners-Lee, T. (eds.) LDOW
2008. CEUR-WS, vol. 369. CEUR-WS.org (2008)- [14] Services provided by falcon available at:
researchgate.net http://iws.seu.edu.cn/services/falcons/
[11] Y.Qu, G.cheng,H.Wu, W.Ge, X,Zhang, Seeking [15] T. Berners-Lee, Linked Data, Design issues for the
Knowledge with Falcon, Semantic web Challenges. World Wide Web,World WideWeb Consortium, http:
/www.w3.org/DesignIssues/LinkedData.html (2006).
[12] http://company.hakia.com/new/documents/White%20Pap
er_Semantic_Search_Technology.pdf [16] Hai Dong, Farookh Khadeer Hussain, Elizabeth Chang,A
Survey in Semantic Search Technologies, , 2008 Second
[13] Aidan Hogan, Andreas Harth , J• urgen Umbrich , Sheila IEEE International Conference on Digital Ecosystems
Kinsella , Axel Polleres , Stefan Decker, Searching and and Technologies (IEEE DEST 2008)© 2008 IEEE.
Browsing Linked Data with SWSE: the Semantic Web
IJCATM : www.ijcaonline.org 11