Mequannint Munye
Mequannint Munye
A THESIS SUBMITTED TO
THE SCHOOL OF GRADUATE STUDIES OF THE ADDIS ABABA UNIVERSITY IN
PARTIAL FULFILLMENT FOR THE DEGREE OF MASTERS OF SCIENCE IN
COMPUTER SCIENCE
November, 2010
ADDIS ABABA UNIVERSITY
SCHOOL OF GRADUATE STUDIES
FACULTY OF COMPUTER AND MATHEMATICAL SCIENCES
DEPARTMENT OF COMPUTER SCIENCE
ADVISOR:
Solomon Atnafu (PhD)
APPROVED BY
Examining Board:
1. Dr. Solomon Atnafu, Advisor ______________________
2. ________________________ ______________________
3. ________________________ ______________________
DEDICATION
father: may God bless you and put your soul in heaven, I appreciate your efforts and
To my father:
struggling to show me the route that brings me to the today’s fruit. But, unlucky to
see it.
To my mother:
mother I appreciate your struggle, patient and hope to bring me here.
brother , do you think that without you I can reach here?
To my brother:
Acknowledgements
This thesis is the harmonized results of many people. Today, I am highly delighted because I
have got the opportunity to express my gratitude for all of them.
Every work could be successfully accomplished because of the will and interests of the almighty
God. So, all my first praises and thanks are reserved to him. Then, I would like to extend my
heartfelt and deepest sense of gratitude to my advisor Dr. Solomon Atnafu for his continuous
and unreserved support, guidance, constructive comments and suggestions. It is with his
continuous follow up, encouragement and advice that this thesis came to realization. I appreciate
his dedication from the beginning to the end to teach me for the first time how to conduct and
write a research.
I would also like to take this opportunity to express my profound gratefulness to my families for
their continuous moral and financial support. I love you all. Tsegaw, you are really a brother and
a role model for all brothers to treat their brothers as you do. I appreciate your inspiration, hope
and patience to follow up me throughout my study.
My sincere thanks also go to my friends who are with me from my undergraduate years in
Haramaya to my post graduate here in Addis Ababa to the end of this thesis. Simachew Endale
and Mehari Bayou: you have been in my life ever since we started our higher education to this
day and no stones unturned with you to get the today’s achievement. You were my stimulator to
start my study every day. I cherish our many holidays together and hope that we will be able to
stay in touch despite the large distances between us.
I would also like to extend my gratitude to Jijiga University for sponsoring me to attend my
second degree. I also want to thank Graduate Committee of Addis Ababa University Computer
Science Department for giving me the permission to commence this research work. Since this
thesis is the cumulative results of the two years learning, I would like to thank all the staff
members of the department of Computer Science involved in the process as well as my
classmates for working together in different projects and assignments in harmony.
At last but not the least, I would like to thank all those people who have fruitful and valuable
assistance and their finger prints in this work.
i
TABLE OF CONTENTS
List of Figures ......................................................................................................... vi
List of Tables.......................................................................................................... vii
List of Appendices ................................................................................................ viii
Acronyms and Abbreviations…………………………………………….……...ix
ABSTRACT ............................................................................................................. x
CHAPTER ONE ...................................................................................................... 1
INTRODUCTION ................................................................................................... 1
1.1 General Background ........................................................................................................... 1
1.2 Statement of the Problem ................................................................................................... 3
1.3 Motivation ............................................................................................................................ 4
1.4 Objectives ............................................................................................................................. 5
1.4.1 General Objective ........................................................................................................... 5
1.4.2 Specific Objectives ......................................................................................................... 5
1.5 Scope and Limitations of the Study ................................................................................... 6
1.6 Methodology ........................................................................................................................ 6
1.6.1 Literature Review ........................................................................................................... 6
1.6.2 Analysis and Design ....................................................................................................... 7
1.6.3 Evaluation....................................................................................................................... 8
1.7 Application of Results ......................................................................................................... 8
1.8 Organization of the Thesis.................................................................................................. 9
CHAPTER TWO................................................................................................... 10
REVIEW OF LITERATURES ............................................................................ 10
2.1 Information Retrieval ....................................................................................................... 10
2.2 Information Retrieval Models .......................................................................................... 10
2.2.1 Early Information Retrieval Systems ........................................................................... 10
2.2.2 Modern Information Retrieval Systems ....................................................................... 11
ii
2.3 Search Engine .................................................................................................................... 14
2.3.1 Components of Search Engines.................................................................................... 14
2.3.2 Multilingual Search Engines ........................................................................................ 15
2.4 Translation ......................................................................................................................... 16
2.4.1 Translation and Information Retrieval ......................................................................... 18
2.4.2 Approaches of Machine Translation ............................................................................ 19
2.4.2.1 Rule-based Approach …………………………………………………………………………………………………….19
2.4.2.2 Statistical Approach ………………………………………………………………………22
2.4.2.3 Example-based Approach ………………………………………………………………………………………………23
2.4.2.4 Hybrid Machine Translation……………………………………………………………... 23
2.5 Transliteration ................................................................................................................... 24
iii
4.5 Amharic Search Engine .................................................................................................... 49
iv
6.3.2 Discussions on the Results of English-Amharic Query Translation Component ........ 77
6.4 Evaluation of the Transliteration Component ............................................................... 77
6.4.1 Discussions on the Transliteration Results................................................................... 78
6.5 Bilingual Retrieval Performance Evaluation .................................................................. 79
6.5.1 Amharic-English Bilingual Retrieval Performance Evaluation ................................... 80
6.5.2 English-Amharic Bilingual Retrieval Performance Evaluation ................................... 82
6.5.3 Summary on the Overall Experimental Results ........................................................... 84
6.5.4 The Significance of Our System Compared to General Purpose Search Engines ....... 84
References .............................................................................................................. 91
v
List of Figures
vi
List of Tables
vii
List of Appendices
viii
Acronyms and Abbreviations
API Application Program Interface
ASCII American Standard Code for Information Interchange
ASP Active Server Page
BLIR Bilingual Information Retrieval
CLEF Cross-Lingual Evaluation Forum
CLIR Cross-Lingual Information Retrieval
DHTML Dynamic Hyper Text Markup Language
EBMT Example Based Machine Translation
HMT Hybrid Machine Translation
HTML Hyper Text Markup Language
HTTP Hyper Text Transfer Protocol
IR Information Retrieval
JDBC Java Database Connectivity
JSP Java Server Page
MLIR Multilingual Information Retrieval
MRD Machine Readable Dictionary
MT Machine Translation
NER Named Entity Recognizer
NII National Institute of Informatics
NLP Natural Language Processing
NTCIR NII Test Collection for Information Retrieval
OOV Out-Of-Vocabulary
POS Part-Of-Speech
PRP Probabilistic Ranking Principle
RBMT Rule Based Machine Translation
SDK Software Development Kit
SMT Statistical Machine Translation
TREC Text REtrieval Conference
URL Uniform Resource Locator
VSM Vector Space Model
XML eXtensible Markup Language
ix
ABSTRACT
As non-English languages have been growing exponentially on the Web with the expansion of
multilingual World Wide Web, the number of online non-English speakers who realizes the
importance of finding information in different languages is enormously growing. However, the
major general purpose search engines such as Google, Yahoo, etc have been lagging behind in
providing indexes and search features to handle non-English languages. Hence, documents that
are published in non-English languages are more likely to be missed or improperly indexed by
major search engines. Amharic, which is the family of Semitic languages and the official
working language of the federal government of Ethiopia, is one of these languages with a rapidly
growing content on the Web. As a result, the need to develop bilingual search engine that
handles the specific characteristics of the users’ native language query (Amharic) and retrieves
documents in both Amharic and English languages becomes more apparent.
In this research work, we designed a model for an Amharic-English Search Engine and
developed a bilingual Web search engine based on the model that enables Web users for finding
the information they need in Amharic and English languages. In doing so, we have identified
different language dependent query preprocessing components for query translation. We have
also developed a bidirectional dictionary-based translation system which incorporates a
transliteration component to handle proper names which are often missing in bilingual lexicons.
We have used an Amharic search engine and an open source English search engine (Nutch) as
our underlying search engines for Web document crawling, indexing, searching, ranking and
retrieving.
Key Words: Bilingual search engines, cross-lingual information retrieval, query preprocessing,
query translation, transliteration.
x
CHAPTER ONE
INTRODUCTION
The rapid growth of online non-English speakers that use the World Wide Web as their major
source of information and a means of communication channel, has led to enormous amount of
information in different languages and encoding schemes to be online on the Web. This creates
new challenges in information search and retrieval since information search and retrieval needs
language specific treatment. According to [2, 13], general purpose search engines, such as
Google, Yahoo, etc, often ignore the special characteristics of non-English languages, and
sometimes they do not even handle diacritics well. However, non-English speaking users use
these search engines that do not take into account the structure and the special characteristics of
the specific language they use, because they may not have alternatives. This has led to the need
to develop language specific search engines instead of general purpose once.
However, developing search engines that support only a specific language and writing scheme do
not allow the users to access all relevant documents available on the Web. This is because; there
may be relevant documents in response to the user’s query that are not in the same language and
script of the query language particularly when the query language is not mostly used on the Web.
1
According to [9], many bilingual information retrieval (BLIR) systems have used a query
translation approach since translating the document needs large bilingual corpus particularly in
morphologically rich languages. In such cases, the documents are indexed in their source
language(s), the query is translated into each of the document languages, and the retrieval is done
in the source languages. An alternative approach of bilingual information retrieval is to do
retrieval in the query language by indexing the document translations in the query language and
searching using the original query. Both query translation and document translation attempt to
map the query and the documents into a common language, but they use different translation
strategies. In query translation, a query may be short and non-grammatical with little or no
context, whereas in document translation, each document is a large, coherent context with full,
grammatical sentences. On the other hand, once a document is translated, any mistakes or
deletions in the translation cannot be remedied, whereas translating a query allows for more
flexibility in incorporating multiple possible translations using synonyms and related terms.
Because of these reasons, we chose the first method for our system. However, in our system, the
documents are crawled, indexed, searched, and retrieved for both the query as well as the
document languages.
Any Cross Language Information Retrieval (CLIR) system integrates a machine translation
system as its core component to translate either the query into the document language or the
documents in the query language. The machine translation (MT) system may use different
approaches of machine translation such as Dictionary-based, Rule-based, Corpus-based, etc. The
selection of the translation approach may depend on the linguistic resources available for the
languages, whether the translation is query translation or document translation, the requirements
of the translation system and others. Having these MT approaches selection criteria in mind, we
have followed dictionary based MT approach for our system.
If the translation is dictionary based, there is one serious problem in every translation system.
This problem is the word coverage limitations of dictionaries because of the appearance of Out-
Of-Vocabulary (OOV) words in the query. This often occurs because most of the queries contain
proper names and borrowed words that do not often present in the bilingual lexicons, i.e. which
may be OOV and do not demand the need of dictionary [11].
2
As a result, the translation system, in bilingual information retrieval, should also integrate a
transliteration component to alleviate the problem of OOVs and to translate them into the target
language without using dictionary. The OOV problem is more exacerbated for applications
involving translation between languages that use different scripts like English and Amharic [8].
So, bilingual applications should also deal with proper names and their transliterations to another
language and script.
Amharic is the second most-spoken Semitic language in the world, after Arabic, and the official
working language of the Federal Democratic Republic of Ethiopia (A country with more than 78
million populations). It has been the working language of the government, the military, and of
the Ethiopian Orthodox Tewahedo Church throughout modern times. Outside Ethiopia, Amharic
is the language of some 2.7 million emigrants (notably in Egypt, Israel, and Sweden) [15, 16].
Thus, it has official status and is spoken by many people as their native and second language. Of
these who speak Amharic, a significant number of them (usually the educated class) can
understand and speak English as well. The percentage of Ethiopian official working language,
Amharic, content on the Web is very less compared to English. However, the contents could be
different than contents that are published in English language, as contents in Amharic are
released mostly to Ethiopian events. Thus, using only one of the languages do not allow for
accessing all the available relevant documents.
There are significant amount of works about Amharic-English information retrieval [1, 3, 4] and
a little about Amharic search engines [2, 17]. However, as the knowledge of ours, there is no
attempted works about Amharic-English information retrieval on the Web (i.e. bilingual
Amharic-English search engine) that accepts user queries according to the user’s language
preferences and returns the result in both languages. Therefore, Monolingual Amharic speaker
Web users have to know English language to use the Web.
3
1.3 Motivation
With the availability of vast amount of information on the Web, the interest of the people to use
search engines to locate information of interest is highly increasing. As the currently available
search engines do not handle all the Web documents written in different languages well, there is
a need to develop language specific search engines. Amharic is one of the languages which have
rapidly growing content on the Web. It has a complex morphology which combines consonantal
roots and vowel intercalation. Amharic and English differ substantially in their morphology,
syntax and the writing system they use. Therefore, the search engines which are mainly
developed for English cannot efficiently be used to retrieve Amharic documents.
Until recently, Ethiopian Web users have to know English language because there are only few
alternatives in their native language to use the Web. To alleviate such problems, an Amharic
search engine that searches the Web for Amharic documents has already been developed [2, 17].
However, the majority of Web documents are published in English when compared to Amharic
language and accessing only Web documents which are written in Amharic may not satisfy user
requirements since there may be relevant documents for user queries in English as well.
4
1.4 Objectives
The general objective of this research work is to design and develop a generic model for a
bilingual search engine that integrates a query translation system for Amharic and English
languages.
Studying and analyzing the language specific behaviors of Amharic and English.
Analyzing the Amharic/English and English/Amharic query translation requirements.
Designing a bidirectional bilingual (Amharic/English) search engine model.
Exploring Amharic preprocessing tools to prepare the query for translation.
Exploring and integrating Amharic-English bilingual dictionary for the purpose of
translating queries from one language to the other.
Developing a bidirectional Amharic/English dictionary based machine translation system
for the purpose of query translation.
Developing and integrating a transliteration system for proper name and borrowed word
transliterations which do not often present in the Amharic-English bilingual dictionary
but frequently occurs in queries.
Designing an interface that accept user’s query either in English or Amharic language.
Exploring and adopting Web based search engines for both Amharic and English
languages.
Developing an algorithm that integrates Amharic and English search engine results in
order to search both Amharic and English Web documents
Developing the prototype of our work and demonstrating the effectiveness of the
proposed system.
5
1.5 Scope and Limitations of the Study
There are several Web documents that are written in different languages, scripts and encodings
and also there are different kinds of Web documents such as image, audio, video and text.
However, this research work considers only text documents on the Web that are written in
Amharic and English languages.
A full functional bilingual search engine requires a number of tools to be developed such as Part
Of Speech (POS) tagger, Stemmer, Named Entity Recognizer and so on for query preprocessing
and query translation. However, some of such tools are not publicly available, some of them do
not have full functionalities, and some of them are not yet developed. Most of these Amharic
Natural Language Processing (NLP) tools which we used for our query preprocessing and the
Amharic search engine are developed for the academic purpose by post graduate students. For
the purpose of evaluation and demonstration of our system, we have integrated and used some of
such tools with their limited functionalities for which are publicly available and some of the tools
such as NER are not integrated at all since they are not yet developed. On the other hand,
because of the absence of bilingual Amharic-English Machine Readable Dictionary (MRD), we
have used an in-house developed dictionary which has limited number of words.
1.6 Methodology
To achieve the general and specific objectives of the research, we have reviewed different
literature reviews and we used open source software and programming language tools.
Different literatures that are considered to be relevant for the research have been reviewed and
some of the concepts have been adopted for our work. Since our research work is on bilingual
search engines, it touches several numbers of areas like: scripts, word translation between
different languages, word transliteration between different languages that use different scripts,
bilingual information retrieval systems, and search engines. Almost all of these are a recent
research areas and are extensively researched particularly in languages such us Japanese,
Chinese, Korean, Indian, and others.
6
The characteristics of Ethiopic character encodings and their phonetic representations have been
studied, and related works have been reviewed to properly design and develop a transliteration
system to translate proper names and borrowed words that are not normally present in the
Amharic-English bilingual dictionary but frequently occur in queries. For the bilingual
information retrieval, there are research works on Amharic-English Information Retrieval then
we have studied and used them as our source of knowledge for the query translation in parallel
with related works in other languages.
A search engine developed for Amharic language [2, 17] has been reviewed to understand the
way in which searching and retrieving Amharic Web documents can be performed. Search
engines developed for languages other than Amharic and English as well have been used as a
reference. Bilingual and multilingual search engines have been reviewed and some of the
important concepts are adopted for our work.
To design the model of bilingual Amharic-English search engine, different translation system
models, and bilingual and multilingual search engine models that are developed for other
languages are studied.
7
• Translation module development: The translation module is developed using Java
programming language and we have used the Java API called JDBC to access our
bilingual dictionary.
• Open source search engine customization and configuration: Nutch an open source
search engine is customized, configured and used to crawl, index and search Web
documents in both languages, Amharic and English.
• User Interface design: Java server page is used to design the user interface of the system
and Apache Tomcat is used as our servlet container.
1.6.3 Evaluation
Web documents were collected by crawling from selected websites. Since the query selection has
an important impact on the CLIR, queries are carefully prepared and different monolingual and
cross language runs has been conducted to evaluate the performance and efficiency of the
system. Domain experts have been involved to judge the relevancy of documents to the queries.
Finally, the conclusions and recommendations have been driven from the evaluation results.
Even though most Internet users in Ethiopia are not native English speakers, they are forced to
use search engines that are mainly designed for English document retrieval. Bilingual search
engine will allow monolingual (or Amharic speakers) to access and retrieve the vast online
information resources that are available in English and Amharic by using their own mother
tongue (native) language queries.
8
In Tourism domain, the bilingual search engine will play a vital role by allowing tourists or
visitors to have query language preferences and by providing Web documents in both Amharic
and English languages for both local people and foreigners. In addition to this, queries, in
tourism domain, are mostly Out-Of-Dictionary words such as person, place, monument or
organization names. These words are not likely to be present in the bilingual dictionaries.
Therefore, bilingual search engine for Amharic-English that integrates proper name
transliteration will have a vital application in the domain.
International marketing companies can access, or make available valuable Website information
that could contribute to worldwide brand recognition and online sales in either of the languages.
These companies will enjoy a diversity of visitors by ultimately improving the Website’s online
position in search engine results. The availability of bilingual search engine will help the
companies to develop multilingual Websites that are essential in order for a company to reach
its entire target market.
The rest of this thesis is organized as follows. Chapter 2 discusses about information retrieval in
general and cross language retrieval in particular and some related issues such as translation and
transliteration. The Chapter discusses and reviews literatures about information retrieval models,
search engines and its components, and multilingual searches. Furthermore, the Chapter also
discusses different query and document translation approaches and some concepts of
transliteration and its application areas. Chapter 3 discusses some related works about cross
language information retrieval and multilingual and bilingual search engines. Chapter 4 presents
a detailed discussion on the model of the proposed system and the functions of each of its
components.
The implementation details of the system such as: the tools, the algorithms, the techniques and
methods used to develop the system are described in Chapter 5. Chapter 6 presents the
experimental setups and the experimental results obtained from the proposed system along with
their interpretations and the reasons behind each of the results. Finally, Chapter 7 concludes the
thesis with the benefits of the research, recommendations, and future research directions.
9
CHAPTER TWO
REVIEW OF LITERATURES
The meaning of the term information retrieval can be very broad. However, according to [21],
Information retrieval (IR) can be defined as finding materials (usually documents) of an
unstructured nature (usually text) that satisfies information needs from within large collections
(usually stored on computers). Information Retrieval (IR) deals with the representation, storage,
organization of, and access to information items [25]. The aim of Information Retrieval (IR) is to
find and retrieve documents relevant to a given query, usually where documents and query are in
the same language. With further advances in research and technology the goal was extended
beyond language barriers to include differences in different languages, which is known as Cross
Language Information Retrieval (CLIR). To support these diversities, different information
retrieval models that have a better performance than the previous one have been proposed [29].
Early Information Retrieval(IR) systems were based on Boolean retrieval model which allowed
users to specify their information need using a complex combination of Boolean ANDs, ORs and
NOTs [29]. The users can pose any query which is in the form of a Boolean expression of terms.
10
The model views each document as just a set of words. In Boolean system, there is no inherent
notion of document ranking, and hence relevance ranking is often not critical.
Since Web data still consist mostly of text, they need various text retrieval tools (e.g., term
weighting, term similarity computation, query expansion) that may be useful in Web IR. On the
other hand, as the name implies, Web documents are heavily interconnected. Therefore, they
need Link analysis approaches, such as PageRank, HITS, etc. These approaches have created
hyperlinks as implicit recommendations about the documents to which they point. As stated in
[30], there are several promising Web IR tools which use the combinations of these different
retrieval techniques. For instance, as in [30], Google seems to employ a variety of techniques to
obtain high performance text retrieval influenced by a universal link analysis score called
PageRank. Google improves link analysis by combining it with text retrieval techniques and
heavily leverage the implicit human judgment embedded in hyperlinks.
11
The relevance of the document is determined by the ranking algorithm. The ranking algorithm
works on the bases of the IR models. Several models have been used for the ranking process.
These includes: vector space model, the probabilistic models, and the inference network model
[30].
The representation of a set of documents as vectors in a common vector space is known as the
vector space model [21]. It is fundamental to a host of information retrieval operations ranging
from scoring documents on a query, document classification and document clustering. In the
vector space model, text is represented by a vector of terms (words or phrases). The vector model
assigns non-binary weights to index terms in queries and in documents. If a term belongs to a
text, it gets a non-zero value in the text-vector along the dimension corresponding to the term.
These values are used to compute the similarity between each document stored in a system and
the user query. In this model, each document will be represented by an n-dimensional vector and
a user query is similarly represented by an n-dimensional vector, where n is the number of terms
in the vocabulary. To assign a numeric score to a document for a query, the model measures the
similarity between the query vector and the document vector. If D is the document vector and
Q is the query vector, then the similarity of document D to query Q can be given as: [29]
Sim ( D , Q ) = ∑
ε
WtiQ
ti Q , D
. WtiD
Where WtiQ is the value of the ith component in the query vector Q , and WtiD is the ith
component in the document vector D . The summation can only be done over the terms that are
common in the query and the document since the word that is not present in either the query or
the document has a value zero.
Probabilistic Model
Users may issue a query on the collection of documents and an ordered list of documents is
returned in response. Using a probabilistic model, the obvious order in which to present
12
documents to the user is to rank documents by their estimated probability of relevance with
respect to the information need. The vector-space model does not take the existing term
relationships in a document into account. The probabilistic model takes these term dependencies
and relationships into consideration. This family of IR models is based on the general principle
that documents in a collection should be ranked by decreasing probability of their relevance to a
query. This is often called the probabilistic ranking principle (PRP).
13
2.3 Search Engine
Web search engines provide the way to reach information on the Web. The availability of more
information on the Web increases the need of the people to use search engines. As defined in [5],
search engine is a software package that collects Web pages on the internet through a robot
(spider, crawler) program and stores the information on the pages with appropriate indexing to
facilitate quick retrieval of the desired information. Even though most search engines have
similar components, the advancements in technology have brought the development of several
types of general purpose, language specific and multilingual search engines.
The most important measure for a search engine is the search performance, quality of the results
and ability to crawl, and index the Web efficiently. The primary goal is to provide high quality
search results over a rapidly growing World Wide Web. Some of the efficient and recommended
search engines are Google, Yahoo, MSN, etc, which share some common features and are
standardized to some extent. Even though, there are differences in the ways these various search
engines work, they all perform three basic tasks:
They search the Internet or select pieces of the Internet based on important words.
They keep an index of the words they find, and where they find them.
They allow users to look for words or combinations of words found in that index.
In order to perform these basic tasks, a Web search engine consists of three major components
[2, 17, 28]:
A Crawler Component
An Indexer Component
A Query Handler Component
Crawler
Web crawlers are an essential component to search engines. To find information on the hundreds
of millions of Web pages that exist, a typical search engine employs special software robots,
called spiders, or crawlers, to build lists of the words found on Websites. A Web crawler is an
automated program, which automatically traverses the Web by downloading documents and
14
following links from page to page. When the crawler is building its lists, the process is called
Web crawling [26]. Crawling is the most fragile application [27] since it involves interacting
with hundreds of thousands of Web servers and various name servers, which are all beyond the
control of the system.
Indexer
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information
retrieval [28]. The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query. Without an index, the search engine would scan every
document in the corpus, which would require considerable time and computing power. Even
though additional computer storage required for storing the index and considerable increase in
the time required for an update to take place, indexing saves the time required during information
retrieval. The Indexer component extracts all the words from each page (parsing), and records
the URL where each word occurred [27]. The result is a large lookup table that can provide all
the URLs that point to pages where a given word occurs. Data collected from each Web page are
then added to the search engine index. When you enter a query at a search engine site, your
input is checked against the search engine's index of all the Web pages it has analyzed. The best
URLs are then returned to you as hits, ranked in order with the best results at the top.
Query Handler
This component interacts with users and answers user queries using the index [2, 17]. The query
handler component accepts user queries (keywords) for a particular topic and looks up the index
component. The document in the index with the keyword will be selected and displayed to the
user. In displaying these pages, ranking mechanism is used.
Due to the rapid growth of multilingual content on the Web, today’s search engines are trying to
solve the problem of language barriers in searching online information. This is to mean that a
query written in one language can be used to retrieve documents on the Web that are written in
any other language. The proportion of Web documents written in non-English language have
been increasing as the number of non-English Web users is dramatically increasing [34, 41]. To
15
improve the ability of a monolingual speaker to search multilingual content, there is a need to
build a system that supports cross-lingual search of different languages written in different
languages and scripts. A multilingual search engine automatically translates your request and
submits it to search engines in multiple languages. Nowadays, a number of multilingual search
engines that support a fast search capability for typical query and achieve a very good
performance in the high precision region of the result list have been developed for major
languages [34, 36]. However, under resourced languages such as Amharic are not considered still
now, because of the unavailability of several tools that can be used for language specific
information retrieval and translations between major languages.
2.4 Translation
Machine translation has become a key technology in our globalized society [42]. As a result,
machine translation software is available for major language pairs and for major computer
platforms, including Web-based machine translation. Machine translation software basically
relies upon the availability of extensive linguistic resources [40]. To develop a good quality
machine translation system, we need to use large collections of parallel texts aligned at the
sentence level, and amounting to at least several millions of words. At the same time, parallel
corpora of this size tend to be very rare, especially for under-resourced languages. Due to this, it
is difficult to create a new good quality machine translation system for these language pairs.
Machine translation (MT) software is special in the way it strongly depends on data [43, 44, 46].
Rule-based machine translation (RBMT) depends on linguistic data such as morphological
dictionaries, bilingual dictionaries, grammars and structural transfer rule files. Corpus-based
machine translation (such as statistical machine translation and example-based machine
translation) depends, directly or indirectly, on the availability of sentence-aligned parallel text.
However, in all cases, one may distinguish three components [43, 44]:
16
• Part of speech tagging
• Transforming from the source language to the target language
• Morphological synthesis
• Restoring the formats of text
However, the sophistication of the translation engine depends on the aims of the translation is
designed. For instance, a translation system that is designed for query translation in cross
language information retrieval may not consider the grammar and structure of the query
sentences for several reasons [50].
• First, most IR models are based on bag-of-words models. Therefore, they don’t take
the syntactic structure and word order of queries into account and hence they are
easy to develop.
• Second, queries submitted to IR systems are usually short and therefore not able to
describe the user’s information need in an unambiguous and precise way. These
indicate that complex process used in machine translation for producing a
grammatical translation is not fully exploited by current IR models.
ii) Linguistic data: is a bilingual dictionary, or bilingual corpora with structural transfer
rules that perform grammatical and other transformations between the two languages
involved in each direction, and control data for each one of the part - of -speech taggers.
iii) Tools (optional): to maintain these data and turn them into a format which is suitable for
the engine to use. It includes tools such as compilers to turn linguistic data into a fast and
compact form used by the engine and software to learn disambiguation or structural
transfer rules.
Machine translation is one of the toughest problems in natural language processing [40]. It is
generally accepted however that machine translation between close or related languages is
simpler than full-fledged translation between languages that differ substantially in morphological
and syntactic structure.
17
2.4.1 Translation and Information Retrieval
Information retrieval is traditionally based on matching the words of a query with the words of
document collections. Since the query and the document collection are in different languages,
this kind of direct matching is impossible in Cross-Lingual Information Retrieval (CLIR).
Therefore, translation is needed to translate either the query into the language of the documents
or the documents into the language of the query. Obviously, translating the whole document
collection is more demanding, as it requires more scarce resources like full-fledged machine
translation system which is not available for a number of languages in developing countries.
Hence query translation techniques become more feasible and common in development and
implementation of CLIR system, particularly for minor languages such as Amharic-English
CLIR system.
Query Translation
The basic methods in query translation are Statistical Machine Translation (SMT), corpus-based
translation and knowledge-based or dictionary-based translation. Since established MT system
and/or parallel corpora are not available for under-resourced languages such as Amharic, it is
advisable to use a query translation which is based on machine-readable dictionary to work on
bilingual or cross-lingual information retrieval.
Document Translation
Translating document collections is less practical, because of its demanding cost to prepare large
bilingual corpora and other translation tools. Moreover, writing a topic in another language and
then asking the search engine to automatically translate the document to be fetched into the
language of the query before retrieval, degrades retrieval effectiveness compared to a
monolingual search in which requests and documents are written in the same language. Due to
this, document translation reduces retrieval performance and is not commonly used specially for
minor languages. In addition to this, users typically prefer to give isolated words, or at best, short
phrases (three-four words) to an IR system. Therefore, the question here should be how to best
translate a set of isolated words into the target language to retrieve a document in a language
different from the query language rather than the sentence based translation.
18
2.4.2 Approaches of Machine Translation
Machine translation can use a method based on linguistic rules. The words in the source
language are translated into the most suitable words of the target language in a linguistic way. It
is often argued that the success of machine translation requires a better understanding of natural
language problems first. The methods of machine translation require extensive lexicons with
morphological, syntactic, and semantic information, and large sets of rules. Machine translation
systems differ in sophistication, and there are several basic approaches to translation.
Rule-based systems were the first 'real' kind of machine translation system. Rather than simply
translating word to word, rules are developed that allow for words to be placed in different
places, to have different meaning depending on context, etc. In this approach, the original text is
first analyzed morphologically and syntactically in order to obtain a syntactic representation.
This representation can then be refined to a more abstract level putting emphasis on the parts
relevant for translation and ignoring other types of information. The transfer process then
converts this final representation (still in the original language) to a representation of the same
level of abstraction in the target language. Translation is carried out by repeating pattern
matching and transformation of tree or graph structures that represent the syntax or semantics of
a sentence. A rule based system is an effective way to implement a machine translation system
because of its extensibility and maintainability. However, it is disadvantageous in processing
efficiency.
Transfer-based systems are another approach to rule-based machine transfer, influenced by the
Interlingua idea. Instead of using a whole language, an intermediate representation of equivalent
pieces is used. This still uses language-pair-specific translation, but the amounts of language-
specific rules are reduced to a minimum. There are two kinds of transfer-based translation:
19
Shallow transfer (syntactic), where words are translated based on combinations of word
types.
Deep transfer (semantic), which uses a representation of the meaning of each word as a
basis for how it should be translated.
a) b)
Interlingua systems are an extension of rule-based systems that use an intermediate language
instead of direct translation [39, 45]. Systems based on Interlingua can then more readily
translate between various combinations of languages. In this approach, the source language (the
text to be translated) is transformed into an Interlingua, an abstract language-independent
representation. The target language is then generated from the interlingua. The inter-lingua
20
approach rests on the assumption that all languages share a common underlying representation.
In interlingua-based MT, the intermediate representation must be independent of the languages
in question, which distinguishes it from the transfer-based approach. Within the rule-based
machine translation paradigm, the interlingual approach is an alternative to the transfer-based
approach. One of the advantages of this approach is that no transfer component has to be created
for each language pair. However, the definition of an Interlingua is difficult and may be even
impossible for a wider domain. Figure 2.2 shows translation graphs for intrelingua-based
machine translation using two interlinguas.
21
Figure 2.3: Transfer and Interlingua pyramid diagram (Source: [40]).
Machine translation can use a method based on dictionary entries, which means that the words
will be translated as they are by a dictionary, usually without much correlation of meaning
between them. Dictionary lookups may be done with or without morphological analysis. This
approach is the least sophisticated and is suitable for the translation of long lists of phrases and
for languages which do not have more linguistic resources.
22
corpora. Where such corpora are available, impressive results can be achieved translating texts of
a similar kind, but such corpora are still very rare.
This approach for machine translation is combination of statistical and rule based or example
based approaches and some elementary grammatical analysis. Hybrid machine translation
(HMT) leverages the strengths of both statistical and rule-based or example based translation
methodologies. The hybrid approach to MT improves translation of large volumes of speech and
text compiled from a variety of sources and assists linguists, translators, and analysts in
achieving greater productivity more quickly and in a more cost effective manner.
Form this subtopic discussion, we have learnt that which translation approaches are suitable for
which types of language and which types of translation approaches are suitable for which types
of translation systems. Based on the discussion, minor languages such Amharic, which have a
scarcity of translation tools, should use machine readable dictionary; and translation systems for
cross language information retrieval do not need high sophistication of translation system and
hence simple translation approaches such as direct word-to-word translation systems are
sufficient to provide query translations.
23
2.5 Transliteration
Transliteration, as defined in [22, 23, 24], is the process of converting terms written in one
language into their approximate spelling or phonetic equivalents in another language. It converts
a word or phrase in the closest corresponding letters or characters of a different alphabet or
language so that the pronunciation is as close as possible to the original word or phrase.
Transliteration is defined for a pair of languages, a source language and a target language. The
two languages may differ in their script systems and phonetic inventories. Transliteration can
also be viewed as the process of replacing words in the source language with their approximate
phonetic or spelling equivalents in the target language. Thus, transliteration is meant to preserve
the sounds of the syllables in words. Transliteration is more phonetic than orthographic. Even
though transliteration between languages that use similar alphabets and sound systems is
relatively simple, transliterating names from Amharic into English is a non-trivial task, mainly
due to the differences in their sound and writing systems. The method of transliteration is also
depends on the characteristics of the source and target languages (in this case English and
Amharic) and on the larger purpose the transliterator is meant to serve.
Transliteration is helpful in situations where one does not know the script of a language but
knows to speak and understand the language. Transliteration techniques, especially name
transliterations, are useful in several areas including machine translation and information
retrieval since proper names are commonly observed in important query words. Translation of
proper names is generally recognized as a significant problem in many multi-lingual text and
speech processing applications. Even when large bilingual lexicons used for machine translation
(MT) and cross-lingual information retrieval (CLIR) which provides significant coverage of the
words encountered in the text, a significant portion of the tokens not covered by such lexicons
are proper names [12]. For CLIR applications in particular, proper names and technical terms are
particularly important, as they carry some of the more distinctive information in a query. In IR
systems where users provide very short queries (2-3 words), their importance grows even further.
Therefore, to handle the problems related to proper names an efficient transliteration system is
required.
24
Transliteration has acquired a growing interest recently, particularly in the field of Machine
Translation (MT). It handles those terms where no translation would suffice or even exist. Some
of the terms (words) that need transliteration in MT systems are:
• Person names, although many of them have homographs that can be translated.
• Foreign words, which retain the sound patterns of their original language with no
semantic translation involved.
• Names of countries may often be subject to transliteration instead of translation.
There are several situations where transliteration is especially useful, such as the following:
• When a user views names that are entered in a world-wide database, it is extremely
helpful to view and refer to the names in the user's native script.
• When the user performs searching and indexing tasks, transliteration can retrieve
information in a different script.
• When a service engineer is sent a program dump that is filled with characters from
foreign scripts, it is much easier to diagnose the problem when the text is transliterated
and the service engineer can recognize the characters.
25
CHAPTER THREE
RELATED WORK
While users searching for Web information on a specific topic or in a language other than
English, they often find it difficult to search for useful and high-quality information using
general-purpose search engines. In response to this problem, many domain-specific or language-
specific search engines that are capable of searching in English and in the users’ native language
have been built. This facilitates for the users to have more efficient searching over a collection of
Web documents in their own domain or native language.
In this chapter, we have reviewed and discussed some related works on cross lingual information
retrieval from the point of view of the translation approaches used, the information retrieval
model or tool used, the language pairs used, etc. We discussed by focusing on Cross-Lingual
Information Retrieval (CLIR), multilingual/bilingual search engines, Amharic-English cross-
lingual information retrieval, and information retrieval for non-English language query in general
purpose search engines.
A work of Mohammed et al. [35], this paper evaluates the use of a Machine Translation based
approach for query translation in an Arabic-English Cross-Language Information Retrieval
(CLIR) system. Arabic is a Central Semitic language, thus related to and classified along with
other Semitic languages such as Hebrew and the Neo-Aramaic languages. It is spoken by a large
number of people as a first and second language and it is the largest member in Semitic language
family in terms of speakers [35]. Modern Standard Arabic is widely taught in schools,
26
universities, and used in workplaces, government and the media. Amharic shares a number of
common linguistic properties with Arabic for which active research has been carried out except it
is written from left to right unlike Arabic. As Arabic is a relatively widely researched Semitic
language and has a number of common properties that share with Amharic, some of the
computational linguistic researches [48, 49] conducted on Amharic language nowadays,
recommended to customize and use the tools developed for this language.
The work of Mohammed et al. [35] used ALKAFI, Arabic-English Machine Translation system,
which is a commercial system developed by CIMOS Corporation and is the first Arabic to
English machine translation system to evaluate the effectiveness of Machine Translation based
system for CLIR. The system is experimented on two standard TREC collections and topics
using the three query types (title, description, and narrative considering them as short, medium,
and long query types respectively) to determine the effects of query length on the performance of
the machine translation based method for CLIR. The results showed that the machine translation
achieved 61.8%, 64.7%, and 60.2% for title, description, and narrative fields, respectively.
Another work on Cross-Lingual Information Retrieval is the work of P. L. Nikesh et al. [36], this
paper describes about an English-Malayalam Cross-Lingual Information Retrieval system. The
system retrieves Malayalam documents in response to query given in English or Malayalam
language. It supports both cross-lingual (English-Malayalam) and monolingual (Malayalam)
Information retrieval. Due to the absence of full-fledged online bilingual dictionary and/or
parallel corpora to build the statistical lexicon for the language, the authors used a bilingual
dictionary developed in house for translation. For document ranking and retrieval, a system
developed based on the vector space model (VSM) has been used.
India is a multilingual country which has 22 official languages. Malayalam is one of the most
prominent regional languages of Indian subcontinent. It is spoken by more than 37 million
people and is the native language of Kerala state in India [36]. As Malayalam is under resourced
language like Amharic, the query translation approaches that the authors follow and the
Information Retrieval (IR) tool used are also important to develop CLIR for Amharic language.
27
3.2 Multilingual Search Engines
Even though ‘traditional’ cross lingual information retrieval techniques produced satisfactory
results as we have discussed in Section 3.1, they cannot be employed directly in Web
applications. As stated in [34], there are several factors that make multilingual Web retrieval
different from traditional CLIR:
Web pages are more unstructured and are very diverse in terms of document content and
document format (such as HTML or ASP).
Traditional CLIR usually focuses on effectiveness, measured in recall and precision.
However, Web retrieval also concerned with efficiency to end users (i.e. response time
and query length).
As most cross lingual information retrieval researchers have used standard text collections like
news articles for their test set, they have not encountered with the problems related to
multilingual Web retrieval. As a result, they have got better performance results. However, at
present the navigation in this multilingual information space is far from the ideal scenario and
the study to integrate CLIR techniques into a multilingual Web retrieval system has arisen. In
this subsection we have reviewed and discussed two multilingual search engines and one cross
lingual Web portal.
Joanne Capstick et al. [20], have developed MULINEX which is a fully implemented
multilingual search engine and it is available in German, French and English languages. The
system was developed by a MULINEX consortium which consists of five European companies,
who aim to improve their competitiveness in the internet market through the development and
application of advanced language technology for providing improved user-friendly Web search
and navigation services.
MULINEX is a multilingual Internet search engine that supports selective information access,
navigation and browsing in a multilingual environment. The goal of MULINEX development is
to find techniques for the effective retrieval of multilingual documents from the Internet. It
facilitates multilingual information access with navigation and browsing, enabling effective
multilingual searching on the Internet by providing translation of queries, customized
summaries, and thematic classification of documents.
28
MULINEX incorporates Web spiders, concept-based indexing, relevance feedback, translation
disambiguation, document categorization, and summarization functionalities. It also translates
retrieved documents into the users’ language such that the users can read them. In this system,
queries are morphologically analyzed and then translated by making use of multilingual
dictionaries. Since the retrieval performance of automatically translated queries is poorer than
monolingual information retrieval, there is an (optional) step of user interaction, where the user
can select terms from the translated query and add his own translation. Summaries and result of
foreign language documents have been translated on demand by making use of the LOGOS
commercial machine translation system.
Another work on multilingual search engine is the work of Wen-hui Zhang et al. [18]. This work
is a multilingual Chinese-English search engine developed by a project of Chinese Academy of
Sciences which has been accomplished in June 1998. The project provides a solution to the
multilingual search problems. The work was conducted with the intention to develop systems
that can efficiently search, index and retrieve multilingual information for Chinese (mother
tongue) and English information. The search engine provides Chinese and English indexing,
retrieval and searching using English and Chinese language. In this research work the authors
presented the concepts, technologies, algorithms and detailed measures to achieve the goal of
providing highly relevant, and up-to-date multilingual information search, index, and retrieve.
The system has uniform query interfaces for both Chinese and English languages to allow the
user to conduct searching. The character encoding detection module determines the language and
sends request to the database supporting the corresponding language. A set of information
retrieval functions (such as: Boolean query (and, or, not), phrase query, fuzzy query, proximity
query, stemming, language parsing, and spelling error override) that natively support Chinese
and English searches are developed. The system consists of four parts: the information gatherer,
the search engine, the query server and the scheduler. Each part has its own several
functionalities in the information retrieval process. In this work, the procedure of retrieving
information is as follows.
i. User submits a query, with search options such as preferred categories, to the query server.
ii. The encoding detection module is invoked to detect the language of the query.
29
iii. The query is reformulated by the query server and passed to appropriate search engines.
iv. The search engines work in parallel and return ranked results to the query server. Each
result is hypertext-linked to the Web site which contains the expected document.
v. The query server reorganizes the results and returns them to the user.
In this system, the query server accepts user's query and invokes related search engine, organizes
the retrieved results provided by search engine and returns the document only in the language of
the user’s query. There is no concept of query translation or cross lingual information retrieval,
rather the user types the query in his/her own language preferences and gets the result in the
language of the query.
From this paper we have learnt that we can retrieve Web documents in the language of the query
from several search engines by detecting the language of the query. As a result, in our work, to
search, index and retrieve both English and Amharic Web documents, we can use one of the well
known general purpose search engine for English Web documents and Amharic Search Engine
[2, 17] for Amharic Web documents.
As the translation system is the core component in every cross lingual information retrieval
systems, the authors exhaustively compared and made convincing decisions to use the best query
translation approach that can be better applicable for Web retrievals. Among the three
translation approaches (machine translation based, corpus based and dictionary based), which
have been used in different translation systems, the authors used the dictionary-based approach
after pointing out several reasons. They tried to combine the dictionary based approach with
phrasal translation, co-occurrence analysis, pre- and post-translation query expansion for
translation disambiguation.
30
According to the paper [34], the dictionary based approach of query translation is the most
promising for Web applications for two reasons: First, machine-readable dictionaries that are
used in dictionary-based translation approach are more widely available and easier to use than
the parallel corpora required by the corpus-based approach. The limited availability of existing
parallel corpora cannot meet the requirements of practical retrieval systems in today’s diverse
and fast-growing Web environment. Second, the dictionary-based CLIR approach is more
flexible, easier to develop, and easier to control when compared with Machine Translation based
CLIR which has little space for users to modify it for their specific purposes, or it is too costly.
The English and Chinese collections for the Web portal was built using a digital library
development tool called AI Lab SpidersRUs toolkit. The toolkit is capable of building collections
in different languages and encodings. The toolkit is designed by the same research group and has
components which can support functionalities like document fetching, document indexing,
collection repository management, and document retrieval.
The researchers conducted an experiment to measure the effectiveness and efficiency of Web
portal system following TREC evaluation procedures. They evaluated their system by
business/IT domain experts who are fluent in both English and Chinese language, using a set of
queries in both English and Chinese. The results showed that co-occurrence-based phrasal
translation achieved a 74.6% improvement in precision over simple word-by-word translation.
When both pre- and post-translation query expansion was used together, the performance
achievement improved slightly to 78.0% over the baseline word-by-word translation approach.
The authors strongly recommended to use dictionary based approaches for Web based
applications especially for under resourced languages. It has also shown the effectiveness of the
approach as queries are usually short and IR models are based on bag-of-words.
Much effort has not been done on Amharic-English bilingual information retrieval. The three
series research works done by Atelach Alemu have been discussed in this subtopic. Atelech, in
her three consecutive research works entitled with Dictionary-based Amharic - English
Information Retrieval [4], Amharic-English Information Retrieval [3], and Amharic-English
31
Information Retrieval with Pseudo Relevance Feedback [1] discussed cross lingual information
retrieval between Amharic and English languages. For all the research works the authors used
Amharic-English machine readable dictionaries and an online Amharic-English dictionary for
the query terms translation with some additional enhancements for query translation, indexing
and searching from one research to the next. The Amharic topic set used in all the experiments
was constructed by manually translating the English topics. As the experimental results showed,
progressive performance achievements were observed form the first research to the next one and
the challenges related to the issues were discussed.
In the first research [4], for words that have more than one translation, all possible translations
were taken and manual disambiguation was performed. Out-of-vocabulary words (such as:
proper names and borrowed words) were manually added in a separate dictionary. Two different
ways of the same basic dictionary based approach, which basically differ in the way less
informative words can be identified and removed (called stop word removal) from the query,
were used and experimented. A retrieval engine which can support Boolean and Vector Space
Model were used for information retrieval. According to the report on the paper, an average
precision of 0.4009 and 0.3615 have been achieved by the two methods.
In the second research [3], morphological analysis and part of speech tagging were used for
query translation process and Out-Of-Dictionary terms were handled using fuzzy matching. For
document indexing and searching, Lucene, an open source retrieval engine, was used. Four
different experiments which substantially differ from one another in terms of query expansion,
fuzzy matching, and usage of the title and description fields in the topic sets were conducted. As
the author reported on the paper, she have got better retrieval performance from the experimental
results for Amharic when compared to runs in the previous research.
In the third research [1], for the translation of query terms, they gave precedence for matching
bigrams over unigrams of query terms in the dictionaries and Out-Of-Dictionary Amharic query
terms were taken to be possible named entities. These Out-Of-Dictionary terms were further
filtered through restricted fuzzy matching based on edit distance against automatically extracted
English proper names. The Lemur toolkit for language modeling and information retrieval was
used for indexing and retrieval. After the first top ranked documents were retrieved the highest
32
weighted terms were taken to expand the query using the method of Pseudo Relevance
Feedback. This paper reported that very limited experiments were conducted and low precision
indexing and retrieval performance was observed. The experiments were conducted only to show
effects related to the issues like: short queries vs. long queries, the use of Pseudo Relevance
Feedback, and the effect of taking the first translation given versus maximally expanding query
terms with all translations given in dictionaries.
As the researches of Atelach were conducted from scratch due to the unavailability of bilingual
resources for Amharic, the experimental results end up with low precision. Since cross lingual
information retrieval requires a number of tools such as stemmer, part of speech tagger, named
entity recognizer, morphological analyzer, etc, the paper reported that the effectiveness and
efficiency of CLIR system is also highly dependent on these tools and impossible to develop
with in short periods of time.
The Amharic-English cross lingual information retrieval researches done by the researcher were
not information retrieval on the Web. However, as we have discussed in Section 3.2 cross lingual
information retrieval on the Web has additional overheads that should be considered compared to
traditional CLIR due to several factors exist on the Web documents.
Atelach [1, 3] also mentioned the importance of handling out of dictionary words. These words
are usually proper names and borrowed words which do not often present in the bilingual
dictionary but frequently happen in queries. The researcher tried to handle these terms by
manually adding on the dictionary and a method of fuzzy matching. However, this practice leads
to the buildup of huge bilingual dictionary which reduces the performance of the translation
system especially for Web CLIR. In most recent researches, these terms have been handled by
using a transliteration system that can be integrated in the CLIR.
Most Web users begin their Web activities by submitting a query to a search engine such as
Google, MSN, AltaVista or Yahoo etc. General purpose search engines on the Web are the most
popular tools to search for, locate, and retrieve information, and their use has been growing at
33
exponential rates. Although English is still the dominant language on the Web, information in
other languages is steadily gaining prominence. However, users often find it difficult to search
for useful and high-quality information on the Web using general-purpose search engines when
they search for information in a language other than English. A search engine that can handle
these problems of language differences is becoming highly desirable and so researched in some
of the major languages.
There have been several research works done to experiment the effectiveness and performance of
general purpose search engines’ of handling non-English queries. The works tried to compare the
results with language specific search engines and arrived in similar conclusions. The work in
[37], explores the characteristics of the Chinese language and how queries in this language are
handled by different search engines. The authors compared the results by entering queries in two
major search engines (Google and AlltheWeb) and two search engines developed for Chinese
(Sohu and Baidu). The results showed that the performance of the two major search engines was
not equivalent with that of the search engines developed for Chinese.
A research work in [13], was also conducted to test general (English oriented) and local (non-
English oriented) search engines on handling queries in Russian, French, Hungarian, and
Hebrew. The authors ran queries in all four languages in the local and the general search engines,
and found that in most cases the latter ignored the special characteristics of the language of the
queries and did not properly handle diacritics. Based on the results the authors recommended that
morphological variations among languages must be considered by the developers of search
engines, and users should be made aware of what they miss when they use the general search
engine to find information in languages other than English.
Another research [38] that used three general search engines (AlltheWeb, AltaVista, and Google)
and three Arabic engines (Al bahhar, Ayna, and Morfixa) was conducted over Arabic HTML
documents using Arabic queries. The queries were constructed to highlight the special
characteristics of Arabic, especially the occurrence of prefixes, which are very common in
Arabic words. The results of the searches showed that general search engine search features and
its indexing algorithms did not handle Arabic queries well, and did not provide any mechanisms
to address Arabic prefixes. The findings highlight the importance of making users aware of what
34
they miss by using the general engines, underscoring the need to modify these engines to better
handle Arabic queries.
In general, general purpose search engines can accept queries of several languages and returns
Web documents in the language of the query. However, they do have performance and efficiency
problems as they are mainly designed for English language. As they do for other languages,
general purpose search engine can also accept Amharic queries and return Amharic Web
documents [2, 17], but they do not provide effective and efficient search functionalities with a
high-quality collection of Amharic Web documents because of their language specific nature.
In our work, we have used both general purpose and language specific search engines to
efficiently access Web documents. One of the well known general purpose open source search
engine has been used to access English Web documents for queries which are written in English
or which are translated from Amharic to English, and an Amharic search engine that can well
support Amharic language encodings has been used to access Amharic documents. As a result,
the users of our system can have simultaneous access of Web documents which are written in
both English and Amharic languages.
35
CHAPTER FOUR
As it can be surmised from its name, our bilingual search engine is designed to pull up results in
not one, but two languages by accepting queries in either of the languages (Amharic and
English). In this chapter, we present the general architectural design of the proposed Amharic-
English bilingual search engine as depicted in Figure 4.1. The main components of the system
together with their subcomponents or modules, and the interaction between each of the main
components and their subsystems have been described.
Even though, the general basic architectures of the bilingual systems are nearly similar, they
substantially differ in their internal functionalities depending on the languages used, the
algorithms employed and the aims of the bilingual search engines were designed. As a result, the
subcomponents of each component have various structures and functionalities.
As many of the bilingual search engines, our proposed Amharic-English bilingual search engine
also shares some of the common components and as a result it consists of components such as
36
Query Preprocessing, Query Translation, Amharic Search Engine, and English Search Engine as
shown in Figure 4.1.
Since our bilingual search engine is capable of accepting queries in both Amharic as well as
English languages and query preprocessing is a language specific task, the query preprocessing
process is done in two different query preprocessing modules (Amharic query preprocessing and
English query preprocessing modules). The Amharic query preprocessing module is
37
responsible for tokenizing the Amharic queries into words, normalization of different characters
of the Amharic scripts which have the same sound and expanding short forms, eliminating stop-
words (less informative words), and stemming inflectional and some derivational Amharic
morphemes. The English query preprocessing module is responsible for tokenizing the English
queries into words, eliminating English stop-words (less informative words), and stemming
inflectional and some derivational English morphemes. The query translation process also uses
two components which are responsible for lexical transfer or dictionary lookup in the bilingual
Amharic-English dictionary and transliterating the out of dictionary words assuming that they are
proper names or borrowed words. These components are Amharic Query translation and
English Query translation. The Amharic search engine component is responsible for
crawling, indexing, and ranking of the Amharic Web documents. The English search engine
component is responsible for crawling, indexing, and ranking of English Web documents.
To analyze the Amharic query, the Amharic query preprocessing component uses different
techniques and produces bag-of-Amharic words of the query in their normal forms. This
component consists of subcomponents such as tokenization, normalization, stop-words removal,
and stemming. The output of this component is a set of Amharic bag of words which is used as
an input for the Amharic query translation component. The detailed architecture of this
component is shown in Figure 4.2 and the details of its subcomponents have been discussed
below.
38
Figure 4.2: Amharic Query Preprocessing Component
Tokenization: it is the process of demarcating, classifying, and forming tokens from an input
stream of characters. Tokenization could be used to split the query phrase into an instance of a
sequence of characters which can be grouped together as a useful semantic unit for processing.
During tokenization, we chop on whitespaces, throw away punctuation characters and choose the
correct token to use. Tokenization requires the language of the document to be known since it is
language specific.
The first step in the query preprocessing component is tokenizing the Amharic query into words
using white spaces and Amharic punctuation marks. For example, in the statement
“ኢህአዴግ፣መኢአድ፣ኢዴፓና ቅንጅት በምርጫ 2002 ሥነ-ምግባር አዋጅ ላይ ተወያዩ።”, unlike human
beings, a computer cannot spontaneously understand that the sentence has 10 words; rather it
understands the statement only as it has a sequence of 51 characters. Therefore, to convert them
into meaningful tokens, they have to be tokenized using certain Amharic language tokenization
criteria as: ኢህአዴግ መኢአድ ኢዴፓና ቅንጅት በምርጫ 2002 ሥነምግባር አዋጅ ላይ ተወያዩ
39
Normalization: Amharic has some redundant symbols with the same sound in its alphabet. For
example, አ, ኣ, ዐ and ዓ, ሰ and ሠ, and ጸ and ፀ have the same sound. These characters can
be used interchangeably without any meaning difference in the language. So we can replace
them by one character and normalize the representation of words in different forms to one
common form. The other functionality of this component is expanding short words. In Amharic,
it is common to shorten some words using the forward slash (‘/’) and English full stop ( (.))
like /
. / etc. These words will be expanded to their normal forms for
dictionary lookup. The normalization subcomponent of the Amharic query preprocessing
component handles these two activities.
Stop-words removing: stop-word is the name given to words which are filtered out prior to or
after processing of natural language text. Most search engines and query translation systems do
not consider extremely common words in order to save disk space or to speed up search results.
In our system, when a posed query is analyzed, words which have less content (stop words)
should be removed from the query words before translation. As a result, after the query is
tokenized into words and the characters and short words are normalized, the next step in our
query preprocessing component is identifying and removing the Amharic stop words such as፡
ነው፣ሆነ፣ወደ፣ናቸው፣ውስጥ፣etc.
Stemming: it is the process of reducing inflected (or sometimes derived) words into their stem
or citation forms. Most search engines and translation systems use stemmer to compare the root
forms of the search terms to the documents in their database. As Amharic language is
morphologically complex, words are inflected with prefixes, suffixes and infixes. The need of
stemming in the query processing is to reduce the size of the dictionary by avoiding the entry of
different variants of a word in machine readable dictionaries. This increases the performance of
the translation system which in turn has a positive impact on the effectiveness of the cross
lingual IR systems to retrieve documents in another language.
In our Amharic query preprocessing component, after the Amharic stop words are removed, the
stemming process has been performed on the remaining Amharic words. For example, the
Amharic words “በሉ” ፣“በላን”፣“በላች” ፣“በላሁ” ፣“በላችሁ” “”, etc can be reduced to
40
their citation word “በላ”. This helps the query translation component to handle morphological
variations and to find matches in the dictionaries for as many of the query words as possible.
Like Amharic query preprocessing component, the English query preprocessing component also
use different techniques to analyze the English query. As a result, it has three components like
tokenization, stop words removing, and stemming. The detailed architecture of this component is
depicted in Figure 4.3 and each subcomponent is described below.
Stop words removing: as we have discussed in Amharic stop words removing, there are some
terms that appear very frequently on the collection of English language documents and most of
which are not relevant for the retrieval task. In English language, these include articles,
conjunctions, prepositions, etc (for example, the words “a”, “an”, “are”, “be”, “for” . . .) are
referred as English stop words. The techniques used to remove these non-content bearing words
41
are differ from one system to another depending on the language of the query. So, this
subcomponent is responsible for removing these words from the English query sentence.
Stemming: English language has different inflectional, derivational and compound word
morphology. Compound words such as teapot, starlight, etc are made from two simple words,
which can stand alone, but joined together to form a single word. However, morphemes which
cannot stand alone are said to be suffixes and prefixes which collectively known as infixes. The
suffixes and prefixes form inflectional and derivational morphology for the language. Past tenses
of regular verbs and plural forms of nouns are examples of inflectional morphology which
inflects or alters the word by adding the suffixes. In derivational morphology, new words are
formed without reference to the internal grammar of a sentence. Table 4.1 shows examples of
different morphologies of English language.
In our English query preprocessing component, English words such as: “charger”, “charging”,
“charged”, “charges” are reduced to their base word “charge” using this subcomponent. The
output of this component is used by the English query translation module to find matches in the
dictionary.
42
Among the several translation approaches discussed in Chapter two, our proposed query
translation component is based on dictionary look up. As we have discussed in the preceding two
Chapters, there are different reasons that have made this approach an appropriate one for Web
based applications especially for under resourced languages. We can see these reasons from two
angles:
a) From the advantages of machine readable dictionary translation approach points of view
[34]:
Machine-readable dictionaries that are used in dictionary-based translation
approach are more widely available and easier to use than the parallel corpora
required by the corpus-based approach.
The limited availability of existing parallel corpora cannot meet the requirements of
practical retrieval systems in today’s diverse and fast-growing Web environment.
The dictionary-based CLIR approach is more flexible, easier to develop, and easier
to control when compared with Machine Translation based CLIR which has little
space for users to modify it for their specific purposes.
b) From the current IR models points of view [50]:
Most IR models are based on bag-of-words models. Therefore, they don’t take the
syntactic structure and word order of queries into account and hence they are easy
to develop.
Queries submitted to IR systems are usually short. Therefore, they don’t able to
describe the user’s information need in an unambiguous and precise way. These
indicate that complex process used in machine translation for producing a
grammatical translation is not fully exploited by current IR models.
This means that a simpler translation approach may suffice to implement the translation process.
Having these ideas and the scarcity of linguistic resource for Amharic in mind, we have
encouraged developing our query translation system based on a word-by-word translation
method by looking up the general-purpose Amharic-English bilingual dictionary.
43
4.3.1 Amharic Query Translation
Our Amharic query translation component consists of two main subcomponents which are a
lexical transfer which uses the Amharic-English dictionary as a knowledge base, and a
transliteration subcomponent which uses character mappings to transliterate from Amharic word
to English word. The architecture of the Amharic query translation component is shown in
Figure 4.4. The details of each subcomponent have been discussed below.
If match
No Yes
exist
Pick Translated
Preprocessed Amharic Query English Word
Lexical Transfer
Transforming
44
Lexical Transfer: the Amharic query translation process in this subcomponent is performed by
getting Amharic word lists of the Amharic query from the Amharic query preprocessing
component, looking up the English sense sets of each term in the general-purpose Amharic-
English bilingual dictionary, and selecting the possible English translation senses from the
dictionary. Generally, in this subcomponent, the Amharic words received from the query
preprocessing component are automatically translated into English words only by using the
bilingual dictionary.
Transliteration: one of the problems in dictionary based translation approach is coverage of the
words of the language. Particularly, proper names and borrowed words are not usually covered in
the bilingual dictionary. As many of Amharic loanwords and proper names, and their
transliteration have the same pronunciation or are nearly identical when written in English
alphabet, a transliteration system is the best choice to transform from one language character
script to another.
In our transliteration subcomponent, all the words that are not found in the bilingual dictionary
will pass through this module. In CLIR, the three tasks: name identification, name translation,
and name searching are required to handle proper names in the query. However, because of the
absence of named entity recognition system for Amharic, instead of identifying and
transliterating the proper names, all the words which are not found in the bilingual dictionary
will be subjected to transliteration assuming that they are either proper names or borrowed words
for both of which a transliteration system is very important.
The transliteration module works by segmenting each stemmed Amharic words into Amharic
phoneme (or characters), transforming each character into its corresponding English characters
based on the convention for the transcription of Ethiopic script into ASCII, and concatenating
each translated English phoneme (or character) into a single English word.
Like the Amharic query translation component, this component has also two subcomponents
which are the lexical transfer and transliteration component. The basic algorithmic differences of
the two components lay on the way the transliteration component works and the input and output
languages they receive and produce. In our English query translation component, the lexical
45
transfer subcomponent uses the Amharic-English dictionary as a knowledge base like its
corresponding Amharic lexical transfer module does. However, the transliteration subcomponent
uses the transliteration database as its knowledge base instead of direct character mappings. The
architecture of the English query translation component is shown in Figure 4.5. The details of
each subcomponent have been discussed below
46
4.4 English Search Engine
The results of our bilingual search engine are highly dependent on the underlying English and
Amharic search engines used to collect Web documents from the Web. Each search engine
performs the crawling, indexing and ranking according to its own mechanism. In this subtopic,
we have offered the architecture of general Web search engines, as described in [51], which have
been used to crawl, index and rank the English Web documents in our Amharic-English bilingual
search engine. Figure 4.6 shows the general Web search engine architecture [51]. Based on the
figure we have given a high level overview of how the whole system and the different
components work. The general Web search engine architecture consists of the components such
as: crawler module, crawl control module, indexer module, ranking module, page repository,
collection analysis module and the query engine.
47
Every search engine relies on a crawler module to browse the Web and to fetch Web documents.
Crawlers are relatively simple, small, automated programs or scripts that crawl through Web
pages similar to how a human user follows links to reach different pages. A set of starting URLs,
called seed URLs whose pages to be retrieved from the Web, will be given to the crawlers. The
crawlers then extract URLs appearing in the retrieved pages and give this information to the
crawler control module. The crawler control module is responsible for determining what links to
be visited next and feeding this links back to the crawler, i.e. it directs the crawling operation.
The crawler follows links that are considered worthwhile and deposits a copy of the page being
crawled into a page repository. This process continues until no more local resources, such as
storage, are available in the repositories.
The indexer module reads and extracts all the word-occurrences from each page in the repository
and associates a list of URLs where each word occurred. A large “lookup table" that can provide
all the URLs that point to pages where a given word occurs will be generated. The lexicon lists
all the terms found in all the pages that are covered in the crawling process. The lexicon also
stores some statistic such as the number of pages the word is found on, the position of the word
in document, an approximation of font size and capitalization of the word, etc which are useful
for page ranking during querying. Due to the size and rapid rate of change of the Web, indexing
poses special difficulties. As a result, it performs some special, less common kinds of indexes
such as structure index (as shown in the Figure 4.6) which reflects the links between pages.
The collection analysis module is responsible for creating a variety of other indexes. Utility
indexes, which indexes on values (such as the number of images in each page, or the size of
pages), is generated by collection analysis module. The collection analysis module may use the
text and structure indexes when creating utility indexes.
During a crawling and indexing run, search engines must store the pages they retrieve from the
Web in page repository. On the initial crawl when the repositories are empty, the crawling
process is carried out in full. For subsequent crawls, an update strategy must be chosen to update
indexes where any of the content of the currently indexed pages has changed. Search engines
sometimes maintain a cache of the pages they have visited beyond the time required to build the
index. This cache allows them to serve out result pages very quickly, in addition to providing
basic search facilities.
48
The query engine module is responsible for receiving and processing search requests from users.
The engine relies heavily on the indexes, and sometimes on the page repository to decide which
pages are most relevant to the query. Due to the Web's size and the fact that users typically only
enter one or two keywords, result sets are usually very large. So, the pages that are thought to be
most relevant to the query are listed before those that are less relevant and should be returned in
a ranked fashion although relevance can be based on different factors. Hence, the ranking
module has the task of sorting the results such that results near the top are the most likely to be
what the user is looking for.
49
Figure 4.7: Amharic Search Engine Architecture (Source: [17])
The Crawler component is responsible for collecting Amharic language documents from the
Web. The crawler begins its crawling by getting starting URLs, called seed URLs, which have
Amharic Web content pages from the frontier. It uses http protocol and multiple threads to fetch
the Web page at that URL. The collected pages from each URL will be given to the Amharic
page identification module to be identified that whether they are Amharic language contents or
not. If the pages are Amharic language contents, they will be stored in the Amharic page
repository for further processing otherwise they will be discarded.
50
The character encoding identifier and converter module parses the pages that are stored in
Amharic page repository and extracts the associated texts and links of the page. Its major
responsibility is identifying and converting the character encodings of the Amharic Web pages.
If the encoding of the page is non-Unicode, it will be converted to Unicode representation.
The Indexer component builds indexes from documents that it gets from the Crawler. It is
composed of Amharic analyzer that considers the typical characteristics of the Amharic language
and indexing module which stores the indexed terms. The Amharic analyzer is responsible for
extracting terms from the given Amharic text for indexing purpose. It tokenizes, analyzes
Amharic aliases, removes stopwords, and applies stemming to the words before they are indexed.
The query engine component provides an interface for the users to enter Amharic search queries.
It is responsible for parsing the Amharic queries and it uses the Amharic analyzer to analyze the
queries. The Ranking module ranks the search results by using the index and Amharic Page
Repository. It accesses the index to match the parsed and analyzed query terms with the terms
that were indexed.
In our bilingual Amharic-English search engine we will use these search engines discussed in
Section 4.5 and 4.6 for crawling, indexing, and ranking English and Amharic Web documents
respectively. The search engines will have a common query dispatcher which provides the query
to each search engine based on the language of the query terms. The result pages returned by
each search engine will be collected and displayed to the user on the systems user interface.
51
CHAPTER FIVE
In this Chapter, we have described the implementation details of the bilingual search engine for
Amharic-English. The development environment and the tools used to develop the system have
been briefly discussed. The tools used for query preprocessing, the customizations made on the
existing query preprocessing tools to fit for our system, the new algorithms developed for query
translation from Amharic to English and vice versa, and the mechanisms and conventions used
for transliteration have been described. The techniques used to dispatch the queries and to fetch
the results from the two search engines have been discussed. The tools used, the mechanisms of
the tools use to crawl and index web documents, and the configurations made on the tools for
crawling and indexing both Amharic and English web documents have been described.
Developing a full-fledged bilingual search engine requires a lot of resources. High performance
computer hardware, network bandwidth, and linguistic tools are among the basic resources that
are required to determine the successful development and implementation of the search engines.
Crawling the web, storing the indexes, searching and retrieving web documents are network
bandwidth, memory storage and high performance hardware intensive tasks. Preprocessing and
translation of user queries between languages require the existence of linguistic tools for each
language and automated linguistic data between the language pair.
In this subtopic, we have described the development environment used in developing the
bilingual search engine. Even though crawl databases should be stored and indexed in a variety
of several machines, our system is developed and tested in a single laptop computer for the
demonstration, testing and evaluation purpose. The computer has Intel(R) Pentium(R) Dual CPU
with 2.00 GHz each processor, 3.00GB of RAM, 320GB of hard disk, and Microsoft XP
Professional Service Pack 2 operating system. We used 6MB/sec network bandwidth which is
shared by thousands of Addis Ababa University community members.
52
5.2 Development Tools
During the development of the bilingual search engine, softwares which are developed for
academic purposes, open source software, programming language software, and bilingual
dictionaries are used. The following software tools are configured, customized and used for
developing the prototype of the system.
Nutch-1.0:
It is the latest version of an open source java implementation of a Web search engine. Nutch is
built on top of Lucene, which is an API for text indexing and searching and released by the
Apache Software Foundation. In addition to Lucene components, Nutch has web crawler for
collecting documents from the web. For our bilingual search engine, we have configured and
used the software for crawling and indexing English web documents by providing seed URLs.
53
Apache Tomcat 6.0:
Apache Tomcat is an open source software implementation of the Java Servlet and Java Server
Pages technologies and released under the Apache License version 2. It is a web container which
allows running Servlet and Java Server Pages based web applications. It also provides per default
an HTTP connector on port 8080. We used the latest version of the software for the Servlet and
Java Server Page container for our system.
Cygwin 1.7.5:
It is an open source collection of tools that allows Unix or Linux applications to be compiled and
run on a Windows operating system from within a Linux-like interface. It provides a Unix-like
environment and software tool set to Window users. We used the latest version of the software to
compile and run the crawler using command line instructions in the Windows XP environment.
We have used the latest version of the software to build our two-way bilingual Amharic-English
dictionary. We have chosen this software because it is open source and easy to use. The Amharic
words and their corresponding English translations are typed in Microsoft Access database to use
the easier user interface advantage of the Ms Access. The tables in the Ms Access are converted
into utf-8 encoding and exported to the MySQL server database which is created to support utf-8
encoding and non-English character sets in order to support Amharic characters. The database is
exported to the MySQL server for better manipulation of the data using Java programming
language.
Bilingual Dictionaries:
Two hard copy dictionaries (Amharic-English and English-Amharic dictionaries) are used to
develop machine readable dictionary for the query translation. The words from the dictionaries
are typed in a database system and used for developing two-way dictionary which allows us to
translate the query from Amharic-English and vice versa. The dictionary database is accessed
using JDBC(Java DataBase Connectivity) of the java API.
54
5.3 User Interface Implementation
The user interface of our system is responsible for accepting user queries, dispatching the queries
to the appropriate query preprocessing module based on the language, and displaying the results
to the user in an appealing manner in the users’ browser window. It provides the user with
options of selecting the language (Amharic or English), entering the query based on the selected
language, submitting the query, and viewing the results of the request. As a result, the user can
type either in Amharic or in English language and get the results in both languages in the same
window of different frames.
The user interface of our system is developed using a JavaServer Page (JSP) which is part of the
servlet standard. JavaServer Pages (JSP) is a server-side programming technology that enables
the creation of dynamic web pages and applications. It allows a web browser to make requests to
the java application container called web server and to display dynamic contents to the viewer.
We have chosen this programming language because of several reasons. JSPs have the advantage
of:
Embedding Java codes into HTML, XML, DHTML, or other document types.
Displaying dynamic content to the viewer
Reducing Web development and maintenance costs
Creating Web applications easier and faster
Making Java APIs or frameworks available to Web designers
Moving Java code outside of existing JSP pages
Reusing Java code in multiple Web applications, and
Using the advantage of the Java motto of “write once, run anywhere”
The query preprocessing step is the first step in every IR system. The process requires a special
attention when the IR system is cross language IR, because the preprocessed query should be
first translated to other language before it has been feed to the search engines. When the
translation is based on dictionary lookups, the preprocessed query terms should have a match in
the bilingual dictionary together with its appropriate translation. This step will have different
55
stages and may implement different techniques depending on the characteristics of the language.
Most of the steps in the query preprocessing are language specific and needs knowledge about
the language to develop the tools required. In this subtopic, the implementation details of each of
the components required for both Amharic and English query preprocessing before they have
been translated from Amharic to English or vice versa has been described.
5.4.1 Tokenization
In query preprocessing tokenization means splitting the query sentence or phrase into individual
terms or words based on some criteria of the language such as word delimiters. In our work, the
user can enter either Amharic or English languages, so there is a need to tokenize both Amharic
and English language queries based on the languages characteristics.
As discussed in Chapter Four Section 4.2, stop words has no significance content to search
information from the web or from any collected documents. As a result, they have to be removed
from the query sentence or phrase before they have submitted to the translation component.
Since stop words removing is a language dependent task, it needs a standard list of stop words
for each language.
In this work, even though Amharic does not have standard stop word lists, we have used a set of
common Amharic stop words lists that are used by [2, 17, 55] as shown in Appendix I. As it has
been stated in these research works, the stop words are collected from Amharic literatures as well
as researchers from the Ethiopian Language Research Institute. For English stop words, there is a
56
standard stop word lists which is available on the Web although it can be modified based on the
systems’ requirements. For our system, we have used a list of English stop words as in Appendix
II.
5.4.3 Stemming
Stemming is an important analysis step in a number of areas such as natural language processing
(NLP), information retrieval (IR), machine translation (MT) and text classification [53]. Several
research works showed the importance of stemming from two different perspectives. The first
one is from the positive impact of stemming in the dictionary based cross language IR, where
there is a need in the translation step, to look up terms in a machine readable dictionary (MRD).
In this regard, a research work conducted in [53] showed that in cross language information
retrieval (CLIR) stemming reduces the size of term entries in the dictionary which in turn
increases the effectiveness of the translation system. Words in machine readable dictionaries are
usually entered in their citation forms so stemming is required to find term matches in the
dictionary during query translation. The second one is from the application of stemming during
indexing to reduce the vocabulary size in the index, and it is used during query processing in
order to ensure similar representation as that of the document collection. In this regard, a
research work in [52] showed the importance of stemming words in the query and retrieved text
documents to facilitate searching database of Amharic texts and to increase the effectiveness of
IR systems.
Amharic Stemming:
In this research work, after the stop words are removed from the tokenized Amharic query, a
stemmer that stripped off the prefixes and suffixes are performed. The stemming algorithm
developed by Alemayehu and Willett [52] and latter customized by Tessema[2] is adopted for
our work. This stemming algorithm is an aggressive one which stems both inflectional and
derivational morphemes. The algorithm also changes the last character of each stem word to
sadis(the sixth order of Amharic characters). However, the dictionary used in this work consists
of an entry for words and their derivational variants. So, query terms that do not have sadis
ending characters couldn’t have matches in the dictionary. Table 5.1 shows the results of the
stemmer.
57
Table 5.1: Previous stemmer result
In addition to this, the algorithm also changes the last character of every proper name to sadis
which give inappropriate results to our transliteration component which leads inappropriate
transliteration of proper names of which don’t have sadis endings. Table 5.2 shows the results of
the stemmer and its impact on the transliteration component.
Table 5.2: Results of the stemmer on proper names and its impact on the transliteration
system
Stemmer Transliteration Actual
Proper name
result result transliteration
To alleviate this problem, the original algorithm is modified to fit for our system. The stemmer
module in our query preprocessing component only strips off prefixes and suffixes and leave the
results as it is without changing the last character to sadis. In doing so, the proper names are also
transliterated in their normal forms.
English Stemming:
Like Amharic words, dictionary entries of English words are usually the base words and their
derivations. Therefore, to translate the query terms from English to Amharic, using a machine
readable dictionary, preprocessing the query and stemming the suffixes and prefixes of the
inflected words are highly advisable. In our work, Porter Stemmer algorithm of the Lucene API
58
is used together with our EnglishAnalyzer module to produce the stems of English query terms.
The Porter stemming algorithm (or ‘Porter stemmer’) is usually used to remove the commoner
morphological and inflectional endings from English words.
5.4.4 Normalization
As we have discussed in Chapter Four, Section 4.2, normalization, in our work, means replacing
one character with another if the two characters have the same sound. The implementation of this
module is done to handle two basic functionalities as we have described below.
The first functionality is replacing characters with one common and frequently used character if
the characters have the same sound. For example, the word “5”can be replaced by 67” with
the assumption that the Amharic characters and 7 are commonly used in literatures than the
characters 5 and and they do have the same sound respectively. For this work, characters
which have different symbols for the same sound and which are identified by different
researchers (Tessema Mindaye [2] and Seid Muhie[54]) are considered.
The second functionality of this component is expanding short words which are shortened by the
forward slash (‘/’) into one or two words. For example, $/$ is expanded to $8$ and 9/4
is expanded to 9:+ 4. For this work, lists of common Amharic short words (Appendx III)
which are collected from the work of Melese Tamiru [55] are considered. After the Amharic
query sentence is tokenized, each word is checked from the list and if it is found, it will be
expanded to its corresponding expanded form.
In any cross language information retrieval (CLIR) system, query translation is the core
component which needs a lot of efforts has to be exerted. If the query translation is made using a
bilingual dictionary, there may be two fundamental research tasks to be considered [56].
1) How to improve the coverage of the bilingual dictionary and
2) How to select the correct translation of the query among all the translations provided by
the dictionary. The first task is the major focus of this research.
Even though dictionary-based query translation is one of the conventional and commonly used
approaches in CLIR, the appearance of Out-Of-Vocabulary (OOV) terms is one of the main
59
difficulties arising with this approach. These OOV terms are often proper names, borrowed
words or newly created words which can’t be usually found in bilingual dictionary [56]. This
problem significantly limits the retrieval performance of the CLIR system. In this Section, we
have described the implementation details of designing our Machine Readable Dictionary
(MRD) for Amharic-English and vice versa, the lexical transfer between the two languages, and
the solution proposed to improve the coverage limitations of the bilingual dictionary.
To do this task, the words in the two dictionaries are carefully selected in order to exclude:
1) Stop words such as yours, ourselves, about, above, across, after, again, etc from English-
Amharic dictionary and <=, >, <?/, @, A, .+$, etc from Amharic-English
dictionary.
2) Inflectional morphologies of root words such as past tense and past perfect tense form
of regular verbs, plural form of nouns, “-ing” form of verbs, etc from English-Amharic
dictionary and words which include prefixes and suffixes(such as words which include B,
, +, etc as a prefix and CD, D<, ED, , D, F, G, etc as a suffix ) from Amharic-English
dictionary.
3) Words which are written in characters of H, I, 5, , etc and their derivatives from both
dictionaries for Amharic words. Instead of entering them directly in the MRD, we
replaced the characters with other characters which have the same sound as the
normalization module does.
4) Words which have the same pronunciations in both languages such as: sport, lottery,
hotel, film, piano, kilo, etc.
60
Step 2: Development environment
After the words have been carefully selected, they are typed into Microsoft Access database
management system. The main reason why this database management system is selected for
words and their translation entry is to use the advantage of the easier user interface of Ms Access
particularly for Amharic character entries.
After all the required words are entered into the database, the Ms Access database is migrated
into MySQL Server database management system in which a database with the same data types
and which can support utf-8 encodings and Amharic character sets has been created. MySQL
Server is open source software, cross platform, fast to access, and can support Amharic character
to be entered and accessed by Java DataBase Connectivity (JDBC) for manipulation by using
java programming language. These advantages of MySQL Server are the main reasons for the
migration of the database from Ms Access to MySQL Server.
After the query preprocessing step is accomplished, each of the remaining stemmed bags of
words in the query will be checked for its presence in the dictionary. If the match is found, the
corresponding translation will be retrieved and given to the dispatcher to provide them for the
appropriate search engine. However, there may be two other possibilities: either the word may
have more than one translation or the word may not be totally found in the dictionary. We have
discussed the techniques used to solve these problems below.
If a given word has more than one translation, it will create the problem of translation
disambiguation which refers to finding the most appropriate translation from several choices in
the dictionary. As stated in [56], there are several approaches to solve this problem. Among
them, one approach is to select the most likely translation, usually the first one offered by a
61
dictionary. Another solution is to use all possible translations in the query with the OR operator.
Even though the second approach introduces noise into the query which leads to the retrieval of
many irrelevant documents, we have used this approach as it is likely to include the correct
translation. This means that one single Amharic term could in our case give rise to many possible
alternative English term translations. At the query level, this means each query was initially
maximally expanded and all will be considered for web document searching.
In the other hand, if the word is not found in the dictionary, it will be given to the transliteration
module assuming that it is either a proper name or borrowed word that will not be often found in
the dictionary and usually have the same pronunciation in both languages. The algorithm used
for lexical transfer is shown in Figure 5.1.
Else
End if
62
For our work, the Java programming language API(Application Program Interface) called
JDBC(Java DataBase Connectivity) is used for accessing the database and lexical transfer
between the two languages. JDBC defines how a java programmer can access the database in
tabular format from Java code using a set of standard interfaces and classes written in the Java
programming language.
5.5.3 Transliteration
As we have discussed in Chapter Four, some of the reasons why dictionary-based translation has
been commonly used in cross-language information retrieval is because bilingual dictionaries are
widely available, dictionary-based approaches are easy to implement, and the efficiency of word
translation with a dictionary is high. However, due to the vocabulary limitation of dictionaries,
the translations of some words in a query cannot be often found in a dictionary.
Proper names, such as personal names and place names, are a major source of Out Of
Vocabulary (OOV) terms because many dictionaries do not include such terms. It is common for
proper names to be translated word-by-word based on phonetic pronunciations. The process of
pronouncing a word (proper name, borrowed word, etc) in one language in a similar manner with
in another language is called transliteration [12, 56]. As a result, transliteration has become one
of the solutions for the problem of coverage limitation of bilingual dictionaries during query
translation.
In this subtopic, we have described the Amharic-English and back transliteration between the
two languages. The main reason of using transliteration is to handle OOV terms that do not
present in Amharic-English dictionary. The transliteration process keeps approximate phonetic
equivalents between the two languages. Since the dialects of the two languages use different
scripts, inter-dialectal translation without lexical changes is quite useful. In our work, a heuristic
based approach which uses phoneme mappings is implemented to solve the problem of OOV
terms.
Amharic-English transliteration
The Ethiopic writing system is a rich syllabary of at least 34 consonant classes each having
generally seven or eight forms, and fewer having twelve or thirteen [60]. Which means each
63
letter or symbol usually represents the whole syllable like ‘(’, ‘J’, ‘’, or ‘=’ for ‘da’, ‘du’, ‘la’,
or ‘lu’ respectively. For this work, the conventions used for finding the closest match between
the Latin and Ethiopic phonetic system called System for Ethiopic Representation in ASCII
(SERA), is used as our best reference to develop the transliteration module. This convention is
mostly developed for use of typing Ethiopic scripts using Latin keyboard layouts. However, in
our work, the transliteration module is used for transliterating words written in Ethiopic script
into Latin script based on their Amharic pronunciations. Therefore, minor convention
modifications have been done in some order of the Amharic characters to fit for our system as
shown in Appendix IV. For example, each fifth order (hamis) of each Amharic character is
represented by its consonant representation plus “E” in SERA, such as: L=hE, A=lE, M=mE, etc,
but for pronunciation this is not quite useful. Therefore, we have used L=hie/he, A= lie/le, M=
mie/me, etc to pronounce these characters in English language. Table 5.3 shows some exemplary
characters used in the representation of Amharic characters in English.
The other modification is on sadises depending on their position in the word. For example, the
Amharic proper name “V2” can be transliterated as “shimelis” in English. In this case,
sadises can be transliterated as their Latin consonantal representation plus the vowel “i” or the
Latin consonantal character only, based on their position in the word as in the example =li,
V=shi and =s although all are sadises. This means when a sadis present at the beginning of the
word, or the character before it is also sadis, or before the last character except for some
exceptions, they are represented by consonant plus the vowel “i” (as V=shi, =li) but the
64
consonant character only (as =s) when they appear at the end of the word or at the remaining
places and conditions [19, 59]. Figure 5.2: shows the algorithm for transliterating a word from
Amharic to English.
If the character is at the beginning of the word and the character is sadis
Else if the character is sadis and the character before it is also sadis
Replace with its Latin conventional ‘Xi’ except for some exceptions
Else
Transliterate the character based on the transliteration convention
65
English-Amharic transliteration
The automatic back transliteration of Amharic-English is not an easy task to implement and it
needs a lot of efforts to be done, and it also needs the involvement of linguistic experts who are
fluent in both languages. As a result, for our work, we have used list of manually transliterated
proper names in a separate table to transliterate proper names written in English to Amharic
language. Most of the collection of proper names called gazetteers have been collected from the
work of Seid Muhie[54], he used it as a named entity recognizer for gazetteer based answer
selection in his work. Figure 5.3 shows the algorithm used for transliteration of English proper
names into Amharic.
For each English stemmed query word not found in the dictionary
Else
66
5.6 Web Document Crawling, Indexing and Searching
As we have discussed in the previous Section of this Chapter, we have used an open source web
search engine called Nutch for our English web document crawling and indexing; and an
Amharic search engine which is developed by Hassen Redwan [12] by customizing Nutch for
our Amharic Web document crawling and indexing. The Nutch search engine consists of roughly
four main components: Crawler: which discovers and retrieves web pages, WebDB: a custom
database that stores known URLs and fetched page contents, Indexer: which dissects pages and
builds keyword-based indexes from them, and Search Web Application: which is a JSP
application that can be configured and deployed in a servlet container.
Nutch requires up to a gigabyte of free disk space and a high-speed connection for searching and
indexing web documents. As a result, several configurations and customizations have been made
to bring the search engine for our context. The configurations are made for both crawling and
searching Web documents. In this subtopic, we have described some of the configurations made
on the search engine.
From the sites used for English Web document crawling, some of them host only English
language documents and some of them are bilingual (Amharic and English) Web sites. The
67
bilingual web sites are used for both Amharic and English web documents crawling. The sites
used for only English Web document crawling are the following.
From the sites used for Amharic Web document crawling, some of them host only Amharic
language documents and some of them are bilingual (Amharic and English) Web sites. The sites
used for only Amharic Web document crawling are the following.
The bilingual websites used for both Amharic and English Web document crawling are the
following.
• http://www.ena.gov.et/: a website for the Ethiopian News Agency, the state news service.
• http://www.ethiopianreporter.com/: a website for Ethiopian Reporter, a daily newspaper
based in Addis Ababa.
• http://www.waltainfo.com/: Walta is pro-government news site based in Addis Ababa. It
includes news coverage from EPRDF point of view.
• http://www.ethpress.gov.et/: a website of Ethiopian press agency, particularly for an
official Ethiopian newspaper called Herald.
The second step is editing the file conf/crawl-urlfilter.txt and replacing MY.DOMAIN.NAME
with the name of the domain we wish to crawl. In our work, the domain names of all the seed
68
URLs are written with the line, for example: +^http://([a-z0-9]*\.)*addisfortune.com/, to
include any url in the ‘addisfortune’ domain.
The third step is editing the file conf/nutch-site.xml. In this step, a lot of property value changes
have been performed to fit the search engine for our system. The property value changes are
done as shown in Appendix V. Some of the properties for which value changes are made during
the configuration of the crawler include:
69
5.6.2 Web Document Searching
After we have verified that the crawling is successfully completed, we have proceeded to setting
up the web interface to search Web documents. For searching using the web interface, we have
put the Nutch war file into the Apache Tomcat web application server servlet container. Then in
our web application folder, we have edited the nutch-site.xml file to set up our searcher directory
as shown in Figure 5.5.
<?xml version="1.0"?>
<configuration>
<property>
<name>searcher.dir</name>
<value>D:/nutch-1.0/crawl/</value>
</property>
</configuration>
After the searcher directory set up has been performed as shown in the above Figure, we have
displayed the user interface of our Amharic-English bilingual search engine which is designed
using Java Server Page (JSP). The user interface is displayed using Microsoft Internet explorer
browser window. The user interface provides facilities to select the query language and to enter a
query in a text box. In the query text box, we can write a query for searching from the crawled
and indexed web documents.
70
CHAPTER SIX
EXPERIMENTAL RESULTS
The purpose of the experiment is to evaluate the effectiveness of our dictionary based bilingual
search engine. The effectiveness of cross language information retrieval systems is often
measured by using precision and recall. The evaluation is usually performed by comparing the
results of cross language runs with the corresponding monolingual runs of the same system by
varying the queries. To do this, one needs to have the appropriate test collection such as
document collection, test queries, and their relevance judgments.
The test collections that have been used for CLIR experiments are usually provided by the three
major evaluation workshops; the Text REtrieval Conference (TREC) for Cross Language Track,
the Cross-Language Evaluation Forum (CLEF) covering many European languages, and the NII
Test Collection for IR systems (NTCIR) for Asian languages evaluation covering Chinese,
Japanese and Korean[34]. In these workshops, the task was to match the queries in one language
against documents collection in another language and to return a ranked list of results. The
experiments of most of the previous CLIR studies were conducted using standard document
collections provided by these organizations. These collections usually consist of documents
prejudged and carefully selected by human experts for evaluation purposes.
However, for web based IR systems, there is no established relevance judgment available for
precision and recall. As a result, precision is usually considered and reported for web based IR
systems at low recall levels. This is because, there is no proper method of calculating absolute
recall of search engines as it is impossible to know the total number of relevant documents in
huge number of web pages. Since our study used Web pages instead of standard collections, the
traditional CLIR evaluation techniques have not been applicable. As a result, in order to evaluate
the performance of our system, an experiment was designed and conducted to test the precision
instead of the recall since it does not require the knowledge of the test collections. In this
Chapter, we have described the approaches followed to evaluate our system, the query
preparation for testing the system, the discussions on the experimental results and the possible
reasons behind the results found.
71
6.1 Experimental Approaches
There are two commonly used approaches to evaluate CLIR systems [61]. The first approach is
manually translating test queries that are written in one language into other language pair and
using the original test queries and the translated queries to retrieve the same document
collections which are written in the translated query language. Then the performance of the
cross-language information retrieval system can be evaluated by comparing the results of the
monolingual runs directly with the cross-language equivalents as shown in Figure 6.1. The
disadvantage of this evaluation technique is the manual translation requires the application of
human judgment, and evaluation collections constructed in this way exhibit some variability
based on the terminology chosen by a particular translator.
Figure 6.1: Bilingual CLIR system evaluation using manually translated query.
The second approach is using queries formulated in one language and retrieving documents
which are written in the two language pairs and stored in a parallel corpus. Then the precision of
the results of the queries for each recall for documents which are written in both languages will
be compared. The disadvantage of this technique of evaluation is it requires a collection of
parallel corpus which is difficult to get particularly for under resourced languages.
72
The first approach, i.e. using manually translated test queries is a widely used evaluation strategy
[61] for under resourced languages such as Amharic because it permits existing test collections
in any of the languages to be inexpensively extended to any language pair for which translation
resources are available. As a result, we have used the first approach to test the performance of
our bilingual search engine. We conducted our cross language retrieval experiments using both
Amharic and English queries and on both Amharic and English language Web document
collections since our system is a bidirectional (two-way) bilingual CLIR system.
Our system is also designed and implemented to retrieve documents in two languages (Amharic
and English) for a single query in Amharic or English language. As a result, to visualize the
significance of our bilingual system, the relevant bilingual retrieval results of our system
(Amharic and English documents) are compared with the relevant results of monolingual search
engines as shown in Figure 6.2.
comparison
Amharic
Amharic Monolingual
Search
Query Results
Engine
73
6.2 Test Query Preparation
The effectiveness of any cross-lingual search engine is highly dependent on the performance of
the query translation component. The query formulated for cross-lingual information retrieval
need to be properly translated into the target document language before it has been submitted to
the search engine. To visualize the effectiveness of the translation system, the Amharic query
74
translation results of the 15 Amharic queries together with their manual translation and the
English query translation results of 15 English queries are shown in Table 6.1 and Table 6.2
respectively. The discussions on the query translation results are also described in this Section.
The fifteen Amharic queries selected for the cross-lingual evaluation, which have a total number
of 36 words, are as shown in Table 6.1. Out of these, 11 words are proper names of which 3
words are inflected with prefixes; since they are OOV terms, they need the transliteration
component for translation, 2 words are numbers which don’t need any translation, 1 word is a
stopword which has to be removed during query preprocessing, and from the remaining 22
words which need the bilingual dictionary for translation, 5 are inflected with suffixes and
prefixes of which 1 word is a short word.
75
Table 6.2: Automatic English query translation results
English Query Translation
No. English Query Component Result
1 Election 2002 & 2002
2 Meles zenawi 2+ XYZ
3 Electoral Board of Ethiopian & ^ [$\]!
4 World cup 2010 + _F& 2010
5 Birtukan Mideksa `F 3-O
6 Ethiopian political parties [$\]! a+b cb
7 Business and economy d [fg
8 Kenenisa Bekele ,hO ,+
9 Quality education $8$ P$
10 Amhara Development Association .P .$ .8
11 Football e) i
12 Tana Beles project jY kl#$
13 Democratic party Qm#Pn cb
14 Amharic Language .o pFp
15 Agricultural Development eq .$
Out of the 15 queries, 10 queries are properly translated. When we see it word wise, out of the 36
words, 31(88.57% of the words) words are properly translated. The translations of the remaining
5 words which are the cause for the improper translation of the 5 queries are not exactly the same
as the manual translation. The problem arises because of different reasons. For example, the
word ‘[$r]!s and ‘.Ps are transliterated as ‘etyopiya’ and ‘Amara’ instead of ‘Ethiopia’ and
‘Amhara’ respectively, because the transliteration component doesn’t handle some ‘irregular’
transliterations, such as the appearance of the character ‘h’ in both words. On the other hand, the
proper name ‘`Fs is directly translated to ‘Orange’, the word in the bilingual dictionary,
because of the absence of automatic named entity recognizer. In addition, in the query ‘ jY +
kl#$s, the word s+s is transliterated as ‘les’ because the character ‘s is considered as
Amharic prefix and stripped by the Amharic stemmer.
76
6.3.2 Discussions on the Results of English-Amharic Query Translation Component
To test the effectiveness of the English query translation component, the manual translations of
the 15(fifteen) Amharic queries are used. As we can see in Table 6.2 above, out of the 15
queries, 14(93.33% of the queries) queries are properly translated. This shows that the English
query translation component has better query translation performance than the Amharic query
translation component. This is because the English translation component uses a database to
store and translate proper names instead of using the character map transliterations as the
Amharic query translation component. However, since it is impossible to list out all the proper
names, some proper names are totally missed from the translated queries. For example, from the
query ‘Tana Beles project’, the term ‘Beles’ is missed from the translated query. This is because;
this word is not found either from the bilingual dictionary or from the proper name database. But
it could be handled if there was a transliteration module as the Amharic-English query translation
component.
The transliteration component is also independently evaluated for its performance. To do this, we
have collected 1100 (one thousand one hundred) proper names of which most of them are person
names and some are the names of countries, places, hotels, well noun cities, etc. Most of the
proper names are collected from the gazetteers used for English-Amharic manual transliteration
and few of the names are collected from the Web and the department of Computer Science as we
are trying to include the names of Addis Ababa University Computer Science graduate students
and the staff members of Computer Science department.
The manual transliteration of the gazetteer is made with the help of linguists. For words which
are ambiguous for transliteration, a free English-Amharic online transliteration service of the
Google API (http://www.google.com/ta3reeb/) is also used just for cross check although this
transliteration service has also few problems.
77
6.4.1 Discussions on the Transliteration Results
Out of the 1100 proper names, 887(80.64% of the names) names are properly transliterated. The
remaining 213(19.36% of the names) names have detected to occur having minor and major
spelling problems because of different reasons. The causes of the problems that degrade the
performance of our transliteration component are viewed from different angles as follows.
1. Transliteration Variations: some Amharic characters use different English characters for
the same pronunciation during transliteration as shown in Table 6.3.
For example:
a. the character “B6 and its derivatives use the English characters ‘C’, ‘K’, and ‘Q’ in
different words
b. the English characters ‘ph’ are pronounced as ‘*s when they appear as written
c. the Amharic character ‘l’ and its derivatives use sometimes ‘g’ instead of ‘j’
2. Irregular Transliterations: some English words include characters which have not been
pronounced or they can be transliterated in a different manner. For example: Djibouti,
Jerusalem, Ghana, Amhara, etc although their Amharic transliteration is {|b, e }OA,
Y, .P, etc respectively.
78
3. Double Consonants: some Amharic words include double characters to indicate longer
consonants in pronunciation as the length of the consonant may often affect the meaning
of a word. For Example: Gessess, Mohammed, Tullu, Jimma, Gojjam, Teppi, etc for
‘~55’, ‘mS2’, ‘`=’, ‘.’, ‘’, ‘w’, etc respectively.
4. Vowel Usage Variations: the Amharic characters ‘!’, ‘rs, ‘\’, etc are represented as ‘ia’
or ‘ya’, ‘u’ or ‘yu’, ‘yo’ or ‘io’, etc respectively. For example: Algeria for ‘l0!’ and
yaregal for ‘!’ for the character ‘!’, Uganda for ‘rF(’ and Kalayu for ‘r’ for the
character ‘r’, Yonas for ‘\Y’ and Tsion for ‘\F’ for the character ‘\’, etc
5. Problems Related to Sadises and Hamises: when Amharic ‘sadis’ and ‘hamis’ characters
are transliterated into Latin characters, they sometimes vary from the normal agreement
although they appear in the same positions. For example: Adinew for ‘’ and
Admasu for ‘.’ for the character ‘’, Abiy for ‘:’ and Abdo for ‘
’ for the
character ‘’, Kelkilie for ‘B#A’ and Kifle for ‘#xA’ for the character ‘A’ etc.
Even though some Web-based CLIR systems such as Keizai(which accepts English queries and
returns Japanese and Korean documents), TwentyOne(which supports six European languages),
Arabvista(which accepts English or Arabic query to retrieve Web pages in multiple languages,
including Chinese, French, and German), ECIRS(English–Chinese Web-based system), and
MULINEX (which is more mature multilingual Web search and navigation tool for English,
French and German) are available, for most of them, there is no systematic evaluations are
available which leave their effectiveness uncertain [62]. The traditional CLIR systems are often
tested on standard, readily available collections (mostly news articles). However, this is not
applicable for Web-based CLIR which requires an extensive crawling (spidering) process to
build multilingual collections.
Since it was impossible to read all Web page documents collected during crawling to judge the
relevancy of the documents, we emphasized on precision only for the top 10 retrieved Web pages
for each query as our primary performance measure. The measurement is referred to as target
retrieval (34, 62). In addition, cross-language information retrieval always yields precision loss
compared to traditional monolingual information retrieval. Therefore, it is not uncommon to
79
evaluate the performance of CLIR systems by comparing its results with the corresponding
monolingual run. In a monolingual run, we manually translated the original Amharic query
discussed in Section 6.2 into the target query (i.e. English query), and we have performed the
retrieval on this translated query. After the precisions for both CLIR and monolingual retrieval
are obtained, the precisions of the CLIR and of the monolingual retrieval have been compared.
Since our system is bidirectional, within a single query and its corresponding manual translation
four different search results will be retrieved. These search results are:
Then our system evaluation is performed by calculating the precision of each of the retrieved
relevant documents. To evaluate the performance of the Amharic-English CLIR system, we
compared the average precision of the Amharic-English cross-lingual search results with the
average precision of the English monolingual search results. In addition, to evaluate the
performance of the English-Amharic CLIR system, we compared the average precision of the
English-Amharic cross-lingual search results with the average precision of the Amharic
monolingual search results. Finally, we compared the performance of the Amharic-English and
the English-Amharic retrieval engine.
As we described in Section 6.2 and 6.3, we have used 15 Amharic queries to evaluate the
effectiveness of our Amharic-English bilingual retrieval performance. Each query is directly
submitted to the Amharic monolingual retrieval engine and at the same time to the Amharic-
English cross-lingual retrieval engine. It means that within a single search click we do have both
Amharic and English web documents in the same browser window separated by frames as shown
in Figure 6.3 for the Amharic query “ .o pFp”. On the other hand, these Amharic queries
are manually translated into their corresponding English query and used for evaluation of our
system by comparing the precision of the monolingual and cross-lingual results. As it was
described in Chapter Two, precision is the number of relevant documents retrieved divided by
80
the total number of retrieved documents. However, relevance is subjective and different users
may have different relevance measures. In our work, the relevancy of the retrieved document to
the query is evaluated by two Information Science post graduate students who are regular users
of the Web.
81
Table 6.4: Amharic-English precision evaluation result
Monolingual Cross-lingual
Query % of monolingual
precision precision
Query1 0.8 0.8 100%
Query2 0.7 0.5 71.42%
Query3 0.9 0.7 77.78%
Query4 0.8 0.3 37.5%
Query5 0.83 0.2 24.1%
Query6 0.9 0.8 88.89%
Query7 0.8 0.7 87.5%
Query8 0.83 0.8 96.39%
Query9 0.8 0.6 75%
Query10 0.8 0.4 50%
Query11 1 0.8 80%
Query12 0.88 0.38 43.18%
Query13 0.83 0.8 96.39%
Query14 1 1 100%
Query15 0.83 0.67 80.72%
Average 0.85 0.63 74.12%
As it can be seen from this Table, the monolingual precision of the English retrieval engine is
85% and that of the Amharic-English retrieval engine is 63%. From this, we can observe that the
Amharic-English cross-lingual retrieval system out performed 74.12% of its corresponding
monolingual retrieval engine.
Like the Amharic queries, the English queries, which are direct manual translations of the
Amharic queries, are also submitted to the English monolingual retrieval engine and at the same
time to the English-Amharic cross-lingual retrieval engine. The English-Amharic search results
are as shown in Figure 6.4 for the English query “Election 2002”.
82
Figure 6.4: English-Amharic search result.
The presicions of the Amharic monolingual and English-Amharic cross-lingual retreival for each
English query and the precision comparision of the two are shown in Table 6.5. The precisions
for both monolingual and cross-lingual retrieval are calculated for only the top 10 retrieved
Amharic Web pages for each query.
83
Query8 0.75 0.6 80%
Query9 0.86 0.75 87.21%
Query10 0.89 0.71 79.78%
Query11 0.8 0.57 71.25%
Query12 0.71 0.5 70.42%
Query13 0.88 0.75 85.23%
Query14 1 1 100%
Query15 0.8 0.67 83.75%
Average 0.85 0.67 78.82%
As it can be surmised from Table 6.5, the Amharic monolingual retrieval engine has an average
precision of 85% and its corresponding English-Amharic cross-lingual retrieval engine has an
average precision of 67%. This experimental result also showed us the English-Amharic cross-
lingual retrieval engine out performed 78.82% of its corresponding monolingual retrieval engine.
The experimental results in Table 6.4 and 6.5 showed that the cross-lingual retrieval engines
have lower precisions than their corresponding monolingual retrieval engines which are as
expected and observed in every cross-lingual evaluation results. Furthermore, the experimental
results showed that the English-Amharic cross-lingual retrieval engine out performed better than
Amharic-English cross-lingual retrieval engine. This is because the English query translation
component has better query translation performance than Amharic query translation component
because of the reasons we discussed in Section 6.3.2 and this shows the dependency of the cross-
lingual retrieval performance on the translation component.
6.5.4 The Significance of Our System Compared to General Purpose Search Engines
Our system is designed and implemented considering the needs of Web users to use Web
documents in both languages and hence it handles the specific characteristics of Amharic
language. As a result, to evaluate the significance of our system when compared to general
purpose search engines, some Amharic queries which have morphological variations, character
variations, and short words are given to our system and to ‘Google’ search engine to see whether
our system handles these characteristics of the language and to see the result variations in Google
84
search results. In addition to this, the significance of the system because of being its bilingualism
is also observed by looking into the relevant English Web documents retrieved by the system in
addition to the Amharic documents for these Amharic queries. The relevancies of the documents
are only evaluated for only the top 10 retrieved Web documents.
Table 6.6: Google versus our system results for queries that show different characteristics
of Amharic language.
Our System Results
Google Results
Language Amharic Documents Total No.
Query No. of Rel.
Characteristics of
No. of Rel. Documents No. of Rel. Documents English
Relevant
Documents Documents
Similarity Documents Similarity Documents
[$\]! cbED 10 7 5 12
0 7
Morphological [$\]! cb 7 7 5 12
Variations [$\]! &ED 9 8 7 15
0 8
[$\]! & 6 8 7 15
.o XY 9 10 7 17
0 10
Character .o XY 2 10 7 17
Variations 2+ XYZ 10 6 4 10
0 6
2+ XYZ 3 6 4 10
$8$ P$ 10 8 5 13
Short words 1 8
$/$ P$ 6 8 5 13
As shown in Table 6.6, Google does not consider the characteristics of the Amharic language
and retrieved different results for the same query which have different variations. However, our
system results are the same for different variations of the words in the query. In addition to this,
since our system is being bilingual, it also retrieved some relevant English Web documents for
these queries. In general, the results in Table 6.6 showed us the significance of our system over
general purpose search engines in both considering the characteristics of Amharic language and
retrieving more number of relevant documents by simultaneously searching Amharic and
English Web documents for the given Amharic query.
85
CHAPTER SEVEN
7.1 Conclusion
With the expansion of the World Wide Web, the amount of information in different languages
and encoding schemes online on the Web is enormously growing. As a result, the number of
online non-English speakers who use the Web as their major source of information and a means
communication channel by realizing the importance of finding information in different languages
on the Web is highly increasing. These Web users need to query search engines using their own
native language query to search information that are relevant to their needs. However, the
today’s general purpose search engines do not handle the special characteristics of non-English
languages. This creates new challenges in information search and retrieval since information
search and retrieval needs language specific treatment.
Until recent years, the non-English speaking Web users use general purpose search engines to
find information using English queries although they are not proficient enough to formulate the
queries in the language. On the other hand, they may use their own language query although the
search engines do not take into account the special characteristics of the specific language they
use. In response to this, a number of language specific search engines which consider the
characteristics of the users query language have been developed to allow the Web users to search
information using their own native language. However, having a search engine that support only
a specific language may lead to miss relevant Web documents that are not written in the same
language and script of the query language particularly when the query language is not mostly
used on the Web.
Although Web users speak different languages and there are huge number of non-English Web
documents, most of the resources are written and published in English language on the Web to
be accessed by search engines. Amharic, the second most spoken Semitic language in the world
after Arabic, is one of the languages which have rapidly growing content on the Web. To search
these Amharic Web documents, an Amharic Search engine has already been developed.
However, the majority of Web documents are published in English when compared to Amharic
86
language and accessing only Web documents which are written in Amharic may not satisfy user
requirements since there may be relevant documents for user queries in English as well. As a
result, Web users need a multilingual search engine which can give them results in English
language and in their own native language.
Nowadays, multilingual search engines which allow users to formulate their query in their own
native language and retrieve documents in any other language have been developed. The search
engine is said to be bilingual search engine for the case of two language pairs. In this type of
Web information retrieval engines, since the query and the document are written in different
languages, either the query or the document should be translated. So translation system is one of
the basic components which should be well addressed in any bilingual search engine.
This research concerned itself with the bilingual search process in an attempt to enable users
finding information they need in Amharic and English. It also concerned with translation of
keywords from one language to the other which is an important issue for bilingual search. Our
work aims to be different to and is probably the first Amharic-English bilingual search engine
which displays the search results in two different languages on the same page. The first one is for
users query language and the other one is for its translation. It uses language specific Amharic
search engine and a general purpose search engine as its underlying search engine.
In this research, we attempted to design and develop a bilingual Web search engine for Amharic
and English languages. The search engine is developed considering the features of Amharic and
English languages and the translations between them in its design and implementation. The
search engine has different components which address the basic language specific issues for the
two languages in query translation and information retrieval. These major components are:
Amharic query preprocessing, English query preprocessing, Amharic query translation, English
query translation, Amharic search engine, and English search engine.
The Amharic and English query preprocessing modules perform tokenization, normalization,
removing stop-words, and stemming of the Amharic and English queries respectively. The
Amharic and English query translation modules are also responsible for lexical transfer or
dictionary lookup in the bilingual Amharic-English dictionary and transliterating the out of
dictionary words in the Amharic and English queries respectively. The Amharic-English
87
transliteration subcomponent is developed considering the heuristic Amharic-English characters
mapping. The Amharic and English search engines are the underlying search engines for the
bilingual Amharic-English search engine for crawling, indexing, and ranking of the Amharic and
English Web documents respectively.
Nutch, an open source search engine, is chosen and customized to crawl, index, and rank both
the Amharic and English Web documents in the underlying search engines. The Web crawling
was conducted by providing selected seed URLs which are considered to host Amharic Web
documents for Amharic search engine and English Web documents for English search engine.
Some of the components of our system are independently evaluated for their performance. The
Amharic and English query translation components are evaluated by the way they translate the
queries. The Amharic query translation module translates 88.57% of the words properly in the
given Amharic queries where as the English query translation module properly translates 93.33%
of the words in the given English queries. The transliteration component transliterates 80.64%
proper names correctly which can be considered as promising.
The precision was calculated for each query and the average precision measure showed
promising results for both the Amharic-English and English-Amharic cross-lingual evaluations.
The average cross-lingual precision evaluation results are compared with the corresponding
average monolingual precision evaluation results for each retrieval engine. The experimental
results showed that the Amharic-English cross-lingual retrieval engine performed 74.12% of its
corresponding English monolingual retrieval engine and the English-Amharic cross-lingual
retrieval engine performed 78.82% of its corresponding Amharic monolingual retrieval engine.
The results of both retrieval engines are found to be promising.
88
7.2 Contributions of the Work
With this work several novel ideas and designs are proposed. Among which we:
7.3 Recommendations
In this thesis work, we have presented the efforts exerted to design and implement a dictionary
based bilingual Amharic-English Web search engine. However, developing a full-fledged
Amharic-English bilingual Web search engine requires the involvement of several professionals
from different disciplines such as computer science, information science, linguistics and other
related fields. As a result, additional features, improvements, and modifications should be
incorporated to come up with an effective and efficient bilingual search engine. Hence, we
proposed the following recommendations for future research directions:
89
Incorporating phrasal translation and co-occurrence analysis techniques for translation
disambiguation in combination with the dictionary-based approach.
Improving the performance of the transliteration component by handling exceptions,
transliteration variations and irregularities, sadises and hamises.
Another possible extension to our work is expanding the bilingual search engine to
include other more languages such as Tigrigna, Afaan Oromo, etc.
The current Amharic stemmer that we have used should be enhanced since it has a
negative impact on the query preprocessing for query translation which in turn has an
impact on the search results.
Automatic Named Entity Recognizer(NER) for Amharic should be developed and
integrated to identify whether the query term is need to be checked from the bilingual
dictionary or should be transliterated.
As the main problems of dictionary based approach is limitation of word coverage,
including large size commercial bilingual dictionary and online bilingual dictionary for
query translation will minimize the problem.
Amharic thesaurus and WordNet which used to expand the query using synonyms must
be investigated and integrated.
A comprehensive set of Amharic stop-words, short words, aliases should be identified
and included.
Building domain specific machine translation systems and implementing the bilingual
search engine for specific applications. The bilingual search engine can be easily
customized and implemented to satisfy the needs of some organizations for specific Web
based application domains such as: medicine, business, and other domain specific Web
portals.
After efficient and critical masses of resources such as corpus, lexicon, morphological
analyzers, stemmers, and named entity recognizers have been developed and made
publicly available for Amharic language, the Amharic-English bilingual Web search
engine will reach a level higher than that of academic prototype systems.
90
References
[1]Atelach Alemu Argaw. "Amharic-English Information Retrieval with Pseudo Relevance
Feedback". In: Peters, C., et al. (eds.) Advances in Multilingual and Multimodal
Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF
2007, Budapest, Hungary, September 19-21, 2007, Revised Selected Papers, pp. 119-126.
Springer, Berlin / Heidelberg. 2008
[4]Atelach Alemu Argaw, Lars Asker, Rickard Cöster and Jussi Karlgren. "Dictionary-based
Amharic - English Information Retrieval". In Proceedings of Cross Language Evaluation
Forum (CLEF 2004), Bath, UK. September 2004.
[5]J. Deepa Devi, Ranjani Parthasarathi and T.V. Geetha. “Tamil Search Engine”. Sixth
Tamil Internet 2003 Conference, Chennai, Tamilnadu, August 2003.
[6]Prasad Pingali, Jagadeesh J and Vasudeva Varma. “WebKhoj: Indian language IR from
multiple character encodings”. In Proceedings of the 15th International Conference on
World Wide Web Edinburgh, Scotland, May 2006.
91
[9]Kristen Parton, Kathleen R. McKeown, James Allan, and Enrique Henestroza.
“Simultaneous Multilingual Search for Translingual Information Retrieval”. In
Proceeding of the 17th ACM conference on Information and knowledge management,
Napa Valley, California, USA. ACM New York, NY, USA, 2008.
[11]Karunesh Arora, Ankur Garg, Gour Mohan, Somiram Singla, and Chander Mohan.
“Cross Lingual Information Retrieval Efficiency Improvement through Transliteration”.
In Proceedings of ASCNT – 2009, CDAC, Noida, India, pp. 65 – 71, 2009.
[13]Judit Bar-Ilan and Tatyana Gutman: “How do search engines handle non-English
queries? - A case study”. WWW (Alternate Paper Tracks), Budapest, Hungary, 2003.
[17]Hassen Redwan Hussen. “Enhanced Design of Amharic Search Engine (An Amharic
Search Engine with Alias and Multi-character Set Support)”. A Thesis Submitted to the
School of Graduate Studies of Addis Ababa University in Partial Fulfillment for the
Degree of Master of Science in Computer Science, 2008.
92
[18]Wen-hui Zhang, Hua-lin Qian, Wei Mao and Guo-nian Sun. “A Multilingual (Chinese,
English) Indexing, Retrieval, Searching Search Engine”. Available at
[20]Joanne Capstick, Abdel Kader Diagne, Gregor Erbach and Hans Uszkoreit.
“MULINEX: Multilingual Web Search and Navigation”, Accessed on August 25, 2009,
Available at http://eprints.kfupm.edu.sa/52030/1/52030.pdf, Published on 08.02.99.
[22]Yaser Al-Onaizan and Kevin Knight. “Translating Named Entities Using Monolingual
and Bilingual Resources.” In Proceedings of the 40th Annual Meeting of the Association
for Computational Linguistics (ACL), Philadelphia, July 2002.
[27]Monica Peshave and Kamyar Dezhgosha, "How Search Engines Work and a Web
Crawler Application". Department of Computer Science, University of Illinois,
Springfield USA, 2005.
93
[28]Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual Web
search engine”. Computer Networks and ISDN Systems, April 1998.
[30] Kiduk Yang. “Information Retrieval on the Web”. Annual Review of Information
Science and Technology, Volume 39 Issue 1, Pages 33 – 80, American Society for
Information Science and Technology, 2008.
[31]Andrew Graves, Mounia Lalmas. “Video retrieval using an MPEG-7 based inference
network”. In Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval. ACM New York, NY, USA, 2002.
[33]ADAM LOPEZ. “Statistical Machine Translation”. ACM Computing Surveys, Vol. 40,
No. 3, Article 8, August 2008.
[34]Jialun Qin, Yilu Zhou, Michael Chau and Hsinchun Chen. Multilingual Web retrieval:
An experiment in English–Chinese business intelligence. John Wiley & Sons, Inc. New
York, NY, USA, 2006
[37]H. Moukdad and H. Cui (2005). How Do Search Engines Handle Chinese Queries?
Webology, 2 (3), Article 17.
[38]H. Moukdad. Lost In Cyberspace: How Do Search Engines Handle Arabic Queries? The
12th International World Wide Web Conference, Budapest, Hungary, May 2003.
94
[39] Interlingual Machine Translation.
[40]Machine Translation.
[45]Dorr, Bonnie, Eduard Hovy and Lori Levin, "Machine Translation: Interlingual
Methods", Encyclopedia of Language and Linguistics 2nd edition ms. 939, Brown, Keith
(ed.), 2004.
[46]Alexander Franz, Keiko Horiguchi, Lei Duan, Doris Ecker, Eugene Koontz, and Kazami
Uchida. An integrated architecture for example-based machine translation. Association
for Computational Linguistics Morristown, NJ, USA. 2000.
95
[48]Atelach Alemu, Lars Asker, and Mesfin Getachew. “Natural language processing for
Amharic: Overview and suggestions for a way forward”. In Proceedings of the 10th
Conference on Traitement Automatique des Langues Naturelles. Batzsur-Mer, France,
June 2003.
[49]Saba Amsalu & Sisay Fissiha Adafre. “Machine Translation for Amharic: Where we
are”. In Proceedings of LREC-2006: Fifth International Conference on Language
Resources and Evaluation. 5th SALTMIL Workshop on Minority Languages: “Strategies
for developing machine translation for minority languages”, Genoa, Italy. 2006.
[50]Wessel Kraaij, Jian-Yun Nie, and Michel Simard. “Embedding Web-based statistical
translation models in cross-language information retrieval.” MIT Press Cambridge, MA,
USA, 2003.
[51]A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. “Searching the Web”.
ACM Transactions on Internet Technology (TOIT), 2001.
[53]Atelach Alemu Argaw and Lars Asker. “An Amharic Stemmer: Reducing Words to their
Citation Forms”. In Proceedings of the 5th Workshop on Important Unresolved Matters,
pages 104–110. Association for Computational Linguistics, 2007.
[54]Seid Muhie Yimam. “TETEYEQ ()ـ: Amharic Question Answering For FactoID
Questions”. A Thesis Submitted to the School of Graduate Studies of the Addis Ababa
University in Partial Fulfillment for the Degree of Master of Science in Computer
Science, 2009.
96
[56]Lu Chengye, Xu Yue, and Geva Shlomo. “Web-Based Query Translation for English-
Chinese CLIR”. Computational Linguistics and Chinese Language Processing (CLCLP),
Vol. 13, No. 1, pp. 61-90, 2008.
[57]Amsalu Aklilu and G.P Mosback. “English Amharic Dictionary”. Oxford University
press 1973 and reprinted by Makusa publishing PLC, 1996.
[59]Serge Obolensky, Debebow Zelelie, and Mulugeta Andualem. “Amharic Basic Course”.
Foreign Service Institute, Washingiton D.C, 1995.
[62]Yilu Zhou, Jialun Qin, Hsinchun Chen, and Jay F. Nunamaker. “Multilingual Web
Retrieval: An Experiment on a Multilingual Business Intelligence Portal”. In the
Proceedings of the 38th Annual Hawaii International Conference on System Sciences.
IEEE Computer Society Washington, DC, USA, 2005.
97
Appendix I: Lists of Amharic Stop Words
<= _ 3u$
<= j eB
> e#8
<?/ +: e#T !+
@ 2+B e#E !=
@ 2OO: F :~
@g +! F >
< +!r e3 m
<=F + eFi /D
: ~+7 eB
A ~ e< P
AD .0 eY !
r BY eF3 :/ O
2@ D) eF3~+$ :8
.+$ /D eF3~+7 3)m
.+` $YF$ eF3Y~}$
2B D eF3J$ P
~ } eF3~Y )F
~ -$ ~
.) eFQ< ~7_
.F : eF{ )X
.F ~ e8 $
mF ~D e! u$
n@F Y$ e!F(FJ 3)m
n YU e!F(F(D ty
n= <F e!F(F
+ + B Y)_
ddn = B> ~+$
d@F / , B: :~
+_ /,_ B2B n=
/_ Bm +_
t$ e<F B/D +@
O B
^/ O_ BP @g
/ * Bu$ 2)+F
m ~| 2#_
/D ~F_ : :Y~P=
> P_ 3
1 P _Y
98
Appendix II: Lists of English Stop Words
a backing done fully
about backs down further
above be down furthered
across became downed furthering
after because downing furthers
again become downs g
against becomes during gave
all been e general
are before each generally
almost began early get
alone behind either gets
along being end give
already beings ended given
also best ending gives
although better ends go
always between enough going
among big even good
an both evenly goods
and but ever got
another by every great
any c everybody greater
anybody came everyone greatest
anyone can everything group
anything cannot everywhere grouped
anywhere case f grouping
are cases face groups
area certain faces h
areas certainly fact had
around clear facts has
as clearly far have
ask come felt having
asked could few he
asking d find her
asks did finds here
at differ first herself
away different for high
b differently four high
back do from high
backed does full higher
99
highest likely noone places
him long not point
himself longer nothing pointed
his longest now pointing
how m nowhere points
however made number possible
i make numbers present
if making o presented
important man of presenting
in many off presents
interest may often problem
interested me old problems
interesting member older put
interests members oldest puts
into men on q
is might once quite
it more one r
its most only rather
itself mostly open really
j mr opened right
just mrs opening right
k much opens room
keep must or rooms
keeps my order s
kind myself ordered said
knew n ordering same
know necessary orders saw
known need other say
knows needed others says
l needing our second
large needs out seconds
largely never over see
last new p seem
later new part seemed
latest newer parted seeming
least newest parting seems
less next parts sees
let no per several
lets nobody perhaps shall
like non place she
100
should their two which
show them u while
showed then under who
showing there until whole
shows therefore up whose
side these upon why
sides they us will
since thing use with
small things used within
smaller think uses without
smallest thinks v work
so this very worked
some those w working
somebody though want works
someone thought wanted would
something thoughts wanting x
somewhere three wants y
state through was year
states thus way years
still to ways yet
still today we you
such together well young
sure too wells younger
t took went youngest
take toward were your
taken turn what yours
than turned when z
that turning where
the turns whether
101
Appendix III: Lists of Amharic Short Words
ُُ/ ץ#/
ُُ/ ץُ%ג/
ָ
ُ/ ץ&/
ֳ
/ (ץ'/
ױֶ/ ُ)%ץ /
ُ/ ף/ו
ץٍ/ ُ/ו
ץ/א ُ,ד/ـ
ُ/א ץ/ג
ֳ
/א ָ-/
דـ / ָף//0ה
ץ / ָף1//0
ץ/ ַָ/-ֳ3
צ/ ץ%5אֵ/
ُע/
/
ױֶ/ ץֱוא/ץ
ױֶ
/ ُ#/
ُ
/ ו7.
ُ!/ 7.7
ץ/ת ץ&.
102
Appendix IV: Amharic-English Character Mapping
1st 2nd 3rd 4th 6th 7th 8th
5th order
No. order order order order order order order
)N N O PN 9 O ON QR
1. ሀ = ha ሁ = hu< ሂ = hi ሃ = ha ሄ = he/hie ህ = h/hi ሆ = ho
14. አ = E ኡ = u ኢ = i ኣ = a ኤ = A እ = e ኦ = o ኧ = e
16 ወ = we ዉ = wu ዊ = wi ዋ = wa ዌ = we/wie ው = we ዎ = wo
19 የ = ye ዩ = yu ዪ = yi ያ = ya ዬ = ye/yie ይ = y/yi ዮ = yo
103
Appendix V: Basic Configurations of the Amharic-English Bilingual Search Engine
104
Appendix VI: Sample Crawling Process Test Result
crawl started in: AmharicCrawl
rootUrlDir = urls
threads = 10
depth = 10
Injector: starting
Injector: crawlDb: AmharicCrawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: AmharicCrawl/segments/20100914160014
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: AmharicCrawl/segments/20100914160014
Fetcher: threads: 10
fetching http://am.wikipedia.org/
fetching http://www.addisadmass.com/
fetching http://www.ena.gov.et/
fetching http://archives.ethiozena.com/
fetching http://www.dw-world.de/amharic/
fetching http://www.waltainfo.com/
fetching http://www.zethiopia.com/
fetching http://www.ethpress.gov.et/
fetching http://www.ethiopianreporter.com/amharic/
fetch of http://www.addisadmass.com/ failed with: Http code=500, url=http://www.addisadmass.com/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: AmharicCrawl/crawldb
CrawlDb update: segments: [AmharicCrawl/segments/20100914160014]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: AmharicCrawl/segments/20100914160038
Generator: filtering: false
Generator: topN: 2147483647
Generator: jobtracker is 'local', generating exactly one partition.
105
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: AmharicCrawl/segments/20100914160038
Fetcher: threads: 10
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121256.htm
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121225.htm
fetching http://www.ethpress.gov.et/../forum/forums.php
fetching http://www.ethpress.gov.et/../../ethpress/main/viewsOpinion.php
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121233.htm
fetching http://www.ethpress.gov.et/ethpress/main/economy.php
fetching http://www.ethpress.gov.et/../../ethpress/main/politics.php
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/14Sep10/121278.htm
fetching http://www.ena.gov.et/#Menu1_SkipLink
fetching http://www.ethiopianreporter.com/amharic/#
fetching http://www.addisadmass.com/
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121274.htm
fetching http://www.ethiopianreporter.com/amharic/</div>
fetch of http://www.addisadmass.com/ failed with: Http code=500, url=http://www.addisadmass.com/
fetching http://www.ethiopianreporter.com/amharic/).removeClass(
fetching http://www.ethpress.gov.et/ethpress/main/viewsOpinion.php
fetching http://www.ethpress.gov.et/ethpress/main/politics.php
fetching http://www.ethpress.gov.et/ethpress/main/contactUs.php
fetching http://www.ethpress.gov.et/ethpress/subscribe/subscribe.php
fetching http://www.dw-world.de/amharic/#
fetching http://www.dw-world.de/amharic/#mainContent
fetching http://www.ethpress.gov.et/../../ethpress/pictureGallery/pictureGallery.php
fetching http://www.ethiopianreporter.com/amharic/).addClass(
fetching http://www.ena.gov.et/Videos/Meles_InterviewII_081310.wmv
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121222.htm
fetching http://www.ethiopianreporter.com/amharic/application/x-shockwave-flash
fetching http://www.ethpress.gov.et/../subscribe/subscribe.php
fetching http://www.ethpress.gov.et/aboutus.php
fetching http://www.dw-world.de/amharic/#headSearch
fetching http://www.ethpress.gov.et/ethpress/main/survey.php
fetching http://www.ethpress.gov.et/../../ethpress/main/msgMinInfo.php
fetching http://www.ethpress.gov.et/../../ethpress/main/editorial.php
fetching http://www.ethpress.gov.et/ethpress/main/news.php
fetching http://www.ena.gov.et/Videos/Meles_Interview_EngII_081310.wmv
fetching http://www.ena.gov.et/Videos/Meles_Interview_Eng_081310.wmv
fetching http://www.ethpress.gov.et/ethpress/main/home.php
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121265.htm
fetching http://www.dw-world.de/amharic/8.0.0
fetching http://www.ethpress.gov.et/ethpress/security/login.php
fetching http://www.waltainfo.com/
fetching http://www.ena.gov.et/Videos/Meles_Interview_081310.wmv
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121273.htm
fetching http://www.dw-world.de/amharic/#headLanguage
106
fetching http://www.ethpress.gov.et/ethpress/main/aboutUs.php
fetching http://www.ethpress.gov.et/ethpress/forum/forums.php
fetching http://www.dw-world.de/amharic/#multi
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121275.htm
fetching http://www.ethpress.gov.et/ethpress/security/register.php
fetching http://www.ena.gov.et/Default.aspx
fetching http://www.ethpress.gov.et/../../ethpress/main/economy.php
fetching http://www.ethpress.gov.et/ethpress/advertise/reqAdvertise.php
fetching http://www.ethpress.gov.et/../../ethpress/main/news.php
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121277.htm
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121246.htm
fetching http://www.zethiopia.com/text/javascript
fetching http://www.ethpress.gov.et/contactus.php
fetching http://www.ena.gov.et/AmharicNews/2010/Sep/13Sep10/121270.htm
fetching http://www.ena.gov.et/EnglishNews/2010/Sep/13Sep10/121262.htm
fetching http://www.dw-world.de/amharic/#nav
fetching http://www.dw-world.de/amharic/#headLinks
fetching http://www.ethpress.gov.et/ethpress/main/editorial.php
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: AmharicCrawl/crawldb
CrawlDb update: segments: [AmharicCrawl/segments/20100907044047]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: AmharicCrawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100906232355
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100906232451
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100906232858
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100906233350
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100907003613
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100907024804
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100907031321
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100907034852
LinkDb: adding segment: file:/d:/nutch-1.0/AmharicCrawl/segments/20100907044047
LinkDb: done
Indexer: starting
Indexer: done
Dedup: starting
Dedup: adding indexes in: AmharicCrawl/indexes
Dedup: done
Merging indexes to: AmharicCrawl/index
Adding AmharicCrawl/indexes/part-00000
107
Declaration
I, the undersigned, declare that this thesis is my original work and has not been presented for a
degree in any other university, and that all source of materials used for the thesis have been duly
acknowledged.
Declared by:
Name: ________________________________________
Signature: _____________________________________
Date: _________________________________________
Confirmed by advisor:
Name: ________________________________________
Signature: _____________________________________
Date: _________________________________________
108