Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan
Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan
ISSN 2229-5518
Index Terms—Concept-based model, IR, Ontology, Retrieval model, Term frequency, Text categorization and Telugu documents,
—————————— ——————————
1. INTRODUCTION
In the current paper we have focussed our efforts on electronic
documents of Telugu Language. Ontology: Ontology is not necessarily norms on the Construc-
The Telugu Language: Telugu language is the second most spo- tion or definition or expression. A conceptual description of on-
ken languages after Hindi in India. Telugu belongs to the South tology including concept, attribute, entity, association description
Central Dravidian subgroup of the Dravidian family of languages. and the main purpose for knowledge sharing and reuse is given by
It has been recently awarded the Classical status. Telugu has been Jade Goldstein [2]. Ontology is the concept (concepts, classes) of
the language of choice for lyrical compositions for its vowel end- abstract sets and attributes (properties, attributes) is for the cha-
ings words, rightly called the “Italian of the East”. Words in Dra- racteristics of objects and entities (individuals, instances) is a real
vidian languages, especially in Telugu are long and complex. thing and association (relations) will attribute is used for the titles
Telugu, like other Dravidian languages is highly rich in morphol- of the two concepts or entities.
ogy and hence agglutinative in nature. Telugu has 16 vowels and
40 consonants. Ontology is a formal, explicit specification of a shared conceptua-
lization. Jade [2] defined that ontology is a conceptual descrip-
Text Categorization: (TC) is the classification of documents tion, including concept, attribute, entity and association descrip-
with respect to a set of one or more pre-existing categories. TC is tion with the main purpose of knowledge sharing and reusing
a hard and very useful operation frequently applied to assign sub- knowledge. In the context of knowledge sharing, we will use the
ject categories to documents, to route and filter texts, or as a part term ontology to mean a specification of a conceptualization.
of natural language processing systems. In the past, several me- That is, ontology is a description of the concepts and relationships
thods proposed for text categorization were typically based on the that can exist for an agent or a community of agents. This defini-
classical Bag-of-Words model where each term or term stem is an tion is consistent with the usage of ontology as set-of-concept-
independent feature. The disadvantages of this classical represen- definitions, but more general.
tation are: The Proposed work is an efficient way of extracting text from the
a) The ignorance of any relation between words, as a result of Telugu Documents and performing Information Retrieval from
which learning algorithms are restricted to detect patterns in the that Telugu Document.
used terminology only, while conceptual patterns remain ig-
nored. Related works in this area have been explained in Section 2.
Our Proposed Work and its layout have been explained in
b) The big dimensionality of the representation space. In this article, Section 3. Results and Performance are dealt in Section 4.
we propose a new method for text categorization, which is based on Section 5 states the Conclusions and further work to be done
the use of the Word Net ontology to capture the relations between the
words.
In this approach terms are merged with their associated concepts LITERATURE SURVEY
extracted from the used ontology to form a hybrid model for text
representation. We have undertaken a series of experiments on Semantics has been introduced at various linguistic levels, word
Telugu documents which highlight the positive contribution of level, sentence level and document content extraction level and at
this approach. various stages of Information Retrieval such as query and document
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 2
ISSN 2229-5518
representation, and in indexing. Any attempt to bring in semantics order to overcome this defect, we use morphological analyzer tool
needs to balance the amount of complex natural language processing to get the root words. As a next step, domain specific key words
required, with the increase in retrieval performance. It is important to are identified. Text classifier is applied on the key words selected
note that the pre-processing done for document representation is an from the telugu document and found that the classifier efficiency
offline one-time process which would every time provide of key words are better w.r.t. to the words. We found that when
improvement in the retrieval performance. The main modules of IR we applied the ontological classification, there is an enormous
are pre-processing, indexing and retrieval. A set of documents is amount of improvement in the classifier efficiency. Here we have
given as input to the pre-processing phase where the stop words and
used the Ontology_Dictionary (Wordnet - telugu) developed by
punctuation are removed. The parts of speech of the content words
Centre for Advanced Linguistics and Ttransliteration Studies
are determined by the POS (Part of Speech) tagger after the stemmer
stems the content words resulting in root words. Basically a
(CALTS-UOH), university of Hyderabad(Central University) for
document can be represented with a bag of words using Boolean feature grouping. All such words which are grouped based on the
model. The bag of words however does not provide ranking of the features are termed as word class/concept. With the help of ontol-
retrieved documents. ogy, terms that are found in and around the same concept are
mapped into one dimension. This will help in excluding or dis-
To overcome this limitation, keyword-based search has been put ambiguating the terms that are present in many concepts due to
forward where precision and recalls are improved but this also giving the semantic ambiguity.
some ambiguous results. The use of ontology is the motivations of
the Semantic information retrieval. Semantic search engine is viewed 3.1 An Illustration Using Ontology Based Classification for
as a tool that gets formal ontology-based queries (e.g., in RDQL, Telugu Document
RQL, SPARQL, etc.) from a client, executes them against a
knowledge base (KB), and returns ontology values that satisfy the “AarDika mMtrito, kAryadarsito muKhya mMtri assembly lo
query. These techniques typically use Boolean search models based mMtaNalu” - Telugu (“Chief Minister discussed with the Finance
on an ideal view of the information space as consisting of non- Ministerand secretary in the assembly” – English)
ambiguous, non-redundant, formal pieces of ontological knowledge. Words: { AarDika , mMtrito, muKhya, mMtri, assembly lo,
mMtaNalu}
Conceptual search, i.e., search based on meaning rather than just
character strings, has been the motivation of a large body of research
Root word: mMtri
in the IR field long before the Semantic information retrieval
emerged. This drive can be found in popular and widely explored
areas such as Latent Semantic Indexing , linguistic conceptualization Key words: {mMtri, assembly}
approaches or the use of thesaurus and taxonomies to improve
retrieval. Feature Grouping: {mMtri, muKhyamMtri, kAryadarsi }
Those proposals are commonly based on shallow and sparse concep- Word Class/Concept: {mMtri, muKhyamMtri, kAryadarsi }
tualizations, usually considering very few different types of relations
between concepts and low information specificity levels. The model
we proposed considers a much more detailed and densely populated
conceptual space in the form of an ontology-based KB. Though it is
difficult to obtain such a rich conceptual space, this is one of the ma-
jor targets addressed by the Semantic Web research community. Rajakeeyalu
3.2 Equations
Algorithm: If H1 is true, accept the fact that the efficiency levels are better.
1. Start To measure the performance of these measures, we calculated
2. Morph Analysis (Finding base words) recall rate and precision rate.
3. Apply Ontology Recall rate= a/b and precision rate = a/c
4. Find the Sub-category where a = No. of documents which are classified into category
5. Recognize parent-node as a category of the respective correctly.
document b= No. of documents of category in the testing data.
6. If the parent has a child- repeat the process (iterate the c=No. of documents which are classified into category.
process from 3-5)
7. Otherwise take parent as the final category Table 1. Groups of Misclassification
3.5 FIGURES: FLOW PROCESS OF ONTOLOGY BASED TEXT CATEGORIZATION FOR TELUGU DOCUMENTS
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 4
ISSN 2229-5518
4. CONCLUSION:
6.REFERENCES
[1]Sebastiani F., “Machine Learning in Automated Text
Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47,
2002.
IJSER © 2011
http://www.ijser.org