0% found this document useful (0 votes)
69 views4 pages

Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan

Yes

Uploaded by

Sagar Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views4 pages

Ontology Based Text Categorization - Telugu Documents: Mrs.A.Kanaka Durga, Dr.A.Govardhan

Yes

Uploaded by

Sagar Sagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 1

ISSN 2229-5518

Ontology Based Text Categorization - Telugu


Documents
Mrs.A.Kanaka Durga, Dr.A.Govardhan
Abstract— In this paper, we introduce a new method of ontology based text classification for Telugu documents and retrieval system. Many
of the text categorization techniques are based on word and/or phrase analysis of the text. Term frequency analysis signifies the importance of
a term within a document. Two terms within a document can have the same frequency, but one term may contribute more to the meaning of
the sentence compared to the other term. Our aim is to capture the semantics of a text. The model we worked enables to capture the terms that
presents the concepts in the text and thus identifies the topic of the document. We have introduced the new concept based model which ana-
lyzes the terms on the sentences and documents level. This concept-based model effectively discriminates between non-important terms with
respect to sentence semantics and terms which hold the concepts that represent the sentence meaning. The limitations of key-word based
search are overcome by usage of Ontology which is a motivation of semantic IR. The retrieval model is based on an adaptation of the classic
vector-space model. The concept of ontology is associated with the related words and their weights from the pre-classified documents as a
learning stage. In the main process, the words and their mutual relations are extracted from the target documents. The concept of Ontology is
used to map the target document. A detailed description of the test results is illustrated in the paper and we explained thoroughly how the
concept based classification is far more superior when compared to the word based classification for telugu documents.

Index Terms—Concept-based model, IR, Ontology, Retrieval model, Term frequency, Text categorization and Telugu documents,

——————————  ——————————

1. INTRODUCTION
In the current paper we have focussed our efforts on electronic
documents of Telugu Language. Ontology: Ontology is not necessarily norms on the Construc-
The Telugu Language: Telugu language is the second most spo- tion or definition or expression. A conceptual description of on-
ken languages after Hindi in India. Telugu belongs to the South tology including concept, attribute, entity, association description
Central Dravidian subgroup of the Dravidian family of languages. and the main purpose for knowledge sharing and reuse is given by
It has been recently awarded the Classical status. Telugu has been Jade Goldstein [2]. Ontology is the concept (concepts, classes) of
the language of choice for lyrical compositions for its vowel end- abstract sets and attributes (properties, attributes) is for the cha-
ings words, rightly called the “Italian of the East”. Words in Dra- racteristics of objects and entities (individuals, instances) is a real
vidian languages, especially in Telugu are long and complex. thing and association (relations) will attribute is used for the titles
Telugu, like other Dravidian languages is highly rich in morphol- of the two concepts or entities.
ogy and hence agglutinative in nature. Telugu has 16 vowels and
40 consonants. Ontology is a formal, explicit specification of a shared conceptua-
lization. Jade [2] defined that ontology is a conceptual descrip-
Text Categorization: (TC) is the classification of documents tion, including concept, attribute, entity and association descrip-
with respect to a set of one or more pre-existing categories. TC is tion with the main purpose of knowledge sharing and reusing
a hard and very useful operation frequently applied to assign sub- knowledge. In the context of knowledge sharing, we will use the
ject categories to documents, to route and filter texts, or as a part term ontology to mean a specification of a conceptualization.
of natural language processing systems. In the past, several me- That is, ontology is a description of the concepts and relationships
thods proposed for text categorization were typically based on the that can exist for an agent or a community of agents. This defini-
classical Bag-of-Words model where each term or term stem is an tion is consistent with the usage of ontology as set-of-concept-
independent feature. The disadvantages of this classical represen- definitions, but more general.
tation are: The Proposed work is an efficient way of extracting text from the
a) The ignorance of any relation between words, as a result of Telugu Documents and performing Information Retrieval from
which learning algorithms are restricted to detect patterns in the that Telugu Document.
used terminology only, while conceptual patterns remain ig-
nored. Related works in this area have been explained in Section 2.
Our Proposed Work and its layout have been explained in
b) The big dimensionality of the representation space. In this article, Section 3. Results and Performance are dealt in Section 4.
we propose a new method for text categorization, which is based on Section 5 states the Conclusions and further work to be done
the use of the Word Net ontology to capture the relations between the
words.
In this approach terms are merged with their associated concepts LITERATURE SURVEY
extracted from the used ontology to form a hybrid model for text
representation. We have undertaken a series of experiments on Semantics has been introduced at various linguistic levels, word
Telugu documents which highlight the positive contribution of level, sentence level and document content extraction level and at
this approach. various stages of Information Retrieval such as query and document
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 2
ISSN 2229-5518

representation, and in indexing. Any attempt to bring in semantics order to overcome this defect, we use morphological analyzer tool
needs to balance the amount of complex natural language processing to get the root words. As a next step, domain specific key words
required, with the increase in retrieval performance. It is important to are identified. Text classifier is applied on the key words selected
note that the pre-processing done for document representation is an from the telugu document and found that the classifier efficiency
offline one-time process which would every time provide of key words are better w.r.t. to the words. We found that when
improvement in the retrieval performance. The main modules of IR we applied the ontological classification, there is an enormous
are pre-processing, indexing and retrieval. A set of documents is amount of improvement in the classifier efficiency. Here we have
given as input to the pre-processing phase where the stop words and
used the Ontology_Dictionary (Wordnet - telugu) developed by
punctuation are removed. The parts of speech of the content words
Centre for Advanced Linguistics and Ttransliteration Studies
are determined by the POS (Part of Speech) tagger after the stemmer
stems the content words resulting in root words. Basically a
(CALTS-UOH), university of Hyderabad(Central University) for
document can be represented with a bag of words using Boolean feature grouping. All such words which are grouped based on the
model. The bag of words however does not provide ranking of the features are termed as word class/concept. With the help of ontol-
retrieved documents. ogy, terms that are found in and around the same concept are
mapped into one dimension. This will help in excluding or dis-
To overcome this limitation, keyword-based search has been put ambiguating the terms that are present in many concepts due to
forward where precision and recalls are improved but this also giving the semantic ambiguity.
some ambiguous results. The use of ontology is the motivations of
the Semantic information retrieval. Semantic search engine is viewed 3.1 An Illustration Using Ontology Based Classification for
as a tool that gets formal ontology-based queries (e.g., in RDQL, Telugu Document
RQL, SPARQL, etc.) from a client, executes them against a
knowledge base (KB), and returns ontology values that satisfy the “AarDika mMtrito, kAryadarsito muKhya mMtri assembly lo
query. These techniques typically use Boolean search models based mMtaNalu” - Telugu (“Chief Minister discussed with the Finance
on an ideal view of the information space as consisting of non- Ministerand secretary in the assembly” – English)
ambiguous, non-redundant, formal pieces of ontological knowledge. Words: { AarDika , mMtrito, muKhya, mMtri, assembly lo,
mMtaNalu}
Conceptual search, i.e., search based on meaning rather than just
character strings, has been the motivation of a large body of research
Root word: mMtri
in the IR field long before the Semantic information retrieval
emerged. This drive can be found in popular and widely explored
areas such as Latent Semantic Indexing , linguistic conceptualization Key words: {mMtri, assembly}
approaches or the use of thesaurus and taxonomies to improve
retrieval. Feature Grouping: {mMtri, muKhyamMtri, kAryadarsi }

Those proposals are commonly based on shallow and sparse concep- Word Class/Concept: {mMtri, muKhyamMtri, kAryadarsi }
tualizations, usually considering very few different types of relations
between concepts and low information specificity levels. The model
we proposed considers a much more detailed and densely populated
conceptual space in the form of an ontology-based KB. Though it is
difficult to obtain such a rich conceptual space, this is one of the ma-
jor targets addressed by the Semantic Web research community. Rajakeeyalu

Our approach combines the flexibility and generality of an IR model


for unstructured search spaces. The expressiveness and detail of a
structured relational model describes some of the knowledge PaRtI Assembly
involved in the unstructured information space, in a structured and
formal way, with powerful and precise data querying facilities.
Ontology-based approach can be relied since it enables further Party gurtu party office Abhyarthi Palaka prati
inferencing capabilities that can be exploited to enhance the retrieval Paksham pakshmam
process. By building upon an ontology-based layer, our model
benefits from semantic data integration facilities.

Mayfield and Fin in combine ontology-based techniques and text-


based retrieval in sequence. We share with Mayfield et al. the idea Speaker Dy Speaker Mantrulu Sakhalu Floorleader
that semantic search should be a complement of keyword-based
search as long as not enough ontologies and metadata are On Pre-processed document, root words are extracted through
Ontology at concept level. Noun words are identified and their
available. frequency computed and preserved in the data bank. On the nouns
thus retrieved, feature-matrix clusters developed. We have calcu-
3. DETAILS OF THE WORK CARRIED OUT
lated a representative feature vector for each concept node in an
To start with, each text document is tokenized so that it gives
Ontology. We have then measured the similarity of the two of
raise to the set of wordsThe efficiency levels are low when we
those class vectors by a simple cosine measure.
apply any of the conventional classifier methodology. These low
efficiency levels are attributed to the inflated form of words. In
IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 9, September-2011 3
ISSN 2229-5518

3.2 Equations
Algorithm: If H1 is true, accept the fact that the efficiency levels are better.
1. Start To measure the performance of these measures, we calculated
2. Morph Analysis (Finding base words) recall rate and precision rate.
3. Apply Ontology Recall rate= a/b and precision rate = a/c
4. Find the Sub-category where a = No. of documents which are classified into category
5. Recognize parent-node as a category of the respective correctly.
document b= No. of documents of category in the testing data.
6. If the parent has a child- repeat the process (iterate the c=No. of documents which are classified into category.
process from 3-5)
7. Otherwise take parent as the final category Table 1. Groups of Misclassification

3.3.Vector Space Model:


We used vector space model to weigh terms and calculate feature Result type No.of texts
vectors Texts assigned to the subclass category 20
Weight of a term is given as : wik = tfik x idfk
where Texts assigned to the sperclass category 4
tfik is the number of occurrences of term tk in document i and 40
idfk is the inverse document frequency of the term tk in the Texts assigned to the other category
collection of documents. .
A commonly used measure for the inverse document frequency
is:
idfk = log(N / nk)
where
N is the total number of documents in the collection, and
nk is the number of document which contains a given term
.
Ontology based classification is carried in three steps: -
Step I : Ontology creation
Step II : Calculating relevance score
Step III : Text classification

Experiments were conducted on a small sample of 400 Telugu


documents which were broadly categorised into two categories .
namely rajakeeyalu (Politics), Aatalu (Sports) .Out of these 400
documents 80% were used as training docments and the balance
20% were used as testing documents.
3.4. Hypothesis:
H1: On ontology based text categorization, more distinctive fea-
tures are mapped towards right of the ontology scale.
H0: On ontology based text categorization, less distinctive fea-
tures are mapped towards left of the ontology scale.

3.5 FIGURES: FLOW PROCESS OF ONTOLOGY BASED TEXT CATEGORIZATION FOR TELUGU DOCUMENTS

Telugu Text Tokenizing Words Morphological Root words Key words


Document
Analyzer

Ontology_ Dic- Text Classify


Text Classify Word Classes Feature
grouping tionary

IJSER © 2011
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 2, Issue 6, June-2011 4
ISSN 2229-5518

4. CONCLUSION:

Literature on earlier research have proven that in conventional me-


thods, misclassified items are not accessible. Further it is also easy to
develop weakly thesauruses than conventional methods. In our paper,
we have proposed a nd proven that the efficiency of text classifica-
tion of the term is better when we used the Ontology model for Telu-
gu documents when compared to the conventional methods.
5. ACKNOLEDGMENTS:
We sincerely thank Dr.G.Uma Maheswara Rao for providing Ontolo-
gy_dictionary for Telugu and Morphoogical Analyzer tool.

6.REFERENCES
[1]Sebastiani F., “Machine Learning in Automated Text
Categorization,” ACM Computing Surveys, vol. 34, no. 1, pp. 1-47,
2002.

[2] Jade Goldstein, Mark Kantrowitz, Vibhu Mittal, and Jaime


Carbonell (1999), Summarizing Text Documents: Sentence
Selection and Evaluation Metrics, In ACM SIGIR 1999,
pp.121-128, 1999.

[3] Dr.G.Uma Maheswara Rao, Morphological Analyser, at the


centre for ALTS, University of Hyderabad.

[4] Dr.G.Uma Maheswara Rao and research team “Ontolo-


gy_Dictionary-Telegu”,at the centre for ALTS, University of
Hyderabad.

[5] A. karthikeyan et al.,”An Novel Approach sing Semantic Infor-


mation retrieval For Tamil documents”, International Jornal of Engi-
neering Science and Technology,vol.2(9),2010,4424-4433.

[6].S.MChaware et al.,”A survey:Issues of semantic Match-


ing for Indian Languages Using Ontology”,International
Journal of Information echnology and knowledge Man-
agement,vol.2(2).pp.351-354,2010.

IJSER © 2011
http://www.ijser.org

You might also like