0% found this document useful (0 votes)

65 views10 pages

A Language Independent Approach To Develop URDUIR System

Computer Science & Information Technology (CS & IT)

Uploaded by

CS & IT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views10 pages

A Language Independent Approach To Develop URDUIR System

Computer Science & Information Technology (CS & IT)

Uploaded by

CS & IT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM

Mohd. Shahid Husain, Iram Siraj

Integral University, Lucknow
{siddiquisahil,sirajiram25}@gmail.com

ABSTRACT
This is the era of Information Technology. Today the most important thing is how one gets the right information at right time. More and more data repositories are now being made available online. Information retrieval systems or search engines are used to access electronic information available on the internet. These information retrieval systems depend on the available tools and techniques for efficient retrieval of information content in response to the user query needs. During last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8.

KEYWORDS
Information Retrieval, Urdu IR, Stemming

1. INTRODUCTION
The rapid growth of electronic data has attracted the attention in the research and industry communities for efficient methods for indexing, analysis and retrieval of information from these large number of data repositories having wide range of data for a vast domain of applications. In this era of Information technology, more and more data is now being made available on online data repositories. Almost every information one need is now available on internet. English and European Languages basically dominated the web since its inception. However, now the web is getting multi-lingual. Especially, during last few years, a wide range of information in Indian regional languages like Hindi, Urdu, Bengali, Oriya, Tamil and Telugu has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. Moreover, since India is a country having a wide range of regional languages, in the Indian context, the IR approach should be such that it can handle multilingual document collections.
Jan Zizka (Eds) : CCSIT, SIPP, AISC, PDCTA - 2013 pp. 397406, 2013. CS & IT-CSCP 2013

DOI : 10.5121/csit.2013.3645

398

Computer Science & Information Technology (CS & IT)

A number of information retrieval systems are available to support English and some other European languages. Work involving development of IR systems for Indian languages is only of recent interests. Development of such systems is constraint by the lack of the availability of linguistic resources and tools in these languages. The reported works in this direction for Indian languages were focused on Hindi, Tamil, Bengali, Marathi and Oriya. But there is no reported work is done for Urdu language. There is no sufficient amount of resources available to retrieve information effectively available on internet in Indian Languages. So there is a need of some efficient tools and techniques to represent, express, store and retrieve the information available in different languages. The present work focuses on development of an efficient Information Retrieval system for Urdu Language.

2. INFORMATION RETRIEVAL
Information Retrieval is the sub domain of text mining and natural language processing. This is the science in which the software system retrieves the relevant documents or the information in response to the user query need. The Information Retrieval system match the given user quires with the data corpus available and rank the documents on the basis of the relevance with the user need. Then the IR system returns the top ranked documents containing relevant information to the user query. IR systems may be monolingual, bi-lingual or multilingual. The main objective of this thesis work is the development of the mono-lingual information retrieval system for Urdu language. To retrieve the relevant information on the basis of user query The IR system breaks the query statement and the data corpus in a standard format. The query is then matched with the documents presented in the corpus and ranked on the basis of the relevance with the query. Top ranked documents are then retrieved.

There are various approaches for converting the query statement and the corpus data in a standard format like stemming, morphological analysis, Stop word removal, indexing etc. Similarly there are various techniques or methods for query matching like cosine similarity, Euclidean distance etc. The efficiency of any Information Retrieval system depends on the term weighting schemes, strategies used for indexing the documents and the retrieval model used to develop the IR system.

2.1 Stemmer
Stemming is the backbone process of any IR system. Stemmers are used for getting base or root form (i.e. stems) from inflected (or sometimes derived) words. Unlike morphological analyzer, where the root words have some lexical meaning, its not necessary with the case of a stemmer.

Computer Science & Information Technology (CS & IT)

399

A stemmer is used to remove the inflected part of the words to get their root form. Stemming is used to reduce the overhead of indexing and to improve the performance of an IR system. More specifically, stemmer increases recall of the search engine, whereas Precision decreases. However sometimes precision may increases depending upon the information need of the users. Stemming is the basic process of any query system, because a user who needs some information on plays may also be interested in documents that contain the word play (without the s).

2.2 Term Frequency

This isa local parameter which indicates the frequency or the count of a term within a document. This parameter gives the relevance of a document with a user query term on the basis of how many times that term occurs in that particular document. Mathematically it can be given as: tfij=nij Where nij is the frequency or the number of occurrence of term ti in the document dj.

2.3 Document Frequency

This is a global parameter and attempts to include distribution of term across the documents. This parameter gives the importance of the term across the document corpus. The number of the documents in the corpus containing the considered term t is called the document frequency. To normalize, it is divided by the total number of the documents in the corpus. Mathematically it can be given as: dfi=ni/n Where niis the number of documents that contains termtiand the total number of the documents in the corpus is n. idf is the inverse of this document frequency.

2.4 The third factor which may affect the weighting function is the length of the document.
Hence the term weighting function can be represented by a triplet ABC here A- tf component B- idf component C- Length normalizing component The factor Term frequency within a document i.e. A may have following options:

400

Computer Science & Information Technology (CS & IT) Table 1: different options for considering term frequency

N tf = tfij B a tf = 0 or 1
tfij tf = 0.5 + 0.5 max tf in D j

(Raw term frequency) (binary weight) (Augmented term frequency) Logarithmic term frequency

tf = ln(tfij) + 1.0

The options for the factor inverse document frequency i.e. B is:
Table 2: different options for considering inverse document frequency

N T

Wt=tf Wt=tf*idf

No conversion i.e. idf is not taken Idf is taken into account

The options for the factor document length i.e. C is:

Table 3: different options for considering document length

N C

Wij=wt Wij=wt/ sqrt(sum of (wts squared))

No conversion Normalized weight

2.5 Indexing
To represent the documents in the corpus and the user query statement indexing is done. That is the process of transforming document text and given query statement to some representation of it is known as indexing. There are different index structures which can be used for indexing. The most commonly used data structure by IR system is inverted index. Indexing techniques concerned with the selection of good document descriptors, such as keywords or terms, to describe information content of the documents. A good descriptor is one that helps in describing the content of the document and in discriminating the document from other documents in the collection. The most widely used method is to represent the query and the document as a set of tokens i.e. index terms or keywords. For indexing a document, there are different indexing strategies as given below :

Computer Science & Information Technology (CS & IT)

401

2.5.1 Character Indexing: in this scheme the tokens used for representing the documents are the characters present in the document. 2.5.2 Word Indexing: this approach uses words in the document to represent it. 2.5.3 N-gram indexing: this method breaks the words into n-grams, these n-grams are used to index the documents. 2.5.4 Compound Word Indexing: in this method bi-words or tri-words are used for indexing.

2.6 Information Retrieval models

An IR model defines the following aspects of retrieval procedure of a search engine: How the documents and users queries are represented a. How system retrieves relevant documents according to users queries & b. How retrieved documents are ranked. Any typical IR model comprises of the following: a. A model for documents b. A model for queries and c. Matching function which compares queries to documents. The IR models can be categorized as: 2.6.1 Classical models of IR: this is the simplest IR model. It is based on the well recognized and easy to understood knowledge of mathematics. Classical models are easy to implement and are very efficient. The three classical information retrieval models are: -Boolean -Vector and -Probabilistic models 2.6.2 Non-Classical models of IR: Non-classical information retrieval models are based on principles like information logic model, situation theory model and interaction model. They are not based on concepts like similarity, probability, Boolean operations etc. on which classical retrieval models are based on. 2.6.3 Alternative models of IR:Alternative models are advanced classical IR models. These models make use of specific techniques from other fields like Cluster model, fuzzy model and latent semantic indexing (LSI) models. 2.6.4 Boolean Retrieval model: This is the simplest retrieval model which retrieves the information on the basis of the query given in Boolean expression. Boolean queries are queries that uses And, OR and Not Boolean operations to join the query terms.

402

Computer Science & Information Technology (CS & IT)

The one drawback of Boolean information retrieval model is that it requires Boolean query instead of free text. The second drawback is that this model cannot rank the documents on the basis of relevance with the user query. It just gives the document if it contains the query word, regardless the term count in the document or the actual importance of that query word in the document. 2.6.5 Vector Space model: This model represents documents and queries as vectors of features representing terms. Features are assigned some numerical value that is usually some function of frequency of terms. In this model, each document d is viewed as a vector of tfidf values, one component for each term. So we have a vector space where a. Terms are axes b. documents live in this space Ranking algorithm compute similarity between document and query vectors to yield a retrieval score to each document. The Postulate is: Documents related to the same information are close together in the vector space. 2.6.6 Probabilistic retrieval model: In this model, initialy some set of documents is retrieved by using vectorial model or boolean model. The user inspects these documents looking for the relevant ones and gives his feed back. IR system uses this feedback information to refine the search criteria. This process is repeated, untill user gets the desired information in response to his needs.

2.7 Similarity measures

To retrieve the most relevant documents with the user information need, the IR system matches the documents available in the corpus with the given user query. To perform this process different similarity measures are used. For example Euclidean distance, cosine similarity. 2.7.1 Cosine Similarity: We regard the query as short document. The documents present in the corpus and the query are represented by the vectors in the vector space with features as axes. The IR system rank the documents by the closeness of document vectors to the query vectors. IR system then retrieve the top ranked documents to the user.

Fig. 1: A VSM model representing 3 documents and a query

Computer Science & Information Technology (CS & IT)

403

The above diagram shows a vector space model where axes ti and tj are the terms used for indexing.

2.8 Metrics for IR Evaluation

The aim of any Information Retrieval system is to search document in responce to a user query relavant to his information need. The performance of IR systems is evaluated on the basis of how relavent documents it retrieve. Relevance depends upon a specific users judgment. It is subjective in nature. The true relevance of the retrieved document can be judged by the user only, on the basis of his information need. For same query statement, the desired information need may differ from user to user. Traditionally the evaluation of IR systems has been done on a set of queries and test document collections. For each test query a set of ranked relavant documents is created manually then the system result is cross checked by it. There are many retrieval models/ algorithms/ systems. Different performance metrics are used to assess how effeciently an IR system retrieve the documents in responce to a users information need. Different Criteria's for evaluation of an IR system are: a. b. c. d. e. f. Coverage of the collection Time lag Presentation format User effort Precision Recall

Effectiveness is the performance measure of any IR system which describes, how much the IR system satisfy a users information need by retrieving relevant documents. Aspects of effectiveness include: a. Whether the retrieved documents are pertinent to the information need of the user. b. Whether the retrieved documents are ranked according to the relevance with the user query. c. Whether the IR system returns a reasonable number of relevant documents present in the corpus to the user etc.

3. OUR APPROACH
In this work, to develop an Information Retrieval system for Urdu language, the following methods and evaluation parameters are used.

3.1 Stemmer:For Developing Stemmer we have used an unsupervised approach [1] which gives
accuracy of 84.2.

404

Computer Science & Information Technology (CS & IT)

3.2 Term Weighting scheme: For term weighting we have used the tf*idf 3.3 Indexing Scheme: in this work the query statement and the documents are represented
using the word indexing strategy.

3.4 Retrieval Model: To implement our IR system we have used the vector space model. 3.5 Encoding Scheme: As the system focuses on Urdu language, to access the data UTF8
character encoding is used.

3.6 Similarity Measure: for getting documents which are more closely related to the query
i.e.to measure the similarity between different documents in the corpus and the query statement, the cosine similarity measure is used.

3.7 Ranking of the document: for ranking of the retrieved documents in order to their
relevance with the query, cosine similarity values are used. The document having higher cosine value (min angular distance) with the query will be more similar i.e. contains the query terms more frequently and hence these documents will be considered more relevant to the user query.

4. EXPERIMENT
The data set used in this thesis for the training and testing of the developed Urdu IR system is taken from Emilie corpus. In this corpus documents are in xml format. The data set taken from EMILLE corpus a tagged data set consist of documents having information related to health issues, road safety issues, education issues, legal social issues, social issues, housing issues etc. The testing data set consist of documents from various domains such as:
Table 4: dataset specification used for Urdu IR

Domain Health Education Housing Legal Social issues Homeopathy Drama Myths Story and Novel Media Science History Politics

Number of Documents 33 8 8 8 12 32 13 10 21 15 47 33 21

Number of words 223412 115264 120327 108055 146083 527360 135680 202880 300160 224000 704000 502400 728320

Computer Science & Information Technology (CS & IT)

405

Psychology Religion Sociology Miscellaneous

27 34 21 48

555520 556800 398080 985374

A Query set consist of 200 queries is prepared manually for training and testing of the IR system.

5. RESULTS AND DISCUSSIONS

For testing purpose of the developed Information Retrieval system, a test collection of 350 documents have been used. A set of 200 queries was constructed on these 350 documents. This query set is used to evaluate the developed Urdu IR.
Table 5: results of the developed Urdu IR system testing

Number of documents 350

Number of queries 200

Precision Min (avg.) Max (avg.) 0.13 0.63

Recall Min (avg.) Max (avg.) 0.5 0.8

As shown in the above table, the system has value of 0.13 as the minimum average precision and maximum average precision value of the system is 0.63. Similarly the minimum average recall value for the system is 0.5 and maximum average recall value was found out to be 0.8.

6. CONCLUSION AND FUTURE WORK

In this paper we have discussed various indexing schemes and IR models. We have used tf*idf scheme for indexing and to implement the IR system VSM (Vector Space Model) is used. The experimental result shows that the average recall of the developed IR system is 0.8 with 0.3 precision. IR is one of the hottest research fields. One can do a lot new research to provide efficient IR system which can satisfy the users information needs. A lot of research is needed to develop language independent approaches to support IR systems for multilingual data collections.

REFERENCES
[1] MohdShahid Husain et. al. A language Independent Approach to develop Urdu stemmer. Proceedings of the second International Conference on Advances in Computing and Information Technology. 2012. Rizvi, J et. al. Modeling case marking system of Urdu-Hindi languages by using semantic information. Proceedings of the IEEE International Conference on Natural Language Processing and Knowledge Engineering (IEEE NLP-KE '05). 2005. Butt, M. King, T. Non-Nominative Subjects in Urdu: A Computational Analysis. Proceedings of the International Symposium on Non-nominative Subjects, Tokyo, December, pp. 525-548, 2001. Chen, A. Gey, F. Building and Arabic Stemmer for Information Retrieval. Proceedings of the Text Retrieval Conference, 47, 2002.

[2]

[3] [4]

406 [5]

Computer Science & Information Technology (CS & IT) R. Wicentowski. "Multilingual Noise-Robust Supervised Morphological Analysis using the Word Frame Model." In Proceedings of Seventh Meeting of the ACL Special Interest Group on Computational Phonology (SIGPHON), pp. 70-77, 2004. Rizvi, Hussain M. Analysis, Design and Implementation of Urdu Morphological Analyzer. SCONEST, 1-7, 2005. Krovetz, R. View Morphology as an Inference Process. In the Proceedings of 5th International Conference on Research and Development in Information Retrieval, 1993. Thabet, N. Stemming the Quran. In the Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, 2004. Paik, Pauri. A Simple Stemmer for Inflectional Languages. FIRE 2008. Sharifloo, A.A., Shamsfard M. A Bottom up Approach to Persian Stemming. IJCNLP, 2008 Kumar, A. and Siddiqui, T. An Unsupervised Hindi Stemmer with Heuristics Improvements. In Proceedings of the Second Workshop on Analytics for Noisy Unstructured Text Data, 2008. Kumar, M. S. and Murthy, K. N. Corpus Based Statistical Approach for Stemming Telugu. Creation of Lexical Resources for Indian Language Computing and Processing (LRIL), C-DAC, Mumbai, India, 2007. Qurat-ul-AinAkram, AsmaNaseer, SarmadHussain. Assas-Band, an Affix-Exception-List Based Urdu Stemmer. Proceedings of ACL-IJCNLP 2009. http://en.wikipedia.org/wiki/Urdu http://www.bbc.co.uk/languages/other/guide/urdu/steps.shtml http://www.andaman.org/BOOK/reprints/weber/rep-weber.htm Natural Language processing and Information Retrieval by TanveerSiddiqui, U S Tiwary. Information retrieval: data structure and algorithms by William B. Frakes, Ricardo Baeza-Yates. http://www.crulp.org/software/ling_resources.htm

[6] [7] [8] [9] [10] [11] [12]

[13] [14] [15] [16] [17] [18] [19]

Sufism in India PDF
67% (3)
Sufism in India PDF
18 pages
Strategic Reward Management PDF
50% (2)
Strategic Reward Management PDF
537 pages
Thesis Summary
No ratings yet
Thesis Summary
117 pages
Faculty Name: Dr. Humera Khanam Subject Name:NLP
No ratings yet
Faculty Name: Dr. Humera Khanam Subject Name:NLP
206 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
Pearson BTEC Level 7 Diploma in Strategic Management and Leadership (QCF) Sample Assignment
0% (4)
Pearson BTEC Level 7 Diploma in Strategic Management and Leadership (QCF) Sample Assignment
6 pages
Unit III
No ratings yet
Unit III
37 pages
Bridge To Eternity Shaykh Nazim Al-Haqqani
100% (5)
Bridge To Eternity Shaykh Nazim Al-Haqqani
104 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Mod 4
No ratings yet
Mod 4
35 pages
Ir - Chapter 1
No ratings yet
Ir - Chapter 1
7 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Pe Ii6
No ratings yet
Pe Ii6
166 pages
Bulu
No ratings yet
Bulu
47 pages
Module 5 - Information Retrieval and Lexical Resources
0% (1)
Module 5 - Information Retrieval and Lexical Resources
80 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Irs Unit-3 Notes - 241202 - 145950
No ratings yet
Irs Unit-3 Notes - 241202 - 145950
21 pages
Irt Ia 2
No ratings yet
Irt Ia 2
9 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Chapter #7 Applicatios of NLP (Reading Ass)
No ratings yet
Chapter #7 Applicatios of NLP (Reading Ass)
58 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Designing and Building An Automatic Information Re
No ratings yet
Designing and Building An Automatic Information Re
7 pages
Testing Different Log Bases For Vector Model Weighting Technique
No ratings yet
Testing Different Log Bases For Vector Model Weighting Technique
15 pages
'Awakening To Consciousness' - Sandra Heber Percy
100% (4)
'Awakening To Consciousness' - Sandra Heber Percy
151 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
1999 - Stemming Methodologies Over Individual Query Words For An Arabic Information Retrieval System - Abu - Salem - 99
No ratings yet
1999 - Stemming Methodologies Over Individual Query Words For An Arabic Information Retrieval System - Abu - Salem - 99
6 pages
Thesis - Dinesh Mavaluru
No ratings yet
Thesis - Dinesh Mavaluru
142 pages
Irs QB Iii I Se
No ratings yet
Irs QB Iii I Se
9 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
NLP Mod-5
No ratings yet
NLP Mod-5
17 pages
Mba Full Syllabus 2015-16 As On Date 4th August 2015
No ratings yet
Mba Full Syllabus 2015-16 As On Date 4th August 2015
207 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Elevating Learning and Development Intro
100% (2)
Elevating Learning and Development Intro
16 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Unit 1
No ratings yet
Unit 1
15 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Research - Abesec - Mlir Neel
No ratings yet
Research - Abesec - Mlir Neel
7 pages
Cae Collocations
91% (11)
Cae Collocations
12 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Biblical Principles of Godly Relationships
100% (1)
Biblical Principles of Godly Relationships
24 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Articles, Essays and Speeches
100% (4)
Articles, Essays and Speeches
20 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Setting Koneksi VB Ke Mysql
No ratings yet
Setting Koneksi VB Ke Mysql
14 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
04 - Sociological Thinkers
No ratings yet
04 - Sociological Thinkers
109 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Information Retrieval Algorithms: A Survey: Prabhakar Raghavan
No ratings yet
Information Retrieval Algorithms: A Survey: Prabhakar Raghavan
8 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Survey On Variants of Cross Language Information Retrieval: S. Vaishnavi Dr. Anitha Chepuru
No ratings yet
Survey On Variants of Cross Language Information Retrieval: S. Vaishnavi Dr. Anitha Chepuru
4 pages
Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Social Psychology of Fashion
100% (1)
Social Psychology of Fashion
32 pages
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
No ratings yet
A New Method For Applicant of Explicit Semantic Analysis and Word Sense Disambiguation in Concept-Based Information Retrieval
10 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Irt Ans
No ratings yet
Irt Ans
9 pages
Jsri, 13
No ratings yet
Jsri, 13
184 pages
3-IMC Partners and Cross-Functional Organization
No ratings yet
3-IMC Partners and Cross-Functional Organization
29 pages
Hawthrone Experiment
100% (1)
Hawthrone Experiment
6 pages
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
No ratings yet
NLP Mod-V Q - A (Uploaded by Snaptricks - In)
7 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Vakhtangov's 10 Laws
No ratings yet
Vakhtangov's 10 Laws
4 pages
Tribal Covenant of Christ Ambassador, Inc. (Tccai) : The Common Purpose
No ratings yet
Tribal Covenant of Christ Ambassador, Inc. (Tccai) : The Common Purpose
11 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Managing The Customer Experience
No ratings yet
Managing The Customer Experience
4 pages
Daily Sayings March
No ratings yet
Daily Sayings March
11 pages
Life Beyond IAS
No ratings yet
Life Beyond IAS
3 pages
Week 03 Munoz Performing Disidentifications PP 1 9
No ratings yet
Week 03 Munoz Performing Disidentifications PP 1 9
10 pages
Arundhati Postcolonial Cosmopolitanism
No ratings yet
Arundhati Postcolonial Cosmopolitanism
18 pages
Architecture Project Synopsis
No ratings yet
Architecture Project Synopsis
3 pages
Personal Social and Emotional Development July 13
No ratings yet
Personal Social and Emotional Development July 13
4 pages
GDouglas CV 2015
No ratings yet
GDouglas CV 2015
5 pages
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
No ratings yet
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
2 pages
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
No ratings yet
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
2 pages
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
No ratings yet
Call For Papers-The International Journal of Ambient Systems and Applications (IJASA)
2 pages
Free Publications : Scope & Topics
No ratings yet
Free Publications : Scope & Topics
2 pages
Free Publications : Scope & Topics
No ratings yet
Free Publications : Scope & Topics
2 pages
Beware of Blind Artist
No ratings yet
Beware of Blind Artist
8 pages
Scene Questionnaire
No ratings yet
Scene Questionnaire
3 pages
Religious Training and Religiosity in Psychiatry Residency Programs
No ratings yet
Religious Training and Religiosity in Psychiatry Residency Programs
7 pages
The International Journal of Ambient Systems and Applications (IJASA)
No ratings yet
The International Journal of Ambient Systems and Applications (IJASA)
2 pages
Lesson Notes Light
No ratings yet
Lesson Notes Light
3 pages
Cry, The Beloved Country: Check Off Study Guide Information and Questions As You Discuss / Answer Them
No ratings yet
Cry, The Beloved Country: Check Off Study Guide Information and Questions As You Discuss / Answer Them
1 page
International Journal of Game Theory and Technology (IJGTT)
No ratings yet
International Journal of Game Theory and Technology (IJGTT)
2 pages
International Journal of Game Theory and Technology (IJGTT)
No ratings yet
International Journal of Game Theory and Technology (IJGTT)
2 pages
International Journal of Game Theory and Technology (IJGTT)
No ratings yet
International Journal of Game Theory and Technology (IJGTT)
2 pages
International Journal of Game Theory and Technology (IJGTT)
No ratings yet
International Journal of Game Theory and Technology (IJGTT)
2 pages
International Journal of Game Theory and Technology (IJGTT)
No ratings yet
International Journal of Game Theory and Technology (IJGTT)
2 pages
International Journal of Chaos, Control, Modelling and Simulation (IJCCMS)
No ratings yet
International Journal of Chaos, Control, Modelling and Simulation (IJCCMS)
3 pages
International Journal of Chaos, Control, Modelling and Simulation (IJCCMS)
No ratings yet
International Journal of Chaos, Control, Modelling and Simulation (IJCCMS)
3 pages
International Journal of Peer-To-Peer Networks (IJP2P)
No ratings yet
International Journal of Peer-To-Peer Networks (IJP2P)
2 pages
Ijait CFP
No ratings yet
Ijait CFP
2 pages
Ijait CFP
No ratings yet
Ijait CFP
2 pages
Ijait CFP
No ratings yet
Ijait CFP
2 pages
Ijait CFP
No ratings yet
Ijait CFP
2 pages
International Journal On Soft Computing, Artificial Intelligence and Applications (IJSCAI)
No ratings yet
International Journal On Soft Computing, Artificial Intelligence and Applications (IJSCAI)
2 pages
International Journal of Security, Privacy and Trust Management (IJSPTM)
No ratings yet
International Journal of Security, Privacy and Trust Management (IJSPTM)
2 pages
International Journal On Soft Computing (IJSC)
No ratings yet
International Journal On Soft Computing (IJSC)
1 page
Graph Hoc
No ratings yet
Graph Hoc
1 page
Graph Hoc
No ratings yet
Graph Hoc
1 page
Graph Hoc
No ratings yet
Graph Hoc
1 page
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet

A Language Independent Approach To Develop URDUIR System

Uploaded by

A Language Independent Approach To Develop URDUIR System

Uploaded by

A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM

Mohd. Shahid Husain, Iram Siraj

Computer Science & Information Technology (CS & IT)

Computer Science & Information Technology (CS & IT)

2.2 Term Frequency

2.3 Document Frequency

No conversion i.e. idf is not taken Idf is taken into account

The options for the factor document length i.e. C is:

Wij=wt Wij=wt/ sqrt(sum of (wts squared))

No conversion Normalized weight

Computer Science & Information Technology (CS & IT)

2.6 Information Retrieval models

Computer Science & Information Technology (CS & IT)

2.7 Similarity measures

Fig. 1: A VSM model representing 3 documents and a query

Computer Science & Information Technology (CS & IT)

2.8 Metrics for IR Evaluation

Computer Science & Information Technology (CS & IT)

Computer Science & Information Technology (CS & IT)

Psychology Religion Sociology Miscellaneous

555520 556800 398080 985374

5. RESULTS AND DISCUSSIONS

Number of documents 350

Number of queries 200

Precision Min (avg.) Max (avg.) 0.13 0.63

Recall Min (avg.) Max (avg.) 0.5 0.8

6. CONCLUSION AND FUTURE WORK

[6] [7] [8] [9] [10] [11] [12]

[13] [14] [15] [16] [17] [18] [19]

You might also like