0% found this document useful (0 votes)
29 views23 pages

Application NLP

Uploaded by

dayanand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views23 pages

Application NLP

Uploaded by

dayanand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 23

Application of NLP in

Information Retrieval
Presentation Outline

 Overview of current IR Systems


 Problems with NLP in IR
 Major applications of NLP in IR
Motivation

 Most successful general purpose retrieval


methods are statistical methods.

 Sophisticated linguistic processing often


degrade performance.
What is IR ??
 “Information retrieval system is one that
searches a collection of natural language
documents with the goal of retrieving
exactly the set of documents that pertain
to a users question”

 Have their origins in library systems


 Do not attempt to deduce or generate
answers
The problem of IR
 Goal = find documents relevant to an information
need from a large document set Info.
need

Query
IR
Retrieval syste
Document Answer list
collection m

5
Basics of IR Systems
Basics of IR Systems (contd…)
 Indexing the collection of documents.

 Transforming the query in the same way


as the document content is represented.

 Comparing the description of each


document with that of the query.

 Listing the results in order of relevancy.


Basics of IR Systems (contd…)

 Retrieval Systems consist of mainly two


processes:
 Indexing
 Matching
Indexing
 Indexing is the process of selecting terms to
represent a text.

 Indexing involves:
 Tokenizationof string
 Removing frequent words
 Stemming (removing ing, ed, etc)

 Two common Indexing Techniques:


 Boolean Model
 Vector space model
Indexing
Information Retrieval Models
 A retrieval model consists of:
 D: representation for documents
 R: representation for queries
 F: a modeling framework for D, Q
 R(q, di): a ranking or similarity function which
orders the documents with respect to a
query.
 In this, tokens are treated in the form of 1’s
and 0’s
Boolean Model
 Queries are represented as Boolean
combinations of the terms.
 Set of documents that satisfied the
Boolean expression are retrieved in
response to the query.
 Drawback
 Useris given no indication as to whether some
documents in the retrieved set are likely to be
better than others in the set
Vector Space Model
 In this model documents and queries are
represented by vectors in T dimensional space.
 T is the number of distinct terms used in the
documents.
 Each axis corresponds to one term.
 Ranked list of documents ordered by similarity to
the query where similarity between a query and a
document is computed using a metric on the
respective vectors.
Matching
 Matching is the process of computing a measure
of similarity between two text representations.
 Relevance of a document is computed based on
following parameters:
 tf - term frequency is simply the number of times a
given term appears in that document.
tfi.j = (count of ith term in jth document)/(total terms in jth document)
 idf- inverse document frequency is a measure of the
general importance of the term
idfi = (total no. of documents)/(no. of documents containing ith term)
 tfidfi,j score = tf * idf
Evaluation of IR Systems
 Two common effectiveness measures include:
 Precision: Proportion of retrieved documents that are
relevant. (it is near to accuracy)
Precision= no.of retrieved relevant documents/total no.of
relevant documents
 Recall: Proportion of relevant documents that are
retrieved.
Recall= no.of retrieved relevant documents/total no.of
retrieved documents
 Ideally both precision and recall should be 1.
 In practice, these are inversely related.
Case Study
Query: I need to know the gas mileage for my audi a8 2004 model

Source: Yahoo search (search.yahoo.com)


Case Study (contd…)
Query: I need to know the gas mileage for my audi a8 2004 model

Source: Y!Q search (yq.search.yahoo.com)


Case Study (contd…)
Query: I need to know the gas mileage for my audi a8 2004 model

Source: Google search (www.google.com)


Case Study (contd…)
 Yahoo Search
 Puretext-based search.
 Result generates instance of same text containing
documents.
 Y!Q Search
 Use of semantics but not efficient.
 Attempts to generate answer. However this is done
less efficiently here.
 Google Search
 Efficientuse of NLP for deduction of answer form given
question.
 A step towards question-answering !!
Conclusion

 Research efforts to address appropriate


tasks are underway.
E.g. document summarization, generating
answers.

 Achieving extremely efficient NLP


techniques is an idealization.
References
 Voorhees, EM, "Natural Language Processing and Information Retrieval," in
Pazienza, MT (ed.), Information Extraction: Towards Scalable, Adaptable Systems,
New York: Springer, 1999.
 Salton G Wong A Yang CS A Vector Space Model for Automatic Indexing
Communications of the ACM (1975) 613-620.
 Mari Vallez; Rafael Pedraza-Jimenez. Natural Language Processing in Textual
Information Retrieval and Related Topics "Hipertext.net", num. 5, 2007.
 Sanjeet Khaitan, Kamaljeet Verma and Pushpak Bhattacharyya, Exploiting Semantic
Proximity for Information Retrieval, IJCAI 2007, Workshop on Cross Lingual
Information Access, Hyderabad, India, Jan, 2007.
 Wikipedia
Questions ??
Thank You !!!!!

You might also like