0% found this document useful (0 votes)
19 views5 pages

Artificial Intelligence in Information Retrieval

This paper explores the intersection of Artificial Intelligence (AI) and Information Retrieval (IR), detailing how AI techniques enhance the retrieval of information. It discusses various models and methodologies, including the use of natural language processing and inverted indexing, to improve the efficiency and effectiveness of information retrieval systems. The document emphasizes the importance of structured, unstructured, and semi-structured data in the context of AI-driven IR solutions.

Uploaded by

Shereyas T.N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Artificial Intelligence in Information Retrieval

This paper explores the intersection of Artificial Intelligence (AI) and Information Retrieval (IR), detailing how AI techniques enhance the retrieval of information. It discusses various models and methodologies, including the use of natural language processing and inverted indexing, to improve the efficiency and effectiveness of information retrieval systems. The document emphasizes the importance of structured, unstructured, and semi-structured data in the context of AI-driven IR solutions.

Uploaded by

Shereyas T.N
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

ARTIFICIAL INTELLIGENCE IN
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N) | 978-1-6654-7436-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICAC3N56670.2022.10074291

INFORMATION RETRIEVAL
Lalita Shukla1 Dr.J.N. Singh2
Galgotias University, Greater Noida, U.P, India. Galgotias University, Greater Noida, U.P, India.
[email protected], [email protected],

Dr Prashant Johri3, Dr Avneesh Kumar4


Galgotias University, Greater Noida, U.P, India.
[email protected]

Abstract—This paper discusses the relationship between Techniques of artificial intelligence are used to obtain
information retrieval (IR) and AI. Checking retrieval of information throughout the standard process and to
texts, summarizes its key features and demonstrates the acquire new resources of added value. The first section
state of its art by introducing it one model that may have
provides a brief overview of data recovery. The
details, and other test results that show its value. The paper
following sections are organized according to the
then analyzes this model and effective methods related to,
focusing on and forgiving their weak use, unwanted recovery step process and provide examples of
representation and thinking. This paper describes some of applications.
the most effective ways uses intelligence-acquiring
information retrieval (IR). Recovery of information is an
AI AND IR
important information management technology. It works “This is the use of computers to carry out tasks requiring
together by searching for information and referencing,
reasoning on world knowledge, as exemplified by giving
storing and categorizing information.
responses to questions in situation where one is dealing
Keywords: Information retrieval, NLP, Inverted Index, with only partial knowledge and with indirect
Stemming, lemmatization, Standardization, connectivity” [1]
Paramerization.
An IR system is a software system that provides access
I. INTRODUCTION to books, magazines, and other documents; Stores and
"Acquiring knowledge" is an all-encompassing term. maintains those documents.
This paper is based on a well-established concept of text 1.1. Basic Terms
retrieval. I will also limit you, at first, to writing
document, or text, retrieval, retrieval for processing  Corpus: A large repository of document stored
other types of documents, of sample photos. This paper on computers.
answers the question: What is the source of information
 Information Need: A topic about which we
(in the sense that document retrieval) related to
want to get the information.
performance intelligence? The answer may seem
 Relevance: Some of the documents in corpus
obvious, that is, everything. If IR means, as very
that may contain what I want to search.
important and challenging, it is automatic retrieval of
content-based information, and then a common thought
1.2. Types of data
in AI that AI researchers will show IR staff how this is
done. Information Retrieval (IR) is a process that
involves activities related to human understanding and
1. STRUCTURED DATA:
information management; therefore, the definition of It refers to the information in the form of tables and has
a clear, overt semantic structure.
Information Access Systems can benefit from the use of
strategic strategies to account for internal and Example: Data stored in Relational Database.
uncertainties that reflect the subordination of the task.
2. UNSTRUCTURED DATA
Artificial intelligence methods in information
retrieval It lacks a clear, meaningful, intuitive, easy-for-a
computer structure.

ISBN:978-1-6654-7436-8/22/$31.00 ©2022 IEEE 1

Authorized licensed use limited to: Panimalar Engineering College - Chennai. Downloaded on January 04,2025 at 03:26:35 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

Example: be considered and may look for topics and forms of


singularity and plurality. It can also assign weights to
 Any search phrase over a web. each word. The complete NLP program tags all parts of
 Social Media speech, identify objects, subjects, agents, functions,
 Emails, mobile data, etc. extends place names, and adds similar names and forms
to appropriate nouns. It then creates a vague
3. SEMI-STRUCTURED DATA representation of the question of whether the system
matches its knowledge base. It is possible to identify the
In fact, almost any data is not really "unstructured".
qualifications, as well as the basic syntax and more as
This applies to all textual data that has a lot of text part of the NLP program to generate the query
structure, such as captions, paragraphs and footnotes. presentation. Term weights are calculated.

1.3. Areas Of Artificial Intelligence for A closer look at the models reveals that they are very
InformationRetrieval similar to the traditional vector space model Recovery.
[2]
 Natural language processing
Step 3: Query Matching
 Representation of knowledge
 Expert systems The main function of Query Matching in IR is to first
 Ex: Logical formalisms, conceptual locate the same query documents (recovery phase) and
graphs, etc. then list the corresponding documents (rating stage).
 Machine learning Matching occurs between the question and each
 Short term: over a single session document in the collection, as the collection is very large
 Long term: over multiple searches by (in billions), the corresponding understanding should
multiple users work well. The translated query is linked to a distorted
 Computer Vision. file with a knowledge base, if any. Traditional online
services match the name of each query specified by the
 Ex: OCR
searcher to include in the search. The full NLP program
could mimic the question "Is slippery a common
 Argument under uncertainty
condition on stoves?" Or "I love the place of all the
 Ex: probability theory
smooth toffees in New England." The NLP will expand
"New England" and add similar names from its
II. WORKING OF IR
knowledge base, possibly "location." [3]

Step 4: Ranking & Sorting


HOW INFORMATION RETRIEVAL WORKS?
Once all the candidate papers matching the question
Step 1: Document Processing
have been selected, they will be sorted based on the date,
Documents are included in the program. Although the field, or document that assesses how important it is in
first step may involve some sort of notation and the question.
extraction, most information systems generate a garbled Statistics and systems based on NLP use the same type
file or word list alphabetically. Shortcuts are not of parallel measurement techniques. [3]
included in this list. New text has been added to the
existing list so that all the words in the program have all III. METHODOLOGY
the visibility in one place and have their place in each “The amount of available information is growing at an
document. As it grows, text recovery programs can add incredible rate, for example the Internet and World Wide
or build knowledge base with internal dictionaries, Web.
semantic networks or lists of sentences, synonyms and
 Information is stored in many forms e.g.,
personal pronouns.
images, text, video, and audio.
Step 2: Query Processing  Common methods include the Boolean, Vector
Space and Probabilistic models” [4]
Whenever a question arises, it should be translated into
the system. For rational systems, it is a complete NLP
3.1. Traditional approach (Solution)
program so it is not such a difficult task because the
inventor has already asked the question in terms of
 We can use the concepts of Grepping (named
computer interpretation. NLP systems and statistics
after UNIX command GREP) that follows the
should do some work done by detectives to fix the
concept of linear scanning through documents.
question. The mathematical index identifies the words to

2
Authorized licensed use limited to: Panimalar Engineering College - Chennai. Downloaded on January 04,2025 at 03:26:35 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

 We can use some Scripting language. Number of distinct terms=500,000


Now, if we create term document matrix for this
But the problem with these solutions is that both of scenario:
them are very time consuming and repetition is done No. of cells=500,000 X 10,00,000
for each and every query. =500 GB
3.2. Better Solution- Boolean Retrieval This will need a lot of space in memory for execution,
which is infeasible.
 Preprocess the corpus in advance Term document matrix is sparse
 Organize the information about the occurrence
of different words in a way that speed up query  Sparse states that most of the entries in a
processing. matrixis zero.
 For each document, record which term appears  So, to overcome this problem, we convert this
in it. matrix into inverted index.
 To do this, we create a term document
incidence matrix. 3.3. INVERTED INDEX
 Query: Huffman and Tree but not dangling
Inverse index is a key data structure in modern data
suffix recovery.

Table 1: Term document Incidence Matrix The index always maps back to a part of the document.
Inverted aka converted file is the standard term in IR.
Term Non- Uniquel Adapti LZ7 Stati
docume Binary y ve 77 c Basic Idea of Inverted Index:
nt Huffm Decoda Huffm Dict Dicti
Inciden an ble an iona onar The basic premise of the distorted context is that we
ce Code Codes Tree ry y maintain a dictionary of words (sometimes called a
Matrix dictionary or dictionary). For each term, we need to store
Danglin 0 1 0 0 0 a list of all documents that contain t.
g suffix
 Identify each document with the document
Huffma 1 0 1 0 0
number or document Id (docID).
n code
Dictiona 0 0 0 1 1 The list is then called the list of posts (or distorted lists),
ry as well as the entire list of posts that are collected
Tree 1 0 1 0 0 together after being named.

Example (Shakespeare play)


QUERY: Take the vectors for Huffman and Tree and
dangling suffix(complimented)->bitwise AND

Find vectors of Huffman, Tree and dangling


suffix(complemented) from the incidence matrix:

10100 AND 10100 AND 1011

Take bitwise AND operator: = 10100

Our Answer: Non-Binary Huffman Codes and Adaptive Figure 3.1A term-document index matrix
Huffman Tree. SORTING:-
Boolean Retrieval Model The main index step is to sort this list so that the words
The Boolean retrieval model, sometimes also called as are in alphabetical order and so that the searching is
Incidence Vectors [10][11][12][13].It is a model for data faster. (Figure 3.2).
retrieval in which any question can be asked in the form
of Boolean expressions of words, in which the words are
combined with operators AND, OR and NOT.

Problem with Boolean Approach:

Suppose a corpus has 1 million documents (text).

3
Authorized licensed use limited to: Panimalar Engineering College - Chennai. Downloaded on January 04,2025 at 03:26:35 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

appears in a set of key words for each document,


known as a conditional index.

We are basically focusing on the word "word bag."


In this way, all the words in this text are taken as
their reference. In this way word order, structure,
meaning, etc. [6]is unfocused.
[5]
1. Preprocessing
Figure 3.2 The two parts of an inverted index
 Eliminating tags (e.g., XML)
Disposal of non-documented items (index), such as
How to maintain this posting list in memory? specific tags of documents or title.
Fixed-size array:
 Standardization of text:
It is not useful if we want to modify the data. Text evaluation, which includes combining the
whole text into a complete set of texts to work with,
 What will happen if the word “Caesar” is added to which includes processing of capitalized or non-
the document 14? capitalized words, exploring certain parameters such
 Variable-size array: as numbers or dates; abbreviations or acronyms, to
 We will use variable size postings list instead of remove empty words
fixes-size arrays.
 On disk, continuous running of the posting is normal using a list of active words (extensions, articles,
and optimal. etc.) to identify N-Grams, (example words and
 Use linked lists or variable length of arrays in the underlined terms). [7]
memory.  Stemming and lemmatization
1. Some tradeoffs in size/ease of insertion. The word lemma is its root form and its forms have
to be modified. The main purpose of stemming is to
Construction of Inverted Index: reduce the word to its source word, i.e., the key
words in the question or document are represented
by their sources rather than by the source words.
Stemming and lemmatization is very helpful in
Brutus, Caesar, Calpurnia, are friends
achieving the root form (which sometimes known as
synonyms) of derived words.The word "inform" can
 Collecting the documents which is to be indexed.: be a dictionary of "information" or "inform."
[6] For Example:
 It tokenizesall the words or we can say, it uses Dancing------------------ Dance
Tokenizer to cut sequence into token stream. We Dances------------------- Dance
will divide all the words/texts in the document.
Brutus Caesar Calpurnia are Friends  Common root form „dance‟

 Use Linguistic modules to modify the tokens which 2. Parameterization


are the indexing terms [6]
Parametrising documents include providing weight for
Brutus Caesar Calpurnia are friend the related conditions associated with the document. The
weight of a word is generally calculated as the frequency
with which it appears in the text, indicating the
 Now, by using Indexer we will create an inverted significance of these words as a description of the
index, consists of a dictionary sorted by document content of the book.
IDs and postings given in figure 3.2. [6]
Parametrization is a vital challenge [8] in awareness of
the limitations in designing complex systems and
3.4. Statistical Processing of Natural language modelling for both profiling and data quality (and
availability) is an important requirement. [8]
Statistical processing of natural language [6]
represents an older model of retrieval systems, and

4
Authorized licensed use limited to: Panimalar Engineering College - Chennai. Downloaded on January 04,2025 at 03:26:35 UTC from IEEE Xplore. Restrictions apply.
2022 4th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N)

IV. CONCLUSION

In order to enjoy the results of the algorithms, we have


learned, it is helpful to provide performance
measurements for random recovery. Documents, questions
and judgments are required for all proposed assessments.
The paper concludes with a framework for further
research areas on the practical wisdom of information
retrieval systems. A new application such as answering a
question based on very intelligent analysis can be
expected to receive an additional market share soon.

The most important unanswered question is: In what


future directions can (and should) this approach be
taken?

We can use different approaches as per the requirement.


For instance, we can use Boolean Retrieval for less
amount of data, but Inverted Index is the best for my
opinion.

REFERENCES

[1] K. S. Jones, “The role of artificial intelligence in information


retrieval,” Journal of the American Society for Information
Science, vol. 42, no. 8, pp. 558-565, 1991.
[2] T. Mandl, “Tolerant information retrieval with backpropagation
networks.,” Neural Computing & Applications, pp. 280-289,
2000.
[3] S. Feldman, “NLP Meets the Jabberwocky: Natural Language
Processing in Information Retrieval,” ONLINE-WESTON THEN
WILTON- 23 , pp. 62-73, 1999.
[4] B. N. Khan, “Integrating Artificial Intelligence into Information
Retrieval,” [Online].
[5] An example information retrieval problem, [Online].
[6] C. Manning and H. Schutze, Foundations of statistical natural
language processing, MIT press, 1999.
[7] M. Vallez and R. Pedraza-Jimenez, “ Natural language
processing in textual information retrieval and related topics.,”
2007. [Online].
[8] A. Mubayi, “Inferring Patterns, Dynamics, and Model-Based
Metrics of Epidemiological Risks of Neglected Tropical
Diseases,” Handbook of Statistics, vol. 37, pp. 155-183, 2017.
[9] S. Hartrumpf, “Extending knowledge and deepening linguistic
processing for the question answering system InSicht.,” In
Workshop of the Cross-Language Evaluation Forum for
European Languages, pp. 361-369, 2005.
[10] J. N. Singh and S. K. Dwivedi, “Analysis of vector space model
in information retrieval,” in Proc. of IJCA National Conference
on Communication Technologies & its Impact on Next
Generation Computing, 2012, vol. 2, pp. 14–18.
[11] J. N. Singh and S. K. Dwivedi, “A comparative study
onapproaches of vector space model in information retrieval,” in
International Conference of Reliability, Infocom Technologies
and Optimization, 2013.
[12] J. N. Singh and S. K. Dwivedi, “Comparative study on evaluative
measures of search engines,” in Proceedings of 3rd International
Conference on Reliability, Infocom Technologies and
Optimization, 2014, pp. 1–6.
[13] J. N. Singh and S. K. Dwivedi, Analyze & Evaluate the
Performance of Search Engines. Lambert Academic Publishing,
2015.

5
Authorized licensed use limited to: Panimalar Engineering College - Chennai. Downloaded on January 04,2025 at 03:26:35 UTC from IEEE Xplore. Restrictions apply.

You might also like