Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
ABOR
UNIVERS
ITY
FACALITY OF TECHNOLOGY
Sec_2 Group_5
Name ID no
1. Mogessie Yirdaw…………………………………………………………………….IT®094/11
2. Dejenu Fekadie………………………………………………………………………. .IT®038/11
3. Yebelay Yesigat………………………………………………………………………. IT®126/11
4. Birtukan Melikamu………………………………………………………………… IT®030/11
5. Rahel Mulatu…………………………………………………………………………… IT®100/11
6. Emaway Demelash……………………………………………………………………IT®042/11
7. Selam Andargie……………………………………………………………………….. IT®102/11
8. Aderaji Getinet…………………………………………………………………………. IT®010/11
9. getahun melaku……………………………………………………………………………IT®058/11
10.Endalikachew Acha,………………………………………………………………….. IT®046/11
11.G/Selassie Abebe………………………………………………………………………. IT®054/11
Information storage and retrieval assignment
An information storage and retrieval system (ISRS) is a network with a built-in user interface
that facilitates the creation, searching, and modification of stored data.
Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and
for databases of texts, images or sounds.
IRS is the process of obtaining information system resources that are relevant to an information
need from a collection of those resources. Searches can be based on full-text or other content-
based indexing.
Automated information retrieval systems are used to reduce what has been called information
overload.
An IR system is a software system that provides access to books, journals and other documents;
stores and manages those documents.
NLP draws from many disciplines, including computer science and computational
linguistics, in its pursuit to fill the gap between human communication and computer
understanding.
NLP drives computer programs that translate text from one language to another, respond
to spoken commands, and summarize large volumes of text rapidly—even in real time.
But NLP also plays a growing role in enterprise solutions that help streamline business
operations, increase employee productivity, and simplify mission-critical business processes.
The difference between natural language processing (NLP) and information storage and
retrieval (ISR)
Goal of NLP is to understand and generate languages that humans use naturally. This means that
eventually we will be able to communicate with computers as we do with our fellow humans.
NLP systems focus on to create systems that efficiently process texts and to make their
information accessible to computer applications.
Whereas IRS is about finding relevant resources to an information need from a humongous
collection of resources. Due to dominance of textual data over internet, primary task of any IR
system is to process those text where NLP plays pivotal role.
IRS is not limited to text, it is applicable to image and video search as well.
Information Retrieval is the broader aspect of digging out data within a specific context (i.e.
query intent). This data is textual in many IR applications.
NLP is used to understand the structure and meaning of human language by analyzing different
aspects like syntax, semantics, pragmatics, and morphology. Then, computer science transforms
this linguistic knowledge into rule-based, machine learning algorithms that can solve specific
problems and perform desired tasks
Text length counts: is features like character length and word length are quite common to be
significant in text datasets.
Non-dictionary word counts: Count/ratio of non-dictionary words or OOV (out of vocab)
words in the text. It can be a pseudo-feature representing how formal the text is.
Readability metrics: Metrics like Flesch-Kincaideadability Test and SMOG can be used.
Essentially, these are ratios of complex words (polysyllables) to the total number of words or
sentences — which are hypothesized to signal how easily readable a piece for text is.
Unique word ratios: ratio of number of unique words/total words. This feature gives you a
sense of word repetitions in the data points.
Part of speech (pos): If there’s a single feature I have used the most in my experience, it’s
the Part-Of-Speech (POS) tags. They are incredibly powerful.
Features of information storage and retrieval
Presentation
Organization
Storage
access
The similarity of natural language processing and information storage and retrieval.
Both used to access the use full information
Both used for stored and searched.
Stemming and Lemmatization
Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers. Often when searching
text for a certain keyword, it helps if the search returns variations of the word. For instance,
searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for
[boat, boater, boating, boats].
Stemming uses the stem of the word, while lemmatization uses the context in which the word
is being used.
Stemming just removes or stems the last few characters of a word, often leading to incorrect
meanings and spelling. While Lemmatization considers the context and converts the word to its
meaningful base form, which is called Lemma. Sometimes, the same word can have multiple
different Lemmas.
A stemmer will return the stem of a word, which needn't be identical to the morphological root of
the word. It usually sufficient that related words map to the same stem, even if the stem is not in
itself a valid root, while in lemmatization, it will return the dictionary form of a word, which
must be a valid word.
In lemmatization, the part of speech of a word should be first determined and the normalization
rules will be different for different part of speech, while the stemmer operates on a single word
without knowledge of the context, and therefore cannot discriminate between words which have
different meanings depending on part of speech.
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms
to linguistically valid lemmas.
Lemmatization deals only with inflectional variance, whereas stemming may also deal with
derivational variance;
Stemming identifies the common root form of a word by removing or replacing word suffixes
(e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a
word and returns its base form (e.g. “better” is lemmatized as “good”).
Lemmatization is similar to stemming as both of them reduce a word variant to its "stem" in
stemming and to its "lemma" in lemmatizing. It uses vocabulary and morphological analysis for
returning words to their dictionary form. Lemmatization converts each word to its basic form, the
lemma.
Stemming and lemmatization play an important role in order to increase the recall capabilities of
an information retrieval system.
The basic principle of both techniques is to group similar words which have either the same root
or the same canonical citation form.
Both are used in Text and Natural Language Processing.
Porter stemmer is the least aggressive algorithm, with the description of each algorithm actually
being somewhat lengthy and technical.
Ontologies are usually huge repositories of concepts and relations between concepts in a
certain domain.
Using the ontologies in information retrieval improves the retrieval goal.
The aim of this research is to explore the impact of the use of ontology for query expansion
to improve retrieval results.
The results show that ontology based query expansion has resulted in a higher number of
relevant documents being retrieved compared to other query expansion process.
Ontologies are applicable to domain independent retrieval such as web information retrieval
and also more useful in specialized information retrieval tasks. They have also been used in
query expansion. Ontologies are effectively formal and explicit specifications in the form of
concepts and relations of shared conceptualizations.
4. WordNet based query expansion
The WordNet describes a query expansion method based on the expansion of
geographical terms by means of WordNet synonyms and meronyms.
The WordNet ontology only in the geographical domain, by applying a query expansion
method, based on the synonymy and meronyms relationships, to geographical terms.
WordNet can be used in a more effective way during the indexing phase, by adding
synonyms and homonyms to the index terms.
5. Thesaurus based query expansion
A thesaurus is a classification system compiled of words or phrases organized with the
objective of facilitating the user’s idea .In section 4 we highlight the construction of
thesaurus in detail.
Term Weightage In some cases all the terms in the query may contain an equal weight i.e. all
terms are of equal importance. However many times it happens that different weights have to
be assigned to various terms depending on its importance.
4. What is Search Engines? Discuss in detail the basic idea behind search engine.
Additionally, list some of common search engines and explain how they work.
A search engine is a web based tool that is used by people to locate information on the
internet.
Search engines allow users to search the internet for content using keywords.
When a user enters a query into a search engine, a search engine results page (SERP) is
returned, ranking the found pages in order of their relevance.
Web search engines are the most visible IR applications
Search engines often change their algorithms (the programs that rank the results) to improve user
experience. They aim to understand how users search and give them the best answer to their
query. This means giving priority to the highest quality and most relevant pages.
Crawling - search engines use programs, called spiders, bots or crawlers, to scour the
internet. They may do this every few days, so it is possible for content to be out-of-date
until they crawl your website again.
Indexing - the search engine will try to understand and categories the content on a web
page through 'keywords'. Following SEO best practice will help the search engine
understand your content so you can rank for the right search queries.
Ranking - search results are ranked based on a number of factors. These may include
keyword density, speed and links. The search engine's aim is to provide the user with the
most relevant result.
Although most search engines will provide tips on how to improve your page ranking, the exact
algorithms used are well guarded and change frequently to avoid misuse. But by following
search engine optimization (SEO) best practice you can ensure that:
Search engines can easily crawl your website. You can also prompt them to crawl new
content.
Your content is indexed for the right keywords so it can appear for relevant searches.
5. As you know, stop-word lists are language dependent (the stop-word list of one language
is different from the stop-word list of another language). Prepare the stop-word list of your
mother tongue language. If your group is composed of members speaking different
languages, select one language and prepare stop-word lists on it.
What words are not stop words?
Generally speaking, most stop words are function (filler) words, which are words with little or no
meaning that help form a sentence. Content words like adjectives, nouns, and verbs are often not
considered as stop words. However, a programmer may choose to add very common words. For
example, Computer Hope may consider "computer" a stop word because it could be used to
describe any computer-related product (e.g., computer motherboard, computer video card,
etc.).According to this description we can list stop words in Amharic language
Examples:
ስለዚህ
ነገር ግን
ይሁን እንጅ
ናቸው
ነው
6. There are several data structure/file structure to use for text indexing in IR, such
as sequential file indexing, inverted file indexing, suffix array indexing and
signature file indexing, etc. Discuss how they work in text indexing.
Records in indexed sequential files are stored in the order that they are written to the disk.
Records may be retrieved in sequential order or in random order using a numeric index to
represent the record number in the file.
An Indexed Sequential File is organized so that it is easy to find information quickly without the
computer having to search through the whole file. The file includes a number of indexes and the
computer can look at these to find exactly where in the file to find the record - in much the same
way that you would use a library catalogue to find a book rather than looking along all the
shelves.
This is ideal for files which are used for reference - such as in an airline booking system or a
library catalogue: it would be very slow to have to hunt all the way through the files in order to
find a particular flight or a particular library book – so an ordinary sequential file would not be
suitable.
It is possible to insert new records into an Indexed Sequential File - so it is not necessary to copy
a whole file across to a new one just to add a single record (as would be the case with a
sequential file) as the system leaves space for later insertions; likewise, it is easy to alter a record
in place without writing the whole file out again, or to delete a record.
Again this is very suitable for systems such as travel or theatre booking where information has to
be changed frequently and it would be inefficient to copy out a whole file again just to alter
information about one or two seats.
When setting up an Indexed Sequential File in the first place, the programmer has to specify
which field will be used as the key field - e.g. a student file might use student name, or number
as the key field - this is the item which will be looked up in the index in order to find a record.
Note that some operating systems automatically set up a properly indexed file with sufficient
space when your program runs: with others however you will have to set up a template for the
file (i.e. give information about it) using a utility program before running the program which will
use the file.
an inverted index (also referred to as a postings file or inverted file) is a database index storing a
mapping from content, such as words or numbers, to its locations in a table, or in a document or
a set of documents (named in contrast to a forward index, which maps from documents to
content). The purpose of an inverted index is to allow fast full-text searches, at a cost of
increased processing when a document is added to the database. The inverted file may be the
database file itself, rather than its index. It is the most popular data structure used in document
retrieval systems.
Suffix array indexing
What is suffix array in information retrieval?
A suffix array is a sorted array of all suffixes of a given string. Any suffix tree based algorithm
can be replaced with an algorithm that uses a suffix array enhanced with additional information
and solves the same problem in the same time complexity Document are normally stored as lists
of words.
The suffix array is one of the most prevalent data structures for string indexing; it stores the
lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the
suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a
property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x
has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the
valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly
build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We
mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability
distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of
length m matches a weighted sequence x of length n at starting position i if the product of
probabilities of the letters of p at positions i, …, i+m − 1 in x is at least 1/z. Our algorithm for
building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in
time and space O(nz). Finally, we present extensive experimental results showing that our new
indexing data structure is well suited for real-world applications.
Signature file indexing
The signature file method is a popular indexing technique used in information retrieval and
databases. It excels in efficient index maintenance and lower space overhead. However, it suffers
from inefficiency in query processing due to the fact that for each query processed the entire
signature file needs to be scanned
A signature file allows fast search for text data. It is typically a very compact data structure that
aims at minimizing disk access at query time. Query processing is performed in two stages:
filtering, where false-negatives are guaranteed to not occur but false-positives may occur and,
query refinement, where false-positives are removed
Efficient and effective text indexing is a well-known and long-standing problem in information
retrieval. While inverted files are nowadays a de facto standard for text indexing, in the early
days, its storage overhead was not acceptable for larger datasets. In addition, accessing an
inverted file on disk would require a relatively large number of (expensive) disk seeks. The main
motivation for signature files is to allow fast filtering of text using a linear scan of the signature
file for finding text segments that may contain the queried term(s). Given that the found
segments may be false-positives, a refinement step is required before the final correct answer is
returned. The main compromise in signature files lies in how to build signatures for terms and
for text segments that allow low storage overhead, fast disk access, and minimizes the ratio of
false-positives.