0% found this document useful (0 votes)
111 views14 pages

Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment

Uploaded by

aderaj getinet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views14 pages

Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment

Uploaded by

aderaj getinet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

DEBRET

ABOR
UNIVERS
ITY

FACALITY OF TECHNOLOGY

DEPARTEMENT OF INFORMATION TECHNOLOGY

Course title: Information storage and retrieval

Sec_2 Group_5
Name ID no
1. Mogessie Yirdaw…………………………………………………………………….IT®094/11
2. Dejenu Fekadie………………………………………………………………………. .IT®038/11
3. Yebelay Yesigat………………………………………………………………………. IT®126/11
4. Birtukan Melikamu………………………………………………………………… IT®030/11
5. Rahel Mulatu…………………………………………………………………………… IT®100/11
6. Emaway Demelash……………………………………………………………………IT®042/11
7. Selam Andargie……………………………………………………………………….. IT®102/11
8. Aderaji Getinet…………………………………………………………………………. IT®010/11
9. getahun melaku……………………………………………………………………………IT®058/11
10.Endalikachew Acha,………………………………………………………………….. IT®046/11
11.G/Selassie Abebe………………………………………………………………………. IT®054/11
Information storage and retrieval assignment

Sumited to ms.Senbeto.K Sumited date 07/11/13 .E.


1 Differentiate the following terms giving clear distinguishing definition and features. Explain
their similarities as well.
 Natural Language Processing (NLP) and ISR
 Stemming and Lemmatization

Information retrieval and stored (IRS)

An information storage and retrieval system (ISRS) is a network with a built-in user interface
that facilitates the creation, searching, and modification of stored data.
Information retrieval is the science of searching for information in a document, searching for
documents themselves, and also searching for the metadata that describes data, and
for databases of texts, images or sounds.
IRS is the process of obtaining information system resources that are relevant to an information
need from a collection of those resources. Searches can be based on full-text or other content-
based indexing.

Automated information retrieval systems are used to reduce what has been called information
overload.

An IR system is a software system that provides access to books, journals and other documents;
stores and manages those documents. 

Natural Language Processing (NLP)

 Natural language processing (NLP) is a branch of artificial intelligence that helps


computers understand, interpret and manipulate human language.
 NLP is broadly defined as the automatic manipulation of natural language, like speech
and text, by software.
 The study of natural language processing has been around for more than 50 years and
grew out of the field of linguistics with the rise of computers

DTU 2013 Page 1


Information storage and retrieval assignment

 NLP draws from many disciplines, including computer science and computational
linguistics, in its pursuit to fill the gap between human communication and computer
understanding.
 NLP drives computer programs that translate text from one language to another, respond
to spoken commands, and summarize large volumes of text rapidly—even in real time.

There’s a good chance you’ve interacted with NLP in the form of

 voice-operated GPS systems


 digital assistants
 speech-to-text dictation software
 customer service catboats
 And other consumer conveniences.

But NLP also plays a growing role in enterprise solutions that help streamline business
operations, increase employee productivity, and simplify mission-critical business processes.

The difference between natural language processing (NLP) and information storage and
retrieval (ISR)

Goal of NLP is to understand and generate languages that humans use naturally. This means that
eventually we will be able to communicate with computers as we do with our fellow humans.
NLP systems focus on to create systems that efficiently process texts and to make their
information accessible to computer applications.
Whereas IRS is about finding relevant resources to an information need from a humongous
collection of resources. Due to dominance of textual data over internet, primary task of any IR
system is to process those text where NLP plays pivotal role.
IRS is not limited to text, it is applicable to image and video search as well.
Information Retrieval is the broader aspect of digging out data within a specific context (i.e.
query intent). This data is textual in many IR applications.

DTU 2013 Page 2


Information storage and retrieval assignment

Features of natural language processing

NLP is used to understand the structure and meaning of human language by analyzing different
aspects like syntax, semantics, pragmatics, and morphology. Then, computer science transforms
this linguistic knowledge into rule-based, machine learning algorithms that can solve specific
problems and perform desired tasks

These below list is the basic features such as:

 Text length counts: is features like character length and word length are quite common to be
significant in text datasets.
 Non-dictionary word counts: Count/ratio of non-dictionary words or OOV (out of vocab)
words in the text. It can be a pseudo-feature representing how formal the text is.
 Readability metrics: Metrics like Flesch-Kincaideadability Test and SMOG can be used.
Essentially, these are ratios of complex words (polysyllables) to the total number of words or
sentences — which are hypothesized to signal how easily readable a piece for text is.
 Unique word ratios: ratio of number of unique words/total words. This feature gives you a
sense of word repetitions in the data points.
 Part of speech (pos): If there’s a single feature I have used the most in my experience, it’s
the Part-Of-Speech (POS) tags. They are incredibly powerful.
Features of information storage and retrieval

Presentation
Organization
Storage
access
The similarity of natural language processing and information storage and retrieval.
Both used to access the use full information
Both used for stored and searched.
Stemming and Lemmatization
Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers. Often when searching

DTU 2013 Page 3


Information storage and retrieval assignment

text for a certain keyword, it helps if the search returns variations of the word. For instance,
searching for “boat” might also return “boats” and “boating”. Here, “boat” would be the stem for
[boat, boater, boating, boats].

Reduce terms to their “roots” before indexing.


“Stemming” suggest crude affix chopping.
Language dependent
E.g. Automate(s), automatic, automation all reduced to automat.
For example compressed and compression are both accepted as equivalent to compress.
For example compress and compress are both accept as equivalent to compress
Lemmatization
Lemmatization looks beyond word reduction and considers a language’s full vocabulary to
apply a morphological analysis to words. The lemma of ‘was’ is ‘be’ and the lemma of ‘mice’ is
‘mouse’.
Reduce inflectional/variant forms to base form.
E.g. am, are, is→ be
car, cars, car's, cars' → car
The boy's cars are different colors → the boy car be different color.
Lemmatization implies doing “proper” reduction to dictionary headword form.
The difference between Stemming and Lemmatization

Stemming uses the stem of the word, while lemmatization uses the context in which the word
is being used.

Stemming just removes or stems the last few characters of a word, often leading to incorrect
meanings and spelling. While Lemmatization considers the context and converts the word to its
meaningful base form, which is called Lemma. Sometimes, the same word can have multiple
different Lemmas.
A stemmer will return the stem of a word, which needn't be identical to the morphological root of
the word. It usually sufficient that related words map to the same stem, even if the stem is not in
itself a valid root, while in lemmatization, it will return the dictionary form of a word, which
must be a valid word.

DTU 2013 Page 4


Information storage and retrieval assignment

In lemmatization, the part of speech of a word should be first determined and the normalization
rules will be different for different part of speech, while the stemmer operates on a single word
without knowledge of the context, and therefore cannot discriminate between words which have
different meanings depending on part of speech.
Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms
to linguistically valid lemmas.
Lemmatization deals only with inflectional variance, whereas stemming may also deal with
derivational variance;
Stemming identifies the common root form of a word by removing or replacing word suffixes
(e.g. “flooding” is stemmed as “flood”), while lemmatization identifies the inflected forms of a
word and returns its base form (e.g. “better” is lemmatized as “good”).

The features of stemming

1. Stemming is used in information retrieval systems like search engines.


2. It is used to determine domain vocabularies in domain analysis.
The features of lemmatization
The main features of lemmatization is that it takes into consideration the context of the word to
determine which the intended meaning the user is looking for is. This process allows to decrease
noise and speed up the user's task.

The Similarity of stemming and lemmatization

Lemmatization is similar to stemming as both of them reduce a word variant to its "stem" in
stemming and to its "lemma" in lemmatizing. It uses vocabulary and morphological analysis for
returning words to their dictionary form. Lemmatization converts each word to its basic form, the
lemma.

Stemming and lemmatization play an important role in order to increase the recall capabilities of
an information retrieval system.
 The basic principle of both techniques is to group similar words which have either the same root
or the same canonical citation form.
Both are used in Text and Natural Language Processing.

DTU 2013 Page 5


Information storage and retrieval assignment

2. Define the following IR tools and concepts.


NLTK
Cross language IR
Porter Stemmer

Natural Language Toolkit (NLTK)


 It is a platform used for building Python programs that work with human language data
for applying in statistical natural language processing (NLP).
It contains text processing libraries for tokenization, parsing, classification, stemming,
tagging and semantic reasoning.
 It also includes graphical demonstrations and sample data sets as well as accompanied by
a cook book and a book which explains the principles behind the underlying language
processing tasks that NLTK supports.
Cross-language information retrieval (CLIR)
It is a subfield of information retrieval dealing with retrieving information written in
a language different from the language of the user's query.
The most CLIR systems use various translation techniques .these techniques can be classified
into different categories based on different translation resources:

 Dictionary-based CLIR techniques


 Parallel corpora based CLIR techniques
 Comparable corpora based CLIR techniques
 Machine translator based CLIR techniques
Porter stemming
Porter stemming is a process for removing the commoner morphological and inflexional endings
from words in English.
Its main use is as part of a term normalization process that is usually done when setting up
Information Retrieval systems.
The Porter stemming algorithm is a process for removing the commoner morphological and in
flexional endings from words in English. Its main use is as part of a term normalization process
that is usually done when setting up Information Retrieval systems

DTU 2013 Page 6


Information storage and retrieval assignment

Porter stemmer is the least aggressive algorithm, with the description of each algorithm actually
being somewhat lengthy and technical. 

3. Discuss the following query processing techniques in ISR:


1. Relevance feedback
2. Query expansion
3. Ontology based query expansion
4. Wordnet based query expansion
5. Thesaurus based query expansion
1. Relevance feedback
Relevance feedback is a feature of some information retrieval systems. The idea
behind relevance feedback is to take the results that are initially returned from a given query,
to gather user feedback, and to use information about whether or not those results
are relevant to perform a new query.
Relevance feedback:
 user feedback on relevance of docs in initial set of results.
 The user marks some results as relevant or non‐relevant.
 The system computes a better representation of the information need based on feedback.
 Relevance feedback can go through one or more iterations.
2. Query expansion (QE)
 It is a process in Information Retrieval which consists of selecting and adding terms to the
user's query with the goal of minimizing query-document mismatch and thereby
improving retrieval performance
 It is the process of reformulating a given query to improve retrieval performance
in information retrieval operations, particularly in the context of query understanding.
 In the context of search engines, query expansion involves evaluating a user's input (what
words were typed into the search query area, and sometimes other types of data) and
expanding the search query to match additional documents.
3. Ontology based query expansion
 Ontology based query expansion technique to enhance the effectiveness of the system in
retrieval process.

DTU 2013 Page 7


Information storage and retrieval assignment

 Ontologies are usually huge repositories of concepts and relations between concepts in a
certain domain.
 Using the ontologies in information retrieval improves the retrieval goal.
 The aim of this research is to explore the impact of the use of ontology for query expansion
to improve retrieval results.
 The results show that ontology based query expansion has resulted in a higher number of
relevant documents being retrieved compared to other query expansion process.

Ontologies are applicable to domain independent retrieval such as web information retrieval
and also more useful in specialized information retrieval tasks. They have also been used in
query expansion. Ontologies are effectively formal and explicit specifications in the form of
concepts and relations of shared conceptualizations.
4. WordNet based query expansion
 The WordNet describes a query expansion method based on the expansion of
geographical terms by means of WordNet synonyms and meronyms.
 The WordNet ontology only in the geographical domain, by applying a query expansion
method, based on the synonymy and meronyms relationships, to geographical terms.
 WordNet can be used in a more effective way during the indexing phase, by adding
synonyms and homonyms to the index terms.
5. Thesaurus based query expansion
A thesaurus is a classification system compiled of words or phrases organized with the
objective of facilitating the user’s idea .In section 4 we highlight the construction of
thesaurus in detail.
Term Weightage In some cases all the terms in the query may contain an equal weight i.e. all
terms are of equal importance. However many times it happens that different weights have to
be assigned to various terms depending on its importance.
4. What is Search Engines? Discuss in detail the basic idea behind search engine.
Additionally, list some of common search engines and explain how they work.
 A search engine is a web based tool that is used by people to locate information on the
internet.
 Search engines allow users to search the internet for content using keywords.

DTU 2013 Page 8


Information storage and retrieval assignment

 When a user enters a query into a search engine, a search engine results page (SERP) is
returned, ranking the found pages in order of their relevance.
 Web search engines are the most visible IR applications

Search engines often change their algorithms (the programs that rank the results) to improve user
experience. They aim to understand how users search and give them the best answer to their
query. This means giving priority to the highest quality and most relevant pages.

Some of the most popular examples of search engines are:


 Google
 Bing
 Yahoo!
 MSN Search
There are three key steps to how most search engines work:

 Crawling - search engines use programs, called spiders, bots or crawlers, to scour the
internet. They may do this every few days, so it is possible for content to be out-of-date
until they crawl your website again. 

 Indexing - the search engine will try to understand and categories the content on a web
page through 'keywords'. Following SEO best practice will help the search engine
understand your content so you can rank for the right search queries. 

 Ranking - search results are ranked based on a number of factors. These may include
keyword density, speed and links. The search engine's aim is to provide the user with the
most relevant result. 
Although most search engines will provide tips on how to improve your page ranking, the exact
algorithms used are well guarded and change frequently to avoid misuse. But by following
search engine optimization (SEO) best practice you can ensure that:

 Search engines can easily crawl your website. You can also prompt them to crawl new
content.
 Your content is indexed for the right keywords so it can appear for relevant searches.

 Your content can rank highly on the SERP. 

DTU 2013 Page 9


Information storage and retrieval assignment

5. As you know, stop-word lists are language dependent (the stop-word list of one language
is different from the stop-word list of another language). Prepare the stop-word list of your
mother tongue language. If your group is composed of members speaking different
languages, select one language and prepare stop-word lists on it.
What words are not stop words?
Generally speaking, most stop words are function (filler) words, which are words with little or no
meaning that help form a sentence. Content words like adjectives, nouns, and verbs are often not
considered as stop words. However, a programmer may choose to add very common words. For
example, Computer Hope may consider "computer" a stop word because it could be used to
describe any computer-related product (e.g., computer motherboard, computer video card,
etc.).According to this description we can list stop words in Amharic language
Examples:

ስለዚህ

ነገር ግን

ይሁን እንጅ

እና ሌሎችም አየያዦች stop words

ናቸው

ነው

ነች ቁጥር አመልካች stop words

ስለ ሀገራችን these stop words ስለ፣ኣችን

6. There are several data structure/file structure to use for text indexing in IR, such
as sequential file indexing, inverted file indexing, suffix array indexing and
signature file indexing, etc. Discuss how they work in text indexing.

What is an indexed sequential file?

Records in indexed sequential files are stored in the order that they are written to the disk.
Records may be retrieved in sequential order or in random order using a numeric index to
represent the record number in the file.

DTU 2013 Page 10


Information storage and retrieval assignment

Indexed Sequential Files

An Indexed Sequential File is organized so that it is easy to find information quickly without the
computer having to search through the whole file. The file includes a number of indexes and the
computer can look at these to find exactly where in the file to find the record - in much the same
way that you would use a library catalogue to find a book rather than looking along all the
shelves.

This is ideal for files which are used for reference - such as in an airline booking system or a
library catalogue: it would be very slow to have to hunt all the way through the files in order to
find a particular flight or a particular library book – so an ordinary sequential file would not be
suitable.

It is possible to insert new records into an Indexed Sequential File - so it is not necessary to copy
a whole file across to a new one just to add a single record (as would be the case with a
sequential file) as the system leaves space for later insertions; likewise, it is easy to alter a record
in place without writing the whole file out again, or to delete a record.

Again this is very suitable for systems such as travel or theatre booking where information has to
be changed frequently and it would be inefficient to copy out a whole file again just to alter
information about one or two seats.

When setting up an Indexed Sequential File in the first place, the programmer has to specify
which field will be used as the key field - e.g. a student file might use student name, or number
as the key field - this is the item which will be looked up in the index in order to find a record.

Note that some operating systems automatically set up a properly indexed file with sufficient
space when your program runs: with others however you will have to set up a template for the
file (i.e. give information about it) using a utility program before running the program which will
use the file.

Inverted file indexing

an inverted index (also referred to as a postings file or inverted file) is a database index storing a
mapping from content, such as words or numbers, to its locations in a table, or in a document or

DTU 2013 Page 11


Information storage and retrieval assignment

a set of documents (named in contrast to a forward index, which maps from documents to
content). The purpose of an inverted index is to allow fast full-text searches, at a cost of
increased processing when a document is added to the database. The inverted file may be the
database file itself, rather than its index. It is the most popular data structure used in document
retrieval systems.
Suffix array indexing
What is suffix array in information retrieval?
A suffix array is a sorted array of all suffixes of a given string. Any suffix tree based algorithm
can be replaced with an algorithm that uses a suffix array enhanced with additional information
and solves the same problem in the same time complexity Document are normally stored as lists
of words.
The suffix array is one of the most prevalent data structures for string indexing; it stores the
lexicographically sorted list of suffixes of a given string. Its practical advantage compared to the
suffix tree is space efficiency. In Property Indexing, we are given a string x of length n and a
property Π, i.e., a set of Π-valid intervals over x so that a pattern p occurs in x if and only if x
has an occurrence of p that lies entirely within an interval of Π. A suffix-tree-like index over the
valid prefixes of suffixes of x can be built in time and space O(n). We show here how to directly
build a suffix-array-like index, the Property Suffix Array (PSA), in time and space O(n). We
mainly draw our motivation from weighted (probabilistic) sequences: sequences of probability
distributions over a given alphabet. Given a probability threshold 1/z, we say that a string p of
length m matches a weighted sequence x of length n at starting position i if the product of
probabilities of the letters of p at positions i, …, i+m − 1 in x is at least 1/z. Our algorithm for
building the PSA can be directly applied to build an O(nz)-sized suffix-array-like index over x in
time and space O(nz). Finally, we present extensive experimental results showing that our new
indexing data structure is well suited for real-world applications.
Signature file indexing
The signature file method is a popular indexing technique used in information retrieval and
databases. It excels in efficient index maintenance and lower space overhead. However, it suffers
from inefficiency in query processing due to the fact that for each query processed the entire
signature file needs to be scanned
A signature file allows fast search for text data. It is typically a very compact data structure that

DTU 2013 Page 12


Information storage and retrieval assignment

aims at minimizing disk access at query time. Query processing is performed in two stages:
filtering, where false-negatives are guaranteed to not occur but false-positives may occur and,
query refinement, where false-positives are removed

Efficient and effective text indexing is a well-known and long-standing problem in information
retrieval. While inverted files are nowadays a de facto standard for text indexing, in the early
days, its storage overhead was not acceptable for larger datasets. In addition, accessing an
inverted file on disk would require a relatively large number of (expensive) disk seeks. The main
motivation for signature files is to allow fast filtering of text using a linear scan of the signature
file for finding text segments that may contain the queried term(s). Given that the found
segments may be false-positives, a refinement step is required before the final correct answer is
returned. The main compromise in signature files lies in how to build signatures for terms and
for text segments that allow low storage overhead, fast disk access, and minimizes the ratio of
false-positives.

DTU 2013 Page 13

You might also like