Assignment 1
Assignment 1
By
Dr Syed Khaldoon Khurshid
• In searching a word in a document that you
could get away with searching the simplest
possible way: reading through the whole
document from beginning to end, looking
for the right words.
• That’s not how the any search engine you
used worked, and it wouldn’t be remotely
possible for Google to work this way!
• But we will develop our understanding of
indexes by using this simple approach.
• There's "one weird trick" behind any search
engine: building and using an index.
• A book’s index allows you to access a book in an
"inside out" manner. Instead of scanning the book to
find references to a topic, the index is an alphabetical
listing of topics that explains where in the book that
topic appears.
• Even huge internet search engines containing many
trillions of documents ultimately rely on storing all that
information in an index that is stored in a computer.
• The most fundamental operation on an index is
looking up a word. If the index is built just like a
book’s index, then a word can be found by looking in
the middle of the index, and then (depending on the
word) looking at the word in the middle of the first
half or the middle of the second half.
• By always looking in the middle, you can look up a word
in an index containing trillions of web pages by just
checking a few dozen entries! That’s the magic of binary
search!
• In a book, an index is a static document, created one time for
one book. There’s no space in a book’s index to add new
topics. That won’t work for a search engine: the web is
changing constantly!
• A search engine’s index has to support one other critical
operation: adding a new word. You can think of an index
that supports addition as a huge organized wall full of sticky
notes. It’s always possible to add a new sticky
note.
• In further lecture, you’re going to assume
that you have an index that supports these
two critical operations:
• quickly looking up words and
• quickly inserting new words.
• Based on these two operations, you’ll learn
how to build a index for search engine.
Search Engine that efficiently supports
complex queries over trillions of
documents.
Building An Index
• But to get big, you'll start very, very small.
In this module, you'll build an index that lets
you search these three "documents":
•
This collection contains 3 documents, with a
total of ten words.
• Your goal is to create an easily searchable record of
where words appear. In addition to your
documents, you
have a pad of sticky
notes. Start with the
first document:
• Ignoring punctuation and capitalization, the first
step is to write the first word, “Hope" on a sticky
note, along with the document identifier, D1:
• The next word in D2 is "against." You'll need a new sticky note for
that one:
• The last word in D2 is "hope" again. The sticky note for "hope"
already has D2 on it, so you have a choice to make! • You could either
write 'D2' on the sticky note again or not. If you choose not to write
down D2 a second time on the
"hope" sticky note, and if you do the same thing for future
repeated words, what information will be lost?
• Continue on to the third document.
• The first word is "springs” In context, this has a
Here is
an algorithm for building a search engine's index. For each
document in our database and for each word in that document, run
the following procedure
• Why count the number of lookups and inserts
performed by the index-building algorithm at all? •
One reason is that understanding the number of
operations an algorithm performs is a good way to
better understand the algorithm.
• Another reason is that the lookup, insert, and write
operations, while hopefully quite fast, take up most
of the time of creating an index. This is true whether
you're building the index with powerful computers
or with sticky notes on your wall.
• Counting the number of operations enables us to
ignore the specifics of how we're putting together
the search engine's index, while still understanding
important details about how much time it will take
to build the index.
Case Study I for IR:
Design a Simple Document
Search Engine
Design a Simple Document Search Engine
• Creating a simple document search engine for information
retrieval involves several steps. Here's a step-by-step approach:
• Step 1: Define Your Goals
Understand what you want your search engine to do. Do you
want to search for specific words in documents? Do you want to
find documents related to certain topics?