IR Unit 3
IR Unit 3
IR Unit 3
DEPARTMENT OF CSE
CS6007 – INFORMATION RETRIEVAL
Web search overview, web structure, the user, paid placement, search
engine optimization/ spam. Web size measurement – search engine
optimization/spam – Web Search Architectures – crawling – meta-
crawlers- Focused Crawling – web indexes –- Near-duplicate detection –
Index Compression – XML retrieval.
The World Wide Web allows people to share information globally. The amount
of information grows without bound. In order to extract information the user
needs a tool to search the Web. The tool is called a search engine.
1
Crawler-based search engines use automated software programs to survey and
categorize web pages. The programs used by the search engines to access our
web pages are called ‗spiders‘, ‗crawlers‘, ‗robots‘ or ‗bots‘.
Examples of crawler-based search engines are:
a)Google (www.google.com) b) Ask Jeeves (www.ask.com)
2. Directories
A ‗directory‘ uses human editors who decide what category the site belongs to,
they place websites within specific categories in the ‗directories‘ database. The
human editors comprehensively check the website and rank it, based on the
information they find, using a pre-defined set of rules.
There are two major directories a) Yahoo Directory (www.yahoo.com) b) Open
Directory (www.dmoz.org)
3. Hybrid Search Engines
Hybrid search engines use a combination of both crawler-based results and
directory results. More and more search engines these days are moving to a
hybrid-based model. Examples of hybrid search engines are:
a)Yahoo (www.yahoo.com) b)Google (www.google.com)
4. Meta Search Engines
Meta search engines take the results from all the other search engines results,
and combine them into one large listing. Examples of Meta search engines
include: a)Meta crawler (www.metacrawler.com) b) Dogpile (www.dogpile.com)
2
Types of Search Engines
3
Fig.3.2.1 Bow-tie structure of Web
A web crawl is a task performed by special purpose software that surfs the
Web, starting from a multitude of web pages and then continuously following
the hyper links it encountered until the end of the crawl. One of the intriguing
findings of this crawl was that the Web has a bow-tie structure as shown in the
above figure.
The central core of the Web (the knot of the bow-tie) is the strongly connected
component (SCC), which means that for any two pages in the SCC, a user can
navigate from one of them to the other and back by clicking on links embedded
in the pages encountered. The relative size of the SCC turned out to be 27.5%
of the crawled portion of the Web. The left bow, called IN, contains pages that
have a directed path of links leading to the SCC and its relative size was 21.5%
of the crawled pages. Pages in the left bow might be either new pages that have
not yet been linked to, or older web pages that have not become popular
enough to become part of the SCC. The right bow, called OUT, contains pages
that can be reached from the SCC by following a directed path of links and its
relative size was also 21.5% of the crawled pages. Pages in the right bow might
be pages in e-commerce sites that have a policy not to link to other sites. The
other components are the ―tendrils‖ and the ―tubes‖ that together comprised
21.5% of the crawled portion of the Web, and further ―disconnected‖
components whose total size was about 8% of the crawl. A web page in Tubes
has a directed path from IN to OUT bypassing the SCC, and a page in Tendrils
can either be reached from IN or leads into OUT. The pages in disconnected are
not even weakly connected to the SCC; that is, even if we ignored the fact that
hyperlinks only allow forward navigation, allowing them to be traversed
backwards as well as forwards, we still could not reach reach the SCC from
them.
4
3.2.2 Small-World Structure of the Web
The study of the Web also revealed some other interesting properties
regarding the structure and navigability of the Web. It turns out that over 75%
of the time there is no directed path of links from one random web page to
another. When such a path exists, its average distance is 16 clicks and when
an undirected path exists (i.e., one allowing backward traversal of links) its
average distance is only seven clicks. Thus, the Web is a small-world network
popularly known through the notion of ―six degrees of separation,‖ where any
two random people in the world can discover that there is only a short chain of
at most six acquaintances between them. Since in a small-world network the
average distance between any two nodes is logarithmic in the number of pages
in the network, a 10-fold increase in the size of the Web would only lead to the
average distance increasing by a few clicks.
The diameter of a graph is the maximum shortest distance between any
two nodes in the graph. It was found that the diameter of the SCC of the web
graph is at least 28 but when considering the Web as a whole, the diameter is
at least 500; this longest path is from the most distant node in IN to the most
distant node in OUT.
There are some problems when users use the interface of a search engine.
5
Specify the words clearly, stating which words should be in the page
and which words should not be in the page.
Provide as many particular terms as possible: page title, date, and
country.
If looking for a company, institution, or organization, try to guess the
URL by using the www prefix followed by the name, and then (.com,
.edu, .org, .gov, or country code).
Some search engines specialize in some areas. For example, the users
can use ResearchIndex (www.researchindex.com) to search research
papers.
If the users use broad queries, try to use Web directories as starting
points.
The user should notice that anyone can publish data on the Web, so
information that they get from search engines might not be accurate.
6
Fig 3.4.1 Example of Paid Placement
In this scheme the search engine separates its query results list into two parts:
(i) an organic list, which contains the free unbiased results, displayed
according to the search engine‘s ranking algorithm, and (ii) a sponsored list,
which is paid for by advertising .This method of payment is called pay per click
(PPC), also known as cost per click (CPC), since payment is made by the
advertiser each time a user clicks on the link in the sponsored listing.
7
Organic Search vs Paid Search
Whenever you type a question into Google, or any other search engine, the list
of links that appear below the ads are known as "organic results." These
appear purely based on the quality and content of the page. Traffic that comes
from people finding your links among these results is classified as "organic
search" traffic or just organic traffic.
Paid search accounts are those that companies have paid to appear the top of
search results
Search engine optimization is broken down into two basic areas: on-page, and
off-page optimization. On-page optimization refers to website elements which
8
comprise a web page, such as HTML code, textual content, and images. Off-
page optimization refers, predominantly, to back links (links pointing to the site
which is being optimized, from other relevant websites).
On-Page Factors
3.Title tags<title>
5.Header tags<h1>
4.ALT image tags
1.Content,Content,Content(Body text)<body>
6.Hyperlink text
2.Keyword frequency and density
Off-Page Factors
1.Anchor text
2.Link Popularity (―votes for your site) – adds credibility
9
Title tag, Meta description tag and Keywords
2.Linking Strategies
-the text in the links should include keywords
- the more inbound links the higher the SE ranking
- if the site linking to us is already indexed, spiders will also receive our site
3.Keywords
- the most important in optimizing rankings
-keywords are words that appear the most in a page
- the spider chooses the appropriate keywords for each page, then sends them
back to its SE
-our web site will then be indexed based on our keywords
-can be key phrases or a single keyword
- do not use common words eg ‗the‘ ‗and‘ ‗of‘: spiders ignore them
-writekeyword-richtext
- balance keyword-rich and readability
4.Title tags
10
The title tag on pages of your website tells search engines what the page is
about. It should be 70 characters or less and include your business or brand
name and keywords that relate to that specific page only. This tag is placed
between the <HEAD> </HEAD> tags near the top of the HTML code for the
page.
6.Alt tags
-include keywords in your alt tags
<IMG src="star.gif" alt=―star logo">
11
It ensures that a web page content should have been created for the
users and not just for the search engines.
It ensures good quality of the web pages.
It ensures availability of useful content on the web pages.
Always follow a White Hat SEO tactic and do not try to fool your site visitors.
Be honest and you will definitely get something more.
Always stay away from any of the above Black Hat tactics to improve the rank
of our site. Search engines are smart enough to identify all the above properties
of our site and ultimately we are not going to get anything.
There are two major types of search engine optimization, white hat search
engine optimization (the 'good' kind), and black hat (the 'not so good' kind).
SEO Services
There are a number of SEO services which can help contribute to the
improvement of the organic search engine rankings of a website. These services
include, but are not limited to, on-page (or on-site) optimization, link building,
search engine friendly website design and development, and search engine
friendly content writing services.
Measuring the World Wide Web is a very difficult task due to its dynamic
nature. In 1999, there were over 40 million computers in more than 200
countries connected to the Internet. Within these computers, over 3 million of
them are Web servers (NetSizer, 1998). There are two explanations why the
number of Web servers is huge. The first one is that many Web sites share the
same Web server using virtual hosts and not all of them are fully accessible to
the outside world. The second one is that not all Web sites start with the prefix
www and by only counting Web sites with this prefix there were only 780,000
in 1998.
13
As for the total number of Web pages, there were estimated to be 350
million in 1998. Between 1997 and 1998, the size of Web pages was doubled
in nine months and was gro wing at a rate of 20 million pages per
month.(Baeza-Yates, 1999). The most popular formats of Web documents
are HTML, followed by GIF and JPG (images format), ASCII files, Postscript
and ASP. The most popular compression tools are GNU zip, Zip, and
Compress. Most HTML pages are not standard, because they do not comply
with HTML specifications. HTML documents seldom start with a document
type definition. Also they are typically small with an average of 5 Kb and a
median of 2 Kb. On average, each HTML page contains one or two images
and five to fifteen hyperlinks. Most of these hyperlinks are local, meaning
the associated Web pages are mostly stored in the same Web server (Baeza-
Yates, 1999). The top ten most referenced Web sites such as Microsoft,
Netscape, and Yahoo are referenced in over 100,000 places. In addition, the
site containing the most external links is Yahoo!. Yahoo! glues all the
isolated Web sites together to form a large Web database. If we assume that
the average size of an HTML page is 5 Kb and there are 300 million Web
pages, then we have at least 1.5 terabytes of text. This huge size is
consistent with other studies done by other organizations (Baeza-Yates,
1999).
Search engines make billions of dollars each year selling ads. Most search
engine traffic goes to the free, organically listed sites. The ratio of traffic
distribution is going to be keyword dependent and search engine dependent,
but we believe about 85% of Google‘s traffic clicks on the organic listings. Most
other search engines display ads a bit more aggressively than Google does. In
many of those search engines, organic listings get around 70% of the traffic.
Some sites rank well on merit, while others are there due exclusively to ranking
manipulation. In many situations, a proper SEO campaign can provide a much
greater ROI(Return on Investments) than paid ads do. This means that while
search engine optimizers—known in the industry as SEOs—and search engines
have business models that may overlap, they may also compete with one
another for ad dollars. Sometimes SEOs and search engines are friends with
each other, and, unfortunately, sometimes they are enemies. When search
engines return relevant results, they get to deliver more ads. When their results
are not relevant, they lose market share. Beyond relevancy, some search
engines also try to bias the search results to informational sites such that
commercial sites are forced into buying ads. There is a huge sum of money in
manipulating search results. There are ways to improve search engine
placement that go with the goals of the search engines, and there are also ways
that go against them. Quality SEOs aim to be relevant, whether or not they
follow search guidelines. Many effective SEO techniques may be considered
somewhat spammy. Like anything in life, we should make an informed decision
14
about which SEO techniques we want to use and which ones we do not .We
may choose to use highly aggressive, ―crash and burn‖ techniques, or slower,
more predictable, less risky techniques. Most industries will not require
extremely aggressive promotional techniques.
First, search engines crawl the Web to see what is there. This task is
performed by a piece of software, called a crawler or a spider (or Googlebot, as
is the case with Google). Spiders follow links from one page to another and
index everything they find on their way. Having in mind the number of pages
on the Web (over 20 billion), it is impossible for a spider to visit a site daily just
to see if a new page has appeared or if an existing page has been modified,
sometimes crawlers may not end up visiting our site for a month or two.
After a page is crawled, the next step is to index its content. The indexed page
is stored in a giant database, from where it can later be retrieved. Essentially,
the process of indexing is identifying the words and expressions that best
describe the page and assigning the page to particular keywords. For a human
it will not be possible to process such amounts of information but generally
search engines deal just fine with this task. Sometimes they might not get the
meaning of a page right but if we help them by optimizing it, it will be easier for
them to classify our pages correctly and for us – to get higher rankings.
When a search request comes, the search engine processes it – i.e. it compares
the search string in the search request with the indexed pages in the database.
Since it is likely that more than one page (practically it is millions of pages)
contains the search string, the search engine starts calculating the relevancy
of each of the pages in its index with the search string.
1.Keyword research
It allows us to see which keywords users actually employ to find products and
services within our chosen market, instead of making guesses at the keywords
we believe are the most popular.
2.Content development
3.Web development
It involves
4.Link Building
Building links will make up about 60% of our work. There are ways to
automate this process using shortcuts, workarounds, and submission services.
Internal linking is also very important. Treat the way we link to our own
content same as we would link from an external site.
5.Webmaster Tools
16
Dashboards offer a number of tools which allow us to understand how
the search engine sees our site. These are the only way to identify crawling,
indexing, and the ranking issue with our site.
The main components of a search engine are the crawler, indexer, search
index, query engine, and search interface.
A web crawler is a software program that traverses web pages, downloads them
for indexing, and follows the hyperlinks that are referenced on the downloaded
pages a web crawler is also known as a spider, a wanderer or a software robot.
The second component is the indexer which is responsible for creating the
search index from the web pages it receives from the crawler.
The Search Index
The search index is a data repository containing all the information the search
engine needs to match and retrieve web pages. The type of data structure used
to organize the index is known as an inverted file. It is very much like an index
at the back of a book. It contains all the words appearing in the web pages
crawled, listed in alphabetical order (this is called the index file), and for each
word it has a list of references to the web pages in which the word appears (this
is called the posting list ).
Example:
Consider the entry for ―chess‖ in the search index. Attached to the entry is the
posting list of all web pages that contain the word ―chess‖; for example, the
entry for ―chess‖ could be chess → [www.chess.co.uk, www.uschess.org,
17
www.chessclub.com, . . .] Often, more information is stored for each entry in
the index such as the number of documents in the posting list for the entry,
that is, the number of web pages that contain the keyword, and for each
individual entry in the posting file we may also store the number of
occurrences of the keyword in the web page and the position of each
occurrence within the page. This type of information is useful for determining
content relevance.
The search index will also store information pertaining to hyperlinks in a
separate link database, which allows the search engine to perform hyperlink
analysis, which is used as part of the ranking process of web pages. The link
database can also be organized as an inverted file in such a way that its index
file is populated by URLs and the posting list for each URL entry, called the
source URL, contains all the destination URLs forming links between these
source and destination URLs.
The link database for the Web can be used to reconstruct the structure of the
web and to have good coverage, its index file will have to contain billions of
entries. When we include the posting lists in the calculation of the size of the
link database, then the total number of entries in the database will be an order
of magnitude higher. Compression of the link database is thus an important
issue for search engines, who need to perform efficient hyperlink analysis.
Randall have developed compression techniques for the link database, which
take advantage of the structure of the Web. Their techniques are based on the
observations that most web pages tend to link to other pages on the same
website, and many web pages on the same web site tend to link to a common
set of pages. Combing these observations with well-known compression
methods, they have managed to reduce the space requirements to six bits per
hyperlink.
The text which is attached to a hyperlink, called link (or anchor) text, that is
clicked on by users following the link, is considered to be part of the web page
it references. So when a word such as ―chess‖ appears in some link text, then
the posting list for that word will contain an entry for the destination URL of
the link.
The Query Engine
of a commercial query engine is a well-guarded secret, since search engines are
rightly paranoid, fearing web sites who wish to increase their ranking by
unscrupulously taking advantage of the algorithms the search engine uses to
rank result pages. Search engines view such manipulation as spam, since it
has direct effects on the quality of the results presented to the user.
Spam
Spam is normally associated with unsolicited e-mail also known as junk e-mail,
although the word spam originally derives from spiced ham and refers to a
(canned meat product.) It is not straightforward to distinguish between search
engine spam and organic search engine optimization, where a good and healthy
design of web pages leads them to be visible on the top results of search
engines for queries related to the pages.
18
The query engine provides the interface between the search index, the user,
and the Web. The query engine processes a user query in two steps. In the first
step, the query engine retrieves from the search index information about
potentially relevant web pages that match the keywords in the user query, and
in the second step a ranking of the results is produced, from the most relevant
downwards.
The ranking algorithm combines content relevance of web pages and other
relevance measures of web pages based on link analysis and popularity.
Deciding how to rank web pages revolves upon our understanding of the
concept of what is ―relevant‖ for a user, given a query. The problem with
relevance is that what is relevant for one user may not be relevant to another.
In a nutshell, relevance is, to a large degree, personal and depends on the
context and task the user has in mind. Search engines take a very pragmatic
view of relevance and continuously tweak and improve their ranking algorithms
by examining how surfers search the Web; for example, by studying recent
query logs.
Once the query is processed, the query engine sends the results list to the
search interface, which displays the results on the user‘s screen. The user
interface provides the look and feel of the search engine, allowing the user to
submit queries, browse the results list, and click on chosen web pages for
further browsing.
19
3.9 Web Crawling
20
Basic Crawler Architecture
Distributed crawling
crawling operation can be performed by several dedicated threads
Parallel crawling can be distributed over nodes of a distributed
system (geographical distribution, link-based distribution, etc.)
Distribution involves a host-splitter, which dispatches URLs to the
corresponding crawling node
21
• back queues for politeness
Each back queue only contains URLs for a given host (mapping between
hosts and back queue identifiers)
When a back queue is empty, it is filled with URLs from priority queues
A heap contains the earliest time te to contact a given host again
URL Frontier
3.9.2 Meta-crawlers
Eg.http://www.kartoo.com this meta search site shows the results with sites being
interconnected by keywords
URL queue contains a list of unvisited URLs maintained by the crawler and
is initialized with seed URLs. Web page downloader fetches URLs from URL
queue and downloads corresponding pages from the internet. The parser and
23
extractor extracts information such as the terms and the hyperlink URLs from
a downloaded page. Relevance calculator calculates relevance of a page w.r.t.
topic, and assigns score to URLs extracted from the page. Topic filter analyzes
whether the content of parsed pages is related to topic or not. If the page is
relevant, the URLs extracted from it will be added to the URL queue, otherwise
added to the Irrelevant table.
A focused crawling algorithm loads a page and extracts the links. By
rating the links based on keywords the crawler decides which page to retrieve
next. The Web is traversed link by link and the existing work is extended in the
area of focused document crawling. There are various categories in focused
crawlers:
(a) Classic focused crawler
(b) Semantic crawler
(c) Learning crawler
(a)Classic focused crawlers
Guides the search towards interested pages by taking the user query
which describes the topic as input. They assign priorities to the links based on
the topic of query and the pages with high priority are downloaded first.
These priorities are computed on the basis of similarity between the topic and
the page containing the links. Text similarity is computed using an information
similarity model such as the Boolean or the Vector Space Model
(b) Semantic crawlers
It is a variation of classic focused crawlers. To compute topic to page
relevance downloaded priorities are assigned to pages by applying
semantic similarity criteria, the sharing of conceptually similar terms defines
the relevance of a page and the topic. Ontology is used to define the conceptual
similarity between the terms.
(c)Learning crawlers
Uses a training process to guide the crawling process and to assign visit
priorities to web pages. A learning crawler supplies a training set which
consist of relevant and not relevant Web pages in order to train the learning
crawler . Links are extracted from web pages by assigning the higher visit
priorities to classify relevant topic. Methods based on context graphs
and Hidden Markov Models take into account not only the page content but
also the link structure of the Web and the probability that a given page will
lead to a relevant page .
Web indexing means creating indexes for individual Web sites, intranets,
collections of HTML documents, or even collections of Web sites. Indexes are
systematically arranged items, such as topics or names, that serve as entry
points to go directly to desired information within a larger document or set of
documents. Indexes are traditionally alphabetically arranged. But they may
also make use of hierarchical arrangements, as provided by thesauri, or they
24
may be entirely hierarchical, as in the case of taxonomies. An index might not
even be displayed, if it is incorporated into a searchable database.
A Web index is often a brows able list of entries from which the user makes
selections, but it may be non-displayed and searched by the user typing into a
search box. A site A-Z index is a kind of Web index that resembles an
alphabetical back-of-the-book style index, where the index entries are
hyperlinked directly to the appropriate Web page or page section, rather than
using page numbers. Web indexes work particularly well in sites that have a
flat structure with only one or two levels of hierarchy. Indexes complement
search engines on larger web sites and for smaller sites, they provide a cost-
effective alternative. Whether to use a back-of-the-book style index or a
hierarchy of categories will depend on the size of the size of the site and how
rapidly the content is changing. Site indexes are best done by individuals
skilled in indexing who also have basic skills in HTML or in using HTML
indexing tools.
The Web Index assesses the Web‘s contribution to social, economic and
political progress in countries around the world.
Parliament of Australia
A large back-of-the-book style index.
UNIXhelp for Users
Both a back-of-the-book style index and a searchable index.
Daily Herald story index - 1901 to 1964
An alphabetical list of subject headings for newspaper articles.
Writer's Block
A "living" index that is updated quarterly.
The World Bank Group
Subject list showing major topics and sub-topics. Each page for a major
topic is structured differently.
US Census Bureau
A large multi-level subject list. Also provides a list of subjects in
alphabetical order.
26
3.12 Near duplicate detection
The Web contains multiple copies of the same content. By some estimates, as
many as 40% of the pages on the Web are duplicates of other pages. Search
engines try to avoid indexing multiple copies of the same content, to keep down
storage and processing overheads.
comparing the sets of shingles for all web pages. Let denote the set of
27
eliminate one from indexing. However, this does not appear to have simplified
matters: we still have to compute Jaccard coefficients pairwise.
To avoid this, we use a form of hashing. First, we map every shingle into a
hash value over a large space, say 64 bits. For , let be the
corresponding set of 64-bit hash values derived from . We now invoke the
following trick to detect document pairs whose sets have large Jaccard
overlaps. Let be a random permutation from the 64-bit integers to the 64-bit
Theorem.
28
(247)
Proof. We give the proof in a slightly more general setting: consider a family of
sets whose elements are drawn from a common universe. View the sets as
columns of a matrix , with one row for each element in the universe. The
the column that results from applying to the th column. Finally, let be
the index of the first row in which the column has a . We then prove
(248)
entries of and partition the rows into four types: those with 0's in both
a 0 in , and finally those with 1's in both of these columns. Indeed, the first
four rows of Figure 19.9 exemplify all of these four types of rows. Denote by
29
the number of rows with 0's in both columns, the second, the third
(249)
Because is a random permutation, the probability that this smallest row has
a 1 in both columns is exactly the right-hand side of Equation 249
3.13.1 The dictionary and the inverted index as the central data structures
in information retrieval (IR). There are two more subtle benefits of compression.
The first is increased use of caching. Search systems use some parts of
the dictionary and the index much more than others. For example, if we cache
the postings list of a frequently used query term t, then the computations
necessary for responding to the one-term query t can be entirely done in
memory. With compression, we can fit a lot more information into main
memory. Instead of having to expend a disk seek when processing a query with
t, we instead access its postings list in memory and decompress it. As we will
see below, there are simple and efficient decompression methods, so that the
penalty of having to decompress the postings list is small. As a result, we are
able to decrease the response time of the IR system substantially. Because
memory is a more expensive resource than disk space, increased speed owing
to caching – rather than decreased space requirements – is often the prime
motivator for compression.
The second more subtle advantage of compression is faster transfer of
data from disk to memory. Efficient decompression algorithms run so fast on
modern hardware that the total time of transferring a compressed chunk of
data from disk and then decompressing it is usually less than transferring the
same chunk of data in uncompressed form. For instance, we can reduce
30
input/output (I/O) time by loading a much smaller compressed postings list,
even when you add on the cost of decompression. So, in most cases, the
retrieval system runs faster on compressed postings lists than on
uncompressed postings lists.
31
Figure Storing the dictionary as an array of fixed-width entries.
Total space:
M×(2×20 + 4 + 4) = 400,000×48 = 19.2 MB
NB: why 40 bytes per term ? (unicode + max. length of a term)
The simplest data structure for the dictionary is to sort the vocabulary
lexicographically and store it in an array of fixed-width entries as shown in
Figure above . Assuming a Unicode representation, we allocate 2*20 We
allocate 20 bytes for the term itself (because few terms have more than twenty
characters in English), 4 bytes for its document frequency, and 4 bytes for the
pointer to its postings list. Four-byte pointers resolve a 4 gigabytes (GB)
address space. For large collections like the web, we need to allocate more
bytes per pointer. We look up terms in the array by binary search.
Dictionary-as-a-string storage
32
Space use of Dictionary as-a-string
Pointers mark the end of the preceding term and the beginning of the next. For
example, the first three terms in this example are systile, syzygetic, and
syzygial. Using fixed-width entries for terms is clearly wasteful. The average
length of a term in English is about eight characters icompresstb1, so on
average we are wasting twelve characters (or 24 bytes) in the fixed-width
scheme. Also, there is no way of storing terms with more than twenty
characters like hydrochlorofluorocarbons and supercalifragilisticexpialidocious.
We can overcome these shortcomings by storing the dictionary terms as one
long string of characters, as shown in above Figure . The pointer to the next
term is also used to demarcate the end of the current term. As before, locating
terms in the data structure by way of binary search in the (now smaller) table.
This scheme saves us 60% compared to fixed-width storage - 24 bytes on
average of the 40 bytes 12 bytes on average of the 20 bytes we allocated for
terms before. However, we now also need to store term pointers. The term
pointers resolve 400,000*8=3.2*106 positions, so they need to be
log23.2*106 ≈22 bits or 3 bytes long.
In this new scheme, user needs 400,000 * (4+4+3+8)=7.6 MB space for the
Reuters-RCV1 dictionary: 4 bytes each for frequency and postings pointer, 3
bytes for the term pointer, and 8 bytes on average for the term. So user have
reduced the space requirements by one third from 19.211.2 to 10.87.6 MB.
33
Blocked storage with four terms per block.
The first block consists of systile, syzygetic, syzygial, and syzygy with lengths of
seven, nine, eight, and six characters, respectively. Each term is preceded by a
byte encoding its length that indicates how many bytes to skip to reach
subsequent terms.
34
Dictionary compression for Reuters
Representation size in MB
dictionary, fixed-width 19.2
dictionary as a string 10.8
~, with blocking, k= 4 10.3
~, with blocking & front coding 7.9
1.First pass over the collection to determine the number of unique terms
(vocabulary) and the number of documents to be indexed
2.Allocate the matrix and second pass over the collection to fill the matrix
3.Traverse the matrix row by row and write posting lists to file
Postings compression
GAP ENCODING
Encoding gaps instead of document IDs. For example, we store gaps 107, 5, 43,
..., instead of docIDs 283154, 283159, 283202, ... for computer. The first docID
is left unchanged (only shown for arachnocentric)
35
Three posting entries
The postings file is much larger than the dictionary, factor of at least 10.
Key desideratum: store each posting compactly.
A posting for our purposes is a docID.
For Reuters(8,00,000 documents) we would use 32 bit per docID
when using 4 byte in integers.
Alternatively, we can use log 2800,000≈ 20bits per docID.
Our goal: use far fewer than 20bits per docID.
Unary code
Gamma codes
We can compress better with bit level codes
The Gamma code is the best known of these.
Represent a gap G as a pair length and offset
offset is G in binary, with the leading bit cut off
For example 13 → 1101 → 101
length is the length of offset
For 13 (offset 101), this is 3.
We encode length with unary code: 1110.
Gamma code of 13 is the concatenation of length and offset: 1110101
What is XML?
38
A meta-language (language to describe other languages).XML is able to
represent a mix of structured and text(unstructured) information, defined by
WWW Consortium headed by James Clark. It is the de facto standard markup
language. Example :
An XML document is an ordered, labeled tree. Each node of the tree is an XML
element and is written with an opening and closing tag . An element can have
one or more XML attributes . In the XML document in Figure above , the scene
element is enclosed by the two tags <scene ...> and </scene>. It has an
attribute number with value vii and two child elements, title and verse.
39
Fig 2:The XML document in Figure 1 as a simplified DOM
object.
Figure 1 shows Figure 2 as a tree. The leaf nodes of the tree consist of text,
e.g., Shakespeare, Macbeth, and Macbeth's castle. The tree's internal
nodes encode either the structure of the document (title, act, and scene)
or metadata functions (author).
The standard for accessing and processing XML documents is the XML
Document Object Model or DOM . The DOM represents elements, attributes
and text within elements as nodes in a tree. Figure 2 is a simplified DOM
representation of the XML document in Figure 1 . With a DOM API, we can
process an XML document by starting at the root element and then descending
down the tree from parents to children.
XML HTML
A common format for XML queries is NEXI (Narrowed Extended XPath I). We
give an example in Figure 3. We display the query on four lines for
typographical convenience, but it is intended to be read as one unit without
line breaks. In particular, //section is embedded under //article.
The query in Figure 3 specifies a search for sections about the summer
holidays that are part of articles from 2001 or 2002. As in XPath double
41
slashes indicate that an arbitrary number of elements can intervene on a path.
The dot in a clause in square brackets refers to the element the clause
modifies. The clause [.//yr = 2001 or .//yr = 2002] modifies //article. Thus,
the dot refers to //article in this case. Similarly, the dot in [about(., summer
holidays)] refers to the section that the clause modifies.
User focus on the core information retrieval problem in XML retrieval, namely
how to rank documents according to the relevance criteria expressed in the
about conditions of the NEXI query.
42
The first challenge in structured retrieval is that users want us to return parts
of documents (i.e., XML elements), not entire documents as IR systems usually
do in unstructured retrieval. If we query Shakespeare's plays for Macbeth's
castle, should we return the scene, the act or the entire play in Figure 2 ? In
this case, the user is probably looking for the scene. On the other hand, an
otherwise unspecified search for Macbeth should return the play of this name,
not a sub unit. One criterion for selecting the most appropriate part of a
document is the structured document retrieval principle
To have each dimension of the vector space encode a word together with its
position within the XML tree.
Take each text node (leaf) and break it into multiple nodes, one for each
word. E.g. split Bill Gates into Bill and Gates
Define the dimensions of the vector space to be lexicalized sub trees of
documents – sub trees that contain at least one vocabulary term.
Here we represent queries and documents as vectors in this space of
lexicalized sub trees and compute matches between them,
e.g. using the vector space formalism. The main difference is that the
dimensions of vector space in unstructured retrieval are vocabulary terms
whereas they are lexicalized Sub trees in XML retrieval.
45
Structural term
There is a tradeoff between the dimensionality of the space and the accuracy of
query results.
To restrict dimensions to vocabulary terms, then we have a standard
vector space retrieval system that will retrieve many documents that do
not match the structure of the query (e.g., Gates in the Title as opposed
to the author element).
To create a separate dimension for each lexicalized sub tree occurring in
the collection, the dimensionality of the space becomes too large.
To compromise this index all paths that end in a single vocabulary term, in
other words all XML-context term pairs. We call such an XML-context term
pair a structural term and denote it by <c,t>: a pair of XML-context c and
vocabulary term t.
Context resemblance
A simple measure of the similarity of a path cq in a query and a path cd in a
document is the following context resemblance function CR:
46
47