0% found this document useful (0 votes)
6 views

Unit-5. Search Engines

The document discusses Information Retrieval (IR), a field focused on the organization, storage, and retrieval of unstructured information from large digital collections. It highlights the importance of search engines, their functionalities, and the challenges they face, such as data volume, relevance, and spam. The document also differentiates between IR and traditional databases, emphasizing the iterative nature of the search process and the role of user intent in retrieving relevant information.

Uploaded by

joyvinod09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Unit-5. Search Engines

The document discusses Information Retrieval (IR), a field focused on the organization, storage, and retrieval of unstructured information from large digital collections. It highlights the importance of search engines, their functionalities, and the challenges they face, such as data volume, relevance, and spam. The document also differentiates between IR and traditional databases, emphasizing the iterative nature of the search process and the role of user intent in retrieving relevant information.

Uploaded by

joyvinod09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 105

1

Information Retrieval
& Search Engines
Search and Information
Retrieval
 Searchon the Web is a daily activity for many
people throughout the world
 DataProcessing, Search and communication are
most popular uses of the computer
 Applications involving search are everywhere
 Thefield of computer science that is most involved
with R&D for search is information retrieval (IR)
3
IR problem
 First applications: in libraries (1950s)
ISBN: 0-201-12227-8
Author: Salton, Gerard
Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer
Editor: Addison-Wesley
Date: 1989
Content: <Text>
 external attributes and internal attribute (content)
 Search by external attributes = Search in DB
 IR: Search by content
Information retrieval
(IR)
 What comes to mind when I say “information
retrieval”?
 Where have you seen IR? What are some real-
world examples/uses?
 Search engines
 File search (e.g. Windows Instant Search, Google Desktop)
 Databases?
 Catalog search (e.g. library)
 Intranet search (i.e. corporate networks)
5
Information retrieval
 Information retrieval is a field concerned with
the structure, analysis, organization, storage,
searching, and retrieval of information.

 The field is since 1950s on text and text documents.


 Web pages, email, scholarly papers, books, and news
stories are just a few of the many examples of
documents. All of these documents have some amount
of structure, such as the title, author, date, ….called as
attributes.
Example 6

Google

Web
7

 Google. Google Search Engine is the best search engine in the


world and it is also one of most popular products from
Google. ...
 Bing. Bing is Microsoft's answer to Google and it was launched
in 2009. ...
 Yahoo. ...
 Baidu. ...
 AOL. ...
 Ask.com. ...
 Excite. ...
The Standard Retrieval Model 8

An information
retrieval system is
system that is
capable of storage,
retrieval and
maintenance of
information.
9

Information Retrieval
10
11
12
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content

?
13
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content

Info.
Goal = find nee
documents Query
d
relevant to an Retrieval
IR
syste
information Document
collection
m Answer list
need from a
large
document set 14
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content

•Find all documents about MCA


•Find all course web pages at DSI
•What is the cheapest flight from Blore to Delhi?
•Who was the 10th president of India?

15
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content

What is the difference between an


information need and a query?

16
Information Retrieval
Information Retrieval is finding material in documents of
an unstructured nature that satisfy an information need
from within large collections of digitally stored content

Information need Query

•Find all documents about “MCA”


MCA
DSI AND college AND course
•Find all course web pages
at DSI url-contains
•Who was the 10th president? WHO=president NUMBER=10
17
What is Information
Retrieval ?
 The process of actively seeking out information
relevant to a topic of interest

 Typically it refers to the automatic (rather than manual)


retrieval of documents
Information Retrieval System (IRS)
 “Document” is the generic term for an information holder
(book, chapter, article, webpage, etc)
19
Possible approaches

1. String matching (linear search in


documents)
- Slow
- Difficult to improve
2. Indexing (*)
- Fast
- Flexible to further improvement
IR is an Iterative
Process

Goals Repositories

Workspace
 The exchange doesn’t end with first answer
 Users can recognize elements of a useful answer, even
when incomplete
 Questions and understanding changes as the process
22
Challenges

Why is information retrieval hard?


 Lots and lots of data
 efficiency

 storage

 discovery (web)
 Data is unstructured
 Querying/Understanding user intent
 SPAM

 Data quality
IR vs. databases
Structured data tends to refer to information in “tables”
Employee Manager Salary
Smith Jones 50000
Chang Smith 60000
Ivy Smith 50000

Typically allows numerical range and exact match


(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Documents vs. Database
Records
 Database records (or tuples in relational
databases) are typically made up of well-
defined fields (or attributes)
 e.g.,bank records with account numbers,
balances, names, addresses, social security
numbers, dates of birth, etc.
 Easy
to compare fields with well-defined
semantics to queries in order to find
matches
Documents vs. Records
 Example bank database query
 Findrecords with balance > $50,000 in branches
located in Delhi
 Matches easily found by comparison with field
values of records
 Example search engine query
 bank scandals in western mass
 This text must be compared to the text of entire
Big Issues in IR

 Relevance
A relevant document contains the information
that a person was looking for when they
submitted a query to the search engine

 Many factors influence a person’s decision


about what is relevant: e.g., timeliness,
authority or novelty of the result, task, context,
style
Big Issues in IR
 Evaluation
 Experimental procedures and measures for
comparing system output with user expectations
 Typically
use test collection of documents, queries,
and relevance judgments
Most commonly used are TREC (Text
REtrievalConf.) collections as benchmark.
 Recall and precision are two examples of
effectiveness measures
Big Issues in IR

 Precision:
Proportion of retrieved
documents that are relevant
 Recall:
Proportion of relevant documents
that are retrieved
 Assumption: All relevant documents are known.
31
Information Retrieval
 Information Retrieval – a Definition. The aim of information
retrieval (IR) is to make machine-stored data discoverable: unlike
data mining which extracts structures from online records, IR is
concerned with filtering specific information from a set of data.
 A “Search engine” is one of many different kinds of “information
retrieval systems.”
 An information retrieval system is any multi-tiered system by
which underlying stored data (raw, structured, or even
documents, images and sound/video) is indexed by keywords or
codes to allow it to be searched, enumerated and retrieved.
 There may be a GUI that allows a human to perform these
functions, a documented API may be provided for integrating
32

 Search - Take any document say webpage or word document


etc press CTRL+F and you search for a word. So, you search an
exact word in a document. Similarly, you can search in
databases for exact words with minor variations as well.

 Information Retrieval (IR) - IR is typically beyond search


which mines information from vast knowledge base and gives
results based not just on your keywords but also based on your
intent. It also takes in to account your personal preferences,
different meaning of keywords, spelling errors.
33
Search Engine - Intro
 Youhave probably been using search
engines, but perhaps may not be as
effectively as possible.
A lot of information is available on-line, but
not all of them is completely accurate.
 Theweb-page addresses are constantly
changing, it may be only available for a
short time.
34
Search Engine
A Web Search Engine is a software system
that is designed to search for information on
the world Wide Web.
 It
uses the keywords to search for
documents that relate to these keywords
and then puts the result in order of
relevance to the topic that was searched for.
35
Search Engine

 Search Engine refers to a huge database of


internet resources such as web pages,
newsgroups, programs, images etc. It helps
to locate information on World Wide Web.
 User can search for any information by
passing query in form of keywords or phrase.
It then searches for relevant information in its
database and return to the user.
36
Search Engines
Asearch engine is a computer program
that does the following:
Allows user to submit a query that consists
of a word/ phase
Searches the database
Returns a list of suitable URLs which match
your query.
Allows user to revise and resubmit.
37
38
Why Search Engines ?

 Search Engines are important because with


over billions web pages available, it would
be impossible to search for the information
that is specifically needed.
 Thisis why search engines are used to filter
the information that is on the internet and
transform it into results that each individual
can easily access and use within the matter
of seconds.
39

 Search Engine refers to a huge database of


internet resources such as web pages,
newsgroups, programs, images etc. It helps
to locate information on World Wide Web.
 User can search for any information by
passing query in form of keywords or phrase.
It then searches for relevant information in
its database and return to the user.
40

A search engine is the practical application of


information retrieval techniques to large-scale text
collections.
Search engines can be found in many different applications,
such as desktop search or enterprise search.

Search engines have been around for many years. For


example, MEDLINE, the online medical literature search
system, started in the 1970s.
41
How do search engines work?

Search engines have three primary functions:


Crawl: Scour the Internet for content, looking over the
code/content for each URL they find.
Index: Store and organize the content found during the
crawling process. Once a page is in the index, it’s in the
running to be displayed as a result to relevant queries.
Rank: Provide the pieces of content that will best answer a
searcher's query, which means that results are ordered by
most relevant to least relevant.
42

Job of
Spide
r
43
44
Crawling
 Crawling is the discovery process in which search engines
send out a team of robots (known as crawlers or spiders) to
find new and updated content. Content can vary — it could be
a webpage, an image, a video, a PDF, etc. — but regardless of
the format, content is discovered by links.

Googlebot starts out by fetching a few web pages, and then


follows the links on those webpages to find new URLs. By
hopping along this path of links, the crawler is able to find
new content and add it to their index called Caffeine — a
massive database of discovered URLs — to later be retrieved
45
46
47

A Web crawler, sometimes called


a spider or spiderbot and often shortened
to crawler, is that systematically browses the World
Wide Web,
48
Simple Index Diagram

UCB SIMS 202, Sept. 2004


Avi Rappoport, Search Tools
Consulting
49
50

Processing is distributed
among several peers in a
decentralized
The usual search scenario involves someone typing in a query to
a search engine and receiving answers in the form of a list of
documents in ranked order.
World Wide Web (web search) is by far the most common application
involving information retrieval.
Vertical search is a specialized form of web search where the domain of
the search is restricted to a particular topic.
Enterprise search involves finding the required information in the huge
variety of computer files scattered across a corporate intranet. Web pages
are certainly a part of that distributed information store, but most
information will be found in sources such as email, reports, presentations,
spreadsheets, and structured data in corporate databases.
52

Desktop search is the personal version of enterprise search, where the


information sources are the files stored on an individual computer,
including email messages and web pages that have recently been
browsed.

Peer-to-peer search involves finding information in networks of nodes or


computers without any centralized control. This type of search began as a
file sharing tool for music but can be used in any community based on
shared interests.
-History 54
55
Types of Search Engines

 Crawler based Search Engines


 Directories
 Specialty Search Engines
 Hybrid Search Engines
 Meta Search Engines
56
57

Directories
A Web Directory or Web Guide is a
hierarchical representation of
hyperlinks.
 The top level is typically a wide range of
very general topics.
 Each topic contains hyperlinks of more
specialized sub-topics.
 Very easy to use.
Hierarchical 58

Representation
59
Comparison
Directory Search Engine
A directory allows you to explore A search engine brings you to the
and get what you want eventually. exact page on the words or
phrases you are looking for.
Use a directory to find cooking- Use a search engine to find a
related websites. specific recipe, by providing the
name of the ingredients.
Use a directory to find travel Use a search engine to find the
guides in a country. transport trains schedule in South
Africa.
60
61
62
63
Search Engine Issues
 Performance
 improving the efficiency of search
e.g.,reducing response time, increasing query
throughput, increasing indexing speed
 Indexesare data structures designed to improve
search efficiency
designing and implementing them are major issues for
search engines
Search Engine Issues
 Dynamic data
 The“collection” for most real applications is
constantly changing in terms of updates,
additions, deletions
e.g., web pages

 Acquiring or “crawling” the documents is a major


task
Typical measures are coverage (how much has been
Search Engine Issues
 Scalability
 Makingeverything work with millions of users
every day, and many terabytes of documents
 Distributed processing is essential
 Adaptability
 Changing and tuning search engine components
-interface for different applications
Spam

 For
Web search, spam in all its forms is one of the
major issues
 Affects
the efficiency of search engines and, more
seriously, the effectiveness of the results
 Many types of spam will degrade the performance.
 Use
new subfield called adversarial IR, since
spammers are “adversaries” with different goals
White Pages / Yellow 68

Pages
 White pages allows user to lookup information
about individuals.
 We can use white page to track down the
telephone numbers, email address.
 People can abuse white pages
 Some people think that white pages are an
invasion of their privacy.
 Yellow pages contain information about
69
Search Engine
Components
Web crawler
 It is also known as spider or bots. It is a software component
that traverses the web to gather information.
Database
 All the information on the web is stored in database. It consists
of huge web resources.
Search Interfaces
 This component is an interface between user and the database.
It helps the user to search through the database.
70
Search Engine
Components
Generally there are three basic components of a search engine as
listed below:
Web crawler
 It is also known as spider or bots. It is a software component that
traverses the web to gather information.
Database
 All the information on the web is stored in database. It consists of
huge web resources.
Search Interfaces
 This component is an interface between user and the database. It
71

An architecture is designed to ensure that a system will


satisfy the application requirements or goals. The two
primary goals of a search engine are:

• Effectiveness (quality): We want to be able to retrieve the


most relevant set of documents possible for a query.

• Efficiency (speed): We want to process queries from users


as quickly as possible.
72
Search Engine Working
 Web crawler, database and the search interface are the
major component of a search engine that actually makes
search engine to work. Search engines make use of
Boolean expression AND, OR, NOT to restrict and widen the
results of a search. Following are the steps that are
performed by the search engine:
 The search engine looks for the keyword in the index for
predefined database instead of going directly to the web to
search for the keyword.
 It then uses software to search for the information in the
database. This software component is known as web
73

Once web crawler finds the pages, the search engine then
shows the relevant web pages as a result. These retrieved
web pages generally include title of page, size of text
portion, first several sentences etc.

These search criteria may vary from one search engine to


the other. The retrieved information is ranked according
to various factors such as frequency of keywords,
relevancy of information, links etc.
User can click on any of the search results to open it.
74
Architecture

 Thesearch engine architecture comprises of the


three basic layers listed below:
 Content collection and refinement.
 Search core
 User and application interfaces
75

Architecture
SE Processing - Index
Processing
 Text acquisition
 identifies and stores documents for indexing
 Text transformation
 transforms documents into index terms or
features
 Index creation
 takesindex terms and creates data
structures (indexes) to support fast searching
Indexing Process
79
80
Text Acquisition

The task of the text acquisition component is to identify and


make available the documents that will be searched.
Text acquisition will build a collection by crawling or scanning
the Web, a corporate intranet, a desktop, or other sources of
information.
Also, the text acquisition component creates a document
data store, which contains the text and metadata for all the
documents.
Metadata is information about a document that is not part of
Text Acquisition
 Crawler.
 Identifies and acquires documents for search engine
 Many types – web, Document crawler
 Web crawlers are designed to follow the links on webpages
to discover and download new pages
 Mustefficiently find huge numbers of web pages (coverage) and keep
document store up-to-date (freshness)
 Document crawlers for enterprise and desktop search
 Follow links and scan directories to discover both external and
internal (i.e.,restricted to the corporate intranet) pages, but also
must scan both corporate and personal directories to identify email,
word processing documents, presentations, database records, and
Text Acquisition
 Feeds
 Mechanismfor accessing Real-time streams
of documents
e.g., web feeds for news, blogs, video, radio, tv
 RSS (Rich Site Summary) - standard web feed formats
to publish frequently updated information.
 RSS “reader” can provide new XML documents to search
engine.
 The reader monitors those feeds and provides new
Document data store

 Stores
text, metadata, and other related content for
documents
Metadata is information about document such as type and
creation date
Other content includes links, anchor text
 Providesfast access to document contents for search
engine components

 Could use relational database system to store


85
Text Transformation
 This
component transforms documents into index
terms or features.
 Indexterms, as the name implies, are the parts of a
document that are stored in the index and used in
searching.
 Examples of index terms or features are phrases,
names of people, dates, and links in a web page.
also are sometimes simply referred to as “terms.”
The set of all the terms that are indexed for a
document collection is called the index vocabulary.
Text Transformation
 Parser
 Processing
the sequence of text tokens in the
document to recognize structural elements/terms
e.g., titles, links, headings, etc.

 Tokenizer recognizes “words” in the text


must consider issues like capitalization, hyphens,
apostrophes, non-alpha characters, separators
Text Transformation
 Stopping
 Remove common words
 e.g., “and”, “or”, “the”, “in”
 Some impact on efficiency and effectiveness

 Stemming
 Group words derived from a common stem
 e.g., “computer”, “computers”, “computing”, “compute”
 Usually effective, but not for all queries
Text Transformation
 Link Analysis
 Makes use of links and anchor text in web pages

 Linkanalysis identifies popularity and community


information
e.g.,
PageRank - is a way of measuring the importance of
website pages.

 Anchortext can significantly enhance the


representation of pages pointed to by links
89

 Link analysis provides the search engine with a


rating of the popularity, and to some extent, the
authority of a page (in other words, how important
it is).
 Anchor text, which is the clickable text of a web
link, can be used to enhance the text content of a
page that the link points to.
 These two factors can significantly improve the
effectiveness of web search for some types of
90
Search engine indexing &
Ranking
Search engine indexing
 Search engines process and store information they find in an
index, a huge database of all the content they’ve discovered
and deem good enough to serve up to searchers.
Search engine ranking
 When someone performs a search, search engines scour their
index for highly relevant content and then orders that content
in the hopes of solving the searcher's query. This ordering of
search results by relevance is known as ranking. In general, you
can assume that the higher a website is ranked, the more
relevant the search engine believes that site is to the query.
91

 It’s possible to block search engine crawlers from part or


all of your site, or instruct search engines to avoid storing
certain pages in their index. While there can be reasons
for doing this, if you want your content found by
searchers, you have to first make sure it’s accessible to
crawlers and is indexable. Otherwise, it’s as good as
invisible.
Index Creation
 Document Statistics
 Thetask of the document statistics component is simply to
gather and record statistical information about words,
features, and documents.
 This
information is used by the ranking component to
compute scores for documents.
 The document statistics are stored in lookup tables.
 Weighting
 Index
term weights reflect the relative importance of
words in documents which is Used in ranking algorithm
Index Creation
 Inversion
 Core of indexing process
 Converts
document-term information to term-
document for indexing
 Formatof inverted file is designed for fast query
processing
Must also handle updates
Compression used for efficiency
94
Query Process

 User interaction
 supports creation and refinement of query, display
of results
 Ranking
 uses
query and indexes to generate ranked list of
documents
 Evaluation
 monitors and measures effectiveness and efficiency
96
Query Process
User Interaction
 Query input
 Provides interface and parser for query
language
 Query language used to describe more complex
queries and results of query transformation
 IR query languages also allow content and
structure specifications, but focus on content
User Interaction

 Query transformation
 Improves initial query
 Spell checking and query suggestion provide
alternatives to original query using search log
 Query expansion and relevance feedback
modify the original query with additional terms
10
Search Engine - How does it 0
work?
 User Interface – Allows you to type a query and
displays the results.
 Searcher – The engine searches the database for
matching your query.
 Evaluator – The engine assigns scores to the
retrieved information.
 Gatherer – The component that travels the WEB,
and collects information.
 Indexer – The engine that categorizes the data
collected by the gatherer.
10
1
User Interface

 Providesa mechanism for a user to submit


queries to the search engine.
 Uses forms, very user friendly.
 Theuser interface displays the search results in
a convenient way.
A summary of each matched page is shown.
10
2
Searcher
 Itis a program that uses the search engine’s
database to locate the matches for a specific
query.
 The database of a search engine holds extremely
large indexed pages.
 A highly efficient search algorithm is necessary.
 Computer Scientists have spent years to develop
the searching and sorting methods.
10
3
Evaluator
 The searcher returns a set of URLs that match
your query.
 Not all of the hits equally match your query.
 More references to the page, the ranking of the
page will be higher.
 How the relevancy score is calculated?
 Variesfrom one engine to another one.
 The number of times of the word appears?
 The query words appear in the title?
 The query words appear in the META tag?
10
Link Popularity 4

reference
10
5
Gatherer

 It is a program that traverses the Web and


gathers information about the Web documents.
 It runs at a short and regular intervals.
 It returns information and will be indexed to the
database.
 Alternate names: Bot, Crawler, Robot, Spider
and Worm.
10
6
Indexer

 Itorganizes the data by creating a set of keys or


an index.
 Indexes need to be rebuilt frequently.
 E.g. Libraries – Author, Title, ISBN, etc…
 In order to ensure the returned URL is not out of
date.
 The search engine is very complex and needs to
break down into different components.
10
7

ALL THE
BEST

You might also like