Informaiton Retrieval and Web Search
Informaiton Retrieval and Web Search
SEARCH
LECTURER: Ms. X.
Information Retrieval (IR) Concepts
Retrieval Models
Types of Queries in IR Systems
Text Preprocessing
Inverted Indexing
Evaluation Measures of Search Relevance
Web Search and Analysis
Trends in Information Retrieval
Information Retrieval (IR) Concepts
Information retrieval
• Process of retrieving documents from a collection in response to a query by a user
Query
• Set of terms
• Used by searcher to specify information need
Hyperlinks
• Used to interconnect Web pages
• Mainly used for browsing
Anchor texts
• Text phrases within documents used to label hyperlinks
• Very relevant to browsing
Web search
• Combines browsing and retrieval
Rank of a Webpage
• Measure of relevance to query that generated result set
Retrieval Models
Semantic model
Boolean Model
Documents
• Represented as features and weights in an n-dimensional vector space
Query
• Specified as a terms vector
• Compared to the document vectors for similarity/relevance assessment
Vector Space Model (cont’d.)
TF-IDF
• Statistical weight measure
• Used to evaluate the importance of a document word in a collection of documents
Rocchio algorithm
• Well-known relevance feedback algorithm
Probabilistic Model
Okapi system
Semantic Model
Knowledge-based IR systems
• Based on semantic models
• Cyc knowledge base
• WordNet
Types of Queries in IR Systems
Keywords
• Consist of words, phrases, and other characterizations of documents
• Used by IR system to build inverted index
IR systems do not pay attention to the ordering of these words in the query
Boolean Queries
Accounts for how close within a record multiple terms should be to each other
Common option requires terms to be in the exact order
Various operator names
• NEAR, ADJ(adjacent), or AFTER
Computationally expensive
Wildcard Queries
TEXT PREPROCESSING
• Commonly used text preprocessing techniques
Stopwords
• Very commonly used words in a language
• Expected to occur in 80 percent or more of the documents
• the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it
Stem
• Word obtained after trimming the suffix and prefix of an original word
Thesaurus
• Precompiled list of important concepts and the main word that describes each
• Synonym converted to its matching concept during preprocessing
• Examples:
• UMLS
• Large biomedical thesaurus of concepts/meta concepts/relationships
• WordNet
• Manually constructed thesaurus that groups words into strict synonym sets
Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases
Digits, dates, phone numbers, e-mail addresses, and URLs may or may not be removed during
preprocessing
Hyphens and punctuation marks
• May be handled in different ways
Generic term
Extracting structured content from text
Examples of IE tasks
Mostly used to identify contextually relevant features that involve text analysis, matching, and
categorization
Inverted Indexing
Vocabulary
• Set of distinct query terms in the document set
Inverted index
• Data structure that attaches distinct terms with a list of all documents that contains term
Topical relevance
• Measures extent to which topic of a result matches topic of query
User relevance
• Describes “goodness” of a retrieved result with regard to user’s information need
Recall
• Number of relevant documents retrieved by a search / Total number of existing relevant documents
Precision
• Number of relevant documents retrieved by a search / Total number of documents retrieved by that search
Recall and Precision (cont’d.)
Average precision
• Useful for computing a single precision value to compare different retrieval algorithms
Recall/precision curve
• Usually has a negative slope indicating inverse relationship between precision and recall
F-score
• Single measure that combines precision and recall to compare different result sets
Web Search and Analysis
Metasearch engines
• Query different search engines simultaneously
Digital libraries
• Collections of electronic resources and services
Web Analysis and Its Relationship to IR
Hyperlink components
• Destination page
• Anchor text
Hub
• Web page or a Website that links to a collection of prominent sites (authorities) on a common topic
Analyzing the Link Structure of Web Pages
Database-based approach
• Infer the structure of the Website or to transform a Web site to organize it as a database
Web Usage Analysis
• Clustering of pages
• Pages with similar contents are grouped together
• Sequential patterns
• Dependency modeling
• Pattern modeling
Practical Applications of Web Analysis
Web analytics
• Understand and optimize the performance of Web usage
Web spamming
• Deliberate activity to promote a page by manipulating results returned by search engines
Web security
Alternate uses for Web crawlers
Trends in Information Retrieval
Faceted search
• Allows users to explore by filtering available information
• Facet
• Defines properties or characteristics of a class of objects
Social search
• New phenomenon facilitated by recent Web technologies: collaborative social search, guided participation
Trends in Information Retrieval (cont’d.)
IR introduction
• Basic terminology, query and browsing modes, semantics, retrieval modes