0% found this document useful (0 votes)

24 views44 pages

Informaiton Retrieval and Web Search

Uploaded by

mihlemaza03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views44 pages

Informaiton Retrieval and Web Search

Uploaded by

mihlemaza03

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 44

INFORMATION RETRIEVAL AND WEB

SEARCH
LECTURER: Ms. X.
Information Retrieval (IR) Concepts
Retrieval Models
Types of Queries in IR Systems
Text Preprocessing
Inverted Indexing
Evaluation Measures of Search Relevance
Web Search and Analysis
Trends in Information Retrieval
Information Retrieval (IR) Concepts

Information retrieval
• Process of retrieving documents from a collection in response to a query by a user

Introduction to information retrieval

• What is the distinction between structured and unstructured data?
• Information retrieval defined
• “Discipline that deals with the structure, analysis, organization, storage, searching, and retrieval of information”

User’s information need expressed as a free-form search request

• Keyword search query
• Query

IR systems characterized by:

• Types of users
• Types of data
• Types of information needed
• Levels of scale
Information Retrieval (IR) Concepts (cont’d.)

High noise-to-signal ratio

Enterprise search systems
• IR solutions for searching different entities in an enterprise’s intranet

Desktop search engines

• Retrieve files, folders, and different kinds of entities stored on the computer
Databases and IR Systems: A Comparison
Brief History of IR

Inverted file organization

• Based on keywords and their weights
• SMART system in 1960s

Text Retrieval Conference (TREC)

Search engine
• Application of information retrieval to large-scale document collections
• Crawler
• Responsible for discovering, analyzing, and indexing new documents
Modes of Interaction in IR Systems

Query
• Set of terms
• Used by searcher to specify information need

Main modes of interaction with IR systems:

• Retrieval
• Extraction of information from a repository of documents through an IR query
• Browsing
• User visiting or navigating through similar or related documents
Modes of Interaction in IR Systems (cont’d.)

Hyperlinks
• Used to interconnect Web pages
• Mainly used for browsing

Anchor texts
• Text phrases within documents used to label hyperlinks
• Very relevant to browsing

Web search
• Combines browsing and retrieval

Rank of a Webpage
• Measure of relevance to query that generated result set
Retrieval Models

Three main statistical models

• Boolean
• Vector space
• Probabilistic

Semantic model
Boolean Model

Documents represented as a set of terms

Form queries using standard Boolean logic set-theoretic operators
• AND, OR and NOT

Retrieval and relevance

• Binary concepts

Lacks sophisticated ranking algorithms

Vector Space Model

Documents
• Represented as features and weights in an n-dimensional vector space

Query
• Specified as a terms vector
• Compared to the document vectors for similarity/relevance assessment
Vector Space Model (cont’d.)

Different similarity functions can be used

• Cosine of the angle between the query and document vector commonly used

TF-IDF
• Statistical weight measure
• Used to evaluate the importance of a document word in a collection of documents

Rocchio algorithm
• Well-known relevance feedback algorithm
Probabilistic Model

Probability ranking principle

• Decide whether the document belongs to the relevant set or the nonrelevant set for a query

Conditional probabilities calculated using Bayes’ Rule

BM25 (Best Match 25)
• Popular probabilistic ranking algorithm

Okapi system
Semantic Model

Include different levels of analysis

• Morphological
• Syntactic
• Semantic

Knowledge-based IR systems
• Based on semantic models
• Cyc knowledge base
• WordNet
Types of Queries in IR Systems

Keywords
• Consist of words, phrases, and other characterizations of documents
• Used by IR system to build inverted index

Queries compared to set of index keywords

Most IR systems
• Allow use of Boolean and other operators to build a complex query
Keyword Queries

Simplest and most commonly used forms of IR queries

Keywords implicitly connected by a logical AND operator
Remove stopwords
• Most commonly occurring words
• a, the, of

IR systems do not pay attention to the ordering of these words in the query
Boolean Queries

AND: both terms must be found

OR: either term found
NOT: record containing keyword omitted
( ): used for nesting
+: equivalent to and
– Boolean operators: equivalent to AND NOT
Document retrieved if query logically true as exact match in document
Phrase Queries

Phrases encoded in inverted index or implemented differently

Phrase generally enclosed within double quotes
More restricted and specific version of proximity searching
Proximity Queries

Accounts for how close within a record multiple terms should be to each other
Common option requires terms to be in the exact order
Various operator names
• NEAR, ADJ(adjacent), or AFTER

Computationally expensive
Wildcard Queries

Support regular expressions and pattern matching-based searching

• ‘Data*’ would retrieve data, database, datapoint, dataset

Involves preprocessing overhead

Not considered worth the cost by many Web search engines today
Retrieval models do not directly provide support for this query type
Natural Language Queries

Few natural language search engines

Active area of research
Easier to answer questions
• Definition and factoid questions

TEXT PREPROCESSING
• Commonly used text preprocessing techniques

• Part of text processing task

Stopword Removal

Stopwords
• Very commonly used words in a language
• Expected to occur in 80 percent or more of the documents
• the, of, to, a, and, in, said, for, that, was, on, he, is, with, at, by, and it

Removal must be performed before indexing

Queries can be preprocessed for stopword removal
Stemming

Stem
• Word obtained after trimming the suffix and prefix of an original word

Reduces different forms of the word formed by inflection

Most famous stemming algorithm:
• Martin Porter’s stemming algorithm
Utilizing a Thesaurus

Thesaurus
• Precompiled list of important concepts and the main word that describes each
• Synonym converted to its matching concept during preprocessing
• Examples:
• UMLS
• Large biomedical thesaurus of concepts/meta concepts/relationships
• WordNet
• Manually constructed thesaurus that groups words into strict synonym sets
Other Preprocessing Steps: Digits, Hyphens, Punctuation Marks, Cases

Digits, dates, phone numbers, e-mail addresses, and URLs may or may not be removed during
preprocessing
Hyphens and punctuation marks
• May be handled in different ways

Most information retrieval systems perform case-insensitive search

Text preprocessing steps language specific
Information Extraction

Generic term
Extracting structured content from text
Examples of IE tasks
Mostly used to identify contextually relevant features that involve text analysis, matching, and
categorization
Inverted Indexing

Vocabulary
• Set of distinct query terms in the document set

Inverted index
• Data structure that attaches distinct terms with a list of all documents that contains term

Steps involved in inverted index construction

Evaluation Measures
of Search Relevance

Topical relevance
• Measures extent to which topic of a result matches topic of query

User relevance
• Describes “goodness” of a retrieved result with regard to user’s information need

Web information retrieval

• Must evaluate document ranking order
Recall and Precision

Recall
• Number of relevant documents retrieved by a search / Total number of existing relevant documents

Precision
• Number of relevant documents retrieved by a search / Total number of documents retrieved by that search
Recall and Precision (cont’d.)

Average precision
• Useful for computing a single precision value to compare different retrieval algorithms

Recall/precision curve
• Usually has a negative slope indicating inverse relationship between precision and recall

F-score
• Single measure that combines precision and recall to compare different result sets
Web Search and Analysis

Vertical search engines

• Topic-specific search engines

Metasearch engines
• Query different search engines simultaneously

Digital libraries
• Collections of electronic resources and services
Web Analysis and Its Relationship to IR

Goals of Web analysis:

• Improve and personalize search results relevance
• Identify trends

Classify Web analysis:

• Web content analysis
• Web structure analysis
• Web usage analysis
Searching the Web

Hyperlink components
• Destination page
• Anchor text

Hub
• Web page or a Website that links to a collection of prominent sites (authorities) on a common topic
Analyzing the Link Structure of Web Pages

The PageRank ranking algorithm

• Used by Google
• Highly linked pages are more important (have greater authority) than pages with fewer links
• Measure of query-independent importance of a page/node

HITS Ranking Algorithm

• Contains two main steps: a sampling component and a weight-propagation component
Web Content Analysis

Structured data extraction

• Several approaches: writing a wrapper, manual extraction, wrapper induction, wrapper generation

Web information integration

• Web query interface integration and schema matching

Ontology-based information integration

• Single, multiple, and hybrid
Web Content Analysis (cont’d.)

Building concept hierarchies

• Documents in a search result are organized into groups in a hierarchical fashion

Segmenting Web pages and detecting noise

• Eliminate superfluous information such as ads and navigation
Approaches to Web Content Analysis

Agent-based approach categories

• Intelligent Web agents
• Information filtering/categorization
• Personalized Web agents

Database-based approach
• Infer the structure of the Website or to transform a Web site to organize it as a database
Web Usage Analysis

Typically consists of three main phases:

• Preprocessing, pattern discovery, and pattern analysis

Pattern discovery techniques:

• Statistical analysis
• Association rules
• Clustering of users
• Establish groups of users exhibiting similar browsing patterns
Web Usage Analysis (cont’d.)

• Clustering of pages
• Pages with similar contents are grouped together
• Sequential patterns
• Dependency modeling
• Pattern modeling
Practical Applications of Web Analysis

Web analytics
• Understand and optimize the performance of Web usage

Web spamming
• Deliberate activity to promote a page by manipulating results returned by search engines

Web security
Alternate uses for Web crawlers
Trends in Information Retrieval

Faceted search
• Allows users to explore by filtering available information
• Facet
• Defines properties or characteristics of a class of objects

Social search
• New phenomenon facilitated by recent Web technologies: collaborative social search, guided participation
Trends in Information Retrieval (cont’d.)

Conversational search (CS)

• Interactive and collaborative information finding interaction
• Aided by intelligent agents
Summary

IR introduction
• Basic terminology, query and browsing modes, semantics, retrieval modes

Web search analysis

• Content, structure, usage
• Algorithms
• Current trends

NALEDI
67% (12)
NALEDI
277 pages
Pgdca Project by Sumoti Das
No ratings yet
Pgdca Project by Sumoti Das
43 pages
Bulu
No ratings yet
Bulu
47 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Unit II
No ratings yet
Unit II
73 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
DDB Ch27
No ratings yet
DDB Ch27
60 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
UNIT I - Introduction and Motivation
No ratings yet
UNIT I - Introduction and Motivation
57 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Introduction
No ratings yet
Introduction
32 pages
1-Overview of Information Retrieval
No ratings yet
1-Overview of Information Retrieval
44 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
46 pages
1 IR Introductionn
No ratings yet
1 IR Introductionn
30 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
Introduction To IR 2021
No ratings yet
Introduction To IR 2021
40 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Chapter 1 Ir
No ratings yet
Chapter 1 Ir
37 pages
ISR Chap..1
No ratings yet
ISR Chap..1
27 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
1 Introduction MIR
No ratings yet
1 Introduction MIR
35 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
34 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Information Retrieval
No ratings yet
Information Retrieval
72 pages
Information Retrieval: Dr. Bassel ALKHATIB
No ratings yet
Information Retrieval: Dr. Bassel ALKHATIB
55 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
NLP M5 Part-1 SPP
No ratings yet
NLP M5 Part-1 SPP
55 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Chapter One IR
No ratings yet
Chapter One IR
18 pages
Irt Q&A
No ratings yet
Irt Q&A
14 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
1-Overview of Information Retrieval - New
No ratings yet
1-Overview of Information Retrieval - New
47 pages
1 IR Chapter-One
No ratings yet
1 IR Chapter-One
47 pages
IRS B Tech CSE Part 1
No ratings yet
IRS B Tech CSE Part 1
161 pages
Emutye
No ratings yet
Emutye
20 pages
Elasticsearch Server: Second Edition
From Everand
Elasticsearch Server: Second Edition
Rafał Kuć
No ratings yet
Mastering Elasticsearch - Second Edition
From Everand
Mastering Elasticsearch - Second Edition
Rogoziński Marek
No ratings yet
Lib 412 Presentation Group 8
No ratings yet
Lib 412 Presentation Group 8
12 pages
7 Day Articulation Checklist v2
No ratings yet
7 Day Articulation Checklist v2
3 pages
Reflective Report Search Strategies
No ratings yet
Reflective Report Search Strategies
3 pages
Lib 412 Group 8 Proposal Appendices
No ratings yet
Lib 412 Group 8 Proposal Appendices
6 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
15 pages
SSIT311 Chapter 10 Cognitive Therapy PDF
No ratings yet
SSIT311 Chapter 10 Cognitive Therapy PDF
68 pages
LVM3
No ratings yet
LVM3
5 pages
Data Virtualization Business Intelligence Systems Van Der Lans Book en US
0% (2)
Data Virtualization Business Intelligence Systems Van Der Lans Book en US
3 pages
CH 2
No ratings yet
CH 2
3 pages
RE: SQL Developer CV Sample - SQL Developer CV Formats / Templates
No ratings yet
RE: SQL Developer CV Sample - SQL Developer CV Formats / Templates
2 pages
Implementing Multiple Fact Tables
No ratings yet
Implementing Multiple Fact Tables
4 pages
Chapter 17: Recovery System: ©silberschatz, Korth and Sudarshan 17.1 Database System Concepts, 5 Ed
No ratings yet
Chapter 17: Recovery System: ©silberschatz, Korth and Sudarshan 17.1 Database System Concepts, 5 Ed
40 pages
Examples From COMPANY Database: Triggers: The Problem
No ratings yet
Examples From COMPANY Database: Triggers: The Problem
25 pages
SQL Training
No ratings yet
SQL Training
154 pages
Hibernate - ORM Overview: What Is JDBC?
No ratings yet
Hibernate - ORM Overview: What Is JDBC?
76 pages
Module 2 Lesson 1
No ratings yet
Module 2 Lesson 1
12 pages
Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide
100% (1)
Microsoft SQL Server Analysis Services Multidimensional Performance and Operations Guide
201 pages
1.1 Lesson 1.1 Hints PDF
100% (3)
1.1 Lesson 1.1 Hints PDF
14 pages
Web Mining and Other Data Mining
No ratings yet
Web Mining and Other Data Mining
2 pages
Installation and Configuration of Oracle Database Gateway For Heterogeneous Databases (MSSQL, MySql & DB2)
No ratings yet
Installation and Configuration of Oracle Database Gateway For Heterogeneous Databases (MSSQL, MySql & DB2)
19 pages
Benerin Database Firebird
No ratings yet
Benerin Database Firebird
3 pages
Session-3-BASIC FILE PERMISSIONS & VI EDITOR
No ratings yet
Session-3-BASIC FILE PERMISSIONS & VI EDITOR
22 pages
Ieee Paper
No ratings yet
Ieee Paper
16 pages
Database Level IV Practical Exam
No ratings yet
Database Level IV Practical Exam
5 pages
Lab Assignment-1
No ratings yet
Lab Assignment-1
4 pages
17
No ratings yet
17
7 pages
Intelligence Community Massive Digital Data Systems Initiative
No ratings yet
Intelligence Community Massive Digital Data Systems Initiative
18 pages
Exception
No ratings yet
Exception
188 pages
Ais Chapter 4
100% (1)
Ais Chapter 4
6 pages
UNIT 5 File Organization in DBMS
No ratings yet
UNIT 5 File Organization in DBMS
22 pages
COM736 Assignment1 Help Desk 20 21
No ratings yet
COM736 Assignment1 Help Desk 20 21
6 pages
Lectures 6,7 Modular RAG
No ratings yet
Lectures 6,7 Modular RAG
85 pages
Filing System
0% (1)
Filing System
11 pages
Microsoft Power BI Syllabus
No ratings yet
Microsoft Power BI Syllabus
10 pages
Unit 5: Integrity and Security: Dhanashree Huddedar
No ratings yet
Unit 5: Integrity and Security: Dhanashree Huddedar
37 pages