0% found this document useful (0 votes)
5 views80 pages

Introduction Advanced DB

Advance database introduction

Uploaded by

aliaghawani215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views80 pages

Introduction Advanced DB

Advance database introduction

Uploaded by

aliaghawani215
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

INLS 509:

Introduction to Information Retrieval

Jaime Arguello
[email protected]

January 8, 2014

Friday, January 10, 14


Introductions

• Hello, my name is ______.

• However, I’d rather be called ______. (optional)

• I’m in the ______ program.

• I’m taking this course because I want to ______.

Friday, January 10, 14


What is Information Retrieval?

• Information retrieval (IR) is the science and practice of


developing and evaluating systems that match
information seekers with the information they seek.

Friday, January 10, 14


What is Information Retrieval?

• This course mainly focuses on search engines

• Given a query and a corpus, find relevant items

query: a user’s expression of their information need


corpus: a repository of retrievable items
relevance: satisfaction of the user’s information need

Friday, January 10, 14


What is Information Retrieval?

• Gerard Salton, 1968:

Information retrieval is a field concerned with the


structure, analysis, organization, storage, and
retrieval of information.

Friday, January 10, 14


Information Retrieval
structure

Friday, January 10, 14


Information Retrieval
document structure

Friday, January 10, 14


Information Retrieval
document structure

However, the main content of the page is


in the form of natural language text,
which has little structure that a
computer can understand

Friday, January 10, 14


Information Retrieval
document structure

However, the main content of the page is


in the form of natural language text,
which has little structure that a
As it turns out, it’s not necessary for a
computer can understand
computer to understand natural language
text for it to determine that this
document is likely to be relevant to a
particular query (e.g., “Gerard Salton”)

Friday, January 10, 14


Information Retrieval
collection structure

10

Friday, January 10, 14


Information Retrieval
analysis: classification and information extraction

11

Friday, January 10, 14


Information Retrieval
organization: cataloguing

http://www.dmoz.org 12

Friday, January 10, 14


Information Retrieval
organization: cataloguing

http://www.dmoz.org 13

Friday, January 10, 14


Information Retrieval
analysis and organization: reading-level

14

Friday, January 10, 14


Information Retrieval
organization: recommendations

http://www.yelp.com/biz/cosmic-cantina-chapel-hill
(not actual page)
15

Friday, January 10, 14


Information Retrieval
storage

• How might a web search engine view these pages


differently in terms of storage?

16

Friday, January 10, 14


Information Retrieval
retrieval

• Efficiency: retrieving results in this lifetime (or, better yet,


in 0.18 seconds)
• Effectiveness: retrieving results that satisfy the user’s
information need (more on this later)
• We will focus more on effectiveness

• However, we will also discuss in some detail how search


engines retrieve results as fast as they do

17

Friday, January 10, 14


Many Types of Search Engines

18

Friday, January 10, 14


Many Types of Search Engines

19

Friday, January 10, 14


The Search Task

• Given a query and a corpus, find relevant items

query: user’s expression of their information need


corpus: a repository of retrievable items
relevance: satisfaction of the user’s information need

20

Friday, January 10, 14


Search Engines
web search

query
corpus
results

web pages

21

Friday, January 10, 14


Search Engines
digital library search

query
corpus
results

scientific
publications

22

Friday, January 10, 14


Search Engines
news search

query
corpus
results

news articles

23

Friday, January 10, 14


Search Engines
local business search

query
corpus
results

curated/synthesized
business listings

24

Friday, January 10, 14


Search Engines
desktop search

query
corpus
results

files in my laptop

25

Friday, January 10, 14


Search Engines
micro-blog search

query
corpus
results

tweets

26

Friday, January 10, 14


Search Engines
people/profile search

query
corpus
results

profiles

27

Friday, January 10, 14


Information Retrieval Tasks and Applications
digital library search desktop search
web search question-answering
enterprise search federated search
news search social search
local business search expert search
image search product search
video search patent search
(micro-)blog search recommender systems
community Q&A search opinion mining
28

Friday, January 10, 14


The Search Task

• Given a query and a corpus, find relevant items

query: user’s expression of their information need


corpus: a repository of retrievable items
relevance: satisfaction of the user’s information need

29

Friday, January 10, 14


The Search Task
in this course
• Given a query and a corpus, find relevant items

query: user’s expression of their information need


‣ a textual description of what the user wants
corpus: a repository of retrievable items
‣ a collection of textual documents
relevance: satisfaction of the user’s information need
‣ the document contains information the user wants

30

Friday, January 10, 14


Why is IR fascinating?
• Information retrieval is an uncertain process

‣ users don’t know what they want


‣ users don’t know how to convey what they want
‣ computers can’t elicit information like a librarian
‣ computers can’t understand natural language text
‣ the search engine can only guess what is relevant
‣ the search engine can only guess if a user is satisfied
‣ over time, we can only guess how users adjust their
short- and long-term behavior for the better
31

Friday, January 10, 14


Queries and Relevance

32

Friday, January 10, 14


Queries and Relevance
‣ soft surroundings ‣ broadstone raquet club
‣ trains interlocking dog sheets ‣ seadoo utopia
‣ belly dancing music ‣ seasons white plains condo
‣ christian dior large bag ‣ priority club.com
‣ best western airport sea tac ‣ aircat tools
‣ www.bajawedding.com ‣ epicurus evil
‣ marie selby botanical gardens ‣ instructions
‣ big chill down coats ‣ hinds county city of jackson
‣ www.magichat.co.uk ‣ last searches on aol a to z
‣ marie selby botanical gardens ‣ riverbank run

(AOL query-log) 33

Friday, January 10, 14


Queries and Relevance

• A query is an impoverished description of the user’s


information need
• Highly ambiguous to anyone other than the user

34

Friday, January 10, 14


Queries and Relevance
the input to the system
• Query 435: curbing population growth

what is in the user’s head


• Description: What measures have been taken worldwide
and what countries have been effective in curbing
population growth? A relevant document must describe
an actual case in which population measures have been
taken and their results are known. Reduction measures
must have been actively pursued. Passive events such as
decease, which involuntarily reduce population, are not
relevant.
(from TREC 2005 HARD Track)
35

Friday, January 10, 14


Queries and Relevance

• Query 435: curbing population growth

• Description:
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????
???????????????

(from TREC 2005 HARD Track)


36

Friday, January 10, 14


Queries and Relevance

• Query 435: curbing population growth

• Can we imagine a relevant document without all these


query terms?

37

Friday, January 10, 14


Queries and Relevance

• Query 435: curbing population growth

• The same concept can be expressed in different ways

38

Friday, January 10, 14


Queries and Relevance

• Query 435: curbing population growth

• Can we imagine a non-relevant document with all these


query terms?

39

Friday, January 10, 14


Queries and Relevance

• Query 435: curbing population growth

• The query concept can have different “senses”

40

Friday, January 10, 14


Queries and Relevance

• This is why IR is difficult (and fascinating!)

• Croft, Metzler, & Strohman:

Understanding how people compare text and designing


computer algorithms to accurately perform this comparison
is at the core of information retrieval.

• IR does not seek a deep “understanding” of the


document text
• It uses statistical properties of the text to predict whether
a document is relevant to a query
‣ easier and often times sufficient
41

Friday, January 10, 14


Predicting Relevance

• What types of evidence can we use to predict that a


document is relevant to a query?
‣ query-document evidence: a property of the query-
document pair (e.g., a measure of similarity)
‣ document evidence: a property of the document
(same for all queries)

42

Friday, January 10, 14


Query-Document Evidence

• Query: bathing a cat

...
43

Friday, January 10, 14


Query-Document Evidence

• Query: bathing a cat

• The important query terms


occur frequently
• Both terms occur

• Terms occur close together

• Terms occur in the title

• Terms occur in the URL


www.wikihow.com/bathe-your-cat

• Any other ideas?

...
44

Friday, January 10, 14


Query-Document Evidence

• Terms occur in hyperlinks


pointing to the page

• Same language as query

• Other terms semantically


related to query-terms (e.g.,
feline, wash)

...
45

Friday, January 10, 14


Query-Document Evidence

• Does not contain “.com”

• [verb] [article] [noun]

• Not one of the most


popular queries
• Does not contain the term
“news”

...
46

Friday, January 10, 14


Query-Document Evidence

• We can also use previous


user interactions, e.g.:
• The query is similar to
other queries associated
with clicks on this
document
• The document is similar to
other documents
associated with clicks for
this query

...
47

Friday, January 10, 14


Document Evidence

• Lots of in-links
(endorsements)
• Non-spam properties:

‣ grammatical sentences
‣ no profanity
• Has good formatting

• Anything other ideas?

...
48

Friday, January 10, 14


Document Evidence
• Author attributes

• Peer-reviewed by many

• Reading-level appropriate
for user community
• Has pictures

• Recently modified (fresh)

• Normal length

• From domain with other


high-quality documents

...
49

Friday, January 10, 14


Predicting Relevance

• IR does not require a deep “understanding” of information

• We can get by using shallow sources of evidence, which


can be generated from the query-document pair or just
the document itself.

50

Friday, January 10, 14


The Search Task

• Output: a ranking of items in descending order of


predicted relevance (simplifies the task)
• Assumption: the user scans the results from top to
bottom and stops when he/she is satisfied or gives up 51

Friday, January 10, 14


Evaluating a Ranking

• So, how good is a particular ranking?

• Suppose we know which documents are truly relevant


to the query...
52

Friday, January 10, 14


Evaluating a Ranking

A B

• Which ranking is better?

53

Friday, January 10, 14


Evaluating a Ranking

A B

• In general, a ranking with all the relevant documents at


the top is best (A is better than B)
54

Friday, January 10, 14


Evaluating a Ranking

A B

• Which ranking is better?

55

Friday, January 10, 14


Evaluating a Ranking

A B

• Oftentimes the (relative) quality of a ranking is unclear


and depends on the task
56

Friday, January 10, 14


Evaluating a Ranking

A B

• Web search: ??????

57

Friday, January 10, 14


Evaluating a Ranking

A B

• Web search: A is better than B


• Many documents (redundantly) satisfy the user; the
higher the first relevant document, the better 58

Friday, January 10, 14


Evaluating a Ranking

A B

• Patent search: ??????

59

Friday, January 10, 14


Evaluating a Ranking

A B

• Patent search: B is better than A


• User wants to see everything in the corpus that is related
to the query (high cost in missing something) 60

Friday, January 10, 14


Evaluating a Ranking

A B

• Exploratory search: ??????

61

Friday, January 10, 14


Evaluating a Ranking

A B

• Exploratory search: A is better than B


• Satisfying the information need requires information
found in different documents 62

Friday, January 10, 14


Evaluating a Ranking
evaluation metrics

• Given a ranking with known relevant/non-relevant


documents, an evaluation metric outputs a quality score
• Many, many metrics

• Different metrics make different assumptions

• Choosing the “right one” requires understanding the task

• Often, we use several (sanity check)

63

Friday, January 10, 14


Summary
• The goal of information retrieval is to match information-
seekers with the information they seek.
• IR involves analysis, organization, storage, and retrieval

• There are many types of search engines

• There is uncertainty at every step of the search process

• Simple heuristics don’t work, so IR systems make


predictions about relevance!
• IR systems use “superficial” evidence to make predictions

• Users expect different things, depending on the task

• Evaluation requires understanding the user community.

• My goal is convince you that IR is a fascinating science


64

Friday, January 10, 14


Course Overview
Jaime Arguello
[email protected]

January 8, 2014

Friday, January 10, 14


Course Objectives

• How do search engines work?

‣ effectiveness and efficiency


• How do users behave with them?

‣ how do users determine usefulness of information?


‣ how can a search engine mimic this process?
• Why do search engines fail?

‣ the user? the corpus? the system? something else?


• How can they be evaluated (off-line)?

• How can they be monitored and tuned (on-line)?


66

Friday, January 10, 14


Why are these important questions?
• Most of the world’s information is in natural language text
‣ the world wide web
‣ scientific publications
‣ books
‣ social media interactions

• The amount of this information is growing quickly; human


capacity is not (evolution doesn’t move that fast)
• We need smarter tools

• IR provides tools for analyzing and organizing content to


facilitate search, discovery, and learning

67

Friday, January 10, 14


Course Structure

• Information retrieval is an interdisciplinary problem

people who want people who care people who want


to understand people about to understand how
information computers can solve
retrieval problems
• We need to understand both ends of the spectrum
68

Friday, January 10, 14


Course Structure

• IR: computer-based solutions to a human problem

the system the user

first half of the second half of the


semester semester

• Understanding IR systems requires math!

69

Friday, January 10, 14


Road Map
• Introduction to ad-hoc retrieval
‣ controlled vocabularies
‣ full-text indexing
• Boolean retrieval
• Indexing and query processing
• Statistical Properties of Text
• Document Representation
• Retrieval Models
‣ vector space model
‣ language modeling
‣ others (depending on how quickly we progress)
70

Friday, January 10, 14


Road Map
• Evaluation

‣ test-collection construction
‣ evaluation metrics
‣ experimentation
‣ user studies
‣ search-log analysis
• Studies of search behavior

• Federated Search

• Clustering

• Text Classification

71

Friday, January 10, 14


Grading
• 30% homework

‣ 10% each
• 15% midterm

• 15% final exam

• 30% literature review

‣ 5% proposal
‣ 10% presentation
‣ 15% paper

• 10% (and chocolates) participation

72

Friday, January 10, 14


Grading for Graduate Students

• H: 95-100%

• P: 80-94%

• L: 60-79%

• F: 0-59%

73

Friday, January 10, 14


Grading for Undergraduate Students
• A+: 97-100% • D+: 67-69%

• A: 94-96% • D: 64-66%

• A-: 90-93% • D-: 60-63%

• B+: 87-89% • F: <= 59%

• B: 84-86%

• B-: 80-83%

• C+: 77-79%

• C: 74-76%

• C-: 70-73%
74

Friday, January 10, 14


Homework vs. Midterm vs. Final

• The homework will be challenging. It should be, you


have more time.

75

Friday, January 10, 14


Literature Review
• See description on the syllabus

• Form groups of 2 or 3

• Choose an IR task (next slide)

• Write a short proposal (mostly for feedback)

• Review the literature

‣ not just the different solutions to the problem


‣ the best solutions to the problem!
• Write a paper (~30 pages double-spaced)

• Make a presentation

‣ 10 minute presentation + 5 minutes Q&A 76

Friday, January 10, 14


Literature Review
example tasks
• Personalized information retrieval

• Session-based information retrieval

• Clustering of search results

• Book search

• Multimedia search (over items not inherently associated


with text)
• Social-media data for forecasting and event-detection

• Query-log analysis for forecasting and event-detection

• Faceted search

• Federated search 77

Friday, January 10, 14


Literature Review
tips

• Be thorough
• Be scientific

‣ don’t focus on the writing of the papers you review


‣ focus on the science (the method and the evaluation)
• Be constructive
• Contribute new insight and structure

‣ your literature review shouldn’t read like a “list”


‣ connect dots that haven’t been connected
• Say what you think!

78

Friday, January 10, 14


Course Tips
• Work hard

• Do the assigned readings

• Do other readings

• Be patient and have reasonable expectations

‣ you’re not supposed to understand everything we cover


in class during class
• Seek help sooner rather than later

‣ office hours: manning 305, T, Th 9:30-10:30am


‣ questions via email
• Keep laptop usage to a minimum (live in the present)
79

Friday, January 10, 14


Questions?

80

Friday, January 10, 14

You might also like