Intro To IRE

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Introduc)ontoInforma)onRetrieval

Introduc)onto

Informa(onRetrieval
CS276 Informa)onRetrievalandWebSearch PanduNayakandPrabhakarRaghavan Lecture1:Booleanretrieval

Introduc)ontoInforma)onRetrieval

Informa)onRetrieval
Informa)onRetrieval(IR)isndingmaterial(usually documents)ofanunstructurednature(usuallytext) thatsa)sesaninforma)onneedfromwithinlarge collec)ons(usuallystoredoncomputers).

Introduc)ontoInforma)onRetrieval

Unstructured(text)vs.structured (database)datain1996

Introduc)ontoInforma)onRetrieval

Unstructured(text)vs.structured (database)datain2009

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Unstructureddatain1680
WhichplaysofShakespearecontainthewordsBrutus ANDCaesarbutNOTCalpurnia? OnecouldgrepallofShakespearesplaysforBrutus andCaesar,thenstripoutlinescontainingCalpurnia? Whyisthatnottheanswer?
Slow(forlargecorpora) NOTCalpurniaisnontrivial Otheropera)ons(e.g.,ndthewordRomansnear countrymen)notfeasible Rankedretrieval(bestdocumentstoreturn)
Laterlectures
5

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Termdocumentincidence

Brutus AND Caesar BUT NOT Calpurnia

1 if play contains word, 0 otherwise

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Incidencevectors
Sowehavea0/1vectorforeachterm. Toanswerquery:takethevectorsforBrutus,Caesar andCalpurnia(complemented)bitwiseAND. 110100AND110111AND101111=100100.

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Answerstoquery
Antony and Cleopatra,Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.

Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Basicassump)onsofInforma)onRetrieval
Collec)on:Fixedsetofdocuments Goal:Retrievedocumentswithinforma)onthatis relevanttotheusersinforma)onneedandhelpsthe usercompleteatask

Introduc)ontoInforma)onRetrieval

Theclassicsearchmodel
TASK

Misconception?
Info Need

Info about removing mice without killing them


Mistranslation?

Verbal form

Misformulation?
mouse trap

Query

SEARCH ENGINE

Query Refinement

Results

Corpus

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Howgoodaretheretrieveddocs?
Precision:Frac)onofretrieveddocsthatare relevanttousersinforma)onneed Recall:Frac)onofrelevantdocsincollec)onthatare retrieved Moreprecisedeni)onsandmeasurementsto followinlaterlectures

11

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Biggercollec)ons
ConsiderN=1milliondocuments,eachwithabout 1000words. Avg6bytes/wordincludingspaces/punctua)on
6GBofdatainthedocuments.

SaythereareM=500Kdis)ncttermsamongthese.

12

Introduc)ontoInforma)onRetrieval

Sec. 1.1

Cantbuildthematrix
500Kx1Mmatrixhashalfatrillion0sand1s. Why? Butithasnomorethanonebillion1s.
matrixisextremelysparse.

Whatsabeberrepresenta)on?
Weonlyrecordthe1posi)ons.

13

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Invertedindex
Foreachtermt,wemuststorealistofalldocuments thatcontaint.
Iden)fyeachbyadocID,adocumentserialnumber

Canweusexedsizearraysforthis?
Brutus Caesar Calpurnia 1 1 2 2 2 31 4 4 11 31 45 173 174 5 6 16 57 132

54 101

What happens if the word Caesar is added to document 14?


14

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Invertedindex
Weneedvariablesizepos)ngslists
Ondisk,acon)nuousrunofpos)ngsisnormalandbest Inmemory,canuselinkedlistsorvariablelengtharrays
Sometradeosinsize/easeofinser)on Pos)ng

Brutus Caesar Calpurnia

1 1 2

2 2 31

4 4

11 31 45 173 174 5 6 16 57 132

54 101

Dictionary

Postings Sorted by docID (more later on why). 15

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Invertedindexconstruc)on
Documents to be indexed

Friends, Romans, countrymen. Tokenizer

Token stream More on these later. Modified tokens

Friends Romans Linguistic modules friend Indexer friend roman

Countrymen

countryman 2 1 13 4 2 16

Inverted index

roman countryman

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Indexersteps:Tokensequence
Sequenceof(Modiedtoken,DocumentID)pairs.

Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.

Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Indexersteps:Sort
Sortbyterms
AndthendocID

Coreindexingstep

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Indexersteps:Dic)onary&Pos)ngs
Mul)pleterm entriesinasingle documentare merged. SplitintoDic)onary andPos)ngs Doc.frequency informa)onis added.
Whyfrequency? Willdiscusslater.

Introduc)ontoInforma)onRetrieval

Sec. 1.2

Wheredowepayinstorage?
Listsof docIDs Terms and counts

Pointers

Later in the course: How do we index efficiently? How much storage do we need?

20

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Theindexwejustbuilt
Howdoweprocessaquery?
Laterwhatkindsofqueriescanweprocess? Todays focus

21

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Queryprocessing:AND
Considerprocessingthequery:
BrutusANDCaesar LocateBrutusintheDic)onary;
Retrieveitspos)ngs.

LocateCaesarintheDic)onary;
Retrieveitspos)ngs.

Mergethetwopos)ngs: 2 1 4 2 8 3 16 5 32 8 64 13 128 Brutus 21 34 Caesar


22

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Themerge
Walkthroughthetwopos)ngssimultaneously,in )melinearinthetotalnumberofpos)ngsentries

2 1

4 2

8 3

16 5

32 8

64 13

128 Brutus 21 34 Caesar

If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID.
23

Introduc)ontoInforma)onRetrieval

Intersec)ngtwopos)ngslists (amergealgorithm)

24

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Booleanqueries:Exactmatch
TheBooleanretrievalmodelisbeingabletoaska querythatisaBooleanexpression:
BooleanQueriesuseAND,ORandNOTtojoinqueryterms
Viewseachdocumentasasetofwords Isprecise:documentmatchescondi)onornot.

PerhapsthesimplestmodeltobuildanIRsystemon

Primarycommercialretrievaltoolfor3decades. Manysearchsystemsyous)lluseareBoolean:
Email,librarycatalog,MacOSXSpotlight

25

Introduc)ontoInforma)onRetrieval

Sec. 1.4

Example:WestLawhttp://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
/3 = within 3 words, /S = in same sentence
26

Introduc)ontoInforma)onRetrieval

Sec. 1.4

Example:WestLawhttp://www.westlaw.com/
Anotherexamplequery:
Requirementsfordisabledpeopletobeabletoaccessa workplace disabl!/paccess!/sworksiteworkplace(employment/3 place)

NotethatSPACEisdisjunc)on,notconjunc)on! Long,precisequeries;proximityoperators; incrementallydeveloped;notlikewebsearch Manyprofessionalsearcherss)lllikeBooleansearch


Youknowexactlywhatyouaregeqng

Butthatdoesntmeanitactuallyworksbeber.

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Booleanqueries: Moregeneralmerges
Exercise:Adaptthemergeforthequeries: BrutusANDNOTCaesar BrutusORNOTCaesar
Canwes)llrunthroughthemergein)meO(x+y)? Whatcanweachieve?

28

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Merging
WhataboutanarbitraryBooleanformula? (BrutusORCaesar)ANDNOT (AntonyORCleopatra) Canwealwaysmergeinlinear)me?
Linearinwhat?

Canwedobeber?

29

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Queryop)miza)on
Whatisthebestorderforqueryprocessing? ConsideraquerythatisanANDofnterms. Foreachofthenterms,getitspos)ngs,then ANDthemtogether.
Brutus Caesar Calpurnia 2 1 4 2 8 3 16 32 64 128 5 8 16 21 34

13 16
30

Query:BrutusANDCalpurniaANDCaesar

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Queryop)miza)onexample
Processinorderofincreasingfreq:
startwithsmallestset,thenkeepcuEngfurther.
This is why we kept document freq. in dictionary

Brutus Caesar Calpurnia

2 1

4 2

8 3

16 32 64 128 5 8 16 21 34

13 16

Executethequeryas(CalpurniaANDBrutus)ANDCaesar.
31

Introduc)ontoInforma)onRetrieval

Sec. 1.3

Moregeneralop)miza)on
e.g.,(maddingORcrowd)AND(ignobleORstrife) Getdoc.freq.sforallterms. Es)matethesizeofeachORbythesumofits doc.freq.s(conserva)ve). ProcessinincreasingorderofORsizes.

32

Introduc)ontoInforma)onRetrieval

Exercise
Recommendaquery processingorderfor

(tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

33

Introduc)ontoInforma)onRetrieval

Queryprocessingexercises
Exercise:IfthequeryisfriendsANDromansAND (NOTcountrymen),howcouldweusethefreqof countrymen? Exercise:ExtendthemergetoanarbitraryBoolean query.Canwealwaysguaranteeexecu)onin)me linearinthetotalpos)ngssize? Hint:BeginwiththecaseofaBooleanformulaquery whereeachtermappearsonlyonceinthequery.

34

Introduc)ontoInforma)onRetrieval

Exercise
Trythesearchfeatureat hbp://www.rhymezone.com/shakespeare/ Writedownvesearchfeaturesyouthinkitcoulddo beber

35

Introduc)ontoInforma)onRetrieval

WhatsaheadinIR? Beyondtermsearch
StanfordUniversity

Whataboutphrases? Proximity:FindGatesNEARMicrosoA.
Needindextocaptureposi)oninforma)onindocs.

Zonesindocuments:Finddocumentswith (author=Ullman)AND(textcontainsautomata).

36

Introduc)ontoInforma)onRetrieval

Evidenceaccumula)on
1vs.0occurrenceofasearchterm
2vs.1occurrence 3vs.2occurrences,etc. Usuallymoreseemsbeber

Needtermfrequencyinforma)onindocs

37

Introduc)ontoInforma)onRetrieval

Rankingsearchresults
Booleanqueriesgiveinclusionorexclusionofdocs. Otenwewanttorank/groupresults
Needtomeasureproximityfromquerytoeachdoc. Needtodecidewhetherdocspresentedtouserare singletons,oragroupofdocscoveringvariousaspectsof thequery.

38

Introduc)ontoInforma)onRetrieval

IRvs.databases: Structuredvsunstructureddata
Structureddatatendstorefertoinforma)onin tables
Employee Smith Chang Ivy Manager Jones Smith Smith Salary 50000 60000 50000

Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
39

Introduc)ontoInforma)onRetrieval

Unstructureddata
Typicallyreferstofreeformtext Allows
Keywordqueriesincludingoperators Moresophis)catedconceptqueries,e.g.,
ndallwebpagesdealingwithdrugabuse

Classicmodelforsearchingtextdocuments

40

Introduc)ontoInforma)onRetrieval

Semistructureddata
Infactalmostnodataisunstructured E.g.,thisslidehasdis)nctlyiden)edzonessuchas theTitleandBullets Facilitatessemistructuredsearchsuchas
TitlecontainsdataANDBulletscontainsearch

tosaynothingoflinguis)cstructure

41

Introduc)ontoInforma)onRetrieval

Moresophis)catedsemistructured search
TitleisaboutObjectOrientedProgrammingAND Authorsomethinglikestro*rup where*isthewildcardoperator Issues:
howdoyouprocessabout? howdoyourankresults?

ThefocusofXMLsearch(IIRchapter10)

42

Introduc)ontoInforma)onRetrieval

Clustering,classica)onandranking
Clustering:Givenasetofdocs,grouptheminto clustersbasedontheircontents. Classica)on:Givenasetoftopics,plusanewdocD, decidewhichtopic(s)Dbelongsto. Ranking:Canwelearnhowtobestorderasetof documents,e.g.,asetofsearchresults

43

Introduc)ontoInforma)onRetrieval

Thewebanditschallenges
Unusualanddiversedocuments Unusualanddiverseusers,queries,informa)on needs Beyondterms,exploitideasfromsocialnetworks
linkanalysis,clickstreams...

Howdosearchengineswork? Andhowcanwemakethembeber?
44

Introduc)ontoInforma)onRetrieval

Moresophis)catedinforma)onretrieval
Crosslanguageinforma)onretrieval Ques)onanswering Summariza)on Textmining

45

Introduc)ontoInforma)onRetrieval

Coursedetails
CourseURL:cs276.stanford.edu
[a.k.a.,hbp://www.stanford.edu/class/cs276/]

Work/Grading:
Problemsets(2) Prac)calexercises(2) Midterm Final 20% 10%+20%=30% 20% 30%

Textbook:
Introduc)ontoInforma)onRetrieval
Inbookstoreandonline(hbp://informa)onretrieval.org/) Werehappytogetcomments/correc)ons/feedbackonit!
46

Introduc)ontoInforma)onRetrieval

Coursesta
Professor:PanduNayak [email protected] Professor:PrabhakarRaghavan
[email protected]

TAs:SonaliAggarwal,SandeepSripada, Valen)nSpitkovsky Ingeneral,dontusetheaboveaddresses,but:


Newsgroup:su.class.cs276 [preferred] [email protected]
47

Introduc)ontoInforma)onRetrieval

Resourcesfortodayslecture
Introduc)ontoInforma)onRetrieval,chapter1 Shakespeare:
hbp://www.rhymezone.com/shakespeare/ Trytheneatbrowsebykeywordsequencefeature!

ManagingGigabytes,chapter3.2 ModernInforma)onRetrieval,chapter8.2

Anyques)ons?
48

You might also like