Intro To IRE
Intro To IRE
Intro To IRE
Introduc)onto
Informa(onRetrieval
CS276 Informa)onRetrievalandWebSearch PanduNayakandPrabhakarRaghavan Lecture1:Booleanretrieval
Introduc)ontoInforma)onRetrieval
Informa)onRetrieval
Informa)onRetrieval(IR)isndingmaterial(usually documents)ofanunstructurednature(usuallytext) thatsa)sesaninforma)onneedfromwithinlarge collec)ons(usuallystoredoncomputers).
Introduc)ontoInforma)onRetrieval
Unstructured(text)vs.structured (database)datain1996
Introduc)ontoInforma)onRetrieval
Unstructured(text)vs.structured (database)datain2009
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Unstructureddatain1680
WhichplaysofShakespearecontainthewordsBrutus ANDCaesarbutNOTCalpurnia? OnecouldgrepallofShakespearesplaysforBrutus andCaesar,thenstripoutlinescontainingCalpurnia? Whyisthatnottheanswer?
Slow(forlargecorpora) NOTCalpurniaisnontrivial Otheropera)ons(e.g.,ndthewordRomansnear countrymen)notfeasible Rankedretrieval(bestdocumentstoreturn)
Laterlectures
5
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Termdocumentincidence
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Incidencevectors
Sowehavea0/1vectorforeachterm. Toanswerquery:takethevectorsforBrutus,Caesar andCalpurnia(complemented)bitwiseAND. 110100AND110111AND101111=100100.
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Answerstoquery
Antony and Cleopatra,Act III, Scene ii
Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain.
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Basicassump)onsofInforma)onRetrieval
Collec)on:Fixedsetofdocuments Goal:Retrievedocumentswithinforma)onthatis relevanttotheusersinforma)onneedandhelpsthe usercompleteatask
Introduc)ontoInforma)onRetrieval
Theclassicsearchmodel
TASK
Misconception?
Info Need
Verbal form
Misformulation?
mouse trap
Query
SEARCH ENGINE
Query Refinement
Results
Corpus
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Howgoodaretheretrieveddocs?
Precision:Frac)onofretrieveddocsthatare relevanttousersinforma)onneed Recall:Frac)onofrelevantdocsincollec)onthatare retrieved Moreprecisedeni)onsandmeasurementsto followinlaterlectures
11
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Biggercollec)ons
ConsiderN=1milliondocuments,eachwithabout 1000words. Avg6bytes/wordincludingspaces/punctua)on
6GBofdatainthedocuments.
SaythereareM=500Kdis)ncttermsamongthese.
12
Introduc)ontoInforma)onRetrieval
Sec. 1.1
Cantbuildthematrix
500Kx1Mmatrixhashalfatrillion0sand1s. Why? Butithasnomorethanonebillion1s.
matrixisextremelysparse.
Whatsabeberrepresenta)on?
Weonlyrecordthe1posi)ons.
13
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Invertedindex
Foreachtermt,wemuststorealistofalldocuments thatcontaint.
Iden)fyeachbyadocID,adocumentserialnumber
Canweusexedsizearraysforthis?
Brutus Caesar Calpurnia 1 1 2 2 2 31 4 4 11 31 45 173 174 5 6 16 57 132
54 101
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Invertedindex
Weneedvariablesizepos)ngslists
Ondisk,acon)nuousrunofpos)ngsisnormalandbest Inmemory,canuselinkedlistsorvariablelengtharrays
Sometradeosinsize/easeofinser)on Pos)ng
1 1 2
2 2 31
4 4
54 101
Dictionary
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Invertedindexconstruc)on
Documents to be indexed
Countrymen
countryman 2 1 13 4 2 16
Inverted index
roman countryman
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Indexersteps:Tokensequence
Sequenceof(Modiedtoken,DocumentID)pairs.
Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me.
Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Indexersteps:Sort
Sortbyterms
AndthendocID
Coreindexingstep
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Indexersteps:Dic)onary&Pos)ngs
Mul)pleterm entriesinasingle documentare merged. SplitintoDic)onary andPos)ngs Doc.frequency informa)onis added.
Whyfrequency? Willdiscusslater.
Introduc)ontoInforma)onRetrieval
Sec. 1.2
Wheredowepayinstorage?
Listsof docIDs Terms and counts
Pointers
Later in the course: How do we index efficiently? How much storage do we need?
20
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Theindexwejustbuilt
Howdoweprocessaquery?
Laterwhatkindsofqueriescanweprocess? Todays focus
21
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Queryprocessing:AND
Considerprocessingthequery:
BrutusANDCaesar LocateBrutusintheDic)onary;
Retrieveitspos)ngs.
LocateCaesarintheDic)onary;
Retrieveitspos)ngs.
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Themerge
Walkthroughthetwopos)ngssimultaneously,in )melinearinthetotalnumberofpos)ngsentries
2 1
4 2
8 3
16 5
32 8
64 13
If list lengths are x and y, merge takes O(x+y) operations. Crucial: postings sorted by docID.
23
Introduc)ontoInforma)onRetrieval
Intersec)ngtwopos)ngslists (amergealgorithm)
24
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Booleanqueries:Exactmatch
TheBooleanretrievalmodelisbeingabletoaska querythatisaBooleanexpression:
BooleanQueriesuseAND,ORandNOTtojoinqueryterms
Viewseachdocumentasasetofwords Isprecise:documentmatchescondi)onornot.
PerhapsthesimplestmodeltobuildanIRsystemon
Primarycommercialretrievaltoolfor3decades. Manysearchsystemsyous)lluseareBoolean:
Email,librarycatalog,MacOSXSpotlight
25
Introduc)ontoInforma)onRetrieval
Sec. 1.4
Example:WestLawhttp://www.westlaw.com/
Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) Tens of terabytes of data; 700,000 users Majority of users still use boolean queries Example query:
What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM
/3 = within 3 words, /S = in same sentence
26
Introduc)ontoInforma)onRetrieval
Sec. 1.4
Example:WestLawhttp://www.westlaw.com/
Anotherexamplequery:
Requirementsfordisabledpeopletobeabletoaccessa workplace disabl!/paccess!/sworksiteworkplace(employment/3 place)
Butthatdoesntmeanitactuallyworksbeber.
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Booleanqueries: Moregeneralmerges
Exercise:Adaptthemergeforthequeries: BrutusANDNOTCaesar BrutusORNOTCaesar
Canwes)llrunthroughthemergein)meO(x+y)? Whatcanweachieve?
28
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Merging
WhataboutanarbitraryBooleanformula? (BrutusORCaesar)ANDNOT (AntonyORCleopatra) Canwealwaysmergeinlinear)me?
Linearinwhat?
Canwedobeber?
29
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Queryop)miza)on
Whatisthebestorderforqueryprocessing? ConsideraquerythatisanANDofnterms. Foreachofthenterms,getitspos)ngs,then ANDthemtogether.
Brutus Caesar Calpurnia 2 1 4 2 8 3 16 32 64 128 5 8 16 21 34
13 16
30
Query:BrutusANDCalpurniaANDCaesar
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Queryop)miza)onexample
Processinorderofincreasingfreq:
startwithsmallestset,thenkeepcuEngfurther.
This is why we kept document freq. in dictionary
2 1
4 2
8 3
16 32 64 128 5 8 16 21 34
13 16
Executethequeryas(CalpurniaANDBrutus)ANDCaesar.
31
Introduc)ontoInforma)onRetrieval
Sec. 1.3
Moregeneralop)miza)on
e.g.,(maddingORcrowd)AND(ignobleORstrife) Getdoc.freq.sforallterms. Es)matethesizeofeachORbythesumofits doc.freq.s(conserva)ve). ProcessinincreasingorderofORsizes.
32
Introduc)ontoInforma)onRetrieval
Exercise
Recommendaquery processingorderfor
33
Introduc)ontoInforma)onRetrieval
Queryprocessingexercises
Exercise:IfthequeryisfriendsANDromansAND (NOTcountrymen),howcouldweusethefreqof countrymen? Exercise:ExtendthemergetoanarbitraryBoolean query.Canwealwaysguaranteeexecu)onin)me linearinthetotalpos)ngssize? Hint:BeginwiththecaseofaBooleanformulaquery whereeachtermappearsonlyonceinthequery.
34
Introduc)ontoInforma)onRetrieval
Exercise
Trythesearchfeatureat hbp://www.rhymezone.com/shakespeare/ Writedownvesearchfeaturesyouthinkitcoulddo beber
35
Introduc)ontoInforma)onRetrieval
WhatsaheadinIR? Beyondtermsearch
StanfordUniversity
Whataboutphrases? Proximity:FindGatesNEARMicrosoA.
Needindextocaptureposi)oninforma)onindocs.
Zonesindocuments:Finddocumentswith (author=Ullman)AND(textcontainsautomata).
36
Introduc)ontoInforma)onRetrieval
Evidenceaccumula)on
1vs.0occurrenceofasearchterm
2vs.1occurrence 3vs.2occurrences,etc. Usuallymoreseemsbeber
Needtermfrequencyinforma)onindocs
37
Introduc)ontoInforma)onRetrieval
Rankingsearchresults
Booleanqueriesgiveinclusionorexclusionofdocs. Otenwewanttorank/groupresults
Needtomeasureproximityfromquerytoeachdoc. Needtodecidewhetherdocspresentedtouserare singletons,oragroupofdocscoveringvariousaspectsof thequery.
38
Introduc)ontoInforma)onRetrieval
IRvs.databases: Structuredvsunstructureddata
Structureddatatendstorefertoinforma)onin tables
Employee Smith Chang Ivy Manager Jones Smith Smith Salary 50000 60000 50000
Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith.
39
Introduc)ontoInforma)onRetrieval
Unstructureddata
Typicallyreferstofreeformtext Allows
Keywordqueriesincludingoperators Moresophis)catedconceptqueries,e.g.,
ndallwebpagesdealingwithdrugabuse
Classicmodelforsearchingtextdocuments
40
Introduc)ontoInforma)onRetrieval
Semistructureddata
Infactalmostnodataisunstructured E.g.,thisslidehasdis)nctlyiden)edzonessuchas theTitleandBullets Facilitatessemistructuredsearchsuchas
TitlecontainsdataANDBulletscontainsearch
tosaynothingoflinguis)cstructure
41
Introduc)ontoInforma)onRetrieval
Moresophis)catedsemistructured search
TitleisaboutObjectOrientedProgrammingAND Authorsomethinglikestro*rup where*isthewildcardoperator Issues:
howdoyouprocessabout? howdoyourankresults?
ThefocusofXMLsearch(IIRchapter10)
42
Introduc)ontoInforma)onRetrieval
Clustering,classica)onandranking
Clustering:Givenasetofdocs,grouptheminto clustersbasedontheircontents. Classica)on:Givenasetoftopics,plusanewdocD, decidewhichtopic(s)Dbelongsto. Ranking:Canwelearnhowtobestorderasetof documents,e.g.,asetofsearchresults
43
Introduc)ontoInforma)onRetrieval
Thewebanditschallenges
Unusualanddiversedocuments Unusualanddiverseusers,queries,informa)on needs Beyondterms,exploitideasfromsocialnetworks
linkanalysis,clickstreams...
Howdosearchengineswork? Andhowcanwemakethembeber?
44
Introduc)ontoInforma)onRetrieval
Moresophis)catedinforma)onretrieval
Crosslanguageinforma)onretrieval Ques)onanswering Summariza)on Textmining
45
Introduc)ontoInforma)onRetrieval
Coursedetails
CourseURL:cs276.stanford.edu
[a.k.a.,hbp://www.stanford.edu/class/cs276/]
Work/Grading:
Problemsets(2) Prac)calexercises(2) Midterm Final 20% 10%+20%=30% 20% 30%
Textbook:
Introduc)ontoInforma)onRetrieval
Inbookstoreandonline(hbp://informa)onretrieval.org/) Werehappytogetcomments/correc)ons/feedbackonit!
46
Introduc)ontoInforma)onRetrieval
Coursesta
Professor:PanduNayak [email protected] Professor:PrabhakarRaghavan
[email protected]
Introduc)ontoInforma)onRetrieval
Resourcesfortodayslecture
Introduc)ontoInforma)onRetrieval,chapter1 Shakespeare:
hbp://www.rhymezone.com/shakespeare/ Trytheneatbrowsebykeywordsequencefeature!
ManagingGigabytes,chapter3.2 ModernInforma)onRetrieval,chapter8.2
Anyques)ons?
48