0% found this document useful (0 votes)
68 views45 pages

TFIDF

Boolean queries often result in either too few ( (=0) 0) or too many (1000s) results. k d retrieval scores documents based on Term frequency.

Uploaded by

Ravindra Mule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views45 pages

TFIDF

Boolean queries often result in either too few ( (=0) 0) or too many (1000s) results. k d retrieval scores documents based on Term frequency.

Uploaded by

Ravindra Mule
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

IntroductiontoInformationRetrieval

Introductionto

InformationRetrieval
CS276:InformationRetrievalandWebSearch ChristopherManningandPrabhakar Raghavan Lecture6:Scoring,TermWeightingandthe VectorSpaceModel
1

IntroductiontoInformationRetrieval

R Recap of fl lecture t 5
Collectionandvocabulary ystatistics:Heaps p andZipfs p laws DictionarycompressionforBooleanindexes Dictionarystring,blocks,frontcoding Postingscompression:Gapencoding,prefixuniquecodes VariableByteandGammacodes
collection (text, xml markup etc) collection (text) Term doc incidence matrix Term-doc postings, uncompressed (32-bit words) postings, uncompressed (20 bits) postings, variable byte encoded postings, encoded 3,600.0 960.0 40 000 0 40,000.0 400.0 250.0 116.0 101.0
2

MB

IntroductiontoInformationRetrieval

Thi lecture; This l t IIRSections S ti 6.2 6 26.4.3 643


Rankedretrieval Scoringdocuments Termfrequency Collectionstatistics Weightingschemes Vectorspacescoring

IntroductiontoInformationRetrieval

Ch. 6

R k dretrieval Ranked ti l
Thusfar,ourquerieshaveallbeenBoolean.
Documentseithermatchordont.

Goodforexpert p userswithprecise p understanding gof theirneedsandthecollection.


Alsogoodforapplications:Applicationscaneasily consume1000sofresults.

Notgoodforthemajorityofusers.
MostusersincapableofwritingBooleanqueries(orthey are,buttheythinkitstoomuchwork). don twanttowadethrough1000sofresults. results Mostusersdont
Thisisparticularlytrueofwebsearch.
4

IntroductiontoInformationRetrieval

ProblemwithBooleansearch: f torfamine feast f i

Ch. 6

Booleanqueriesoftenresultineithertoofew( (=0) 0)or toomany(1000s)results. Q Query y1:standarduserdlink650 200,000 , hits Query2:standarduserdlink650nocardfound:0 hits Ittakesalotofskilltocomeupwithaquerythat producesamanageablenumberofhits.
ANDgivestoofew;ORgivestoomany

IntroductiontoInformationRetrieval

R k dretrieval Ranked t i lmodels d l


Ratherthanasetofdocumentssatisfyingaquery expression,inrankedretrievalmodels,thesystem returnsanorderingoverthe(top)documentsinthe collectionwithrespecttoaquery Freetextqueries:Ratherthanaquerylanguageof operatorsandexpressions,theusersqueryisjust oneormorewordsinahumanlanguage Inprinciple,therearetwoseparatechoiceshere,but inpractice,rankedretrievalmodelshavenormally beenassociatedwithfreetextqueriesandviceversa
6

IntroductiontoInformationRetrieval

Feastorfamine:notaproblemin ranked k dretrieval ti l


Indeed,thesizeoftheresultsetisnotanissue Wejustshowthetopk(10)results Wedontoverwhelmtheuser Premise:therankingalgorithmworks

Ch. 6

Whenasystemproducesarankedresultset,large resultsetsarenotanissue

IntroductiontoInformationRetrieval

Ch. 6

S i asthe Scoring th basis b i of franked k dretrieval ti l


Wewishtoreturninorderthedocumentsmostlikely tobeusefultothesearcher Howcanwerankorderthedocumentsinthe collectionwithrespecttoaquery? g ascore say yin[0,1] toeachdocument Assign Thisscoremeasureshowwelldocumentandquery match.

IntroductiontoInformationRetrieval

Ch. 6

Q Query document d tmatching t hi scores


Weneedawayofassigningascoretoa query/documentpair Letsstartwithaonetermquery q y Ifthequerytermdoesnotoccurinthedocument: scoreshouldbe0 Themorefrequentthequeryterminthedocument, thehigherthescore(shouldbe) Wewilllookatanumberofalternativesforthis.

IntroductiontoInformationRetrieval

Ch. 6

T k 1: Take 1 Jaccard J dcoefficient ffi i t


RecallfromLecture3:Acommonlyusedmeasureof overlapoftwosetsA andB j jaccard( (A,B) , )=|A B|/|A B| jaccard(A,A)=1 jaccard(A,B)=0 ifA B=0 A andB donthavetobethesamesize. 1. Alwaysassignsanumberbetween0and1

10

IntroductiontoInformationRetrieval

Ch. 6

J Jaccard dcoefficient: ffi i t S Scoring i example l


Whatisthequerydocumentmatchscorethatthe Jaccardcoefficientcomputesforeachofthetwo documentsbelow? Query:idesofmarch Document 1:caesardiedinmarch Document 2:thelongmarch

11

IntroductiontoInformationRetrieval

Ch. 6

I Issues with ithJ Jaccard df forscoring i


Itdoesnt doesn tconsidertermfrequency(howmanytimes atermoccursinadocument) Raretermsinacollectionaremoreinformativethan frequentterms.Jaccarddoesntconsiderthis information Weneedamoresophisticatedwayofnormalizingfor length Laterinthislecture,welluse | A I B | / | A U B | ...insteadof|A B|/|A B|(Jaccard)forlength normalization.
12

IntroductiontoInformationRetrieval

Recall(Lecture1):Binaryterm d document ti incidence id matrix ti


Antony and Cleopatra Julius Caesar The Tempest Hamlet

Sec. 6.2

Othello

Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

1 1 1 0 1 1 1

1 1 1 1 0 0 0

0 0 0 0 0 1 1

0 1 1 0 0 1 1

0 0 1 0 0 1 1

1 0 1 0 0 1 0

Each document is represented by a binary vector {0,1}|V|


13

IntroductiontoInformationRetrieval

Sec. 6.2

T Term document d tcount tmatrices ti


Considerthenumberofoccurrencesofatermina document:
Eachdocumentisacountvectorinv:acolumnbelow

Antony and Cleopatra

Julius Caesar

The Tempest

Hamlet

Othello

Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

157 4 232 0 57 2 2

73 157 227 10 0 0 0

0 0 0 0 0 3 1

0 1 2 0 0 5 1

0 0 1 0 0 5 1

0 0 1 0 0 1 0
14

IntroductiontoInformationRetrieval

B of Bag fwords d model d l


Vectorrepresentationdoesnt doesn tconsidertheordering ofwordsinadocument Johnisq quickerthanMary y andMary yisquicker q than John havethesamevectors gofwords model. Thisiscalledthebag Inasense,thisisastepback:Thepositionalindex wasabletodistinguishthesetwodocuments. Wewilllookatrecoveringpositionalinformation laterinthiscourse. Fornow:bagofwordsmodel
15

IntroductiontoInformationRetrieval

T Term frequency f tf
Thetermfrequencytft,d t d oftermt indocumentd is definedasthenumberoftimesthattoccursind. Wewanttousetfwhencomputing p g q query ydocument matchscores.Buthow? q yisnotwhatwewant: Rawtermfrequency
Adocumentwith10occurrencesofthetermismore relevantthanadocumentwith1occurrenceoftheterm. Butnot10timesmorerelevant.

Relevancedoesnotincreaseproportionallywith termfrequency. frequency


NB:frequency=countinIR
16

IntroductiontoInformationRetrieval

Sec. 6.2

L frequency Log f weighting i hti


Thelogfrequencyweightoftermtindis
w t,d 1 + log 10 tf = , 0,
t,d

if tf

> 0 otherwise
t,d

00,11,21.3,102,10004,etc. Scoreforadocumentquerypair:sumovertermst in bothq andd: score = tqd (1 + log tf t ,d )

Thescoreis0ifnoneofthequerytermsispresentin thedocument.
17

IntroductiontoInformationRetrieval

Sec. 6.2.1

D Document tfrequency f
Rare R terms t aremoreinformative i f ti th thanfrequent f tterms t
Recallstopwords

Consideraterminthequerythatisrareinthe collection(e.g.,arachnocentric) Adocumentcontainingthistermisverylikelytobe relevanttothequeryarachnocentric Wewantahighweightforraretermslike arachnocentric.

18

IntroductiontoInformationRetrieval

Sec. 6.2.1

D Document tfrequency, f continued ti d


Frequenttermsarelessinformativethanrareterms Consideraquerytermthatisfrequentinthe collection( (e.g., g ,high, g ,increase, ,line) Adocumentcontainingsuchatermismorelikelyto berelevantthanadocumentthatdoesnt Butitsnotasureindicatorofrelevance. Forfrequent q terms, ,wewanthigh g p positiveweights g forwordslikehigh,increase,andline g thanforrareterms. Butlowerweights Wewillusedocumentfrequency(df)tocapturethis.
19

IntroductiontoInformationRetrieval

Sec. 6.2.1

idfweight i ht
dft isthedocumentfrequencyoft:thenumberof documentsthatcontaint
dft isaninversemeasureoftheinformativeness oft dft N

Wedefinetheidf (inversedocumentfrequency)oft by

idf t = log10 ( N/df t )

Weuselog(N/dft)insteadofN/dft todampentheeffect ofidf. Will turn out the base of the log is immaterial.

20

IntroductiontoInformationRetrieval

Sec. 6.2.1

idfexample, l supposeN=1million illi


term calpurnia animal sunday fly under the dft 1 100 1,000 10,000 100,000 1,000,000 idft

idf t = log10 ( N/df t )


There is one idf value for each term t in a collection.
21

IntroductiontoInformationRetrieval

Eff tof Effect fidfonranking ki


Doesidfhaveaneffectonrankingforoneterm queries,like
iPhone

idfhasnoeffectonrankingonetermqueries
idfaffectstherankingofdocumentsforquerieswithat leasttwoterms Forthequerycapriciousperson,idfweightingmakes occurrencesofcapricious countformuchmoreinthefinal documentrankingthanoccurrencesofperson.

22

IntroductiontoInformationRetrieval

Sec. 6.2.1

C ll ti vs.D Collection Document tfrequency f


Thecollectionfrequencyoft isthenumberof occurrencesoft inthecollection,counting multiple p occurrences. Example:
Word Collection frequency Document frequency

insurance try

10440 10422

3997 8760

Whichwordisabettersearchterm(andshould getahigherweight)?
23

IntroductiontoInformationRetrieval

Sec. 6.2.2

tfidf weighting i hti


Th Thetfidfweight i htof fat termi isth theproduct d tof fit itstf weightanditsidfweight.

w t ,d = (1 + log l tf t ,d ) log l 10 ( N / df t )
Best B tk knownweighting i hti scheme h i ini information f ti retrieval ti l
Note:theintfidfisahyphen,notaminussign! Alternativenames:tf.idf, tf idf tfxidf

Increaseswiththenumberofoccurrenceswithina document Increaseswiththerarityoftheterminthecollection


24

IntroductiontoInformationRetrieval

Sec. 6.2.2

Fi lranking Final ki of fdocuments d t f foraquery

Score(q, d ) =

t qd

tf.idft ,d

25

IntroductiontoInformationRetrieval

Sec. 6.3

Bi Binary count t weight i htmatrix ti


Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony Brutus Caesar Calpurnia Cleopatra mercy worser

5.25 1.21 8.59 0 2.85 1.51 1.37

3.18 6.1 2.54 1.54 0 0 0

0 0 0 0 0 1.9 0.11

0 1 1.51 0 0 0.12 4.15

0 0 0.25 0 0 5.25 0.25

0.35 0 0 0 0 0.88 1.95

Each document is now represented by a real-valued vector of tf-idf tf idf weights R|V|
26

IntroductiontoInformationRetrieval

Sec. 6.3

D Documents t asvectors t
Sowehavea|V|dimensionalvectorspace Termsareaxesofthespace Documentsarepointsorvectorsinthisspace Veryhighdimensional:tensofmillionsof dimensionswhenyouapplythistoawebsearch engine Thesearevery ysparse p vectors mostentriesarezero.

27

IntroductiontoInformationRetrieval

Sec. 6.3

Q i asvectors Queries t
Keyidea1: Dothesameforqueries:representthem asvectorsinthespace Key yidea2: Rankdocumentsaccording gtotheir proximitytothequeryinthisspace proximity y=similarity yofvectors p proximityinverseofdistance getaway yfrom Recall:Wedothisbecausewewanttog theyoureeitherinoroutBooleanmodel. g than Instead:rankmorerelevantdocumentshigher lessrelevantdocuments
28

IntroductiontoInformationRetrieval

Sec. 6.3

F Formalizing li i vector t spaceproximity i it


Firstcut:distancebetweentwopoints
(=distancebetweentheendpointsofthetwovectors)

Euclideandistance? Euclideandistanceisabadidea... ...becauseEuclideandistanceislargeforvectorsof differentlengths.

29

IntroductiontoInformationRetrieval

Sec. 6.3

Whydistanceisabadidea
TheEuclidean distancebetweenq andd2 islargeeven though h hthe h distributionofterms inthequeryq andthe distributionof termsinthe documentd2 are verysimilar.

30

IntroductiontoInformationRetrieval

Sec. 6.3

U angle Use l i instead t dof fdi distance t


Thoughtexperiment:takeadocumentd andappend ittoitself.Callthisdocumentd. y dandd havethesamecontent Semantically TheEuclideandistancebetweenthetwodocuments canbeq quitelarge g Theanglebetweenthetwodocumentsis0, correspondingtomaximalsimilarity. Key yidea:Rankdocumentsaccording gtoangle g with query.
31

IntroductiontoInformationRetrieval

Sec. 6.3

F From angles l t tocosines i


Thefollowingtwonotionsareequivalent.
Rankdocumentsindecreasing orderoftheanglebetween queryanddocument Rankdocumentsinincreasing orderof cosine(query,document)

C Cosine i is i amonotonically t i ll decreasing d i function f ti for f the th interval[0o,180o]

32

IntroductiontoInformationRetrieval

Sec. 6.3

F From angles l t tocosines i

Buthow andwhy shouldwebecomputingcosines?


33

IntroductiontoInformationRetrieval

Sec. 6.3

L thnormalization Length li ti
Avectorcanbe(length)normalizedbydividingeach ofitscomponentsbyitslength forthisweusethe L2 norm: r x 2 = i xi2 Dividing gavectorby yitsL2 normmakesitaunit (length)vector(onsurfaceofunithypersphere) Effectonthetwodocumentsdandd (dappendedto itself)fromearlierslide:theyhaveidenticalvectors afterlengthnormalization.
Longandshortdocumentsnowhavecomparableweights
34

IntroductiontoInformationRetrieval

Sec. 6.3

cosine(query,document) i ( d t)
Dot product Unit vectors

r r r r r r qd q d cos( q , d ) = r r = r r = q d qd

i =1 i

q di
2 d i=1 i V

2 i =1 i

qi is the tf-idf weight of term i in the query di is the tf-idf tf idf weight of term i in the document cos(q,d) is the cosine similarity of q and d or, equivalently, the cosine of the angle between q and d.
35

IntroductiontoInformationRetrieval

C i f Cosine forl length thnormalized li dvectors t


Forlengthnormalizedvectors,cosinesimilarityis simplythedotproduct(orscalarproduct):
V r r r r cos(q, d ) = q d = qi di i= 1

forq, q dlengthnormalized. normalized

36

IntroductiontoInformationRetrieval

C i similarity Cosine i il it ill illustrated t t d

37

IntroductiontoInformationRetrieval

Sec. 6.3

C i similarity Cosine i il i amongst3d documents


Howsimilarare thenovels SaS:Senseand Sensibility PaP:Prideand Prejudice,and WH Wuthering WH: W th i Heights?
term affection jealous gossip wuthering SaS 115 10 2 0 PaP 58 7 0 0 WH 20 11 6 38

Term frequencies (counts)

Note: To simplify this example, we dont do idf weighting. 38

IntroductiontoInformationRetrieval

Sec. 6.3

3d documents t example l contd. td


Logfrequencyweighting
term affection ff ti jealous gossip wuthering SaS 3 06 3.06 2.00 1.30 0 PaP 2 76 2.76 1.85 0 0 WH 2 30 2.30 2.04 1.78 2.58

Afterlengthnormalization
term affection ff ti jealous gossip wuthering SaS 0 789 0.789 0.515 0.335 0 PaP 0 832 0.832 0.555 0 0 WH 0 524 0.524 0.465 0.405 0.588

cos(SaS,PaP) 0 789 0.832 0.789 0 832 + 0 0.515 515 0.555 0 555 + 0 0.335 335 0.0 00+0 0.0 0 0.0 00 0.94 cos(SaS,WH) ( , ) 0.79 cos(PaP,WH) 0.69 Why do we have cos(SaS,PaP) > cos(SAS,WH)?
39

IntroductiontoInformationRetrieval

Sec. 6.3

Computingcosinescores

40

IntroductiontoInformationRetrieval

Sec. 6.4

tfidfweighting i hti has h manyvariants i t

Columns headed n are acronyms for weight schemes. Why is the base of the log in idf immaterial?
41

IntroductiontoInformationRetrieval

Weightingmaydifferinqueriesvs d documents t

Sec. 6.4

Manysearchenginesallowfordifferentweightings forqueriesvs.documents SMARTNotation:denotesthecombinationinusein anengine,withthenotationddd.qqq, usingthe acronymsfromtheprevioustable Averystandardweightingschemeis:lnc.ltc Document:logarithmictf (lasfirstcharacter),noidf andcosinenormalization Q y logarithmic g tf ( (linleftmostcolumn), ),idf ( (tin Query: secondcolumn),nonormalization

A bad idea?

42

IntroductiontoInformationRetrieval

Sec. 6.4

tfidfexample: l lnc.ltc l lt
Document: car insurance auto insurance Query: best car insurance
Term tf- tf-wt raw auto t best car insurance 0 1 1 1 0 Query df 5000 idf 23 2.3 1.3 2.0 3.0 wt 0 1.3 2.0 3.0 nliz e 0 0.34 0.52 0.78 tf-raw 1 0 1 2 Document tf-wt 1 0 1 1.3 wt 1 0 1 1.3 nliz e 0 52 0.52 0 0.52 0.68 0 0 0.27 0.53 Pro d

1 50000 1 10000 1 1000

Exercise: what is N, the number of docs? D length Doc l th = 12 + 0 2 + 12 + 1.32 1.92 Score = 0+0+0.27+0.53 = 0.8
43

IntroductiontoInformationRetrieval

S Summary vector t spaceranking ki


R Represent tth thequeryasaweighted i ht dtfidfvector t Representeachdocumentasaweightedtfidfvector Compute C the h cosine i similarity i il i scoref forthe h query vectorandeachdocumentvector Rank R kd documents t with ithrespect tt toth thequeryb byscore ReturnthetopK (e.g.,K =10)totheuser

44

IntroductiontoInformationRetrieval

Ch. 6

R Resources f fort todays d l lecture t


IIR6.2 6.4.3 http://www http://www.miislita.com/information miislita com/informationretrieval tutorial/cosinesimilaritytutorial.html
Termweighting g gandcosinesimilarity ytutorialforSEOfolk!

45

You might also like