Lecture 4
Lecture 4
• Zipf Distribution
• Statistical Dependence
• Information retrieval
(support querying)
• Other applications
– Clustering
– Information extraction
–…
Fever None
Flu Strep
yes no
Each leaf node represents
a class/decision
Flu Cold
<=30 overcast
30..40 >40
no yes no yes
no yes no yes
not pure
Information content of Xi
Probability mass function of X
Dr. Yi, C., Tsinghua SEM 14
Feature Selection Measure:
Information Gain
◼ Consider a set S of 10 documents with seven of the
class A and three of the class B
◼ p (A) = 7/10 = 0.7
◼ entropy (S)
<=30 overcast
30..40 >40
no yes no yes
<=30 overcast
30..40 >40
no yes no yes
The
proportion of
correct
decisions
7
Creation
Collection/
Capture
Reuse /
Leverage
Information
Life Cycle Management
Organization/
Indexing
Distribution /
Dissemination
Storage /
Retrieval
Potentially
Relevant
Documents
Text Operations
Searching Index
retrieved docs
Text
Ranking Database
ranked docs
Dr. Yi, C., Tsinghua SEM 43
What is a good structure for index?
docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1 Terms D1 D2 D3 D4 D5 D6 D7 …
D6 1 1 0 t1 1 1 0 1 1 1 0
D7 0 1 0 t2 0 0 1 0 1 1 1
D8 0 1 0 t3 1 0 1 0 1 0 0
D9 0 0 1
D10 0 1 1
Original Documents
W1:d1,d2,d3
W2:d2,d4,d7,d9
Document IDs
Wn :di,…dn
Inverted Files
tokens for
all
good
1
1
1
men 1
ID the
aid
of
1
1
1
their 1
country 1
Doc 1 Doc 2 it
was
2
2
a 2
dark 2
alphabetically
to 1 in 2
the 1 is 1
aid 1 it 2
of 1 manor 2
their 1 men 1
country 1 midnight 2
it 2 night 2
was 2 now 1
a 2 of 1
dark 2 past 2
and 2 stormy 2
stormy 2 the 1
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
past 2 was 2
midnight 2 was 2
• Within-document is
it
1
2
good
in
is
1
2
1
1
1
1
manor 2
compiled stormy
the
2
1
now
of
past
1
1
2
1
1
1
the 1
the 2 stormy 2 1
the 2 the 1 2
their 1 the 2 2
time 1
their 1 1
time 2
time 1 1
to 1
to 1 time 2 1
was 2 to 1 2
was 2 was 2 2
Dr. Yi, C., Tsinghua SEM 55
Creating Inverted Files
• Then the file can be split into
– A Dictionary file
– and
– A Postings file