0% found this document useful (0 votes)
8 views61 pages

Lecture 4

The lecture covers classification and retrieval methods in text processing, focusing on decision trees as a classification technique. It discusses the process of tree induction, feature selection using information gain, and the challenges of overfitting. The importance of balancing model complexity and generalization is emphasized throughout the lecture.

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views61 pages

Lecture 4

The lecture covers classification and retrieval methods in text processing, focusing on decision trees as a classification technique. It discusses the process of tree induction, feature selection using information gain, and the challenges of overfitting. The importance of balancing model complexity and generalization is emphasized throughout the lecture.

Uploaded by

Julien LEKA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Lecture 4: Classification and Retrieval

Dr. YI Cheng (易成)


School of Economics and Management
Mar 18, 2024
Last Lecture
• Information Organization
– Categorization/classification

• Text processing basics


– Statistical Properties of Text

• Zipf Distribution

• Statistical Dependence

– Text Processing Process

Dr. Yi, C., Tsinghua SEM 2


Document Processing Steps (L3)

Dr. Yi, C., Tsinghua SEM 3


Text Processing Applications
• Classification and prediction
– Text categorization
(support browsing)

• Information retrieval
(support querying)

• Other applications
– Clustering
– Information extraction
–…

Dr. Yi, C., Tsinghua SEM 4


Classification as a Prediction Problem
• The classification process

Dr. Yi, C., Tsinghua SEM 5


Classfication Methods
• Decision trees
• Spatial techniques
• Probabilistic classifiers
• Neural networks
• …

Dr. Yi, C., Tsinghua SEM 6


Decision Tree
• The classification process is modeled using a set of
hierarchical decisions on the features, arranged in a
tree-like structure
– Tree structures are common for organizing classification
schemes
• The decision at a particular node, referred to as the
split criterion, is a condition on features based on
training data; divides training data into two or more
parts
• The goal is to identify a split criterion that maximizes
the separation of the different classes among the
children nodes (i.e., try to derive pure sets)
Dr. Yi, C., Tsinghua SEM 7
Decision Tree Example
A decision tree to help a doctor diagnose patient’s disease:

Pain Each branch node represents a choice


among a number of features
abdomen
none
throat chest

Fever Heart attack Cough


Appendicitis
yes no
yes no

Fever None
Flu Strep
yes no
Each leaf node represents
a class/decision
Flu Cold

Dr. Yi, C., Tsinghua SEM 8


Example of Tree Induction:
Instance Data “buys_computer”
(Training Set)
This follows an age income student credit_rating buys_computer
example of <=30 high no fair no
<=30 high no excellent no
Quinlan’s ID3 31…40 high no fair yes
algorithm >40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
14 instances <=30 medium no fair no
2 classes: yes/no <=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

Dr. Yi, C., Tsinghua SEM 9


Output: A Decision Tree for
“buys_computer”
age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

Dr. Yi, C., Tsinghua SEM 10


Tree Induction Algorithm

▪ The algorithm operates over a set of training instances, C.


▪ If all instances in C are in class P (i.e., pure), create a node
P and stop, otherwise select a feature or attribute F and
create a decision node.
▪ Partition the training instances in C into subsets S
according to the values V of F.
▪ Apply the algorithm recursively to each of the subsets S.

Dr. Yi, C., Tsinghua SEM 11


Output: A Decision Tree for
“buys_computer”
Which feature to start?
age? What feature follows the root
feature?
When to stop?
<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

Dr. Yi, C., Tsinghua SEM 12


Feature Selection Measure:
Information Gain
◼ The most common splitting criterion is called information gain
◼ Based on a purity measure (i.e., homogeneity with respect to
the target variable): entropy 1 bit of information required
to distinguish yes and no.

not pure

pure, 100% no Dr. Yi, C., Tsinghua SEM


pure, 100% yes 13
Shannon’s Information Theory (L1)
▪ Information Entropy
• A measure of the disorder/uncertainty of a
system and inversely related to the amount of
energy available to do work.
A discrete random variable X with
Entropy possible values {x1, ..., xn}

Information content of Xi
Probability mass function of X
Dr. Yi, C., Tsinghua SEM 14
Feature Selection Measure:
Information Gain
◼ Consider a set S of 10 documents with seven of the
class A and three of the class B
◼ p (A) = 7/10 = 0.7

◼ p (B) = 3/10 = 0.3

◼ entropy (S)

= - [ p(A) log p(A) + p(B) log p(B) ]


= - [0.7 × log2(0.7) + 0.3 × log2(0.3)] ≈ 0.88

Dr. Yi, C., Tsinghua SEM 15


Feature Selection Measure:
Information Gain
◼ What is the feature with the highest information gain?
◼ S contains si instances of class Ci for i = {1, …, m}
◼ Information (I) measures info required to classify any arbitrary
instance m
si si
I( s1,s2,...,sm) = − log 2 Entropy of “parent”
i =1 s s
◼ Entropy (E) of feature A with values {a1,a2,…,av}
v
s1 j + ... + smj
E(A) =  I ( s1 j ,..., smj ) Entropy of “children”
j =1 s
◼ information gained by branching on feature A

Gain(A) = I(s1 ,s 2 ,...,s m ) − E(A)


Dr. Yi, C., Tsinghua SEM 16
Output: A Decision Tree for
“buys_computer”
age?
Which feature to start?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

Dr. Yi, C., Tsinghua SEM 17


Feature Selection by Information Gain
 Class P: buys_computer = “yes”
5 4
 Class N: buys_computer = “no” E (age) = I (2,3) + I (4,0)
 I(p, n) = I(9, 5) =0.940 14 14
 Compute the entropy for age: 5
+ I (3,2) = 0.694
14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
14
30…40 4 0 0 out of 14 samples, with 2
>40 3 2 0.971
age income student credit_rating buys_computer
yes’es and 3 no’s. Hence,
<=30 high no fair no
<=30 high no excellent no Gain(age) = I ( p, n) − E (age) = 0.246
31…40 high no fair yes
>40 medium no fair yes
>40
>40
low
low
yes fair
yes excellent
yes
no
Similarly,
31…40
Gain(income) = 0.029
low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40
<=30
medium
medium
yes fair
yes excellent
yes
yes
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Dr. Yi, C., Tsinghua SEM 18
Decision Tree Classification
• How good is each of the features individually (i.e.,
as the root of the tree)?
– Information gain
– Over the entire set of instances
• Apply the tree algorithm to the data, what is the
tree like?
– The feature with the highest information gain should be
the root
– Select lower-level features/nodes based on the
instances above them in the tree
– A recursive process of divide and conquer

Dr. Yi, C., Tsinghua SEM 19


Decision Tree Example:
Classifying News Stories – whether they
will likely result in a change in stock price

Dr. Yi, C., Tsinghua SEM 20


Decision Tree Example:
Classifying News Stories
• 1. Summit Tech announces revenues for the three months ended Dec 31, 1998 were $22.4 million,
an increase of 13%.
• 2. Summit Tech and Autonomous Technologies Corporation announce that the Joint
Proxy/Prospectus for Summit’s acquisition of Autonomous has been declared effective by the SEC.
• 3. Summit Tech said that its procedure volume reached new levels in the first quarter and that it had
concluded its acquisition of Autonomous Technologies Corporation.
• 4. Announcement of annual shareholders meeting.
• 5. Summit Tech announces it has filed a registration statement with the SEC to sell 4,000,000 shares
of its common stock.
• 6. A US FDA panel backs the use of a Summit Tech laser in LASIK procedures to correct
nearsightedness with or without astigmatism.
• 7. Summit up 1-1/8 at 27-3/8.
• 8. Summit Tech said today that its revenues for the three months ended June 30, 1999 increased
14%…
• 9. Summit Tech announces the public offering of 3,500,000 shares of its common stock priced at
$16/share.
• 10. Summit announces an agreement with Sterling Vision, Inc. for the purchase of up to six of
Summit’s state of the art, ApexPlus Laser Systems.
• 11. Preferred Capital Markets, Inc. initiates coverage of Summit Technology Inc. with a Strong Buy
rating and a 12-16 month price target of $22.50.
21
Feature Selection in Document
Classification
• Each document is a feature vector, with features
being TF×IDF values of terms
– Labeled “change” or “no change” in training set

• What term is the most useful for distinguishing a


news story that will lead to substantial stock price
changes from one that will not?

Dr. Yi, C., Tsinghua SEM 22


Decision Tree Example:
Classifying News Stories
• Terms with high information gain

Dr. Yi, C., Tsinghua SEM 23


Information Gain Drawbacks
• Problem: attributes with a large number of
values (extreme case: person’s name or ID code)
code age income student credit_rating buys_computer
1 <=30 high no fair no
2 <=30 high no excellent no
3 31…40 high no fair yes
4 >40 medium no fair yes
5 >40 low yes fair yes
6 >40 low yes excellent no
7 31…40 low yes excellent yes
8 <=30 medium no fair no
9 <=30 low yes fair yes
10 >40 medium yes fair yes
11 <=30 medium yes excellent yes
12 31…40 medium no excellent yes
13 31…40 high yes fair yes
14 >40 medium no excellent no

Dr. Yi, C., Tsinghua SEM 24


Information Gain Drawbacks
• Subsets are more likely to be pure if there is a
large number of values for a feature
– Information gain is biased towards choosing
features with a large number of values
– This may result in overfitting
• Selection of an attribute that is non-optimal for
prediction in general

Dr. Yi, C., Tsinghua SEM 25


Gain Ratio
• Gain ratio: a modification of the information gain
that reduces its bias
• Gain ratio takes number and size of branches
into account when choosing an attribute
– It corrects the information gain by taking the split
information into account

– But it may overcompensate: choose an attribute just


because its split information is very low
• Standard fix: only consider attributes with greater than
average information gain, and then compare them on gain
ratio
Dr. Yi, C., Tsinghua SEM 26
Discussion
Q: Is a tree with only pure leaves always the best
classifier you can have?

Dr. Yi, C., Tsinghua SEM 27


Output: A Decision Tree for
“buys_computer”
age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

Dr. Yi, C., Tsinghua SEM 28


Output: A Decision Tree for
Consumer Churn Problem

Dr. Yi, C., Tsinghua SEM 29


Discussion
Q: Is a tree with only pure leaves always the best
classifier you can have?
A: No.
This tree is the best classifier on the training set,
but possibly not on new and unseen data.
Because of overfitting, the tree may not
generalize very well.

Dr. Yi, C., Tsinghua SEM 30


Overfitting
• We want models to apply not just to the exact
training set but to the general population
from which the training data came.
• There is a fundamental trade-off between
model complexity and the possibility of
overfitting.

Dr. Yi, C., Tsinghua SEM 31


Overfitting in Tree Induction
• If we continue to split the data, eventually the
subsets will be pure.
• Any training instance given to the tree for
classification will eventually land at the
appropriate leaf → perfectly accurate!
• But at some point, the tree will start to overfit:
it acquires details of the training set that are
not characteristics of the population in
general, as represented by the holdout set
Dr. Yi, C., Tsinghua SEM 32
A Typical Fitting Graph for Tree
Induction

The
proportion of
correct
decisions

Dr. Yi, C., Tsinghua SEM 33


Avoiding Overfitting with Tree
Induction
• Two common techniques to avoid overfitting:
– (1) to stop growing the tree before it gets too
complex (Prepruning)
– (2) to grow the tree until it is too large, then
“prune” it back, reducing its size (Post-pruning)

Dr. Yi, C., Tsinghua SEM 34


Prepruning
• The simplest method to is to specify a minimum number
of instances that must be present in a leaf
– But at what threshold?
• Based on statistical significance test
– Stop growing the tree when there is no statistically
significant association between any attribute and the
class at a particular node
• Most popular test: chi-squared test
• ID3 used chi-squared test in addition to information gain
– Only statistically significant attributes were allowed to
be selected by information gain procedure
Dr. Yi, C., Tsinghua SEM 35
Post-pruning
• Build a full tree first , then cut off leaves and
branches, and replace them with leaves
• One general idea is to estimate whether replacing a
set of leaves or a branch with a leaf would reduce
accuracy
– If not, then go ahead and prune
– The process can be iterated on progressive
subtrees until any removal or replacement
would reduce accuracy

Dr. Yi, C., Tsinghua SEM 36


Subtree Replacement
• Bottom-up
• Consider replacing a tree
only after considering all its
subtrees

Dr. Yi, C., Tsinghua SEM 37


Summary
• Decision Trees
– splits – binary, multi-way
– split criteria – information gain, gain
ratio, …
– pruning
– …
• No method is always superior –
experiment!

Dr. Yi, C., Tsinghua SEM 38


Homework 2 (Due 5pm Mar 25)
• Resource:
https://docs.rapidminer.com/latest/studio/op
erators/

Dr. Yi, C., Tsinghua SEM 39


Information Life Cycle
Analyzing User Requirement
1

7
Creation
Collection/
Capture

Reuse /
Leverage
Information
Life Cycle Management
Organization/
Indexing

Distribution /
Dissemination
Storage /
Retrieval

Dr. Yi, C., Tsinghua SEM 40


Structure of an IR System
Storage
Interest profiles Documents Line
Search & Queries & data
Line Information Storage and Retrieval System

Rules of the game =


Rules for subject indexing +
Formulating query in Thesaurus (which consists of Indexing
terms of (Descriptive and
descriptors Lead-In Subject)
Vocabulary
and
Indexing
Language)
Storage of
Storage of
profiles
Documents

Store1: Profiles/ Comparison/ Store2: Document


Search requests Matching representations

Potentially
Relevant
Documents

Dr. Yi, C., Tsinghua SEM 41


Central Concepts in IR
• Documents
• Collections
• User Interface and Queries
• Relevance
• Evaluation

Dr. Yi, C., Tsinghua SEM 42


The Retrieval Process
Text
User
Interface
user need Text

Text Operations

logical view logical view


Query DB Manager
Indexing
user feedback Operations Module

query inverted file

Searching Index

retrieved docs
Text
Ranking Database
ranked docs
Dr. Yi, C., Tsinghua SEM 43
What is a good structure for index?

- Is this a good algorithm?


- No! Query processing time should be largely
independent of database size.
- Probably proportional to answer size.
Dr. Yi, C., Tsinghua SEM 44
What is a good structure for index?
• We need a good data structure to support the
operation:
– “given term t, get all the documents that contain it”
• The structure must support this operation very
efficiently.
• It should be built at preprocessing time, not at
query time
– Can afford to spend some time in its construction.

Dr. Yi, C., Tsinghua SEM 45


Inverted Files
• The crucial data structure for indexing
• A file “inverted” so that rows become
columns and columns become rows

docs t1 t2 t3
D1 1 0 1
D2 1 0 0
D3 0 1 1
D4 1 0 0
D5 1 1 1 Terms D1 D2 D3 D4 D5 D6 D7 …
D6 1 1 0 t1 1 1 0 1 1 1 0
D7 0 1 0 t2 0 0 1 0 1 1 1
D8 0 1 0 t3 1 0 1 0 1 0 0
D9 0 0 1
D10 0 1 1

Dr. Yi, C., Tsinghua SEM 46


Creating Inverted Files

Word Extraction Word IDs

Original Documents

W1:d1,d2,d3
W2:d2,d4,d7,d9

Document IDs
Wn :di,…dn

Inverted Files

Dr. Yi, C., Tsinghua SEM 47


Creating Inverted Files
• Map the file names to file IDs
• Consider the following Original Documents
The Department of Computer Science was established in 1984.
D1
The Department launched its first BSc(Hons) in Computer Studies in 1987.
D2
followed by the MSc in Computer Science which was started in 1991.
D3
The Department also produced its first PhD graduate in 1994.
D4
Our staff have contributed intellectually and professionally to the advancements
D5 in these fields.

Dr. Yi, C., Tsinghua SEM 48


Creating Inverted Files
Blue: stop word

The Department of Computer Science was established in 1984.


D1
The Department launched its first BSc(Hons) in Computer Studies in
D2 1987.

followed by the MSc in Computer Science which was started in 1991.


D3
The Department also produced its first PhD graduate in 1994.
D4
Our staff have contributed intellectually and professionally to the
D5 advancements in these fields.

Dr. Yi, C., Tsinghua SEM 49


Creating Inverted Files
After stemming, make lowercase (optional), delete numbers (optional)

depart comput scienc establish


D1
depart launch bsc hons comput studi
D2
follow msc comput scienc start
D3
depart produc phd graduat
D4
staff contribut intellectu profession advanc field
D5

Dr. Yi, C., Tsinghua SEM 50


Creating Inverted Files (unsorted)
Words Documents Words Documents
depart d1,d2,d4 produc d4
comput d1,d2,d3 phd d4
scienc d1,d3 graduat d4
establish d1 staff d5
launch d2 contribut d5
bsc d2 intellectu d5
hons d2 profession d5
studi d2 advanc d5
follow d3 field d5
msc d3
start d3

Dr. Yi, C., Tsinghua SEM 51


Creating Inverted Files (sorted)
Words Documents Words Documents
advanc d5 msc d3
bsc d2 phd d4
comput d1,d2,d3 produc d4
contribut d5 profession d5
depart d1,d2,d4 scienc d1,d3
establish d1 staff d5
field d5 start d3
follow d3 studi d2
graduat d4
intellectu d5
launch d2

Dr. Yi, C., Tsinghua SEM 52


Another Example:
Creating Inverted Files Term
now
Doc #
1

• Documents are parsed to extract is


the
time
1
1
1

tokens for
all
good
1
1
1
men 1

• These are saved with the Document to


come
to
1
1
1

ID the
aid
of
1
1
1
their 1
country 1
Doc 1 Doc 2 it
was
2
2
a 2
dark 2

Now is the time It was a dark and and


stormy
2
2

for all good men stormy night in night


in
2
2

to come to the aid the country the


country
2
2
manor 2

of their country manor. The time the 2


time 2
was past midnight was
past
2
2
midnight 2

Dr. Yi, C., Tsinghua SEM 53


Creating Inverted Files
Term Doc # Term Doc #
now 1 a 2

• After all documents have is


the
time
1
1
1
aid
all
and
1
1
2

been parsed, the


for 1 come 1
all 1 country 1
good 1 country 2

inverted file is sorted


men 1 dark 2
to 1 for 1
come 1 good 1

alphabetically
to 1 in 2
the 1 is 1
aid 1 it 2
of 1 manor 2
their 1 men 1
country 1 midnight 2
it 2 night 2
was 2 now 1
a 2 of 1
dark 2 past 2
and 2 stormy 2
stormy 2 the 1
night 2 the 1
in 2 the 2
the 2 the 2
country 2 their 1
manor 2 time 1
the 2 time 2
time 2 to 1
was 2 to 1
past 2 was 2
midnight 2 was 2

Dr. Yi, C., Tsinghua SEM 54


Creating Inverted Files
Term Doc # Term Doc # Freq
a 2 a 2 1

• Multiple term entries aid


all
and
1
1
2
aid
all
1
1
1
1
and 2 1
for a single document come
country
country
1
1
2
come
country
1
1
1
1

are merged dark


for
good
2
1
1
country
dark
for
2
2
1
1
1
1
in 2

• Within-document is
it
1
2
good
in
is
1
2
1
1
1
1
manor 2

term frequency men


midnight
1
2
it
manor
2
2
1
1
night 2 men 1 1
information is now
of
1
1
midnight
night
2
2
1
1
past 2

compiled stormy
the
2
1
now
of
past
1
1
2
1
1
1
the 1
the 2 stormy 2 1
the 2 the 1 2
their 1 the 2 2
time 1
their 1 1
time 2
time 1 1
to 1
to 1 time 2 1
was 2 to 1 2
was 2 was 2 2
Dr. Yi, C., Tsinghua SEM 55
Creating Inverted Files
• Then the file can be split into
– A Dictionary file
– and
– A Postings file

Dr. Yi, C., Tsinghua SEM 56


Creating Inverted Files
Term Doc # Freq
a
aid
2
1
1
1
Dictionary Postings
all 1 1 Term N docs Tot Freq Doc # Freq
and 2 1 a 1 1 2 1
come 1 1 aid 1 1 1 1
country 1 1 all 1 1 1 1
country 2 1 and 1 1 2 1
dark 2 1 come 1 1 1 1
country 2 2 1 1
for 1 1
dark 1 1 2 1
good 1 1 2 1
for 1 1
in 2 1 good 1 1 1 1
is 1 1 in 1 1 1 1
it 2 1 is 1 1 2 1
manor 2 1 it 1 1 1 1
men 1 1 manor 1 1 2 1
men 1 1 2 1
midnight 2 1
midnight 1 1 1 1
night 2 1
night 1 1 2 1
now 1 1 2 1
now 1 1
of 1 1 of 1 1 1 1
past 2 1 past 1 1 1 1
stormy 2 1 stormy 1 1 2 1
the 1 2 the 2 4 2 1
the 2 2 their 1 1 1 2
time 2 2 2 2
their 1 1
to 1 2 1 1
time 1 1 1 1
was 1 2
time 2 1 2 1
to 1 2 1 2
was 2 2 Collapse list (resolve repeats) 2 2

Dr. Yi, C., Tsinghua SEM 57


Separate documents for each term
How Inverted Files are Used
Dictionary Postings Query on
Term
a
N docs
1
Tot Freq
1
Doc #
2
Freq
1
“time” AND “dark”
aid 1 1 1 1
all 1 1 1 1
and 1 1 2 1
come
country
1
2
1
2
1
1
1
1 2 docs with “time” in
2 1
dark
for
1
1
1
1 2 1 dictionary ->
good 1 1 1 1
in
is
1
1
1
1
1
2
1
1 IDs 1 and 2 from
1 1
it
manor
1
1
1
1 2 1 posting file
men 1 1 2 1
midnight
night
1
1
1
1
1
2
1
1 1 doc with “dark” in
2 1
now
of
1
1
1
1 1 1 dictionary ->
past 1 1 1 1
stormy
the
1
2
1
4
2
2
1
1
ID 2 from posting file
their 1 1 1 2
time 2 2 2 2
to 1 2 1 1
was 1 2 1 1
2 1 Therefore, only doc 2
1 2
2 2 satisfied the query
Dr. Yi, C., Tsinghua SEM 58
Inverted Indexes
• For each term, you get a list consisting of:
– Document ID
– Frequency of term in doc (optional)
– Position of term in doc (optional)
• Permit fast search for individual terms
• These lists can be used to solve Boolean queries:
– country -> d1, d2
– manor -> d2
– country AND manor -> d2
• Also used for statistical ranking algorithms

Dr. Yi, C., Tsinghua SEM 59


Dr. Yi, C., Tsinghua SEM 60
Week Date Lesson Topics
1 Feb 26 Introduction
2 Mar 4 Metadata and subject analysis (metadata schemes,
controlled vocabularies)
3 Mar 11 - Information categorization
- Computational classification: text processing basics
4 Mar 18 - Computational classification: decision tree
Course - Information retrieval: inverted indexes
5-6 Mar 25, Apr 1 - Information retrieval: models (Boolean, vector space,
Schedule probabilistic) and evaluation
7 Apr 8 Project presentation
8-9 Apr 15, 22 -Web search (link analysis, paid search)
10 Apr 29 - Test 1
- Guest lecture
11-12 May 6, 13 Information and social network (information cascades,
social network analysis)
13-14 May 20, 27 Social and ethical issues (pricing of information,
information goods market, IP issues)
- Review
15 Jun 3 Test 2
Dr. Yi, C., Tsinghua SEM 61

You might also like