Machine Learning For Text and Web Mining
Machine Learning For Text and Web Mining
for
Text / Web Data Mining
Byoung-Tak Zhang
School of Computer Science and Engineering
Seoul National University
E-mail: [email protected]
This material is available at
http://scai.snu.ac.kr./~btzhang/
Overview
z
Introduction
4Web Information Retrieval
4Machine Learning (ML)
4ML Methods for Text/Web Data Mining
Summary
4Current and Future Work
2
Text Classification
Information Filtering
Information Extraction
DB Template Filling
& Information
Extraction System
DB Record
Location
user profile
filtered data
Date
question
answer
feedback
DB
Machine Learning
z
Supervised Learning
4Estimate an unknown mapping from known inputoutput pairs
4Learn fw from training set D={(x,y)} s.t. f w (x) = y = f (x)
4Classification: y is discrete
4Regression: y is continuous
z
Unsupervised Learning
4Only input values are provided
4Learn fw from D={(x)} s.t. f w (x) = x
4Density Estimation
4Compression, Clustering
4
Neural Networks
4 Multilayer Perceptrons (MLPs)
4 Self-Organizing Maps (SOMs)
4 Support Vector Machines (SVMs)
Probabilistic Models
4 Bayesian Networks (BNs)
4 Helmholtz Machines (HMs)
4 Latent Variable Models (LVMs)
0 baseball
0 car
0 clinton
0 computer
0 graphics
0 hockey
2 quicktime
.
.
1 references
0 space
3 specs
1 unix
.
TDT2 Corpus
4Target detection and tracking (TDT): NIST
4Used 6,169 documents in experiments
Text Mining:
Helmholtz Machine Architecture
h1
h2
hm
: recognition weight
: generative weight
P (hi = 1) =
d1
d2
d3
dn
P (d i = 1) =
1
n
1 + exp bi wij d j
j =1
1
m
1 + exp bi wij h j
j
=
1
4Latent nodes
Binary values
Extract the underlying causal
structure in the document set.
Capture correlations of the words
in documents.
Binary values
Represent the existence or
absence of words in documents.
Text Mining:
Learning Helmholtz Machines
4Introduce a recognition network for estimation of a
generative network.
( ) ( ( ) )
P d (t ) , (t ) |
log( D | ) = log P d ( t ) , (t ) | = log Q ( t )
Q (t )
t =1
( t )
t =1
(t )
(t )
(t )
T
,
|
P
d
Q (t ) log
Q (t )
t =1 ( t )
( )
( )
4Wake-Sleep Algorithm
Train the recognition and generative models alternately.
Update the weight in network iteratively by simple local delta rule.
10
Categorization
z Topic
Words Extraction
12
warplane, airline, saudi, gulf, wright, soldiers, yitzhak, tanks, stealth, sabah, stations, kurds,
mordechai, separatist, governor
olympics, nagano, olympic, winter, medal, hockey, atheletes, cup, games, slalom, medals, bronze,
skating, lillehammer, downhill
netanyahu, palestinian, arafat, israeli, yasser, kofi, annan, benjamin, palestinians, mideast, gaza,
jerusalem, eu, paris, israel
India, pakistan, pakistani, delhi, hindu, vajpayee, nuclear, tests, atal, kashmir, indian, janata,
bharatiya, islamabad, bihari
imf, monetary, currencies, currency, rupiah, singapore, bailout, traders, markets, thailand,
inflation, investors, fund, banks, baht
pope, cuba, cuban, embargo, castro, lifting, cubans, havana, alan, invasion, reserve, paul, output,
vatican, freedom
13
14
15
16
DecisionTree+Factor
Analysis
Decision Tree
Discriminant Model
V13 (SendEmail)
V234 (OrderItemQuantity Sum%
HavingDiscountRange(5 . 10))
V237 (OrderItemQuantitySum%
Having DiscountRange(10.))
V240 (Friend)
V243 (OrderLineQuantitySum)
V245 (OrderLineQuantity Maximum)
V304 (OrderShippingAmtMin)
V324 (NumLegwearProduct
Views)
V368 (Weight Average)
V374 (NumMainTemplateViews)
V412 (NumReplenishable
Stock Views)
V240 (Friend)
V229 (Order-Average)
V304 (OrderShippingAmtMin.)
V368 (Weight Average)
V43 (Home Market Value)
V377 (NumAcountTemplate Views)
+
V11 (Which
DoYouWearMostFrequent)
V13 (SendEmail)
V17 (USState)
V45 (VehicleLifeStyle)
V68 (RetailActivity)
V19 (Date)
17
network
C
P(A,B,C,D,E) = P(A)P(B|A)P(C|B)
P(D|A,B)P(E|B,C,D)
18
19
Summary
z
10
21
Bayesian Networks:
Architecture
L
P ( L, B, G , M ) = P ( L) P ( B | L) P (G | L, B ) P ( M | L, B, G )
= P ( L) P ( B ) P (G | B ) P ( M | B, L)
z
P ( X) = P ( X i | pa i )
i =1
22
11
Bayesian Networks:
Applications in IR A Simple BN for Text Classification
C
C: document class
ti: ith term
t1
z
z
z
t2
t8754
Bayesian Networks:
Experimental Results
z
Dataset
4The acq dataset from Reuters-21578
48754 terms were selected by TFIDF.
4Training data: 8762 documents
4Test data: 3009 documents
Parametric Learning
4Dirichlet prior assumptions for the network parameter
distributions.
p ( ij | S h ) = Dir ( ij | ij1 ,..., ijri )
24
12
Bayesian Networks:
Experimental Results
For training data
4Accuracy: 94.28%
Recall (%)
Precision (%)
Positive examples
96.83
75.98
Negative examples
93.76
99.32
Recall (%)
Precision (%)
Positive examples
95.16
89.17
Negative examples
96.88
98.67
4Accuracy: 96.51%
25
Document
Clustering
k =1
Topic-Words
Extraction
26
13
EM (Expectation-Maximization) Algorithm
4Algorithm to maximize pre-defined log-likelihood
M-Step
N
P( zk | d n , wm )
=
P( zk ) P(d n | zk ) P( wm | zk )
K
P( z ) P( d
k =1
P( wm | z k ) =
| zk ) P( wm | zk )
n( d
n =1
M N
, wm ) P( z k | d n , wm )
n(d
m =1 n =1
P(d n | z k ) =
n( d
m =1
M N
, wm ) P( z k | d n , wm )
, wm ) P( z k | d n , wm )
n(d
m =1 n =1
P( z k ) =
, wm ) P( z k | d n , wm )
1 M N
n(d n , wm ) P( z k | d n , wm ),
R m=1 n =1
N
R n(d n , wm )
27
m =1 n =1
28
14
german, germani, mr, parti, year, foreign, people, countri, govern, asylum, polit, nation,
law, minist, europ, state, immigr, democrat, wing, social, turkish, west, east, member,
attack,
percent, estonia, bank, state, privat, russian, year, enterprise, trade, million, trade, estonian,
econom, countri, govern, compani, foreign, baltic, polish, loan, invest, fund, product,
research, technology, develop, mar, materi, system, nuclear, environment, electr, process,
product, power, energi, countrol, japan, pollution, structur, chemic, plant,
jordan, peac, isreal, palestinian, king, isra, arab, meet, talk, husayn, agreem, presid, majesti,
negoti, minist, visit, region, arafat, secur, peopl, east, washington, econom, sign, relat,
jerusalem, rabin, syria, iraq,
29
Boosting:
Algorithms
z
z
Importance weights
of training documents
Learner
Learner
Learner
Learner
h1
h2
h3
h4
f ( h1 , h2 , h3 , h4 )
30
15
Boosting:
Applied to Text Filtering
z
Nave Bayes
4 Traditional algorithm for text filtering
c NM =
arg max
c j { relevant , irrelevant }
P (c j ) P ( d i | c j )
Assume independence
among terms
k =1
Boosting:
Applied to Text Filtering Experimental Results
z
4 Test Documents
AP articles (1989~1990)
471 MB, 162999 documents
4 Test Documents
Financial Time (1993~1994)
382 MB, 140651 documents
4 No. of topics: 50
4 No. of topics: 50
Example of a document
32
16
Boosting:
Applied to Text Filtering Experimental Results
Compared with the state-of-the-art text filtering systems
TREC-7
Averaged Scaled F1
Averaged Scaled F3
Boosting
ATT
NTT
PIRC
Boosting
ATT
NTT
PIRC
0.474
0.461
0.452
0.500
0.467
0.460
0.505
0.509
TREC-8
Averaged Scaled LF1
Boosting
PLT1
PLT2
PIRC
Boosting
CL
PIRC
Mer
0.717
0.712
0.713
0.714
0.722
0.721
0.734
0.720
33
Evolutionary Learning:
Applications in IR - Web-Document Retrieval
z
Link Information,
HTML Tag
Information
Retrieval
<A>
chromosomes
Fitness
34
17
Evolutionary Learning:
Applications in IR Tag Weighting
z
Crossover
chromosome X
x1
x2
x3
Mutation
chromosome Y
xn
y1
y2
y3
yn
chromosome X
x1
x2
x3
zi = (xi + yi ) / 2 w.p. Pc
z1
z2
z3
zn
x1
chromosome Z (offspring)
xn
x2
x3
xn
chromosome X
Truncation selection
35
Evolutionary Learning :
Applications in IR - Experimental Results
z
Datasets
4 TREC-8 Web Track Data
4 2GB, 247491 web documents (WT2g)
4 No. of training topics: 10, No. of test topics: 10
Results
36
18
Reinforcement Learning:
Basic Concept
Agent
1. State st
2. Action at
Reward rt
3. Reward rt+1
Environment
4. State st+1
37
Reinforcement Learning:
Applications in IR - Information Filtering
[Seo & Zhang, 2000]
WAIR
retrieve documents
calculate similarity
2. Actioni
(modify profile)
Rewar
(user profile) di
1. Statei
User profile
3. Rewardi+1 (relevance
feedback)
User
4. Statei+1
Document filtering
...
Filtered documents
38
19
Reinforcement Learning:
Experimental Results (Explicit Feedback)
(%)
39
Reinforcement Learning:
Experimental Results (Implicit Feedback)
(%)
40
20