Bradzil Classif withTM
Bradzil Classif withTM
Pavel Brazdil
LIAAD - INESC Porto LA
FEP, Univ. of Porto
Escola de verão
Aspectos de processamento da LN
F. Letras, UP, 4th June 2009
http://www.liaad.up.pt
Overview
1.Introduction
2.Preprocessing document collection in “tm”
2.1 The dataset 20Newsgroups
2.2 Creating a directory with a corpus for 2 subgroups
2.3 Preprocessing operations
2.4 Creating Document-Term matrix
2.5 Converting the Matrix into Data Frame
2.6 Including class information
3. Classification of documents
3.1 Using a kNN classifier
3.2 Using a Decision Tree classifier
3.3 Using a Neural Net
2
1. Introduction
2 Classification of documents
2.1 The dataset 20Newsgroups
4
20Newsgroups
This involves :
- Selecting one of the newsgroups (e.g.sci.electronics)
- Invoking “Change directory” in R to 20news-bydate-train
- Invoking the instruction Corpus():
sci.electr.train <- Corpus( DirSource (“sci.electronics”),
readerControl=list(reader=readNewsgroup, language=“en_US” ) )
If we type :
> sci.electr.train or length(sci.electr.train)
the system responds “A corpus with 591 documents”
Similarly, we can obtain documents from another class
(e.g. talk.religion.misc):
talk.religion.train (377 documents)
Similarly, we can obtain the test data:
sci.electr.test (393 documents)
talk.religion.test (251 documents)
6
Example of one document
sci.electr.train[[1]]
An object of class “NewsgroupDocument”
[1] "In article <[email protected]>
[email protected] writes:"
[2] ">[...]"
[3] ">There are a variety of water-proof housings I could use but the real meat"
[4] ">of the problem is the electronics...hence this posting. What kind of"
[5] ">transmission would be reliable underwater, in murky or even night-time"
[6] ">conditions? I'm not sure if sound is feasible given the distortion under-"
[7] ">water...obviously direction would have to be accurate but range could be"
[8] ">relatively short (I imagine 2 or 3 hundred yards would be more than enough)"
[9] ">"
[10] ">Jim McDonald"
…
[35] " ET \"Tesla was 100 years ahead of his time. Perhaps now his time comes.\""
[36] "----"
8
2.3 Preprocessing
12
Result of preprocessing of one document
Original document:
> sci.rel.tr.ts[[1]] (=sci.electr.train[[1]])
An object of class “NewsgroupDocument”
[1] "In article <[email protected]>
[email protected] writes:"
[2] ">[...]"
[3] ">There are a variety of water-proof housings I could use but the real meat"
[4] ">of the problem is the electronics...hence this posting. What kind of"
[5] ">transmission would be reliable underwater, in murky or even night-time"
[6] ">conditions? I'm not sure if sound is feasible given the distortion under-"
Pre-processed document:
undesirable
> sci.rel.tr.ts.p[[1]]
[1] in article fbaeffaesoprutgersedu mcdonaldaesoprutgersedu writes
[2]
[3] there variety waterproof housings real meat
[4] of electronicshence posting
[5] transmission reliable underwater murky nighttime
[6] conditions sound feasible distortion under 13
Note:
This command is available only in R Version 2.9.1
In R Version 2.8.1 the function available is TermDocMatrix(<DocCollection>)
14
Creating Document-Term Matrix (DTM)
Simple Example:
> DocumentTermMatrix(sci.rel.tr.ts.p)
A document-term matrix (1612 documents, 21967 terms)
15
Options of DTM
Improved example:
> dtm.mx.sci.rel <- DocumentTermMatrix( sci.rel.tr.ts.p,
control=list(weighting=weightTfIdf, minWordLength=2, minDocFreq=5))
16
Generating DTM with different options
18
Inspecting the DTM
As we can see, the matrix is very sparse. By chance there are no values other than 0s.
Note: The DTM is not and ordinary matrix, as it exploits
object-oriented representation (includes meta-data).
The function inspect(..) converts this into an ordinary matrix which can be inspected.19
Ex.
> freqterms100 <- findFreqTerms( dtm.mx.sci.rel, 100)
> freqterms100
[1] "wire" "elohim" "god" "jehovah" "lord" From talk.religion
21
22
2.6 Appending class information
3 Classification of Documents
3.1 Using a KNN classifier
25
26
3.2 Classification of Docs using a Decision Tree
27
> dt
n= 968
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 968 377 sci (0.3894628 0.6105372)
2) god>=2.5 26 0 rel (1.0000000 0.0000000) *
3) god< 2.5 942 351 sci (0.3726115 0.6273885)
6) jesus>=2.5 16 0 rel (1.0000000 0.0000000) *
7) jesus< 2.5 926 335 sci (0.3617711 0.6382289) *
28
Evaluating the Decision Tree
29
Acknowledgements:
Rui Pedrosa, M.Sc. Student, M. ADSAD, 2009
The errors reported on a similar task were quite good - about 17%
30
5. Calculating the evaluation measures
The basis for all calculations is the confusion matrix, such as:
> conf.mx <- table(class.ts, predictions.ts)
> conf.mx
class.ts rel sci
rel 201 50
sci 3 390
31
Evaluation measures
+^ -^
+ TP FN
Recall = TP / (TP + FN) * 100%
- FP TN
- FP TN
32