Text Mining
Text Mining
Numerical Vectors
1
Introduction
● To Mine Text we need to process it in a form that Data Mining
procedures use.
● From earlier chapter, this involves generating features in a
spread sheet format
● Classical data mining looks at highly structured data
● Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
● Predictive text mining is simpler and more restrictive than open
ended data mining.
● Text mining is unstructured because very far from the
spreadsheet model that we need to process data for prediction.
2
Introduction
● Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
● We have to determine nature of column in spread sheet.
● Features are easy to obtain , some are difficult. Features (word in
a text-easy) ; grammatical function of a word in a sentence
● Discuss ????
● How to Obtain the kinds of features generated from Text
3
Collecting Documents
● Text Mining is collect data
● Web page retrieval application for an intranet implicitly specifies
the relevant documents to be the web pages on the intranet
● If documents are identified, then they can be obtained- main issue
– cleanse the samples and ensure high quality
● Web application compromising a number of autonomous
Websites, one may deploy s/w tool such as WebCrawler to collect
the documents
4
Collecting Documents
● Other application u have a logging process attached to an input
data steam for a length of time (eg -> email audit u will log in
the incoming and outgoing messages at mail server for a period
of time)
● For R&D work of Text Mining - we need generic data. –
Corpus
● Accompanying Software is Reuter which is called Reuter’s
corpus(RV1)
● Early days (1960’s and 1970’s)1 million works was considered –
● size of collection of size of collection Brown corpus consist of
500 samples for 2000 words of American English test
5
Collecting Documents
● European corpus was modeled on Brown corpus – British
English
● 1970’s 0r 80’s more resource were available- govt sponsored.
● Some widely used corpora –Penn Tree Bank (collection manually parsed
sentences from Journal)
o Resource is World Wide Web. Web crawlers can build collections
of pages from a particular sit such as yahoo. Give n size of web,
collections require cleaning before use
6
Document Standardization
7
Document Standardization-XML
● Standard way to insert tags onto text to identify it’s parts.
● Each Document is markedoff from corpus through XML
● XML will have tags
● <Date>
● <Subject>
● <Topic>
● <Text>
● <Body>
● <Header>
8
XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
9
XML
● The main reason to identify the parts is to allow selection of
those parts that is used to generate features
● Selected document is concatenated into strings- separated by
tags
● Document Standardization-
● Why should we care ??
● Advantage of data standardization is mining tools can be applied without
having to consider the pedigree of document.
10
Tokenization
● Document Collected in XMl Format- examine the data
● Break the characters into words – TOKENS
● Each token is an instance of a type; the number of tokens is
higher than the number of types
● 2 tokens “the” occurs twice in a sentence. Refer to occurrence of a type
● Character space , tab are not tokens but white spaces
● Comma, Colon are tokens (between characters) eg:- USA,INDIA
● between numbers are delimiter (121,135)
● Apostrophe –number of uses (Delimiter or part of token) eg:- D’Angelo
● When it is followed by a terminator – internal quote (Tess’.)
11
Tokenization -Pesudocode
● Dash is a terminator a token preceeded ir follwed by another dash
(522-3333)
● Without identifying token it is diffuicult to imagine extracting
higher level information from document
12
Lemmatization
● Once a character stream has been segmented after sequence of
tokens
● Next Step ?? Convert each tokens to standard forms – Stemming
or Lemmatization. (Application dependent)
● Reduce the number of distinct types in corpus and increase frequency of
occurrence of individual types
● English Speaker’s agree nouns Book and Books are 2 forms of same word-
advantage to eliminate kind of variation
● Normalization regularize grammatical variants –Inflectional
Stemming
13
Stemming to a Root
14
Stemming Pseudocode
15
Vector Generation for prediction
● Consider the problem of categorizing documents
● Characteristic feature are tokens or words they contain.
● Without deep analysis we can choose to describe each document
by features that represent the most frequent tokens.
● There is collective features called dictionary.
● Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document
collection.
● Each Row-> document; column ->feature
16
Vector Generation for prediction
● Cells in a spreadsheet is a measurement of feature for a document.
● Basic model of data, we will simply check the presence or absence of words
● Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents
are readily available. – confidence on variation and combinations of words
that occurs
● If prediction is our goal then we need one more column for correct answer.
● In preparing data for learning, information is available from document
labels. Our labels are binaries and answers which is also called as class)
● Instead of generating global dictionary for class we consider words in class
that we r trying to predict.
● If this class is far smaller than the negative class which is typical –local
dictionary is far smaller than global dictionary
● Another reduction in dictionary size is to compile a lost of stopwords and
remove them from dictionary.
17
● Stopwords are almost never have any predictive capability,
such articles a & the pronouns as it and they.
● Frequency information on the word counts can be quite useful
in reducing the dictionary size and improve predictive
performance
● Frequent words are stopwords and can be deleted.
● Alternative approach to local dictionary generation is to
generate a global dictionary from all documents in the
collection . Special feature selection routines will attempt to
select a subset of words that have greatest potential of
prediction- independent(selction methods)
● If we have 100 topics to categorize, then 100 problems to solve
. Our choices are 100 small dictionary or 1 global dictionary .
18
● Vectors implied by spreadsheet model will be regenerated to
correspond to small dictionary
● Instead of placing the word in the dictionary -> follow a path
printed dictionary and avoid storing every variation of word. (no
singular/plural/past/present)
● Verbs stored in stemming manner.
● Add a layer of complexity in text – gain in performance and size
is reduced
● Universal procedure is trim words to their root form -> difference
in meaning (exit /exiting)- context of programming (different
meanings)
● Small Dictionary- u can capture the best words easily.
● Use of tokens and stemming are examples of helpful procedures
in smaller dictionaries. Improve ability of managing of learning
and accuracy
● Document can be converted to spread sheet . 19
● Each column is feature. Row is a document
● Model of data for predictive text mining in terms of spread
sheet that populated by ones or zeros.
● Cells represent the presence of dictionary words in a document
collection. Higher accuracy-> additional transformations
● They are
● Word Pairs and collocations
● Frequency
● Tf-idf
● Word Pairs and Collocations:- They serve to increase size of
dictionary improve performance of prediction
● Instead of 0’s & 1’s in cells; the frequency of word can be used.
word “the” occurs 10 times count of “the” is used)
● Count give better results than binary in cells.
● This leads to compact solutions same solution of binary data
model. Yet additional frequency yield simpler solution.
20
● Frequencies are helpful in prediction but add complexity to
solutions.
● Compromise that works – 3 value system.1/0/2
● Word did not occur -0
● Word occurred one -1
● Word occurred 2 or more times -2
● Capture much added value of frequency without adding much
complexity to model.
● Another variant is zeroing the values below the threshold
where tokens min frequency before being considered any use.
● Reduce the complexity of spread sheet – used in Data Mining
algorithms
● Other methods to reduce complexity are chi square, mutual
Information, odds Ratio ..etc
● Next step beyond counting frequency is modify the count by
perceived importance of that word .
21
● Tf-idf:- Compute the weightings or scores of words
● Values of positive numbers that we capture the absence or
presence of the words.
● Eq (a) we see that weight assigned to word j-term of frequency
modified by a scale factor for importance of word. Scale factor
is inverse document frequency (eq (b))
● Simply checks for number of documents containing the word
df(j) and reverse scaling.
● Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
● Idf(j) = log(N/ df(j)) ------! Eq(b)
● When a word appears in a document, the scale is lowered and
perhaps zero. if word is unique , appears in few documents -
scale factor zooms upward and appears important
● Alternative of this tf-idf formulation exist, but motivation is
same. Result is positive score that replaces the simple
frequency or binary (T/F) entry in our cell in spreadsheet.
22
● Another variant is weight the tokens from different parts of the
document.
● Which Data Transformation Method are BEST????
● No Universal answer.
● Best predictive accuracy is dependent on mating all these methods.
● Best variation is one method may not be the one for other. Test ALL
● Describe data as populating a spread sheet-cells are 0
● Small subset of dictionary words.
● Text Classification a text corpus 1000’s words. Each individual
document ,unique tokens.
● Spread sheet for that document is 0. Rather than store all 0’s its
better to represent the spread sheet as a set of sparse vectors
(row is list of pairs , one element of pair is column and other is
corresponding nonzero value). By not storing the non zero It
will increase memory
23
Multi Word Features
● Features are associated with single words ( tokens delimited with
white space)
● Simple scenario is extended to include pair of words eg:- bon and
viant . Instead of separating we could feature the word as
bonviant.
● Why stop at pairs? Why not consider a multiword features??
24
● X number if words occurring within a maximum window size y(y>=x
naturally)
● How features are extracted from text- specialized methods???
● If we use frequency methods, combinations of words that are relatively
frequent.
● Straight forward implementation is simple combination of x words in window
y
● Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood
ratio is used!!!
● An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
● Multiword features are not too found in document collection, but they are higly
predictive
27
● A commonly used ranking score is information gain criterion
which is
IG ( j ) = Llabel − L( j )
1
LLabel = ∑ Pr( y = c) log 2 (1 / pr ( y = c))
c =0
1 1
L( j ) = ∑ Pr( xi = v)∑ Pr( y = c | xi = v) log 2 (1 / pr ( y = c | xi = v))
v =0 c =0
31