0% found this document useful (0 votes)
48 views31 pages

Text Mining

This document discusses how to transform unstructured text documents into numerical vectors that can be analyzed by machine learning algorithms. It describes collecting documents, standardizing them into XML format, tokenizing the text into words, lemmatizing words, generating a feature dictionary, and creating a spreadsheet with word counts to represent each document as a vector. The goal is to prepare textual data in a format that supports predictive text mining tasks like document categorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views31 pages

Text Mining

This document discusses how to transform unstructured text documents into numerical vectors that can be analyzed by machine learning algorithms. It describes collecting documents, standardizing them into XML format, tokenizing the text into words, lemmatizing words, generating a feature dictionary, and creating a spreadsheet with word counts to represent each document as a vector. The goal is to prepare textual data in a format that supports predictive text mining tasks like document categorization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

From Textual Information to

Numerical Vectors

1
Introduction
● To Mine Text we need to process it in a form that Data Mining
procedures use.
● From earlier chapter, this involves generating features in a
spread sheet format
● Classical data mining looks at highly structured data
● Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
● Predictive text mining is simpler and more restrictive than open
ended data mining.
● Text mining is unstructured because very far from the
spreadsheet model that we need to process data for prediction.

2
Introduction
● Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
● We have to determine nature of column in spread sheet.
● Features are easy to obtain , some are difficult. Features (word in
a text-easy) ; grammatical function of a word in a sentence
● Discuss ????
● How to Obtain the kinds of features generated from Text

3
Collecting Documents
● Text Mining is collect data
● Web page retrieval application for an intranet implicitly specifies
the relevant documents to be the web pages on the intranet
● If documents are identified, then they can be obtained- main issue
– cleanse the samples and ensure high quality
● Web application compromising a number of autonomous
Websites, one may deploy s/w tool such as WebCrawler to collect
the documents

4
Collecting Documents
● Other application u have a logging process attached to an input
data steam for a length of time (eg -> email audit u will log in
the incoming and outgoing messages at mail server for a period
of time)
● For R&D work of Text Mining - we need generic data. –
Corpus
● Accompanying Software is Reuter which is called Reuter’s
corpus(RV1)
● Early days (1960’s and 1970’s)1 million works was considered –
● size of collection of size of collection Brown corpus consist of
500 samples for 2000 words of American English test

5
Collecting Documents
● European corpus was modeled on Brown corpus – British
English
● 1970’s 0r 80’s more resource were available- govt sponsored.
● Some widely used corpora –Penn Tree Bank (collection manually parsed
sentences from Journal)
o Resource is World Wide Web. Web crawlers can build collections
of pages from a particular sit such as yahoo. Give n size of web,
collections require cleaning before use

6
Document Standardization

● When Documents are collected, you can have them in


different formats-
● Some documents may be collected in word format or
simple text with ASCII format. To process these
documents we have to convert them to standard formats
● Standard Format –XML
● XML is Extensible Markup Language

7
Document Standardization-XML
● Standard way to insert tags onto text to identify it’s parts.
● Each Document is markedoff from corpus through XML
● XML will have tags
● <Date>
● <Subject>
● <Topic>
● <Text>
● <Body>
● <Header>

8
XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

9
XML
● The main reason to identify the parts is to allow selection of
those parts that is used to generate features
● Selected document is concatenated into strings- separated by
tags
● Document Standardization-
● Why should we care ??
● Advantage of data standardization is mining tools can be applied without
having to consider the pedigree of document.

10
Tokenization
● Document Collected in XMl Format- examine the data
● Break the characters into words – TOKENS
● Each token is an instance of a type; the number of tokens is
higher than the number of types
● 2 tokens “the” occurs twice in a sentence. Refer to occurrence of a type
● Character space , tab are not tokens but white spaces
● Comma, Colon are tokens (between characters) eg:- USA,INDIA
● between numbers are delimiter (121,135)
● Apostrophe –number of uses (Delimiter or part of token) eg:- D’Angelo
● When it is followed by a terminator – internal quote (Tess’.)

11
Tokenization -Pesudocode
● Dash is a terminator a token preceeded ir follwed by another dash
(522-3333)
● Without identifying token it is diffuicult to imagine extracting
higher level information from document

12
Lemmatization
● Once a character stream has been segmented after sequence of
tokens
● Next Step ?? Convert each tokens to standard forms – Stemming
or Lemmatization. (Application dependent)
● Reduce the number of distinct types in corpus and increase frequency of
occurrence of individual types
● English Speaker’s agree nouns Book and Books are 2 forms of same word-
advantage to eliminate kind of variation
● Normalization regularize grammatical variants –Inflectional
Stemming

13
Stemming to a Root

● grammatical variants (singular/plural present/past)


● It is always advantageous to eliminate this kind of variation
before further processing
● When normalization is confined to regular grammatical
variants such as singular/plural and present/past, the process is
called Inflectional stemming
● The intent of these stemmers is to reach a root of no
inflectional or derivational prefixes or suffixes- end result
aggressive stemming Example
● reduce number of types is text

14
Stemming Pseudocode

15
Vector Generation for prediction
● Consider the problem of categorizing documents
● Characteristic feature are tokens or words they contain.
● Without deep analysis we can choose to describe each document
by features that represent the most frequent tokens.
● There is collective features called dictionary.
● Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document
collection.
● Each Row-> document; column ->feature

16
Vector Generation for prediction
● Cells in a spreadsheet is a measurement of feature for a document.
● Basic model of data, we will simply check the presence or absence of words
● Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents
are readily available. – confidence on variation and combinations of words
that occurs
● If prediction is our goal then we need one more column for correct answer.
● In preparing data for learning, information is available from document
labels. Our labels are binaries and answers which is also called as class)
● Instead of generating global dictionary for class we consider words in class
that we r trying to predict.
● If this class is far smaller than the negative class which is typical –local
dictionary is far smaller than global dictionary
● Another reduction in dictionary size is to compile a lost of stopwords and
remove them from dictionary.

17
● Stopwords are almost never have any predictive capability,
such articles a & the pronouns as it and they.
● Frequency information on the word counts can be quite useful
in reducing the dictionary size and improve predictive
performance
● Frequent words are stopwords and can be deleted.
● Alternative approach to local dictionary generation is to
generate a global dictionary from all documents in the
collection . Special feature selection routines will attempt to
select a subset of words that have greatest potential of
prediction- independent(selction methods)
● If we have 100 topics to categorize, then 100 problems to solve
. Our choices are 100 small dictionary or 1 global dictionary .

18
● Vectors implied by spreadsheet model will be regenerated to
correspond to small dictionary
● Instead of placing the word in the dictionary -> follow a path
printed dictionary and avoid storing every variation of word. (no
singular/plural/past/present)
● Verbs stored in stemming manner.
● Add a layer of complexity in text – gain in performance and size
is reduced
● Universal procedure is trim words to their root form -> difference
in meaning (exit /exiting)- context of programming (different
meanings)
● Small Dictionary- u can capture the best words easily.
● Use of tokens and stemming are examples of helpful procedures
in smaller dictionaries. Improve ability of managing of learning
and accuracy
● Document can be converted to spread sheet . 19
● Each column is feature. Row is a document
● Model of data for predictive text mining in terms of spread
sheet that populated by ones or zeros.
● Cells represent the presence of dictionary words in a document
collection. Higher accuracy-> additional transformations
● They are
● Word Pairs and collocations
● Frequency
● Tf-idf
● Word Pairs and Collocations:- They serve to increase size of
dictionary improve performance of prediction
● Instead of 0’s & 1’s in cells; the frequency of word can be used.
word “the” occurs 10 times count of “the” is used)
● Count give better results than binary in cells.
● This leads to compact solutions same solution of binary data
model. Yet additional frequency yield simpler solution.
20
● Frequencies are helpful in prediction but add complexity to
solutions.
● Compromise that works – 3 value system.1/0/2
● Word did not occur -0
● Word occurred one -1
● Word occurred 2 or more times -2
● Capture much added value of frequency without adding much
complexity to model.
● Another variant is zeroing the values below the threshold
where tokens min frequency before being considered any use.
● Reduce the complexity of spread sheet – used in Data Mining
algorithms
● Other methods to reduce complexity are chi square, mutual
Information, odds Ratio ..etc
● Next step beyond counting frequency is modify the count by
perceived importance of that word .

21
● Tf-idf:- Compute the weightings or scores of words
● Values of positive numbers that we capture the absence or
presence of the words.
● Eq (a) we see that weight assigned to word j-term of frequency
modified by a scale factor for importance of word. Scale factor
is inverse document frequency (eq (b))
● Simply checks for number of documents containing the word
df(j) and reverse scaling.
● Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
● Idf(j) = log(N/ df(j)) ------! Eq(b)
● When a word appears in a document, the scale is lowered and
perhaps zero. if word is unique , appears in few documents -
scale factor zooms upward and appears important
● Alternative of this tf-idf formulation exist, but motivation is
same. Result is positive score that replaces the simple
frequency or binary (T/F) entry in our cell in spreadsheet.
22
● Another variant is weight the tokens from different parts of the
document.
● Which Data Transformation Method are BEST????
● No Universal answer.
● Best predictive accuracy is dependent on mating all these methods.
● Best variation is one method may not be the one for other. Test ALL
● Describe data as populating a spread sheet-cells are 0
● Small subset of dictionary words.
● Text Classification a text corpus 1000’s words. Each individual
document ,unique tokens.
● Spread sheet for that document is 0. Rather than store all 0’s its
better to represent the spread sheet as a set of sparse vectors
(row is list of pairs , one element of pair is column and other is
corresponding nonzero value). By not storing the non zero It
will increase memory

23
Multi Word Features
● Features are associated with single words ( tokens delimited with
white space)
● Simple scenario is extended to include pair of words eg:- bon and
viant . Instead of separating we could feature the word as
bonviant.
● Why stop at pairs? Why not consider a multiword features??

● Unlike word pairs , the words need not be consecutive.


● Eg:- Don Smith as feature – we can ignore is middle name Leroy
that may reappear in some reference to the person.
● In this case we have to accommodate many reference to the noun
that involve a number of adjectives with desired adjective not the
adjacent to the noun. Eg:- we want to accept a phrase broken and
dirty vase as an instance broken vase

24
● X number if words occurring within a maximum window size y(y>=x
naturally)
● How features are extracted from text- specialized methods???
● If we use frequency methods, combinations of words that are relatively
frequent.
● Straight forward implementation is simple combination of x words in window
y
● Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood
ratio is used!!!
● An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
● Multiword features are not too found in document collection, but they are higly
predictive

size(T ) log10 ( freq(T )) freq(T )


AM (T ) =
∑wordi∈T freq(word i ) 25
26
Labels for Right Answers:-
● For prediction an extra column is added to the spreadsheet
● Last column contains the labels, looks no different from others.
● It’s a 0 or 1 indicating a right answer with either True/false
● In the sparse vector format are appended to each vector separately as either
a one (positive class) or a Zero (negative class)
● Feature Selection by Attribute Ranking:-
● In addition to frequency based approaches, feature selection can
be done in number of ways.
● Select a set of feature for each category to form a local dictionary
for the category
● Independent ranking feature attributes according to their
predictive abilities for category under consideration.
● Predictive ability of an attribute can be measured by certain
quantity how its is correlated
● Lets assume n number of documents; xi
presence or absence
of attribute j in x; y to denote label of document in last column

27
● A commonly used ranking score is information gain criterion
which is
IG ( j ) = Llabel − L( j )
1
LLabel = ∑ Pr( y = c) log 2 (1 / pr ( y = c))
c =0
1 1
L( j ) = ∑ Pr( xi = v)∑ Pr( y = c | xi = v) log 2 (1 / pr ( y = c | xi = v))
v =0 c =0

● The quantity L(j) is number of bits required to encode the label


and the attribute j minus the number of bits required to encode
the attribute.
● Quantities are needed to compute L(j). Can be easily estimated
using the estimators
freq( x j = v) + 1
pr ( xi = v) =
n+2
freq( x j = v, label = c) + 1
pr ( y = c | x j = v) =
freq( x j = v) + 2
28
Sentence Boundary Determination
● If the XML markup for corpus doesn't mark sentence boundaries,
necessary to mark the sentence
● Necessary to determine when a period is part of a token and when
it is not
● For more sophisticated way linguistic parsing, the algorithms often
require complete sentence as input.
● Extraction algorithms operate text a sentence at a time
● Algorithms are optimal, sentences are identified clearly
● Sentence boundary determination is problem of deciding which
instances of period followed by white space are sentence delimiters
and which are not, since we assume characters ? ! –classification
problem
● Algorithm – accuracy and adjustments will give better performance
29
30
Thank you ☺ !!!!

31

You might also like