Text Mining

This document discusses how to transform unstructured text documents into numerical vectors that can be analyzed by machine learning algorithms. It describes collecting documents, standardizing them into XML format, tokenizing the text into words, lemmatizing words, generating a feature dictionary, and creating a spreadsheet with word counts to represent each document as a vector. The goal is to prepare textual data in a format that supports predictive text mining tasks like document categorization.

Uploaded by

Anonymous sETEf2rtz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views31 pages

Text Mining

Uploaded by

Anonymous sETEf2rtz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

From Textual Information to

Numerical Vectors

1
Introduction
● To Mine Text we need to process it in a form that Data Mining
procedures use.
● From earlier chapter, this involves generating features in a
spread sheet format
● Classical data mining looks at highly structured data
● Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
● Predictive text mining is simpler and more restrictive than open
ended data mining.
● Text mining is unstructured because very far from the
spreadsheet model that we need to process data for prediction.

2
Introduction
● Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
● We have to determine nature of column in spread sheet.
● Features are easy to obtain , some are difficult. Features (word in
a text-easy) ; grammatical function of a word in a sentence
● Discuss ????
● How to Obtain the kinds of features generated from Text

3
Collecting Documents
● Text Mining is collect data
● Web page retrieval application for an intranet implicitly specifies
the relevant documents to be the web pages on the intranet
● If documents are identified, then they can be obtained- main issue
– cleanse the samples and ensure high quality
● Web application compromising a number of autonomous
Websites, one may deploy s/w tool such as WebCrawler to collect
the documents

4
Collecting Documents
● Other application u have a logging process attached to an input
data steam for a length of time (eg -> email audit u will log in
the incoming and outgoing messages at mail server for a period
of time)
● For R&D work of Text Mining - we need generic data. –
Corpus
● Accompanying Software is Reuter which is called Reuter’s
corpus(RV1)
● Early days (1960’s and 1970’s)1 million works was considered –
● size of collection of size of collection Brown corpus consist of
500 samples for 2000 words of American English test

5
Collecting Documents
● European corpus was modeled on Brown corpus – British
English
● 1970’s 0r 80’s more resource were available- govt sponsored.
● Some widely used corpora –Penn Tree Bank (collection manually parsed
sentences from Journal)
o Resource is World Wide Web. Web crawlers can build collections
of pages from a particular sit such as yahoo. Give n size of web,
collections require cleaning before use

6
Document Standardization

● When Documents are collected, you can have them in

different formats-
● Some documents may be collected in word format or
simple text with ASCII format. To process these
documents we have to convert them to standard formats
● Standard Format –XML
● XML is Extensible Markup Language

7
Document Standardization-XML
● Standard way to insert tags onto text to identify it’s parts.
● Each Document is markedoff from corpus through XML
● XML will have tags
● <Date>
● <Subject>
● <Topic>
● <Text>
● <Body>
● <Header>

8
XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
- <note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

9
XML
● The main reason to identify the parts is to allow selection of
those parts that is used to generate features
● Selected document is concatenated into strings- separated by
tags
● Document Standardization-
● Why should we care ??
● Advantage of data standardization is mining tools can be applied without
having to consider the pedigree of document.

10
Tokenization
● Document Collected in XMl Format- examine the data
● Break the characters into words – TOKENS
● Each token is an instance of a type; the number of tokens is
higher than the number of types
● 2 tokens “the” occurs twice in a sentence. Refer to occurrence of a type
● Character space , tab are not tokens but white spaces
● Comma, Colon are tokens (between characters) eg:- USA,INDIA
● between numbers are delimiter (121,135)
● Apostrophe –number of uses (Delimiter or part of token) eg:- D’Angelo
● When it is followed by a terminator – internal quote (Tess’.)

11
Tokenization -Pesudocode
● Dash is a terminator a token preceeded ir follwed by another dash
(522-3333)
● Without identifying token it is diffuicult to imagine extracting
higher level information from document

12
Lemmatization
● Once a character stream has been segmented after sequence of
tokens
● Next Step ?? Convert each tokens to standard forms – Stemming
or Lemmatization. (Application dependent)
● Reduce the number of distinct types in corpus and increase frequency of
occurrence of individual types
● English Speaker’s agree nouns Book and Books are 2 forms of same word-
advantage to eliminate kind of variation
● Normalization regularize grammatical variants –Inflectional
Stemming

13
Stemming to a Root

● grammatical variants (singular/plural present/past)

● It is always advantageous to eliminate this kind of variation
before further processing
● When normalization is confined to regular grammatical
variants such as singular/plural and present/past, the process is
called Inflectional stemming
● The intent of these stemmers is to reach a root of no
inflectional or derivational prefixes or suffixes- end result
aggressive stemming Example
● reduce number of types is text

14
Stemming Pseudocode

15
Vector Generation for prediction
● Consider the problem of categorizing documents
● Characteristic feature are tokens or words they contain.
● Without deep analysis we can choose to describe each document
by features that represent the most frequent tokens.
● There is collective features called dictionary.
● Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document
collection.
● Each Row-> document; column ->feature

16
Vector Generation for prediction
● Cells in a spreadsheet is a measurement of feature for a document.
● Basic model of data, we will simply check the presence or absence of words
● Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents
are readily available. – confidence on variation and combinations of words
that occurs
● If prediction is our goal then we need one more column for correct answer.
● In preparing data for learning, information is available from document
labels. Our labels are binaries and answers which is also called as class)
● Instead of generating global dictionary for class we consider words in class
that we r trying to predict.
● If this class is far smaller than the negative class which is typical –local
dictionary is far smaller than global dictionary
● Another reduction in dictionary size is to compile a lost of stopwords and
remove them from dictionary.

17
● Stopwords are almost never have any predictive capability,
such articles a & the pronouns as it and they.
● Frequency information on the word counts can be quite useful
in reducing the dictionary size and improve predictive
performance
● Frequent words are stopwords and can be deleted.
● Alternative approach to local dictionary generation is to
generate a global dictionary from all documents in the
collection . Special feature selection routines will attempt to
select a subset of words that have greatest potential of
prediction- independent(selction methods)
● If we have 100 topics to categorize, then 100 problems to solve
. Our choices are 100 small dictionary or 1 global dictionary .

18
● Vectors implied by spreadsheet model will be regenerated to
correspond to small dictionary
● Instead of placing the word in the dictionary -> follow a path
printed dictionary and avoid storing every variation of word. (no
singular/plural/past/present)
● Verbs stored in stemming manner.
● Add a layer of complexity in text – gain in performance and size
is reduced
● Universal procedure is trim words to their root form -> difference
in meaning (exit /exiting)- context of programming (different
meanings)
● Small Dictionary- u can capture the best words easily.
● Use of tokens and stemming are examples of helpful procedures
in smaller dictionaries. Improve ability of managing of learning
and accuracy
● Document can be converted to spread sheet . 19
● Each column is feature. Row is a document
● Model of data for predictive text mining in terms of spread
sheet that populated by ones or zeros.
● Cells represent the presence of dictionary words in a document
collection. Higher accuracy-> additional transformations
● They are
● Word Pairs and collocations
● Frequency
● Tf-idf
● Word Pairs and Collocations:- They serve to increase size of
dictionary improve performance of prediction
● Instead of 0’s & 1’s in cells; the frequency of word can be used.
word “the” occurs 10 times count of “the” is used)
● Count give better results than binary in cells.
● This leads to compact solutions same solution of binary data
model. Yet additional frequency yield simpler solution.
20
● Frequencies are helpful in prediction but add complexity to
solutions.
● Compromise that works – 3 value system.1/0/2
● Word did not occur -0
● Word occurred one -1
● Word occurred 2 or more times -2
● Capture much added value of frequency without adding much
complexity to model.
● Another variant is zeroing the values below the threshold
where tokens min frequency before being considered any use.
● Reduce the complexity of spread sheet – used in Data Mining
algorithms
● Other methods to reduce complexity are chi square, mutual
Information, odds Ratio ..etc
● Next step beyond counting frequency is modify the count by
perceived importance of that word .

21
● Tf-idf:- Compute the weightings or scores of words
● Values of positive numbers that we capture the absence or
presence of the words.
● Eq (a) we see that weight assigned to word j-term of frequency
modified by a scale factor for importance of word. Scale factor
is inverse document frequency (eq (b))
● Simply checks for number of documents containing the word
df(j) and reverse scaling.
● Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
● Idf(j) = log(N/ df(j)) ------! Eq(b)
● When a word appears in a document, the scale is lowered and
perhaps zero. if word is unique , appears in few documents -
scale factor zooms upward and appears important
● Alternative of this tf-idf formulation exist, but motivation is
same. Result is positive score that replaces the simple
frequency or binary (T/F) entry in our cell in spreadsheet.
22
● Another variant is weight the tokens from different parts of the
document.
● Which Data Transformation Method are BEST????
● No Universal answer.
● Best predictive accuracy is dependent on mating all these methods.
● Best variation is one method may not be the one for other. Test ALL
● Describe data as populating a spread sheet-cells are 0
● Small subset of dictionary words.
● Text Classification a text corpus 1000’s words. Each individual
document ,unique tokens.
● Spread sheet for that document is 0. Rather than store all 0’s its
better to represent the spread sheet as a set of sparse vectors
(row is list of pairs , one element of pair is column and other is
corresponding nonzero value). By not storing the non zero It
will increase memory

23
Multi Word Features
● Features are associated with single words ( tokens delimited with
white space)
● Simple scenario is extended to include pair of words eg:- bon and
viant . Instead of separating we could feature the word as
bonviant.
● Why stop at pairs? Why not consider a multiword features??

● Unlike word pairs , the words need not be consecutive.

● Eg:- Don Smith as feature – we can ignore is middle name Leroy
that may reappear in some reference to the person.
● In this case we have to accommodate many reference to the noun
that involve a number of adjectives with desired adjective not the
adjacent to the noun. Eg:- we want to accept a phrase broken and
dirty vase as an instance broken vase

24
● X number if words occurring within a maximum window size y(y>=x
naturally)
● How features are extracted from text- specialized methods???
● If we use frequency methods, combinations of words that are relatively
frequent.
● Straight forward implementation is simple combination of x words in window
y
● Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood
ratio is used!!!
● An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
● Multiword features are not too found in document collection, but they are higly
predictive

size(T ) log10 ( freq(T )) freq(T )

AM (T ) =
∑wordi∈T freq(word i ) 25
26
Labels for Right Answers:-
● For prediction an extra column is added to the spreadsheet
● Last column contains the labels, looks no different from others.
● It’s a 0 or 1 indicating a right answer with either True/false
● In the sparse vector format are appended to each vector separately as either
a one (positive class) or a Zero (negative class)
● Feature Selection by Attribute Ranking:-
● In addition to frequency based approaches, feature selection can
be done in number of ways.
● Select a set of feature for each category to form a local dictionary
for the category
● Independent ranking feature attributes according to their
predictive abilities for category under consideration.
● Predictive ability of an attribute can be measured by certain
quantity how its is correlated
● Lets assume n number of documents; xi
presence or absence
of attribute j in x; y to denote label of document in last column

27
● A commonly used ranking score is information gain criterion
which is
IG ( j ) = Llabel − L( j )
1
LLabel = ∑ Pr( y = c) log 2 (1 / pr ( y = c))
c =0
1 1
L( j ) = ∑ Pr( xi = v)∑ Pr( y = c | xi = v) log 2 (1 / pr ( y = c | xi = v))
v =0 c =0

● The quantity L(j) is number of bits required to encode the label

and the attribute j minus the number of bits required to encode
the attribute.
● Quantities are needed to compute L(j). Can be easily estimated
using the estimators
freq( x j = v) + 1
pr ( xi = v) =
n+2
freq( x j = v, label = c) + 1
pr ( y = c | x j = v) =
freq( x j = v) + 2
28
Sentence Boundary Determination
● If the XML markup for corpus doesn't mark sentence boundaries,
necessary to mark the sentence
● Necessary to determine when a period is part of a token and when
it is not
● For more sophisticated way linguistic parsing, the algorithms often
require complete sentence as input.
● Extraction algorithms operate text a sentence at a time
● Algorithms are optimal, sentences are identified clearly
● Sentence boundary determination is problem of deciding which
instances of period followed by white space are sentence delimiters
and which are not, since we assume characters ? ! –classification
problem
● Algorithm – accuracy and adjustments will give better performance
29
30
Thank you ☺ !!!!

Enquiry Routines PDF
0% (1)
Enquiry Routines PDF
27 pages
Chapter 8 Resource Allocation
No ratings yet
Chapter 8 Resource Allocation
10 pages
Sept '18 - Gerrys Death - QuadrigaCX Chatlogs PDF
No ratings yet
Sept '18 - Gerrys Death - QuadrigaCX Chatlogs PDF
211 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text Mining
No ratings yet
Text Mining
62 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
TEXT Mining
No ratings yet
TEXT Mining
45 pages
Pipeline
No ratings yet
Pipeline
9 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Text Mining
No ratings yet
Text Mining
34 pages
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
No ratings yet
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
63 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
Delhi Public School Bangalore North
No ratings yet
Delhi Public School Bangalore North
8 pages
Unit 5
No ratings yet
Unit 5
8 pages
NLP
No ratings yet
NLP
4 pages
Lect 05 Preprocessing Text
No ratings yet
Lect 05 Preprocessing Text
25 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Statistical Language Processing
No ratings yet
Statistical Language Processing
32 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
Week 1-4 Text An
No ratings yet
Week 1-4 Text An
74 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
Exam 2
No ratings yet
Exam 2
5 pages
Coba Coba Upload
No ratings yet
Coba Coba Upload
3 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
Lect 5
No ratings yet
Lect 5
40 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Ai Lecture22
No ratings yet
Ai Lecture22
32 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Natural Language Processing
No ratings yet
Natural Language Processing
49 pages
DS Finalexam (Thxtoshravani)
No ratings yet
DS Finalexam (Thxtoshravani)
31 pages
Ai TXT Unit2
No ratings yet
Ai TXT Unit2
14 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Module 1.2
No ratings yet
Module 1.2
28 pages
Mapping Texts 2024
No ratings yet
Mapping Texts 2024
326 pages
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
No ratings yet
Mapping Texts - Computational Text Analysis For The Social - Dustin S - Stoltz, Marshall A - Taylor - 2024 - Computational Social Science - 9780197756874 - Anna's Archive
326 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
Amazon Food Review Notes
No ratings yet
Amazon Food Review Notes
37 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
2017 Phrase Mining From Massive Text and Its Applications
No ratings yet
2017 Phrase Mining From Massive Text and Its Applications
89 pages
08 Text Data Processing
No ratings yet
08 Text Data Processing
42 pages
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
AJAX Toolkit
No ratings yet
AJAX Toolkit
48 pages
E-R Model
No ratings yet
E-R Model
31 pages
SB Binomial Distribution
67% (3)
SB Binomial Distribution
6 pages
095866
No ratings yet
095866
9 pages
A Review of Optimization Approach To Power Flow Tracing in A Deregulated Power System
No ratings yet
A Review of Optimization Approach To Power Flow Tracing in A Deregulated Power System
14 pages
Design Principles and Goals: Storage
No ratings yet
Design Principles and Goals: Storage
16 pages
Configuring The OM Channel On HUAWEI DBS3900
100% (1)
Configuring The OM Channel On HUAWEI DBS3900
3 pages
Wearing The Hair Shirt
100% (1)
Wearing The Hair Shirt
68 pages
Control Units Group2
No ratings yet
Control Units Group2
22 pages
Command Injection Essence
No ratings yet
Command Injection Essence
11 pages
Ait307 QP
No ratings yet
Ait307 QP
3 pages
Hpc301 User Manual
No ratings yet
Hpc301 User Manual
25 pages
Tpec Module 5
No ratings yet
Tpec Module 5
7 pages
EDPMS User Manual Guide
No ratings yet
EDPMS User Manual Guide
34 pages
Radar Target Recognition
No ratings yet
Radar Target Recognition
26 pages
VHDL Shift and Add 3 AlgorithmRichard E Haskell
No ratings yet
VHDL Shift and Add 3 AlgorithmRichard E Haskell
8 pages
How To Install and Activate PAN-DB For URL Filt... Palo Alto Networks Live
No ratings yet
How To Install and Activate PAN-DB For URL Filt... Palo Alto Networks Live
2 pages
Computer Science - GATE - 2012 - Ques+Ans
No ratings yet
Computer Science - GATE - 2012 - Ques+Ans
16 pages
On Data
No ratings yet
On Data
30 pages
BS EN 10204 - Type 3.2 Inspection Certification
100% (1)
BS EN 10204 - Type 3.2 Inspection Certification
2 pages
Trees and Binary Trees
No ratings yet
Trees and Binary Trees
59 pages
I & C Maintenance Manual
100% (1)
I & C Maintenance Manual
111 pages
Canopen Electronic Data Sheet (Eds) : Public Available Specification
No ratings yet
Canopen Electronic Data Sheet (Eds) : Public Available Specification
26 pages
Operations Research A Report Submitted For External Assessment On
No ratings yet
Operations Research A Report Submitted For External Assessment On
29 pages
Krishna Singh Resume Up1
No ratings yet
Krishna Singh Resume Up1
1 page
Create Sample Solution
No ratings yet
Create Sample Solution
11 pages
Tinjauan Terhadap Rencana Penerapan Pajak Lingkungan Sebagai Instrumen Perlindungan Lingkungan Hidup Di Indonesia
No ratings yet
Tinjauan Terhadap Rencana Penerapan Pajak Lingkungan Sebagai Instrumen Perlindungan Lingkungan Hidup Di Indonesia
16 pages