l4 TP Slides Text Processing
l4 TP Slides Text Processing
Processing
KNIME AG
1
Table of Contents
1. Text Mining Concepts
2. The Text Processing Extension
3. Importing Text
4. Enrichment
5. Preprocessing
6. Transformation
7. Classification
8. Visualization
9. Clustering
10. Bonus Cases and Workflows
2
Text Mining Concepts
3
Sources / References
Chapter 7 –
Text Mining, Sentiment Analysis,
and Social Analytics
+
Articles
White papers
Tutorials
© 2012 © 2020
§ Text Mining
§ Text Analytics
§ Text Processing
§ Information Retrieval
§ Information Extraction
§ Natural Language Processing
§ Computational Linguistics
§ Unstructured Data Mining
§ …
Text Analytics =
§ Information Retrieval +
§ NLP + Data Mining +
§ Web Mining
§ Roughly 85-90 percent of all corporate data is in some kind of unstructured form
(e.g., text)
§ Unstructured corporate data is doubling in size every 18 months…
§ Tapping into these information sources is not an option, but a need to stay
competitive
§ Answer: text mining / text analytics / text processing
§ A semi-automated process of extracting information and discovering knowledge from
unstructured data sources
§ It is also described as text data mining or knowledge discovery in textual databases
§ Deception detection
§ A very difficult problem
§ If detection is limited to only text, then the problem is even more difficult
§ The study …
§ Analyzed text based testimonies of persons of interest at military bases [a high-stakes environment!]
§ Used only text-based features (i.e., textual cues)
Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining methods for
real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.
Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining
methods for real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.
16
Text Mining Process
Sample
(Generate a representative
1 2 sample of the data) CRISP-DM
SEMMA
Business Data
Understanding Understanding
My own
3
SEMMA
Data Assess Explore
Preparation (Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data) KDD Process
6
4 Feedback
Deployment My organization's
Model
Data
Building
Domain-specific methodology
None
Model Modify
5 (Use variety of statistical and (Select variables, transform
machine learning models ) variable representations) Other methodology (not domain specific)
Testing and
Evaluation
0 10 20 30 40 50 60 70
Source: Used with permission of Delen, D., & M. Crossland. (2008). “Seeding the Survey and Analysis
of Research Literature with Text Mining.” Expert Systems with Applications, 34(3), pp. 1707–1720.
ISR 1999 D. Robey and Accounting for the 2-Oct 167-185 organizational Although much contemporary
M. C. Boudreau contradictory transformation impacts of thought considers advanced
organizational technology organization information technologies as
consequences of theory research either determinants or
information methodology enablers of radical
technology: intraorganizational power organizational change,
Theoretical directions electronic communication empirical studies have
and methodological mis implementation revealed inconsistent findings
implications culture systems to support the deterministic
logic implicit in such
arguments. This paper
reviews the contradictory
JMIS 2001 R. Aron and Achieving the optimal 18/2 65-88 information products When producers of goods (or
E. K. Clemons balance between internet advertising services) are confronted by a
investment in quality product positioning situation in which their
and investment in signaling signaling offerings no longer perfectly
self-promotion for games match consumer preferences,
information products they must determine the
extent to which the advertised
features of
… … … … … … … …
27
Today’s Example
§ Additional tasks:
§ Sentiment analysis
§ Visualization of documents
§ Document clustering
Title
Rating
Author
Full text
Goal:
§ Build a classifier to distinguish between reviews about Italian or Chinese
restaurants.
Review about
an Italian or a
Chinese
restaurant?
33
Installation
Clas
sif icatio
n
… perhaps your
name is Clustering
Rumpelstiltskin
[Person] ? …
Visualization
… perhaps your
name is
Rumpelstiltskin
111010011
[Person] ? …
011001000
001110110
§ Document table
§ List of documents
§ Bag of words
§ Tuples of documents and terms
§ Document vectors
§ Numerical representations of documents
§ Open KNIME
§ Install KNIME Textprocessing Extension
§ Download workflows from the provided URL
43
Data Source Nodes
Output port
Status
Node label
Basic settings
Help button
Preview
§ Relative to …
§ Mountpoint
§ Custom URL
§ Single file
§ Files in a folder
§ Supported operations
§ Column filtering
§ Column sorting
§ Column renaming
§ Column type mapping
§ Select between union or intersection of
columns (in case of reading many files)
§ Absolute URL
§ Mountpoint-relative URL
File path
File system
Sheet
specific
settings
Preview
File system
File path
Preview
Title
Text
Category
Authors
Tokenizer
Directory
Recursive
search
File
extensions
Meta data to
extract
Extraction of
attachments
You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/
Import text
§ Table Reader
§ Row Filter
§ Strings to Documents
§ Column Filter
67
Enrichment
§ Node Repository:
Other Data Types/Text Processing/Enrichment
§ Available Tagger Nodes
§ Stanford tagger
§ Dictionary (& Wildcard) tagger
§ OpenNLP tagger
§ Abner tagger
§ Amazon Comprehend
§ …
Number of
parallel threads
Number of
parallel threads
Model to use
Exact match
or “contains”
Dictionary
column
Tag value to be
Type of tag assigned
to be assigned
Document column
View and
Document interactivity
column configuration
Number of
documents to
Enable display display
of tags
Enrichment
§ POS tagger
§ Tagged Document Viewer
Enrichment
§ File Reader
§ Dictionary Tagger
§ Tagged Document Viewer
Document
corpus
StanfordNLP
NE model
Dictionary
Document
corpus
Dictionary
column
Parameters for
built-in model
Overwrite!
Set unmodifiable in
tagger nodes
Ignore unmodifiability
in preprocessing
nodes
§ Trains NER model for Latin and Gallic names based on “De Bello Gallico” from
Julius Caesar.
90
Preprocessing
§ Node Repository:
Other Data Types/Text Processing/Preprocessing
§ Available Preprocessing Nodes
§ Stop Word Filter
§ Snowball Stemmer
§ Tag Filter
§ Case Converter
§ RegEx Filter
§ …
Append original
document
Ignore term
unmodifiability
Built-in stop
word lists
Custom stop
word list
SS à SS caress à caress
Sà cats à cat
Original text:
Light caresses colours, sets them aglow, plays with nuances, shadows and
structures
Porter stemmer:
Light caress colour, set them aglow, plai with nuanc, shadow and structure.
Language
selection
Tag type
selection
Tag value
selection
Preprocessing
§ Number Filter
§ Punctuation Erasure
§ Stop Word Filter
§ Case Converter
§ Snowball Stemmer
§ POS Filter
104
Transformation
§ Node Repository:
Other Data Types/Text Processing/Transformation
§ Available Transformation Nodes
§ Bag of Words Creator
§ Document Vector
§ Strings to Document
§ Sentence Extractor
§ Document Data Extractor
§ Unique Term Extractor
§ …
Bag of words
Document list
Documents used
to create bag of
words
Original
documents can
be appended
Terms to transform
to strings
Preprocessing II
§ Bag of Words Creator
§ Term to String
§ GroupBy
§ Row Filter
§ Reference Row Filter
Documents to
append to left of
Create bit or the created vector
numerical columns
vector
Document vector
Reference document
vectors
Use settings
from model
input
Hashed document
vector
Dimensions of
document vectors
Hashing function
Fields to extract
§ Node Repository:
Other Data Types/Text Processing/Frequencies
§ Available Frequency Nodes
§ TF
§ IDF
§ NGram creator
§ …
§ Computes the relative or absolute term frequency (TF) of each term within a
document
§ Computes three variants of inverse document frequency (IDF) for each term
within the documents
§ Smooth, normalized, and probabilistic
§ Creates ngrams from documents of input table and counts their frequencies
§ Both word and character ngrams are possible
Transformation
§ TF
§ Document Vector
§ Document Data Extractor
132
Classification
Methods:
§ Decision Trees
§ Neural Networks
§ Naïve Bayes
§ Logistic Regression
§ Support Vector Machine
§ Tree Ensembles
Training Train
Set Model
Original
Data Set
Apply Score
Test Set Model Model
Training Train
Set Model
Original
Data Set
Apply Score
Test Set Model Model
Trained
Model
Training set
Test set
§ C4.5 builds a tree from a set of training data using the concept of information
entropy.
§ At each node of the tree, the attribute of the data with the highest normalized
information gain (difference in entropy) is chosen to split the data.
§ The C4.5 algorithm then recourses on the smaller sub lists.
Training Train
Set Model
Original
Data Set
Apply Score
Test Set Model Model
False Negatives
False Positives
True Negatives
Classification
§ Color Manager
§ Column Filter
§ Partitioning
§ Decision Tree Learner
§ Decision Tree Predictor
§ Scorer
§ Usually the documents used to train a model are read from a different source
than that of the documents to which the model is applied afterwards
§ To apply a trained model on a second set of documents we need to ensure that
all features of the training set exist as features of the second set.
§ This means that all document vector columns of the training set must exist as
document vector columns in the second set.
Classification II
153
Sentiment Analysis
Predictive modeling:
§ Build classifier to distinguish between positive and negative reviews.
§ “Ah, Moonwalker, I'm a huge Michael Jackson fan, I grew up with his music, Thriller was actually the
first music video I ever saw apparently. …”
§ “This film has a very simple but somehow very bad plot. …”
Positive or negative?
Classification
§ Strings to document
§ Preprocessing nodes
§ Bag of words creation, grouping,
counting, and filtering
§ Vector creation
§ Model training and scoring
Dictionary based:
§ Use a custom dictionary to count positive and negative words.
§ Compute sentiment score to predict sentiment label.
Classification
§ Strings to Documents
§ Dictionary Tagger
§ Bag of words, TF, and
GroupBy for counting
§ Pivoting
§ Math Formula
§ Rule Engine
§ Scorer
166
Visualization Nodes
§ Node Repository:
Other Data Types/Text Processing/Misc
§ Available Visualization Nodes
§ Document Viewer
§ Tag Cloud
§ Tagged Document Viewer (in JS Views (Labs))
§ The KNIME Textprocessing extension provides only the first two dedicated
visualization nodes
§ …but various other nodes can be used for visualization too!
Size of words
corresponds to
frequency
Scaling of font
size: linear, log,
exp
Visualization
§ Decision Tree Learner
§ Tag Cloud
§ (Optional Coloring)
§ TF, Document Data Extractor, Group By,
Pivoting, Math Formula, Color Manager
Document column
Tagged terms
can be hilited
Document column
Visualization
§ Document Viewer
§ Tagged Document Viewer
§ Supplementary Workflows/
§ R Theme River (R plot)
§ Twitter Word Tree (JavaScript view)
181
Clustering
§ We can use standard KNIME nodes to cluster the numerical document vectors.
Methods:
§ Hierarchical clustering
§ K-Means / Medoids
§ Density based
§ …
Distance column
Document vectors
Columns to use
for distance
computation
Clustering model
Distance column
Distance function
(optional)
Distance column
Linkage
type
§ Shows:
§ Dendrogram of clustering Dendrogram or
§ Distance curve distance
§ Colors
Hierarchical
clustering model
Cluster assignment
Hierarchical
clustering model
Threshold or cluster
count based
assignment
Hierarchy of data
points Illustration of
dendrogram
Data e.g.:
document vectors
Assignment of
clusters
Distance matrix
column Cluster count k
Clustering
§ Distance Matrix Calculate
§ Hierarchical Clustering
§ Cluster View
§ Cluster Assigner
§ k-Medoids
201
Topic Modeling in Text Mining
Documents
...
Topics to
documents
weights
[Latent]
...
Topics
Words to
topics
weights
...
Words
§ Example: we have a corpus of three different subject areas (1, 2, 3) and are
presumed to have three topics (A, B, and C)
§ With LDA, each topic is associated with each document with a weight, and each
word is associated with each topic with a weight. The distribution of topics will be
heavy on one over the others:
§ Doc from Subject 1: 88% topic A, 5% topic B, 7% topic C
§ Doc from Subject 2: 5% topic A, 91% topic B, 4% topic C
§ Doc from Subject 3: 5% topic A, 6% topic B, 89% topic C
§ Example: Genres associated with movies
§ Each movie is associated with multiple genres with different weights
Output
208
R Theme River
Leader / Follower
analysis of users
Sentiment
analysis of users
219
Forecasting Box Office Success of Hollywood Movies
§ Dursun Delen, Ph.D.
§ Regents Professor
§ Department of Management Science and Information Systems
§ Spears School of Business, Oklahoma State University
§ Our approach
§ Use a data mining approach
§ Use as much historical data as possible
§ Make it web-enabled
C Predict before the initial release
Class No. 1 2 3 4 5 6 7 8 9
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Dependent Independent Variable Possible Values
Values
Variable
MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
Genre 10 Related, Thriller, Horror,
A Typical Comedy, Cartoon, Action,
Classification Documentary
PREDICTION MODELS
§ Statistical Models:
§ Discriminant Analysis
§ Ordinal Multiple Logistic Regression
Prediction Models
Remote
Models
Movie Forecast
Local
Models
Guru (MFG)
Remote
GUI
Data Sources
(Internet Web Services
XML / SOAP ETL
Browser)
HTML
TCP/IP MFG Engine
(Web Server) ODBC
& ETL
User MFG
Database
(Manager) XML
Knowledge Base
(Business Rules)
References:
Delen, D., R. Sharda and P. Kumar (2006). “Movie Forecast Guru: A Web-based DSS for Hollywood
Managers”, Decision Support System. In Press.
Henry, M., R. Sharda and D. Delen (2007). “Using Neural Networks to Forecast Box-Office Success”
America’s Conference on Information Systems (AMCIS), Keystone, Colorado. Association for
Information Systems, 1512-1516.
Sharda, R. and D. Delen (2006). “Predicting box-office success of motion pictures with neural networks”
Expert Systems with Applications, 30(2), 243-254.
Sharda, R., D. Delen (2006). “How to Predict a Movie’s Success at the Box Office”, FORESIGHT: The
International Journal of Applied Forecasting, October 2006.
233