0% found this document useful (0 votes)

18 views230 pages

l4 TP Slides Text Processing

The document provides an introduction to text processing concepts and techniques, focusing on text mining, text analytics, and the KNIME Text Processing Extension. It covers various aspects such as importing text, preprocessing, transformation, classification, and visualization of unstructured data. Additionally, it discusses the importance of text mining in extracting valuable insights from unstructured corporate data and outlines the text mining process and applications.

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views230 pages

l4 TP Slides Text Processing

Uploaded by

Rasha Elsayed Sakr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 230

[L4-TP] Introduction to Text

Processing
KNIME AG

1
Table of Contents
1. Text Mining Concepts
2. The Text Processing Extension
3. Importing Text
4. Enrichment
5. Preprocessing
6. Transformation
7. Classification
8. Visualization
9. Clustering
10. Bonus Cases and Workflows

2
Text Mining Concepts

3
Sources / References
Chapter 7 –
Text Mining, Sentiment Analysis,
and Social Analytics

+
Articles
White papers
Tutorials

© 2021 KNIME AG. All rights reserved. 4

Too Many Terms

§ Text Mining
§ Text Analytics
§ Text Processing
§ Information Retrieval
§ Information Extraction
§ Natural Language Processing
§ Computational Linguistics
§ Unstructured Data Mining
§ …

© 2021 KNIME AG. All rights reserved. 5

Text Mining versus Text Analytics

Text Analytics =
§ Information Retrieval +
§ NLP + Data Mining +
§ Web Mining

© 2021 KNIME AG. All rights reserved. 6

Why Text Mining?

§ Roughly 85-90 percent of all corporate data is in some kind of unstructured form
(e.g., text)
§ Unstructured corporate data is doubling in size every 18 months…
§ Tapping into these information sources is not an option, but a need to stay
competitive
§ Answer: text mining / text analytics / text processing
§ A semi-automated process of extracting information and discovering knowledge from
unstructured data sources
§ It is also described as text data mining or knowledge discovery in textual databases

© 2021 KNIME AG. All rights reserved. 7

Data Mining versus Text Mining

§ Both seek for novel and useful patterns

§ Both are semi-automated processes
§ Difference is the nature of the data
§ Structured versus unstructured data
§ Structured data: in databases
§ Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on
§ To perform text mining – first, impose structure to the data, then mine the
structured data.

© 2021 KNIME AG. All rights reserved. 8

Use of Text Mining

§ Benefits of text mining are obvious especially in text-rich data environments

§ Such as law (court orders), academic research (research articles), finance (quarterly reports),
medicine (discharge summaries), biology (molecular interactions), technology (patent files),
marketing (customer comments), etc.
§ Electronic communication records (e.g., Email)
§ Spam filtering
§ Email prioritization and categorization
§ Automatic response generation

© 2021 KNIME AG. All rights reserved. 9

Text Mining Terminology

§ Lexicon (word dictionary) § Unstructured data

§ WordNet, SentiWordNet
§ Corpus (and corpora)
§ Term-by-document matrix § Words and Terms
§ Word/document frequency
§ Numeric, binary, TF/IDF § Concepts
§ Dimensional reduction: Singular § Stemming vs. Lemmatization
value decomposition (SVD) § Stop [include] words/terms
§ Topic Modeling § Synonyms (and homonyms)
§ Latent Dirichlet Allocation (LDA)
§ Morphology
§ Word Embedding
§ Tokenization
§ Word2Vec
§ Part-of-speech tagging
§ Bag of Words

© 2021 KNIME AG. All rights reserved. 10

Use of Text Mining

Stemming vs. Lemmatization

§ Common goal: to generate the root form of the words
§ Difference:
§ Stem results in truncated/chopped words (not necessarily a complete word)
§ Stemming is syntactic and fast - follows an algorithm
§ Example: amuse, amusing, amused à amus
§ Lemma results in an actual language word (inflection free)
§ Lemmatization is semantic and slow - follows linguistic dictionary
§ Example: am, is, are à be

© 2021 KNIME AG. All rights reserved. 11

Text Mining Application – Deception Detection

§ Deception detection
§ A very difficult problem
§ If detection is limited to only text, then the problem is even more difficult
§ The study …
§ Analyzed text based testimonies of persons of interest at military bases [a high-stakes environment!]
§ Used only text-based features (i.e., textual cues)

Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining methods for
real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.

© 2021 KNIME AG. All rights reserved. 12

Text Mining Application – Deception Detection

Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining
methods for real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.

© 2021 KNIME AG. All rights reserved. 13

Text Mining Application – Deception Detection

© 2021 KNIME AG. All rights reserved. 14

Text Mining Application – Deception Detection

§ 371 usable/labeled statements are generated

§ 31 features/cues are used
§ Different feature selection methods used
§ 10-fold cross validation is used
§ Early Prediction Results (overall % accuracy):
§ Logistic regression 67.28
§ Decision trees 71.60
§ Neural networks 73.46
§ Tools used: General Architecture for Text Engineering (GATE), Linguistic Inquiry
and Word Count (LIWC), and WEKA

© 2021 KNIME AG. All rights reserved. 15

Text Mining Process

16
Text Mining Process

§ A standard process: the manifestation of the “best” practices

§ Standard process for data mining:
§ Cross industry process for data mining (CRISP-DM)
§ Sample, Explore, Modify, Model, Assess (SEMMA)

Sample
(Generate a representative
1 2 sample of the data) CRISP-DM
SEMMA
Business Data
Understanding Understanding
My own

3
SEMMA
Data Assess Explore
Preparation (Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data) KDD Process
6
4 Feedback
Deployment My organization's
Model
Data
Building
Domain-specific methodology

None
Model Modify
5 (Use variety of statistical and (Select variables, transform
machine learning models ) variable representations) Other methodology (not domain specific)
Testing and
Evaluation
0 10 20 30 40 50 60 70

© 2021 KNIME AG. All rights reserved. 17

Text Mining Process

© 2021 KNIME AG. All rights reserved. 18

Text Mining Process

© 2021 KNIME AG. All rights reserved. 19

Text Mining Process

§ Task 1: Establish the corpus § Task 2: Create the Term–by–

§ Collect all relevant unstructured data Document Matrix (TDM)
(e.g., textual documents, XML files, emails, § Should all the terms be included?
Web pages, short notes, voice
§ Stop words, include words
recordings…)
§ Synonyms, homonyms
§ Digitize, standardize the collection
(e.g., all in ASCII text files) § Stemming, lemmatization
§ Place the collection in a common place § What is the best representation of the
(e.g., in a flat file, or in a directory as indices (values in cells)?
separate files) § Row counts; binary frequencies; log
frequencies
§ TF/IDF

© 2021 KNIME AG. All rights reserved. 20

Text Mining Process

§ Task 2 – Cont. § Task 3: Extract knowledge

§ TDM is a sparse matrix. How can we § Classification (text categorization)
reduce the dimensionality of the TDM? § Clustering (natural groupings of text)
§ Manual by a domain expert, frequency § Improve search recall
based, SVD, … § Improve search precision
§ Scatter/gather
§ Query-specific clustering
§ Association
§ Trend Analysis

© 2021 KNIME AG. All rights reserved. 21

Text Mining Process in KNIME

§ Logical organization of KNIME Text Processing nodes

© 2021 KNIME AG. All rights reserved. 22

Text Mining Application – Trend Identification

§ Mining the published I S literature

§ Three Top IS Journals Targeted
§ Quarterly (MISQ)
§ Journal of MIS (JMIS)
§ Information Systems Research (ISR)
§ Covered 12-year period (1994-2005)
§ 901 papers are included in the study
§ Only the paper abstracts are used
§ 9 clusters are generated for further analysis

Source: Used with permission of Delen, D., & M. Crossland. (2008). “Seeding the Survey and Analysis
of Research Literature with Text Mining.” Expert Systems with Applications, 34(3), pp. 1707–1720.

© 2021 KNIME AG. All rights reserved. 23

Text Mining Application – Trend Identification
Journal Year Author(s) Title Vol/No Pages Keywords Abstract
MISQ 2005 A. Malhotra, Absorptive capacity 29/1 145-187 Knowledge management The need for continual value
S. Gosain and configurations in supply chain absorptive innovation is driving supply
O. A. El Sawy supply chains: capacity chains to evolve from a pure
Gearing for partner- interorganizational transactional focus to
enabled market information systems leveraging interorganizational
knowledge creation configuration approaches partner ships for sharing

ISR 1999 D. Robey and Accounting for the 2-Oct 167-185 organizational Although much contemporary
M. C. Boudreau contradictory transformation impacts of thought considers advanced
organizational technology organization information technologies as
consequences of theory research either determinants or
information methodology enablers of radical
technology: intraorganizational power organizational change,
Theoretical directions electronic communication empirical studies have
and methodological mis implementation revealed inconsistent findings
implications culture systems to support the deterministic
logic implicit in such
arguments. This paper
reviews the contradictory
JMIS 2001 R. Aron and Achieving the optimal 18/2 65-88 information products When producers of goods (or
E. K. Clemons balance between internet advertising services) are confronted by a
investment in quality product positioning situation in which their
and investment in signaling signaling offerings no longer perfectly
self-promotion for games match consumer preferences,
information products they must determine the
extent to which the advertised
features of
… … … … … … … …

© 2021 KNIME AG. All rights reserved. 24

Text Mining Application – Trend Identification

© 2021 KNIME AG. All rights reserved. 25

Text Mining Application – Trend Identification

© 2021 KNIME AG. All rights reserved. 26

Today’s Example

27
Today’s Example

§ Classification of free-text documents is a common task in the field of text mining.

§ It is used to categorize documents, i.e. assign pre-defined topics, or it can be

used for sentiment analysis.

§ Today we want to construct a workflow that reads and preprocesses text

documents, transforms them into a numerical representation and builds a
predictive model to assign pre-defined labels to documents.

§ Additional tasks:
§ Sentiment analysis
§ Visualization of documents
§ Document clustering

© 2021 KNIME AG. All rights reserved. 28

Today’s Example

Title
Rating

Author
Full text

© 2021 KNIME AG. All rights reserved. 29

Today’s Example

Goal:
§ Build a classifier to distinguish between reviews about Italian or Chinese
restaurants.

Review about
an Italian or a
Chinese
restaurant?

© 2021 KNIME AG. All rights reserved. 30

Today’s Example

© 2021 KNIME AG. All rights reserved. 31

Bonus Examples

© 2021 KNIME AG. All rights reserved. 32

The KNIME Text Processing
Extension

33
Installation

§ Install KNIME Textprocessing Extension from menu…

© 2021 KNIME AG. All rights reserved. 34

Installation

§ …or drag-and-drop from KNIME Hub

© 2021 KNIME AG. All rights reserved. 35

Tips for Better Performance

§ Increase maximum memory for KNIME

§ Edit knime.ini
§ Add “-Xmx3G” as last line of knime.ini file
§ Replace 3 by the amount of gigabytes allocated for KNIME

§ Useful additional extensions

§ XML-Processing (KNIME extension)
§ Parsing and processing of XML documents
§ KNIME JavaScript Views (Labs)
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 36

Philosophy
Extract document data Clean-up & preprocessing Knowledge extraction

Clas
sif icatio
n
… perhaps your
name is Clustering
Rumpelstiltskin
[Person] ? …
Visualization

… perhaps your
name is
Rumpelstiltskin
111010011
[Person] ? …
011001000
001110110

Enrichment Term-document matrix

© 2021 KNIME AG. All rights reserved. 37

Additional Data Types: Document

§ KNIME uses a composite/aggregate data type to represent textual content

§ Fields include:
§ Title
§ Text
§ Source
§ Category
§ Author(s)
§ Date, …
§ Generic Meta Data

© 2021 KNIME AG. All rights reserved. 38

Additional Data Types: Term

§ KNIME uses a composite/aggregate data type to represent terms [keywords]

§ Fields include:
§ Sentiment
§ POS tag
§ City
§ Person name
§ Etc.

© 2021 KNIME AG. All rights reserved. 39

Part of Speech Tags

Source: Penn-Treebank Project.

© 2021 KNIME AG. All rights reserved. 40

Data Table Structures

§ Document table
§ List of documents

§ Bag of words
§ Tuples of documents and terms

§ Document vectors
§ Numerical representations of documents

© 2021 KNIME AG. All rights reserved. 41

Section Exercise

§ Open KNIME
§ Install KNIME Textprocessing Extension
§ Download workflows from the provided URL

© 2021 KNIME AG. All rights reserved. 42

Importing Text

43
Data Source Nodes

Typically characterized by:

§ Orange color
§ By default no input ports, 1-2 output ports
§ New file handling with KNIME 4.4.
§ Consistent user experience across all nodes and file systems
§ Managing of various file systems within the same workflow
§ Performance improvements

Output port
Status

Node label

© 2021 KNIME AG. All rights reserved. 44

CSV Reader
Advanced settings
§ Reads either one or multiple
.csv and .txt files
File system
§ Further tabs to
§ limit the rows
File path
§ select encoding

Basic settings

Help button

Preview

© 2021 KNIME AG. All rights reserved. 45

Common Settings: Four Default File Systems

§ Local File System

§ Relative to …

§ Mountpoint

§ Custom URL

© 2021 KNIME AG. All rights reserved. 46

Common Settings: Connecting to other File Systems

§ Add file system connection port to connect to another file system

§ Click on the three dots on the lower left to
add or remove a dynamic port.

§ Supported file systems

§ Microsoft Azure
§ Google
§ Amazon
§ Databricks
§ BigData file systems (hdfs, httpFS, …)
§ On-premise (e.g. ssh, ftp, …)

© 2021 KNIME AG. All rights reserved. 47

Common Settings: Read Single or Multiple Files

§ Single file

§ Files in a folder

§ Option to include subfolder

§ Option to define filter criterions

© 2021 KNIME AG. All rights reserved. 48

Common Settings: Transformation Tab

§ Supported operations
§ Column filtering
§ Column sorting
§ Column renaming
§ Column type mapping
§ Select between union or intersection of
columns (in case of reading many files)

© 2021 KNIME AG. All rights reserved. 49

Alternative Faster Way …

Drag & Drop

OR
Copy &
Paste

© 2021 KNIME AG. All rights reserved. 50

File Path Options Old File Handling
New file handling
§ Local path

§ Absolute URL

§ Mountpoint-relative URL

© 2021 KNIME AG. All rights reserved. 51

Workflow-Relative File Paths (Old File Handling)

§ Best choice if workflows are to be

shared
§ Requires matching folder structure
within workflow group
§ Independent of environment outside of
workflow group
§ Example: Path to „Sentiment Analysis.table“
§ Local path:
§ C:\Users\rb\knime-workspace\KNIMEUserTraining\data\Sentiment Analysis.table
§ Workflow relative:

YouTube KNIME TV Channel:

https://youtu.be/U9sP4g4yGwY

© 2021 KNIME AG. All rights reserved. 52

Excel Reader (XLS)

§ Reads .xls and .xlsx file from Microsoft Excel

§ Supports reading from multiple sheets

© 2021 KNIME AG. All rights reserved. 53

Excel Reader

File path
File system

Sheet
specific
settings

Preview

© 2021 KNIME AG. All rights reserved. 54

New Node: Table Reader

§ Reads tables from the native KNIME Format

§ Maximum performance
§ Minimum configuration

File system

File path

Preview

© 2021 KNIME AG. All rights reserved. 55

New Node: Database Reader

§ Connectors for Common DB types

§ MySQL, Postgres, SQLite
§ Also works with any JDBC driver
§ Common nodes for SQL Query Building
§ GroupBy, Join, Filter, Sort

© 2021 KNIME AG. All rights reserved. 56

Other Useful Data Sources

§ KNIME Analytics Platform provides many more

options to access data:
§ PMML Reader – reads standard predictive models
§ XML Reader with XPATH support
§ Python/R Source nodes
§ Tika Parser – extracts textual data from 200+ file types
§ REST Web Services, and many more

§ Find out more in by downloading the free book

“Will they blend”
https://www.knime.com/knimepress/download-
will-they-blend
© 2021 KNIME AG. All rights reserved. 57
Parser Nodes

§ Node Repository: Other Data Types/Text Processing/IO

§ Available Parser Nodes
§ Flat File Document Parser
§ PDF Parser
§ Word Parser
§ Document Grabber
§ …

© 2021 KNIME AG. All rights reserved. 58

New Node: Strings To Document

§ Creation of document cells from strings

§ Converts string cells to document cells
§ Useful in combination with File Reader, XLS Reader, database nodes

© 2021 KNIME AG. All rights reserved. 59

Strings To Document: Configuration

Title
Text

© 2021 KNIME AG. All rights reserved. 60

Tokenizers

§ Different tokenizers are available leading to slightly different terms extracted

from the document

Example: “I’m enjoying the tutorial”

English WordTokenizer WhitespaceTokenizer

“I”, “’m”, “enjoying”, “the”, “tutorial” “I’m”, “enjoying”, “the”, “tutorial”

© 2021 KNIME AG. All rights reserved. 61

New Node: Meta Info Inserter & Extractor

§ Inserter allows adding document meta info

§ Adds meta info to documents as key value pairs
§ Helpful if more meta info available than covered by Strings to Documents node
§ Extractor brings data back from document cell into table columns
§ Each key results in a column, containing the specific values for each document related to that key.

© 2021 KNIME AG. All rights reserved. 62

New Node: Tika Parser

§ Reads files of various formats from directory

§ Searches for all files with specified extension in directory
§ Creates one document for each file
§ Extracts specified (meta) information

© 2021 KNIME AG. All rights reserved. 63

Tika Parser: Configuration

© 2021 KNIME AG. All rights reserved. 64

Section Exercise

§ Start with “Exercise: Importing text”

§ Import string data from:
§ TripadvisorReviews-SanFranciscoRestaurants-ItalianChineseFood.table
§ Filter rows with missing titles
§ Convert strings to documents
§ Filter all columns except the document column

You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/

© 2021 KNIME AG. All rights reserved. 65

Section Solution

Import text
§ Table Reader
§ Row Filter
§ Strings to Documents
§ Column Filter

© 2021 KNIME AG. All rights reserved. 66

Enrichment

67
Enrichment

§ Semantic information is indicated by a tag assignment

§ Part of speech, named entities (persons, organizations, genes, …), sentiments
§ A tag consists of a type and a value
§ Type represents the class or set of tags
§ e.g. POS (part of speech), NE (named entity)
§ Value represents the actual tag value
§ e.g. NN (noun), PERSON

Term “food” with tag value

“NN” and type “POS”
Column containing
terms with tags

© 2021 KNIME AG. All rights reserved. 68

Tagger Nodes

§ Typically characterized by:

§ Yellow color
§ 1 to 2 input ports (requiring one document column), 1 output port
§ Assignment of semantic information (tags) to terms

© 2021 KNIME AG. All rights reserved. 69

Tagger Nodes

§ Node Repository:
Other Data Types/Text Processing/Enrichment
§ Available Tagger Nodes
§ Stanford tagger
§ Dictionary (& Wildcard) tagger
§ OpenNLP tagger
§ Abner tagger
§ Amazon Comprehend
§ …

© 2021 KNIME AG. All rights reserved. 70

Tagger Nodes

§ Allows to specify the number of

parallel threads.
§ Note: each thread will load a separate
model into memory!
§ Tagged terms are set unmodifiable
(we’ll explain this shortly).

Number of
parallel threads

© 2021 KNIME AG. All rights reserved. 71

New Node: Stanford Tagger

§ Assigns part of speech tags to terms

§ Models for English, German, French (from Stanford NLP Group)
§ Alternative node: POS tagger
§ Model only for English (from OpenNLP)

© 2021 KNIME AG. All rights reserved. 72

Stanford Tagger: Configuration

Number of
parallel threads
Model to use

© 2021 KNIME AG. All rights reserved. 73

New Node: Dictionary Tagger

§ Assigns selected tag to matching terms

§ Matches terms in documents against terms in dictionary
§ Tag to be assigned to matching terms is specified in the dialog
§ Alternative node: Wildcard tagger
§ Terms in dictionary may contain wild cards and regular expressions

© 2021 KNIME AG. All rights reserved. 74

Dictionary Tagger: Configuration

Exact match
or “contains”
Dictionary
column

Tag value to be
Type of tag assigned
to be assigned

© 2021 KNIME AG. All rights reserved. 75

New Node: Tagged Document Viewer

§ Displays documents with tags highlighted:

§ Takes a column with documents as input
§ Allows to inspect tags assigned to documents

Document with tags

highlighted

Document column

© 2021 KNIME AG. All rights reserved. 76

Tagged Document Viewer: Configuration

View and
Document interactivity
column configuration

Number of
documents to
Enable display display
of tags

© 2021 KNIME AG. All rights reserved. 77

Section Exercise

§ Start with “Exercise: Enrichment”

§ Assign (English) POS tags
§ View tagged documents

© 2021 KNIME AG. All rights reserved. 78

Section Solution

Enrichment
§ POS tagger
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 79

Section Exercise (Bonus)

§ Start with “Exercise: Enrichment II”

§ Read files that contain positive and negative words
§ MPQA-OpinionCorpus-PositiveList.csv
§ MPQA-OpinionCorpus-NegativeList.csv
§ Assign positive and negative sentiment tags based on positive and negative word lists
§ View tagged documents
§ Tip: Dictionary Tagger node

© 2021 KNIME AG. All rights reserved. 80

Section Solution (Bonus)

Enrichment
§ File Reader
§ Dictionary Tagger
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 81

Custom NER Models

§ The provided NER models of OpenNLP NE tagger and StandfordNLP NE tagger

are trained for a few types of entities and English language only.
§ For more specific applications and other languages custom models are needed.

© 2021 KNIME AG. All rights reserved. 82

New Node: StanfordNLP NE Learner

§ Trains a NER model based on the input dictionary and corpus

§ Tag type and value can be set in the dialog
§ Creates tagged corpus based on input documents and dictionary. Trains model with tagged corpus.

Document
corpus
StanfordNLP
NE model

Dictionary

© 2021 KNIME AG. All rights reserved. 83

Stanford Tagger: Configuration

Document
corpus
Dictionary
column

Tag type and

value

© 2021 KNIME AG. All rights reserved. 84

New Node: StanfordNLP NE tagger

§ Tags documents based on input NER model.

§ NER model can be specified in dialog, built-in or model from input port

© 2021 KNIME AG. All rights reserved. 85

StanfordNLP NE Tagger: Configuration

Parameters for
built-in model

Use model from

input port or
built-in models

© 2021 KNIME AG. All rights reserved. 86

Tagging Conflicts

§ In case of tag intersections, the last node overwrites.

§ “Serbian-American inventor Nikola Tesla developed the …”
1. POS tagger: “Serbian-American\NNP inventor\NNP Nikola\NNP Tesla\NNP
developed\VBD the\DT…”
2. NE tagger: “Serbian-American\NNP inventor\NNP Nikola Tesla\Person developed\VBD
the\DT …”

Overwrite!

© 2021 KNIME AG. All rights reserved. 87

Unmodifiable Terms

§ Tagged terms can be set unmodifiable

§ Unmodifiable terms are not affected by any preprocessing node
§ Preprocessing nodes can explicitly ignore unmodifiability

Set unmodifiable in
tagger nodes

Ignore unmodifiability
in preprocessing
nodes

© 2021 KNIME AG. All rights reserved. 88

Supplementary Workflows: NER Tagger Model Training

§ Trains NER model for Latin and Gallic names based on “De Bello Gallico” from
Julius Caesar.

© 2021 KNIME AG. All rights reserved. 89

Preprocessing

90
Preprocessing

§ Reduction of feature space (terms)

§ Filtering of unnecessary terms
§ Stop words, based on POS tags, dictionaries, RegEx, …
§ Normalization of terms
§ Stemming, case conversion

© 2021 KNIME AG. All rights reserved. 91

Preprocessing Nodes

§ Typically characterized by:

§ Yellow color
§ 1 to 2 input ports (requiring one document column), 1 output port
§ For filtering and normalizing terms of documents and bags of words

© 2021 KNIME AG. All rights reserved. 92

Preprocessing Nodes

§ Node Repository:
Other Data Types/Text Processing/Preprocessing
§ Available Preprocessing Nodes
§ Stop Word Filter
§ Snowball Stemmer
§ Tag Filter
§ Case Converter
§ RegEx Filter
§ …

© 2021 KNIME AG. All rights reserved. 93

Preprocessing Nodes

§ Preprocessing tab in node dialog to specify:

§ Append original documents
§ Ignore term unmodifiability (set by tagger nodes).

Append original
document

Ignore term
unmodifiability

© 2021 KNIME AG. All rights reserved. 94

New Node: Stop Word Filter

§ Filters stop words

§ Built-in stop word lists: English, French, German, Italian, …
§ Alternatively load custom stop word list

© 2021 KNIME AG. All rights reserved. 95

Stop Word Filter: Configuration

Built-in stop
word lists
Custom stop
word list

© 2021 KNIME AG. All rights reserved. 96

Stemming

Stemming reduces different forms of a Rule Example

word to its common base by sequential SSES à SS caresses à caress

application of stemming rules.
IES à I ponies à poni

SS à SS caress à caress

Sà cats à cat

Original text:
Light caresses colours, sets them aglow, plays with nuances, shadows and
structures

Porter stemmer:
Light caress colour, set them aglow, plai with nuanc, shadow and structure.

© 2021 KNIME AG. All rights reserved. 97

New Node: Snowball Stemmer

§ Reduces terms to word stem

§ For various languages: English, German, French, Italian, …
§ Integration of Snowball stemming library
§ Alternative nodes: Porter Stemmer, Kuhlen Stemmer
§ For English only

© 2021 KNIME AG. All rights reserved. 98

Snowball Stemmer: Configuration

Language
selection

© 2021 KNIME AG. All rights reserved. 99

New Node: Tag Filter

§ Filters terms based on specified tag values

§ For all tag types and values

© 2021 KNIME AG. All rights reserved. 100

Tag Filter: Configuration

Tag type
selection

Tag value
selection

© 2021 KNIME AG. All rights reserved. 101

Section Exercise

§ Start with “Exercise: Preprocessing”

§ Filtering:
§ Numbers
§ Punctuation marks
§ Stop words
§ All terms except: nouns, verbs, adjectives
§ Stemming
§ To lower case

© 2021 KNIME AG. All rights reserved. 102

Section Solution

Preprocessing
§ Number Filter
§ Punctuation Erasure
§ Stop Word Filter
§ Case Converter
§ Snowball Stemmer
§ POS Filter

© 2021 KNIME AG. All rights reserved. 103

Transformation

104
Transformation

§ Transformation of data table structures

§ List of documents è bag of words
§ Bag of words è document / term vectors
§ Extraction of document fields to string columns
§ Conversion of terms to strings

© 2021 KNIME AG. All rights reserved. 105

Transformation Nodes

§ Typically characterized by:

§ Yellow color
§ 1 input port, 1 output port

© 2021 KNIME AG. All rights reserved. 106

Transformation Nodes

§ Node Repository:
Other Data Types/Text Processing/Transformation
§ Available Transformation Nodes
§ Bag of Words Creator
§ Document Vector
§ Strings to Document
§ Sentence Extractor
§ Document Data Extractor
§ Unique Term Extractor
§ …

© 2021 KNIME AG. All rights reserved. 107

New Node: Bag of Words Creator

§ Transforms list of documents into bag of words

§ Original documents can be appended in a column

Bag of words
Document list

© 2021 KNIME AG. All rights reserved. 108

Bag of Words Creator: Configuration

Documents used
to create bag of
words

Original
documents can
be appended

© 2021 KNIME AG. All rights reserved. 109

New Node: Term to String

§ Transforms term cells into string cells

§ Tag information will get lost

Bag of words with

string column
Bag of words

© 2021 KNIME AG. All rights reserved. 110

Term to String: Configuration

Terms to transform
to strings

© 2021 KNIME AG. All rights reserved. 111

Section Exercise

§ Start with “Exercise: Preprocessing II”

§ Create bag of words
§ Filter terms that occur in less than 5 documents
§ Tip: Bag of Words, GroupBy, and Reference Row Filter

© 2021 KNIME AG. All rights reserved. 112

Section Solution

Preprocessing II
§ Bag of Words Creator
§ Term to String
§ GroupBy
§ Row Filter
§ Reference Row Filter

© 2021 KNIME AG. All rights reserved. 113

New Node: Document Vector

§ Transforms bag of words into document vectors

§ Creates bit or numerical vectors

Bag of words with Document vector

frequency column

© 2021 KNIME AG. All rights reserved. 114

Document Vector: Configuration

Documents to
append to left of
Create bit or the created vector
numerical columns
vector

© 2021 KNIME AG. All rights reserved. 115

New Node: Document Vector Applier

§ Transforms bag of words into document vectors

§ Creates feature space of reference document vectors
§ Creates bit or numerical vectors

Document vector

Reference document
vectors

© 2021 KNIME AG. All rights reserved. 116

Document Vector Applier: Configuration

Use settings
from model
input

Include and exclude

lists of features of
the reference
vectors

© 2021 KNIME AG. All rights reserved. 117

New Node: Document Vector Hashing

§ Transforms documents into document vectors

§ Vector indices of terms are determined by term hashing
§ Creates bit or numerical vectors
§ Is streamable

Hashed document
vector

© 2021 KNIME AG. All rights reserved. 118

Document Vector Hashing: Configuration

Dimensions of
document vectors

Hashing function

© 2021 KNIME AG. All rights reserved. 119

New Node: Document Data Extractor

§ Extracts document fields as strings

§ Title, text, categories, …
Reminder: we stored restaurant type into
Category field in String to Document
conversion

Extracted field as string

column
Document column

© 2021 KNIME AG. All rights reserved. 120

Document Data Extractor: Configuration

Fields to extract

© 2021 KNIME AG. All rights reserved. 121

Frequencies

§ Frequencies are based on the number of occurrences of terms

§ Locally (in documents): term frequency (TF) absolute or relative
§ Globally (in corpus): inverse document frequency (IDF)
§ Frequencies can also be used for term filtering

© 2021 KNIME AG. All rights reserved. 122

Frequency Nodes

§ Typically characterized by:

§ Green color
§ 1 input port, 1 output port
§ Require bag of words
Append column with
relative TF values

© 2021 KNIME AG. All rights reserved. 123

Frequency Nodes

§ Node Repository:
Other Data Types/Text Processing/Frequencies
§ Available Frequency Nodes
§ TF
§ IDF
§ NGram creator
§ …

© 2021 KNIME AG. All rights reserved. 124

New Node: TF

§ Computes the relative or absolute term frequency (TF) of each term within a
document

Appended column with

TF values

© 2021 KNIME AG. All rights reserved. 125

New Node: DF

§ Computes the number of documents that contain each term

Appended column with

DF values

© 2021 KNIME AG. All rights reserved. 126

New Node: IDF

§ Computes three variants of inverse document frequency (IDF) for each term
within the documents
§ Smooth, normalized, and probabilistic

Appended column with

IDF values

© 2021 KNIME AG. All rights reserved. 127

New Node: Term Co-Occurence Counter

§ Counts the number of pairwise co-occurences of terms in bag of words within

selected parts of document (e.g. sentence, paragraph, title)

© 2021 KNIME AG. All rights reserved. 128

New Node: Ngram Creator

§ Creates ngrams from documents of input table and counts their frequencies
§ Both word and character ngrams are possible

© 2021 KNIME AG. All rights reserved. 129

Section Exercise

§ Start with “Exercise: Transformation”

§ Compute relative term frequencies
§ Create document vectors
§ Extract class label / category

© 2021 KNIME AG. All rights reserved. 130

Section Solution

Transformation
§ TF
§ Document Vector
§ Document Data Extractor

© 2021 KNIME AG. All rights reserved. 131

Classification

132
Classification

§ Assigning pre-defined labels to documents

§ Categorization
§ Sentiment analysis
§ Topic assignment
§ Supervised learning

§ In the last section we transformed textual documents into a numerical

representation (document vectors).
§ We can use standard KNIME nodes to classify / analyze these vectors.

© 2021 KNIME AG. All rights reserved. 133

Classification

Methods:
§ Decision Trees
§ Neural Networks
§ Naïve Bayes
§ Logistic Regression
§ Support Vector Machine
§ Tree Ensembles

© 2021 KNIME AG. All rights reserved. 134

Data Mining: Process Overview

Train and apply Evaluate

Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 135

New Node: Partitioning

§ Use it to split data into training and evaluation sets

§ Partition by count (e.g. 10 rows) or fraction (e.g. 10%)
§ Sample by a variety of methods; random, linear, stratified

© 2021 KNIME AG. All rights reserved. 136

Data Mining: Process Overview

Train and apply Evaluate

Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 137

The Learner-Predictor Motif

§ All data mining models use a Learner-Predictor motif.

§ The Learner node trains the model with its input data.
§ The Predictor node applies the model to a different subset of data.

Trained
Model

Training set

Test set

© 2021 KNIME AG. All rights reserved. 138

Decision Tree

§ C4.5 builds a tree from a set of training data using the concept of information
entropy.
§ At each node of the tree, the attribute of the data with the highest normalized
information gain (difference in entropy) is chosen to split the data.
§ The C4.5 algorithm then recourses on the smaller sub lists.

J.R. Quinlan, “C4.5 Programs for machine learning”

J. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining”

© 2021 KNIME AG. All rights reserved. 139

New Node: Decision Tree Learner

© 2021 KNIME AG. All rights reserved. 140

Decision Tree: View

If the word “Italian”

occurs in a review, the
restaurant is very likely
an Italian restaurant.

© 2021 KNIME AG. All rights reserved. 141

New Node: Decision Tree Predictor

§ Consumes a Decision Tree model and new data to classify

§ Check the box to append class probabilities

© 2021 KNIME AG. All rights reserved. 142

Data Mining: Process Overview

Train and apply Evaluate

Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 143

New Node: Scorer

§ Compare predicted results to known truth to evaluate model quality

§ Confusion matrix shows the distribution of model errors
§ An accuracy statistics table provides additional info

© 2021 KNIME AG. All rights reserved. 144

Scorer: Confusion Matrix

This is the difference between

the confusion matrix data table
and the confusion matrix view True Positives

False Negatives

False Positives
True Negatives

© 2021 KNIME AG. All rights reserved. 145

Scorer: Accuracy Measures

From the confusion

matrix

© 2021 KNIME AG. All rights reserved. 146

Section Exercise

§ Start with “Exercise: Classification”

§ Append color information based on class labels
§ Split data into training and test set
§ Train decision tree classifier on training set
§ Apply trained model on test set
§ Score model

© 2021 KNIME AG. All rights reserved. 147

Section Solution

Classification
§ Color Manager
§ Column Filter
§ Partitioning
§ Decision Tree Learner
§ Decision Tree Predictor
§ Scorer

© 2021 KNIME AG. All rights reserved. 148

Classification (Bonus)

§ Usually the documents used to train a model are read from a different source
than that of the documents to which the model is applied afterwards
§ To apply a trained model on a second set of documents we need to ensure that
all features of the training set exist as features of the second set.
§ This means that all document vector columns of the training set must exist as
document vector columns in the second set.

© 2021 KNIME AG. All rights reserved. 149

Classification (Bonus)

All features of the training set

must exist as features in the
second set.

© 2021 KNIME AG. All rights reserved. 150

Section Exercise (Bonus)

§ Start with “Exercise: Classification II”

§ Create document vectors for the second set of documents “Boston Tripadvisor Reviews”
§ The feature space of the second set has to contain all features of the training set!
§ Apply the trained model on the second set of documents

© 2021 KNIME AG. All rights reserved. 151

Section Solution (Bonus)

Classification II

© 2021 KNIME AG. All rights reserved. 152

Text Mining Bonus –
Sentiment Analysis

153
Sentiment Analysis

§ Sentiment à belief, view, opinion, and conviction

§ Also known as Opinion Mining
§ Sentiment analysis is trying to answer the question “What do people feel/think
about a certain topic?”
§ Help analyze the data related to opinions of many [the population] using a
variety of automated text analysis tools
§ Used in variety of domains, but its applications in CRM are especially noteworthy
(which is related to customers/consumers’ opinions)
§ Often collects and uses data obtained from the internet

© 2021 KNIME AG. All rights reserved. 154

Sentiment Analysis Applications

§ Voice of the customer (VOC)

§ Voice of the Market (VOM)
§ Voice of the Employee (VOE)
§ Brand Management
§ Product/Service Management
§ Financial Markets
§ Politics
§ Entertainment
§ Government Intelligence

© 2021 KNIME AG. All rights reserved. 155

Sentiment Analysis Methods

§ Classification based – Supervised

§ Establish the corpus: Collect relevant documents
§ Such as online product reviews
§ Manually label a subset of documents as positive, negative
§ Use employees, students, or Mechanical Turk to manually label reviewers
§ Make sure to built in cross-validation [each document reviewed by three people]
§ Develop classification models using the labeled samples
§ Use single models and ensembles to identify the best predictor
§ Use the identified best classification model to predict the sentiment for the rest of the samples

© 2021 KNIME AG. All rights reserved. 156

Sentiment Analysis Methods

§ Lexicon driven – Unsupervised

§ Use a term dictionary with sentiment descriptors [e.g., opinion corpus or SentiWordNet]

© 2021 KNIME AG. All rights reserved. 157

Sentiment Analysis Process

§ Step 1 – Sentiment Detection

§ Fact [= objectivity] versus Opinion [= subjectivity]
§ Step 2 – N-P Polarity Classification
§ N [= negative] versus P [= positive]
§ Step 3 – Target Identification
§ Person, Product, Event, etc.
§ Step 4 – Aggregation
§ Word à Statement à Paragraph à Document

© 2021 KNIME AG. All rights reserved. 158

Sentiment Analysis Example (Bonus)

§ The Large Movie Review Dataset v1.0

§ 50000 English movie reviews
§ Associated sentiment labels “positive” and “negative”
§ http://ai.stanford.edu/~amaas/data/sentiment/
§ Subset contains 2000 documents
§ 1000 positive reviews
§ 1000 negative reviews
§ …/data/IMDb-sample.csv

© 2021 KNIME AG. All rights reserved. 159

Sentiment Analysis Example (Bonus)

Predictive modeling:
§ Build classifier to distinguish between positive and negative reviews.
§ “Ah, Moonwalker, I'm a huge Michael Jackson fan, I grew up with his music, Thriller was actually the
first music video I ever saw apparently. …”
§ “This film has a very simple but somehow very bad plot. …”

Positive or negative?

© 2021 KNIME AG. All rights reserved. 160

Section Exercise (Bonus)

§ Start with “Exercise: Classification III”

§ Create document cells
§ Preprocess documents
§ Punctuation Erasure, N Chars Filter, Stop Word Filter, Case converter, Snowball Stemmer
§ Filter all terms that occur in less than 20 documents
§ Create document vectors
§ Extract sentiment label and assign colors
§ Partition into training and test set
§ Train decision tree model and score it

© 2021 KNIME AG. All rights reserved. 161

Section Solution (Bonus)

Classification
§ Strings to document
§ Preprocessing nodes
§ Bag of words creation, grouping,
counting, and filtering
§ Vector creation
§ Model training and scoring

© 2021 KNIME AG. All rights reserved. 162

Sentiment Analysis Example (Bonus)

Dictionary based:
§ Use a custom dictionary to count positive and negative words.
§ Compute sentiment score to predict sentiment label.

© 2021 KNIME AG. All rights reserved. 163

Section Exercise (Bonus)

§ Start with “Exercise: Classification IV”

§ Create document cells
§ Tag terms based on sentiment dictionaries
§ Tip: Dictionary Tagger
§ Extract and count positive and negative terms
§ Compute sentiment score based on the number of positive and negative terms
§ Predict sentiment labels based on score
§ Score predictions

© 2021 KNIME AG. All rights reserved. 164

Section Solution (Bonus)

Classification
§ Strings to Documents
§ Dictionary Tagger
§ Bag of words, TF, and
GroupBy for counting
§ Pivoting
§ Math Formula
§ Rule Engine
§ Scorer

Visualization

166
Visualization Nodes

§ Typically characterized by:

§ Blue color
§ 1 input port, 1-2 output port (image port)

Visualization Nodes

§ Node Repository:
Other Data Types/Text Processing/Misc
§ Available Visualization Nodes
§ Document Viewer
§ Tag Cloud
§ Tagged Document Viewer (in JS Views (Labs))

§ The KNIME Textprocessing extension provides only the first two dedicated
visualization nodes
§ …but various other nodes can be used for visualization too!

New Node: Tag Cloud

§ Shows terms visualized in a cloud

§ Colors are specified via the Color Manager
§ Requires a term and a numerical column (usually TF)
§ Creates image, available at image out port

Size of words
corresponds to
frequency

List of terms and

frequencies

Display only top N

terms (rows)

Term column and

frequency column

Min and max

fontsize, angle,
…

Additional Visualizations

§ Decision Tree View

§ Inspect trained model
§ See which terms are discriminative

Section Exercise

§ Start with “Exercise: Visualization”

§ Inspect decision tree via its view
§ Visualize bag of words using a tag cloud
§ Assign colors to terms in tag cloud (Optional)
§ Green if term occurs mostly in Chinese reviews, blue if terms occurs mostly in Italian reviews

Section Solution

Visualization
§ Decision Tree Learner
§ Tag Cloud
§ (Optional Coloring)
§ TF, Document Data Extractor, Group By,
Pivoting, Math Formula, Color Manager

New Node: Document Viewer

§ Shows details of documents

§ Title, Full text
§ Meta information
§ Tagged terms can be hilited and linked

Document column

Document Viewer: View
Details view
with title and
full text

List of all documents. Tagset to hilite

Double click for
details

Tagged terms
can be hilited

Author, category, meta

information, …

Reminder: Tagged Document Viewer

§ Displays documents with tags highlighted:

§ Takes a column with documents as input
§ Allows to inspect tags assigned to documents

Document with tags

highlighted

Document column

Section Exercise

§ Start with “Exercise: Visualization II”

§ View document content
§ View document content and highlight tagged terms
§ View tagged documents

Section Solution

Visualization
§ Document Viewer
§ Tagged Document Viewer

Bonus Visualizations

§ Supplementary Workflows/
§ R Theme River (R plot)
§ Twitter Word Tree (JavaScript view)

Clustering

181
Clustering

§ Find groups (clusters) of similar documents

§ Topic detection
§ Exploration
§ Unsupervised learning

§ We can use standard KNIME nodes to cluster the numerical document vectors.

Clustering

Methods:
§ Hierarchical clustering
§ K-Means / Medoids
§ Density based
§ …

Hierarchical Clustering

§ Creates hierarchy for all data points

§ Agglomerative, bottom-up
§ Combine the “closest” data points/clusters, one at a time
§ Hierarchy can be illustrated by dendrogram
§ Applicable only on small data sets (<5000)

§ Complete linkage: combine data object/cluster with minimal maximum distance

§ Finds compact, convex clusters
§ Single linkage: combine data object/cluster with minimal minimum distance
§ Also finds concave clusters
§ Average linkage: distance between two clusters c1 and c2 = mean distance
between all points in c1 and c2

Prototype-based Clustering

§ K-Medoids, K-Means, Fuzzy C-Means, …

§ Data are condensed to a small fixed number of prototypical data points
§ Each prototype represents a subset of data points
§ Applicable on large data sets
§ Number of prototypes (k) must be specified in advance

New Node: Distance Matrix Calculate

§ Computes all pairwise distances

§ Different distance measures available
§ Euclidean, Manhattan, Cosine, Dice, Tanimoto, …
§ Optional distance model input port

Distance column
Document vectors

Distance Matrix Calculate: Configuration

Name of distance Distance measure

column

Columns to use
for distance
computation

New Node: Hierarchical Clustering (DistMatrix)

§ Creates hierarchy of input data points

§ Complete Linkage, Average Linkage, Single Linkage
§ Requires distance column or model

Clustering model
Distance column

Distance function
(optional)

Hierarchical Clustering (DistMatrix): Configuration

Distance column
Linkage
type

New Node: Hierarchical Cluster View

§ Shows:
§ Dendrogram of clustering Dendrogram or
§ Distance curve distance
§ Colors

Hierarchical
clustering model

Data points, e.g.

document vectors

New Node: Hierarchical Cluster Assigner

§ Assigns data points to clusters based on

§ Distance threshold
§ Number of clusters

Cluster assignment
Hierarchical
clustering model

Data points, e.g.

document vectors

Hierarchical Cluster Assigner: Configuration

Threshold or cluster
count based
assignment

Hierarchical Clustering: Example Workflow

Hierarchy of data
points Illustration of
dendrogram

Data e.g.:
document vectors
Assignment of
clusters

New Node: k-Medoids

§ Computes k prototypes (medoids)

§ Requires distance column or model
§ Requires specification of k
Cluster assignment
§ Similar nodes:
§ k-Means
§ Fuzzy c-Means

Data points and

distance column

k-Medoids: Configuration

Distance matrix
column Cluster count k

Random seed for

reproducible
results

k-Medoids Clustering: Example Workflow

Data e.g.: Assignment of

document vectors clusters

Stay Up-To-Date and Contribute

§ Follow the KNIME Community Journal on Medium

Low Code for Advanced Data Science

§ Daily content on data stories, data science theory,

getting started with KNIME and more
for the community by the community

§ Would you like to share your data story

with the KNIME community?

Contributions are always welcome!

Section Exercise

§ Start with “Exercise: Clustering”

§ What groups of documents are in the data?
§ Compute pairwise cosine distances
§ Apply hierarchical clustering
§ View dendrogram to find out the number of clusters (k)
§ Assign k clusters
§ Apply k-Medoids with k as number of clusters
§ Select documents of one cluster in dendrogram, hilite them, and inspect data in a table view

Section Solution

Clustering
§ Distance Matrix Calculate
§ Hierarchical Clustering
§ Cluster View
§ Cluster Assigner
§ k-Medoids

Text Mining – Topic Modeling

201
Topic Modeling in Text Mining

§ The process of discovering (learning, identifying, extracting) topics across

a collection of documents (corpus)
§ Common assumptions for all topic modeling models:
§ Each document consists of a mix of topics
§ Each topic consists of a collection of words/terms
§ Topics are “hidden” or “latent” constructs in between documents and words
§ The goal of topic modeling is to discover these latent variables (i.e., topics)
that shape the meaning/semantics in the document collection

Topic Modeling Techniques

§ Latent Semantic Analysis (LSA)

§ First, construct the term-by-document matrix (Raw count vs. TF-IDF)
§ Second [optional], use Singular Value Decomposition (SVD)
§ Third, identify/learn the natural groupings/clusters of documents
§ Pros: it is straightforward, relatively easy and fast to apply
§ Cons: exclusive assignments to words and groups; need for very large set of documents and
vocabulary; and lack of interpretability
§ Probabilistic Latent Semantic Analysis (pLSA)
§ Uses probabilities as opposed to singular values (via SVD)
§ Latent Dirichlet Allocation (LDA)
§ A richer representation of the document-topic-word structure…

Latent Dirichlet Allocation (LDA)

§ LDA uses Dirichlet priors/distributions for the document-to-topic and topic-to-

word associations/allocations
§ LDA is a generative statistical model
§ It is an unsupervised learning process
§ Given a set of training data the goal is to identify the underlying distribution by generating samples
from the same distribution
§ Dirichlet distribution - Dir(α)
§ It is a family of continuous multivariate probability distributions parameterized by a vector α of positive
reals.
§ It is a multivariate generalization of the beta distribution
§ Hence, it is also called multivariate beta distribution (MBD)

Latent Dirichlet Allocation (LDA)
Doc 1 Doc 2 Doc 3 Doc m-1 Doc m

Documents
...

Topics to
documents
weights
[Latent]

...
Topics

Topic 1 Topic 2 Topic k

Words to
topics
weights

...
Words

Word 1Word 2Word 3Word 4Word 5Word 6Word 7 Word n-1Word n

Latent Dirichlet Allocation (LDA)

§ Example: we have a corpus of three different subject areas (1, 2, 3) and are
presumed to have three topics (A, B, and C)
§ With LDA, each topic is associated with each document with a weight, and each
word is associated with each topic with a weight. The distribution of topics will be
heavy on one over the others:
§ Doc from Subject 1: 88% topic A, 5% topic B, 7% topic C
§ Doc from Subject 2: 5% topic A, 91% topic B, 4% topic C
§ Doc from Subject 3: 5% topic A, 6% topic B, 89% topic C
§ Example: Genres associated with movies
§ Each movie is associated with multiple genres with different weights

Latent Dirichlet Allocation (LDA) in KNIME

Output

Supplementary Workflows

208
R Theme River

Creates theme river using ggplot2.

§ ggplot2 has to be installed!
§ Change lib path

Twitter Word Tree

Creates a word tree using the JavaScript Google charting library.

Term Co-occurrences

Term co-occurrences of all term pairs are counted on sentence

and document level.

Topic Extraction

Extracts two topics from the input documents and 10 words

to represent each topic.

Social Media Analysis

Leader / Follower
analysis of users

Sentiment
analysis of users

Social Media Analysis

§ Slashdot forum data

§ Text Mining: sentiment analysis of users
§ Network Mining: leader and follower scoring of users

Romeo and Juliet

Read epub file Tag character names and count frequencies

Load JPEG and convert to Insert PNG images and visualize

PNG network

Romeo and Juliet

§ Interaction network of characters.

§ Border color indicates family assignment
§ Node size is related to TF of character
names

Text Mining Application
Analytics Goes to Hollywood

219
Forecasting Box Office Success of Hollywood Movies
§ Dursun Delen, Ph.D.
§ Regents Professor
§ Department of Management Science and Information Systems
§ Spears School of Business, Oklahoma State University

Forecasting Box-Office: A Tough Problem!

“… No one can tell you how a movie

is going to do in the marketplace…
not until the film opens in darkened
theatre and sparks fly up between the
screen and the audience…”

Mr. Jack Valenti

Long time President and CEO
of the Motion Picture Association of America

Current work on prediction…

§ A lot of research have been done

§ Behavioral models
§ Analytical models
D Predict after the initial release

§ Our approach
§ Use a data mining approach
§ Use as much historical data as possible
§ Make it web-enabled
C Predict before the initial release

Our Analytics Approach – Movie Forecast Guru

§ DATA – 849 Movies released between 1998-2006

§ Movie Decision Parameters:
§ Intensity of competition rating
§ MPAA Rating
§ Star power
§ Genre
§ Technical Effects
§ Sequel ?
§ Estimated screens at opening
§ …more…
§ Output: Box office gross receipts (flop ® blockbuster)
Class No. 1 2 3 4 5 6 7 8 9
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)

Data Description

Class No. 1 2 3 4 5 6 7 8 9

Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)

Number of
Dependent Independent Variable Possible Values
Values
Variable
MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
Genre 10 Related, Thriller, Horror,
A Typical Comedy, Cartoon, Action,
Classification Documentary

Problem Special effects 3 High, Medium, Low

Sequel 1 Yes, No
Number of screens 1 Positive integer

Our Analytics Approach – Movie Forecast Guru

PREDICTION MODELS
§ Statistical Models:
§ Discriminant Analysis
§ Ordinal Multiple Logistic Regression

§ Machine Learning Models:

§ Artificial Neural Networks
§ Decision Tree Induction
§ CART - Classification & Regression Trees
§ C5 - Decision Tree
§ Support Vector Machines
§ Rough Sets
§ Ensembles…

Prediction Results

Prediction Models

Individual Models Ensemble Models

Performance Random Boosted Fusion

Measure SVM ANN C&RT Forest Tree (Average)

Count (Bingo) 192 182 140 189 187 194

Count (1-Away) 104 120 126 121 104 120

Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%

Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%

Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63

* Training set: 1998 – 2005 movies; Test set: 2006 movies

Text Mining the Plot Summaries [Storylines]

§ Is there any predictive power within the plot storylines?

Text Mining the Plot Summaries [Storylines]

§ Read the data

§ Movie_Dataset_2002-2006.xls
§ Process (numericize/structure) the data
§ Try Bag-of-Words approach
§ Try LDA
§ Builds and test prediction models
§ Use single models (Decision Tree, ANN, SVM, kNN, etc.)
§ Use model ensembles (Random Forest, Boosted Trees, etc.)
§ Compare the findings

Web-based DSS
Prediction Models

Remote
Models

Movie Forecast
Local
Models
Guru (MFG)
Remote
GUI
Data Sources
(Internet Web Services
XML / SOAP ETL
Browser)

HTML
TCP/IP MFG Engine
(Web Server) ODBC
& ETL
User MFG
Database
(Manager) XML

Knowledge Base
(Business Rules)

Implementation of MGF

References for Movie Prediction

References:
Delen, D., R. Sharda and P. Kumar (2006). “Movie Forecast Guru: A Web-based DSS for Hollywood
Managers”, Decision Support System. In Press.
Henry, M., R. Sharda and D. Delen (2007). “Using Neural Networks to Forecast Box-Office Success”
America’s Conference on Information Systems (AMCIS), Keystone, Colorado. Association for
Information Systems, 1512-1516.
Sharda, R. and D. Delen (2006). “Predicting box-office success of motion pictures with neural networks”
Expert Systems with Applications, 30(2), 243-254.
Sharda, R., D. Delen (2006). “How to Predict a Movie’s Success at the Box Office”, FORESIGHT: The
International Journal of Applied Forecasting, October 2006.

Media Coverage of “Box Office Forecasting” Project

Stay Up-To-Date and Contribute

§ Follow the KNIME Community Journal on Medium

Low Code for Advanced Data Science

§ Daily content on data stories, data science theory,

getting started with KNIME and more
for the community by the community

§ Would you like to share your data story

with the KNIME community?

Contributions are always welcome!

Thank you
[email protected]

233

Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
64 pages
Case Study On Text Mining
No ratings yet
Case Study On Text Mining
8 pages
Lecture 5 - Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5 - Text Mining Sentiment and Social Media Analytics
52 pages
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
100% (1)
Decision Support and Business Intelligence Systems (9 Ed., Prentice Hall) Text and Web Mining
45 pages
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
No ratings yet
Chapter 7 - Text Mining, Sentiment Analysis, and Social Analytics
91 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
7 - Text Analytics Text Mining and Sentiment Analysis
100% (2)
7 - Text Analytics Text Mining and Sentiment Analysis
53 pages
Turban Dss9e Ch07
No ratings yet
Turban Dss9e Ch07
45 pages
SCA - Module 7
No ratings yet
SCA - Module 7
47 pages
Chapter 07 - in Class
No ratings yet
Chapter 07 - in Class
49 pages
Text Data Mining: Part-I
No ratings yet
Text Data Mining: Part-I
104 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
DMDW Case Study Finished
No ratings yet
DMDW Case Study Finished
28 pages
Text Mining Introduction
No ratings yet
Text Mining Introduction
6 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Chapter 03 - Sharda 11e Full Accessible PPT 07
No ratings yet
Chapter 03 - Sharda 11e Full Accessible PPT 07
29 pages
Seven Text Mining Techniques
No ratings yet
Seven Text Mining Techniques
21 pages
Unit I - Text Mining
No ratings yet
Unit I - Text Mining
48 pages
AFM - Module 4
No ratings yet
AFM - Module 4
48 pages
Text Mining
No ratings yet
Text Mining
18 pages
Zhang 2015
No ratings yet
Zhang 2015
5 pages
Text Mining and Its Business Applications
No ratings yet
Text Mining and Its Business Applications
17 pages
08-Text Mining
No ratings yet
08-Text Mining
38 pages
1 2 3 4 5 Merged
No ratings yet
1 2 3 4 5 Merged
23 pages
Text Mining
No ratings yet
Text Mining
25 pages
Datamining 1
No ratings yet
Datamining 1
11 pages
Text Mining: Tools, Techniques, and Applications
No ratings yet
Text Mining: Tools, Techniques, and Applications
19 pages
Data Mining in Business Intelligence
No ratings yet
Data Mining in Business Intelligence
63 pages
Module 1 Part1
No ratings yet
Module 1 Part1
54 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
10 1109@icaccs 2019 8728547
No ratings yet
10 1109@icaccs 2019 8728547
5 pages
What Is Text Mining
No ratings yet
What Is Text Mining
9 pages
CH 06 PPTaccessible
No ratings yet
CH 06 PPTaccessible
71 pages
Text Mining
No ratings yet
Text Mining
16 pages
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
No ratings yet
TextAnalyticsApplicationofTextMining2021 31122023 071845am 1 10122024 061001pm
7 pages
Knowledge Management: A Text Mining Approach
No ratings yet
Knowledge Management: A Text Mining Approach
10 pages
Submitted To: Submitted By:: Text Mining
No ratings yet
Submitted To: Submitted By:: Text Mining
15 pages
Catalyst Preparation Methods
100% (1)
Catalyst Preparation Methods
25 pages
Text Mining
No ratings yet
Text Mining
3 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
Text Mining
No ratings yet
Text Mining
85 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
No ratings yet
(IJCST-V6I4P5) :S.Sheela, T.Bharathi
7 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Text Mining: A Burgeoning Technology For Knowledge Extraction
100% (1)
Text Mining: A Burgeoning Technology For Knowledge Extraction
5 pages
Module 4
No ratings yet
Module 4
63 pages
Effective Classification of Text
No ratings yet
Effective Classification of Text
6 pages
1-What Is Text Mining - IBM
No ratings yet
1-What Is Text Mining - IBM
5 pages
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
No ratings yet
Business Intelligence, Analytics, and Data Science: A Managerial Perspective
73 pages
Method Section-Seminar Paper
No ratings yet
Method Section-Seminar Paper
6 pages
Survey Data Analysis
No ratings yet
Survey Data Analysis
17 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
Skilled Worker Visa - Eligible Occupations and Codes - GOV - Uk
No ratings yet
Skilled Worker Visa - Eligible Occupations and Codes - GOV - Uk
98 pages
Rebranding and Revitalisation
100% (1)
Rebranding and Revitalisation
7 pages
Sample Study Matter JEE (Advanced) PDF
100% (1)
Sample Study Matter JEE (Advanced) PDF
89 pages
Text Mining: 2 History
No ratings yet
Text Mining: 2 History
8 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
North and South
No ratings yet
North and South
18 pages
Table Showing Current Ratio: List of Tables
No ratings yet
Table Showing Current Ratio: List of Tables
37 pages
Philips 37pfl7605h CH Q552.1e-La
No ratings yet
Philips 37pfl7605h CH Q552.1e-La
184 pages
FCIE V17a Sample Paper
No ratings yet
FCIE V17a Sample Paper
21 pages
Isolated Footing Excel Computation
No ratings yet
Isolated Footing Excel Computation
27 pages
State Budget 2025-26
No ratings yet
State Budget 2025-26
13 pages
Effect of Feed Rate On The Generation of Surface Roughness in Turning
No ratings yet
Effect of Feed Rate On The Generation of Surface Roughness in Turning
7 pages
04-Textcat Text Class
No ratings yet
04-Textcat Text Class
77 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Brosur Master Steel
No ratings yet
Brosur Master Steel
4 pages
Tut4 - WordEmb NLP
No ratings yet
Tut4 - WordEmb NLP
30 pages
Force of Friction
No ratings yet
Force of Friction
30 pages
Bag - of - Words NLP
No ratings yet
Bag - of - Words NLP
23 pages
Syntactic and Dependency Parsing
No ratings yet
Syntactic and Dependency Parsing
159 pages
Lect33 Textcat
No ratings yet
Lect33 Textcat
70 pages
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
No ratings yet
CSE538 sp25 (4) Lexical and Vector Semantics 2-25 NLP
126 pages
10 Estimators Pre Lecture
No ratings yet
10 Estimators Pre Lecture
109 pages
Techno-Commercial Proposal (Without Price) (08!04!2025)
No ratings yet
Techno-Commercial Proposal (Without Price) (08!04!2025)
6 pages
3 - Slides Corpus3
No ratings yet
3 - Slides Corpus3
88 pages
2DI90 ch11
No ratings yet
2DI90 ch11
54 pages
2DI90 ch9
No ratings yet
2DI90 ch9
83 pages
2DI90 chID190-CH5
No ratings yet
2DI90 chID190-CH5
62 pages
07 Covariance Answers Hidden Lecture
No ratings yet
07 Covariance Answers Hidden Lecture
62 pages
POS Tagging
No ratings yet
POS Tagging
63 pages
4 - Slides Regualer Expression
No ratings yet
4 - Slides Regualer Expression
75 pages
01-Introduction PLC
No ratings yet
01-Introduction PLC
53 pages
Cyber Crime Laboratory Manual 2022
No ratings yet
Cyber Crime Laboratory Manual 2022
7 pages
2.BasicTextProcessing NEW
No ratings yet
2.BasicTextProcessing NEW
39 pages
13-Neuralcrf Pos Tagging
No ratings yet
13-Neuralcrf Pos Tagging
40 pages
Primes
No ratings yet
Primes
39 pages
01-Bayes-All-Handout Prob
No ratings yet
01-Bayes-All-Handout Prob
28 pages
ch07 Consistency Replication
No ratings yet
ch07 Consistency Replication
30 pages
Slides08 LR Parsing
No ratings yet
Slides08 LR Parsing
25 pages
ML4D-L6 nlp2
No ratings yet
ML4D-L6 nlp2
58 pages
Ch. 1 Notes
No ratings yet
Ch. 1 Notes
11 pages
02 Random Vars All Handout
No ratings yet
02 Random Vars All Handout
23 pages
Jarrar LectureNotes Ch1 Introduction
No ratings yet
Jarrar LectureNotes Ch1 Introduction
18 pages
Imc Shift-Cipher
No ratings yet
Imc Shift-Cipher
17 pages
13-Oo-Opolymorphism PLC
No ratings yet
13-Oo-Opolymorphism PLC
15 pages
AllPack Cataloque - 11.10.24
No ratings yet
AllPack Cataloque - 11.10.24
8 pages
Origin of HAZOP Analysis
No ratings yet
Origin of HAZOP Analysis
5 pages
Reduction Proofs
No ratings yet
Reduction Proofs
9 pages
A Shani 2020
No ratings yet
A Shani 2020
9 pages
Hardening
No ratings yet
Hardening
7 pages
New Trends For Authentication
No ratings yet
New Trends For Authentication
5 pages
Literature Review Last Edit
No ratings yet
Literature Review Last Edit
11 pages
Werner 2018 Geographies of Production I Global Production and Uneven Development
No ratings yet
Werner 2018 Geographies of Production I Global Production and Uneven Development
11 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
What The Managers Really Do
No ratings yet
What The Managers Really Do
4 pages
Research Proposal
No ratings yet
Research Proposal
10 pages
Programming with Nim: Definitive Reference for Developers and Engineers
From Everand
Programming with Nim: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Lioba CV
No ratings yet
Lioba CV
5 pages
Spark Streaming Assignment
No ratings yet
Spark Streaming Assignment
2 pages
Yorrick - Player Sheet
No ratings yet
Yorrick - Player Sheet
2 pages
Surprise Test Solution
No ratings yet
Surprise Test Solution
1 page
After Class - AVTC6 - Unit 6 - Pie Charts - K26
No ratings yet
After Class - AVTC6 - Unit 6 - Pie Charts - K26
3 pages
Lesson 23 - Unit Review Part 1
No ratings yet
Lesson 23 - Unit Review Part 1
2 pages
R June 6 Prakash Bari Health
No ratings yet
R June 6 Prakash Bari Health
6 pages
Concept Mining: Fundamentals and Applications
From Everand
Concept Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet