0% found this document useful (0 votes)
18 views230 pages

l4 TP Slides Text Processing

The document provides an introduction to text processing concepts and techniques, focusing on text mining, text analytics, and the KNIME Text Processing Extension. It covers various aspects such as importing text, preprocessing, transformation, classification, and visualization of unstructured data. Additionally, it discusses the importance of text mining in extracting valuable insights from unstructured corporate data and outlines the text mining process and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views230 pages

l4 TP Slides Text Processing

The document provides an introduction to text processing concepts and techniques, focusing on text mining, text analytics, and the KNIME Text Processing Extension. It covers various aspects such as importing text, preprocessing, transformation, classification, and visualization of unstructured data. Additionally, it discusses the importance of text mining in extracting valuable insights from unstructured corporate data and outlines the text mining process and applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 230

[L4-TP] Introduction to Text

Processing
KNIME AG

1
Table of Contents
1. Text Mining Concepts
2. The Text Processing Extension
3. Importing Text
4. Enrichment
5. Preprocessing
6. Transformation
7. Classification
8. Visualization
9. Clustering
10. Bonus Cases and Workflows

2
Text Mining Concepts

3
Sources / References
Chapter 7 –
Text Mining, Sentiment Analysis,
and Social Analytics

+
Articles
White papers
Tutorials

© 2012 © 2020

© 2021 KNIME AG. All rights reserved. 4


Too Many Terms

§ Text Mining
§ Text Analytics
§ Text Processing
§ Information Retrieval
§ Information Extraction
§ Natural Language Processing
§ Computational Linguistics
§ Unstructured Data Mining
§ …

© 2021 KNIME AG. All rights reserved. 5


Text Mining versus Text Analytics

Text Analytics =
§ Information Retrieval +
§ NLP + Data Mining +
§ Web Mining

© 2021 KNIME AG. All rights reserved. 6


Why Text Mining?

§ Roughly 85-90 percent of all corporate data is in some kind of unstructured form
(e.g., text)
§ Unstructured corporate data is doubling in size every 18 months…
§ Tapping into these information sources is not an option, but a need to stay
competitive
§ Answer: text mining / text analytics / text processing
§ A semi-automated process of extracting information and discovering knowledge from
unstructured data sources
§ It is also described as text data mining or knowledge discovery in textual databases

© 2021 KNIME AG. All rights reserved. 7


Data Mining versus Text Mining

§ Both seek for novel and useful patterns


§ Both are semi-automated processes
§ Difference is the nature of the data
§ Structured versus unstructured data
§ Structured data: in databases
§ Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on
§ To perform text mining – first, impose structure to the data, then mine the
structured data.

© 2021 KNIME AG. All rights reserved. 8


Use of Text Mining

§ Benefits of text mining are obvious especially in text-rich data environments


§ Such as law (court orders), academic research (research articles), finance (quarterly reports),
medicine (discharge summaries), biology (molecular interactions), technology (patent files),
marketing (customer comments), etc.
§ Electronic communication records (e.g., Email)
§ Spam filtering
§ Email prioritization and categorization
§ Automatic response generation

© 2021 KNIME AG. All rights reserved. 9


Text Mining Terminology

§ Lexicon (word dictionary) § Unstructured data


§ WordNet, SentiWordNet
§ Corpus (and corpora)
§ Term-by-document matrix § Words and Terms
§ Word/document frequency
§ Numeric, binary, TF/IDF § Concepts
§ Dimensional reduction: Singular § Stemming vs. Lemmatization
value decomposition (SVD) § Stop [include] words/terms
§ Topic Modeling § Synonyms (and homonyms)
§ Latent Dirichlet Allocation (LDA)
§ Morphology
§ Word Embedding
§ Tokenization
§ Word2Vec
§ Part-of-speech tagging
§ Bag of Words

© 2021 KNIME AG. All rights reserved. 10


Use of Text Mining

Stemming vs. Lemmatization


§ Common goal: to generate the root form of the words
§ Difference:
§ Stem results in truncated/chopped words (not necessarily a complete word)
§ Stemming is syntactic and fast - follows an algorithm
§ Example: amuse, amusing, amused à amus
§ Lemma results in an actual language word (inflection free)
§ Lemmatization is semantic and slow - follows linguistic dictionary
§ Example: am, is, are à be

© 2021 KNIME AG. All rights reserved. 11


Text Mining Application – Deception Detection

§ Deception detection
§ A very difficult problem
§ If detection is limited to only text, then the problem is even more difficult
§ The study …
§ Analyzed text based testimonies of persons of interest at military bases [a high-stakes environment!]
§ Used only text-based features (i.e., textual cues)

Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining methods for
real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.

© 2021 KNIME AG. All rights reserved. 12


Text Mining Application – Deception Detection

Source: Fuller, C. M., Biros, D. P., & Delen, D. (2011). An investigation of data and text mining
methods for real world deception detection. Expert Systems with Applications, 38(7), 8392-8398.

© 2021 KNIME AG. All rights reserved. 13


Text Mining Application – Deception Detection

© 2021 KNIME AG. All rights reserved. 14


Text Mining Application – Deception Detection

§ 371 usable/labeled statements are generated


§ 31 features/cues are used
§ Different feature selection methods used
§ 10-fold cross validation is used
§ Early Prediction Results (overall % accuracy):
§ Logistic regression 67.28
§ Decision trees 71.60
§ Neural networks 73.46
§ Tools used: General Architecture for Text Engineering (GATE), Linguistic Inquiry
and Word Count (LIWC), and WEKA

© 2021 KNIME AG. All rights reserved. 15


Text Mining Process

16
Text Mining Process

§ A standard process: the manifestation of the “best” practices


§ Standard process for data mining:
§ Cross industry process for data mining (CRISP-DM)
§ Sample, Explore, Modify, Model, Assess (SEMMA)

Sample
(Generate a representative
1 2 sample of the data) CRISP-DM
SEMMA
Business Data
Understanding Understanding
My own

3
SEMMA
Data Assess Explore
Preparation (Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data) KDD Process
6
4 Feedback
Deployment My organization's
Model
Data
Building
Domain-specific methodology

None
Model Modify
5 (Use variety of statistical and (Select variables, transform
machine learning models ) variable representations) Other methodology (not domain specific)
Testing and
Evaluation
0 10 20 30 40 50 60 70

© 2021 KNIME AG. All rights reserved. 17


Text Mining Process

© 2021 KNIME AG. All rights reserved. 18


Text Mining Process

© 2021 KNIME AG. All rights reserved. 19


Text Mining Process

§ Task 1: Establish the corpus § Task 2: Create the Term–by–


§ Collect all relevant unstructured data Document Matrix (TDM)
(e.g., textual documents, XML files, emails, § Should all the terms be included?
Web pages, short notes, voice
§ Stop words, include words
recordings…)
§ Synonyms, homonyms
§ Digitize, standardize the collection
(e.g., all in ASCII text files) § Stemming, lemmatization
§ Place the collection in a common place § What is the best representation of the
(e.g., in a flat file, or in a directory as indices (values in cells)?
separate files) § Row counts; binary frequencies; log
frequencies
§ TF/IDF

© 2021 KNIME AG. All rights reserved. 20


Text Mining Process

§ Task 2 – Cont. § Task 3: Extract knowledge


§ TDM is a sparse matrix. How can we § Classification (text categorization)
reduce the dimensionality of the TDM? § Clustering (natural groupings of text)
§ Manual by a domain expert, frequency § Improve search recall
based, SVD, … § Improve search precision
§ Scatter/gather
§ Query-specific clustering
§ Association
§ Trend Analysis

© 2021 KNIME AG. All rights reserved. 21


Text Mining Process in KNIME

§ Logical organization of KNIME Text Processing nodes

© 2021 KNIME AG. All rights reserved. 22


Text Mining Application – Trend Identification

§ Mining the published I S literature


§ Three Top IS Journals Targeted
§ Quarterly (MISQ)
§ Journal of MIS (JMIS)
§ Information Systems Research (ISR)
§ Covered 12-year period (1994-2005)
§ 901 papers are included in the study
§ Only the paper abstracts are used
§ 9 clusters are generated for further analysis

Source: Used with permission of Delen, D., & M. Crossland. (2008). “Seeding the Survey and Analysis
of Research Literature with Text Mining.” Expert Systems with Applications, 34(3), pp. 1707–1720.

© 2021 KNIME AG. All rights reserved. 23


Text Mining Application – Trend Identification
Journal Year Author(s) Title Vol/No Pages Keywords Abstract
MISQ 2005 A. Malhotra, Absorptive capacity 29/1 145-187 Knowledge management The need for continual value
S. Gosain and configurations in supply chain absorptive innovation is driving supply
O. A. El Sawy supply chains: capacity chains to evolve from a pure
Gearing for partner- interorganizational transactional focus to
enabled market information systems leveraging interorganizational
knowledge creation configuration approaches partner ships for sharing

ISR 1999 D. Robey and Accounting for the 2-Oct 167-185 organizational Although much contemporary
M. C. Boudreau contradictory transformation impacts of thought considers advanced
organizational technology organization information technologies as
consequences of theory research either determinants or
information methodology enablers of radical
technology: intraorganizational power organizational change,
Theoretical directions electronic communication empirical studies have
and methodological mis implementation revealed inconsistent findings
implications culture systems to support the deterministic
logic implicit in such
arguments. This paper
reviews the contradictory
JMIS 2001 R. Aron and Achieving the optimal 18/2 65-88 information products When producers of goods (or
E. K. Clemons balance between internet advertising services) are confronted by a
investment in quality product positioning situation in which their
and investment in signaling signaling offerings no longer perfectly
self-promotion for games match consumer preferences,
information products they must determine the
extent to which the advertised
features of
… … … … … … … …

© 2021 KNIME AG. All rights reserved. 24


Text Mining Application – Trend Identification

© 2021 KNIME AG. All rights reserved. 25


Text Mining Application – Trend Identification

© 2021 KNIME AG. All rights reserved. 26


Today’s Example

27
Today’s Example

§ Classification of free-text documents is a common task in the field of text mining.

§ It is used to categorize documents, i.e. assign pre-defined topics, or it can be


used for sentiment analysis.

§ Today we want to construct a workflow that reads and preprocesses text


documents, transforms them into a numerical representation and builds a
predictive model to assign pre-defined labels to documents.

§ Additional tasks:
§ Sentiment analysis
§ Visualization of documents
§ Document clustering

© 2021 KNIME AG. All rights reserved. 28


Today’s Example

Title
Rating

Author
Full text

© 2021 KNIME AG. All rights reserved. 29


Today’s Example

Goal:
§ Build a classifier to distinguish between reviews about Italian or Chinese
restaurants.

Review about
an Italian or a
Chinese
restaurant?

© 2021 KNIME AG. All rights reserved. 30


Today’s Example

© 2021 KNIME AG. All rights reserved. 31


Bonus Examples

© 2021 KNIME AG. All rights reserved. 32


The KNIME Text Processing
Extension

33
Installation

§ Install KNIME Textprocessing Extension from menu…

© 2021 KNIME AG. All rights reserved. 34


Installation

§ …or drag-and-drop from KNIME Hub

© 2021 KNIME AG. All rights reserved. 35


Tips for Better Performance

§ Increase maximum memory for KNIME


§ Edit knime.ini
§ Add “-Xmx3G” as last line of knime.ini file
§ Replace 3 by the amount of gigabytes allocated for KNIME

§ Useful additional extensions


§ XML-Processing (KNIME extension)
§ Parsing and processing of XML documents
§ KNIME JavaScript Views (Labs)
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 36


Philosophy
Extract document data Clean-up & preprocessing Knowledge extraction

Clas
sif icatio
n
… perhaps your
name is Clustering
Rumpelstiltskin
[Person] ? …
Visualization

… perhaps your
name is
Rumpelstiltskin
111010011
[Person] ? …
011001000
001110110

Enrichment Term-document matrix

© 2021 KNIME AG. All rights reserved. 37


Additional Data Types: Document

§ KNIME uses a composite/aggregate data type to represent textual content


§ Fields include:
§ Title
§ Text
§ Source
§ Category
§ Author(s)
§ Date, …
§ Generic Meta Data

© 2021 KNIME AG. All rights reserved. 38


Additional Data Types: Term

§ KNIME uses a composite/aggregate data type to represent terms [keywords]


§ Fields include:
§ Sentiment
§ POS tag
§ City
§ Person name
§ Etc.

© 2021 KNIME AG. All rights reserved. 39


Part of Speech Tags

Source: Penn-Treebank Project.

© 2021 KNIME AG. All rights reserved. 40


Data Table Structures

§ Document table
§ List of documents

§ Bag of words
§ Tuples of documents and terms

§ Document vectors
§ Numerical representations of documents

© 2021 KNIME AG. All rights reserved. 41


Section Exercise

§ Open KNIME
§ Install KNIME Textprocessing Extension
§ Download workflows from the provided URL

© 2021 KNIME AG. All rights reserved. 42


Importing Text

43
Data Source Nodes

Typically characterized by:


§ Orange color
§ By default no input ports, 1-2 output ports
§ New file handling with KNIME 4.4.
§ Consistent user experience across all nodes and file systems
§ Managing of various file systems within the same workflow
§ Performance improvements

Output port
Status

Node label

© 2021 KNIME AG. All rights reserved. 44


CSV Reader
Advanced settings
§ Reads either one or multiple
.csv and .txt files
File system
§ Further tabs to
§ limit the rows
File path
§ select encoding

Basic settings

Help button

Preview

© 2021 KNIME AG. All rights reserved. 45


Common Settings: Four Default File Systems

§ Local File System

§ Relative to …

§ Mountpoint

§ Custom URL

© 2021 KNIME AG. All rights reserved. 46


Common Settings: Connecting to other File Systems

§ Add file system connection port to connect to another file system


§ Click on the three dots on the lower left to
add or remove a dynamic port.

§ Supported file systems


§ Microsoft Azure
§ Google
§ Amazon
§ Databricks
§ BigData file systems (hdfs, httpFS, …)
§ On-premise (e.g. ssh, ftp, …)

© 2021 KNIME AG. All rights reserved. 47


Common Settings: Read Single or Multiple Files

§ Single file

§ Files in a folder

§ Option to include subfolder


§ Option to define filter criterions

© 2021 KNIME AG. All rights reserved. 48


Common Settings: Transformation Tab

§ Supported operations
§ Column filtering
§ Column sorting
§ Column renaming
§ Column type mapping
§ Select between union or intersection of
columns (in case of reading many files)

© 2021 KNIME AG. All rights reserved. 49


Alternative Faster Way …

Drag & Drop


OR
Copy &
Paste

© 2021 KNIME AG. All rights reserved. 50


File Path Options Old File Handling
New file handling
§ Local path

§ Absolute URL

§ Mountpoint-relative URL

© 2021 KNIME AG. All rights reserved. 51


Workflow-Relative File Paths (Old File Handling)

§ Best choice if workflows are to be


shared
§ Requires matching folder structure
within workflow group
§ Independent of environment outside of
workflow group
§ Example: Path to „Sentiment Analysis.table“
§ Local path:
§ C:\Users\rb\knime-workspace\KNIMEUserTraining\data\Sentiment Analysis.table
§ Workflow relative:

YouTube KNIME TV Channel:


https://youtu.be/U9sP4g4yGwY

© 2021 KNIME AG. All rights reserved. 52


Excel Reader (XLS)

§ Reads .xls and .xlsx file from Microsoft Excel


§ Supports reading from multiple sheets

© 2021 KNIME AG. All rights reserved. 53


Excel Reader

File path
File system

Sheet
specific
settings

Preview

© 2021 KNIME AG. All rights reserved. 54


New Node: Table Reader

§ Reads tables from the native KNIME Format


§ Maximum performance
§ Minimum configuration

File system

File path

Preview

© 2021 KNIME AG. All rights reserved. 55


New Node: Database Reader

§ Connectors for Common DB types


§ MySQL, Postgres, SQLite
§ Also works with any JDBC driver
§ Common nodes for SQL Query Building
§ GroupBy, Join, Filter, Sort

© 2021 KNIME AG. All rights reserved. 56


Other Useful Data Sources

§ KNIME Analytics Platform provides many more


options to access data:
§ PMML Reader – reads standard predictive models
§ XML Reader with XPATH support
§ Python/R Source nodes
§ Tika Parser – extracts textual data from 200+ file types
§ REST Web Services, and many more

§ Find out more in by downloading the free book


“Will they blend”
https://www.knime.com/knimepress/download-
will-they-blend
© 2021 KNIME AG. All rights reserved. 57
Parser Nodes

§ Node Repository: Other Data Types/Text Processing/IO


§ Available Parser Nodes
§ Flat File Document Parser
§ PDF Parser
§ Word Parser
§ Document Grabber
§ …

© 2021 KNIME AG. All rights reserved. 58


New Node: Strings To Document

§ Creation of document cells from strings


§ Converts string cells to document cells
§ Useful in combination with File Reader, XLS Reader, database nodes

© 2021 KNIME AG. All rights reserved. 59


Strings To Document: Configuration

Title
Text

Category

Authors

Tokenizer

© 2021 KNIME AG. All rights reserved. 60


Tokenizers

§ Different tokenizers are available leading to slightly different terms extracted


from the document

Example: “I’m enjoying the tutorial”

English WordTokenizer WhitespaceTokenizer

“I”, “’m”, “enjoying”, “the”, “tutorial” “I’m”, “enjoying”, “the”, “tutorial”

© 2021 KNIME AG. All rights reserved. 61


New Node: Meta Info Inserter & Extractor

§ Inserter allows adding document meta info


§ Adds meta info to documents as key value pairs
§ Helpful if more meta info available than covered by Strings to Documents node
§ Extractor brings data back from document cell into table columns
§ Each key results in a column, containing the specific values for each document related to that key.

© 2021 KNIME AG. All rights reserved. 62


New Node: Tika Parser

§ Reads files of various formats from directory


§ Searches for all files with specified extension in directory
§ Creates one document for each file
§ Extracts specified (meta) information

© 2021 KNIME AG. All rights reserved. 63


Tika Parser: Configuration

Directory

Recursive
search
File
extensions

Meta data to
extract
Extraction of
attachments

© 2021 KNIME AG. All rights reserved. 64


Section Exercise

§ Start with “Exercise: Importing text”


§ Import string data from:
§ TripadvisorReviews-SanFranciscoRestaurants-ItalianChineseFood.table
§ Filter rows with missing titles
§ Convert strings to documents
§ Filter all columns except the document column

You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/

© 2021 KNIME AG. All rights reserved. 65


Section Solution

Import text
§ Table Reader
§ Row Filter
§ Strings to Documents
§ Column Filter

© 2021 KNIME AG. All rights reserved. 66


Enrichment

67
Enrichment

§ Semantic information is indicated by a tag assignment


§ Part of speech, named entities (persons, organizations, genes, …), sentiments
§ A tag consists of a type and a value
§ Type represents the class or set of tags
§ e.g. POS (part of speech), NE (named entity)
§ Value represents the actual tag value
§ e.g. NN (noun), PERSON

Term “food” with tag value


“NN” and type “POS”
Column containing
terms with tags

© 2021 KNIME AG. All rights reserved. 68


Tagger Nodes

§ Typically characterized by:


§ Yellow color
§ 1 to 2 input ports (requiring one document column), 1 output port
§ Assignment of semantic information (tags) to terms

© 2021 KNIME AG. All rights reserved. 69


Tagger Nodes

§ Node Repository:
Other Data Types/Text Processing/Enrichment
§ Available Tagger Nodes
§ Stanford tagger
§ Dictionary (& Wildcard) tagger
§ OpenNLP tagger
§ Abner tagger
§ Amazon Comprehend
§ …

© 2021 KNIME AG. All rights reserved. 70


Tagger Nodes

§ Allows to specify the number of


parallel threads.
§ Note: each thread will load a separate
model into memory!
§ Tagged terms are set unmodifiable
(we’ll explain this shortly).

Number of
parallel threads

© 2021 KNIME AG. All rights reserved. 71


New Node: Stanford Tagger

§ Assigns part of speech tags to terms


§ Models for English, German, French (from Stanford NLP Group)
§ Alternative node: POS tagger
§ Model only for English (from OpenNLP)

© 2021 KNIME AG. All rights reserved. 72


Stanford Tagger: Configuration

Number of
parallel threads
Model to use

© 2021 KNIME AG. All rights reserved. 73


New Node: Dictionary Tagger

§ Assigns selected tag to matching terms


§ Matches terms in documents against terms in dictionary
§ Tag to be assigned to matching terms is specified in the dialog
§ Alternative node: Wildcard tagger
§ Terms in dictionary may contain wild cards and regular expressions

© 2021 KNIME AG. All rights reserved. 74


Dictionary Tagger: Configuration

Exact match
or “contains”
Dictionary
column

Tag value to be
Type of tag assigned
to be assigned

© 2021 KNIME AG. All rights reserved. 75


New Node: Tagged Document Viewer

§ Displays documents with tags highlighted:


§ Takes a column with documents as input
§ Allows to inspect tags assigned to documents

Document with tags


highlighted

Document column

© 2021 KNIME AG. All rights reserved. 76


Tagged Document Viewer: Configuration

View and
Document interactivity
column configuration

Number of
documents to
Enable display display
of tags

© 2021 KNIME AG. All rights reserved. 77


Section Exercise

§ Start with “Exercise: Enrichment”


§ Assign (English) POS tags
§ View tagged documents

© 2021 KNIME AG. All rights reserved. 78


Section Solution

Enrichment
§ POS tagger
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 79


Section Exercise (Bonus)

§ Start with “Exercise: Enrichment II”


§ Read files that contain positive and negative words
§ MPQA-OpinionCorpus-PositiveList.csv
§ MPQA-OpinionCorpus-NegativeList.csv
§ Assign positive and negative sentiment tags based on positive and negative word lists
§ View tagged documents
§ Tip: Dictionary Tagger node

© 2021 KNIME AG. All rights reserved. 80


Section Solution (Bonus)

Enrichment
§ File Reader
§ Dictionary Tagger
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 81


Custom NER Models

§ The provided NER models of OpenNLP NE tagger and StandfordNLP NE tagger


are trained for a few types of entities and English language only.
§ For more specific applications and other languages custom models are needed.

© 2021 KNIME AG. All rights reserved. 82


New Node: StanfordNLP NE Learner

§ Trains a NER model based on the input dictionary and corpus


§ Tag type and value can be set in the dialog
§ Creates tagged corpus based on input documents and dictionary. Trains model with tagged corpus.

Document
corpus
StanfordNLP
NE model

Dictionary

© 2021 KNIME AG. All rights reserved. 83


Stanford Tagger: Configuration

Document
corpus
Dictionary
column

Tag type and


value

© 2021 KNIME AG. All rights reserved. 84


New Node: StanfordNLP NE tagger

§ Tags documents based on input NER model.


§ NER model can be specified in dialog, built-in or model from input port

© 2021 KNIME AG. All rights reserved. 85


StanfordNLP NE Tagger: Configuration

Parameters for
built-in model

Use model from


input port or
built-in models

© 2021 KNIME AG. All rights reserved. 86


Tagging Conflicts

§ In case of tag intersections, the last node overwrites.


§ “Serbian-American inventor Nikola Tesla developed the …”
1. POS tagger: “Serbian-American\NNP inventor\NNP Nikola\NNP Tesla\NNP
developed\VBD the\DT…”
2. NE tagger: “Serbian-American\NNP inventor\NNP Nikola Tesla\Person developed\VBD
the\DT …”

Overwrite!

© 2021 KNIME AG. All rights reserved. 87


Unmodifiable Terms

§ Tagged terms can be set unmodifiable


§ Unmodifiable terms are not affected by any preprocessing node
§ Preprocessing nodes can explicitly ignore unmodifiability

Set unmodifiable in
tagger nodes

Ignore unmodifiability
in preprocessing
nodes

© 2021 KNIME AG. All rights reserved. 88


Supplementary Workflows: NER Tagger Model Training

§ Trains NER model for Latin and Gallic names based on “De Bello Gallico” from
Julius Caesar.

© 2021 KNIME AG. All rights reserved. 89


Preprocessing

90
Preprocessing

§ Reduction of feature space (terms)


§ Filtering of unnecessary terms
§ Stop words, based on POS tags, dictionaries, RegEx, …
§ Normalization of terms
§ Stemming, case conversion

© 2021 KNIME AG. All rights reserved. 91


Preprocessing Nodes

§ Typically characterized by:


§ Yellow color
§ 1 to 2 input ports (requiring one document column), 1 output port
§ For filtering and normalizing terms of documents and bags of words

© 2021 KNIME AG. All rights reserved. 92


Preprocessing Nodes

§ Node Repository:
Other Data Types/Text Processing/Preprocessing
§ Available Preprocessing Nodes
§ Stop Word Filter
§ Snowball Stemmer
§ Tag Filter
§ Case Converter
§ RegEx Filter
§ …

© 2021 KNIME AG. All rights reserved. 93


Preprocessing Nodes

§ Preprocessing tab in node dialog to specify:


§ Append original documents
§ Ignore term unmodifiability (set by tagger nodes).

Append original
document

Ignore term
unmodifiability

© 2021 KNIME AG. All rights reserved. 94


New Node: Stop Word Filter

§ Filters stop words


§ Built-in stop word lists: English, French, German, Italian, …
§ Alternatively load custom stop word list

© 2021 KNIME AG. All rights reserved. 95


Stop Word Filter: Configuration

Built-in stop
word lists
Custom stop
word list

© 2021 KNIME AG. All rights reserved. 96


Stemming

Stemming reduces different forms of a Rule Example

word to its common base by sequential SSES à SS caresses à caress


application of stemming rules.
IES à I ponies à poni

SS à SS caress à caress

Sà cats à cat

Original text:
Light caresses colours, sets them aglow, plays with nuances, shadows and
structures

Porter stemmer:
Light caress colour, set them aglow, plai with nuanc, shadow and structure.

© 2021 KNIME AG. All rights reserved. 97


New Node: Snowball Stemmer

§ Reduces terms to word stem


§ For various languages: English, German, French, Italian, …
§ Integration of Snowball stemming library
§ Alternative nodes: Porter Stemmer, Kuhlen Stemmer
§ For English only

© 2021 KNIME AG. All rights reserved. 98


Snowball Stemmer: Configuration

Language
selection

© 2021 KNIME AG. All rights reserved. 99


New Node: Tag Filter

§ Filters terms based on specified tag values


§ For all tag types and values

© 2021 KNIME AG. All rights reserved. 100


Tag Filter: Configuration

Tag type
selection

Tag value
selection

© 2021 KNIME AG. All rights reserved. 101


Section Exercise

§ Start with “Exercise: Preprocessing”


§ Filtering:
§ Numbers
§ Punctuation marks
§ Stop words
§ All terms except: nouns, verbs, adjectives
§ Stemming
§ To lower case

© 2021 KNIME AG. All rights reserved. 102


Section Solution

Preprocessing
§ Number Filter
§ Punctuation Erasure
§ Stop Word Filter
§ Case Converter
§ Snowball Stemmer
§ POS Filter

© 2021 KNIME AG. All rights reserved. 103


Transformation

104
Transformation

§ Transformation of data table structures


§ List of documents è bag of words
§ Bag of words è document / term vectors
§ Extraction of document fields to string columns
§ Conversion of terms to strings

© 2021 KNIME AG. All rights reserved. 105


Transformation Nodes

§ Typically characterized by:


§ Yellow color
§ 1 input port, 1 output port

© 2021 KNIME AG. All rights reserved. 106


Transformation Nodes

§ Node Repository:
Other Data Types/Text Processing/Transformation
§ Available Transformation Nodes
§ Bag of Words Creator
§ Document Vector
§ Strings to Document
§ Sentence Extractor
§ Document Data Extractor
§ Unique Term Extractor
§ …

© 2021 KNIME AG. All rights reserved. 107


New Node: Bag of Words Creator

§ Transforms list of documents into bag of words


§ Original documents can be appended in a column

Bag of words
Document list

© 2021 KNIME AG. All rights reserved. 108


Bag of Words Creator: Configuration

Documents used
to create bag of
words

Original
documents can
be appended

© 2021 KNIME AG. All rights reserved. 109


New Node: Term to String

§ Transforms term cells into string cells


§ Tag information will get lost

Bag of words with


string column
Bag of words

© 2021 KNIME AG. All rights reserved. 110


Term to String: Configuration

Terms to transform
to strings

© 2021 KNIME AG. All rights reserved. 111


Section Exercise

§ Start with “Exercise: Preprocessing II”


§ Create bag of words
§ Filter terms that occur in less than 5 documents
§ Tip: Bag of Words, GroupBy, and Reference Row Filter

© 2021 KNIME AG. All rights reserved. 112


Section Solution

Preprocessing II
§ Bag of Words Creator
§ Term to String
§ GroupBy
§ Row Filter
§ Reference Row Filter

© 2021 KNIME AG. All rights reserved. 113


New Node: Document Vector

§ Transforms bag of words into document vectors


§ Creates bit or numerical vectors

Bag of words with Document vector


frequency column

© 2021 KNIME AG. All rights reserved. 114


Document Vector: Configuration

Documents to
append to left of
Create bit or the created vector
numerical columns
vector

© 2021 KNIME AG. All rights reserved. 115


New Node: Document Vector Applier

§ Transforms bag of words into document vectors


§ Creates feature space of reference document vectors
§ Creates bit or numerical vectors

Document vector

Reference document
vectors

© 2021 KNIME AG. All rights reserved. 116


Document Vector Applier: Configuration

Use settings
from model
input

Include and exclude


lists of features of
the reference
vectors

© 2021 KNIME AG. All rights reserved. 117


New Node: Document Vector Hashing

§ Transforms documents into document vectors


§ Vector indices of terms are determined by term hashing
§ Creates bit or numerical vectors
§ Is streamable

Hashed document
vector

© 2021 KNIME AG. All rights reserved. 118


Document Vector Hashing: Configuration

Dimensions of
document vectors

Hashing function

© 2021 KNIME AG. All rights reserved. 119


New Node: Document Data Extractor

§ Extracts document fields as strings


§ Title, text, categories, …
Reminder: we stored restaurant type into
Category field in String to Document
conversion

Extracted field as string


column
Document column

© 2021 KNIME AG. All rights reserved. 120


Document Data Extractor: Configuration

Fields to extract

© 2021 KNIME AG. All rights reserved. 121


Frequencies

§ Frequencies are based on the number of occurrences of terms


§ Locally (in documents): term frequency (TF) absolute or relative
§ Globally (in corpus): inverse document frequency (IDF)
§ Frequencies can also be used for term filtering

© 2021 KNIME AG. All rights reserved. 122


Frequency Nodes

§ Typically characterized by:


§ Green color
§ 1 input port, 1 output port
§ Require bag of words
Append column with
relative TF values

© 2021 KNIME AG. All rights reserved. 123


Frequency Nodes

§ Node Repository:
Other Data Types/Text Processing/Frequencies
§ Available Frequency Nodes
§ TF
§ IDF
§ NGram creator
§ …

© 2021 KNIME AG. All rights reserved. 124


New Node: TF

§ Computes the relative or absolute term frequency (TF) of each term within a
document

Appended column with


TF values

© 2021 KNIME AG. All rights reserved. 125


New Node: DF

§ Computes the number of documents that contain each term

Appended column with


DF values

© 2021 KNIME AG. All rights reserved. 126


New Node: IDF

§ Computes three variants of inverse document frequency (IDF) for each term
within the documents
§ Smooth, normalized, and probabilistic

Appended column with


IDF values

© 2021 KNIME AG. All rights reserved. 127


New Node: Term Co-Occurence Counter

§ Counts the number of pairwise co-occurences of terms in bag of words within


selected parts of document (e.g. sentence, paragraph, title)

© 2021 KNIME AG. All rights reserved. 128


New Node: Ngram Creator

§ Creates ngrams from documents of input table and counts their frequencies
§ Both word and character ngrams are possible

© 2021 KNIME AG. All rights reserved. 129


Section Exercise

§ Start with “Exercise: Transformation”


§ Compute relative term frequencies
§ Create document vectors
§ Extract class label / category

© 2021 KNIME AG. All rights reserved. 130


Section Solution

Transformation
§ TF
§ Document Vector
§ Document Data Extractor

© 2021 KNIME AG. All rights reserved. 131


Classification

132
Classification

§ Assigning pre-defined labels to documents


§ Categorization
§ Sentiment analysis
§ Topic assignment
§ Supervised learning

§ In the last section we transformed textual documents into a numerical


representation (document vectors).
§ We can use standard KNIME nodes to classify / analyze these vectors.

© 2021 KNIME AG. All rights reserved. 133


Classification

Methods:
§ Decision Trees
§ Neural Networks
§ Naïve Bayes
§ Logistic Regression
§ Support Vector Machine
§ Tree Ensembles

© 2021 KNIME AG. All rights reserved. 134


Data Mining: Process Overview

Train and apply Evaluate


Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 135


New Node: Partitioning

§ Use it to split data into training and evaluation sets


§ Partition by count (e.g. 10 rows) or fraction (e.g. 10%)
§ Sample by a variety of methods; random, linear, stratified

© 2021 KNIME AG. All rights reserved. 136


Data Mining: Process Overview

Train and apply Evaluate


Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 137


The Learner-Predictor Motif

§ All data mining models use a Learner-Predictor motif.


§ The Learner node trains the model with its input data.
§ The Predictor node applies the model to a different subset of data.

Trained
Model

Training set

Test set

© 2021 KNIME AG. All rights reserved. 138


Decision Tree

§ C4.5 builds a tree from a set of training data using the concept of information
entropy.
§ At each node of the tree, the attribute of the data with the highest normalized
information gain (difference in entropy) is chosen to split the data.
§ The C4.5 algorithm then recourses on the smaller sub lists.

J.R. Quinlan, “C4.5 Programs for machine learning”


J. Shafer, R. Agrawal, M. Mehta, “SPRINT: A Scalable Parallel Classifier for Data Mining”

© 2021 KNIME AG. All rights reserved. 139


New Node: Decision Tree Learner

© 2021 KNIME AG. All rights reserved. 140


Decision Tree: View

If the word “Italian”


occurs in a review, the
restaurant is very likely
an Italian restaurant.

© 2021 KNIME AG. All rights reserved. 141


New Node: Decision Tree Predictor

§ Consumes a Decision Tree model and new data to classify


§ Check the box to append class probabilities

© 2021 KNIME AG. All rights reserved. 142


Data Mining: Process Overview

Train and apply Evaluate


Partition data performance
model

Training Train
Set Model

Original
Data Set

Apply Score
Test Set Model Model

© 2021 KNIME AG. All rights reserved. 143


New Node: Scorer

§ Compare predicted results to known truth to evaluate model quality


§ Confusion matrix shows the distribution of model errors
§ An accuracy statistics table provides additional info

© 2021 KNIME AG. All rights reserved. 144


Scorer: Confusion Matrix

This is the difference between


the confusion matrix data table
and the confusion matrix view True Positives

False Negatives

False Positives
True Negatives

© 2021 KNIME AG. All rights reserved. 145


Scorer: Accuracy Measures

From the confusion


matrix

© 2021 KNIME AG. All rights reserved. 146


Section Exercise

§ Start with “Exercise: Classification”


§ Append color information based on class labels
§ Split data into training and test set
§ Train decision tree classifier on training set
§ Apply trained model on test set
§ Score model

© 2021 KNIME AG. All rights reserved. 147


Section Solution

Classification
§ Color Manager
§ Column Filter
§ Partitioning
§ Decision Tree Learner
§ Decision Tree Predictor
§ Scorer

© 2021 KNIME AG. All rights reserved. 148


Classification (Bonus)

§ Usually the documents used to train a model are read from a different source
than that of the documents to which the model is applied afterwards
§ To apply a trained model on a second set of documents we need to ensure that
all features of the training set exist as features of the second set.
§ This means that all document vector columns of the training set must exist as
document vector columns in the second set.

© 2021 KNIME AG. All rights reserved. 149


Classification (Bonus)

All features of the training set


must exist as features in the
second set.

© 2021 KNIME AG. All rights reserved. 150


Section Exercise (Bonus)

§ Start with “Exercise: Classification II”


§ Create document vectors for the second set of documents “Boston Tripadvisor Reviews”
§ The feature space of the second set has to contain all features of the training set!
§ Apply the trained model on the second set of documents

© 2021 KNIME AG. All rights reserved. 151


Section Solution (Bonus)

Classification II

© 2021 KNIME AG. All rights reserved. 152


Text Mining Bonus –
Sentiment Analysis

153
Sentiment Analysis

§ Sentiment à belief, view, opinion, and conviction


§ Also known as Opinion Mining
§ Sentiment analysis is trying to answer the question “What do people feel/think
about a certain topic?”
§ Help analyze the data related to opinions of many [the population] using a
variety of automated text analysis tools
§ Used in variety of domains, but its applications in CRM are especially noteworthy
(which is related to customers/consumers’ opinions)
§ Often collects and uses data obtained from the internet

© 2021 KNIME AG. All rights reserved. 154


Sentiment Analysis Applications

§ Voice of the customer (VOC)


§ Voice of the Market (VOM)
§ Voice of the Employee (VOE)
§ Brand Management
§ Product/Service Management
§ Financial Markets
§ Politics
§ Entertainment
§ Government Intelligence

© 2021 KNIME AG. All rights reserved. 155


Sentiment Analysis Methods

§ Classification based – Supervised


§ Establish the corpus: Collect relevant documents
§ Such as online product reviews
§ Manually label a subset of documents as positive, negative
§ Use employees, students, or Mechanical Turk to manually label reviewers
§ Make sure to built in cross-validation [each document reviewed by three people]
§ Develop classification models using the labeled samples
§ Use single models and ensembles to identify the best predictor
§ Use the identified best classification model to predict the sentiment for the rest of the samples

© 2021 KNIME AG. All rights reserved. 156


Sentiment Analysis Methods

§ Lexicon driven – Unsupervised


§ Use a term dictionary with sentiment descriptors [e.g., opinion corpus or SentiWordNet]

© 2021 KNIME AG. All rights reserved. 157


Sentiment Analysis Process

§ Step 1 – Sentiment Detection


§ Fact [= objectivity] versus Opinion [= subjectivity]
§ Step 2 – N-P Polarity Classification
§ N [= negative] versus P [= positive]
§ Step 3 – Target Identification
§ Person, Product, Event, etc.
§ Step 4 – Aggregation
§ Word à Statement à Paragraph à Document

© 2021 KNIME AG. All rights reserved. 158


Sentiment Analysis Example (Bonus)

§ The Large Movie Review Dataset v1.0


§ 50000 English movie reviews
§ Associated sentiment labels “positive” and “negative”
§ http://ai.stanford.edu/~amaas/data/sentiment/
§ Subset contains 2000 documents
§ 1000 positive reviews
§ 1000 negative reviews
§ …/data/IMDb-sample.csv

© 2021 KNIME AG. All rights reserved. 159


Sentiment Analysis Example (Bonus)

Predictive modeling:
§ Build classifier to distinguish between positive and negative reviews.
§ “Ah, Moonwalker, I'm a huge Michael Jackson fan, I grew up with his music, Thriller was actually the
first music video I ever saw apparently. …”
§ “This film has a very simple but somehow very bad plot. …”

Positive or negative?

© 2021 KNIME AG. All rights reserved. 160


Section Exercise (Bonus)

§ Start with “Exercise: Classification III”


§ Create document cells
§ Preprocess documents
§ Punctuation Erasure, N Chars Filter, Stop Word Filter, Case converter, Snowball Stemmer
§ Filter all terms that occur in less than 20 documents
§ Create document vectors
§ Extract sentiment label and assign colors
§ Partition into training and test set
§ Train decision tree model and score it

© 2021 KNIME AG. All rights reserved. 161


Section Solution (Bonus)

Classification
§ Strings to document
§ Preprocessing nodes
§ Bag of words creation, grouping,
counting, and filtering
§ Vector creation
§ Model training and scoring

© 2021 KNIME AG. All rights reserved. 162


Sentiment Analysis Example (Bonus)

Dictionary based:
§ Use a custom dictionary to count positive and negative words.
§ Compute sentiment score to predict sentiment label.

© 2021 KNIME AG. All rights reserved. 163


Section Exercise (Bonus)

§ Start with “Exercise: Classification IV”


§ Create document cells
§ Tag terms based on sentiment dictionaries
§ Tip: Dictionary Tagger
§ Extract and count positive and negative terms
§ Compute sentiment score based on the number of positive and negative terms
§ Predict sentiment labels based on score
§ Score predictions

© 2021 KNIME AG. All rights reserved. 164


Section Solution (Bonus)

Classification
§ Strings to Documents
§ Dictionary Tagger
§ Bag of words, TF, and
GroupBy for counting
§ Pivoting
§ Math Formula
§ Rule Engine
§ Scorer

© 2021 KNIME AG. All rights reserved. 165


Visualization

166
Visualization Nodes

§ Typically characterized by:


§ Blue color
§ 1 input port, 1-2 output port (image port)

© 2021 KNIME AG. All rights reserved. 167


Visualization Nodes

§ Node Repository:
Other Data Types/Text Processing/Misc
§ Available Visualization Nodes
§ Document Viewer
§ Tag Cloud
§ Tagged Document Viewer (in JS Views (Labs))

§ The KNIME Textprocessing extension provides only the first two dedicated
visualization nodes
§ …but various other nodes can be used for visualization too!

© 2021 KNIME AG. All rights reserved. 168


New Node: Tag Cloud

§ Shows terms visualized in a cloud


§ Colors are specified via the Color Manager
§ Requires a term and a numerical column (usually TF)
§ Creates image, available at image out port

Size of words
corresponds to
frequency

List of terms and


frequencies

© 2021 KNIME AG. All rights reserved. 169


Tag Cloud: Configuration

Display only top N


terms (rows)

Term column and


frequency column

© 2021 KNIME AG. All rights reserved. 170


Tag Cloud: View

Scaling of font
size: linear, log,
exp

Min and max


fontsize, angle,

© 2021 KNIME AG. All rights reserved. 171


Additional Visualizations

§ Decision Tree View


§ Inspect trained model
§ See which terms are discriminative

© 2021 KNIME AG. All rights reserved. 172


Section Exercise

§ Start with “Exercise: Visualization”


§ Inspect decision tree via its view
§ Visualize bag of words using a tag cloud
§ Assign colors to terms in tag cloud (Optional)
§ Green if term occurs mostly in Chinese reviews, blue if terms occurs mostly in Italian reviews

© 2021 KNIME AG. All rights reserved. 173


Section Solution

Visualization
§ Decision Tree Learner
§ Tag Cloud
§ (Optional Coloring)
§ TF, Document Data Extractor, Group By,
Pivoting, Math Formula, Color Manager

© 2021 KNIME AG. All rights reserved. 174


New Node: Document Viewer

§ Shows details of documents


§ Title, Full text
§ Meta information
§ Tagged terms can be hilited and linked

Document column

© 2021 KNIME AG. All rights reserved. 175


Document Viewer: View
Details view
with title and
full text

List of all documents. Tagset to hilite


Double click for
details

Tagged terms
can be hilited

Author, category, meta


information, …

© 2021 KNIME AG. All rights reserved. 176


Reminder: Tagged Document Viewer

§ Displays documents with tags highlighted:


§ Takes a column with documents as input
§ Allows to inspect tags assigned to documents

Document with tags


highlighted

Document column

© 2021 KNIME AG. All rights reserved. 177


Section Exercise

§ Start with “Exercise: Visualization II”


§ View document content
§ View document content and highlight tagged terms
§ View tagged documents

© 2021 KNIME AG. All rights reserved. 178


Section Solution

Visualization
§ Document Viewer
§ Tagged Document Viewer

© 2021 KNIME AG. All rights reserved. 179


Bonus Visualizations

§ Supplementary Workflows/
§ R Theme River (R plot)
§ Twitter Word Tree (JavaScript view)

© 2021 KNIME AG. All rights reserved. 180


Clustering

181
Clustering

§ Find groups (clusters) of similar documents


§ Topic detection
§ Exploration
§ Unsupervised learning

§ We can use standard KNIME nodes to cluster the numerical document vectors.

© 2021 KNIME AG. All rights reserved. 182


Clustering

Methods:
§ Hierarchical clustering
§ K-Means / Medoids
§ Density based
§ …

© 2021 KNIME AG. All rights reserved. 183


Hierarchical Clustering

§ Creates hierarchy for all data points


§ Agglomerative, bottom-up
§ Combine the “closest” data points/clusters, one at a time
§ Hierarchy can be illustrated by dendrogram
§ Applicable only on small data sets (<5000)

§ Complete linkage: combine data object/cluster with minimal maximum distance


§ Finds compact, convex clusters
§ Single linkage: combine data object/cluster with minimal minimum distance
§ Also finds concave clusters
§ Average linkage: distance between two clusters c1 and c2 = mean distance
between all points in c1 and c2

© 2021 KNIME AG. All rights reserved. 184


Prototype-based Clustering

§ K-Medoids, K-Means, Fuzzy C-Means, …


§ Data are condensed to a small fixed number of prototypical data points
§ Each prototype represents a subset of data points
§ Applicable on large data sets
§ Number of prototypes (k) must be specified in advance

© 2021 KNIME AG. All rights reserved. 185


New Node: Distance Matrix Calculate

§ Computes all pairwise distances


§ Different distance measures available
§ Euclidean, Manhattan, Cosine, Dice, Tanimoto, …
§ Optional distance model input port

Distance column
Document vectors

© 2021 KNIME AG. All rights reserved. 186


Distance Matrix Calculate: Configuration

Name of distance Distance measure


column

Columns to use
for distance
computation

© 2021 KNIME AG. All rights reserved. 187


New Node: Hierarchical Clustering (DistMatrix)

§ Creates hierarchy of input data points


§ Complete Linkage, Average Linkage, Single Linkage
§ Requires distance column or model

Clustering model
Distance column

Distance function
(optional)

© 2021 KNIME AG. All rights reserved. 188


Hierarchical Clustering (DistMatrix): Configuration

Distance column
Linkage
type

© 2021 KNIME AG. All rights reserved. 189


New Node: Hierarchical Cluster View

§ Shows:
§ Dendrogram of clustering Dendrogram or
§ Distance curve distance
§ Colors

Hierarchical
clustering model

Data points, e.g.


document vectors

© 2021 KNIME AG. All rights reserved. 190


New Node: Hierarchical Cluster Assigner

§ Assigns data points to clusters based on


§ Distance threshold
§ Number of clusters

Cluster assignment
Hierarchical
clustering model

Data points, e.g.


document vectors

© 2021 KNIME AG. All rights reserved. 191


Hierarchical Cluster Assigner: Configuration

Threshold or cluster
count based
assignment

© 2021 KNIME AG. All rights reserved. 192


Hierarchical Clustering: Example Workflow

Hierarchy of data
points Illustration of
dendrogram

Data e.g.:
document vectors
Assignment of
clusters

© 2021 KNIME AG. All rights reserved. 193


New Node: k-Medoids

§ Computes k prototypes (medoids)


§ Requires distance column or model
§ Requires specification of k
Cluster assignment
§ Similar nodes:
§ k-Means
§ Fuzzy c-Means

Data points and


distance column

© 2021 KNIME AG. All rights reserved. 194


k-Medoids: Configuration

Distance matrix
column Cluster count k

Random seed for


reproducible
results

© 2021 KNIME AG. All rights reserved. 195


k-Medoids Clustering: Example Workflow

Data e.g.: Assignment of


document vectors clusters

© 2021 KNIME AG. All rights reserved. 196


Stay Up-To-Date and Contribute

§ Follow the KNIME Community Journal on Medium


Low Code for Advanced Data Science

§ Daily content on data stories, data science theory,


getting started with KNIME and more
for the community by the community

§ Would you like to share your data story


with the KNIME community?

Contributions are always welcome!

© 2021 KNIME AG. All rights reserved. 198


Section Exercise

§ Start with “Exercise: Clustering”


§ What groups of documents are in the data?
§ Compute pairwise cosine distances
§ Apply hierarchical clustering
§ View dendrogram to find out the number of clusters (k)
§ Assign k clusters
§ Apply k-Medoids with k as number of clusters
§ Select documents of one cluster in dendrogram, hilite them, and inspect data in a table view

© 2021 KNIME AG. All rights reserved. 199


Section Solution

Clustering
§ Distance Matrix Calculate
§ Hierarchical Clustering
§ Cluster View
§ Cluster Assigner
§ k-Medoids

© 2021 KNIME AG. All rights reserved. 200


Text Mining – Topic Modeling

201
Topic Modeling in Text Mining

§ The process of discovering (learning, identifying, extracting) topics across


a collection of documents (corpus)
§ Common assumptions for all topic modeling models:
§ Each document consists of a mix of topics
§ Each topic consists of a collection of words/terms
§ Topics are “hidden” or “latent” constructs in between documents and words
§ The goal of topic modeling is to discover these latent variables (i.e., topics)
that shape the meaning/semantics in the document collection

© 2021 KNIME AG. All rights reserved. 202


Topic Modeling Techniques

§ Latent Semantic Analysis (LSA)


§ First, construct the term-by-document matrix (Raw count vs. TF-IDF)
§ Second [optional], use Singular Value Decomposition (SVD)
§ Third, identify/learn the natural groupings/clusters of documents
§ Pros: it is straightforward, relatively easy and fast to apply
§ Cons: exclusive assignments to words and groups; need for very large set of documents and
vocabulary; and lack of interpretability
§ Probabilistic Latent Semantic Analysis (pLSA)
§ Uses probabilities as opposed to singular values (via SVD)
§ Latent Dirichlet Allocation (LDA)
§ A richer representation of the document-topic-word structure…

© 2021 KNIME AG. All rights reserved. 203


Latent Dirichlet Allocation (LDA)

§ LDA uses Dirichlet priors/distributions for the document-to-topic and topic-to-


word associations/allocations
§ LDA is a generative statistical model
§ It is an unsupervised learning process
§ Given a set of training data the goal is to identify the underlying distribution by generating samples
from the same distribution
§ Dirichlet distribution - Dir(α)
§ It is a family of continuous multivariate probability distributions parameterized by a vector α of positive
reals.
§ It is a multivariate generalization of the beta distribution
§ Hence, it is also called multivariate beta distribution (MBD)

© 2021 KNIME AG. All rights reserved. 204


Latent Dirichlet Allocation (LDA)
Doc 1 Doc 2 Doc 3 Doc m-1 Doc m

Documents
...

Topics to
documents
weights
[Latent]

...
Topics

Topic 1 Topic 2 Topic k

Words to
topics
weights

...
Words

Word 1Word 2Word 3Word 4Word 5Word 6Word 7 Word n-1Word n

© 2021 KNIME AG. All rights reserved. 205


Latent Dirichlet Allocation (LDA)

§ Example: we have a corpus of three different subject areas (1, 2, 3) and are
presumed to have three topics (A, B, and C)
§ With LDA, each topic is associated with each document with a weight, and each
word is associated with each topic with a weight. The distribution of topics will be
heavy on one over the others:
§ Doc from Subject 1: 88% topic A, 5% topic B, 7% topic C
§ Doc from Subject 2: 5% topic A, 91% topic B, 4% topic C
§ Doc from Subject 3: 5% topic A, 6% topic B, 89% topic C
§ Example: Genres associated with movies
§ Each movie is associated with multiple genres with different weights

© 2021 KNIME AG. All rights reserved. 206


Latent Dirichlet Allocation (LDA) in KNIME

Output

© 2021 KNIME AG. All rights reserved. 207


Supplementary Workflows

208
R Theme River

Creates theme river using ggplot2.


§ ggplot2 has to be installed!
§ Change lib path

© 2021 KNIME AG. All rights reserved. 209


Twitter Word Tree

Creates a word tree using the JavaScript Google charting library.

© 2021 KNIME AG. All rights reserved. 210


Term Co-occurrences

Term co-occurrences of all term pairs are counted on sentence


and document level.

© 2021 KNIME AG. All rights reserved. 211


Topic Extraction

Extracts two topics from the input documents and 10 words


to represent each topic.

© 2021 KNIME AG. All rights reserved. 212


Social Media Analysis

Leader / Follower
analysis of users

Sentiment
analysis of users

© 2021 KNIME AG. All rights reserved. 215


Social Media Analysis

§ Slashdot forum data


§ Text Mining: sentiment analysis of users
§ Network Mining: leader and follower scoring of users

© 2021 KNIME AG. All rights reserved. 216


Romeo and Juliet

Read epub file Tag character names and count frequencies

Load JPEG and convert to Insert PNG images and visualize


PNG network

© 2021 KNIME AG. All rights reserved. 217


Romeo and Juliet

§ Interaction network of characters.


§ Border color indicates family assignment
§ Node size is related to TF of character
names

© 2021 KNIME AG. All rights reserved. 218


Text Mining Application
Analytics Goes to Hollywood

219
Forecasting Box Office Success of Hollywood Movies
§ Dursun Delen, Ph.D.
§ Regents Professor
§ Department of Management Science and Information Systems
§ Spears School of Business, Oklahoma State University

© 2021 KNIME AG. All rights reserved. 220


Forecasting Box-Office: A Tough Problem!

“… No one can tell you how a movie


is going to do in the marketplace…
not until the film opens in darkened
theatre and sparks fly up between the
screen and the audience…”

Mr. Jack Valenti


Long time President and CEO
of the Motion Picture Association of America

© 2021 KNIME AG. All rights reserved. 221


Current work on prediction…

§ A lot of research have been done


§ Behavioral models
§ Analytical models
D Predict after the initial release

§ Our approach
§ Use a data mining approach
§ Use as much historical data as possible
§ Make it web-enabled
C Predict before the initial release

© 2021 KNIME AG. All rights reserved. 222


Our Analytics Approach – Movie Forecast Guru

§ DATA – 849 Movies released between 1998-2006


§ Movie Decision Parameters:
§ Intensity of competition rating
§ MPAA Rating
§ Star power
§ Genre
§ Technical Effects
§ Sequel ?
§ Estimated screens at opening
§ …more…
§ Output: Box office gross receipts (flop ® blockbuster)
Class No. 1 2 3 4 5 6 7 8 9
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)

© 2021 KNIME AG. All rights reserved. 223


Data Description

Class No. 1 2 3 4 5 6 7 8 9

Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)

Number of
Dependent Independent Variable Possible Values
Values
Variable
MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
Genre 10 Related, Thriller, Horror,
A Typical Comedy, Cartoon, Action,
Classification Documentary

Problem Special effects 3 High, Medium, Low


Sequel 1 Yes, No
Number of screens 1 Positive integer

© 2021 KNIME AG. All rights reserved. 224


Our Analytics Approach – Movie Forecast Guru

PREDICTION MODELS
§ Statistical Models:
§ Discriminant Analysis
§ Ordinal Multiple Logistic Regression

§ Machine Learning Models:


§ Artificial Neural Networks
§ Decision Tree Induction
§ CART - Classification & Regression Trees
§ C5 - Decision Tree
§ Support Vector Machines
§ Rough Sets
§ Ensembles…

© 2021 KNIME AG. All rights reserved. 225


Prediction Results

Prediction Models

Individual Models Ensemble Models

Performance Random Boosted Fusion


Measure SVM ANN C&RT Forest Tree (Average)

Count (Bingo) 192 182 140 189 187 194

Count (1-Away) 104 120 126 121 104 120

Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%

Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%

Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63


* Training set: 1998 – 2005 movies; Test set: 2006 movies

© 2021 KNIME AG. All rights reserved. 226


Text Mining the Plot Summaries [Storylines]

§ Is there any predictive power within the plot storylines?

© 2021 KNIME AG. All rights reserved. 227


Text Mining the Plot Summaries [Storylines]

§ Read the data


§ Movie_Dataset_2002-2006.xls
§ Process (numericize/structure) the data
§ Try Bag-of-Words approach
§ Try LDA
§ Builds and test prediction models
§ Use single models (Decision Tree, ANN, SVM, kNN, etc.)
§ Use model ensembles (Random Forest, Boosted Trees, etc.)
§ Compare the findings

© 2021 KNIME AG. All rights reserved. 228


Web-based DSS
Prediction Models

Remote
Models

Movie Forecast
Local
Models
Guru (MFG)
Remote
GUI
Data Sources
(Internet Web Services
XML / SOAP ETL
Browser)

HTML
TCP/IP MFG Engine
(Web Server) ODBC
& ETL
User MFG
Database
(Manager) XML

Knowledge Base
(Business Rules)

© 2021 KNIME AG. All rights reserved. 229


Implementation of MGF

© 2021 KNIME AG. All rights reserved. 230


References for Movie Prediction

References:
Delen, D., R. Sharda and P. Kumar (2006). “Movie Forecast Guru: A Web-based DSS for Hollywood
Managers”, Decision Support System. In Press.
Henry, M., R. Sharda and D. Delen (2007). “Using Neural Networks to Forecast Box-Office Success”
America’s Conference on Information Systems (AMCIS), Keystone, Colorado. Association for
Information Systems, 1512-1516.
Sharda, R. and D. Delen (2006). “Predicting box-office success of motion pictures with neural networks”
Expert Systems with Applications, 30(2), 243-254.
Sharda, R., D. Delen (2006). “How to Predict a Movie’s Success at the Box Office”, FORESIGHT: The
International Journal of Applied Forecasting, October 2006.

Media Coverage of “Box Office Forecasting” Project

© 2021 KNIME AG. All rights reserved. 231


Stay Up-To-Date and Contribute

§ Follow the KNIME Community Journal on Medium


Low Code for Advanced Data Science

§ Daily content on data stories, data science theory,


getting started with KNIME and more
for the community by the community

§ Would you like to share your data story


with the KNIME community?

Contributions are always welcome!

© 2021 KNIME AG. All rights reserved. 232


Thank you
[email protected]

233

You might also like