Knime PDF
Knime PDF
Nodes
perform tasks on
data
Workflows
combine nodes to model data
flow
Document
Components
Preprocessi
ng
Databases
MySQL, PostgreSQL, Oracle
Theobald
any JDBC (DB2, MS SQL
Server)
Amazon DynamoDB
Files
CSV, txt, Excel, Word, PDF
SAS, SPSS
XML, JSON, PMML
Images, texts, networks
Other
Twitter, Google
Amazon S3, Azure Blob Store
Sharepoint, Salesforce
Kafka
REST, Web services
Preprocessing
Row, column, matrix based
Data blending
Join, concatenate, append
Aggregation
Grouping, pivoting, binning
Feature Creation and
Selection
Regression
Linear, logistic
Classification
Decision tree, ensembles, SVM,
MLP,
Naïve Bayes
Clustering
k-means, DBSCAN, hierarchical
Validation
Cross-validation, scoring, ROC
Deep Learning
Keras, DL4J
External
R, Python, Weka, H2O, Keras
Interactive Visualizations
JavaScript-based nodes
Scatter Plot, Box Plot, Line Plot
Networks, ROC Curve, Decision
Tree
Plotly Integration
Adding more with each release!
Misc
Tag cloud, open street map,
molecules
Script-based visualizations
R, Python
Database
Files
Excel, CSV, txt
XML
PMML
to: local, KNIME Server, Amazon S3,
Azure Blob Store
BIRT Reporting
KNIME
Explorer
Node
Description
Workflow
Coach
Workflow
Editor
KNIME
Node Hub
Repository
By default the Node Monitor shows you the output table of the node
selected in
the workflow editor
Click on the three dots on the upper right to show the flow variables,
configuration, etc.
Mode
l
Flow
Variable
Imag
e
Dat DB DB
a Connection Data
Not Configured:
The node is waiting for configuration or incoming data.
Configured:
The node has been configured correctly, and can be
executed.
Executed:
The node has been successfully executed. Results
may be viewed and used in downstream nodes.
Error:
The node has encountered an error during execution.
© 2021 KNIME AG. All rights 24
reserved.
Node Configuration
Right-click node
Select Execute in the context
menu
If execution is successful, status
shows green light
If execution encounters errors,
status shows red light
The buttons in the toolbar can be used for the active workflow.
The most important buttons are:
Execute selected and executable nodes (F7) Execute all executable nodes
Execute selected nodes and open first view Cancel all selected, running
nodes (F9) Cancel all running nodes
Scatter
Plot
Data
Plot View
View
https://hub.knime.c
om
© 2021 KNIME AG. All rights 30
reserved.
Getting Started: KNIME Example Server
Node Execution Shift + F10 executes all configured nodes and opens all views
F9 cancels selected running nodes
Shift + F9 cancels all running nodes
Node Connections Ctrl + L connects selected nodes
Ctrl + Shift + L disconnects selected nodes
Ctrl + Shift + Arrow moves the selected node in the arrow direction
Move Nodes and Ctrl + Shift + moves the selected annotation in the front or in the back of
Annotations PgUp/PgDown all overlapping annotations
F8 resets selected nodes
Ctrl + S saves the workflow
Workflow Operations
Ctrl + Shift + S saves all open workflows
Ctrl + Shift + W closes all open workflows
Metanode Shift + F12 opens metanode wizard
Many strategies to deal with the problem (see “How to deal with
missing values” KNIME Blog https://www.knime.com/blog/how-to-deal-
with-missing-values?)
We adopt the strategy that predicts the missing values based on the
other attributes on the same data row
Copy SQL
statement
You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/
r1 m 2
r1+r3+r6 m 8
r3 m 1
r2+r4+r5 f 15
r4 f 5
r5 f 7
r6 m 5
Aggregated on “Group” by
method:
sum(“Value”)
Aggregate to summarize
data
Matches
all
columns
Matches
all
numeric
columns
Join by ID
Inner Join
Join by ID
Missing values
in the right
table
Missing
values
in the left
table
Columns from
left table to
output table
Columns
from right
table to
output table
Optional. Sort the data rows by descending AGEP and extract top
10 only. Hint: Use LIMIT to restrict the number of rows returned by
the db.
© 2021 KNIME AG. All rights 85
reserved.
Predicting COW
Values with
KNIME
Missing Value Implementation Approach
Train a Decision Tree to predict the COW where COW is not null
Apply Decision Tree Model to predict COW where COW is
missing (null)
Create table as
select
Insert/append data
Update values in
table
Delete rows from
table
Append to
Increase batch size
or drop
for better
existing
performance
table
Apply custom
variable
types
Columns to
update
Columns that
identify the
records to update
Drop table
missing table handling
cascade option
Execute any SQL statement e.g.
DDL
Manipulate existing queries
DB Transaction Start/End
Take advantage of these nodes
to group several database data
manipulation operations into a
single unit of work
This transaction either
completes entirely or not at all.
Uses the default isolation level
of
the connected database.
Optional: Write the learned Decision Tree Model and the timestamp into a new table
named "model”
Access HIVE
Storage HDFS
Fil File
(large!)
e
Blocks (default:
64MB)
DataNode
s
NameNode DataNodes
Master service that manages file Workers, store and retrieve blocks
system per request of client or namenode
namespace
Periodically report to namenode that
Maintains metadata for all files they are running and
and directories in filesystem tree which blocks they are storing
Knows on which datanode blocks
of a
given file are located
Whole system depends on
availability of NameNode
Data
Replication
All blocks of a file are stored
as sequence of blocks File
1
B B B
Blocks of a file are replicated 1 2 3
HDF
S
Hadoo
p
PC DataNod
e
Client, e.g.
KNIME DataNod
e
DataNod
e
HIVE
MapReduce Tez Spark
YARN
HDFS
MAP(...)
MapReduce / Tez / Spark
REDUCE(...)
x2 z2 n2 y2
DataNode table_1.csv
x3 z3 n3 y3
table_2.csv table_3.csv
John Doe 35
Immutable: Not
Data manipulation = creating new DataFrame from an e:
existing one by applying a function on it Earlier versions of
KNIME and Spark used
RDDs (resilient distributed
Lazily evaluated: datasets)
Functions are not executed until an action is triggered, In KNIME, DataFrames
that are always used in Spark
requests to actually see the row data 2 and later.
Distributed:
Each row belongs to exactly one partition
Each partition is held by a Spark Executor
© 2021 KNIME AG. All rights 12
reserved.
Spark – Lazy Evaluation
Triggers
evaluati
on
Spark Context
Main entry point for Spark
functionality
Represents connection to a Spark
cluster
Allocates resources on the cluster
Spark
Executor
Task Task
Spark Driver
Spark Context
Spark
Executor
Task Task
Scheduled
Hadoop
execution and a
RESTful workflow l Cluster
submission Impa
Submit
KNIME Server Large Impala
with Big Data queries
Extensions via JDBC
2
Workflow er
Upload Hiveserv
via
HTTP(S) Submit
Build Hive
Spark queries
workflow via JDBC
s
graphicall y
y Liv
KNIME Analytics Interact
Platform with Spark
with Big Data via
HTTP(S)
Extensions
Other possible
nodes
Sale Parent
s Table
About partition
columns:
Optional (!) performance
optimization
Use columns that are often used
in WHERE clauses
Use only categorical columns Partition
with suitable value range, i.e. column
Start from the workflow that implements the missing value strategy and
write the
results back into Hive. That is:
write the results onto a new table named "newTable" in Hive using the
HDFS Connection of the Local Big Data Environment along with both
DB Table Creator and DB Loader nodes.
Connection/Administration
HDFS Connection
HDFS File Permission
Utilize the existing remote file handling
nodes
Transfer Files
Create Folders
Delete Files/Folders
Compress and Decompress
Full documentation available
Runs on Hadoop
Supported Spark Versions
1.2, 1.3, 1.5, 1.6, 2.x
One KNIME extension for all Spark versions
Scalable machine learning library (Spark MLlib and
spark.ml)
Algorithms for
Classification (decision tree, naïve Bayes, logistic regression, …)
Regression (linear regression, …)
Clustering (k-means)
Collaborative filtering (ALS)
Dimensionality reduction (SVD, PCA)
Item sets / Association rules
Spark
Context
Port
Also supported:
KNIME
Context
in HDFS
Spark Context
Hive
query
Spark
From Context
Hive
From other Spark
Context
sources
Database
Database
© 2021 KNIME AG. All rights
reserved.
Spark DataFrame Ports
To do:
Connect to Spark via the Create Local Big Data
Environment node
To do:
Column Filter to remove PWGTP* and PUMA* columns
Join with ss13hme on SERIAL NO
Find rows with top ten AGEP in dataset and import them into
KNIME
Calculate average average AGEP per SEX group
Split the dataset into two:
One where COW is null
One where COW is not null
Spark MLlib
model
Alternative
nodes
© 2021 KNIME AG. All rights 188
reserved.
Spark Predictor Node
Alternative
nodes
On the ss13pme table, the current workflow separates the rows where
COW is null,
from those where COW is not null, and then modifies COW to be zero-
based.
To do:
Where COW is not null:
Fix missing values in feature columns
Train a decision tree on COW on data rows
Where COW is null:
Remove COW column
Apply the decision tree model to predict COW
Apply Learn
model on model
demand at scale
PMML MLlib
model model
Sophisticate Apply
d model model
learning at scale
To KNIME Spark
DataFrame
KNIME data
table
To CSV file in
Remote file
connection
Spark DataFrame
HDFS
To
Hive connection
Spark Hive
table
DataFrame
Hive
To Other Spark
Storages
DataFrame
To
Database
connection Database
Spark table
Database DataFrame
Connectivity to
Google Cloud Storage
Google Big Query (via DB
Nodes)
Google Cloud Dataproc
This workflow provides a Spark predictor to predict COW values for the
ss13pme
data set. The model is applied to predict COW values where they are
missing.
Now export the new data set without missing values to:
A KNIME table
A Parquet file in HDFS
A Hive table
Benefits:
Efficient compression: Stored as columns and compressed, which leads to smaller disk
reads.
Fast reads: Data is split into multiple files. Files include a built-in index, min/max values,
and other aggregates. In addition, predicate pushdown pushes filters into reads so that
minimal rows are read.
Proven in large-scale deployments: Facebook uses the ORC file format for a 300+ PB
deployment.
ID Gender Ag
e
1 female 45
I Gender
D
333 female Ag
2 male 20
4 e 4
2 mal 2
Metainformation 5
e ⋮ 0
⋮ ⋮ ⋮ ⋮ ⋮
ID: min = 1 ; max = I Gender
3333 Gender: 6666 male 4 D Ag
666 femal 4
3333 male
female; male Age: 42 2 e
Metainformat 7 e 5
min = 20; max = 45 2 mal 2
ID: min = 3334 ; max =
ion
e 0
6666 ⋮ ⋮ ⋮
Gender: female; male 10000 male 42
Age: min = 5; max = 24
Metainformation
ID: min = 6667 ; max =
10000
Gender: female; male
Age: min = 45; max = 87