0% found this document useful (0 votes)
21 views183 pages

Knime PDF

KNIME Analytics Platform is a comprehensive tool for data analysis, offering graphical programming, various extensions, and integrations with languages like R and Python. It supports data access from multiple sources, big data processing, and provides functionalities for data transformation, analysis, visualization, and deployment. The platform includes a user-friendly workspace, workflow management, and a rich library of nodes for diverse data tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views183 pages

Knime PDF

KNIME Analytics Platform is a comprehensive tool for data analysis, offering graphical programming, various extensions, and integrations with languages like R and Python. It supports data access from multiple sources, big data processing, and provides functionalities for data transformation, analysis, visualization, and deployment. The platform includes a user-friendly workspace, workflow management, and a rich library of nodes for diverse data tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 183

Overview

KNIME Analytics Platform


What is KNIME Analytics Platform?

A tool for data analysis, manipulation, visualization, and


reporting
Based on the graphical programming paradigm
Provides a diverse array of extensions:
Text Mining
Network Mining
Cheminformatics
Many integrations,
such as Java, R, Python, Weka, Keras, Plotly, H2O, etc.

© 2021 KNIME AG. All rights 2


reserved.
Visual KNIME Workflows

Nodes
perform tasks on
data

Workflows
combine nodes to model data
flow

Document

Components
Preprocessi
ng

encapsulate complexity &


expertise

© 2021 KNIME AG. All rights 3


reserved.
Data Access

Databases
MySQL, PostgreSQL, Oracle
Theobald
any JDBC (DB2, MS SQL
Server)
Amazon DynamoDB
Files
CSV, txt, Excel, Word, PDF
SAS, SPSS
XML, JSON, PMML
Images, texts, networks
Other
Twitter, Google
Amazon S3, Azure Blob Store
Sharepoint, Salesforce
Kafka
REST, Web services

© 2021 KNIME AG. All rights 4


reserved.
Big Data

Spark & Databricks


HDFS support
Hive
Impala
In-database
processing

© 2021 KNIME AG. All rights 5


reserved.
Transformation

Preprocessing
Row, column, matrix based
Data blending
Join, concatenate, append
Aggregation
Grouping, pivoting, binning
Feature Creation and
Selection

© 2021 KNIME AG. All rights 6


reserved.
Analysis & Data Mining

Regression
Linear, logistic
Classification
Decision tree, ensembles, SVM,
MLP,
Naïve Bayes
Clustering
k-means, DBSCAN, hierarchical
Validation
Cross-validation, scoring, ROC
Deep Learning
Keras, DL4J
External
R, Python, Weka, H2O, Keras

© 2021 KNIME AG. All rights 7


reserved.
Visualization

Interactive Visualizations
JavaScript-based nodes
Scatter Plot, Box Plot, Line Plot
Networks, ROC Curve, Decision
Tree
Plotly Integration
Adding more with each release!
Misc
Tag cloud, open street map,
molecules
Script-based visualizations
R, Python

© 2021 KNIME AG. All rights 8


reserved.
Deployment

Database
Files
Excel, CSV, txt
XML
PMML
to: local, KNIME Server, Amazon S3,
Azure Blob Store
BIRT Reporting

© 2021 KNIME AG. All rights 9


reserved.
Over 2000 Native and Embedded Nodes
Included:

Data Access Transformati Analysis & Visualization Deployment


MySQL, Oracle, ... on Mining R via BIRT PMML
Statistics JFreeChart
SAS, SPSS, ... Row Column JavaScript
XML, JSON
Excel, Flat, ... Matrix Data Mining Databases
Machine Learning Plotly
Hive, Impala, ... Text, Image Community / Excel, Flat, etc.
Web Analytics
XML, JSON, PMML Time Series 3rd Text, Doc,
Text Mining
Text, Doc, Java Python Network Analysis Image Industry
Image, ... Web Community / Social Media Specific
Crawlers Industry 3rd Analysis Community /
Specific R, Weka, Python 3rd
Community / 3rd Community / 3rd
© 2021 KNIME AG. All rights 10
reserved.
Install KNIME Analytics Platform
Select the KNIME version for your
computer:
Mac
Windows – 32 or 64 bit
Linux
Download the archive and extract the
file, or download installer package and
run it

© 2021 KNIME AG. All rights 11


reserved.
Start KNIME Analytics Platform

Use the shortcut created by the


installer

Or go to the installation directory and launch KNIME via the


knime.exe

© 2021 KNIME AG. All rights 12


reserved.
The KNIME Workspace

The workspace is the folder/directory in which workflows (and


potentially data
files) are stored for the current KNIME session
Workspaces are portable (just like KNIME)

© 2021 KNIME AG. All rights 13


reserved.
The KNIME Analytics Platform Workbench

KNIME
Explorer
Node
Description

Workflow
Coach

Workflow
Editor

KNIME
Node Hub
Repository

Console & Node


Monitor
Outlin
e

© 2021 KNIME AG. All rights 14


reserved.
KNIME Explorer

In LOCAL you can access your


own workflow projects.
Other mountpoints allow you to
connect to
EXAMPLE Server
KNIME Hub
KNIME Server
The Explorer toolbar on the top
has a search box and buttons to
select the workflow displayed in the
active
editor
refresh the view
The KNIME Explorer can contain
4 types of content:
Workflows
Workflow groups
Data files
Shared Components

© 2021 KNIME AG. All rights 15


reserved.
Creating New Workflows, Importing, and
Exporting
Right-click inside the KNIME Explorer to create a new workflow or a
workflow
group, or to import a workflow
Right-click the workflow or workflow group to export

© 2021 KNIME AG. All rights 16


reserved.
Node Repository

The Node Repository lists all


KNIME nodes
The search box has 2 modes
Standard Search – exact match of
node
name
Fuzzy Search – finds the most
similar node name

© 2021 KNIME AG. All rights 17


reserved.
Description

The Description view


provides information
about:
Node functionality
Input & output
Node settings
Ports
References to literature

© 2021 KNIME AG. All rights 18


reserved.
Workflow Description

When selecting the workflow,


the Description view gives
you information about the
workflow:
Title
Description
Associated tags and links
Creation date
Author

© 2021 KNIME AG. All rights 19


reserved.
Workflow Coach

Node recommendation engine


Gives hints about which node to use next in the
workflow
Based on KNIME communities' usage statistics
Based on own KNIME workflows

© 2021 KNIME AG. All rights 20


reserved.
Node Monitor

By default the Node Monitor shows you the output table of the node
selected in
the workflow editor
Click on the three dots on the upper right to show the flow variables,
configuration, etc.

© 2021 KNIME AG. All rights 21


reserved.
Console and Other Views

Console view prints out error


and warning messages
about what is going
on under the hood

Click View and select Other…


to add different views
Node Monitor, Licenses, etc.

© 2021 KNIME AG. All rights 22


reserved.
Inserting and Connecting Nodes

Insert nodes into workspace by dragging them from the


Node Repository or by double-clicking in the Node Repository
Connect nodes by left-clicking the output port of Node A and dragging
the cursor to the (matching) input port of Node B
Common port types:

Mode
l
Flow
Variable
Imag
e

Dat DB DB
a Connection Data

© 2021 KNIME AG. All rights 23


reserved.
More on Nodes…

A node can have 4 states:

Not Configured:
The node is waiting for configuration or incoming data.

Configured:
The node has been configured correctly, and can be
executed.

Executed:
The node has been successfully executed. Results
may be viewed and used in downstream nodes.

Error:
The node has encountered an error during execution.
© 2021 KNIME AG. All rights 24
reserved.
Node Configuration

Most nodes need to be configured


To access a node configuration
dialog:
Double-click the node
Right-click -> Configure

© 2021 KNIME AG. All rights 25


reserved.
Node Execution

Right-click node
Select Execute in the context
menu
If execution is successful, status
shows green light
If execution encounters errors,
status shows red light

© 2021 KNIME AG. All rights 26


reserved.
Tool Bar

The buttons in the toolbar can be used for the active workflow.
The most important buttons are:

Execute selected and executable nodes (F7) Execute all executable nodes
Execute selected nodes and open first view Cancel all selected, running
nodes (F9) Cancel all running nodes

© 2021 KNIME AG. All rights 27


reserved.
Node Views

Right-click node to inspect the execution


results by
selecting output ports (last option in the context
menu)
to inspect tables, images, etc.
selecting Interactive View to open visualization
results in a browser

Scatter
Plot

Data
Plot View
View

© 2021 KNIME AG. All rights 28


reserved.
KNIME File Extensions

Dedicated file extensions for workflows and workflow groups


associated with
KNIME Analytics Platform

*.knwf for KNIME Workflow Files

*.knar for KNIME Archive Files

© 2021 KNIME AG. All rights 29


reserved.
Getting Started: KNIME Hub

Place to search and


share
Workflows
Nodes
Components
Extensions

https://hub.knime.c
om
© 2021 KNIME AG. All rights 30
reserved.
Getting Started: KNIME Example Server

Connect via KNIME Explorer to a public repository with large


selection of
example workflows for many, many applications

© 2021 KNIME AG. All rights 31


reserved.
Hot Keys (for Future Reference)
Task Hot key Description
Node Configuration F6 opens the configuration window of the selected node
F7 executes selected configured nodes
Shift + F7 executes all configured nodes

Node Execution Shift + F10 executes all configured nodes and opens all views
F9 cancels selected running nodes
Shift + F9 cancels all running nodes
Node Connections Ctrl + L connects selected nodes
Ctrl + Shift + L disconnects selected nodes
Ctrl + Shift + Arrow moves the selected node in the arrow direction
Move Nodes and Ctrl + Shift + moves the selected annotation in the front or in the back of
Annotations PgUp/PgDown all overlapping annotations
F8 resets selected nodes
Ctrl + S saves the workflow
Workflow Operations
Ctrl + Shift + S saves all open workflows
Ctrl + Shift + W closes all open workflows
Metanode Shift + F12 opens metanode wizard

© 2021 KNIME AG. All rights 32


reserved.
Introduction to
the Big Data
Course
Goal of this Course

Become familiar with the Hadoop Ecosystem


and the
KNIME Big Data Extensions

What you need:


KNIME Analytics Platform with
KNIME Big Data Extensions
KNIME Big Data Connectors
KNIME Extension for Apache Spark
KNIME Extension for Local Big Data
Environment
KNIME File Handling Nodes

© 2021 KNIME AG. All rights 34


reserved.
Installation of Big Data Extensions

© 2021 KNIME AG. All rights 35


reserved.
…or install via drag-and-drop from the KNIME
Hub

© 2021 KNIME AG. All rights 36


reserved.
Big Data Resources (1)

SQL Syntax and Examples


https://www.w3schools.com

Apache Spark MLlib


https://spark.apache.org/docs/latest/ml-
guide.html

KNIME Big Data Extensions (Hadoop +


Spark) https://www.knime.com/knime-big-
data-extensions

Example workflows on KNIME Hub


https://www.knime.com/nodeguide/big-data

© 2021 KNIME AG. All rights 37


reserved.
Big Data Resources (2)

Whitepaper “KNIME opens the Doors to Big Data”


https://www.knime.com/sites/default/files/inline-
images/big_data_in_knime_1.pdf
Blog Posts
https://www.knime.org/blog/Hadoop-Hive-meets-Excel
https://www.knime.com/blog/SparkSQL-meets-HiveSQL
https://www.knime.com/blog/speaking-kerberos-with-knime-big-data-
extensions
https://www.knime.com/blog/new-file-handling-out-of-labs-and-into-production
Video
https://www.knime.com/blog/scaling-analytics-with-knime-big-data-extensions

© 2021 KNIME AG. All rights 38


reserved.
Overview

1. Use a traditional Database,


and KNIME Analytics
Platform native Machine
Learning Nodes
2. Moving In-Database
Processing
to Hadoop Hive
3. Moving In-Database
Processing and Machine
Learning to Spark

© 2021 KNIME AG. All rights 39


reserved.
Today’s Example: Missing Values Strategy

Missing Values are a big problem in Data Science!

Many strategies to deal with the problem (see “How to deal with
missing values” KNIME Blog https://www.knime.com/blog/how-to-deal-
with-missing-values?)

We adopt the strategy that predicts the missing values based on the
other attributes on the same data row

CENSUS Data Set with missing COW values from


http://www.census.gov/programs-surveys/acs/data/pums.html

© 2021 KNIME AG. All rights 40


reserved.
CENSUS Data Set

CENSUS data contains questions to a sample of US residents


(1%)
over 10 years
CENSUS data set description:
http://www2.census.gov/programs-
surveys/acs/tech_docs/pums/data_dict/PUMSDataDict15.pdf
ss13hme (60K rows) -> questions about housing to Maine
residents
ss13pme (60K rows) -> questions about themselves to Maine
residents
ss13hus (31M rows) -> questions about housing to all US
residents in the sample
ss13pus (31M rows) -> questions about themselves to all US
residents
in the sample
© 2021 KNIME AG. All rights 41
reserved.
Missing Values Strategy Implementation

Connect to the CENSUS data set


Separate data rows with COW from data rows with missing COW
Train a decision tree to predict COW (obviously only on data rows
with COW)
Apply decision tree to predict COW where COW is missing
Update original data set with new predicted COW values

© 2021 KNIME AG. All rights 42


reserved.
Today’s Example: Missing Values Strategy

© 2021 KNIME AG. All rights 43


reserved.
Let’s Practice First
on a Traditional
Database
Database
Extension
Database Extension

Visually assemble complex SQL statements (no SQL coding


needed)
Connect to all JDBC-compliant databases
Harness the power of your database within KNIME

© 2021 KNIME AG. All rights 46


reserved.
Database Connectors

Many dedicated DB Connector nodes available


If connector node missing, use DB Connector node with
JDBC driver

© 2021 KNIME AG. All rights 47


reserved.
In-Database Processing

Database Manipulation nodes generate SQL query on top of the


input SQL
query (brown square port)
SQL operations are executed on the database!

© 2021 KNIME AG. All rights 48


reserved.
Export Data

Writing data back into


database
Exporting data into KNIME

© 2021 KNIME AG. All rights 49


reserved.
Tip

SQL statements can be easily viewed from the DB node


output ports.

© 2021 KNIME AG. All rights 50


reserved.
Database Port
Types
Database Port Types

Database Connection Port


(brown)
Connection information
SQL statement

Database JDBC Connection Port


(red)
Connection information

© 2021 KNIME AG. All rights 52


reserved.
DB Connection Port View

© 2021 KNIME AG. All rights 53


reserved.
DB Data Port View

Copy SQL
statement

© 2021 KNIME AG. All rights 54


reserved.
Connect to
Database and
Import Data
Database Connectors

Dedicated nodes to connect


to specific Databases
Necessary JDBC driver included
Easy to use
Import DB specific
behavior/capability

Hive and Impala connector


part of the KNIME Big Data
Connectors extension

General Database Connector


Can connect to any JDBC source
Register new JDBC driver via
File -> Preferences -> KNIME ->
Databases

© 2021 KNIME AG. All rights 56


reserved.
Register JDBC Driver

Register single jar


file JDBC
drivers

Register new JDBC


driver with
companion files

Open KNIME and go


to File -> Preferences,
then KNIME ->
Databases

© 2021 KNIME AG. All rights 57


reserved.
DB Connector Node

© 2021 KNIME AG. All rights 58


reserved.
DB Connector Node – Type Mapping

KNIME will do its best to guess


what type mappings are
appropriate based on what it
knows about your database
If you need more control, you
can specify type mappings
manually in two ways
By name, for individual fields – or
groups of
fields using RegEx
By type
Two separate tabs to govern
input and output type
mappings

© 2021 KNIME AG. All rights 59


reserved.
Dedicated Database Connectors

MS SQL Server, MySQL, Postgres,


SQLite, …
Propagate connection information to
other DB nodes

© 2021 KNIME AG. All rights 60


reserved.
Workflow Credentials – Usage

Replaces username and


password fields
Supported by several nodes
that
require login credentials
DB connectors
Remote file system connectors
Send mail

© 2021 KNIME AG. All rights 61


reserved.
Credentials Configuration Node

Works together with all nodes that


support
workflow credentials

© 2021 KNIME AG. All rights 62


reserved.
DB Table Selector Node

Takes connection information and constructs a


query
Explore DB metadata
Outputs a SQL query

© 2021 KNIME AG. All rights 63


reserved.
DB Reader node

Executes incoming SQL Query on


Database
Reads results into a KNIME data table

Database Connection KNIME Data


Port Table

© 2021 KNIME AG. All rights 64


reserved.
Section Exercise – 01_DB_Connect

Connect to the database (SQLite) newCensus.sqlite in folder


1_Data
Use SQLite Connector (Note: SQLite Connector supports knime:// protocol)
Explore DB metadata
Select table ss13pme (person data in Maine)
Import the data into a KNIME data table

Optional: Create a Credentials Input node and use it in a


MySQL Connector instead of user name and password.

You can download the training workflows from the KNIME Hub:
https://hub.knime.com/knime/spaces/Education/latest/Courses/

© 2021 KNIME AG. All rights 65


reserved.
In-Database
Processing
Query Nodes

Filter rows and


columns
Join tables/queries
Extract samples
Bin numeric columns
Sort your data
Write your own
query
Aggregate your data

© 2021 KNIME AG. All rights 67


reserved.
Data Aggregation

RowID Group Value

r1 m 2

r2 f 3 RowID Group Sum(Value)

r1+r3+r6 m 8
r3 m 1
r2+r4+r5 f 15
r4 f 5

r5 f 7

r6 m 5

Aggregated on “Group” by
method:
sum(“Value”)

© 2021 KNIME AG. All rights 68


reserved.
Database GroupBy Node

Aggregate to summarize
data

© 2021 KNIME AG. All rights 69


reserved.
DB GroupBy Node – Manual Aggregation

Returns number of rows per


group

© 2021 KNIME AG. All rights 70


reserved.
Database GroupBy Node – Pattern Based
Aggregation

Tick this option if the search pattern is


a regular expression otherwise it is
treated as string with wildcards ('*'
and '?')

© 2021 KNIME AG. All rights 71


reserved.
Database GroupBy Node – Type Based
Aggregation

Matches
all
columns

Matches
all
numeric
columns

© 2021 KNIME AG. All rights 72


reserved.
Database GroupBy Node – DB Specific Aggregation
Methods

SQLite: 7 aggregation PostgreSQL: 25 aggregation


functions functions

© 2021 KNIME AG. All rights 75


reserved.
DB GroupBy Node – Custom Aggregation
Function

© 2021 KNIME AG. All rights 76


reserved.
Joining Columns of Data
Left Table Right Table

Join by ID

Inner Join

Left Outer Join Right Outer Join


Missing values Missing values
in in the left
the right table. table.

© 2021 KNIME AG. All rights 77


reserved.
Joining Columns of Data
Left Table Right Table

Join by ID

Full Outer Join

Missing values
in the right
table
Missing
values
in the left
table

© 2021 KNIME AG. All rights 78


reserved.
Database Joiner Node

Combines columns from 2 different


tables
Top port contains “Left” data table
Bottom port contains the “Right”
data
table

© 2021 KNIME AG. All rights 79


reserved.
Joiner Configuration – Linking Rows

Values to join on.


Multiple joining
columns are allowed.

© 2021 KNIME AG. All rights 80


reserved.
Joiner Configuration – Column Selection

Columns from
left table to
output table

Columns
from right
table to
output table

© 2021 KNIME AG. All rights 81


reserved.
Database Row Filter Node

Filters rows that do not match the filter criteria


Use the IS NULL or IS NOT NULL operator to filter missing
values

© 2021 KNIME AG. All rights 82


reserved.
Database Sorter Node

Sorts the input data by one or multiple


columns

© 2021 KNIME AG. All rights 83


reserved.
Database Query Node

Executes arbitrary SQL queries


#table# is replaced with input
query

© 2021 KNIME AG. All rights 84


reserved.
Section Exercise – 02_DB_InDB_Processing

From tables ss13hme (house data) and ss13pme (person data) in


database
newCensus.sqlite create 4 tables
1. Join ss13hme and ss13pme on SERIALNO. Remove all columns
named PUMA* and PWGTP* from both tables.
2. Filter all rows from ss13pme where COW is NULL.
3. Filter all rows from ss13pme where COW is NOT NULL.
4. Calculate average AGEP for the different SEX groups. For all 4
tasks, at the end load data into KNIME.

Optional. Sort the data rows by descending AGEP and extract top
10 only. Hint: Use LIMIT to restrict the number of rows returned by
the db.
© 2021 KNIME AG. All rights 85
reserved.
Predicting COW
Values with
KNIME
Missing Value Implementation Approach

Remember that after we perform some


in- database ETL on the data, a key
task is to fill in missing values for the
COW field in our dataset

We could try to do this by applying


some simple business rules, but a more
sophisticated approach is to build a
model to predict COW

Therefore, we will train and apply a


decision tree model for COW

© 2021 KNIME AG. All rights 87


reserved.
Section Exercise – 03_DB_Modelling

Train a Decision Tree to predict the COW where COW is not null
Apply Decision Tree Model to predict COW where COW is
missing (null)

© 2021 KNIME AG. All rights 88


reserved.
Write/Load Data into a
Database
Database Writing Nodes

Create table as
select
Insert/append data
Update values in
table
Delete rows from
table

© 2021 KNIME AG. All rights 90


reserved.
DB Writer Node

Writes data from a KNIME data


table
directly into a database table

Append to
Increase batch size
or drop
for better
existing
performance
table

© 2021 KNIME AG. All rights 91


reserved.
DB Writer Node (continued)

Writes data from a KNIME data


table
directly into a database table

Apply custom
variable
types

© 2021 KNIME AG. All rights 92


reserved.
DB Connection Table Writer Node

Creates a new database table based on the input


SQL query

© 2021 KNIME AG. All rights 93


reserved.
DB Delete Node

Deletes all database records that match the


values
of the selected columns

Increase batch size


for better
performance

© 2021 KNIME AG. All rights 94


reserved.
Database Update Node

Updates all database


records
that match the update
criteria

Columns to
update

Columns that
identify the
records to update

© 2021 KNIME AG. All rights 95


reserved.
Utility

Drop table
missing table handling
cascade option
Execute any SQL statement e.g.
DDL
Manipulate existing queries

© 2021 KNIME AG. All rights 96


reserved.
More Utility Nodes and Transaction Support
DB Connection
Extractor
DB Connection Closer

DB Transaction Start/End
Take advantage of these nodes
to group several database data
manipulation operations into a
single unit of work
This transaction either
completes entirely or not at all.
Uses the default isolation level
of
the connected database.

Workflow is available on the KNIME


Hub:
© 2021 KNIME AG. All rights 97
reserved.
Section Exercise – 04_DB_WritingToDB

Write the original table to ss13pme_original table with a Database


Connection
Table Writer node ... just in case we mess up with the updates in the
next step.
Update all rows in ss13pme table with the output of the predictor
node. That is all rows with missing COW value with the predicted COW
value, using column SERIALNO for WHERE condition (SERIALNO
uniquely identifies each person). Check the UpdateStatus column for
success.

Optional: Write the learned Decision Tree Model and the timestamp into a new table
named "model”

© 2021 KNIME AG. All rights 98


reserved.
Let’s Now Try the Same
with Hadoop
A Quick Intro to
Hadoop
Apache Hadoop

Open-source framework for distributed storage and processing of


large data
sets
Designed to scale up to thousands of machines
Does not rely on hardware to provide high availability
Handles failures at application layer instead
First release in 2006
Rapid adoption, promoted to top level Apache project in 2008
Inspired by Google File System (2003) paper
Spawned diverse ecosystem of products

© 2021 KNIME AG. All rights 101


reserved.
Hadoop Ecosystem

Access HIVE

Processing MapReduc Tez Spark


e
Resource
Management
YARN

Storage HDFS

© 2021 KNIME AG. All rights 102


reserved.
HDFS

Hadoop distributed file system HIVE

Stores large files across multiple


MapReduce Tez Spark
YARN
machines HDFS

Fil File
(large!)
e
Blocks (default:
64MB)

DataNode
s

© 2021 KNIME AG. All rights 103


reserved.
HDFS – NameNode and DataNode

NameNode DataNodes
Master service that manages file Workers, store and retrieve blocks
system per request of client or namenode
namespace
Periodically report to namenode that
Maintains metadata for all files they are running and
and directories in filesystem tree which blocks they are storing
Knows on which datanode blocks
of a
given file are located
Whole system depends on
availability of NameNode

© 2021 KNIME AG. All rights 104


reserved.
HDFS – Data Replication and File Size

Data
Replication
All blocks of a file are stored
as sequence of blocks File
1
B B B
Blocks of a file are replicated 1 2 3

for fault tolerance (usually 3


replicas)
Aims: improve data reliability, NameNod n
B
n B
n
1 2
availability, and network bandwidth e 1 1 1
utilization B B B
n n n
1 1 2
2 2 2
B B B
n n 3
n
3 2
3 3 3
B
3
n n n
4 4 4
rack rack rack
1 2 3

© 2021 KNIME AG. All rights 106


reserved.
HDFS – Access and File Size

Several ways to access HDFS File


data Size:
Hadoop is designed to handle
HDFS
Direct transmission of data from
fewer large files instead of
nodes to client lots of small files
Needs access to all nodes in Small file: File significantly
cluster
smaller than Hadoop block
WebHDFS
Direct transmission of data from
size
nodes to client via HTTP Problems:
Needs access to all nodes in Namenode memory
cluster
MapReduce performance
HttpFS
All data is transmitted to client
via one
single gateway node -> HttpFS
service

© 2021 KNIME AG. All rights 107


reserved.
HDFS – Access

HDF
S
Hadoo
p

PC DataNod
e

Client, e.g.
KNIME DataNod
e

DataNod
e

© 2021 KNIME AG. All rights


reserved.
YAR
N
Cluster resource management system
Two elements
Resource manager (one per cluster):
Knows where worker nodes are located and how many resources
they have
Scheduler: Decides how to allocate resources to applications
Node manager (many per cluster):
Launches application containers
Monitor resource usage and report to Resource Manager

HIVE
MapReduce Tez Spark
YARN
HDFS

© 2021 KNIME AG. All rights 11


reserved.
MapReduce
Inpu Splitti Mappin Shuffling Resul
t ng g Reducing t
blue, 1 blue, 1
blue red
red, 1 blue, 1 blue, 3
orange
orange, 1 blue, 1

blue red blue, 3


orange yellow, 1 yellow, yellow, 1
yellow blue yellow, 1
yellow blue blue, 1 1 orange, 1
blue red,1

blue blue, 1 orange, orange,


1 1

Map applies a function to each red, 1 red, 1


element
For each word emit: word, 1 HIVE
Reduce aggregates a list of values to one MapReduce Tez Spark
result
For all equal words sum up count YARN
HDFS

© 2021 KNIME AG. All rights 11


reserved.
Hive

SQL-like database on top of files in HDFS


Provides data summarization, query, and analysis
Interprets a set of files as a database table (schema
information to be provided)
Translates SQL queries to MapReduce, Tez, or Spark jobs
Supports various file formats:
Text/CSV
SequenceFile
Avro
ORC HIVE
Parquet MapReduce Tez Spark
YARN
HDFS

© 2021 KNIME AG. All rights 118


reserved.
Hive

SQL select * from table

MAP(...)
MapReduce / Tez / Spark
REDUCE(...)

Var1 Var2 Var3 Y


DataNode DataNode x1 z1 n1 y1

x2 z2 n2 y2
DataNode table_1.csv
x3 z3 n3 y3
table_2.csv table_3.csv

© 2021 KNIME AG. All rights 119


reserved.
Spark

Cluster computing framework for large-scale data


processing
Keeps large working datasets in memory between
jobs
No need to always load data from disk -> much (!) faster than
MapReduce
Programmatic interface
Scala, Java, Python, R
Functional programming paradigm: map, flatmap, filter, reduce,
fold, …
Great for:
HIVE
MapReduce Tez Spark
Iterative algorithms YARN

Interactive analysis HDFS

© 2021 KNIME AG. All rights 120


reserved.
Spark – Data Representation

DataFrame: Name Surname Age

John Doe 35

Table-like: Collection of rows, organized in Jane Roe 29

columns with names and types … … …

Immutable: Not
Data manipulation = creating new DataFrame from an e:
existing one by applying a function on it Earlier versions of
KNIME and Spark used
RDDs (resilient distributed
Lazily evaluated: datasets)
Functions are not executed until an action is triggered, In KNIME, DataFrames
that are always used in Spark
requests to actually see the row data 2 and later.

Distributed:
Each row belongs to exactly one partition
Each partition is held by a Spark Executor
© 2021 KNIME AG. All rights 12
reserved.
Spark – Lazy Evaluation

Functions ("transformations") on DataFrames are not executed


immediately
Spark keeps record of the transformations for each DataFrame
The actual execution is only triggered once the data is needed
Offers the possibility to optimize the transformation steps

Triggers
evaluati
on

© 2021 KNIME AG. All rights 125


reserved.
Spark Context

Spark Context
Main entry point for Spark
functionality
Represents connection to a Spark
cluster
Allocates resources on the cluster

Spark
Executor
Task Task
Spark Driver
Spark Context
Spark
Executor
Task Task

© 2021 KNIME AG. All rights 126


reserved.
Big Data Architecture with KNIME

Scheduled
Hadoop
execution and a
RESTful workflow l Cluster
submission Impa
Submit
KNIME Server Large Impala
with Big Data queries
Extensions via JDBC
2
Workflow er
Upload Hiveserv
via
HTTP(S) Submit
Build Hive
Spark queries
workflow via JDBC
s
graphicall y
y Liv
KNIME Analytics Interact
Platform with Spark
with Big Data via
HTTP(S)
Extensions

© 2021 KNIME AG. All rights 13


reserved.
In-Database
Processing on
Hadoop
KNIME Big Data Connectors

Package required drivers/libraries for


HDFS,
Hive, Impala access
Preconfigured database connectors
Hive
Impala

© 2021 KNIME AG. All rights 136


reserved.
Hive Connector

Creates JDBC connection to


Hive
On unsecured clusters no
password required

© 2021 KNIME AG. All rights 137


reserved.
Preferences

© 2021 KNIME AG. All rights 138


reserved.
Create Local Big Data Environment Node

Creates a fully functional big


data
environment
on your local machine with
Apache Hive
HDFS
Apache Spark
Try out Big Data nodes
without Hadoop cluster
Build and test workflows
locally on
sample data

© 2021 KNIME AG. All rights 139


reserved.
Section Exercise – 01_Hive_Connect

Execute the workflow 00_Setup_Hive_Table to create a local big data


environment
with the data used in this training.

On the workflow implemented in the previous section to predict


missing COW values, move execution from database to Hive.
That means:

change this workflow to run on the ss13pme table on the Hive


database in your local big data environment.

© 2021 KNIME AG. All rights 140


reserved.
Write/Load Data into
Hadoop
Loading Data into Hive/Impala

Connectors are from KNIME Big Data Connectors


Extension
Use DB Table Creator and DB Loader from regular DB
framework

Other possible
nodes

© 2021 KNIME AG. All rights 142


reserved.
DB Loader

© 2021 KNIME AG. All rights 143


reserved.
Hive Partitioning

Sale Parent
s Table

Jan201 Feb20 Mar20


7 17 17
date ≥ 01-01- date ≥ 02-01- date ≥ 03-01-
Range
2017 2017 2017 Partition
date ≤ 01-31- date ≤ 02-29- date ≤ 03-31- by date
2017 2017 2017

Europ Europ Europ List Sub-


e Asi e e partition by
Asi Asi
region
a
re
gion = re
gion = region
E= US Europa US Europa US
urop
regio = A regio
e = A regio
e = A
en Asia n Asia n Asia
region = region = region =
USA USA USA

© 2021 KNIME AG. All rights 144


reserved.
Partitioning

About partition
columns:
Optional (!) performance
optimization
Use columns that are often used
in WHERE clauses
Use only categorical columns Partition
with suitable value range, i.e. column

not too few distinct values (e.g.


2) and not too many distinct
values (e.g. 10 million)
Partition columns should not
contain missing values

© 2021 KNIME AG. All rights 145


reserved.
Section Exercise – 02_Hive_WritingToDB

Start from the workflow that implements the missing value strategy and
write the
results back into Hive. That is:

write the results onto a new table named "newTable" in Hive using the
HDFS Connection of the Local Big Data Environment along with both
DB Table Creator and DB Loader nodes.

© 2021 KNIME AG. All rights 146


reserved.
HDFS File
Handling
File Handling Nodes

Connection/Administration
HDFS Connection
HDFS File Permission
Utilize the existing remote file handling
nodes
Transfer Files
Create Folders
Delete Files/Folders
Compress and Decompress
Full documentation available

© 2021 KNIME AG. All rights 14


reserved.
Ready for
Spark?
KNIME
Extension for
Apache Spark
Spark: Machine Learning on Hadoop

Runs on Hadoop
Supported Spark Versions
1.2, 1.3, 1.5, 1.6, 2.x
One KNIME extension for all Spark versions
Scalable machine learning library (Spark MLlib and
spark.ml)
Algorithms for
Classification (decision tree, naïve Bayes, logistic regression, …)
Regression (linear regression, …)
Clustering (k-means)
Collaborative filtering (ALS)
Dimensionality reduction (SVD, PCA)
Item sets / Association rules

© 2021 KNIME AG. All rights 162


reserved.
Spark Integration in KNIME

© 2021 KNIME AG. All rights 163


reserved.
Spark Contexts: Creating

Three nodes to create a Spark


context:
Create Local Big Data Environment
Runs Spark locally on your machine (no cluster
required)
Good for workflow prototyping

Create Spark Context (Livy)


Requires a cluster that provides the Livy service
Good for production use

Create Databricks Environment


Runs Spark on a remote Databricks cluster
Good for large-scale production use

© 2021 KNIME AG. All rights 164


reserved.
Spark Contexts: Using, Destroying

Spark Context port is required by all Spark


nodes
Destroying a Spark Context destroys all
Spark DataFrames within the context

Spark
Context
Port

© 2021 KNIME AG. All rights 165


reserved.
Create Spark Context (Livy)

Allows to use Spark nodes on


clusters
with Apache Livy
Out-of-the-box compatibility with:
Hortonworks (v2.6.3 and higher)
Amazon EMR (v5.9.0 and higher)
Azure HDInsight (v3.6 and higher)

Also supported:

© 2021 KNIME AG. All rights 166


reserved.
Import Data
from KNIME or
Hadoop
Import Data to Spark

From KNIME data


table Spark

KNIME
Context

From CSV file Remote file


Connection

in HDFS
Spark Context

Hive
query
Spark

From Context

Hive
From other Spark
Context

sources
Database

From query Spark


Context

Database
© 2021 KNIME AG. All rights
reserved.
Spark DataFrame Ports

Spark DataFrame port points to a


DataFrame
in Spark cluster
Data stays within Spark Spark
DataFrame
Output port provides data preview and column Port
information

Reminder: Lazy Evaluation


A green node status does not always mean that
computation
has been performed!

© 2021 KNIME AG. All rights 169


reserved.
Section Exercise – 01_Spark_Connect

To do:
Connect to Spark via the Create Local Big Data
Environment node

Import the ss13pme table from Hive into Spark

© 2021 KNIME AG. All rights 170


reserved.
Virtual Data Warehouse

© 2021 KNIME AG. All rights 171


reserved.
Pre-Processing with
Spark
Spark Column Filter Node

© 2021 KNIME AG. All rights 173


reserved.
Spark Row Filter Node

© 2021 KNIME AG. All rights 174


reserved.
Spark Joiner Node

© 2021 KNIME AG. All rights 175


reserved.
Spark Missing Value Node

© 2021 KNIME AG. All rights 176


reserved.
Spark GroupBy and Spark Pivot Nodes

© 2021 KNIME AG. All rights 177


reserved.
Spark Sorter Node

© 2021 KNIME AG. All rights 178


reserved.
Spark SQL Query Node

© 2021 KNIME AG. All rights 179


reserved.
Section Exercise – 02_Spark_Preprocessing

This exercise will demonstrate some data manipulation operations


in Spark.
It initially imports the ss13pme and ss13hme tables from Hive.

To do:
Column Filter to remove PWGTP* and PUMA* columns
Join with ss13hme on SERIAL NO
Find rows with top ten AGEP in dataset and import them into
KNIME
Calculate average average AGEP per SEX group
Split the dataset into two:
One where COW is null
One where COW is not null

© 2021 KNIME AG. All rights 180


reserved.
Mix & Match

Thanks to the transferring


nodes (Hive to Spark and Spark
to Hive, Table to Spark and
Spark to Table) you can mix
and match in-database
processing operations

© 2021 KNIME AG. All rights 181


reserved.
Modularize and Execute Your Own Spark Code: Java
Snippets

© 2021 KNIME AG. All rights 182


reserved.
Modularize and Execute Your Own Spark Code: PySpark
Script

© 2021 KNIME AG. All rights 183


reserved.
Machine Learning with
Spark
MLlib Integration: Familiar Usage Model

Usage model and dialogs like existing nodes


No coding required
Various algorithms for classification, regression and clustering
supported

© 2021 KNIME AG. All rights 185


reserved.
MLlib Integration: Spark MLlib Model Port

MLlib model ports for model transfer


Model ports provide more information about the
model itself

Spark MLlib
model

© 2021 KNIME AG. All rights 186


reserved.
MLlib Integration: Categorical
features
MLLib learner nodes only support
numeric
features and labels
String columns (with categorical
values) need to be mapped to
numeric first

© 2021 KNIME AG. All rights 187


reserved.
MLlib Integration: Categorical Values for Decision Tree
Algorithms
MLlib tree algorithms have optional PMML input
port
If connected, hints to Decision Tree algorithm which numeric
columns
are categorical in nature
Improves performance in some cases

Alternative
nodes
© 2021 KNIME AG. All rights 188
reserved.
Spark Predictor Node

Spark Predictor (Mllib) assigns labels based on an MLlib


model
Supports all supervised classification & regression MLlib
models
Spark.ml models have a separate learner/predictor

Alternative
nodes

© 2021 KNIME AG. All rights 189


reserved.
Section Exercise – 03_Spark_Modelling

On the ss13pme table, the current workflow separates the rows where
COW is null,
from those where COW is not null, and then modifies COW to be zero-
based.

To do:
Where COW is not null:
Fix missing values in feature columns
Train a decision tree on COW on data rows
Where COW is null:
Remove COW column
Apply the decision tree model to predict COW

© 2021 KNIME AG. All rights 190


reserved.
Mass Learning in Spark – Conversion to PMML

Mass learning on Hadoop


Convert supported MLlib models to
PMML

© 2021 KNIME AG. All rights 191


reserved.
Sophisticated Model Learning in KNIME - Mass Prediction
in Spark
Supports KNIME models and pre-processing steps
Sophisticated model learning in KNIME
Mass prediction in Spark using Spark PMML Model
Predictor

© 2021 KNIME AG. All rights 192


reserved.
Closing the Loop

Apply Learn
model on model
demand at scale

PMML MLlib
model model

Sophisticate Apply
d model model
learning at scale

© 2021 KNIME AG. All rights 193


reserved.
Mix and Match

KNIME <-> Hive <->


Spark

© 2021 KNIME AG. All rights 194


reserved.
Export Data
back into
KNIME/Hadoop
Export Data from Spark

To KNIME Spark
DataFrame
KNIME data
table

To CSV file in
Remote file
connection
Spark DataFrame

HDFS
To
Hive connection
Spark Hive
table
DataFrame

Hive
To Other Spark

Storages
DataFrame

To
Database
connection Database
Spark table

Database DataFrame

© 2021 KNIME AG. All rights


reserved.
Cloud & Big Data Connectivity: Databricks

Create Databricks Environment: connect to your


Databricks cluster
Azure or AWS
Databricks Delta, Databricks File System, or Apache Spark

© 2021 KNIME AG. All rights 197


reserved.
Cloud & Big Data Connectivity: Google

Connectivity to
Google Cloud Storage
Google Big Query (via DB
Nodes)
Google Cloud Dataproc

© 2021 KNIME AG. All rights 198


reserved.
Section Exercise – 04_Spark_WritingToDB

This workflow provides a Spark predictor to predict COW values for the
ss13pme
data set. The model is applied to predict COW values where they are
missing.

Now export the new data set without missing values to:
A KNIME table
A Parquet file in HDFS
A Hive table

© 2021 KNIME AG. All rights 199


reserved.
Exampl
es
Analyzing the Irish Meter Dataset Using Spark
SQL

© 2021 KNIME AG. All rights 201


reserved.
Analyzing the Irish Meter Dataset Using Spark
SQL

© 2021 KNIME AG. All rights 202


reserved.
Columnar File Formats

Available in KNIME Analytics Platform: ORC and Parquet

Benefits:
Efficient compression: Stored as columns and compressed, which leads to smaller disk
reads.

Fast reads: Data is split into multiple files. Files include a built-in index, min/max values,
and other aggregates. In addition, predicate pushdown pushes filters into reads so that
minimal rows are read.

Proven in large-scale deployments: Facebook uses the ORC file format for a 300+ PB
deployment.

Improves performance when Hive is reading, writing, and processing


data in HDFS

© 2021 KNIME AG. All rights 203


reserved.
Example: Columnar File Formats

ID Gender Ag
e
1 female 45
I Gender
D
333 female Ag
2 male 20
4 e 4
2 mal 2
Metainformation 5
e ⋮ 0
⋮ ⋮ ⋮ ⋮ ⋮
ID: min = 1 ; max = I Gender
3333 Gender: 6666 male 4 D Ag
666 femal 4
3333 male
female; male Age: 42 2 e
Metainformat 7 e 5
min = 20; max = 45 2 mal 2
ID: min = 3334 ; max =
ion
e 0
6666 ⋮ ⋮ ⋮
Gender: female; male 10000 male 42
Age: min = 5; max = 24
Metainformation
ID: min = 6667 ; max =
10000
Gender: female; male
Age: min = 45; max = 87

© 2021 KNIME AG. All rights 204


reserved.
Example: Columnar File Formats
ID Gender Age
1 female 45
2 male 2
⋮ ⋮ 0 ⋮I Gender Age
D female 5
3333 male 42 3334
Metainformation 2 male 24
ID: min = 1 ; max = 3333 ⋮ ⋮ ⋮
Gender: female; male ID Gender Age
6666 male 17
Age: min = 20; max = 45 666 male 45
Metainformation 7
2 male 87
ID: min = 3334 ; max =
6666 ⋮ ⋮ ⋮
Gender: female; male 10000 male 65
Age: min = 5; max = 24
Metainformation
ID: min = 2 ; max = 10000
Gender: male
Age: min = 45; max = 87

Select ID from table where


Age > 30 and Gender =
female
© 2021 KNIME AG. All rights 205
reserved.
Example: Write an ORC File

© 2021 KNIME AG. All rights 206


reserved.
H2O Integration

KNIME integrates the H2O machine learning


library
H2O: Open source, focus on scalability and
performance
Supports many different models
Generalized Linear Model
Gradient Boosting Machine
Random Forest
k-Means, PCA, Naive Bayes, etc. and more to come!
Includes support for MOJO model objects for
deployment
Sparkling water = H2O on Spark

© 2021 KNIME AG. All rights 207


reserved.
The H2O Sparkling Water Integration

© 2021 KNIME AG. All rights 208


reserved.
Conclusio
ns
SQLite

© 2021 KNIME AG. All rights 211


reserved.
Hadoop Hive

© 2021 KNIME AG. All rights 212


reserved.
Spark

© 2021 KNIME AG. All rights 213


reserved.
Stay Up-To-Date and Contribute

Follow the KNIME Community Journal on


Medium Low Code for Advanced Data
Science

Daily content on data stories, data science


theory, getting started with KNIME and
more
for the community by the community

Would you like to share your data


story with the KNIME community?

Contributions are always


welcome!

© 2021 KNIME AG. All rights 214


reserved.
The End
Twitter: @KNIME

You might also like