Loffler v. Automatic Data Processing. Using The KNIME Analytics Platform 2021
Loffler v. Automatic Data Processing. Using The KNIME Analytics Platform 2021
Analytic path
Knime workflows permit the creation and saving of complete analytic
procedures, which are used for preparing organized data, processed exactly
pursuant to our needs, from the original raw data. This can be very
convenient if we execute an analysis only very rarely or if the input
information always differs only slightly. In that case, the visualized analytic
path will help us to get quickly oriented and to effectively complete our
analysis, even though such an analytic workflow may not be suitable for
automation.
What is included in the publication
The publication primarily explores the automation of data processing - by
semiautomatic and autonomous data workflows. You will gradually become
acquainted with the basic building blocks of the automation - variables, loops
and branches. We will also show you how to optimize our data workflows,
protect them against errors and launch them automatically.
The individual chapters address theory as well as practice. We have
endeavored to apply a balanced approach. That is why each explored topic is
complemented by an example, which can be practically tested.
Theory
Topic explanation
Usage examples with a detailed analysis of individual operations
Notes and observations from the practice
References to other examples and information sources
Practice
We believe that interactive teaching is the most beneficial way of teaching for
students. That is why we have prepared a shared folder, from which you can
download a package that includes all the explored workflows (20 fully
functional Knime workflows), including the used data.
Note:
None of the provided examples require the paid version of Knime (i.e., they
do not require the Knime Server).
Used symbols (apply to the printed and PDF versions; the ePub and Mobi
formats contain only text)
We assume that you are already familiar with the basics of working with the
Knime Analytics Platform. If you are not, we recommend getting familiar
with the basics of working in the Knime environment first.
The following is a link to the updated introduction to the work with Knime:
https://www.knime.com/getting-started-guide
You can also find links to other study materials in the last chapter of the
book.
If you do not have Knime installed on your computer yet, you can use some
of the automatic installation programs (for MS Windows, Linux and macOS)
that can be found on the Knime website.
Link: https://www.knime.com/downloads/download-knime
Training files
The folder with all the explored workflows and the corresponding data files
can be downloaded here (password for the download is Knime2020):
Knime workflow
Download archive – Knime_Automation_book
Data files
Download archive – Knime_Advanced_Data
Note: a special Archive folder is always located in the output folder and its
subfolders. It contains the prepared result files (in order to allow for
comparing your outputs with ours).
Import Knime workflow
Save the Knime_Automation_book archive with the work workflows to your
Knime workspace using the “Import KNIME Workflow…” option. Upon the
import completion, the workflow should be available in the tree of the local
workflows.
Import KNIME Workflow
Automation
What do we mean when we say automation?
We define the automation level in the Knime context in the following
manner:
Automation examples
Knime can be used for automating almost any data task. Just to get an idea,
we present here some typical examples. Particular solutions of the selected
tasks are then described in detail as a part of our explanations.
3. KNIME – variables – introduction
In the Knime environment, we can work with variables – known as Flow
variables. Variables allow us to execute more complex operations within the
framework of our workflows. We can continuously save useful values in the
variables, upon which we can use them in the workflow nodes whenever
needed.
A variable is defined by its name, data type and value. For example, the city
variable can be of the String data type (i.e., a text string) and its value can be
Tokyo. Or, the age variable can be of the Integer data type and its value can
be 62.
Table of the basic variable types:
Types of variables
We use two types of variables:
You can define global variables using the “Workflow Variables…” option
(selected by clicking on the names of individual workflows with the right
mouse button).
You can add a variable using the Add button. We define the variable name,
type and initial value.
The result looks like this.
Node configurations almost always include the Flow variables tab. The tree
of variables includes fields that are accessible within the frame of a particular
node. Existing variables, which then influence the behavior of the given
node, can be selected in the grey fields (input variables). Variable names
(new or existing) can be entered in the white fields. They will then contain
values from the output of the given node (output variables, which
subsequently form input variables for subsequent nodes).
Grey and white fields for entering variables
Input node variables
Workflow: 001_Variables - global 1
In order to explain the “grey fields” for the Flow Variables, we create a
global workflow variable called v_article, into which we save the Article
string (which is the name of the table column whose values we will want to
convert using the Number To String node).
In this simple workflow, we upload the file that contains the sales data for the
material numbers, which Knime uploaded as the Integer type. However, we
want to convert the material numbers into the String type.
In the Number To String node, we select the variable v_article for the
“included_names” parameter.
We do not manually select anything on the Options tab (the Include area will
be empty) since the selection control was taking over by the variable
v_article.
Conversion result.
We can see the current variable values on the Flow Variables tab.
The principle of such nodes is very similar. Variables are formed based on
the given table values or as a result of a certain rule.
The possibility to create variables in this manner is fundamental for the
existence of automatic workflows, when we need to control the workflow
behavior using parameters that can later change.
Example 1 - variables formed based on a customization table
We create our customization table and workflow parameters using the Table
Creator node. It is a simplification. In real life, you would use a csv
definition file, MS Excel table, table from an SQL database, etc.
We created the Path, City, Chunk and Filter columns, and we entered the
corresponding values in the row.
In technical documentation, it is certainly recommended to describe the
significance of the individual variables (we recommend to do so and to keep
such descriptions for more extensive workflows).
We can then use the variables arbitrarily, for example, when uploading a file,
in the nodes for filtering and converting individual values, etc.
Workflow example
Workflow: 002_Variables - variable nodes 1
In the CSV Reader node, we initially activated the variable ports using the
“Show Flow Variable Ports” option.
Next, we selected suitable variables.
The Path variable for a dynamic file upload.
The Chunk variable for limiting the number of the uploaded rows.
We selected the City variable in the Row Filter node in a similar manner.
In the Number To String node, we used the Filter variable.
File upload – with the read row IDs and read column headers parameters.
The workflow can then look, for example, like this.
Workflow: 003_Variables - variable nodes 2
The Table Column to Variable node creates variables according to the
selected columns in our customization table. The values are saved in the
Value column.
This node allows for manipulations with strings, including data type
conversions. Our conversion for the Chunk variable is from the String type to
the Integer type. We selected the toInt(x) function for the Chunk variable and
we checked the Replace Variable option.
KNIME – creating variables using the “input”
nodes
Another method for creating variables for your Knime workflows is the use
of the Input nodes.
Input nodes allow for the creation of variables using a user dialogue.
Example 1 - file selection for the CSV Reader
Node configuration
Wrapping in the Component node is necessary in order for the input node to
start reacting as an interactive input field - with the user dialogue.
4. We complete the final configuration steps of the finished
Component node
5. We name the Flow variable
We then arbitrarily use the Component in the nodes, for which it makes sense
to use our Flow variable, in our case in the Row Filter node again.
The workflow that uses the Input nodes can then look, for example, like this.
Workflow: 004_Active_elements - Input 1
File: C:\KNIME_Advanced\Input\Sales_full.csv
Note – using variables of the Input nodes
If we want to use an Input node variable, we need to set the connection of the
given variable ports in the output node of the component and include the
selected variable in the Include part of the filter. It is then not necessary to
enter the Flow variable upon the first calling of our Component nodes.
Combining Input fields within a single component
Several Input nodes can be combined within a single component. Single use
of the given component will then allow for entering several variables
simultaneously.
The node prepared in this manner can look, for example, like this.
The use of the Workflow can then look, for example, like this:
Workflow: 005_Active_elements - Input 2
File: C:\KNIME_Advanced\Input\Sales_full.csv
We wrap the node in the component again and select an output variable.
However, the node returns only one row, which consists of all entered values.
Tables prepared in this manner can already be used for filtering, for example.
Filtering by the table values can be executed using the Reference Row Filter
node.
Similar to the Input nodes, Widgets can be encased in the Component meta
node, upon which they can be called within a single common interactive
view.
To demonstrate how the widgets work, we will modify the
005_Active_elements - Input 2 workflow in a way that it includes the
Widgets instead of the Input nodes.
Workflow: 007_Active_elements - Widgets 1
We launch the Component using the “Interactive View: Widget multi view”.
We can enter values of the variables in the open interactive view.
Detail of the “Widget multi view” Component.
The links given below include more detailed information about the Widgets
and their use.
All information about the Widgets can be found at the Node Pit, including
workflow examples:
https://nodepit.com/category/flowabstraction/widgets
Using Widgets in the environment of the Web Portal and Knime Server:
https://www.knime.com/knime-software/knime-webportal
Formatting Widgets using CSS:
https://docs.knime.com/2019-06/analytics_platform_css_guide/index.html
4. KNIME – executing workflows in a
loop – cycles
When we need a part of our workflow to be executed in a loop or cycle, we
use nodes of the Loop type.
A Loop must have a starting node (Loop Start) and an ending node (End
Loop). Any sequence of the nodes (loop body, in our case Action) is then
repeated within the frame of the loop until the loop ends.
We will become familiar with the most commonly used loop types in detail.
KNIME – loop over the table rows
We start the overview of the loops with a very practical loop type - Table
Row To Variable.
The Table Row To Variable Loop Start node gradually passes through the
rows of a table and converts the content of each row into a variable. We can
then work with the variable inside the loop body. A loop body can be any
workflow. We close the loops using the Loop End node, which starts the
subsequent loop iterations and, at the same time, gathers the results of
individual iterations in the table.
The loop can be started as a whole – individual iterations are then executed
together or step-by-step – you can then execute and monitor individual loop
iterations.
Options of the Loop End node
Example – combining sheets of an MS Excel workbook
Situation: We have a file in MS Excel. It contains data related to bad quality
expenses for cost accounting (reworking, warranty claims and scrapping) for
the individual calendar weeks. The data are arranged in a single file, though
they are divided into individual sheets according to the calendar weeks
(CW01, CW02…). A new sheet with the data for the previous calendar week
is added to the table each week. We need to create a universal workflow that
will combine the data from all the sheets in a single table.
We then use the Sheet variable when configuring the Excel Reader (XLS)
node.
Note: notice the connection of the Excel Reader (XLS) node via the variable
port
We convert the values in the Document ID and Cost Center columns in the
body loop into the String type and we add the CW column, which will contain
the name of the current sheet (it presents the calendar week here; thus, we do
not lose the information about the calendar week from which the data came).
Configuration of the Number To String node
We can save the final (combined) table into a new MS Excel file.
Other practical examples of the use of the Table Row To Variable Loop can
be found under the following link:
https://nodepit.com/node/org.knime.base.node.flowvariable.variableloophead.LoopStartVa
KNIME – loop according to the value groups
The Group Loop Start node in the loop gradually combines the values of the
input table. Arbitrary workflow steps can then be executed in the loop body
with the groups created in this manner. Again, we close the loops using the
Loop End node, which starts the subsequent loop iterations and, at the same
time, gathers the results of individual iterations in the table.
Similar to the first example, the Group Loop can be started as a whole –
individual iterations are then executed together or step-by-step – we can then
execute and monitor individual loop iterations.
Options of the Loop End node
In the Group Loop Start node, we set the field, based on which we will
gradually create the given group of values. In our case, these values will be
from the Cost Center column.
Node result (in our case the result of the last iteration)
Flow variables after the node execution
The String Manipulation (Variable) nodes in the loop body will prepare us a
suitable file name and path, which we will then use for saving the final files.
First of all, we replace the CW10_All string in the Path variable by the value
of the Cost Center variable – in our case 100199.
Node result
We will save the final files using the Excel Writer (XLS) node; the path and
the file name will be dynamically controlled by the prepared variable Path.
Additional examples of the practical use of the Group Loop can be found
here:
https://nodepit.com/node/org.knime.base.node.meta.looper.group.GroupLoopStartNodeFa
KNIME – loop over an interval
The Interval Loop Start node starts a loop that gradually increases the value
of the variable within the time interval we defined. In the node, we determine
the start and end values of the interval and also the step, which represents the
value by which the variable will be increasing during the individual
iterations.
If we set an interval from 0 to 10 with step 2 for variable X, the variable will
gradually acquire the following values:
In the Interval Loop Start node, we set the bottom limit, upper limit and
interval step. We set the prefix name for the variables to be loop_X. We
subsequently use the variables in other loop nodes (X will thus be represented
by the loop_Xvalue variable).
Flow variable after the second iteration.
In the next section of our workflow, we create the necessary data. We will
then repeat this section until we reach the top limit of the loop (i.e., until the
value of the flow variable loop_Xvalue will be equal to the value of the flow
variable loop_Xto = 6.28).
We use the Constant Value Column node for creating a new row of the table.
The new row will contain a value in column X, which will correspond to the
value of the loop_Xvalue variable. During the loop, the values in this row
will be overwritten by new values and the final table (in the End Loop node)
will contain all the rows generated by the individual loop iterations.
This is followed by two Math formula nodes for the calculation of the values
of the sin(x) and cos(x) functions and for their recording in the table.
The value of the sin(loop_Xvalue) function will be recorded in the sin(x)
column.
Similarly, the value of the cos(loop_Xvalue) function will be recorded in the
cos(x) column.
Final table of the second Math Formula (cos(x)) node after the seventh
iteration.
We configure the End Loop node, for example, in the following manner.
Once the entire loop is executed, the node result will contain 629 values –
values of variable X and corresponding values of functions sin(x) and cos(x).
We can then use such a table for drawing the given function graphs, which
we execute using the Line Plot node.
Once the node is executed, we can view the final graphs – Interactive View
option: Line Plot.
Practical use of loops of the Interval Loop type
Apart from “academic” examples, the Interval Loop offers several practical
usage options. For example, when simulating the impact of the prices of input
materials on the product profitability (impact upon a price change by 1 EUR,
2 EUR…) or, similarly, the iterative modelling of the impact of tax rate
changes, various margins, or changes in the number of workers. For
advanced scenarios, we can use them for, for example, to search for optimal
parameters of the machine learning models (when we change the model
parameters in the loop and save the final model accuracy in the table, from
which we subsequently select the best model parameters for productive
deployment).
You can find several practical examples of the use of the Interval Loop here:
https://nodepit.com/node/org.knime.base.node.meta.looper.LoopStartIntervalNodeFactory
KNIME – loop over the table columns
The Column List Loop Start node starts the loop, which goes through the
selected table columns and executes the defined workflow nodes over these
columns. The Loop End (Column Append) node ensures gradual passage
through individual columns and it also gathers the processed columns into the
final table.
Workflow: 011_Loops – 4
File: C:\KNIME_Advanced\Input\Shipping\Shipping_report.xlsx
Using the Excel Reader (XLS) node, we initially upload the given file, upon
which we create two column groups in the Column Splitter node:
Observe the flow variables, formed within the loop. We use the
currentColumnName variable for the dynamic renaming of the columns (we
need to rename the columns to, for example, tmp in order to allow for
universal data modifications of all columns within the loop – see below).
First of all, we convert the column data type in the loop body (Number To
String node, the column name is controlled by the currentColumnName
variable), after which we change the texts stated in the given column - using
the String Manipulation node.
In the other nodes in the loop body, we want to add the "N_" string in front of
each code (gradually in all columns), thus making sure that we can
immediately see that they are categories and not numbers.
Since we will change data in the individual columns (there will be several of
them), we need to do a little trick – we change the column name before the
String Manipulation node, for example, to tmp. Once the manipulation with
the string is completed, we restore the original column name. (Note: should
we only want to change the data type in the column from a number to a
string, and not to change the data, we just need the Number To String node.)
The String Manipulation node and function join() need to know the name of
the column in which the given data manipulation will be executed. The
picture below shows an example of a Plant column. If we do not gradually
rename all columns to, for example, tmp, we would need to have multiple
nodes – one for each column.
When we rename each column within the frame of the loop of the table to
tmp, the String Manipulation node will look like this.
We have thus ensured that all the columns of the table will be modified.
We execute the first conversion of the column names using the Column
Rename (Regex) node, with the stipulation that we will control the
searchString parameter by the currentColumnName variable and that we
permanently set the Replacement: parameter as tmp.
We use the same node type for the second conversion, we just set the variable
reversely. The searchString parameter will be set as tmp and the
replaceString parameter will be controlled by the currentColumnName
variable (the implicit text prefix_$1 plays no role here).
We close the loop using the Loop End (Column Append) node.
The result after the entire loop is executed looks like this.
All the columns are of the S = String type and the data in the columns have
been modified as we needed.
In the last step of our workflow, we will repeatedly connect the table using
the Joiner node. The joining column will be the technical Row ID column.
Connection result.
Alternative workflow
The example given above explains a loop over table columns. However,
Knime is continuously being developed and version 4.2 brought a new node
type - String Manipulation (Multi Column). This node makes mass
manipulations of columns a very easy task. A workflow created in Knime
version 4.2 can look, for example, like this.
Workflow: 011_Loops – 4
Several examples of the use of the Counting Loop can be found here:
https://nodepit.com/node/org.knime.base.node.meta.looper.LoopStartCountNodeFactory
Several examples of workflows with the use of a loop with a condition at the
end can be studied here:
https://nodepit.com/node/org.knime.base.node.meta.looper.condition.LoopStartGenericNo
Recursive Loop
The Recursive Loop Start node initiates a special loop type, which is
exceptional due to the fact that it allows for returning the result of any
workflow executed within the loop body back to the start node of the loop
using the terminal Recursive Loop End node. The loop body executes the
first iteration at the input port of the node. The loop body then executes the
second and additional iterations with the given final table sent back using the
Recursive Loop End node.
The Recursive Loop End node has two input ports. The first port is used for
collecting data from individual iterations (identically to other loop types).
The second port is designated for the data that will be recursively sent back to
the Recursive Loop Start node.
Configuration of the Recursive Loop Start node.
Some practical examples of the use of recursive loops can be found here:
https://nodepit.com/node/org.knime.base.node.meta.looper.recursive.RecursiveLoopStartN
5. KNIME – conditional workflow
branching
Should our workflows result in a situation when we need to execute an
alternative set of steps based on a situation that has arisen, we can
conveniently use the nodes for conditional branching.
Knime uses two types of branching:
1. Branching of the IF type – we use it for “either/or”, “yes/no”,
etc. cases.
2. Branching of the CASE type – we use it for “black, white or
green”, “0, 1 or 2”, etc. cases.
The IF Switch node includes the PortChoice parameter, which can acquire
the following values: both, bottom or top. The values control if the workflow
leaves the IF Switch node via port 1, port 2, or via both ports. If we, for
example, select the top value, the bottom port of the node closes and only the
nodes linked to the upper port are executed.
To make the workflow automatic, the branching needs to be controlled by a
variable.
Workflow: 012_If_Switch - 1
File: C:\KNIME_Advanced\Input\KPI (folder with the files)
The key moment of the workflow is the Rule Engine Variable node. In this
node, we fill the variable that we have named switch with the value top,
provided the flow variable Location (gradually hiding all file names) contains
the string “daily”. The switch variable will be filled with the value bottom,
provided the flow variable Location contains the string “weekly”.
Setting up a rule for the switch variable:
Flow variables after uploading the given CSV file in the CSV Reader node:
We insert the switch variable in the IF Switch node.
Workflow: 013_CASE_Switch - 1
File: C:\KNIME_Advanced\Input\Shifts (folder with the files)
The key moment of the workflow is the Rule Engine Variable node, which
fills the Switch variable based on the information about the number of
columns in the table obtained from the Extract Table Dimension node
(variable Number Columns). This variable will control the ports in the CASE
Switch Data (Start) node and thus also the action that will be executed with
the data.
Setting up the rules for the Switch variable according to the Number Columns
value:
Flow variables for the execution of the Extract Table Dimension node (one
of the loop steps):
In the CASE Switch Data (Start) node, we connect the PortIndex parameter
with the Switch variable.
Based on the value of the Switch variable, the corresponding actions are
executed in individual Metanodes (metanodes are used here for clarity
purposes).
We will look into the details of the executed actions. We open individual
metanodes.
7 days file metanode for tables that contain columns from Monday to Sunday.
Only “unpivoting” is executed here - transformation of the columns to rows.
We will add the missing column using the Rule Engine node (using the
condition Switch = 1, we will fill the column with 0 values; to add the
column, we use the option Append Column: Sun).
The other nodes are then identical, i.e., we execute “unpivoting” using the
Unpivoting node, and rearrange the order of the columns using the Column
Resorter node.
Result after rearranging the order of the columns.
The 5 days file metanode for tables that are missing the Saturday as well as
Sunday columns.
Should the Saturday and Sunday columns be missing in a file, we will add
both of them using two Rule Engine nodes (we will fill the columns with 0
values using the condition Switch = 1 resp. Switch = 2; to add the column, we
use the option Append Column: Sat, resp. Append Column: Sun).
The other nodes of this workflow branch are identical to the previous two
cases.
Workflow completion
The workflow can be completed, for example, like this.
The Loop End node executes the aggregation of the uploaded data
from the gradually uploaded files, while the Column Auto Type
Cast node identifies an unknown data type in the ColumnValues
node. Using the Excel Writer (XLS) node, we save the final
combined file. If we also want to see the corresponding graph, we
can, for example, arrange the data using the GroupBy node and
display them using the Line Plot node.
Data type “?” in the ColumnValues column (which includes the required
number of workers)
“Tuning” workflows
When we start a workflow, we usually find parts that we could improve.
In our case, it seems convenient to place the Unpivoting node past the CASE
branches, to rename the generic column names ColumnNames and
ColumnValues to clearer ones, such as Day and FTE Count, and to eliminate
the problem with the identification of the ColumnValues variable, and thus
also of the Column Auto Type Cast node.
The ColumnValues variable at the output of the Loop End node.
If we remove the quotation marks, the column will contain “I” (Integer)
values.
Should the ColumnValues column still be of the “?” type, tracing (launching
the loop step-by-step) and searching for the error source (a file, a node, etc.)
would be executed next, or we would keep the Column Auto Type Cast node
for identifying a correct data type, as it is in our case.
6. KNIME – workflow calling
automation
Knime allows for two types of automation:
assisted automation
autonomous automation.
When you need to call a workflow with a global variable, add the following
parameter:
-workflow.variable=<var>,<value>,<type>. (e.g. -workflow.variable=package,123,int)
Workflow: 014_External_WF_Call – 1
File: C:\KNIME_Advanced\Input\KPI_filter (file folder)
Notice the Container Output (Table) node. This node will secure the transfer
of the table when the workflow is called by another workflow. The table will
be transferred using the Parameter Name parameter: (it has the name output
here).
We will call the workflow shown above using the Call Workflow (Table
Based) node. The actual data transfer is governed by the parameter Fill
output table from: = output.
Workflow: 015_External_WF_Call - 2
Launching workflows using the Knime Server
The most effective tool for the automated launch of workflows is the Knime
Server. Apart from automating work with workflows, the Knime Server can
do several other things – it supports team work, manages access rights to the
given data, publishes data in the form of well-arranged dashboards. Apart
from the above, it allows for executing automatic data modifications and
transformations (which is the main topic of this publication). It also forms a
platform for the data science. Data scientists can use Knime for creating and
tuning models for machine learning, while the Knime Server ensures a
productive deployment of these models.
Knime Server is a paid service. Price information related to the Knime Server
and more detailed information about the content and functionalities of the
Knime Server can be found here: https://www.knime.com/knime-server.
7. KNIME – workflow automation
and clarity
Workflow automation is not only about variables, loops and branching.
Automatic workflows should also be well arranged, understandable, fast and
resistant to errors.
Workflow clarity
Always keep in mind that you need to create workflows that will run for a
long time without human intervention. When you come back to a certain
workflow after a while, you need to quickly grasp what the workflow does
and how it operates. You can significantly improve workflow clarity by using
the following elements:
Node descriptions (short and long)
Annotations
Metanodes
Components
Meta information
Node descriptions
Short descriptions can be entered directly in the “visible” part of the nodes.
Long descriptions allow for detailed descriptions of the nodes. We can enter
long texts using the “Edit Node Description”. We will then see the text when
you hold the mouse pointer over the given node for approximately one
second.
Hold the mouse pointer over the node for approximately 1 second and the
corresponding node description will be displayed.
Annotations
Annotations allow for the effective documentation of the workflows. Using
annotations, you can comment on and document workflow logic blocks. The
annotation editor is very simple and well-arranged.
Annotation editor.
When we open this metanode, we will see several nodes that form a logical
unit. The clarity of the workflow has significantly improved, without
compromising the clarity of the entire workflow.
Workflow: 016_Metanode_example – 1
(adopted from KNIME Example Workflow “Finding Association Rules for
Market Basket Analysis”)
The fastest way to create a metanode is to mark consecutive workflow nodes
and, using the right mouse button, select the “Create Metanode” option.
Knime will create everything else (except the metanode description)
automatically.
A dialogue window opens (at the right part of the screen, at the location
where node descriptions usually are). It has a structure for entering meta
information about our workflow.
Now we can start editing.
Well maintained meta information has an added value, particularly due to the
fact that you can quickly grasp a particular workflow by reading them.
Well-arranged search of workflows according to data and tags in meta
information is not fully functional yet (version 4.2.).
8. KNIME – workflow automation
and speed
Speed
When it comes to speed, Knime is a relatively well-tuned application.
While we do not have to usually engage in assessing performance when it
comes to manually started, one-time workflows, the “performance” issue can
be important for automatic workflows that operate with large volumes of
data.
Let us explore three areas that can significantly influence the performance of
our workflows:
1. Configuration of the given environment
2. Using the “Cache” node
3. Using “Parallel execution”
Cache node
The output of the Cache node is the same table as the input table, though it is
newly created in the memory, free of any links to previous transformations
and intermediate statuses. This can significantly speed up subsequent data
processing. For example, if we filter out 5 columns that we are interested in
from a table with a total of 100 columns, the workflow does not delete these
columns, it only hides them. When we use the Cache node, the workflow
continues only with the “clean” five columns, which can thus be much faster.
One of the possible uses of the Cache node - after a transformation of the
table columns.
Workflow: 017_Performance_Speed – 1
File: C:\KNIME_Advanced\Input\Sales\Sales_full.csv
Be careful though, the Cache node has its own arrangements. That is why
you should observe the given “authorized” use of this method. To make sure
that the use is “authorized”, you can use the helpful Timer Info node, which
shows the times consumed by individual workflow nodes.
Output example of the Time Info node.
Other information related to the Cache node and its use can be studied here:
https://nodepit.com/node/org.knime.base.node.util.cache.CacheNodeFactory
Parallel execution
The parallel execution of selected nodes can really influence the processing
speed of some workflows.
Additional extensions must be installed using "Install KNIME Extensions…"
for the nodes that have the ability to do this.
This particular extension is called KNIME Virtual Nodes.
When the installation is complete, three new nodes are added to the Node
Repository.
Example of the use of the Parallel Chunk Start and Parallel Chunk End
nodes.
Workflow: 018_Performance_Speed – 2
Files:
C:\KNIME_Advanced\Input\Sales\Sales_full.csv
C:\KNIME_Advanced\Input\Sales\Sales_locations.xlsx
Once again, the use of parallel processing should be properly assessed and
tested. These nodes also have their own arrangements and only the proper
simulation, testing and measurements will show us what the real effect of this
method is (it can be enormous, though it can also be negative if used
unsuitably).
Other information related to virtual nodes and their uses for parallel
workflow operations can be found here:
https://nodepit.com/node/org.knime.core.node.workflow.virtual.parallelchunkstart.Parallel
Knime resolves this situation in an elegant way using the Empty Table
Switch node.
This apt node works very simply. The workflow continues with port 1 when
the input table contains data. However, when the input table is empty, the
workflow continues with port 2. This will, for example, allow us to upload
data in the correct way (from a different source, from the same source but
with different parameters, etc.), or to prepare the corresponding error record –
error log.
The strength of the Empty Table Switch node is demonstrated by this
workflow.
The workflow uploads the files from the stock inventory count (the files
should be saved in a single folder) and combines them into a single file. If the
folder is empty, it produces an error log instead of the expected result.
Workflow: 019_Automation_log_file – 1
File: C:\KNIME_Advanced\Input\Inventory_upload (file folder)
Catching errors
Knime allows us to catch and correct error states using two nodes:
Try – we place this node before the operation (node or a
sequence of nodes), for which an error, which we need to
treat, can occur
Catch Errors – we place this node past the treated node or
sequence of nodes
These nodes form a part of the Error Handling group of nodes. They exist in
multiple variants.
The Catch Errors node either continues with the unchanged (expected,
correct) output of the treated node (input port 1), or with the node output that
is prepared for error situations (input port 2). If there is an error, it is recorded
in the given variables.
We will demonstrate how the “Try – Catch” process works on a simple
example. The workflow either uploads data from the SAP system and saves
the result in a data file, or creates a log with the given error description. CSV
writer creates either the Data_SAP_download file with the required data (if
the data connection and upload are free of errors), or the
Error_log_SAP_download file with the corresponding information about the
given error state (if an error occurs in the Python Source node).
Workflow: 020_Automation_log_file – 2
Note: the example with the extraction from SAP was “handily available”. The
workflow would work the same for other database connectors as well.
This is what the result of the Catch Errors node looks like when the caught
error reports are saved in the prepared “Failing…” variables.
Configuration of the Catch Errors node.
This is what an error log can look like (the error description was shortened).
Data visualization
Knime offers very good data visualization options. Data views are
implemented using interactive views (I have already touched on the topic in
the chapter about loops). The ability to visualize data will become even more
apparent in combination with the (paid) Knime Server, resp. the Web Portal.
You can find the visualization nodes in basic Knime under Views/
JavaScript:
Examples of the use of the visualization nodes can be found under
EXAMPLES/ 03_Visualization:
Power BI
A useful extension can be additionally installed in Knime. It enables the
sending of data to Microsoft Power BI within a workflow. However, you
must have at least the Power BI Pro license.
Databases
We will explore databases and database connectors (MS SQL, SAP ERP,
SAP HANA, MySQL, etc.) in the next part of our Knime series. (2021)
Once again, Nodepit is a rich source of information.
https://nodepit.com/category/database
Connectors to SAP
A common topic in medium and large companies is the ability of a direct
connection between the SAP system and Knime.
This connection is feasible (for classic SAP as well as SAP S/4 HANA), and
there are several ways of achieving such a connection. We have tried three
different ways:
1. Connection using SAP RFC
a. Configuration of the local environment for SAP
RFC
b. Uploading data from SAP using RFC functions
(from the Python Source node) and their use
directly in a Knime workflow
Regarding 2:
https://hub.knime.com/knime/extensions/org.knime.features.sap.theobald/latest/org.knime
Regarding 3:
https://hub.knime.com/knime/extensions/org.knime.features.rest/latest/org.knime.rest.node
Machine learning
The Knime Analytics Platform is a tool, the main advantage of which is its
ability to resolve and automate tasks from the area of advanced data analytics
without using program code. This excellent characteristic of Knime is
particularly apparent in the area of machine learning and artificial
intelligence. Knime supports classic machine learning algorithms as well as
deep learning algorithms.
The library of examples, part 04 Analytics, contains several workflows that
demonstrate the use of various algorithm types.
Python and R
Apart from the native Knime nodes supported by Java Script, we can also
additionally install support for the Python and R programming languages.
R installation
Since I belong among the fans of the Python language and do not know R
well, I do not have any practical experience with the installation of R in
Knime. Nevertheless, detailed instructions for working with “R-nodes” can
be found here:
https://docs.knime.com/2019-12/r_installation_guide/index.html
Python installation
If you want to work with nodes that support the Python language, you need
to:
Knime Server
Knime Server is a tool that can move your automation to a completely
different level. The server is a paid tool and its price is relatively high.
However, if you have multiple automatic workflows and are considering
automatic business reporting or the implementation of AI scenarios for your
company, Knime server definitely represents a solution that you can count
on.
Knime Server is available as a local, “on-premise” installation or as a cloud
solution (AWS and MS Azure).
Knime Server introduction:
https://www.youtube.com/watch?v=NuEhV7TXh1Y
KNIME Server official website:
https://www.knime.com/knime-server
11. KNIME – links to other sources
Links
Links to other sources, including social networks, blogs, attendance and on-
line courses, as well as a link to real datasets (public datasets to play with).
Knime – official
Knime – updated basic operation guide with the platform:
https://www.knime.com/getting-started-guide
Knime forum:
https://forum.knime.com/
Knime hub:
https://hub.knime.com/
Knime blog:
https://www.knime.com/blog
Knime – community
Node pit:
https://nodepit.com/
Training
https://www.knime.com/knime-courses
MOOC courses
https://www.udemy.com/course/knime-bootcamp/ (basics, free)
https://www.udemy.com/courses/search/?
src=ukw&q=%C5%A1t%C4%9Btinov%C3%A1 (courses of my colleague B.
Stetinova on the topic of Knime and machine learning)
https://www.udemy.com/courses/search/?src=ukw&q=knime (search link to
other courses on Udemy.com)
https://www.coursera.org/lecture/code-free-data-science/introduction-to-
knime-analytics-platform-YBD5E (introductions to Knime on the
Coursera.org platform)
Social networks
Social network Link
Facebook https://www.facebook.com/KNIMEanalytics/
Instagram https://www.instagram.com/knime_official/
LinkedIn https://ch.linkedin.com/company/knime.com
Youtube https://www.youtube.com/user/KNIMETV
https://cz.linkedin.com/in/vladim%C3%ADr-l%C3%B6ffler-a84a9456
https://www.udemy.com/user/vladimir-loffler/
[email protected]
13. Acknowledgment
I would like to thank my colleagues (and particularly Barbora Stetinova) for
their feedback and precious advice and comments related to the content of
this textbook. I would also like to thank my family, which supported me
while I was working on this book.