2020-Codeless Deep Learning With KNIME
2020-Codeless Deep Learning With KNIME
Co d e le ss D e e p Le a rn in g w ith K N IME
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any
means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or
reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the
information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or
its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this
book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book
by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-80056-661-3
www.packt.com
Packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you
plan your personal development and advance your career. For more information, please visit our website.
Wh y su b sc rib e ?
Spend less time learning and more time coding with practical eBooks and videos from over 4,000 industry professionals
Improve your learning with skill plans tailored especially for you
Get a free eBook or video every month
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to
the eBook version at packt.com and, as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with
us at [email protected] for more details.
At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive
exclusive discounts and offers on Packt books and eBooks.
Co n trib u to rs
A b o u t th e a u th o rs
Kathrin Melcher is a data scientist at KNIME. She holds a master's degree in mathematics from the University of Konstanz,
Germany. She joined the evangelism team at KNIME in 2017 and has a strong interest in data science and machine learning
algorithms. She enjoys teaching and sharing her data science knowledge with the community, for example, in the book From Excel
to KNIME, as well as on various blog posts and at training courses, workshops, and conference presentations.
Rosaria Silipo has been working in data analytics since 1992. Currently, she is a principal data scientist at KNIME. In the past,
she has held senior positions with Siemens, Viseca AG, and Nuance Communications, and worked as a consultant in a number of
data science projects. She holds a Ph.D. in bioengineering from the Politecnico di Milano and a master's degree in electrical
engineering from the University of Florence (Italy). She is the author of more than 50 scientific publications, many scientific white
papers, and a number of books for data science practitioners.
There are so many people to thank! We would like to thank Corey Weisinger for the Demand Prediction workflow in Chapter
6, and Jon Fuller for the image classification workflow in Chapter 9; Marcel Wiedenmann, Christian Dietz, and Benjamin
Wilhelm, from the KNIME development team, for the great Keras integration and the many deep learning nodes; and finally,
Paolo Tamagnini and Maarit Widmann, from the components team at KNIME, for the shared components we used in this
book.
A b o u t th e re v ie w e rs
Corey Weisinger is a data scientist at KNIME in Austin, Texas. He studied mathematics at Michigan State University, focusing
on actuarial techniques and functional analysis. Prior to KNIME, he worked as an analytics consultant for the auto industry in
Detroit, Michigan. He currently focuses on signal processing and numeric prediction techniques, teaches a time series course on
KNIME, and is the author of the guidebook, From Alteryx to KNIME.
Adrian Nembach has a master's degree in computer science from the University of Konstanz. During his master's, he focused on
deep learning for computer vision, including generative adversarial networks for semi-supervised classification of cell images and
depth extraction from light field images. Alongside his studies, he also worked as a working student at KNIME, where he was
involved in the development of various machine learning-related nodes and extensions, including integrations for Keras, XGBoost,
and a rewrite of KNIME's native logistic regression and random forest nodes. After completing his degree, he started as a software
engineer at KNIME, developing nodes for machine learning interpretability, active learning, and weak supervision.
Barbora Stetinova is experienced in the data science and business intelligence spheres. She started her career, after obtaining her
MA and MBA degrees from university, at WITTE Automotive. Her data journey began in the controlling department as a data
analyst, and she currently works in the IT department, where she is responsible for data science and business intelligence projects.
Parallel to this, Barbora is engaged as a business analyst consultant for different industries at Leadership Synergy Community. To
help others on their data science journey, she publishes her own data science e-learning courses. All of this led her to cooperate with
Packt on data science projects as an e-learning trainer and technical reviewer, and with KNIME AG on a data visualization course.
Pa c k t is se a rc h in g fo r a u th o rs lik e y o u
If you're interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with
thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You
can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
Ta b le o f Co n te n ts
Preface
Section 1: Feedforward Neural Networks and KNIME
Deep Learning Extension
Chapter 1: Introduction to Deep Learning with KNIME
Analytics Platform
Summary
Chapter 2: Data Access and Preprocessing with KNIME
Analytics Platform
Accessing Data
Transforming Data
Summary
Loss Functions
Summary
Normalization
Summary
Optimizing Threshold
Taking Actions
Summary
Introducing RNNs
Demand Prediction
Summary
Index Encoding
The Dataset
Summary
Encoder-Decoder Architecture
Summary
Introduction to CNNs
Introducing Padding
Introducing Pooling
Summary
Using TensorFlow 2
Summary
Shared Components
Chapter 2, Data Access and Preprocessing with KNIME Analytics Platform, dives a bit deeper into the basic and advanced
functionalities of KNIME Analytics Platform: from data access to workflow parameterization.
Chapter 3, Getting Started with Neural Networks, is the only theoretical chapter of the book. It paints an overview of the basic
concepts around neural and deep learning networks and the algorithms used to train them.
Chapter 4, Building and Training a Feedforward Neural Network, is where we put into practice what we describe in Chapter 3,
Getting Started with Neural Networks; we will build, train, and evaluate our first simple feedforward networks for classification tasks.
Chapter 5, Autoencoder for Fraud Detection, is where, with a neural autoencoder to solve the problem of fraud detection in credit
card transactions, we start the series of case studies based on deep learning solutions.
Chapter 6, Recurrent Neural Networks for Demand Prediction, is where we introduce Long Short-Term Memory (LSTM )
models in recurrent neural networks. Indeed, with their dynamic behavior, they are particularly effective in solving time series
problems, such as a classic demand prediction problem.
Chapter 7, Implementing NLP Applications, covers how LSTM-based recurrent neural networks are often also used to implement
solutions for natural language processing tasks. In this chapter, we cover a few case studies for free text generation, free name
generation, and sentiment analysis.
Chapter 8, Neural Machine Translation, looks at an encoder-decoder architecture for automatic translations.
Chapter 9, Convolutional Neural Networks for Image Classification, covers a case study on image classification, which we could not
miss. We classify histopathology images into cancer diagnoses using a convolutional neural network.
Chapter 10, Deploying a Deep Learning Network, starts describing the deployment phase. A simple example of the deployment
workflow is explained in detail.
Chapter 11, Best Practices and Other Deployment Options, extends the previous chapter dedicated to deployment with more
deployment options, such as web applications and REST services, and we conclude the book with a few tips and tricks.
To g e t th e mo st o u t o f th is b o o k
KNIME Analytics Platform is a very easy-to-use tool. No previous coding knowledge is necessary.
Some previous math knowledge, however, is necessary to deal with the data transformations and understand the training algorithms.
KNIME Analytics Platform is open source and can be downloaded, installed, and used for free. Download it from
https://www.knime.com/downloads.
KNIME Server is only needed in Chapter 11, Best Practices and Other Deployment Options, to run the trained network within a
REST service or a web application. KNIME Server is not open source and cannot be used for free but requires a yearly license. For a
test license, please fill in the contact form at https://www.knime.com/contact.
D o w n lo a d th e e x a mp le w o rk flo w s
You can download the example workflows for this book from KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/, or from GitHub, at
https://github.com/PacktPublishing/Codeless-Deep-Learning-with-KNIME. If there's an update to the workflows, it will be updated
on the existing KNIME Hub and GitHub repositories.
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check
them out!
D o w n lo a d th e c o lo r ima g e s
We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here:
https://static.packt-cdn.com/downloads/9781800566613_ColorImages.pdf.
Co n v e n tio n s u se d
There are a number of text conventions used throughout this book.
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames,
dummy URLs, user input, and Twitter handles. Here is an example: " We drag and drop the Demographics.csv file
from Example Workflows/TheData/Misc into the workflow editor."
Bold : Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes
appear in the text like this. Here is an example: "After configuration is complete, we click OK ; the node state moves to yellow and
the node can now be executed."
G e t in to u c h
Feedback from our readers is always welcome.
General feedback : If you have questions about any aspect of this book, mention the book title in the subject of your message and
email us at [email protected].
Errata : Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake
in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book,
clicking on the Errata Submission Form link, and entering the details.
Piracy : If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide
us with the location address or website name. Please contact us at [email protected] with a link to the material.
If you are interested in becoming an author : If there is a topic that you have expertise in and you are interested in either
writing or contributing to a book, please visit authors.packtpub.com.
Re v ie w s
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from?
Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think
about our products, and our authors can see your feedback on their book. Thank you!
Deep learning can be quite complex, and we must make sure that the journey is worth the result. Thus, we'll start this chapter by stating,
once again, the relevance of deep learning techniques when it comes to successfully implementing applications for data science.
We will continue by providing a quick overview of the tool of choice for this book – KNIME Software – and focus on how it
complements both KNIME Analytics Platform and KNIME Server.
The work we'll be doing throughout this book will be implemented in KNIME Analytics Platform, which is open source and available
for free. We will dedicate a full section to how to download, install, and use KNIME Analytics Platform, even though more details will
be provided in the chapters to follow.
Among the benefits of KNIME Analytics Platform is, of course, its codeless Deep Learning - Keras Integration extension, which we will
be making extensive use of throughout this book. In this chapter, we will just focus on the basic concepts and requirements for this
KNIME extension.
Finally, we will conclude this chapter by stating the goal and structure of this book. We wanted to give it a practical flavor, so most of
the chapters will revolve around a practical case study that includes real-world data. In each chapter, we will take the chance to dig
deeper into the required neural architecture, data preparation, deployment, and other aspects necessary to make the case study at hand a
success.
We'll start by stating the importance of deep learning when it comes to successful data science applications.
Th e Imp o rta n c e o f D e e p Le a rn in g
If you have been working in the field of data science – or Artificial Intelligence (AI ), as it is called nowadays – for a few
years, you might have noticed the recent sudden explosion of scholarly and practitioner articles about successful solutions based on
deep learning techniques.
The big breakthrough happened in 2012 when the deep learning-based AlexNet network won the ImageNet challenge by an
unprecedented margin. This victory kicked off a surge in the usage of deep learning networks. Since then, these have expanded to
many different domains and tasks.
So, what are we referring to exactly when we talk about deep learning? Deep learning covers a subset of Machine Learning (ML )
algorithms, most of which stem from neural networks. Deep learning is indeed the modern evolution of traditional neural networks.
deeper
Apart from the classic feedforward, fully connected, backpropagation-trained, and multilayer perceptron architectures,
Recurrent
architectures have been added. Deeper indicates more hidden layers and a few new additional neural paradigms, including
Neural Networks (RNNs ), Long-Short Term Memory (LSTM ), Convolutional Neural Networks (CNNs ), Generative
Adversarial Networks (GANs ), and more.
The recent success of these new types of neural networks is due to several reasons. First, the increased computational power in modern
machines has favored the introduction and development of new paradigms and more complex neural architectures. Training a complex
neural network in minutes leaves space for more experimentation compared to training the same network for hours or days. Another
reason is due to their flexibility. Neural networks are universal function approximators, which means that they can approximate almost
anything, provided that their architecture is sufficiently complex.
Having mathematical knowledge of these algorithms, experience with the most effective paradigms and architectures, and domain
wisdom are all basic, important, and necessary ingredients for the success of any data science project. However, there are other, more
contingent factors – such as ease of learning, speed of prototyping, options for debugging and testing to ensure the correctness of the
solution, flexibility to experiment, availability of help from external experts, and automation and security capabilities – that also
influence the final result of the project.
In this book, we'll present deep learning solutions that can be implemented with the open source, visual programming-based, free-to-
use tool known as KNIME Analytics Platform. The deployment phases for some of these solutions also use a few features provided by
KNIME Server.
Next, we will learn about how KNIME Analytics Platform and KNIME Server complement each other, as well as which tasks both
should be used for.
We'll concentrate on KNIME Analytics Platform first and provide an overview of what it can accomplish.
Visual programming is a key feature of KNIME Analytics Platform for quick prototyping. It makes the tool very easy to use. In
visual programming, a Graphical User Interface (GUI ) guides you through all the necessary steps for building a pipeline
(workflow) of dedicated blocks (nodes). Each node implements a given task; each workflow of nodes takes your data from the
beginning to the end of the designed journey. A workflow substitutes a script; a node substitutes one or more script lines.
Without extensive coverage when it comes to commonly used data wrangling techniques, machine learning algorithms, and data types
and formats, and without integration with most common database software, data sources, reporting tools, external scripts, and
programming languages, the software's ease of use would be limited. For this reason, KNIME Analytics Platform has been designed to
be open to different data formats, data types, data sources, and data platforms, as well as external tools such as Python and R.
We'll start by looking at a few ML algorithms. KNIME Analytics Platform covers most machine learning algorithms: from decision
trees to random forest and gradient boosted trees, from recommendation engines to a number of clustering techniques, from Naïve
Bayes to linear and logistic regression, from neural networks to deep learning. Most of these algorithms are native to KNIME Analytics
Platform, though some can be integrated from other open source tools such as Python and R.
To train different deep learning architectures, such as RNNs, autoencoders, and CNNs, KNIME Analytics Platform has integrated the
Keras deep learning library through the KNIME Deep Learning - Keras Integration extension
(https://www.knime.com/deeplearning/keras). Through this extension, it is possible to drag and drop nodes to define complex neural
architectures and train the final network without necessarily writing any code.
However, defining the network is just one of the many steps that must be taken. Ensuring the data is in the right form to train the
Data
network is another crucial step. For this, a very large number of nodes are available so that we can implement a myriad of
Wrangling techniques. By combining nodes dedicated to small tasks, you can implement very complex data transformation
operations.
KNIME Analytics Platform also connects to most of the required data sources: from databases to cloud repositories, from big data
platforms to files.
But what if all of this is not enough? What if you need a specific procedure for a specific domain? What if you need a specific network
manipulation function from Python? Where KNIME Analytics Platform and its extensions cannot reach, you can integrate with other
scripting and programming languages, such as Python , R , Java , and Javascript , just to mention a few. In addition, KNIME
Analytics Platform has seamless integration with BIRT, a business intelligence and reporting tool. Integrations with other reporting
platforms such as Tableau, QlickView, PowerBI, and Spotfire are also available.
Several JavaScript-based nodes are dedicated to implementing data visualization plots and charts: from a simple scatter plot to a more
complex sunburst chart, from a simple histogram to a parallel coordinate plot, and more. These nodes seem simple but are potentially
quite powerful. If you combine them within a component , you can interactively select data points across multiple charts. By doing
this, the component inherits and combines all the views from the contained nodes and connects them in a way that, if the points are
selected and visualized in one chart, they can also be selected and visualized in the other charts of the component's composite view.
Figure 1.1 shows the composite view of a component containing a scatter plot, a bar chart, and a parallel coordinate plot. The three
plots visualize the same data and are connected in a way that, by selecting data in the bar chart, it selects and optionally visualizes the
data that's been selected in the other two charts.
When it comes to creating a data science solution, KNIME Analytics Platform provides everything you need. However, KNIME Server
offers a few additional features to ease your job when it comes to moving the solution to production.
This process of moving the application into the real world is called moving into production. The process of including the trained model
in this final application is called deployment . Both phases are deeply connected and can be quite problematic since all the errors that
occurred in the application design show up at this stage.
It is possible, though limited, to move an application into production using KNIME Analytics Platform. If you, as a lone data scientist
or a data science student, do not regularly deploy applications and models, KNIME Analytics Platform is probably enough for your
needs. However, if you are just a bit more involved in an enterprise environment, where scheduling, versioning, access rights, disaster
recovery, web applications and REST services, and all the other typical functions of a production server are needed, then just using
KNIME Analytics Platform for production can be cumbersome.
In this case, KNIME Server , which comes with an annual license fee, can make your life easier. First of all, it is going to fit the
governance of the enterprise's IT environment better. It also offers a protected collaboration environment for your group and the entire
data science lab. And of course, its main advantage consists of making model deployment and moving it into production easier and
safer since it uses the integrated deployment feature and allows you to use one-click deployment into production. End users can then
run the application from a KNIME Analytics Platform client or – even better – from a web browser.
Remember those composite views that offer interactive interconnected views of selected points? These become fully formed web pages
when the application is executed on a web browser viaKNIME Server's WebPortal .
Using the components as touchpoints within the workflow, we get a Guided Analytics () application within the web browser. Guided
analytics inserts touchpoints to be consumed by the end user from a web browser within the flow of the application. The end user can
take advantage of these touchpoints to insert knowledge or preferences and to steer the analysis in the desired direction.
1. Go to .
2. Provide some details about yourself (step 1 in Figure 1.2).
3. Download the version that's suitable for your operating system (step 2 in Figure 1.2).
4. While you're waiting for the appropriate version to download, browse through the different steps to get started (step 3 in Figure 1.2):
Figure 1.2 – Steps for downloading the KNIME Analytics Platform package
Once you've downloaded the package, locate it, start it, and follow the instructions that appear onscreen to install it in any directory that
you have write permissions for.
Once it's been installed, locate your instance of KNIME Analytics Platform – from the appropriate folder, desktop link, application, or
link in the start menu – and start it.
When the splash screen appears, a window will ask for the location of your workspace ( Figure 1.3). This workspace is a folder on your
machine that will host all your work. The default workspace folder is called knime-workspace:
Figure 1.3 – The KNIME Analytics Platform Launcher window asking for the workspace folder
The KNIME workbench consists of different panels that can be resized, removed by clicking the X on their tab, or reinserted via the
View menu. Let's take a look at these panels:
KNIME Explorer : The KNIME Explorer panel in the upper-left corner displays all the workflows in the selected (LOCAL )
workspace, possible connections to mounted KNIME servers, a connection to the EXAMPLES server, and a connection to the My-
KNIME-Hub space.
The LOCAL workspace displays all workflows, saved in the workspace folder that were selected when KNIME Analytics Platform
was started. The very first time the platform is opened, the LOCAL workspace only contains workflows and data in the Example
Workflows folder. These are example applications to be used as starting points for your projects.
The EXAMPLES server is a read-only KNIME hosted server that contains many more example workflows, organized into
categories. Just double-click it to be automatically logged in with read-only mode. Once you've done this, you can browse, open,
explore, and download all available example workflows. Once you have located a workflow, double-click it to explore it or drag and
LOCAL to create a local editable copy.
drop it into
My-KNIME-Hub provides access to the KNIME community shared repository (KNIME Hub ), either in public or private mode.
You can use My-KNIME-Hub/Public to share your work with the KNIME community or My-KNIME-Hub/Private as a
space for your current work.
Workflow Coach : Workflow Coach is a node recommendation engine that aids you when you're building workflows. Based
on worldwide user statistics or your own private statistics, it will give you suggestions on which nodes you should use to complete
your workflow.
Node Repository : The Node Repository contains all the KNIME nodes you have currently installed, organized into categories.
To help you with orientation, a search box is located at the top of the Node Repository panel. The magnifier lens on its left
switches between the exact match and the fuzzy search option.
Workflow Editor : The Workflow Editor is the canvas at the center of the page and is where you assemble workflows,
configure and execute nodes, inspect results, and explore data. Nodes are added from the Node Repository panel to the workflow
editor by drag and drop or double-click. Upon starting KNIME Analytics Platform, the Workflow Editor will open on the
Welcome Page panel, which includes a number of useful tips on where to find help, courses, events, and the latest news about the
software.
Outline : The Outline view displays the entire workflow, even if only a small part is visible in the workflow editor. This part is
marked in gray in the Outline view. Moving the gray rectangle in the Outline view changes the portion of the workflow that's
visible in the Workflow Editor.
Console and Node Monitor : The Console and the Node Monitor share one panel with two tabs. The Console tab prints out
possible error and warning messages. The same information is written to a log file, located in the workspace directory. The Node
Monitor tab shows you the data that's available at the output ports of the selected executed node in the Workflow Editor. If a node
has multiple output ports, you can select the data of interest from a dropdown menu. By default, the data at the top output port is
shown.
KNIME Hub : The KNIME Hub (https://hub.knime.com/) is an external space where KNIME users can share their work. This
panel allows you to search for workflows, nodes, and components shared by members of the KNIME community.
Description : The Description panel displays information about the selected node or category. In particular, for nodes, it explains
the node's task, the algorithm behind it (if any), the dialog options, the available views, the expected input data, and the resulting
output data. For categories, it displays all contained nodes.
Finally, at the very top, you can find the Top Menu , which includes menus for file management and preference settings, workflow
editing options, additional views, node commands, and help documentation.
Under the top menu, a toolbar is available. When a workflow is open, the toolbar offers commands for workflow editing, node
execution, and customization.
A workflow can be built by dragging and dropping nodes from the Node Repository panel onto the Workflow Editor window
node has several input and/or output
or by just double-clicking them. Nodes are the basic processing units of any workflow. Each
ports. Data flows over a connection from an output port to the input port (s) of other nodes. Two nodes are connected – and the
data flow is established – by clicking the mouse at the output port of the first node and releasing the mouse at the input port of the
next node. A pipeline of such nodes makes a workflow.
In Figure 1.5, under each node, you will see a status light : red, yellow, or green:
When a new node is created, the status light is usually red, which means that the node's settings still need to be configured for the node
to be able to execute its task.
Configure or just double-click it. Then, adjust the necessary settings in the node's dialog.
To configure a node, right-click it and select
When the dialog is closed by pressing the OK button, the node is configured, and the status light changes to yellow; this means that the
node is ready to be executed. Right-clicking on the node again shows an enabled Execute option; pressing it will execute the node.
The ports on the left are input ports, where the data from the outport of the predecessor node is fed into the node. Ports on the right are
outgoing ports. The result of the node's operation on the data is provided by the output port of the successor nodes. When you hover
over the port, a tooltip will provide information about the output dimension of the node.
IMPORTANT NOTE
Only ports of the same type can be connected!
Data ports (black triangles) are the most common type of node ports and transfer flat data tables from node to node. Database
ports (brown squares) transfer SQL queries from node to node. Many more node ports exist and transfer different objects from one
node to the next.
After successful execution, the status light of the node turns green, indicating that the processed data is now available on the outports.
The result(s) can be inspected by exploring the outport view(s): the last entries in the context menu open them.
With that, we have completed our quick tour of the workbench in KNIME Analytics Platform.
Now, let's take a look at where we can find starting examples and help.
All workflows described in this book are also available on the KNIME Hub for you:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/.
Once you've isolated the workflow you are interested in, click on it to open its page, and then download it or open it in KNIME
Analytics Platform to customize it to your own needs.
On the other hand, to share your work on the KNIME Hub, just copy your workflows from your local workspace into the My-KNIME-
Hub/Public folder in the KNIME Explorer panel within the KNIME workbench. It will be automatically available to all members of
the KNIME community.
The KNIME community is also very active, with tips and tricks available on the KNIME Forum (https://forum.knime.com/). Here,
you can ask questions or search for answers.
IMPORTANT NOTE
Folders in KNIME Explorer are called Workflow Groups .
Similarly, you can create a new workflow, as follows:
a) Right-click the Chapter 1 folder (or anywhere you want your workflow to be).
b) Select New KNIME Workflow (as shown in Figure 1.7) and, in the window that opens, name it
My_first_workflow.
2. Click Finish . You should then see a new workflow with that name in the KNIME Explorer panel.
After clicking Finish , the Workflow Editor will open the canvas for the empty workflow.
TIP
By default, the canvas for a new workflow opens with the grid on; to turn it off, click the Open the settings dialog for the
workflow editor button (the button before the last one) in the toolbar. This button opens a window where you can customize the
workflow's appearance (for example, allowing curved connections) and perform editing (turn the grid on/off).
Figure 1.7 shows the New Workflow Group ... option in the KNIME Explorer's context menu. It allows you to create a new, empty
folder:
Figure 1.7 – Context menu for creating a new folder and a new workflow in KNIME Explorer
Let's create the node so that we can read the adult.csv ASCII file:
a) In the Node Repository, search for the File Reader node (it is actually located in the IO/Read category).
b) Drag and drop the File Reader node onto the Workflow Editor panel.
c) Alternatively, just double-click the File Reader node in the Node Repository; this will automatically create
Workflow Editor panel.
it in the
In Figure 1.8 , see the File Reader node located in the Node Repository:
Figure 1.8 – The File Reader node under IO/Read in the Node Repository
Now, let's configure the node so that it reads the adult.csv file. Double-click the newly created File Reader node in the
Workflow Editor and manually configure it with the file path to the adult.csv file. Alternatively, just drag and drop the adult.csv file
from the KNIME Explorer panel (or from anywhere on your machine) onto the Workflow Editor window. You can see this
action in Figure 1.9 :
Figure 1.9 – Dragging and dropping the adult.csv file onto the Workflow Editor panel.
This automatically generates a File Reader node that contains most of the correct configuration settings for reading the file.
TIP
The Advanced button in the File Reader configuration window leads you to additional advanced settings: reading files with special
characters, such as quotes; allowing lines with different lengths; using different encodings; and so on.
To execute this node, just right-click it and from the context menu, select Execute ; alternatively, click on the Execute buttons (single
and double white arrows on a green background) that are available in the toolbar.
To inspect the output data table that's produced by this node's execution, right-click on the node and select the last option available in
the context menu. This opens the data table that appears as a result of reading the adult.csv file. You will notice columns such as Age ,
Workclass , and so on.
IMPORTANT NOTE
Data in KNIME Analytics Platform is organized into tables. Each cell is uniquely identified via the column header and the row ID .
Therefore, column headers and row IDs need to have unique values.
fnlwgt is one column for which we were never sure of what it meant. So, let's remove it from further analysis by using the
Column Filter node.
To do this, search for Column Filter in the search box above the Node Repository, then drag and drop it onto the Workflow Editor
and connect the output of the File
Reader node to the input of the Column Filter node. Alternatively, we can
select the File Reader node in the Workflow Editor panel and then double-click the Column Filter node in
the Node Repository. This automatically creates a node and its connections in the Workflow Editor.
The Column Filter node and its configuration window are shown in Figure 1.10 :
Figure 1.10 – Configuring the Column Filter node to remove the column named fnlwgt from the input data table
Again, double-click or right-click the node and then selectConfigure to configure it. This configuration window contains three
options that can be selected via three radio buttons: Manual Selection , Wildcard/Regex Selection , and Type Selection . Let's
take a look at these in more detail:
Manual Selection offers an Include/Exclude framework so that you can manually transfer columns from the Include set into the
Exclude set and vice versa.
Wildcard/Regex Selection extracts the columns you wish to keep, based on a wildcard (using * as the wildcard) or regex
expression.
Type Selection keeps the columns based on the data types they carry.
Manual Selection tab and
Since this is our first workflow, we'll go for the easiest approach; that is, Manual Selection. Go to the
transfer the fnlwgt column to the Exclude set via the buttons in-between the two frames (these can be seen in Figure 1.10).
After executing the Column Filter node, if we inspect the output data table (right-click and select the last option in the context menu),
we'll see a table that doesn't contain the fnlwgt column.
Now, let's extract all the records of people who work more than 20 hours/week. hours-per-week is the column that
contains the data of interest.
For this, we need to create a Row Filter node and implement the required condition:
use range checking keeps only those data rows whose value in the Column to test columns falls between the lower bound
and upper bound values.
only missing values match only keeps the data rows where a missing value is present in the selected column.
The default behavior is to include the matching data rows in the output data table. However, this can be changed by enabling Exclude
rows by attribute value via the radio buttons on the left-hand side of the configuration window.
Alternative filtering criteria can be done by row number or by row ID. This can also be enabled via the radio buttons on the left-hand
side of the configuration window:
Figure 1.11 – Configuring the Row Filter node to keep only rows with hours-per-week > 20 in the input data table
After execution, upon opening the output data table ( Figure 1.12), no data rows with hours-per-week < 20 should be present:
Figure 1.12 – Right-clicking a successfully executed node and selecting the last option shows the data table that was produced by the
node
Now, let's look at some very basic visualization. Let's visualize the number of men versus women in this dataset, which contains people
who work more than 20 hours/week:
Figure 1.13 – The Bar Chart node and its configuration window
Bar Chart node in the Node Repository, create an instance in the workflow, connect it to receive input from the
To do this, locate the
Row Filter node, and open its configuration window (Figure 1.13 ). Here, there are four tabs we can use for configuration
purposes. Options covers all data settings, General Plot Options covers all plot settings, Control Options covers all control
options, and Interactivity covers all subscription events when it comes to interacting with other plots, views, and charts when they've
been assembled to create a component. Again, since this is just a beginner's workflow, we'll adopt all the default settings and just set the
following:
From the Options tab, set Category Column to sex, ensuring it appears on the x axis. Then, select Occurrence Count in
order to count the number of rows by sex.
From the General Plot Options tab, set a title, a subtitle, and the axis labels.
This node does not produce data, but rather a view of the bar chart. So, to inspect the results produced by this node after its execution,
right-click it and select the central option; that is, Interactive View: Group Bar Chart (Figure 1.14):
Figure 1.14 – Right-clicking a successfully executed visualization node and selecting the Interactive View: Grouped Bar Chart option
to see the chart/plot that has been produced
IMPORTANT NOTE
Most data visualization nodes produce a view and not a data table. To see the respective view, right-click the successfully executed
node and select the Interactive View: … option.
The second lower input port of the Bar Chart node is optional (a white triangle) and is used to read a color map so that you can color
the bars in the bar chart.
IMPORTANT NOTE
Note that a number of different data visualization nodes are available in the Node Repository: JavaScript, Local(Swing), Plotly, and so
on. JavaScript and Plotly nodes offer the highest level of interactivity and the most polished graphics. We used the Bar
Chart node from the JavaScript category in the Node Repository panel here.
Now, we'll add a few comments to document the workflow. You can add comments at the node level or at the general workflow level.
Each node in the workflow is created with a default label of Node xx under it. Upon double-clicking it, the node label editor appears.
This allows you to customize the text, the font, the color, the background, and other similar properties of the node (Figure 1.15 ). We
need to write a little comment under each node to make it clear what tasks they are implementing:
Figure 1.15 – Editor for customizing the labels under each node
You can also write annotations at the workflow level. Just right-click anywhere in the Workflow Editor and select New Workflow
Annotation . A yellow frame will appear in editing mode. Here, you can add text and customize it, as well as its frame. To close the
annotation editor, just click anywhere else in the Workflow Editor. To reopen the annotation editor, double-click in the top-left corner
of the annotation ( Figure 1.16):
Congratulations! You have just built your first workflow with KNIME Analytics Platform. It should look something like the one in
Figure 1.17:
Figure 1.17 – My_first_Workflow
Now, let's make sure we have KNIME Deep Learning – Keras Integration installed and functioning.
KNIME Analytics Platform consists of a software core and several provided extensions and integrations. Such extensions and
integrations are provided by the KNIME community and extend the original software core through a variety of data science
functionalities, including advanced algorithms for AI.
The KNIME extension of interest here is called KNIME Deep Learning – Keras Integration . It offers a codeless GUI-based
integration of the Keras library, while using TensorFlow as its backend. This means that a number of functions from Keras libraries
have been wrapped into KNIME nodes, within KNIME's classic, easy-to-use visual dialog window. Due to this integration, you can
read, write, create, train, and execute deep learning networks without writing code.
Another deep learning integration that's available is called KNIME Deep Learning - TensorFlow Integration . This extension
allows you to convert Keras models into TensorFlow models, as well as read, execute, and write TensorFlow models.
TensorFlow is an open source library provided by Google that includes a number of deep learning paradigms. TensorFlow functions
can run on single devices, as well as on multiple CPUs and multiple GPUs. This parallel calculation feature is the key to speeding up the
computationally intensive training that's required for deep learning networks.
However, using the TensorFlow library within Python can prove quite complicated, even for an expert Python programmer or a deep
learning pro. Thus, a number of simplified interfaces have been developed on top of TensorFlow that expose a subset of its functions
and parameters. The most successful of such TensorFlow-based libraries is Keras. However, even Keras still requires some
programming skills. The KNIME Deep Learning – Keras Integration puts the KNIME GUI on top of the Keras libraries that are
available, mostly eliminating the need to code.
To make the KNIME Deep Learning – Keras Integration work, a few pieces of the puzzle need to be installed:
Let's start with the first piece: installing the Keras and TensorFlow nodes.
You can install them from within KNIME Analytics Platform by clicking on File from the top menu and selecting Install KNIME
Extension… . This opens the dialog shown in Figure 1.18:
From this new dialog, you can select the extensions and integrations you want to install. Using the search bar at the top is helpful for
filtering the available extensions and integrations.
TIP
Another way you can install extensions is by dragging and dropping them from the KNIME Hub.
To install the Keras and TensorFlow nodes that will be used in the case studies described in this book, you need to select the following:
Similar to the KNIME Python Integration, the KNIME Deep Learning Integration uses Anaconda to manage Python environments. If
you have already installed Anaconda for, for example, the KNIME Python Integration, you can skip the first step.
1. First, get and install the latest Anaconda version (Anaconda ≥ 2019.03, conda ≥ 4.6.2)
from https://www.anaconda.com/products/individual. On the Anaconda download page, you can choose between Anaconda with
Python 3.x or Python 2.x. Either one should work (if you're not sure, we suggest selecting Python 3).
2. Next, we need to create an environment with the correct libraries installed. To do so, from within KNIME Analytics Platform, open
the Python Deep Learning preferences. From here, do the following:
File -> Preferences from the top menu. This will open a new dialog with a list on the left.
3. First, select
4. From the dialog, select KNIME -> Python Deep Learning .
From this page, create some Conda environments with the correct packages installed for Keras or TensorFlow 2. For the case studies
in this book, it will be sufficient to set up an environment for Keras.
5. To create and set up a new environment, enable Use special Deep Learning configuration and set Keras to Library used
for DL Python . Next, enable Conda and provide the path to your Conda installation directory.
6. In addition, to create a new environment for Keras, click on the New environment… button in the Keras framework.
This opens a new dialog, as in Figure 1.21, where you can set the new environment's name:
7. Click on the Create new CPU environment or Create new GPU environment button to create a new environment for
using either a CPU or GPU, if available.
Now, you can get started. In this section, you were introduced to the most convenient way of setting up a Python environment. Other
options can be found in the KNIME documentation: https://docs.knime.com/2019-
06/deep_learning_installation_guide/index.html#keras-integration.
G o a l a n d Stru c tu re o f th is Bo o k
In this book, our aim is to provide you with a strong theoretical basis about deep learning architectures and training paradigms, as well
as some detailed codeless experience of their implementations for solving practical case studies based on real-world data.
For this journey, we have adopted the codeless tool, KNIME Analytics Platform. KNIME Analytics Platform is based on visual
programming and exploits a user-friendly GUI to make data analytics a more affordable task without the barrier of coding. As with
many other external extensions, KNIME Analytics Platform has integrated the Keras libraries under this same GUI, thus including deep
learning as part of its list of codeless extensions. From within KNIME Analytics Platform, you can build, train, and test a deep learning
architecture with just a few drag and drops and a few clicks of the mouse. We provided a little introduction to the tool in this chapter,
but we will provide more detailed information about it in Chapter 2, Data Access and Preprocessing with KNIME Analytics Platform.
After that, in Chapter 3 , Getting Started with Neural Networks, we will provide a quick overview of the basic concepts behind neural
networks and deep learning. This chapter will by no means provide complete coverage of all the architectures and paradigms involved
in neural networks and deep learning. Instead, it will provide a quick overview of them to help you familiarize yourself with the
concept, either for the first time or again, before you continue implementing them. Please refer to more specialized literature if you
want to know more about the mathematical background of deep learning.
As we stated previously, we decided to talk about deep learning techniques in a very practical way; that is, always with reference to real
case studies where a particular deep learning technique had been successfully implemented. We'll start this trend in Chapter 4, Building
and Training a Feedforward Network, where we'll describe a few basic example applications we can use to train and apply the basic
concepts surrounding deep learning networks that we explored in Chapter 3 , Getting Started with Neural Networks. Although these are
simple toy examples, they are still useful for illustrating how to apply the theoretical concepts we described in the previous chapter.
With Chapter 5, Autoencoder for Fraud Detection, we'll start looking at real case studies. The first case study we'll describe in this
chapter aims to prevent fraud detection in credit card transactions by firing an alarm every time a suspicious transaction is detected. To
implement this subspecies of anomaly detection, we'll use an approach based on the autoencoder architecture, as well as the calculated
distance between the output and the input values of the network.
With Chapter 5, Autoencoder for Fraud Detection, we are still in the realm of classic neural networks, including feedforward networks
Chapter 6, Recurrent Neural Networks for Demand
and those trained with backpropagation, albeit with an original architecture. In
Prediction, we'll enter the realm of deep learning network with RNNs – specifically, with LSTMs. Here, the dynamic character of such
networks and their capability to capture the time evolution of a signal will be exploited to solve a classic time series analysis problem:
demand prediction.
Upon introducing RNNs, we will learn how to use them for Natural Language Processing (NLP ) case studies. Chapter 7,
Implementing NLP Applications, covers a few such NLP use cases: sentiment analysis, free text generation, and product name
generation, to name a few. All such use cases are similar in the sense that they analyze streams of text. All of them are also slightly
different in that they find a solution to a different problem: classification for sentiment analysis for the former case, and unconstrained
generation of sequences of words or characters for the other two use cases. Nevertheless, data preparation techniques and RNN
architectures are similar for all case studies, which is why they have been placed into one single chapter.
Chapter 8, Neural Machine Translation, describes a spin-off case of free text generation with RNNs. Here, a sequence of words will be
generated at the output of the network as a response to a corresponding sequence of words in the input layer. The output sequence will
be generated in the target language, while the input sequence will be provided in the source language.
Deep learning does not just come in the form of RNNs and text mining. Actually, the first examples of deep learning networks came
from the field of image processing. Chapter 9, Convolutional Neural Networks for Image Classification, is dedicated to describing a
case study where histopathology slide images must be classified as one of three different types of cancer. To do that, we will introduce
CNNs. Training networks for image analysis is not a simple task in terms of time, the amount of data, and computational resources.
Often, to train a neural network so that it recognizes images, we must rely on the benefits of transfer learning, as described in Chapter
9, Convolutional Neural Networks for Image Classification, as well.
Chapter 9, Convolutional Neural Networks for Image Classification, concludes our in-depth look into how deep learning techniques
can be implemented for real case studies. We are aware of the fact that other deep learning paradigms have been used to produce
solutions for other data science problems. However, here, we decided to only report the common paradigms in which we had real-life
experiences.
After training a network, the deployment phase must take place. Deployment is often conveniently forgotten since this is the phase
where all problems are put to the test. This includes errors in the application's design, in training the network, in accessing and
preparing the data: all of them will show up here, during deployment. Due to this, the last two chapters of this book are dedicated to the
deployment phase of trained deep learning networks.
Chapter 10, Deploying a Deep Learning Network, will show you how to build a deployment application, while Chapter 11, Best
Practices and Other Deployment Options, will show you all the deployment options that are available (a web application or a REST
service). It will also provide you with a few tips and tricks from our own experience.
Each chapter comes with its own set of questions so that you can test your understanding of the material that's been provided.
With that, please read on to discover the various deep learning architectures that can be applied to real use cases using KNIME
Analytics Platform.
Su mma ry
This first chapter aimed to prepare you for the content provided in this book.
Thus, we started this chapter by reminding you of the importance of deep learning, as well as the surge in popularity it garnered
following the first deep learning success stories. Such a surge in popularity is probably what brought you here, with the desire to learn
more about practical implementations of deep learning networks for real use cases.
Nowadays, the main barrier that we come across when learning about deep learning is the coding skills that are required. Here, we
adopted KNIME software, and in particular the open source KNIME Analytics Platform, so that we can look at the case studies that will
be proposed throughout this book. To do this, we described KNIME software and KNIME Analytics Platform in detail.
KNIME Analytics Platform also benefits from an extension known as KNIME Deep Learning – Keras Integration, which helps with
integrating Keras deep learning libraries. It does this by wrapping Python-based libraries into the codeless KNIME GUI. We dedicated a
full section to installing it.
Finally, we concluded this chapter by providing an overview of what the remaining chapters in this book will cover.
Before we dive into the math and applications of deep learning networks, we will use the next chapter to familiarize ourselves with the
basic features of KNIME Analytics Platform.
Chapter 2: D a ta A c c e ss a n d Pre p ro c e ssin g w ith K N IME
A n a ly tic s Pla tfo rm
Before deep-diving into neural networks and deep learning architectures, it might be a good idea to get familiar with KNIME Analytics
Platform and its most important functions.
In this chapter, we will cover a few basic operations within KNIME Analytics Platform. Since every project needs data, we will first go
through the basics of how to access data: from files or databases. In KNIME Analytics Platform, you can also access data from REST
services, cloud repositories, specific industry formats, and more. We will leave the exploration of these other options to you.
Data comes in a number of shapes and types. In the Data Types and Conversions section, we will briefly investigate the tabular nature
of the KNIME data representation, the basic types of data in a data table, and how to convert from one type to another.
At this point, after we have imported the data into a KNIME workflow, we will show some basic data operations, such as filtering,
joining, concatenating, aggregating, and other commonly used data transformations.
The parameterization of a static workflow will conclude this very quick overview of the basic operations you can perform with KNIME
Analytics Platform on your data.
Accessing Data
Transforming Data
Parameterizing the Workflow
A c c e ssin g D a ta
Before starting with examples of how to access and import data into a KNIME workflow, let's create the workflow:
1. Click on the File item in the top menu or right-click on a folder, such as LOCAL , for example, in KNIME Explorer .
2. Then, select the New KNIME Workflow option.
An empty canvas will open in the central part of the KNIME workbench: the workflow editor.
For this chapter, we will use toy data already available at installation. A set of workflows is installed together with the core KNIME
Analytics Platform. You can find them in the Example Workflows folder (Figure 2.1 ) in the KNIME Explorer panel.
Its TheData sub-folder contains some free toy datasets:
Figure 2.1 – Structure of the Example Workflows folder in the KNIME Explorer panel
TIP
In order to upload data to the KNIME Explorer panel, just copy it into a folder within the current workspace folder on your
machine. The folder and its contents will then appear in KNIME Explorer in the list of workflows, servers, KNIME Hub spaces, and
data available.
In the long way, you search for the File Reader node in the Node Repository; drag and drop it into the workflow editor; double-click it
to open its configuration window, or alternatively, right-click it and then select Configure ; and set the required settings, which at the
very least require the file path via the Browse button (Figure 2.2).
In the short way, you just drag and drop your CSV-formatted file from the File Explorer panel into the workflow editor. This way
automatically creates a File Reader node, fills up most of its configuration settings, including the file path, and keeps the configuration
window open for further adjustments ( Figure 2.2).
Under the file path, there are some basic settings: whether to read the first row as column headers and/or the first column as RowID,
the column delimiter for general text files, and how to deal with spaces and tabs.
Notice two more things in this configuration window of the File Reader node: the data preview and the Advanced button. The data
preview in the lower part of the window allows you to see whether the dataset is being read properly. The Advanced button takes you
to more advanced settings, such as enabling shorter lines, character encoding, quotes, and other similar preferences.
When using the short way to create and configure a File Reader node, in the preview panel in the node configuration window
( Figure 2.2), you can see whether the automatic settings were sufficient or whether additional customization is necessary:
Figure 2.2 – The File Reader node and its configuration window
We drag and drop the Demographics.csv file from Example Workflows/TheData/Misc
into the workflow editor. In the configuration window of the File Reader node, we see that the CustomerKey column is
interpreted as the row ID of the data rows, rather than its own column. We need to disable the read Row IDs option to read the data
properly. After the configuration is complete, we click OK ; the node state moves to yellow and the node can now be executed.
TIP
The automatic creation of the node and the configuration of its settings by file drag and drop works only for specific file extensions:
.csv for a File Reader node, .table for a Table Reader node, .xls and .xlsx for an Excel Reader node, and so on.
Similarly, if we drag and drop the ProductData2.xls file from the KNIME Explorer panel to the workflow editor, an
Excel Reader (XLS ) node is created and automatically configured (Figure 2.3):
Figure 2.3 – The Excel Reader (XLS) node and its configuration window
The configuration window ( Figure 2.3) is similar to the one of the File Reader node, but, of course, customized to deal with Excel
files. Three items especially are different:
The preview part is activated by a refresh button. You need to click on refresh to update the preview.
Column headers and row IDs are extracted from spreadsheet cells, identified with an alphabet letter (the column with the row IDs)
and a row number (the row with the column headers), according to the Excel standards.
On top of the URL path, there is a menu with a default choice, Custom URL . This menu allows you to express the file path as an
absolute path (local file system ), as a path relative to a mountpoint (Mountpoint ), as a path relative to one of the current
locations (data, workflow, or mountpoint), or as a custom path (Custom URL ). This feature will be soon extended to other reader
nodes.
In our case, the automated configuration process does not include the column headers. We can see this from the preview segment. So,
because we have the column headers in the first row, we adjust the Table contains column names in row number setting to 1,
refresh the preview, and click OK to save the changes and close the window.
Next, let's read the SentimentAnalysis.table file. .table files contain binary content in a KNIME
proprietary format optimized for speed and size. These files are read by the Table Reader node. Since all the information about the file
is already included in the file itself, the configuration window of the Table Reader node just requires the file path and a few
additional settings to limit the content to import. Again, dragging and dropping the SentimentAnalysis.table
file automatically generates a Table Reader node with a pre-configured URL.
To conclude this section, let's read the last files, SentimentRating.csv and
WebDataOldSystem.csv, with two more File Reader nodes; then, let's add the name of the file in the comment under
each node. Then, finally, let's group all these reader nodes inside an annotation explaining Reading data from files (Figure 2.9).
Demographics.csv contains the demographics of a number of customers, such as age and gender. Each customer is
identified via a CustomerKey value. ProductData2.xls contains the products purchased by each customer,
again identified via the CustomerKey value. SentimentAnalysis.table contains the sentiment
expressed as text by the customer toward the company and the product, again identified via the CustomerKey value.
SentimentRating.csv contains the mapping between the sentiment rating and the sentiment text. Finally,
WebdataOldSystem.csv contains the old activity index by each customer, as classified in the old web system, before
migration.
Of course, if there is a dataset from before migration, we must have a newer dataset with data from the system after migration. This can
be found in a database table in the WebActivity.sqlite SQLite database.
This leads us to the next section, where we will learn how to read data from a database.
There are nodes for each of these steps, as shown in Figure 2.4:
Figure 2.4 – Importing data from databases: connect, select, build SQL query, and import
Connectors : Connector nodes connect KNIME Analytics Platform to a database. The only generic connector is the DB
Connector node. Besides that, there are many dedicated connectors. There is a SQLite connector, a MySQL connector, a Microsoft
Access connector, and many other dedicated connectors. The SQLite connector is what we need to connect to the database contained
in the WebActivity.sqlite file. The configuration window only requires the database file path since SQLite is a
file-based database. All other settings have been preset in the node. Indeed, it is common to have some preset settings in dedicated
connectors, and therefore dedicated connectors need fewer settings than the generic DB Connector node. A drag and drop of the
.sqlite file automatically generates the SQLite DB Connector node with preloaded configuration settings.
Selecting the Table : The DB Table Selector node allows you to select the table from the connected database to work on. If
you are a SQL expert, the Custom Query flag allows you to create your own query for the subset of data to extract.
Build SQL Query : If you are not a SQL expert, you can still build your SQL query to extract the subset of data. The DB nodes in
the DB/Query category take a SQL query as input and add one more SQL queries on top of it. The node GUI is completely
codeless and therefore there is no need to know any SQL code. So, for example, the configuration window of the DB Row Filter
node presents a graphical editor on the right to build a row-filtering SQL query.
In the following screenshot ( Figure 2.5), record(s) of CustomerKey = 11177 have been excluded:
Figure 2.5 – The GUI of the DB Row Filter node. This node builds a SQL query to filter out records without using any SQL script
Import Data : Finally, the DB Reader node imports the data from the database connection according to the input SQL query. The
DB Reader node has no configuration window since all the required SQL settings to import the data are contained in the SQL query
at its input port. There are many other nodes, besides the DB Reader node, to import data from a database at the end of such a
sequence. They are all in the DB/Read/Write category in the Node Configuration panel.
IMPORTANT NOTE
Did you notice the node ports in Figure 2.4? We passed from the black triangle (data) to the red square (connection) to the brown
square (SQL query). Only ports of the same type, transporting data of the same type, can be connected!
In order to inspect the results, after the successful execution of the DB Reader node, you can right-click the last node in the sequence
– the one with a black triangle (data) port, in this case, the DB Reader node – and select the last item in the menu. This shows the
output data table.
The database nodes only produce a SQL query. At the output port, you can still inspect the results of the query by right-clicking the
node, selecting the last item in the menu, then clicking on the Cache no of Rows button in the Table Preview tab to temporarily
visualize just the top rows in the selected number.
At this point, we have also imported the last dataset, including customer web activity after migration to the new web system.
Let's spend a bit of time now on the data structure and data types.
D a ta Ty p e s a n d Co n v e rsio n s
If you inspect any of the output data tables from any of the nodes described previously, you will see a table-like representation of the
data. Here, each value is identified via RowID, the identification number for the record, and via a column header , the name of
the attribute ( Figure 2.6). So, the gender of CustomerKey 11000 is M, as identified via the Gender column header,
and the row ID is Row0. In a reader node, the row ID and column header can be generated automatically or assigned from the values
in a column or a row in the data.
The following is a screenshot of the data table output by the File Reader node:
Figure 2.6 – A KNIME data table. Here, a cell is identified via its RowID value and column header
Figure 2.6 from the icons in the column headers. Basic data types are Integer ,
Each data column also has a data type, as you can see in
Double , Boolean (true/false), and String . However, more complex data types are also available, such as Date&Time ,
Document , Image , Network , and more. We will see some of these data types in the upcoming chapters.
Of course, a data column is not condemned to stay with that data type forever. If the condition exists, it can move to another data type.
Some nodes are dedicated to conversions and can be found in the Node Repository under Manipulation/Column/Convert &
Replace .
In the data that we have read, CustomerKey has been imported as a five-digit integer. However, it might be convenient to
move from an integer type representation to a string type representation. For that, we use the Number to String node. The
configuration window consists of an include/exclude framework to select those columns whose type needs changing. The opposite
transformation is obtained with the String to Number node. The Double to Int node might also be useful for a transformation
from double to integer.
TIP
The String Manipulation and Math Formula nodes, even though their primary task is data transformation, also present some
conversion functionality.
We would like to draw your attention to the Category To Number node. This node comes in handy to discretize nominal classes
and transform them into numbers, as neural networks only accept numbers as target classes.
Image or Date&Time , offer their own conversion nodes. A very helpful node for that is the String to
Special data types, such as
Date&Time node. Date or Time objects are often read as String , and this node converts them into the appropriate type object.
In the next section, we want to consolidate all this customer information, starting with the web activity before and after the migration.
In these two datasets, the columns describing web activity have different names: First_WebActivity_ and
First(WebActivity). Let's standardize them to the same name: First_WebActivity_.
This is what the Column Rename node does:
Figure 2.7 – The Column Rename node and its configuration window
The configuration window of the Column Rename node lists all the columns from the input data table on the left. Double-clicking
on a column opens a frame on the right showing the current column name and requiring the new name and/or new type. All the nodes
we have introduced in this section can be seen in the workflow in Figure 2.13.
Now, we are ready to concatenate the two web activity datasets and join all the other datasets by their CustomerKey values.
The web activity dataset from the new web system comes from the SQLite database and consists of three columns:
CustomerKey, First_WebActivity_, and Count. Count is just a progressive number associated with
the data rows. It is not important for the upcoming analysis. We can decide later whether to remove it or keep it.
It would be nice to have both rankings for the web activity, from the old and the new system, together in one single data table. For this,
we use the Concatenate node. Two input data tables are placed together in the same output data table. Data cells belonging to
columns with the same name are placed in the same output column. Data columns existing in only one of the tables can be retained
(union of columns) or removed (intersection of columns), as set in the node configuration window. The node configuration window
also offers a few strategies to deal with rows with the same row IDs existing in both input tables.
We concatenated the two web activity data tables and kept the union of data columns in the output data table.
IMPORTANT NOTE
The Concatenate node icon shows three dots in its lower-left corner. Clicking these three dots gives you the chance to add more input
ports and therefore to concatenate more input data tables.
Let's now move on to the sentiment analysis data. SentimentAnalysis.table produced a data table with
CustomerKey and SentimentAnalysis columns. SentimentAnalysis includes the customer's
sentiment toward the company and product, expressed as text. SentimentRating.csv produced a data table with two
columns: SentimentAnalysis and SentimentRating. Both columns express the customer sentiment: one
in text and one in ranking ordinals. This is a mapping data table, translating text into ranking sentiment and vice versa. Depending on
the kind of analysis we will run, we might need the text expression or the ranking expression. So, to be on the safe side, let's join these
two data tables together to have them all, CustomerKey, SentimentAnalysis (text), and
SentimentRating (ordinals), in one data table only. This is obtained with the Joiner node.
The Joiner node joins data cells from two input data tables together into the same data row, according to a key value. In our case, the
key values are provided by the SentimentAnalysis columns present in both input data tables. So, each customer
(CustomerKey) will have the SentimentAnalysis text value and the corresponding
SentimentRating value. The Joiner node offers four different join modes: inner join (intersection of key values in the
two tables),left outer join (all key values from the left/top table), right outer join (all key values from the right/bottom table),
and full outer join (all key values from both tables).
In Figure 2.8 , you can find the two tabs of the configuration window of the Joiner node:
Figure 2.8 – Configuration window of the Joiner node: the Joiner Settings and Column Selection tabs
The configuration window of the Joiner node includes two tabs: Joiner Settings and Column Selection . The Joiner Settings
Column
tab exposes for selection the joiner mode and the data columns containing the key values for both input tables. The
Selection tab sets the columns from both input tables to retain when building the final joint data rows. A few additional options are
available to deal with columns with the same names in the two tables and to set what to do with the key columns after the joining is
performed.
IMPORTANT NOTE
There can be more than one level of key columns for the join. Just select the + button in the Joiner Settings tab to add more key
columns. If you have more than one level of key columns, you can decide whether a join is performed if all key values match (Match
all of the following ) or if just one key value matches (Match any of the following ), as set in the top radio buttons (Figure 2.8
on the left).
We joined the two sentiment tables using SentimentAnalysis as the key column in both tables and using a left outer
join. The left outer join includes all key values from the left (upper) table (the customer table) and therefore makes sure that all
sentiment values for all customers are retained in the output data table.
After joining CustomerKey with all the sentiment expressions, we will perform other similar join operations, multiple times, in
cascade, using CustomerKey as the key column, to collect together the different pieces of data for the same customers in one
single table ( Figure 2.13).
If we inspect the output produced by the File Reader node on the Demographics.csv file, we notice two data columns
that are also provided by other files: WebActivity and
SentimentRating. They are old columns and should be
substituted with the same columns from the SentimentAnalysis.table file and the web activity files. We could
remove these two columns in the Column Selection tab of the Joiner node. Alternatively, we can just filter those two columns out
with a dedicated node.
Let's see how to filter columns and rows out of a data table.
Use a wildcard or a Regex expression to match the names of the columns to exclude or to keep (Wildcard/Regex Selection ).
Define one or more data types for the columns to include or exclude (Type Selection ).
All these options are available at the top of the configuration window of the Column Filter node. Selecting one of them changes the
configuration window according to the required settings for that option. Here are the options.
Manual Selection : Provides an include/exclude framework to move columns from one frame to the other to include or exclude
input columns from the output data table (Figure 2.9 ).
Wildcard/Regex Selection : This option provides a textbox to enter the expression to match the column names. Wildcard
expressions use * for joker characters; for example, R* indicates all words starting with R, R*a indicates all words starting with
R and ending with a, and so on. Regex refers to regular expressions.
Type Selection : This option provides a multiple choice for the data types of the columns to include.
The configuration window of the Column Filter node is shown in Figure 2.9 :
Figure 2.9 – The Column Filter node and its configuration window
So far, we have been filtering data by columns. The other flavor for data filtering is by rows. In this case, we want to remove or keep
just some of the data rows in the table. For example, still working on the data from the Demographics.csv file, we might
want to keep only the men in the dataset or remove all records with CustomerKey 11177 . For this kind of filtering operation, there
are many different nodes: Row Filter , Row Filter (Labs) , Rule-based Row Filter , Reference Row Filter , Date&Time
based Row Filter , and more:
The Row Filter node is very simple and very powerful: on the right, the filtering condition and on the left the filtering mode.
Filtering Condition matches the content of cells in a data column with a condition. The input data column to match is selected at
the top. The condition can consist of pattern matching , including wildcards and regex in the pattern expression; range
checking , which is useful for numerical columns; and missing value matching .
Filtering Mode on the left sets whether to include or exclude the matching rows, matching by attribute value, row number, or
RowID :
Figure 2.10 – The Row Filter node and its configuration window
Here, we filter out, using the Exclude option on the left, all data rows where the CustomerKey attribute has a value of
11177.
A similar result could have been obtained using a Reference Row Filter node. The Reference Row Filter node has two input
ports. It matches the rows in the top table with the rows in the lower table according to the cell content in the set columns. Matching
rows will be excluded or included according to the node configuration settings. In the workflow in Figure 2.13, we feed the value
11177 into the lower port of the Reference Row Filter node from a Table Creator node.
The Table Creator node is an interesting node for temporary small data. It covers the role of an internal spreadsheet, which is
where to store a few lines of data.
Figure 2.11 – The two tabs of the GroupBy node's configuration window: Groups and Manual Aggregation
The GroupBy node isolates groups of data and on these groups calculates some measures, such as simple count, average, variance,
Groups of the configuration window; measure setting
percentages, and others. Identification of the groups happens in the tab named
happens in one of the other tabs (Figure 2.11 ).
In the Groups tab, you select the data columns whose value combinations define the different groups of data. The node then creates
one row for each group. For example, selecting the Gender column as the group column with distinct values of male and female
means to identify those groups of data with Gender as male or female . Selecting the Gender (male /female ) and MaritalStatus
(single /married ) columns as group columns means to identify the single-female , single-male , married-female , and married-
male groups.
Then, we need to select the measures we want to provide for these groups. Here we can proceed by doing the following:
Manually selecting the columns and the measures to apply one by one ( Manual Aggregation )
Selecting the columns based on a pattern, including wildcard or Regex expressions, and the measures to apply to each set of columns
( Pattern Based Aggregation )
Type Based Aggregation )
Selecting the columns by type and the measures to apply to each set of columns (
Each measure setting mode has its own tab in the configuration window (Figure 2.11 ). In the Manual Aggregation tab, we set the
simple Count measure on the CustomerKey column and the Mean measure on the Age column. For Gender as the
group column, we then get the number and the average age of women and men in the input table.
IMPORTANT NOTE
The GroupBy node offers a large number of measures. We have seen Count and Mean . However, we could have also used
percentage, median, variance, number of missing values, sum, mode, minimum, maximum, first, last, kurtosis, concatenation of
(distinct) values, correlation, and more. It is worth taking some time to investigate all the measurement methods available within the
GroupBy node.
Like the GroupBy node, we have the Pivoting node. The Pivoting node also identifies groups in the data and provides some
aggregation measures on selected data columns for each group. The difference with the GroupBy node is in the shape of the result.
Let's take the example of the Gender (Group ) and MaritalStatus (Pivot ) groups and the Count measure applied
to the CustomerKey data column. The final result is a table with male /female as the row IDs, married /single as the
column headers, and the count of occurrences of each combination as the cell content.
This means that the distinct values in the group columns generate rows and the distinct values in the pivoting columns generate
columns.
Pivoting node then has three tabs: Groups to select the group columns, Pivots to select the
The configuration window of the
pivoting columns, and Manual Aggregation to manually select data columns and the measures to calculate on them. If more than
one manual aggregation is used, the resulting pivoting table has one column for each combination of aggregation method and pivot
value.
In addition, the node returns the total aggregation based on only the group columns on the second output port and the total aggregation
based on only the pivoted columns at the third output port.
Let's move on now to a few more very flexible and very powerful nodes to perform data transformation.
The String Manipulation node applies transformations on string values in data cells. The transformations are listed in the Function
panel in the node configuration window (Figure 2.12 ). There, you can see the function and its possible syntaxes. If you select a
function in the list, in the panel on the right, named Description , a full description of the function task and syntax appears. The
transformation, however, is implemented in the Expression editor at the bottom of the window.
First, you select (double-click) a transformation from the Function list, then you populate it with the appropriate arguments in the
Expression editor. Arguments of a function can be constant strings, thus enclosed in "", or values from other columns in the input
data table. Values from columns are inserted automatically with the right syntax with a double-click on the column name in the
Column List panel on the left.
Let's take an example:
1. In the data table resulting from the GroupBy node, we got two data rows: one for male (M) and one for female (F), containing the
number of occurrences and the average age for each group (M/F). Let's change "M" to "Male" and "F" to
Female".
2. Then, we would use the replace(str, search, replace) function, where str indicates the column
to work on, search the string to search in the cell value, and replace the string to use as a replacement. Double-clicking
on the Gender column in the Column List panel and completing the expression by hand, we end up with the following
expression:
replace($Gender$, "M", "Male")
3. We get the following in a subsequent node:
Figure 2.12 – The String Manipulation node and its configuration window
It is also possible to nest transformation functions. If, for example, we want to standardize all cells to make sure to include the "M"
or "F" capital letters before applying the replace() transformation, we would nest the uppercase(str)
function in it and end up with the following expression:
IMPORTANT NOTE
In the String Manipulation (Multi Column) node, if we want to apply the same expression to all columns selected in the upper
part of the configuration window, we need to use the $$CURRENTCOLUMN$$ general column name in the Expression
editor. The very large number of string transformations in the Function list makes this node extremely powerful.
A node very similar to the String Manipulation node, even though working on a different task, is the Math Formula node. The Math
Formula node implements a mathematical expression on the input data. Besides that, it works exactly the same as the String
Manipulation node. In the configuration window, the available math functions are listed in the central Function panel. If a function
from the list is selected, the description appears in the Description panel. The final expression is crafted in the Expression editor at
the bottom. Insertion of column names in the Expression editor happens by double-clicking the column name in the Column List
panel on the left. Nested mathematical functions are possible.
The Math Formula (Multi Column) node extends the Math Formula node to apply the same formula onto many selected
columns.
Figure 2.13 shows the final workflow containing all the operations described in this chapter, which is also available on the KNIME
Hub: https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%202/:
Figure 2.13 – Workflow that summarizes some data access, data conversion, and data transformation nodes available in KNIME
Analytics Platform
So far, we have seen static transformations on data. What about having a different transformation for different conditions? Let's take
the Row Filter node. Today, I might want to filter out the female occurrences from the data table, while tomorrow the male ones.
How can I do that without having to change the configuration settings for all involved nodes at every run? The time has come to
introduce you to Flow Variables .
Pa ra me te riz in g th e Wo rk flo w
Let's consider a simple workflow: read the Demographics.csv file, filter all data rows with Gender = M or
F, and replace M or F with Male or Female, respectively. Once we have decided whether to work on M or F, the workflow
becomes quite simple and includes a File Reader , Row Filter , and String Manipulation node with the replace()
function:
1. Let's add one node that allows us to choose whether to work on M or F records: the String Configuration node. This node
generates a flow variable. A flow variable is a parameter that travels with the data flow along the workflow branch and it can be used
to overwrite settings in other nodes.
2. As far as we are concerned, for now, two settings are important in the configuration window of this node: the default value and the
variable name. Let's use default value M for now, to work with Gender = M records, and let's name the flow variable
gender_variable.
3. Executing the node creates a Flow Variable named gender_variable with value M:
Flow Variable to overwrite the filter setting in the Row Filter node. In the configuration window
4. Now, let's use the value of the
of the Row Filter node (Figure 2.10 ), on the right of the setting for Use pattern matching there is a button with a V on it.
Through this button, we can overwrite the setting with the value contained in one of the available flow variables.
Alternatively, if the configuration setting does not display this button, you can overwrite the setting via the Flow Variables tab at
the top of the configuration window. In this tab, search for your setting, and in the corresponding empty space, select the Flow
Variable to use. In our case, we overwrote the Use pattern matching setting with the Flow Variable
gender_variable via the V button.
Did you notice the red connection between the String Configuration node and the Row Filter node? This is a Flow Variable
connection. Flow variables are injected into nodes and branches via these connections.
IMPORTANT NOTE
All nodes have hidden red circle ports for the input and output of flow variables. Clicking on the flow variable port of a node and
releasing on another node brings out the hidden flow variable port and connects the nodes. Alternatively, in the context menu of
each node, the Show Flow Variable Ports option makes them visible.
5. After that, we create a small table with two rows, [M,
Male] and [F, Female]. We select the row corresponding to the
value in the gender_variable flow variable, and we aim to replace the M or F character with the text. For this last
part, we need to replace the hardcoded strings in the String Manipulation node with the current text values. We already have the
M or F character as a Flow Variable .
6. Now, we transform the Male/Female text into a new flow variable. We do this via the Table Column To Variable
node. This node converts the values from a table column into flow variables with the row IDs as their variable name and the values in
the selected column as the variable values.
String Manipulation node sees both Flow Variables : gender_variable, generated by the
At this point, the
String Configuration node, and Row0, generated by the Table Column to Variable node. So, we can use the following
syntax to perform the replacement operation alternatively with [M, Male] or [F, Female], depending on what has
been selected in the String Configuration node:
replace($Gender$, $${Sgender_variable}$$, $${SRow0}$$)
Notice the difference in syntax in the expression when dealing with columns ($Gender$) and flow variables
($${Sgender_variable}$$). Also, flow variables can be inserted automatically and with the right syntax in the
Expression editor, by double-clicking on the flow variable name in the Flow Variable List panel on the left of the String
Manipulation node's configuration window (Figure 2.12 ).
The benefit of using flow variables is clear. When we decide to use F instead of M, we just change the setting in the String
Configuration node instead of checking and changing the setting in every single node.
We have shown only a small fraction of the nodes dealing with flow variables. You can explore more of these nodes in the Workflow
Control/Variables category in the Node Repository panel.
Su mma ry
We do not have space in this book to describe more of the many nodes available in KNIME Analytics Platform. We will leave this
exploratory task to you.
KNIME Analytics Platform includes more than 2,000 nodes and covers a large variety of functionalities. However, the factotum nodes
that work in most situations are much fewer in number, such as, for example, File Reader, Row Filter, GroupBy, Join, Concatenation,
Math Formula, String Manipulation, Rule Engine, and more. We have described most of them in this chapter to give you a solid basis
to build more complex workflows for deep learning, which we will do in the next chapter.
Q u e stio n s a n d Ex e rc ise s
Check your level of understanding of the concepts presented in this chapter by answering the following questions:
c) By using the File Reader node and the allow short lines enabled option
d) By using the File Reader node and the Limit Rows enabled option
b) By using the Row Filter node and pattern matching =42 with the Include option on the right
c) By using the Row Filter node and range checking on, lower boundary 42, with the Include optionon the right
d) By using the Row Filter node and range checking on, lower boundary 42 , with the Exclude option on the right
3. How can I find the average sentiment rating for single women?
Gender and MaritalStatus as group columns and the mean operation on the
a) By using a GroupBy node with
SentimentRating column
b) By using a GroupBy node with Gender as the group column and a count operation on the CustomerKey column
CustomerKey as the group column and a concatenate operation on the
c) By using a GroupBy node with
SentimentAnalysis column
d) By using a GroupBy node with MaritalStatus as the group column and a percent operation on the
SentimentRating column
4. Why do we need flow variables?
We will start with the basic concepts of neural networks and deep learning: from the first artificial neuron as a simulation of the
biological neuron to the training of a network of neurons, a fully connected feedforward neural network, using a backpropagation
algorithm.
We will then discuss the design of a neural architecture as well as the training of the final neural network. Indeed, when designing a
neural architecture, we need to appropriately select its topology, neural layers, and activation functions, and introduce some techniques
to avoid overfitting.
Finally, before training, we need to know when to use which loss function and what the different parameters that have to be set for
training are. This will be described in the last part of this chapter.
N e u ra l N e tw o rk s a n d D e e p Le a rn in g – Ba sic Co n c e p ts
All you hear about at the moment is deep learning. Deep learning stems from the traditional discipline of neural networks, in the realm
of machine learning.
The field of neural networks has gone through a number of stop-and-go phases. Since the early excitement for the first perceptron in
the '60s and the subsequent lull when it became evident what the perceptron could not do; through the renewed enthusiasm for the
backpropagation algorithm applied to multilayer feedforward neural networks and the subsequent lull when it became apparent that
training recurrent networks required hardware capabilities that were not available at the time; right up to today's new deep learning
paradigms, units, and architectures running on much more powerful, possibly GPU-equipped, hardware.
Let's start from the beginning and, in this section, go through the basic concepts behind neural networks and deep learning. While these
basic notions might be familiar to you, especially if you have already attended a neural networks or deep learning course, we would
still like to describe them here as a reference for descriptions of KNIME deep learning functionalities in the coming chapters, and for
anyone who might be a neophyte to this field.
Figure 3.1 – On the left, a biological neuron with inputs xj at dendrites and output y at the axon terminal (image from Wikipedia). On
the right, an artificial neuron (perceptron) with inputs xj connected to weights wj and producing output y
The simplest simulation tries to reproduce a biological neuron with just two inputs and one output, like on the right in Figure 3.1. The
input signals at the dendrites are now called and and reach the soma of the artificial cell via two weights, and
simulating the chemical reactions in the synapses. If the total input signal reaching the soma is higher than a given threshold ,
simulating the "high enough" concept, an output signal is generated. This simple artificial neuron is called a perceptron .
Two details to clarify here: the total input signal and the threshold function. There are many neural electrical input-output voltage
models (Hodgkin, A. L.; Huxley, A. F. (1952), A quantitative description of membrane current and its application to conduction and
excitation in nerve, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413, The Journal of Physiology. 117 (4): 500–544). The
simplest way to represent the total input signal uses a weighted sum of all input signals, where the weights represent the role of the
synapse reactions. The firing function of the neuron soma can be described via a step function . Thus, for our simplified simulation
in Figure 3.1, the output is calculated as follows:
with .
Generalizing to a neuron with input signals and with any other activation function , we get the following formula:
Here, the threshold has been transformed into a weight connected to an input signal that is constantly on – that is,
constantly set to 1.
However, one single artificial neuron, just like one single biological neuron, does not have high computational capability. It can
implement just a few simple functions, as we will see later in the next sub-section, Understanding the need for hidden layers. As in the
biological world, networks of neurons have a much bigger computational potential than one single neuron alone. Networks of
biological neurons, such as even simple brains, can learn and carry out very complex tasks. Similarly, networks of artificial neurons can
learn and carry out very complex tasks. The key to the success of neural networks is this flexibility in forming more or less complex
architectures and in training them to perform more or less complex tasks.
An example of a network of perceptrons is shown in Figure 3.2. This network has three layers of neurons: an input layer accepting the
input signals ; a first hidden layer with two neurons connected to the outputs of the input layer; a second hidden layer with three
neurons connected to the outputs of the first hidden layer; and finally an output layer with one neuron only, fed by the outputs of the
hidden layer and producing the final output of the network. Neurons are indicated by a circle including the symbol for the
weighted sum of the input signals and the symbol for the activation function:
Figure 3.2 – On the left, a network of biological neurons (image from Wikipedia). On the right, a network of artificial neurons (multi-
layer perceptron)
Notice that in this particular architecture, all connections move from the input to the output layer: this is a fully connected feedforward
architecture. Of course, a feedforward neural network can have as many hidden layers as you want, and each neural layer can
Multi-Layer Perceptron (MLP ).
have as many artificial neurons as you want. A feedforward network of perceptrons is called a
layer. This progressive numbering of the neural layers is also contained in the weight and hidden unit notations. is the output
value of the neural unit in the (hidden) layer, and is the weight connecting neural unit in layer with neural
unit in layer .
IMPORTANT NOTE
Notice that, in this notation, and in and in are NOT exponents. They just describe the network layer of the output
unit for and the destination layer for .
Figure 3.3 – Fully connected feedforward neural network with just one hidden layer and one output unit
There are many types of activation functions . We have seen the step function in the previous section, which however has a
main flaw: it is neither continuous nor derivable. Some similar activation functions have been introduced over the years, which are
easier to handle since they are continuous and derivable everywhere. Common examples are the sigmoid and the hyperbolic
tangent, . Recently, a new activation function, named Rectified Linear Unit (ReLU ), has been introduced, which
seems to perform better with fully connected feedforward neural networks with many hidden layers. We will describe these activation
functions in detail in the coming chapters.
IMPORTANT NOTE
Usually, neurons in the same layer have the same activation function Different layers, though, can have different activation
functions.
Another parameter of a network is its topology , or architecture. We have seen a fully connected feedforward network, where all
connections move from the input toward the output and, under this constraint, all units are connected to all units in the next layer.
However, this is of course not the only possible neural topology. Cross-connections within the same layer , backward connections
from layer to layer , and autoconnections of a single neuron with itself are also possible.
Different connections and different architectures produce different data processing functions. For example, autoconnections introduce a
time component, since the current output of neuron at time will be an additional input for the same neuron at time ;
a feedforward neural network with as many outputs as inputs can implement an autoencoder and be used for compression or for outlier
detection. We will see some of these different neural architectures and the tasks they can implement later in this book. For now, we just
give you a little taste of possible neural topologies in Figure 3.4:
The first network from the left in Figure 3.4 has its neurons all completely connected, so that the definition of the layer becomes
unnecessary. This is a Hopfield network and is generally used as an associative memory.
The second network is a feedforward autoencoder: three layers, as many input units as many output units, and a hidden layer
with units, where usually ; this network architecture has been adopted for outlier detection or to implement a
dimensionality reduction of the input space.
Finally, the third network presents units with autoconnections. As said before, autoconnections introduce a time component within the
function implemented by the network and therefore are often adopted for time series analysis. This last network qualifies as a recurrent
neural network.
Let's go back to fully connected feedforward neural networks. Now that we've seen how they are structured, let's try to understand why
they are built this way.
Classic examples are the OR and AND problems, which can be solved by a line separating the "1" outputs from the "0" outputs.
Therefore, a perceptron can implement a solution to both problems. However, it cannot implement a solution to theXOR problem.
The XOR function outputs "1" when the two inputs are different (one is "0" and one is "1") and outputs "0" when the two inputs are
the same (both are "0" or both are "1"). Indeed, the XOR operator is a nonlinearly separable problem and one line only is not sufficient
to separate the "1" outputs from the "0" outputs ( Figure 3.5):
Figure 3.5 – A perceptron implements a linear discriminant surface, which is a line in a two-dimensional space. All linearly separable
problems can be solved by a single perceptron. A perceptron cannot solve non-linearly separable problems
The only possibility to solve the XOR problem is to add one hidden layer with two units into the perceptron architecture, making it into
an MLP ( Figure 3.6). The two hidden units in green and red each implement one line to separate some "0"s and "1"s. The one unit in
the output layer then builds a new line on top of the two previous lines and implements the final discriminant:
Figure 3.6 – One additional hidden layer with two units enables the MLP to solve the XOR problem
The example in Figure 3.7 shows a three-layer network: one input layer receiving input values and , one hidden layer with
two units, and one output layer with one unit only. The two hidden units implement two discrimination lines:
for the red unit and for the orange unit. The output line implements
a discrimination line on top of these two as , which is identified by the green area in the
plane shown in Figure 3.7:
Figure 3.7 – The network on the left fires up only for the points in the green zone in the input space, as depicted on the right
As you see, adding just one hidden layer makes the neural network much more powerful in terms of possible functions to implement.
However, there is more. The Universal Approximation Theorem states that a simple feedforward network with a single hidden
layer and a sufficient number of neurons can approximate any continuous function on compact subsets of , under mild
assumptions on the activation function and assuming that the network has been sufficiently trained (Hornik K., Stinchcombe M., White
H. (1989) Multilayer feedforward networks are universal approximators:
http://www.sciencedirect.com/science/article/pii/0893608089900208 Neural Networks, Vol. 2, Issue 5 , (1989 ) Pages 359-366). This
theorem proves that neural networks have a kind of universality property. That is, any function can be approximated by a sufficiently
large and sufficiently trained neural network. Sufficiently large refers to the number of neurons in a feedforward network. In addition,
the cited paper refers to network architectures with just one single hidden layer with enough neurons.
Even very simple network architectures, thus, can be very powerful! Of course, this is all true under the assumption of a sufficiently
large hidden layer (which might become too large for a reasonable training time) and a sufficient training time.
A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may
fail to learn and generalize correctly, (Goodfellow I., Bengio Y., Courville A. (2016). Deep Learning, MIT Press).
We have seen that introducing one or more hidden layers to a feedforward neural network makes it extremely powerful. Let's see how
to train it.
Training a neural network means showing examples from the training set repeatedly and each time adjusting the parameter values, the
weights, to fit a loss function, calculated on the desired input-output behavior. To find the weights that best fit the loss functions, the
gradient descent algorithm or variants of Stochastic Gradient Descent (SGD ) are used. The idea is to update the weights by taking
steps in the direction of steepest descent on the error surface. The direction of steepest descent is equivalent to the negative of the
gradient. To calculate the gradient efficiently, the backpropagation algorithm is used. Let's find out how it works.
The math behind backpropagation
A classic loss function for regression problems is the total squared error, defined as follows:
Here, and are respectively the desired target and the real answer for output unit , and the sum runs on all units of the output
layer and on all examples in the training set .
If we adopt the gradient descent strategy to reach a minimum in the loss function surface, at each training iteration, each weight of the
network must be incremented in the opposite direction of the derivative of in the weight space (Goodfellow I., Bengio Y., Courville
A. (2016. Deep Learning, MIT Press):
This partial derivative of the error with respect to the weight is calculated using the chain rule:
Here, is the loss function, the output of neuron j, its total input, and its input weight from neuron in the
previous layer:
For the weights connecting to units in the output layer , the derivatives will be as follows:
So, finally:
Therefore, the weight change for weights connecting to output units is as follows:
Here, , is the input to the output node , and is the learning rate.
For the weights connecting to the units in a hidden layer , the calculation of the derivative, and therefore of the weight change, is a
bit more complicated. While the last two derivatives remain the same also when referring to neurons in hidden layers, will need
to be recalculated.
If we consider the loss function as a function of all input sums to all neurons in the next layer connected to
The update formula for all weights, leading to output or hidden neurons, is this:
This recursive formula tells us that for unit in the hidden layer can be calculated as the linear combination of all
in layer , which will be if this is the output layer or if this is another hidden layer. This means
that moving from the output layer backward toward the input layer, we can calculate all , starting from and then through
all , as a combination of from the preceding layer, layer after layer. Together with , we can also calculate all
weight updates .
The Idea Behind Backpropagation
So, the training of a feedforward neural network can be seen as a two-step process:
1. All training vectors are presented, one after the other, to the input layer of the network, and the signal is propagated throughout all
network connections (and weights) till the output layer. After all of the training examples have passed through the network, the total
squared error is calculated at the output layer as the sum of the single squared errors. This is the forward pass :
Figure 3.8 – In the forward pass of the backpropagation algorithm, all training examples are presented at the input layer and
forward-propagated through the network till the output layer, to calculate the output values
2. All are calculated for all units in the output layer. Then, the s are backpropagated from the output layer through all
network connections (and weights) till the input layer and all in the hidden layers are also calculated. This is the
backward pass :
Figure 3.9 – In the backward pass of the backpropagation algorithm, all s are calculated at the output layer and backpropagated
through the network till the input layer. After all examples from the training set have passed through the network forth and back, all
weights are updated
This algorithm is called backpropagation , as a reference to the s backpropagating through the network during the second pass.
After all the training data has passed through the network forth and back, all weights are updated.
Also notice the first derivative of the unit activation function in . Of course, using a continuous derivable function
helps with the calculations. This is the reason why the and function have been so popular
with neural architectures.
The gradient descent algorithm is not guaranteed to reach the global minimum of the error function, but it often ends up in a local
minimum. If the local minimum does not ensure satisfactory performance of the network, the training process must be repeated starting
from new initial conditions, meaning new initial values for the weights of the network.
Neural networks are very powerful in implementing input-output models and very flexible in terms of architecture and parameters. It is
extremely easy to build huge neural networks, by adding more and more neurons and more and more hidden layers. Besides the longer
training times, an additional risk is to run quickly into the overfitting of the training data. Overfitting is a drawback of too complex
models, usually with too many free parameters to fit a simple task. The result of an over-dimensioned model for a simple task is that the
model, at some point, will start using the extra parameters to memorize noise and errors in the training set, considerably worsening the
model's performance. The power and flexibility of neural networks make them prone to overfitting, especially if we are dealing with
small training sets.
IMPORTANT NOTE
Another big objection that has been leveled against neural networks since their introduction is their non-interpretability. The adjustment
of the weights has no correspondence with any entity in the data domain. When dealing with neural networks, we need to accept that
we are dealing with black boxes and we might not understand the decision process.
If interpretability is a requirement for our project, then maybe neural networks are not the tool for us. A few techniques have been
proposed recently to extract knowledge on the decision process followed in black-box models, such as the SHAPLEY values or
Partial Dependency Plots (Molnar C. Interpretable Machine Learning, https://christophm.github.io/interpretable-ml-
book/index.html, GitHub). They are currently in their infancy and not immune from criticism. However, they constitute an interesting
attempt to fix the interpretability problem of neural networks. These are beyond the scope of this book, so we will not be exploring
them in any more detail.
With the basic theory covered, let's get into the design of a network.
D e sig n in g y o u r N e tw o rk
In the previous section, we learned that neural networks are characterized by a topology, weights, and activation functions. In
particular, feedforward neural networks have an input and an output layer, plus a certain number of hidden layers in between. While
the values for the network weights are automatically estimated via the training procedure, the network topology and the activation
functions have to be predetermined during network design before training. Different network architectures and different activation
functions implement different input-output tasks. Designing the appropriate neural architecture for a given task is still an active research
field in the deep learning area (Goodfellow I., Bengio Y., Courville A. (2016). Deep Learning, MIT Press).
Other parameters are involved in the training algorithm of neural networks, such as the learning rate or the loss function. We have also
seen that neural networks are prone to overfitting; this means that their flexibility makes it easy for them to run into the overfitting
problem. Would it be possible to contain the weight growth, to change the loss function, or to self-limit the network structure during
training as to avoid the overfitting problem?
This section gives you an overview of all those remaining parameters: the topology of the network, the parameters in the training
algorithm, the possible activation functions, the loss functions, regularization terms, and more, always keeping an eye on containing the
overfitting effect, making the training algorithm more efficient, and developing more powerful neural architectures.
for and
Note that is the weighted sum of the input values to the th neuron and is the vector of all weighted input
sums.
A network can then also be seen as a chain of functions , where each function implements a neural layer. Depending on the
network architecture, each neural layer has different input values and uses a different activation function , and therefore
implements a different function , using the two calculation steps described previously.
The complexity of the total function implemented by the full network also depends on the number of layers involved; that is, it
depends on the network depth.
A layer where all neurons are connected to all outputs of the previous layer is called a dense layer . Fully connected feedforward
networks are just a chain of dense layers, where each layer has its own activation function. In feedforward neural networks, then, a
function is based on the number of the layer's neurons, the number of inputs, and the activation function. The key difference
between layers is then the activation function. Let's look at the most commonly used activation functions in neural networks.
Sigmoid Function
The sigmoid function is an S-shaped function with values between and . For the th neuron in the layer, the function is
defined as follows:
It is plotted on the left in Figure 3.10.
For binary classification problems, this is the go-to function for the output neural layer, as the value range allows us to
interpret the output as the probability of one of the two classes. In this case, the output neural layer consists of only one neuron, AKA
unit size 1, with the sigmoidal activation function. Of course, the same function can also be used as an activation function for output
and hidden layers with a bigger unit size:
Figure 3.10 – The sigmoid function (on the left) can be used as the activation function of the single output neuron of a network
implementing the solution to a binary classification problem (in the center). It can be used generically as an activation function for
neurons placed in hidden or output layers in a network (on the right)
One of the biggest advantages of the sigmoid function is its derivability everywhere and its easy derivative expression. Indeed, when
using the sigmoid activation function, the weight update rule for the backpropagation algorithm becomes very simple, since the first
On the other hand, one of the biggest disadvantages of using sigmoid as the neurons' activation function in more complex or deep
neural architectures is the vanishing gradient problem. Indeed, when calculating the derivatives to update the network weights, the
chain multiplication of output values (< 1) from sigmoid functions might produce very small values. In this case, too small gradients
are produced at each training iteration, leading to slow convergence for the training algorithm.
Here also, one of the biggest advantages of the function is its continuity and its derivability everywhere, which leads to
simpler formulas for the updates of the weights in the training algorithm. also has the advantage of being centered at 0,
which can help to stabilize the training process.
Again, one of the biggest disadvantages of using tanh as an activation function in complex or deep neural architectures is the vanishing
gradient problem.
Linear Function
A special activation function is the linear activation function , also known as the identity function:
When would such a function be used? A neural layer with a linear activation function implements a linear regression model.
Sometimes, a neural layer with a linear activation function is also introduced to keep the original network response, before it is
transformed to get the required range or probability score. In this case, the last layer of the network is split into two layers: one with the
linear activation function preserves the original output and the other one applies another activation function for the required output
format.
In Chapter 7, Implementing NLP Applications, where we describe the Generating Product Name case study, this approach is used to
introduce a new parameter called temperature after the linear activation function layer.
An activation function that helps to overcome the problem of vanishing gradient is the Rectified Linear Unit function, ReLU for
short. The ReLU function is like the linear function, at least from 0 on. Indeed, the ReLU function is for negative values of and
is the identity function for positive values of :
.
The ReLU activation function, while helping with the vanishing gradient problem, is not differentiable for . In practice, this
is not a problem when training neural networks as usually one of the one-sided derivatives is used rather than reporting that the
derivative is not defined.
Softmax Function
All activation functions introduced until now are functions that have a single value as output. This means only the weighted sum is
used to calculate the output value of the th neuron, independently from weighted sums , with being used to calculate
the outputs of the other neurons in the same layer. The softmax function , on the other hand, works on the whole output vector
and not just on one single value .
In general, the softmax function transforms a vector of size into a vector , which is a
vector of the same size with values between and , with the constraint that all values sum to :
This additional constraint allows us to interpret the components of vector as probabilities of different classes. Therefore, the softmax
activation function is often the function of choice for the last neural layer in a multiclass classification problem. The th element of the
output vector is calculated as follows:
Figure 3.13 shows an example network that uses the softmax function in the last layer, where all output values sum up to 1:
Figure 3.13 – A simple neural layer with the softmax activation function
IMPORTANT NOTE
The softmax function is also used by the logistic regression algorithm for multiclass classification problems.
Variants of ReLU are Leaky Rectified Linear Unit and Parametric Rectified Linear Unit (PReLU ). LeakyReLU offers an
almost zero line ( ) for negative values of the function argument rather than just zero as in the pure ReLU. PReLU makes this
line with parametric slope ( ) rather than fixed slope as in the LeakyReLU. Parameter becomes part of the parameters that the
network must train.
LeakyReLU:
PReLU:
ELU:
SELU:
An approximation of the sigmoid activation function is the hard sigmoid activation function . It is faster to calculate than
sigmoid. Despite being an approximation of the sigmoid activation function, it still provides reasonable results on classification tasks.
However, since it's just an approximation, it performs worse on regression tasks:
Hard-Sigmoid:
The SoftPlus activation function is also quite popular. This is a smoothed version of the ReLU activation function:
SoftPlus:
Figure 3.14 – Plots of some additional popular activation functions, mainly variants of ReLU and sigmoid functions
The images in Figure 3.14 show the plots of the aforementioned activation functions.
Large neural networks, trained on too small datasets, often incur the problem of fitting the training data too well and missing the
capability to generalize to new data. This problem is known as overfitting. Figure 3.15 shows a regression input-output function
implemented by a neural network on the training data (full crosses) and on the test data (empty crosses). On the left, we see a regression
function that does not even manage to fit the training data properly, much less the test data. This is probably due to an insufficient
architecture size or short training time ( underfitting ). In the center, we find a regression curve decently fitting both training and test
data. On the right, we have a regression curve fitting the training data perfectly and failing in the fit on the test data; this is the
overfitting problem:
Figure 3.15 – From left to right, the regression curve implemented by a network underfitting, fitting just fine, and overfitting the
training data
How can we know in advance the right size of the neural architecture and the right number of epochs for the training algorithm? A few
tricks can be adopted to address the problem of overfitting without worrying too much about the exact size of the network and the
number of epochs: norm regularization, dropout, and early stopping.
Norm Regularization
One sign of overfitting is the high values of the weights. Thus, the idea behind norm regularization is to penalize weights with high
values by adding a penalty term to the objective function, AKA the loss function:
Here, are the true values and are the predicted values. A new loss function
The training algorithm, thus, while minimizing this new loss function, will reach a weight configuration with smaller values. This is a
well-known regularization approach you might already know from the linear or logistic regression algorithms.
The parameter is used to control the penalty effect. is equivalent to no regularization. Higher values of
implement a stronger regularization effect and lead to smaller weights.
There are two commonly used penalty norm functions: the L1 norm and the L2 norm . The L1 norm is the sum of the absolute
values of the weights and the L2 norm is the sum of the squares of the weights:
and are both common methods to avoid overfitting with one big difference. regularization generally leads to smaller
weights but lacks the ability to reduce the weights all the way to zero. On the other hand, regularization allows for a few larger
weights while reducing all other weights to zero. When designing a loss function, it is also possible to use a mixture of both and
regularization.
In addition, you can also apply regularization terms to weights of selected layers. Three different norm regularizations have been
designed to act on single layers: kernel regularization , bias regularization , and activity regularization .
Kernel regularization penalizes the weights, but not the biases; bias regularization penalizes the biases only; and activity regularization
leads to smaller output values for the selected layer.
Dropout
Another common approach in machine learning to avoid overfitting is to introduce the dropout technique, which is another
regularization technique.
The idea is, at each training iteration, to ignore (drop) randomly some of the neurons in either the input layer or a hidden layer with all
its input and output connections. At each iteration, different neurons are dropped. Therefore, the number of neurons in the architecture,
and which of them are trained, effectively changes from iteration to iteration. The randomization introduced in this way helps to
control the overfitting effect.
Dropout makes sure that individual neurons and layers do not rely on single neurons in the preceding layers, thus becoming more
robust and less prone to overfitting:
Figure 3.16 – The dropout technique selects some neurons in each layer and drops them from being updated in the current training
iteration. The full network on the left is trained only partially in the four training iterations described on the right.
Dropout is applied to each layer of the network separately. This often translates into a temporary layer, the dropout layer, being
inserted after the layer we want to randomize. The dropout layer controls how many neurons of the previous layer are dropped at each
training iteration.
To control how many neurons in a layer are dropped, a new parameter is introduced: the drop rate . The drop rate defines the fraction
of neurons in the layer that should be dropped from training at each iteration.
TIP
Here are two quick tips for dropout:
First, dropout leads to layers with fewer neurons and therefore reduces the layer capacity. It is recommended to start with a high
number of neurons per layer.
Second, dropout is only applied to the input or hidden layers, not to the output layer since we want the response of the model to always
be the same at each iteration.
Early Stopping
Another option to avoid overfitting is to stop the training process before the network starts overfitting, which is known as early
stopping . To detect the point where the algorithm starts to fit the training data better than the test data, an additional validation set with
new data is used. During training, the network performances are monitored on both the training set and the validation set. At the
beginning of the training phase, the network performance on both the training and validation sets improves. At some point, though, the
performance of the network on the training set keeps improving while on the validation set it starts deteriorating. Once the performance
starts to get worse on the validation set, the training is stopped.
Convolutional Layers
One area where neural networks are extremely powerful is image analysis, for example, image classification. Feedforward neural
networks are also frequently used in this area. Often, though, the sequence of dense layers is not used alone, but in combination with
another series of convolutional layers. Convolutional layers are placed after the input of the neural network, to extract features and
then create a better representation of the image to pass to the next dense layers – the feedforward architecture – for the classification.
These networks are calledConvolutional Neural Networks , CNNs for short.
Chapter 9, Convolutional Neural Networks for Image Classification, explains in detail how convolutional layers work. It will also
introduce some other related neural layers that are suitable to analyze data with spatial relationships, such as the flatten layer and the
max pooling layer.
We will start with an overview of possible loss functions for regression, binary classification, and multiclass classification problems.
Then, we will introduce some optimizers and additional training parameters for the training algorithms.
Loss Functions
In order to train a feedforward neural network, an appropriate error function, often called a loss function , and a matching last layer
have to be selected. Let's start with an overview of commonly used loss functions for regression problems.
Mean Squared Error (MSE ) Loss : The mean squared error is the default error metric for regression problems. For
training samples, it is calculated as follows:
Here, are the true values and are the predicted values. The MSE gives more importance to large error
values and it is always positive. A perfect predictor would have an of .
Mean Squared Logarithmic Error (MSLE ) Loss : The MSLE is a loss function that penalizes large errors less than the MSE. It
is calculated by applying the logarithm on the predicted and the true values, before using the MSE. For training
samples, it is calculated as follows:
MSLE applies to numbers greater or equal to , such as prices. 1 is added to both and to avoid having
This loss function is recommended if the range of the target values is large and larger errors shouldn't be penalized significantly more
than smaller errors. The MSLE is always positive and a perfect model has a loss of .
Mean Absolute Error (MAE ) Loss : The MAE loss function is a more robust loss function with regards to outliers. This means it
punishes large errors even less than the previous two loss functions, MSE and MSLE. For training samples, it is
calculated as follows:
In summary, we can say that we can choose between three different loss functions for regression problems: MSE, MSLE, and MAE.
Let's continue with loss functions for binary and multiclass classification problems.
predict the probability for class . Here the output layer consists of just one unit with the sigmoid activation function. For this
approach, the recommended default loss function is binary cross entropy .
On a training set of samples, the binary cross-entropy can be calculated as follows:
the same value for the other class. is the predicted value in the previously shown loss functions.
Other possible loss functions for binary classification problems are Hinge and Squared Hinge . In this case, the two classes have to
be encoded as and and therefore the unit in the output layer must use the tanh activation function.
The default loss function for multiclass classification problems is categorical cross-entropy . On a training set of
samples, the categorical cross-entropy can be calculated as an extension to C classes of the binary cross-entropy:
corresponding probability predicted by the network for class . Again, k is the predicted value by output neuron k
for training sample . i
For multiclass classification problems with too many different classes, such as language modeling where each word in the dictionary is
sparse categorical cross-entropy is used.
one class,
IMPORTANT NOTE
Backpropagation is typically referred to as the algorithm that calculates the gradients of the weights. The algorithm to train a neural
network is usually some variant of SGD and it makes use of backpropagation to update the network weights.
A big role in the training algorithm is played by the learning rate . The learning rate defines the size of the step taken along the
direction of the gradient descent on the error surface during the learning phase. A too-small produces tiny steps and therefore takes
a long time to reach the minimum of the loss function, especially if the loss function happens to have flat slopes. A too-large
produces large steps that might overshoot and miss the minimum of the loss function, especially if the loss function is narrow and with
steep slopes. The choice of the right value of learning rate is critical. A possible solution could be to use an adaptive learning rate,
starting large and progressively decreasing with the number of training iterations.
In Figure 3.17, there are examples for moving on the loss function with a too-small, too-large, and adaptive learning rate:
Figure 3.17 – The progressive decrease of the error with a too-small learning rate (on the left), a too-large learning rate (in the
All loss functions are defined as a sum over all training samples. This leads to algorithms that update the weights after all training
samples have passed through the network. This training strategy is called batch training . It is the correct way to proceed; however, it
is also computationally expensive and often slow.
The alternative is to use the online training strategy, where weights are updated after the pass of each training sample. This strategy
is less computationally expensive, but it is just an approximation of the original backpropagation algorithm. It is also prone to running
into oscillations. In this case, it is good practice to use smaller values for the learning rate.
Virtually all modern deep learning frameworks make use of a mixture of batch and online training, where they use small batches of
training examples to perform a single update step.
The Momentum term is added to the weight delta to increase the weight update as long as they have the same sign as the
previous delta. Momentum speeds up the training on long flat error surfaces and can help the network pass a local minimum. The
weight update then would include an extra term:
Number of epochs : The number of epochs defines the number of cycles that run over the full training dataset.
To summarize, the algorithm goes through the whole training set times, where is the number of epochs. Each epoch consists
of a number of iterations and, for each iteration, a subset of the training set (a batch) is used. At the end of each iteration, weights are
updated following the online training strategy.
Su mma ry
We have reached the end of this chapter, where we have learned the basic theoretical concepts behind neural networks and deep
learning networks. All of this will be helpful to understand the steps for the practical implementation of deep learning networks
described in the coming chapters.
We started with the artificial neuron and moved on to describe how to assemble and train a network of neurons, a fully connected
feedforward neural network, via a variant of the gradient descent algorithm, using the backpropagation algorithm to calculate the
gradient.
We concluded the chapter with a few hints on how to design and train a neural network. First, we described some commonly used
network topologies, neural layers, and activation functions to design the appropriate neural architecture.
We then moved to analyze the effects of some parameters involved in the training algorithm. We introduced a few more parameters
and techniques to optimize the training algorithm against a selected loss function.
In the next chapter, you will learn how you can perform all the steps we introduced in this chapter using KNIME Analytics Platform.
Q u e stio n s a n d Ex e rc ise s
Test how well you have understood the concepts in this chapter by answering the following questions:
a. Each neuron from the previous layer is connected to each neuron in the next layer.
b. To speed up calculations
d. For symmetry
d. The deltas calculated at the output layer and backpropagated through the network
a. MAE
b. RMSE
c. Categorical cross-entropy
d. Binary cross-entropy
a. RNNs
b. CNNs
d. Autoencoders
6. How is the last layer of a network commonly configured when solving a binary classification problem?
b. On image data
c. On sequential data
d. On sparse datasets
Chapter 4: Bu ild in g a n d Tra in in g a Fe e d fo rw a rd N e u ra l
N e tw o rk
In Chapter 3, Getting Started with Neural Networks, you learned the basic theory behind neural networks and deep learning. This
chapter sets that knowledge into practice. We will implement two very simple classification examples: a multiclass classification using
the iris flower dataset, and a binary classification using the adult dataset, also known as the census income dataset.
These two datasets are quite small and the corresponding classification solutions are also quite simple. A fully connected feedforward
network will be sufficient in both examples. However, we decided to show them here as toy examples to describe all of the required
steps to build, train, and apply a fully connected feedforward classification network with KNIME Analytics Platform and
KNIME Keras Integration .
These steps include commonly used preprocessing techniques, the design of the neural architecture, the setting of the activation
functions, the training and application of the network, and lastly, the evaluation of the results.
Pre p a rin g th e D a ta
In Chapter 3, Getting Started with Neural Networks, we introduced the backpropagation algorithm, which is used by gradient descent
algorithms to train a neural network. These algorithms work on numbers and can't handle nominal/categorical input features or class
values. Therefore, nominal input features or nominal output values must be encoded into numerical values if we want the network to
make use of them. In this section, we will show several numerical encoding techniques and the corresponding nodes in KNIME
Analytics Platform to carry them out.
Besides that, we will also go through many other classic data preprocessing steps to feed machine learning algorithms: creating
training, validation, and test sets from the original dataset; normalization; and missing value imputation.
Along the way, we will also show you how to import data, how to perform a few additional data operations, and some commonly
used tricks within KNIME Analytics Platform. The workflows described in this chapter are available on the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%204/.
Classification of income (binary class) based on the data from the adult dataset
Figure 4.1 – Overview of the Iris dataset, used here to implement a multiclass classification
The adult dataset consists of 32,561 samples of people living in the US. Each record describes a person through 14 demographics
features, including their current annual income (> 50K/<= 50K). Figure 4.2 shows an overview of the features in the dataset:
numerical features, such as age and hours worked per week, and nominal features, such as work class and marital status.
The goal is to train a neural network to predict whether a person earns more or less than 50K per year, using all the other attributes as
input features. The network we want to use should have two hidden layers, each one with eight units and the ReLU activation
function.
TIP
To get an overview of the dataset, you can use the Data Explorer node. This node displays some statistical measures of the input
data within an interactive view. In Figure 4.1 and Figure 4.2, you can see the view of the node for the two example datasets.
To summarize the Iris dataset, it consists of four numerical features, plus the iris nominal class; the adult dataset consists of 14 mixed
features, numerical and nominal. The first step in the data preparation would, therefore, be to transform all nominal features into
numerical ones. Let's move on, then, to the encoding techniques.
One-hot vector encoding overcomes this problem by representing each feature with a vector, where the distance across all the vectors
is always the same. The vector consists of the same quantity of binary components as possible values in the original feature. Each
1 for that value; the other components remain set to 0. In the hair
component is then associated with one of the values and is set to
IMPORTANT NOTE
A one-hot vector is a vector with a single 1 and all other values being 0. It can be used to encode different classes without adding
any artificial distance between them.
Let's see now how to implement these encodings with KNIME nodes.
A PMML model output port (blue square) with the mapping rules
Figure 4.3 shows you the node, as well as its configuration window:
Figure 4.3 – The Category to Number node performs an integer encoding on the selected columns
In the upper part of the configuration window, you can select the string-type input columns to apply the integer encoding to. The
columns in the Include framework will be transformed, while the columns in the Exclude framework will be left unchanged. You
can move columns from one framework to the other using the buttons between them.
By default, values in the original columns are replaced with the integer-encoded values. However, the Append columns checkbox
creates additional columns for the integer-encoded values so as not to overwrite the original columns. If you activate this checkbox,
you can also define a custom suffix for the new columns' headers.
In the lower part of the configuration window, you can define the encoding rule: the start value, the increment, the maximum
allowed number of categories, and an integer value for all missing values.
TIP
To apply the same integer encoding mapping stored in the PMML output port to another dataset, you can use the Category to
Number (Apply) node.
The Category to Number node defines the mapping automatically. This means you cannot manually define which nominal value
should be represented by which integer value. If you wish to do so, you have other options in KNIME Analytics Platform, and we
will introduce two of them: the Cell Replacer node and the Rule Engine node.
The Cell Replacer node replaces cell values in a column according to a dictionary table. It has two inputs:
The top input for the table with the target column whose values are to be replaced
Figure 4.4 shows the configuration window of the Cell Replacer node:
Figure 4.4 – The Cell Replacer node implements an encoding mapping based on a dictionary
In the upper part of the configuration window, you can select the target column from the input table at the top input port; this means
the column whose values you want to replace based on the dictionary values.
In the Dictionary table part of the configuration window, you can select, from the data table at the lower input port, the column
with the lookup values – that is, the Input (Lookup) column – and the column containing the replacement values – that is, the
Output (Replacement) column.
Any occurrence in the target column (first input) that matches the lookup value is replaced with the corresponding replacement
value. The result is stored in the output column, which is either added to the table or replaces the original target column.
Missing values are treated as ordinary values; that is, they are valid values both as lookup and replacement values. If there are
duplicates in the lookup column in the dictionary table, the last occurrence (lowest row) defines the replacement pair.
For the integer encoding example, you need a dictionary table to map the nominal values and the integer values. For example, each
education level should be mapped to a corresponding integer value. You can then feed the original dataset into the top input port and
this map/dictionary table into the lowest input port.
TIP
The Table Creator node can be helpful to manually create the lookup table.
If you don't have a dictionary table and you don't want to create one, you can use the Rule Engine node.
The Rule Engine node transforms the values in the input columns according to a set of manually defined rules, which are defined in
its configuration window.
Figure 4.5 shows you the configuration window of the Rule Engine node:
Figure 4.5 – The Rule Engine node implements an integer encoding from user-defined rules
In the Expression part of the configuration window, you can define the set of rules to apply. Each rule consists of an antecedent
(condition) and a consequence, joined by =>, in the form of "antecedent => consequence". The results
are either inserted into a new column or replace the values in a selected column. For each data row in the input table, the rule-
matching process moves from the top rule to the lowest: the first matching rule determines the outcome, and then the rule process
stops. The last default condition, collecting all the remaining data rows, is expressed as "TRUE => consequence".
The outcome of a rule may be a string (in between " or / symbols), a number, a Boolean constant, or a reference to another
column. If no rule matches, the outcome is a missing value. References to other columns are represented by the column name in
between $. You can insert a column reference by hand or by double-clicking on a column in Column List on the left side of the
configuration window.
Besides the Expression panel, you find the Function , Column List , and Flow Variable List panels. The Function panel
Column List panel lists all input columns, and Flow Variable List contains all the available flow variables.
lists all functions, the
Double-clicking on any of them adds them to the Expression window with the right syntax. Also, selecting any of the functions
shows a description as well as an example.
To summarize, there are many ways to implement integer encoding in KNIME Analytics Platform. We introduced three options:
The Category to Number node offers an automatic, easy approach if you do not want to define the mapping by hand.
The Cell Replacer node is really useful if you have a lookup table at hand.
The Rule Engine node is useful if you want to manually define the mapping between the nominal values and the integer values
via a set of rules.
In the configuration window, you can select the string-type columns on which to perform the one-hot encoding. For each column, as
many new columns will be created as there are different values. The header of each new column will be the original value in the
nominal column and its cells take a value of either 0 or 1, depending on the presence or absence of the header value in the original
column.
Figure 4.6 – The One to Many node implements the one-hot encoding for nominal features
Creating one-hot encoded vectors leads to very large and very sparse data tables with many zeros. This can weigh on the workflow
performance during execution. The Keras Learner node does accept large and sparse one-hot-encoded data tables. However, it also
offers a very nice optional feature that avoids this whole step of explicitly creating the data table with the one-hot-encoded vectors. It
can create the one-hot-encoded vectors internally from an integer-encoded version of the original column. In this way, the one-hot
encoding representation of the data remains hidden within the Keras Network Learner node and is never passed from node to
node. In this case, the value of each integer-encoded cell must be presented to the Keras Network Learner node as a collection type
cell. To create a collection type cell, you can use the Create Collection Column node. In the Training the Network section of
this chapter, you will see how to configure the Keras Network Learner node properly to make use of this feature.
Figure 4.7 shows the configuration window of the Create Collection Column node. In the Exclude-Include frame, you select
one or more columns to aggregate in a collection-type column. In the lower part of the configuration window, you can decide
whether to remove the original columns and define the new collection type column's name:
Figure 4.7 – The Create Collection Column node aggregates the values from multiple columns as a collection into one single
column
Notice that for this two-step one-hot encoding – first integer encoding, then one-hot encoding – you need to create the integer
encoding column with one of the nodes listed in the previous section, and then apply the Create Collection Column node to just
one column: the integer-encoded column that we have just created.
A common approach to binary classification is to encode the two classes with and and then to train the
network to predict the probability for the class. In this case, either the Category to Number node or the Rule Engine node
can work.
In the case of a multiclass problem, there are also two options to encode the class column: the One to Many node on its own or the
Category to Number followed by the Create Collection Column node.
Normalization
Most neural networks are trained using some variant of stochastic gradient descent with the backpropagation algorithm to calculate
the gradient. Input features with non-comparable ranges can create problems during learning, as the input features with the largest
range can overpower the calculation of the weight update, possibly even overshooting a local minimum. This can create oscillations
and slow down the convergence of the learning process. To speed up the learning phase, it is recommended to normalize the data in
advance; for example, by using the z-score normalization so that the values in each column are Gaussian-distributed with a mean of
0.0 and a standard deviation of 1.0.
In Figure 4.8, you can see the Normalizer node and its configuration window, as well as the Normalizer (Apply) node:
Figure 4.8 – The Normalizer node creates a normalization function for the selected input columns. The Normalizer (Apply) node
applies the same normalization function to another dataset
The Normalizer node creates a normalization function on the selected input columns and normalizes them. The Normalizer (Apply)
node takes an external predefined normalization function and applies it to the input data. A classic case for the application of this pair
of nodes is on training and test sets. The Normalizer node normalizes the training data and the Normalizer (Apply) node applies the
same normalization transformation to the test data.
The Normalizer node has one data input port and two output ports:
One model output port containing the normalization parameters, which can be used on another dataset in a Normalizer (Apply)
node
In the configuration window of the Normalizer node, you can select the numerical columns to normalize and the normalization
method.
The configuration window of the Normalizer (Apply) node is minimal since all of the necessary parameters are contained in the input
normalization model.
TIP
With a Partitioning node, you can create the training and test sets before normalizing the data.
A powerful node to impute missing values is the Missing Value node. This node allows you to select between many imputation
methods, such as mean value, fixed value, and most frequent value, to name just a few.
Figure 4.9 shows the two tabs of the configuration window of the node. In the first tab, the Default tab, you can select an
imputation method to apply to all columns of the same type in the dataset; all columns besides those set in the second tab of the
configuration, the Column Settings tab. In this second tab, you can define the imputation method for each individual column:
Figure 4.9 – The Missing Value node selects among many imputation methods for missing values
Most neural networks are trained in a supervised way. Therefore, another necessary step is the creation of a training set and a test set,
and optionally a validation set. To create different disjoint subsets, you can use the Partitioning node.
In the configuration window of the Partitioning node in Figure 4.10, you can set the size for the first partition, by either an absolute
or a relative percentage number. Below that, you can set the sampling technique to create this first subset, by random extraction
following the data distribution according to the categories in a selected column (stratified sampling), linearly every n data rows, or
just sequentially starting from the top. The top output port produces the resulting partition; the lower output port produces all other
remaining data rows:
For classification problems, the Stratified sampling option is recommended. It ensures that the distribution of the categories in the
Take from top option is preferable,
selected column is (approximately) retained in the two partitions. For time-series analysis, the
if your data is sorted ascending by date. Samples further back in time will be in one partition and more recent samples in the other.
We have talked about encoding for categorical features, normalization for numerical features, missing value imputation, and
partitioning of the dataset. It is likely that those are not the only nodes you might need to prepare your data for the neural network.
Let's see how the data preparation works in practice, by implementing the data preparation part on the two example datasets we have
described previously.
The workflow starts with reading the Iris dataset using the Table Reader node.
TIP
You can find the dataset in the data folder for this chapter.
As the dataset has only numerical input features (petal and sepal measures), there is no need for encoding:
Figure 4.11 – This workflow snippet shows the preprocessing for the data in the iris dataset example
However, the target variable contains three different categories: the names of each flower species. The categories in this nominal
column need to be converted into numbers via some encoding technique. To avoid the introduction of non-existent relationships, we
opted for one-hot encoding. To implement the one-hot encoding, we chose the combination of integer encoding via nodes and one-
hot encoding within the Keras Learner node. We will talk about the one-hot encoding internal to the Keras Learner node in the
Training the Network section. Here, we will focus on the creation of an integer encoding of the flower classes inside a collection type
column:
1. In order to transform the species names into an index, we use the Rule Engine node, with the following rules:
$class$ = "Iris-setosa" => 0
$class$ = "Iris-virginica" => 1
TRUE => 2
In addition, we decided to replace the values in the class column.
2. Afterward, we pass the results from the Rule Engine node through a Create Collection Column node, to format the encoded
class values as collection type cells. This means we include the class column, and we exclude all other columns in the
configuration window.
3. Next, the training and test sets are created with a Partitioning node, using 75% of the data for training and the remaining 25%
for testing.
4. Lastly, the data is normalized using the z-score normalization.
The Iris dataset is quite small and quite well defined. Only a few nodes, the minimum required, were sufficient to implement the data
preparation part.
Let's see now what happens on a more complex (but still small) dataset, such as the adult dataset.
Like for the Iris dataset, you can find the two datasets used in the workflow in the data folder for this chapter: the adult dataset and a
dictionary Excel sheet. In the adult dataset, education levels are spelled out as text. The dictionary Excel file provides a map between
the education levels and the corresponding standard integer codes. We could use these integer codes as the numerical encoding of the
education input feature.
Next, the Cell Replacer node replaces all educational levels with the corresponding codes. We get one encoding practically
without effort.
Some of the nominal columns have missing values. Inside the Missing Value node, they get imputed with a fixed value:
"Missing".
Next, we proceed with the encoding of all other nominal features, besides education. For the following features, an integer encoding
is used, implemented by the Category to Number node: marital status, race, and sex. We can afford to use the integer encoding
here, because the features are either binary or with just a few categories.
For the remaining nominal features – work class, occupation, relationship, and native-country – one-hot encoding is used,
implemented by the One to Many node. Remember that this node creates one new column for each value in each of the selected
columns. So, after this transformation, the dataset has 82, instead of the original 14, features.
Next, the training, validation, and test sets are created with a sequence of two Partitioning nodes, always using a stratified sampling
based on the Income class column.
Lastly, the Income column gets integer encoded on all subsets and all their data is normalized.
TIP
To hide complexity and to tidy up your workflows, you can create metanodes . Metanodes are depicted as gray nodes and contain
sub-workflows of nodes. To create a metanode, select the nodes you want to hide, right-click, and select Create Metanode .
Our data is ready. Let's now build the neural network.
Bu ild in g a Fe e d fo rw a rd N e u ra l A rc h ite c tu re
To build a neural network architecture using the KNIME Keras integration, you can use a chain of Keras layer nodes. The available
Keras->Layers folder in the Node Repository , such as Advanced
nodes to construct layers are grouped by categories in the
Activations , Convolution , Core , Embedding , and Recurrent , to name just a few.
Each layer displayed in the Keras->Layers folder has a specialty. For example, layers in Advanced Activations create layers
with units with specific activation functions; layers in Convolution create layers for convolutional neural networks; Core contains
all classic layers, such as the Input layer to collect the input values and the Dense layer for a fully connected feedforward neural
network; and so on.
We will explore many of these layers along the way in this book. However, in this current chapter, we will limit ourselves to the basic
layers needed in a fully connected feedforward neural network.
The first layer in any network is the layer that receives the input values. Let's start from the Keras Input Layer node.
On the left of Figure 4.13, you can see the Keras Input Layer node and on the right its configuration window. As you can see, the
node does not have an input port, just one output port of a different shape and color (red square) from the nodes encountered so far:
this is the Keras Network Port :
Figure 4.13 – The Keras Input Layer node defines the input layer of your neural network
TIP
The color and shape of a port indicate which ports can be connected with each other. Most of the time, only ports of the same color
and shape can be connected, but there are exceptions. For example, you can connect a gray square, which is a Python DL port, with
a Keras port, a red square.
Each layer node has a configuration window, with the setting options required for this specific layer. Compared to other layer nodes,
this node has a simple configuration window with only a few setting options.
The most important setting is Shape . Shape allows you to define the input shape of your network, meaning how many neurons
your input layer has. Remember, the number of neurons in the input layer has to match the number of your preprocessed input
columns.
The Iris dataset has four features that we will use as inputs: sepal length, sepal width, petal length, and petal width. Therefore, the
input shape here is 4.
In addition, in the configuration window of the Keras Input Layer node, you can set the following:
Figure 4.14 shows the configuration window of this node. The setting options are split into two tabs: Options and Advanced .
The Options tab contains the most important settings, such as the number of neurons, also known as units, and the activation
function.
In addition, the Input tensor setting defines the part of the input tensor coming from the previous node. In a feedforward network,
the input tensor is the output tensor from the previous layer. However, some layer nodes – such as, for example, the Keras LSTM
Layer node – create not just one hidden output tensor, but multiple. In such cases, you must select one among the different input
tensors, or hidden states, produced by the previous layer node. Keras Input Layer, like Keras Dense Layer, produces only one output
vector and this is what we select as input tensor to our Keras Dense Layer node.
In the upper part of the Advanced tab, you can select how to randomly initialize the weights and biases of the network; this means
the starting values for all weights and biases before the first iteration of the learning process.
The lower part of the Advanced tab allows you to add norm regularization for the weights in this layer. Norm regularization is a
technique to avoid overfitting, which we introduced in Chapter 3, Getting Started with Neural Networks. In the configuration
window, you can select whether to apply it to the kernel weight matrix, the bias vector, and/or the layer activation. After activating
the corresponding checkbox, you can select between using the L1 norm as a penalty term, the L2 norm as a penalty term, or both.
Lastly, you can set the value for the regularization parameter, , for the penalty terms and constraints on the weight and
bias values.
By using the Keras Input Layer node and multiple Keras Dense Layer nodes, you can build a feedforward network for many
different tasks, such as, for example, to classify iris flowers:
Figure 4.14 – The Keras Dense Layer node allows you to add a fully connected layer to your neural network, including a selection
of commonly used activation functions
Configuration of the other layer nodes is similar to what was described here for the dense and input layers, and you will learn more
about them in the next chapters.
Since both basic examples used in this chapter refer to feedforward networks, we now have all of the necessary pieces to build both
feedforward neural networks.
One hidden layer with eight units and the ReLU activation function
One output layer with three units, one for each output class, meaning one for each iris species, with the softmax activation function
We opted for the ReLU activation function in the hidden layer for its better performance when used in hidden layers, and for the
softmax activation function in the output layer for its probabilistic interpretability. The output unit with the highest output from the
softmax function is the unit with the highest class probability.
Figure 4.15 shows the neural network architecture used for the iris classification problem:
Figure 4.15 – A diagram of the feedforward network used for the iris flower example
Figure 4.16 shows the workflow snippet with the three layer nodes building the network and their configuration windows, including
the number of units and activation functions:
Figure 4.16 – This workflow snippet builds the neural network in Figure 4.15 for the Iris dataset example. The configuration
windows below them show you the nodes' configurations
The input layer has four input units, Shape = 4, for the four numerical input features. The first Keras Dense Layer node,
which is the hidden layer, has eight units and uses the ReLU activation function. In the output layer, the softmax activation function
is used with three units, one unit for each class.
TIP
In the last layer, the name prefix Output has been used. This makes it easier to identify the layer in the Executor node and has the
advantage that the layer name doesn't change if more Keras Dense Layer nodes are added as hidden layers.
One hidden layer with six units and the ReLU activation function
One more hidden layer with six units and the ReLU activation function
One output layer with one unit and the sigmoid activation function
The output layer uses a classic implementation for the binary classification problem: one single unit with the sigmoid activation
function. The sigmoid function, spanning a range of 0 and 1, can easily implement class attribution using 0 for one class and 1 for
the other. Thus, for a binary classification problem, where the two classes are encoded as 0 and 1, one sigmoid function alone can
produce the probability for the class encoded as 1.
Figure 4.17 shows you the workflow snippet that builds this fully connected feedforward neural network:
Figure 4.17 – This workflow snippet builds the fully connected feedforward neural network used as a solution for the adult dataset
example
After preprocessing, the adult dataset ends up having 82 columns, 81 input features, and the target column. Therefore, the input
layer has Shape = 81. Next, the two hidden layers are built using two Keras Dense Layer nodes with Units = 6
and the ReLU activation function. The output layer consists of a Keras Dense Layer node, again with Units = 1 and the
sigmoid activation function.
In this section, you've learned how to build a feedforward neural network using the KNIME Keras integration nodes. The next step is
to set the other required parameters for the network training, such as, for example, the loss function, and then to train the network.
Tra in in g th e N e tw o rk
We have the data ready and we have the network. The goal of this section is to show you how to train the network with the data in
the training set. This requires the selection of the loss function, the setting of the training parameters, the specification of the training
set and the validation set, and the tracking of the training progress.
The key node for network training and for all these training settings is theKeras Network Learner node. This is a really
powerful, really flexible node, with many possible settings, distributed over four tabs: Input Data , Target Data , Options , and
Advanced Options .
The Keras Network Learner node has three input ports:
Top port : The neural network you want to train
Middle port : The training set
Lowest port : The optional validation set
It has one output port, exporting the trained network.
In addition, the node has the Learning Monitor view, which you can use to monitor the network training progress.
Let's find out first how to select the loss function before we continue with the training parameters.
Now that the network structure is defined and you have selected the correct loss function, the next step is to define which columns of
the input dataset are the inputs for your network and which column contains the target values.
The input data is the data that your network expects as input, which means the columns that fit the input size of the network. In the
Input Data tab, the number of input neurons for the selected network and the consequent shape are reported at the very top:
Figure 4.19 – In the Input Data tab of the Keras Network Learner node, you can select the input column(s) and the correct
conversion
Next, you must select the conversion type; this means the transformation for the selected input columns into a format that is accepted
by the network input specification. The possible conversion types are as follows:
The other conversion types just take the input columns in the specified format (double, integer, or image) and present them to the
network.
After selecting the conversion type, you can select the input columns to the network through an include-exclude frame. Notice that
the frame has been pre-loaded with all the input columns matching the selected conversion type.
Let's now select the target column. The target data must match the specifications from the output layer. This means that, if your
output layer has 20 units, your target data must be 20-dimensional vectors; or, if your output layer has only one unit, your target
data must also consist of one single value for each training sample or data row.
In the Target Data tab, at the very top, the number of neurons in the output layer of the network and the resulting shape is
reported. Like in theInput Data tab, here you can select from many conversion options to translate from the input dataset into the
network specifications. The menu, with all the available conversion types to select from, has been preloaded with the conversion
types that fit the specifications of the output layer of the network.
For multiclass classification problems, the conversion type from a collection of numbers (integer) to one-hot tensor is really helpful.
Instead of creating the one-hot vectors in advance, you need only to encode the position of the class (1) in the input collection cell.
All the training parameters can be found in the Options and Advanced Options tabs. In Figure 4.20, you can see the Options
tab of the Keras Network Learner node:
Figure 4.20 – In the Options tab of the Keras Network Learner node, you can set all the training parameters
In the upper part of the Options tab, in the configuration window, you can define the number of epochs and the batch size. This
determines the number of data rows from the training and validation sets to feed the network in batches within each training iteration.
IMPORTANT NOTE
If you defined a batch size in the Keras Input Layer node, the batch size settings are deactivated.
Under that, there are two checkboxes. One shuffles the training data randomly before each epoch, and one sets a random seed.
Shuffling the training data often improves the learning process. Indeed, updating the network with the same batches in the same
order in each epoch can have a detrimental effect on the convergence speed of the training. If the shuffling checkbox is selected, the
random seed checkbox becomes active and the displayed number is used to generate the random sequence for the shuffling
operation. The usage of a random seed produces a repeatable random shuffling procedure and therefore allows us to repeat the
results of a specific training run. Clicking the New seed button generates a new random seed and a new random shuffling
procedure. Disabling the checkbox for the random seed creates a new seed for each node execution.
Options tab, you can select the Optimizer algorithm , and its parameters to use during training. The
In the lower part of the
RMSProp optimizer and then the corresponding
optimizer algorithm is the training algorithm. For example, you can select the
Learning rate and Learning rate decay values. When the node is selected, the Description panel on the right is populated
with details about the node. A list of optimizers is provided, as well as links to the original Keras library explaining all the parameters
required in this frame.
At the very bottom of the Options tab, you can constrain the size of the gradient values. If Clip norm is checked, the gradients
whose L2 norm exceeds the given norm will be clipped to that norm. If Clip value is checked, the gradients whose absolute value
exceeds the given value will be clipped to that value (or the negated value, respectively).
The Advanced Options tab contains a few additional settings for special termination and learning rate reduction cases. The last
option allows you to specify which GPU to use on systems with multiple GPUs.
Clicking on Loss above the line plot shows the loss curve on the training set instead of the accuracy.
More information about the training progress is available in the Keras Log Output view. This can be selected in the top part of the
Keras Learning node's view, in the last tab after Accuracy and Loss :
Figure 4.21 – The Learning Monitor view shows the progress of the learning process
TIP
The Learning Monitor view of the Keras Network Learner node allows you to track the learning of your model. You can open it
by right-clicking on the executing node and selecting View: Learning Monitor .
If you are using a validation set, a blue line appears in the accuracy/loss plot. The blue line shows the corresponding progress of the
training procedure on the validation set.
x axis – the batch axis – to see the progress after each batch in more detail.
Under the plot, you have the option to zoom in on the
The Smoothing checkbox introduces the moving average curve of the original accuracy or loss curve. The Log Scale checkbox
changes the curve representation to a logarithmic scale for a more detailed evaluation of the training run.
Finally, at the bottom of the view, you can see the Stop learning button. This is an option for on-demand early stopping of the
training process. If training is stopped before it is finished, the network is saved in the current status.
In the first tab, the Input Data tab, the four numerical inputs are selected as the input features. During the data preparation part, we
applied no nominal feature encoding on the input features. So, we just feed them as they are into the input layer of the network, by
using the From Number (double) conversion type.
In the second tab, the Target Data tab, the target column is selected. If you remember, during the data preparation part, we integer
encoded the class into a collection cell to proceed later with the one-hot encoding conversion. So, we selected the
class_collection input column, containing the integer-encoded class as a collection, and we applied the From
Collection of Number (integer) to One-Hot Tensor conversion. Therefore, during the execution, the Keras Network
Learner node creates the one-hot encoding version of the three classes in a three-dimensional vector, as it is also required to match
the network output. In the lower part of this second tab, select the Categorical cross entropy loss function.
In the third tab, namedOptions , the training parameters are defined. The network is trained using 50 epochs, a training batch size
of 5, and the RMSProp optimizer.
The settings in the Advanced Options tab are left inactive, by default.
In the first tab, Input Data , the From Number (double) conversion is selected and the 81 input feature columns are included.
Only the target column, Income, is in the Exclude part. Here, in the data preparation phase, some of the input features
were already numerical and have not been encoded, some have been integer encoded, and some have been one-hot encoded via
KNIME native nodes. So, all input features are ready to be fed as they are into the network. Notice that, since we decided to mix
integer encoding, one-hot encoding, and original features, the only possible encoding applicable to all those different features is a
simple From Number type of transformation.
Also, in the second tab, Target Data , the From Number (double) conversion is selected, as the target is just a numerical value:
0 or 1. This also fits the one output from the sigmoid function in the output layer of the network. In the include-exclude frame,
only the target column, Income, is included. Next, the Binary cross entropy loss function is selected, to fit a binary
classification problem such as this one.
In the third tab, Options , we set the network to be trained for 80 epochs with a training batch size of 80 data rows. In this example,
we also use a validation set, to be able to already see, during training, the network progress on data not included in the training set.
For the processing of the validation set, a batch size of 40 data rows is set. Lastly, we select Adam as the optimizer for this training
process.
Again, the settings in the last tab, Advanced Options , are disabled by default.
Te stin g a n d A p p ly in g th e N e tw o rk
Now that the neural network has been trained, the last step is to apply the network to the test set and evaluate its performance.
In the first tab of the configuration window, named Options , you can select, in the upper part, the backend engine, the batch size
for the input data, and whether to also keep the original input columns in the output data table.
Under that, you can specify the input columns and the required conversion. Like in the Keras Network Learner node, the input
specifications from the neural network are printed at the top. Remember that, since you are using the same network and the same
format for the data, the settings for the input features must be the same as the ones in the Keras Network Learner node.
In the last part of this tab, you can add the settings for the output(s). First, you need to specify where to take the output from; this
should be the output layer from the input network. To add one output layer, click on the add output button. In the new window,
you see a menu containing all layers from the input network. If you configured prefixes in the layer nodes, you could see them in
the drop-down menu, making it easier for you to recognize the layer of interest. Select the output layer:
Figure 4.22 – The Keras Network Executor node runs the network on new data. In the configuration window, you can select the
outputs by clicking on the add output button
In all use cases included in this book, the last layer of the network is used as the output layer. This layer is easily recognizable, as it is
the only one without the (hidden) suffix in the drop-down list.
TIP
You can also output the output of a hidden layer, for example, for debugging purposes.
Finally, select the appropriate conversion type, to get the output values in the shape you prefer – for example, in one cell as a list
( To List of Number (double) ) or with a new column for each output unit (To Number (double) ). In this last case, you can
define a prefix to append to the names of the output columns.
The Advanced Options part contains settings to let the network run on GPU-enabled machines.
The last step is the evaluation of the model. To evaluate a classification model, you can use either the Scorer node or the ROC
Curve node. The output of the Scorer node gives you common performance metrics, such as the accuracy, Cohen's kappa, or the
confusion matrix.
TIP
Another really nice node to evaluate the performance for the binary classification problem is the Binary Classification
Inspector node. The node is part of the KNIME Machine Learning Interpretability Extension:
https://hub.knime.com/knime/extensions/org.knime.features.mli/latest.
For the evaluation of regression solutions, the Numeric Scorer node calculates some error measures, such as mean squared error,
root mean squared error, mean absolute error, mean absolute percentage error, mean signed difference, and R-squared.
Figure 4.23 – This workflow snippet applies the trained network and extracts and evaluates the predictions for the iris flower
example
In the configuration window of the Keras Network Executor node, the four input features are selected as input columns. In the
lower part of the Options tab, the output layer has been selected by clicking on the add output button. As we didn't use any
prefixes in the configuration window of the layer nodes, the last layer here is just called "dense_2/Softmax:0_",
as aConversion type of To Number (double) is selected. As the Iris dataset has three different possible class values, the node
adds three new columns with the three probabilities for the three classes. Another conversion option is To List of Number
(double) . This conversion option would lead to only one new column, with all the class probabilities in one cell packaged as a list.
Next, the predictions are extracted with the Rule Engine node. The probabilities for the different classes are in the
$Output_1/Softmax:0_0for class 0, and Output_1/Softmax:0_1 columns for
class 1, and Output_1/Softmax:0_2 for class 2. Here, the class with the highest probability is selected as the
predicted outcome.
The first rule checks whether the class encoded as 0 has the highest probability by comparing it to the probability for the other two
classes. The second rule does the same for the class encoded as 1, and the third rule for the class encoded as 2. The last rule defines
a default value.
In the configuration window of the Keras Network Executor node, in the Options tab, the 81 input features are included, and the
dense_3 output layer is added as the output. In this case, the output of the network is the probability for the class encoded as
1, ">50K".
Finally, the Rule Engine node checks whether the output probability is higher or lower than the 0.5 threshold using the following
code:
$dense_3/Softmax:0_0$ < 0.5=> "<=50K"
TRUE => ">50K"
Lastly, the network performance is evaluated with the Scorer node.
With this, we have gone through the whole process, from data access and data preparation to defining, training, applying, and
evaluating a neural network using KNIME Analytics Platform.
Su mma ry
We have reached the end of this chapter, where you have learned how to perform the different steps involved in training a neural
network in KNIME Analytics Platform.
We started with common preprocessing steps, including different encodings, normalization, and missing value handling. Next, you
learned how to define a neural network architecture by using different Keras layer nodes without writing code. We then moved on to
the training of the neural network and you learned how to define the loss function, as well as how you can monitor the learning
progress, apply the network to new data, and extract the predictions.
Each section closed with small example sessions, preparing you to perform all these steps on your own.
In the next chapter, you will see how these steps can be applied to the first use case of the book: fraud detection using an
autoencoder.
Q u e stio n s a n d Ex e rc ise s
Check your level of understanding of the concepts presented in this chapter by answering the following questions:
1. How can you set the loss function to train your neural network?
c) By creating an integer encoding using the Category to Number node and afterward, the Integer to One Hot Encoding node
d) By creating an integer encoding, transforming it into a collection cell, and selecting the right conversion
3. How can you define the number of neurons for the input of your network?
c) The input dimension is set automatically based on the selected features in the Keras Network Learner node.
4. How can you monitor the training of your neural network on a validation set?
a) Feed a validation set into the optional input port of the Keras Network Learner node and open the training monitor view. The
performance of the validation set is shown in red.
b) Click on the apply on validation set button in the training monitor view.
c) Feed a validation set into the optional input port of the Keras Network Learner node and open the training monitor view. The
performance of the validation set is shown in blue.
d) Feed a validation set into the optional input port of the Keras Network Learner node and open the validation set tab of the
training monitor view. Build a workflow to read the Iris dataset and to train a neural network with one hidden layer (eight units
and the ReLU activation function) to distinguish the three species from each other based on the four input features.
Se c tio n 2 : D e e p Le a rn in g N e tw o rk s
Here, we move on to more advanced concepts in neural networks (deep learning) and how to implement them within KNIME
Analytics Platform, based on some case studies.
Those were two simple examples using quite small datasets, in which all the classes were adequately represented, with just a few
hidden layers in the network and a straightforward encoding of the output classes. However, they served their purpose: to teach you
how to assemble, train, and apply a neural network in KNIME Analytics Platform.
Now, the time has come to explore more realistic examples and apply more complex neural architectures and more advanced deep
learning paradigms in order to solve more complicated problems based sometimes on ill-conditioned datasets. In the following
chapters, you will look at some of these more realistic case studies, requiring some more creative solutions than just a fully connected
feedforward network for classification.
We will start with a binary classification problem with a dataset that has data from only one of the two classes. Here, the classic
classification approach cannot work, since one of the two classes is missing from the training set. There are many problems of this
kind, such as anomaly detection to predict mechanical failures or fraud detection to distinguish legitimate from fraudulent credit card
transactions.
This chapter investigates an alternative neural approach to design a solution for this extreme situation in fraud detection: the
autoencoder architecture.
We will cover the following topics:
Introducing Autoencoders
In tro d u c in g A u to e n c o d e rs
In previous chapters, we have seen that neural networks are very powerful algorithms. The power of each network lies in its
architecture, activation functions, and regularization terms, plus a few other features. Among the varieties of neural architectures,
there is a very versatile one, especially useful for three tasks: detecting unknown events, detecting unexpected events, and reducing
the dimensionality of the input space. This neural network is the autoencoder .
The simplest autoencoder has only three layers: one input layer, one hidden layer, and one output layer. More complex structured
autoencoders might include additional hidden layers:
Autoencoders can be used for many different tasks. Let's first see how an autoencoder can be used for dimensionality reduction.
In this case, the first part of the network, moving the data from a vector with size to a vector with size , plays
the role of the encoder. The second part of the network, reconstructing the input vector from a space back into a
space, is the decoder. The compression rate is then . The larger the value of and the smaller the value of ,
the higher the compression rate:
Figure 5.2 – Encoder and decoder subnetworks in a three-layer autoencoder
When using the autoencoder for dimensionality reduction , the full network is first trained to reproduce the input vector onto the
output layer. Then, before deployment, it is split into two parts: the encoder (input layer and hidden layer) and the decoder
(hidden layer and output layer). The two subnetworks are stored separately.
TIP
If you are interested in the output of the bottleneck layer, you can configure the Keras Network Executor node to output the
middle layer. Alternatively, you can split the network within the DL Python Network Editor node by writing a few lines of
Python code.
During the deployment phase, in order to compress an input record, we just pass it through the encoder and save the output of the
hidden layer as the compressed record. Then, in order to reconstruct the original vector, we pass the compressed record through the
decoder and save the output values of the output layer as the reconstructed vector.
If a more complex structure is used for the autoencoder – for example, with more than one hidden layer – one of the hidden
layers must work as the compressor output, producing the compressed record and separating the encoder from the decoder
subnetwork.
Now, the question when we talk about data compression is how faithfully can the original record be reconstructed? How much
information is lost by using the output of the hidden layer instead of the original data vector? Of course, this all depends on how well
the autoencoder performs and how large our error tolerance is.
During testing, when we apply the network to new data, we denormalize the output values and we calculate the chosen error metric
– for example, the Root Mean Square Error (RMSE ) – between the original input data and the reconstructed data on the
whole test set. This error value gives us a measure of the quality of the reconstructed data. Of course, the higher the compression rate,
the higher the reconstruction error. The problem thus becomes to train the network to achieve acceptable performance, as per our
error tolerance.
Since no anomaly examples are available, the autoencoder is trained only on non-anomaly examples. Let's call these examples of the
"normal" class. On a training set full of "normal" data, the autoencoder network is trained to reproduce the input feature vector onto
the output layer.
The idea is that, when required to reproduce a vector of the "normal" class, the autoencoder is likely to perform a decent job because
that is what it was trained to do. However, when required to reproduce an anomaly on the output layer, it will hopefully fail because
it won't have seen this kind of vector throughout the whole training phase. Therefore, if we calculate the distance – any distance –
between the original vector and the reproduced vector, we see a small distance for input vectors of the "normal" class and a much
larger distance for input vectors representing an anomaly.
Thus, by setting a threshold, , we should be able to detect anomalies with the following rule:
Here, is the reconstruction error for the input vector, , and is the set threshold.
This sort of solution has already been implemented successfully for fraud detection, as described in a blog post, Credit Card Fraud
Detection using Autoencoders in Keras -- TensorFlow for Hackers (Part VII), by Venelin Valkov
(https://medium.com/@curiousily/credit-card-fraud-detection-using-autoencoders-in-keras-tensorflow-for-hackers-part-vii-
20e0c85301bd). In this chapter, we will use the same idea to build a similar solution using a different autoencoder structure.
Let's find out how the idea of an autoencoder can be used to detect fraudulent transactions.
Wh y is D e te c tin g Fra u d so H a rd ?
Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false pretenses. Fraud
detection is applied in many industries, such as banking or insurance. In banking, fraud may include forging checks or using stolen
credit cards. For this example, we will focus on fraud in credit card transactions.
This kind of fraud, in credit card transactions, is a huge problem for credit card issuers as well as for the final payers. The European
Central Bank reported that in 2016, the total number of card fraud cases using cards issued in the Single Euro Payments Area
( SEPA ) amounted to 17.3 million, and the total number of card transactions using cards issued in SEPA amounted to 74.9 billion
(https://www.ecb.europa.eu/pub/cardfraud/html/ecb.cardfraudreport201809.en.html#toc1).
However, the amount of fraud is not the only problem. From a data science perspective, fraud detection is also a very hard task to
solve, because of the small amount of data available on fraudulent transactions. That is, often we have tons of data on legitimate
credit card transactions and just a handful on fraudulent transactions. A classic approach (training, then applying a model) is not
possible in this case since the examples for one of the two classes are missing.
Fraud detection, however, can also be seen as anomaly detection. Anomaly detection is any event that is unexpected within a dataset.
A fraudulent transaction is indeed an unexpected event and therefore we can consider it an anomaly in a dataset of legitimate normal
credit card transactions.
One option is the discriminative approach. Based on a training set with both classes, legitimate and fraudulent transactions, we build a
model that distinguishes between data from the two classes. This could be a simple threshold-based rule or a supervised machine
learning model. This is the classic approach based on a training set including enough examples from both classes.
Alternatively, you can treat a fraud detection problem as outlier detection. In this case, you can use a clustering algorithm that leaves
space for outliers (noise), such as DBSCAN ; or you can use the isolation forest technique , which isolates outliers with just a
few cuts with respect to legitimate data. Fraudulent transactions, though, must belong to the original dataset, to be isolated as outliers.
Another approach, called the generative approach , involves using only legitimate transactions during the training phase. This
allows us to reproduce the input vector onto the output layer. Once the model for the autoencoder has been trained, we use it during
deployment to reproduce the input transaction. We then calculate the distance (or error) between the input values and the output
values. If that distance falls below a given threshold, the transaction is likely to be legitimate; otherwise, it is flagged as a fraud
candidate.
In this example, we will use the credit card dataset by Kaggle. This dataset contains credit card transactions from European
cardholders in September 2013. Fraudulent transactions have been labeled with 1, while legitimate transactions are labeled with 0.
The dataset contains 284,807 transactions, but only 492 (0.2%) of them are fraudulent. Due to privacy reasons, principal
components are used instead of the original transaction features. Thus, each credit card transaction is represented by 30 features: 28
principal components extracted from the original credit card data, the transaction time, and the transaction amount.
Let's proceed with the building, training, and testing of the autoencoder.
Bu ild in g a n d Tra in in g th e A u to e n c o d e r
Let's go into detail about the particular application we will build to tackle fraud detection with a neural autoencoder. Like all data
science projects, it includes two separate applications: one to train and optimize the whole strategy on dedicated datasets, and one to
set it in action to analyze real-world credit card transactions. The first application is implemented with the training workflow ; the
second application is implemented with the deployment workflow .
TIP
Often, training and deployment are separate applications since they work on different data and have different goals.
The training workflow uses a lab dataset to produce an acceptable model to implement the task, sometimes requiring a few different
trials. The deployment workflow does not change the model or the strategy anymore; it just applies it to real-world transactions to get
fraud alarms.
In this section, we will focus on the training phase, including the following steps:
Data Access : Here, we read the lab data from the file, including all 28 principal components, the transaction amount, and the
corresponding time.
Data Preparation : The data comes already clean and transformed via Principal Component Analysis (PCA ). What
remains doing in this phase is to create all the data subsets required for the training, optimization, and testing of the neural
autoencoder and the whole strategy.
Building the Neural Network : An autoencoder is a feedforward neural network with as many inputs as outputs. Let's then
decide the number of hidden layers, the number of hidden neurons, and the activation functions in each layer, and then build it
accordingly.
Training the Neural Autoencoder : In this part, the autoencoder is trained on a training set of just legitimate transactions with
one of the training algorithms (the optimizers), according to the selected training parameters, such as, at least, the loss function, the
number of epochs, and the batch size.
Rule for Fraud Alarms : After the network has been trained and it is able to reproduce legitimate transactions on the output
layer, we need to complete the strategy by calculating the distance between the input and output layers and by setting a threshold-
based rule to trigger fraud alarms.
Testing the whole Strategy : The last step is to test the whole strategy performance. How many legitimate transactions are
correctly recognized? How many fraud alarms are correctly triggered and how many are false alarms?
Then, we need an additional data subset, the threshold optimization set, to optimize the threshold, , in the rule-based fraud alarm
generator. This last subset should include all fraudulent transactions, in addition to a number of legitimate transactions, as follows:
Row Splitter node to separate legitimate transactions from fraudulent transactions, one Concatenate
This all translates into one
node to add back the fraudulent transactions into the threshold optimization set, and a number of Partitioning nodes. All data
extraction in the Partitioning nodes is performed at random:
Figure 5.3 – The datasets used in the fraud detection process
IMPORTANT NOTE
The training set, validation set, and threshold optimization set must be completely separated. No records can be shared across any of
the subsets. This is to ensure a meaningful performance measure during evaluation and an independent optimization procedure.
Next, all data in each subset must be normalized to fall in . Normalization is defined on the training set and applied to the
other two subsets. The normalization parameters are also saved for the deployment workflow using the Model Writer node:
Figure 5.4 – The workflow implementing data preparation for fraud detection
The workflow in Figure 5.4 shows how the creation of the different datasets and the normalization can be performed in KNIME
Analytics Platform.
The neural network was built using the following (see Figure 5.5):
The Keras Input Layer node with Shape = 30
Five Keras Dense Layer nodes to implement the hidden layers, using sigmoid as the activation function and 40, 20, 8, 20, and
40 units, respectively
The Keras Dense Layer node for the output layer, with 30 units and sigmoid as the activation function:
Figure 5.5 – Structure of the neural autoencoder trained to reproduce credit card transactions from the input layer onto the output
layer
Now that we've built the autoencoder, let's train and test it using the data.
Training and Testing the Autoencoder
To train and validate the network, we use the Keras Network Learner node, with the training set and the validation set at the
input ports, and the following settings (Figure 5.6 ):
The number of epochs is set to 50, the batch size for the training and validation set is set to 300, and the Adam (an optimized
version of backpropagation) training algorithm is used, in the Options tab.
The target and input features are the same in the Input tab and in the Target tab and are accepted as simple Double numbers.
In the Loss tab of the Learning Monitor view of the Keras Network Learner node, you can see two curves now: one is the mean
loss (or error) per training sample in a batch (in red) and the other one is the mean loss per sample on the validation data (in blue).
At the end of the training phase, the final mean loss value fell in around [0.0012, 0016] for batches from the training set and in
[0.0013, 0.0018] for batches from the validation set. The calculated loss is the mean reconstruction error for one batch, calculated by
the following formula:
Here, is the batch size, is the number of units on the output layer, i
is the output value of neuron in the output layer for
k
training sample , and is the corresponding target answer.
After training, the network is applied to the optimization set, using the Keras Network Executor node, and it is saved for
deployment as a Keras file using the Keras Network Writer node.
Figure 5.6 shows the configuration for the Options tab in the Keras Network Executor node: all 30 input features are passed as
Double numbers and the input columns are kept so that the reconstruction error can be calculated later on. The last layer is selected
as the output and the values are exported as simple Double numbers:
Figure 5.6 – The Keras Network Executor node and its configuration window
The next step is to calculate the distance between the original feature vector and the reproduced feature vector, and to apply a
threshold, , to discover fraud candidates.
First, we run this new transaction, , through the autoencoder via the Keras Network Executor node. The reproduction of the
original transaction is generated at the output layer.Now, a reconstruction error, , is calculated, as the distance between the
original transaction vector and the reproduced one.A transaction is then considered a fraud candidate according to the following rule:
Here, is the reconstruction error value for transaction and K is a threshold.The MSE was also adopted for the
reconstruction error:
Here, i
is the th feature of transaction , and is the corresponding value on the output layer of the network.
is calculated via a Math Formula node, and the previous rule is implemented via a Rule Engine node, assuming, for now,
threshold to be . 1 is the fraud candidate class and 0 is the legitimate transaction class. A Scorer (Javascript) node
finally calculates some performance metrics for the whole approach: 83.64% accuracy, with 83.60% specificity and 99.95%
sensitivity on class 1. Specificity is the ratio between the number of true legitimate transactions and all transactions that did not raise
any alarm. Sensitivity, on the opposite side, measures the ratio of fraud alarms that actually hit a fraudulent transaction.
Specificity produces a measure of the frauds we might have missed, while sensitivity produces a measure of the frauds we hit:
Figure 5.7 – The rule implemented in the Rule Engine node, comparing reconstruction error with threshold
O p timiz in g th e A u to e n c o d e r Stra te g y
What is the best value to use for threshold ? In the last section, we adopted based on our experience. However,
is this the best value for ? Threshold , in this case, is not automatically optimized via the training procedure. It is just a static
parameter external to the training algorithm. In KNIME Analytics Platform, it is also possible to optimize static parameters outside of
the Learner nodes.
Optimizing Threshold
If no labeled fraudulent transactions are available in the dataset, the value of threshold is defined as a high percentile of the
reconstruction errors on the optimization set.
During the data preparation phase, we generated three data subsets: the training set and validation set for the Keras Network Learner
node to train and validate the autoencoder, and one last subset, which we called the threshold optimization set. This final subset
includes 1/3 of all the legitimate transactions and the handful of fraudulent transactions. We can use this subset to optimize the value
of threshold against the accuracy of the whole fraud detection strategy.
To optimize a parameter means to find the value within a range that maximizes or minimizes a given measure. Based on our
experience, we assume the value of K to be a positive number (> 0) and to lie below 0.02. So, to optimize the value of threshold
means to find the value in that maximizes the accuracy of the whole application.
The accuracy of the application is calculated via a Scorer (JavaScript) node, considering the results of the Rule Engine node as the
predictions and comparing them with the original class (0 = legitimate transaction, 1 = fraudulent transaction) in the optimization
set.
The spanning of the value interval and the identification of the threshold value for the maximum accuracy is performed by an
optimization loop . Every loop in KNIME Analytics Platform is implemented via two nodes: a loop start node and a
loop end node. In the optimization loop, these two nodes are the Parameter Optimization Loop Start node and the
Parameter Optimization Loop End node.
The Parameter Optimization Loop Start node spans parameter values in a given interval with a given step size. Interval
and step size have been chosen here based on the range of the reconstruction error feature, as
shown in theLower Bound and Upper Bound cells in the Spec tab of the data table at the output port of the Math Formula
node, named MSE input-output distance , after the Keras Network Executor node.
The Parameter Optimization Loop End node collects all results as flow variables, detects the best (maximum or minimum)
value for the target measure, and exports it together with the parameter that generated it. In our case, the target measure is the
accuracy, measured on the predictions from the Rule Engine node, which must be maximized against values for threshold .
All nodes in between the loop start and the loop end make the body of the loop – that is, the part that gets repeated as many times as
needed until the input interval of parameter values has all been covered. In the loop body, we add the additional constraint that the
optimal accuracy should be found only for those parameters where the specificity and sensitivity are close in value. This is the goal
of the metanode named Coefficient 0/1. Here, if the specificity and sensitivity are more than 10% apart, the
coefficient is set to 0, otherwise to 1. This coefficient then multiplies the overall accuracy coming from the Scorer (JavaScript)
node. In this way, the maximum accuracy is detected only for those cases where the specificity and sensitivity are close to each other:
Figure 5.8 – The optimization loop
After extracting the optimal threshold, we transform it into a flow variable and pass it to the final rule implementation.
In the component we created, we set up the Component Output node to export the flow variable containing the value for the
optimal threshold. This flow variable needs to exit the component to be used in the final rule for fraud detection. The final rule is
implemented in a new Rule Engine node and the final predictions are evaluated against the original classes in a new Scorer
(JavaScript) node.
The final workflow to train and test the neural autoencoder using credit card transaction data and to implement the fraud detection
rule with the optimal threshold is shown in Figure 5.9. The workflow, named
01_Autoencoder_for_Fraud_Detection_Training, is downloadable from the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%205/:
Figure 5.9 – The workflow to train and test the autoencoder and to find the optimal threshold, K
Now that we have found the best threshold, let's have a look at the performance of the autoencoder.
Performance Metrics
In this section, we report the performance measures of this approach on the threshold optimization set after applying the fraud
detection rule. The optimal threshold value was found to be for an accuracy of 93.52%.
In Figure 5.10, you can see the confusion matrix , the class statistics based on it, and the general performance measures, all of
them describing how well the fraud detector is performing on the optimization set:
Figure 5.10 – Performance metrics of the final fraud detector with optimized threshold K
Let's consider class 1 (fraud) as the positive class. The high number of false positives (6,236) shows the weakness of this approach: it
is prone to generating false positives. In other words, it tends to label perfectly legitimate transactions as fraud candidates. Now, there
are case studies where false positives are not a huge problem, and this is one of those. In the case of a false positive, the price to pay
is to send a message to the credit card owner about the current transaction. If the message turns out to be useless, the damage is not
much compared to the possible risk. Of course, this tolerance does not apply to all case studies. A false positive in medical diagnosis
carries a much heavier responsibility than a wrong fraud alarm in a credit card transaction.
IMPORTANT NOTE
The whole process could also be forced to lean more toward fraud candidates or legitimate transactions, by introducing an expertise-
based bias in the definition of threshold K.
In general, the autoencoder captures 87% of the fraudulent transactions and 93% of the legitimate transactions in the validation set,
for an overall accuracy of 85% and a Cohen's kappa of 0.112. Considering the high imbalance between the number of normal and
fraudulent transactions in the validation set (96,668 versus 492), the results are still promising.
Notice that this false positive-prone approach is a desperate solution for a case study where no, or almost no, examples from one of
the classes exist. A supervised classifier on a training set with labeled examples would probably reach better performances. But this is
the data we have to deal with!
We have now trained the autoencoder and found the best threshold for our rule system. We will see, in the next section, how to
deploy it in the real world on real data.
D e p lo y in g th e Fra u d D e te c to r
At this point, we have an autoencoder network and a rule with acceptable performance for fraud detection. In this section, we will
deployment workflow.
implement the
The deployment workflow (Figure 5.11 ), like all deployment workflows, takes in new transaction data, passes it through the
autoencoder, calculates the distance, applies the fraud detection rule, and finally, flags the input transaction as fraud or legitimate.
At the same time, data from some new credit card transactions are read from the file using the File Reader node. This particular file
contains two new transactions.
The transactions are normalized with the same parameters built on the training data and previously saved in the file named
normalizer model. These normalization parameters are read from the file using the Model Reader node.
The last file to read contains the value of the optimized threshold, .K
Afterward, the MSEs between the original features and the reconstructed features for each transaction are calculated using the Math
Formula node.
The Rule Engine node applies the threshold, , as defined during the optimization phase, to detect possible fraud candidates.
The following table shows the reconstruction errors for the two transactions and the consequent class assignment. The application
(autoencoder and distance rule) defines the first transaction as legitimate and the second transaction as a fraud candidate:
Figure 5.12 – Reconstruction errors and fraud class assignment for credit card transactions in the dataset used for deployment
Taking Actions
In the last part of the workflow, we need to take action:
IF-THEN conditions involving actions are implemented in KNIME Analytics Platform via switch blocks. Similar to loops, switch
blocks have a start node and an end node. The end node in switch blocks is optional, however. The switch start node activates only
one of the output ports, enabling de facto only one possible further path for the data flow. The switch end node collects the results
from the different branches. The most versatile switch block is the CASE switch in all its declinations: for data, flow variables, or
models.
The active port, and then the active branch, is controlled via the configuration window of the Switch CASE Start node. This
configuration setting is usually controlled via a flow variable, whose values enable one or the other output each time.
In our case, we have two branches. The upper branch is connected to port 0, activated by class 0, and performs nothing. The
second branch is connected to port 1, activated by class 1, and sends an email to the owner of the credit card.
We conclude here the section on the implementation of the autoencoder-based strategy for fraud detection.
Su mma ry
In this chapter, we discussed approaches for building a fraud detector for credit card transactions in the desperate case when no, or
almost no, examples of the fraud class are available. This solution trains a neural autoencoder to reproduce legitimate transactions
from the input onto the output layer. Some postprocessing is necessary to set an alarm for the fraud candidate based on the
reconstruction error.
In describing this solution, we have introduced the concept of training and deployment applications, components, optimization
loops, and switch blocks.
In the next chapter, we will discuss a special family of neural networks, so-called recurrent neural networks, and how they can be
used to train neural networks for sequential data.
b) Anomaly detection
d) Regression problems
a) By training a network with an output layer with less number than input layers
d) By building a network with more hidden neurons than the input and output layers
Chapter 6: Re c u rre n t N e u ra l N e tw o rk s fo r D e ma n d
Pre d ic tio n
We have gathered some experience, by now, with fully connected feedforward neural networks in two variants: implementing a
classification task by assigning an input sample to a class in a set of predefined classes or trying to reproduce the shape of an input
vector via an autoencoder architecture. In both cases, the output response depends only on the values of the current input vector. At
time , the output response, , depends on, and only on, the input vector, , at time . The network has no memory
With Recurrent Neural Networks (RNNs ), we introduce the time component . We are going to discover networks where the
output response, , at time depends on the current input sample, , as well as on previous input samples,
, ,… , where the memory of the network of the past samples depends on the
network architecture.
We will first introduce the general concept of RNNs, and then the specific concept of Long Short-Term Memory (LSTM ) in
the realm of a classic time series analysis task: demand prediction . Then, we will show how to feed the network with not only
static vectors, , but also sequences of vectors, such as , , ,… ,
spanning samples of the past input signal. These sequences of input vectors (tensors) built on the training set are used to train and
evaluate a practical implementation of an LSTM-based RNN.
Introducing RNNs
The Demand Prediction Problem
In tro d u c in g RN N s
Let's start with an overview of RNNs.
RNNs are a family of neural networks that cannot be constrained in the feedforward architecture.
IMPORTANT NOTE
RNNs are obtained by introducing auto or backward connections – that is, recurrent connections – into feedforward neural
networks.
When introducing a recurrent connection, we introduce the concept of time. This allows RNNs to take context into account; that is,
to remember inputs from the past by capturing the dynamic of the signal.
Introducing recurrent connections changes the nature of the neural network from static to dynamic and is therefore suitable for
analyzing time series. Indeed, RNNs are often used to create solutions to problems involving time-ordered sequences, such as time
series analysis, language modeling, free text generation, automatic machine translation, speech recognition, image captioning, and
other similar problems investigating the time evolution of a given signal.
Figure 6.1 – A simple fully connected feedforward network on the left and its more compact matrix-based representation on the
right
The compact representation of the network in Figure 6.1 includes one multi-dimensional input, , one possibly multi-dimensional
output, , one hidden layer represented by the box containing the neuron icons, and the two weight matrixes from the input to the
Let's now introduce to this network a recurrent connection, feeding the output vector, , back into the input layer in addition to the
original input vector, ( Figure 6.2). This simple change to the network architecture changes the network behavior. Before, the
function implemented by the network was just , where is the current time when the input sample, , is
presented to the network. Now, the function implemented by the recurrent network assumes the shape
; that is, the current output depends on the current input, as well as on the output
produced in the previous step for the previous input sample. We have introduced the concept of time:
Figure 6.2 – Adding a recurrent connection to the feedforward network
Thanks to these recurrent connections, the output of RNNs also contains a bit of the history of the input signal. We then say that they
have memory. How far in the past the memory span goes depends on the recurrent architecture and the paradigms contained in it.
For this reason, RNNs are more suitable than feedforward networks for analyzing sequential data, because they can also process
information from the past. Past input information is metabolized via the output feedback into the input layer through the recurrent
connection.
The problem now becomes how to train a network where the output depends on the previous output(s) as well. As you can imagine,
a number of algorithms have been proposed over the years. The simplest one, and therefore the most commonly adopted, is Back
Propagation Through Time (BPTT ) (Goodfellow I, Bengio Y., Courville A., Deep Learning, MIT Press, (2016)).
BPTT is based on the concept of unrolling the network over time. To understand the concept of unrolling , let's take a few glimpses at
the network at different times, , during training:
At time , we have the original feedforward network, with weight matrixes and , input and
, and output .
At time , we again have the original feedforward network, with weight matrixes and , but this time with
At time , again, we have the original feedforward network, with weight matrixes and , but this time with
Practically, we can copy the same original feedforward network with static weight matrixes and times, which is as
many samples as in the input sequence ( Figure 6.3). Each copy of the original network at time will have the current input
vector, , and the previous output vector, , as input. More generically, at each time , the network copy will
produce an output, , and a related state, . The state, , is the network memory and feeds the next copy of the
static network, while is the dedicated output of each network copy. In some recurrent architectures, and
are identical.
Producing a state tensor, , related to output tensor , used as the network memory
IMPORTANT NOTE
This recurrent network can also be just a sub-network that is a hidden unit in a bigger neural architecture. In this case, it is fed by
the outputs of previous layers, and its output forms the input to the next layers in the bigger network. Then, is not the
output of the whole network, but just the output of this recurrent unit – that is, an intermediate hidden state of the full network.
In Figure 6.3, we propose the unrollment over four time steps of the simple recurrent network in Figure 6.2:
At this point, we have transformed the recurrent sub-network into a sequence of copies of the original feedforward network –
that is, into a much larger static feedforward network. As large as it might be, we do already know how to train fully connected
feedforward networks with the backpropagation algorithm. So, the backpropagation algorithm has been adapted to include the
unrolling process and to train the resulting feedforward network. This is the basic BPTT algorithm. Many variations of the BPTT
algorithm have also been proposed over the years.
We will now dive into the details of the simplest recurrent network, the one made of just one layer of recurrent units.
This simple recurrent unit already shows some memory, in the sense that the current output also depends on previously presented
samples at the input layer. However, its architecture is a bit too simple to show a considerable memory span. Of course, it depends on
the task to solve how long of a memory span is needed. A classic example is sentence completion.
To complete a sentence, you need to know the topic of the sentence, and to know the topic, you need to know the previous words in
the sentence. For example, analyzing the sentence Cars drive on the …, we realize that the topic is cars and then the only logical
answer would be road. To complete this sentence, we need a memory of just four words. Let's now take a more complex sentence,
such asI love the beach. My favorite sound is the crashing of the …. Here, many answers are possible, including cars, glass, or
waves. To understand which is the logical answer, we need to go back in the sentence to the word beach, which is nine words
backward. The memory span needed to analyze this sentence is more than double the memory span needed to analyze the previous
sentence. This short example shows that sometimes a longer memory span is needed to give the correct answer.
The simple recurrent neural unit provides some memory, but often not enough to solve most required tasks. We need something
more powerful that can crawl backward farther in the past than just what the simple recurrent unit can do. This is exactly why LSTM
units were introduced.
Figure 6.5 shows the structure of an unrolled LSTM unit (C. Olah, Understanding LSTM Networks, 2015,
https://colah.github.io/posts/2015-08-Understanding-LSTMs/):
As you can see, the different copies of the unit are connected by two hidden vectors. The one on the top is the cell state vector,
, used to make information travel through the different unit copies. The second one on the bottom is the output vector of the
unit.
Next, we have the gates, three in total. Gates can open or close (or partially open/close) and, in this way, they make decisions on what
to store or delete from a hidden vector. A gate consists of a sigmoid function and a pointwise multiplication. Indeed, the sigmoid
function takes values in [0,1]. Specifically, removes the input (forgets it), while lets the input pass
unaltered (remembers it). In between and , a variety of nuances of remembering and forgetting are
possible.
The weights of these sigmoid layers, which implement the gates, are adjusted via the learning process. That is, the gates learn when to
allow data to enter, leave, or be deleted through the iterative process of making guesses, backpropagating error, and adjusting
weights via gradient descent. The training algorithm for LSTM layers is again an adaptation of the backpropagation algorithm.
An LSTM layer contains three gates: a forget gate, an input gate, and an output gate ( Figure 6.5). Let's have a closer look at these
gates.
input vector, and in the output vector of the previous unit, , the gate produces a forget or remember decision,
as follows:
The vector of decision is then pointwise multiplied by the hidden cell state vector, , to decide what to
remember ( ) and what to forget ( ) from the previous state.
The question now is why do we want to forget? If LSTM units have been introduced to obtain a longer memory, why should we
need to forget something? Take, for example, analyzing a document in a text corpus; you might need to forget all knowledge about
the previous document since the two documents are probably unrelated. Therefore, with each new document, the memory should be
reset to 0.
Even within the same text, if you move to the next sentence and the subject of the text changes, and with the new subject a new
gender appears, then you might want to forget the gender of the previous subject, to be ready to incorporate the new one and to
adjust the corresponding part of speech accordingly.
lets input components pass completely ( ), blocks them completely ( ), or something in between depending
on their importance to the final, current, and future outputs.
The input gate doesn't operate on the previous cell state, , directly. Instead, a new cell state candidate, , is
created, based on the values in the current input vector, , and in the output vector of the previous unit, ,
using a tanh layer.
The input gate now decides which information of the cell candidate state vector, , should be added to the cell state vector,
. Therefore, the candidate state, , is multiplied pointwise by the output of the sigmoid layer of the input gate, ,
and then added to the filtered cell state vector, The final state, , then results in the following:
What have we done here? We have added new content to the previous cell state vector, . Let's suppose we want to look at a
new sentence in the text where is a subject with a different gender. In the forget gate, we forgot about the gender previously
stored in the cell state vector. Now, we need to fill in the void and push the new gender into memory – that is, into the new cell state
vector.
Again, like all other gates, the output gate applies a sigmoid function to all components of the input vector, , and of the
previous output vector, , in order to decide what to block and what to pass from the newly created state vector, ,
into the final output vector, . All decisions, , are then pointwise multiplied by the newly created state vector, ,
In this case, the output vector, , and the state vector, , produced by the LSTM recurrent unit are different,
Why do we need a different output from the unit cell state? Well, sometimes the output needs to be something different from the
memory. For example, while the cell state is supposed to carry the memory of the gender to the next unit copy, the output might be
required to produce the number, plural or singular, of the subject rather than its gender.
LSTM layers are a very powerful recurrent architecture, capable of keeping the memory of a large number of previous inputs. These
layers thus fit – and are often used to solve – problems involving ordered sequences of data. If the ordered sequences of data are
sorted based on time, then we talk about time series. Indeed, LSTM-based RNNs have been applied often and successfully to time
series analysis problems. A classic task to solve in time series analysis is demand prediction. In the next section, we will explore an
application of LSTM-based neural networks to solve a demand prediction problem.
Demand prediction is a task related to the need to make estimates about the future. We all agree that knowing what lies ahead in the
future makes life much easier. This is true for life events as well as, for example, the prices of washing machines and refrigerators, or
demand for electrical energy in an entire city. Knowing how many bottles of olive oil customers will want tomorrow or next week
allows for better restocking plans in retail stores. Knowing of a likely increase in the demand for gas or diesel allows a trucking
company to better plan its finances. There are countless examples where this kind of knowledge of the future can be of help.
Demand Prediction
Demand prediction , or demand forecasting, is a big branch of data science. Its goal is to make estimations about future demand
using historical data and possibly other external information. Demand prediction can refer to any kind of numbers: visitors to a
restaurant, generated kW/h, school new registrations, beer bottles, diaper packages, home appliances, fashion clothing and
accessories, and so on. Demand forecasting may be used in production planning, inventory management, and at times in assessing
future capacity requirements, or in making decisions on whether to enter a new market.
Demand prediction techniques are usually based on time series analysis. Previous values of demand for a given product, goods, or
service are stored and sorted over time to form a time series. When past values in the time series are used to predict future values in
the same time series, we are talking about autoregressive analysis techniques. When past values from other external time series are
also used to predict future values in the time series, then we are talking about multi-regression analysis techniques.
Time series analysis is a field of data science with a lot of tradition, as it already offers a wide range of classical techniques.
Traditional forecasting techniques stem from statistics and their top techniques are found in the Autoregressive Integrated
Moving Average (ARIMA ) model and its variations. These techniques require the assumption of a number of statistical
hypotheses, are hard to verify, and are often not realistic. On the other hand, they are satisfied with a relatively small amount of past
data.
Recently, with the growing popularity of machine learning algorithms, a few data-based regression techniques have also been
applied to demand prediction problems. The advantages of these machine learning techniques consist of the absence of required
statistical hypotheses and less overhead in data transformation. The disadvantages consist of the need for a larger amount of data.
Also, notice that in the case of time series where all required statistical hypotheses are verified, traditional methods tend to perform
better.
Let's try to predict the next values in the time series based on the past values. When using a machine learning model for time
series analysis, such as, for example, linear regression or a regression tree, we need to supply the vector of the past samples as
input to train the model to predict the next values. While this strategy is commonly implemented and yields satisfactory results,
it is still a static approach to time series analysis – static in the sense that each output response depends only on the corresponding
input vector. The order of presentation of input samples to the model does not influence the response. There is no concept of an
input sequence, but just of an input vector.
TIP
KNIME Analytics Platform offers a few nodes and standard components to deal with time series analysis. The key node here is the
Lag Column node to build a vector of past samples. In addition to the Lag Column node, a number of components dedicated to
time series analysis are available in the EXAMPLES/00_Components/Time Series folder in the
KNIME Explorer panel. These components use the KNIME GUI to run the statsmodels Python module in the
background. Because of that, they require the installation of the KNIME Python integration (https://www.knime.com/blog/setting-
up-the-knime-python-extension-revisited-for-python-30-and-20).
In Figure 6.6 , you can see the list of available components for time series analysis tasks within KNIME Analytics Platform:
Figure 6.6 – The EXAMPLES/00_Components/Time Series folder contains components dedicated to time series analysis
All things considered, these machine learning-based strategies, using regression models, do not fully exploit the sequential structure
of the data, where the fact that comes after carries some additional information. This is where RNNs,
and particularly LSTMs, might offer an edge on the other machine learning algorithms, thanks to their internal memory .
Let's now introduce the case study for this chapter: predicting energy demand in kilowatts (kW ) needed by the hour.
For this reason, a couple of years ago energy companies started to monitor the electricity consumption of each household, store, or
other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation
( CER ).
The Smart Metering Electricity Customer Behaviour Trials (CBTs ) took place between 2009 and 2010 with over 5,000 Irish
homes and businesses participating. The purpose of the trials was to assess the impact on consumers' electricity consumption, in order
to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy
business customers who participated in the trials had an electricity smart meter installed in their homes or on their premises and
agreed to take part in the research to help establish how smart metering can help shape energy usage behaviors across a variety of
demographics, lifestyles, and home sizes.
The original dataset contains over 5,000 time series, each one measuring the electricity usage for each installed smart meter for a bit
over a year. All original time series have been aligned and standardized to report energy measures by the hour.
The final goal is to predict energy demand across all users. At this point, we have a dilemma: should we train one model for each
time series and sum up all predictions to get the demand in the next hour or should we train one single model on all time series to get
the global demand for the next hour?
Training one model on a single time series is easier and probably more accurate. However, training 5,000 models (and probably
more in real life) can pose a few technical problems. Training one single model on all time series might not be that accurate. As
expected, a compromise solution was implemented. Smart energy meters have been clustered based on energy usage profile, and the
average time series of hourly energy usage for each cluster has been calculated. The goal now is to calculate the energy demand in
the next hour for each clustered time series, weight it by the cluster size, and then sum up all contributions to find the final total
energy demand for the next hour.
Thirty smart meter clusters have been detected based on the energy used on business days versus the weekend, at different times over
the 24 hours, and the average hourly consumption.
More details on this data preparation procedure can be found in the Data Chef ETL Battles. What can be prepared with today's data?
Ingredient Theme: Energy Consumption Time Series blog post, available at
https://www.knime.com/blog/EnergyConsumptionTimeSeries, and in the Big Data, Smart Energy, and Predictive Analytics
whitepaper, available at https://files.knime.com/sites/default/files/inline-images/knime_bigdata_energy_timeseries_whitepaper.pdf.
The final dataset contains 30 time series of average energy usage by the 30 clusters. Each time series shows the electrical profile of a
given cluster of smart meters: from stores (high energy consumption from 9 a.m. to 5 p.m. on business days) to nightly business
customers (high energy consumption from 9 p.m. to 6 a.m. every day), from family households (high energy consumption from 7
a.m. to 9 a.m. and then again from 6 p.m. to 10 p.m. every business day) to other unclear entities (using energy across 24 hours on
all 7 days of the week). For example, cluster 26 refers to stores ( Figure 6.7). Here, electrical energy is used mainly between 9 a.m.
and 5 p.m. on all business days:
Figure 6.7 – Plot of energy usage by the hour for cluster 26
On the opposite side, cluster 13 includes a number of restaurants ( Figure 6.8), where the energy usage is pushed to the evening,
mainly from 6 p.m. to midnight, every day of the week:
Notice that cluster 26 is the poster child for time series analysis, with a clear seasonality on the 24 hours in a day and the 7 days of the
week series. In this chapter, we will continue with an autoregressive analysis of cluster 26's time series. The goal will be to predict the
average energy usage in the next hour, based on the average energy usage in the past hours, for cluster 26.
Now that we have a set of time series describing the usage of electrical energy by the hour for clusters of users, we will try to perform
some predictions of future usage for each cluster. Let's focus first on the data preparation for this time series problem.
Dealing with time series , the data preparation steps are slightly different from what is implemented in other classification or
clustering applications. Let's go through these steps:
Data loading : Read from the file the time series of the average hourly used energy for the 30 identified clusters and the
corresponding times.
Date and time standardization : Time is usually read as a string from the file. To make sure that it is processed appropriately,
it is best practice to transform it into a Date&Time object. A number of nodes are available to deal with Date&Time objects in an
appropriate and easy way, but especially in a standardized way.
Timestamp alignment : Once the time series has been loaded, we need to make sure that its sampling has been consistent with
no time holes. Possible time holes need to be filled with missing values. We also need to make sure that the data of the time series
has been time-sorted.
Partitioning : Here, we need to create a training set to train the network and a test set to evaluate its performance. Differently
from classification problems, here we need to respect the time order so as not to mix the past and future of the time series in the
same set. Past samples should be reserved for the training set and future samples for the test set.
Missing value imputation : Missing value imputation for time series is also different from missing value imputation in a static
dataset. Since what comes after depends on what was there before, most techniques of missing value imputation for time series are
based on previous and/or the following sample values.
Creating the input vector of past samples : Once the time series is ready for analysis, we need to build the tensors to feed
the network. The tensors must consist of past samples that the network will use to predict the value for the next sample in time.
So, we need to produce sequences of past -dimensional vectors (the past samples) for all training and test records.
Creating the list to feed the network : Finally, the input tensors of past samples must be transformed into a list of values, as
this is the input format required by the network.
In the configuration window ( Figure 6.9), you must select the string input columns containing the date and/or time information and
define the date/time format. You can do this manually, by providing a string format – for example, as dd.mm.yyyy, where
dd indicates the day, mm the month, and yyyy the year.
For example, if you have date format of day(2).month(2).year(4), you can manually add the option
dd.MM.yyyy, if this is not available in the Date format options. When manually adding the date/time type, you must select
the appropriate New type option: Date or Time or Date&time .
Alternatively, you can provide the date/time format automatically, by pressing the Guess data type and format button. With this
last option, KNIME Analytics Platform will parse your string to find out the date/time format. It works most of the time! If it does
not, you can always revert to manually entering the date/time format.
TIP
In the node description of the String to Data&Time node, you can find an overview of possible placeholders in the format
structures. The most important ones are y for year, M for month in year, d for day of month, H for hour of day (between 0 and 23),
m for minute of hour, and s for second of minute. Many more placeholders are supported – for example, W for week of month or
D for day of year.
The String to Date&Time node is just one of the many nodes that deals with Date&Time objects, all contained in the Other Data
Types/Time Series folder in the Node Repository panel. Some nodes manipulate Date&Time objects, such as, for example, to
calculate a time difference or produce a time shift; other nodes are used to convert Date&Time objects from one format to another.
After that, the Column Filter node is inserted to isolate the time series for cluster 26 only. The only required standardization here was
about the date conversion from a string to a Date&Time object. We can now move on to data cleaning.
The Timestamp Alignment component is part of the time series-dedicated component set available in
EXAMPLES/00_Components/Time Series. To create an instance in your workflow, just drag and drop
it into the workflow editor or double-click it.
After that, we partition the data into a training set and test set, to train the LSTM-based RNN and evaluate it. We have not provided
an additional validation set here to evaluate the network performance throughout the training process. We decided to keep things
simple and just provide a training set to the Keras Network Learner node and a test set to measure the error on the time series
prediction task:
Figure 6.10 – The Partitioning node and its configuration window. Notice the Take from top data extraction mode for time series
analysis
To separate the input dataset into training and test sets, we use again a Partitioning node. Here, we decided to implement an 80%–
20% split: 80% of the input data will be directed toward training and 20% toward testing. In addition, we set the extraction procedure
toTake from top (Figure 6.10). In a time series analysis problem, we want to keep the intrinsic time order of the data: we use the
past to train the network and the future to test it. When using the Take from top data extraction option, the top percentage of the
data is designated to the top output port, while the remaining at the bottom to the lower output port. If the data is time-sorted from
past to future, then this data extraction modality preserves the time order of the data.
IMPORTANT NOTE
In a time series analysis problem, partitioning should use the Take from top data extraction modality, in order to preserve the time
order of the data and use the past for training and the future for testing.
As for every dataset, the operation for missing value imputation is an important one; first, because neural networks cannot deal with
missing values and second, because choosing the right missing value imputation technique can affect your final results.
IMPORTANT NOTE
Missing value imputation must be implemented after the Timestamp Alignment component since this component, by definition,
creates missing values.
In Chapter 4 , Building and Training a Feedforward Network , we already introduced the Missing Value node and its different
strategies to impute missing values. Some of these strategies are especially useful when it comes to sequential data, as they take the
previous and/or following values in a time series into account. Possible strategies are as follows:
Average/linear interpolation , replacing the missing value with the average value of previous and next sample
Moving average , replacing the missing value with the mean value of the sample window
Next , replacing the missing value with the value of the next sample
Previous , replacing the missing value with the value of the previous sample
We went for linear interpolation between the previous and next values to impute missing values in the time series (Figure 6.11 ):
Figure 6.11 – The Missing Value node and its configuration window
Let's focus next on the creation of input tensors for the neural network.
IMPORTANT NOTE
A key node to create vectors of past samples that is so often needed in time series analysis is the Lag Column node.
Figure 6.12 shows the Lag Column node and its configuration window:
Figure 6.12 – The Lag Column node and its configuration window
The Lag Column node makes copies of the selected column and shifts them down a number, , of cells, where
cells, where is the lag interval and is the Lag setting in the
configuration window.
The Lag Column node is a very simple yet very powerful node that comes in handy in a lot of situations. If the input column is time-
sorted, then shifting down the cells corresponds to moving them into the past or the future, depending on the time order.
Figure 6.13 – The Lag Column node takes snapshots of the same column at different times, as defined by the Lag and Lag interval
settings
Considering Lag = 4 and Lag Interval = 2, the Lag Column node produces four copies of the selected column, each copy moving
t
backward with a step of 2. That is, besides the selected column at current time , we will also have four snapshots of the same column
t t t t Figure 6.13).
at time -2, -4, -6, and -8 (
For our demand prediction problem, we used the values for the average energy used by cluster 26 in the immediate 200 past hours to
predict the average energy need at the current hour. That is, we built an input vector with the 200 immediate past samples, using a
Lag Column node with Lag=200 and Lag Interval=1 ( Figure 6.12).
For space reasons, we then transformed the vector of cells into a collection of cells using the Column Aggregator node, as it is
one of the possible formats to feed the neural network via the Keras Network Learner node. The Column Aggregator node is another
way to produce lists of data cells. The node groups the selected columns per row and aggregates their cells using the selected
aggregation method. In this case, the List aggregation method was selected and applied to the 200 past values of cluster 26, as
created via the Lag Column node.
The workflow snippet, implementing data preparation part to feed the upcoming RNN for the demand prediction problem, is shown
in Figure 6.14:
Figure 6.14 – Data preparation for demand prediction: date and time standardization, time alignment, missing value imputation,
creating the input vector of past samples, and partitioning
The data is ready. Let's now build, train, and test the LSTM-based RNN to predict the average demand of electrical energy for cluster
26 at the current hour given the average energy used in the previous 200 hours by the same cluster 26.
A relatively simple network is already achieving good error measures on the test set for our demand prediction task, and therefore,
we decided to focus this section on how to test a model for time series prediction rather than on how to optimize the static parameters
of a neural network. We looked at the optimization loop in Chapter 5, Autoencoder for Fraud Detection. In general, this
optimization loop can also be applied to optimize network hyperparameters. Let's begin by building an LSTM-based RNN.
One input layer accepting tensors of 200 past vectors – each past vector being just the previous sample, that is, with size 1 –
obtained through a Keras Input Layer node with Shape = 200, 1.
One hidden layer with 100 LSTM units, accepting the previous tensor as the only input, through the Keras LSTM Layer node
A classic dense layer as output with just one neuron producing the predicted value for the next sample in the time series, obtained
through the Keras Dense Layer node with the ReLU activation function.
The nodes used to build this neural architecture are shown in Figure 6.15:
IMPORTANT NOTE
The size of the input tensor was [200,1], which is a sequence of 200 1-sized vectors. If the length of the input sequence is not
known, we can use ? to indicate unknown sequence length. The NLP case studies in the next chapter will show you some examples
of this.
We have already described the Keras Input Layer node and the Keras Dense Layer node in previous chapters. Let's explore, in this
section, just the Keras LSTM Layer node.
IMPORTANT NOTE
Until now, we have used the term vector when we've talked about the input, the cell state, and the output. A tensor is a more
generalized form, representing a vector stretching along k-dimensions. A rank 0 tensor is equal to a scalar value, a rank 1 tensor is
equal to a vector, and a rank 2 tensor is equal to a matrix.
Notice that the Keras LSTM Layer node accepts up to three input tensors: one with the input values of the sequence and two to
If the previous neural layer produces more than one tensor as output, in the configuration window of the current LSTM layer, via a
drop-down menu, you can select which tensor should be used as input or to initialize the hidden states.
We will explore more complex neural architectures in the next chapters. Here, we have limited our architecture to the simplest classic
LSTM layer configuration, accepting just one input tensor from the input layer. The one input tensor accepted as input can be seen in
the configuration window of the LSTM Layer node in Figure 6.16:
Figure 6.16 – The Keras LSTM Layer node and its configuration window
For the LSTM layer, we can set two activation functions, called Activation and Recurrent activation . The Recurrent
activation function is used by the gates to filter the input components. The function selected as Activation is used to create the
candidates for the cell state, , and to normalize the new cell state, , before applying the output gate. This means that
for the standard LSTM unit, which we introduced in this chapter, the setting for Activation is the tanh function and for Recurrent
activation the sigmoid function.
We set the layer to add biases to the different layers of the LSTM unit but decided to not use the dropout.
The Implementation and Unroll setting options don't have any impact on the results but can improve the performance depending
on your hardware and the sequence length. When activating theUnroll checkbox, the network will be unrolled before training,
which can speed up the learning process, but it is memory-expensive and only suitable for short input sequences. If unchecked, a so-
called symbolic loop is used in the TensorFlow backend.
You can choose whether to return the intermediate output tensors as a full
sequence or just the last output tensor, (the Return sequences option). In addition, you can also output the hidden
cell state tensor as output (the Return state option). In the energy demand prediction case study, only the final output tensor of the
LSTM unit is used to feed the next dense layer with the ReLU activation function. Therefore, the two checkboxes are not activated.
The other three tabs in the node configuration window set the regularization terms, initialization strategies, and constraints on the
learning algorithm. We set no regularizations and no constraints in this layer. Let's train this network.
The loss function is set to Mean Squared Error (MSE ) in the Target tab.
The number of epochs is set to 50, the training batch size to 256, and the training algorithm to Adam – an optimized version
of backpropagation – in the Options tab.
The learning rate is set to be 0.001 with no learning rate decay.
For this network, with just one neuron in the output layer, the MSE loss function on a training batch takes on a simpler form and
becomes the following:
Here, is the batch size, is the output value for training sample , and is the corresponding target answer.
Since we are talking about number prediction and MSE as the loss function, the plot in the Loss tab of the Learning Monitor
view is the one to take into account to evaluate the learning process. Since we are trying to predict exact numbers, the accuracy is not
meaningful in this case. Figure 6.17 shows the Learning Monitor view of the Keras Network Learner node for this demand
prediction example:
Figure 6.17 – Plot of the MSE loss function over training epochs in the Loss tab of the Learning Monitor view
The screenshot in Figure 6.17 shows that after just a few batch training iterations, we reach an acceptable prediction error, at least on
the training set. After training, the network should be applied to the test set, using the Keras Network Executor node, and saved
for deployment as a Keras file using the Keras Network Writer node.
Let's now apply the trained LSTM network to the test set.
The In-sample testing component selects the number of input sequences to test on (the Row Filter node), then passes them through
theKeras Network Executor node, and joins the predictions with the corresponding target answers.
After that, and outside of the In-sample testing component, the Numeric Scorer node calculates some error metrics and the
Line Plot (Plotly) node shows the original time series and the reconstructed time series (final workflow in Figure 6.25). The
numeric error metrics quantify the error, while the line plot gives a visual idea of how faithful the predictions are. Predictions
in-sample predictions.
generated with this approach are called
The Numeric Scorer node calculates six error metrics (Figure 6.19 ): R , Mean Absolute Error (MAE ), MSE, Root Mean
2
Squared Error (RMSE ), Mean Signed Difference (MSD ), and Mean Absolute Percentage Error (MAPE ). The
corresponding formulas are shown here:
Here, is the number of predictions from the test set, is the output value for the test sample , and is the
corresponding target answer. We chose to apply the network on a test set of 600 tensors, generated the corresponding predictions,
and calculated the error metrics. This is the result we got:
Figure 6.19 – Error measures between in-sample predicted 600 values and the corresponding target values
Each metric has its pros and cons. Commonly adopted errors for time series predictions are MAPE, MAE, or MSE. MAPE, for
example, shows just 9% error on the next 600 values of the predicted time series, which is a really good result. The plot in Figure
6.20 proves it:
Figure 6.20 – The next 600 in-sample predicted values against the next 600 target values in the time series
This is an easy test. For each value to predict, we feed the network with the previous history of real values. This is a luxury situation
that we cannot always afford. Often, we predict the next 600 values, one by one, based just on past predicted values. That is, once we
have trained the network, we trigger the next prediction with the first 200 real past values in the test set. After that, however, we
predict the next value based on the latest 199 real values plus the currently predicted one; then again based on the latest 198 real
values plus the previously predicted one and the currently predicted one, and so on. This is a suboptimal, yet more realistic, situation.
Predictions generated with this approach are called out-sample predictions and this kind of testing is called out-sample testing.
To implement out-sample testing, we need to implement the loop that feeds the current prediction back into the vector of past
samples. This loop has been implemented in the deployment workflow as well. Let's have a look at the details of this implementation.
Here, a recursive loop , formed by a Recursive Loop Start node and a Recursive Loop End node, predicts, at each
iteration, the next value and forms the new input sequence for the network, by eliminating the oldest sample and adding the latest
prediction. The Recursive Loop Start node requires no configuration, while the Recursive Loop End node requires the
ending condition for the loop. We parameterized this ending condition (600 predictions) through the flow variable, named
no_preds, created in the Integer Configuration node (no_preds=600).
The Integer Configuration node belongs to a special group of configuration nodes, so its configuration window transfers into the
configuration window of the component that contains it. As a consequence, the Deployment Loop component has a configuration
Figure 6.22:
setting for the number of predictions to create with the recursive loop, as shown in
IMPORTANT NOTE
The recursive loop is one of the few loops in KNIME Analytics Platform that allows you to pass the results back to be consumed in
the next iteration.
The Deployment Loop component uses two more new important nodes:
The Keras to TensorFlow Network Converter node : The Keras to TensorFlow Converter node converts a Keras deep
learning model with a TensorFlow backend into a TensorFlow model. TensorFlow models are executed using the TensorFlow Java
API, which is usually faster than the Python kernel available via the Keras Python API. If we use the Keras Network Executor node
within the recursive loop, a Python kernel must be started at each iteration, which slows down the network execution. A
TensorFlow model makes the network execution much faster.
The TensorFlow Network Executor node : The configuration window of the TensorFlow Network Executor node is similar
to the configuration window of the Keras Network Executor node, the only difference being the backend engine, which in this
case is TensorFlow.
For out-sample testing, the deployment loop is triggered with the first tensor in the test set and from there it generates 600 predictions
autonomously. In the out-sample testing component, these predictions are then joined with the target values and outside of the out-
sample testing component, the Numeric Error node calculates the selected error metrics.
Obviously, for out-sample testing, the error values become larger ( Figure 6.23), since the prediction error is influenced by the
prediction errors in the previous steps. MAPE, for example, reaches 18%, which is practically double the result from in-sample
testing:
Figure 6.23 – Error measures between the out-sample predicted 600 values and the corresponding target values
In Figure 6.24, we can see the prediction error when visualizing the predicted time series and comparing it with the original time
series for the first 600 out-sample predictions:
Figure 6.24 – The next 600 out-sample predicted values (orange) against the next 600 target values (blue) in the time series
There, we can see that the first predictions are quite correct, but they start deteriorating the further we move from the onset of the test
set. This effect is, of course, not present for in-sample predictions. Indeed, the error values on the first out-sample predictions are
comparable to the error values for the corresponding in-sample predictions.
We have performed here a pretty crude time series prediction since we have not taken into account the seasonality prediction as a
separate problem. We have somehow let the network manage the whole prediction by itself, without splitting seasonality and
residuals. Our results are satisfactory for this use case. However, for more complex use cases, the seasonality index could be
calculated, the seasonality subtracted, and predictions performed only on the residual values of the time series. Hopefully, this would
be an easier problem and would lead to more accurate predictions. Nevertheless, we are satisfied with the prediction error, especially
considering that the network had to manage the prediction of the seasonality as well.
The final workflow, building, training, and in-sample testing the network, is shown in Figure 6.25:
Figure 6.25 – The final workflow to prepare the data and build, train, and test the LSTM-based network on a time series prediction
problem
This workflow is available in the book's GitHub space. Let's now move on to the deployment workflow.
loop to generate new samples (here, we went for ); we apply the trained LSTM-based RNN inside the
deployment loop; and finally, we visualize the predictions with a Line Plot (Plotly) node. Notice that this time there are no
predictions versus target values, since the deployment data is real-world data and not lab data, and as such does not have any target
values to be compared to.
The deployment workflow is shown in Figure 6.26 and is available on KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%206/:
Figure 6.26 – The deployment workflow for a demand prediction problem
This is the deployment workflow, including data reading, the same data preparation as for the data in the training workflow, network
reading, and a deployment loop to generate the predictions.
In this last section, we have learned how to apply the deployment loop to a deployment workflow to generate new predictions in real
life.
Su mma ry
In this chapter, we introduced a new recurrent neural unit: the LSTM unit. We showed how it is built and trained, and how it can be
applied to a time series analysis problem, such as demand prediction.
As an example of a demand prediction problem, we tried to predict the average energy consumed by a cluster of users in the next
hour, given the energy used in the previous 200 hours. We showed how to test in-sample and out-sample predictions and some
numeric measures commonly used to quantify the prediction error. Demand prediction applied to energy consumption is just one of
the many demand prediction use cases. The same approach learned here could be applied to predict the number of customers in a
restaurant, the number of visitors to a web site, or the amount of a type of food required in a supermarket.
In this chapter, we also introduced a new loop in KNIME Analytics Platform, the recursive loop, and we mentioned a new
visualization node, the Line Plot (Plotly) node.
In the next chapter, we will continue with RNNs, focusing on different text-related applications.
Q u e stio n s a n d Ex e rc ise s
Check your level of understanding of the concepts explored in this chapter by answering the following questions:
2. What is the data extraction option to use for partitioning in time series analysis?
3. What is a tensor?
k
b). A tensor is a -dimensional vector.
a). In-sample testing uses the real past values from the test set to make the predictions. Out-sample testing uses past prediction
values to make new predictions.
d). In-sample testing applies the trained network while out-sample testing uses rules.
Chapter 7: Imp le me n tin g N LP A p p lic a tio n s
InChapter 6, Recurrent Neural Networks for Demand Prediction, we introduced Recurrent Neural Networks (RNNs ) as a
family of neural networks that are especially powerful to analyze sequential data. As a case study, we trained a Long Short-Term
Memory (LSTM )-based RNN to predict the next value in the time series of consumed electrical energy. However, RNNs are not
just suitable for strictly numeric time series, as they have also been applied successfully to other types of time series.
Another field where RNNs are state of the art is Natural Language Processing (NLP ). Indeed, RNNs have been applied
successfully to text classification, language models, and neural machine translation. In all of these tasks, the time series is a sequence
of words or characters, rather than numbers.
In this chapter, we will run a short review of some classic NLP case studies and their RNN-based solutions: a sentiment analysis
application, a solution for free text generation, and a similar solution for the generation of name candidates for new products.
We will start with an overview of text encoding techniques to prepare the sequence of words/characters to feed our neural network.
The first case study, then, classifies text based on its sentiment. The last two case studies generate new text as sequences of new words,
and new words as sequences of new characters, respectively.
Ex p lo rin g Te x t En c o d in g Te c h n iq u e s fo r N e u ra l
N e tw o rk s
In Chapter 4, Building and Training a Feedforward Neural Network, you learned that feedforward networks – and all other neural
networks as well – are trained on numbers and don't understand nominal values. In this chapter, we want to feed words and
characters into neural networks. Therefore, we need to introduce some techniques to encode sequences of words or characters – that
is, sequences of nominal values – into sequences of numbers or numerical vectors. In addition, in NLP applications with RNNs, it is
mandatory that the order of words or characters in the sequence is retained throughout the text encoding procedure.
Let's have a look at some text encoding techniques before we dive into the NLP case studies.
Index Encoding
In Chapter 4 , Building and Training a Feedforward Neural Network , you learned about index encoding for nominal values. The
idea was to represent each nominal class with an integer value, also called an index.
We can use this same idea for text encoding. Here, instead of encoding each class with a different index, we encode each word or
each character with a different index. First, a dictionary must be created to map all words/characters in the text collection to an index;
afterward, through this mapping, each word/character is transformed into its corresponding index and, therefore, each sequence of
words/characters into the sequence of corresponding indexes. In the end, each text is represented as a sequence of indexes, where
each index encodes a word or a character. The following figure gives you an example:
Figure 7.1 – An example of text encoding via indexes at the word level
Notice that index 1, for the word the, and index 13, for the word brown, are repeated twice in the sequence, as the words appear
twice in the example sentence, the quick brown fox jumped over the brown dog.
Later in this chapter, in the Finding the Tone of Your Customers' Voice – Sentiment Analysis section, we'll use index encoding on
words to represent text.
In the Free Text Generation with RNNs section, on the other hand, we'll use one-hot vectors as text encoding on characters. Let's
explore what one-hot vector encoding is.
TIP
Remember that the Keras Learner node can convert index-based encodings into one-hot vectors. Thus, to train a neural network
on one-hot-vectors, it is sufficient to feed it with an index-based encoding of the text document.
A commonly used text encoding – similar to one-hot vectors but that doesn't retain the word order – are document vectors .
Here, a vector is built from all the words available in the document collection and each word becomes a component in the vector
space. Thus, each text is transformed into a vector of 0s and 1s, encoding the presence (1) or absence (0) of the words. One vector
represents one text document and contains multiple 1s. Notice that this encoding does not retain the word order because all of the text
is encoded within the same vector structure regardless of the word order.
Working with words, the dimension of one-hot vectors is equal to the dictionary size – that is, to the number of words available in
the document corpus. If the document corpus is large, the dictionary size quickly becomes the number of words in the whole
language. Therefore, one-hot vector encoding on a word level can lead to very large and sparse representations.
Working with characters, the dictionary size is the size of the character set, which, even including punctuation and special signs, is
much smaller than in the previous case. Thus, one-hot vector encoding fits well for character encoding but might lead to
dimensionality explosion on word encoding.
To encode a document at the word level, a much more appropriate method is embeddings.
To learn the projection of each word into the continuous vector space, a dedicated neural network layer is used, which is called the
embedding layer. This layer learns to associate a vector representation with each word. The best-known word embedding techniques
are Word2vec and GloVe .
There are two ways that words embeddings can be used (J. Brownlee, How to Use Word Embedding Layers for Deep Learning with
Keras, Machine Learning Mastery Blog, 2017, https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-
keras/):
If trained jointly with a neural network, the input to an embedding layer is an index-based encoded sequence. The number of output
units in the embedding layer defines the dimension of the embedding space. The weights of the embedding layer, which are used to
calculate the embedding representation of each index, and therefore of each word, are learned during the training of the network.
Now that we are familiar with different text encoding techniques, let's move on to our first NLP use case.
Generally, sentiment analysis belongs to a bigger group of NLP applications known as text classification. In the case of sentiment
analysis, the goal is to predict the sentiment class.
Another common example of text classification is language detection. Here, the goal is to recognize the text language. In both cases,
if we use an RNN for the task, we need to adopt a many-to-one architecture. A many-to-one neural architecture accepts a sequence
of inputs at different times, , and uses the final state of the output unit to predict the one single class – that is, sentiment or
language.
Figure 7.3 – An example of a many-to-one neural architecture: a sequence of many inputs at different times and only the final
status of the output
In our first use case in this chapter, we want to analyze the sentiment of movie reviews. The goal is to train an RNN at a word level,
with an embedding layer and an LSTM layer.
For this example, we will use the IMDb dataset, which contains two columns: the text of the movie reviews and the sentiment. The
sentiment is encoded as 1 for positive reviews and as 0 for negative reviews.
Figure 7.4 shows you a small subset with some positive and some negative movie reviews:
Figure 7.4 – Extract of the IMDb dataset, showing positive- and negative-labeled reviews
Let's start with reading and encoding the texts of the movie reviews.
As the number of words available in the IMDb document corpus is very high, we decided to reduce them during the text
preprocessing phase, by removing stop words and reducing all words to their stems. In addition, only the most frequent terms in
the training set are encoded with a dedicated index, while all others receive just the default index.
In theory, RNNs can handle sequences of variable length. In practice, though, the sequence length for all input samples in one
training batch must be the same. As the number of words per review might differ, we define a fixed sequence length and we zero-
pad too-short sequences – that is, we add 0s to complete the sequence – and we truncate too-long sequences.
All these preprocessing steps are applied to the training set and the test set, with one difference. In the preprocessing of the training
set, the dictionary with the most frequent terms is created. This dictionary is then only applied during the preprocessing of the test
set.
2. Tokenize, clean, and stem the movie reviews in the training set and the test set.
3. Create a dictionary of all the terms. The most frequent terms in the training set are represented by dedicated indexes and all
other terms by a default index.
4. Map the words in the training and test set to the corresponding dictionary indexes.
5. Truncate too-long word sequences in the training set and test set.
Figure 7.5 – Preprocessing workflow snippet for the sentiment analysis case study
The first metanode, Read and partition data , reads the table with the movie reviews and sentiment information and partitions the
dataset into a training set and a test set. ThePreprocessing training set metanode performs the different preprocessing steps on
the training set and creates and applies the dictionary, which is available at the second output port. The last metanode, Preprocess
test set , applies the created dictionary to the test set and performs the different preprocessing steps on the test set.
Let's see how all these steps are implemented in KNIME Analytics Platform.
Figure 7.6 shows you the workflow snippet inside the metanode:
Figure 7.6 – Workflow snippet inside the Read and partition metanode
The Table Reader node reads the table with the sentiment information as an integer value and the movie reviews as a string value.
Next, the sentiment information is transformed into a string with the Number To String node. This step is necessary to allow
stratified sampling in the Partitioning node. In the last step, the data type of the column sentiment is transformed back into an
integer using the String To Number node so that it can be used as the target column during training by the Keras Learner node.
Now that we have a training set and a test set, let's continue with the preprocessing of the training set.
Figure 7.7 – Workflow snippet inside the Preprocess training set metanode
For the preprocessing of the movie reviews, the KNIME Text Processing extension is used.
TIP
The KNIME Text Processing extension includes nodes to read and write documents from and to a variety of text formats; to
transform words; to clean up sentences of spurious characters and meaningless words; to transform a text into a numeric table; to
calculate all required text statistics; and finally, to explore topics and sentiment.
The KNIME Text Processing extension relies on a new data type: Document object . Raw text becomes a document when
additional metadata, such as title, author(s), source, and class, are added to it. Text in a document is tokenized following one of the
many available language-specific tokenization algorithms. Document tokenization produces a hierarchical structure of the text
items: sections, paragraphs, sentences, and words. Words are often referred to as tokens or terms.
To make use of the preprocessing nodes of the KNIME Text Processing extension, we need to transform the movie reviews into
documents, via the Strings To Document node. This node collects values from different columns and turns them into a document
object, after tokenizing the main text.
Figure 7.8 shows you the configuration window of the Strings To Document node:
Figure 7.8 – Configuration window of the Strings To Document node
Text Preprocessing
Next, the document objects are cleaned through a sequence of text preprocessing nodes, contained in the
component of the workflow in Figure 7.7. The inside of the Text Preprocessing component is shown in Figure 7.9:
Figure 7.9 – Workflow snippet showing the inside of the Preprocessing component
Punctuation Erasure node, to strip all punctuation from the input documents.
The workflow snippet starts with the
The Number Filter node filters out all numbers, expressed as digits, including decimal separators (, or . ) and possible leading signs
(+ or - ).
The N Chars Filter node filters out all terms with less than – in our case, – characters, as specified in the
configuration window of the node.
so, thus, and so on, are called stop words . They carry little information and can be removed with the Stop
Filler words, such as
Word Filter node. This node filters out all terms that are contained in the selected stop word list. A custom stop word list can be
passed to the node via the second input port, or a default built-in stop word list can be adopted. A number of built-in stop word lists
are available for various languages.
The Case Converter node converts all terms into upper or lowercase. In this case study, they are converted into lowercase.
Lastly, the Snowball Stemmer node reduces words to their stem, removing the grammar inflection, using the Snowball stemming
library (http://snowball.tartarus.org/).
IMPORTANT NOTE
The goal of stemming is to reduce inflectional forms and derivationally related forms to a common base form. For example, look,
looking, looks, and looked are all replaced by their stem, look.
Now that we have cleaned up the text of the movie reviews of the training set, we can create the dictionary.
Counter : A progressive eight-digit index to each of the words. This eight-digit index is just a temporary index that will help us
deal with truncation.
Figure 7.10 – A small subset of the dictionary, where each word is represented by a progressive integer index and another
progressive eight-digit integer index
Both indexes are created in the Create Dictionary component and Figure 7.11 shows you the workflow snippet inside the
component:
The Create Dictionary component has a configuration window, which you can see in Figure 7.12. The input option in the
Integer Configuration node and requests the dictionary size as the number of the
configuration window is inherited from the
The workflow inside the component first creates a global set of unique terms over all the documents by using the Unique Term
Extractor node:
This node allows us to create an index column and a frequency column, as shown in the preceding screenshot. The index column
contains a progressive integer number starting from 1, where 1 is assigned to the most frequent term.
k most frequent terms. For that, three frequency measures are available:
The node optionally provides the possibility to filter the top
the term frequency , the document frequency , and the inverse document frequency . For now, we want to select all terms
and we will work on the dictionary size later.
IMPORTANT NOTE
Term frequency (TF ): The number of occurrences of a term in all documents
Document frequency (DF ): The number of documents in which a term occurs
Inverse document frequency (IDF ): Logarithm of the number of documents divided by DF
The eight-digit index is created via the Counter Generation node. This node adds a new Counter column to the input data table,
starting from a minimum value (Min Value ) of 10,000,000 and using 1 as the step size. This minimum value guarantees the eight-
digit format.
The Index and Counter columns are then converted from integers into strings with the Number To String node.
Next comes the reduction of the dictionary size. The top most frequent terms keep the progressive index assigned by the Unique
Term Extractor node, while all other terms get a default index of . Remember that can be changed via the
component's configuration window. For this example, was set to 20,000. In the lower part of the component sub-workflow, the
Row Splitter node splits the input data table into two sub-tables: the top rows (top output port) and the rest of the rows (lower
output port).
The Constant Value Column node then replaces all index values with the default index value in the lower sub-table.
Lastly, the two sub-tables are concatenated back together.
Now that the dictionary is ready, we can continue with the truncation of the movie reviews.
First, we set the maximum number, , of terms allowed in a document. Again, this is a parameter that can be changed through the
component's configuration window, shaped via the Integer Configuration node. We set the maximum number of terms in a
document – that is, the maximum document size – as terms. If a document is too long, we should just keep the
first terms and throw away the rest.
It is not easy to count the number of words in a text. Since words have variable lengths, we should detect the spaces separating the
words within a loop and then count the words. Loops, however, often slow down execution. So, an alternative trick is to use the
eight-digit representation of the words inside the text.
Within the text, each word is substituted by its eight-digit code via the Dictionary Replacer node. The Dictionary Replacer
node matches terms in the input documents at the top input port with dictionary terms at the lower input port and then replaces them
with the corresponding value in the dictionary table.
The lower input port with the dictionary table for the matching and replacement operation
IMPORTANT NOTE
The dictionary table must consist of at least two string columns. One string column contains the terms to replace (keys) and the
other string column contains the replacement strings (values). In the configuration window, we can set both columns from the data
table at the lower input port.
At this point, we have text with terms of fixed length (8 digits + 1 <space>) and not words of variable length.
So, limiting a text to words is the same as limiting a text to characters, if , to 720
characters. This operation is much easier to carry out without loops or complex node structures, but just with a String
Manipulation node. However, the String Manipulation node works on string objects and not on documents. To use it, we need
to move temporarily back to text as strings.
The text is extracted from the document as a simple string with the Document Data Extractor node. This node extracts
information, such as, for example, the text and title, from a document cell.
The Math Formula (Variable) node takes the flow variable for the maximum document size and calculates the maximum
number of characters allowed in a document.
The String Manipulation node extracts the substring from the text starting from the first character (at position 0) until the
maximum number of characters allowed, using the substr() function. This effectively keeps only the top terms and
removes all others.
Lastly, the text is transformed back into a document, called Truncated Document , and all superfluous columns are removed in
the Column Filter node.
At this point, the eight-digit indexes have exhausted their task and can be substituted with the progressive integer index for the
encoding. This is done in the Dictionary Replacer node, once again.
With that, we have truncated too-long documents to the maximum number of terms allowed. Next, we need to zero-pad too-short
documents.
Zero-padding is again performed at the string level, and not at the document level. After the text has been extracted as a string from
the input document using the Document Data Extractor node, the Cell Splitter node splits the input text at each
<space> and creates one new column for each index.
Remember that all truncated text now has a maximum length of indexes from the previous step. So, from those texts, the
number of newly generated columns is surely . For all other texts with shorter-term sequences, the Cell Splitter node will fill
the empty columns with missing values. It is enough to turn these missing values into 0s and the zero-padding procedure is complete.
Missing Value node.
This replacement of missing values with 0s is performed by the
Lastly, all superfluous columns are removed within the Column Filter node.
Now that all term sequences – that is, all text – have the same length, collection cells are created with the Create Collection
Cell node to feed the Keras Learner node.
Next, we need to perform the same preprocessing on the test and apply the created dictionary.
Figure 7.16 shows you the workflow snippet inside the Preprocess test set metanode:
Figure 7.16 – Workflow snippet inside the Preprocess test set metanode
The lower part of the workflow is similar to the workflow snippet inside the Preprocess training set metanode, only the part
including the creation of the dictionary is different. Here, the dictionary for the test set is based on the dictionary from the training
set. All terms available in the training set dictionary receive the corresponding index encoding; all remaining terms receive the
default index.
Therefore, first a list of all terms in the test set is created using the Unique Term Extractor node. Next, this list is joined with the
list of terms in the training set dictionary using a right outer join. A right outer join allows us to keep all the rows from the lower
input port – that is, all terms in the test set – and to add the indexes from the training dictionary, if available. For all terms that are
not in the training dictionary, the joiner node creates missing values in the index columns. These missing values are then replaced
with the default index value using the Missing Value node.
All other steps, such as truncation and zero-padding, are performed in the same way as in the preprocessing of the training set.
We have finished the preprocessing phase and we can now continue with the definition of the network architecture and its training.
Network Architecture
We want to use an LSTM-based RNN, where we train the embedding as well. The embedding is trained by an embedding layer.
Therefore, we create a neural network with four layers:
A dense layer with one unit with the sigmoid activation function, as we have a binary classification problem at hand
The embedding layer expects a sequence of index-based encoded terms as input. Therefore, the input layer must accept sequences of
integer indexes (in our case, ). This means Shape = 80 and data type = Int 32 in
the configuration window of the Keras Input Layer node.
Next, the Keras Embedding Layer node must learn to embed the integer indexes into an appropriate high-dimensional vector
space. Figure 7.17 shows its configuration window. The input tensor is directly recovered from the output of the previous input
layer:
There are two important configuration settings for the Keras Embedding Layer node. For the Input dimension setting, we
need to provide the dictionary size – that is, the number of unique indexes. The number of unique indexes is the maximum index
Output dimension , we provide the dimension of the final embedding space. We
value plus 1 since the counter started from 0. For
have arbitrarily chosen an embedding dimension of 128. The output tensor of the Keras Embedding Layer node then has the
dimension [sequence length , embedding dimension]. In our case, this is [80,
128].
Next, the Keras LSTM Layer node is used to add an LSTM-based recurrent layer to the network. This node is used with the
128 units, which means Units = 128, Activation = Tanh, Recurrent
default settings and
activation = Hard sigmoid, Dropout = 0.2, Recurrent dropout =
0.2, and return sequences, return state, go backward, and unroll all unchecked.
Lastly, a Keras Dense Layer node with one unit with the sigmoid activation function is used to predict the final binary sentiment
classification.
Now that we have our preprocessed data and the neural architecture, we can start training the network.
In the first tab, Input Data , the From Collection of Number (integer) conversion is selected, as our input is a collection cell
of integer values (the indexes), encoding our movie reviews. Next, the collection cell is selected as input.
In the second tab, Target Data , the From Number (integer) conversion type and the column with the sentiment class are
selected. In the lower part, the binary cross-entropy is selected as the loss function since it is a binary classification task.
In the third tab, Options , the following training parameters are set: Epochs
= 30, Training batch
size = 100, shuffle training data before each epoch is activated, and Optimizer = Adam (with the default
settings).
Now that the network is trained, we can apply it to the test set and evaluate how good its performance is at predicting the sentiment
behind a review text.
In the configuration window, we again select From Collection of Number (integer) as the conversion type and the collection
cell as input.
As output, we are interested in the output of the last dense layer, as this gives us the probability for sentiment being equal to 1
(positive). Therefore, we click on the add output button, select the sigmoid layer, and make sure that the To Number (double)
conversion is used.
The Keras Network Executor node adds one new column to the input table with the probability for the positive class encoded as
1.
Next, the Rule Engine node translates this probability into a class prediction with the following expression:
$dense_1/Sigmoid:0_0$ > 0.5 => 1
TRUE => 0
Here, $dense_1/Sigmoid:0_0$ is the name of the output column from the network.
The expression transforms all values above 0.5 into 1s, and into 0s otherwise.
IMPORTANT NOTE
Remember that the different instruction lines in a Rule Engine node are executed sequentially. Execution stops when the
antecedent in one line is verified.
Lastly, the Scorer node evaluates the performance of the model and the Keras Network Writer node saves the trained network
for deployment. Figure 7.18 shows the network performance, in the view of the Scorer node, achieving a respectable 83% of
correct sentiment classification on the movie reviews:
Figure 7.18 – Performance of the LSTM and embedding-based network on sentiment classification
With this, we have finished our first NLP case study. Figure 7.19 displays the complete workflow used to implement the example.
You can download the workflow from the KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%207/:
Figure 7.19 – Complete workflow to prepare the text and build, train, and evaluate the neural network for sentiment analysis
For now, we offer no deployment workflow. In Chapter 10, Deploying a Deep Learning Network, we will come back to this trained
network to build a deployment workflow.
Let's now move on to the next NLP application: free text generation with RNNs.
Thus, for a network to generate new fairy tales, it must be trained on existing fairy tales. We downloaded the Brothers Grimm corpus
from the Gutenberg project, from https://www.gutenberg.org/ebooks/2591.
Training a network at the word level sounds more logical since languages are structured by words and not by characters. Input
sequences (sequences of words) are short but the dictionary size (all words in the domain) is large. On the other hand, training the
network at a character level relies on much smaller and more manageable dictionaries, but might lead to very long input sequences.
According to Wikipedia, the English language, for example, has around 170,000 different words and only 26 different letters. Even
if we distinguish between uppercase and lowercase, and we add numbers, punctuation signs, and special characters, we have a
dictionary with less than 100 characters.
We want to train a network to generate text in the Brothers Grimm style. In order to do that, we train the network with a few
Brothers Grimm tales, which already implies a very large number of words in the dictionary. So, to avoid the problem of a huge
dictionary and the consequent possibly unmanageable network size, we opt to train our fairy tale generator at the character level.
Training at the character level means that the network must learn to predict the next character after the past characters have passed
through the input. The training set, then, must consist of many samples of sequences of characters together with the next
character to predict (the target value).
During deployment, a start sequence of characters must trigger the network to generate the new text. Indeed, this first sequence
predicts the next character; then in the next step, the most recent initial characters and the predicted character will make
the new input sequence to predict the next character, and so on.
In the next section, we will explain how to clean, transform, and encode the text data from the Grimms' fairy tales to feed the
network.
On the left of Figure 7.20, you can see the created input sequences and the target values:
Figure 7.20 – Example of overlapping sequences used for training
Next, we need to encode the character sequences. In order to avoid introducing an artificial distance among characters, we opted for
one-hot vector encoding. We will perform the one-hot encoding in two steps. First, we perform an index-based encoding; then we
Keras Network Learner node via the From Collection of Number (integer)
convert it into one-hot encoding in the
conversion option to One-Hot Tensor . The resulting overlapping index-encoded sequences for the training set are shown on the
right of Figure 7.20 .
The workflow snippet in the next figure reads and transforms the fairy tales into overlapping index-based encoded character
sequences and their associated target character. Both the input sequence and target character are stored in a collection-type column:
Figure 7.21 – Preprocessing workflow snippet reading and transforming text from Brothers Grimm fairy tales
Reads all the fairy tales from the corpus and extracts five fairy tales for training and Snow white and Rose
red as the seed for deployment
Reshapes the text, placing one character per row in a single column
Creates and applies the index-based dictionary, consisting, in this case, of the character set, including punctuation and special signs
Using the Lag Column node, creates the overlapping sequences and then re-sorts them from the oldest to the newest character in
the sequence
Encapsulates the input sequence and target character into collection-type columns
The Row Splitter node splits the collection of fairy tales into two subsets: at the lower output port, only Snow
white
and Rose red and at the top output port, all the other fairy tales. We'll save Snow white and Rose
red for deployment.
Next, a Row Filter node is used to extract the first five fairy tales, which are used for training.
The next step is the reshaping of the text into a sequence of characters with one single column.
It starts with two String Manipulation nodes. The first one replaces each white space with a tilde character. The second one
replaces each character with the character itself plus a <space> character, by using the regexReplace() function.
regexReplace() takes advantage of regular expressions, such as "[^\\s]" to match any character in the input
string and "$0 " for the matched character plus <space>. The final syntax for the regexReplace() function,
used within the String Manipulation node and applied to the input column, $Col0$, is then the following:
For the reshaping of the text, we set all columns as value columns and none as retaining columns. The result is the representation of
the fairy tale as a long sequence of characters within one column.
At last, some cleaning up: all rows with missing values are removed with the Row Filter node.
Now that we have the dictionary ready, we apply it with the Cell Replacer node, already introduced in Chapter 4 , Building and
Training a Feedforward Neural Network.
Figure 7.24 – Resulting output of the Lag Column node, where the time is sorted in ascending order from right to left
We need to reorder the columns to follow an ascending order from left to right, in order to have the oldest character on the left and
the most recent character on the right. This re-sorting is performed by the Resort Columns metanode.
Figure 7.25 shows you the inside of the metanode:
Here, the Reference Column Resorter node changes the order of the data columns in the table at the top input port according to
the order established in the data table at the lower input port. The reference data table at the lower input port must contain a string-
type column with the column headers from the first input table in a particular order. The columns in the first data table are then
sorted according to the row order of the column names in the second data table.
To create the table with sorted column headers, we extract the column headers with the Extract Column Header node. The
Extract Column Header node separates the column headers from the table content and outputs the column headers at the top
output port and the content at the lower output port.
Then, the row of column headers is transposed into a column with the Transpose node.
Finally, we assign an increasing integer number to each column header via the Counter Generation node and we sort them by
counter value in descending order using the Sorter node.
Now that we have the column headers from the first table sorted correctly in time, we can input it at the lower port of the Reference
Column Resorter node. The result is a data table where each row is a sequence of characters, time is sorted
from left to right, and subsequent rows contain overlapping character sequences. At this point, we can create the collection cells for
the input and target data of the network.
IMPORTANT NOTE
Even though the target data consists of only one single value, we still need to transform it into a collection cell so that the index can
be transformed into a one-hot vector by the Keras Network Learner node.
Let's move on to the next step: defining and training the network architecture.
As usual, we define the input shape of the neural network using a Keras Input Layer node. The input here is a second-order
tensor: the first dimension is the sequence length ( , but we will allow a variable length ?) and the second dimension
is the size of the one-hot vectors – that is, the size of the character set (65). So, the input shape is ?, 65.
As we don't need the intermediate hidden states, we leave most of the settings as default in the Keras LSTM Layer node. We just
set the number of units to 512.
Keras
Free text generation can be seen as a multi-class classification application, where the characters are the classes. Therefore, the
Dense Layer node at the output of the network is set to have 65 units (one for each character in the character set) with the softmax
activation function, to score the probability of each character to be the next character.
Let's proceed with training this network on the encoded overlapping sequences.
In the second configuration tab, Target Data , we select From Collection of Number (integer) to One-Hot-Tensor again
on the collection column containing the target value. As this is a multi-class classification problem, we set the loss function to
Categorical Cross Entropy .
In the third configuration tab, Options , we provide the training parameters: 50 epochs, training batch size 256, shuffling
option on, and optimizer as Adam with default settings for the learning rate.
The network is finally saved in Keras format with the Keras Network Writer node. In addition, the network is converted into a
TensorFlow network with the Keras to TensorFlow Network Converter node and saved with the TensorFlow Network
Writer node. The TensorFlow network is used in deployment to avoid a time-consuming Python startup, required by the Keras
network.
Figure 7.26 shows the full workflow implementing all the described steps to train a neural network to generate fairy tales. This
workflow and the used dataset are available on KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%207/:
Now that we have trained and saved the network, let's move on to deployment to generate a new fairy tale's text.
sequences ( ). We feed the network with that sequence to predict the next character; then, we delete the oldest
character in the sequence, add the predicted one, and apply the network again to our new input sequence, and so on. This is exactly
the same procedure that we used in the case study for demand prediction. So, we will implement it here again with a recursive loop
( Figure 7.27):
The trigger sequence was taken from the Snow white and Rose red fairy tale. The text for the trigger sequence was
Read and Pre-Process
preprocessed, sequenced, and encoded as in the workflow used to train the network. This is done in the
metanode, shown in Figure 7.28:
Figure 7.28 – Workflow content in the Read and Pre-Process metanode to read and preprocess the trigger sequence
The workflow reads the Snow white and Rose red fairy tale as well as the dictionary from the files created in the training
workflow. Then, the same preprocessing steps as in the training workflow are applied.
After that, we read the trained TensorFlow network and apply it to the trigger sequence with the TensorFlow Network Executor
node.
The output of the network is the probability of each character to be the next. We can pick the predicted character following two
possible strategies:
The character with the highest probability is assigned to be the next character, known as the greedy strategy.
We have implemented both strategies in the Extract Index metanode in two different deployment workflows.
Figure 7.29 shows the content of the Extract Index metanode when implementing the first strategy:
Figure 7.29 – Workflow snippet to extract the character with the highest probability
This metanode takes as input the output probabilities from the executed network and extracts the character with the highest
probability. The key node here is the Many to One node, which extracts the cell with the highest score (probability) from the
network output.
Figure 7.30 shows the content of the Extract Index metanode when implementing the second strategy:
Figure 7.30 – Workflow snippet to pick the next character based on a probability distribution
This workflow snippet expects as input the probability distribution for the characters and picks one according to it. The key node
Random Label Assigner (Data) node, which assigns a value based on the input probability distribution.
here is the
The Random Label Assigner (Data) node assigns one index to each data row at the lower input port based on the probability
distribution at the upper input port. The data table at the upper input port must have two columns: one column with the class values
– in our case, the index-encoded characters in string format – and one column with the corresponding probabilities. Therefore, the
Figure 7.30 prepares the data table for the top input port of the Random Label Assigner
first part of the workflow snippet in
(Data) node, from the network output, using the Transpose node, the Counter Generation node, and the Number To String
node, while the Table Creator node creates a new table with only one row using the Table Creator node. This means the
Random Label Assigner (Data) node then picks one index, based on the probability distribution defined by the table at the first
input port.
TIP
The idea of the recursive loop and its implementation are explained in detail in Chapter 6, Recurrent Neural Networks for Demand
Prediction.
You can download the deployment workflow, implementing both options, from the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%207/.
The trigger sequence of 100 characters (not italics) comes from the first sentence of the fairy tale, Snow white and Rose red. The
remaining text has been automatically generated by the network.
SNOW-WHITE AND ROSE-RED There was once a poor widow who lived in a lonely cottage. In front of the cas, and a
hunbred of wine behind the door of the; and he said the ansmer: 'What want yeurnKyow yours went for bridd, like is good any,
or cries, and we will say I only gave the witeved to the brood of the country to go away with it.' But when the father said, 'The
cat soon crick.' The youth, the old …
The network has successfully learned the structure of the English language. Although the text is not perfect, you can see sensible
character combinations, full words, some correct usage of quotation marks, and other similarly interesting features that the network
has assimilated from the training text.
G e n e ra tin g Pro d u c t N a me s w ith RN N s
This last NLP case study is similar to the previous one. There, we wanted the network to create new free text based on a start
sequence; here, we want the network to create new free words based on a start token. There, we wanted the network to create new
sequences of words; here, we want the network to create new sequences of characters. Indeed, the goal of this product name
generation case study is to create new names – that is, new words. While there'll be some differences, the approaches will be similar.
Let's take a classic creative marketing example: product naming. Before a new product can be launched to the market, it actually
needs a name. To find the name, the most creative minds of the company come together to generate a number of proposals for
product names, taking different requirements into account. For example, the product name should sound familiar to the customers
and yet be new and fresh too. Of all those candidates, ultimately only one will survive and be adopted as the name for the new
product. Not an easy task!
Now, let's take one of the most creative industries: fashion. A company specializing in outdoor wear has a new line of clothes ready
for the market. The task is to generate a sufficiently large number of name candidates for the new line of clothing. Names of
mountains were proposed, as many other outdoor fashion labels have. Names of mountains evoke the feeling of nature and sound
familiar to potential customers. However, new names must also be copyright free and original enough to stand out in the market.
Why not use fictitious mountain names then? Since they are fictitious, they are copyright free and differ from competitor names;
however, since they are similar to existing mountain names, they also sound familiar enough to potential customers. Could an
artificial intelligence model help generate new fictitious mountain names that still sound realistic enough and are evocative of
adventure? What kind of network architecture could we use for such a task?
As we want to be able to form new words that are somehow reminiscent of mountain names, the network must be trained on the
names of already-existing mountains.
To form the training set, we use a list of 33,012 names of US mountains, as extracted from Wikipedia through a Wikidata query.
Now that we have some training data, we can think about the network architecture. This time, we want to train a many-to-many
LSTM-based RNN (see Figure 7.32). This means that during training, we have a sequence as input and a sequence as output. During
deployment, the RNN, based on some initialized hidden states and the start token, must predict the first character of the new name
candidate; then at the next step, based on the predicted character and on the updated hidden states, it must predict the next character
– and so on until an end token is predicted and the process of generating the new candidate name is concluded:
Figure 7.32 – Simplified, unrolled visualization of the many-to-many RNN architecture for the product name generation case study
To train the LSTM unit for this task, we need two sequences: an input sequence, made of a start token plus the mountain name, and a
target sequence, made of the mountain name plus an end token. Notice that, at each training iteration, we feed the correct character
into the network from the training set and not its prediction. This is called the teacher forcing training approach.
Let's focus first on preprocessing and encoding input and target sequences.
Figure 7.33 – Workflow to read, encode, and create the input and target sequences for mountain name generation
The workflow snippet in the preceding figure creates the input and target sequences by doing the following:
1. Reading the mountain names and removing duplicates by using the Table Reader node and the Duplicate Row Filter node
2. Replacing each <space> with a tilde character and afterward, each character with the character itself and <space>,
using two String Manipulation nodes (this step is described in detail in the preprocessing of the previous case study, Free text
generation with RNNs)
3. Creating and applying a dictionary (we will have a close look at this step in the next sub-section)
4. Character splitting based on <space> and replacing all missing values with end tokens, to zero pad too-short sequences
5. Creating input and target sequences as collection type cells
Most of the steps are similar to the preprocessing steps in the case study of free text generation with RNNs. We will only take a closer
look at step 3 and step 5. Let's start with step 3.
In this metanode, we again use nodes from the KNIME Text Processing extension. The Strings To Document node tokenizes
these names with spaced characters so that each character becomes its own term. Then, the Unique Term Extractor node gives us
the list of unique characters in all documents – that is, the character set. The Counter Generation node assigns an index to each
character starting from 2, as we want to use indexes 0 and 1 for the end and start tokens. To use it as a dictionary in the next step,
Number To String node. Finally, the dictionary is applied (the
the created numerical indexes are transformed into strings by the
Dictionary Replacer node), to transform characters into indexes in the original mountain names, and the text is extracted from
the document (the Document Data Extractor node).
TIP
The KNIME Text Processing extension and some of their nodes, such as Strings To Document , Unique Term Extractor ,
Dictionary Replacer , and Document Data Extractor , were introduced more in detail in the first case study of this chapter,
Finding the Tone of Your Customers' Voice – Sentiment Analysis.
In the separate, lower branch of the workflow snippet, we finalize the dictionary for the deployment by adding one more row for the
end token, using the Empty Table Row node (see Figure 7.35). This node adds a number of rows to the input data table, either
with missing values or with predefined constant values for each cell type. In our case, we add one additional row, and we use 0 as
the default value for the integer cells and an empty string for the string cells. This adds one new row to our dictionary table, with 0
in the index column and empty strings in the character columns. We need this additional row in the deployment workflow to remove
the end token(s):
Figure 7.35 – Configuration window of the Add Empty Rows node
The additional values are added with Constant Values Column nodes, where the constant value 1 is used for the start token in
the input sequence and the value 0 for the end token in the target sequence. In the case of the input sequence, the new column with
Column Resorter node. Now, the sequences can be
the start token must be at the beginning. This is taken care of by the
aggregated and transformed into collection cells, using the Create Collection Column node.
Let's now design and train the appropriate network architecture.
The number of unique characters in the training set – that is, the character set size – is 95. Since we allow sequences of variable
length, the shape of the input layer is ?, 95. The ? stands for a variable sequence length.
Next, we have the Keras LSTM Layer node. This time, it is important to activate the Return sequences and Return state
checkboxes, as we need the intermediate output states during the training process and the cell state in the deployment. We also set
256 units for this layer and we have left all other settings unchanged.
In this case study, we want to add even more randomization to the character pick at the output layer, to increment the network
creativity. This is done by introducing the temperature parameter in the softmax function of the trained output layer.
Remember, the softmax function is defined as follows:
with
If we now introduce the additional temperature parameter, the formula for the activation function changes to the following:
with
This means we divide the linear part by before applying the softmax function.
After training the network, the temperature, , is added by using the DL Python Editor node with the following lines of Python
code:
from keras.models import Model
from keras.layers import Input, Lambda
from keras import backend as K
# Define Inputs
state1=Input((256,))
state2=Input((256,))
new_input=Input((1,95))
# Extract layers
lstm=input_network.layers[-4]
dense_softmax=input_network.layers[-1]
dense_linear=input_network.layers[-2]
# Apply LSTM Layer on new Inputs
x, h1, h2=lstm(new_input, initial_state=[state1, state2])
# Apply the linear layer
linear=dense_linear(x)
# Add lambda
linear_div_temp=Lambda(lambda x: x*0.9)(linear)
# Apply Softmax activation
probabilities = dense_softmax(linear_div_temp)
output_network = Model(inputs=[new_input, state1, state2],
outputs=[probabilities, h1, h2])
Remember that the hidden states of the previous LSTM unit are always used as input in the next LSTM unit. Therefore, three inputs
are defined in the code: two for the two hidden states and one for the last predicted character encoded as a one-hot vector.
Finally, the network is transformed into a TensorFlow network object and saved for deployment. The final training workflow is
shown in Figure 7.36:
Figure 7.36 – Training workflow for the product name generation case study
In the last two case studies, the hidden state vectors were re-initialized at each iteration, as we always had previous characters or
previous values as input. In this case study, we pass back, from the loop end node to the loop start node, not only the predicted
index but also the two hidden state tensors from the LSTM layer.
In Figure 7.37, you can see the deployment workflow, which is also available on the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%207/. Let's look at the
setting differences in detail:
Figure 7.37 – Deployment workflow to create multiple possible product names
The first component, Create Start Token , sets the number, , of new fictitious mountain names to generate. Then, it creates a
table with three columns and rows. One column contains only start tokens – that is, a collection cell with the value 1. The other
two columns contain the initial hidden states – that is, collection cells with 256 zeros in both columns.
The TensorFlow Network Executor node executes the network one first time, producing as output the probability distribution
TensorFlow Network Executor , we have selected as input the columns with
over the indexes. In the configuration window of
the first hidden state, the second hidden state, and the input collection. In addition, we set three output columns: one output column
for the probability distribution, one output column for the first hidden state, and one output column for the second hidden state. We
then pick the next index-encoded character according to the output probability distribution using the Random Label Assigner
(Data) node in the First Char metanode. All these output values, predicted indexes, and hidden states make their way to the loop
start node to predict the second index-encoded character.
Then, we start the recursive loop to generate one character after the next. At each iteration, we apply the network to the last predicted
index and hidden states. We then pick the next character, again with the Random Number Assigner (Data) node, and we feed
Recursive Loop End node so that they can
the last predicted value and the new hidden states into the lower input port of the
reach back to the loop start node.
In the Extract Mountain Names component, we finally apply the dictionary – created in the training workflow – and we
remove all the mountain names that appeared already in the training set.
In Figure 7.38, you can see some of the generated mountain names. Indeed, they are new, copyright-free, evocative of mountains,
and nature-feeling, and can be generated automatically in a number as high as desired:
Figure 7.38 – Mountain names generated by the deployment workflow
Su mma ry
We have reached the end of this relatively long chapter. Here, we have described three NLP case studies, each one solved by training
an LSTM-based RNN applied to a time series prediction kind of problem.
The first case study analyzed movie review texts to extract the sentiment hidden in it. We dealt there with a simplified problem,
considering a binary classification (positive versus negative) rather than considering too many nuances of possible user sentiment.
The second case study was language modeling. Training an RNN on a given text corpus with a given style produced a network
capable of generating free text in that given style. Depending on the text corpus on which the network is trained, it can produce fairy
tales, Shakespearean dialogue, or even rap songs. We showed an example that generates text in fairy tale style. The same workflows
can be easily extended with more success to generate rap songs (R. Silipo, AI generated rap songs, CustomerThink, 2019,
https://customerthink.com/ai-generated-rap-songs/) or Shakespearean dialogue (R. Silipo, Can AI write like Shakespeare?, Towards
data Science, 2019, https://towardsdatascience.com/can-ai-write-like-shakespeare-de710befbfee).
The last case study involved the generation of candidates for new product names that must be innovative and copyright-free, stands
out in the market, and be evocative of nature. So, we trained an RNN to generate fictitious mountain names to be used as name
candidates for a new outdoor clothing line.
In the next chapter, we will describe one more NLP example: neural machine translation.
Q u e stio n s a n d Ex e rc ise s
1. What is a word embedding?
In this chapter, we will build on top of this case study for free text generation and train a neural network to automatically translate
sentences from a source language into a target language. To do that, we will use concepts learned from the free text generation
network, as well as from the autoencoder introduced in Chapter 5, Autoencoder for Fraud Detection.
We will start by describing the general concept of machine translation, followed by an introduction to the encoder-decoder neural
architectures that will be used for neural machine translation. Next, we will discuss all the steps involved in the implementation of the
application, from preprocessing to defining the network structure to training and applying the network.
The development of automatic translation systems started in the early 1970s with Rule-Based Machine Translation (RBMT ).
Here, automatic translation was implemented through hand-developed rules and dictionaries by specialized linguists at the lexical,
syntactic, and semantic levels of sentences.
In the 1990s, statistical machine translation models became state of the art, even though the first concepts for statistical
machine translation were introduced in 1949 by Warren Weaver. Instead of using dictionaries and handwritten rules, the idea
became to use a vast corpus of examples to train statistical models. This task can be described as modeling the probability
distribution, , that a string, , in the target language (for example, German) is the translation of a string, , in the source
language (for example, English). Different approaches have been introduced to model this probability distribution, the
most popular of which came from the Bayes theorem and modeled as . Thus, in this approach, the task
is split into two subtasks: training a language model, , and modeling the probability, More generally, several
subtasks can be defined, and several models are trained and tuned for each subtask.
More recently, neural machine translation gained quite some popularity in the task of automatic translation. Also, here, a vast corpus
of example sentences in a source and target language is required to train the translation model. The difference between classical
statistical-based models and neural machine translation is in the definition of the task: instead of training many small sub-components
and tuning them separately, one single network is trained in an end-to-end fashion.
One network architecture that can be used for neural machine translations is an encoder-decoder network. Let's find out what this is.
En c o d e r-D e c o d e r A rc h ite c tu re
In this section, we will first introduce the general concept of an encoder-decoder architecture. Afterward, we will focus on how the
encoder is used in neural machine translation. In the last two subsections, we will concentrate on how the decoder is applied during
training and deployment.
In the case of encoder-decoder networks for neural machine translation, the task of the encoder is to extract the context of the
sentence in the source language (the input sentence) into a dense representation, while the task of the decoder is to create the
corresponding translation in the target language from the dense representation of the encoder.
Figure 8.1 – The general structure of an encoder-decoder network for neural machine translation
Here, the source language is English, and the target language is German. The goal is to translate the sentence I
am a
student from English into German, where one correct translation could be Ich bin ein Student. The
encoder consumes the I am a student sentence and produces as output a dense vector representation of the content of
the sentence. This dense vector representation is fed into the decoder, which then outputs the translation.
In this case study, the input and the output of the network are sequences. Therefore, Recurrent Neural Network (RNN ) layers
are commonly used in the encoder and decoder parts, to capture the context information and to handle input and output sequences of
variable length.
In general, encoder-decoder RNN-based architectures are used for all kinds of sequence-to-sequence analysis tasks – for example,
question-and-answer systems. Here, the question is first processed by the encoder, which creates a dense numerical representation of
it, then the decoder generates the answer.
Let's focus now on the encoder part of the neural translation network, before we move on to the decoder, to understand what kind of
data preparation is needed.
TIP
In Chapter 6, Recurrent Neural Networks for Demand Prediction, we introduced LSTM layers. Remember that an LSTM layer has
two hidden states, one being the cell state and the other being a filtered version of it. The cell state contains a summary of all previous
inputs.
In a classic encoder-decoder network architecture, the vectors of the hidden states of the LSTM layer are used to store the dense
representation. Figure 8.2 shows how the LSTM-based encoder processes the input sentence:
Figure 8.2 – Example of how the encoder processes the input sentence
The encoder starts with some initialized hidden state vectors. At each step, the next word in the sequence is fed into the LSTM unit
and the hidden state vectors are updated. The final hidden state vectors, after processing the whole input sequence in the source
language, contain the context representation and become the input for the hidden state vectors in the decoder.
The intermediate output hidden states of the encoder are not used.
Now that we have a dense representation of the context, we can use it to feed the decoder. While the way the encoder works during
training and deployment stays the same, the way the decoder works is a bit different during training and deployment.
Figure 8.3 shows an example of teacher forcing during the training phase of the decoder:
Figure 8.3 – Example of teacher forcing while training of the decoder
The dense context representation of the encoder is used to initialize the hidden states of the decoder's LSTM layer. Next, two
sequences are used by the LSTM layer to train the decoder: the input sequence with the true word/character values, starting with a
start token , and the target sequence, also with the true word/character values.
IMPORTANT NOTE
The target sequence, in this case, is the input sequence shifted by one character and with an end token at the end.
To summarize, three sequences of words/characters are needed during training:
During deployment, we don't have the input and output sequence for the decoder. So, let's find out how the trained decoder can be
used during deployment.
In the first step, the dense context representation from the encoder forms the input hidden state vectors and the start token forms
the input value for the decoder. Based on this, the first word is predicted, and the hidden state vectors are updated. In the next steps,
the updated hidden state vectors and the last predicted word are fed back into the LSTM unit, to predict the next word. This means
that if a wrong word has been predicted once; the error accumulates in this kind of sequential prediction.
In this section, you learned what encoder-decoder neural networks are and how they can be used for neural machine translation.
In the next sections, we will go through the steps required to train a neural machine translation network to translate sentences from
English into German. As usual, the first step is data preparation.
So, let's start by creating the three sequences required to train a neural machine translation network using an encoder-decoder
structure.
Pre p a rin g th e D a ta fo r th e Tw o La n g u a g e s
In Chapter 7, Implementing NLP Applications, we talked about the advantages and disadvantages of training neural networks at the
character and word levels. As we already have some experience with the character level, we decided to also train this network for
automatic translation at the character level.
To train a neural machine translation network, we need a dataset with bilingual sentence pairs for the two languages. Datasets for
different language combinations can be downloaded for free at www.manythings.org/anki/. From there, we can download a dataset
containing a number of sentences in English and German that are commonly used in everyday life. The dataset consists of two
columns only: the original short text in English and the corresponding translation in German.
Figure 8.5 shows you a subset of this dataset to be used as the training set:
Figure 8.5 – Subset of the training set with English and German sentences
As you can see, for some English sentences, there is more than one possible translation. For example, the sentence Hug Tom can
be translated to Umarmt Tom, Umarmen Sie Tom, or Drücken Sie Tom.
Remember that a network doesn't understand characters, only numerical values. Thus, character input sequences need to be
transformed into numerical input sequences. In the first part of the previous chapter, we introduced several encoding techniques.
In addition, a dictionary mapping for the English and German characters with their index is also needed. In the previous chapter, for
product name generation, we resorted to the KNIME Text Processing Extension to generate the index-based encoding for the
character sequences. We will do the same here.
For the training of the neural machine translation, three index-encoded character sequences must be created:
The input sequence to feed the encoder. This is the index-encoded input character sequence from the source language – in our
case, English.
The input sequence to feed the decoder. This is the index-encoded character sequence for the target language, starting with a start
token.
The target sequence to train the decoder, which is the input sequence to the decoder shifted by one step in the past and ending with
an end token.
The workflow in Figure 8.6 reads the bilingual sentence pairs, extracts the first 10,000 sentences, performs the index-encoding for
the sentences in English and German separately, and finally, partitions the dataset into a training and a test set:
Figure 8.6 – Preprocessing workflow snippet to prepare the data to train the network for neural machine translation
Most of the work of the preprocessing happens inside the component named Index encoding and sequence creation . Figure
8.7 shows its content:
Figure 8.7 – Workflow snippet inside the component named Index encoding and sequence creation
The workflow snippet inside the component first separates the English text from the German text, then produces the index-encoding
for the sentences – in the upper part for the German sentences and the lower part for the English sentences. Then, finally, for each
language, a dictionary is created, applied, and saved.
After the index-encoding of the German sentences, the two sequences for the decoder are created: in the upper branch by adding a
start token at the beginning and in the lower branch by adding an end token at the end of the sequence.
All sequences from the German and English languages are then transformed into collection cells so that they can be converted to
one-hot encoding before training.
An LSTM Layer via a Keras LSTM Layer node: In this node, we use 256 units and enable the return state checkbox to pass the
hidden states to the upcoming decoder network.
(German). In our example, the input tensor for German has a shape of .
An LSTM layer via a Keras LSTM Layer node. This time, the optional input ports are used to feed the hidden states from the
encoder into the decoder. This means the output port of the first LSTM layer in the encoder network is connected to both optional
input ports in the decoder network. In addition, the output port of the Keras Input Layer node for the German input sequences is
connected to the top input port. In its configuration window, it is important to select the correct input tensors as well as the hidden
tensors. The return sequence and return state checkboxes must be activated to return the intermediate output hidden states, which
are used in the next layer to extract the probability distribution for the next predicted character. As in the encoder LSTM, 256 units
are used.
Last, a softmax layer is added via a Keras Dense Layer node to produce the probability vector of the characters in the
dictionary in the target language (German). In the configuration window, the softmax activation function is selected to have 85
units, which is the size of the dictionary of the target language.
The upper part of the workflow defines the encoder with a Keras Input Layer and Keras LSTM Layer node. In the lower part,
the decoder is defined as described previously.
Now that we have defined the encoder-decoder architecture, we can train the network.
In the first tab of the configuration window, named Input Data , the input columns for both input layers are selected: in the upper
part for the source language, which means the input for the encoder, and in the lower part for the target language, which means the
input for the decoder. To convert the index-encoded sequences into one-hot-encoded sequences, the From Collection of
Number (integer) to One-Hot Tensor conversion type is used for both columns.
In the next tab of the configuration window, named Target Data , the column with the target sequence for the decoder is selected
and the From Collection of Number (integer) to One-Hot Tensor conversion type is enabled again. Characters are again
considered like classes in a multi-class classification problem; therefore, the Categorical Cross Entropy loss function is adopted for
the training process.
In the third tab, Options , the training phase is set to run for a maximum of 120 epochs, with a batch size of 128 data rows, shuffling
the data before each epoch and using Adam as the optimizer algorithm with the default settings.
During training, we monitor the performance using the Learner Monitor view of the Keras Network Learner node and decide to
stop the learning process when an accuracy of 94% has been reached.
Remember that the output of the decoder is the probability distribution across all characters in the target language. In Chapter 7 ,
Implementing NLP Applications, we introduced two approaches for the prediction of the next character based on this output
probability distribution. Option one picks the character with the highest probability as the next character. Option two picks the next
character randomly according to the given probability distribution.
In this case study, we use option one and implement it directly in the decoder via an additional lambda layer . To summarize, when
postprocessing, we need to perform the following steps:
Introduce a lambda layer with an argmax function that selects the character with the highest probability in the softmax layer.
IMPORTANT NOTE
Lamba layers allow you to use arbitrary TensorFlow functions when constructing sequential and functional API models using
TensorFlow as the backend. Lambda layers are best suited for simple operations or quick experimentation.
Let's start with extracting the encoder.
1. Load packages:
from keras.models import Model
from keras.layers import Input
2. Define input:
new_input = Input((None,70))
3. Extract trained encoder LSTM and define model:
encoder = input_network.layers[-3]
output = encoder(new_input)
output_network = Model(inputs=new_input, outputs=output)
It starts with defining the input, feeding it into the encoder's LSTM layer, and then defining the output.
In more detail, in the first two lines, the required packages are loaded. Next, an input layer is defined; then, the -3 layer – the
trained LSTM layer of the encoder – is extracted. Finally, the network output is defined as the output of the trained encoder LSTM
layer
Now that we have extracted the encoder, let's see how we can extract the decoder.
Extracting the Decoder and Adding a Lambda Layer
In the following code snippet, you can see the code used in the DL Python Network Editor node to extract the decoder part and
add the lambda layer to it:
1. Load the packages:
decoder_lstm = input_network.layers[-2]
decoder_dense = input_network.layers[-1]
4. Apply the LSTM and dense layer:
x, out_h, out_c = decoder_lstm(new_input, initial_state=
[state1, state2])
probability_output = decoder_dense(x)
5. Add the lambda layer and define the output:
argmax_output = Lambda(lambda x: K.argmax(x, axis=-1))
(probability_output)
output_network = Model(inputs=[new_input, state1, state2],
outputs=[probability_output, argmax_output, out_h,
out_c])
The code again first loads the necessary packages, then defines three inputs – two for the input hidden states and one for the one-
hot-encoded character vector. Next, it extracts the trained LSTM layer and the softmax layer in the decoder. Finally, it introduces the
lambda layer with the argmax function and defines the output.
For faster execution during deployment, the encoder and the decoder are converted into TensorFlow networks using the Keras to
TensorFlow Network Converter node.
Now that we have trained the neural machine translation network and we have separated the encoder and the decoder, we want to
apply them to the sentences in the test set.
The decoder should be initialized with the first hidden states from the encoder and with the start token from the input sequence, to
trigger the translation character by character in the recursive loop. Figure 8.9 visualizes the process:
Figure 8.9 – The idea of applying the encoder and decoder model during deployment
Figure 8.10 – This workflow snippet applies the trained encoder-decoder neural architecture to translate English sentences into
German sentences
It starts with a TensorFlow Network Executor node (the first one on the left in Figure 8.10). This node takes the encoder and
the index-encoded English sentences as input. In its configuration window, two outputs are defined from the LSTM hidden states.
Next, we create a start token and transform it into a collection cell. To this start token, we apply the decoder network using another
TensorFlow Network Executor node (the second one from the left). In the configuration window, we make sure that the hidden
states from the encoder produced in the previous TensorFlow Network Executor node are used as input. As output, we again
set the hidden states, as well as the next predicted character – that is, the first character of the translated sentence.
Now, we enter the recursive loop, where this process is repeated multiple times using the updated hidden states from the last iteration
and the last predicted character as input.
Finally, the German dictionary is applied to the index-encoded predicted characters and the final translation is obtained. The
following is an excerpt of the translation results:
Figure 8.11 – Final results of the deployed translation network on new English sentences
In the first column, we have the new English sentences, in the second column, the correct translations, and in the last column, the
translation generated by the network. Most of these translations are actually correct, even though they don't match the sentences in
column two, as the same sentence can have different translations. On the other hand, the translation of the Talk to Tom
sentence is not correct as rune is not a German word.
In this section, you have learned how you can define, train, and apply encoder-decoder architectures based on the example of neural
machine translation at the character level.
Su mma ry
In this chapter, we explored the topic of neural machine translation and trained a network to produce English-to-German translations.
We started with an introduction to automatic machine translation, covering its history from rule-based machine translation to neural
machine translation. Next, we introduced the concept of encoder-decoder RNN-based architectures, which can be used for neural
machine translation. In general, encoder-decoder architectures can be used for sequence-to-sequence prediction tasks or question-
answer systems.
After that, we covered all the steps needed to train and apply a neural machine translation model at the character level, using a simple
network structure with only one LSTM unit for both the encoder and decoder. The joint network, derived from the combination of
the encoder and decoder, was trained using a teacher forcing paradigm.
At the end of the training phase and before deployment, a lambda layer was inserted in the decoder part to predict the character
with the highest probability. In order to do that, the structure of the trained network was modified after the training process with a
few lines of Python code in a DL Python Network Editor node. The Python code split the decoder and the encoder networks and
added the lambda layer. This was the only part involving a short, simple snippet of Python code.
Of course, this network could be further improved in many ways – for example, by stacking multiple LSTM layers or by training
the model at the word level using additional embeddings.
This is the last chapter on RNNs. In the next chapter, we want to move on to another class of neural networks, Convolutional
Neural Networks (CNNs ), which have proven to be very successful for image processing.
Q u e stio n s a n d Ex e rc ise s
1. An encoder-decoder model is a:
We will start with a general introduction to CNNs, explaining the basic idea behind a convolution layer and introducing some related
terminology such as padding, pooling, filters, and stride.
Afterward, we will build and train a CNN for image classification from scratch. We will cover all required steps: from reading and
preprocessing of the images to defining, training, and applying the CNN.
To train a neural network from scratch, a huge amount of labeled data is usually required. For some specific domains, such as images
or videos, such a large amount of data might not be available, and the training of a network might become impossible. Transfer
learning is a proposed solution to handle this problem. The idea behind transfer learning consists of using a state-of-the-art neural
network trained for a task A as a starting point for another, related, task B.
Introduction to CNNs
In tro d u c tio n to CN N s
CNNs are commonly used in image processing and have been the winning models in several image-processing competitions. They
are often used, for example, for image classification, object detection, and semantic segmentation.
Sometimes, CNNs are also used for non-image-related tasks, such as recommendation systems, videos, or time-series analysis.
Indeed, CNNs are not only applied to two-dimensional data with a grid structure but can also work when applied to one- or three-
dimensional data. In this chapter, however, we focus on the most common CNN application area: image processing .
A CNN is a neural network with at least one convolution layer . As the name states, convolution layers perform a convolution
mathematical transformation on the input data. Through such a mathematical transformation, convolution layers acquire the ability to
detect and extract a number of features from an image, such as edges, corners, and shapes. Combinations of such extracted features
are used to classify images or to detect specific objects within an image.
A convolution layer is often found together with a pooling layer , also commonly used in the feature extraction part of image
processing.
The goal of this section is thus to explain how convolution layers and pooling layers work separately and together and to detail the
different setting options for the two layers.
As mentioned, in this chapter we will focus on CNNs for image analysis. So, before we dive into the details of CNNs, let’s quickly
review how images are stored.
How are Images Stored?
A grayscale image can be stored as a matrix, where each cell represents one pixel of the image and the cell value represents the gray
level of the pixel. For example, a black and white image, with size pixels, can be represented as a matrix with dimensions
, where each value of the matrix ranges between and . is a black pixel, is a white pixel, and a value in between
corresponds to a level of gray in the grayscale.
As each pixel is represented by one gray value only, one channel (matrix) is sufficient to represent this image. For color images, on
the other hand, more than one value is needed to define the color of each pixel. One option is to use the three values specifying the
intensity of red, green, and blue to define the pixel color. In the following screenshot, to represent a color image, three channels are
used instead of one: ( Figure 9.2):
Figure 9.2 – Representing a 28 x 28 color image using three channels for RGB
Moving from a grayscale image to a red, green, and blue (RGB ) image, the more general concept of tensor —instead of a
simple matrix—becomes necessary. In this way, the grayscale image can be described as a tensor of , while a color
image with pixels can be represented with a tensor.
In general, a tensor representing an image with pixels height, pixels width, and channels has the dimension x x .
But why do we need special networks to analyze images? Couldn’t we just flatten the image, represent each image as a long
vector, and train a standard fully connected feedforward neural network?
IMPORTANT NOTE
The process of transforming a matrix representation of an image into a vector is called flattening .
Indeed, the spatial dependency is lost when the image is flattened into a vector. As a result, fully connected feedforward networks are
not translation-invariant. This means that they produce different results for shifted versions of the same image. For example, a
network might learn to identify a cat in the upper-left corner of an image, but the same network is not able to detect a cat in the
lower-right corner of the same image.
In addition, the flattening of an image produces a very long vector, and therefore it requires a very large fully connected
feedforward network with many weights. For example, for a pixel image with three channels, the network needs
inputs. If the next layer has neurons, we would need to train weights only in
the first layer. You see that the number of weights can quickly become unmanageable, likely leading to overfitting during training.
Convolution layers, which are the main building block of a CNN, allow us to solve this problem by exploiting the spatial properties
of the image. So, let’s find out how a convolution layer works.
For an image with one channel a filter is a small matrix, often of size or , called a kernel . Different kernels—that
is, matrices with different values—filter different patterns. A kernel moves across an image and performs a convolution operation.
That convolution operation gives a name to the layer. The output of such a convolution is called a feature map .
IMPORTANT NOTE
For an input image with three channels (for example, an input tensor with shape ), a kernel with kernel size 2 has the
shape . This means the kernel can incorporate information from all channels but only within a small (2 x 2, in this example)
region of the input image.
Figure 9.3 here shows an example of how a convolution is calculated for an image of size and a kernel with size
:
In this example, we start by applying the kernel to the upper-left region of the image. The image values are elementwise
multiplied with the kernel values and then summed up, as follows:
The result of this elementwise multiplication and sum is the first value, in the upper-left corner, in the output feature map. The kernel
is then moved across the whole image to calculate all other values of the output feature map.
IMPORTANT NOTE
The convolution operation is denoted with a * and is different from a matrix multiplication. Even though the layer is called
convolution, most neural network libraries actually implement a related function called cross-correlation . To perform a correct
convolution, according to its mathematical definition, the kernel in addition must be flipped. For CNNs this doesn’t make a
difference because the weights are learned anyway.
In a convolution layer, a large number of filters (kernels) are trained in parallel on the input dataset and for the required task. That is,
the weights in the kernel are not set manually but are adjusted automatically as weights during the network training procedure.
During execution, all trained kernels are applied to calculate the feature map.
The dimension of the feature map is then a tensor of size . In the example in Figure 9.3, we applied only
one kernel, and the dimension of the feature map is .
Historically, kernels were designed manually for selected tasks. For example, the kernel in Figure 9.3 detects vertical lines. Figure 9.4
here shows you the impact of some other handcrafted kernels:
The convolution operation is just a part of the convolution layer. After that, a bias and a non-linear activation function are applied to
each entry in the feature map. For example, we can add a bias value to each value in the feature map and then apply rectified liner
unit (ReLU ) as an activation function to set all values below the bias to 0.
IMPORTANT NOTE
In Chapter 3, Getting Started with Neural Networks, we introduced dense layers. In a dense layer, the weighted sum of the input is
first calculated; then, a bias value is added to the sum, and the activation function is applied. In a convolutional layer, the weighted
sum of the dense layer is replaced by the convolution.
A convolution layer has multiple setting options. We have already introduced three of them along the way, and they are listed here:
The activation function, where ReLU is the one most commonly used
There are three more setting options: padding, stride, and dilation rate. Let’s continue with padding.
Introducing Padding
When we applied the filter in the example in Figure 9.3 , the dimension of the feature map shrunk compared to the dimension of the
input image. The input image had a size of and the feature map a size of .
In addition, by looking at the feature map, we can see that pixels in the inner part of the input image (cells with values f, g, j, and k)
are more often considered in the convolution than pixels at corners and borders. This implies that inner values will get a higher
weight in further analysis. To overcome this issue, images can be zero-padded by adding zeros in additional external cells ( Figure
9.5). This is a process called padding .
Figure 9.5 here shows you an example of a zero-padded input:
Figure 9.5 – Example of a zero-padded image
Here, two cells with value zero have been added to each row and column, all around the original image. If a kernel of size is
now applied to this padded image, the output dimension of the feature map would be the same as the dimension of the original
image. The number of cells to use for zero padding is one more setting available in convolution layers.
Two other settings that influence the output size, if no padding is used, are called stride and dilation rate .
The number of pixels used for the kernel shift is called stride . The stride is normally defined by a tuple, specifying the number of
cells for the shift in the horizontal and vertical direction. A higher stride value, without padding, leads to a downsampling of the
input image.
The top part of Figure 9.6 shows how a kernel of size 3 x 3 moves across an image with stride 2, 2.
Another setting option for a convolution layer is the dilation rate . The dilation rate indicates that only one cell out of
consecutive cells in the input image is used for the convolution operation. A dilation rate of uses only one every two pixels
from the input image for the convolution. A dilation rate of uses one of three consecutive pixels. As for the stride, a dilation
rate is a tuple of values for the horizontal and vertical direction. When using a dilation rate higher than , the kernel gets
dilated to a larger field of view on the original image. So, a 3 x 3 kernel with dilation rate explores a field of view of size 5 x 5
in the input image, while using only nine convolution parameters.
For a kernel and a dilation rate of , the kernel scans an area of on the input image using only its corner values
(see the lower part of Figure 9.6). This means for a dilation rate of , we have a gap of size 1. For a dilation rate of , we
would have a gap size of 2, and so on:
Figure 9.6 – Impact of different stride and dilation rate values on the output feature map
Introducing Pooling
The idea of pooling is to replace an area of the feature map with summary statistics. For example, pooling can replace each
area of the feature map with its maximum value, called max pooling , or its average value, called average pooling (Figure 9.7 ):
Figure 9.7 – Results of max and average pooling
A pooling layer reduces the dimension of the input image in a more efficient way and allows the extraction of dominant, rotational,
and positional-invariant features.
As with a filter, in pooling we need to define the size of the explored area for which to calculate the summary statistics. A commonly
used setting is a pooling size of pixels and a stride of two pixels in each direction. This setting halves the image dimension.
IMPORTANT NOTE
Pooling layers don’t have any weights, and all settings are defined during the configuration of the layer. They are static layers, and
their parameters do not get trained like the other weights in the network.
Pooling layers are normally used after one convolution layer or multiple-stacked convolution layers.
Convolution layers can be applied to input images as well as to feature maps. Indeed, multiple convolution layers are often stacked
on top of each other in a CNN. In such a hierarchy, the first convolution layer may extract low-level features, such as edges. The
filters in the next layer then work on top of the extracted features and may learn to detect shapes, and so on.
The final extracted features can then be used for different tasks. In the case of image classification, the feature map—resulting from
the stacking of multiple convolution layers—is flattened, and a classifier network is applied on top of it.
To summarize, a standard CNN for image classification first uses a series of convolution and pooling layers, then a flattened layer,
and then a series of dense layers for the final classification.
Now that we are familiar with convolutional layers and pooling layers, let’s see how they can be introduced inside a network for
image classification.
Cla ssify in g Ima g e s w ith CN N s
In this section, we will see how to build and train from scratch a CNN for image classification.
The goal is to classify handwritten digits between 0 and 9 with the data from the MNIST database , a large database of handwritten
digits commonly used for training various image-processing applications. The MNIST database contains 60,000 training images and
10,000 testing images of handwritten digits and can be downloaded from this website: http://yann.lecun.com/exdb/mnist/.
To read and preprocess images, KNIME Analytics Platform offers a set of dedicated nodes and components, available after installing
the KNIME Image Processing Extension .
TIP
The KNIME Image Processing Extension (https://www.knime.com/community/image-processing) allows you to read in more than
140 different format types of images (thanks to the Bio-Formats Application Processing Interface (API )). In addition, it can
be used to apply well-known image-processing techniques such as segmentation, feature extraction, tracking, and classification,
taking advantage of the graphical user interface within KNIME Analytics Platform.
In general, the nodes operate on multi-dimensional image data (for example, videos, 3D images, multi-channel images, or even a
combination of these), via the internal library ImgLib2-API. Several nodes calculate image features (for example, Zernike,
texture, or histogram features) for segmented images (for example, a single cell). Machine learning algorithms are applied on the
resulting feature vectors for the final classification.
To apply and train neural networks on images, we need one further extension: the KNIME Image Processing - Deep
Learning Extension . This extension introduces a number of useful image operations—for example, some conversions necessary
for image data to feed the Keras Network Learner node.
IMPORTANT NOTE
To train and apply neural networks on images, you need to install the following extensions:
KNIME Image Processing (https://www.knime.com/community/image-processing)
KNIME Image Processing – Deep Learning Extension (https://hub.knime.com/bioml-
konstanz/extensions/org.knime.knip.dl.feature/latest)
Let’s get started with reading and preprocessing the handwritten digits.
The goal of the reading and preprocessing workflow is to read the images and to match them with their labels. Therefore, the
following steps are implemented (also shown in Figure 9.8):
1. Read and sort the images for training.
These steps are performed by the workflow shown in the following screenshot:
Figure 9.8 – This workflow reads a subset of the MNIST dataset, adds the corresponding labels, and transforms the pixel type from
unsigned byte to float
To read the images, we use the Image Reader (Table) node. This node expects an input column with the Uniform Resource
Locator (URL ) paths to the image files. To create the sorted list of URLs, the List Files node first gets all paths to the image files
in the training folder. Then, the Sort images metanode is used. Figure 9.9 here shows you the inside of the metanode:
The metanode extracts the image number from the filename with a String Manipulation node and sorts them with a Sorter node.
The Image Reader (Table) node then reads the images.
The File Reader node, in the lower branch, reads the table with the image labels.
In the next step, the Column Appender node appends the correct label to each image. Since images have been sorted as to match
their corresponding label, a simple appending operation is sufficient. Figure 9.10 here shows a subset of the output of the Column
Appender node:
Figure 9.10 – Output of the Column Appender node, with the digit image and the corresponding label
Next, the Image Calculator node changes the pixel type from unsigned byte to float, by dividing each pixel value by 255.
Finally, the Create Collection Column node creates a collection cell for each label. The collection cell is required to create the
one-hot vector-encoded classes, to use during training.
Now that we have read and preprocessed the training images, we can design the network structure.
Designing the Network
In this section, you will learn how to define a classical CNN for image classification.
A classical CNN for image classification consists of two parts, which are trained together in an end-to-end fashion, as follows:
Feature Extraction : The first part performs the feature extraction of the images, by training a number of filters.
Classification : The second part trains a classification network on the extracted features, available in the flattened feature map
resulting from the feature extraction part.
We start with a simple network structure with only one convolution layer, followed by a pooling layer for the feature extraction part.
The resulting feature maps are then flattened, and a simple classifier network, with just one hidden layer with the ReLU activation
function, is trained on them.
Figure 9.11 – This workflow snippet builds a simple CNN for the classification of the MNIST dataset
The workflow starts with a Keras Input Layer node to define the input shape. The images of the MNIST dataset have
pixels and only one channel, as they are grayscale images. Thus, the input is a tensor of shape , and therefore the
input shape is set to .
Next, the convolutional layer is implemented with a Keras Convolution 2D Layer node. Figure 9.12 here shows you the
configuration window of the node:
Figure 9.12 – Keras Convolution 2D Layer node and its configuration window
The setting named Filters sets the number of filters to apply. This will be the last dimension of the feature map. In this example, we
decided to train32 filters.
Next, you can set the Kernel size option in pixels—that is, an integer tuple defining the height and width of each kernel. For the
MNIST dataset, we use a kernel size of . This means the setting is .50.
Next, you can set the Strides option, which is again defined by a tuple of two integers, specifying the strides of the convolution
along the height and width of the image. Any stride value greater than 1 is incompatible with any dilation_rate greater
than 1.
The output tensor of the convolutional layer (that is, our feature map) has the dimension , as we have
filters and we don’t use padding.
Next, a Keras Max Pooling 2D Layer node is used to apply max pooling on the two dimensions.
Figure 9.13 here shows you the configuration window of the node:
Figure 9.13 – Keras Max Pooling 2D Layer node and its configuration window
In the configuration window of the Keras Max Pooling 2D Layer node, you can define the Pool size . Again, this is an integer
tuple defining the pooling window. Remember, the idea of max pooling is to represent each area of the size of the pooling window
with the maximum value in the area.
The stride is again an integer tuple, setting the step size to shift the pooling window.
Lastly, you can select whether to apply zero padding by selecting Valid for no padding, and Same to apply padding.
For this MNIST example, we set Pool size as , Strides as , and applied no padding. Therefore, the dimension of output of
the pooling layer is .
Next, a Keras Flatten Layer node is used to transform the feature map into a vector, with dimension
.
After the Keras Flatten Layer node, we build a simple classification network with one hidden layer and one output layer. The
hidden layer with the ReLU activation function and 100 units is implemented by the first Keras Dense Layer node in Figure 9.11,
while the output layer is implemented by the second (and last) Keras Dense Layer node in Figure 9.11 . As it is a multiclass
classification problem with 10 different classes, here the softmax activation function with 10 units is used. In addition, the Name
prefix output is used so that we can identify the output layer more easily when applying the network to new data.
Now that we have defined the network structure, we can move on to train the CNN.
Figure 9.14 here shows the Input Data tab of the configuration window of the Keras Network Learner node, including this
additional conversion option:
Figure 9.14 – Input Data tab of the configuration window of the Keras Network Learner node with the additional conversion
option, From Image (Auto-mapping)
In the Target Data tab, the conversion option from From Collection of Number (integer) to One-Hot Tensor is selected
for the column with the collection cell of the image label.
On the bottom, the Categorical cross entropy activation function is selected, as the problem is a multiclass classification problem.
In theOptions tab, the following training parameters are set:
Number of epochs : 10
Training batch size : 200
Optimizer : Adadelta with the default settings
Figure 9.15 here shows the progress of the training procedure in the Learning Monitor view of the Keras Network Learner
node after node execution:
Figure 9.15 – The Learning Monitor view shows the training progress of the network
The Learning Monitor view shows the progress of the network during training over the many training batches. On the right-hand
Current Value shows you the accuracy for the last batch, which is in this
side, you can see the accuracy for the last few batches.
case 0.995 .
Now that we have a trained CNN satisfactorily performing on the training set, we can apply it to the test set. Here, the same reading
and preprocessing steps as for the training set must also be applied on the test set.
The Keras Network Executor node applies the trained network on the images in the test set. In the configuration window, the last
layer, producing the probability distribution of the different digits, is selected as output.
At this point, a bit of postprocessing is required in order to extract the final prediction from the network output.
The goal of the postprocessing is to extract the class with the highest probability and then to evaluate the network performance. This
is implemented by the workflow snippet shown here in Figure 9.16:
Figure 9.16 – This workflow snippet extracts the digit class with the highest probability and evaluates the network performance on
the test set
The Many to One node extracts the column header of the column with the highest probability in each row.
Then, the Column Expression node extracts the class from the column header.
TIP
The Column Expression node is a very powerful node. It provides the possibility to append an arbitrary number of new columns
or modify existing columns using expressions.
For each column to be appended or modified, a separate expression can be defined. These expressions can be simply created using
predefined functions, similarly to the Math Formula and the String Manipulation nodes. Nevertheless, there is no restriction on
the number of lines an expression can have and the number of functions it can use. Additionally, intermediate results of functions or
calculations can be stored within an expression by assigning them to temporary variables (using =).
Available flow variables and columns of the input table can be accessed via the provided access functions variable
("variableName") and column ("columnName").
Figure 9.17 here shows you the configuration window of the Column Expression node, with the expression used in the workflow
snippet in Figure 9.16 to extract the class information. In this case, the expression extracts the last character from the strings in the
column named Detected Digit :
Figure 9.17 – The Column Expression node and its configuration window
Next, the data type of the predicted class is converted from String to Integer with the String to Number node, and
the network performance is evaluated on the test set with the Scorer node.
Figure 9.18 here shows the view produced by the Scorer node:
Figure 9.18 – View of the Scorer node, showing the performance of the network on the test set
As you can see, this simple CNN has already reached an accuracy of 94% and a Cohen’s kappa of 0.934 on the test set. The
complete workflow is available on the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%209/.
In this section, we built and trained from scratch a simple CNN, reaching an acceptable performance for this rather simple image
classification task. Of course, we could try to further improve the performance of this network by doing the following:
Increasing the number of training epochs
Using augmentation
Using dropout
We leave this up to you, and continue with another way of network learning, called transfer learning.
But why should we use transfer learning instead of training models in the traditional, isolated way?
To get a comprehensive labeled dataset for a new domain, in order to be able to train a network to reach state-of-art-performance,
can be difficult or even impossible. As an example, the often-used ImageNet database , which is used to train state-of-the-art
models, has been developed over the course of many years. It would take time to create a similar new dataset for a new image
domain. However, when these state-of-the-art models are applied to other related domains, they often suffer a considerable loss in
performance, or, even worse, they break down. This happens due to the model bias toward the training data and domain.
Transfer learning allows us to use the knowledge gained during training on a task and domain where sufficient labeled data was
available as a starting point, to train new models in domains where not enough labeled data is yet available. This approach has shown
great results in many computer vision and natural language processing (NLP ) tasks.
Figure 9.20 here visualizes the idea behind transfer learning:
Figure 9.20 – Idea behind transfer learning
Before we talk about how we can apply transfer learning when training a neural network, let’s have a quick look at the formal
definition of transfer learning and the many scenarios in which it can be applied.
For a given domain, , a task consists of the following two components as well:
A label space
A predictive function
Here, the predictive function could be the conditional probability distribution In general, the predictive function is
a function trained on the labeled training data to predict the label for any sample in the feature space.
Using this terminology, transfer learning is defined by Sinno Jialin Pan and Qiang Yang in the following way:
"Given a source domain and learning task , a target domain and learning task , transfer learning aims to help
improve the learning of the target predictive function in using the knowledge in and , where , or
."
Sebastian Ruder uses this definition in his article, Transfer Learning - Machine Learning’s Next Frontier, 2017 (
https://ruder.io/transfer-learning/) to describe the following four scenarios in which transfer learning can be used:
An example in the paper is cross-lingual adaptation, where we have documents in different languages.
An example comes in the form of documents that discuss different topics. This scenario is called domain adaption.
3. Different label spaces:
Now that we have a basic understanding of transfer learning, let’s find out next how transfer learning can be applied to the field of
deep learning.
In a stacked CNN for image classification, the initial convolution layers are responsible for extracting low-level features such as
edges, while the next convolution layers extract higher-level features such as body parts, animals, or faces. The last layers are trained
to classify the images, based on the extracted features.
So, if we want to train a CNN for a different image-classification task, on different images and with different labels, we must not train
the new filters from scratch, but we can use the previously trained convolution layers in a state-of-the-art network as the starting
point. Hopefully, the new training procedure will be faster and will require a smaller amount of data.
To use the trained layers from another network as the training starting point, we need to extract the convolution layers from the
original network and then build some new layers on top. To do so, we have the following two options:
We freeze the weights of the trained layers and just train the added layers based on the output of the frozen layers. This approach
is often used in NLP applications, where trained embeddings are reused.
We use the trained weights to initialize new convolution layers in the network and then fine-tune them while training the added
layers. In this case, a small training rate is used to not unlearn the learned knowledge from the source task.
For the last case study of this book, we want to train a neural network to predict cancer type from histopathology slide images. To
speed up the learning process and considering the relatively small dataset we have, we will apply transfer learning starting from the
convolution layers in the popular VGG16 network used here as the source network.
VGG16 is one of the winner models on the ImageNet Challenge from 2014. It is a stacked CNN network, using kernels of size
with an increasing depth—that is, with an increasing number of filters. The original network was trained on the ImageNet
dataset, containing images , referring to more than 1,000 classes.
Figure 9.21 shows you the network structure of the VGG16 model.
It starts with two convolution layers, each with 64 filters. After a max pooling layer, again two convolution layers are used, this time
each with 128 filters. Then, another max pooling layer is followed by three convolution layers, each with 256 filters. After one more
max pooling layer, there are again three convolution layers, each with 512 filters, followed by another pooling layer and three
convolution layers each with 512 filters. After one last pooling layer, three dense layers are used:
In this case study, we would like to reuse the trained convolution layers of the VGG16 model and add some layers on top for the
cancer cell classification task. During training, the convolution layers will be frozen and only the added layers will be trained.
To do so, we build three separate sub-workflows: one workflow to download the data, one workflow to preprocess the images, and a
third workflow to train the neural network, using transfer learning. You can download the workflow with the three sub-workflows
from the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter%209/. Let’s start with
the workflow to download the data.
Therefore, the workflow first defines a directory for the downloaded data using the Create Directory node. Next, the GET
Request node and the Binary Objects to Files node are used to download and save the tar.gz file into the created
directory. The Unzip Files node unzips the downloaded file. As a result, we get three sub-directories—one for each lymphoma
type. Next, the workflow creates a data table that stores the path to the image files using the List Files node. Based on the subfolder,
the Rule Engine node adds the class label according to its class of lymphoma. Finally, the created table is written into a
.table file.
The next step is to preprocess the images.
The loading and preprocessing steps are performed by the workflow shown here in Figure 9.23:
Figure 9.23 – This workflow loads and preprocesses the image
This second workflow starts with reading the table created in the first workflow, including the paths to the images as well as the class
information. Next, the Category To Number node encodes the different nominal class values (FL, MCL, and CLL) with an
index, before the dataset is split into a training set and a test set using the Partitioning node. For this case study, we decided to use
60% of the data for training and 40% of the data for testing, using stratified sampling on the class column.
In the Load and preprocess images (Local Files) component, the images are uploaded and preprocessed.
Figure 9.24 – Inside of the Load and preprocess images (Local Files) component
The component uses a loop to load and preprocess one image after the other. The Chunk Loop Start node, with one row per
Loop End node, concatenating the resulting rows from the loop iterations, ends the loop.
chunk, starts the loop, while the
In the loop body, one image is always loaded with the Image Reader (Table) node. The image is then normalized using the
Image Calculator node, dividing each pixel value by 255.
Next, the Image Cropper node is used to crop the image to a size that is dividable by 64. Since the original size of the images is
1388px 1040px, the first 44 pixels of the left side and the top 16 pixels of each image are cropped.
Figure 9.25 here shows you the configuration window of the node:
Figure 9.25 – Image Cropper node and its configuration window
Next, the Splitter node splits each image into 336 images of size 64 x 64 pixels, storing each new sub-image in a new column, for a
total of ~75,000 patches. Figure 9.26 here shows you the Advanced tab of the configuration window of the Splitter node, where
the maximum size for each dimension of the resulting images has been set:
Figure 9.26 – The Splitter node and its configuration window
Next, the table is transposed into one column and renamed, before the class information is added to each image with the Cross
Joiner node.
Now that we have the prepared images, we can continue with the last workflow.
Keras Network Learner node and the ~75,000 patches created from the training set
To train the final network, we will use the
images. These steps are performed by the workflow shown here in Figure 9.27 :
Figure 9.27 – Training workflow to train the new network to classify images of cancer cells
The workflow first reads the VGG16 network with the Keras Network Reader node. The Keras Network Reader node can
read models in three different file formats. Models are saved in a .h5 file with the complete network structure and weights, or
networks are saved in .json or .yaml files with just the network structure.
In this case, we read the .h5 file of the trained VGG16 network because we aim to use all of the knowledge embedded inside the
network.
The output tensor of the VGG16 network has dimensions , which is the size of the output of the last max pooling
layer. Before we can add some dense layers for the classification task, we flatten the output using theKeras Flatten Layer node.
Now, a dense layer with ReLU activation and 64 neurons is added using a Keras Dense Layer node. Next, a Dropout Layer
node is introduced, with a dropout rate of Finally, one last Keras Dense Layer node defines the output of the network. As
we are dealing with a classification problem with three different classes, the softmax activation function with three units is adopted.
If we were to connect the output of the last Keras Dense Layer node to a Keras Network Learner node, we would fine-tune
all layers, including the trained convolution layers from the VGG16 model. We do not want to lose all that knowledge! So, we
decided to not fine-tune the layers of the VGG16 model but to train only the newly added layers. Therefore, the layers of the VGG16
model must be frozen.
To freeze layers of a network, we use the Keras Freeze Layers node. Figure 9.28 here shows you the configuration window of
this node:
Figure 9.28 – The Keras Freeze Layers node and its configuration window
In the configuration window, you can select the layer(s) to freeze. Later on, when training the network, the weights of the selected
layers will not be updated. All other layers will be trained. We froze every layer except the ones we added at the end of the VGG16
network.
In the lower branch of the workflow, we read the training data using the Table Reader node and we one-hot encode the class using
the One to Many node.
Now that we have the training data and the network structure, we can fine-tune it with the Keras Network Learner node.
As with all other case studies in this book, the columns for the input data and target data are selected in the configuration window of
the Keras Network Learner node, together with the required conversion type. In this case, the From Image conversion for the
input column and from Number (double) for the target column have been selected. Because this is a multiclass classification task, the
Categorical cross entropy loss function has been adopted. To fine-tune this network, it has been trained for 5 epochs using a
training batch size of 64 and RMSProp with the default settings as optimizer.
Once the network has been fine-tuned, we evaluate its performance on the test images. The preprocessed test images, as patches of 64
x 64 px, are read with a Table Reader node. To predict the class of an image, we generate predictions for each of the 64 x 64px
patches using the Keras Network Executor node. Then, all predictions are combined using a simple majority voting scheme,
implemented in the Extract Prediction metanode.
Finally, the network is evaluated using the Scorer node. The classifier has achieved 96% accuracy (fine-tuning for a few more
epochs can push the accuracy to 98%).
TIP
In this use case, the VGG16 model is only used for feature extraction. Therefore, another approach is to apply the convolutional
layers of the VGG16 model to extract the features beforehand and to feed them as input into a classic feedforward neural network.
This has the advantage that the forward pass through VGG16 would be done only once per image, instead of doing it in every batch
update.
We could now save the network and deploy it to allow a pathologist to access those predictions via a web browser, for example. How
this can be done using KNIME Analytics Platform and KNIME Server is shown in the next chapter.
Su mma ry
In this chapter, we explored CNNs, focusing on image data.
We started with an introduction to convolution layers, which motivates the name of this new family of neural networks. In this
introduction, we explained why CNNs are so commonly used for image data, how convolutional networks work, and the impact of
the many setting options. Next, we discussed pooling layers, commonly used in CNNs to efficiently downsample the data.
Finally, we put all this knowledge to work by building and training from scratch a CNN to classify images of digits between 0 and 9
from the MNIST dataset. Afterward, we discussed the concept of transfer learning, introduced four scenarios in which transfer
learning can be applied, and showed how we can use transfer learning in the field of neural networks.
In the last section, we applied transfer learning to train a CNN to classify histopathology slide images. Instead of training it from
scratch, this time we reused the convolutional layers of a trained VGG16 model for the feature extraction of the images.
Now that we have covered the many different use cases, we will move on to the next step, which is the deployment of the trained
neural networks. In the next chapter, you will learn about different deployment options with KNIME software.
Q u e stio n s a n d Ex e rc ise s
1. What is the kernel size in a convolutional layer?
b) If no model is available
During the exploration of some of the use cases, a second workflow has already been introduced, to deploy the network to work on
real-world data. So, you have already seen some deployment examples. In this last section of the book, however, we focus on the
many deployment options for machine learning models in general, and for trained deep learning networks in particular.
Usually, a second workflow is built and dedicated to deployment. This workflow reads the trained model and the new real-world
data, it preprocesses this data in exactly the same way as for the training data, then it applies the trained deep learning network on is
transformed data and produces the results according to the project's requirements.
This chapter focuses on the reading, writing, and preprocessing of the data in a deployment workflow.
This chapter starts with a review of the features for saving, reading, and converting a trained network. This is followed by two
examples of how the preprocessing for our sentiment analysis use case can also be implemented in a deployment workflow. Finally,
the chapter shows how to improve execution speed by enabling GPU support.
Co n v e rsio n o f th e N e tw o rk Stru c tu re
The goal of a deployment workflow is to apply a trained network to new real-world data. Therefore, the last step of the training
workflow must be to save the trained network.
However, Keras-formatted networks can only be interpreted and executed via the Keras libraries. This is already one level on top of
the TensorFlow libraries. Executing the network application on the TensorFlow Java API directly, rather than on a Python kernel via
the Keras Python API, makes execution faster. The good news is that KNIME Analytics Platform also has nodes for TensorFlow
execution in addition to the nodes based on Keras libraries.
Keras to
Thus, if faster execution is needed, the Keras network should be converted into a TensorFlow network using the
TensorFlow Network Converter node. After conversion, the network can be saved using the TensorFlow Network Writer
node as a SavedModel file, a compressed zip file. A SavedModel file contains a complete TensorFlow program,
including weights and computation. It does not require the original model building code to run, which makes it useful for sharing or
deploying.
The first step in a deployment network is to read a trained network.
The Keras Network Reader node reads a Keras deep learning network from a file. The file can either contain a full, pre-trained
network (.h5 file) or just a network architecture definition without weights (a .json or .yaml file). You can use the node
to read networks trained with KNIME Analytics Platform or networks trained directly with Keras, such as pretrained Keras networks.
The TensorFlow Network Reader (or TensorFlow 2 Network Reader) node reads a TensorFlow (or TensorFlow 2) deep
learning network from a directory or from a zip file. If reading from a directory, it has to be a valid SavedModel folder.
If reading from a zip file, it must contain a valid SavedModel folder.
TIP
The TensorFlow Network Reader node allows us to select a tag and a signature in its configuration window. Tags are used to identify
the meta graph definition to load. Signatures are concrete functions specifying the expected input and output. A
SavedModel can have multiple tags as well as multiple signatures per tag. A network saved with KNIME Analytics Platform
has only one tag and one signature. In the Advanced tab of the configuration window, you can define your own signature by
defining the input and output of the model by selecting one of the hidden layers as output, for example.
Another node, which allows you to read pretrained networks without writing a single line of code, is the ONNX Network Reader
node. ONNX stands for Open Neural Network Exchange and is a standard format for neural networks developed by Microsoft
and Facebook. Since it is a standard format, it is portable across machine learning frameworks such as PyTorch, Caffe2, TensorFlow,
and more. You can download pretrained networks from the ONNX Model Zoo (https://github.com/onnx/models#vision) and read
them with the ONNX Network Reader node. The ONNX networks can also be converted into TensorFlow networks using the
ONNX to TensorFlow Network Converter node, and then executed with the TensorFlow Network Executor node.
TIP
To use the ONNX nodes, you need to install the KNIME Deep Learning – ONNX Integration extension.
Another option for reading a network using Python code is the DL Python Network Creator node, which can be used to read
pretrained neural networks using a few lines of Python code.
TIP
The DL Python Network Creator node can also be used in training workflows to define the network architecture using Python code
instead of layer nodes.
So far, we have used Keras-based nodes with TensorFlow 1 as the backend. There are also nodes that use TensorFlow 2 as the
backend to implement similar operations.
Using TensorFlow 2
For all the examples in this book, we have used Keras-based nodes that run TensorFlow 1 as the backend. TensorFlow 2 is also
supported since the release of KNIME Analytics Platform 4.2. On the KNIME Hub, you can find many examples of how to use
TensorFlow 2 integration.
In this section, we use the sentiment analysis case study shown in Chapter 7, Implementing NLP Applications, as an example, and we
build two deployment workflows for it. The goal of both workflows is to read new movie reviews from a database, predict the
sentiment, and write the prediction into the database.
In the first example, the preprocessing steps are implemented manually into the deployment workflow. In the second example, the
Integrated Deployment feature is used.
These steps are performed by the workflow in Figure 10.1, which you can download from the KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter_10/:
Figure 10.1 – Deployment workflow for the sentiment analysis case study from Chapter 7, Implementing NLP Applications
The workflow first connects to a SQLite database, where the new movie reviews are stored, using the SQLite Connector node.
Next, the SELECT SQL statement to read the new reviews from the table named new_reviews is implemented by the DB Table
Selector node.
The SQL statement is then executed through the DB Reader node. As a result, we have the new reviews in a data table at the output
port of the node.
TIP
In Chapter 2, Data Access and Preprocessing with KNIME Analytics Platform, the database extension was introduced in detail.
Remember that the database nodes create a SQL statement at their output brown-squared port.
Before applying the network to these new reviews, we need to perform the same transformations as in the training workflow. In the
training workflow, reported in Chapter 7, Implementing NLP Applications, there was a metanode named Preprocess test set
where all the required preprocessing steps were applied to the test data. We used this metanode as the basis for creating the
preprocessing steps for the incoming data in the deployment workflow.
Figure 10.2 shows the content of this metanode, which is dedicated to the preprocessing of the test set:
Figure 10.2 – Preprocessing of the test data in the training workflow of the sentiment analysis case study from Chapter 7,
Implementing NLP Applications
In the deployment workflow in Figure 10.1, the dictionary, created during training is read first; then the preprocessing steps are
implemented in the Preprocessing metanode.
Figure 10.3 shows you the workflow snippet inside this metanode:
Figure 10.3 – Workflow snippet inside the Preprocessing metanode of the deployment workflow
If we compare the workflow snippets in Figure 10.2 and Figure 10.3, you can see that they contain the same preprocessing steps, as
was expected.
Now that the same preprocessing as for the training data has been applied to the deployment data, the trained network can be
introduced through the Keras Network Reader node (Figure 10.1).
Next, the trained network runs on the preprocessed deployment reviews using the Keras Network Executor node. The output of
the network is the probability of the sentiment being equal to 1, where 1 encodes a positive movie review. The same threshold as
during training is also applied here through the Rule Engine node: a threshold of .
In the last step, the tables in the database are updated. First, the DB Delete node deletes the reviews we just analyzed from the
new_reviews table. Then, the DB Writer node appends the new movie reviews with their predictions to another table in the
database, named review-with-sentiment .
This is the first example of the deployment of a neural network using KNIME Analytics Platform. This workflow should be executed
on a regular basis to predict the sentiment for all new incoming movie reviews.
TIP
KNIME Server can schedule the execution of workflows, so you can trigger their execution automatically on a regular schedule.
This approach has one disadvantage. If the model is retrained on more data or with different settings (for example, if more or fewer
terms are taken into account during training or the threshold for the Rule Engine node is changed) we need to remember to also
update the preprocessing steps in the deployment workflow. And since we are forgetful humans, we might forget or make mistakes.
This manual step slows down the process and can easily lead to mistakes. Automating the construction of parts of the deployment
workflow can be a safer option, especially if the models are changed often, for example, every day or even every hour.
IMPORTANT NOTE
Other common names for the training process are data science creation or modeling workflow.
The nodes from the Integrated Deployment extension close the gap between creating and deploying data science.
The Rule Engine node, which decides on the positive or the negative class based on a threshold applied to the output class'
probability
The workflow in Figure 10.4 shows you this example based on the sentiment analysis case study. You can download the workflow
from the KNIME Hub at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter_10/:
Figure 10.4 – Training workflow that automatically creates a deployment workflow using Integrated Deployment
The part in the thick box is the captured workflow snippet. The Capture Workflow Start node defines the beginning and the
Capture Workflow End node defines the end of the workflow snippet to capture.
The start node doesn't need any configuration. Figure 10.5 shows the configuration window of the Capture Workflow End
node:
Figure 10.5 – Configuration window of the Capture Workflow End node
In the configuration window, you can set the name of the captured workflow snippet. You can also set whether the captured snippet
should be stored with the data and, if yes, the maximum number of data rows to include. We will see in a second why it can be
helpful to store some data in the captured workflow snippet.
Capture
The captured workflow snippet, with or without data, is then exported via the output port (the black square) of the
Workflow End node. In the workflow in Figure 10.4, the workflow snippet is then collected by the Workflow Writer node and
written into the deployment workflow, with unaltered settings and configuration.
Figure 10.6 shows the configuration window of the Workflow Writer node:
Figure 10.6 – The Workflow Writer node and its configuration window
At the top, you can set the location of the folder of the destination workflow ( Output location ).
Next, you need to set the name of the destination workflow. The node automatically proposes a default name, which you can
customize via the Use custom workflow name option. If the name you choose refers to a workflow that already exists, you can
let the writer node fail or overwrite.
At the bottom, you can select the deployment option for the destination workflow: just create it, create it and open it, or save it as a
.knwf file to export.
The next figure, Figure 10.7, shows you the automatically generated deployment workflow by the Workflow Writer node:
Figure 10.7 – Automatically created deployment workflow from the workflow snippet captured via Integrated Deployment
Preprocessing test set metanode, as well as the Keras Network Executor , Rule
In the captured workflow you can see the
Engine , and Column Filter nodes. Additionally, the whole Integrated Deployment process has added the following:
Two Reference Reader nodes. They are generic reader nodes, loading the connection information of static parameters not
found in the captured workflow snippet.
A Container Input (Table) and a Container Output (Table) node in order to accept input data and to send output data
respectively from and to other applications.
The execution of this deployment workflow can be triggered either by another workflow using the Call Workflow (Table) node
or via a REST service if the workflow has been deployed on a KNIMEs Server. In the next chapter, we will talk about the REST calls
and REST services in detail.
In Figure 10.7, the example deployment workflow reads two entities at the top of the workflow using the two reader nodes without
an icon inside them. The left one provides the dictionary table based on the training data, and the right one provides the trained
neural network.
The Container Input (Table) node receives a data table from an external caller (that is, the Call Workflow (Table Based)
node) and makes it available on the output port. A configuration parameter enables the external caller to send a data table to the
Container Input (Table) node.
The Container Input (Table) node also has an optional input port (represented by an unfilled input port). If a data table is
connected to the optional input, the node will simply forward this table to the next node; if a table is supplied via a REST API, then
the supplied table will be available on the output port.
The Container Output (Table) node sends a KNIME data table to an external caller.
Let's now find out how the automatically created workflow can be used to predict the sentiment of new reviews during deployment.
Using the Automatically Created Workflow
Let's have a look now at how the deployment workflow can be consumed.
Figure 10.8 shows you an example of how the automatically created deployment workflow can be consumed to classify the
sentiment of new movie reviews, and you can download it from the KNIME Hub to try it out, at
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter_10/ :
The workflow connects to the database and reads the incoming new movie reviews.
Then, the Call Workflow (Table Based) node calls the deployment workflow (Figure 10.7), the one that was automatically
built. The Call Workflow (Table Based) node indeed calls other workflows residing on your local workspace or on a mounted
KNIME server. The called workflow must contain at least one Container Input node and one Container Output node to define the
interface between the two workflows: the called and the caller workflows.
Via theCall Workflow (Table Based) node, we send the new movie reviews to the deployment workflow to feed the
Container Input (Table) node. The deployment workflow is then executed, and the predictions are sent back to the caller
workflow and made available via the output port of the Call Workflow (Table Based) node.
A great advantage of this strategy is the ensured consistency between the data operations in the training workflow and the data
operations in the deployment workflow. If we now change any settings in the data operations in the training workflow, for example,
the value of the threshold in the Rule Engine node (Figure 10.4), and we re-execute the training workflow, these changes are
automatically imported into the new version of the deployment workflow ( Figure 10.7) and used by any workflow relying on it
( Figure 10.8).
TIP
Another great node of the Integrated Deployment extension is the Workflow Combiner node, which allows us to combine
workflow snippets from different original workflows.
We have reached the last section of this chapter, which is on scalability and GPU execution.
GPUs have been designed to handle multiple computations simultaneously. This paradigm suits the intensive computations required
to train a deep learning network. Hence, GPUs are an alternative option to train large deep learning networks efficiently and
effectively on large datasets.
Some Keras libraries can exploit the computational power of NVIDIA®-compatible GPUs via the TensorFlow paradigms. As a
consequence, KNIME Keras integration can also exploit the computational power of GPUs to train deep learning networks more
quickly.
In Chapter 1, Introduction to Deep Learning with KNIME Analytics Platform, we introduced how to set up Python for KNIME
Keras integration and KNIME TensorFlow integration. In order to run the KNIME Keras integration on the GPU rather than on the
CPU, you do not need to take many extra steps.
Of course, you need a GPU-enabled computer. TensorFlow 1.12 requires an NVIDIA GPU card with a CUDA compute capability of
3.5 or higher.
Besides that, most of the required dependencies (that is, CUDA® and cuDNN) will be automatically installed by Anaconda when
installing the conda tensorflow=1.12 and keras-gpu=2.2.4. packages
The only extra step at installation is the latest version of the NVIDIA® GPU driver, to be installed manually.
At installation time, by selecting Create new GPU environment instead of Create new CPU environment , an environment
with keras-gpu=2.2.4 is created.
When using the TensorFlow integration, it is also possible to execute on the GPU to read and execute TensorFlow's SavedModel.
IMPORTANT NOTE
The GPU support for the KNIME TensorFlow integration (which uses the TensorFlow Java API) is generally independent of
the GPU support for the KNIME Keras integration (which uses Python). Hence, the two GPU supports must be set up
individually. Due to the limitations of TensorFlow, the GPU support for the KNIME TensorFlow integration can only run on
Windows and Linux, and not on Mac.
At the time of writing, the following GPU configuration is recommended by KNIME.
The KNIME TensorFlow integration uses TensorFlow version 1.13.1, which requires the following NVIDIA® software to be
installed on your system:
For detailed instructions and the most recent updates, please check the KNIME documentation (https://docs.knime.com/2019-
06/deep_learning_installation_guide/index.html#tensorflow-integration).
Su mma ry
In this chapter, we have covered three different topics. We started with a summary of the many options for reading, converting, and
writing neural networks.
We then moved on to the deployment of neural networks, using the sentiment analysis case study from Chapter 7, Implementing
NLP Applications, as an example. The goal here was to build a workflow that uses the trained neural network to predict the sentiment
of new reviews stored in the database. We have shown that a deployment workflow can be assembled in two ways: manually or
automatically with Integrated Deployment.
The last section of the chapter dealt with the scalability of network training and execution. In particular, it showed how to exploit the
computational power of GPUs when training a neural network.
In the next and last chapter of this book, we will explore further deployment options and best practices when working with deep
learning.
Q u e stio n s a n d Ex e rc ise s
1. Which network conversions are available in KNIME Analytics Platform?
2. Which statements regarding Integrated Deployment are true (two statements are correct)?
b) The execution of the automatically generated workflow can be triggered by another workflow.
In the first section of this chapter, you will learn how to deploy a deep learning model as a web application so that end users can
execute, interact with, and control the application via a web browser. In order to implement a web application, we need to introduce
the KNIME WebPortal, a feature of KNIME Server. Components play a central role in the development of web applications since
they are used to implement the interaction points according to the Guided Analytics feature of the KNIME software. In this
chapter, you will also learn more about components.
Another deployment option to consume a deep learning model is a web service, through a REST interface. Web services have
become very popular recently because they allow you to integrate and orchestrate a number of applications seamlessly and easily
within the same ecosystem. In the second section of this chapter, you will learn how to build, deploy, and call workflows as REST
services with the KNIME software.
We will conclude this chapter with some best practice advice and tips and tricks for working with both neural networks and KNIME
Analytics Platform. These best practices and tips and tricks originate from our own experiences of many years of working on deep
learning projects, some of which have been described in this book.
Figure 11.1 – A web application implemented by a KNIME workflow, running on KNIME Server, and called via KNIME
WebPortal from any web browser
In this example, the web application is a very simple one. It has only two interaction points – that is, two web pages: the first one for
the image upload and the second one to inspect the results.
More complex web applications can be developed codelessly using this combination of KNIME Analytics Platform, KNIME Server,
and KNIME WebPortal. Some examples of quite complex and very beautifully designed web applications, such as Guided
Visualization, Guided Labeling, and Guided Automation, are available for download from the KNIME Hub (https://hub.knime.com).
In contrast to KNIME Analytics Platform, KNIME Server contains no data operations or model training algorithms. However, it
contains the whole IT infrastructure to allow collaboration among team members, on-demand and scheduled execution of
applications, a definition of the access rights for each registered user or group of users, model management, auditing features, and, of
course, deployment options, as we will see in this chapter. Also, in contrast to KNIME Analytics Platform, KNIME Server is not open
source but rather needs a yearly license. Figure 11.2 shows the login page for KNIME WebPortal:
Among the many IT features available with KNIME Server, KNIME WebPortal allows you to see and manage workflows from any
web browser. This seems a simple feature, but it can be the missing link between the data scientist and the end user.
IMPORTANT NOTE
The end user is an expert in their domain and usually has neither the time nor the inclination to open KNIME Analytics Platform and
investigate workflows and nodes. All the end user needs is a comfortable web-based application running on a web browser and
showing only the information they need to see; at the very least, the page for the data upload and the final page summarizing the
results.
WebPortal does not need any special installation. It comes already pre-packaged with the installation of KNIME Server. However, its
appearance can be easily customized through dedicated CSS style sheets. KNIME WebPortal only accepts registered users and
requires logging in ( Figure 11.2).
After a successful login, the starting page appears with the folders you have been granted access to. Navigate to the workflow you
would like to start and then press Run (Figure 11.3):
Figure 11.3 – Start page for a selected workflow on KNIME WebPortal
IMPORTANT NOTE
KNIME Server is the complementary tool to KNIME Analytics Platform. While KNIME Analytics Platform has all the algorithms
and data operations, KNIME Server provides the IT infrastructure for team-based collaboration, application automation, model
management, and deployment options.
Let's find out how a workflow must be structured to create a sequence of pages on KNIME WebPortal with defined interaction
options.
Now, it would be long and complicated if we had to build all those pages/steps from scratch. Luckily, there are components . Each
page/step just visualizes the content of the composite view of a component in the underlying workflow. So, implementing a
sequence of web pages for WebPortal in reality corresponds to just implementing a sequence of components with the required
composite views.
The upper part of Figure 11.4 shows three web pages from an application running on KNIME WebPortal: a form to import
customers from a database; a scatter plot and a table, connected to each other, to select some customers; and finally, a page displaying
the information for the selected customers. The lower part of Figure 11.4 shows the underlying workflow with the corresponding
three components. The composite view of each component produces one page during the execution of the workflow on KNIME
WebPortal:
Figure 11.4 – Top: step execution of an application on KNIME WebPortal. Bottom: the corresponding workflow generating the
web pages for the step execution
The workflow in Figure 11.4 is used as an example for a step execution of a workflow from a web browser and refers to a customer
dataset. We will use only this dataset for this section since it allows us to show many different features used in component
construction. This web application has been designed to allow the end user to inspect customer data and select customers with a high
risk of churn to be contacted by a team member.
The first component, Get Customers from Database , creates the first page on the left. Here, the end user must provide their
username and password to connect to the database.
After clicking on theNext button in the lower-right corner, the workflow is executed until the next component, Select Customers
to Contact , is reached and the corresponding web page is created. On this page, the end user gets an overview of the customer data
via a scatter plot and a table and selects the customers to contact. For the selection, the view provides two interaction options: select a
product via the radio buttons in the upper-left corner or change the churn score using the range slider in the lower-left corner. The
scatter plot and table are automatically updated according to the new selection parameters. Once the end user is happy with the
selection, they clickNext again to get to the last page of the web application.
The final page is created by the Browse and Download Customers List component. Here, the data of each selected customer is
reported in a tile view and can be exported into an Excel file. The workflow is available on the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter_11/.
TIP
To open the interactive view of a component already in KNIME Analytics Platform, after execution, right-click the component and
select Interactive View: <name of the component> .
To summarize, again each of these pages is created by one component in the workflow and displays its interactive view. Components
and their composite views are then the key elements to build workflows for web applications.
Let's now see how a composite view can be created and customized.
In the lower-left part of the corresponding page, the end user has the option to define a threshold for the churn score via a slider.
This interactive slider is created by theInteractive Range Slider Filter Widget node.
In the upper-left corner, the option to select the product is created by the Interactive Value Filter Widget node.
In addition, the page shows an interactive scatter plot and an interactive table. Those two views are created by the Scatter Plot node
and the Table View node, respectively.
As you can see, each widget/view node adds one piece to the final composite view and therefore to the corresponding page in
WebPortal.
TIP
In KNIME Analytics Platform, the view of each widget/view node is visible by right-clicking the node and selecting Interactive
View:<name of node> .
The nodes that can contribute pieces to a composite view can be categorized into three groups:
Widget nodes
View nodes
Interactive widget nodes
Let's have a look at each category in detail.
Widget Nodes
Widget nodes produce a view with an interactive form for setting parameters. The newly set parameters are then exported as flow
variables and can be used by other nodes down the line in the workflow.
TIP
In Chapter 2, Data Access and Preprocessing with KNIME Analytics Platform, we introduced the concept of flow variables and how
they can be used to overwrite setting options.
Each widget node is specialized in producing one specific input or interaction form, such as string input, integer input, selecting one
or many values from a list, and more. You can find all the available widget nodes in the Node Repository under Workflow
Abstraction | Widgets , as shown in Figure 11.6:
Figure 11.6 – Available widget nodes in the Node Repository
Selection widget nodes : The widget nodes in the selection category produce web forms to select values from a list, such as
choosing a specific column from a data table, including/excluding multiple columns from a dataset, or selecting one or more
values to filter data from a table.
Output widget nodes : These widget nodes add custom text, links, or images to the composite view.
As an example, Figure 11.7 shows the Single Selection Widget node and its configuration window:
Figure 11.7 – The Single Selection Widget node and its configuration window
Most of the widget nodes share some important settings, such as Label , Description , and Variable Name :
Label : This creates a label on top of the form created by the widget node.
Description : This value is shown as a tooltip on the widget form.
Variable Name : This gives the name of the flow variable created by the node.
Let's have a look at the additional configuration settings for the Single Selection Widget node (Figure 11.7 ):
Selection Type : Defines the objects used for the selection: a drop-down menu, vertical or horizontal radio buttons, or a list
Possible Choices : Defines the list of available values to choose from
Default Value : Assigns an initial default value to the selection operation
The standard widget node produces either some flow variables or a table as output, which can be used in the downstream nodes in
the workflow. A special set of widget nodes are the interactive widget nodes.
View Nodes
View nodes visualize data through interactive charts, plots, and tables.
Figure 11.8 shows you an overview of the available view nodes in the Node Repository:
If multiple view nodes are present inside a component, their views interact with each other in the resulting component view; for
example, via selection, data points selected in the view of one node can be selected or even isolated in the view of another node.
TIP
The Plotly nodes and the JavaScript nodes in the Labs category offer even more interactive options to visualize your data in
composite views. Views from (local) nodes in the Local (Swing) category cannot be integrated into the composite view of a
component.
In comparison to the standard widget nodes, these two nodes trigger direct filter events in the open composite view or web page. The
flow variable created by a standard widget node can be used by a subsequent node but doesn't trigger direct changes in the open
page or composite view.
Now that we have an overview of the nodes available to build a composite view, let's customize the composite view of a component
through some layout options.
TIP
To open a component in a new tab in the workflow editor, you must press Ctrl + double-click the component or right-click the
component and select Component | Open .
The layout in a composite view is set via the layout editor from inside the component. After opening the content of the component
in a new tab in the workflow editor, click the layout editor button at the rightmost side of the top toolbar, as shown in Figure 11.9 :
Figure 11.9 – Toolbar with layout editor button to the far right
Upon clicking on the layout editor button, the visual layout editor ( Figure 11.10) opens:
Figure 11.10 – Visual layout editor
The layout editor uses a grid structure with rows and columns.
On the left, there are row templates with different numbers of columns and a list of all the still-unplaced views. On the right, there is
the layout editor itself.
You can change the layout by adding new row templates via drag and drop from the template list on the left to the layout editor on
the right. To add a new empty column, click the + button in the layout editor. Columns inside a row can be manually resized.
Empty cells in the layout editor can be populated by dragging and dropping views from the list of unused views into the cells in the
layout editor.
The default layout consists of one column only and all views from the widget and view nodes are placed in it from top to bottom. To
start from a blank canvas, click the clear layout button in the upper-left corner of the layout editor. This clearing action adds all
views to the list on the left.
TIP
Node labels (the text below the node) are used in the layout editor to identify the views. It is best practice to change the node labels to
meaningful descriptions, to easily recognize the views in the layout editor.
If you want to exclude the view of a node from the composite view, you can go to the first tab of the layout editor, called Node
Usage , and disable the node view for the WebPortal/composite view.
It is also possible to have nested components, which are a component inside a component. If the nested component has a view, this
shows up as a node view in the layout editor. Thus, you can integrate the view of the nested component into your layout as you
would do for any other node.
A composite view can be easily beautified – for example, by adding a header or a sidebar and styling the text body. You are in luck
as there are shared components on the KNIME Hub to do that.
Figure 11.11 shows the web page before and after the introduction of some styling elements using some of the available shared
components:
Figure 11.11 – Web page without (left) and with (right) a header, sidebar, and additional information in the body of the page
You will see some shared components in action at the end of this section when we build the cancer cell classification example.
Shared Components
In the previous section, Creating Composite Views, we discussed how to use components to create composite views and then pages
for WebPortal applications. Components can also bundle up functionalities that can be reused and shared with others via the KNIME
Hub and KNIME Server. These functionalities range from simple repetitive tasks, such as entering credentials into a database, to
more complicated tasks, such as optimizing parameters.
In comparison to metanodes, components have their own configuration window. They can be configured without touching the
individual nodes inside – providing a handy way to hide configuration complexity. Of course, if needed, you can still open the
component, dive into the details, and make any adjustments relevant to your use case.
To add settings in the configuration window of a component, you can use the configuration nodes . They work similarly to
widget nodes but at the level of the configuration window instead of the composite view. You can find them in the Node Repository
under Workflow Abstraction | Configuration . Components can have a description in the Description panel like any KNIME
Description
node. From inside a component, you can edit the description by clicking on the pen in the upper-left corner of the
panel in KNIME Analytics Platform.
For components to become like all other KNIME nodes, they have to be shared.
Next, you can select the link type to link the component instance to the component template. The link type defines the location of the
component template when checking for updates. After choosing the destination of the component template, a dialog opens asking
you for the link type:
Create absolute link : The workflow uses the absolute path when looking for the component template.
Create mountpoint-relative link : The workflow uses the relative path starting from the selected mountpoint when looking
for the component template.
Create workflow-relative link : The workflow uses a relative path starting from the current workflow folder when looking
for the component template.
Don't create link with shared instance : A component template is created but is not linked to the current instance.
TIP
When you deploy a workflow to KNIME Server, make sure that all link types on the component instances also work when on the
server.
To create an instance of a shared component, simply drag and drop the component template from the KNIME Hub or KNIME
Explorer to the workflow editor. Newly created instances are read-only and link to the corresponding shared component.
Each time the workflow is started, KNIME Analytics Platform searches for possible updates of the component template and if there
are any, proposes to also update the instance. This has the advantage that if something changes in the component template, the
changes are automatically reflected in the instances.
TIP
Being read-only, new instances cannot be edited. You need to disconnect the instance from the template first in order to change its
content. To do that, you need to right-click on the component instance and select Component | Disconnect Link .
There are a lot of public shared components on the EXAMPLES Server or the KNIME Hub. You will also find some shared
components in the workflow group for this chapter on the KNIME Hub.
Now that you are familiar with shared components and the WebPortal, let's have a look at the deployment example of cancer cell
classification.
The goal here is to produce a web application for pathologists who are not familiar with KNIME Analytics Platform and data science
in general. It should help them in their daily routine by suggesting a cancer classification during the analysis of histopathology
images. An additional requirement is the option to upload multiple images in sequence without restarting the application. Figure
11.12 shows you the workflow implementing the application. You can download the workflow from the KNIME Hub:
https://hub.knime.com/kathrin/spaces/Codeless%20Deep%20Learning%20with%20KNIME/latest/Chapter_11/.
Let's focus first on the middle part of the workflow: the loop body inside the annotation box:
Figure 11.12 – Deployment workflow to score new histopathology images from a web browser
At each iteration, one image is uploaded, the classification is produced, and two web pages are presented to the pathologist. The loop
takes care of the iterations and the two components in the loop body – the Upload Image component and the View Results
component – of the web pages:
Figure 11.13 – Workflow inside the Upload Image component and the webpage created by it
Upload Image component, which creates the first web page of the web application. You can see the
The loop body starts with the
Figure 11.13.
created page as well as the inside of the component in
The header of the web page, with the KNIME logo and the navigation path, is created by the shared component named WebPortal
Header . For WebPortal applications with many steps, a header like this helps the end user to get an overview of the current step (in
frame), the steps already covered (yellow boxes or light-gray boxes), and the steps yet to come (gray boxes).
Next, the workflow reads the trained deep learning network using the Keras Network Reader node and applies it to the image
Keras Network Executor node.
patches with the
In the Prepare Visualization metanode, the image patches are assigned a color, according to the probability of belonging to one
of the three cancer classes.
Finally, the results are visualized using the last component, named View Results . Figure 11.15 shows you the workflow inside the
component and the corresponding web page obtained when executing the workflow in the web browser:
Figure 11.15 – Workflow snippet inside the View Results component and the web page created by it
In the View Results component, we again find the shared WebPortal Header component, to create the page header, this time
with the Upload Image box in yellow (past steps) and the Results box with the yellow frame (current step).
The Table to Image node converts the image contained in the first row of the selected column into an image object. This image
object is then fed into the Image Output Widget node to display it inside a composite view.
Lastly, the pathologist must decide whether to upload another image. This selection part is implemented in the web page via radio
buttons and in the workflow by the Single Selection Widget node.
This node produces a flow variable at its output port with the selected option:
Figure 11.16 – Configuration window of the Variable Condition Loop End node
This whole snippet, from Upload Image to View Results , is wrapped in a loop, to meet the additional requirement to give the
pathologist the option to upload multiple images. The last radio button selection is used as the loop stopping criterion.
TIP
Remember that a loop always needs a loop start and a loop end node. In between these two nodes, there is the loop body, which is
executed at each loop iteration.
There are many different loop start and loop end nodes available. Some, for example, use only a subset of rows at each iteration
(Group Loop Start and Chunk Loop Start ) and some only a subset of the columns (Column List Loop Start ).
The workflow in Figure 11.12 uses the Generic Loop Start node to start the loop and the Variable Condition Loop End
node to close the loop. The two nodes allow us to build a loop with a custom stopping criterion that can be defined in the
Variable Condition Loop End node (Figure 11.16). According to these settings, the loop stops if the flow
configuration of the
variable named keepgoing – created by the Single Selection Widget node – has the value No. As the Generic
Loop Start node always needs an input table, an empty table is created with the Empty Table Creator node to feed the node.
After deploying this workflow to a KNIME server and running it on KNIME WebPortal, the pathologist can easily upload new
images and get the result from the automatic classification.
In this section, you have learned how to build a web application using the KNIME software and how to deploy a deep learning
network as a web application.
Let's now discover how deep learning networks can be deployed as REST services.
The KNIME Server REST API offers an interface for non-KNIME applications to communicate with KNIME Server via simple
HTTP requests. The main benefit of RESTful web services is the ease of integration of the application into the company IT
landscape. Self-contained and isolated applications can call each other and exchange data via the REST interface. In this way, it
becomes easier to add new applications to the ecosystem.
Any workflow uploaded on KNIME Server is automatically available via the REST API . This allows you to seamlessly deploy
KNIME workflows as web services via the REST API and integrate them into the infrastructure of your data science lab.
In the sentiment analysis example, we want to deploy the deep learning network as a REST service . In this way, external
applications – for example, a website or a mobile app – can send some text to the REST service and get back the predicted
sentiment.
Let's quickly look at the steps required to build a deployment workflow as a REST service in KNIME Analytics Platform.
When building a REST service with an input and an output, we need to define the structure of the inputs and outputs. In KNIME
Analytics Platform, this can be done via the container input and output nodes.
IMPORTANT NOTE
Not every REST service has inputs and outputs. For example, a REST service that connects to a database to get the most recent data
only has outputs. A REST service that concludes the process by writing the results to a database does not need to output any results.
KNIME Analytics Platform has a variety of input nodes that can be used to define the structure of the input to the REST API. You
can find these nodes in the Node Repository under Workflow Abstraction | Workflow Invocation (Figure 11.17):
Figure 11.17 – Available container nodes to define the REST API
As you can see in Figure 11.17, there are four Container Input nodes – for credentials, for one data row only, for a data table, or
for flow variables. A table input allows you to send either a single data row or multiple data rows to the web service. On the other
hand, a row input sends only one single data row.
The Container Input (Row) and Container Input (Table) nodes have an optional input port. This port receives a template
data table and based on that table, defines the input structure. This template serves two purposes: first, if no input table is provided
when the workflow is called via the REST API, the values from the template are used as default input to execute the workflow.
Second, the table is used to define the input structure that the web service expects. If the structure of the current input differs from
the template, the web service will produce an error message. The advantage of this template technique is that the input is parsed
automatically and converted into the specified types.
Similarly, to define the output of the REST service, you can use one of the Container Output nodes: either the Container
Output (Row) or the Container Output (Table) node.
For our deployment workflow, to classify one movie review at a time, we used a Container Input (Row) node to define the input
structure and a Container Output (Row) node to define the output structure to the REST service. In order to classify one or
Container Input (Table) node and the Container Output (Table) node could be used.
more movie reviews at a time, the
There are two ways to create a workflow that can be deployed as a REST service:
In Chapter 10, Deploying a Deep Learning Network, we introduced the Integrated Deployment extension of KNIME Analytics
Platform, which allows you to capture parts of the training workflow and deploy them automatically. Even there, as an example, we
used the sentiment analysis case study. In Figure 11.18, you can see the automatically created workflow via Integrated
Deployment , with a Container Input (Table) node and a Container Output (Table) node to define the input and output
data structures:
Figure 11.18 – Automatically created deployment workflow from Chapter 10, Deploying a Deep Learning Network
In Chapter 10, Deploying a Deep Learning Network, we saved this automatically created workflow locally and we triggered its
execution through a Call Workflow (Table Based) node. Instead of saving the workflow locally, we could deploy it on a
KNIME server.
The other option is to build a REST service manually from scratch. In this case, we have to provide the dictionary and the trained
model ( Figure 11.19):
As you can see, it looks really similar to the previous workflow in Figure 11.18, with the only difference that it uses a Table
Reader node and a Keras Network Reader node to read the dictionary and the trained model. In addition, a template table has
been inserted to define the input data structure to the REST API.
We have the REST service. Let's see how we can call it.
This web page is created using an open source framework called Swagger . Swagger has been integrated into KNIME Server to
document the REST API, to easily explore the different HTTP requests and test them.
For example, you could test how to trigger the execution of the REST service with a POST request. By selecting the POST request,
Swagger shows you an overview of the possible parameters, the schema for the input data, and the URL to call. You can also try it
out by clicking on the Try it out button.
You can also trigger the execution of the REST service from another workflow using the Call Workflow (Table Based) node.
This node calls local or remote workflows, sending the provided input table and outputting the REST service response.
The workflow in Figure 11.21 shows how to trigger the execution of a REST service on KNIME Server:
Figure 11.21 – This workflow triggers the execution of a REST service on KNIME Server
In the configuration window of the Call Workflow (Table Based) node, you get the list of all the workflows deployed on
KNIME Server by clicking on the Browse workflows button. After selecting a workflow, in the advanced settings, you can assign
the input table to the input of the called workflow and the output table to the output of the called workflow. This feature comes in
very handy when the deployed workflow has many input nodes.
In this section, you have learned how to deploy a workflow as a REST service on KNIME Server. Let's now conclude with some tips
and tricks from our own experience.
Shuffle training data before each epoch checkbox in the Advanced tab in the
To do that, make sure you activate the
configuration window of the Keras Network Learner node.
To add batch normalization to your network, you can use the Keras Batch Normalization Layer node.
Use metanodes and components : To keep large workflows tidy and clean, it is recommended to hide the implementation
details and some of the complexity inside metanodes or components. Indeed, to make a workflow easily understandable at first
glance, you can create one metanode or component for each step in the project, such as data access, data preprocessing, model
training, and model evaluation. Inside each metanode/component, you can have further metanodes and components for different
sub-steps, such as, for example, different preprocessing steps or network layers.
Documenting a workflow : KNIME Analytics Platform offers you three ways to document a workflow:
a) Node labels
b) Annotation boxes
c) Workflow description
Node labels and annotation boxes help you and other users to understand at a glance the workflow's tasks and subtasks easily.
It is also possible to add meta-information to your workflow through theDescription panel. To do so, click anywhere in the
workflow editor (not on a node). The Description view changes to a workflow description with meta-information about the
workflow: the title, description, and related links and tags.
Figure 11.22 – In this workflow, the execution order is forced by using a flow variable connection
Of course, the Excel Sheet Appender (XLS) node should be executed after the Excel Writer (XLS) node. By using a flow
variable connection from the flow variable output port of the Excel Writer (XLS) node to the flow variable input port of the
Excel Sheet Appender (XLS) node, we force the execution of the Excel Sheet Appender (XLS) node to start only after the
execution of the Excel Writer (XLS) node is finished.
Su mma ry
In this chapter, you learned about two more options to deploy your trained deep learning networks: web applications and REST
services. We finished the chapter – and the book – with some tips and tricks to successfully work with deep learning in KNIME
Analytics Platform.
In the first section of this chapter, you learned how to build web applications using KNIME WebPortal of KNIME Server so that end
users can execute their workflows and interact with the web pages comfortably from a web browser.
Next, you learned how to build, deploy, and call REST services using KNIME Server to integrate your deep learning networks into
the company's IT infrastructure. You learned about the many options to define the input and output data structure of the REST
service, how to inspect the REST API using the open source Swagger tool, and how to trigger the execution of a REST service from
within KNIME Analytics Platform.
In the last section, we collated some tips and tricks from our own experience that might turn out to be helpful when working with
deep learning in KNIME Analytics Platform.
At this point, we think that you are well equipped to start building and deploying your own workflows to train and use deep learning
networks suitable to your own business cases and data with the KNIME software.
Q u e stio n s a n d Ex e rc ise s
1. Which kind of nodes can you use to add input fields to a composite view?
a) Configuration nodes
b) Widget nodes
c) View nodes
c) By selecting some view or widget nodes, right-clicking, and selecting Create Component
4. Which node can be used to define the input and output of a REST service?
a) Configuration nodes
b) Widget nodes
c) View nodes
d) Container nodes
O th e r Bo o k s Y o u Ma y En jo y
If you enjoyed this book, you may be interested in these other books by Packt:
ISBN: 978-1-83864-729-2
Understand the key mathematical concepts for building neural network models
Cover optimization algorithms, from basic stochastic gradient descent (SGD) to the advanced Adam optimizer
ISBN: 978-1-83864-630-1
Implement deep neural network from scratch using the Keras library
Implement a convolutional neural network (CNN) image classifier for traffic signal signs
Train and test neural networks for behavioral-cloning by driving a car in a virtual simulator
Le a v e a re v ie w - le t o th e r re a d e rs k n o w w h a t y o u
th in k
Please share your thoughts on this book with others by leaving a review on the site that you bought it from. If you purchased the
book from Amazon, please leave us an honest review on this book's Amazon page. This is vital so that other potential readers can
see and use your unbiased opinion to make purchasing decisions, we can understand what our customers think about our products,
and our authors can see your feedback on the title that they have worked with Packt to create. It will only take a few minutes of
your time, but is valuable to other potential customers, our authors, and Packt. Thank you!