0% found this document useful (0 votes)
114 views77 pages

Big Data Workshop

knime

Uploaded by

Olu Adesola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views77 pages

Big Data Workshop

knime

Uploaded by

Olu Adesola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

KNIME Big Data Workshop

Tobias Kötter and Björn Lohrmann


KNIME

© 2017 KNIME.com AG. All Rights Reserved.


Variety, Velocity, Volume
• Variety:
– Integrating heterogeneous data…
– ... and tools
• Velocity:
– Real time scoring of millions of
records/sec
– Continuous data streams
– Distributed computation
• Volume:
– From small files...
– ...to distributed data repositories
– Moving computation to the data

© 2017 KNIME.com AG. All Rights Reserved. 2 2


Variety

© 2017 KNIME.com AG. All Rights Reserved. 3


The KNIME Analytics Platform: Open for Every Data, Tool, and User

Data Scientist Business Analyst

KNIME Analytics Platform


External Native External
Data Data Access, Analysis, and Legacy
Connectors Visualization, and Reporting Tools

Distributed / Cloud Execution

© 2017 KNIME.com AG. All Rights Reserved. 4


Data Integration

© 2017 KNIME.com AG. All Rights Reserved. 5


Integrating R and Python

© 2017 KNIME.com AG. All Rights Reserved. 6


Modular Integrations

© 2017 KNIME.com AG. All Rights Reserved. 7


Other Programming/Scripting Integrations

© 2017 KNIME.com AG. All Rights Reserved. 8


Velocity

© 2017 KNIME.com AG. All Rights Reserved. 9


Velocity

• High Demand Scoring/Prediction:


– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor

© 2017 KNIME.com AG. All Rights Reserved. 10


High Performance Scoring via Workflows

• Record (or small batch) based processing


• Exposed as RESTful web service

© 2017 KNIME.com AG. All Rights Reserved. 11


High Performance Scoring using Models

• KNIME PMML Scoring via compiled PMML


• Deployed on KNIME Server
• Exposed as RESTful web service

• Partnership with Zementis


– ADAPA Real Time Scoring
– UPPI Big Data Scoring Engine

© 2017 KNIME.com AG. All Rights Reserved. 12


Velocity

• High Demand Scoring/Prediction:


– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor

© 2017 KNIME.com AG. All Rights Reserved. 13


Streaming in KNIME

© 2017 KNIME.com AG. All Rights Reserved. 14


Velocity

• High Demand Scoring/Prediction:


– High Performance Scoring using generic Workflows
– High Performance Scoring of Predictive Models
• Continuous Data Streams
– Streaming in KNIME
• Distributed Computation
– KNIME Cluster Executor

© 2017 KNIME.com AG. All Rights Reserved. 15


KNIME Cluster Executor: Distributed Data

© 2017 KNIME.com AG. All Rights Reserved. 16


KNIME Cluster Execution: Distributed Analytics

© 2017 KNIME.com AG. All Rights Reserved. 17


Volume

© 2017 KNIME.com AG. All Rights Reserved. 18


Moving computation to the data

© 2017 KNIME.com AG. All Rights Reserved. 19


Volume

• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor

© 2017 KNIME.com AG. All Rights Reserved. 20


Database Extension

• Visually assemble complex SQL statements


• Connect to almost all JDBC-compliant databases
• Harness the power of your database within KNIME

© 2017 KNIME.com AG. All Rights Reserved. 21


In-Database Processing

• Operations are performed within the database

© 2017 KNIME.com AG. All Rights Reserved. 22


Tip

• SQL statements are logged in KNIME log file

© 2017 KNIME.com AG. All Rights Reserved. 23


Database Port Types

Database Connection Port (dark red)


• Connection information
• SQL statement

Database Connection Ports can


be connected to
Database JDBC Connection Port (light red) Database JDBC Connection Ports
• Connection information but not vice versa

© 2017 KNIME.com AG. All Rights Reserved. 24


Database Connectors
• Nodes to connect to specific
Databases
– Bundling necessary JDBC drivers
– Easy to use
– DB specific behavior/capability
• Hive and Impala connector part
of the commercial KNIME Big
Data Connectors extension
• General Database Connector
– Can connect to any JDBC source
– Register new JDBC driver via
preferences page

© 2017 KNIME.com AG. All Rights Reserved. 25


Register JDBC Driver

Register single jar file


JDBC drivers

Register new JDBC driver


with companion files

Open KNIME and go to Increase connection timeout for


long running database operations
File -> Preferences

© 2017 KNIME.com AG. All Rights Reserved. 26


Query Nodes

• Filter rows and


columns
• Join tables/queries
• Extract samples
• Bin numeric columns
• Sort your data
• Write your own query
• Aggregate your data

© 2017 KNIME.com AG. All Rights Reserved. 27


Database GroupBy – Manual Aggregation

Returns number of rows per group

© 2017 KNIME.com AG. All Rights Reserved. 28


Database GroupBy – Pattern Based Aggregation

Tick this option if the search pattern is a


regular expression otherwise it is treated
as string with wildcards ('*' and '?')

© 2017 KNIME.com AG. All Rights Reserved. 29


Database GroupBy – Type Based Aggregation

Matches all
columns

Matches
all numeric
columns

© 2017 KNIME.com AG. All Rights Reserved. 30


Database GroupBy – DB Specific Aggregation Methods

SQLite 7 aggregation functions

PostgreSQL 25 aggregation functions

© 2017 KNIME.com AG. All Rights Reserved. 31


Database GroupBy – Custom Aggregation Function

© 2017 KNIME.com AG. All Rights Reserved. 32


Database Writing Nodes

• Create table as select


• Insert/append data
• Update values in table
• Delete rows from table

© 2017 KNIME.com AG. All Rights Reserved. 33


Performance Tip
– Increase batch size in database manipulation nodes

Increase batch size for


better performance

© 2017 KNIME.com AG. All Rights Reserved. 34


Volume

• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor

© 2017 KNIME.com AG. All Rights Reserved. 35


Apache Hadoop
• Open-source framework for distributed storage and
processing of large data sets
• Designed to scale up to thousands of machines
• Does not rely on hardware to provide high availability
– Handles failures at application layer instead
• First release in 2006
– Rapid adoption, promoted to top level Apache project in 2008
– Inspired by Google File System (2003) paper
• Spawned diverse ecosystem of products

© 2017 KNIME.com AG. All Rights Reserved. 36


Hadoop Ecosystem

Access HIVE

Processing MapReduce Tez Spark


Resource
Management YARN

Storage HDFS

© 2017 KNIME.com AG. All Rights Reserved. 37


HDFS
HIVE

• Hadoop distributed file system MapRedu


ce Tez
YARN
Spark

• Stores large files across multiple machines


HDFS

File File (large!)

Blocks (default: 64MB)

DataNodes

© 2017 KNIME.com AG. All Rights Reserved. 38


HDFS – NameNode and DataNode
• NameNode • DataNodes
– Master server that – Workers, store and
manages file system retrieve blocks per
namespace request of client or
• Maintains metadata for namenode
all files and directories in – Periodically report to
filesystem tree
namenode that they are
• Knows on which running and which blocks
datanode blocks of a
given file are located they are storing
– Whole system depends
on availability of
NameNode

© 2017 KNIME.com AG. All Rights Reserved. 39


Reading Data from HDFS

1: open Distributed 2: get block


locations
FileSystem
HDFS Client 3: read NameNode
FSData
6: close
InputStream
Client node
5: read

4: read

DataNode DataNode DataNode

© 2017 KNIME.com AG. All Rights Reserved. 40


HDFS – Data replication and file size
Data Replication
• All blocks of a file are File 1

B B B
stored as sequence of 1 2 3

blocks
• Blocks of a file are B B
n n n
replicated for fault NameNode
1
1
1
2
1
tolerance (usually 3 B
1 n B
1
n
B
2 n
replicas) B
2
B
2
B
2
n n n
– Aims: improve data 3
3
2
3
3
3
reliability, availability, and B
3 n n n
network bandwidth 4 4 4
utilization rack 1 rack 2 rack 3

© 2017 KNIME.com AG. All Rights Reserved. 41


HDFS – Access and File Size
• Several ways to access HDFS File Size
data • Hadoop is designed to handle
– FileSystem (FS) shell commands fewer large files instead of lots
• Direct RPC connection of small files
• Requires Hadoop client to be
installed • Small file: File significantly
– WebHDFS smaller than Hadoop block size
• Provides REST API functionality, lets
external applications connect via
• Problems:
HTTP – Namenode memory
• Direct transmission of data from – MapReduce performance
node to client
• Needs access to all nodes in cluster
– HttpFS
• All data is transmitted to client via
one single node -> gateway

© 2017 KNIME.com AG. All Rights Reserved. 42


YARN
• Cluster resource management
system
• Two elements
– Resource manager (one per
cluster):
• Knows where workers are located
and how many resources they have
• Scheduler: Decides how to allocate
resources to applications
– Node manager (many per
cluster):
• Launches application containers
• Monitor resource usage and report
to Resource Manager HIVE
MapRedu
ce Tez Spark
YARN
HDFS

© 2017 KNIME.com AG. All Rights Reserved. 43


YARN

Node Manager

Appl.
Client Container
Master

Node Manager
Resource
Client Manager Appl. Containe
Master r

MapReduce Node Manager


Status
Container Container
Job Submission
Node Status
Resource Request

© 2017 KNIME.com AG. All Rights Reserved. 44


Hive
• Infrastructure on top of Hadoop
• Provides data summarization, query, and analysis
• SQL-like language (HiveQL)
• Converts queries to MapReduce, Apache Tez, and Spark
jobs
• Supports various file formats:
– Text/CSV
– SequenceFile
– Avro MapRedu
HIVE
Tez Spark

ce

ORC YARN


HDFS
Parquet

© 2017 KNIME.com AG. All Rights Reserved. 45


Spark

• Cluster computing framework for large-scale data


processing
• Keeps large working datasets in memory between
jobs
– No need to always load data from disk -> way (!) faster
than MapReduce
• Great for: HIVE

– Iterative algorithms
MapRedu
ce Tez Spark
YARN
HDFS

– Interactive analysis

© 2017 KNIME.com AG. All Rights Reserved. 46


Spark – Basic Concepts
• SparkContext
– Main entry point for Spark functionality
– Represents connection to a Spark cluster
– Create RDDs, accumulators, and broadcast variables on cluster

• RDD: Resilient Distributed Dataset


– Read-only multiset of data items distributed over cluster of
machines
– Fault-tolerant: Lost partition automatically reconstructed from
RDDs it was computed from
– Lazy evaluation: Computation only happens when action is
required

© 2017 KNIME.com AG. All Rights Reserved. 47


Spark – DataFrame and Dataset
• DataFrame
– Distributed collection of data organized in named columns
– Similar to table in relational database
– Can be constructed from many sources: structured data files,
Hive table, RDDs...

• Dataset
– Extension of DataFrame API
– Strongly-typed, immutable collection of objects mapped to a
relational schema
– Catches syntax and analysis errors at compile time

© 2017 KNIME.com AG. All Rights Reserved. 48


Volume

• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor

© 2017 KNIME.com AG. All Rights Reserved. 49


KNIME Big Data Connectors

• Package required drivers/libraries for HDFS, Hive,


Impala access
• Preconfigured connectors
– Hive
– Cloudera Impala
– Extends the open source database integration

© 2017 KNIME.com AG. All Rights Reserved. 50


Hive/Impala Loader

• Batch upload a KNIME data table to Hive/Impala

© 2017 KNIME.com AG. All Rights Reserved. 51


HDFS File Handling

• New nodes
– HDFS Connection
– HDFS File Permission
• Utilize the existing
remote file handling
nodes
– Upload/download files
– Create/list directories
– Delete files

© 2017 KNIME.com AG. All Rights Reserved. 52


HDFS File Handling

© 2017 KNIME.com AG. All Rights Reserved. 53


Volume

• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor

© 2017 KNIME.com AG. All Rights Reserved. 54


KNIME Spark Executor
• Based on Spark MLlib
• Scalable machine learning library
• Runs on Hadoop
• Algorithms for
– Classification (decision tree, naïve bayes, …)
– Regression (logistic regression, linear regression, …)
– Clustering (k-means)
– Collaborative filtering (ALS)
– Dimensionality reduction (SVD, PCA)

© 2017 KNIME.com AG. All Rights Reserved. 55


Familiar Usage Model

• Usage model and dialogs similar to existing nodes


• No coding required

© 2017 KNIME.com AG. All Rights Reserved. 56


MLlib Integration

• MLlib model ports for model transfer


• Native MLlib model learning and prediction
• Spark nodes start and manage Spark jobs
– Including Spark job cancelation

Native MLlib model

© 2017 KNIME.com AG. All Rights Reserved. 57


Data Stays Within Your Cluster

• Spark RDDs as input/output format


• Data stays within your cluster
• No unnecessary data movements
• Several input/output nodes e.g. Hive, hdfs files, …

© 2017 KNIME.com AG. All Rights Reserved. 58


Machine Learning – Unsupervised Learning Example

© 2017 KNIME.com AG. All Rights Reserved. 59


Machine Learning – Supervised Learning Example

© 2017 KNIME.com AG. All Rights Reserved. 60


Mass Learning – Fast Event Prediction

• Convert supported MLlib models to PMML

© 2017 KNIME.com AG. All Rights Reserved. 61


Sophisticated Learning - Mass Prediction

• Supports KNIME models and pre-processing steps

© 2017 KNIME.com AG. All Rights Reserved. 62


Closing the Loop

Apply model Learn model


on demand at scale

PMML model MLlib model

Sophisticated Apply model


model learning at scale

© 2017 KNIME.com AG. All Rights Reserved. 63


Mix and Match

• Combine with existing KNIME nodes such as loops

© 2017 KNIME.com AG. All Rights Reserved. 64


Modularize and Execute Your Own Spark Code

© 2017 KNIME.com AG. All Rights Reserved. 65


Lazy Evaluation in Spark

• Transformations are lazy


• Actions trigger evaluation

© 2017 KNIME.com AG. All Rights Reserved. 66


Spark Node Overview

© 2017 KNIME.com AG. All Rights Reserved. 67


KNIME Big Data Architecture
Scheduled execution
Submit Impala queries Hadoop Cluster
and RESTful workflow

Impala
submission via JDBC

KNIME Server with extensions:


• KNIME Big Data Connectors

Hiveserver 2
Submit Hive queries
• KNIME Big Data Executor for Spark via JDBC

Workflow
Upload via HTTP(S)
Build Spark
workflows Submit Spark jobs

Spark Job
Server *
graphically via HTTP(S)

KNIME Analytics Platform


with extensions:
• KNIME Big Data Connectors
• KNIME Big Data Executor for Spark *Software provided by KNIME, based on
https://github.com/spark-jobserver/spark-jobserver

© 2017 KNIME.com AG. All Rights Reserved. 68


Executing KNIME Nodes on Spark

© 2017 KNIME.com AG. All Rights Reserved. 69


Behind the Scene
Cluster Worker Node Cluster Worker Node

Spark Executor JVM Spark Executor JVM

Input RDD

RDD Partition RDD Partition

KNIME Workflow Execute KNIME (OSGI) (OSGI)


workflow on Spark
KNIME Workflow KNIME Workflow

KNIME Analytics Platform


KNIME Server
Output RDD

RDD Partition RDD Partition

Workflow Replica

© 2017 KNIME.com AG. All Rights Reserved. 70


Behind the Scene
• Variation (1): Send RDD Cluster Worker Node Cluster Worker Node
Spark Executor JVM Spark Executor JVM
data through a single Input RDD
workflow replica RDD Partition RDD Partition

RDD Partition

KNIME Workflow Execute KNIME (OSGI)


workflow on Spark
KNIME Workflow

KNIME Analytics Platform


KNIME Server
Output RDD

RDD Partition

Workflow Replica

© 2017 KNIME.com AG. All Rights Reserved. 71


Behind the Scene
• Variation (2): Send pre- Cluster Worker Node Cluster Worker Node
Spark Executor JVM Spark Executor JVM
grouped RDD data through Input RDD
workflow replicas RDD Partition RDD Partition

RDD Partition RDD Partition

KNIME Workflow Execute KNIME (OSGI) (OSGI)


workflow on Spark
KNIME Workflow KNIME Workflow

KNIME Analytics Platform


KNIME Server
Output RDD

RDD Partition RDD Partition

Workflow Replica

© 2017 KNIME.com AG. All Rights Reserved. 72


Big Data, IoT, and the three V
• Variety:
– KNIME inherently well-suited: open platform
– broad data source/type support
– extensive tool integration
• Velocity:
– High Performance Scoring of predictive models
– Streaming execution
• Volume:
– Bring the computation to the data
– Big Data Extensions cover ETL and model learning
– Distributed Execution of KNIME workflows

© 2017 KNIME.com AG. All Rights Reserved. 73


Demo

© 2017 KNIME.com AG. All Rights Reserved. 74


Want to try it at home?
• Hadoop cluster
– Use your own Hadoop cluster
– Use a preconfigured virtual machine
• http://hortonworks.com/products/hortonworks-sandbox/
• http://www.cloudera.com/downloads/quickstart_vms.html
• Download and install compatible Spark Job Server
– See installation steps at https://www.knime.org/knime-spark-
executor#install
• For a free 30-day Trial go to
https://www.knime.org/knime-big-data-extensions-
free-30-day-trial
© 2017 KNIME.com AG. All Rights Reserved. 75
Resources
– SQL Syntax and Examples (www.w3schools.com)
– Apache Spark MLlib (http://spark.apache.org/mllib/)
– The KNIME Website (www.knime.org)
• Database Documentation (https://tech.knime.org/database-
documentation)
• Big Data Extensions (https://www.knime.org/knime-big-data-
extensions)
• Forum (tech.knime.org/forum)
• LEARNING HUB under RESOURCES (www.knime.org/learning-hub)
• Blog for news, tips and tricks (www.knime.org/blog)
– KNIME TV channel on
– KNIME on @KNIME

© 2017 KNIME.com AG. All Rights Reserved. 76


The KNIME® trademark and logo and OPEN FOR INNOVATION® trademark are used by
KNIME.com AG under license from KNIME GmbH, and are registered in the United States.
KNIME® is also registered in Germany.
© 2017 KNIME.com AG. All Rights Reserved. 77

You might also like