Big Data Workshop
Big Data Workshop
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
Matches all
columns
Matches
all numeric
columns
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
Access HIVE
Storage HDFS
DataNodes
4: read
B B B
stored as sequence of 1 2 3
blocks
• Blocks of a file are B B
n n n
replicated for fault NameNode
1
1
1
2
1
tolerance (usually 3 B
1 n B
1
n
B
2 n
replicas) B
2
B
2
B
2
n n n
– Aims: improve data 3
3
2
3
3
3
reliability, availability, and B
3 n n n
network bandwidth 4 4 4
utilization rack 1 rack 2 rack 3
Node Manager
Appl.
Client Container
Master
Node Manager
Resource
Client Manager Appl. Containe
Master r
ORC YARN
–
HDFS
Parquet
– Iterative algorithms
MapRedu
ce Tez Spark
YARN
HDFS
– Interactive analysis
• Dataset
– Extension of DataFrame API
– Strongly-typed, immutable collection of objects mapped to a
relational schema
– Catches syntax and analysis errors at compile time
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
• New nodes
– HDFS Connection
– HDFS File Permission
• Utilize the existing
remote file handling
nodes
– Upload/download files
– Create/list directories
– Delete files
• Database Extension
• Introduction to Hadoop
• KNIME Big Data Connector
• KNIME Spark Executor
Impala
submission via JDBC
Hiveserver 2
Submit Hive queries
• KNIME Big Data Executor for Spark via JDBC
Workflow
Upload via HTTP(S)
Build Spark
workflows Submit Spark jobs
Spark Job
Server *
graphically via HTTP(S)
Input RDD
Workflow Replica
RDD Partition
RDD Partition
Workflow Replica
Workflow Replica