0% found this document useful (0 votes)
206 views76 pages

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

fork
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views76 pages

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

fork
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Apache Spark

Crash Course - DataWorks Summit – Berlin 2018

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Data

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Sources
à Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars

à User Generated Content (Social, Web & Mobile)


– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Growth in Zeta Bytes (ZB)
70.00

60.00

50.00

40.00 50+ ZB in 2021


30.00

20.00

10.00

0.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


The “Big Data” Problem
Problem
à A single machine cannot process or even store all the data!

Solution
à Distribute data over large clusters

Difficulty
à How to split work across machines?

à Moving data over network is expensive


à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Spark

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What Is Apache Spark?

à Apache open source project


originally developed at AMPLab
(University of California Berkeley)
à Unified, general data processing
engine that operates across varied
data workloads and platforms

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Why Apache Spark?

à Elegant Developer APIs


– Single environment for data munging, data wrangling, and Machine Learning (ML)
à In-memory computation model – Fast!
– Effective for iterative computations and ML
à Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)
– External libraries via open & commercial projects (H2Os Sparkling Water)

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


More Flexible /// Better Storage and Performance

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Overview

à Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
à Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


SparkSession

What is it?
à Main entry point for Spark functionality

à Allows programming with DataFrame and Dataset APIs


à Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


DataFrames
Column
à Distributed collection of data organized into named
columns
Col1 Col2 … … ColN
à Conceptually equivalent to a table in relational DB or
Row
a data frame in R/Python
à API available in Scala, Java, Python, and R

DataFrame

Data is described as a DataFrame


with rows, columns, and a schema

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sources

Avro Column
CSV
JSON
Col1 Col2 … … ColN

Row

Spark SQL

DataFrame
HIVE

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Create a DataFrame

Example
val path = "examples/flights.json"
val flights = spark.read.json(path)

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Register a Temporary View (SQL API)

Example
flights.createOrReplaceTempView("flightsView")

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Two API Examples: DataFrame and SQL APIs

DataFrame API
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)

Results
SQL API +------+----+--------+
SELECT Origin, Dest, DepDelay |Origin|Dest|DepDelay|
FROM flightsView +------+----+--------+
WHERE DepDelay > 15 LIMIT 5 | IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What is Stream Processing?

Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics

Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action

Stream Processing + Batch Processing = All Data Analytics


real-time (now) historical (past)

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Modern Data Applications approach to Insights
Traditional Analytics Next Generation Analytics
Structured & Repeatable Iterative & Exploratory
Structure built to store data Data is the structure

Start with hypothesis Data leads the way


Test against selected data Explore all data, identify correlations

23 Analyze
© Hortonworks Inc. 2011 – 2016. after
All Rights landing…
Reserved Analyze in motion…
23
Spark Streaming

Overview
à Extension of Spark Core API

à Stream processing of live data streams


– Scalable
– High-throughput
– Fault-tolerant

No longer
supported
ZeroMQ
in
Spark
MQTT 2.x

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Discretized Streams (DStreams)


à High-level abstraction representing continuous stream of data

à Internally represented as a sequence of RDDs


à Operation applied on a DStream translates to operations on the underlying RDDs

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Example: flatMap operation

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark Streaming

Window Operations
à Apply transformations over a sliding window of data, e.g. rolling average

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Challenges in Streaming Data

à Consistency
à Fault tolerance
à Out-of-order data

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming

à High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
à Event-time Processing - Native support for working w/ out -of-order and late data
à End-to-end Exactly Once - Transactional both in processing and output

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming: Basics

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Structured Streaming: Model

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Handling late arriving data

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark MLlib

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark ML Pipeline

à fit() is for training


à transform() is for prediction

Input
Train DataFrame Pipeline
(TRAIN)

fit()
Input transform() Output
Predict DataFrame Pipeline Model Dataframe
(TEST) (PREDICTIONS)

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark ML Pipeline

Feature Feature
Combine Linear
transform transform
features Regression
1 2

Input
Train DataFrame Pipeline

Export Model
Input Output
Predict DataFrame Pipeline Model DataFrame

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample Spark ML Pipeline

indexer = …
parser = …
hashingTF = …
vecAssembler = …

rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model


results = model.transform(testData) # Test model

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Exporting ML Models - PMML
à Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
à Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark GraphX

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


à Page Rank
à Topic Modeling (LDA)
à Community Detection

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu


43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX Algorithms

à PageRank
à Connected components
à Label propagation
à SVD++
à Strongly connected components
à Triangle count

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample GraphX Code in Scala

graph = Graph(vertices, edges)


messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What’s Apache Zeppelin?

Web-based notebook
that enables interactive
data analytics.

You can make beautiful


data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin with HDP 2.6+
Web-based Notebook for interactive analytics

Features Use Case


• Ad-hoc experimentation • Data exploration and discovery

• Deeply integrated with • Visualization


Spark + Hadoop • Interactive snippet-at-a-time
• Supports multiple experience
language backends • “Modern Data Science Studio”
• Incubating at Apache

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?

Notebook
Author
Zeppelin

Cluster
Spark | Hive | HBase
Collaborators/ Any of 30+ back ends
Report viewers

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Big Data Lifecycle
Business user
Customer

Data Engineer Data Scientist


Report

ETL /
Collect Analysis
Process

Data
Product

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Zeppelin Multitenancy

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Livy

à Livy is the open source REST interface for interacting with Apache Spark from anywhere
à Installed as Spark Ambari Service

Spark Interactive Session


SparkContext

HTTP HTTP (RPC)

Livy Client Livy Server


Spark Batch Session
SparkContext

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Security Across Zeppelin-Livy-Spark

Shiro LDAP
Zeppelin
Driver Ispark Group Interpreter Livy APIs
Spark on YARN
SPNego: Kerberos Kerberos

Livy Server

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Reasons to Integrate with Livy

à Bring Sessions to Apache Zeppelin


– Isolation
– Session sharing

à Enable efficient cluster resource utilization


– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout)

à To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

55 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


SparkSession Sharing

Session-1

Client 1
SparkSession-1
SparkContext
Session-1
Session-1

Session-2
Client 2
SparkSession-2
Livy Server SparkContext
Session-2

Client 3

56 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Apache Zeppelin + Livy End-to-End Security

Tommy Callahan

Ispark Group Interpreter Livy APIs


Spark on YARN
SPNego: Kerberos Kerberos/RPC
Zeppelin Job runs as
Livy Server Tommy Callahan

LDAP

57 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


HDP Basics

58 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


à Zeppelin è Interactive notebook
Scala
Java
Python
R
MLlib
Spark
SQL
Spark
Streaming
GraphX
à Spark
APIs

à YARN è Resource Management


Spark Core Engine

à HDFS è Distributed Storage Layer (4M files)


YARN – Future: Ozone object store
1 ° ° ° ° ° ° ° ° ° °

° ° ° ° °
HDFS
° ° ° ° ° N

59 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hortonworks Data Platform

63 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Sample Architecture

64 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


68 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Managed Dataflow
REGIONAL CORE
SOURCES
INFRASTRUCTURE INFRASTRUCTURE

69 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


High-Level Overview
Live Dashboard

IoT Devices IoT Edge


(single node)
NiFi Hub Data Broker

Data Column
Store DB
HDFS/Ozone HBase/Cassandra
IoT Devices IoT Edge
72
(single node) Data Center
(on prem/cloud)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.x & HDP 2.x
What’s New?

73 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


What’s New

à Future HDP / Spark 2.3


– Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream
processing (instead of 100ms we’d normally see with micro batching)
– stream-to-stream joins
– PySpark boost by improving performance with pandas UDFs
– runs on Kubernetes clusters by providing native support for Apache Spark applications
à HDP 2.6.4 / Spark 2.2
– Structured Streaming GA
– Yahoo! Benchmark: 65M rec/s
– ORC feature & performance improvements à Parquet Parity
à HDP 2.6.3 / Spark 2.1
– Spark SQL Ranger integration for row and column security
– DataSet API GA
– GraphX GA
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.3

76 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

77 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


DSX + HDP

78 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Data Science Experience (DSX) Local
Enterprise Data Science platform for teams

DSX

Livy REST interface

Hortonworks Data Platform (HDP)


HDP Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)

79 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Lab

80 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Hortonworks Community Connection
community.hortonworks.com

• Full Q&A Platform (like StackOverflow)

• Knowledge Base Articles

• Code Samples and Repositories

81 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Community Engagement
community.hortonworks.com

20k+
Registered Users

45k+
Answers

100k+
Technical Assets
82 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

© Hortonworks Inc. 2011 – 2015. All Rights Reserved


Future of Data Meetups

83 © Hortonworks Inc. 2011 – 2016. All Rights Reserved


Thanks!
Robert Hryniewicz
@RobHryniewicz

You might also like