0% found this document useful (0 votes)

206 views76 pages

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

fork

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views76 pages

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

fork

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Apache Spark

Crash Course - DataWorks Summit – Berlin 2018

Robert Hryniewicz
Data Evangelist
@RobHryniewicz
Data

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Sources
Ã Internet of Things (IoT)
– Wind Turbines, Oil Rigs
– Beacons, Wearables
– Smart Cars

Ã User Generated Content (Social, Web & Mobile)

– Twitter, Facebook, Snapchat
– Clickstream
– Paypal, Venmo

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Growth in Zeta Bytes (ZB)
70.00

60.00

50.00

40.00 50+ ZB in 2021

30.00

20.00

10.00

0.00
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The “Big Data” Problem
Problem
Ã A single machine cannot process or even store all the data!

Solution
Ã Distribute data over large clusters

Difficulty
Ã How to split work across machines?

Ã Moving data over network is expensive

Ã Must consider data & network locality
Ã How to deal with failures?
Ã How to deal with slow nodes?

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What Is Apache Spark?

Ã Apache open source project

originally developed at AMPLab
(University of California Berkeley)
Ã Unified, general data processing
engine that operates across varied
data workloads and platforms

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Apache Spark?

Ã Elegant Developer APIs

– Single environment for data munging, data wrangling, and Machine Learning (ML)
Ã In-memory computation model – Fast!
– Effective for iterative computations and ML
Ã Machine Learning
– Implementation of distributed ML algorithms
– Pipeline API (Spark MLlib)
– External libraries via open & commercial projects (H2Os Sparkling Water)

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

More Flexible /// Better Storage and Performance

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Overview

Ã Spark module for structured data processing (e.g. ORC, Parquet, Avro, MySQL)
Ã Two ways to manipulate data:
– DataFrame/Dataset API
– SQL query

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SparkSession

What is it?
Ã Main entry point for Spark functionality

Ã Allows programming with DataFrame and Dataset APIs

Ã Represented as spark and auto-initialized in a notebook type env. (Zeppelin or Jupyter)

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrames
Column
Ã Distributed collection of data organized into named
columns
Col1 Col2 … … ColN
Ã Conceptually equivalent to a table in relational DB or
Row
a data frame in R/Python
Ã API available in Scala, Java, Python, and R

DataFrame

Data is described as a DataFrame

with rows, columns, and a schema

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sources

Avro Column
CSV
JSON
Col1 Col2 … … ColN

Row

Spark SQL

DataFrame
HIVE

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Create a DataFrame

Example
val path = "examples/flights.json"
val flights = spark.read.json(path)

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example
flights.createOrReplaceTempView("flightsView")

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Two API Examples: DataFrame and SQL APIs

DataFrame API
flights.select("Origin", "Dest", "DepDelay”)
.filter($"DepDelay" > 15).show(5)

Results
SQL API +------+----+--------+
SELECT Origin, Dest, DepDelay |Origin|Dest|DepDelay|
FROM flightsView +------+----+--------+
WHERE DepDelay > 15 LIMIT 5 | IAD| TPA| 19|
| IND| BWI| 34|
| IND| JAX| 25|
| IND| LAS| 67|
| IND| MCO| 94|
+------+----+--------+

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Real-time Machine Learning Graph Analysis

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Stream Processing?

Batch Processing
• Ability to process and analyze data at-rest (stored data)
• Request-based, bulk evaluation and short-lived processing
• Enabler for Retrospective, Reactive and On-demand Analytics

Stream Processing
• Ability to ingest, process and analyze data in-motion in real- or near-real-time
• Event or micro-batch driven, continuous evaluation and long-lived processing
• Enabler for real-time Prospective, Proactive and Predictive Analytics for Next Best
Action

Stream Processing + Batch Processing = All Data Analytics

real-time (now) historical (past)

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Modern Data Applications approach to Insights
Traditional Analytics Next Generation Analytics
Structured & Repeatable Iterative & Exploratory
Structure built to store data Data is the structure

Start with hypothesis Data leads the way

Test against selected data Explore all data, identify correlations

23 Analyze
© Hortonworks Inc. 2011 – 2016. after
All Rights landing…
Reserved Analyze in motion…
23
Spark Streaming

Overview
Ã Extension of Spark Core API

Ã Stream processing of live data streams

– Scalable
– High-throughput
– Fault-tolerant

No longer
supported
ZeroMQ
in
Spark
MQTT 2.x

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Discretized Streams (DStreams)

Ã High-level abstraction representing continuous stream of data

Ã Internally represented as a sequence of RDDs

Ã Operation applied on a DStream translates to operations on the underlying RDDs

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Example: flatMap operation

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Window Operations
Ã Apply transformations over a sliding window of data, e.g. rolling average

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Challenges in Streaming Data

Ã Consistency
Ã Fault tolerance
Ã Out-of-order data

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Structured Streaming

Ã High-Level APIs - DataFrames, Datasets and SQL. Same in streaming and in batch
Ã Event-time Processing - Native support for working w/ out -of-order and late data
Ã End-to-end Exactly Once - Transactional both in processing and output

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Structured Streaming: Basics

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Structured Streaming: Model

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Handling late arriving data

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark MLlib

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark ML Pipeline

Ã fit() is for training

Ã transform() is for prediction

Input
Train DataFrame Pipeline
(TRAIN)

fit()
Input transform() Output
Predict DataFrame Pipeline Model Dataframe
(TEST) (PREDICTIONS)

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark ML Pipeline

Feature Feature
Combine Linear
transform transform
features Regression
1 2

Input
Train DataFrame Pipeline

Export Model
Input Output
Predict DataFrame Pipeline Model DataFrame

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sample Spark ML Pipeline

indexer = …
parser = …
hashingTF = …
vecAssembler = …

rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model

results = model.transform(testData) # Test model

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Exporting ML Models - PMML
Ã Predictive Model Markup Language (PMML)
–> XML-based predictive model interchange format
Ã Supported models
–K-Means
–Linear Regression
–Ridge Regression
–Lasso
–SVM
–Binary

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark GraphX

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Page Rank
Ã Topic Modeling (LDA)
Ã Community Detection

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
GraphX Algorithms

Ã PageRank
Ã Connected components
Ã Label propagation
Ã SVD++
Ã Strongly connected components
Ã Triangle count

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sample GraphX Code in Scala

graph = Graph(vertices, edges)

messages = spark.textFile("hdfs://...")
graph2 = graph.joinVertices(messages) {
(id, vertex, msg) => ...
}

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s Apache Zeppelin?

Web-based notebook
that enables interactive
data analytics.

You can make beautiful

data-driven, interactive
and collaborative
documents with SQL,
Python, Scala and more

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin with HDP 2.6+
Web-based Notebook for interactive analytics

Features Use Case

• Ad-hoc experimentation • Data exploration and discovery

• Deeply integrated with • Visualization

Spark + Hadoop • Interactive snippet-at-a-time
• Supports multiple experience
language backends • “Modern Data Science Studio”
• Incubating at Apache

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

49 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Zeppelin work?

Notebook
Author
Zeppelin

Cluster
Spark | Hive | HBase
Collaborators/ Any of 30+ back ends
Report viewers

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data Lifecycle
Business user
Customer

Data Engineer Data Scientist

Report

ETL /
Collect Analysis
Process

Data
Product

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Zeppelin Multitenancy

Livy

Ã Livy is the open source REST interface for interacting with Apache Spark from anywhere
Ã Installed as Spark Ambari Service

Spark Interactive Session

SparkContext

HTTP HTTP (RPC)

Livy Client Livy Server

Spark Batch Session
SparkContext

Security Across Zeppelin-Livy-Spark

Shiro LDAP
Zeppelin
Driver Ispark Group Interpreter Livy APIs
Spark on YARN
SPNego: Kerberos Kerberos

Livy Server

Reasons to Integrate with Livy

Ã Bring Sessions to Apache Zeppelin

– Isolation
– Session sharing

Ã Enable efficient cluster resource utilization

– Default Spark interpreter keeps YARN/Spark job running forever
– Livy interpreter recycled after 60 minutes of inactivity
(controlled by livy.server.session.timeout)

Ã To Identity Propagation
– Send user identity from Zeppelin > Livy > Spark on YARN

SparkSession Sharing

Session-1

Client 1
SparkSession-1
SparkContext
Session-1
Session-1

Session-2
Client 2
SparkSession-2
Livy Server SparkContext
Session-2

Client 3

Apache Zeppelin + Livy End-to-End Security

Tommy Callahan

Ispark Group Interpreter Livy APIs

Spark on YARN
SPNego: Kerberos Kerberos/RPC
Zeppelin Job runs as
Livy Server Tommy Callahan

LDAP

HDP Basics

Ã Zeppelin è Interactive notebook
Scala
Java
Python
R
MLlib
Spark
SQL
Spark
Streaming
GraphX
Ã Spark
APIs

Ã YARN è Resource Management

Spark Core Engine

Ã HDFS è Distributed Storage Layer (4M files)

YARN – Future: Ozone object store
1 ° ° ° ° ° ° ° ° ° °

° ° ° ° °
HDFS
° ° ° ° ° N

Hortonworks Data Platform

Sample Architecture

High-Level Overview
Live Dashboard

IoT Devices IoT Edge

(single node)
NiFi Hub Data Broker

What’s New

Ã Future HDP / Spark 2.3

– Spark Structured Streaming latency in single-digit milliseconds in continuous mode in stream
processing (instead of 100ms we’d normally see with micro batching)
– stream-to-stream joins
– PySpark boost by improving performance with pandas UDFs
– runs on Kubernetes clusters by providing native support for Apache Spark applications
Ã HDP 2.6.4 / Spark 2.2
– Structured Streaming GA
– Yahoo! Benchmark: 65M rec/s
– ORC feature & performance improvements à Parquet Parity
Ã HDP 2.6.3 / Spark 2.1
– Spark SQL Ranger integration for row and column security
– DataSet API GA
– GraphX GA
74 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
75 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark 2.3

Spark SQL Spark Streaming Spark MLlib GraphX
Structured Data Near Real-time Machine Learning Graph Analysis

DSX + HDP

Data Science Experience (DSX) Local
Enterprise Data Science platform for teams

DSX

Livy REST interface

Hortonworks Data Platform (HDP)

HDP Enterprise compute (Spark/Hive) & storage
(HDFS/Ozone)

Lab

Hortonworks Community Connection
community.hortonworks.com

• Full Q&A Platform (like StackOverflow)

• Knowledge Base Articles

• Code Samples and Repositories

Community Engagement
community.hortonworks.com

20k+
Registered Users

45k+
Answers

Future of Data Meetups

Thanks!
Robert Hryniewicz
@RobHryniewicz

Bigdata Interview Preparation Guide
No ratings yet
Bigdata Interview Preparation Guide
292 pages
Rahul Sharma
100% (1)
Rahul Sharma
2 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Airflow 101 Mobile
No ratings yet
Airflow 101 Mobile
48 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
0% (1)
1 - Creating A Data Transformation Pipeline With Cloud Dataprep
39 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Akash Resume
No ratings yet
Akash Resume
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Lab - Exploring DataLake With Athena and Quicksight PDF
No ratings yet
Lab - Exploring DataLake With Athena and Quicksight PDF
22 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
100 Interview Questions On Hadoop - Hadoop Online Tutorials
100% (1)
100 Interview Questions On Hadoop - Hadoop Online Tutorials
22 pages
Cheat Sheet AWS Solutions Architect Professional
No ratings yet
Cheat Sheet AWS Solutions Architect Professional
177 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
DBT Cloud Advanced Architecture Guide
0% (1)
DBT Cloud Advanced Architecture Guide
4 pages
Lessons From Large-Scale Machine Learning Deployments On Spark
No ratings yet
Lessons From Large-Scale Machine Learning Deployments On Spark
105 pages
Amazon Elastic MapReduce PDF
No ratings yet
Amazon Elastic MapReduce PDF
231 pages
Cloudera Administration PDF
100% (1)
Cloudera Administration PDF
476 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
AWS Amazon EMR
100% (1)
AWS Amazon EMR
38 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
GCP-presented Diagram - Drawio
No ratings yet
GCP-presented Diagram - Drawio
60 pages
SparkInternals All
No ratings yet
SparkInternals All
90 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Hadoop Security S360 2015v8 PDF
No ratings yet
Hadoop Security S360 2015v8 PDF
27 pages
(Hortonworks University) HDP Developer Apache Spark
100% (1)
(Hortonworks University) HDP Developer Apache Spark
66 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Donald Ngandeu 1
No ratings yet
Donald Ngandeu 1
6 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Data Platform and Analytics Foundational Training: (Speaker Notes)
No ratings yet
Data Platform and Analytics Foundational Training: (Speaker Notes)
19 pages
Certification
No ratings yet
Certification
16 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
Hive Query Optimization Infinity
No ratings yet
Hive Query Optimization Infinity
13 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
No ratings yet
Hadoop Administration Interview Questions and Answers: 40% Career Booster Discount On All Course - Call Us Now 9019191856
26 pages
Trivago Pipeline
No ratings yet
Trivago Pipeline
18 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
7 Steps For A Developer To Learn Apache Spark
No ratings yet
7 Steps For A Developer To Learn Apache Spark
30 pages
MapR Certified Spark Developer Study Guide (MCSD)
No ratings yet
MapR Certified Spark Developer Study Guide (MCSD)
29 pages
Hadoop Interview Questions - Part 1
No ratings yet
Hadoop Interview Questions - Part 1
8 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
No ratings yet
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
6 pages
Automate Machine Learning - Aparna Elangovan
No ratings yet
Automate Machine Learning - Aparna Elangovan
26 pages
OlYAD CINEMA TICKET RESERVATION SYSTEM
No ratings yet
OlYAD CINEMA TICKET RESERVATION SYSTEM
73 pages
Srs
50% (2)
Srs
59 pages
ISSP Template
No ratings yet
ISSP Template
19 pages
Big Data Architectures
No ratings yet
Big Data Architectures
4 pages
Pivot Table Material
No ratings yet
Pivot Table Material
39 pages
GSMA Intelligence - The B2B Opportunity in The 5G Era
No ratings yet
GSMA Intelligence - The B2B Opportunity in The 5G Era
28 pages
4 Debugging
No ratings yet
4 Debugging
35 pages
R SQL
No ratings yet
R SQL
187 pages
Introduction Introduction To Oracle
0% (1)
Introduction Introduction To Oracle
3 pages
Chapter 5 Micropython
No ratings yet
Chapter 5 Micropython
25 pages
Building Scalable Web Sites
No ratings yet
Building Scalable Web Sites
21 pages
OLX - Software Requirement Specification SRS: Software Engineering (Lovely Professional University)
0% (1)
OLX - Software Requirement Specification SRS: Software Engineering (Lovely Professional University)
7 pages
Dynamic HTTP or Odata Adapter - Ntication For Flow Processing in
No ratings yet
Dynamic HTTP or Odata Adapter - Ntication For Flow Processing in
24 pages
Balaji Resume
No ratings yet
Balaji Resume
3 pages
1994.roberts. Defining Electronic Records, Document and Data PDF
No ratings yet
1994.roberts. Defining Electronic Records, Document and Data PDF
13 pages
Project Report Writing Guidelines
No ratings yet
Project Report Writing Guidelines
4 pages
Security Questions
No ratings yet
Security Questions
2 pages
Hacktoberfest Git Cheat Sheet
No ratings yet
Hacktoberfest Git Cheat Sheet
1 page
Netskope Best Practice Policies
No ratings yet
Netskope Best Practice Policies
6 pages
CIT-137-M1 SP19 Syllabus
No ratings yet
CIT-137-M1 SP19 Syllabus
7 pages
Dorian Arroyo Product Support Engineer English
No ratings yet
Dorian Arroyo Product Support Engineer English
3 pages
Moxa Mxview Series Datasheet v2.2
No ratings yet
Moxa Mxview Series Datasheet v2.2
7 pages
Abis Multimodal Automatic Biometric Identification System
No ratings yet
Abis Multimodal Automatic Biometric Identification System
2 pages
Ws Tip Headers PDF
No ratings yet
Ws Tip Headers PDF
8 pages
MHS Staff Email Addresses
No ratings yet
MHS Staff Email Addresses
4 pages
Lab Sheet 8
No ratings yet
Lab Sheet 8
10 pages
Resume Lokesh Raj T
No ratings yet
Resume Lokesh Raj T
3 pages
6.3 The Memory Hierarchy: - This Storage Organization Can Be Thought of As A Pyramid
No ratings yet
6.3 The Memory Hierarchy: - This Storage Organization Can Be Thought of As A Pyramid
18 pages
Consistency and Replication: Distributed Systems Principles and Paradigms
No ratings yet
Consistency and Replication: Distributed Systems Principles and Paradigms
38 pages
Tyler Schwantes: Bachelor of Management Information Systems, December, 2012
No ratings yet
Tyler Schwantes: Bachelor of Management Information Systems, December, 2012
4 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

Public - Crash Course - Apache Spark - Berlin - 2018 PDF

Uploaded by

Apache Spark

Crash Course - DataWorks Summit – Berlin 2018

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã User Generated Content (Social, Web & Mobile)

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

40.00 50+ ZB in 2021

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Moving data over network is expensive

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Apache open source project

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Elegant Developer APIs

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Allows programming with DataFrame and Dataset APIs

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data is described as a DataFrame

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Stream Processing + Batch Processing = All Data Analytics

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Start with hypothesis Data leads the way

Ã Stream processing of live data streams

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Discretized Streams (DStreams)

Ã Internally represented as a sequence of RDDs

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Example: flatMap operation

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã fit() is for training

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

model = pipe.fit(trainData) # Train model

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Source: ampcamp.berkeley.edu

44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

graph = Graph(vertices, edges)

45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

You can make beautiful

47 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Features Use Case

• Deeply integrated with • Visualization

48 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

50 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Engineer Data Scientist

51 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

52 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Interactive Session

HTTP HTTP (RPC)

Livy Client Livy Server

53 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

54 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Ã Bring Sessions to Apache Zeppelin