100% found this document useful (1 vote)

136 views100 pages

ScalaJVMBigData SparkLessons PDF

Uploaded by

Abhishek Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

136 views100 pages

ScalaJVMBigData SparkLessons PDF

Uploaded by

Abhishek Rastogi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

Scala and the JVM for Big Data:

Lessons from Spark

polyglotprogramming.com/talks
[email protected]
@deanwampler

1
©Dean Wampler 2014-2019, All Rights Reserved
Spark

2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1

Resilient Distributed
Datasets 4
Productivity?

Very concise, elegant, functional APIs.

•Scala, Java
•Python, R
•... and SQL!
5
Productivity?

Interactive shell (REPL)

•Scala, Python, R, and SQL

6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...

... ... ...

wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...

wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)

... ... hbase (.../hbase,1),(.../hive,1)

hdfs (.../hadoop,1),(.../hbase,1),(..

block hive (.../hive,1)

... ... ... ...

Miracle!!
wikipedia.org/hbase HBase stores...
... ...
block
... ...

block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new SparkContext(master, “Inv. Index”)

sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1)) // (id, content)
}.flatMap {
case (id, content) =>
toWords(content).map(word => ((word,id),1)) // toWords not shown
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq) // Sort the value seq by count, desc.
}.saveAsTextFile("/path/to/output") 13
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new

SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
14

toWords(contents).map(w => ((w,id),1))

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new

RDD[String]: .../hadoop, Hadoop provides...
SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
RDD[(String,String)]: (.../hadoop,Hadoop provides...)
case (id, contents) =>
15

toWords(contents).map(w => ((w,id),1))

val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
RDD[((String,String),Int)]: ((Hadoop,.../hadoop),20)
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
16
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
17
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
18
Productivity?
textFile

map

Intuitive API: flatMap

•Dataflow of steps. reduceByKey

map

•Inspired by Scala collections groupByKey

and functional programming. map

saveAsTextFile
19
Performance?
textFile

map

Lazy API: flatMap

•Combines steps into “stages”. reduceByKey

map

•Cache intermediate data in groupByKey

memory. map

saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()

val planes_for_flights1 = sqlContext.sql("""

SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum LIMIT 100""")

val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
25
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
26
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""

SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

27
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""

28
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event

Event

…
… …

Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …

Window of 3 RDD Batches #1

Window of 3 RDD Batches #2

34
ML/MLlib
K-Means

•Machine Learning requires:

•Iterative training of models.
•Good linear algebra perf.
GraphX
PageRank

•Graph algorithms require:

•Incremental traversal.
•Eﬃcient edge and node reps.
Foundation:

The JVM 39
20 Years of
DevOps

Lots of Java Devs 40

Tools and Libraries
Akka
Breeze
Algebird
Spire & Cats
Axle
...
41
Big Data Ecosystem

42
But it’s

not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection

45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.

46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.

47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model

49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3

“second”

“first”

“third”

Arrays “zeroth”
51
val person: Person

name: String
“Buck Trends”
age: Int 29
addr: Address

… …

Class Instances
52
Hash Map

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”

Hash Maps
53
Improving Performance

Why obsess about this?

Spark jobs are CPU bound:
•Improve network I/O? ~2% better.
•Improve disk I/O? ~20% better. 54
What changed?

•Faster HW (compared to ~2000)

•10Gbs networks
•SSDs.
55
What changed?

•Smarter use of I/O

•Pruning unneeded data sooner.
•Caching more eﬀectively.
•Eﬃcient formats, like Parquet. 56
What changed?

•But more CPU use today:

•More Serialization.
•More Compression.
•More Hashing (joins, group-bys). 57
Improving Performance

To improve performance, we need to

focus on the CPU, the:
•Better algorithms, sure.
•And optimize use of memory. 58
Project Tungsten

Initiative to greatly improve

Dataset/DataFrame performance.59
Goals

60
Reduce References
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")

•Evaluating expressions billions of

times:
•Virtual function calls.
•Boxing/unboxing.
•Branching (if statements, etc.) 63
Implementation

64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length

oﬀset to var. len. data

•Compute hashCode and equals on

raw bytes. 65
val person: Person

name: String

•Compare: age: Int

addr: Address
29
“Buck Trends”

… …

null bit set (1bit/field) values (8bytes/field) variable length

oﬀset to var. len. data

66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…

67
Hash Map

h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”

h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4
…
68
Memory Management
•Some allocations oﬀ heap.
•sun.misc.Unsafe.

69
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types

(Planned for Java 9 or 10)

73
case class Timestamp(epochMillis: Long) {

def toString: String = { ... }

def add(delta: TimeDelta): Timestamp = {

/* return new shifted time */
}
Don’t allocate on the heap;
... just push the primiRve long
} on the stack.
(scalac does this now.)
74
Long operations
aren’t atomic
According to the
JVM spec
75
No Unsigned Types

What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion

scala> val array = Array.fill[Short](N)(0)

array: Array[Short] = Array(0, 0, ...)

scala> import
org.apache.spark.util.SizeEstimator

scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...

scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).

| map(i => b.value(i))

79
scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).

| map(i => b.value(i))

java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

84
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
Pass this closure to
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) RDD.map:
... i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

85
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
Verify that it’s
at java.io.ByteArrayOutputStream.write(...)
...
“clean” (serializable).
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
i => b.value(i)
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

86
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
...which it does by
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
serializing to a byte array...
at org.apache.spark.rdd.RDD.map(...)

87
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
...which requires copying
at java.io.ByteArrayOutputStream.write(...)
...
an array...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) What array???
...
i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
...
at org.apache.spark.rdd.RDD.map(...)
scala> val array = Array.fill[Short](N)(0)
... 88
Why did this
happen?
89
•You write:
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))

90
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)

class $iwC extends Serializable {

sc.parallelize(...).map(i => b.value(i))
}
} 91
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles: ... sucks in the whole object!
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
So, this closure over “b”...
class $iwC extends Serializable {
sc.parallelize(...).map(i => b.value(i))
}
} 92
Lightbend is
investigating
re-engineering
the REPL 93
Workarounds...

94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...

95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val array = Array.fill[Short](N)(0)
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
polyglotprogramming.com/talks
lightbend.com/fast-data-platform
[email protected]
@deanwampler

Questions?
Bonus Material
You can find an extended version of this
talk with more details at
polyglotprogramming.com/talks

100

Project Book - Schneider Electric
100% (2)
Project Book - Schneider Electric
69 pages
AWS clf-02 Question and Answers 2025
No ratings yet
AWS clf-02 Question and Answers 2025
24 pages
Spark
No ratings yet
Spark
160 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
HY2595 V1.1 DC-DC Buck REGULATOR
No ratings yet
HY2595 V1.1 DC-DC Buck REGULATOR
10 pages
Analog Circuits Lab Possible Viva Questions
71% (14)
Analog Circuits Lab Possible Viva Questions
18 pages
MUDRA - The Road Ahead ..: Jiji Mammen, Ceo, Mudra
No ratings yet
MUDRA - The Road Ahead ..: Jiji Mammen, Ceo, Mudra
20 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Performance Tuning For Content Manager Sg246949
No ratings yet
Performance Tuning For Content Manager Sg246949
490 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Ef Cient Provenance Management Via Clustering and Hybrid Storage in Big Data Environments
No ratings yet
Ef Cient Provenance Management Via Clustering and Hybrid Storage in Big Data Environments
12 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
Returns To Be Submitted To RBI
No ratings yet
Returns To Be Submitted To RBI
106 pages
Developing Android
No ratings yet
Developing Android
11 pages
Final Os Lab
No ratings yet
Final Os Lab
103 pages
PWM With Microcontroller 8051 For SCR or Triac Power Control
No ratings yet
PWM With Microcontroller 8051 For SCR or Triac Power Control
8 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Cloudera Spark
No ratings yet
Cloudera Spark
66 pages
04 - Google BigQuery Pricing
No ratings yet
04 - Google BigQuery Pricing
18 pages
WAGO 750-352 Coupler
No ratings yet
WAGO 750-352 Coupler
300 pages
Spark Lab
No ratings yet
Spark Lab
6 pages
203report On Asset Quality (RAQ)
No ratings yet
203report On Asset Quality (RAQ)
98 pages
Policy For 5G Mobie Network and Service in Singapore
No ratings yet
Policy For 5G Mobie Network and Service in Singapore
52 pages
Big Data Hadoop Training Certification 7
No ratings yet
Big Data Hadoop Training Certification 7
40 pages
GCP Pde Notes
No ratings yet
GCP Pde Notes
147 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
07 - Ingesting New Datasets Into Google BigQuery
No ratings yet
07 - Ingesting New Datasets Into Google BigQuery
8 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Pip Upgrade
No ratings yet
Pip Upgrade
62 pages
Datasheet S6-EH1P (3-6) K-L-PRO Global V2.8 2023 06
No ratings yet
Datasheet S6-EH1P (3-6) K-L-PRO Global V2.8 2023 06
2 pages
CCA175 Demo Examenes
No ratings yet
CCA175 Demo Examenes
19 pages
Pending Work Items
No ratings yet
Pending Work Items
38 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Audio Amplifiers
No ratings yet
Audio Amplifiers
63 pages
Python For Test Automation
No ratings yet
Python For Test Automation
30 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Spark
No ratings yet
Spark
13 pages
Compute Services: David Tucker
No ratings yet
Compute Services: David Tucker
47 pages
Scala Reference
No ratings yet
Scala Reference
6 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Design and Implement A Lexical Analyzer For Given Language Using C and The Lexical
No ratings yet
Design and Implement A Lexical Analyzer For Given Language Using C and The Lexical
3 pages
Apache Kafka Course Curriculum
No ratings yet
Apache Kafka Course Curriculum
5 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Git in 4 Weeks
No ratings yet
Git in 4 Weeks
61 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Basic Circuit Analysis
No ratings yet
Basic Circuit Analysis
11 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Moxa AWK-5232 User Manual
No ratings yet
Moxa AWK-5232 User Manual
81 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
No ratings yet
Technologies For Handling Big Data: Prepared By: Saidatul Rahah Hamidi
49 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Big Data Tools 2 - Apache Spark With PySpark
No ratings yet
Big Data Tools 2 - Apache Spark With PySpark
33 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Paramount e Brochure
No ratings yet
Paramount e Brochure
15 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
1529321885092
No ratings yet
1529321885092
9 pages
Interest Rate Sensitivity Test
No ratings yet
Interest Rate Sensitivity Test
10 pages
Interview
No ratings yet
Interview
86 pages
83stress Test
No ratings yet
83stress Test
9 pages
c4029 Fall 1 12 Sol
No ratings yet
c4029 Fall 1 12 Sol
8 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Segmentation Technique in Operating System: Updated On Oct 13, 2023 13:32 IST
No ratings yet
Segmentation Technique in Operating System: Updated On Oct 13, 2023 13:32 IST
19 pages
Suzlon
No ratings yet
Suzlon
7 pages
Git Book
No ratings yet
Git Book
9 pages
Financial Reporting: Status Report - 19-09-2019
No ratings yet
Financial Reporting: Status Report - 19-09-2019
8 pages
SMART Board® MX065-V2 Interactive Display With Iq: Specifications
No ratings yet
SMART Board® MX065-V2 Interactive Display With Iq: Specifications
6 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
CV Abhishek Kumar Rastogi
No ratings yet
CV Abhishek Kumar Rastogi
3 pages
74Sector-Wise and Industry-Wise Deployment of Bank Credit (SIBC)
No ratings yet
74Sector-Wise and Industry-Wise Deployment of Bank Credit (SIBC)
3 pages
Ethical Hacking Book 2022
No ratings yet
Ethical Hacking Book 2022
11 pages
Structured Cabling
No ratings yet
Structured Cabling
4 pages
Automating Daily Email Reports Using Python Can Be Achieved Using Libraries Like
No ratings yet
Automating Daily Email Reports Using Python Can Be Achieved Using Libraries Like
4 pages
CQF
No ratings yet
CQF
29 pages
Refeerence Paper 32
No ratings yet
Refeerence Paper 32
4 pages
Power Macintosh G3 (Blue and White) Firmware Update 1.1 Document and Software
No ratings yet
Power Macintosh G3 (Blue and White) Firmware Update 1.1 Document and Software
3 pages
Istar: User'S Manual
No ratings yet
Istar: User'S Manual
18 pages
Market Risk JD Kotak
No ratings yet
Market Risk JD Kotak
1 page
Form 10078 3 - 12 8 2020
No ratings yet
Form 10078 3 - 12 8 2020
4 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Edureka Interview Questions - HDFS
No ratings yet
Edureka Interview Questions - HDFS
4 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Biodata For Marriage
No ratings yet
Biodata For Marriage
1 page
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Cloud Dataproc Workflow Animation
No ratings yet
Cloud Dataproc Workflow Animation
2 pages
6SL3055-0AA00-3KA0 Datasheet en
No ratings yet
6SL3055-0AA00-3KA0 Datasheet en
1 page
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Monitoring Hadoop
From Everand
Monitoring Hadoop
Gurmukh Singh
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

ScalaJVMBigData SparkLessons PDF

Uploaded by

ScalaJVMBigData SparkLessons PDF

Uploaded by

Scala and the JVM for Big Data:

Lessons from Spark

Very concise, elegant, functional APIs.

Interactive shell (REPL)

... ... ...

wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)

... ... hbase (.../hbase,1),(.../hive,1)

block hive (.../hive,1)

... ... ... ...

val sparkContext = new SparkContext(master, “Inv. Index”)

val sparkContext = new

toWords(contents).map(w => ((w,id),1))

val sparkContext = new

toWords(contents).map(w => ((w,id),1))

Intuitive API: flatMap

•Dataflow of steps. reduceByKey

•Inspired by Scala collections groupByKey

and functional programming. map

Lazy API: flatMap

•Combines steps into “stages”. reduceByKey

•Cache intermediate data in groupByKey

val planes_for_flights1 = sqlContext.sql("""

val planes_for_flights1 = sqlContext.sql("""

val planes_for_flights1 = sqlContext.sql("""

Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …

Window of 3 RDD Batches #1

Window of 3 RDD Batches #2

•Machine Learning requires:

•Graph algorithms require:

Lots of Java Devs 40

Why obsess about this?

•Faster HW (compared to ~2000)

•Smarter use of I/O

•But more CPU use today:

To improve performance, we need to

Initiative to greatly improve

•Evaluating expressions billions of

oﬀset to var. len. data

•Compute hashCode and equals on

•Compare: age: Int

null bit set (1bit/field) values (8bytes/field) variable length

oﬀset to var. len. data

(Planned for Java 9 or 10)

def toString: String = { ... }

def add(delta: TimeDelta): Timestamp = {

scala> val array = Array.fill[Short](N)(0)

scala> sc.parallelize(0 until 100000).

scala> sc.parallelize(0 until 100000).

class $iwC extends Serializable {

You might also like