100% found this document useful (1 vote)
136 views100 pages

ScalaJVMBigData SparkLessons PDF

Uploaded by

Abhishek Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
136 views100 pages

ScalaJVMBigData SparkLessons PDF

Uploaded by

Abhishek Rastogi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Scala and the JVM for Big Data:

Lessons from Spark

polyglotprogramming.com/talks
[email protected]
@deanwampler

1
©Dean Wampler 2014-2019, All Rights Reserved
Spark

2
A Distributed
Computing Engine
on the JVM
3
Cluster
Node Node Node
RDD
Partition 1 Partition 1 Partition 1

Resilient Distributed
Datasets 4
Productivity?

Very concise, elegant, functional APIs.


•Scala, Java
•Python, R
•... and SQL!
5
Productivity?

Interactive shell (REPL)


•Scala, Python, R, and SQL

6
Notebooks
•Jupyter
•Spark Notebook
•Zeppelin
•Beaker
•Databricks
7
8
Example:
Inverted Index
9
Web Crawl
wikipedia.org/hadoop index
Hadoop provides block
MapReduce and HDFS
... ...
wikipedia.org/hadoop Hadoop provides...

... ... ...

wikipedia.org/hbase block
... ...
HBase stores data in HDFS
wikipedia.org/hbase HBase stores...
... ...
10
l Compute Inverted Index
index inverse index
block block
... ... ... ...

wikipedia.org/hadoop Hadoop provides... hadoop (.../hadoop,1)

... ... hbase (.../hbase,1),(.../hive,1)


hdfs (.../hadoop,1),(.../hbase,1),(..

block hive (.../hive,1)

... ... ... ...


Miracle!!
wikipedia.org/hbase HBase stores...
... ...
block
... ...

block
block
... ...
... ...
wikipedia.org/hive Hive queries...
... ...
block 11
nverted Index
inverse index
block
... ...
hadoop (.../hadoop,1)
hbase (.../hbase,1),(.../hive,1)
hdfs (.../hadoop,1),(.../hbase,1),(.../hive,1)
hive (.../hive,1)
... ...
racle!! 12
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new SparkContext(master, “Inv. Index”)


sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1)) // (id, content)
}.flatMap {
case (id, content) =>
toWords(content).map(word => ((word,id),1)) // toWords not shown
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq) // Sort the value seq by count, desc.
}.saveAsTextFile("/path/to/output") 13
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new


SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
14

toWords(contents).map(w => ((w,id),1))


import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

val sparkContext = new


RDD[String]: .../hadoop, Hadoop provides...
SparkContext(master, “Inv. Index”)
sparkContext.textFile("/path/to/input").
map { line =>
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
RDD[(String,String)]: (.../hadoop,Hadoop provides...)
case (id, contents) =>
15

toWords(contents).map(w => ((w,id),1))


val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
RDD[((String,String),Int)]: ((Hadoop,.../hadoop),20)
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
16
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
}.groupByKey.
mapValues {
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
17
val array = line.split(",", 2)
(array(0), array(1))
}.flatMap {
case (id, contents) =>
toWords(contents).map(w => ((w,id),1))
}.reduceByKey(_ + _).
map {
case ((word,id),n) => (word,(id,n))
RDD[(String,Iterable((String,Int))]: (Hadoop,seq(.../hadoop,20),...))
}.groupByKey.
mapValues {
seq => sortByCount(seq)
}.saveAsTextFile("/path/to/output")
18
Productivity?
textFile

map

Intuitive API: flatMap

•Dataflow of steps. reduceByKey

map

•Inspired by Scala collections groupByKey

and functional programming. map

saveAsTextFile
19
Performance?
textFile

map

Lazy API: flatMap

•Combines steps into “stages”. reduceByKey

map

•Cache intermediate data in groupByKey

memory. map

saveAsTextFile
20
21
Higher-Level
APIs
22
SQL:
Datasets/
DataFrames 23
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
Example
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum LIMIT 100""")

val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)
24
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
25
import org.apache.spark.SparkSession
val spark = SparkSession.builder()
.master("local")
.appName("Queries")
.getOrCreate()

val flights =
spark.read.parquet(".../flights")
val planes =
spark.read.parquet(".../planes")
flights.createOrReplaceTempView("flights")
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()
26
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

27
planes. createOrReplaceTempView("planes")
flights.cache(); planes.cache()

val planes_for_flights1 = sqlContext.sql("""


SELECT * FROM flights f
JOIN planes p ON f.tailNum = p.tailNum
LIMIT 100""")
Returns another
val planes_for_flights2 = Dataset.
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

28
val planes_for_flights2 =
flights.join(planes,
flights("tailNum") ===
planes ("tailNum")).limit(100)

Not an “arbitrary”
anonymous funcRon, but a
“Column” instance.
29
Performance
The Dataset API has the
same performance for all
languages:
Scala, Java,
Python, R,
and SQL! 30
def join(right: Dataset[_], joinExprs: Column): DataFrame = {
def groupBy(cols: Column*): RelationalGroupedDataset = {
def orderBy(sortExprs: Column*): Dataset[T] = {
def select(cols: Column*): Dataset[...] = {
def where(condition: Column): Dataset[T] = {
def limit(n: Int): Dataset[T] = {
def intersect(other: Dataset[T]): Dataset[T] = {
def sample(withReplacement: Boolean, fraction, seed) = {
def drop(col: Column): DataFrame = {
def map[U](f: T => U): Dataset[U] = {
def flatMap[U](f: T => Traversable[U]): Dataset[U] ={
def foreach(f: T => Unit): Unit = {
def take(n: Int): Array[Row] = {
def count(): Long = {
def distinct(): Dataset[T] = {
def agg(exprs: Map[String, String]): DataFrame = {
31
32
Structured
Streaming
33
DStream (discretized stream)
Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event

Event


… …

Time 1 RDD Time 2 RDD Time 3 RDD Time 4 RDD …

Window of 3 RDD Batches #1

Window of 3 RDD Batches #2

34
ML/MLlib
K-Means

•Machine Learning requires:


•Iterative training of models.
•Good linear algebra perf.
GraphX
PageRank

•Graph algorithms require:


•Incremental traversal.
•Efficient edge and node reps.
Foundation:

The JVM 39
20 Years of
DevOps

Lots of Java Devs 40


Tools and Libraries
Akka
Breeze
Algebird
Spire & Cats
Axle
...
41
Big Data Ecosystem

42
But it’s

not perfect...
43
Richer data libs.
in Python & R 44
Garbage
Collection

45
GC Challenges
•Typical Spark heaps: 10s-100s GB.
•Uncommon for “generic”, non-data
services.

46
GC Challenges
•Too many cached RDDs leads to huge
old generation garbage.
•Billions of objects => long GC pauses.

47
Tuning GC
•Best for Spark:
•-XX:UseG1GC -XX:-ResizePLAB -
Xms... -Xmx... -
XX:InitiatingHeapOccupancyPerce
nt=... -XX:ConcGCThread=...
databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-
applications.html 48
JVM Object Model

49
Java Objects?
•“abcd”: 4 bytes for raw UTF8, right?
•48 bytes for the Java object:
•12 byte header.
•8 bytes for hash code.
•20 bytes for array overhead.
•8 bytes for UTF16 chars. 50
val myArray: Array[String]
0 1 2 3

“second”

“first”

“third”

Arrays “zeroth”
51
val person: Person

name: String
“Buck Trends”
age: Int 29
addr: Address

… …

Class Instances
52
Hash Map

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key”

Hash Maps
53
Improving Performance

Why obsess about this?


Spark jobs are CPU bound:
•Improve network I/O? ~2% better.
•Improve disk I/O? ~20% better. 54
What changed?

•Faster HW (compared to ~2000)


•10Gbs networks
•SSDs.
55
What changed?

•Smarter use of I/O


•Pruning unneeded data sooner.
•Caching more effectively.
•Efficient formats, like Parquet. 56
What changed?

•But more CPU use today:


•More Serialization.
•More Compression.
•More Hashing (joins, group-bys). 57
Improving Performance

To improve performance, we need to


focus on the CPU, the:
•Better algorithms, sure.
•And optimize use of memory. 58
Project Tungsten

Initiative to greatly improve


Dataset/DataFrame performance.59
Goals

60
Reduce References
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4
“a key” 61
Reduce References
•Fewer, bigger objects to GC.
•Fewer cache misses
val myArray: Array[String]
val person: Person 0 1 2 3

“second”
name: String
“Buck Trends”
age: Int 29 “first”
addr: Address
“third”
… …
Hash Map “zeroth”

h/c1
key value … …
h/c2
h/c3 “a value”
h/c4 62
“a key”
Less Expression Overhead
sql("SELECT a + b FROM table")

•Evaluating expressions billions of


times:
•Virtual function calls.
•Boxing/unboxing.
•Branching (if statements, etc.) 63
Implementation

64
Object Encoding
New CompactRow type:
null bit set (1bit/field) values (8bytes/field) variable length

offset to var. len. data

•Compute hashCode and equals on


raw bytes. 65
val person: Person

name: String

•Compare: age: Int


addr: Address
29
“Buck Trends”

… …

null bit set (1bit/field) values (8bytes/field) variable length

offset to var. len. data

66
•BytesToBytesMap:
h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4

67
Hash Map

h/c1
key value … …
h/c2
•Compare h/c3 “a value”
h/c4
“a key”

h/c1
Tungsten Memory Page
h/c2
k1 v1 k2 v2
h/c3
k3 v3 k4 v4
h/c4

68
Memory Management
•Some allocations off heap.
•sun.misc.Unsafe.

69
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
70
Less Expression Overhead
sql("SELECT a + b FROM table")

•Solution:
•Generate custom byte code.
•Spark 1.X - for subexpressions.
•Spark 2.0 - for whole queries.
71
72
No Value Types

(Planned for Java 9 or 10)


73
case class Timestamp(epochMillis: Long) {

def toString: String = { ... }

def add(delta: TimeDelta): Timestamp = {


/* return new shifted time */
}
Don’t allocate on the heap;
... just push the primiRve long
} on the stack.
(scalac does this now.)
74
Long operations
aren’t atomic
According to the
JVM spec
75
No Unsigned Types

What’s
factorial(-1)?
76
Arrays Indexed
with Ints
Byte Arrays
limited to 2GB!
77
scala> val N = 1100*1000*1000
N2: Int = 1100000000 // 1.1 billion

scala> val array = Array.fill[Short](N)(0)


array: Array[Short] = Array(0, 0, ...)

scala> import
org.apache.spark.util.SizeEstimator

scala> SizeEstimator.estimate(array)
res3: Long = 2200000016 // 2.2GB
78
scala> val b = sc.broadcast(array)
...broadcast.Broadcast[Array[Short]] = ...

scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).


| map(i => b.value(i))

79
scala> SizeEstimator.estimate(b)
res0: Long = 2368

scala> sc.parallelize(0 until 100000).


| map(i => b.value(i))

java.lang.OutOfMemoryError:
Boom!
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
80
But wait...
I actually lied
to you...
81
Spark handles large
broadcast variables
by breaking them
into blocks. 82
Scala
REPL83
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

84
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
Pass this closure to
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) RDD.map:
... i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

85
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
Verify that it’s
at java.io.ByteArrayOutputStream.write(...)
...
“clean” (serializable).
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
i => b.value(i)
.writeObject(...)
...
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
at org.apache.spark.rdd.RDD.map(...)

86
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
at java.io.ByteArrayOutputStream.write(...)
...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...)
...
...which it does by
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
serializing to a byte array...
at org.apache.spark.rdd.RDD.map(...)

87
java.lang.OutOfMemoryError:
Requested array size exceeds VM limit

at java.util.Arrays.copyOf(...)
...
...which requires copying
at java.io.ByteArrayOutputStream.write(...)
...
an array...
at java.io.ObjectOutputStream.writeObject(...)
at ...spark.serializer.JavaSerializationStream
.writeObject(...) What array???
...
i => b.value(i)
at ...spark.util.ClosureCleaner$.ensureSerializable(..)
...
...
at org.apache.spark.rdd.RDD.map(...)
scala> val array = Array.fill[Short](N)(0)
... 88
Why did this
happen?
89
•You write:
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))

90
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles:
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)

class $iwC extends Serializable {


sc.parallelize(...).map(i => b.value(i))
}
} 91
scala> val array = Array.fill[Short](N)(0)
scala> val b = sc.broadcast(array)
scala> sc.parallelize(0 until 100000).
| map(i => b.value(i))
•Scala compiles: ... sucks in the whole object!
class $iwC extends Serializable {
val array = Array.fill[Short](N)(0)
val b = sc.broadcast(array)
So, this closure over “b”...
class $iwC extends Serializable {
sc.parallelize(...).map(i => b.value(i))
}
} 92
Lightbend is
investigating
re-engineering
the REPL 93
Workarounds...

94
•Transient is often all you need:
scala> @transient val array =
| Array.fill[Short](N)(0)
scala> ...

95
object Data { // Encapsulate in objects!
val N = 1100*1000*1000
val array = Array.fill[Short](N)(0)
val getB = sc.broadcast(array)
}
object Work {
def run(): Unit = {
val b = Data.getB // local ref!
val rdd = sc.parallelize(...).
map(i => b.value(i)) // only needs b
rdd.take(10).foreach(println)
}} 96
Why Scala?
See the longer version
of this talk at
polyglotprogramming.com/talks 97
polyglotprogramming.com/talks
lightbend.com/fast-data-platform
[email protected]
@deanwampler

Questions?
Bonus Material
You can find an extended version of this
talk with more details at
polyglotprogramming.com/talks

100

You might also like