Ebook Accelerating Apache Spark 3
Ebook Accelerating Apache Spark 3
APACHE
SPARK 3
Leveraging NVIDIA GPUs
to Power the Next Era
of Analytics and AI
(2) (2)
assign Master assign
map reduce
worker
split 0 (6) write output
worker file 0
split 1 (5) remote read
(3) read (4) local write
split 2 worker output
worker
split 3 file 1
split 4
worker
HDFS
Partition 1 Partition 2 Partition 3 Partition 4
8213034705, 95, 8213034705, 8213034705, 8213034705,
2.927373 115, 2.943484, 100, 2.951285, 117, 2.998947,
jake7870, Davidbresier2, gladimacowgirl, daysrus,
partitioned P1 P3 P4 P2
Pandas Drill
Spark Impala
Arrow Memory
Parquet HBase
Cassandra Kudu
MLlib
Spark Spark GraphX
(Machine
SQL Streaming (graph)
Learning)
Apache Spark
Worker Node
Driver Program
Application
Executor Disk
Spark Session Cache
Partition Task Data
Partition Task Data
Cluster Manager
Worker Node
Executor Disk
Cache
Partition Task Data
Partition Task Data
org.apache.spark.sql.Row
Worker Node
Executor
Cache
Partition
Row Column
Partition
Worker Node
DataFrame is like a Partitioned Table.
Executor
Cache
Partition
Partition
Executor Disk
object Taxi { Cache
def main(args: Array[String][
Partition Task Data
val spark: SparkSession=SparkSession.builder()
./bin/spark-submit Partition Task Data
.appName( Taxi master( local[*] getOrCreate()
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val schema =
StructType(Array(
StructField("vendor_id", DoubleType),
StructField("passenger_count", DoubleType),
StructField("trip_distance", DoubleType),
StructField("pickup_longitude", DoubleType),
StructField("pickup_latitude", DoubleType),
StructField("rate_code", DoubleType),
StructField("store_and_fwd", DoubleType),
StructField("dropoff_longitude", DoubleType),
StructField("dropoff_latitude", DoubleType),
StructField("fare_amount", DoubleType),
StructField("hour", DoubleType),
StructField("year", IntegerType),
StructField("month", IntegerType),
StructField("day", DoubleType),
StructField("day_of_week", DoubleType),
StructField("is_weekend", DoubleType)
))
result:
df: org.apache.spark.sql.DataFrame = [vendor_id: double, passenger_count:
double ... 14 more fields]
org.apache.spark.sql.Row
df.take(1)
result:
Array[org.apache.spark.sql.Row] = Array([4.52563162E8,5.0,2.72,-
73.948132,40.829826999999995,-6.77418915E8,-1.0,-
73.969648,40.797472000000006,11.5,10.0,2012,11,13.0,6.0,1.0])
Transformations create a new
DataFrame from the current one
Actions return Worker Node
Driver Program values to driver,
Application or write to disk
Executor
Spark Session Cache
Partition Task
Partition Task
Cluster Manager
Worker Node
Executor
Cache
Partition Task
Partition Task
select
join
groupBy
groupBy transformation
df.groupBy("hour").count().show(4)
result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+
show(n)
take(n)
count
filter()
NARROW
0, 1, 2 0
0, 2 0
0, 2 0
1, 2
result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+
groupBy, agg, sortBy, orderBy
WIDE
0,1,2
0,0,0
0,2
1,1
0,2
2,2,2,2
1,2
df.groupBy("hour").count().show(4)
result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+
df3.explain
result:
== Physical Plan ==
* HashAggregate (6)
+- Exchange (5)
+- * HashAggregate (4)
+- * Project (3)
+- * Filter (2)
+- Scan csv (1)
(1) Scan csv
Output [2]: [hour, day_of_week]
(2) Filter [codegen id : 1]
Input [2]: [hour, day_of_week]
Condition : (isnotnull(day_of_week) AND (day_of_week = 6.0))
(3) Project [codegen id : 1]
Output [1]: [hour]
Input [2]: [hour, day_of_week]
(4) HashAggregate [codegen id : 1]
Input [1]: [hour]
Functions [1]: [partial_count(1) AS count]
Aggregate Attributes [1]: [count]
Results [2]: [hour, count]
(5) Exchange
Input [2]: [hour, count]
Arguments: hashpartitioning(hour, 200), true, [id=]
(6) HashAggregate [codegen id : 2]
Input [2]: [hour, count]
Keys [1]: [hour]
Functions [1]: [finalmerge_count(merge count) AS count(1)]
Aggregate Attributes [1]: [count(1)]
Results [2]: [hour, count(1) AS count]
0,0,0
1,1
2,2,2,2
HASH
PROJECT AGGREGATE
FILE SCAN FILTER EXCHANGE
Planner Query Execution
Cost Model
Physical Selected
Plans Physical RDDs
Plans
0,0,0
1,1
2,2,2,2
Stage 1 Stage 2
Task
0,0,0 Task
Task
1,1 Task
Task
2,2,2,2 Task
Task
Stage 1 Stage 2
Worker Node
Executor Disk
Task set sent to the
task scheduler, which Cache
Stage1 Stage1
sends tasks to the Partition Task Data
executors to run.
Task Partition Task Data
Task
Task
Task
Task Worker Node
Task
Task Executor Disk
Cache
Partition Task Data
Partition Task Data
Dataframe
datasource DataFrame
DataFrame
DataFrame
DataFrame
DataFrames
DataFrame
df.groupBy("hour").avg("fare_amount")
.orderBy("hour").show(5)
result:
+----+------------------+
|hour| avg(fare_amount)|
+----+------------------+
| 0.0|11.083333333333334|
| 1.0|22.581632653061224|
| 2.0|11.370820668693009|
| 3.0|13.873989218328841|
| 4.0| 14.57204433497537|
+----+------------------+
%sql
select trip_distance,avg(trip_distance), avg(fare_amount)
from taxi
group by trip_distance order by avg(trip_distance) desc
%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by hour order by hour
%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by rate_code order by rate_code
%sql
select day_of_week, avg(fare_amount), avg(trip_distance)
from taxi
group by day_of_week order by day_of_week
df.write.format("parquet")
.partitionBy("year")
.option("path", "/data ")
.saveAsTable("taxi")
path
to
table
year = 2019
part01.parquet
part02.parquet
year = 2018
part01.parquet
..
.
df.filter("year = '2019')
.groupBy("year").avg("fareamount")
df.write.format("parquet")
.partitionBy("year")
.bucketBy(4,"hour")
.option("path", "/data ")
.saveAsTable("taxi")
df.filter("year = '2019')
.groupBy("hour")
.avg("hour")
GPU Scheduling
SPARK APPLICATION
SPARK DRIVER TASK CODE
Request Executor Assign Pass GPU
Spark Task runs and
Submit Executor Registers GPU(s) and addrs to
App
Launches user gets GPU
Containers with GPU Launch TensorFlow or
Executor addr assigned
w/GPU(s) adds Tasks other AI algo
•
•
Join
Scan Filter
•
•
--conf [conf key]=[conf value]
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.5.0.jar,cudf-0.19.2-
cuda10-1.jar' \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleOps.enabled=true
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=1
--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh
spark.task.resource.gpu.amount
spark.executor.cores park.executor.cores=2
park.task.resource.gpu.amount=0.5
park.executor.cores=6
spark.task.cpus=1.
spark.sql.files.maxPartitionBytes
spark.hadoop.mapreduce.input.fileinputformat.split.minsize
result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+
result:
== Physical Plan ==
*(1) GpuColumnarToRow false<
+- !GpuProject [hour#10, fare_amount#9]
+- GpuCoalesceBatches TargetSize(1000000,2147483647)
+- !GpuFilter (gpuisnotnull(hour#10) AND (hour#10 = 0.0))
+- GpuBatchScan[fare_amount#9, hour#10] GpuCSVScan Location:
InMemoryFileIndex[s3a://spark-taxi-dataset/raw-small/train], ReadSchema:
struct<fare_amount:double,hour:double>
Build Model
F(X1, X2)=Y
,
,
New Data
X1, X2
Features
Use Model
Predict
Y = intercept
+ (coefficient * X) + error
,
, Data
X size
Features
Build Model
Y Y=a+bx
,
,
New Data
X
Features
Use Model,
Predict
slope a, coefficient b
•
−
−
All Data
import org.apache.spark._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.regression._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.tuning._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.Pipeline
df.show
:
+---------+--------+------+----------+----------+----------+--------+---------+---------+
|longitude|latitude|medage|totalrooms|totalbdrms|population|houshlds|medincome|medhvalue|
result
+---------+--------+------+----------+----------+----------+--------+---------+---------+
| -122.23| 37.88| 41.0| 880.0| 129.0| 322.0| 126.0| 8.3252| 452600.0|
| -122.22| 37.86| 21.0| 7099.0| 1106.0| 2401.0| 1138.0| 8.3014| 358500.0|
| -122.24| 37.85| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.2574| 352100.0|
+---------+--------+------+----------+----------+----------+--------+---------+---------+
df.cache
df.createOrReplaceTempView("house")
spark.catalog.cacheTable("house")
df.describe("medincome","medhvalue","roomsPhouse","popPhouse").show
result:
+-------+------------------+------------------+------------------+------------------+
|summary| medincome| medhvalue| roomsPhouse| popPhouse|
+-------+------------------+------------------+------------------+------------------+
| count| 20640| 20640| 20640| 20640|
| mean|3.8706710030346416|206855.81690891474| 5.428999742190365| 3.070655159436382|
| stddev|1.8998217183639696|115395.61587441359|2.4741731394243205| 10.38604956221361|
| min| 0.4999| 14999.0|0.8461538461538461|0.6923076923076923|
| max| 15.0001| 500001.0| 141.9090909090909|1243.3333333333333|
+-------+------------------+------------------+------------------+------------------+
df.select(corr("medhvalue","medincome")).show()
+--------------------------+
|corr(medhvalue, medincome)|
+--------------------------+
| 0.688075207464692|
+--------------------------+
val Array(trainingData, testData) = df.randomSplit(Array(0.8, 0.2), 1234)
Transformers
DataFrame +
DataFrame VectorAssembly Scaler Lable and
Features
val rf = new
RandomForestRegressor().setLabelCol("medhvalue").setFeaturesCol("features")
Transformers Estimator
Pipeline
val paramGrid = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(100, 200))
.addGrid(rf.maxDepth, Array(2, 7, 10))
.addGrid(rf.numTrees, Array(5, 20))
.build()
assembler.getInputCols
.zip(featureImportances.toArray)
.sortBy(-_._2)
.foreach { case (feat, imp) =>
println(s"feature: $feat, importance: $imp") }
result:
feature: medincome, importance: 0.4531355014139285
feature: popPhouse, importance: 0.12807843645878508
feature: longitude, importance: 0.10501162983981065
feature: latitude, importance: 0.1044621179898163
feature: bedrmsPRoom, importance: 0.09720295935509805
feature: roomsPhouse, importance: 0.058427239343697555
feature: medage, importance: 0.05368211559886386
result:
rfr_maxBins: 50,
rfr_maxDepth: 2,
rfr_-numTrees: 5
Transformers Estimator
Pipeline
result:
+------------------+---------+
| prediction|medhvalue|
+------------------+---------+
|104349.59677450571| 94600.0|
| 77530.43231856065| 85800.0|
|111369.71756877871| 90100.0|
| 97351.87386020401| 82800.0|
+------------------+---------+
predictions = predictions.withColumn("error", col("prediction")-
col("medhvalue"))
result:
+------------------+---------+-------------------+
| prediction|medhvalue| error|
+------------------+---------+-------------------+
| 104349.5967745057| 94600.0| 9749.596774505713|
| 77530.4323185606| 85800.0| -8269.567681439352|
| 101253.3225967887| 103600.0| -2346.677403211302|
+------------------+---------+-------------------+
result:
rmse: Double = 52724.70
Transformers Evaluator
DataFrame +
DataFrame Model Pipeline Evaluator
Predictions
pipelineModel.write.overwrite().save(modeldir)
→
→
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.evaluation._
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor,
XGBoostRegressionModel}
show(5)
tdf.select("trip_distance", "rate_code","fare_amount").show(5)
result:
+------------------+-------------+-----------+
| trip_distance| rate_code|fare_amount|
+------------------+-------------+-----------+
| 2.72|-6.77418915E8| 11.5|
| 0.94|-6.77418915E8| 5.5|
| 3.63|-6.77418915E8| 13.0|
| 11.86|-6.77418915E8| 33.5|
| 3.03|-6.77418915E8| 11.0|
+------------------+-------------+-----------+
function describe
tdf.select("trip_distance", "rate_code","fare_amount").describe().show
+-------+------------------+--------------------+------------------+
|summary| trip_distance| rate_code| fare_amount|
+-------+------------------+--------------------+------------------+
| count| 7999| 7999| 7999|
| mean| 3.278923615451919|-6.569284350812602E8|12.348543567945994|
| stddev|3.6320775770793547|1.6677419425906155E8|10.221929466939088|
| min| 0.0| -6.77418915E8| 2.5|
| max|35.970000000000006| 1.957796822E9| 107.5|
+-------+------------------+--------------------+------------------+
%sql
select trip_distance, fare_amount
from taxi
num_workers
tree_method
object Benchmark {
def time[R](phase: String)(block: => R): (R, Float) = {
val t0 = System.currentTimeMillis
val result = block // call-by-name
val t1 = System.currentTimeMillis
println("Elapsed time [" + phase + "]: " +
((t1 - t0).toFloat / 1000) + "s")
(result, (t1 - t0).toFloat / 1000)
}
}
// use the estimator to fit (train) a model
val (model, _) = Benchmark.time("train") {
xgbRegressor.fit(trainSet)
}
Transform
DataFrame +
DataFrame +
Fitted Model Label + Features +
Label Features
Predictions
DataFrame + Evaluate
Label + Evaluator
Predictions
◾