0% found this document useful (0 votes)
289 views108 pages

Ebook Accelerating Apache Spark 3

Spark SQL allows users to query structured data using SQL. It includes components for parsing SQL, optimizing queries, and executing them to produce a DataFrame. The parser converts SQL into a logical plan, which is optimized and converted into physical execution plans involving operations like scans, filters, and aggregations on Resilient Distributed Datasets (RDDs). This allows Spark SQL to leverage Spark's distributed processing capabilities to perform queries efficiently in a parallel manner.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
289 views108 pages

Ebook Accelerating Apache Spark 3

Spark SQL allows users to query structured data using SQL. It includes components for parsing SQL, optimizing queries, and executing them to produce a DataFrame. The parser converts SQL into a logical plan, which is optimized and converted into physical execution plans involving operations like scans, filters, and aggregations on Resilient Distributed Datasets (RDDs). This allows Spark SQL to leverage Spark's distributed processing capabilities to perform queries efficiently in a parallel manner.

Uploaded by

aaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

ACCELERATING

APACHE
SPARK 3
Leveraging NVIDIA GPUs
to Power the Next Era
of Analytics and AI

Carol McDonald with contributions from NVIDIA


User Program

(1) fork (1) fork


(1) fork

(2) (2)
assign Master assign
map reduce

worker
split 0 (6) write output
worker file 0
split 1 (5) remote read
(3) read (4) local write
split 2 worker output
worker
split 3 file 1

split 4
worker

Input Map Intermediate Files Reduce Ouptput


Files Phase (on local disk) Phase Files

Job 1 Job 2 Last Job


Maps Reduces Maps Reduces Maps Reduces

SequenceFile SequenceFile SequenceFile

Input Output Output Input Output


to to to to to
Job 1 Job 1 Job 2 Last Job Last Job

HDFS
Partition 1 Partition 2 Partition 3 Partition 4
8213034705, 95, 8213034705, 8213034705, 8213034705,
2.927373 115, 2.943484, 100, 2.951285, 117, 2.998947,
jake7870, Davidbresier2, gladimacowgirl, daysrus,

• Data read into Memory Cache


• Partitioned across a cluster
• Operated on in parallel Node Node Node
• Cached in memory for iterations Executor Executor Executor

partitioned P1 P3 P4 P2
Pandas Drill

Spark Impala

Arrow Memory

Parquet HBase

Cassandra Kudu
MLlib
Spark Spark GraphX
(Machine
SQL Streaming (graph)
Learning)

Apache Spark
Worker Node
Driver Program
Application
Executor Disk
Spark Session Cache
Partition Task Data
Partition Task Data

Cluster Manager
Worker Node

Executor Disk
Cache
Partition Task Data
Partition Task Data

org.apache.spark.sql.Row
Worker Node

Executor
Cache
Partition
Row Column
Partition

Worker Node
DataFrame is like a Partitioned Table.
Executor
Cache
Partition
Partition

$ /[installation path]/bin/spark-shell --master local[2]

val spark = SparkSession.builder.appName("Simple


Application").master("local[2]").getOrCreate()
Worker Node

Executor Disk
object Taxi { Cache
def main(args: Array[String][
Partition Task Data
val spark: SparkSession=SparkSession.builder()
./bin/spark-submit Partition Task Data
.appName( Taxi master( local[*] getOrCreate()

val df=spark.read.option( inferSchema false Cluster Manager


.option( header true).schema(schema).csv(file) Worker Node
df.groupBy( hour count().show()
} Executor Disk
} Cache
Partition Task Data
Partition Task Data

import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val schema =
StructType(Array(
StructField("vendor_id", DoubleType),
StructField("passenger_count", DoubleType),
StructField("trip_distance", DoubleType),
StructField("pickup_longitude", DoubleType),
StructField("pickup_latitude", DoubleType),
StructField("rate_code", DoubleType),
StructField("store_and_fwd", DoubleType),
StructField("dropoff_longitude", DoubleType),
StructField("dropoff_latitude", DoubleType),
StructField("fare_amount", DoubleType),
StructField("hour", DoubleType),
StructField("year", IntegerType),
StructField("month", IntegerType),
StructField("day", DoubleType),
StructField("day_of_week", DoubleType),
StructField("is_weekend", DoubleType)
))

val file = "/data/taxi_small.csv"

val df = spark.read.option("inferSchema", "false")


.option("header", true).schema(schema).csv(file)

result:
df: org.apache.spark.sql.DataFrame = [vendor_id: double, passenger_count:
double ... 14 more fields]

org.apache.spark.sql.Row

df.take(1)
result:
Array[org.apache.spark.sql.Row] = Array([4.52563162E8,5.0,2.72,-
73.948132,40.829826999999995,-6.77418915E8,-1.0,-
73.969648,40.797472000000006,11.5,10.0,2012,11,13.0,6.0,1.0])
Transformations create a new
DataFrame from the current one
Actions return Worker Node
Driver Program values to driver,
Application or write to disk
Executor
Spark Session Cache
Partition Task
Partition Task

Cluster Manager
Worker Node

Executor
Cache
Partition Task
Partition Task

 select
 join
 groupBy
groupBy transformation

df.groupBy("hour").count().show(4)

result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+

 show(n)
 take(n)
 count
filter()

NARROW

0, 1, 2 0

0, 2 0

0, 2 0

1, 2

// select and filter are narrow transformations


df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).show(2)

result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+
groupBy, agg, sortBy, orderBy

WIDE

0,1,2
0,0,0
0,2
1,1
0,2
2,2,2,2
1,2

df.groupBy("hour").count().show(4)

result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+



Parser Analyzer Optimizer Planner Query Execution


SQL
Cost Model

Unresolved Optimized Physical Selected


Dataset Logical Plan Plans Physical RDDs
Logical Plan Logical Plan Plans
Metadata Cache
DataFrame Catalog Manager
df2 Scan csv
day_of_week hour fare_amount
day_of_week

val df = spark.read.option("inferSchema", "false") .option("header",


true).schema(schema).csv(file)
val df2 = df.select($"hour", $"fare_amount",
$"day_of_week").filter($"day_of_week" === "6.0" )
df2.show(3)
result:
+----+-----------+-----------+
|hour|fare_amount|day_of_week|
+----+-----------+-----------+
|10.0| 11.5| 6.0|
|10.0| 5.5| 6.0|
|10.0| 13.0| 6.0|
+----+-----------+-----------+
df2.explain(“formatted”)
result:
== Physical Plan ==
* Project (3)
+- * Filter (2)
+- Scan csv (1)

(1) Scan csv


Location: [dbfs:/FileStore/tables/taxi_tsmall.csv]
Output [3]: [fare_amount#143, hour#144, day_of_week#148]
PushedFilters: [IsNotNull(day_of_week), EqualTo(day_of_week,6.0)]

(2) Filter [codegen id : 1]


Input [3]: [fare_amount#143, hour#144, day_of_week#148]
Condition : (isnotnull(day_of_week#148) AND (day_of_week#148 = 6.0))

(3) Project [codegen id : 1]


Output [3]: [hour#144, fare_amount#143, day_of_week#148]
Input [3]: [fare_amount#143, hour#144, day_of_week#148]
READ FILTER
we df3
Scan Filter Project HashAggregate Exchange HashAggregate
Exchange groupBy

val df3 = df2.groupBy("hour").count


df3.orderBy(asc("hour"))show(5)
result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 47|
| 2.0| 658|
| 3.0| 742|
| 4.0| 812|
+----+-----+

df3.explain
result:
== Physical Plan ==
* HashAggregate (6)
+- Exchange (5)
+- * HashAggregate (4)
+- * Project (3)
+- * Filter (2)
+- Scan csv (1)
(1) Scan csv
Output [2]: [hour, day_of_week]
(2) Filter [codegen id : 1]
Input [2]: [hour, day_of_week]
Condition : (isnotnull(day_of_week) AND (day_of_week = 6.0))
(3) Project [codegen id : 1]
Output [1]: [hour]
Input [2]: [hour, day_of_week]
(4) HashAggregate [codegen id : 1]
Input [1]: [hour]
Functions [1]: [partial_count(1) AS count]
Aggregate Attributes [1]: [count]
Results [2]: [hour, count]
(5) Exchange
Input [2]: [hour, count]
Arguments: hashpartitioning(hour, 200), true, [id=]
(6) HashAggregate [codegen id : 2]
Input [2]: [hour, count]
Keys [1]: [hour]
Functions [1]: [finalmerge_count(merge count) AS count(1)]
Aggregate Attributes [1]: [count(1)]
Results [2]: [hour, count(1) AS count]

READ FILTER GROUPBY

0,0,0

1,1

2,2,2,2
HASH
PROJECT AGGREGATE
FILE SCAN FILTER EXCHANGE
Planner Query Execution

Cost Model
Physical Selected
Plans Physical RDDs
Plans

READ FILTER GROUPBY

0,0,0

1,1

2,2,2,2

Stage 1 Stage 2

Physical Plan Split into Tasks Task Set

READ FILTER GROUPBY Stage1 Stage2

Task
0,0,0 Task
Task
1,1 Task
Task
2,2,2,2 Task
Task

Stage 1 Stage 2
Worker Node

Executor Disk
Task set sent to the
task scheduler, which Cache
Stage1 Stage1
sends tasks to the Partition Task Data
executors to run.
Task Partition Task Data
Task
Task
Task
Task Worker Node
Task
Task Executor Disk
Cache
Partition Task Data
Partition Task Data
Dataframe
datasource DataFrame
DataFrame
DataFrame
DataFrame
DataFrames

Load Data DataFrame

// load the data as in Chapter 1


val file = "/data/taxi_small.csv"

val df = spark.read.option("inferSchema", "false")


.option("header", true).schema(schema).csv(file)

// cache DataFrame in columnar format in memory


df.cache

// create Table view of DataFrame for Spark SQL


df.createOrReplaceTempView("taxi")

// cache taxi table in columnar format in memory


spark.catalog.cacheTable("taxi")
%sql
select hour, avg(fare_amount)
from taxi
group by hour order by hour

DataFrame

df.groupBy("hour").avg("fare_amount")
.orderBy("hour").show(5)

result:
+----+------------------+
|hour| avg(fare_amount)|
+----+------------------+
| 0.0|11.083333333333334|
| 1.0|22.581632653061224|
| 2.0|11.370820668693009|
| 3.0|13.873989218328841|
| 4.0| 14.57204433497537|
+----+------------------+
%sql
select trip_distance,avg(trip_distance), avg(fare_amount)
from taxi
group by trip_distance order by avg(trip_distance) desc

%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by hour order by hour

%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by rate_code order by rate_code
%sql
select day_of_week, avg(fare_amount), avg(trip_distance)
from taxi
group by day_of_week order by day_of_week








df.write.format("parquet")
.partitionBy("year")
.option("path", "/data ")
.saveAsTable("taxi")

path
to
table
year = 2019
part01.parquet
part02.parquet
year = 2018
part01.parquet
..
.

df.filter("year = '2019')
.groupBy("year").avg("fareamount")
df.write.format("parquet")
.partitionBy("year")
.bucketBy(4,"hour")
.option("path", "/data ")
.saveAsTable("taxi")

df.filter("year = '2019')
.groupBy("hour")
.avg("hour")
GPU Scheduling
SPARK APPLICATION
SPARK DRIVER TASK CODE
Request Executor Assign Pass GPU
Spark Task runs and
Submit Executor Registers GPU(s) and addrs to
App
Launches user gets GPU
Containers with GPU Launch TensorFlow or
Executor addr assigned
w/GPU(s) adds Tasks other AI algo

Cluster Manager Node


(YARN/K8s/etc)
Executor Task
CPU GPU



select fs.airport, fs.total_sales


from flight_sales fs, flight_airports fa
where fs.airport = fa.airport and fa.region = 'NEUSA'

Join

Scan Filter








































--conf [conf key]=[conf value]
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.5.0.jar,cudf-0.19.2-
cuda10-1.jar' \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleOps.enabled=true

spark.conf.set("[conf key]", [conf value])


scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)

--conf spark.executor.resource.gpu.amount=1

--conf spark.task.resource.gpu.amount=1

--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh
spark.task.resource.gpu.amount

spark.executor.cores park.executor.cores=2

park.task.resource.gpu.amount=0.5

park.executor.cores=6

spark.task.cpus=1.
 spark.sql.files.maxPartitionBytes

spark.hadoop.mapreduce.input.fileinputformat.split.minsize

// select and filter are narrow transformations


df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).show(2)

result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+

df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).explain

result:
== Physical Plan ==
*(1) GpuColumnarToRow false<
+- !GpuProject [hour#10, fare_amount#9]
+- GpuCoalesceBatches TargetSize(1000000,2147483647)
+- !GpuFilter (gpuisnotnull(hour#10) AND (hour#10 = 0.0))
+- GpuBatchScan[fare_amount#9, hour#10] GpuCSVScan Location:
InMemoryFileIndex[s3a://spark-taxi-dataset/raw-small/train], ReadSchema:
struct<fare_amount:double,hour:double>

val df3 = df2.groupBy("month").count


.orderBy(asc("month")).show(5)
,
,Data
X1, X2
Features

Build Model
F(X1, X2)=Y

,
,
New Data
X1, X2
Features

Use Model
Predict

Y = intercept
+ (coefficient * X) + error
,
, Data
X size
Features

Build Model

Y Y=a+bx

,
,
New Data
X
Features

Use Model,
Predict

slope a, coefficient b




All Data

Subset Subset Subset

Tree Tree Tree





















Load Data Dataframe

import org.apache.spark._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.regression._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.tuning._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.Pipeline

val schema = StructType(Array(


StructField("longitude", FloatType,true),
StructField("latitude", FloatType, true),
StructField("medage", FloatType, true),
StructField("totalrooms", FloatType, true),
StructField("totalbdrms", FloatType, true),
StructField("population", FloatType, true),
StructField("houshlds", FloatType, true),
StructField("medincome", FloatType, true),
StructField("medhvalue", FloatType, true)
))

var file ="/path/cal_housing.csv"

var df = spark.read.format("csv").option("inferSchema", "false").schema(schema).load(file)

df.show
:
+---------+--------+------+----------+----------+----------+--------+---------+---------+
|longitude|latitude|medage|totalrooms|totalbdrms|population|houshlds|medincome|medhvalue|
result
+---------+--------+------+----------+----------+----------+--------+---------+---------+
| -122.23| 37.88| 41.0| 880.0| 129.0| 322.0| 126.0| 8.3252| 452600.0|
| -122.22| 37.86| 21.0| 7099.0| 1106.0| 2401.0| 1138.0| 8.3014| 358500.0|
| -122.24| 37.85| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.2574| 352100.0|
+---------+--------+------+----------+----------+----------+--------+---------+---------+

// create ratios for features


df = df.withColumn("roomsPhouse", col("totalrooms")/col("houshlds"))
df = df.withColumn("popPhouse", col("population")/col("houshlds"))
df = df.withColumn("bedrmsPRoom", col("totalbdrms")/col("totalrooms"))

df=df.drop("totalrooms","houshlds", "population" , "totalbdrms")

df.cache
df.createOrReplaceTempView("house")
spark.catalog.cacheTable("house")
df.describe("medincome","medhvalue","roomsPhouse","popPhouse").show

result:
+-------+------------------+------------------+------------------+------------------+
|summary| medincome| medhvalue| roomsPhouse| popPhouse|
+-------+------------------+------------------+------------------+------------------+
| count| 20640| 20640| 20640| 20640|
| mean|3.8706710030346416|206855.81690891474| 5.428999742190365| 3.070655159436382|
| stddev|1.8998217183639696|115395.61587441359|2.4741731394243205| 10.38604956221361|
| min| 0.4999| 14999.0|0.8461538461538461|0.6923076923076923|
| max| 15.0001| 500001.0| 141.9090909090909|1243.3333333333333|
+-------+------------------+------------------+------------------+------------------+

df.select(corr("medhvalue","medincome")).show()

+--------------------------+
|corr(medhvalue, medincome)|
+--------------------------+
| 0.688075207464692|
+--------------------------+
val Array(trainingData, testData) = df.randomSplit(Array(0.8, 0.2), 1234)

val featureCols = Array("medage", "medincome", "roomsPhouse", "popPhouse",


"bedrmsPRoom", "longitude", "latitude")

//put features into a feature vector column


val assembler = new
VectorAssembler().setInputCols(featureCols).setOutputCol("rawfeatures")
val scaler = new
StandardScaler().setInputCol("rawfeatures").setOutputCol("features").setWith
Std(true.setWithMean(true)

Transformers
DataFrame +
DataFrame VectorAssembly Scaler Lable and
Features

val rf = new
RandomForestRegressor().setLabelCol("medhvalue").setFeaturesCol("features")

val steps = Array(assembler, scaler, rf)

val pipeline = new Pipeline().setStages(steps)

Transformers Estimator

DataFrame VectorAssembly Scaler RandomForest

Pipeline
val paramGrid = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(100, 200))
.addGrid(rf.maxDepth, Array(2, 7, 10))
.addGrid(rf.numTrees, Array(5, 20))
.build()

val evaluator = new RegressionEvaluator()


.setLabelCol("medhvalue")
.setPredictionCol("prediction")
.setMetricName("rmse")

val crossvalidator = new CrossValidator()


.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)

// fit the training data set and return a model


val pipelineModel = crossvalidator.fit(trainingData)

val featureImportances = pipelineModel


.bestModel.asInstanceOf[PipelineModel]
.stages(2)
.asInstanceOf[RandomForestRegressionModel]
.featureImportances

assembler.getInputCols
.zip(featureImportances.toArray)
.sortBy(-_._2)
.foreach { case (feat, imp) =>
println(s"feature: $feat, importance: $imp") }

result:
feature: medincome, importance: 0.4531355014139285
feature: popPhouse, importance: 0.12807843645878508
feature: longitude, importance: 0.10501162983981065
feature: latitude, importance: 0.1044621179898163
feature: bedrmsPRoom, importance: 0.09720295935509805
feature: roomsPhouse, importance: 0.058427239343697555
feature: medage, importance: 0.05368211559886386

val bestEstimatorParamMap = pipelineModel


.getEstimatorParamMaps
.zip(pipelineModel.avgMetrics)
.maxBy(_._2)
._1
println(s"Best params:\n$bestEstimatorParamMap")

result:
rfr_maxBins: 50,
rfr_maxDepth: 2,
rfr_-numTrees: 5
Transformers Estimator

DataFrame VectorAssembly Scaler RandomForest

Pipeline

Extract Features Predict with Model

Load Data Evaluator

Test Pipeline Predictions


DataFrame Model DataFrame

val predictions = pipelineModel.transform(testData)


predictions.select("prediction", "medhvalue").show(5)

result:
+------------------+---------+
| prediction|medhvalue|
+------------------+---------+
|104349.59677450571| 94600.0|
| 77530.43231856065| 85800.0|
|111369.71756877871| 90100.0|
| 97351.87386020401| 82800.0|
+------------------+---------+
predictions = predictions.withColumn("error", col("prediction")-
col("medhvalue"))

predictions.select("prediction", "medhvalue", "error").show

result:
+------------------+---------+-------------------+
| prediction|medhvalue| error|
+------------------+---------+-------------------+
| 104349.5967745057| 94600.0| 9749.596774505713|
| 77530.4323185606| 85800.0| -8269.567681439352|
| 101253.3225967887| 103600.0| -2346.677403211302|
+------------------+---------+-------------------+

predictions.describe("prediction", "medhvalue", "error").show


result:
+-------+-----------------+------------------+------------------+
|summary| prediction| medhvalue| error|
+-------+-----------------+------------------+------------------+
| count| 4161| 4161| 4161|
| mean|206307.4865123929|205547.72650805095| 759.7600043416329|
| stddev|97133.45817381598|114708.03790345002| 52725.56329678355|
| min|56471.09903814694| 26900.0|-339450.5381565819|
| max|499238.1371374392| 500001.0|293793.71945819416|
+-------+-----------------+------------------+------------------+

val maevaluator = new RegressionEvaluator()


.setLabelCol("medhvalue")
.setMetricName("mae")

val mae = maevaluator.evaluate(predictions)


result:
mae: Double = 36636.35

val evaluator = new RegressionEvaluator()


.setLabelCol("medhvalue")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)

result:
rmse: Double = 52724.70

Transformers Evaluator

DataFrame +
DataFrame Model Pipeline Evaluator
Predictions
pipelineModel.write.overwrite().save(modeldir)

val sameModel = CrossValidatorModel.load(“modeldir")



 →
 →

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.evaluation._
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor,
XGBoostRegressionModel}

import ml.dmlc.xgboost4j.scala.spark.rapids.{GpuDataReader, GpuDataset}

lazy val schema =


StructType(Array(
StructField("vendor_id", DoubleType),
StructField("passenger_count", DoubleType),
StructField("trip_distance", DoubleType),
StructField("pickup_longitude", DoubleType),
StructField("pickup_latitude", DoubleType),
StructField("rate_code", DoubleType),
StructField("store_and_fwd", DoubleType),
StructField("dropoff_longitude", DoubleType),
StructField("dropoff_latitude", DoubleType),
StructField(labelName, DoubleType),
StructField("hour", DoubleType),
StructField("year", IntegerType),
StructField("month", IntegerType),
StructField("day", DoubleType),
StructField("day_of_week", DoubleType),
StructField("is_weekend", DoubleType)
))
val trainPath = "/FileStore/tables/taxi_tsmall.csv"
val evalPath = "/FileStore/tables/taxi_esmall.csv"
val spark = SparkSession.builder().appName("Taxi-GPU").getOrCreate

Load Data DataFrame

val tdf = spark.read.option("inferSchema",


"false").option("header", true).schema(schema).csv(trainPath)
val edf = spark.read.option("inferSchema", "false").option("header",
true).schema(schema).csv(evalPath)

show(5)

tdf.select("trip_distance", "rate_code","fare_amount").show(5)
result:
+------------------+-------------+-----------+
| trip_distance| rate_code|fare_amount|
+------------------+-------------+-----------+
| 2.72|-6.77418915E8| 11.5|
| 0.94|-6.77418915E8| 5.5|
| 3.63|-6.77418915E8| 13.0|
| 11.86|-6.77418915E8| 33.5|
| 3.03|-6.77418915E8| 11.0|
+------------------+-------------+-----------+

function describe

tdf.select("trip_distance", "rate_code","fare_amount").describe().show
+-------+------------------+--------------------+------------------+
|summary| trip_distance| rate_code| fare_amount|
+-------+------------------+--------------------+------------------+
| count| 7999| 7999| 7999|
| mean| 3.278923615451919|-6.569284350812602E8|12.348543567945994|
| stddev|3.6320775770793547|1.6677419425906155E8|10.221929466939088|
| min| 0.0| -6.77418915E8| 2.5|
| max|35.970000000000006| 1.957796822E9| 107.5|
+-------+------------------+--------------------+------------------+
%sql
select trip_distance, fare_amount
from taxi

Load data Transform


DataFrame +
DataFrame VectorAssembler
Features

// feature column names


val featureNames = Array("passenger_count","trip_distance",
"pickup_longitude","pickup_latitude","rate_code","dropoff_longitude",
"dropoff_latitude", "hour", "day_of_week","is_weekend")
// create transformer
object Vectorize {
def apply(df: DataFrame, featureNames: Seq[String], labelName: String):
DataFrame = {
val toFloat = df.schema.map(f => col(f.name).cast(FloatType))
new VectorAssembler()
.setInputCols(featureNames.toArray)
.setOutputCol("features")
.transform(df.select(toFloat:_*))
.select(col("features"), col(labelName))
}
}
// transform method adds features column
var trainSet = Vectorize(tdf, featureNames, labelName)
var evalSet = Vectorize(edf, featureNames, labelName)
trainSet.take(1)
result:
res8: Array[org.apache.spark.sql.Row] = Array([[5.0,2.7200000286102295,-
73.94813537597656,40.82982635498047,-6.77418944E8,-
73.96965026855469,40.79747009277344,10.0,6.0,1.0],11.5])

num_workers
tree_method

Load data Transform Input


DataFrame +
DataFrame Estimator
Features

lazy val paramMap = Map(


"learning_rate" -> 0.05,
"max_depth" -> 8,
"subsample" -> 0.8,
"gamma" -> 1,
"num_round" -> 500
)
// set up xgboost parameters
val xgbParamFinal = paramMap ++ Map("tree_method" -> "hist", "num_workers" -
> 12)
// create the xgboostregressor estimator
val xgbRegressor = new XGBoostRegressor(xgbParamFinal)
.setLabelCol(labelName)
.setFeaturesCol("features")
num_workers
tree_method gpu_hist,

val xgbParamFinal = paramMap ++ Map("tree_method" -> "gpu_hist",


"num_workers" -> 1)
// create the estimator
val xgbRegressor = new XGBoostRegressor(xgbParamFinal)
.setLabelCol(labelName)
.setFeaturesCols(featureNames)

Load data Transform Input Fit


DataFrame +
DataFrame Estimator Fitted Model
Features

object Benchmark {
def time[R](phase: String)(block: => R): (R, Float) = {
val t0 = System.currentTimeMillis
val result = block // call-by-name
val t1 = System.currentTimeMillis
println("Elapsed time [" + phase + "]: " +
((t1 - t0).toFloat / 1000) + "s")
(result, (t1 - t0).toFloat / 1000)
}
}
// use the estimator to fit (train) a model
val (model, _) = Benchmark.time("train") {
xgbRegressor.fit(trainSet)
}
Transform
DataFrame +
DataFrame +
Fitted Model Label + Features +
Label Features
Predictions

val (prediction, _) = Benchmark.time("transform") {


val ret = model.transform(evalSet).cache()
ret.foreachPartition(_ => ())
ret
}
prediction.select( labelName, "prediction").show(10)
Result:
+-----------+------------------+
|fare_amount| prediction|
+-----------+------------------+
| 5.0| 4.749197959899902|
| 34.0|38.651187896728516|
| 10.0|11.101678848266602|
| 16.5| 17.23284912109375|
| 7.0| 8.149757385253906|
| 7.5|7.5153608322143555|
| 5.5| 7.248467922210693|
| 2.5|12.289423942565918|
| 9.5|10.893491744995117|
| 12.0| 12.06682014465332|
+-----------+------------------+

DataFrame + Evaluate
Label + Evaluator
Predictions

val evaluator = new RegressionEvaluator().setLabelCol(labelName)


val (rmse, _) = Benchmark.time("evaluation") {
evaluator.evaluate(prediction)
}
println(s"RMSE == $rmse")
Result:
Elapsed time [evaluation]: 0.356s
RMSE == 2.6105287283128353
model.write.overwrite().save(savepath)

val sameModel = XGBoostRegressionModel.load(savepath)
















You might also like