0% found this document useful (0 votes)

289 views108 pages

Ebook Accelerating Apache Spark 3

Spark SQL allows users to query structured data using SQL. It includes components for parsing SQL, optimizing queries, and executing them to produce a DataFrame. The parser converts SQL into a logical plan, which is optimized and converted into physical execution plans involving operations like scans, filters, and aggregations on Resilient Distributed Datasets (RDDs). This allows Spark SQL to leverage Spark's distributed processing capabilities to perform queries efficiently in a parallel manner.

Uploaded by

aaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

289 views108 pages

Ebook Accelerating Apache Spark 3

Uploaded by

aaa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

ACCELERATING

APACHE
SPARK 3
Leveraging NVIDIA GPUs
to Power the Next Era
of Analytics and AI

Carol McDonald with contributions from NVIDIA

User Program

(1) fork (1) fork

(1) fork

(2) (2)
assign Master assign
map reduce

worker
split 0 (6) write output
worker file 0
split 1 (5) remote read
(3) read (4) local write
split 2 worker output
worker
split 3 file 1

split 4
worker

Input Map Intermediate Files Reduce Ouptput

Files Phase (on local disk) Phase Files

Job 1 Job 2 Last Job

Maps Reduces Maps Reduces Maps Reduces

SequenceFile SequenceFile SequenceFile

Input Output Output Input Output

to to to to to
Job 1 Job 1 Job 2 Last Job Last Job

HDFS
Partition 1 Partition 2 Partition 3 Partition 4
8213034705, 95, 8213034705, 8213034705, 8213034705,
2.927373 115, 2.943484, 100, 2.951285, 117, 2.998947,
jake7870, Davidbresier2, gladimacowgirl, daysrus,

• Data read into Memory Cache

• Partitioned across a cluster
• Operated on in parallel Node Node Node
• Cached in memory for iterations Executor Executor Executor

partitioned P1 P3 P4 P2
Pandas Drill

Spark Impala

Arrow Memory

Parquet HBase

Cassandra Kudu
MLlib
Spark Spark GraphX
(Machine
SQL Streaming (graph)
Learning)

Apache Spark
Worker Node
Driver Program
Application
Executor Disk
Spark Session Cache
Partition Task Data
Partition Task Data

Cluster Manager
Worker Node

Executor Disk
Cache
Partition Task Data
Partition Task Data

org.apache.spark.sql.Row
Worker Node

Executor
Cache
Partition
Row Column
Partition

Worker Node
DataFrame is like a Partitioned Table.
Executor
Cache
Partition
Partition

$ /[installation path]/bin/spark-shell --master local[2]

val spark = SparkSession.builder.appName("Simple

Application").master("local[2]").getOrCreate()
Worker Node

Executor Disk
object Taxi { Cache
def main(args: Array[String][
Partition Task Data
val spark: SparkSession=SparkSession.builder()
./bin/spark-submit Partition Task Data
.appName( Taxi master( local[*] getOrCreate()

val df=spark.read.option( inferSchema false Cluster Manager

.option( header true).schema(schema).csv(file) Worker Node
df.groupBy( hour count().show()
} Executor Disk
} Cache
Partition Task Data
Partition Task Data

import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val schema =
StructType(Array(
StructField("vendor_id", DoubleType),
StructField("passenger_count", DoubleType),
StructField("trip_distance", DoubleType),
StructField("pickup_longitude", DoubleType),
StructField("pickup_latitude", DoubleType),
StructField("rate_code", DoubleType),
StructField("store_and_fwd", DoubleType),
StructField("dropoff_longitude", DoubleType),
StructField("dropoff_latitude", DoubleType),
StructField("fare_amount", DoubleType),
StructField("hour", DoubleType),
StructField("year", IntegerType),
StructField("month", IntegerType),
StructField("day", DoubleType),
StructField("day_of_week", DoubleType),
StructField("is_weekend", DoubleType)
))

val file = "/data/taxi_small.csv"

val df = spark.read.option("inferSchema", "false")

.option("header", true).schema(schema).csv(file)

result:
df: org.apache.spark.sql.DataFrame = [vendor_id: double, passenger_count:
double ... 14 more fields]

org.apache.spark.sql.Row

df.take(1)
result:
Array[org.apache.spark.sql.Row] = Array([4.52563162E8,5.0,2.72,-
73.948132,40.829826999999995,-6.77418915E8,-1.0,-
73.969648,40.797472000000006,11.5,10.0,2012,11,13.0,6.0,1.0])
Transformations create a new
DataFrame from the current one
Actions return Worker Node
Driver Program values to driver,
Application or write to disk
Executor
Spark Session Cache
Partition Task
Partition Task

Cluster Manager
Worker Node

Executor
Cache
Partition Task
Partition Task

 select
 join
 groupBy
groupBy transformation

df.groupBy("hour").count().show(4)

result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+

 show(n)
 take(n)
 count
filter()

NARROW

0, 1, 2 0

0, 2 0

1, 2

// select and filter are narrow transformations

df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).show(2)

result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+
groupBy, agg, sortBy, orderBy

WIDE

0,1,2
0,0,0
0,2
1,1
0,2
2,2,2,2
1,2

df.groupBy("hour").count().show(4)

result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 49|
| 2.0| 658|
| 3.0| 742|
+----+-----+





Parser Analyzer Optimizer Planner Query Execution

SQL
Cost Model

Unresolved Optimized Physical Selected

Dataset Logical Plan Plans Physical RDDs
Logical Plan Logical Plan Plans
Metadata Cache
DataFrame Catalog Manager
df2 Scan csv
day_of_week hour fare_amount
day_of_week

val df = spark.read.option("inferSchema", "false") .option("header",

true).schema(schema).csv(file)
val df2 = df.select($"hour", $"fare_amount",
$"day_of_week").filter($"day_of_week" === "6.0" )
df2.show(3)
result:
+----+-----------+-----------+
|hour|fare_amount|day_of_week|
+----+-----------+-----------+
|10.0| 11.5| 6.0|
|10.0| 5.5| 6.0|
|10.0| 13.0| 6.0|
+----+-----------+-----------+
df2.explain(“formatted”)
result:
== Physical Plan ==
* Project (3)
+- * Filter (2)
+- Scan csv (1)

(1) Scan csv

Location: [dbfs:/FileStore/tables/taxi_tsmall.csv]
Output [3]: [fare_amount#143, hour#144, day_of_week#148]
PushedFilters: [IsNotNull(day_of_week), EqualTo(day_of_week,6.0)]

(2) Filter [codegen id : 1]

Input [3]: [fare_amount#143, hour#144, day_of_week#148]
Condition : (isnotnull(day_of_week#148) AND (day_of_week#148 = 6.0))

(3) Project [codegen id : 1]

Output [3]: [hour#144, fare_amount#143, day_of_week#148]
Input [3]: [fare_amount#143, hour#144, day_of_week#148]
READ FILTER
we df3
Scan Filter Project HashAggregate Exchange HashAggregate
Exchange groupBy

val df3 = df2.groupBy("hour").count

df3.orderBy(asc("hour"))show(5)
result:
+----+-----+
|hour|count|
+----+-----+
| 0.0| 12|
| 1.0| 47|
| 2.0| 658|
| 3.0| 742|
| 4.0| 812|
+----+-----+

df3.explain
result:
== Physical Plan ==
* HashAggregate (6)
+- Exchange (5)
+- * HashAggregate (4)
+- * Project (3)
+- * Filter (2)
+- Scan csv (1)
(1) Scan csv
Output [2]: [hour, day_of_week]
(2) Filter [codegen id : 1]
Input [2]: [hour, day_of_week]
Condition : (isnotnull(day_of_week) AND (day_of_week = 6.0))
(3) Project [codegen id : 1]
Output [1]: [hour]
Input [2]: [hour, day_of_week]
(4) HashAggregate [codegen id : 1]
Input [1]: [hour]
Functions [1]: [partial_count(1) AS count]
Aggregate Attributes [1]: [count]
Results [2]: [hour, count]
(5) Exchange
Input [2]: [hour, count]
Arguments: hashpartitioning(hour, 200), true, [id=]
(6) HashAggregate [codegen id : 2]
Input [2]: [hour, count]
Keys [1]: [hour]
Functions [1]: [finalmerge_count(merge count) AS count(1)]
Aggregate Attributes [1]: [count(1)]
Results [2]: [hour, count(1) AS count]

READ FILTER GROUPBY

0,0,0

1,1

2,2,2,2
HASH
PROJECT AGGREGATE
FILE SCAN FILTER EXCHANGE
Planner Query Execution

Cost Model
Physical Selected
Plans Physical RDDs
Plans

READ FILTER GROUPBY

0,0,0

1,1

2,2,2,2

Stage 1 Stage 2

Physical Plan Split into Tasks Task Set

READ FILTER GROUPBY Stage1 Stage2

Task
0,0,0 Task
Task
1,1 Task
Task
2,2,2,2 Task
Task

Stage 1 Stage 2
Worker Node

Executor Disk
Task set sent to the
task scheduler, which Cache
Stage1 Stage1
sends tasks to the Partition Task Data
executors to run.
Task Partition Task Data
Task
Task
Task
Task Worker Node
Task
Task Executor Disk
Cache
Partition Task Data
Partition Task Data
Dataframe
datasource DataFrame
DataFrame
DataFrame
DataFrame
DataFrames

Load Data DataFrame

// load the data as in Chapter 1

val file = "/data/taxi_small.csv"

val df = spark.read.option("inferSchema", "false")

.option("header", true).schema(schema).csv(file)

// cache DataFrame in columnar format in memory

df.cache

// create Table view of DataFrame for Spark SQL

df.createOrReplaceTempView("taxi")

// cache taxi table in columnar format in memory

spark.catalog.cacheTable("taxi")
%sql
select hour, avg(fare_amount)
from taxi
group by hour order by hour

DataFrame

df.groupBy("hour").avg("fare_amount")
.orderBy("hour").show(5)

result:
+----+------------------+
|hour| avg(fare_amount)|
+----+------------------+
| 0.0|11.083333333333334|
| 1.0|22.581632653061224|
| 2.0|11.370820668693009|
| 3.0|13.873989218328841|
| 4.0| 14.57204433497537|
+----+------------------+
%sql
select trip_distance,avg(trip_distance), avg(fare_amount)
from taxi
group by trip_distance order by avg(trip_distance) desc

%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by hour order by hour

%sql
select hour, avg(fare_amount), avg(trip_distance)
from taxi
group by rate_code order by rate_code
%sql
select day_of_week, avg(fare_amount), avg(trip_distance)
from taxi
group by day_of_week order by day_of_week














df.write.format("parquet")
.partitionBy("year")
.option("path", "/data ")
.saveAsTable("taxi")

path
to
table
year = 2019
part01.parquet
part02.parquet
year = 2018
part01.parquet
..
.

df.filter("year = '2019')
.groupBy("year").avg("fareamount")
df.write.format("parquet")
.partitionBy("year")
.bucketBy(4,"hour")
.option("path", "/data ")
.saveAsTable("taxi")

df.filter("year = '2019')
.groupBy("hour")
.avg("hour")
GPU Scheduling
SPARK APPLICATION
SPARK DRIVER TASK CODE
Request Executor Assign Pass GPU
Spark Task runs and
Submit Executor Registers GPU(s) and addrs to
App
Launches user gets GPU
Containers with GPU Launch TensorFlow or
Executor addr assigned
w/GPU(s) adds Tasks other AI algo

Cluster Manager Node

(YARN/K8s/etc)
Executor Task
CPU GPU





•
•

select fs.airport, fs.total_sales

from flight_sales fs, flight_airports fa
where fs.airport = fa.airport and fa.region = 'NEUSA'

Join

Scan Filter



•
•


















































--conf [conf key]=[conf value]
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.5.0.jar,cudf-0.19.2-
cuda10-1.jar' \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleOps.enabled=true

spark.conf.set("[conf key]", [conf value])

scala> spark.conf.set("spark.rapids.sql.incompatibleOps.enabled", true)

--conf spark.executor.resource.gpu.amount=1

--conf spark.task.resource.gpu.amount=1

--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh
spark.task.resource.gpu.amount

spark.executor.cores park.executor.cores=2

park.task.resource.gpu.amount=0.5

park.executor.cores=6

spark.task.cpus=1.
 spark.sql.files.maxPartitionBytes

spark.hadoop.mapreduce.input.fileinputformat.split.minsize

// select and filter are narrow transformations

df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).show(2)

result:
+----+-----------+
|hour|fare_amount|
+----+-----------+
| 0.0| 10.5|
| 0.0| 12.5|
+----+-----------+

df.select($"hour", $"fare_amount").filter($"hour" === "0.0" ).explain

result:
== Physical Plan ==
*(1) GpuColumnarToRow false<
+- !GpuProject [hour#10, fare_amount#9]
+- GpuCoalesceBatches TargetSize(1000000,2147483647)
+- !GpuFilter (gpuisnotnull(hour#10) AND (hour#10 = 0.0))
+- GpuBatchScan[fare_amount#9, hour#10] GpuCSVScan Location:
InMemoryFileIndex[s3a://spark-taxi-dataset/raw-small/train], ReadSchema:
struct<fare_amount:double,hour:double>

val df3 = df2.groupBy("month").count

.orderBy(asc("month")).show(5)
,
,Data
X1, X2
Features

Build Model
F(X1, X2)=Y

,
,
New Data
X1, X2
Features

Use Model
Predict


Y = intercept
+ (coefficient * X) + error
,
, Data
X size
Features

Build Model

Y Y=a+bx

,
,
New Data
X
Features

Use Model,
Predict

slope a, coefficient b


•

−
−
All Data

Subset Subset Subset

Tree Tree Tree































Load Data Dataframe

import org.apache.spark._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.regression._
import org.apache.spark.ml.evaluation._
import org.apache.spark.ml.tuning._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.ml.Pipeline

val schema = StructType(Array(

StructField("longitude", FloatType,true),
StructField("latitude", FloatType, true),
StructField("medage", FloatType, true),
StructField("totalrooms", FloatType, true),
StructField("totalbdrms", FloatType, true),
StructField("population", FloatType, true),
StructField("houshlds", FloatType, true),
StructField("medincome", FloatType, true),
StructField("medhvalue", FloatType, true)
))

var file ="/path/cal_housing.csv"

var df = spark.read.format("csv").option("inferSchema", "false").schema(schema).load(file)

df.show
:
+---------+--------+------+----------+----------+----------+--------+---------+---------+
|longitude|latitude|medage|totalrooms|totalbdrms|population|houshlds|medincome|medhvalue|
result
+---------+--------+------+----------+----------+----------+--------+---------+---------+
| -122.23| 37.88| 41.0| 880.0| 129.0| 322.0| 126.0| 8.3252| 452600.0|
| -122.22| 37.86| 21.0| 7099.0| 1106.0| 2401.0| 1138.0| 8.3014| 358500.0|
| -122.24| 37.85| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.2574| 352100.0|
+---------+--------+------+----------+----------+----------+--------+---------+---------+

// create ratios for features

df = df.withColumn("roomsPhouse", col("totalrooms")/col("houshlds"))
df = df.withColumn("popPhouse", col("population")/col("houshlds"))
df = df.withColumn("bedrmsPRoom", col("totalbdrms")/col("totalrooms"))

df=df.drop("totalrooms","houshlds", "population" , "totalbdrms")

df.cache
df.createOrReplaceTempView("house")
spark.catalog.cacheTable("house")
df.describe("medincome","medhvalue","roomsPhouse","popPhouse").show

result:
+-------+------------------+------------------+------------------+------------------+
|summary| medincome| medhvalue| roomsPhouse| popPhouse|
+-------+------------------+------------------+------------------+------------------+
| count| 20640| 20640| 20640| 20640|
| mean|3.8706710030346416|206855.81690891474| 5.428999742190365| 3.070655159436382|
| stddev|1.8998217183639696|115395.61587441359|2.4741731394243205| 10.38604956221361|
| min| 0.4999| 14999.0|0.8461538461538461|0.6923076923076923|
| max| 15.0001| 500001.0| 141.9090909090909|1243.3333333333333|
+-------+------------------+------------------+------------------+------------------+

df.select(corr("medhvalue","medincome")).show()

+--------------------------+
|corr(medhvalue, medincome)|
+--------------------------+
| 0.688075207464692|
+--------------------------+
val Array(trainingData, testData) = df.randomSplit(Array(0.8, 0.2), 1234)

val featureCols = Array("medage", "medincome", "roomsPhouse", "popPhouse",

"bedrmsPRoom", "longitude", "latitude")

//put features into a feature vector column

val assembler = new
VectorAssembler().setInputCols(featureCols).setOutputCol("rawfeatures")
val scaler = new
StandardScaler().setInputCol("rawfeatures").setOutputCol("features").setWith
Std(true.setWithMean(true)

Transformers
DataFrame +
DataFrame VectorAssembly Scaler Lable and
Features

val rf = new
RandomForestRegressor().setLabelCol("medhvalue").setFeaturesCol("features")

val steps = Array(assembler, scaler, rf)

val pipeline = new Pipeline().setStages(steps)

Transformers Estimator

DataFrame VectorAssembly Scaler RandomForest

Pipeline
val paramGrid = new ParamGridBuilder()
.addGrid(rf.maxBins, Array(100, 200))
.addGrid(rf.maxDepth, Array(2, 7, 10))
.addGrid(rf.numTrees, Array(5, 20))
.build()

val evaluator = new RegressionEvaluator()

.setLabelCol("medhvalue")
.setPredictionCol("prediction")
.setMetricName("rmse")

val crossvalidator = new CrossValidator()

.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)

// fit the training data set and return a model

val pipelineModel = crossvalidator.fit(trainingData)

val featureImportances = pipelineModel

.bestModel.asInstanceOf[PipelineModel]
.stages(2)
.asInstanceOf[RandomForestRegressionModel]
.featureImportances

assembler.getInputCols
.zip(featureImportances.toArray)
.sortBy(-_._2)
.foreach { case (feat, imp) =>
println(s"feature: $feat, importance: $imp") }

result:
feature: medincome, importance: 0.4531355014139285
feature: popPhouse, importance: 0.12807843645878508
feature: longitude, importance: 0.10501162983981065
feature: latitude, importance: 0.1044621179898163
feature: bedrmsPRoom, importance: 0.09720295935509805
feature: roomsPhouse, importance: 0.058427239343697555
feature: medage, importance: 0.05368211559886386

val bestEstimatorParamMap = pipelineModel

.getEstimatorParamMaps
.zip(pipelineModel.avgMetrics)
.maxBy(_._2)
._1
println(s"Best params:\n$bestEstimatorParamMap")

result:
rfr_maxBins: 50,
rfr_maxDepth: 2,
rfr_-numTrees: 5
Transformers Estimator

DataFrame VectorAssembly Scaler RandomForest

Pipeline

Extract Features Predict with Model

Load Data Evaluator

Test Pipeline Predictions

DataFrame Model DataFrame

val predictions = pipelineModel.transform(testData)

predictions.select("prediction", "medhvalue").show(5)

result:
+------------------+---------+
| prediction|medhvalue|
+------------------+---------+
|104349.59677450571| 94600.0|
| 77530.43231856065| 85800.0|
|111369.71756877871| 90100.0|
| 97351.87386020401| 82800.0|
+------------------+---------+
predictions = predictions.withColumn("error", col("prediction")-
col("medhvalue"))

predictions.select("prediction", "medhvalue", "error").show

result:
+------------------+---------+-------------------+
| prediction|medhvalue| error|
+------------------+---------+-------------------+
| 104349.5967745057| 94600.0| 9749.596774505713|
| 77530.4323185606| 85800.0| -8269.567681439352|
| 101253.3225967887| 103600.0| -2346.677403211302|
+------------------+---------+-------------------+

predictions.describe("prediction", "medhvalue", "error").show

result:
+-------+-----------------+------------------+------------------+
|summary| prediction| medhvalue| error|
+-------+-----------------+------------------+------------------+
| count| 4161| 4161| 4161|
| mean|206307.4865123929|205547.72650805095| 759.7600043416329|
| stddev|97133.45817381598|114708.03790345002| 52725.56329678355|
| min|56471.09903814694| 26900.0|-339450.5381565819|
| max|499238.1371374392| 500001.0|293793.71945819416|
+-------+-----------------+------------------+------------------+

val maevaluator = new RegressionEvaluator()

.setLabelCol("medhvalue")
.setMetricName("mae")

val mae = maevaluator.evaluate(predictions)

result:
mae: Double = 36636.35

val evaluator = new RegressionEvaluator()

.setLabelCol("medhvalue")
.setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)

result:
rmse: Double = 52724.70

Transformers Evaluator

DataFrame +
DataFrame Model Pipeline Evaluator
Predictions
pipelineModel.write.overwrite().save(modeldir)

val sameModel = CrossValidatorModel.load(“modeldir")




 →
 →

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.ml.evaluation._
import org.apache.spark.sql.types._
import ml.dmlc.xgboost4j.scala.spark.{XGBoostRegressor,
XGBoostRegressionModel}

import ml.dmlc.xgboost4j.scala.spark.rapids.{GpuDataReader, GpuDataset}

lazy val schema =

StructType(Array(
StructField("vendor_id", DoubleType),
StructField("passenger_count", DoubleType),
StructField("trip_distance", DoubleType),
StructField("pickup_longitude", DoubleType),
StructField("pickup_latitude", DoubleType),
StructField("rate_code", DoubleType),
StructField("store_and_fwd", DoubleType),
StructField("dropoff_longitude", DoubleType),
StructField("dropoff_latitude", DoubleType),
StructField(labelName, DoubleType),
StructField("hour", DoubleType),
StructField("year", IntegerType),
StructField("month", IntegerType),
StructField("day", DoubleType),
StructField("day_of_week", DoubleType),
StructField("is_weekend", DoubleType)
))
val trainPath = "/FileStore/tables/taxi_tsmall.csv"
val evalPath = "/FileStore/tables/taxi_esmall.csv"
val spark = SparkSession.builder().appName("Taxi-GPU").getOrCreate

Load Data DataFrame

val tdf = spark.read.option("inferSchema",

"false").option("header", true).schema(schema).csv(trainPath)
val edf = spark.read.option("inferSchema", "false").option("header",
true).schema(schema).csv(evalPath)

show(5)

tdf.select("trip_distance", "rate_code","fare_amount").show(5)
result:
+------------------+-------------+-----------+
| trip_distance| rate_code|fare_amount|
+------------------+-------------+-----------+
| 2.72|-6.77418915E8| 11.5|
| 0.94|-6.77418915E8| 5.5|
| 3.63|-6.77418915E8| 13.0|
| 11.86|-6.77418915E8| 33.5|
| 3.03|-6.77418915E8| 11.0|
+------------------+-------------+-----------+

function describe

tdf.select("trip_distance", "rate_code","fare_amount").describe().show
+-------+------------------+--------------------+------------------+
|summary| trip_distance| rate_code| fare_amount|
+-------+------------------+--------------------+------------------+
| count| 7999| 7999| 7999|
| mean| 3.278923615451919|-6.569284350812602E8|12.348543567945994|
| stddev|3.6320775770793547|1.6677419425906155E8|10.221929466939088|
| min| 0.0| -6.77418915E8| 2.5|
| max|35.970000000000006| 1.957796822E9| 107.5|
+-------+------------------+--------------------+------------------+
%sql
select trip_distance, fare_amount
from taxi

Load data Transform

DataFrame +
DataFrame VectorAssembler
Features

// feature column names

val featureNames = Array("passenger_count","trip_distance",
"pickup_longitude","pickup_latitude","rate_code","dropoff_longitude",
"dropoff_latitude", "hour", "day_of_week","is_weekend")
// create transformer
object Vectorize {
def apply(df: DataFrame, featureNames: Seq[String], labelName: String):
DataFrame = {
val toFloat = df.schema.map(f => col(f.name).cast(FloatType))
new VectorAssembler()
.setInputCols(featureNames.toArray)
.setOutputCol("features")
.transform(df.select(toFloat:_*))
.select(col("features"), col(labelName))
}
}
// transform method adds features column
var trainSet = Vectorize(tdf, featureNames, labelName)
var evalSet = Vectorize(edf, featureNames, labelName)
trainSet.take(1)
result:
res8: Array[org.apache.spark.sql.Row] = Array([[5.0,2.7200000286102295,-
73.94813537597656,40.82982635498047,-6.77418944E8,-
73.96965026855469,40.79747009277344,10.0,6.0,1.0],11.5])

num_workers
tree_method

Load data Transform Input

DataFrame +
DataFrame Estimator
Features

lazy val paramMap = Map(

"learning_rate" -> 0.05,
"max_depth" -> 8,
"subsample" -> 0.8,
"gamma" -> 1,
"num_round" -> 500
)
// set up xgboost parameters
val xgbParamFinal = paramMap ++ Map("tree_method" -> "hist", "num_workers" -
> 12)
// create the xgboostregressor estimator
val xgbRegressor = new XGBoostRegressor(xgbParamFinal)
.setLabelCol(labelName)
.setFeaturesCol("features")
num_workers
tree_method gpu_hist,

val xgbParamFinal = paramMap ++ Map("tree_method" -> "gpu_hist",

"num_workers" -> 1)
// create the estimator
val xgbRegressor = new XGBoostRegressor(xgbParamFinal)
.setLabelCol(labelName)
.setFeaturesCols(featureNames)

Load data Transform Input Fit

DataFrame +
DataFrame Estimator Fitted Model
Features

object Benchmark {
def time[R](phase: String)(block: => R): (R, Float) = {
val t0 = System.currentTimeMillis
val result = block // call-by-name
val t1 = System.currentTimeMillis
println("Elapsed time [" + phase + "]: " +
((t1 - t0).toFloat / 1000) + "s")
(result, (t1 - t0).toFloat / 1000)
}
}
// use the estimator to fit (train) a model
val (model, _) = Benchmark.time("train") {
xgbRegressor.fit(trainSet)
}
Transform
DataFrame +
DataFrame +
Fitted Model Label + Features +
Label Features
Predictions

val (prediction, _) = Benchmark.time("transform") {

val ret = model.transform(evalSet).cache()
ret.foreachPartition(_ => ())
ret
}
prediction.select( labelName, "prediction").show(10)
Result:
+-----------+------------------+
|fare_amount| prediction|
+-----------+------------------+
| 5.0| 4.749197959899902|
| 34.0|38.651187896728516|
| 10.0|11.101678848266602|
| 16.5| 17.23284912109375|
| 7.0| 8.149757385253906|
| 7.5|7.5153608322143555|
| 5.5| 7.248467922210693|
| 2.5|12.289423942565918|
| 9.5|10.893491744995117|
| 12.0| 12.06682014465332|
+-----------+------------------+

DataFrame + Evaluate
Label + Evaluator
Predictions

val evaluator = new RegressionEvaluator().setLabelCol(labelName)

val (rmse, _) = Benchmark.time("evaluation") {
evaluator.evaluate(prediction)
}
println(s"RMSE == $rmse")
Result:
Elapsed time [evaluation]: 0.356s
RMSE == 2.6105287283128353
model.write.overwrite().save(savepath)

val sameModel = XGBoostRegressionModel.load(savepath)





















◾

ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
SN-IND-1-040 Diagnostics With CAPL Since 9.0 SP3
No ratings yet
SN-IND-1-040 Diagnostics With CAPL Since 9.0 SP3
28 pages
Car Washing Management System
100% (2)
Car Washing Management System
21 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
Spark QA
No ratings yet
Spark QA
34 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Caching in Spark
No ratings yet
Caching in Spark
51 pages
Azure Data Engineering Interview Q & A - Topicwise
No ratings yet
Azure Data Engineering Interview Q & A - Topicwise
57 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
Maneesh Azure
No ratings yet
Maneesh Azure
6 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
06.introduction To Data Factory
No ratings yet
06.introduction To Data Factory
26 pages
Shelly Bansal - SR Data Engineer
No ratings yet
Shelly Bansal - SR Data Engineer
6 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Big Data Computing Spark Built-In Libraries
No ratings yet
Big Data Computing Spark Built-In Libraries
11 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Dice Resume CV SN
No ratings yet
Dice Resume CV SN
5 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Azure Data Factory
No ratings yet
Azure Data Factory
6 pages
Databricks Data Engineer Associate Notes
No ratings yet
Databricks Data Engineer Associate Notes
5 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
DBT Analytics Engineering
No ratings yet
DBT Analytics Engineering
23 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Apache Spark
No ratings yet
Apache Spark
62 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Databricks Certified Data Engineer Professional Practice Questions
No ratings yet
Databricks Certified Data Engineer Professional Practice Questions
13 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Pyspark
No ratings yet
Pyspark
31 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
DWH Fundamentals
No ratings yet
DWH Fundamentals
63 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Self-Assessment Questions 4-1: 1. Discuss The Stages in The Program Development Process (10 PTS)
No ratings yet
Self-Assessment Questions 4-1: 1. Discuss The Stages in The Program Development Process (10 PTS)
1 page
Answers
No ratings yet
Answers
6 pages
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
No ratings yet
XZ000-G3 (Eusp) 2.0 Qig V2.0.1
2 pages
06 Use Case Modeling Part 1
No ratings yet
06 Use Case Modeling Part 1
6 pages
Employee Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Employee Data Analysis System (Ip Class 12) (2024-25)
30 pages
CV Tran Thanh Huy
No ratings yet
CV Tran Thanh Huy
5 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
31 pages
Anjalie G@sliit LK
No ratings yet
Anjalie G@sliit LK
10 pages
PPS Lab File With Solution
No ratings yet
PPS Lab File With Solution
70 pages
Computer Mouse
No ratings yet
Computer Mouse
22 pages
Proposal On Project Topic: Name: Matric No
No ratings yet
Proposal On Project Topic: Name: Matric No
2 pages
Application of AI in Home Automation
No ratings yet
Application of AI in Home Automation
5 pages
AI Promt Engineering Prelim Exam
No ratings yet
AI Promt Engineering Prelim Exam
19 pages
NPTEL CC Assignment11
33% (3)
NPTEL CC Assignment11
4 pages
Amey
No ratings yet
Amey
30 pages
HTML - Overview - Tutorialspoint
No ratings yet
HTML - Overview - Tutorialspoint
4 pages
Logcat 1586605643702
No ratings yet
Logcat 1586605643702
29 pages
jAVA PROFESIONEL
No ratings yet
jAVA PROFESIONEL
24 pages
Datasheet of DS 7732NI K4 NVR E - V4.71.400 - 20221017
No ratings yet
Datasheet of DS 7732NI K4 NVR E - V4.71.400 - 20221017
5 pages
GD Manual
No ratings yet
GD Manual
35 pages
Chapter 4 Database Recovery Techniques
No ratings yet
Chapter 4 Database Recovery Techniques
26 pages
Automation in VirtualBox
No ratings yet
Automation in VirtualBox
3 pages
White Paper: Digital Forensic Analysis of Amazon Linux EC2 Instances
No ratings yet
White Paper: Digital Forensic Analysis of Amazon Linux EC2 Instances
40 pages
Java Notes - 5th Unit
No ratings yet
Java Notes - 5th Unit
38 pages
Network Design and Topologies
No ratings yet
Network Design and Topologies
13 pages
CO34563-Assignment-4 With Solution
No ratings yet
CO34563-Assignment-4 With Solution
13 pages
Increase CPUs and Memory On The Panorama Virtual Appliance
No ratings yet
Increase CPUs and Memory On The Panorama Virtual Appliance
2 pages
Prime Implicant.: Department of Electronics and Communication Engineering
No ratings yet
Prime Implicant.: Department of Electronics and Communication Engineering
2 pages