Dataset - Databricks
Dataset - Databricks
Dataset
Dataset API
In this notebook, we demonstrate the new Dataset API in Spark 2.0, using a very
simple JSON file.
res39: String =
"
{"name":"Matei Zaharia","email":"[email protected]","iq":180}
{"name":"Reynold Xin","email":"[email protected]","iq":80}
"
The first way, used primarily in testing and demos, uses the range function available
on SparkSession.
https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 1/5
5/5/2020 Dataset - Databricks
74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 9
3, 94, 95, 96, 97, 98, 99)
The second way, which is probably the most common way, is to create a
DataFrame/Dataset by referencing some files on external storage systems.
display(jsonData)
import org.apache.spark.sql.Dataset
jsonDataset: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [email:
string, iq: bigint ... 1 more field]
display(jsonDataset)
email
[email protected]
https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 2/5
5/5/2020 Dataset - Databricks
DataFrame (or Dataset of Row) is great, but sometimes I want compile-time type
safety and we would to be able to work with my own domain-specific objects. Here we
demonstrate how to turn an untyped Dataset into a typed Dataset.
Metadata operations
There are a few metadata operations that are very handy for Datasets.
// Explain the logical and physical query plan to compute the Dataset.
ds.explain(true)
https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 3/5
5/5/2020 Dataset - Databricks
== Physical Plan ==
WholeStageCodegen
: +- Scan HadoopFiles[email#2250,iq#2251L,name#2252] Format: JSON, PushedFilt
ers: [], ReadSchema: struct<email:string,iq:bigint,name:string>
// Run a map
ds.map(_.name).collect()
// Run a filter
ds.filter(_.iq > 140).collect()
// Can also run agregations to compute total IQ and average IQ grouped by some
key
// In this case we are just grouping by a constant 0, i.e. all records get
grouped together
import org.apache.spark.sql.expressions.scala.typed
ds.groupByKey(_ => 0).agg(typed.sum(_.iq), typed.avg(_.iq)).collect()
import org.apache.spark.sql.expressions.scala.typed
res48: Array[(Int, Double, Double)] = Array((0,260.0,130.0))
// The select function is similar to the map function, but is not typed (i.e.
it returns a DataFrame)
ds.select("name").collect()
https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 4/5
5/5/2020 Dataset - Databricks
// Run some aggregations: note that we are using groupBy, which is different
from the type safe groupByKey
import org.apache.spark.sql.functions.{sum, avg}
ds.groupBy().agg(sum("iq"), avg("iq")).collect()
ds.rdd
https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 5/5