0% found this document useful (0 votes)
17 views5 pages

Dataset - Databricks

This document provides an overview of the Dataset API introduced in Spark 2.0, demonstrating how to create DataFrames and Datasets from JSON files. It explains the difference between typed and untyped Datasets, along with various operations such as filtering, aggregating, and metadata retrieval. Additionally, it highlights the interoperability between Datasets and RDDs.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

Dataset - Databricks

This document provides an overview of the Dataset API introduced in Spark 2.0, demonstrating how to create DataFrames and Datasets from JSON files. It explains the difference between typed and untyped Datasets, along with various operations such as filtering, aggregating, and metadata retrieval. Additionally, it highlights the interoperability between Datasets and RDDs.

Uploaded by

Tuan Minh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

5/5/2020 Dataset - Databricks

Dataset

Dataset API
In this notebook, we demonstrate the new Dataset API in Spark 2.0, using a very
simple JSON file.

To read the companion blog post, click here:


https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-
smarter.html (https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-
easier-faster-and-smarter.html)

// Take a look at the content of the file


dbutils.fs.head("/home/webinar/person.json")

res39: String =
"
{"name":"Matei Zaharia","email":"[email protected]","iq":180}
{"name":"Reynold Xin","email":"[email protected]","iq":80}
"

Creating DataFrames and Datasets


Starting Spark 2.0, a DataFrame is simply a type alias for Dataset of Row. There are
many ways to create DataFrames and Datasets.

The first way, used primarily in testing and demos, uses the range function available
on SparkSession.

// range(100) creates a Dataset with 100 elements, from 0 to 99.


val range100 = spark.range(100)
range100.collect()

range100: org.apache.spark.sql.Dataset[Long] = [id: bigint]


res40: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1
5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 5
4, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 1/5
5/5/2020 Dataset - Databricks

74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 9
3, 94, 95, 96, 97, 98, 99)

The second way, which is probably the most common way, is to create a
DataFrame/Dataset by referencing some files on external storage systems.

// Read the data in as a DataFrame


val jsonData = spark.read.json("/home/webinar/person.json")

jsonData: org.apache.spark.sql.DataFrame = [email: string, iq: bigint ... 1 mo


re field]

display(jsonData)

email

[email protected]

[email protected]

// DataFrame is just an alias for Dataset[Row]


import org.apache.spark.sql.Dataset
val jsonDataset: Dataset[Row] = jsonData

import org.apache.spark.sql.Dataset
jsonDataset: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [email:
string, iq: bigint ... 1 more field]

Databricks' display works on both DataFrames and Datasets.

display(jsonDataset)

email
[email protected]

[email protected]

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 2/5
5/5/2020 Dataset - Databricks

DataFrame (or Dataset of Row) is great, but sometimes I want compile-time type
safety and we would to be able to work with my own domain-specific objects. Here we
demonstrate how to turn an untyped Dataset into a typed Dataset.

// First, define my domain specific class


case class Person(email: String, iq: Long, name: String)

// Turn a generic DataFrame into a Dataset of Person


val ds = spark.read.json("/home/webinar/person.json").as[Person]

defined class Person


ds: org.apache.spark.sql.Dataset[Person] = [email: string, iq: bigint ... 1 mo
re field]

Metadata operations
There are a few metadata operations that are very handy for Datasets.

// Get the list of columns


ds.columns

res43: Array[String] = Array(email, iq, name)

// Get the schema of the underlying data structure.


ds.schema

res44: org.apache.spark.sql.types.StructType = StructType(StructField(email,St


ringType,true), StructField(iq,LongType,true), StructField(name,StringType,tru
e))

// Explain the logical and physical query plan to compute the Dataset.
ds.explain(true)

== Parsed Logical Plan ==


Relation[email#2250,iq#2251L,name#2252] HadoopFiles

== Analyzed Logical Plan ==


email: string, iq: bigint, name: string
Relation[email#2250,iq#2251L,name#2252] HadoopFiles

== Optimized Logical Plan ==


Relation[email#2250,iq#2251L,name#2252] HadoopFiles

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 3/5
5/5/2020 Dataset - Databricks

== Physical Plan ==
WholeStageCodegen
: +- Scan HadoopFiles[email#2250,iq#2251L,name#2252] Format: JSON, PushedFilt
ers: [], ReadSchema: struct<email:string,iq:bigint,name:string>

Typed Dataset API


Dataset includes a typed functional API similar to RDDs and Scala's own collection
library. This API is available in Scala/Java, but not Python/R.

// Run a map
ds.map(_.name).collect()

res46: Array[String] = Array(Matei Zaharia, Reynold Xin)

// Run a filter
ds.filter(_.iq > 140).collect()

res47: Array[Person] = Array(Person([email protected],180,Matei Zaharia))

// Can also run agregations to compute total IQ and average IQ grouped by some
key
// In this case we are just grouping by a constant 0, i.e. all records get
grouped together
import org.apache.spark.sql.expressions.scala.typed
ds.groupByKey(_ => 0).agg(typed.sum(_.iq), typed.avg(_.iq)).collect()

import org.apache.spark.sql.expressions.scala.typed
res48: Array[(Int, Double, Double)] = Array((0,260.0,130.0))

Untyped Dataset API (a.k.a. DataFrame API)


Dataset also includes untyped functions that return results in the form of DataFrames
(i.e. Dataset[Row]). This API is available in all programming languages
(Java/Scala/Python/R).

// The select function is similar to the map function, but is not typed (i.e.
it returns a DataFrame)
ds.select("name").collect()

res49: Array[org.apache.spark.sql.Row] = Array([Matei Zaharia], [Reynold Xin])

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 4/5
5/5/2020 Dataset - Databricks

// Run some aggregations: note that we are using groupBy, which is different
from the type safe groupByKey
import org.apache.spark.sql.functions.{sum, avg}
ds.groupBy().agg(sum("iq"), avg("iq")).collect()

import org.apache.spark.sql.functions.{sum, avg}


res50: Array[org.apache.spark.sql.Row] = Array([260,130.0])

Interoperate with RDDs


A Dataset can be easily turned into an RDD.

ds.rdd

res51: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[679] at rdd at <con


sole>:65

Again, to read the companion blog post, click here:


https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-
smarter.html (https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-
easier-faster-and-smarter.html)

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 5/5

You might also like