0% found this document useful (0 votes)

17 views5 pages

Dataset - Databricks

This document provides an overview of the Dataset API introduced in Spark 2.0, demonstrating how to create DataFrames and Datasets from JSON files. It explains the difference between typed and untyped Datasets, along with various operations such as filtering, aggregating, and metadata retrieval. Additionally, it highlights the interoperability between Datasets and RDDs.

Uploaded by

Tuan Minh Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views5 pages

Dataset - Databricks

Uploaded by

Tuan Minh Pham

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

5/5/2020 Dataset - Databricks

Dataset

Dataset API
In this notebook, we demonstrate the new Dataset API in Spark 2.0, using a very
simple JSON file.

To read the companion blog post, click here:

https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-
smarter.html (https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-
easier-faster-and-smarter.html)

// Take a look at the content of the file

dbutils.fs.head("/home/webinar/person.json")

res39: String =
"
{"name":"Matei Zaharia","email":"[email protected]","iq":180}
{"name":"Reynold Xin","email":"[email protected]","iq":80}
"

Creating DataFrames and Datasets

Starting Spark 2.0, a DataFrame is simply a type alias for Dataset of Row. There are
many ways to create DataFrames and Datasets.

The first way, used primarily in testing and demos, uses the range function available
on SparkSession.

// range(100) creates a Dataset with 100 elements, from 0 to 99.

val range100 = spark.range(100)
range100.collect()

range100: org.apache.spark.sql.Dataset[Long] = [id: bigint]

res40: Array[Long] = Array(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1
5, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 5
4, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 1/5
5/5/2020 Dataset - Databricks

74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 9
3, 94, 95, 96, 97, 98, 99)

The second way, which is probably the most common way, is to create a
DataFrame/Dataset by referencing some files on external storage systems.

// Read the data in as a DataFrame

val jsonData = spark.read.json("/home/webinar/person.json")

jsonData: org.apache.spark.sql.DataFrame = [email: string, iq: bigint ... 1 mo

re field]

display(jsonData)

[email protected]

// DataFrame is just an alias for Dataset[Row]

import org.apache.spark.sql.Dataset
val jsonDataset: Dataset[Row] = jsonData

import org.apache.spark.sql.Dataset
jsonDataset: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [email:
string, iq: bigint ... 1 more field]

Databricks' display works on both DataFrames and Datasets.

display(jsonDataset)

email
[email protected]

[email protected]

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 2/5
5/5/2020 Dataset - Databricks

DataFrame (or Dataset of Row) is great, but sometimes I want compile-time type
safety and we would to be able to work with my own domain-specific objects. Here we
demonstrate how to turn an untyped Dataset into a typed Dataset.

// First, define my domain specific class

case class Person(email: String, iq: Long, name: String)

// Turn a generic DataFrame into a Dataset of Person

val ds = spark.read.json("/home/webinar/person.json").as[Person]

defined class Person

ds: org.apache.spark.sql.Dataset[Person] = [email: string, iq: bigint ... 1 mo
re field]

Metadata operations
There are a few metadata operations that are very handy for Datasets.

// Get the list of columns

ds.columns

res43: Array[String] = Array(email, iq, name)

// Get the schema of the underlying data structure.

ds.schema

res44: org.apache.spark.sql.types.StructType = StructType(StructField(email,St

ringType,true), StructField(iq,LongType,true), StructField(name,StringType,tru
e))

// Explain the logical and physical query plan to compute the Dataset.
ds.explain(true)

== Parsed Logical Plan ==

Relation[email#2250,iq#2251L,name#2252] HadoopFiles

== Analyzed Logical Plan ==

email: string, iq: bigint, name: string
Relation[email#2250,iq#2251L,name#2252] HadoopFiles

== Optimized Logical Plan ==

Relation[email#2250,iq#2251L,name#2252] HadoopFiles

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 3/5
5/5/2020 Dataset - Databricks

== Physical Plan ==
WholeStageCodegen
: +- Scan HadoopFiles[email#2250,iq#2251L,name#2252] Format: JSON, PushedFilt
ers: [], ReadSchema: struct<email:string,iq:bigint,name:string>

Typed Dataset API

Dataset includes a typed functional API similar to RDDs and Scala's own collection
library. This API is available in Scala/Java, but not Python/R.

// Run a map
ds.map(_.name).collect()

res46: Array[String] = Array(Matei Zaharia, Reynold Xin)

// Run a filter
ds.filter(_.iq > 140).collect()

res47: Array[Person] = Array(Person([email protected],180,Matei Zaharia))

// Can also run agregations to compute total IQ and average IQ grouped by some
key
// In this case we are just grouping by a constant 0, i.e. all records get
grouped together
import org.apache.spark.sql.expressions.scala.typed
ds.groupByKey(_ => 0).agg(typed.sum(_.iq), typed.avg(_.iq)).collect()

import org.apache.spark.sql.expressions.scala.typed
res48: Array[(Int, Double, Double)] = Array((0,260.0,130.0))

Untyped Dataset API (a.k.a. DataFrame API)

Dataset also includes untyped functions that return results in the form of DataFrames
(i.e. Dataset[Row]). This API is available in all programming languages
(Java/Scala/Python/R).

// The select function is similar to the map function, but is not typed (i.e.
it returns a DataFrame)
ds.select("name").collect()

res49: Array[org.apache.spark.sql.Row] = Array([Matei Zaharia], [Reynold Xin])

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 4/5
5/5/2020 Dataset - Databricks

// Run some aggregations: note that we are using groupBy, which is different
from the type safe groupByKey
import org.apache.spark.sql.functions.{sum, avg}
ds.groupBy().agg(sum("iq"), avg("iq")).collect()

import org.apache.spark.sql.functions.{sum, avg}

res50: Array[org.apache.spark.sql.Row] = Array([260,130.0])

Interoperate with RDDs

A Dataset can be easily turned into an RDD.

ds.rdd

res51: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[679] at rdd at <con

sole>:65

Again, to read the companion blog post, click here:

https://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/Dataset.html 5/5

Pastil or Pastel
88% (8)
Pastil or Pastel
13 pages
SQL Server by RN Reddy
No ratings yet
SQL Server by RN Reddy
233 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
RSCH2111 Practical Research 1 First Quarter Exam
No ratings yet
RSCH2111 Practical Research 1 First Quarter Exam
123 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Action Research Designs: Presented By: Dr. Abdul Khaliq
100% (1)
Action Research Designs: Presented By: Dr. Abdul Khaliq
37 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Vertical Storage Tanks
No ratings yet
Vertical Storage Tanks
12 pages
A Practical Guide To SAP NetWeaver Business Warehouse (BW) 7.0
0% (2)
A Practical Guide To SAP NetWeaver Business Warehouse (BW) 7.0
11 pages
Edited Pr2 Module 20 Week 5
No ratings yet
Edited Pr2 Module 20 Week 5
7 pages
Topic 4 - Data Mining Tools and Technique
No ratings yet
Topic 4 - Data Mining Tools and Technique
22 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
Bda U5
No ratings yet
Bda U5
42 pages
Method of Research Module Lesson 2
No ratings yet
Method of Research Module Lesson 2
6 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Mlflow Workshop Part 2
No ratings yet
Mlflow Workshop Part 2
29 pages
04 CaseStudy DataPlatformPeopleStrategy Rao Tom
No ratings yet
04 CaseStudy DataPlatformPeopleStrategy Rao Tom
30 pages
Unit 4 (Data Frame and Apache Kafka)
No ratings yet
Unit 4 (Data Frame and Apache Kafka)
28 pages
Mlflow Workshop Part 3
No ratings yet
Mlflow Workshop Part 3
25 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Qualitative Quantitative and Mixed Method
No ratings yet
Qualitative Quantitative and Mixed Method
47 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Pyspark
No ratings yet
Pyspark
10 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Adani Wilmark
No ratings yet
Adani Wilmark
78 pages
2024 Learner Guide Infmib2 - 1
No ratings yet
2024 Learner Guide Infmib2 - 1
26 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
No ratings yet
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
39 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Inmo 2012
No ratings yet
Inmo 2012
6 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Py Spark
No ratings yet
Py Spark
9 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Presentation Cse443
No ratings yet
Presentation Cse443
24 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Notes
No ratings yet
Notes
5 pages
9en0 01 Rms 20190815
No ratings yet
9en0 01 Rms 20190815
10 pages
Sparksql
No ratings yet
Sparksql
3 pages
Hadoop For Finance Essentials - Sample Chapter
No ratings yet
Hadoop For Finance Essentials - Sample Chapter
15 pages
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
30 pages
Iarjset 2024 11750
No ratings yet
Iarjset 2024 11750
10 pages
Sap Bwbi Overview
No ratings yet
Sap Bwbi Overview
33 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
No ratings yet
6CS030 Big Data 2019/0 Portfolio - Part 1: Worksheet Three - 5% Hand-Out: Week 9. Demo: Week 10 Workshop
2 pages
AWS Architecture
No ratings yet
AWS Architecture
2 pages
Notes
No ratings yet
Notes
4 pages
D - Anas Bin Ariffin
No ratings yet
D - Anas Bin Ariffin
15 pages
Chapter 8 Pointers: Lecturer: Mrs Rohani Hassan
No ratings yet
Chapter 8 Pointers: Lecturer: Mrs Rohani Hassan
20 pages
The Sum of Squares Technique
No ratings yet
The Sum of Squares Technique
4 pages
Data Modeling Part 3 - Practical
No ratings yet
Data Modeling Part 3 - Practical
19 pages
Objective: Mongodb Is An Open-Source Database Written in C++. Development of Mongodb Began in
No ratings yet
Objective: Mongodb Is An Open-Source Database Written in C++. Development of Mongodb Began in
13 pages
Page 01
No ratings yet
Page 01
2 pages
Docse
No ratings yet
Docse
3 pages
Data Exploration On Databricks - Databricks
No ratings yet
Data Exploration On Databricks - Databricks
1 page
Data Exploration On Databricks (Setup) - Databricks
No ratings yet
Data Exploration On Databricks (Setup) - Databricks
1 page
AdTech Sample Notebook (Part 1) - Databricks
No ratings yet
AdTech Sample Notebook (Part 1) - Databricks
1 page
SQL Server Security Basics
No ratings yet
SQL Server Security Basics
32 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
VisTool Manual
No ratings yet
VisTool Manual
7 pages
Symbol Table
No ratings yet
Symbol Table
10 pages
Big Data Analytics
No ratings yet
Big Data Analytics
5 pages
Dbms Indexing
No ratings yet
Dbms Indexing
3 pages
JDK Tutorials - Herong's Tutorial Examples
From Everand
JDK Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
From Everand
Study Guide Cisco 300-735 SAUTO Automating and Programming Cisco Security Solutions Exam
Anand Vemula
No ratings yet
Azure For Starters
From Everand
Azure For Starters
Chinmoy Mukherjee
No ratings yet
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet

Dataset - Databricks

Uploaded by

Dataset - Databricks

Uploaded by

5/5/2020 Dataset - Databricks

To read the companion blog post, click here:

// Take a look at the content of the file

Creating DataFrames and Datasets

// range(100) creates a Dataset with 100 elements, from 0 to 99.

range100: org.apache.spark.sql.Dataset[Long] = [id: bigint]

// Read the data in as a DataFrame

jsonData: org.apache.spark.sql.DataFrame = [email: string, iq: bigint ... 1 mo

// DataFrame is just an alias for Dataset[Row]

Databricks' display works on both DataFrames and Datasets.

// First, define my domain specific class

// Turn a generic DataFrame into a Dataset of Person

defined class Person

// Get the list of columns

res43: Array[String] = Array(email, iq, name)

// Get the schema of the underlying data structure.

res44: org.apache.spark.sql.types.StructType = StructType(StructField(email,St

== Parsed Logical Plan ==

== Analyzed Logical Plan ==

== Optimized Logical Plan ==

Typed Dataset API

res46: Array[String] = Array(Matei Zaharia, Reynold Xin)

res47: Array[Person] = Array(Person([email protected],180,Matei Zaharia))

Untyped Dataset API (a.k.a. DataFrame API)

res49: Array[org.apache.spark.sql.Row] = Array([Matei Zaharia], [Reynold Xin])

import org.apache.spark.sql.functions.{sum, avg}

Interoperate with RDDs

res51: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[679] at rdd at <con

Again, to read the companion blog post, click here:

You might also like