0% found this document useful (0 votes)

12 views6 pages

Adding StructType Columns To Spark DataFrames

The document discusses how to add StructType columns to Spark DataFrames to define schemas. It demonstrates creating StructType columns to add nested schemas and eliminate order dependencies when defining multiple DataFrame transformations. StructType objects contain a list of StructFields and can be appended to DataFrames using functions like struct() and withColumn().

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views6 pages

Adding StructType Columns To Spark DataFrames

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Adding StructType columns to

Spark DataFrames
Matthew Powers
Follow
Jan 15, 2018 · 3 min read

StructType objects define the schema of Spark DataFrames.

StructType objects contain a list of StructField objects that define
the name, type, and nullable flag for each column in a DataFrame.

Let’s start with an overview of StructType objects and then

demonstrate how StructType columns can be added to DataFrame
schemas (essentially creating a nested schema).

StructType columns are a great way to eliminate order

dependencies from Spark code.

StructType overview
The StructType case class can be used to define a DataFrame
schema as follows.
val data = Seq(
Row(1, "a"),
Row(5, "z")
)

val schema = StructType(

List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)

df.show()+---+------+
|num|letter|
+---+------+
| 1| a|
| 5| z|
+---+------+

The DataFrame schema method returns a StructType object.

print(df.schema)StructType(
StructField(num, IntegerType, true),
StructField(letter, StringType, true)
)

Let’s look at another example to see how StructType columns can

be appended to DataFrames.

Appending StructType columns

Let’s use the struct() function to append a StructType column to a
DataFrame.
val data = Seq(
Row(20.0, "dog"),
Row(3.5, "cat"),
Row(0.000006, "ant")
)

val schema = StructType(

List(
StructField("weight", DoubleType, true),
StructField("animal_type", StringType, true)
)
)

val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
val actualDF = df.withColumn(
"animal_interpretation",
struct(
(col("weight") > 5).as("is_large_animal"),
col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
)
)

actualDF.show(truncate = false)+------+-----------
+---------------------+
|weight|animal_type|animal_interpretation|
+------+-----------+---------------------+
|20.0 |dog |[true,true] |
|3.5 |cat |[false,true] |
|6.0E-6|ant |[false,false] |
+------+-----------+---------------------+

Let’s take a look at the schema.

print(actualDF.schema)StructType(
StructField(weight,DoubleType,true),
StructField(animal_type,StringType,true),
StructField(animal_interpretation, StructType(
StructField(is_large_animal,BooleanType,true),
StructField(is_mammal,BooleanType,true)
), false)
)

The animal_interpretation column has a StructType type — this

DataFrame has a nested schema.

It’s easier to view the schema with the printSchema method.

We can flatten the DataFrame as follows.

Using StructTypes to eliminate order

dependencies
Let’s demonstrate some order dependent code and then use a
StructType column to eliminate the order dependencies.

Let’s consider three custom transformations that

add is_teenager, has_positive_mood, and what_to_do columns to a
DataFrame.
def withIsTeenager()(df: DataFrame): DataFrame = {
df.withColumn("is_teenager", col("age").between(13, 19))
}

def withHasPositiveMood()(df: DataFrame): DataFrame = {

df.withColumn(
"has_positive_mood",
col("mood").isin("happy", "glad")
)
}

def withWhatToDo()(df: DataFrame) = {

df.withColumn(
"what_to_do",
when(
col("is_teenager") && col("has_positive_mood"),
"have a chat"
)
)
}
Notice that both
the withIsTeenager and withHasPositiveMood transformations must be
run before the withWhatToDo transformation can be run. The
functions have an order dependency because they must be run in a
certain order for the code to work.

Let’s build a DataFrame and execute the functions in the right

order so the code will run.
val data = Seq(
Row(30, "happy"),
Row(13, "sad"),
Row(18, "glad")
)

val schema = StructType(

List(
StructField("age", IntegerType, true),
StructField("mood", StringType, true)
)
)

val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)

Let’s use the struct function to append a StructType column to the

DataFrame and remove the order depenencies from this code.
val isTeenager = col("age").between(13, 19)
val hasPositiveMood = col("mood").isin("happy", "glad")
df.withColumn(
"best_action",
struct(
isTeenager.as("is_teenager"),
hasPositiveMood.as("has_positive_mood"),
when(
isTeenager && hasPositiveMood,
"have a chat"
).as("what_to_do")
)
).show(truncate = false)+---+-----+-----------------------+
|age|mood |best_action |
+---+-----+-----------------------+
|30 |happy|[false,true,null] |
|13 |sad |[true,false,null] |
|18 |glad |[true,true,have a chat]|
+---+-----+-----------------------+

Order dependencies can be a big problem in large

Spark codebases
If you’re code is organized as DataFrame transformations, order
dependencies can become a big problem.

You might need to figure out how to call 20 functions in exactly

the right order to get the desired result.

StructType columns are one way to eliminate order dependencies

from your code. I’ll discuss other strategies in more detail in a
future blog post!

Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Relational Database Design
No ratings yet
Relational Database Design
79 pages
1737249906013
No ratings yet
1737249906013
106 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
summary of Chapter 14
No ratings yet
summary of Chapter 14
55 pages
I NT R Oduct I Ontoschemarefi Nement
No ratings yet
I NT R Oduct I Ontoschemarefi Nement
52 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
dbms-unit-3
No ratings yet
dbms-unit-3
42 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
PySpark StructType StructField Explained 1722792510
No ratings yet
PySpark StructType StructField Explained 1722792510
6 pages
w12_runningnotes-201026-001818
No ratings yet
w12_runningnotes-201026-001818
25 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
DGDGSZ
No ratings yet
DGDGSZ
15 pages
02 Data - Engg - 23-24 Worksheet Practical#5b 1
No ratings yet
02 Data - Engg - 23-24 Worksheet Practical#5b 1
22 pages
Structures
No ratings yet
Structures
32 pages
ADBMS notes
No ratings yet
ADBMS notes
23 pages
Database and SQL
No ratings yet
Database and SQL
65 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Revision
No ratings yet
Spark Revision
16 pages
journal
No ratings yet
journal
47 pages
MiG-29
No ratings yet
MiG-29
157 pages
PostgreSQL Composite Types Guide
No ratings yet
PostgreSQL Composite Types Guide
25 pages
CH 7
100% (4)
CH 7
19 pages
Json To Dataframe
No ratings yet
Json To Dataframe
13 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Lab6F_Creating Hive Table with Complex Data Type
No ratings yet
Lab6F_Creating Hive Table with Complex Data Type
11 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
ISPF Programmers Guide
No ratings yet
ISPF Programmers Guide
349 pages
SQL vs Pyspark-1
No ratings yet
SQL vs Pyspark-1
9 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Ip Project
No ratings yet
Ip Project
21 pages
Real_World_Presentation_RA2311050010060_AswathAS_DAA
No ratings yet
Real_World_Presentation_RA2311050010060_AswathAS_DAA
7 pages
Spark and Scala 2
No ratings yet
Spark and Scala 2
11 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
Unit IV
No ratings yet
Unit IV
35 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Avoid InferSchema
No ratings yet
Avoid InferSchema
7 pages
4220 5 (Python)
No ratings yet
4220 5 (Python)
12 pages
Module-3: Schema Refinement
No ratings yet
Module-3: Schema Refinement
38 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Dataset - Databricks
No ratings yet
Dataset - Databricks
5 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
Flare Facade: Pixelskin 02 6
No ratings yet
Flare Facade: Pixelskin 02 6
5 pages
Database Design Theory: Introduction To Databases CSCC43 Winter 2011 Ryan Johnson
No ratings yet
Database Design Theory: Introduction To Databases CSCC43 Winter 2011 Ryan Johnson
10 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
DOC-20240908-WA0032.
No ratings yet
DOC-20240908-WA0032.
6 pages
ApacheSpark MyNotes
No ratings yet
ApacheSpark MyNotes
6 pages
Resumen JQueryMobile
No ratings yet
Resumen JQueryMobile
50 pages
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
No ratings yet
MMW 101 - Lesson 10 - Z-Scores and Normal Curve
45 pages
SS2 Physics
No ratings yet
SS2 Physics
3 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Dallapicola Analise PDF
100% (1)
Dallapicola Analise PDF
87 pages
TSX303 DS Encleaned
No ratings yet
TSX303 DS Encleaned
4 pages
Ovality-Sensor-II-Manual-2
No ratings yet
Ovality-Sensor-II-Manual-2
16 pages
Technical Manual 911 Carrera 997 Model 2006 Group 9 Wiring Diagrams
100% (1)
Technical Manual 911 Carrera 997 Model 2006 Group 9 Wiring Diagrams
62 pages
Ford International 2008 PDF
No ratings yet
Ford International 2008 PDF
66 pages
Dulang Redevelopment (Phase 1) Analysis Result and Proposed Mitigation
No ratings yet
Dulang Redevelopment (Phase 1) Analysis Result and Proposed Mitigation
19 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Sophia Girls' Senior Secondary School Saharanpur Computer Science (083) Practical File For Class-Xi SESSION:-2020-2021
No ratings yet
Sophia Girls' Senior Secondary School Saharanpur Computer Science (083) Practical File For Class-Xi SESSION:-2020-2021
36 pages
Final C Lab Record
No ratings yet
Final C Lab Record
60 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Oleh: Seva Pravitasari, Sumarni Dan Tri Anasari Akademi Kebidanan YLPP Purwokerto, JL KH Wahid Hasyim No. 274A Telp (0281) 641655
No ratings yet
Oleh: Seva Pravitasari, Sumarni Dan Tri Anasari Akademi Kebidanan YLPP Purwokerto, JL KH Wahid Hasyim No. 274A Telp (0281) 641655
9 pages
Page 01
No ratings yet
Page 01
2 pages
Analysis of UAV Corridors in Cellular Networks
No ratings yet
Analysis of UAV Corridors in Cellular Networks
6 pages
W01 358 6996
No ratings yet
W01 358 6996
29 pages
Micro Ohm Meter
No ratings yet
Micro Ohm Meter
14 pages
Wipro FPGA Design Flow
100% (1)
Wipro FPGA Design Flow
7 pages
59i4az - HMW - 1499073393 - REFRACTIVE INDEX OF A LIQUID
No ratings yet
59i4az - HMW - 1499073393 - REFRACTIVE INDEX OF A LIQUID
3 pages
This Study Resource Was: Simulab Activity 2.1. Voltage and Current Division Principle
100% (1)
This Study Resource Was: Simulab Activity 2.1. Voltage and Current Division Principle
5 pages
Simple Harmonic Motion
No ratings yet
Simple Harmonic Motion
3 pages
Building-Hotal Taj Case Study
No ratings yet
Building-Hotal Taj Case Study
4 pages
Pembahasan Soal - Astronomi
0% (1)
Pembahasan Soal - Astronomi
8 pages
0580 m15 Ms 42 PDF
No ratings yet
0580 m15 Ms 42 PDF
7 pages
Altermundus Pgfornament
No ratings yet
Altermundus Pgfornament
6 pages
Harun Al Rashid: Harun and Mamun-The Age of Reason Contributed by Prof. Dr. Nazeer Ahmed, PHD
No ratings yet
Harun Al Rashid: Harun and Mamun-The Age of Reason Contributed by Prof. Dr. Nazeer Ahmed, PHD
5 pages
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet

Adding StructType Columns To Spark DataFrames

Uploaded by

Adding StructType Columns To Spark DataFrames

Uploaded by

Adding StructType columns to

StructType objects define the schema of Spark DataFrames.

Let’s start with an overview of StructType objects and then

StructType columns are a great way to eliminate order

val schema = StructType(

The DataFrame schema method returns a StructType object.

Let’s look at another example to see how StructType columns can

Appending StructType columns

val schema = StructType(

Let’s take a look at the schema.

The animal_interpretation column has a StructType type — this

It’s easier to view the schema with the printSchema method.

We can flatten the DataFrame as follows.

Using StructTypes to eliminate order

Let’s consider three custom transformations that

def withHasPositiveMood()(df: DataFrame): DataFrame = {

def withWhatToDo()(df: DataFrame) = {

Let’s build a DataFrame and execute the functions in the right

val schema = StructType(

Let’s use the struct function to append a StructType column to the

Order dependencies can be a big problem in large

You might need to figure out how to call 20 functions in exactly

StructType columns are one way to eliminate order dependencies

You might also like