0% found this document useful (0 votes)
10 views

Json To Dataframe

Uploaded by

Rajashekar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Json To Dataframe

Uploaded by

Rajashekar M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1.

spark read json file into dataframe

{"ID":1,"NAME":"Jourdan","GENDER":"Female","DOB":"2012-01-01","SALARY":82445.63,"NRI":null}
{"ID":2,"NAME":"Alvera","GENDER":"Female","DOB":"2023-08-08","SALARY":75985.14,"NRI":true}
{"ID":3,"NAME":"Chauncey","GENDER":"Male","DOB":"2010-09-17","SALARY":81600.32,"NRI":null}
{"ID":4,"NAME":"Karrie","GENDER":"Female","DOB":"2024-02-28","SALARY":93889.24,"NRI":null}
{"ID":5,"NAME":"Phil","GENDER":"Female","DOB":"2022-06-06","SALARY":99743.67,"NRI":true}

1.1 Read JSON file without specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val inputDF = spark.read


.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Simple_Multi_line.json")

println("Show DataFrame schema and data")


inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Output:

Show DataFrame schema and data


root
|-- DOB: string (nullable = true) //Spark is not inferring the date type without specifying the schema
|-- GENDER: string (nullable = true)
|-- ID: long (nullable = true)
|-- NAME: string (nullable = true)
|-- NRI: boolean (nullable = true)
|-- SALARY: double (nullable = true)

inputDF:
+----------+------+---+--------+----+--------+
|DOB |GENDER|ID |NAME |NRI |SALARY |
+----------+------+---+--------+----+--------+
|2012-01-01|Female|1 |Jourdan |null|82445.63|
|2023-08-08|Female|2 |Alvera |true|75985.14|
|2010-09-17|Male |3 |Chauncey|null|81600.32|
|2024-02-28|Female|4 |Karrie |null|93889.24|
|2022-06-06|Female|5 |Phil |true|99743.67|
+----------+------+---+--------+----+--------+

1.2 Read JSON file with specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()


val schema = StructType(
Array(
StructField("ID", IntegerType),
StructField("NAME", StringType),
StructField("GENDER", StringType),
StructField("DOB", DateType),
StructField("SALARY", DoubleType),
StructField("NRI", BooleanType)
)
)

val inputDF = spark.read


.schema(schema)
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Single_Line.json")

println("Show DataFrame schema and data")


inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Printing Schema of Columns


root
|-- ID: integer (nullable = true)
|-- NAME: string (nullable = true)
|-- GENDER: string (nullable = true)
|-- DOB: date (nullable = true) //Reading file with a user-specified custom schema
|-- SALARY: double (nullable = true)
|-- NRI: boolean (nullable = true)

inputDF:
+---+--------+------+----------+--------+----+
|ID |NAME |GENDER|DOB |SALARY |NRI |
+---+--------+------+----------+--------+----+
|1 |Jourdan |Female|2012-01-01|82445.63|null|
|2 |Alvera |Female|2023-08-08|75985.14|true|
|3 |Chauncey|Male |2010-09-17|81600.32|null|
|4 |Karrie |Female|2024-02-28|93889.24|null|
|5 |Phil |Female|2022-06-06|99743.67|true|
+---+--------+------+----------+--------+----+

2. Read JSON file from multiline

Sample Data:

[{"ID":1,
"NAME":"Jourdan",
"GENDER":"Female",
"DOB":"2012-01-01",
"SALARY":82445.63,
"NRI":null
},
{"ID":2,
"NAME":"Alvera",
"GENDER":"Female",
"DOB":"2023-08-08",
"SALARY":75985.14,
"NRI":true
},
{"ID":3,
"NAME":"Chauncey",
"GENDER":"Male",
"DOB":"2010-09-17",
"SALARY":81600.32,
"NRI":null
},
{"ID":4,
"NAME":"Karrie",
"GENDER":"Female",
"DOB":"2024-02-28",
"SALARY":93889.24,
"NRI":null
},
{"ID":5,
"NAME":"Phil",
"GENDER":"Female",
"DOB":"2022-06-06",
"SALARY":99743.67,
"NRI":true
}]

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(


Array(
StructField("ID", IntegerType),
StructField("NAME", StringType),
StructField("GENDER", StringType),
StructField("DOB", DateType),
StructField("SALARY", DoubleType),
StructField("NRI", BooleanType)
)
)

val inputDF = spark.read


.schema(schema)
.option("multiline","true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Simple_Multi_line.json")

println("Show DataFrame schema and data")


inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Output:

Show DataFrame schema and data


root
|-- ID: integer (nullable = true)
|-- NAME: string (nullable = true)
|-- GENDER: string (nullable = true)
|-- DOB: date (nullable = true)
|-- SALARY: double (nullable = true)
|-- NRI: boolean (nullable = true)

inputDF:
+---+--------+------+----------+-------------+----+
|ID |NAME |GENDER|DOB |SALARY |NRI |
+---+--------+------+----------+-------------+----+
|1 |Jourdan |Female|2012-01-01|82445.3232323|null|
|2 |Alvera |Female|2023-08-08|75985.14 |true|
|3 |Chauncey|Male |2010-09-17|81600.32 |null|
|4 |Karrie |Female|2024-02-28|93889.24 |null|
|5 |Phil |Female|2022-06-06|99743.67 |true|
+---+--------+------+----------+-------------+----+

3. Read Nested JSON data into dataframe

[
{
"ID": 2,
"NAME": "Jane Smith",
"AGE": 35,
"HEIGHT": 5.6,
"WEIGHT": 155.0,
"IS_STUDENT": false,
"DOB": "1987-09-20",
"ADDRESS": {
"STREET": "456 Oak St",
"CITY": "Othertown",
"STATE": "CA",
"ZIPCODE": "54321"
},
"GREADES": [75, 85, 90],
"SALARY": 85000.75,
"IS_MANAGER": true
},
{
"ID": 3,
"NAME": "Alice Johnson",
"AGE": 28,
"HEIGHT": 5.4,
"WEIGHT": 140.0,
"IS_STUDENT": true,
"DOB": "1993-03-10",
"ADDRESS": {
"STREET": "789 Pine St",
"CITY": "Smalltown",
"STATE": "TX",
"ZIPCODE": "67890"
},
"GREADES": [90, 95, 100],
"SALARY": 65000.25,
"IS_MANAGER": false
},
{
"ID": 4,
"NAME": "Robert Brown",
"AGE": 40,
"HEIGHT": 6.0,
"WEIGHT": 180.0,
"IS_STUDENT": false,
"DOB": "1982-12-05",
"ADDRESS": {
"STREET": "101 Elm St",
"CITY": "Villagetown",
"STATE": "IL",
"ZIPCODE": "98765"
},
"GREADES": [80, 85, 90],
"SALARY": 90000.00,
"IS_MANAGER": true
},
{
"ID": 5,
"NAME": "Emily Lee",
"AGE": 25,
"HEIGHT": 5.8,
"WEIGHT": 160.0,
"IS_STUDENT": true,
"DOB": "1996-07-08",
"ADDRESS": {
"STREET": "321 Maple St",
"CITY": "Hometown",
"STATE": "FL",
"ZIPCODE": "54321"
},
"GREADES": [95, 95, 95],
"SALARY": 60000.50,
"IS_MANAGER": false
},
{
"ID": 6,
"NAME": "Michael Davis",
"AGE": 45,
"HEIGHT": 6.2,
"WEIGHT": 190.0,
"IS_STUDENT": false,
"DOB": "1977-11-15",
"ADDRESS": {
"STREET": "567 Cedar St",
"CITY": "Mountainview",
"STATE": "CA",
"ZIPCODE": "12345"
},
"GREADES": [70, 75, 80],
"SALARY": 100000.00,
"IS_MANAGER": true
}
]

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

var inputDF = spark.read


.option("multiline","true")
.option("inferschema", "true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Nested_Data.json")

inputDF = inputDF.withColumn("DOB", col("DOB").cast(DateType))

println("Show DataFrame schema and data")


inputDF.printSchema()

println("inputDF:")
inputDF.select("*").show(false)

println("Splitting nested fields in ADDRESS Column")


val splitAddressDF =
inputDF.selectExpr("ID","NAME","DOB","AGE","SALARY","HEIGHT","WEIGHT","IS_MANAGER","GRAD
ES","ADDRESS.*").drop("ADDRESS")
splitAddressDF.printSchema()
splitAddressDF.show(false)

println("Exploding GRADES Array into separate rows")


val explodedDF = splitAddressDF.withColumn("GRADE", explode(col("GRADES"))).drop("GRADES")
explodedDF.show(false)

Output:

Show DataFrame schema and data


root
|-- ADDRESS: struct (nullable = true)
| |-- CITY: string (nullable = true)
| |-- STATE: string (nullable = true)
| |-- STREET: string (nullable = true)
| |-- ZIPCODE: string (nullable = true)
|-- AGE: long (nullable = true)
|-- DOB: date (nullable = true)
|-- GRADES: array (nullable = true)
| |-- element: long (containsNull = true)
|-- HEIGHT: double (nullable = true)
|-- ID: long (nullable = true)
|-- IS_MANAGER: boolean (nullable = true)
|-- IS_STUDENT: boolean (nullable = true)
|-- NAME: string (nullable = true)
|-- SALARY: double (nullable = true)
|-- WEIGHT: double (nullable = true)
inputDF:
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+
|ADDRESS |AGE|DOB |GRADES |HEIGHT|ID
|IS_MANAGER|IS_STUDENT|NAME |SALARY |WEIGHT|
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+
|{Othertown, CA, 456 Oak St, 54321} |35 |1987-09-20|[75, 85, 90] |5.6 |2 |true |false |Jane
Smith |85000.75|155.0 |
|{Smalltown, TX, 789 Pine St, 67890} |28 |1993-03-10|[90, 95, 100]|5.4 |3 |false |true |Alice
Johnson|65000.25|140.0 |
|{Villagetown, IL, 101 Elm St, 98765} |40 |1982-12-05|[80, 85, 90] |6.0 |4 |true |false |Robert
Brown |90000.0 |180.0 |
|{Hometown, FL, 321 Maple St, 54321} |25 |1996-07-08|[95, 95, 95] |5.8 |5 |false |true |Emily Lee
|60000.5 |160.0 |
|{Mountainview, CA, 567 Cedar St, 12345}|45 |1977-11-15|[70, 75, 80] |6.2 |6 |true |false |Michael
Davis|100000.0|190.0 |
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+

Splitting nested fields in ADDRESS Column


root
|-- ID: long (nullable = true)
|-- NAME: string (nullable = true)
|-- DOB: date (nullable = true)
|-- AGE: long (nullable = true)
|-- SALARY: double (nullable = true)
|-- HEIGHT: double (nullable = true)
|-- WEIGHT: double (nullable = true)
|-- IS_MANAGER: boolean (nullable = true)
|-- GRADES: array (nullable = true)
| |-- element: long (containsNull = true)
|-- CITY: string (nullable = true)
|-- STATE: string (nullable = true)
|-- STREET: string (nullable = true)
|-- ZIPCODE: string (nullable = true)

+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+
|ID |NAME |DOB |AGE|SALARY |HEIGHT|WEIGHT|IS_MANAGER|GRADES |CITY
|STATE|STREET |ZIPCODE|
+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |[75, 85, 90] |Othertown |CA |456 Oak
St |54321 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |[90, 95, 100]|Smalltown |TX |789 Pine
St |67890 |
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |[80, 85, 90] |Villagetown |IL |101 Elm St
|98765 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |[95, 95, 95] |Hometown |FL |321 Maple
St|54321 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |[70, 75, 80] |Mountainview|CA |567
Cedar St|12345 |
+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+

Exploding GRADES Array into separate rows


+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+
|ID |NAME |DOB |AGE|SALARY |HEIGHT|WEIGHT|IS_MANAGER|CITY |STATE|STREET
|ZIPCODE|GRADE|
+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|75 |
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|85 |
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|90 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|90 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|95 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|100 |
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |80
|
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |85
|
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |90
|
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|70 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|75 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|80 |
+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+

4. Read Nested JSON data into DataFrame

{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(


Array(
StructField("id", StringType),
StructField("type", StringType),
StructField("name", StringType),
StructField("ppu", DoubleType),
StructField("batters", StructType(
Array(
StructField("batter", ArrayType(StructType(
Array(
StructField("id", StringType),
StructField("type", StringType)
)
)))
)
)),
StructField("topping", ArrayType(StructType(
Array(
StructField("id", StringType),
StructField("type", StringType)
)
)))
)
)

val inputDF = spark.read


.schema(schema)
.option("multiline","true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\DONUT_JSON.json")

println("Show DataFrame schema and data")


inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

val sampleDF = inputDF.withColumnRenamed("id", "key")

println("creating a separate row for each element of “batter” array by exploding “batter” column and \n
Extract the individual elements from the “new_batter” struct")
val finalBatDF = sampleDF
.select(col("key"),
explode(col("batters.batter")).alias("new_batter"))
.select("key", "new_batter.*")
.withColumnRenamed("id", "bat_id")
.withColumnRenamed("type", "bat_type")
finalBatDF.show(false)

println("Convert Nested “toppings” to Structured DataFrame")

val topDF = sampleDF


.select(col("key"), explode(col("topping")).alias("new_topping"))
.select("key","new_topping.*")
.withColumnRenamed("id", "top_id")
.withColumnRenamed("type", "top_type")
topDF.show(false)

println("Explode the batters array")


val explodedBattersDF = inputDF.select(col("id"), col("type"), col("name"), col("ppu"),
explode(col("batters.batter")).as("batter"), col("topping"))
println("explodedBattersDF")
explodedBattersDF.show(100,false)

println("Explode the topping array")


val explodedToppingDF = explodedBattersDF.select(col("id"), col("type"), col("name"), col("ppu"),
col("batter.id").as("batter_id"), col("batter.type").as("batter_type"),
explode(col("topping")).as("topping"))
println("explodedToppingDF:")
explodedToppingDF.show(100,false)

println("Select the desired columns to form the complete DataFrame")


val completeDF = explodedToppingDF.select(col("id"), col("type"), col("name"), col("ppu"),
col("batter_id"), col("batter_type"), col("topping.id").as("topping_id"),
col("topping.type").as("topping_type"))

completeDF.show(100,false)

Output:

Show DataFrame schema and data


root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: double (nullable = true)
|-- batters: struct (nullable = true)
| |-- batter: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- topping: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)

inputDF:
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+
|id |type |name|ppu |batters |topping
|
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+
|0001|donut|Cake|0.55|{[{1001, Regular}, {1002, Chocolate}, {1003, Blueberry}]}|[{5001, None}, {5002,
Glazed}, {5005, Sugar}, {5007, Powdered Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate},
{5004, Maple}]|
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+

creating a separate row for each element of “batter” array by exploding “batter” column and
Extract the individual elements from the “new_batter” struct
+----+------+---------+
|key |bat_id|bat_type |
+----+------+---------+
|0001|1001 |Regular |
|0001|1002 |Chocolate|
|0001|1003 |Blueberry|
+----+------+---------+

Convert Nested “toppings” to Structured DataFrame


+----+------+------------------------+
|key |top_id|top_type |
+----+------+------------------------+
|0001|5001 |None |
|0001|5002 |Glazed |
|0001|5005 |Sugar |
|0001|5007 |Powdered Sugar |
|0001|5006 |Chocolate with Sprinkles|
|0001|5003 |Chocolate |
|0001|5004 |Maple |
+----+------+------------------------+

Explode the batters array


explodedBattersDF
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
|id |type |name|ppu |batter |topping
|
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
|0001|donut|Cake|0.55|{1001, Regular} |[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
|0001|donut|Cake|0.55|{1002, Chocolate}|[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
|0001|donut|Cake|0.55|{1003, Blueberry}|[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
Explode the topping array
explodedToppingDF:
+----+-----+----+----+---------+-----------+--------------------------------+
|id |type |name|ppu |batter_id|batter_type|topping |
+----+-----+----+----+---------+-----------+--------------------------------+
|0001|donut|Cake|0.55|1001 |Regular |{5001, None} |
|0001|donut|Cake|0.55|1001 |Regular |{5002, Glazed} |
|0001|donut|Cake|0.55|1001 |Regular |{5005, Sugar} |
|0001|donut|Cake|0.55|1001 |Regular |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1001 |Regular |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1001 |Regular |{5003, Chocolate} |
|0001|donut|Cake|0.55|1001 |Regular |{5004, Maple} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5001, None} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5002, Glazed} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5005, Sugar} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1002 |Chocolate |{5003, Chocolate} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5004, Maple} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5001, None} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5002, Glazed} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5005, Sugar} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1003 |Blueberry |{5003, Chocolate} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5004, Maple} |
+----+-----+----+----+---------+-----------+--------------------------------+

Select the desired columns to form the complete DataFrame


+----+-----+----+----+---------+-----------+----------+------------------------+
|id |type |name|ppu |batter_id|batter_type|topping_id|topping_type |
+----+-----+----+----+---------+-----------+----------+------------------------+
|0001|donut|Cake|0.55|1001 |Regular |5001 |None |
|0001|donut|Cake|0.55|1001 |Regular |5002 |Glazed |
|0001|donut|Cake|0.55|1001 |Regular |5005 |Sugar |
|0001|donut|Cake|0.55|1001 |Regular |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1001 |Regular |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1001 |Regular |5003 |Chocolate |
|0001|donut|Cake|0.55|1001 |Regular |5004 |Maple |
|0001|donut|Cake|0.55|1002 |Chocolate |5001 |None |
|0001|donut|Cake|0.55|1002 |Chocolate |5002 |Glazed |
|0001|donut|Cake|0.55|1002 |Chocolate |5005 |Sugar |
|0001|donut|Cake|0.55|1002 |Chocolate |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1002 |Chocolate |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1002 |Chocolate |5003 |Chocolate |
|0001|donut|Cake|0.55|1002 |Chocolate |5004 |Maple |
|0001|donut|Cake|0.55|1003 |Blueberry |5001 |None |
|0001|donut|Cake|0.55|1003 |Blueberry |5002 |Glazed |
|0001|donut|Cake|0.55|1003 |Blueberry |5005 |Sugar |
|0001|donut|Cake|0.55|1003 |Blueberry |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1003 |Blueberry |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1003 |Blueberry |5003 |Chocolate |
|0001|donut|Cake|0.55|1003 |Blueberry |5004 |Maple |
+----+-----+----+----+---------+-----------+----------+------------------------+

You might also like