0% found this document useful (0 votes)

14 views13 pages

Json To Dataframe

Uploaded by

Rajashekar M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views13 pages

Json To Dataframe

Uploaded by

Rajashekar M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1.

spark read json file into dataframe

{"ID":1,"NAME":"Jourdan","GENDER":"Female","DOB":"2012-01-01","SALARY":82445.63,"NRI":null}
{"ID":2,"NAME":"Alvera","GENDER":"Female","DOB":"2023-08-08","SALARY":75985.14,"NRI":true}
{"ID":3,"NAME":"Chauncey","GENDER":"Male","DOB":"2010-09-17","SALARY":81600.32,"NRI":null}
{"ID":4,"NAME":"Karrie","GENDER":"Female","DOB":"2024-02-28","SALARY":93889.24,"NRI":null}
{"ID":5,"NAME":"Phil","GENDER":"Female","DOB":"2022-06-06","SALARY":99743.67,"NRI":true}

1.1 Read JSON file without specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val inputDF = spark.read

.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Simple_Multi_line.json")

println("Show DataFrame schema and data")

inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Output:

Show DataFrame schema and data

inputDF:
+----------+------+---+--------+----+--------+
|DOB |GENDER|ID |NAME |NRI |SALARY |
+----------+------+---+--------+----+--------+
|2012-01-01|Female|1 |Jourdan |null|82445.63|
|2023-08-08|Female|2 |Alvera |true|75985.14|
|2010-09-17|Male |3 |Chauncey|null|81600.32|
|2024-02-28|Female|4 |Karrie |null|93889.24|
|2022-06-06|Female|5 |Phil |true|99743.67|
+----------+------+---+--------+----+--------+

1.2 Read JSON file with specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(
Array(
StructField("ID", IntegerType),
StructField("NAME", StringType),
StructField("GENDER", StringType),
StructField("DOB", DateType),
StructField("SALARY", DoubleType),
StructField("NRI", BooleanType)
)
)

val inputDF = spark.read

.schema(schema)
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Single_Line.json")

println("Show DataFrame schema and data")

inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Printing Schema of Columns

inputDF:
+---+--------+------+----------+--------+----+
|ID |NAME |GENDER|DOB |SALARY |NRI |
+---+--------+------+----------+--------+----+
|1 |Jourdan |Female|2012-01-01|82445.63|null|
|2 |Alvera |Female|2023-08-08|75985.14|true|
|3 |Chauncey|Male |2010-09-17|81600.32|null|
|4 |Karrie |Female|2024-02-28|93889.24|null|
|5 |Phil |Female|2022-06-06|99743.67|true|
+---+--------+------+----------+--------+----+

2. Read JSON file from multiline

Sample Data:

[{"ID":1,
"NAME":"Jourdan",
"GENDER":"Female",
"DOB":"2012-01-01",
"SALARY":82445.63,
"NRI":null
},
{"ID":2,
"NAME":"Alvera",
"GENDER":"Female",
"DOB":"2023-08-08",
"SALARY":75985.14,
"NRI":true
},
{"ID":3,
"NAME":"Chauncey",
"GENDER":"Male",
"DOB":"2010-09-17",
"SALARY":81600.32,
"NRI":null
},
{"ID":4,
"NAME":"Karrie",
"GENDER":"Female",
"DOB":"2024-02-28",
"SALARY":93889.24,
"NRI":null
},
{"ID":5,
"NAME":"Phil",
"GENDER":"Female",
"DOB":"2022-06-06",
"SALARY":99743.67,
"NRI":true
}]

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(

Array(
StructField("ID", IntegerType),
StructField("NAME", StringType),
StructField("GENDER", StringType),
StructField("DOB", DateType),
StructField("SALARY", DoubleType),
StructField("NRI", BooleanType)
)
)

val inputDF = spark.read

.schema(schema)
.option("multiline","true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Simple_Multi_line.json")

println("Show DataFrame schema and data")

inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

Output:

Show DataFrame schema and data

inputDF:
+---+--------+------+----------+-------------+----+
|ID |NAME |GENDER|DOB |SALARY |NRI |
+---+--------+------+----------+-------------+----+
|1 |Jourdan |Female|2012-01-01|82445.3232323|null|
|2 |Alvera |Female|2023-08-08|75985.14 |true|
|3 |Chauncey|Male |2010-09-17|81600.32 |null|
|4 |Karrie |Female|2024-02-28|93889.24 |null|
|5 |Phil |Female|2022-06-06|99743.67 |true|
+---+--------+------+----------+-------------+----+

3. Read Nested JSON data into dataframe

[
{
"ID": 2,
"NAME": "Jane Smith",
"AGE": 35,
"HEIGHT": 5.6,
"WEIGHT": 155.0,
"IS_STUDENT": false,
"DOB": "1987-09-20",
"ADDRESS": {
"STREET": "456 Oak St",
"CITY": "Othertown",
"STATE": "CA",
"ZIPCODE": "54321"
},
"GREADES": [75, 85, 90],
"SALARY": 85000.75,
"IS_MANAGER": true
},
{
"ID": 3,
"NAME": "Alice Johnson",
"AGE": 28,
"HEIGHT": 5.4,
"WEIGHT": 140.0,
"IS_STUDENT": true,
"DOB": "1993-03-10",
"ADDRESS": {
"STREET": "789 Pine St",
"CITY": "Smalltown",
"STATE": "TX",
"ZIPCODE": "67890"
},
"GREADES": [90, 95, 100],
"SALARY": 65000.25,
"IS_MANAGER": false
},
{
"ID": 4,
"NAME": "Robert Brown",
"AGE": 40,
"HEIGHT": 6.0,
"WEIGHT": 180.0,
"IS_STUDENT": false,
"DOB": "1982-12-05",
"ADDRESS": {
"STREET": "101 Elm St",
"CITY": "Villagetown",
"STATE": "IL",
"ZIPCODE": "98765"
},
"GREADES": [80, 85, 90],
"SALARY": 90000.00,
"IS_MANAGER": true
},
{
"ID": 5,
"NAME": "Emily Lee",
"AGE": 25,
"HEIGHT": 5.8,
"WEIGHT": 160.0,
"IS_STUDENT": true,
"DOB": "1996-07-08",
"ADDRESS": {
"STREET": "321 Maple St",
"CITY": "Hometown",
"STATE": "FL",
"ZIPCODE": "54321"
},
"GREADES": [95, 95, 95],
"SALARY": 60000.50,
"IS_MANAGER": false
},
{
"ID": 6,
"NAME": "Michael Davis",
"AGE": 45,
"HEIGHT": 6.2,
"WEIGHT": 190.0,
"IS_STUDENT": false,
"DOB": "1977-11-15",
"ADDRESS": {
"STREET": "567 Cedar St",
"CITY": "Mountainview",
"STATE": "CA",
"ZIPCODE": "12345"
},
"GREADES": [70, 75, 80],
"SALARY": 100000.00,
"IS_MANAGER": true
}
]

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

var inputDF = spark.read

.option("multiline","true")
.option("inferschema", "true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\Nested_Data.json")

inputDF = inputDF.withColumn("DOB", col("DOB").cast(DateType))

println("Show DataFrame schema and data")

inputDF.printSchema()

println("inputDF:")
inputDF.select("*").show(false)

println("Splitting nested fields in ADDRESS Column")

val splitAddressDF =
inputDF.selectExpr("ID","NAME","DOB","AGE","SALARY","HEIGHT","WEIGHT","IS_MANAGER","GRAD
ES","ADDRESS.*").drop("ADDRESS")
splitAddressDF.printSchema()
splitAddressDF.show(false)

println("Exploding GRADES Array into separate rows")

val explodedDF = splitAddressDF.withColumn("GRADE", explode(col("GRADES"))).drop("GRADES")
explodedDF.show(false)

Output:

Show DataFrame schema and data

root
|-- ADDRESS: struct (nullable = true)
| |-- CITY: string (nullable = true)
| |-- STATE: string (nullable = true)
| |-- STREET: string (nullable = true)
| |-- ZIPCODE: string (nullable = true)
|-- AGE: long (nullable = true)
|-- DOB: date (nullable = true)
|-- GRADES: array (nullable = true)
| |-- element: long (containsNull = true)
|-- HEIGHT: double (nullable = true)
|-- ID: long (nullable = true)
|-- IS_MANAGER: boolean (nullable = true)
|-- IS_STUDENT: boolean (nullable = true)
|-- NAME: string (nullable = true)
|-- SALARY: double (nullable = true)
|-- WEIGHT: double (nullable = true)
inputDF:
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+
|ADDRESS |AGE|DOB |GRADES |HEIGHT|ID
|IS_MANAGER|IS_STUDENT|NAME |SALARY |WEIGHT|
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+
|{Othertown, CA, 456 Oak St, 54321} |35 |1987-09-20|[75, 85, 90] |5.6 |2 |true |false |Jane
Smith |85000.75|155.0 |
|{Smalltown, TX, 789 Pine St, 67890} |28 |1993-03-10|[90, 95, 100]|5.4 |3 |false |true |Alice
Johnson|65000.25|140.0 |
|{Villagetown, IL, 101 Elm St, 98765} |40 |1982-12-05|[80, 85, 90] |6.0 |4 |true |false |Robert
Brown |90000.0 |180.0 |
|{Hometown, FL, 321 Maple St, 54321} |25 |1996-07-08|[95, 95, 95] |5.8 |5 |false |true |Emily Lee
|60000.5 |160.0 |
|{Mountainview, CA, 567 Cedar St, 12345}|45 |1977-11-15|[70, 75, 80] |6.2 |6 |true |false |Michael
Davis|100000.0|190.0 |
+---------------------------------------+---+----------+-------------+------+---+----------+----------+-------------+--------+------
+

Splitting nested fields in ADDRESS Column

+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+
|ID |NAME |DOB |AGE|SALARY |HEIGHT|WEIGHT|IS_MANAGER|GRADES |CITY
|STATE|STREET |ZIPCODE|
+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |[75, 85, 90] |Othertown |CA |456 Oak
St |54321 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |[90, 95, 100]|Smalltown |TX |789 Pine
St |67890 |
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |[80, 85, 90] |Villagetown |IL |101 Elm St
|98765 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |[95, 95, 95] |Hometown |FL |321 Maple
St|54321 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |[70, 75, 80] |Mountainview|CA |567
Cedar St|12345 |
+---+-------------+----------+---+--------+------+------+----------+-------------+------------+-----+------------+-------+

Exploding GRADES Array into separate rows

+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+
|ID |NAME |DOB |AGE|SALARY |HEIGHT|WEIGHT|IS_MANAGER|CITY |STATE|STREET
|ZIPCODE|GRADE|
+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|75 |
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|85 |
|2 |Jane Smith |1987-09-20|35 |85000.75|5.6 |155.0 |true |Othertown |CA |456 Oak St |54321
|90 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|90 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|95 |
|3 |Alice Johnson|1993-03-10|28 |65000.25|5.4 |140.0 |false |Smalltown |TX |789 Pine St |67890
|100 |
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |80
|
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |85
|
|4 |Robert Brown |1982-12-05|40 |90000.0 |6.0 |180.0 |true |Villagetown |IL |101 Elm St |98765 |90
|
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|5 |Emily Lee |1996-07-08|25 |60000.5 |5.8 |160.0 |false |Hometown |FL |321 Maple St|54321
|95 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|70 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|75 |
|6 |Michael Davis|1977-11-15|45 |100000.0|6.2 |190.0 |true |Mountainview|CA |567 Cedar St|12345
|80 |
+---+-------------+----------+---+--------+------+------+----------+------------+-----+------------+-------+-----+

4. Read Nested JSON data into DataFrame

{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}

Code:

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(

Array(
StructField("id", StringType),
StructField("type", StringType),
StructField("name", StringType),
StructField("ppu", DoubleType),
StructField("batters", StructType(
Array(
StructField("batter", ArrayType(StructType(
Array(
StructField("id", StringType),
StructField("type", StringType)
)
)))
)
)),
StructField("topping", ArrayType(StructType(
Array(
StructField("id", StringType),
StructField("type", StringType)
)
)))
)
)

val inputDF = spark.read

.schema(schema)
.option("multiline","true")
.json("C:\\Users\\RECVUE-1162\\Desktop\\JSON_POC\\DONUT_JSON.json")

println("Show DataFrame schema and data")

inputDF.printSchema()

println("inputDF:")
inputDF.show(false)

val sampleDF = inputDF.withColumnRenamed("id", "key")

println("creating a separate row for each element of “batter” array by exploding “batter” column and \n
Extract the individual elements from the “new_batter” struct")
val finalBatDF = sampleDF
.select(col("key"),
explode(col("batters.batter")).alias("new_batter"))
.select("key", "new_batter.*")
.withColumnRenamed("id", "bat_id")
.withColumnRenamed("type", "bat_type")
finalBatDF.show(false)

println("Convert Nested “toppings” to Structured DataFrame")

val topDF = sampleDF

.select(col("key"), explode(col("topping")).alias("new_topping"))
.select("key","new_topping.*")
.withColumnRenamed("id", "top_id")
.withColumnRenamed("type", "top_type")
topDF.show(false)

println("Explode the batters array")

val explodedBattersDF = inputDF.select(col("id"), col("type"), col("name"), col("ppu"),
explode(col("batters.batter")).as("batter"), col("topping"))
println("explodedBattersDF")
explodedBattersDF.show(100,false)

println("Explode the topping array")

val explodedToppingDF = explodedBattersDF.select(col("id"), col("type"), col("name"), col("ppu"),
col("batter.id").as("batter_id"), col("batter.type").as("batter_type"),
explode(col("topping")).as("topping"))
println("explodedToppingDF:")
explodedToppingDF.show(100,false)

println("Select the desired columns to form the complete DataFrame")

val completeDF = explodedToppingDF.select(col("id"), col("type"), col("name"), col("ppu"),
col("batter_id"), col("batter_type"), col("topping.id").as("topping_id"),
col("topping.type").as("topping_type"))

completeDF.show(100,false)

Output:

Show DataFrame schema and data

inputDF:
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+
|id |type |name|ppu |batters |topping
|
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+
|0001|donut|Cake|0.55|{[{1001, Regular}, {1002, Chocolate}, {1003, Blueberry}]}|[{5001, None}, {5002,
Glazed}, {5005, Sugar}, {5007, Powdered Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate},
{5004, Maple}]|
+----+-----+----+----+---------------------------------------------------------+--------------------------------------------------------
---------------------------------------------------------------------------------+

creating a separate row for each element of “batter” array by exploding “batter” column and
Extract the individual elements from the “new_batter” struct
+----+------+---------+
|key |bat_id|bat_type |
+----+------+---------+
|0001|1001 |Regular |
|0001|1002 |Chocolate|
|0001|1003 |Blueberry|
+----+------+---------+

Convert Nested “toppings” to Structured DataFrame

+----+------+------------------------+
|key |top_id|top_type |
+----+------+------------------------+
|0001|5001 |None |
|0001|5002 |Glazed |
|0001|5005 |Sugar |
|0001|5007 |Powdered Sugar |
|0001|5006 |Chocolate with Sprinkles|
|0001|5003 |Chocolate |
|0001|5004 |Maple |
+----+------+------------------------+

Explode the batters array

explodedBattersDF
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
|id |type |name|ppu |batter |topping
|
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
|0001|donut|Cake|0.55|{1001, Regular} |[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
|0001|donut|Cake|0.55|{1002, Chocolate}|[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
|0001|donut|Cake|0.55|{1003, Blueberry}|[{5001, None}, {5002, Glazed}, {5005, Sugar}, {5007, Powdered
Sugar}, {5006, Chocolate with Sprinkles}, {5003, Chocolate}, {5004, Maple}]|
+----+-----+----+----+-----------------+------------------------------------------------------------------------------------------------
-----------------------------------------+
Explode the topping array
explodedToppingDF:
+----+-----+----+----+---------+-----------+--------------------------------+
|id |type |name|ppu |batter_id|batter_type|topping |
+----+-----+----+----+---------+-----------+--------------------------------+
|0001|donut|Cake|0.55|1001 |Regular |{5001, None} |
|0001|donut|Cake|0.55|1001 |Regular |{5002, Glazed} |
|0001|donut|Cake|0.55|1001 |Regular |{5005, Sugar} |
|0001|donut|Cake|0.55|1001 |Regular |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1001 |Regular |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1001 |Regular |{5003, Chocolate} |
|0001|donut|Cake|0.55|1001 |Regular |{5004, Maple} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5001, None} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5002, Glazed} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5005, Sugar} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1002 |Chocolate |{5003, Chocolate} |
|0001|donut|Cake|0.55|1002 |Chocolate |{5004, Maple} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5001, None} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5002, Glazed} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5005, Sugar} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5007, Powdered Sugar} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5006, Chocolate with Sprinkles}|
|0001|donut|Cake|0.55|1003 |Blueberry |{5003, Chocolate} |
|0001|donut|Cake|0.55|1003 |Blueberry |{5004, Maple} |
+----+-----+----+----+---------+-----------+--------------------------------+

Select the desired columns to form the complete DataFrame

+----+-----+----+----+---------+-----------+----------+------------------------+
|id |type |name|ppu |batter_id|batter_type|topping_id|topping_type |
+----+-----+----+----+---------+-----------+----------+------------------------+
|0001|donut|Cake|0.55|1001 |Regular |5001 |None |
|0001|donut|Cake|0.55|1001 |Regular |5002 |Glazed |
|0001|donut|Cake|0.55|1001 |Regular |5005 |Sugar |
|0001|donut|Cake|0.55|1001 |Regular |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1001 |Regular |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1001 |Regular |5003 |Chocolate |
|0001|donut|Cake|0.55|1001 |Regular |5004 |Maple |
|0001|donut|Cake|0.55|1002 |Chocolate |5001 |None |
|0001|donut|Cake|0.55|1002 |Chocolate |5002 |Glazed |
|0001|donut|Cake|0.55|1002 |Chocolate |5005 |Sugar |
|0001|donut|Cake|0.55|1002 |Chocolate |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1002 |Chocolate |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1002 |Chocolate |5003 |Chocolate |
|0001|donut|Cake|0.55|1002 |Chocolate |5004 |Maple |
|0001|donut|Cake|0.55|1003 |Blueberry |5001 |None |
|0001|donut|Cake|0.55|1003 |Blueberry |5002 |Glazed |
|0001|donut|Cake|0.55|1003 |Blueberry |5005 |Sugar |
|0001|donut|Cake|0.55|1003 |Blueberry |5007 |Powdered Sugar |
|0001|donut|Cake|0.55|1003 |Blueberry |5006 |Chocolate with Sprinkles|
|0001|donut|Cake|0.55|1003 |Blueberry |5003 |Chocolate |
|0001|donut|Cake|0.55|1003 |Blueberry |5004 |Maple |
+----+-----+----+----+---------+-----------+----------+------------------------+

Day 19 Master Pyspark
No ratings yet
Day 19 Master Pyspark
2 pages
Lputorrents Summary One Year of TikTok in The United Kingdom
No ratings yet
Lputorrents Summary One Year of TikTok in The United Kingdom
1 page
Vertopal.com Projet 2 Classification Des Crédits (4) (1)
No ratings yet
Vertopal.com Projet 2 Classification Des Crédits (4) (1)
24 pages
Apache Spark with Scala - cheatsheet (1) (1)
No ratings yet
Apache Spark with Scala - cheatsheet (1) (1)
7 pages
DACLUSTER
No ratings yet
DACLUSTER
9 pages
Column Renaming in Pyspark
No ratings yet
Column Renaming in Pyspark
4 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Assessment Questions 2 (1)
No ratings yet
Assessment Questions 2 (1)
11 pages
172525
No ratings yet
172525
12 pages
Apache Spark
No ratings yet
Apache Spark
2 pages
DGDGSZ
No ratings yet
DGDGSZ
15 pages
VoThaiThaoNhi ECON209 F2024 Lab 2
No ratings yet
VoThaiThaoNhi ECON209 F2024 Lab 2
10 pages
IP PROJECT Old
No ratings yet
IP PROJECT Old
18 pages
TP Spark SQL avec Scala.fr.en
No ratings yet
TP Spark SQL avec Scala.fr.en
3 pages
02 Data - Engg - 23-24 Worksheet Practical#5b 1
No ratings yet
02 Data - Engg - 23-24 Worksheet Practical#5b 1
22 pages
PySpark StructType StructField Explained 1722792510
No ratings yet
PySpark StructType StructField Explained 1722792510
6 pages
Part-19 Handling Json Files
No ratings yet
Part-19 Handling Json Files
8 pages
Program
No ratings yet
Program
11 pages
Scenario Series 19 - Handling JSON in Pyspark
No ratings yet
Scenario Series 19 - Handling JSON in Pyspark
8 pages
journal
No ratings yet
journal
47 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Dataset - Databricks
No ratings yet
Dataset - Databricks
5 pages
Week12 Assignment Solution
No ratings yet
Week12 Assignment Solution
10 pages
Pyspark 500
No ratings yet
Pyspark 500
103 pages
PySpark Entity Resolution
No ratings yet
PySpark Entity Resolution
5 pages
Adding StructType Columns To Spark DataFrames
No ratings yet
Adding StructType Columns To Spark DataFrames
6 pages
Class 12 IP Practical TASK
No ratings yet
Class 12 IP Practical TASK
3 pages
6 Prac LST Finishhhhhhh
No ratings yet
6 Prac LST Finishhhhhhh
7 pages
PySpark_FP_Course ID 58339 - Hands on 4
No ratings yet
PySpark_FP_Course ID 58339 - Hands on 4
2 pages
Notebook PYTHON DATA SCIENCE
No ratings yet
Notebook PYTHON DATA SCIENCE
16 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
unit 4 Spark SQL
No ratings yet
unit 4 Spark SQL
49 pages
Spark Revision
No ratings yet
Spark Revision
16 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
10
No ratings yet
10
6 pages
Py Spark Samples
No ratings yet
Py Spark Samples
3 pages
car\;as
No ratings yet
car\;as
13 pages
Spark Graph Spip
No ratings yet
Spark Graph Spip
12 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Lithium Ion Battery Manual
No ratings yet
Lithium Ion Battery Manual
65 pages
Sanya Sekhri Assignment
No ratings yet
Sanya Sekhri Assignment
2 pages
Python OOPs Assignment
No ratings yet
Python OOPs Assignment
19 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
DATAFRAME Vs DATASETS
No ratings yet
DATAFRAME Vs DATASETS
9 pages
Spark and Scala 2
No ratings yet
Spark and Scala 2
11 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
1 4-EDA Ipynb
No ratings yet
1 4-EDA Ipynb
12 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
PoC Proposal Template
100% (1)
PoC Proposal Template
43 pages
vertopal.com_IS_Extended_Project_Guided _Template_Notebook (1)
No ratings yet
vertopal.com_IS_Extended_Project_Guided _Template_Notebook (1)
26 pages
git-session-slides
No ratings yet
git-session-slides
35 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Employee Data Analysis IP Sample Project
No ratings yet
Employee Data Analysis IP Sample Project
11 pages
BD WPS2
No ratings yet
BD WPS2
23 pages
f_cv
No ratings yet
f_cv
2 pages
SSI Exam Details
100% (20)
SSI Exam Details
26 pages
Octoplant Broschuere en
No ratings yet
Octoplant Broschuere en
11 pages
CMP 212 2024 Lecture Note Part Two
No ratings yet
CMP 212 2024 Lecture Note Part Two
22 pages
Class 12 - Applied Maths - Ms - Preboard 1 - 2023-24
No ratings yet
Class 12 - Applied Maths - Ms - Preboard 1 - 2023-24
5 pages
Parachute Gore Size Calculator
No ratings yet
Parachute Gore Size Calculator
5 pages
Command and Control Cheat Sheet
100% (1)
Command and Control Cheat Sheet
70 pages
Assignment Mongo db1
No ratings yet
Assignment Mongo db1
2 pages
Database IMPLEMENTATION TOOLS
No ratings yet
Database IMPLEMENTATION TOOLS
54 pages
DataStorage Lab2
No ratings yet
DataStorage Lab2
2 pages
HP DesignJet T630 Printer Series
No ratings yet
HP DesignJet T630 Printer Series
2 pages
Grade 9 English (FAL) June 2023 Question Paper
100% (2)
Grade 9 English (FAL) June 2023 Question Paper
13 pages
DATAFRAME CREATION & MANIPULATION
No ratings yet
DATAFRAME CREATION & MANIPULATION
2 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Orient PO
No ratings yet
Orient PO
1 page
Algebra, Trigonometria Y Geometria Analitica: Amalfi Galindo Ospino
No ratings yet
Algebra, Trigonometria Y Geometria Analitica: Amalfi Galindo Ospino
19 pages
Ma1103 6THW PDF
No ratings yet
Ma1103 6THW PDF
13 pages
Matrix Borgatti PDF
No ratings yet
Matrix Borgatti PDF
79 pages
Civil 3D, The Modern
No ratings yet
Civil 3D, The Modern
60 pages
Requirements Analysis: 4.2.1 Existing System Review & Literature Review
No ratings yet
Requirements Analysis: 4.2.1 Existing System Review & Literature Review
12 pages
080 Deepak Jangid BRM Proposal PDF
No ratings yet
080 Deepak Jangid BRM Proposal PDF
14 pages
Unit - I: Class: TYBMS Subject: Operations Research Mcqs Q.1) Choose Correct Alternative in Each of The Following
No ratings yet
Unit - I: Class: TYBMS Subject: Operations Research Mcqs Q.1) Choose Correct Alternative in Each of The Following
10 pages
No Ph.D. Game Design With Three.js
From Everand
No Ph.D. Game Design With Three.js
Nikiforos Kontopoulos
No ratings yet
De Sulfat or PCB
No ratings yet
De Sulfat or PCB
1 page
Customer Return Management
No ratings yet
Customer Return Management
13 pages
Stuxnet Virus
No ratings yet
Stuxnet Virus
28 pages
Gmax Tempest Install Guide
No ratings yet
Gmax Tempest Install Guide
3 pages
b2111 - Standard Time Off Configuration Workbook
No ratings yet
b2111 - Standard Time Off Configuration Workbook
123 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
FL3110 FL3111 Data PDF
No ratings yet
FL3110 FL3111 Data PDF
2 pages
Analysis of Chopper Fed D.C. Drive With PWM &amp Hysteresis Current Control Scheme
No ratings yet
Analysis of Chopper Fed D.C. Drive With PWM &amp Hysteresis Current Control Scheme
8 pages
E-Leetspeak: All New! the Most Challenging Puzzles Since Sudoku!
From Everand
E-Leetspeak: All New! the Most Challenging Puzzles Since Sudoku!
Thomas Ferrante
No ratings yet
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages

Json To Dataframe

Uploaded by

Json To Dataframe

Uploaded by

1.

spark read json file into dataframe

1.1 Read JSON file without specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val inputDF = spark.read

println("Show DataFrame schema and data")

Show DataFrame schema and data

1.2 Read JSON file with specifying the schema

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val inputDF = spark.read

println("Show DataFrame schema and data")

Printing Schema of Columns

2. Read JSON file from multiline

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(

val inputDF = spark.read

println("Show DataFrame schema and data")

Show DataFrame schema and data

3. Read Nested JSON data into dataframe

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

var inputDF = spark.read

inputDF = inputDF.withColumn("DOB", col("DOB").cast(DateType))

println("Show DataFrame schema and data")

println("Splitting nested fields in ADDRESS Column")

println("Exploding GRADES Array into separate rows")

Show DataFrame schema and data

Splitting nested fields in ADDRESS Column

Exploding GRADES Array into separate rows

4. Read Nested JSON data into DataFrame

val spark = SparkSession.builder().master("local").appName("Reading JSON").getOrCreate()

val schema = StructType(

val inputDF = spark.read

println("Show DataFrame schema and data")

val sampleDF = inputDF.withColumnRenamed("id", "key")

println("Convert Nested “toppings” to Structured DataFrame")

val topDF = sampleDF

println("Explode the batters array")

println("Explode the topping array")

println("Select the desired columns to form the complete DataFrame")

Show DataFrame schema and data

Convert Nested “toppings” to Structured DataFrame

Explode the batters array

Select the desired columns to form the complete DataFrame

You might also like