0% found this document useful (0 votes)
11 views8 pages

Part-19 Handling Json Files

Uploaded by

suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

Part-19 Handling Json Files

Uploaded by

suresh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CHENCHU’S

C .R. Anil Kumar Reddy


Associate Developer for Apache Spark 3.0

🚀 Mastering PySpark and


Databricks 🚀

Part-19
Handling Json Files
Read Json|Flatten Json
www.linkedin.com/in/chenchuanil
CHENCHU’S

Json file
{
"Employee": [
{
"emp_id": 4606601,
"Designation": "Manager",
"attribute": [
{
"Parent_id": 4655002,
"status_flag": 1,
"Department": [
{
"Dept_id": 46044403,
"Code": "ep",
"dept_type": "MP",
"dept_flag": 1
}
]
}
]
},
{
"emp_id": 56555000,
"Designation": "Supervisor",
"attribute": [
{
"Parent_id": 5605501,
"status_flag": 1,
"Department": [
{
"Dept_id": 56044402,
"Code": "ep",
"dept_type": "P",
"dept_flag": 1
},
{
"Dept_id": 56044000,
"Code": "fs",
"dept_type": "D",
"dept_flag": 0
}
]
}
]
}
]
}

www.linkedin.com/in/chenchuanil
CHENCHU’S

www.linkedin.com/in/chenchuanil
CHENCHU’S

The below function can be Re-used across many projects and it is tested
for deeply nested complex Json file that contains 200 plus columns and
30 hierarchies.

By default, spark engine will not flatten complex deeply nested Json file,
so we need to create user defined(UDF) functions.

www.linkedin.com/in/chenchuanil
CHENCHU’S

complex_fields = dict([(field.name, field.dataType)


for field in df.schema.fields
if type(field.dataType) == ArrayType or type(field.dataType) ==
StructType])

www.linkedin.com/in/chenchuanil
CHENCHU’S

2.Iterate Until No Complex Fields:

while len(complex_fields) != 0:

The function processes the DataFrame in a loop until there are no complex
fields left in the schema.

3.Process the First Complex Field:


col_name = list(complex_fields.keys())[0]
print("Processing :" + col_name + " Type : " + str(type(complex_fields[col_name])))

The first complex column name (col_name) and its type are retrieved and printed
for debugging.

4.Handle StructType Fields:

if (type(complex_fields[col_name]) == StructType):
expanded = [col(col_name + '.' + k).alias(col_name + '_' + k) for k in [n.name
for n in complex_fields[col_name]]]
df = df.select("*", *expanded).drop(col_name)

www.linkedin.com/in/chenchuanil
CHENCHU’S

5.Handle ArrayType Fields:

elif (type(complex_fields[col_name]) == ArrayType):


df = df.withColumn(col_name, explode_outer(col_name))

6.Recompute Complex Fields:


complex_fields = dict([(field.name, field.dataType)
for field in df.schema.fields
if type(field.dataType) == ArrayType or type(field.dataType) ==
StructType])

7.Return the Flattened DataFrame:


return df

Once the loop ends, the function returns the fully flattened DataFrame.

Purpose

The function is designed to handle deeply nested data structures, which are common in
JSON or hierarchical datasets, and transform them into a tabular format that is easier to
query and analyze.
www.linkedin.com/in/chenchuanil
NIL REDDY CHENCHU CHENCHU’S

Torture the data, and it will confess to anything

DATA ANALYTICS

Happy Learning

SHARE IF YOU LIKE THE POST

Lets Connect to discuss more on Data

www.linkedin.com/in/chenchuanil

You might also like