Part-19 Handling Json Files
Part-19 Handling Json Files
Part-19
Handling Json Files
Read Json|Flatten Json
www.linkedin.com/in/chenchuanil
CHENCHU’S
Json file
{
"Employee": [
{
"emp_id": 4606601,
"Designation": "Manager",
"attribute": [
{
"Parent_id": 4655002,
"status_flag": 1,
"Department": [
{
"Dept_id": 46044403,
"Code": "ep",
"dept_type": "MP",
"dept_flag": 1
}
]
}
]
},
{
"emp_id": 56555000,
"Designation": "Supervisor",
"attribute": [
{
"Parent_id": 5605501,
"status_flag": 1,
"Department": [
{
"Dept_id": 56044402,
"Code": "ep",
"dept_type": "P",
"dept_flag": 1
},
{
"Dept_id": 56044000,
"Code": "fs",
"dept_type": "D",
"dept_flag": 0
}
]
}
]
}
]
}
www.linkedin.com/in/chenchuanil
CHENCHU’S
www.linkedin.com/in/chenchuanil
CHENCHU’S
The below function can be Re-used across many projects and it is tested
for deeply nested complex Json file that contains 200 plus columns and
30 hierarchies.
By default, spark engine will not flatten complex deeply nested Json file,
so we need to create user defined(UDF) functions.
www.linkedin.com/in/chenchuanil
CHENCHU’S
www.linkedin.com/in/chenchuanil
CHENCHU’S
while len(complex_fields) != 0:
The function processes the DataFrame in a loop until there are no complex
fields left in the schema.
The first complex column name (col_name) and its type are retrieved and printed
for debugging.
if (type(complex_fields[col_name]) == StructType):
expanded = [col(col_name + '.' + k).alias(col_name + '_' + k) for k in [n.name
for n in complex_fields[col_name]]]
df = df.select("*", *expanded).drop(col_name)
www.linkedin.com/in/chenchuanil
CHENCHU’S
Once the loop ends, the function returns the fully flattened DataFrame.
Purpose
The function is designed to handle deeply nested data structures, which are common in
JSON or hierarchical datasets, and transform them into a tabular format that is easier to
query and analyze.
www.linkedin.com/in/chenchuanil
NIL REDDY CHENCHU CHENCHU’S
DATA ANALYTICS
Happy Learning
www.linkedin.com/in/chenchuanil