Step 1: import the SparkSession Package
Step 2: Create a SparkSession object.
Step 3: Read the json file, and create a DataFrame with the Jason data. Display the DataFrame. Save the
DataFrame to a paraquet file with name Employees.
Step 4: From the DataFrame, display the associates who are mapped to ‘JAVA’ stream. Save the resultant
DataFrame to a parquet file with name JavaEmployees.
from pyspark.sql import SparkSession
# Step 2: Create a SparkSession object
spark = SparkSession.builder \
.appName("Employee Data Processing") \
.getOrCreate()
# Step 3: Read the JSON file and create a DataFrame
# Assuming the JSON file is named 'employees.json'
df = spark.read.json("employees.json")
# Display the DataFrame
df.show()
# Save the DataFrame to a Parquet file
df.write.parquet("Employees")
# Step 4: Filter associates mapped to the 'JAVA' stream
java_employees_df = df.filter(df.stream == "JAVA")
# Display the filtered DataFrame
java_employees_df.show()
# Save the filtered DataFrame to a Parquet file