0% found this document useful (0 votes)
16 views2 pages

PySpark - FP - Course ID 58339 - Hands On 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views2 pages

PySpark - FP - Course ID 58339 - Hands On 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Step 1: import the SparkSession Package

Step 2: Create a SparkSession object.


Step 3: Read the json file, and create a DataFrame with the Jason data. Display the DataFrame. Save the
DataFrame to a paraquet file with name Employees.

Step 4: From the DataFrame, display the associates who are mapped to ‘JAVA’ stream. Save the resultant
DataFrame to a parquet file with name JavaEmployees.

from pyspark.sql import SparkSession

# Step 2: Create a SparkSession object

spark = SparkSession.builder \

.appName("Employee Data Processing") \

.getOrCreate()

# Step 3: Read the JSON file and create a DataFrame

# Assuming the JSON file is named 'employees.json'

df = spark.read.json("employees.json")

# Display the DataFrame

df.show()

# Save the DataFrame to a Parquet file

df.write.parquet("Employees")

# Step 4: Filter associates mapped to the 'JAVA' stream


java_employees_df = df.filter(df.stream == "JAVA")

# Display the filtered DataFrame

java_employees_df.show()

# Save the filtered DataFrame to a Parquet file

You might also like