0% found this document useful (0 votes)
6 views2 pages

Pyspark Practice Template

The document outlines a series of steps for data cleaning using Spark, including initializing a Spark session, reading a CSV file, and performing various cleaning tasks such as removing duplicates, handling missing values, and normalizing string columns. It also details advanced techniques for data cleaning, such as filtering out unwanted data, handling incorrect data types, and managing outliers. The document serves as a guide for effectively preparing data for analysis.

Uploaded by

Srivamshi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views2 pages

Pyspark Practice Template

The document outlines a series of steps for data cleaning using Spark, including initializing a Spark session, reading a CSV file, and performing various cleaning tasks such as removing duplicates, handling missing values, and normalizing string columns. It also details advanced techniques for data cleaning, such as filtering out unwanted data, handling incorrect data types, and managing outliers. The document serves as a guide for effectively preparing data for analysis.

Uploaded by

Srivamshi Bandi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

# Initialize Spark Session

# TODO: Write SparkSession builder here

# Read CSV file


# TODO: Read CSV with header and schema

# Remove duplicates
# TODO: Drop duplicates

# Trim and normalize string columns


# TODO: Trim and convert "name" column to lowercase

# Handle missing values


# TODO: Fill nulls in "age" and "city" columns

# Filter invalid rows


# TODO: Filter where "age" > 0

# Remove special characters from a column


# TODO: Remove special characters from "name"

# Show cleaned data


# TODO: Show the final DataFrame

# Advanced Data Cleaning Techniques

# 1. Remove Null or Missing Values


# TODO: Drop rows with any nulls
# TODO: Drop rows with nulls in specific columns
# TODO: Fill nulls in specific columns

# 2. Handle Duplicates
# TODO: Drop all duplicates
# TODO: Drop duplicates based on specific columns

# 3. Trim and Normalize String Columns


# TODO: Trim whitespace
# TODO: Convert to lowercase
# TODO: Convert to uppercase

# 4. Handle Incorrect Data Types


# TODO: Cast column to IntegerType
# TODO: Replace invalid values with null using when()

# 5. Filter Out Unwanted Data


# TODO: Keep rows where column > 0
# TODO: Keep rows matching a specific value

# 6. Rename or Drop Columns


# TODO: Rename a column
# TODO: Drop a column
# 7. Remove Non-ASCII or Special Characters
# TODO: Remove special characters from a column

# 8. Fill Missing Values for Specific Data Types


# TODO: Calculate mean of numeric column
# TODO: Fill numeric column with mean
# TODO: Fill string column with default

# 9. Handle Outliers
# TODO: Filter out values outside limits
# TODO: Replace values outside limits

# 10. Combine or Split Columns


# TODO: Combine columns with a separator
# TODO: Split a column

# 11. Drop Rows with Corrupted Data


# TODO: Read file with DROPMALFORMED mode

# 12. Replace Specific Values


# TODO: Replace multiple values in a column

# 13. Validate and Correct Data


# TODO: Apply validation rule using when()

# Show cleaned DataFrame


# TODO: Show final cleaned DataFrame

You might also like