0% found this document useful (0 votes)

62 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Can you explain the

different Transformation
you’ve done in your
project?

Be Prepared
Learn 50 Pyspark
Transformation
to Stand Out
Abhishek Agrawal
Azure Data Engineer
1. Normalization
Scaling data to a range between 0 and 1.

from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

2. Standardization
Transforming data to have zero mean and unit variance

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

3. Log Transformation
Applying a logarithmic transformation to handle skewed data.

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = sfrom pyspark.ml.feature import StandardScaler

# Initialize the StandardScaler

scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features"
)

# Fit the scaler to the dataset and transform the data

scaled_data = scaler.fit(data).transform(data)
caler.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

4. Binning
Grouping continuous values into discrete bins.

from pyspark.sql.functions import when

# Add a new column 'bin_column' based on conditions

data = data.withColumn(
"bin_column",
when(data["value"] < 10, "Low")
.when(data["value"] < 20, "Medium")
.otherwise("High")
)

5. One-Hot Encoding
Converting categorical variables into binary columns.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Step 1: Indexing the categorical column

indexer = StringIndexer(inputCol="category", outputCol="category_index")
indexed_data = indexer.fit(data).transform(data)

# Step 2: One-hot encoding the indexed column

encoder = OneHotEncoder(inputCol="category_index", outputCol="category_onehot")
encoded_data = encoder.fit(indexed_data).transform(indexed_data)

6. Label Encoding
Converting categorical values into integer labels.

from pyspark.ml.feature import StringIndexer

# Step 1: Create a StringIndexer to index the 'category' column

indexer = StringIndexer(inputCol="category", outputCol="category_index")

# Step 2: Fit the indexer on the data and transform it

indexed_data = indexer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

7. Pivoting
Pivoting is the process of transforming long-format data (where each row
represents a single observation or record) into wide-format data (where
each column represents a different attribute or category). This
transformation is typically used when you want to turn a categorical
variable into columns and aggregate values accordingly.

# Pivoting the data to create a summary of sales by month for each ID

pivoted_data = data.groupBy("id") \
.pivot("month") \
.agg({"sales": "sum"})
= data.groupBy("id").pivot("month").agg({"sales": "sum"})

8. Unpivoting
Unpivoting is the opposite of pivoting. It transforms wide-format data
(where each column represents a different category or attribute) into
long-format data (where each row represents a single observation). This
is useful when you want to turn column headers back into values.

# Unpivoting the data to convert columns into rows

unpivoted_data = data.selectExpr(
"id",
"stack(2, 'Jan', Jan, 'Feb', Feb) as (month, sales)"
)

9. Aggregation
Summarizing data by applying functions like sum(), avg(), etc.

# Aggregating data by category to compute the sum of values

aggregated_data = data.groupBy("category") \
.agg({"value": "sum"})

Abhishek Agrawal | Azure Data Engineer

10. Feature Extraction
Extracting useful features from raw data.

from pyspark.sql.functions import year, month, dayofmonth

# Add year, month, and day columns to the DataFrame

data = (
data
.withColumn("year", year(data["timestamp"]))
.withColumn("month", month(data["timestamp"]))
.withColumn("day", dayofmonth(data["timestamp"]))
)

11. Outlier Removal

Filtering out extreme values (outliers).

# Filter rows where the 'value' column is less than 1000

filtered_data = data.filter(data["value"] < 1000)

12. Data Imputation

Filling missing values with the mean or median.

from pyspark.ml.feature import Imputer

# Create an Imputer instance

imputer = Imputer(inputCols=["column"], outputCols=["imputed_column"])

# Fit the imputer model and transform the data

imputed_data = imputer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

13. Date/Time Parsing
Converting string to datetime objects.

from pyspark.sql.functions import to_timestamp

# Convert the 'date_string' column to a timestamp with the specified format

data = data.withColumn("timestamp", to_timestamp(data["date_string"], "yyyy-MM-dd"))

14. Text Transformation

Converting text to lowercase.

from pyspark.sql.functions import lower

# Convert the text in 'text_column' to lowercase and store it in a new column

data = data.withColumn("lowercase_text", lower(data["text_column"]))

15. Data Merging

Combining two datasets based on a common column.

# Perform an inner join between data1 and data2 on the 'id' column
merged_data = data1.join(data2, data1["id"] == data2["id"], "inner")

16. Data Joining

Joining data using inner, left, or right joins.

# Perform a left join between data1 and data2 on the 'id' column
joined_data = data1.join(data2, on="id", how="left")

Abhishek Agrawal | Azure Data Engineer

17. Filtering Rows
Filtering rows based on a condition.

# Filter rows where the 'value' column is greater than 10

filtered_data = data.filter(data["value"] > 10)

18. Column Renaming

Renaming columns for clarity.

# Rename the column 'old_column' to 'new_column'

data = data.withColumnRenamed("old_column", "new_column")

19. Column Dropping

Removing unnecessary columns.

# Drop the 'unwanted_column' from the DataFrame

data = data.drop("unwanted_column")

20. Column Conversion

Converting a column from one data type to another.

from pyspark.sql.functions import col

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", col("column_string").cast("int"))

Abhishek Agrawal | Azure Data Engineer

21. Type Casting
Changing the type of a column (e.g., from string to integer).

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", data["column_string"].cast("int"))

22. Duplicate Removal

Removing duplicate rows based on specified columns.

# Remove duplicate rows based on 'column1' and 'column2'

data = data.dropDuplicates(["column1", "column2"])

23. Null Value Removal

Filtering rows with null values in specified columns.

# Filter rows where the 'column' is not null

cleaned_data = data.filter(data["column"].isNotNull())

24. Windowing Functions

Using window functions to rank or aggregate data.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification partitioned by 'category' and ordered by 'value'

window_spec = Window.partitionBy("category").orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

25. Column Combination
Combining multiple columns into one.

from pyspark.sql.functions import concat

# Concatenate 'first_name' and 'last_name' columns to create 'full_name'

data = data.withColumn("full_name", concat(data["first_name"], data["last_name"])
)

26. Cumulative Sum

Calculating a running total of a column.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

# Define a window specification ordered by 'date' with an unbounded

preceding frame
window_spec = Window.orderBy("date").
rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Add a 'cumulative_sum' column that computes the cumulative sum of 'value'

data = data.withColumn("cumulative_sum", sum("value").over(window_spec))

27. Rolling Average

Calculating a moving average over a window of rows.

from pyspark.sql.window import Window

from pyspark.sql.functions import avg

window_spec = Window.orderBy("date").rowsBetween(-2, 2)

data = data.withColumn("rolling_avg", avg("value").over(window_spec))

Abhishek Agrawal | Azure Data Engineer

28. Value Mapping
Mapping values of a column to new values.

from pyspark.sql.functions import when

# Map 'value' column: set 'mapped_column' to 'A' if 'value' is 1, otherwise 'B'

data = data.withColumn("mapped_column", when(data["value"] == 1, "A").
otherwise("B"))

29. Subsetting Columns

Calculating a moving average over a window of rows.

Selecting only a subset of columns from the dataset.

30. Column Operations

Performing arithmetic operations on columns.

# Create a new column 'new_column' as the sum of 'value1' and 'value2'

data = data.withColumn("new_column", data["value1"] + data["value2"])

31. String Splitting

Splitting a string column into multiple columns based on a delimiter.

from pyspark.sql.functions import split

# Split the values in 'column' by a comma and store the result in 'split_column'
data = data.withColumn("split_column", split(data["column"], ","))

Abhishek Agrawal | Azure Data Engineer

32. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.

from pyspark.sql.functions import explode

# Flatten the array or map in 'nested_column' into multiple rows in

'flattened_column'
data = data.withColumn("flattened_column", explode(data["nested_column"]))

33. Sampling Data

Taking a random sample of the data.

# Sample 10% of the data

sampled_data = data.sample(fraction=0.1)

34. Stripping Whitespace

Removing leading and trailing whitespace from string columns.

from pyspark.sql.functions import trim

# Remove leading and trailing spaces from 'string_column' and create

'trimmed_column'
data = data.withColumn("trimmed_column", trim(data["string_column"]))

Abhishek Agrawal | Azure Data Engineer

35. String Replacing
Replacing substrings within a string column.

from pyspark.sql.functions import regexp_replace

# Replace occurrences of 'old_value' with 'new_value' in 'text_column' and

create 'updated_column'
data = data.withColumn("updated_column", regexp_replace(data["text_column"],
"old_value", "new_value"))

36. Date Difference

Calculating the difference between two date columns.

from pyspark.sql.functions import datediff

# Calculate the difference in days between 'end_date' and 'start_date', and

create 'date_diff' column
data = data.withColumn("date_diff", datediff(data["end_date"],
data["start_date"]))

37. Window Rank

Ranking rows based on a specific column.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification ordered by 'value'

window_spec = Window.orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

38. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.

# Group by 'category' and calculate the sum of 'value1' and the average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)

39. Date Truncation

Truncating a date column to a specific unit (e.g., year, month).

from pyspark.sql.functions import trunc

# Truncate the date_column to the beginning of the month and add as a new
column
data = data.withColumn("truncated_date", trunc(data["date_column"], "MM"))

40. Repartitioning Data

Changing the number of partitions for better performance

# Repartition the DataFrame into 4 partitions

data = data.repartition(4)

Abhishek Agrawal | Azure Data Engineer

41. Adding Sequence Numbers
Assigning a unique sequence number to each row.

from pyspark.sql.functions import monotonically_increasing_id

# Add a new column 'row_id' with a unique, monotonically increasing ID

data = data.withColumn("row_id", monotonically_increasing_id())

42. Shuffling Data

Randomly shuffling rows in a dataset.

from pyspark.sql.functions import rand

# Shuffle the DataFrame by ordering rows randomly

shuffled_data = data.orderBy(rand())

43. Array Aggregation

Combining values into an array.

from pyspark.sql.functions import collect_list

# Group by 'id' and aggregate 'value' into a list, storing it in a new column
'values_array'
data = data.groupBy("id").agg(collect_list("value").alias("values_array"))

Abhishek Agrawal | Azure Data Engineer

44. Scaling
Scaling features by a specific factor.

from pyspark.ml.feature import QuantileDiscretizer

# Initialize the QuantileDiscretizer with input column, output column, and

number of buckets
scaler = QuantileDiscretizer(inputCol="value", outputCol="scaled_value",
numBuckets=10)

# Fit the discretizer to the data and transform the DataFrame

scaled_data = scaler.fit(data).transform(data)

45. Bucketing
Grouping continuous data into buckets.

from pyspark.ml.feature import Bucketizer

# Define split points for bucketing

splits = [0, 10, 20, 30, 40, 50]

# Initialize the Bucketizer with splits, input column, and output column
bucketizer = Bucketizer(splits=splits, inputCol="value",
outputCol="bucketed_value")

# Apply the bucketizer transformation to the DataFrame

bucketed_data = bucketizer.transform(data)

Abhishek Agrawal | Azure Data Engineer

46. Boolean Operations
Performing boolean operations on columns.

from pyspark.sql.functions import col

# Add a new column 'is_valid' indicating whether the 'value' column is greater
than 10
data = data.withColumn("is_valid", col("value") > 10)

47. Extracting Substrings

Extracting a portion of a string from a column.

from pyspark.sql.functions import substring

# Add a new column 'substring' containing the first 5 characters of

'text_column'
data = data.withColumn("substring", substring(col("text_column"), 1, 5))

48. JSON Parsing

Parsing JSON data into structured columns.

from pyspark.sql.functions import from_json

# Parse the JSON data in the 'json_column' into a structured column 'json_data'
using the specified schema
data = data.withColumn("json_data", from_json(col("json_column"), schema))

Abhishek Agrawal | Azure Data Engineer

49. String Length
Finding the length of a string column

from pyspark.sql.functions import length

# Add a new column 'string_length' containing the length of the strings in

'text_column'
data = data.withColumn("string_length", length(col("text_column")))

50. Row-wise Operations

Applying row-wise functions to a dataset by applying a custom function
to a column using a User-Defined Function (UDF).

from pyspark.sql.functions import udf

from pyspark.sql.types import IntegerType

# Define a function to add 2 to the input value

def add_two(value):
return value + 2

# Register the function as a UDF

add_two_udf = udf(add_two, IntegerType())

# Apply the UDF to the 'value' column and create a new column
'incremented_value'
data = data.withColumn("incremented_value", add_two_udf(col("value")))

Abhishek Agrawal | Azure Data Engineer

Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Databricks Vs SQL Cheat Sheet
No ratings yet
Databricks Vs SQL Cheat Sheet
11 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Today's Content: Program Development Steps
100% (1)
Today's Content: Program Development Steps
7 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Overview of UNIX: References
0% (1)
Overview of UNIX: References
172 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
DP-203T00 Microsoft Azure Data Engineering-03
No ratings yet
DP-203T00 Microsoft Azure Data Engineering-03
21 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
New CS WorkBook 2023 (Computer Science) (Editable)
No ratings yet
New CS WorkBook 2023 (Computer Science) (Editable)
94 pages
App Q
50% (2)
App Q
11 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Journal
No ratings yet
Journal
47 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Pyspark 12 Questions
No ratings yet
Pyspark 12 Questions
8 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Py Spark
No ratings yet
Py Spark
7 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Spark Optimization 1741826797
No ratings yet
Spark Optimization 1741826797
7 pages
Techlog 2020-2 Synchronization Tool Deployment Guide
No ratings yet
Techlog 2020-2 Synchronization Tool Deployment Guide
17 pages
TCS Rejected Many Due To Weak PySpark Logic!?
No ratings yet
TCS Rejected Many Due To Weak PySpark Logic!?
7 pages
Spark Optimisation
No ratings yet
Spark Optimisation
7 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
IBM PySpark CheatSheet
No ratings yet
IBM PySpark CheatSheet
2 pages
Partition Pruning
No ratings yet
Partition Pruning
2 pages
Distributed Dbms 2170714 Lab Manual
No ratings yet
Distributed Dbms 2170714 Lab Manual
50 pages
74 Series ICs
No ratings yet
74 Series ICs
13 pages
Red Hat Enterprise Linux-8-Performing A Standard RHEL installation-en-US PDF
No ratings yet
Red Hat Enterprise Linux-8-Performing A Standard RHEL installation-en-US PDF
193 pages
DNP3 User Guide Axon Test - 20150728AT38UG0E
No ratings yet
DNP3 User Guide Axon Test - 20150728AT38UG0E
61 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Aspire 5251/5551G/5551 Series Service Guide
No ratings yet
Aspire 5251/5551G/5551 Series Service Guide
210 pages
Lecture - Notes C++
No ratings yet
Lecture - Notes C++
24 pages
RS232 422 485 I2c Spi
100% (1)
RS232 422 485 I2c Spi
6 pages
Vmware Vsphere Esxi 6.5.X On Dell Poweredge Servers: Installation Instructions and Important Information Guide
No ratings yet
Vmware Vsphere Esxi 6.5.X On Dell Poweredge Servers: Installation Instructions and Important Information Guide
32 pages
HANA Disks Data Partitions 2.00.040+
No ratings yet
HANA Disks Data Partitions 2.00.040+
3 pages
PerconaXtraBackup-2 4 21
No ratings yet
PerconaXtraBackup-2 4 21
162 pages
1.2 Test 2 - Secondary Storage Answers
No ratings yet
1.2 Test 2 - Secondary Storage Answers
3 pages
LM 3 - Database System Architecture
No ratings yet
LM 3 - Database System Architecture
20 pages
CH 07
No ratings yet
CH 07
43 pages
Istal Ferret With Linux 32
No ratings yet
Istal Ferret With Linux 32
9 pages
COMPUTER ch-1 New
No ratings yet
COMPUTER ch-1 New
2 pages
C++ Data Structures
100% (1)
C++ Data Structures
6 pages
Instructional Module and Its Components (Guide) : Course Ccs 3 Developer and Their Background
No ratings yet
Instructional Module and Its Components (Guide) : Course Ccs 3 Developer and Their Background
7 pages
A: True or False: Chapter 5: Living With AI: Digital Data (W.S 01)
No ratings yet
A: True or False: Chapter 5: Living With AI: Digital Data (W.S 01)
4 pages
Data Base - MCQ
No ratings yet
Data Base - MCQ
4 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
Backup
No ratings yet
Backup
5 pages
Ao-1509/1509A Zigbee For Arduino 操作說明: Features
No ratings yet
Ao-1509/1509A Zigbee For Arduino 操作說明: Features
4 pages
Ict1532 - 6
No ratings yet
Ict1532 - 6
3 pages
Advanced Indexing Techniques: Practice Exercises
No ratings yet
Advanced Indexing Techniques: Practice Exercises
2 pages