Master PySpark 1-18
Master PySpark 1-18
Spark SQL is a module in Spark for working with structured data. It allows you to query
structured data inside Spark using SQL and integrates seamlessly with DataFrames.
# Sample Data
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie"), (4, "David"), (5,
"Eve")]
columns = ["ID", "Name"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show DataFrame
df.show()
Step-by-Step Guide
Before loading the data, ensure you import the necessary modules:
You can define a custom schema for your CSV file. This allows you to explicitly set the data
types for each column.
Load the CSV file into a DataFrame using the read.csv() method. Here, header=True treats
the first row as headers, and inferSchema=True allows Spark to automatically assign data
types to columns.
To read multiple CSV files into a single DataFrame, you can pass a list of file paths. Ensure
that the schema is consistent across all files.
df = spark.read.csv("/FileStore/tables/Order.csv", header=True,
inferSchema=True, sep=',')
Use the following commands to check the schema and display the DataFrame:
• Behind the Scenes: When you use inferSchema, Spark runs a job that scans the CSV
file from top to bottom to identify the best-suited data type for each column based
on the values it encounters.
• Pros:
o Useful when the schema of the file keeps changing, as it allows Spark to
automatically detect the data types.
• Cons:
o Performance Impact: Spark must scan the entire file, which can take extra
time, especially for large files.
o Loss of Control: You lose the ability to explicitly define the schema, which may
lead to incorrect data types if the data is inconsistent.
Conclusion
Loading data from CSV files into a DataFrame is straightforward in PySpark. Understanding
how to define a schema and the implications of using inferSchema is crucial for optimizing
your data processing workflows.
This document provides a comprehensive overview of how to load CSV data into
DataFrames in PySpark, along with considerations for using schema inference. Let me know
if you need any more details or adjustments!
#multiple column
df.select(
col("Name").alias('EmployeeName'), # Rename "Name" to "EmployeeName"
col("Salary").alias('EmployeeSalary'), # Rename "Salary" to
"EmployeeSalary"
column("Department"), # Select "Department"
df.Joining_Date # Select "Joining_Date"
).show()
selectExpr() allows you to use SQL expressions directly and rename columns concisely:
Summary
In PySpark, the withColumn() function is widely used to add new columns to a DataFrame.
You can either assign a constant value using lit() or perform transformations using existing
columns.
• Example:
o Assign a constant value with lit().
o Perform calculations using existing columns like multiplying values.
• Renaming a column:
• Handling column names with special characters or spaces: If a column has special
characters or spaces, you need to use backticks (`) to escape it:
df2 = df.drop("Country")
Dropping columns creates a new DataFrame, and the original DataFrame remains
unchanged.
4. Immutability of DataFrames
In Spark, DataFrames are immutable by nature. This means that after creating a DataFrame,
its contents cannot be changed. All transformations like adding, renaming, or dropping
columns result in a new DataFrame, keeping the original one intact.
• For instance, dropping columns creates a new DataFrame without altering the
original:
This immutability ensures data consistency and supports Spark’s parallel processing, as
transformations do not affect the source data.
Key Points
• Use withColumn() for adding columns, with lit() for constant values and expressions
for computed values.
• Use withColumnRenamed() to rename columns and backticks for special characters
or spaces.
• Use drop() to remove one or more columns.
• DataFrames are immutable in Spark—transformations result in new DataFrames,
leaving the original unchanged.
In PySpark, you can change the data type of a column using the cast() method. This is
helpful when you need to convert data types for columns like Salary or Phone.
df.printSchema()
2. Filtering Data
You can filter rows based on specific conditions. For instance, to filter employees with a
salary greater than 50,000:
You can also apply multiple conditions using & or | (AND/OR) to filter data. For example,
finding employees over 30 years old and in the IT department:
Filtering based on whether a column has NULL values or not is crucial for data cleaning:
This set of operations will help you efficiently manage and transform your data in PySpark,
ensuring data integrity and accuracy for your analysis!
1. Changing Data Types: Easily modify column types using .cast(). E.g., change 'Salary' to
double or 'Phone' to string for better data handling.
2. Filtering Data: Use .filter() or .where() to extract specific rows. For example, filter
employees with a salary over 50,000 or non-null Age.
3. Multiple Conditions: Chain filters with & and | to apply complex conditions, such as
finding employees over 30 in the IT department.
4. Handling NULLs: Use .isNull() and .isNotNull() to filter rows with missing or available
values, such as missing addresses or valid emails.
5. Unique/Distinct Values: Use .distinct() to get unique rows or distinct values in a
column. Remove duplicates based on specific fields like Email or Phone using
.dropDuplicates().
6. Count Distinct Values: Count distinct values in one or multiple columns to analyze
data diversity, such as counting unique departments or combinations of Department
and Performance_Rating.
# Sample data
data = [
("USA", "North America", 100, 50.5),
("India", "Asia", 300, 20.0),
("Germany", "Europe", 200, 30.5),
("Australia", "Oceania", 150, 60.0),
("Japan", "Asia", 120, 45.0),
("Brazil", "South America", 180, 25.0)
]
# Create DataFrame
df = spark.createDataFrame(data, columns)
Note: By default, the sorting is in ascending order. This shows the top 5 countries in
alphabetical order.
Note: Here, the DataFrame is sorted first by Country (ascending), and within the same
country, it is sorted by UnitsSold in ascending order.
Note: This ensures that null values (if present) are placed at the end when sorting by
Country.
• Sorting: You can sort a DataFrame by one or more columns using .orderBy() or .sort().
By default, sorting is ascending, but you can change it using asc() or desc().
These functions and transformations are common in PySpark for manipulating and querying
data effectively!
Note: This transforms the first letter of each word in the Country column to uppercase.
Concatenation Functions
Note: Concatenates the values of Region and Country without any separator.
Note: This creates a new column concatenated by combining Region and Country with a
space between them.
These functions and transformations are common in PySpark for manipulating and querying
data effectively!
Let's create a PySpark DataFrame for employee data, which will include columns such as
EmployeeID, Name, Department, and Skills.
I'll demonstrate the usage of the split, explode, and other relevant PySpark functions with
the employee data, along with notes for each operation.
# Create DataFrame
df = spark.createDataFrame(data, columns)
Note: This splits the Skills column into an array of skills based on the space separator. The
alias("Skills_Array") gives the resulting array a meaningful name.
Note: The array index starts from 0, so Skills_Array[0] gives the first skill for each employee.
Note: The size() function returns the number of elements (skills) in the Skills_Array.
Note: This returns a boolean indicating whether the array contains the specified skill,
"Cloud", for each employee.
Note: The explode() function takes an array column and creates a new row for each
element of the array. Here, each employee will have multiple rows, one for each skill.
Steps:
1. Create sample employee data.
2. Demonstrate the usage of ltrim(), rtrim(), trim(), lpad(), and rpad() on string columns.
Sample Data Creation for Employees
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad,
trim, col
# Create DataFrame
df = spark.createDataFrame(data, columns)
Output Explanation:
• ltrim_Name: The leading spaces from the Name column are removed.
• rtrim_Name: The trailing spaces from the Name column are removed.
• trim_Name: Both leading and trailing spaces are removed from the Name column.
• lpad_Name: The Name column is padded on the left with "X" until the string length
becomes 10.
• rpad_Name: The Name column is padded on the right with "Y" until the string length
becomes 10.
Code Example
from pyspark.sql import SparkSession
from pyspark.sql.functions import current_date, current_timestamp,
date_add, date_sub, col, datediff, months_between, to_date, lit
4. months_between:
• months_between(to_date(lit("2016-01-01")), to_date(lit("2017-01-01")): Calculates
the number of months between January 1, 2016, and January 1, 2017, which is -12
months because start_date is earlier than end_date.
3. Handling Timestamps:
o You can use to_timestamp to convert strings with both date and time into a
timestamp format. This is useful when working with datetime values.
o After casting to a timestamp, you can extract various date/time components
such as the year, month, day, hour, minute, and second.
Sample Output
For the input "2017-12-11" (with the format yyyy-dd-MM), you can expect the following
results:
• Year: 2017
• Month: 12
• Day: 11
• Hour: 0 (since no time is provided)
• Minute: 0
• Second: 0
For invalid date strings (like "2017-20-12"), you will get null in the resulting DataFrame.
Here’s an example of how you can use PySpark functions for null handling with sales data.
The code includes null detection, dropping rows with nulls, filling null values, and using
coalesce() to handle nulls in aggregations. I will provide the notes alongside the code.
Notes:
1. Detecting Null Values:
The isNull() function identifies rows where a specified column has null values. The
output shows a boolean flag for each row to indicate whether the value in the
column is null.
Let's create a sample DataFrame using PySpark that includes various numerical values. This
dataset will be useful for demonstrating the aggregate functions.
# Create sample data
data = [
Row(id=1, value=10),
Row(id=2, value=20),
Row(id=3, value=30),
Row(id=4, value=None),
Row(id=5, value=40),
Row(id=6, value=20)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show the DataFrame
df.show()
Sample Output
4. Maximum (max) and Minimum (min): Finds the maximum and minimum values in a
specified column.
This should give you a solid understanding of aggregate functions in PySpark! If you have
any specific questions or need further assistance, feel free to ask!
Let's create some sample data to demonstrate each of these PySpark DataFrame operations
and give notes explaining the functions. Here's how you can create a PySpark DataFrame
and apply these operations.
Sample Data
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType,
IntegerType
# Sample data
data = [
("HR", 10000, 500, "John"),
("Finance", 20000, 1500, "Doe"),
("HR", 15000, 1000, "Alice"),
("Finance", 25000, 2000, "Eve"),
("HR", 20000, 1500, "Mark")
]
# Define schema
schema = StructType([
StructField("department", StringType(), True),
StructField("salary", IntegerType(), True),
StructField("bonus", IntegerType(), True),
StructField("employee_name", StringType(), True)
])
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show()
1. Grouped Aggregation
Perform aggregation within groups based on a grouping column.
Explanation:
• sum: Adds the values in the group for column1.
• avg: Calculates the average value of column1 in each group.
• max: Finds the maximum value.
• min: Finds the minimum value.
2. Multiple Aggregations
Perform multiple aggregations in a single step.
3. Concatenate Strings
Concatenate strings within a column.
Explanation:
• first: Retrieves the first value of the name column within each group.
• last: Retrieves the last value of the name column within each group.
Explanation:
• stddev: Calculates the standard deviation of column.
• variance: Calculates the variance of column.
Explanation:
• sumDistinct: Sums only the distinct values in column. This avoids counting duplicates.
Joins in PySpark
Joins are used to combine two DataFrames based on a common column or condition.
PySpark supports several types of joins, similar to SQL. Below are explanations and
examples for each type of join.
1. Inner Join
Code:
Explanation:
• Purpose: Returns rows where there is a match in both DataFrames (df1 and df2)
based on the common_column.
• Behavior: Rows with no matching value in either DataFrame are excluded.
• Use Case: When you only need records that exist in both DataFrames.
Explanation:
• Purpose: Returns all rows from df1 and the matching rows from df2. If no match
exists in df2, the result will contain NULL for columns from df2.
• Behavior: All rows from the left DataFrame (df1) are preserved, even if there’s no
match in the right DataFrame (df2).
• Use Case: When you want to retain all rows from df1, even if there's no match in df2.
Explanation:
• Purpose: Returns all rows when there is a match in either df1 or df2. Non-matching
rows will have NULL values in the columns from the other DataFrame.
• Behavior: Retains all rows from both DataFrames, filling in NULL where there is no
match.
• Use Case: When you want to retain all rows from both DataFrames, regardless of
whether there’s a match.
Explanation:
• Purpose: Returns only the rows from df1 where there is a match in df2. It behaves
like an inner join but only keeps columns from df1.
• Behavior: Filters df1 to only keep rows that have a match in df2.
• Use Case: When you want to filter df1 to keep rows with matching keys in df2, but
you don’t need columns from df2.
Explanation:
• Purpose: Returns only the rows from df1 that do not have a match in df2.
• Behavior: Filters out rows from df1 that have a match in df2.
• Use Case: When you want to filter df1 to keep rows with no matching keys in df2.
Explanation:
• Purpose: Returns the Cartesian product of df1 and df2, meaning every row of df1 is
paired with every row of df2.
• Behavior: The number of rows in the result will be the product of the row count of
df1 and df2.
• Use Case: Typically used in edge cases or for generating combinations of rows, but be
cautious as it can result in a very large DataFrame.
Explanation:
• Purpose: This is an example of an inner join where the common columns have
different names in df1 and df2.
• Behavior: Joins df1 and df2 based on a condition where columnA from df1 matches
columnB from df2.
• Use Case: When the join condition involves columns with different names or more
complex conditions.
Conclusion:
• Inner Join: Matches rows from both DataFrames.
• Left/Right Join: Keeps all rows from the left or right DataFrame and matches where
possible.
• Full Join: Keeps all rows from both DataFrames.
• Left Semi: Filters df1 to rows that match df2 without including columns from df2.
• Left Anti: Filters df1 to rows that do not match df2.
• Cross Join: Returns the Cartesian product, combining all rows of both DataFrames.
• Explicit Condition Join: Allows complex join conditions, including columns with
different names.
These joins are highly useful for various types of data integration and analysis tasks in
PySpark.
# Sample DataFrames
data1 = [Row(id=0), Row(id=1), Row(id=1), Row(id=None),
Row(id=None)]
data2 = [Row(id=1), Row(id=0), Row(id=None)]
df1 = spark.createDataFrame(data1)
df2 = spark.createDataFrame(data2)
# Inner Join
inner_join = df1.join(df2, on="id", how="inner")
print("Inner Join:")
inner_join.show()
# Left Join
left_join = df1.join(df2, on="id", how="left")
print("Left Join:")
left_join.show()
• Left Join:
o Keeps all rows from the left DataFrame and includes matching rows from the
right, filling in null for unmatched rows.
• Left Anti Join:
o Keeps only rows from the left DataFrame that do not have a match in the right
DataFrame.
• Summary: Choose a left join to combine data and keep all rows from the left
DataFrame. Use a left anti join to identify entries unique to the left DataFrame.
• Definition: A broadcast join optimizes joins when one DataFrame is small enough to
fit into memory by broadcasting it to all nodes. This eliminates the need to shuffle
data across the cluster, significantly improving performance for large datasets.
• Usage: Recommended for joining a large DataFrame with a small DataFrame (that
can fit into memory). You can force a broadcast join using broadcast(df_small).
• Advantages:
o Avoids data shuffling, which can speed up processing for suitable cases.
o Reduces memory consumption and network I/O for specific types of joins.
In summary:
• Left Anti Join is useful for identifying non-matching rows from one DataFrame
against another.
• Broadcast Join is a performance optimization technique ideal for joining a large
DataFrame with a small one efficiently, reducing shuffle costs.
These concepts help you manage and optimize your data processing tasks efficiently in
PySpark.
# Sample DataFrames
emp_data = [
Row(emp_id=1, emp_name="Alice", emp_salary=50000,
emp_dept_id=101, emp_location="New York"),
Row(emp_id=2, emp_name="Bob", emp_salary=60000,
emp_dept_id=102, emp_location="Los Angeles"),
Row(emp_id=3, emp_name="Charlie", emp_salary=55000,
emp_dept_id=101, emp_location="Chicago"),
Row(emp_id=4, emp_name="David", emp_salary=70000,
emp_dept_id=103, emp_location="San Francisco"),
Row(emp_id=5, emp_name="Eve", emp_salary=48000,
emp_dept_id=102, emp_location="Houston")
]
dept_data = [
Row(dept_id=101, dept_name="Engineering", dept_head="John",
dept_location="New York"),
Row(dept_id=102, dept_name="Marketing", dept_head="Mary",
dept_location="Los Angeles"),