0% found this document useful (0 votes)
102 views1 page

Pyspark Dataframe Cheatsheet New

PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.

Uploaded by

Zyad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views1 page

Pyspark Dataframe Cheatsheet New

PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.

Uploaded by

Zyad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Getting Started with PySpark DataFrames

Working with Dates and Timestamps PySpark SQL Cheat Sheet: SQL Functions for

PySpark DataFrame Cheat Sheet


DataFrames
Initializing a SparkSession:
Months Between Sorting Data
# Step 1: Import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import months_between
from pyspark.sql.functions import desc
# Step 2: Create a SparkSession
spark = SparkSession.builder \ months_between_df = df.withColumn("MonthsBetween", months_between(current_date(),
# Using SQL-like functions for sorting
.appName("PySpark DataFrame Tutorial") \ # Set a name for your Spark application col("Date")))
sorted_data = df.orderBy("Age")
.config("spark.some.config.option", "config-value") \ # Add any additional configurations desc_sorted_data = df.sort(col("Age").desc())
if needed # Show the DataFrame with months between
.getOrCreate() print("DataFrame with Months Between:")
# Show the sorted data
months_between_df.show()
print("Sorted Data:")
# Step 3: Verify the SparkSession sorted_data.show()
print(spark.version) # Print the Spark version to ensure the SparkSession is created
successfully Date Addition and Subtraction print("Descending Sorted Data:")

Aggregating and Grouping Data Joins and Combining DataFrames


desc_sorted_data.show()
Selecting and Filtering Data Filtering Data Elements Data Manipulation and Transformation from pyspark.sql.functions import date_add, date_sub
Joining DataFrames
Loading data from CSV date_add_df = df.withColumn("DatePlus10Days", date_add(col("Date"), 10))
date_sub_df = df.withColumn("DateMinus5Days", date_sub(col("Date"), 5))
Output: Using Expressions to Transform Data
Original DataFrame: Aggregation using min() and max() Functions 1. Inner Join: # Sample data for another DataFrame
# Create a SparkSession # Show the DataFrames with date addition and subtraction data2 = [
spark = SparkSession.builder \ + + + + inner_join = orders.join(products, on="product", how="inner") print("DataFrame with Date Addition:") ("Alice", "New York"),
.appName("Loading Data from CSV") \ | ID | Name | Age | Filtered DataFrame (using where() method): # Using expressions to transform data # Using min() and max() functions to find minimum and maximum salary inner_join.show() ("Bob", "San Francisco"),
date_add_df.show()
.getOrCreate() + + + + df_with_id_name = df.withColumn("ID_Name", concat(col("ID"), lit("_"), col("Name"))) min_salary = df.select(min("Salary")).collect()[0][0] ("Eva", "Los Angeles")
| 1 | Alice | 28 |
| 2 | Bob | 22 | + + + + max_salary = df.select(max("Salary")).collect()[0][0] Output: print("DataFrame with Date Subtraction:") ]
# Load data from CSV file into a DataFrame | ID | Name | Age | # Show the DataFrame with the new "ID_Name" column + + + + + date_sub_df.show()
| 3 | Charlie | 35 | + + + +
csv_file_path = "path/to/your/csv/file.csv" + + + + print("DataFrame with 'ID_Name' Column:") print("Minimum Salary:", min_salary) | product | order | quantity | price | columns2 = ["Name", "City"]
df_csv = spark.read.csv(csv_file_path, header=True, inferSchema=True) | 1 | Alice | 28 | + + + + + df2 = spark.createDataFrame(data2, columns2)
df_with_id_name.show() print("Maximum Salary:", max_salary)
| 3 | Charlie | 35 | | apple | 101 | 3 | 1.5 |
| 4 | David | 30 | | banana | 102 | 2 | 0.75 |
# Show the first few rows of the DataFrame Selecting specific columns from the DataFrame: + + + + # Using SQL-like functions to join DataFrames
Output: + + + + +
df_csv.show() Output: joined_data = df.join(df2, on="Name", how="inner")

# Select specific columns from the DataFrame


2. Left Join: Advanced DataFrame Operations # Show the joined data
selected_columns = df.select("Name", "Age") DataFrame with 'ID_Name' Column: left_join = orders.join(products, on="product", how="left") print("Joined Data:")
Loading data from JSON Minimum Salary: 100
left_join.show() joined_data.show()
# Show the DataFrame with selected columns Data Manipulation and Transformation +
| ID
+
| Name
+
| Age
+
| ID_Name |
+ Maximum Salary: 200
Output:
Window Functions
print("DataFrame with Selected Columns:") + + + + +
# Load data from JSON file into a DataFrame + + + + +
json_file_path = "path/to/your/json/file.json"
selected_columns.show() | 1 | Alice | 28 | 1_Alice | Grouping Data Elements | product | order | quantity | price | Suppose you have a DataFrame with sales data and you want to calculate the rolling average
df_json = spark.read.json(json_file_path) | 2 | Bob | 22 | 2_Bob | + + + + + of sales for each product over a specific window size.
A sample DataFrame to demonstrate each operation: | 3 | Charlie | 35 | 3_Charlie | from pyspark.sql import SparkSession | apple | 101 | 3 | 1.5 |
+ + + + +
Output:
Performance Optimization Tips
# Show the first few rows of the DataFrame from pyspark.sql.functions import col, avg | banana | 102 | 2 | 0.75 |
df_json.show() | orange | 103 | 5 | null | from pyspark.sql import SparkSession
+ + + + +
+ + + + # Create a SparkSession from pyspark.sql.window import Window
DataFrame with Selected Columns: | ID | Name | Age | Using SQL-like Expressions spark = SparkSession.builder \ 3. Right Join: from pyspark.sql.functions import col, avg
+ + + + .appName("Grouping Data") \
+ + + | 1 | Alice | 28 | right_join = orders.join(products, on="product", how="right") Optimizing PySpark DataFrame operations is essential to achieve better performance and
Loading data from Parquet | Name | Age | | 2 | Bob | 22 | # Using SQL-like expressions to transform data
.getOrCreate()
right_join.show()
# Create a SparkSession
efficient data processing. Here are some tips and best practices to consider:
+ + + df_with_age_group = df.withColumn("Age_Group", expr("CASE WHEN Age <= 25 THEN spark = SparkSession.builder \
| 3 | Charlie | 35 |
| Alice | 28 | + + + + # Sample data for the DataFrame .appName("Window Functions") \
'Young' ELSE 'Adult' END")) Output:
# Load data from Parquet file into a DataFrame | Bob | 22 | data = [ .getOrCreate() 1. Use Lazy Evaluation: PySpark uses lazy evaluation, which means that transformations on
parquet_file_path = "path/to/your/parquet/file.parquet" | Charlie | 35 | ("Alice", "Department A", 100), + + + + +
# Show the DataFrame with the new "Age_Group" column | product | order | quantity | price | DataFrames are not executed immediately but are queued up. This allows PySpark to
df_parquet = spark.read.parquet(parquet_file_path) + + + ("Bob", "Department B", 200), # Sample data for the DataFrame
Adding new columns print("DataFrame with 'Age_Group' Column:")
("Charlie", "Department A", 150),
+ + +
|
+ +
data = [
optimize and optimize the execution plan before actually performing computations.
df_with_age_group.show() | apple | 101 3 | 1.5 |
# Show the first few rows of the DataFrame ("David", "Department C", 120), | banana | 102 | 2 | 0.75 | ("ProductA", "2022-01-01", 100), 2. Caching: Caching involves storing a DataFrame or RDD in memory so that it can be
df_parquet.show() ("Eva", "Department B", 180) | grape | null | null | 2.0 | ("ProductA", "2022-01-02", 150),
Additionally, you can also use the col() function from pyspark.sql.functions to select columns. # Adding a new column "City" with a default value reused efficiently across multiple operations. Use .cache() or .persist() to cache intermediate
] + + + + + ("ProductA", "2022-01-03", 200),
Here's how you can do it: df_with_new_column = df.withColumn("City", lit("New York")) DataFrames that are used in multiple transformations or actions.
4. Outer Join: ("ProductA", "2022-01-04", 120),
Output: # Define the DataFrame schema ("ProductA", "2022-01-05", 180),
# Show the DataFrame with the new column outer_join = orders.join(products, on="product", how="outer")
# Using col() function to select columns print("DataFrame with New Column:")
columns = ["Name", "Department", "Salary"] ("ProductB", "2022-01-01", 50), 3. Partitioning: Partitioning involves dividing your data into smaller subsets (partitions)
selected_columns_v2 = df.select(col("Name"), col("Age")) outer_join.show() ("ProductB", "2022-01-02", 70),
df_with_new_column.show() based on certain criteria. This can significantly improve query performance as it reduces the
DataFrame with 'Age_Group' Column: # Create the DataFrame Output: ("ProductB", "2022-01-03", 90), amount of data that needs to be processed. Use .repartition() or .coalesce() to manage
DataFrame Basics # Show the DataFrame with selected columns using col() function
print("DataFrame with Selected Columns (using col() function):")
+
| ID
+
| Name
+
| Age
+
| |
| Age_Group
+
df = spark.createDataFrame(data, columns)
+
| product |
+
order
+ + +
| quantity | price |
]
partitions.
selected_columns_v2.show()
Output: # Show the DataFrame # Define the DataFrame schema .
+ + + + + + + + + + 4. Broadcasting: Broadcasting is a technique where smaller DataFrames are distributed to
| 1 | Alice | 28 | Adult | print("DataFrame:") | apple | 101 | 3 | 1.5 | columns = ["Product", "Date", "Sales"]
df.show() | banana | 102 | 2 | 0.75 | worker nodes and cached in memory for join operations. This is particularly useful when you
In PySpark, a DataFrame is a distributed collection of data organized into named columns.For | 2 | Bob | 22 | Young |
DataFrame with New Column: have a small DataFrame that needs to be joined with a larger one.
example, consider the following simple DataFrame representing sales data with distinct rows Output: | 3 | Charlie | 35 | Adult | | orange | 103 | 5 | null | # Create the DataFrame
and columns: + + + + + + + + + + | grape | null | null | 2.0 | df = spark.createDataFrame(data, columns) .
| ID | Name | Age | City |
+ + + + + 5. Avoid Using .collect(): Using .collect() brings all the data to the driver node, which can
DataFrame with Selected Columns (using col() function): Output:
+ + + + + # Define the window specification lead to memory issues. Instead, try to perform operations using distributed computations.
+ + + + + | 1 | Alice | 28 | New York | window_spec = Window.partitionBy("Product").orderBy("Date").rowsBetween(-1, 1)
| OrderID | Item | Quantity | UnitPrice | + + + .
| 2 | Bob | 22 | New York | DataFrame:
+ + + + + | Name | Age | | 3 | Charlie | 35 | New York | 6. Use Built-in Functions: Whenever possible, use built-in functions from the pyspark.sql.
| 101 | Apple |5 | 1.50 | + + + + + + + # Calculate rolling average using window function
+ + + + + functions module. These functions are optimized for distributed processing and provide
| 102
| 103
| Banana | 3
| Orange | 2
| 0.75
| 1.00
|
|
| Alice
| Bob
| 28
| 22
|
|
Aggregating and Grouping Data | Name
+
| Department | Salary |
+ + + Handling Missing Data
df_with_rolling_avg = df.withColumn("RollingAvg", avg(col("Sales")).over(window_spec))
better performance than custom Python functions.
+ + + + + | Charlie | 35 | Renaming Columns | Alice | Department A | 100 | # Show the DataFrame with rolling average .
+ + + | Bob | Department B | 200 | df_with_rolling_avg.show() 7. Avoid Shuffling: Shuffling is an expensive operation that involves data movement
# Renaming the column "Name" to "Full_Name" | Charlie | Department A | 150 | Identifying and handling missing data is a crucial step in data preprocessing and analysis. between partitions. Minimize shuffling by using appropriate partitioning and avoiding
| David | Department C | 120 | operations that require data to be rearranged.
df_with_renamed_column = df.withColumnRenamed("Name", "Full_Name") PySpark provides several techniques to identify and handle missing data in DataFrames.
from pyspark.sql import SparkSession | Eva | Department B | 180 |
+ + + + Let's explore these techniques: .
# Show the DataFrame with the renamed column
from pyspark.sql.functions import col, sum, avg, count, min, max Pivot Tables 8. Optimize Joins: Joins can be performance-intensive. Try to avoid shuffling during joins by
Creating DataFrames print("DataFrame with Renamed Column:")
# Create a SparkSession Identifying Missing Data
ensuring both DataFrames are appropriately partitioned and using the appropriate join
Filtering Data Elements df_with_renamed_column.show() Suppose you have a DataFrame with sales data and you want to create a pivot table showing
spark = SparkSession.builder \ # Grouping data based on "Department" and calculating average salary strategy (broadcast, sortMerge, etc.).
total sales for each product in different months.
.appName("Aggregation Functions") \ grouped_data = df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")) PySpark represents missing data as null values. You can use various methods to identify missing .
PySpark's versatility extends beyond just working with structured data; it also offers the flexibility .getOrCreate() data in DataFrames: from pyspark.sql import SparkSession 9. Use explain(): Use the explain() method on DataFrames to understand the execution plan
to create DataFrames from Resilient Distributed Datasets (RDDs) and various external data sources. Output: # Show the grouped data ▪ isNull() and isNotNull() from pyspark.sql.functions import col and identify potential optimization opportunities.
Let's explore how you can harness these capabilities. # Sample data for the DataFrame
Creating a sample DataFrame to demonstrate each method: print("Grouped Data:") ▪ dropna() from pyspark.sql.pivot import PivotTable .
data = [ grouped_data.show() ▪ fillna()
Creating DataFrames from RDDs DataFrame with Renamed Column:
("Alice", 100), 10. Hardware Considerations: Consider the cluster configuration, hardware resources,
# Create a SparkSession
+ + + + and the amount of data being processed. Properly allocating resources and scaling up the
# Sample data for the DataFrame | ID | Full_Name | Age |
("Bob", 200), Handling Missing Data spark = SparkSession.builder \
("Charlie", 150), Output: .appName("Pivot Tables") \ cluster can significantly impact performance..
# Sample RDD data = [ + + + + ("David", 120), .getOrCreate() .
rdd = spark.sparkContext.parllelize( [ ( 1, “Alice” ), ( 2, “Bob” ), ( 3, “Charlie” ) ] ) (1, "Alice", 28), | 1 | Alice | 28 | • Dropping Rows with Null Values
| 2 | Bob | 22 |
("Eva", 180) 11. Monitor Resource Usage: Keep an eye on resource usage, including CPU, memory, and
(2, "Bob", 22), ] DataFrame:
| 3 | Charlie | 35 | # Sample data for the DataFrame disk I/O. Monitoring can help identify performance bottlenecks and resource constraints..
# Convert RDD to DataFrame (3, "Charlie", 35),
+ + + + df _without _null = df.dropna( ) data = [ .
df_from_rdd = rdd.toDF( [ “ID”, “Name” ] ) (4, "David", 30), # Define the DataFrame schema
+ + +
| Department |AvgSalary| ("ProductA", "2022-01-01", 100), 12. Use Parquet Format: Parquet is a columnar storage format that is highly efficient for
(5, "Eva", 25) columns = ["Name", "Salary"] + + + ("ProductA", "2022-02-01", 150),
# Show the DataFrame ] • Filling Null Values both reading and writing. Consider using Parquet for storage as it can improve read and write
df_from_rdd.show ( )
Dropping Columns | Department B | 190.0 | ("ProductA", "2022-01-01", 200),
# Create the DataFrame | Department C | 120.0 | ("ProductB", "2022-02-01", 120), performance.
# Define the DataFrame schema df = spark.createDataFrame(data, columns) .
| Department A | 125.0 | df _filled_mean = df.fillna( { ‘age’: df.select( avg( ‘age’ ) ).first( ) [ 0 ] } ) ("ProductB", "2022-02-01", 180),
schema = StructType([ # Dropping the column "Age" + + +
Working with Various Data Sources StructField("ID", IntegerType(), True), df_dropped_column = df.drop("Age")
]
# Show the DataFrame
StructField("Name", StringType(), True), print("DataFrame:") • Imputation # Define the DataFrame schema
StructField("Age", IntegerType(), True) # Show the DataFrame with the dropped column df.show()
# Creating DataFrame from CSV columns = ["Product", "Date", "Sales"]
]) print("DataFrame with Dropped Column:")
csv_path = “path/to/data.csv” from pyspark.ml.feature import Imputer
df_dropped_column.show()
df_from_csv = spark.read.csv ( csv_path, header=True, inferSchema-True ) # Create the DataFrame
# Create the DataFrame
df = spark.createDataFrame(data, schema) Output: Joins and Combining DataFrames imputer = Imputer( inputCols=[ ‘age’ ], outputCols=['imputed_age'])
imputed_df = imputer.fit(df).transform(df)
df = spark.createDataFrame(data, columns)
# Creating DataFrame from JSON Output:
json_path = “path/to/data.json” # Create a pivot table
df_from_json = spark.read.json ( json_path ) • Adding an Indicator Column pivot_table = df.groupBy("Product").pivot("Date").sum("Sales")
Using filter() method DataFrame with Dropped Column: DataFrame:
A sample example to illustrate the different join types:
# Creating DataFrame from Parquet + + + + + + # Show the pivot table
parquet_path = “path/to/data.parquet” | ID | Name | | Name | Salary | df_with_indicator = df.withColumn('age_missing', df['age'].isNull()) pivot_table.show()
df_from_parquet = spark.read.parquet ( parquet_path ) # Method 1: Using filter() method + + + + + + DataFrame 1:
filtered_df = df.filter(df.Age > 25) | 1 | Alice | | Alice | 100 |
| 2 | Bob | | Bob | 200 | + + +
# Show the filtered DataFrame | 3 | Charlie | | Charlie | 150 | | ID | Name | • Handling Categorical Missing Data
Basic DataFrame Operations print("Filtered DataFrame (using filter()):") + + + | David | 120 | + + +
| Eva | 180 | | 1 | Alice |
filtered_df.show() df_filled_category = df.fillna({'gender': 'unknown'})
+ + + | 2 | Bob |
from pyspark.sql import SparkSession
from pyspark.sql.functions import col Let's use a sample DataFrame to demonstrate some common data transformation examples: | 3
+
| Charlie |
+ +
PySpark SQL Cheat Sheet: SQL Functions for
# Create a SparkSession Output:
Aggregation using sum() Function
DataFrame 1:
DataFrames
spark = SparkSession.builder \ from pyspark.sql import SparkSession
.appName("DataFrame Operations Example") \ from pyspark.sql.functions import col, lit, concat, expr
# Using sum() function to calculate total salary + + +
.getOrCreate() Filtered DataFrame (using filter()): # Create a SparkSession
total_salary = df.select(sum("Salary")).collect()[0][0] | ID
+
|
+
Role |
+ Working with Dates and Timestamps Filtering Data
# Sample data for the DataFrame spark = SparkSession.builder \ | 2 | Manager |
+ + + + print("Total Salary:", total_salary)
data = [ .appName("Data Transformation") \ | 3 | Employee |
| ID | Name | Age |
.getOrCreate() | 4 | Intern | from pyspark.sql import SparkSession
(1, "Alice", 28), + + + + + + +
| 1 | Alice | 28 | Sample DataFrame with date and timestamp data as mentioned below. from pyspark.sql.functions import col
(2, "Bob", 22),
(3, "Charlie", 35) | 3 | Charlie | 35 | # Sample data for the DataFrame Output:
] | 4 | David | 30 | data = [ # Create a SparkSession
+ + + + (1, "Alice", 28), Now, let's demonstrate different join types: spark = SparkSession.builder \
(2, "Bob", 22), DataFrame: .appName("PySpark SQL Functions") \
# Define the DataFrame schema Total Salary: 750
(3, "Charlie", 35) 1. Inner Join: .getOrCreate()
columns = ["ID", "Name", "Age"] + + + +
Using expr() Function ] | Name | Date | Timestamp |
# Create the DataFrame inner_join = df1.join(df2, on="ID", how="inner") + + + + # Sample data for the DataFrame
df = spark.createDataFrame(data, columns) from pyspark.sql.functions import expr
# Define the DataFrame schema Aggregation using avg() Function inner_join.show() | Alice | 2022-01-15 | 2022-01-15 08:30:00 | data = [
columns = ["ID", "Name", "Age"] | Bob | 2021-12-20 | 2021-12-20 15:45:00 | ("Alice", 28),
2. Outer Join: | Charlie | 2022-02-28 | 2022-02-28 11:00:00 | ("Bob", 22),
# Method 3: Using expr() function + + + +
# Create the DataFrame # Using avg() function to calculate average salary ("Charlie", 35)
expr_filtered_df = df.filter(expr("Age > 25")) outer_join = df1.join(df2, on="ID", how="outer") ]
Display DataFrame contents df = spark.createDataFrame(data, columns) average_salary = df.select(avg("Salary")).collect()[0][0]
outer_join.show()
# Show the filtered DataFrame
print("Average Salary:", average_salary) Current Date and Timestamp # Define the DataFrame schema
Input: print("Filtered DataFrame (using expr() function):") 3. Left Join: columns = ["Name", "Age"]
expr_filtered_df.show()
Using Functions to Transform Data left_join = df1.join(df2, on="ID", how="left") from pyspark.sql.functions import current_date, current_timestamp
# Display the first 20 rows of the DataFrame # Create the DataFrame
Output: left_join.show()
df.show() df = spark.createDataFrame(data, columns)
Output: # Using functions to transform data # Adding columns for current date and timestamp
df_with_status = df.withColumn("Status", when(col("Age") > 25, "Adult").otherwise("Young")) 4. Right Join: df_with_current_date = df.withColumn("CurrentDate", current_date())
# Using SQL-like functions to filter data
Output: right_join = df1.join(df2, on="ID", how="right") df_with_current_timestamp = df.withColumn("CurrentTimestamp", current_timestamp())
Average Salary: 150.0 filtered_data = df.filter(col("Age") > 25)
Filtered DataFrame (using expr() function): # Show the DataFrame with the new "Status" column right_join.show() selected_columns = df.select("Name", "Age")
+ + + + print("DataFrame with 'Status' Column:") # Show the DataFrames
| ID | Name | Age | + + + + df_with_status.show() print("DataFrame with Current Date:")
# Show the filtered and selected data
+
| 1
+
| Alice
+
| 28
+
|
| ID
+
| Name
+
| Age
+
|
+
Aggregation using count() Function Let's now use a few examples to illustrate how to combine DataFrames using different join types:
df_with_current_date.show()
print("Filtered Data:")
| 2 | Bob | 22 | | 1 | Alice | 28 | Output: filtered_data.show()
Consider two sample DataFrames: print("DataFrame with Current Timestamp:")
| 3 | Charlie | 35 | | 3 | Charlie | 35 | # Using count() function to count the number of employees
df_with_current_timestamp.show()
+ + + + | 4 | David | 30 | DataFrame with Dropped Column: employee_count = df.select(count("Name")).collect()[0][0] print("Selected Columns:")
+ + + + a. DataFrame orders: selected_columns.show()
+ + + + + print("Number of Employees:", employee_count)
Check DataFrame schema | ID | Name | Age | Status | + + + + Date Difference
Using where() Method + + + + + | order | product | quantity |
Input: | 1 | Alice | 28 | Adult | + + + + Aggregation
Output: from pyspark.sql.functions import datediff
| 2 | Bob | 22 | Young | | 101 | apple | 3 |
# Method 4: Using where() method | 3 | Charlie | 35 | Adult | | 102 | banana | 2 |
# Check the DataFrame schema + + + + + date_diff_df = df.withColumn("DaysSince", datediff(current_date(), col("Date"))) from pyspark.sql.functions import avg, max
where_filtered_df = df.where(df.Age > 25) | 103 | orange | 4 |
df.printSchema() + + + +
Number of Employees: 5 # Show the DataFrame with date difference # Using SQL-like functions for aggregation
# Show the filtered DataFrame
b. DataFrame products: print("DataFrame with Date Difference:") grouped_data = df.groupBy("Age").agg(avg("Age"), max("Age"))
Output: print("Filtered DataFrame (using where() method):")
date_diff_df.show()
where_filtered_df.show() + + + # Show the aggregated data
root | product| price |
+ + + print("Aggregated Data:")
|-- ID: long (nullable = true) grouped_data.show()
| apple | 1.5 |
|-- Name: string (nullable = true) | banana | 0.75 |
|-- Age: long (nullable = true) | grape | |
2.0
+ + +

You might also like