Pyspark Dataframe Cheatsheet New

PySpark DataFrames can be initialized by creating a SparkSession which connects Python applications to an existing Spark cluster. DataFrames allow working with structured data in Spark SQL and support operations like filtering, aggregation, joining, and sorting. Common date operations include calculating the number of months between two dates and sorting data by date columns.

Uploaded by

Zyad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

102 views1 page

Pyspark Dataframe Cheatsheet New

Uploaded by

Zyad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Getting Started with PySpark DataFrames

Working with Dates and Timestamps PySpark SQL Cheat Sheet: SQL Functions for

PySpark DataFrame Cheat Sheet

DataFrames
Initializing a SparkSession:
Months Between Sorting Data
# Step 1: Import the required modules
from pyspark.sql import SparkSession
from pyspark.sql.functions import months_between
from pyspark.sql.functions import desc
# Step 2: Create a SparkSession
spark = SparkSession.builder \ months_between_df = df.withColumn("MonthsBetween", months_between(current_date(),
# Using SQL-like functions for sorting
.appName("PySpark DataFrame Tutorial") \ # Set a name for your Spark application col("Date")))
sorted_data = df.orderBy("Age")
.config("spark.some.config.option", "config-value") \ # Add any additional configurations desc_sorted_data = df.sort(col("Age").desc())
if needed # Show the DataFrame with months between
.getOrCreate() print("DataFrame with Months Between:")
# Show the sorted data
months_between_df.show()
print("Sorted Data:")
# Step 3: Verify the SparkSession sorted_data.show()
print(spark.version) # Print the Spark version to ensure the SparkSession is created
successfully Date Addition and Subtraction print("Descending Sorted Data:")

Aggregating and Grouping Data Joins and Combining DataFrames

desc_sorted_data.show()
Selecting and Filtering Data Filtering Data Elements Data Manipulation and Transformation from pyspark.sql.functions import date_add, date_sub
Joining DataFrames
Loading data from CSV date_add_df = df.withColumn("DatePlus10Days", date_add(col("Date"), 10))
date_sub_df = df.withColumn("DateMinus5Days", date_sub(col("Date"), 5))
Output: Using Expressions to Transform Data
Original DataFrame: Aggregation using min() and max() Functions 1. Inner Join: # Sample data for another DataFrame
# Create a SparkSession # Show the DataFrames with date addition and subtraction data2 = [
spark = SparkSession.builder \ + + + + inner_join = orders.join(products, on="product", how="inner") print("DataFrame with Date Addition:") ("Alice", "New York"),
.appName("Loading Data from CSV") \ | ID | Name | Age | Filtered DataFrame (using where() method): # Using expressions to transform data # Using min() and max() functions to find minimum and maximum salary inner_join.show() ("Bob", "San Francisco"),
date_add_df.show()
.getOrCreate() + + + + df_with_id_name = df.withColumn("ID_Name", concat(col("ID"), lit("_"), col("Name"))) min_salary = df.select(min("Salary")).collect()[0][0] ("Eva", "Los Angeles")
| 1 | Alice | 28 |
| 2 | Bob | 22 | + + + + max_salary = df.select(max("Salary")).collect()[0][0] Output: print("DataFrame with Date Subtraction:") ]
# Load data from CSV file into a DataFrame | ID | Name | Age | # Show the DataFrame with the new "ID_Name" column + + + + + date_sub_df.show()
| 3 | Charlie | 35 | + + + +
csv_file_path = "path/to/your/csv/file.csv" + + + + print("DataFrame with 'ID_Name' Column:") print("Minimum Salary:", min_salary) | product | order | quantity | price | columns2 = ["Name", "City"]
df_csv = spark.read.csv(csv_file_path, header=True, inferSchema=True) | 1 | Alice | 28 | + + + + + df2 = spark.createDataFrame(data2, columns2)
df_with_id_name.show() print("Maximum Salary:", max_salary)
| 3 | Charlie | 35 | | apple | 101 | 3 | 1.5 |
| 4 | David | 30 | | banana | 102 | 2 | 0.75 |
# Show the first few rows of the DataFrame Selecting specific columns from the DataFrame: + + + + # Using SQL-like functions to join DataFrames
Output: + + + + +
df_csv.show() Output: joined_data = df.join(df2, on="Name", how="inner")

# Select speciﬁc columns from the DataFrame

2. Left Join: Advanced DataFrame Operations # Show the joined data
selected_columns = df.select("Name", "Age") DataFrame with 'ID_Name' Column: left_join = orders.join(products, on="product", how="left") print("Joined Data:")
Loading data from JSON Minimum Salary: 100
left_join.show() joined_data.show()
# Show the DataFrame with selected columns Data Manipulation and Transformation +
| ID
+
| Name
+
| Age
+
| ID_Name |
+ Maximum Salary: 200
Output:
Window Functions
print("DataFrame with Selected Columns:") + + + + +
# Load data from JSON file into a DataFrame + + + + +
json_file_path = "path/to/your/json/file.json"
selected_columns.show() | 1 | Alice | 28 | 1_Alice | Grouping Data Elements | product | order | quantity | price | Suppose you have a DataFrame with sales data and you want to calculate the rolling average
df_json = spark.read.json(json_file_path) | 2 | Bob | 22 | 2_Bob | + + + + + of sales for each product over a specific window size.
A sample DataFrame to demonstrate each operation: | 3 | Charlie | 35 | 3_Charlie | from pyspark.sql import SparkSession | apple | 101 | 3 | 1.5 |
+ + + + +
Output:
Performance Optimization Tips
# Show the first few rows of the DataFrame from pyspark.sql.functions import col, avg | banana | 102 | 2 | 0.75 |
df_json.show() | orange | 103 | 5 | null | from pyspark.sql import SparkSession
+ + + + +
+ + + + # Create a SparkSession from pyspark.sql.window import Window
DataFrame with Selected Columns: | ID | Name | Age | Using SQL-like Expressions spark = SparkSession.builder \ 3. Right Join: from pyspark.sql.functions import col, avg
+ + + + .appName("Grouping Data") \
+ + + | 1 | Alice | 28 | right_join = orders.join(products, on="product", how="right") Optimizing PySpark DataFrame operations is essential to achieve better performance and
Loading data from Parquet | Name | Age | | 2 | Bob | 22 | # Using SQL-like expressions to transform data
.getOrCreate()
right_join.show()
# Create a SparkSession
efficient data processing. Here are some tips and best practices to consider:
+ + + df_with_age_group = df.withColumn("Age_Group", expr("CASE WHEN Age <= 25 THEN spark = SparkSession.builder \
| 3 | Charlie | 35 |
| Alice | 28 | + + + + # Sample data for the DataFrame .appName("Window Functions") \
'Young' ELSE 'Adult' END")) Output:
# Load data from Parquet file into a DataFrame | Bob | 22 | data = [ .getOrCreate() 1. Use Lazy Evaluation: PySpark uses lazy evaluation, which means that transformations on
parquet_file_path = "path/to/your/parquet/file.parquet" | Charlie | 35 | ("Alice", "Department A", 100), + + + + +
# Show the DataFrame with the new "Age_Group" column | product | order | quantity | price | DataFrames are not executed immediately but are queued up. This allows PySpark to
df_parquet = spark.read.parquet(parquet_file_path) + + + ("Bob", "Department B", 200), # Sample data for the DataFrame
Adding new columns print("DataFrame with 'Age_Group' Column:")
("Charlie", "Department A", 150),
+ + +
|
+ +
data = [
optimize and optimize the execution plan before actually performing computations.
df_with_age_group.show() | apple | 101 3 | 1.5 |
# Show the first few rows of the DataFrame ("David", "Department C", 120), | banana | 102 | 2 | 0.75 | ("ProductA", "2022-01-01", 100), 2. Caching: Caching involves storing a DataFrame or RDD in memory so that it can be
df_parquet.show() ("Eva", "Department B", 180) | grape | null | null | 2.0 | ("ProductA", "2022-01-02", 150),
Additionally, you can also use the col() function from pyspark.sql.functions to select columns. # Adding a new column "City" with a default value reused efficiently across multiple operations. Use .cache() or .persist() to cache intermediate
] + + + + + ("ProductA", "2022-01-03", 200),
Here's how you can do it: df_with_new_column = df.withColumn("City", lit("New York")) DataFrames that are used in multiple transformations or actions.
4. Outer Join: ("ProductA", "2022-01-04", 120),
Output: # Define the DataFrame schema ("ProductA", "2022-01-05", 180),
# Show the DataFrame with the new column outer_join = orders.join(products, on="product", how="outer")
# Using col() function to select columns print("DataFrame with New Column:")
columns = ["Name", "Department", "Salary"] ("ProductB", "2022-01-01", 50), 3. Partitioning: Partitioning involves dividing your data into smaller subsets (partitions)
selected_columns_v2 = df.select(col("Name"), col("Age")) outer_join.show() ("ProductB", "2022-01-02", 70),
df_with_new_column.show() based on certain criteria. This can significantly improve query performance as it reduces the
DataFrame with 'Age_Group' Column: # Create the DataFrame Output: ("ProductB", "2022-01-03", 90), amount of data that needs to be processed. Use .repartition() or .coalesce() to manage
DataFrame Basics # Show the DataFrame with selected columns using col() function
print("DataFrame with Selected Columns (using col() function):")
+
| ID
+
| Name
+
| Age
+
| |
| Age_Group
+
df = spark.createDataFrame(data, columns)
+
| product |
+
order
+ + +
| quantity | price |
]
partitions.
selected_columns_v2.show()
Output: # Show the DataFrame # Define the DataFrame schema .
+ + + + + + + + + + 4. Broadcasting: Broadcasting is a technique where smaller DataFrames are distributed to
| 1 | Alice | 28 | Adult | print("DataFrame:") | apple | 101 | 3 | 1.5 | columns = ["Product", "Date", "Sales"]
df.show() | banana | 102 | 2 | 0.75 | worker nodes and cached in memory for join operations. This is particularly useful when you
In PySpark, a DataFrame is a distributed collection of data organized into named columns.For | 2 | Bob | 22 | Young |
DataFrame with New Column: have a small DataFrame that needs to be joined with a larger one.
example, consider the following simple DataFrame representing sales data with distinct rows Output: | 3 | Charlie | 35 | Adult | | orange | 103 | 5 | null | # Create the DataFrame
and columns: + + + + + + + + + + | grape | null | null | 2.0 | df = spark.createDataFrame(data, columns) .
| ID | Name | Age | City |
+ + + + + 5. Avoid Using .collect(): Using .collect() brings all the data to the driver node, which can
DataFrame with Selected Columns (using col() function): Output:
+ + + + + # Define the window specification lead to memory issues. Instead, try to perform operations using distributed computations.
+ + + + + | 1 | Alice | 28 | New York | window_spec = Window.partitionBy("Product").orderBy("Date").rowsBetween(-1, 1)
| OrderID | Item | Quantity | UnitPrice | + + + .
| 2 | Bob | 22 | New York | DataFrame:
+ + + + + | Name | Age | | 3 | Charlie | 35 | New York | 6. Use Built-in Functions: Whenever possible, use built-in functions from the pyspark.sql.
| 101 | Apple |5 | 1.50 | + + + + + + + # Calculate rolling average using window function
+ + + + + functions module. These functions are optimized for distributed processing and provide
| 102
| 103
| Banana | 3
| Orange | 2
| 0.75
| 1.00
|
|
| Alice
| Bob
| 28
| 22
|
|
Aggregating and Grouping Data | Name
+
| Department | Salary |
+ + + Handling Missing Data
df_with_rolling_avg = df.withColumn("RollingAvg", avg(col("Sales")).over(window_spec))
better performance than custom Python functions.
+ + + + + | Charlie | 35 | Renaming Columns | Alice | Department A | 100 | # Show the DataFrame with rolling average .
+ + + | Bob | Department B | 200 | df_with_rolling_avg.show() 7. Avoid Shuffling: Shuffling is an expensive operation that involves data movement
# Renaming the column "Name" to "Full_Name" | Charlie | Department A | 150 | Identifying and handling missing data is a crucial step in data preprocessing and analysis. between partitions. Minimize shuffling by using appropriate partitioning and avoiding
| David | Department C | 120 | operations that require data to be rearranged.
df_with_renamed_column = df.withColumnRenamed("Name", "Full_Name") PySpark provides several techniques to identify and handle missing data in DataFrames.
from pyspark.sql import SparkSession | Eva | Department B | 180 |
+ + + + Let's explore these techniques: .
# Show the DataFrame with the renamed column
from pyspark.sql.functions import col, sum, avg, count, min, max Pivot Tables 8. Optimize Joins: Joins can be performance-intensive. Try to avoid shuffling during joins by
Creating DataFrames print("DataFrame with Renamed Column:")
# Create a SparkSession Identifying Missing Data
ensuring both DataFrames are appropriately partitioned and using the appropriate join
Filtering Data Elements df_with_renamed_column.show() Suppose you have a DataFrame with sales data and you want to create a pivot table showing
spark = SparkSession.builder \ # Grouping data based on "Department" and calculating average salary strategy (broadcast, sortMerge, etc.).
total sales for each product in different months.
.appName("Aggregation Functions") \ grouped_data = df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")) PySpark represents missing data as null values. You can use various methods to identify missing .
PySpark's versatility extends beyond just working with structured data; it also offers the flexibility .getOrCreate() data in DataFrames: from pyspark.sql import SparkSession 9. Use explain(): Use the explain() method on DataFrames to understand the execution plan
to create DataFrames from Resilient Distributed Datasets (RDDs) and various external data sources. Output: # Show the grouped data ▪ isNull() and isNotNull() from pyspark.sql.functions import col and identify potential optimization opportunities.
Let's explore how you can harness these capabilities. # Sample data for the DataFrame
Creating a sample DataFrame to demonstrate each method: print("Grouped Data:") ▪ dropna() from pyspark.sql.pivot import PivotTable .
data = [ grouped_data.show() ▪ fillna()
Creating DataFrames from RDDs DataFrame with Renamed Column:
("Alice", 100), 10. Hardware Considerations: Consider the cluster configuration, hardware resources,
# Create a SparkSession
+ + + + and the amount of data being processed. Properly allocating resources and scaling up the
# Sample data for the DataFrame | ID | Full_Name | Age |
("Bob", 200), Handling Missing Data spark = SparkSession.builder \
("Charlie", 150), Output: .appName("Pivot Tables") \ cluster can significantly impact performance..
# Sample RDD data = [ + + + + ("David", 120), .getOrCreate() .
rdd = spark.sparkContext.parllelize( [ ( 1, “Alice” ), ( 2, “Bob” ), ( 3, “Charlie” ) ] ) (1, "Alice", 28), | 1 | Alice | 28 | • Dropping Rows with Null Values
| 2 | Bob | 22 |
("Eva", 180) 11. Monitor Resource Usage: Keep an eye on resource usage, including CPU, memory, and
(2, "Bob", 22), ] DataFrame:
| 3 | Charlie | 35 | # Sample data for the DataFrame disk I/O. Monitoring can help identify performance bottlenecks and resource constraints..
# Convert RDD to DataFrame (3, "Charlie", 35),
+ + + + df _without _null = df.dropna( ) data = [ .
df_from_rdd = rdd.toDF( [ “ID”, “Name” ] ) (4, "David", 30), # Define the DataFrame schema
+ + +
| Department |AvgSalary| ("ProductA", "2022-01-01", 100), 12. Use Parquet Format: Parquet is a columnar storage format that is highly efficient for
(5, "Eva", 25) columns = ["Name", "Salary"] + + + ("ProductA", "2022-02-01", 150),
# Show the DataFrame ] • Filling Null Values both reading and writing. Consider using Parquet for storage as it can improve read and write
df_from_rdd.show ( )
Dropping Columns | Department B | 190.0 | ("ProductA", "2022-01-01", 200),
# Create the DataFrame | Department C | 120.0 | ("ProductB", "2022-02-01", 120), performance.
# Define the DataFrame schema df = spark.createDataFrame(data, columns) .
| Department A | 125.0 | df _filled_mean = df.fillna( { ‘age’: df.select( avg( ‘age’ ) ).first( ) [ 0 ] } ) ("ProductB", "2022-02-01", 180),
schema = StructType([ # Dropping the column "Age" + + +
Working with Various Data Sources StructField("ID", IntegerType(), True), df_dropped_column = df.drop("Age")
]
# Show the DataFrame
StructField("Name", StringType(), True), print("DataFrame:") • Imputation # Define the DataFrame schema
StructField("Age", IntegerType(), True) # Show the DataFrame with the dropped column df.show()
# Creating DataFrame from CSV columns = ["Product", "Date", "Sales"]
]) print("DataFrame with Dropped Column:")
csv_path = “path/to/data.csv” from pyspark.ml.feature import Imputer
df_dropped_column.show()
df_from_csv = spark.read.csv ( csv_path, header=True, inferSchema-True ) # Create the DataFrame
# Create the DataFrame
df = spark.createDataFrame(data, schema) Output: Joins and Combining DataFrames imputer = Imputer( inputCols=[ ‘age’ ], outputCols=['imputed_age'])
imputed_df = imputer.fit(df).transform(df)
df = spark.createDataFrame(data, columns)
# Creating DataFrame from JSON Output:
json_path = “path/to/data.json” # Create a pivot table
df_from_json = spark.read.json ( json_path ) • Adding an Indicator Column pivot_table = df.groupBy("Product").pivot("Date").sum("Sales")
Using filter() method DataFrame with Dropped Column: DataFrame:
A sample example to illustrate the different join types:
# Creating DataFrame from Parquet + + + + + + # Show the pivot table
parquet_path = “path/to/data.parquet” | ID | Name | | Name | Salary | df_with_indicator = df.withColumn('age_missing', df['age'].isNull()) pivot_table.show()
df_from_parquet = spark.read.parquet ( parquet_path ) # Method 1: Using filter() method + + + + + + DataFrame 1:
filtered_df = df.filter(df.Age > 25) | 1 | Alice | | Alice | 100 |
| 2 | Bob | | Bob | 200 | + + +
# Show the filtered DataFrame | 3 | Charlie | | Charlie | 150 | | ID | Name | • Handling Categorical Missing Data
Basic DataFrame Operations print("Filtered DataFrame (using filter()):") + + + | David | 120 | + + +
| Eva | 180 | | 1 | Alice |
filtered_df.show() df_filled_category = df.fillna({'gender': 'unknown'})
+ + + | 2 | Bob |
from pyspark.sql import SparkSession
from pyspark.sql.functions import col Let's use a sample DataFrame to demonstrate some common data transformation examples: | 3
+
| Charlie |
+ +
PySpark SQL Cheat Sheet: SQL Functions for
# Create a SparkSession Output:
Aggregation using sum() Function
DataFrame 1:
DataFrames
spark = SparkSession.builder \ from pyspark.sql import SparkSession
.appName("DataFrame Operations Example") \ from pyspark.sql.functions import col, lit, concat, expr
# Using sum() function to calculate total salary + + +
.getOrCreate() Filtered DataFrame (using filter()): # Create a SparkSession
total_salary = df.select(sum("Salary")).collect()[0][0] | ID
+
|
+
Role |
+ Working with Dates and Timestamps Filtering Data
# Sample data for the DataFrame spark = SparkSession.builder \ | 2 | Manager |
+ + + + print("Total Salary:", total_salary)
data = [ .appName("Data Transformation") \ | 3 | Employee |
| ID | Name | Age |
.getOrCreate() | 4 | Intern | from pyspark.sql import SparkSession
(1, "Alice", 28), + + + + + + +
| 1 | Alice | 28 | Sample DataFrame with date and timestamp data as mentioned below. from pyspark.sql.functions import col
(2, "Bob", 22),
(3, "Charlie", 35) | 3 | Charlie | 35 | # Sample data for the DataFrame Output:
] | 4 | David | 30 | data = [ # Create a SparkSession
+ + + + (1, "Alice", 28), Now, let's demonstrate different join types: spark = SparkSession.builder \
(2, "Bob", 22), DataFrame: .appName("PySpark SQL Functions") \
# Define the DataFrame schema Total Salary: 750
(3, "Charlie", 35) 1. Inner Join: .getOrCreate()
columns = ["ID", "Name", "Age"] + + + +
Using expr() Function ] | Name | Date | Timestamp |
# Create the DataFrame inner_join = df1.join(df2, on="ID", how="inner") + + + + # Sample data for the DataFrame
df = spark.createDataFrame(data, columns) from pyspark.sql.functions import expr
# Define the DataFrame schema Aggregation using avg() Function inner_join.show() | Alice | 2022-01-15 | 2022-01-15 08:30:00 | data = [
columns = ["ID", "Name", "Age"] | Bob | 2021-12-20 | 2021-12-20 15:45:00 | ("Alice", 28),
2. Outer Join: | Charlie | 2022-02-28 | 2022-02-28 11:00:00 | ("Bob", 22),
# Method 3: Using expr() function + + + +
# Create the DataFrame # Using avg() function to calculate average salary ("Charlie", 35)
expr_filtered_df = df.filter(expr("Age > 25")) outer_join = df1.join(df2, on="ID", how="outer") ]
Display DataFrame contents df = spark.createDataFrame(data, columns) average_salary = df.select(avg("Salary")).collect()[0][0]
outer_join.show()
# Show the filtered DataFrame
print("Average Salary:", average_salary) Current Date and Timestamp # Define the DataFrame schema
Input: print("Filtered DataFrame (using expr() function):") 3. Left Join: columns = ["Name", "Age"]
expr_filtered_df.show()
Using Functions to Transform Data left_join = df1.join(df2, on="ID", how="left") from pyspark.sql.functions import current_date, current_timestamp
# Display the first 20 rows of the DataFrame # Create the DataFrame
Output: left_join.show()
df.show() df = spark.createDataFrame(data, columns)
Output: # Using functions to transform data # Adding columns for current date and timestamp
df_with_status = df.withColumn("Status", when(col("Age") > 25, "Adult").otherwise("Young")) 4. Right Join: df_with_current_date = df.withColumn("CurrentDate", current_date())
# Using SQL-like functions to filter data
Output: right_join = df1.join(df2, on="ID", how="right") df_with_current_timestamp = df.withColumn("CurrentTimestamp", current_timestamp())
Average Salary: 150.0 filtered_data = df.filter(col("Age") > 25)
Filtered DataFrame (using expr() function): # Show the DataFrame with the new "Status" column right_join.show() selected_columns = df.select("Name", "Age")
+ + + + print("DataFrame with 'Status' Column:") # Show the DataFrames
| ID | Name | Age | + + + + df_with_status.show() print("DataFrame with Current Date:")
# Show the filtered and selected data
+
| 1
+
| Alice
+
| 28
+
|
| ID
+
| Name
+
| Age
+
|
+
Aggregation using count() Function Let's now use a few examples to illustrate how to combine DataFrames using different join types:
df_with_current_date.show()
print("Filtered Data:")
| 2 | Bob | 22 | | 1 | Alice | 28 | Output: filtered_data.show()
Consider two sample DataFrames: print("DataFrame with Current Timestamp:")
| 3 | Charlie | 35 | | 3 | Charlie | 35 | # Using count() function to count the number of employees
df_with_current_timestamp.show()
+ + + + | 4 | David | 30 | DataFrame with Dropped Column: employee_count = df.select(count("Name")).collect()[0][0] print("Selected Columns:")
+ + + + a. DataFrame orders: selected_columns.show()
+ + + + + print("Number of Employees:", employee_count)
Check DataFrame schema | ID | Name | Age | Status | + + + + Date Difference
Using where() Method + + + + + | order | product | quantity |
Input: | 1 | Alice | 28 | Adult | + + + + Aggregation
Output: from pyspark.sql.functions import datediff
| 2 | Bob | 22 | Young | | 101 | apple | 3 |
# Method 4: Using where() method | 3 | Charlie | 35 | Adult | | 102 | banana | 2 |
# Check the DataFrame schema + + + + + date_diff_df = df.withColumn("DaysSince", datediff(current_date(), col("Date"))) from pyspark.sql.functions import avg, max
where_filtered_df = df.where(df.Age > 25) | 103 | orange | 4 |
df.printSchema() + + + +
Number of Employees: 5 # Show the DataFrame with date difference # Using SQL-like functions for aggregation
# Show the filtered DataFrame
b. DataFrame products: print("DataFrame with Date Difference:") grouped_data = df.groupBy("Age").agg(avg("Age"), max("Age"))
Output: print("Filtered DataFrame (using where() method):")
date_diff_df.show()
where_filtered_df.show() + + + # Show the aggregated data
root | product| price |
+ + + print("Aggregated Data:")
|-- ID: long (nullable = true) grouped_data.show()
| apple | 1.5 |
|-- Name: string (nullable = true) | banana | 0.75 |
|-- Age: long (nullable = true) | grape | |
2.0
+ + +

Textbook of Pharmacoepidemiology, 3rd Edition, 3rd Edition Fast Ebook Download
100% (8)
Textbook of Pharmacoepidemiology, 3rd Edition, 3rd Edition Fast Ebook Download
14 pages
Python Python For Data Science and Machine Learning
100% (4)
Python Python For Data Science and Machine Learning
165 pages
Finacle 10.x 24x7 EOD BOD Updated
100% (1)
Finacle 10.x 24x7 EOD BOD Updated
67 pages
BS en 60300-3-15-2009 - (2020-08-23 - 04-51-23 PM)
100% (2)
BS en 60300-3-15-2009 - (2020-08-23 - 04-51-23 PM)
60 pages
An Introduction To Submarine Cables
100% (1)
An Introduction To Submarine Cables
7 pages
WRM Year8 Spring Block 1 Brackets Equations Inequalities Exemplar Questions and Answers
No ratings yet
WRM Year8 Spring Block 1 Brackets Equations Inequalities Exemplar Questions and Answers
87 pages
Manuel #1116649 (FM841, FM840) Rig 301-52
No ratings yet
Manuel #1116649 (FM841, FM840) Rig 301-52
101 pages
Book 3
No ratings yet
Book 3
4,104 pages
Privacy Maturity Assessment Framework: Elements, Attributes, and Criteria (Version 2.0)
No ratings yet
Privacy Maturity Assessment Framework: Elements, Attributes, and Criteria (Version 2.0)
18 pages
CS3301 DS Handwritten Notes
No ratings yet
CS3301 DS Handwritten Notes
195 pages
Online Order Confirmation
No ratings yet
Online Order Confirmation
16,730 pages
Spm-Unit Ii
No ratings yet
Spm-Unit Ii
84 pages
Chap-7 Memory and Programmable Logic 4th Ed. Mano
100% (1)
Chap-7 Memory and Programmable Logic 4th Ed. Mano
42 pages
2022 Worldwide FPSO Units
No ratings yet
2022 Worldwide FPSO Units
1 page
1 Unsolved Lookup Ex Miscl
No ratings yet
1 Unsolved Lookup Ex Miscl
1,654 pages
Pentesting Active Directory
No ratings yet
Pentesting Active Directory
1 page
Egp Catalog Import Template
No ratings yet
Egp Catalog Import Template
1,903 pages
DevOps Change Velocity 2.0 Data Model
No ratings yet
DevOps Change Velocity 2.0 Data Model
1 page
SN Compliance Policy Statement
No ratings yet
SN Compliance Policy Statement
1,651 pages
1 Unsolved Lookup Ex Miscl
No ratings yet
1 Unsolved Lookup Ex Miscl
1,652 pages
Call Report For 10jan 2011
No ratings yet
Call Report For 10jan 2011
2,529 pages
Ie4-1le7 Simotics Motor Brochure - 06.24
No ratings yet
Ie4-1le7 Simotics Motor Brochure - 06.24
4 pages
Lab Experiment 07 Logical Operations
No ratings yet
Lab Experiment 07 Logical Operations
6 pages
PM Project
No ratings yet
PM Project
1,651 pages
Basis Data 22
No ratings yet
Basis Data 22
674 pages
SN Risk Risk
No ratings yet
SN Risk Risk
1,654 pages
Manual
No ratings yet
Manual
64 pages
2020 2 aybQYldUEVfTZ0MAOB0$w PDF
100% (1)
2020 2 aybQYldUEVfTZ0MAOB0$w PDF
1 page
IBM Tivoli Monitoring Exploring
No ratings yet
IBM Tivoli Monitoring Exploring
172 pages
Project Final - Merged
No ratings yet
Project Final - Merged
70 pages
WP05 - ACT 01 - Development 1909
No ratings yet
WP05 - ACT 01 - Development 1909
53 pages
Construction of 12m Wide Fob at Mwm-Abss-model
No ratings yet
Construction of 12m Wide Fob at Mwm-Abss-model
1 page
Cursul 3
No ratings yet
Cursul 3
609 pages
BMC Resmart Gii Y30t Bipap Humidifier
No ratings yet
BMC Resmart Gii Y30t Bipap Humidifier
4 pages
Active Directory Penetration Manual: Scan Network
100% (1)
Active Directory Penetration Manual: Scan Network
1 page
Accounting System For Manufacturing Company V5.1 - Blank
No ratings yet
Accounting System For Manufacturing Company V5.1 - Blank
2,044 pages
Edexcel iGCSE ICT Software Mindmap
No ratings yet
Edexcel iGCSE ICT Software Mindmap
1 page
Big Data Storage Comparison
No ratings yet
Big Data Storage Comparison
472 pages
THAVASRI
No ratings yet
THAVASRI
93 pages
Hotkeys Meshmixer
No ratings yet
Hotkeys Meshmixer
5 pages
Presentasi Bulldozer D6N LGP
No ratings yet
Presentasi Bulldozer D6N LGP
28 pages
Format Gratis
No ratings yet
Format Gratis
168 pages
MIS-Quiz 2
No ratings yet
MIS-Quiz 2
6 pages
2020 2 aybQYldUEVfTZ0MAOB0 W PDF
No ratings yet
2020 2 aybQYldUEVfTZ0MAOB0 W PDF
1 page
Sohel Portfolio
No ratings yet
Sohel Portfolio
43 pages
Portfolio Dashboard
No ratings yet
Portfolio Dashboard
228 pages
Statistics: Measures of Central Tendency
No ratings yet
Statistics: Measures of Central Tendency
13 pages
Git Rebase Conflict
No ratings yet
Git Rebase Conflict
1 page
Bal Fvakozg PDF
No ratings yet
Bal Fvakozg PDF
1 page
SQL Queries PDF
No ratings yet
SQL Queries PDF
10 pages
Software Questionbank 1st Edition
No ratings yet
Software Questionbank 1st Edition
3 pages
Solutions
No ratings yet
Solutions
39 pages
CS Project Bank Management System 2023 24
No ratings yet
CS Project Bank Management System 2023 24
25 pages
Chronicles of Counterfeit by Lubogo Jireh, Lubogo Israel, Lubogo Zion and Lubogo Isaac
No ratings yet
Chronicles of Counterfeit by Lubogo Jireh, Lubogo Israel, Lubogo Zion and Lubogo Isaac
214 pages
Laboratory Test Report: Sample Information Location Information Patient Information 022407302744 Mr. Sunilkumar J Bajpai
No ratings yet
Laboratory Test Report: Sample Information Location Information Patient Information 022407302744 Mr. Sunilkumar J Bajpai
6 pages
Lipid Profile: Invitro Assay Profile For Quantitative Determination of Lipid Fractions by COBAS INTEGRA, Roche - USA
No ratings yet
Lipid Profile: Invitro Assay Profile For Quantitative Determination of Lipid Fractions by COBAS INTEGRA, Roche - USA
3 pages
Sistema Hidráulico TDT
No ratings yet
Sistema Hidráulico TDT
4 pages
Planos Fdbdiagram
No ratings yet
Planos Fdbdiagram
4 pages
AI File
No ratings yet
AI File
22 pages
Slip
No ratings yet
Slip
2 pages
Slip Regression Classification
No ratings yet
Slip Regression Classification
2 pages
PWD EReceipt Dec 27 140144
No ratings yet
PWD EReceipt Dec 27 140144
1 page
Lunxh Box
No ratings yet
Lunxh Box
24 pages
Deep Learning Mind Map PDF Download
No ratings yet
Deep Learning Mind Map PDF Download
1 page
OtherIBFundTransfer EReceipt Nov 11 190117
No ratings yet
OtherIBFundTransfer EReceipt Nov 11 190117
1 page
Krdd01 TDR Cr02 Gen Ocs PGP 00006
No ratings yet
Krdd01 TDR Cr02 Gen Ocs PGP 00006
1 page
Formula Series All Sum
No ratings yet
Formula Series All Sum
19 pages
Introduction - 3 Topics: Airplane Design (Aerodynamic) Prof. E.G. Tulapurkara Chapter-1
No ratings yet
Introduction - 3 Topics: Airplane Design (Aerodynamic) Prof. E.G. Tulapurkara Chapter-1
38 pages
Checked:: Sheet: Calculations Output
No ratings yet
Checked:: Sheet: Calculations Output
2 pages
SOP For E-Mail Security Policy - v. 1.0
No ratings yet
SOP For E-Mail Security Policy - v. 1.0
8 pages
Flyer - RCS-9631 Capacitor Management Relay
No ratings yet
Flyer - RCS-9631 Capacitor Management Relay
1 page
System Architecture - Bogasari (Surabaya) 13122023
No ratings yet
System Architecture - Bogasari (Surabaya) 13122023
1 page
Sept 22
No ratings yet
Sept 22
1 page
Ref Calculations Output
No ratings yet
Ref Calculations Output
2 pages
DFGSG
No ratings yet
DFGSG
134 pages
Immunoassay: Laboratory Test Report
No ratings yet
Immunoassay: Laboratory Test Report
1 page
Balance Certificate - Feb19 - 144936
No ratings yet
Balance Certificate - Feb19 - 144936
1 page
Iis Detail Data Store Diagram
No ratings yet
Iis Detail Data Store Diagram
1 page
Welcome To Transport Department Government of Tel 3
No ratings yet
Welcome To Transport Department Government of Tel 3
1 page
(HK241) Convolution Operation
No ratings yet
(HK241) Convolution Operation
6 pages
Can Be Has: M/2019/01/02/hash-Match - Join - Internals
No ratings yet
Can Be Has: M/2019/01/02/hash-Match - Join - Internals
1 page
Starting A Project
No ratings yet
Starting A Project
36 pages
Permissions Poster SQL Server VNext and SQLDB
No ratings yet
Permissions Poster SQL Server VNext and SQLDB
1 page
SR Vibratory Ripper
No ratings yet
SR Vibratory Ripper
4 pages
Educ630 Web-Based Assessment Assignment
No ratings yet
Educ630 Web-Based Assessment Assignment
3 pages
IND5001 Assignment 2 V1.0.1
No ratings yet
IND5001 Assignment 2 V1.0.1
6 pages
Power Pivot Client Server Architecture
No ratings yet
Power Pivot Client Server Architecture
2 pages

Pyspark Dataframe Cheatsheet New

Uploaded by

Pyspark Dataframe Cheatsheet New

Uploaded by

Getting Started with PySpark DataFrames

PySpark DataFrame Cheat Sheet

Aggregating and Grouping Data Joins and Combining DataFrames

# Select speciﬁc columns from the DataFrame

You might also like