Python PySpark sum() Function
Last Updated :
20 Sep, 2024
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data analysis on large datasets.
Overview of the PySpark sum() Function
The sum() function in PySpark is used to calculate the sum of a numerical column across all rows of a DataFrame. It can be applied in both aggregate functions and grouped operations.
Syntax:
pyspark.sql.functions.sum(col)
Here, col is the column name or column expression for which we want to compute the sum.
To illustrate the use of sum(), let’s start with a simple example.
Setting Up PySpark
First, ensure we have PySpark installed. We can install it using pip if we haven't done so:
pip install pyspark
Example 1: Basic Sum Calculation
Let’s create a simple DataFrame and compute the sum of a numerical column.
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum
# Create a Spark session
spark = SparkSession.builder.appName("SumExample").getOrCreate()
# Sample data
data = [
("Science", 93),
("Physics", 72),
("Operation System", 81)
]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Marks"])
# Show the DataFrame
df.show()
# Calculate the sum of the 'Marks' column
total_marks = df.select(sum("Marks")).collect()[0][0]
print(f"Total Marks: {total_marks}")
Output:
Example of PySpark sum() functionExplanation:
- DataFrame Creation: We create a DataFrame with names and associated values.
- Sum Calculation: We use the sum() function to calculate the total of the "Value" column and then collect the result.
Example 2: sum() with groupBy() in Sales Data Analysis
Let's consider a more realistic scenario: analyzing sales data for a retail store. Suppose we have a DataFrame with sales records, including the product name and the total sales amount.
Python
# Sample sales data
sales_data = [
("Laptop", 1200.00),
("Smartphone", 800.00),
("Tablet", 600.00),
("Laptop", 1500.00),
("Smartphone", 950.00)
]
# Create DataFrame
sales_df = spark.createDataFrame(sales_data, ["Product", "Sales"])
# Show the DataFrame
sales_df.show()
# Calculate total sales for each product
total_sales_by_product = sales_df.groupBy("Product").agg(sum("Sales").alias("Total_Sales"))
total_sales_by_product.show()
Output:
Real World Example of Explanation:
- Group By: We use groupBy("Product") to group the sales records by product name.
- Aggregation: The agg(sum("Sales").alias("Total_Sales")) computes the total sales for each product, renaming the result to "Total_Sales".
Example 3: Using sum() with Conditions
We can also compute sums conditionally using the when function. For instance, if we want to calculate total sales only for products that exceeded a certain threshold.
Python
from pyspark.sql.functions import when
# ...
# Calculate total sales for products with sales greater than 1000
conditional_sum = sales_df.where("Sales > 1000").agg(sum("Sales").alias("Total_Sales_Above_1000"))
conditional_sum.show()
Output:
Using pyspark sum() with conditionsExplanation:
- Filtering: The where("Sales > 1000") filters the records to include only those with sales over 1000.
- Aggregation: The sum of the filtered records is computed.
Conclusion
The sum() function in PySpark is a fundamental tool for performing aggregations on large datasets. Whether you're calculating total values across a DataFrame or aggregating data based on groups, sum() provides a flexible and efficient way to handle numerical data.
In real-world applications, this function can be used extensively in data analysis tasks such as sales reporting, financial analysis, and performance tracking. With its ability to process massive amounts of data quickly, PySpark's sum() function plays a crucial role in the analytics landscape.
Similar Reads
Python PySpark pivot() Function The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separa
4 min read
sum() function in Python The sum of numbers in the list is required everywhere. Python provides an inbuilt function sum() which sums up the numbers in the list. Pythonarr = [1, 5, 2] print(sum(arr))Output8 Sum() Function in Python Syntax Syntax : sum(iterable, start) iterable : iterable can be anything list , tuples or dict
3 min read
Convert Python Functions into PySpark UDF In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
PySpark Window Functions PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
PySpark Row-Wise Function Composition An interface for Apache Spark in Python is known as Pyspark. While coding in Pyspark, have you ever felt the need to apply the function row-wise and produce the result? Don't know how to achieve this? Continue reading this article further. In this article, we will discuss how to apply row-wise funct
3 min read
Calling another custom Python function from Pyspark UDF PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and app
7 min read