Python PySpark sum() Function
Last Updated :
20 Sep, 2024
PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data analysis on large datasets.
Overview of the PySpark sum() Function
The sum() function in PySpark is used to calculate the sum of a numerical column across all rows of a DataFrame. It can be applied in both aggregate functions and grouped operations.
Syntax:
pyspark.sql.functions.sum(col)
Here, col is the column name or column expression for which we want to compute the sum.
To illustrate the use of sum(), let’s start with a simple example.
Setting Up PySpark
First, ensure we have PySpark installed. We can install it using pip if we haven't done so:
pip install pyspark
Example 1: Basic Sum Calculation
Let’s create a simple DataFrame and compute the sum of a numerical column.
Python
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum
# Create a Spark session
spark = SparkSession.builder.appName("SumExample").getOrCreate()
# Sample data
data = [
("Science", 93),
("Physics", 72),
("Operation System", 81)
]
# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Marks"])
# Show the DataFrame
df.show()
# Calculate the sum of the 'Marks' column
total_marks = df.select(sum("Marks")).collect()[0][0]
print(f"Total Marks: {total_marks}")
Output:
Example of PySpark sum() functionExplanation:
- DataFrame Creation: We create a DataFrame with names and associated values.
- Sum Calculation: We use the sum() function to calculate the total of the "Value" column and then collect the result.
Example 2: sum() with groupBy() in Sales Data Analysis
Let's consider a more realistic scenario: analyzing sales data for a retail store. Suppose we have a DataFrame with sales records, including the product name and the total sales amount.
Python
# Sample sales data
sales_data = [
("Laptop", 1200.00),
("Smartphone", 800.00),
("Tablet", 600.00),
("Laptop", 1500.00),
("Smartphone", 950.00)
]
# Create DataFrame
sales_df = spark.createDataFrame(sales_data, ["Product", "Sales"])
# Show the DataFrame
sales_df.show()
# Calculate total sales for each product
total_sales_by_product = sales_df.groupBy("Product").agg(sum("Sales").alias("Total_Sales"))
total_sales_by_product.show()
Output:
Real World Example of Explanation:
- Group By: We use groupBy("Product") to group the sales records by product name.
- Aggregation: The agg(sum("Sales").alias("Total_Sales")) computes the total sales for each product, renaming the result to "Total_Sales".
Example 3: Using sum() with Conditions
We can also compute sums conditionally using the when function. For instance, if we want to calculate total sales only for products that exceeded a certain threshold.
Python
from pyspark.sql.functions import when
# ...
# Calculate total sales for products with sales greater than 1000
conditional_sum = sales_df.where("Sales > 1000").agg(sum("Sales").alias("Total_Sales_Above_1000"))
conditional_sum.show()
Output:
Using pyspark sum() with conditionsExplanation:
- Filtering: The where("Sales > 1000") filters the records to include only those with sales over 1000.
- Aggregation: The sum of the filtered records is computed.
Conclusion
The sum() function in PySpark is a fundamental tool for performing aggregations on large datasets. Whether you're calculating total values across a DataFrame or aggregating data based on groups, sum() provides a flexible and efficient way to handle numerical data.
In real-world applications, this function can be used extensively in data analysis tasks such as sales reporting, financial analysis, and performance tracking. With its ability to process massive amounts of data quickly, PySpark's sum() function plays a crucial role in the analytics landscape.
Similar Reads
Python PySpark pivot() Function The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separa
4 min read
sum() function in Python The sum of numbers in the list is required everywhere. Python provides an inbuilt function sum() which sums up the numbers in the list. Pythonarr = [1, 5, 2] print(sum(arr))Output8 Sum() Function in Python Syntax Syntax : sum(iterable, start) iterable : iterable can be anything list , tuples or dict
3 min read
Convert Python Functions into PySpark UDF In this article, we are going to learn how to convert Python functions into Pyspark UDFs We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python funct
4 min read
PySpark Window Functions PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
PySpark Row-Wise Function Composition An interface for Apache Spark in Python is known as Pyspark. While coding in Pyspark, have you ever felt the need to apply the function row-wise and produce the result? Don't know how to achieve this? Continue reading this article further. In this article, we will discuss how to apply row-wise funct
3 min read
Calling another custom Python function from Pyspark UDF PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and app
7 min read
How to re-partition pyspark dataframe in Python Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Pyth
3 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
How to use Is Not in PySpark Null values are undefined or empty data present in a dataframe. These null values may be added due to some errors in data transfer or technical glitches. We should identify null values and make necessary changes in the Dataframe to address null values. In this article, we will learn about the usage
4 min read
UDF to sort list in PySpark The most useful feature of Spark SQL used to create a reusable function in Pyspark is known as UDF or User defined function in Python. The column type of the Pyspark can be String, Integer, Array, etc. There occurs some situations in which you have got ArrayType column in Pyspark data frame and you
3 min read