RDD Actions

The document provides examples of common transformations and actions that can be performed on RDDs in PySpark. It demonstrates how to create pair RDDs with (key, value) tuples and use transformations like reduceByKey(), groupByKey(), combineByKey() to perform aggregations by key. Other examples show how to use actions like foreach(), foreachPartition(), fold(), reduce(), takeOrdered() and sampling functions on RDDs. It also discusses persisting RDDs in memory or disk to avoid recomputation.

Uploaded by

durgapriyachikkala05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views18 pages

RDD Actions

Uploaded by

durgapriyachikkala05

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 18

Actions-RDD

foreach example
from pyspark import SparkContext

sc = SparkContext("local", "ForEachExample")
rdd = sc.parallelize([1, 2, 3, 4, 5])

def my_function(x):
print(x)

rdd.foreach(my_function)

sc.stop()
foreachPartition example
from pyspark import SparkContext

sc = SparkContext("local", "ForEachPartitionExample")
rdd = sc.parallelize([1, 2, 3, 4, 5], 2) # Creating 2 partitions

def my_partition_function(iterator):
for x in iterator:
print(x)

rdd.foreachPartition(my_partition_function)

sc.stop()
Fold() example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("FoldExample").getOrCreate()
# Create an RDD of numbers
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Define the binary function for multiplication
def multiply(x, y):
return x * y
# Use the fold function
product_result = numbers_rdd.fold(1, multiply)
# Print the result
print("Product using fold:", product_result)
# Stop the Spark session
spark.stop()
Reduce() example
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("ReduceExample").getOrCreate()
# Create an RDD of numbers
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Define the binary function for addition
def add(x, y):
return x + y
# Use the reduce function
sum_result = numbers_rdd.reduce(add)
# Print the result
print("Sum using reduce:", sum_result)
# Stop the Spark session
spark.stop()
Aggregate Fn example
import findspark
findspark.init() def comb_op(acc1, acc2):
from pyspark.sql import SparkSession # Combine two accumulators by adding their
# Create a Spark session
sum and multiplying their products
return (acc1[0] + acc2[0], acc1[1] * acc2[1])
spark = SparkSession.builder.appName("AggregateExample").getOrCreate()
# Create an RDD of numbers # Use the aggregate function
numbers_rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 2) (sum_result, product_result) =
# Define the zero value and the aggregate functions numbers_rdd.aggregate(zero_value, seq_op,
comb_op)
zero_value = (0, 1) # Accumulator for sum, product
def seq_op(accumulator, element): # Print the results
# Update the accumulator by adding the element to sum and multiplying print("Sum:", sum_result)
to product
print("Product:", product_result)
return (accumulator[0] + element, accumulator[1] * element)
# Stop the Spark session
spark.stop()
takeordered ()

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("takeOrderedExample").getOrCreate()
# Sample data
data = [(3, "Alice"), (1, "Bob"), (5, "Charlie"), (2, "David"), (4, "Eve")]
# Create an RDD from the sample data
rdd = spark.sparkContext.parallelize(data)
# Take the top 3 elements based on the first element of each tuple (ascending order)
top_elements = rdd.takeOrdered(3, key=lambda x: x[0])
# Print the top elements
for element in top_elements:
print(element)
# Stop the Spark session
spark.stop()
Sampling from rdd
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("takeSampleExample").getOrCreate()
# Sample data
data = list(range(1, 20))
# Create an RDD from the sample data
rdd = spark.sparkContext.parallelize(data)
# Take a random sample of 5 elements without replacement
sample_without_replacement = rdd.takeSample(False, 5)
# Take a random sample of 5 elements with replacement
sample_with_replacement = rdd.takeSample(True, 5)
# Print the samples
print("Sample without replacement:", sample_without_replacement)
print("Sample with replacement:", sample_with_replacement)
# Stop the Spark session
spark.stop( )
Persistence in RDD
• Spark RDD’s are lazily evaluation
• Hence spark will recompute an RDD and its dependencies every time
an action is called
• This might become expensive for iterative algorithms
• Persist data-a better option
Persist()
RDD.persist(storageLevel)
• storageLevel specifies where and how to persist the RDD. It is an optional argument that determines the
storage level.

• Common storage levels include:

• MEMORY_ONLY: Cache the RDD in memory as deserialized Java objects (default).

• MEMORY_ONLY_SER: Cache the RDD in memory as serialized Java objects.

• MEMORY_AND_DISK: Cache the RDD in memory, and spill to disk if the memory is not sufficient.

• MEMORY_AND_DISK_SER: Cache the RDD in memory as serialized Java objects, and spill to disk if the memory is not sufficient.

• DISK_ONLY: Cache the RDD on disk.

Example of persist()
import findspark
findspark.init()
from pyspark.storagelevel import StorageLevel
from pyspark import SparkContext
sc = SparkContext("local", "RDD Persistence Example")
# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
# Persist the RDD in memory as deserialized Java objects
rdd.persist(storageLevel=StorageLevel.MEMORY_ONLY)
# Perform some operations on the RDD
sum_result = rdd.reduce(lambda x, y: x + y)
print("Sum of elements:", sum_result)
# The RDD is cached in memory, so it can be reused without recomputation
product_result = rdd.map(lambda x: x * 2).collect()
print("Doubled elements:", product_result)
# Stop the SparkContext
sc.stop()
More about persist() in spark
• If memory overflow happens spark evicts data based on LRU policy
• rnpersist() can be used
• Rdd.unpersist()
WORKING WITH (KEY,VALUE) PAIRS
Pair RDD
ETL performed on RDD to get them to (key,value) pair
Special operations defined on RDD pair
reducebykey()
join()
Creating pair rdd
import findspark
findspark.init()
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "Pair RDD Example")
# Create an RDD with tuples (key, value)
data = [(1, "apple"), (2, "banana"), (3, "cherry"), (4, "date"), (5, "elderberry")]
rdd = sc.parallelize(data)
# Now, 'rdd' is a Pair RDD
# Perform operations on the Pair RDD
# For example, let's filter the fruits with keys greater than 2
filtered_rdd = rdd.filter(lambda x: x[0] > 2)
# Collect and print the results
results = filtered_rdd.collect()
for result in results:
print(result)
# Stop the SparkContext
sc.stop()
Note: Other programming languages like Scala and Java require the data type of the rdd to change , before applying aggregate functions
Transformations on Pair RDD’s
reducebykey()
import findspark
findspark.init()
from pyspark import SparkContext
# Create a SparkContext
sc = SparkContext("local", "reduceByKey Example")
# Create a Pair RDD with key-value pairs
data = [(1, 2), (2, 4), (1, 6), (2, 8), (3, 1)]
pair_rdd = sc.parallelize(data)

# Use reduceByKey to calculate the sum of values for each key

sum_rdd = pair_rdd.reduceByKey(lambda x, y: x + y)
# Collect and print the results
results = sum_rdd.collect()
for result in results:
print("Key:", result[0], "Sum:", result[1])
# Stop the SparkContext
sc.stop()
Groupbykey()
import findspark Output:
findspark.init()
from pyspark import SparkContext Key: 1, Values: ['apple', 'cherry']
# Create a SparkContext
Key: 2, Values: ['banana', 'date']
Key: 3, Values: ['elderberry']
sc = SparkContext("local", "groupByKey Example")
# Create a Pair RDD with key-value pairs
data = [(1, 'apple'), (2, 'banana'), (1, 'cherry'), (2, 'date'), (3, 'elderberry')]
pair_rdd = sc.parallelize(data)
# Use groupByKey to group values by key
grouped_rdd = pair_rdd.groupByKey()
# Iterate through the grouped results and print them
for key, values in grouped_rdd.collect():
print(f"Key: {key}, Values: {list(values)}")
# Stop the SparkContext
sc.stop()
Combinebykey()
from pyspark import SparkContext
average_scores_rdd = pair_rdd.combineByKey(createCombiner, mergeValue, mergeCombiners)
# Create a SparkContext # Calculate the average score for each student
sc = SparkContext("local", "combineByKey Example") average_scores = average_scores_rdd.map(lambda x: (x[0], x[1][0] / x[1][1]))
# Create a Pair RDD with student scores
data = [("Alice", 85), ("Bob", 90), ("Alice", 78), ("Bob", 88), # Collect and print the results
("Alice", 92)] results = average_scores.collect()
for result in results:
pair_rdd = sc.parallelize(data) print("Student:", result[0], "Average Score:", result[1])
# Use combineByKey to calculate the average score for each
student # Stop the SparkContext
# - createCombiner initializes an accumulator (sum, count) for sc.stop()
each key
# - mergeValue adds a new score to the accumulator
# - mergeCombiners combines the accumulators from different
partitions OUTPUT:
def createCombiner(score):
return (score, 1)
def mergeValue(accumulator, score): Student: Alice Average Score: 85.0
total_score, count = accumulator Student: Bob Average Score: 89.0
return (total_score + score, count + 1)

def mergeCombiners(accumulator1, accumulator2):

total_score1, count1 = accumulator1
total_score2, count2 = accumulator2
return (total_score1 + total_score2, count1 + count2)

PySpark Notes
No ratings yet
PySpark Notes
190 pages
SPARK
No ratings yet
SPARK
35 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Pyspark
No ratings yet
Pyspark
31 pages
2 - Intro To PySpark RDD
No ratings yet
2 - Intro To PySpark RDD
35 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Journal
No ratings yet
Journal
47 pages
Pyspark
No ratings yet
Pyspark
44 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Spark
No ratings yet
Spark
160 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Note
No ratings yet
Note
14 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Spark
No ratings yet
Spark
96 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
RDD
No ratings yet
RDD
4 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
External Video-En
No ratings yet
External Video-En
2 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
COB31 Close of Business Important Concepts and COB Crashes R10 01
100% (1)
COB31 Close of Business Important Concepts and COB Crashes R10 01
26 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Access Manager User's Guide: Foxboro Evo Process Automation System
No ratings yet
Access Manager User's Guide: Foxboro Evo Process Automation System
154 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
ADE Training
No ratings yet
ADE Training
1 page
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Core Linux PDF
No ratings yet
Core Linux PDF
722 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
AWS Certified Solutions Architect Associate Practice Test 03
No ratings yet
AWS Certified Solutions Architect Associate Practice Test 03
72 pages
SnowProCore Exam Study Guide 072624 PDF
No ratings yet
SnowProCore Exam Study Guide 072624 PDF
17 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
eBAY QA 1
No ratings yet
eBAY QA 1
10 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Software Requirements Specification: Exam Earn
No ratings yet
Software Requirements Specification: Exam Earn
22 pages
Epicor10 techrefSystemAdministration 101400
No ratings yet
Epicor10 techrefSystemAdministration 101400
243 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
Preventive Computer Maintenance Checklist
94% (17)
Preventive Computer Maintenance Checklist
4 pages
8.mock Test
No ratings yet
8.mock Test
203 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
RealFlow Cinema 4D Manual 001-135
No ratings yet
RealFlow Cinema 4D Manual 001-135
135 pages
PSDIRECT Driver Specification 2.6
No ratings yet
PSDIRECT Driver Specification 2.6
55 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Module 3 ACA Notes
No ratings yet
Module 3 ACA Notes
50 pages
Aca 2 Marks With Answers
No ratings yet
Aca 2 Marks With Answers
22 pages
File Organisation and Indexing
No ratings yet
File Organisation and Indexing
10 pages
Huawei: IT Product Portfolio
No ratings yet
Huawei: IT Product Portfolio
42 pages
A Study of The Basic Concepts of Communication Satellites
No ratings yet
A Study of The Basic Concepts of Communication Satellites
65 pages
E CMP 12394656
No ratings yet
E CMP 12394656
250 pages
Dbatu PPS UNIT 1
No ratings yet
Dbatu PPS UNIT 1
26 pages
Solved Question Paper of EPF 2016 Computer Operator
No ratings yet
Solved Question Paper of EPF 2016 Computer Operator
8 pages
Introduction To Oracle: Opera Global Technical Services
No ratings yet
Introduction To Oracle: Opera Global Technical Services
91 pages
Firebird Tuning
No ratings yet
Firebird Tuning
60 pages
CGMB 234: Multimedia Systems Design
No ratings yet
CGMB 234: Multimedia Systems Design
33 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
51 pages
Network Design - Paper 1 Notes
No ratings yet
Network Design - Paper 1 Notes
8 pages
Oracle Endeca Information Discovery: A Technical Overview: An Oracle White Paper January 2014
No ratings yet
Oracle Endeca Information Discovery: A Technical Overview: An Oracle White Paper January 2014
28 pages
DBA MCQ (MCQ 19 +des 6) From Muhammad Nur E Alam For DBA For Students
No ratings yet
DBA MCQ (MCQ 19 +des 6) From Muhammad Nur E Alam For DBA For Students
3 pages
Understanding Virtual Memory in Red Hat Enterprise Linux 3: Norm Murray and Neil Horman December 13, 2005
No ratings yet
Understanding Virtual Memory in Red Hat Enterprise Linux 3: Norm Murray and Neil Horman December 13, 2005
20 pages
EE282 Final Exam: Solutions
No ratings yet
EE282 Final Exam: Solutions
9 pages
Redis Guide How To Use
No ratings yet
Redis Guide How To Use
8 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet

RDD Actions

Uploaded by

RDD Actions

Uploaded by

Actions-RDD

from pyspark.sql import SparkSession

• Common storage levels include:

• MEMORY_ONLY_SER: Cache the RDD in memory as serialized Java objects.

• DISK_ONLY: Cache the RDD on disk.

# Use reduceByKey to calculate the sum of values for each key

def mergeCombiners(accumulator1, accumulator2):

You might also like