0% found this document useful (0 votes)

157 views20 pages

Ravi Pyspark RDD Tutorial 1665758938

The document provides an overview of RDD (Resilient Distributed Dataset) concepts and operations in PySpark. It defines RDDs as immutable and fault-tolerant collections of objects distributed across a cluster that allow parallel operations. It describes various narrow and wide transformations as well as actions that can be performed on RDDs, including map, filter, reduce, join, count, and collect. It also covers concepts like partitioning, caching, and functional transformations.

Uploaded by

Sree Krith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views20 pages

Ravi Pyspark RDD Tutorial 1665758938

Uploaded by

Sree Krith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Tutorial_2_RDD_Basics

RDD (Resilient Distributed Dataset)

Terminologies
RDD stands for Resilient Distributed Dataset, these are the elements that run
and operate on multiple nodes to do parallel processing on a cluster.

RDDs are...
immutable
fault tolerant / automatic recovery

can apply multiple ops on RDDs

RDD operation are...

Transformation
Action

Basic Operations (Ops)

count(): Number of elements in the RDD is returned.
collect(): All the elements in the RDD are returned.
foreach(f): input callable, and returns only those elements which meet the
condition of the function inside foreach.
filter(f): input callable, and returns new RDDs containing the elements which
satisfy the given callable
map(f, preservesPartitioning = False): A new RDD is returned by applying a
function to each element in the RDD
reduce(f): After performing the specified commutative and associative binary
operation, the element in the RDD is returned.
join(other, numPartitions = None): It returns RDD with a pair of elements with
the matching keys and all the values for that particular key.
cache(): Persist this RDD with the default storage level (MEMORY_ONLY). You
can also check if the RDD is cached or not

Narrow transformations
map()
filter()
flatMap(j
distinct()
Wide (Broad) transformations
reduce()
groupby()
sortBy()
join()

Actions
count()
take()
takeOrdered()
top()
collect()
saveAsTextFile()
first()
reduce()
fold()
aggregate()
foreach()

Dictionary functions
keys()
values()
keyBy()

Functional transformations
mapValues()
flatMapValues()

Grouping, sorting and aggregation

groupByKey()
reduceByKey()
foldByKey()
sortByKey()

Joins
join()
leftOuterJoin()
rightOuterJoin()
fullOuterJoin()
cogroup()
cartesian()

Set operations
union()
intersection()
subtract()
subtractByKey()

Numeric RDD
min()
max()
sum()
mean()
stdev()
variance()

#
https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/rdd.ht
ml a ...

Show cell

Lambda Function
A lambda function is a small anonymous function.
A lambda function can take any number of arguments, but can only have one
expression.

def my_sum(a,b):

return a+b

my_sum(10,20)

Out[6]: 30

def my_func(a,b):
return a+b
a=55
b=77
my_func(a,b)

x = lambda a : a + 10
print(x(3))
print(x(20))

13
30

x = lambda a, b : a + b
print(x(5, 6))
print(x(2, 50))

list_a=[1,2,3,4,5,6,7,8]
for i in list_a:
if i<4:
print(i)

my_list = [1,2,3,4,5,6,7,8,8,34,3,34,34,343,5656]
my_list

my_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,23,23,232323,2323])
my_rdd.collect()

x = sc.parallelize([1,2,3,4,5,6,7,8],4)
new_rdd = x.filter(lambda x: x<4)

x.glom().collect()

new_rdd.count()

type(no_of_rows)

type(new_rdd)

text_rdd = sc.textFile("dbfs:/databricks-datasets/SPARK_README.md")
type(text_rdd)

%sql
select * from sample_db.emp where deptno=10

What is sparkContext
A SparkContext represents the connection to a Spark cluster,
and can be used to create RDDs, accumulators and broadcast variables on that
cluster
Note: Only one SparkContext should be active per JVM. You must stop() the
active SparkContext before creating a new one.
param: config a Spark Config object describing the application configuration.
Any settings in this config overrides the default configs as well as system
properties.

Creating RDD using SparkContext Parallelize Method.

Creating list [1,2,3,4,5,6,7,8,9,10]
The sc.parallelize() method is the SparkContext's parallelize method to create a
parallelized collection
This allows Spark to distribute the data across multiple nodes, instead of
depending on a single node to process the data

rdd = sc.parallelize([1,2,3,4,5,6,7,8],5)

range_rdd = sc.range(10)
range_rdd.collect()

Creating RDD using file.

textFile_rdd = sc.textFile("/databricks-datasets/samples/docs/README.md")
textFile_rdd.count()

Creating RDD using SparkContext

Applying MAP Transformation in RDD.
MAP(func)
Return a new distributed dataset formed by passing each element of the source
through a function func.

x=sc.parallelize([1,2,3,4,5,6,7])
y=x.map(lambda a: a+10 )
print(x.collect())
print(y.collect())

[1, 2, 3, 4, 5, 6, 7]
[11, 12, 13, 14, 15, 16, 17]

print("X RDD VAlues :",x.collect())

print("Y RDD Values after applying map and lambda
transformation",y.collect())

x_rdd=sc.parallelize([1,2,3,4,5,6,7])
y_odd_rdd=x.filter(lambda z: z%2==1 )
print(x_rdd.collect())
print(y_odd_rdd.collect())

[1, 2, 3, 4, 5, 6, 7]
[1, 3, 5, 7]

y_odd_rdd.collect()

print('RDD X values : ',x.collect())

print('RDD Y VALUES : ',y.collect())
print('RDD Z Values : ',z.collect())

filter(func) Transformation
Return a new dataset formed by selecting those elements of the source on
which func returns true.

x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print('RDD X Values Before Filter : ',x.collect())
print('RDD Y Values After apllying Filter in X RDD : ',y.collect())
flatMap(func) Transformation
Similar to map, but each input item can be mapped to 0 or more output items
(so func should return a Seq rather than a single item).

x.flatMap(lambda x: (x,x+55,x*400)).collect()

Out[20]: [1, 56, 400, 2, 57, 800, 3, 58, 1200]

x = sc.parallelize([1,2,3])
y_map = x.map(lambda x: (x, x*100))
z_flatmap = x.flatMap(lambda x: (x, x*100))
print(x.collect())
print(y_map.collect())
print(z_flatmap.collect())

[1, 2, 3]
[(1, 100), (2, 200), (3, 300)]
[1, 100, 2, 200, 3, 300]

#When using PySpark, the flatMap() function does the flattening for us.
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
sc.parallelize(data).flatMap(lambda x: x).collect()

groupBy(func) Transformation in RDD

Group the data in the original RDD. Create pairs where the key is the output of
a user function, and the value is all items for which the function yields this key

x = sc.parallelize(['John', 'Fred', 'Anna', 'James','Jan','Frend','Axe'])

y = x.groupBy(lambda w: w[0])
print('Before Group By',x.collect())
#print(y.collect())
print('After group by ',[(k,tuple(v)) for (k, v) in y.collect()])

Before Group By ['John', 'Fred', 'Anna', 'James', 'Jan', 'Frend', 'Axe']

After group by [('J', ('John', 'James', 'Jan')), ('F', ('Fred', 'Fren
d')), ('A', ('Anna', 'Axe'))]

x = sc.parallelize(['Ravi', 'Sridhar', 'Prasad', 'Raj'])

y = x.groupBy(lambda w: w[0])
print(x.collect())
print([(k, list(v)) for (k, v) in y.collect()])
['Ravi', 'Sridhar', 'Prasad', 'Raj']
[('R', ['Ravi', 'Raj']), ('S', ['Sridhar']), ('P', ['Prasad'])]

What is iterable
anything that can be looped over (i.e. you can loop over a string or file) or
anything that can appear on the right-side of a for-loop: for x in iterable: ...

print(y.collect())

Sample Transformation

Return a sampled subset of this RDD.

Parameters

withReplacement – can elements be sampled multiple times (replaced when

sampled out)

fraction – expected size of the sample as a fraction of this RDD’s size without
replacement: probability that each element is chosen; fraction must be [0, with
replacement: expected number of times each element is chosen; fraction must
be >= 0

seed – seed for the random number generator

x = sc.parallelize([1,2,3,4,5,6,7])
x.take(3)

Out[32]: [1, 2, 3]

x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.sample(True,5,300)
print(x.collect())
print(y.collect())
print(y.count())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8,
8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10]
67
Union(DataSet) Transformation
Return a new dataset that contains the union of the elements in the source
dataset and the argument.
glom() Return an RDD created by coalescing all elements within each partition
into an array

#UNION ALL ( it will give all records including duplicates and its same as
UNION IN PYSPARK)
#UNION (it will eliminate duplicates)

x = sc.parallelize([1,2,3,4,5])
y = sc.parallelize([3,4,5,6,7])
z = y.subtract(x)
print(z.collect())
#print(z.glom().collect())

[6, 7]

intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source
dataset and the argument.
The output will not contain any duplicate elements, even if the input RDDs did.

x = sc.parallelize([1,2,3,3,4,5], 2)
y = sc.parallelize([3,4,4,5,6,7], 1)
z = x.intersection(y)
print(x.glom().collect())
print(y.glom().collect())
print(z.collect())
print(z.glom().collect())

x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.intersection(y)
print(z.collect())
print(z.glom().collect())

substract(otherdataset) Transformation
It returns an RDD that has only value present in the first RDD and not in second
RDD.
its returns if first RDD is having any duplicates. its wont remove any duplicate

x = sc.parallelize([1,2,2,3,3,4,5,7,8], 2)
y = sc.parallelize([3,4,3,4,5,6], 1)
z = x.subtract(y)
print(z.collect())
print(z.glom().collect())

Cartesian Transformation
Provides cartesian product of 2 RDDs
like it will return new RDD multiplication 1st RDD each value Into 2nd RDD
each Value...

x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.cartesian(y)
print(z.collect())

Distinct Transformation
Return a new RDD containing distinct items from the original RDD (omitting all
duplicates)

x = sc.parallelize([1,2,3,3,4,5,5,5,6,6,6,6,7,7,7,3,2,1,8,9,4])
y = x.distinct()
print(y.collect())

coalesce(numPartitions) Transformation
Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering down a large
dataset.
x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,5,4,5,6,8,9,10,12,14,12,12,14,15,16,44,44,45], 10)
y = x.coalesce(4)
print(x.glom().collect())
print(y.glom().collect())

%fs ls /user/hive/warehouse/mydb.db/dept_csv/

repartition(numPartitions) Transformation
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them.
This always shuffles all data over the network. Repartition by column We can
also repartition by columns.
syntax: repartition(numPartitions, *cols)

x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,10,11,12,13,14,15,12,434,545,65,3434,232,4545,234234,33,434,3434
], 5)
y = x.repartition(8)
print(x.glom().collect())
print(x.getNumPartitions())
print(y.glom().collect())
print(y.getNumPartitions())

PartitionBy Transformation
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user supplied function

x = sc.parallelize([('J','James'),('F','Fred'),('A','Anna'),('J','John'),
('R','Ravi'),('E','Eswar')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print (x.glom().collect())
print (y.glom().collect())
print('X RDD No.OF Partitiones : ',x.getNumPartitions())
print('Y RDD No.OF Partitiones : ',y.getNumPartitions())

ZIP Transformation
Return a new RDD containing pairs whose key is the item in the original RDD,
and whose
value is that item’s corresponding element (same partition, same index) in a
second RDD

x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print('display X RDD Values: ',x.collect())
print('Display Y RDD Values: ',y.collect())
print(z.collect())

x = sc.parallelize([1, 2, 3,5,6])
y = sc.parallelize([3,4,5,4,8])
z = x.zip(y)
print(z.collect())

groupByKey Transformation in RDD

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much
better performance.
Note: By default, the level of parallelism in the output depends on the number of
partitions of the parent RDD. You can pass an optional numPartitions argument
to set a different number of tasks.

x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1),('C',4),
('C',5),('B',4)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))

JOINS
join(self, other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
Performs a hash join across the cluster.

x = sc.parallelize([("a", 1), ("b", 4)])

y = sc.parallelize([("a", 2), ("a", 3)])
z = x.join(y)
print(z.collect())
#print(sorted(z.collect()))

leftOuterJoin(self, other, numPartitions=None)

Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k,
(v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.

x = sc.parallelize([("a", 1), ("b", 4)])

y = sc.parallelize([("a", 2)])
z = x.leftOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

rightOuterJoin(self, other, numPartitions=None)

Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs
(k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.
Hash-partitions the resulting RDD into the given number of partitions.

x = sc.parallelize([("a", 1)])
y = sc.parallelize([("a", 2),("c", 4)])
z = x.rightOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

fullOuterJoin(self, other, numPartitions=None)

Perform a full outer join of self and other and it will return both matching and
unmatching result set.
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 5)])
z = x.fullOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

cogroup(self, other, numPartitions=None)

For each key k in self or other, return a resulting RDD that contains a tuple with
the list of values for that key in self as well as other.

x = sc.parallelize([("a", 1), ("b", 4)])

y = sc.parallelize([("a", 2)])
z = x.cogroup(y)
[(x, list(map(list,y))) for x, y in z.collect()]

What is glom()
flattens elements on the same partition
glom() Return an RDD created by coalescing all elements within each partition
into an array.

Shuffle

The Shuffle is an expensive operation since it involves disk I/O, data

serialization, and network I/O. To organize data for the shuffle, Spark generates
sets of tasks - map tasks to organize the data, and a set of reduce tasks to
aggregate it. This nomenclature comes from MapReduce and does not directly
relate to Spark’s map and reduce operations.

Operations which can cause a shuffle include repartition operations like

repartition and coalesce , ‘ ByKey operations (except for counting) like
groupByKey and reduceByKey , and join operations like cogroup and join.

GetNumPartitions()
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)

reduceByKey(func, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce function func,
which must be of type (V,V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.

NOTE: Note If you are grouping using (groupbykey) in order to perform an

aggregation (such as a sum or average) over each key, using reduceByKey or
aggregateByKey will provide much better performance.

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']

wordsRDD = sc.parallelize(wordsList, 4)
wordCountsCollected = (wordsRDD
.map(lambda w: (w, 1))
.reduceByKey(lambda x,y: x+y)
.collect())
print(wordCountsCollected)

ACTIONS

reduce(func) Action
Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one).
The function should be commutative and associative so that it can be computed
correctly in parallel.

# reduce numbers 1 to 10 by adding them up

x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)
type(y)
count()
Return the number of elements in the dataset.

x = sc.parallelize([1,2,3,4,5,6,7,8,9,34,34,34,34,6,45,23,2,0])
x.count()

first() Action
Return the first element of the dataset (similar to take(1)).

x = sc.parallelize([66,77,55,44,33,1,2,3,4])
print(x.first())
print(x.take(4))

takeSample(withReplacement, num, [seed])

Return an array with a random sample of num elements of the dataset, with or
without replacement,
optionally pre-specifying a random number generator seed.
Return a fixed-size sampled subset of this RDD
withReplacement whether sampling is done with replacement
num size of the returned sample
seed seed for the random number generator
returns sample of specified size in an array

rdd = sc.parallelize(range(0, 10))

print(rdd.takeSample(True, 20, 1))
print(rdd.takeSample(False, 20, 1))

print(rdd.takeSample(False, 5, 2))

print(rdd.takeSample(False, 8, 5))

take(n) Action
Return an array with the first n elements of the dataset.
x = sc.parallelize([1,2,3,4])
x.take(2)

countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value,
count) pairs.

x = sc.parallelize([1, 2, 1, 2, 2,3,3,4,3,3,4,4,5,5,6,6])
y = x.countByValue().items()
print(y)
type(y)

isEmpty()
Returns true if and only if the RDD contains no elements at all.
note:: an RDD may be empty even when it has at least 1 partition

x = sc.parallelize(range(10))
print(x.collect())
print(x.isEmpty())
y = sc.parallelize(range(0))
print(y.collect())
print(y.isEmpty())

keys()
Return an RDD with the keys of each tuple.

x = sc.parallelize([(1, 2), (3, 4),(5,55),(6,66)])

y = x.keys()
print(y.collect())

saveAsTextFile(path, compressionCodecClass=None)
Save the RDD to the filesystem indicated in the path
dbutils.fs.rm("dbfs:/tmp/test_data/",True)
x = sc.parallelize(['Raveendra','Eswar','Vamsi','Lakshmi','Vinod'])
x.saveAsTextFile("dbfs:/tmp/test_data/saveAs")

%fs ls dbfs:/tmp/test_data/saveAs/

y = sc.textFile("dbfs:/tmp/test_data/saveAs")
print(y.collect())

saveAsPickleFile(self, path, batchSize=10)

Save this RDD as a SequenceFile of serialized objects. The serializer
used is :class: pyspark.serializers.PickleSerializer , default batch size is
10.

dbutils.fs.rm("dbfs:/tmp/test_data/picklefile/",True)
x = sc.parallelize(['Raveendra','Eswar','Vinod','Lakshmi','Vamsi'])
x.saveAsPickleFile("dbfs:/tmp/test_data/picklefile")

%fs ls dbfs:/tmp/test_data/picklefile

y = sc.pickleFile("dbfs:/tmp/test_data/picklefile")
print(y.collect())

%fs ls dbfs:/tmp/test_data/picklefile/

STDEV()
Return the standard deviation of the items in the RDD

x = sc.parallelize([2,4,1])
y = x.stdev()
print(x.collect())
print(y)
MIN()
Return the MIN value of the items in the RDD

x = sc.parallelize([2,4,1,3,5,6,7])
y = x.min()
print(x.collect())
print(y)

MAX()
REturn the MAX Value of the items in the RDD

x = sc.parallelize([2,4,1,55,66,77,8845,454545,])
y = x.max()
print(x.collect())
print(y)

MEAN()
Return the mean of the items in the RDD

x = sc.parallelize([2,4,1])
y = x.mean()
print(x.collect())
print(y)

SUM()
Return the Sum of the items in the RDD

x = sc.parallelize([2,4,1,55,44,34,34,344])
y = x.sum()
print(x.collect())
print(y)

Stats()
Return a StatCounter object that captures the mean, variance and count of the
RDD’s elements in one operation.
list_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
list_rdd.stats()

my_rdd = sc.parallelize(['test1','test2','test3'])

help(my_rdd)

Azure Data Engineering Interview Q & A - Topicwise
100% (1)
Azure Data Engineering Interview Q & A - Topicwise
57 pages
Py Spark
No ratings yet
Py Spark
10 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark QA
No ratings yet
Spark QA
34 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Pyspark
No ratings yet
Pyspark
31 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
PySpark and Azure Data Engineer Free Notes
No ratings yet
PySpark and Azure Data Engineer Free Notes
65 pages
Databricks Final
100% (1)
Databricks Final
81 pages
Using Databricks Notebook in Talend Studio
No ratings yet
Using Databricks Notebook in Talend Studio
19 pages
Linear Search and Binary Search
No ratings yet
Linear Search and Binary Search
19 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Data Cleaning With PySpark
No ratings yet
Data Cleaning With PySpark
21 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
Spark
No ratings yet
Spark
13 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Snowflake
No ratings yet
Snowflake
122 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Py 1731703428
No ratings yet
Py 1731703428
8 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Azure DE Interview Que
100% (1)
Azure DE Interview Que
25 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
PySpark Meetup Talk
No ratings yet
PySpark Meetup Talk
35 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Bash Prog
100% (1)
Bash Prog
65 pages
Spark in Production
No ratings yet
Spark in Production
34 pages
INTERVIEW QUESTIONS - ALL Companies
No ratings yet
INTERVIEW QUESTIONS - ALL Companies
15 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
100 Dataengineering Interview Questions TRRaveendra 1694654407
No ratings yet
100 Dataengineering Interview Questions TRRaveendra 1694654407
58 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
777 Autopilot Flight Director System
No ratings yet
777 Autopilot Flight Director System
6 pages
Hs Err Pid21567
No ratings yet
Hs Err Pid21567
25 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Data Engineering Roadmap 1679521887
No ratings yet
Data Engineering Roadmap 1679521887
11 pages
x86 Disassembly
No ratings yet
x86 Disassembly
81 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
LPC1754 Datasheet
No ratings yet
LPC1754 Datasheet
74 pages
User Guide Alliance Update MME
No ratings yet
User Guide Alliance Update MME
34 pages
Psialog
No ratings yet
Psialog
176 pages
Tib Designer Palettes
No ratings yet
Tib Designer Palettes
271 pages
JAVA Lab Manual
No ratings yet
JAVA Lab Manual
37 pages
EWT Fusion HCM Technical Trainng PBLv1
No ratings yet
EWT Fusion HCM Technical Trainng PBLv1
14 pages
Spark Vs Hadoop Features Spark
No ratings yet
Spark Vs Hadoop Features Spark
9 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Interview Questions On ADF
No ratings yet
Interview Questions On ADF
2 pages
Ganga Documentation: Release 8.1.0
No ratings yet
Ganga Documentation: Release 8.1.0
39 pages
Chapter 1 Summary Operating System Concepts 9th Edition
No ratings yet
Chapter 1 Summary Operating System Concepts 9th Edition
8 pages
Network Switch Setup For Q-SYS Platform: D-Link DGS-1210 Series DGS-1500 Series DGS-1510 Series
No ratings yet
Network Switch Setup For Q-SYS Platform: D-Link DGS-1210 Series DGS-1500 Series DGS-1510 Series
8 pages
LARA Introduction
No ratings yet
LARA Introduction
17 pages
Synechron Corporate Apartments Guidelines
No ratings yet
Synechron Corporate Apartments Guidelines
4 pages
TMS-374 Programmer Is Supplied by
No ratings yet
TMS-374 Programmer Is Supplied by
17 pages
Chapter 3 - Assembly and Machine Language-WPS Office
No ratings yet
Chapter 3 - Assembly and Machine Language-WPS Office
11 pages
Ptista SG Unit 02.Tcl
No ratings yet
Ptista SG Unit 02.Tcl
20 pages
The PPT COA
No ratings yet
The PPT COA
18 pages
Manage Layer Security and VDO Tasks PDF File
No ratings yet
Manage Layer Security and VDO Tasks PDF File
10 pages
Computer Project Rishi
No ratings yet
Computer Project Rishi
32 pages
Invoice PT Sinar Cipta Lumindo
No ratings yet
Invoice PT Sinar Cipta Lumindo
1 page
PersonalEditionInstallation 716
100% (1)
PersonalEditionInstallation 716
14 pages
Linux CLI Cheat-Sheet
No ratings yet
Linux CLI Cheat-Sheet
6 pages
Self-Quiz Unit 2 Attempt Review
No ratings yet
Self-Quiz Unit 2 Attempt Review
6 pages
Zero V1.0 SCH
No ratings yet
Zero V1.0 SCH
3 pages
Presentation of Introduction To Linux
No ratings yet
Presentation of Introduction To Linux
9 pages
Lamia - Ali Abdallah - Resume - Cloud Engineer & System Admin
No ratings yet
Lamia - Ali Abdallah - Resume - Cloud Engineer & System Admin
2 pages
Arithmetic Logic Unit (ALU) : CSE320 Lecture 6
No ratings yet
Arithmetic Logic Unit (ALU) : CSE320 Lecture 6
5 pages
Versa Point
No ratings yet
Versa Point
2 pages
SCCM Sample Resume 3
No ratings yet
SCCM Sample Resume 3
3 pages
Studio GPGL Log
No ratings yet
Studio GPGL Log
5 pages
1.1 5 Storage Self Check
No ratings yet
1.1 5 Storage Self Check
2 pages
End Term Question Paper Linux For Devices 2022-11 SAP
No ratings yet
End Term Question Paper Linux For Devices 2022-11 SAP
2 pages
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet

Ravi Pyspark RDD Tutorial 1665758938

Uploaded by

Ravi Pyspark RDD Tutorial 1665758938

Uploaded by

Tutorial_2_RDD_Basics

RDD (Resilient Distributed Dataset)

can apply multiple ops on RDDs

RDD operation are...

Basic Operations (Ops)

Grouping, sorting and aggregation

Creating RDD using SparkContext Parallelize Method.

Creating RDD using file.

Creating RDD using SparkContext

print("X RDD VAlues :",x.collect())

print('RDD X values : ',x.collect())

Out[20]: [1, 56, 400, 2, 57, 800, 3, 58, 1200]

groupBy(func) Transformation in RDD

x = sc.parallelize(['John', 'Fred', 'Anna', 'James','Jan','Frend','Axe'])

Before Group By ['John', 'Fred', 'Anna', 'James', 'Jan', 'Frend', 'Axe']

x = sc.parallelize(['Ravi', 'Sridhar', 'Prasad', 'Raj'])

Return a sampled subset of this RDD.

withReplacement – can elements be sampled multiple times (replaced when

seed – seed for the random number generator

groupByKey Transformation in RDD

x = sc.parallelize([("a", 1), ("b", 4)])

leftOuterJoin(self, other, numPartitions=None)

x = sc.parallelize([("a", 1), ("b", 4)])

rightOuterJoin(self, other, numPartitions=None)

fullOuterJoin(self, other, numPartitions=None)

cogroup(self, other, numPartitions=None)

x = sc.parallelize([("a", 1), ("b", 4)])

The Shuffle is an expensive operation since it involves disk I/O, data

Operations which can cause a shuffle include repartition operations like

NOTE: Note If you are grouping using (groupbykey) in order to perform an

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']

# reduce numbers 1 to 10 by adding them up

takeSample(withReplacement, num, [seed])

rdd = sc.parallelize(range(0, 10))

x = sc.parallelize([(1, 2), (3, 4),(5,55),(6,66)])

saveAsPickleFile(self, path, batchSize=10)

You might also like