Ravi Pyspark RDD Tutorial 1665758938

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Tutorial_2_RDD_Basics

RDD (Resilient Distributed Dataset)

Terminologies
RDD stands for Resilient Distributed Dataset, these are the elements that run
and operate on multiple nodes to do parallel processing on a cluster.

RDDs are...
immutable
fault tolerant / automatic recovery

can apply multiple ops on RDDs

RDD operation are...

Transformation
Action

Basic Operations (Ops)


count(): Number of elements in the RDD is returned.
collect(): All the elements in the RDD are returned.
foreach(f): input callable, and returns only those elements which meet the
condition of the function inside foreach.
filter(f): input callable, and returns new RDDs containing the elements which
satisfy the given callable
map(f, preservesPartitioning = False): A new RDD is returned by applying a
function to each element in the RDD
reduce(f): After performing the specified commutative and associative binary
operation, the element in the RDD is returned.
join(other, numPartitions = None): It returns RDD with a pair of elements with
the matching keys and all the values for that particular key.
cache(): Persist this RDD with the default storage level (MEMORY_ONLY). You
can also check if the RDD is cached or not

Narrow transformations
map()
filter()
flatMap(j
distinct()
Wide (Broad) transformations
reduce()
groupby()
sortBy()
join()

Actions
count()
take()
takeOrdered()
top()
collect()
saveAsTextFile()
first()
reduce()
fold()
aggregate()
foreach()

Dictionary functions
keys()
values()
keyBy()

Functional transformations
mapValues()
flatMapValues()

Grouping, sorting and aggregation


groupByKey()
reduceByKey()
foldByKey()
sortByKey()

Joins
join()
leftOuterJoin()
rightOuterJoin()
fullOuterJoin()
cogroup()
cartesian()

Set operations
union()
intersection()
subtract()
subtractByKey()

Numeric RDD
min()
max()
sum()
mean()
stdev()
variance()

#
https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/rdd.ht
ml a ...

Show cell

Lambda Function
A lambda function is a small anonymous function.
A lambda function can take any number of arguments, but can only have one
expression.

def my_sum(a,b):

return a+b

my_sum(10,20)

Out[6]: 30

def my_func(a,b):
return a+b
a=55
b=77
my_func(a,b)

x = lambda a : a + 10
print(x(3))
print(x(20))

13
30

x = lambda a, b : a + b
print(x(5, 6))
print(x(2, 50))

list_a=[1,2,3,4,5,6,7,8]
for i in list_a:
if i<4:
print(i)

my_list = [1,2,3,4,5,6,7,8,8,34,3,34,34,343,5656]
my_list

my_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,23,23,232323,2323])
my_rdd.collect()

x = sc.parallelize([1,2,3,4,5,6,7,8],4)
new_rdd = x.filter(lambda x: x<4)

x.glom().collect()

new_rdd.count()

type(no_of_rows)

type(new_rdd)

text_rdd = sc.textFile("dbfs:/databricks-datasets/SPARK_README.md")
type(text_rdd)

%sql
select * from sample_db.emp where deptno=10

What is sparkContext
A SparkContext represents the connection to a Spark cluster,
and can be used to create RDDs, accumulators and broadcast variables on that
cluster
Note: Only one SparkContext should be active per JVM. You must stop() the
active SparkContext before creating a new one.
param: config a Spark Config object describing the application configuration.
Any settings in this config overrides the default configs as well as system
properties.

Creating RDD using SparkContext Parallelize Method.


Creating list [1,2,3,4,5,6,7,8,9,10]
The sc.parallelize() method is the SparkContext's parallelize method to create a
parallelized collection
This allows Spark to distribute the data across multiple nodes, instead of
depending on a single node to process the data

rdd = sc.parallelize([1,2,3,4,5,6,7,8],5)

range_rdd = sc.range(10)
range_rdd.collect()

Creating RDD using file.

textFile_rdd = sc.textFile("/databricks-datasets/samples/docs/README.md")
textFile_rdd.count()

Creating RDD using SparkContext


Applying MAP Transformation in RDD.
MAP(func)
Return a new distributed dataset formed by passing each element of the source
through a function func.

x=sc.parallelize([1,2,3,4,5,6,7])
y=x.map(lambda a: a+10 )
print(x.collect())
print(y.collect())

[1, 2, 3, 4, 5, 6, 7]
[11, 12, 13, 14, 15, 16, 17]

print("X RDD VAlues :",x.collect())


print("Y RDD Values after applying map and lambda
transformation",y.collect())

x_rdd=sc.parallelize([1,2,3,4,5,6,7])
y_odd_rdd=x.filter(lambda z: z%2==1 )
print(x_rdd.collect())
print(y_odd_rdd.collect())

[1, 2, 3, 4, 5, 6, 7]
[1, 3, 5, 7]

y_odd_rdd.collect()

print('RDD X values : ',x.collect())


print('RDD Y VALUES : ',y.collect())
print('RDD Z Values : ',z.collect())

filter(func) Transformation
Return a new dataset formed by selecting those elements of the source on
which func returns true.

x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print('RDD X Values Before Filter : ',x.collect())
print('RDD Y Values After apllying Filter in X RDD : ',y.collect())
flatMap(func) Transformation
Similar to map, but each input item can be mapped to 0 or more output items
(so func should return a Seq rather than a single item).

x.flatMap(lambda x: (x,x+55,x*400)).collect()

Out[20]: [1, 56, 400, 2, 57, 800, 3, 58, 1200]

x = sc.parallelize([1,2,3])
y_map = x.map(lambda x: (x, x*100))
z_flatmap = x.flatMap(lambda x: (x, x*100))
print(x.collect())
print(y_map.collect())
print(z_flatmap.collect())

[1, 2, 3]
[(1, 100), (2, 200), (3, 300)]
[1, 100, 2, 200, 3, 300]

#When using PySpark, the flatMap() function does the flattening for us.
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
sc.parallelize(data).flatMap(lambda x: x).collect()

groupBy(func) Transformation in RDD


Group the data in the original RDD. Create pairs where the key is the output of
a user function, and the value is all items for which the function yields this key

x = sc.parallelize(['John', 'Fred', 'Anna', 'James','Jan','Frend','Axe'])


y = x.groupBy(lambda w: w[0])
print('Before Group By',x.collect())
#print(y.collect())
print('After group by ',[(k,tuple(v)) for (k, v) in y.collect()])

Before Group By ['John', 'Fred', 'Anna', 'James', 'Jan', 'Frend', 'Axe']


After group by [('J', ('John', 'James', 'Jan')), ('F', ('Fred', 'Fren
d')), ('A', ('Anna', 'Axe'))]

x = sc.parallelize(['Ravi', 'Sridhar', 'Prasad', 'Raj'])


y = x.groupBy(lambda w: w[0])
print(x.collect())
print([(k, list(v)) for (k, v) in y.collect()])
['Ravi', 'Sridhar', 'Prasad', 'Raj']
[('R', ['Ravi', 'Raj']), ('S', ['Sridhar']), ('P', ['Prasad'])]

What is iterable
anything that can be looped over (i.e. you can loop over a string or file) or
anything that can appear on the right-side of a for-loop: for x in iterable: ...

print(y.collect())

Sample Transformation

Return a sampled subset of this RDD.

Parameters

withReplacement – can elements be sampled multiple times (replaced when


sampled out)

fraction – expected size of the sample as a fraction of this RDD’s size without
replacement: probability that each element is chosen; fraction must be [0, with
replacement: expected number of times each element is chosen; fraction must
be >= 0

seed – seed for the random number generator

x = sc.parallelize([1,2,3,4,5,6,7])
x.take(3)

Out[32]: [1, 2, 3]

x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.sample(True,5,300)
print(x.collect())
print(y.collect())
print(y.count())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8,
8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10]
67
Union(DataSet) Transformation
Return a new dataset that contains the union of the elements in the source
dataset and the argument.
glom() Return an RDD created by coalescing all elements within each partition
into an array

#UNION ALL ( it will give all records including duplicates and its same as
UNION IN PYSPARK)
#UNION (it will eliminate duplicates)

x = sc.parallelize([1,2,3,4,5])
y = sc.parallelize([3,4,5,6,7])
z = y.subtract(x)
print(z.collect())
#print(z.glom().collect())

[6, 7]

intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source
dataset and the argument.
The output will not contain any duplicate elements, even if the input RDDs did.

x = sc.parallelize([1,2,3,3,4,5], 2)
y = sc.parallelize([3,4,4,5,6,7], 1)
z = x.intersection(y)
print(x.glom().collect())
print(y.glom().collect())
print(z.collect())
print(z.glom().collect())

x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.intersection(y)
print(z.collect())
print(z.glom().collect())

substract(otherdataset) Transformation
It returns an RDD that has only value present in the first RDD and not in second
RDD.
its returns if first RDD is having any duplicates. its wont remove any duplicate

x = sc.parallelize([1,2,2,3,3,4,5,7,8], 2)
y = sc.parallelize([3,4,3,4,5,6], 1)
z = x.subtract(y)
print(z.collect())
print(z.glom().collect())

Cartesian Transformation
Provides cartesian product of 2 RDDs
like it will return new RDD multiplication 1st RDD each value Into 2nd RDD
each Value...

x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.cartesian(y)
print(z.collect())

Distinct Transformation
Return a new RDD containing distinct items from the original RDD (omitting all
duplicates)

x = sc.parallelize([1,2,3,3,4,5,5,5,6,6,6,6,7,7,7,3,2,1,8,9,4])
y = x.distinct()
print(y.collect())

coalesce(numPartitions) Transformation
Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering down a large
dataset.
x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,5,4,5,6,8,9,10,12,14,12,12,14,15,16,44,44,45], 10)
y = x.coalesce(4)
print(x.glom().collect())
print(y.glom().collect())

%fs ls /user/hive/warehouse/mydb.db/dept_csv/

repartition(numPartitions) Transformation
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them.
This always shuffles all data over the network. Repartition by column We can
also repartition by columns.
syntax: repartition(numPartitions, *cols)

x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,10,11,12,13,14,15,12,434,545,65,3434,232,4545,234234,33,434,3434
], 5)
y = x.repartition(8)
print(x.glom().collect())
print(x.getNumPartitions())
print(y.glom().collect())
print(y.getNumPartitions())

PartitionBy Transformation
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user supplied function

x = sc.parallelize([('J','James'),('F','Fred'),('A','Anna'),('J','John'),
('R','Ravi'),('E','Eswar')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print (x.glom().collect())
print (y.glom().collect())
print('X RDD No.OF Partitiones : ',x.getNumPartitions())
print('Y RDD No.OF Partitiones : ',y.getNumPartitions())

ZIP Transformation
Return a new RDD containing pairs whose key is the item in the original RDD,
and whose
value is that item’s corresponding element (same partition, same index) in a
second RDD

x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print('display X RDD Values: ',x.collect())
print('Display Y RDD Values: ',y.collect())
print(z.collect())

x = sc.parallelize([1, 2, 3,5,6])
y = sc.parallelize([3,4,5,4,8])
z = x.zip(y)
print(z.collect())

groupByKey Transformation in RDD


When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or
average) over each key, using reduceByKey or aggregateByKey will yield much
better performance.
Note: By default, the level of parallelism in the output depends on the number of
partitions of the parent RDD. You can pass an optional numPartitions argument
to set a different number of tasks.

x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1),('C',4),
('C',5),('B',4)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))

JOINS
join(self, other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
Performs a hash join across the cluster.

x = sc.parallelize([("a", 1), ("b", 4)])


y = sc.parallelize([("a", 2), ("a", 3)])
z = x.join(y)
print(z.collect())
#print(sorted(z.collect()))

leftOuterJoin(self, other, numPartitions=None)


Perform a left outer join of self and other.
For each element (k, v) in self, the resulting RDD will either contain all pairs (k,
(v, w)) for w in other, or the pair (k, (v, None)) if no elements in other have key k.
Hash-partitions the resulting RDD into the given number of partitions.

x = sc.parallelize([("a", 1), ("b", 4)])


y = sc.parallelize([("a", 2)])
z = x.leftOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

rightOuterJoin(self, other, numPartitions=None)


Perform a right outer join of self and other.
For each element (k, w) in other, the resulting RDD will either contain all pairs
(k, (v, w)) for v in this, or the pair (k, (None, w)) if no elements in self have key k.
Hash-partitions the resulting RDD into the given number of partitions.

x = sc.parallelize([("a", 1)])
y = sc.parallelize([("a", 2),("c", 4)])
z = x.rightOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

fullOuterJoin(self, other, numPartitions=None)


Perform a full outer join of self and other and it will return both matching and
unmatching result set.
x = sc.parallelize([("a", 1), ("b", 4)])
y = sc.parallelize([("a", 2), ("c", 5)])
z = x.fullOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))

cogroup(self, other, numPartitions=None)


For each key k in self or other, return a resulting RDD that contains a tuple with
the list of values for that key in self as well as other.

x = sc.parallelize([("a", 1), ("b", 4)])


y = sc.parallelize([("a", 2)])
z = x.cogroup(y)
[(x, list(map(list,y))) for x, y in z.collect()]

What is glom()
flattens elements on the same partition
glom() Return an RDD created by coalescing all elements within each partition
into an array.

Shuffle

The Shuffle is an expensive operation since it involves disk I/O, data


serialization, and network I/O. To organize data for the shuffle, Spark generates
sets of tasks - map tasks to organize the data, and a set of reduce tasks to
aggregate it. This nomenclature comes from MapReduce and does not directly
relate to Spark’s map and reduce operations.

Operations which can cause a shuffle include repartition operations like


repartition and coalesce , ‘ ByKey operations (except for counting) like
groupByKey and reduceByKey , and join operations like cogroup and join.

GetNumPartitions()
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)

reduceByKey(func, [numPartitions])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce function func,
which must be of type (V,V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.

NOTE: Note If you are grouping using (groupbykey) in order to perform an


aggregation (such as a sum or average) over each key, using reduceByKey or
aggregateByKey will provide much better performance.

wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']


wordsRDD = sc.parallelize(wordsList, 4)
wordCountsCollected = (wordsRDD
.map(lambda w: (w, 1))
.reduceByKey(lambda x,y: x+y)
.collect())
print(wordCountsCollected)

ACTIONS

reduce(func) Action
Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one).
The function should be commutative and associative so that it can be computed
correctly in parallel.

# reduce numbers 1 to 10 by adding them up


x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.reduce(lambda a,b: a+b)
print(x.collect())
print(y)
type(y)
count()
Return the number of elements in the dataset.

x = sc.parallelize([1,2,3,4,5,6,7,8,9,34,34,34,34,6,45,23,2,0])
x.count()

first() Action
Return the first element of the dataset (similar to take(1)).

x = sc.parallelize([66,77,55,44,33,1,2,3,4])
print(x.first())
print(x.take(4))

takeSample(withReplacement, num, [seed])


Return an array with a random sample of num elements of the dataset, with or
without replacement,
optionally pre-specifying a random number generator seed.
Return a fixed-size sampled subset of this RDD
withReplacement whether sampling is done with replacement
num size of the returned sample
seed seed for the random number generator
returns sample of specified size in an array

rdd = sc.parallelize(range(0, 10))


print(rdd.takeSample(True, 20, 1))
print(rdd.takeSample(False, 20, 1))

print(rdd.takeSample(False, 5, 2))

print(rdd.takeSample(False, 8, 5))

take(n) Action
Return an array with the first n elements of the dataset.
x = sc.parallelize([1,2,3,4])
x.take(2)

countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value,
count) pairs.

x = sc.parallelize([1, 2, 1, 2, 2,3,3,4,3,3,4,4,5,5,6,6])
y = x.countByValue().items()
print(y)
type(y)

isEmpty()
Returns true if and only if the RDD contains no elements at all.
note:: an RDD may be empty even when it has at least 1 partition

x = sc.parallelize(range(10))
print(x.collect())
print(x.isEmpty())
y = sc.parallelize(range(0))
print(y.collect())
print(y.isEmpty())

keys()
Return an RDD with the keys of each tuple.

x = sc.parallelize([(1, 2), (3, 4),(5,55),(6,66)])


y = x.keys()
print(y.collect())

saveAsTextFile(path, compressionCodecClass=None)
Save the RDD to the filesystem indicated in the path
dbutils.fs.rm("dbfs:/tmp/test_data/",True)
x = sc.parallelize(['Raveendra','Eswar','Vamsi','Lakshmi','Vinod'])
x.saveAsTextFile("dbfs:/tmp/test_data/saveAs")

%fs ls dbfs:/tmp/test_data/saveAs/

y = sc.textFile("dbfs:/tmp/test_data/saveAs")
print(y.collect())

saveAsPickleFile(self, path, batchSize=10)


Save this RDD as a SequenceFile of serialized objects. The serializer
used is :class: pyspark.serializers.PickleSerializer , default batch size is
10.

dbutils.fs.rm("dbfs:/tmp/test_data/picklefile/",True)
x = sc.parallelize(['Raveendra','Eswar','Vinod','Lakshmi','Vamsi'])
x.saveAsPickleFile("dbfs:/tmp/test_data/picklefile")

%fs ls dbfs:/tmp/test_data/picklefile

y = sc.pickleFile("dbfs:/tmp/test_data/picklefile")
print(y.collect())

%fs ls dbfs:/tmp/test_data/picklefile/

STDEV()
Return the standard deviation of the items in the RDD

x = sc.parallelize([2,4,1])
y = x.stdev()
print(x.collect())
print(y)
MIN()
Return the MIN value of the items in the RDD

x = sc.parallelize([2,4,1,3,5,6,7])
y = x.min()
print(x.collect())
print(y)

MAX()
REturn the MAX Value of the items in the RDD

x = sc.parallelize([2,4,1,55,66,77,8845,454545,])
y = x.max()
print(x.collect())
print(y)

MEAN()
Return the mean of the items in the RDD

x = sc.parallelize([2,4,1])
y = x.mean()
print(x.collect())
print(y)

SUM()
Return the Sum of the items in the RDD

x = sc.parallelize([2,4,1,55,44,34,34,344])
y = x.sum()
print(x.collect())
print(y)

Stats()
Return a StatCounter object that captures the mean, variance and count of the
RDD’s elements in one operation.
list_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
list_rdd.stats()

my_rdd = sc.parallelize(['test1','test2','test3'])

help(my_rdd)

You might also like