Ravi Pyspark RDD Tutorial 1665758938
Ravi Pyspark RDD Tutorial 1665758938
Ravi Pyspark RDD Tutorial 1665758938
Terminologies
RDD stands for Resilient Distributed Dataset, these are the elements that run
and operate on multiple nodes to do parallel processing on a cluster.
RDDs are...
immutable
fault tolerant / automatic recovery
Transformation
Action
Narrow transformations
map()
filter()
flatMap(j
distinct()
Wide (Broad) transformations
reduce()
groupby()
sortBy()
join()
Actions
count()
take()
takeOrdered()
top()
collect()
saveAsTextFile()
first()
reduce()
fold()
aggregate()
foreach()
Dictionary functions
keys()
values()
keyBy()
Functional transformations
mapValues()
flatMapValues()
Joins
join()
leftOuterJoin()
rightOuterJoin()
fullOuterJoin()
cogroup()
cartesian()
Set operations
union()
intersection()
subtract()
subtractByKey()
Numeric RDD
min()
max()
sum()
mean()
stdev()
variance()
#
https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/rdd.ht
ml a ...
Show cell
Lambda Function
A lambda function is a small anonymous function.
A lambda function can take any number of arguments, but can only have one
expression.
def my_sum(a,b):
return a+b
my_sum(10,20)
Out[6]: 30
def my_func(a,b):
return a+b
a=55
b=77
my_func(a,b)
x = lambda a : a + 10
print(x(3))
print(x(20))
13
30
x = lambda a, b : a + b
print(x(5, 6))
print(x(2, 50))
list_a=[1,2,3,4,5,6,7,8]
for i in list_a:
if i<4:
print(i)
my_list = [1,2,3,4,5,6,7,8,8,34,3,34,34,343,5656]
my_list
my_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,23,23,232323,2323])
my_rdd.collect()
x = sc.parallelize([1,2,3,4,5,6,7,8],4)
new_rdd = x.filter(lambda x: x<4)
x.glom().collect()
new_rdd.count()
type(no_of_rows)
type(new_rdd)
text_rdd = sc.textFile("dbfs:/databricks-datasets/SPARK_README.md")
type(text_rdd)
%sql
select * from sample_db.emp where deptno=10
What is sparkContext
A SparkContext represents the connection to a Spark cluster,
and can be used to create RDDs, accumulators and broadcast variables on that
cluster
Note: Only one SparkContext should be active per JVM. You must stop() the
active SparkContext before creating a new one.
param: config a Spark Config object describing the application configuration.
Any settings in this config overrides the default configs as well as system
properties.
rdd = sc.parallelize([1,2,3,4,5,6,7,8],5)
range_rdd = sc.range(10)
range_rdd.collect()
textFile_rdd = sc.textFile("/databricks-datasets/samples/docs/README.md")
textFile_rdd.count()
x=sc.parallelize([1,2,3,4,5,6,7])
y=x.map(lambda a: a+10 )
print(x.collect())
print(y.collect())
[1, 2, 3, 4, 5, 6, 7]
[11, 12, 13, 14, 15, 16, 17]
x_rdd=sc.parallelize([1,2,3,4,5,6,7])
y_odd_rdd=x.filter(lambda z: z%2==1 )
print(x_rdd.collect())
print(y_odd_rdd.collect())
[1, 2, 3, 4, 5, 6, 7]
[1, 3, 5, 7]
y_odd_rdd.collect()
filter(func) Transformation
Return a new dataset formed by selecting those elements of the source on
which func returns true.
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.filter(lambda x: x%2 == 1) #keep odd values
print('RDD X Values Before Filter : ',x.collect())
print('RDD Y Values After apllying Filter in X RDD : ',y.collect())
flatMap(func) Transformation
Similar to map, but each input item can be mapped to 0 or more output items
(so func should return a Seq rather than a single item).
x.flatMap(lambda x: (x,x+55,x*400)).collect()
x = sc.parallelize([1,2,3])
y_map = x.map(lambda x: (x, x*100))
z_flatmap = x.flatMap(lambda x: (x, x*100))
print(x.collect())
print(y_map.collect())
print(z_flatmap.collect())
[1, 2, 3]
[(1, 100), (2, 200), (3, 300)]
[1, 100, 2, 200, 3, 300]
#When using PySpark, the flatMap() function does the flattening for us.
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
sc.parallelize(data).flatMap(lambda x: x).collect()
What is iterable
anything that can be looped over (i.e. you can loop over a string or file) or
anything that can appear on the right-side of a for-loop: for x in iterable: ...
print(y.collect())
Sample Transformation
Parameters
fraction – expected size of the sample as a fraction of this RDD’s size without
replacement: probability that each element is chosen; fraction must be [0, with
replacement: expected number of times each element is chosen; fraction must
be >= 0
x = sc.parallelize([1,2,3,4,5,6,7])
x.take(3)
Out[32]: [1, 2, 3]
x = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
y = x.sample(True,5,300)
print(x.collect())
print(y.collect())
print(y.count())
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8,
8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10]
67
Union(DataSet) Transformation
Return a new dataset that contains the union of the elements in the source
dataset and the argument.
glom() Return an RDD created by coalescing all elements within each partition
into an array
#UNION ALL ( it will give all records including duplicates and its same as
UNION IN PYSPARK)
#UNION (it will eliminate duplicates)
x = sc.parallelize([1,2,3,4,5])
y = sc.parallelize([3,4,5,6,7])
z = y.subtract(x)
print(z.collect())
#print(z.glom().collect())
[6, 7]
intersection(otherDataset)
Return a new RDD that contains the intersection of elements in the source
dataset and the argument.
The output will not contain any duplicate elements, even if the input RDDs did.
x = sc.parallelize([1,2,3,3,4,5], 2)
y = sc.parallelize([3,4,4,5,6,7], 1)
z = x.intersection(y)
print(x.glom().collect())
print(y.glom().collect())
print(z.collect())
print(z.glom().collect())
x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.intersection(y)
print(z.collect())
print(z.glom().collect())
substract(otherdataset) Transformation
It returns an RDD that has only value present in the first RDD and not in second
RDD.
its returns if first RDD is having any duplicates. its wont remove any duplicate
x = sc.parallelize([1,2,2,3,3,4,5,7,8], 2)
y = sc.parallelize([3,4,3,4,5,6], 1)
z = x.subtract(y)
print(z.collect())
print(z.glom().collect())
Cartesian Transformation
Provides cartesian product of 2 RDDs
like it will return new RDD multiplication 1st RDD each value Into 2nd RDD
each Value...
x = sc.parallelize([1,2,3], 2)
y = sc.parallelize([3,4], 1)
z = x.cartesian(y)
print(z.collect())
Distinct Transformation
Return a new RDD containing distinct items from the original RDD (omitting all
duplicates)
x = sc.parallelize([1,2,3,3,4,5,5,5,6,6,6,6,7,7,7,3,2,1,8,9,4])
y = x.distinct()
print(y.collect())
coalesce(numPartitions) Transformation
Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering down a large
dataset.
x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,5,4,5,6,8,9,10,12,14,12,12,14,15,16,44,44,45], 10)
y = x.coalesce(4)
print(x.glom().collect())
print(y.glom().collect())
%fs ls /user/hive/warehouse/mydb.db/dept_csv/
repartition(numPartitions) Transformation
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them.
This always shuffles all data over the network. Repartition by column We can
also repartition by columns.
syntax: repartition(numPartitions, *cols)
x = sc.parallelize([1, 2, 3, 4,
5,6,7,8,9,10,11,12,13,14,15,12,434,545,65,3434,232,4545,234234,33,434,3434
], 5)
y = x.repartition(8)
print(x.glom().collect())
print(x.getNumPartitions())
print(y.glom().collect())
print(y.getNumPartitions())
PartitionBy Transformation
Return a new RDD with the specified number of partitions,
placing original items into the partition returned by a user supplied function
x = sc.parallelize([('J','James'),('F','Fred'),('A','Anna'),('J','John'),
('R','Ravi'),('E','Eswar')], 3)
y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)
print (x.glom().collect())
print (y.glom().collect())
print('X RDD No.OF Partitiones : ',x.getNumPartitions())
print('Y RDD No.OF Partitiones : ',y.getNumPartitions())
ZIP Transformation
Return a new RDD containing pairs whose key is the item in the original RDD,
and whose
value is that item’s corresponding element (same partition, same index) in a
second RDD
x = sc.parallelize([1, 2, 3])
y = x.map(lambda n:n*n)
z = x.zip(y)
print('display X RDD Values: ',x.collect())
print('Display Y RDD Values: ',y.collect())
print(z.collect())
x = sc.parallelize([1, 2, 3,5,6])
y = sc.parallelize([3,4,5,4,8])
z = x.zip(y)
print(z.collect())
x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1),('C',4),
('C',5),('B',4)])
y = x.groupByKey()
print(x.collect())
print(list((j[0], list(j[1])) for j in y.collect()))
JOINS
join(self, other, numPartitions=None)
Return an RDD containing all pairs of elements with matching keys in self and
other.
Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in
self and (k, v2) is in other.
Performs a hash join across the cluster.
x = sc.parallelize([("a", 1)])
y = sc.parallelize([("a", 2),("c", 4)])
z = x.rightOuterJoin(y)
print(z.collect())
#print(sorted(z.collect()))
What is glom()
flattens elements on the same partition
glom() Return an RDD created by coalescing all elements within each partition
into an array.
Shuffle
GetNumPartitions()
x = sc.parallelize([1,2,3], 2)
y = x.getNumPartitions()
print(x.glom().collect())
print(y)
reduceByKey(func, [numPartitions])
When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce function func,
which must be of type (V,V) => V. Like in groupByKey, the number of reduce
tasks is configurable through an optional second argument.
ACTIONS
reduce(func) Action
Aggregate the elements of the dataset using a function func (which takes two
arguments and returns one).
The function should be commutative and associative so that it can be computed
correctly in parallel.
x = sc.parallelize([1,2,3,4,5,6,7,8,9,34,34,34,34,6,45,23,2,0])
x.count()
first() Action
Return the first element of the dataset (similar to take(1)).
x = sc.parallelize([66,77,55,44,33,1,2,3,4])
print(x.first())
print(x.take(4))
print(rdd.takeSample(False, 5, 2))
print(rdd.takeSample(False, 8, 5))
take(n) Action
Return an array with the first n elements of the dataset.
x = sc.parallelize([1,2,3,4])
x.take(2)
countByValue(self)
Return the count of each unique value in this RDD as a dictionary of (value,
count) pairs.
x = sc.parallelize([1, 2, 1, 2, 2,3,3,4,3,3,4,4,5,5,6,6])
y = x.countByValue().items()
print(y)
type(y)
isEmpty()
Returns true if and only if the RDD contains no elements at all.
note:: an RDD may be empty even when it has at least 1 partition
x = sc.parallelize(range(10))
print(x.collect())
print(x.isEmpty())
y = sc.parallelize(range(0))
print(y.collect())
print(y.isEmpty())
keys()
Return an RDD with the keys of each tuple.
saveAsTextFile(path, compressionCodecClass=None)
Save the RDD to the filesystem indicated in the path
dbutils.fs.rm("dbfs:/tmp/test_data/",True)
x = sc.parallelize(['Raveendra','Eswar','Vamsi','Lakshmi','Vinod'])
x.saveAsTextFile("dbfs:/tmp/test_data/saveAs")
%fs ls dbfs:/tmp/test_data/saveAs/
y = sc.textFile("dbfs:/tmp/test_data/saveAs")
print(y.collect())
dbutils.fs.rm("dbfs:/tmp/test_data/picklefile/",True)
x = sc.parallelize(['Raveendra','Eswar','Vinod','Lakshmi','Vamsi'])
x.saveAsPickleFile("dbfs:/tmp/test_data/picklefile")
%fs ls dbfs:/tmp/test_data/picklefile
y = sc.pickleFile("dbfs:/tmp/test_data/picklefile")
print(y.collect())
%fs ls dbfs:/tmp/test_data/picklefile/
STDEV()
Return the standard deviation of the items in the RDD
x = sc.parallelize([2,4,1])
y = x.stdev()
print(x.collect())
print(y)
MIN()
Return the MIN value of the items in the RDD
x = sc.parallelize([2,4,1,3,5,6,7])
y = x.min()
print(x.collect())
print(y)
MAX()
REturn the MAX Value of the items in the RDD
x = sc.parallelize([2,4,1,55,66,77,8845,454545,])
y = x.max()
print(x.collect())
print(y)
MEAN()
Return the mean of the items in the RDD
x = sc.parallelize([2,4,1])
y = x.mean()
print(x.collect())
print(y)
SUM()
Return the Sum of the items in the RDD
x = sc.parallelize([2,4,1,55,44,34,34,344])
y = x.sum()
print(x.collect())
print(y)
Stats()
Return a StatCounter object that captures the mean, variance and count of the
RDD’s elements in one operation.
list_rdd = sc.parallelize([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
list_rdd.stats()
my_rdd = sc.parallelize(['test1','test2','test3'])
help(my_rdd)