0% found this document useful (0 votes)
4 views2 pages

RDD Task1

The document outlines a series of steps performed in a Jupyter notebook using PySpark to create and manipulate RDDs. It includes importing libraries, creating RDDs from lists and text files, counting rows, splitting data, filtering words, and performing join operations on key-value pair RDDs. The document provides code snippets and their corresponding outputs for each operation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

RDD Task1

The document outlines a series of steps performed in a Jupyter notebook using PySpark to create and manipulate RDDs. It includes importing libraries, creating RDDs from lists and text files, counting rows, splitting data, filtering words, and performing join operations on key-value pair RDDs. The document provides code snippets and their corresponding outputs for each operation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

12/7/24, 10:42 PM RDD_Task.

ipynb - Colab

arr Import the required libraries then Create SparkContext

from pyspark import SparkContext


sc = SparkContext()

arr Create and display an RDD from the following list

list = [('JK', 22), ('V', 24), ('Jimin',24), ('RM', 25), ('J-Hope', 25), ('Suga', 26), ('Jin', 27)]

rdd1 = sc.parallelize(list)
rdd1.collect()

[('JK', 22),
('V', 24),
('Jimin', 24),
('RM', 25),
('J-Hope', 25),
('Suga', 26),
('Jin', 27)]

arr Read sample1.txt file into RDD and display the first 2 elements

rdd2 = sc.textFile('example.txt')
rdd2.take(2)

['Centre for Speech and Language Therapy and Hearing Science',


'Center for Rehabilitative Auditory Research']

arr Count the total number of rows in RDD

rdd2.count()

arr Create a function to split the data using flatMap

rdd3 = rdd2.flatMap(lambda line : line.split())


rdd3.collect()

['Centre',
'for',
'Speech',
'and',
'Language',
'Therapy',
'and',
'Hearing',
'Science',
'Center',
'for',
'Rehabilitative',
'Auditory',
'Research',
'Department',
'of',
'Hearing',
'and',
'Speech',
'Science']

arr Filter the words starting with ‘D’

rdd4 = rdd3.filter(lambda c : c.startswith('D'))


rdd4.collect()

file:///C:/Users/abdo1/Downloads/RDD_Task_t.html 1/2
12/7/24, 10:42 PM RDD_Task.ipynb - Colab

['Department']

arr Creat some key value pairs RDDs

rdd3 = sc.parallelize([('a',2),('b',3)])
rdd4 = sc.parallelize([('a',9),('b',7),('c',10)])

### Perform Join operation on the RDDs (rdd3,rdd4)

rdd3.join(rdd4).collect()

[('b', (3, 7)), ('a', (2, 9))]

file:///C:/Users/abdo1/Downloads/RDD_Task_t.html 2/2

You might also like