0% found this document useful (0 votes)

4 views2 pages

RDD Task1

The document outlines a series of steps performed in a Jupyter notebook using PySpark to create and manipulate RDDs. It includes importing libraries, creating RDDs from lists and text files, counting rows, splitting data, filtering words, and performing join operations on key-value pair RDDs. The document provides code snippets and their corresponding outputs for each operation.

Uploaded by

abdohishamgalaby512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views2 pages

RDD Task1

Uploaded by

abdohishamgalaby512

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 2

12/7/24, 10:42 PM RDD_Task.

ipynb - Colab

arr Import the required libraries then Create SparkContext

from pyspark import SparkContext

sc = SparkContext()

arr Create and display an RDD from the following list

list = [('JK', 22), ('V', 24), ('Jimin',24), ('RM', 25), ('J-Hope', 25), ('Suga', 26), ('Jin', 27)]

rdd1 = sc.parallelize(list)
rdd1.collect()

[('JK', 22),
('V', 24),
('Jimin', 24),
('RM', 25),
('J-Hope', 25),
('Suga', 26),
('Jin', 27)]

arr Read sample1.txt file into RDD and display the first 2 elements

rdd2 = sc.textFile('example.txt')
rdd2.take(2)

['Centre for Speech and Language Therapy and Hearing Science',

'Center for Rehabilitative Auditory Research']

arr Count the total number of rows in RDD

rdd2.count()

arr Create a function to split the data using flatMap

rdd3 = rdd2.flatMap(lambda line : line.split())

rdd3.collect()

['Centre',
'for',
'Speech',
'and',
'Language',
'Therapy',
'and',
'Hearing',
'Science',
'Center',
'for',
'Rehabilitative',
'Auditory',
'Research',
'Department',
'of',
'Hearing',
'and',
'Speech',
'Science']

arr Filter the words starting with ‘D’

rdd4 = rdd3.filter(lambda c : c.startswith('D'))

rdd4.collect()

file:///C:/Users/abdo1/Downloads/RDD_Task_t.html 1/2
12/7/24, 10:42 PM RDD_Task.ipynb - Colab

['Department']

arr Creat some key value pairs RDDs

rdd3 = sc.parallelize([('a',2),('b',3)])
rdd4 = sc.parallelize([('a',9),('b',7),('c',10)])

### Perform Join operation on the RDDs (rdd3,rdd4)

rdd3.join(rdd4).collect()

[('b', (3, 7)), ('a', (2, 9))]

file:///C:/Users/abdo1/Downloads/RDD_Task_t.html 2/2

Lab Spark
No ratings yet
Lab Spark
3 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
SPARK
No ratings yet
SPARK
35 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Notes
No ratings yet
PySpark Notes
190 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
ADE Training
No ratings yet
ADE Training
1 page
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Spark
No ratings yet
Spark
11 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Spark
No ratings yet
Spark
12 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
RDD
No ratings yet
RDD
4 pages
Note
No ratings yet
Note
14 pages
Week12 Assignment Solution
No ratings yet
Week12 Assignment Solution
10 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Pyspark
No ratings yet
Pyspark
44 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
3 - Spark
No ratings yet
3 - Spark
51 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
No ratings yet
Task 1: This Notebook Illustrates The Use of "MAP-REDUCE" To Calculate Averages From The Data Contained in Nsedata - CSV
5 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
External Video-En
No ratings yet
External Video-En
2 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Practice Questions
No ratings yet
Practice Questions
1 page
Apache Spark
No ratings yet
Apache Spark
6 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
PySpark RDD Cheat Sheet
No ratings yet
PySpark RDD Cheat Sheet
1 page
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Unit-3 - 315
No ratings yet
Unit-3 - 315
60 pages
Journal
No ratings yet
Journal
47 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

RDD Task1

Uploaded by

RDD Task1

Uploaded by

12/7/24, 10:42 PM RDD_Task.

arr Import the required libraries then Create SparkContext

from pyspark import SparkContext

arr Create and display an RDD from the following list

['Centre for Speech and Language Therapy and Hearing Science',

arr Count the total number of rows in RDD

arr Create a function to split the data using flatMap

rdd3 = rdd2.flatMap(lambda line : line.split())

arr Filter the words starting with ‘D’

rdd4 = rdd3.filter(lambda c : c.startswith('D'))

arr Creat some key value pairs RDDs

### Perform Join operation on the RDDs (rdd3,rdd4)

[('b', (3, 7)), ('a', (2, 9))]

You might also like