0% found this document useful (0 votes)

96 views6 pages

ApacheSpark MyNotes

This document discusses the differences between RDDs, DataFrames, and Datasets in Apache Spark. RDDs are Spark's fundamental data structure, which are immutable distributed collections that allow parallel operations. DataFrames were introduced to overcome RDD limitations by organizing data into named columns. Datasets provide a typed API over DataFrames for optimized processing. The document provides details on when to use each, how to create them, and their differences in terms of schema inference, debugging support, and other features.

Uploaded by

seenu0104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views6 pages

ApacheSpark MyNotes

Uploaded by

seenu0104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 6

Apache Spark – RDD vs Dataframe vs Dataset

Introduction
It has been 11 years now since Apache Spark came into existence and it
impressively continuously to be the first choice of big data developers. Developers
have always loved it for providing simple and powerful APIs that can do any kind
of analysis on big data.

Initially, in 2011 in they came up with the concept of RDDs, then in 2013 with
Dataframes and later in 2015 with the concept of Datasets. None of them has been
depreciated, we can still use all of them. In this article, we will understand and see
the difference between all three of them.

What are RDDs?

RDDs or Resilient Distributed Datasets is the fundamental data structure of the
Spark. It is the immutable distributed collection of objects of any type and also
allows them to do processing in parallel.

It is Resilient (Fault-tolerant). Means, if you perform multiple transformations on

the RDD and then due to any reason any node fails. The RDD, in that case, is
capable of recovering automatically because of its lineage. So, RDD is immutable,
fault tolerant.
There are 3 ways of creating an RDD:

1. Parallelizing an existing collection of data

2. Referencing to the external data file stored
3. Creating RDD from an already existing RDD

# parallelizing data collection

my_list = [1, 2, 3, 4, 5]
my_list_rdd = sc.parallelize(my_list)

## 2. Referencing to external data file

file_rdd = sc.textFile("path_of_file")

When to use RDDs?

We can use RDDs in the following situations-

 If the transformation is of a low level, RDD will be beneficial to fasten and

straightforward the data manipulation when closer to the source of data.
 It does not automatically infer the schema of the ingested data, we need to
specify the schema of each and every dataset when we create an RDD

 If the data is unstructured like text and media streams, RDD will be beneficial
in terms of performance.
What are Dataframes?
It was introduced first in Spark version 1.3 to overcome the limitations of the
Spark RDD. Spark Dataframes are the distributed collection of the data points, but
here, the data is organized into the named columns. They allow developers to
debug the code during the runtime which was not allowed with the RDDs.

Dataframes can read and write the data into various formats like CSV, JSON,
AVRO, HDFS, and HIVE tables. It is already optimized to process large datasets
for most of the pre-processing tasks so that we do not need to write complex
functions on our own.

It uses a catalyst optimizer for optimization purposes.

It is a fixed distributed data collection that enables Spark developers to implement a structure

on distributed data. This way, it allows abstraction at a higher level.

How to create a dataframe:

There are three ways to create a DataFrame in Spark by hand:

1. Create a list and parse it as a DataFrame using the toDataFrame() method from
the SparkSession.

2. Convert an RDD to a DataFrame using the toDF() method.

3. Import a file into a SparkSession as a DataFrame directly.

toDF()

toDF() provides a concise syntax for creating DataFrames and can be

accessed after importing Spark implicits.
import spark.implicits._

The toDF() method can be called on a sequence object to create a

DataFrame.
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")

someDF has the following schema.

root
| — number: integer (nullable = false)
| — word: string (nullable = true)

toDF() is limited because the column type and nullable flag cannot be
customized. In this example, the number column is not nullable and
the word column is nullable.

The import spark.implicits._ statement can only be run inside of class

definitions when the Spark Session is available. All imports should
be at the top of the file before the class definition,
so toDF() encourages bad Scala coding practices.

toDF() is suitable for local testing, but production grade code that’s
checked into master should use a better solution.

createDataFrame()

The createDataFrame() method addresses the limitations of

the toDF() method and allows for full schema customization and good
Scala coding practices.

Here is how to create someDF with createDataFrame().

val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(

spark.sparkContext.parallelize(someData),
StructType(someSchema)
)

createDataFrame() provides the functionality we need, but the syntax is

verbose. Our test files will become cluttered and difficult to read
if createDataFrame() is used frequently.

createDF()

createDF() is defined in spark-daria and allows for the following terse

syntax.
val someDF = spark.createDF(
List(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
), List(
("number", IntegerType, true),
("word", StringType, true)
)
)

createDF() creates readable code like toDF() and allows for full schema
customization like createDataFrame(). It’s the best of both worlds.

Big shout out to Nithish for writing the advanced Scala code to
make createDF() work so well.

how to create a data frame in py spark:

https://phoenixnap.com/kb/spark-create-dataframe

Spark SQL:

https://www.analyticsvidhya.com/blog/2020/02/hands-on-tutorial-spark-sql-
analyze-data/

Spark 101
No ratings yet
Spark 101
25 pages
Hadoop Interview Question
No ratings yet
Hadoop Interview Question
25 pages
Traing On Hadoop
No ratings yet
Traing On Hadoop
123 pages
Hive Lab
No ratings yet
Hive Lab
33 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Hadoop Fundamentals and Hive Interview Questions
No ratings yet
Hadoop Fundamentals and Hive Interview Questions
8 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Oracle SQL & PL-SQL Optimization For Developers Documentation PDF
No ratings yet
Oracle SQL & PL-SQL Optimization For Developers Documentation PDF
103 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Talend Open Studio For Data Integration: User Guide
No ratings yet
Talend Open Studio For Data Integration: User Guide
452 pages
Nirmal Full Stack Developer Resume
No ratings yet
Nirmal Full Stack Developer Resume
4 pages
Oracle On Azure Whitepaper
No ratings yet
Oracle On Azure Whitepaper
34 pages
MongoDB's Performance Over RDBMS - MongoDB
No ratings yet
MongoDB's Performance Over RDBMS - MongoDB
12 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Scala Basic Interview Questions
No ratings yet
Scala Basic Interview Questions
16 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Unix Training by Dhanabal
No ratings yet
Unix Training by Dhanabal
340 pages
Chatgpt
No ratings yet
Chatgpt
7 pages
Oracle SQL and PLSQL Training Course Syllabus
No ratings yet
Oracle SQL and PLSQL Training Course Syllabus
5 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
L.V. Siva Reddy: Sivareddylella95@
No ratings yet
L.V. Siva Reddy: Sivareddylella95@
3 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
JVM Architecture
No ratings yet
JVM Architecture
6 pages
380 Notes Fa2016
0% (2)
380 Notes Fa2016
79 pages
Bigdata 2016 Hands On 2891109
No ratings yet
Bigdata 2016 Hands On 2891109
96 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Unix 100 Scripts Sample
No ratings yet
Unix 100 Scripts Sample
16 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Interview
No ratings yet
Interview
86 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Tableau Sample Resume 2
No ratings yet
Tableau Sample Resume 2
6 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
Oracle Related Questions
No ratings yet
Oracle Related Questions
33 pages
Python Advanced - Threads and Threading
No ratings yet
Python Advanced - Threads and Threading
9 pages
Hadoop Interview Questions
No ratings yet
Hadoop Interview Questions
28 pages
Cryptography and Network Security-Ppt-1 (Autosaved) .PPTM
No ratings yet
Cryptography and Network Security-Ppt-1 (Autosaved) .PPTM
28 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
No ratings yet
6 Frequently Asked Hadoop Interview Questions and Answers: Q1.What Is Hadoop?
8 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
17 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Facebook Hive POC
No ratings yet
Facebook Hive POC
18 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Taking Leaving Message Kelas Xi #1 Meet
No ratings yet
Taking Leaving Message Kelas Xi #1 Meet
17 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Sampath Polishetty BigData Consultant
No ratings yet
Sampath Polishetty BigData Consultant
7 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
Quantum Dot PDF
No ratings yet
Quantum Dot PDF
22 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
No Touch Exit Sensor k1-1 / k1-2: Installation Sensing Range
No ratings yet
No Touch Exit Sensor k1-1 / k1-2: Installation Sensing Range
1 page
Worksheet-Simplification of Numerical
No ratings yet
Worksheet-Simplification of Numerical
4 pages
Chapter 12 Biology 11
No ratings yet
Chapter 12 Biology 11
52 pages
OLTP
No ratings yet
OLTP
12 pages
Data Sheet Switch Serie w23-w31
No ratings yet
Data Sheet Switch Serie w23-w31
3 pages
Deration Factor
100% (1)
Deration Factor
3 pages
Document Fluid Mechanics
No ratings yet
Document Fluid Mechanics
55 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Aire Acondicionado LG
No ratings yet
Aire Acondicionado LG
78 pages
Netengine 8000 M1A Service Router Data Sheet
No ratings yet
Netengine 8000 M1A Service Router Data Sheet
11 pages
AOA Viva Question
No ratings yet
AOA Viva Question
8 pages
NLM Qna Paper
No ratings yet
NLM Qna Paper
7 pages
DETAILED LESSON PLAN-Math
No ratings yet
DETAILED LESSON PLAN-Math
10 pages
Bcsccs 505 r01 Dbms-Lab Manual-2010
No ratings yet
Bcsccs 505 r01 Dbms-Lab Manual-2010
18 pages
Introduction To: What Is SQL?
No ratings yet
Introduction To: What Is SQL?
25 pages
1AND2 1996 Reff2022-1
No ratings yet
1AND2 1996 Reff2022-1
24 pages
Automobile Engineering Experiment 10: Study of Camber, Caster, Toe-In or Toe-Out Camber
No ratings yet
Automobile Engineering Experiment 10: Study of Camber, Caster, Toe-In or Toe-Out Camber
4 pages
Johannes Gerbes, Frauke Van Der Werff - Fit Fürs Goethe-Zertifikat, A2 - Start Deutsch 2, Volume 2 (2007, Hueber Verlag)
No ratings yet
Johannes Gerbes, Frauke Van Der Werff - Fit Fürs Goethe-Zertifikat, A2 - Start Deutsch 2, Volume 2 (2007, Hueber Verlag)
33 pages
Lesson 39 - Transcript. Build Applications With Glide - Part 2
No ratings yet
Lesson 39 - Transcript. Build Applications With Glide - Part 2
112 pages
M2 R5 Jan2023 Set1
No ratings yet
M2 R5 Jan2023 Set1
21 pages
REx316 Presentation
No ratings yet
REx316 Presentation
60 pages
Assignment 1
No ratings yet
Assignment 1
8 pages
S11003 QUO01 R 0
0% (1)
S11003 QUO01 R 0
12 pages
Data Modeling
No ratings yet
Data Modeling
3 pages
Science: Whole Brain Learning System Outcome-Based Education
No ratings yet
Science: Whole Brain Learning System Outcome-Based Education
20 pages
Semiconductors Are A Special Class of Elements Having A Conductivity Between That of A Good Conductor and That of An Insulator
No ratings yet
Semiconductors Are A Special Class of Elements Having A Conductivity Between That of A Good Conductor and That of An Insulator
2 pages
Test Strategies For Data Processing Pipelines: Lars Albertsson, Independent Consultant (Mapflat)
No ratings yet
Test Strategies For Data Processing Pipelines: Lars Albertsson, Independent Consultant (Mapflat)
36 pages
CSL 210 Lab13 Threads
No ratings yet
CSL 210 Lab13 Threads
4 pages
Multilayer Perceptron
No ratings yet
Multilayer Perceptron
11 pages
Recent Development in Orbital Forging Technology
No ratings yet
Recent Development in Orbital Forging Technology
2 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet

ApacheSpark MyNotes

Uploaded by

ApacheSpark MyNotes

Uploaded by

Apache Spark – RDD vs Dataframe vs Dataset

What are RDDs?

It is Resilient (Fault-tolerant). Means, if you perform multiple transformations on

1. Parallelizing an existing collection of data

# parallelizing data collection

## 2. Referencing to external data file

When to use RDDs?

 If the transformation is of a low level, RDD will be beneficial to fasten and

It uses a catalyst optimizer for optimization purposes.

on distributed data. This way, it allows abstraction at a higher level.

How to create a dataframe:

There are three ways to create a DataFrame in Spark by hand:

2. Convert an RDD to a DataFrame using the toDF() method.

3. Import a file into a SparkSession as a DataFrame directly.

toDF() provides a concise syntax for creating DataFrames and can be

The toDF() method can be called on a sequence object to create a

someDF has the following schema.

The import spark.implicits._ statement can only be run inside of class

The createDataFrame() method addresses the limitations of

Here is how to create someDF with createDataFrame().

val someDF = spark.createDataFrame(

createDataFrame() provides the functionality we need, but the syntax is

createDF() is defined in spark-daria and allows for the following terse

how to create a data frame in py spark:

You might also like