0% found this document useful (0 votes)
96 views6 pages

ApacheSpark MyNotes

This document discusses the differences between RDDs, DataFrames, and Datasets in Apache Spark. RDDs are Spark's fundamental data structure, which are immutable distributed collections that allow parallel operations. DataFrames were introduced to overcome RDD limitations by organizing data into named columns. Datasets provide a typed API over DataFrames for optimized processing. The document provides details on when to use each, how to create them, and their differences in terms of schema inference, debugging support, and other features.

Uploaded by

seenu0104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views6 pages

ApacheSpark MyNotes

This document discusses the differences between RDDs, DataFrames, and Datasets in Apache Spark. RDDs are Spark's fundamental data structure, which are immutable distributed collections that allow parallel operations. DataFrames were introduced to overcome RDD limitations by organizing data into named columns. Datasets provide a typed API over DataFrames for optimized processing. The document provides details on when to use each, how to create them, and their differences in terms of schema inference, debugging support, and other features.

Uploaded by

seenu0104
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Apache Spark – RDD vs Dataframe vs Dataset

Introduction
It has been 11 years now since Apache Spark came into existence and it
impressively continuously to be the first choice of big data developers. Developers
have always loved it for providing simple and powerful APIs that can do any kind
of analysis on big data.

Initially, in 2011 in they came up with the concept of RDDs, then in 2013 with
Dataframes and later in 2015 with the concept of Datasets. None of them has been
depreciated, we can still use all of them. In this article, we will understand and see
the difference between all three of them.

What are RDDs?


RDDs or Resilient Distributed Datasets is the fundamental data structure of the
Spark. It is the immutable distributed collection of objects of any type and also
allows them to do processing in parallel.

It is Resilient (Fault-tolerant). Means, if you perform multiple transformations on


the RDD and then due to any reason any node fails. The RDD, in that case, is
capable of recovering automatically because of its lineage. So, RDD is immutable,
fault tolerant.
There are 3 ways of creating an RDD:

1. Parallelizing an existing collection of data


2. Referencing to the external data file stored
3. Creating RDD from an already existing RDD

# parallelizing data collection


my_list = [1, 2, 3, 4, 5]
my_list_rdd = sc.parallelize(my_list)

## 2. Referencing to external data file


file_rdd = sc.textFile("path_of_file")

When to use RDDs?


We can use RDDs in the following situations-

 If the transformation is of a low level, RDD will be beneficial to fasten and


straightforward the data manipulation when closer to the source of data.
 It does not automatically infer the schema of the ingested data, we need to
specify the schema of each and every dataset when we create an RDD

 If the data is unstructured like text and media streams, RDD will be beneficial
in terms of performance.
What are Dataframes?
It was introduced first in Spark version 1.3 to overcome the limitations of the
Spark RDD. Spark Dataframes are the distributed collection of the data points, but
here, the data is organized into the named columns. They allow developers to
debug the code during the runtime which was not allowed with the RDDs.

Dataframes can read and write the data into various formats like CSV, JSON,
AVRO, HDFS, and HIVE tables. It is already optimized to process large datasets
for most of the pre-processing tasks so that we do not need to write complex
functions on our own.

It uses a catalyst optimizer for optimization purposes.

It is a fixed distributed data collection that enables Spark developers to implement a structure

on distributed data. This way, it allows abstraction at a higher level.

How to create a dataframe:

There are three ways to create a DataFrame in Spark by hand:

1. Create a list and parse it as a DataFrame using the toDataFrame() method from
the SparkSession.

2. Convert an RDD to a DataFrame using the toDF() method.

3. Import a file into a SparkSession as a DataFrame directly.

toDF()

toDF() provides a concise syntax for creating DataFrames and can be


accessed after importing Spark implicits.
import spark.implicits._

The toDF() method can be called on a sequence object to create a


DataFrame.
val someDF = Seq(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
).toDF("number", "word")

someDF has the following schema.


root
| — number: integer (nullable = false)
| — word: string (nullable = true)

toDF() is limited because the column type and nullable flag cannot be
customized. In this example, the number column is not nullable and
the word column is nullable.

The import spark.implicits._ statement can only be run inside of class


definitions when the Spark Session is available. All imports should
be at the top of the file before the class definition,
so toDF() encourages bad Scala coding practices.

toDF() is suitable for local testing, but production grade code that’s
checked into master should use a better solution.

createDataFrame()

The createDataFrame() method addresses the limitations of


the toDF() method and allows for full schema customization and good
Scala coding practices.

Here is how to create someDF with createDataFrame().


val someData = Seq(
Row(8, "bat"),
Row(64, "mouse"),
Row(-27, "horse")
)
val someSchema = List(
StructField("number", IntegerType, true),
StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(


spark.sparkContext.parallelize(someData),
StructType(someSchema)
)

createDataFrame() provides the functionality we need, but the syntax is


verbose. Our test files will become cluttered and difficult to read
if createDataFrame() is used frequently.

createDF()

createDF() is defined in spark-daria and allows for the following terse


syntax.
val someDF = spark.createDF(
List(
(8, "bat"),
(64, "mouse"),
(-27, "horse")
), List(
("number", IntegerType, true),
("word", StringType, true)
)
)

createDF() creates readable code like toDF() and allows for full schema
customization like createDataFrame(). It’s the best of both worlds.

Big shout out to Nithish for writing the advanced Scala code to
make createDF() work so well.

how to create a data frame in py spark:


https://phoenixnap.com/kb/spark-create-dataframe

Spark SQL:

https://www.analyticsvidhya.com/blog/2020/02/hands-on-tutorial-spark-sql-
analyze-data/

You might also like