ApacheSpark MyNotes
ApacheSpark MyNotes
Introduction
It has been 11 years now since Apache Spark came into existence and it
impressively continuously to be the first choice of big data developers. Developers
have always loved it for providing simple and powerful APIs that can do any kind
of analysis on big data.
Initially, in 2011 in they came up with the concept of RDDs, then in 2013 with
Dataframes and later in 2015 with the concept of Datasets. None of them has been
depreciated, we can still use all of them. In this article, we will understand and see
the difference between all three of them.
If the data is unstructured like text and media streams, RDD will be beneficial
in terms of performance.
What are Dataframes?
It was introduced first in Spark version 1.3 to overcome the limitations of the
Spark RDD. Spark Dataframes are the distributed collection of the data points, but
here, the data is organized into the named columns. They allow developers to
debug the code during the runtime which was not allowed with the RDDs.
Dataframes can read and write the data into various formats like CSV, JSON,
AVRO, HDFS, and HIVE tables. It is already optimized to process large datasets
for most of the pre-processing tasks so that we do not need to write complex
functions on our own.
It is a fixed distributed data collection that enables Spark developers to implement a structure
1. Create a list and parse it as a DataFrame using the toDataFrame() method from
the SparkSession.
toDF()
toDF() is limited because the column type and nullable flag cannot be
customized. In this example, the number column is not nullable and
the word column is nullable.
toDF() is suitable for local testing, but production grade code that’s
checked into master should use a better solution.
createDataFrame()
createDF()
createDF() creates readable code like toDF() and allows for full schema
customization like createDataFrame(). It’s the best of both worlds.
Big shout out to Nithish for writing the advanced Scala code to
make createDF() work so well.
Spark SQL:
https://www.analyticsvidhya.com/blog/2020/02/hands-on-tutorial-spark-sql-
analyze-data/