2.
SPARK – RDD Apache Spark
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD is
a fault-tolerant collection of elements that can be operated on in parallel.
There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Data Sharing is Slow in MapReduce
MapReduce is widely adopted for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. It allows users to write parallel computations,
using a set of high-level operators, without having to worry about work distribution and
fault tolerance.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable
storage system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
Iterative Operations on MapReduce
Reuse intermediate results across multiple computations in multi-stage applications. The
following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk
I/O, and serialization, which makes the system slow.
4
Apache Spark
Figure: Iterative operations on MapReduce
Interactive Operations on MapReduce
User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on
the stable storage, which can dominates application execution time.
The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.
Figure: Interactive operations on MapReduce
5
Apache Spark
Data Sharing using Spark RDD
Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most
of the Hadoop applications, they spend more than 90% of the time doing HDFS read-
write operations.
Recognizing this problem, researchers developed a specialized framework called Apache
Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-
memory processing computation. This means, it stores the state of memory as an object
across the jobs and the object is sharable between those jobs. Data sharing in memory
is 10 to 100 times faster than network and Disk.
Let us now try to find out how iterative and interactive operations take place in Spark
RDD.
Iterative Operations on Spark RDD
The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.
Figure: Iterative operations on Spark RDD
Interactive Operations on Spark RDD
This illustration shows interactive operations on Spark RDD. If different queries are run
on the same set of data repeatedly, this particular data can be kept in memory for better
execution times.
Figure: Interactive operations on Spark RDD
6
Apache Spark
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.
7
3. SPARK – INSTALLATION Apache Spark
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.
Step 1: Verifying Java Installation
Java installation is one of the mandatory things in installing Spark. Try the following
command to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response –
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
Step 2: Verifying Scala installation
You should Scala language to implement Spark. So let us verify Scala installation using
following command.
$scala -version
If Scala is already installed on your system, you get to see the following response –
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
Step 3: Downloading Scala
Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar
file in the download folder.
8
Apache Spark
Step 4: Installing Scala
Follow the below given steps for installing Scala.
Extract the Scala tar file
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Move Scala software files
Use the following commands for moving the Scala software files, to respective directory
(/usr/local/scala).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
Set PATH for Scala
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
Verifying Scala Installation
After installation, it is better to verify it. Use the following command for verifying Scala
installation.
$scala -version
If Scala is already installed on your system, you get to see the following response –
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Step 5: Downloading Apache Spark
Download the latest version of Spark by visiting the following link Download Spark. For
this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it,
you will find the Spark tar file in the download folder.
9
Apache Spark
Step 6: Installing Spark
Follow the steps given below for installing Spark.
Extracting Spark tar
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
Moving Spark software files
The following commands for moving the Spark software files to respective directory
(/usr/local/spark).
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Setting up the environment for Spark
Add the following line to ~/.bashrc file. It means adding the location, where the spark
software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Step 7: Verifying the Spark Installation
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
10