0% found this document useful (0 votes)
7 views7 pages

8_PDFsam_apache_spark_tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

8_PDFsam_apache_spark_tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

2.

SPARK – RDD Apache Spark

Resilient Distributed Datasets


Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created


through deterministic operations on either data on stable storage or other RDDs. RDD is
a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.

Data Sharing is Slow in MapReduce


MapReduce is widely adopted for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. It allows users to write parallel computations,
using a set of high-level operators, without having to worry about work distribution and
fault tolerance.

Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable
storage system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.

Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce


Reuse intermediate results across multiple computations in multi-stage applications. The
following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk
I/O, and serialization, which makes the system slow.

4
Apache Spark

Figure: Iterative operations on MapReduce

Interactive Operations on MapReduce


User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on
the stable storage, which can dominates application execution time.

The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.

Figure: Interactive operations on MapReduce

5
Apache Spark

Data Sharing using Spark RDD


Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most
of the Hadoop applications, they spend more than 90% of the time doing HDFS read-
write operations.

Recognizing this problem, researchers developed a specialized framework called Apache


Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-
memory processing computation. This means, it stores the state of memory as an object
across the jobs and the object is sharable between those jobs. Data sharing in memory
is 10 to 100 times faster than network and Disk.

Let us now try to find out how iterative and interactive operations take place in Spark
RDD.

Iterative Operations on Spark RDD


The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.

Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.

Figure: Iterative operations on Spark RDD

Interactive Operations on Spark RDD


This illustration shows interactive operations on Spark RDD. If different queries are run
on the same set of data repeatedly, this particular data can be kept in memory for better
execution times.

Figure: Interactive operations on Spark RDD


6
Apache Spark

By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.

7
3. SPARK – INSTALLATION Apache Spark

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.

Step 1: Verifying Java Installation


Java installation is one of the mandatory things in installing Spark. Try the following
command to verify the JAVA version.

$java -version

If Java is already, installed on your system, you get to see the following response –

java version "1.7.0_71"


Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before
proceeding to next step.

Step 2: Verifying Scala installation


You should Scala language to implement Spark. So let us verify Scala installation using
following command.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.

Step 3: Downloading Scala


Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar
file in the download folder.

8
Apache Spark

Step 4: Installing Scala


Follow the below given steps for installing Scala.

Extract the Scala tar file


Type the following command for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

Move Scala software files


Use the following commands for moving the Scala software files, to respective directory
(/usr/local/scala).

$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit

Set PATH for Scala


Use the following command for setting PATH for Scala.

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation


After installation, it is better to verify it. Use the following command for verifying Scala
installation.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark


Download the latest version of Spark by visiting the following link Download Spark. For
this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it,
you will find the Spark tar file in the download folder.

9
Apache Spark

Step 6: Installing Spark


Follow the steps given below for installing Spark.

Extracting Spark tar


The following command for extracting the spark tar file.

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files


The following commands for moving the Spark software files to respective directory
(/usr/local/spark).

$ su –
Password:

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit

Setting up the environment for Spark


Add the following line to ~/.bashrc file. It means adding the location, where the spark
software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation


Write the following command for opening Spark shell.

$spark-shell

If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop

10

You might also like