8_PDFsam_apache_spark_tutorial
8_PDFsam_apache_spark_tutorial
There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.
Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable
storage system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.
Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.
4
Apache Spark
The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.
5
Apache Spark
Let us now try to find out how iterative and interactive operations take place in Spark
RDD.
Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.
7
3. SPARK – INSTALLATION Apache Spark
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.
$java -version
If Java is already, installed on your system, you get to see the following response –
In case you do not have Java installed on your system, then Install Java before
proceeding to next step.
$scala -version
If Scala is already installed on your system, you get to see the following response –
In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.
8
Apache Spark
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit
$scala -version
If Scala is already installed on your system, you get to see the following response –
9
Apache Spark
$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
$ source ~/.bashrc
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
10