Apache Spark Tutorial
Apache Spark Tutorial
Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would
be useful for Analytics Professionals and ETL developers as well.
Prerequisite
Before you start proceeding with this tutorial, we assume that you have prior exposure
to Scala programming, database concepts, and any of the Linux operating system
flavors.
Apache Spark
Table of Contents
About the Tutorial .................................................................................................................................... i
Audience .................................................................................................................................................. i
Prerequisite.............................................................................................................................................. i
Copyright & Disclaimer............................................................................................................................. i
Table of Contents .................................................................................................................................... ii
Apache Spark
Step 6: Installing Spark .......................................................................................................................... 10
Step 7: Verifying the Spark Installation ................................................................................................. 10
iii
1. SPARK INTRODUCTION
Apache Spark
Industries are using Hadoop extensively to analyze their data sets. The reason is that
Hadoop framework is based on a simple programming model (MapReduce) and it
enables a computing solution that is scalable, flexible, fault-tolerant and cost effective.
Here, the main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not,
really, dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark.
Spark uses Hadoop in two ways one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage purpose
only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming. Apart from supporting all these
workload in a respective system, it reduces the management burden of maintaining
separate tools.
Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data
in memory.
Apache Spark
Advanced Analytics: Spark not only supports Map and reduce. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.
Apache Spark
Components of Spark
The following illustration depicts the different components of Spark.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed
Datasets) transformations on those mini-batches of data.
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.
2. SPARK RDD
Apache Spark
Apache Spark
Apache Spark
Apache Spark
By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.
3. SPARK INSTALLATION
Apache Spark
Spark is Hadoops sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.
Apache Spark
Apache Spark
# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit
Apache Spark
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____
/ __/__
__
___ _____/ /__
_\ \/ _ \/ _ `/ __/
'_/
version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
11
Apache Spark
Spark Core is the base of the whole project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data
structure known as RDD (Resilient Distributed Datasets) that is a logical collection of
data partitioned across machines. RDDs can be created in two ways; one is by
referencing datasets in external storage systems and second is by applying
transformations (e.g. map, filter, reducer, join) on existing RDDs.
The RDD abstraction is exposed through a language-integrated API. This simplifies
programming complexity because the way applications manipulate RDDs is similar to
manipulating local collections of data.
Spark Shell
Spark provides an interactive shell: a powerful tool to analyze data interactively. It is
available in either Scala or Python language. Sparks primary abstraction is a distributed
collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created
from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.
12
Apache Spark
Spark is lazy, so nothing will be executed unless you call some transformation or action
that will trigger job creation and execution. Look at the following snippet of the wordcount example.
Therefore, RDD transformation is not a set of data but is a step in a program (might be
the only step) telling Spark how to get data and what to do with it.
Given below is a list of RDD transformations.
S. No
Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD, so
func must be of type Iterator<T> => Iterator<U> when running on an
RDD of type T.
mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacement, fraction, seed)
Returns a new dataset that contains the union of the elements in the source
13
Apache Spark
dataset and the argument.
intersection(otherDataset)
8
Returns a new RDD that contains the intersection of elements in the source
dataset and the argument.
distinct([numTasks]))
Returns a new dataset that contains the distinct elements of the source
dataset.
groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of (K,
Iterable<V>)
10
pairs.
11
12
13
14
join(otherDataset, [numTasks])
14
Apache Spark
When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks])
15
When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.
cartesian(otherDataset)
16
17
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.
coalesce(numPartitions)
18
19
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over the
network.
repartitionAndSortWithinPartitions(partitioner)
20
Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can push
the sorting down into the shuffle machinery.
15
Apache Spark
Actions
The following table gives a list of Actions, which return values.
S.No
Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.
collect()
Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a sufficiently
small subset of the data.
count()
Returns an array with a random sample of num elements of the dataset, with
or
without
replacement,
optionally
pre-specifying
random
number
generator seed.
takeOrdered(n, [ordering])
7
Returns the first n elements of the RDD using either their natural order or a
custom comparator.
saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark calls toString on each element to convert it to a line of text
16
Apache Spark
in the file.
saveAsSequenceFile(path) (Java and Scala)
Writes the elements of the dataset as a Hadoop SequenceFile in a given path
in the local filesystem, HDFS or any other Hadoop-supported file system. This
9
10
Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().
countByKey()
11
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.
foreach(func)
Runs a function func on each element of the dataset. This is usually, done for
side effects such as updating an Accumulator or interacting with external
12
storage systems.
Note:
modifying
the foreach() may
variables
other
than
Accumulators
outside
of
result in undefined behavior. See Understanding
Example
Consider a word count example: It counts each word appearing in a document. Consider
the following text as an input and is saved as an input.txt file in a home directory.
input.txt: input file.
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful
as they love,
17
Apache Spark
as they care as they share.
Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using Scala.
Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look at the last
line of the output Spark context available as sc means the Spark container is
automatically created spark context object with the name sc. Before starting the first
step of a program, the SparkContext object should be created.
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____
/ __/__
__
___ _____/ /__
_\ \/ _ \/ _ `/ __/
'_/
version 1.2.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
18
Apache Spark
Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here, new RDD is
created with the name of inputfile. The String which is given as an argument in the
textFile() method is absolute path for the input file name. However, if only the file
name is given, then it means that the input file is in the current location.
scala> val inputfile = sc.textFile("input.txt")
Current RDD
While working with the RDD, if you want to know about current RDD, then use the
following command. It will show you the description about current RDD and its
dependencies for debugging.
scala> counts.toDebugString
Apache Spark
scala> counts.saveAsTextFile("output")
part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
[hadoop@localhost output]$ cat part-00000
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
The following command is used to see output from Part-00001 files.
[hadoop@localhost output]$ cat part-00001
Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
20
Apache Spark
(share, 1)
If you want to UN-persist the storage space of particular RDD, then use the following
command.
Scala> counts.unpersist()
You will see the output as follows:
15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list
15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1
21
Apache Spark
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from
memory (free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from
memory (free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14
For verifying the storage space in the browser, use the following URL.
http://localhost:4040
You will see the following screen. It shows the storage space used for the application,
which are running on the Spark shell.
22
5. SPARK DEPLOYMENT
Apache Spark
Spark application, using spark-submit, is a shell command used to deploy the Spark
application on a cluster. It uses all respective cluster managers through a uniform
interface. Therefore, you do not have to configure your application for each one.
Example
Let us take the same example of word count, we used before, using shell commands.
Here, we consider the same example as a spark application.
Sample Input
The following text is the input data and the file named is in.txt.
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful
as they love,
SparkWordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark._
object SparkWordCount {
def main(args: Array[String]) {
Apache Spark
/* Transform the inputRDD into countRDD */
val count=input.flatMap(line=>line.split(" "))
.map(word=>(word, 1))
.reduceByKey(_ + _)
Apache Spark
Submit the spark application using the following command:
spark-submit --class SparkWordCount --master local wordcount.jar
If it is executed successfully, then you will find the output given below. The OK letting in
the following output is for user identification and that is the last line of the program. If
you carefully read the following output, you will find different things, such as:
MemoryStore cleared
Apache Spark
OK
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.217:4040
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-b2c2823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as
root for deletion.
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext
15/07/08 13:56:14 INFO Utils: Shutdown hook called
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3b2c2-823d8d99c5af
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
Apache Spark
(look,1)
The commands for checking output in part-00001 file are:
$ cat part-00001
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Go through the following section to know more about the spark-submit command.
Spark-submit Syntax
spark-submit [options] <app jar | python file> [app arguments]
Options
The table given below describes a list of options:S.No
1
Option
Description
--master
--deploy-mode
--class
--name
--jars
--packages
--repositories
Comma-separated
list
of
additional
remote
repositories to search for the maven coordinates
given with --packages.
27
Apache Spark
8
--py-files
--files
10
--conf (prop=val)
11
--properties-file
12
--driver-memory
13
--driver-java-options
14
--driver-library-path
15
--driver-class-path
16
--executor-memory
17
--proxy-user
18
--help, -h
19
--verbose, -v
20
--version
21
--driver-cores NUM
22
--supervise
23
--kill
24
--status
25
--total-executor-cores
26
--executor-cores
Apache Spark
29
Apache Spark
Spark contains two different types of shared variables- one is broadcast variables and
second is accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also
attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
communication cost.
Spark actions are executed through a set of stages, separated by distributed shuffle
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.
The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only useful
when tasks across multiple stages need the same data or when caching the data in
deserialized form is important.
Broadcast
variables
are
created
from
a
variable v by
calling
SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method. The code given below shows this:
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
Output:
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.
Accumulators
Accumulators are variables that are only added to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Sparks UI. This can be useful for understanding
the progress of running stages (NOTE: this is not yet supported in Python).
30
Apache Spark
An
accumulator
is
created
from
an
initial
value v by
calling
SparkContext.accumulator(v). Tasks running on the cluster can then add to it using
the add method or the += operator (in Scala and Python). However, they cannot read
its value. Only the driver program can read the accumulators value, using
its value method.
The code given below shows an accumulator being used to add up the elements of an
array:
scala> val accum = sc.accumulator(0)
Output
res2: Int = 10
31
Apache Spark
Min()
Minimum value among all elements in the RDD.
Variance()
Variance of the elements.
Stdev()
Standard deviation.
If you want to use only one of these methods, you can call the corresponding method
directly on RDD.
32