15_PDFsam_apache_spark_tutorial
15_PDFsam_apache_spark_tutorial
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
11
4. SPARK – CORE PROGRAMMING Apache Spark
Spark Core is the base of the whole project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data
structure known as RDD (Resilient Distributed Datasets) that is a logical collection of
data partitioned across machines. RDDs can be created in two ways; one is by
referencing datasets in external storage systems and second is by applying
transformations (e.g. map, filter, reducer, join) on existing RDDs.
Spark Shell
Spark provides an interactive shell: a powerful tool to analyze data interactively. It is
available in either Scala or Python language. Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created
from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.
$ spark-shell
The Spark RDD API introduces few Transformations and few Actions to manipulate
RDD.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.
12
Apache Spark
Spark is lazy, so nothing will be executed unless you call some transformation or action
that will trigger job creation and execution. Look at the following snippet of the word-
count example.
Therefore, RDD transformation is not a set of data but is a step in a program (might be
the only step) telling Spark how to get data and what to do with it.
map(func)
filter(func)
flatMap(func)
3 Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
mapPartitions(func)
4 Similar to map, but runs separately on each partition (block) of the RDD, so
func must be of type Iterator<T> => Iterator<U> when running on an
RDD of type T.
mapPartitionsWithIndex(func)
5 Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on an RDD of type T.
union(otherDataset)
7 Returns a new dataset that contains the union of the elements in the source
13
Apache Spark
intersection(otherDataset)
8 Returns a new RDD that contains the intersection of elements in the source
dataset and the argument.
distinct([numTasks]))
9 Returns a new dataset that contains the distinct elements of the source
dataset.
groupByKey([numTasks])
reduceByKey(func, [numTasks])
sortByKey([ascending], [numTasks])
14 join(otherDataset, [numTasks])
14
Apache Spark
When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks])
15 When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.
cartesian(otherDataset)
pipe(command, [envVars])
17 Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.
coalesce(numPartitions)
repartition(numPartitions)
19 Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over the
network.
repartitionAndSortWithinPartitions(partitioner)
20 Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can push
the sorting down into the shuffle machinery.
15
Apache Spark
Actions
The following table gives a list of Actions, which return values.
reduce(func)
1 Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.
collect()
2 Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a sufficiently
small subset of the data.
count()
first()
take(n)
6 Returns an array with a random sample of num elements of the dataset, with
or without replacement, optionally pre-specifying a random number
generator seed.
takeOrdered(n, [ordering])
7 Returns the first n elements of the RDD using either their natural order or a
custom comparator.
saveAsTextFile(path)
8 Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark calls toString on each element to convert it to a line of text
16
Apache Spark
in the file.
10 Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().
countByKey()
11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.
foreach(func)
Runs a function func on each element of the dataset. This is usually, done for
side effects such as updating an Accumulator or interacting with external
12 storage systems.
Example
Consider a word count example: It counts each word appearing in a document. Consider
the following text as an input and is saved as an input.txt file in a home directory.