0% found this document useful (0 votes)
6 views

15_PDFsam_apache_spark_tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

15_PDFsam_apache_spark_tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Apache Spark

15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication


disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server'
on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc

scala>

11
4. SPARK – CORE PROGRAMMING Apache Spark

Spark Core is the base of the whole project. It provides distributed task dispatching,
scheduling, and basic I/O functionalities. Spark uses a specialized fundamental data
structure known as RDD (Resilient Distributed Datasets) that is a logical collection of
data partitioned across machines. RDDs can be created in two ways; one is by
referencing datasets in external storage systems and second is by applying
transformations (e.g. map, filter, reducer, join) on existing RDDs.

The RDD abstraction is exposed through a language-integrated API. This simplifies


programming complexity because the way applications manipulate RDDs is similar to
manipulating local collections of data.

Spark Shell
Spark provides an interactive shell: a powerful tool to analyze data interactively. It is
available in either Scala or Python language. Spark’s primary abstraction is a distributed
collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created
from Hadoop Input Formats (such as HDFS files) or by transforming other RDDs.

Open Spark Shell


The following command is used to open Spark shell.

$ spark-shell

Create simple RDD


Let us create a simple RDD from the text file. Use the following command to create a
simple RDD.

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at


textFile at <console>:12

The Spark RDD API introduces few Transformations and few Actions to manipulate
RDD.

RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create dependencies
between RDDs. Each RDD in dependency chain (String of Dependencies) has a function
for calculating its data and has a pointer (dependency) to its parent RDD.

12
Apache Spark

Spark is lazy, so nothing will be executed unless you call some transformation or action
that will trigger job creation and execution. Look at the following snippet of the word-
count example.

Therefore, RDD transformation is not a set of data but is a step in a program (might be
the only step) telling Spark how to get data and what to do with it.

Given below is a list of RDD transformations.

S. No Transformations & Meaning

map(func)

1 Returns a new distributed dataset, formed by passing each element of the


source through a function func.

filter(func)

2 Returns a new dataset formed by selecting those elements of the source on


which func returns true.

flatMap(func)

3 Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).

mapPartitions(func)

4 Similar to map, but runs separately on each partition (block) of the RDD, so
func must be of type Iterator<T> => Iterator<U> when running on an
RDD of type T.

mapPartitionsWithIndex(func)

5 Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) => Iterator<U> when running on an RDD of type T.

sample(withReplacement, fraction, seed)

6 Sample a fraction of the data, with or without replacement, using a given


random number generator seed.

union(otherDataset)

7 Returns a new dataset that contains the union of the elements in the source

13
Apache Spark

dataset and the argument.

intersection(otherDataset)

8 Returns a new RDD that contains the intersection of elements in the source
dataset and the argument.

distinct([numTasks]))

9 Returns a new dataset that contains the distinct elements of the source
dataset.

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K,


Iterable<V>) pairs.
10 Note: If you are grouping in order to perform an aggregation (such as a
sum or average) over each key, using reduceByKey or aggregateByKey will
yield much better performance.

reduceByKey(func, [numTasks])

11 When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs


where the values for each key are aggregated using the given reduce
function func, which must be of type (V, V) => V. Like in groupByKey, the
number of reduce tasks is configurable through an optional second
argument.

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

12 When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs


where the values for each key are aggregated using the given combine
functions and a neutral "zero" value. Allows an aggregated value type that
is different from the input value type, while avoiding unnecessary
allocations. Like in groupByKey, the number of reduce tasks is configurable
through an optional second argument.

sortByKey([ascending], [numTasks])

13 When called on a dataset of (K, V) pairs where K implements Ordered,


returns a dataset of (K, V) pairs sorted by keys in ascending or descending
order, as specified in the Boolean ascending argument.

14 join(otherDataset, [numTasks])

14
Apache Spark

When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.

cogroup(otherDataset, [numTasks])

15 When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.

cartesian(otherDataset)

16 When called on datasets of types T and U, returns a dataset of (T, U) pairs


(all pairs of elements).

pipe(command, [envVars])

17 Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.

coalesce(numPartitions)

18 Decrease the number of partitions in the RDD to numPartitions. Useful for


running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

19 Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over the
network.

repartitionAndSortWithinPartitions(partitioner)

20 Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can push
the sorting down into the shuffle machinery.

15
Apache Spark

Actions
The following table gives a list of Actions, which return values.

S.No Action & Meaning

reduce(func)

1 Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.

collect()

2 Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a sufficiently
small subset of the data.

count()

3 Returns the number of elements in the dataset.

first()

4 Returns the first element of the dataset (similar to take (1)).

take(n)

5 Returns an array with the first n elements of the dataset.

takeSample (withReplacement,num, [seed])

6 Returns an array with a random sample of num elements of the dataset, with
or without replacement, optionally pre-specifying a random number
generator seed.

takeOrdered(n, [ordering])

7 Returns the first n elements of the RDD using either their natural order or a
custom comparator.

saveAsTextFile(path)

8 Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark calls toString on each element to convert it to a line of text

16
Apache Spark

in the file.

saveAsSequenceFile(path) (Java and Scala)

Writes the elements of the dataset as a Hadoop SequenceFile in a given path


in the local filesystem, HDFS or any other Hadoop-supported file system. This
9 is available on RDDs of key-value pairs that implement Hadoop's Writable
interface. In Scala, it is also available on types that are implicitly convertible
to Writable (Spark includes conversions for basic types like Int, Double,
String, etc).

saveAsObjectFile(path) (Java and Scala)

10 Writes the elements of the dataset in a simple format using Java serialization,
which can then be loaded using SparkContext.objectFile().

countByKey()

11 Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.

foreach(func)

Runs a function func on each element of the dataset. This is usually, done for
side effects such as updating an Accumulator or interacting with external
12 storage systems.

Note: modifying variables other than Accumulators outside of


the foreach() may result in undefined behavior. See Understanding
closures for more details.

Programming with RDD


Let us see the implementations of few RDD transformations and actions in RDD
programming with the help of an example.

Example
Consider a word count example: It counts each word appearing in a document. Consider
the following text as an input and is saved as an input.txt file in a home directory.

input.txt: input file.

people are not as beautiful as they look,


as they walk or as they talk.
they are only as beautiful as they love,
17

You might also like