Open Spark Shell
Open Spark Shell
Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data
interactively. It is available in either Scala or Python language. Spark’s
primary abstraction is a distributed collection of items called a Resilient
Distributed Dataset (RDD). RDDs can be created from Hadoop Input Formats
(such as HDFS files) or by transforming other RDDs.
The Spark RDD API introduces few Transformations and few Actions to
manipulate RDD.
RDD Transformations
RDD transformations returns pointer to new RDD and allows you to create
dependencies between RDDs. Each RDD in dependency chain (String of
Dependencies) has a function for calculating its data and has a pointer
(dependency) to its parent RDD.
Spark is lazy, so nothing will be executed unless you call some transformation
or action that will trigger job creation and execution. Look at the following
snippet of the word-count example.
1 map(func)
2 filter(func)
3 flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output
items (so func should return a Seq rather than a single item).
4 mapPartitions(func)
Similar to map, but runs separately on each partition (block) of the RDD,
so func must be of type Iterator<T> ⇒ Iterator<U> when running on an
RDD of type T.
5 mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an integer value
representing the index of the partition, so func must be of type (Int,
Iterator<T>) ⇒ Iterator<U> when running on an RDD of type T.
7 union(otherDataset)
Returns a new dataset that contains the union of the elements in the source
dataset and the argument.
8 intersection(otherDataset)
Returns a new RDD that contains the intersection of elements in the source
dataset and the argument.
9 distinct([numTasks])
Returns a new dataset that contains the distinct elements of the source
dataset.
10 groupByKey([numTasks])
11 reduceByKey(func, [numTasks])
13 sortByKey([ascending], [numTasks])
14 join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(V, W)) pairs with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
15 cogroup(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K,
(Iterable<V>, Iterable<W>)) tuples. This operation is also called group
With.
16 cartesian(otherDataset)
17 pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g. a Perl or bash
script. RDD elements are written to the process's stdin and lines output to
its stdout are returned as an RDD of strings.
18 coalesce(numPartitions)
19 repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more or fewer
partitions and balance it across them. This always shuffles all data over the
network.
20 repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and, within each
resulting partition, sort records by their keys. This is more efficient than
calling repartition and then sorting within each partition because it can push
the sorting down into the shuffle machinery.
Actions
The following table gives a list of Actions, which return values.
1 reduce(func)
Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.
2 collect()
Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a
sufficiently small subset of the data.
3 count()
4 first()
5 take(n)
7 takeOrdered(n, [ordering])
Returns the first n elements of the RDD using either their natural order or
a custom comparator.
8 saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-supported
file system. Spark calls toString on each element to convert it to a line of
text in the file.
11 countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.
12 foreach(func)
Runs a function func on each element of the dataset. This is usually, done
for side effects such as updating an Accumulator or interacting with external
storage systems.
Example
Consider a word count example − It counts each word appearing in a
document. Consider the following text as an input and is saved as
an input.txt file in a home directory.
Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built
using Scala. Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look
at the last line of the output “Spark context available as sc” means the Spark
container is automatically created spark context object with the name sc.
Before starting the first step of a program, the SparkContext object should
be created.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions:
Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here,
new RDD is created with the name of inputfile. The String which is given as
an argument in the textFile(“”) method is absolute path for the input file
name. However, if only the file name is given, then it means that the input
file is in the current location.
scala> val inputfile = sc.textFile("input.txt")
Next, read each word as a key with a value ‘1’ (<key, value> =
<word,1>)using map function (map(word ⇒ (word, 1)).
The following command is used for executing word count logic. After
executing this, you will not find any output because this is not an action, this
is a transformation; pointing a new RDD or tell spark to what to do with the
given data)
scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_+_);
Current RDD
While working with the RDD, if you want to know about current RDD, then
use the following command. It will show you the description about current
RDD and its dependencies for debugging.
scala> counts.toDebugString
part-00000
part-00001
_SUCCESS
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
You will see the following screen, which shows the storage space used for the
application, which are running on the Spark shell.
If you want to UN-persist the storage space of particular RDD, then use the
following command.
Scala> counts.unpersist()
For verifying the storage space in the browser, use the following URL.
http://localhost:4040/
You will see the following screen. It shows the storage space used for the
application, which are running on the Spark shell.