Lec 9
Lec 9
Lecture 09
Parallel programming with Spark
Parallel programming with the Spark.
Preface: Content of this lecture. In this lecture we will discuss overview of Spark, fundamental of Scala
and functional programming, Spark concepts, Spark operations and the Job execution.
What is this Spark? It is fast expressive cluster computing systems, which is compatible to Apache
Hadoop, now this particular Spark system works, with any Hadoop supported storage system such as
HDFS, S3, sequential file. And so on which improves, the efficiency through in-memory computational,
primitives and general computational graph. So, it also improves the usability through rich collection of
API, is in the form of scale a Java Python, it has the interactive cell. So, this all comprises of the Spark,
scenario. So, using this in memory computation, it is 100 times faster, compared to the previous
generation MapReduce systems and also with the interactive, shell it has reduced often (2-10 x) less code
here in this particular system.
Refer slide time :( 01:44)
So, how to run it basically using local multi-core system or using with the, with a private cluster using
Mesos, YARN and standalone mode
So, Spark originally was written in Scala, which allows concise function that is syntax and interactive
use, there are APIs, which are available for Java and Scala and Python now. So, interactive shells are
available in a Scala and Python.
So, it is a high-level language, for the java virtual machine, that is it can compile through the java virtual
machine, byte code and it is statically, typed that is and then it is interoperates with the Java.
Functions:
def square(x: Int): Int = x*x
def square(x: Int): Int = {
x*x
}
def announce(text: String) {
println(text)
}
And so, here we are going to see the quick tour, of functional programming language, that is the Scala. In
Scala, the variables are defined using where function, where there are two ways, you can define the
variable one is by specifying the type of the variable:
Refer slide time :( 05:41)
Generic types:
var arr = new Array[Int](8)
Indexing:
arr(5) = 7
println(lst(5))
list.map(addTwo)
So, here the goal of Spark is to provide a distributed, collection and here the concept of Spark ,is to
support, this distributed computation in the form of resilient, distributed data sets and RDDs are
immutable collection of objects which are spread across the clusters and these RDDs they are built
through a transformation, such as map filter etcetera and these particular, RDDs they are automatically
rebuilt, on the failure, that means a lineage is automatically generated and whenever there is a failure it
will be reconstructed, similarly it is also controllable, persistence that is the cache.
So, in all cases this entire big corpus of terabyte of data can be processed very efficiently and in a very
quickly. Now let us see the RDDs fault-tolerance. So, RDDs will track the transformations, used to build
them through the lineage to recompute the last data. So, here we can see that, once we specify these filter
that is using, filter which contains the error and then it will split, all those messages, which are tab
separated and this will be collected in the form of messages. So, let us see that, this particular filtered
RDDs, will contain the information and they are now stored in the HDFS file system. So, RDDs keep
track of the transformations, used to build them, their lineage to recompute the last data.
Now behavior with less RAM, can see that if it is fully cached, then it will iteration time will be quite
less here in this case.
Refer slide time :( 19:18)
Now the question is what language you can use, obviously a scalar will be performing, better one let us
see the two of Spark further operations. So, easiest way, to use the Spark is by the interpreter over, the
shell and it runs in a local mode with, only one thread by default what control, can be with the master one
and cluster. So, the first stop is through, the to the Spark context, is the main entry point to the Spark
functionality which is created through the Spark shell.
Refer slide time :( 19:18)
And let us see how to create, this entire operation:
pair[0] # => a
pair[1] # => b
pair._1 // => a
pair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2 pair._1 // => a
pair._2 // => b
Refer slide time :( 25:59)
Now let us see some more key value pair operations, for example
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)
So, these words means map function, will what it will do it will for every word it will generate what
comma 1. So, 2 it will generate 2 comma 1 B, B comma 1 or, or 1 not comma 1 and 2 comma 1 and B
comma 1. So, after having generated this map function, now it will reduce by the key. So, reduced by the
key in the sense it will say that for, a particular key it will do the summation. So, for the same keys for
example B, is the key which is appearing in this particular, map output and here also this is the output,
they will be collected back and that their values will be aggregated, according to the plus sign. So, 2 will
be there similarly, 2 is also appearing two times and 2 will be gathered over here and their values also will
be, added up and whereas all others are appearing only once. So, not will be collected over there and R
will be connected over there. So, this is the reduced function, which will be applied.
And now, we can all the pair's RDDs operations take the optional second parameter for the number of
tasks and this we have shown over here.
Similarly there will be at a scheduler Task, you will support the general tasks graphs, internally and
pipelines will be also pipeline functions we are possible, they will be created by the task scheduler and
cache aware data reuse and locality and partition aware things are done to avoid the shuffles.
Refer slide time :( 31:21)
And let the Hadoop is Spark can read and write to any storage system format that has plug into the
Hadoop. And API is like a Spark on text file supports, while the Spark context Hadoop RDD allows pass
any Hadoop job, to configure the input file.
So, now this is the complete word count program, which we have discussed, shown over here as the local
and what count is the name of the program arguments are there and this argument number one.
Refer slide time :( 32:40)
And
Spark system.
Refer slide time :( 32:51)
So, here we can see that, here we can see that, this particular node, is basically this particular node has,
two links out so, the contribution of one will be divided equally 0.5 and 0.5 similarly this page also has
two outgoing lines links so, it will be also having a contributions of 0.5 and this will have only one. So, it
will be having the contribution of entire 1 and this page is also having 1. So, it will be having contribution
of 1, now let us see that this particular page, we can see that it is now incoming. So, it is incoming with
0.5 so, now we have to calculate according to this 0.15 + 0.85 x 0.5. So, that will be the new page rank of
this and as far as this particular page, is concerned the incoming is once 0.85 x 1 + 0.15. so, the page rank
will become 1 in that case so, 1 will not be changed so, this will be the page rank 1 and how about this so,
here 1 2 3 different links are coming so, with a with this, this link will be 1 plus this will be 1 and this will
be 0.5 and this also will be the this also will be 0.5 so, this become 2 x 0.85 + 0.15. And this also I will
have the same so, let us see that after doing this particular iterations. So, the PageRank will now be
changed, into as we have seen that it will be 1 it was pointed point, five eight point five eight point eight
five one point eight five and this particular, iterations will continue and here it will stop ,why because it is
not going to change.
Refer slide time :( 35:52)
Further let us see how this entire PageRank algorithm can be implemented using scalar.
ranks.saveAsTextFile(...)
Refer slide time :( 37:23)
So, we see that the page rank performance, with the Spark it is very efficient and very fast compared to
The Hadoop here, if the number of machines are 16 and the iteration time is very less,
There are other iterative algorithms which are implemented, in the SPARC such as K-means clustering
logistic regression in all the cases you see that, Spark is very very efficient compared to the Hadoop.
Refer slide time :( 37:53)
In the iterations so, these are some of the references, for the Spark to be.
Refer slide time :( 37:59)
So, conclusion is, Spark offers a rich API is, to make the data analytics fast. Both fast to write and fast to
run. It achieves hundred times speed-up, in the real applications, the growing community with 14
companies are contributing to it and details tutorials are available in the website, www.spark-project.org.
Thank you.