0% found this document useful (0 votes)

20 views33 pages

Lec 9

Uploaded by

ParichayBhattacharjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views33 pages

Lec 9

Uploaded by

ParichayBhattacharjee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Big Data Computing

Prof. Rajiv Misra

Computer Science andEngineering,IIT Patna

Lecture 09
Parallel programming with Spark
Parallel programming with the Spark.

Refer slide time :( 0:17)

Preface: Content of this lecture. In this lecture we will discuss overview of Spark, fundamental of Scala
and functional programming, Spark concepts, Spark operations and the Job execution.

Refer slide time :( 0:30)

Introduction to Spark.

Refer slide time :( 0:32)

What is this Spark? It is fast expressive cluster computing systems, which is compatible to Apache
Hadoop, now this particular Spark system works, with any Hadoop supported storage system such as
HDFS, S3, sequential file. And so on which improves, the efficiency through in-memory computational,
primitives and general computational graph. So, it also improves the usability through rich collection of
API, is in the form of scale a Java Python, it has the interactive cell. So, this all comprises of the Spark,
scenario. So, using this in memory computation, it is 100 times faster, compared to the previous
generation MapReduce systems and also with the interactive, shell it has reduced often (2-10 x) less code
here in this particular system.
Refer slide time :( 01:44)

So, how to run it basically using local multi-core system or using with the, with a private cluster using
Mesos, YARN and standalone mode

Refer slide time :( 01:57)

So, Spark originally was written in Scala, which allows concise function that is syntax and interactive
use, there are APIs, which are available for Java and Scala and Python now. So, interactive shells are
available in a Scala and Python.

Refer slide time :( 02:18)

Now let us introduce, to the Scala functional programming language.
Refer slide time :( 02:27)

So, it is a high-level language, for the java virtual machine, that is it can compile through the java virtual
machine, byte code and it is statically, typed that is and then it is interoperates with the Java.

Refer slide time :( 02:46)

Declaring variables:
var x: Int = 7
var x = 7 // type inferred
val y = “hi” // read-only

Functions:
def square(x: Int): Int = x*x
def square(x: Int): Int = {
x*x
}
def announce(text: String) {
println(text)
}

And so, here we are going to see the quick tour, of functional programming language, that is the Scala. In
Scala, the variables are defined using where function, where there are two ways, you can define the
variable one is by specifying the type of the variable:
Refer slide time :( 05:41)

is the generic types.

Generic types:
var arr = new Array[Int](8)

var lst = List(1, 2, 3)

// type of lst is List[Int]

Indexing:
arr(5) = 7
println(lst(5))

Refer slide time :( 06:33)

Now processing collection with the functional programming,
Processing collections with functional programming:
val list = List(1, 2, 3)
list.foreach(x => println(x)) // prints 1, 2, 3
list.foreach(println) // same
list.map(x => x + 2) // => List(3, 4, 5)
list.map(_ + 2) // same, with placeholder notation
list.filter(x => x % 2 == 1) // => List(1, 3)
list.filter(_ % 2 == 1) // => List(1, 3)
list.reduce((x, y) => x + y) // => 6
list.reduce(_ + _) // => 6

Refer slide time :( 10:25)

Here in this case now let us see the Scala closure syntax more of it.

(x: Int) => x + 2 // full version

x => x + 2 // type inferred
_ + 2 // when each argument is used exactly once
x => { // when body is a block of code
val numberToAdd = 2
x + numberToAdd
}
// If closure is too long, can always pass a function
def addTwo(x: Int): Int = x + 2

list.map(addTwo)

Refer slide time :( 11:54)

Inside another function, there similarly there are other collection, methods which are available in the
Scala. So, a Scala collection provides many other functional methods, for example as Google for a Scala
sequence we can see on it. So, map function is can be applied and this will pass each element, through a
function f and similarly flat map one-to-many map function, it will apply similarly for the filter it will
keep the element passing f and exist means a true if one element passes, for all reduce group by and sort
by all different methods are available.

Refer slide time :( 12:35)

Now let us see the Spark concepts.

Refer slide time :( 12:40)

So, here the goal of Spark is to provide a distributed, collection and here the concept of Spark ,is to
support, this distributed computation in the form of resilient, distributed data sets and RDDs are
immutable collection of objects which are spread across the clusters and these RDDs they are built
through a transformation, such as map filter etcetera and these particular, RDDs they are automatically
rebuilt, on the failure, that means a lineage is automatically generated and whenever there is a failure it
will be reconstructed, similarly it is also controllable, persistence that is the cache.

Refer slide time :( 13:29)

In the rhyme is being done. So, the main primitives is RDDs which are immutable partition collection, of
objects various transformations are applied on the RDDs such as, map filter and group by join and these
are all lazy operations to build RDDs from, other RDDs and besides transformations, actions also can be
defined on our RDD such as count collect, save and it will return the values or it will write it on the under
disk.

Refer slide time :( 13:58)

Now let us see one example of, of mining the console logs, using the Scala program written, in Spark and
here, it what it will do it will load the error messages, from a log into the memory and then interactively
search for a pattern. So, let us see that, the Spark program now you, you, you know that it runs in a form
of a master that is the driver and the worker, system and the first line written is to read this particular log
files, that is in the from the HDFS, it will read the text file and this particular lines after reading it, it will
become in memory. So, it will generate the, the base RDD after reading the file and then what it does is
apply the filter, on the lines which are now red, into the in memory, it will apply the filter operation that
we have seen, on the lines and in this filter operation, what it does is it will identify the, the, the errors
which are appearing, in the in the in the log files and this will perform a transformation, after filtering and
they are collected in the form of, another RDD that is errors RDD and using this errors, RDD down we
can apply, the map function and which says that using this map function. Now you can split this, s to the
wherever the taps are there and they will be now dividing, these into the different error messages and
these error messages are now cached.
So, apply the filter on these error messages that wherever this, foo is available or appearing, in the
message and you have to make a count, of it how many such foo messages are appearing, in this error
message. So, here the count is the action and before that all were transformations. So, let us see how it
will be done. So, wherever these messages are restoring on that, this ‘foo’ will be counted so, just see that
the driver will perform a task, to count. So, task is to count, this particular task will be communicated by
the driver to all the workers and they will count and then return the result back to the driver and also will
be in the cache, these messages will be in the cache. Now the next time when you want to filter the, the
particular string that is called a, ‘bar’, from, the message then, it will be performed in the cache itself it
will not go and look up into the into, the disk it will not be done in the disk, it will be now returned back
from the from the cache itself. So, the results will be quickly returned back because so, therefore the full,
text search of a Wikipedia can be done, in a less than one second if it is in the cache or, or it is 20 second
if it is done through the disk access. So, therefore if the data is scaled to one terabyte, then it will take five
to seven seconds, if it is to be accessed through the cache or it is 170 seconds if it is on disk data
.
Refer slide time :( 17:49)

So, in all cases this entire big corpus of terabyte of data can be processed very efficiently and in a very
quickly. Now let us see the RDDs fault-tolerance. So, RDDs will track the transformations, used to build
them through the lineage to recompute the last data. So, here we can see that, once we specify these filter
that is using, filter which contains the error and then it will split, all those messages, which are tab
separated and this will be collected in the form of messages. So, let us see that, this particular filtered
RDDs, will contain the information and they are now stored in the HDFS file system. So, RDDs keep
track of the transformations, used to build them, their lineage to recompute the last data.

Refer slide time :( 18:55)

So, fault recovery, test if we see here is that so, the iteration time and the failure happens, is recovery is
quite efficiently and done.

Refer slide time :( 19:09)

Now behavior with less RAM, can see that if it is fully cached, then it will iteration time will be quite
less here in this case.
Refer slide time :( 19:18)

Now the question is what language you can use, obviously a scalar will be performing, better one let us
see the two of Spark further operations. So, easiest way, to use the Spark is by the interpreter over, the
shell and it runs in a local mode with, only one thread by default what control, can be with the master one
and cluster. So, the first stop is through, the to the Spark context, is the main entry point to the Spark
functionality which is created through the Spark shell.
Refer slide time :( 19:18)
And let us see how to create, this entire operation:

# Turn a local collection into an RDD

sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)
# Use any existing Hadoop InputFormat
sc.hadoopFile(keyClass, valClass, inputFmt, conf)
Refer slide time :( 20:37)
The data and creating the RDDs now the basic transformations are shown over here:

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function
squares = nums.map(lambda x: x*x) # => {1, 4, 9}
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) # => {4}
# Map each element to zero or more others
nums.flatMap(lambda x: range(0, x)) # => {0, 0, 1, 0, 1, 2}

Refer slide time :( 23:45)

And this is the basic transformation and now we are going to see some of the actions,

nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection
nums.collect() # => [1, 2, 3]
# Return first K elements
nums.take(2) # => [1, 2]
# Count number of elements
nums.count() # => 3
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
nums.saveAsTextFile(“hdfs://file.txt”)

Refer slide time :( 25:29)

And now let us see how to work with key value pairs, now is part distributed reduced transformation x,
on RDDs of the key value pairs and key value pairs means here,

Python: pair = (a, b)

pair[0] # => a

pair[1] # => b

Scala: val pair = (a, b)

pair._1 // => a

pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b); // class scala.Tuple2 pair._1 // => a

pair._2 // => b
Refer slide time :( 25:59)
Now let us see some more key value pair operations, for example

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y)
# => {(cat, 3), (dog, 1)}
pets.groupByKey()
# => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()
# => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements combiners on the map side
Refer slide time :( 27:31)
Now let us see the full use of full execution of, the word count program. So, what count program means

lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ ”)) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda x, y: x + y)

So, these words means map function, will what it will do it will for every word it will generate what
comma 1. So, 2 it will generate 2 comma 1 B, B comma 1 or, or 1 not comma 1 and 2 comma 1 and B
comma 1. So, after having generated this map function, now it will reduce by the key. So, reduced by the
key in the sense it will say that for, a particular key it will do the summation. So, for the same keys for
example B, is the key which is appearing in this particular, map output and here also this is the output,
they will be collected back and that their values will be aggregated, according to the plus sign. So, 2 will
be there similarly, 2 is also appearing two times and 2 will be gathered over here and their values also will
be, added up and whereas all others are appearing only once. So, not will be collected over there and R
will be connected over there. So, this is the reduced function, which will be applied.

Refer slide time :( 29:41)

There are other key value operations and now we can see about these operations, which basically are
mentioned, were here like, like join and cogroup operations.

Refer slide time :( 30:01)

And now, we can all the pair's RDDs operations take the optional second parameter for the number of
tasks and this we have shown over here.

Refer slide time :( 30:13)

And external variables you can you use in the closure will automatically be shipped to the to the cluster.

Refer slide time :( 30:24)

So, all these are the implementation issues.

Refer slide time :( 30:27)

Refer slide time :( 30:29)

So, for other RDD operations the programming guide is there.

Refer slide time :( 30:27)

And that can be seen now let us see about, the job execution in his Spark it has various software
components and when you say Spark context, will be creating it will be created, in the master job master
and then it will create the worker threads, in the form of power context and then executors will execute
them.

Refer slide time :( 30:56)

Similarly there will be at a scheduler Task, you will support the general tasks graphs, internally and
pipelines will be also pipeline functions we are possible, they will be created by the task scheduler and
cache aware data reuse and locality and partition aware things are done to avoid the shuffles.
Refer slide time :( 31:21)

So, more information about resources, on Scala is available.

Refer slide time :( 31:28)

And let the Hadoop is Spark can read and write to any storage system format that has plug into the
Hadoop. And API is like a Spark on text file supports, while the Spark context Hadoop RDD allows pass
any Hadoop job, to configure the input file.

Refer slide time :( 31:45)

And this is to create the Spark context, now Spark context when we say it has different arguments. So,
the first one is called, ‘Master URL’, is nothing but to create the URL or a local oblique in local node n.
so, next one next thing is the application name and then, Spark will now install the path on the cluster,
with the name is Spark home and finally the list of jar files also has to be given in the Spark context
automatically it creates a Spark context using the Java jar file.

Refer slide time :( 32:25)

So, now this is the complete word count program, which we have discussed, shown over here as the local
and what count is the name of the program arguments are there and this argument number one.
Refer slide time :( 32:40)

And

Refer slide time :( 32:41)

Now let us see an example

Refer slide time :( 32:43)

Of a PageRank algorithm how we can execute in the
Refer slide time :( 32:48)

Refer slide time :( 32:48)

Spark system.
Refer slide time :( 32:51)

So, let us start let us see the algorithm,

1. Start at each page the rank of one.

2. On each iteration, have page p contribute rankp / |neighborsp| to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs

So, here we can see that, here we can see that, this particular node, is basically this particular node has,
two links out so, the contribution of one will be divided equally 0.5 and 0.5 similarly this page also has
two outgoing lines links so, it will be also having a contributions of 0.5 and this will have only one. So, it
will be having the contribution of entire 1 and this page is also having 1. So, it will be having contribution
of 1, now let us see that this particular page, we can see that it is now incoming. So, it is incoming with
0.5 so, now we have to calculate according to this 0.15 + 0.85 x 0.5. So, that will be the new page rank of
this and as far as this particular page, is concerned the incoming is once 0.85 x 1 + 0.15. so, the page rank
will become 1 in that case so, 1 will not be changed so, this will be the page rank 1 and how about this so,
here 1 2 3 different links are coming so, with a with this, this link will be 1 plus this will be 1 and this will
be 0.5 and this also will be the this also will be 0.5 so, this become 2 x 0.85 + 0.15. And this also I will
have the same so, let us see that after doing this particular iterations. So, the PageRank will now be
changed, into as we have seen that it will be 1 it was pointed point, five eight point five eight point eight
five one point eight five and this particular, iterations will continue and here it will stop ,why because it is
not going to change.
Refer slide time :( 35:52)

Further let us see how this entire PageRank algorithm can be implemented using scalar.

val links = // RDD of (url, neighbors) pairs

var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case (url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}

ranks.saveAsTextFile(...)
Refer slide time :( 37:23)

So, we see that the page rank performance, with the Spark it is very efficient and very fast compared to
The Hadoop here, if the number of machines are 16 and the iteration time is very less,

Refer slide time :( 37:36)

There are other iterative algorithms which are implemented, in the SPARC such as K-means clustering
logistic regression in all the cases you see that, Spark is very very efficient compared to the Hadoop.
Refer slide time :( 37:53)

In the iterations so, these are some of the references, for the Spark to be.
Refer slide time :( 37:59)

So, conclusion is, Spark offers a rich API is, to make the data analytics fast. Both fast to write and fast to
run. It achieves hundred times speed-up, in the real applications, the growing community with 14
companies are contributing to it and details tutorials are available in the website, www.spark-project.org.
Thank you.

BDS Session 1.1
No ratings yet
BDS Session 1.1
69 pages
Lean Six Sigma Green Belt Cheat Sheet PDF
67% (3)
Lean Six Sigma Green Belt Cheat Sheet PDF
18 pages
Chapter 1 Big Data Development Trend and Kunpeng Big Data Solution
No ratings yet
Chapter 1 Big Data Development Trend and Kunpeng Big Data Solution
534 pages
Lec 9
No ratings yet
Lec 9
38 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Spark
No ratings yet
Spark
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
SPARK
No ratings yet
SPARK
66 pages
Spark and Scala Week 1
No ratings yet
Spark and Scala Week 1
16 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
External Video-En
No ratings yet
External Video-En
2 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
SPARK
No ratings yet
SPARK
35 pages
Overview
No ratings yet
Overview
25 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Spark: Fast, Interactive, Language-Integrated Cluster Computing
No ratings yet
Spark: Fast, Interactive, Language-Integrated Cluster Computing
25 pages
SPARK
No ratings yet
SPARK
125 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
7 Apache Spark
No ratings yet
7 Apache Spark
48 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark
No ratings yet
Spark
160 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Pyspark
No ratings yet
Pyspark
31 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Lec 11
No ratings yet
Lec 11
8 pages
JSW Vijayanagar Optimization of PLTCM1 CRM1 at JSW Steel LTD
No ratings yet
JSW Vijayanagar Optimization of PLTCM1 CRM1 at JSW Steel LTD
28 pages
Lec 3
No ratings yet
Lec 3
25 pages
Lec 13
No ratings yet
Lec 13
12 pages
Advantages of Metallic-Coated Steel Framing in Residential Buildings
No ratings yet
Advantages of Metallic-Coated Steel Framing in Residential Buildings
3 pages
Lec 7
No ratings yet
Lec 7
10 pages
Clecim - Product Brochure - Welders - EN
No ratings yet
Clecim - Product Brochure - Welders - EN
8 pages
Clecim - Product Brochure - SIAS - EN
No ratings yet
Clecim - Product Brochure - SIAS - EN
12 pages
Clecim - Product Brochure - Roll Coater - EN
No ratings yet
Clecim - Product Brochure - Roll Coater - EN
12 pages
Megger Testing
No ratings yet
Megger Testing
12 pages
Green Hydrogen Enabling Measures Roadmap
No ratings yet
Green Hydrogen Enabling Measures Roadmap
41 pages
Shape Measurement of Steel Strips Using PDF
No ratings yet
Shape Measurement of Steel Strips Using PDF
9 pages
State Bank of India
No ratings yet
State Bank of India
5 pages
Welders Cold Rolling Plants en
No ratings yet
Welders Cold Rolling Plants en
12 pages
Annexure To Tender No. Dps/Mrpu/1/1/238 Description of The ITEM: 10 KW Induction Heating System - 1 Set 1.description
No ratings yet
Annexure To Tender No. Dps/Mrpu/1/1/238 Description of The ITEM: 10 KW Induction Heating System - 1 Set 1.description
7 pages
08 Chapter 03 PDF
No ratings yet
08 Chapter 03 PDF
66 pages
CV Daniar Heri Kurniawan New 1
No ratings yet
CV Daniar Heri Kurniawan New 1
4 pages
MISQ BI Special Issue Introduction Chen-Chiang-Storey December 2012 PDF
No ratings yet
MISQ BI Special Issue Introduction Chen-Chiang-Storey December 2012 PDF
24 pages
Information Technology s7 & s8
No ratings yet
Information Technology s7 & s8
317 pages
Ddbms 1233
100% (1)
Ddbms 1233
29 pages
Big Data Documentation - Big Data Documentation
No ratings yet
Big Data Documentation - Big Data Documentation
2 pages
Assignment 2-Intro-to-IoT-Connecting - Things
No ratings yet
Assignment 2-Intro-to-IoT-Connecting - Things
3 pages
DGC - Sources November2023 ApacheAtlasSources en
No ratings yet
DGC - Sources November2023 ApacheAtlasSources en
24 pages
Mesosphere
No ratings yet
Mesosphere
23 pages
How To Install Hadoop On Ubuntu 18.04 or 20.04
No ratings yet
How To Install Hadoop On Ubuntu 18.04 or 20.04
15 pages
001 - Grokking The Advanced System Design Interview - Learn Interactively - WWW - Educative.io
No ratings yet
001 - Grokking The Advanced System Design Interview - Learn Interactively - WWW - Educative.io
9 pages
Data Ingest
No ratings yet
Data Ingest
15 pages
CP7019-Managing Big Data-Anna University - Question Paper
75% (4)
CP7019-Managing Big Data-Anna University - Question Paper
4 pages
Transition From Relational Database To Big Data and Analytics
No ratings yet
Transition From Relational Database To Big Data and Analytics
40 pages
Huawei OceanStor 9000 Technical Presentation
No ratings yet
Huawei OceanStor 9000 Technical Presentation
57 pages
Rakuten Server Monitoring
No ratings yet
Rakuten Server Monitoring
30 pages
IBM Watsonx - Data Level 2 Quiz - Attempt Review
No ratings yet
IBM Watsonx - Data Level 2 Quiz - Attempt Review
17 pages
RESUME SATYARTHGAUR 2023 TechM
No ratings yet
RESUME SATYARTHGAUR 2023 TechM
14 pages
Big Data (Unit 1)
No ratings yet
Big Data (Unit 1)
37 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Hadoop Interview Questions and Answers
No ratings yet
Hadoop Interview Questions and Answers
3 pages
AWS Project by AnwarAkhtar
No ratings yet
AWS Project by AnwarAkhtar
7 pages
The Solution For Big Data Hadoop
No ratings yet
The Solution For Big Data Hadoop
27 pages
Hadoop & Spark
No ratings yet
Hadoop & Spark
40 pages
Advanced Data Science and AI Brochure
No ratings yet
Advanced Data Science and AI Brochure
51 pages
MyBatis 3 User Guide 5
No ratings yet
MyBatis 3 User Guide 5
15 pages
CS 3440 Graded Quiz Unit 3
No ratings yet
CS 3440 Graded Quiz Unit 3
7 pages
Big-Data-Unit 4
No ratings yet
Big-Data-Unit 4
99 pages
Quastor System Design Book - NeetCode Newsletter
No ratings yet
Quastor System Design Book - NeetCode Newsletter
523 pages

Lec 9

Uploaded by

Lec 9

Uploaded by

Big Data Computing

Prof. Rajiv Misra

Refer slide time :( 0:17)

Refer slide time :( 0:30)

Refer slide time :( 0:32)

Refer slide time :( 01:57)

Refer slide time :( 02:18)

Refer slide time :( 02:46)

is the generic types.

var lst = List(1, 2, 3)

Refer slide time :( 06:33)

Refer slide time :( 10:25)

(x: Int) => x + 2 // full version

Refer slide time :( 11:54)

Refer slide time :( 12:35)

Now let us see the Spark concepts.

Refer slide time :( 13:29)

Refer slide time :( 13:58)

Refer slide time :( 18:55)

Refer slide time :( 19:09)

# Turn a local collection into an RDD

nums = sc.parallelize([1, 2, 3])

Refer slide time :( 23:45)

nums = sc.parallelize([1, 2, 3])

Refer slide time :( 25:29)

Python: pair = (a, b)

Scala: val pair = (a, b)

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

Refer slide time :( 29:41)

Refer slide time :( 30:01)

Refer slide time :( 30:13)

Refer slide time :( 30:24)

So, all these are the implementation issues.

Refer slide time :( 30:27)

So, for other RDD operations the programming guide is there.

Refer slide time :( 30:27)

Refer slide time :( 30:56)

So, more information about resources, on Scala is available.

Refer slide time :( 31:28)

Refer slide time :( 31:45)

Refer slide time :( 32:25)

Refer slide time :( 32:41)

Now let us see an example

Refer slide time :( 32:43)

Refer slide time :( 32:48)

So, let us start let us see the algorithm,

1. Start at each page the rank of one.

val links = // RDD of (url, neighbors) pairs

Refer slide time :( 37:36)

You might also like