0% found this document useful (0 votes)

83 views13 pages

Spark

Spark can be built with Hadoop components in three ways. It has various components like Resilient Distributed Datasets (RDDs) which are immutable distributed collections that can hold any type of object. RDDs introduce transformations that manipulate the datasets and actions that return values. Some examples of transformations include map, filter, and join; while examples of actions include reduce, collect, and count. Spark also provides an interactive shell to analyze data interactively in Scala or Python.

Uploaded by

thunuguri santosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views13 pages

Spark

Uploaded by

thunuguri santosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Apache Spark

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.

Components of Spark
The following illustration depicts the different components of Spark.
Resilient Distributed Datasets
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is
an immutable distributed collection of objects. Each dataset in RDD is divided into
logical partitions, which may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects, including user-defined classes.

Iterative Operations on MapReduce

Interactive Operations on MapReduce

Spark Shell
Spark provides an interactive shell − a powerful tool to analyze data interactively. It
is available in either Scala or Python language. Spark’s primary abstraction is a
distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can
be created from Hadoop Input Formats (such as HDFS files) or by transforming other
RDDs.

Open Spark Shell

The following command is used to open Spark shell.
$ spark-shell
Create simple RDD
Let us create a simple RDD from the text file. Use the following command to create
a simple RDD.

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

inputfile: org.apache.spark.rdd.RDD[String] = input.txt MappedRDD[1] at textFile at
<console>:12
The Spark RDD API introduces few Transformations and few Actions to manipulate
RDD.

Transformations
& Meaning

1 map(func)
Returns a new distributed dataset, formed by passing each
element of the source through a function func.

2
filter(func)
Returns a new dataset formed by selecting those elements of
the source on which func returns true.

3
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more
output items (so func should return a Seq rather than a single
item).

4
mapPartitions(func)
Similar to map, but runs separately on each partition (block) of
the RDD, so func must be of type Iterator<T> ⇒ Iterator<U>
when running on an RDD of type T.

5
mapPartitionsWithIndex(func)
Similar to map Partitions, but also provides func with an
integer value representing the index of the partition,
so func must be of type (Int, Iterator<T>) ⇒ Iterator<U> when
running on an RDD of type T.

6
sample(withReplacement, fraction, seed)
Sample a fraction of the data, with or without replacement,
using a given random number generator seed.

7
union(otherDataset)
Returns a new dataset that contains the union of the elements
in the source dataset and the argument.

8
intersection(otherDataset)
Returns a new RDD that contains the intersection of elements
in the source dataset and the argument.

9
distinct([numTasks])
Returns a new dataset that contains the distinct elements of
the source dataset.

10
groupByKey([numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of
(K, Iterable<V>) pairs.
Note − If you are grouping in order to perform an aggregation
(such as a sum or average) over each key, using reduceByKey
or aggregateByKey will yield much better performance.

11
reduceByKey(func, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of
(K, V) pairs where the values for each key are aggregated using
the given reduce function func, which must be of type (V, V) ⇒
V. Like in groupByKey, the number of reduce tasks is
configurable through an optional second argument.

12
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
When called on a dataset of (K, V) pairs, returns a dataset of
(K, U) pairs where the values for each key are aggregated using
the given combine functions and a neutral "zero" value. Allows
an aggregated value type that is different from the input value
type, while avoiding unnecessary allocations. Like in
groupByKey, the number of reduce tasks is configurable
through an optional second argument.

13
sortByKey([ascending], [numTasks])
When called on a dataset of (K, V) pairs where K implements
Ordered, returns a dataset of (K, V) pairs sorted by keys in
ascending or descending order, as specified in the Boolean
ascending argument.

14
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a
dataset of (K, (V, W)) pairs with all pairs of elements for each
key. Outer joins are supported through leftOuterJoin,
rightOuterJoin, and fullOuterJoin.

15
cogroup(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a
dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This
operation is also called group With.
16
cartesian(otherDataset)
When called on datasets of types T and U, returns a dataset of
(T, U) pairs (all pairs of elements).

17
pipe(command, [envVars])
Pipe each partition of the RDD through a shell command, e.g.
a Perl or bash script. RDD elements are written to the process's
stdin and lines output to its stdout are returned as an RDD of
strings.

18
coalesce(numPartitions)
Decrease the number of partitions in the RDD to numPartitions.
Useful for running operations more efficiently after filtering
down a large dataset.

19
repartition(numPartitions)
Reshuffle the data in the RDD randomly to create either more
or fewer partitions and balance it across them. This always
shuffles all data over the network.

20
repartitionAndSortWithinPartitions(partitioner)
Repartition the RDD according to the given partitioner and,
within each resulting partition, sort records by their keys. This
is more efficient than calling repartition and then sorting within
each partition because it can push the sorting down into the
shuffle machinery.

Actions
The following table gives a list of Actions, which return values.

S.No Action & Meaning

1 reduce(func)
Aggregate the elements of the dataset using a function func (which takes
two arguments and returns one). The function should be commutative and
associative so that it can be computed correctly in parallel.

2
collect()
Returns all the elements of the dataset as an array at the driver program.
This is usually useful after a filter or other operation that returns a
sufficiently small subset of the data.

3
count()
Returns the number of elements in the dataset.

4
first()
Returns the first element of the dataset (similar to take (1)).

5
take(n)
Returns an array with the first n elements of the dataset.

6
takeSample (withReplacement,num, [seed])
Returns an array with a random sample of num elements of the dataset,
with or without replacement, optionally pre-specifying a random number
generator seed.

7
takeOrdered(n, [ordering])
Returns the first n elements of the RDD using either their natural order or
a custom comparator.

8
saveAsTextFile(path)
Writes the elements of the dataset as a text file (or set of text files) in a
given directory in the local filesystem, HDFS or any other Hadoop-
supported file system. Spark calls toString on each element to convert it to
a line of text in the file.
9
saveAsSequenceFile(path) (Java and Scala)
Writes the elements of the dataset as a Hadoop SequenceFile in a given
path in the local filesystem, HDFS or any other Hadoop-supported file
system. This is available on RDDs of key-value pairs that implement
Hadoop's Writable interface. In Scala, it is also available on types that are
implicitly convertible to Writable (Spark includes conversions for basic
types like Int, Double, String, etc).

10
saveAsObjectFile(path) (Java and Scala)
Writes the elements of the dataset in a simple format using Java
serialization, which can then be loaded using SparkContext.objectFile().

11
countByKey()
Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs
with the count of each key.

12
foreach(func)
Runs a function func on each element of the dataset. This is usually, done
for side effects such as updating an Accumulator or interacting with
external storage systems.
Note − modifying variables other than Accumulators outside of the
foreach() may result in undefined behavior. See Understanding closures for
more details.

Programming with RDD

Let us see the implementations of few RDD transformations and actions in RDD
programming with the help of an example.

Example
Consider a word count example − It counts each word appearing in a document.
Consider the following text as an input and is saved as an input.txt file in a home
directory.
input.txt − input file.
people are not as beautiful as they look,
as they walk or as they talk.
they are only as beautiful as they love,
as they care as they share.
Follow the procedure given below to execute the given example.

Open Spark-Shell
The following command is used to open spark shell. Generally, spark is built using
Scala. Therefore, a Spark program runs on Scala environment.
$ spark-shell
If Spark shell opens successfully then you will find the following output. Look at the
last line of the output “Spark context available as sc” means the Spark container is
automatically created spark context object with the name sc. Before starting the first
step of a program, the SparkContext object should be created.
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled;
ui acls disabled; users with view permissions: Set(hadoop); users with modify
permissions: Set(hadoop)
15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server
15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port
43292.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.0
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Spark context available as sc
scala>
Create an RDD
First, we have to read the input file using Spark-Scala API and create an RDD.
The following command is used for reading a file from given location. Here, new RDD
is created with the name of inputfile. The String which is given as an argument in the
textFile(“”) method is absolute path for the input file name. However, if only the file
name is given, then it means that the input file is in the current location.
scala> val inputfile = sc.textFile("input.txt")
Execute Word count Transformation
Our aim is to count the words in a file. Create a flat map for splitting each line into
words (flatMap(line ⇒ line.split(“ ”)).
Next, read each word as a key with a value ‘1’ (<key, value> = <word,1>)using map
function (map(word ⇒ (word, 1)).
Finally, reduce those keys by adding values of similar keys (reduceByKey(_+_)).
The following command is used for executing word count logic. After executing this,
you will not find any output because this is not an action, this is a transformation;
pointing a new RDD or tell spark to what to do with the given data)
scala> val counts = inputfile.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_+_);
Current RDD
While working with the RDD, if you want to know about current RDD, then use the
following command. It will show you the description about current RDD and its
dependencies for debugging.
scala> counts.toDebugString
Caching the Transformations
You can mark an RDD to be persisted using the persist() or cache() methods on it. The
first time it is computed in an action, it will be kept in memory on the nodes. Use the
following command to store the intermediate transformations in memory.
scala> counts.cache()
Applying the Action
Applying an action, like store all the transformations, results into a text file. The
String argument for saveAsTextFile(“ ”) method is the absolute path of output folder.
Try the following command to save the output in a text file. In the following example,
‘output’ folder is in current location.
scala> counts.saveAsTextFile("output")
Checking the Output
Open another terminal to go to home directory (where spark is executed in the other
terminal). Use the following commands for checking output directory.
[hadoop@localhost ~]$ cd output/
[hadoop@localhost output]$ ls -1

part-00000
part-00001
_SUCCESS
The following command is used to see output from Part-00000 files.
[hadoop@localhost output]$ cat part-00000
Output
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
(look,1)
The following command is used to see output from Part-00001 files.
[hadoop@localhost output]$ cat part-00001
Output
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
UN Persist the Storage
Before UN-persisting, if you want to see the storage space that is used for this
application, then use the following URL in your browser.
http://localhost:4040
You will see the following screen, which shows the storage space used for the
application, which are running on the Spark shell.
If you want to UN-persist the storage space of particular RDD, then use the following
command.
Scala> counts.unpersist()
You will see the output as follows −
15/06/27 00:57:33 INFO ShuffledRDD: Removing RDD 9 from persistence list
15/06/27 00:57:33 INFO BlockManager: Removing RDD 9
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_1
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_1 of size 480 dropped from memory
(free 280061810)
15/06/27 00:57:33 INFO BlockManager: Removing block rdd_9_0
15/06/27 00:57:33 INFO MemoryStore: Block rdd_9_0 of size 296 dropped from memory
(free 280062106)
res7: cou.type = ShuffledRDD[9] at reduceByKey at <console>:14
For verifying the storage space in the browser, use the following URL.
http://localhost:4040/
You will see the following screen. It shows the storage space used for the application,
which are running on the Spark shell.

ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Spark
No ratings yet
Spark
160 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Computer Studies SS 2 2nd Term Examination
50% (2)
Computer Studies SS 2 2nd Term Examination
7 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
Master Pyspark Zero To Hero 1738689679
No ratings yet
Master Pyspark Zero To Hero 1738689679
102 pages
Khemani D. Search Methods in Artificial Intelligence 2024
No ratings yet
Khemani D. Search Methods in Artificial Intelligence 2024
486 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Object Oriented System Design Quantum
70% (10)
Object Oriented System Design Quantum
253 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
50 PySpark Interview Questions PDF
No ratings yet
50 PySpark Interview Questions PDF
7 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Hadoop and Java Ques - Ans
No ratings yet
Hadoop and Java Ques - Ans
222 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
No ratings yet
PTC Big Data Analysis With ApacheS 27.11-28.11.2019 Handout
48 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
100% (1)
Some of The Frequently Asked Interview Questions For Hadoop Developers Are
72 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
1st Summative Assessment (Math 8) Q2
100% (1)
1st Summative Assessment (Math 8) Q2
2 pages
Hive Interview Questions Answers
No ratings yet
Hive Interview Questions Answers
6 pages
Fundamentals of Apache Sqoop Notes
No ratings yet
Fundamentals of Apache Sqoop Notes
66 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
No ratings yet
HDP Developer-Enterprise Spark 1-Python Lab Guide-Rev 1
168 pages
Midhun BIGDATA Curicullum
No ratings yet
Midhun BIGDATA Curicullum
17 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Interview
No ratings yet
Interview
86 pages
Jnu Dbms Lab File
No ratings yet
Jnu Dbms Lab File
55 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Big Data With Hadoop & Spark - Introduction
No ratings yet
Big Data With Hadoop & Spark - Introduction
42 pages
Handwritten Notes: Prepared by
No ratings yet
Handwritten Notes: Prepared by
109 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Sqoop Interview Questions
No ratings yet
Sqoop Interview Questions
6 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
DBT Flow
No ratings yet
DBT Flow
15 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
6CS5 DS Unit-4
No ratings yet
6CS5 DS Unit-4
64 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Quick Sort Algorithm
No ratings yet
Quick Sort Algorithm
6 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Apache Flink ™: Stream and Batch Processing in A Single Engine
No ratings yet
Apache Flink ™: Stream and Batch Processing in A Single Engine
11 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Edureka - Scala Interview Questions
No ratings yet
Edureka - Scala Interview Questions
21 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Css Model Answer
No ratings yet
Css Model Answer
103 pages
Py Spark
No ratings yet
Py Spark
427 pages
Anna University (Syllabus) V Semester (EEE) Cs-1261 Object Oriented Programming Lab Programs Java
No ratings yet
Anna University (Syllabus) V Semester (EEE) Cs-1261 Object Oriented Programming Lab Programs Java
8 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Spark Concept
No ratings yet
Spark Concept
18 pages
Floating Point
No ratings yet
Floating Point
33 pages
Khwaish 12-A C.S Practical File
No ratings yet
Khwaish 12-A C.S Practical File
100 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Chapter 5
No ratings yet
Chapter 5
11 pages
DLD Chapter2
No ratings yet
DLD Chapter2
64 pages
HW 02
No ratings yet
HW 02
8 pages
MLT Question Bank New
No ratings yet
MLT Question Bank New
6 pages
Escape Sequences in C
No ratings yet
Escape Sequences in C
23 pages
Deep Learning
No ratings yet
Deep Learning
1 page
Hadamard Matrix On Cryptographic Problems: Salman Al Farizi, Mashuri Mashuri, Bambang Hendriya Guswanto
No ratings yet
Hadamard Matrix On Cryptographic Problems: Salman Al Farizi, Mashuri Mashuri, Bambang Hendriya Guswanto
5 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Queue Circular Queue, Priority Queue, and Deque
No ratings yet
Queue Circular Queue, Priority Queue, and Deque
49 pages
Data Types in Power BI
No ratings yet
Data Types in Power BI
7 pages
PC Unit II Notes
No ratings yet
PC Unit II Notes
56 pages
C++ Inheritance - Quizizz
No ratings yet
C++ Inheritance - Quizizz
7 pages
1z0 819 Exam Free Actual Q As Page 1 ExamTopics 2 1 2 5 11
No ratings yet
1z0 819 Exam Free Actual Q As Page 1 ExamTopics 2 1 2 5 11
2 pages
Research Paper Chatbot
No ratings yet
Research Paper Chatbot
5 pages
Lec-15 Dynamic Memory Allocation New
No ratings yet
Lec-15 Dynamic Memory Allocation New
14 pages
Class 1. ACT English
No ratings yet
Class 1. ACT English
9 pages
W07 Divide and Conquer Lecture 13 23102024 021723pm
No ratings yet
W07 Divide and Conquer Lecture 13 23102024 021723pm
9 pages
TE MECH SEM-5 Computational Methods
No ratings yet
TE MECH SEM-5 Computational Methods
2 pages
Recursion and Induction: Rahul Narain
No ratings yet
Recursion and Induction: Rahul Narain
30 pages

Spark

Uploaded by

Spark

Uploaded by

Apache Spark

Spark Built on Hadoop

Iterative Operations on MapReduce

Interactive Operations on MapReduce

Open Spark Shell

scala> val inputfile = sc.textFile(“input.txt”)

The output for the above command is

S.No Action & Meaning

Programming with RDD

You might also like