0% found this document useful (0 votes)

11 views7 pages

8 PDFsam Apache Spark Tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views7 pages

8 PDFsam Apache Spark Tutorial

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2.

SPARK – RDD Apache Spark

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created

through deterministic operations on either data on stable storage or other RDDs. RDD is
a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs: parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce
operations. Let us first discuss how MapReduce operations take place and why they are
not so efficient.

Data Sharing is Slow in MapReduce

MapReduce is widely adopted for processing and generating large datasets with a
parallel, distributed algorithm on a cluster. It allows users to write parallel computations,
using a set of high-level operators, without having to worry about work distribution and
fault tolerance.

Unfortunately, in most current frameworks, the only way to reuse data between
computations (Ex: between two MapReduce jobs) is to write it to an external stable
storage system (Ex: HDFS). Although this framework provides numerous abstractions for
accessing a cluster’s computational resources, users still want more.

Both Iterative and Interactive applications require faster data sharing across parallel
jobs. Data sharing is slow in MapReduce due to replication, serialization, and disk
IO. Regarding storage system, most of the Hadoop applications, they spend more than
90% of the time doing HDFS read-write operations.

Iterative Operations on MapReduce

Reuse intermediate results across multiple computations in multi-stage applications. The
following illustration explains how the current framework works, while doing the iterative
operations on MapReduce. This incurs substantial overheads due to data replication, disk
I/O, and serialization, which makes the system slow.

4
Apache Spark

Figure: Iterative operations on MapReduce

Interactive Operations on MapReduce

User runs ad-hoc queries on the same subset of data. Each query will do the disk I/O on
the stable storage, which can dominates application execution time.

The following illustration explains how the current framework works while doing the
interactive queries on MapReduce.

Figure: Interactive operations on MapReduce

5
Apache Spark

Data Sharing using Spark RDD

Data sharing is slow in MapReduce due to replication, serialization, and disk IO. Most
of the Hadoop applications, they spend more than 90% of the time doing HDFS read-
write operations.

Recognizing this problem, researchers developed a specialized framework called Apache

Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-
memory processing computation. This means, it stores the state of memory as an object
across the jobs and the object is sharable between those jobs. Data sharing in memory
is 10 to 100 times faster than network and Disk.

Let us now try to find out how iterative and interactive operations take place in Spark
RDD.

Iterative Operations on Spark RDD

The illustration given below shows the iterative operations on Spark RDD. It will store
intermediate results in a distributed memory instead of Stable storage (Disk) and make
the system faster.

Note: If the Distributed memory (RAM) is sufficient to store intermediate results (State
of the JOB), then it will store those results on the disk.

Figure: Iterative operations on Spark RDD

Interactive Operations on Spark RDD

This illustration shows interactive operations on Spark RDD. If different queries are run
on the same set of data repeatedly, this particular data can be kept in memory for better
execution times.

Figure: Interactive operations on Spark RDD

6
Apache Spark

By default, each transformed RDD may be recomputed each time you run an action on
it. However, you may also persist an RDD in memory, in which case Spark will keep the
elements around on the cluster for much faster access, the next time you query it. There
is also support for persisting RDDs on disk, or replicated across multiple nodes.

7
3. SPARK – INSTALLATION Apache Spark

Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based
system. The following steps show how to install Apache Spark.

Step 1: Verifying Java Installation

Java installation is one of the mandatory things in installing Spark. Try the following
command to verify the JAVA version.

$java -version

If Java is already, installed on your system, you get to see the following response –

java version "1.7.0_71"

Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

In case you do not have Java installed on your system, then Install Java before
proceeding to next step.

Step 2: Verifying Scala installation

You should Scala language to implement Spark. So let us verify Scala installation using
following command.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

In case you don’t have Scala installed on your system, then proceed to next step for
Scala installation.

Step 3: Downloading Scala

Download the latest version of Scala by visit the following link Download Scala. For this
tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar
file in the download folder.

8
Apache Spark

Step 4: Installing Scala

Follow the below given steps for installing Scala.

Extract the Scala tar file

Type the following command for extracting the Scala tar file.

$ tar xvf scala-2.11.6.tgz

Move Scala software files

Use the following commands for moving the Scala software files, to respective directory
(/usr/local/scala).

$ su –
Password:
# cd /home/Hadoop/Downloads/
# mv scala-2.11.6 /usr/local/scala
# exit

Set PATH for Scala

Use the following command for setting PATH for Scala.

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

After installation, it is better to verify it. Use the following command for verifying Scala
installation.

$scala -version

If Scala is already installed on your system, you get to see the following response –

Step 5: Downloading Apache Spark

Download the latest version of Spark by visiting the following link Download Spark. For
this tutorial, we are using spark-1.3.1-bin-hadoop2.6 version. After downloading it,
you will find the Spark tar file in the download folder.

9
Apache Spark

Step 6: Installing Spark

Follow the steps given below for installing Spark.

Extracting Spark tar

The following command for extracting the spark tar file.

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files

The following commands for moving the Spark software files to respective directory
(/usr/local/spark).

$ su –
Password:

# cd /home/Hadoop/Downloads/
# mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark
# exit

Setting up the environment for Spark

Add the following line to ~/.bashrc file. It means adding the location, where the spark
software file are located to the PATH variable.

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

$ source ~/.bashrc

Step 7: Verifying the Spark Installation

Write the following command for opening Spark shell.

$spark-shell

If spark is installed successfully then you will find the following output.

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop
15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop

Unit 4
No ratings yet
Unit 4
35 pages
Unit V
No ratings yet
Unit V
23 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Unit 4
No ratings yet
Unit 4
8 pages
Spark
No ratings yet
Spark
9 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Apache Spark
No ratings yet
Apache Spark
27 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Apache Spark and Ignite
No ratings yet
Apache Spark and Ignite
4 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Module 2
No ratings yet
Module 2
20 pages
Apache Spark Tutorial
100% (1)
Apache Spark Tutorial
6 pages
Spark
No ratings yet
Spark
160 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
Bda 5
No ratings yet
Bda 5
21 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Spark
No ratings yet
Spark
96 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
SPARK
No ratings yet
SPARK
47 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Unit 5
100% (1)
Unit 5
109 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
Spark
No ratings yet
Spark
37 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Bda U4
No ratings yet
Bda U4
49 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Bda Unit Iv
No ratings yet
Bda Unit Iv
102 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Iouu
No ratings yet
Iouu
12 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Module 3
No ratings yet
Module 3
51 pages
21 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
21 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
13 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
13 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
1 PDFsam IOQm Theory Vedantu
No ratings yet
1 PDFsam IOQm Theory Vedantu
10 pages
25 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
25 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
9 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
9 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
23 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
23 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
25 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
25 PDFsam Trigonometry RESULTS For IOQM
2 pages
15 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
15 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
1 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
1 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
23 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
23 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
13 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
13 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
11 PDFsam IOQm Theory Vedantu
No ratings yet
11 PDFsam IOQm Theory Vedantu
10 pages
9 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
9 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
7 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
7 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
1 Pdfsam Ioqm Important CDF
No ratings yet
1 Pdfsam Ioqm Important CDF
2 pages
41 PDFsam IOQM-BY-FIITJEE
No ratings yet
41 PDFsam IOQM-BY-FIITJEE
10 pages
11 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
11 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
21 Pdfsam Ioqm Important CDF
No ratings yet
21 Pdfsam Ioqm Important CDF
2 pages
13 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
13 PDFsam Trigonometry RESULTS For IOQM
2 pages
5 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
5 PDFsam Trigonometry RESULTS For IOQM
2 pages
21 PDFsam IOQM-BY-FIITJEE
No ratings yet
21 PDFsam IOQM-BY-FIITJEE
10 pages
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
1 PDFsam IOQM-BY-FIITJEE
No ratings yet
1 PDFsam IOQM-BY-FIITJEE
10 pages
51 PDFsam IOQM-BY-FIITJEE
No ratings yet
51 PDFsam IOQM-BY-FIITJEE
10 pages
71 PDFsam IOQM-BY-FIITJEE
No ratings yet
71 PDFsam IOQM-BY-FIITJEE
10 pages
31 PDFsam Mathematical Formulae
No ratings yet
31 PDFsam Mathematical Formulae
10 pages
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
No ratings yet
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
8 pages
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
6 pages
Important Questions of Web Design and Development
No ratings yet
Important Questions of Web Design and Development
3 pages
Java Software and Embedded Systems 1st Edition Mattis Hayes: - Click The Link Below To Download
No ratings yet
Java Software and Embedded Systems 1st Edition Mattis Hayes: - Click The Link Below To Download
50 pages
Akash Agarwal Resume-3
No ratings yet
Akash Agarwal Resume-3
3 pages
Rohit Girish Belagali Resume
No ratings yet
Rohit Girish Belagali Resume
4 pages
T3TWS1.Introduction To SOA-R15
No ratings yet
T3TWS1.Introduction To SOA-R15
34 pages
Memahami Acara Dan Pengecualian (Exceptions)
No ratings yet
Memahami Acara Dan Pengecualian (Exceptions)
12 pages
OOSE-Lab 1-10 .DI
No ratings yet
OOSE-Lab 1-10 .DI
30 pages
Oose Complete Module1
No ratings yet
Oose Complete Module1
142 pages
SAP LABS Interview Questions
No ratings yet
SAP LABS Interview Questions
7 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
487 pages
بنك أسئلة شابتر 2 برمجة حاسب
No ratings yet
بنك أسئلة شابتر 2 برمجة حاسب
5 pages
J2Ee/Jee (Java 2 Enterprise Edition) Technology
No ratings yet
J2Ee/Jee (Java 2 Enterprise Edition) Technology
36 pages
Clean Code With C# - Second Edition: Refactor Your Legacy C# Code Base and Improve Application Performance Using Best Practices Alls Download
100% (2)
Clean Code With C# - Second Edition: Refactor Your Legacy C# Code Base and Improve Application Performance Using Best Practices Alls Download
46 pages
Software Development Brochure
No ratings yet
Software Development Brochure
18 pages
Sucharitha
No ratings yet
Sucharitha
5 pages
CS-217 - OOP - Course Outline - Spring 2021
No ratings yet
CS-217 - OOP - Course Outline - Spring 2021
3 pages
Programming Concepts Notes For Form 1 2025
No ratings yet
Programming Concepts Notes For Form 1 2025
3 pages
MDG Day5 - BRF - BRF+
No ratings yet
MDG Day5 - BRF - BRF+
14 pages
Swe 233 - Oom Uml - Course Note
No ratings yet
Swe 233 - Oom Uml - Course Note
87 pages
Daily Available Projects (8-Dec-2023)
No ratings yet
Daily Available Projects (8-Dec-2023)
141 pages
Rest API Interview Faq's
No ratings yet
Rest API Interview Faq's
39 pages
Lab 1
No ratings yet
Lab 1
15 pages
Chapter - 4 Creating and Validating Forms Marks-12: Content Outline
No ratings yet
Chapter - 4 Creating and Validating Forms Marks-12: Content Outline
26 pages
Chapter 5 Distributed Processing, Client Server, and Clusters-Sum-W5
No ratings yet
Chapter 5 Distributed Processing, Client Server, and Clusters-Sum-W5
55 pages
Jurnal 5
No ratings yet
Jurnal 5
6 pages
Pavan Resume
No ratings yet
Pavan Resume
3 pages
Nodejs Tutorial
No ratings yet
Nodejs Tutorial
5 pages
Unit I - Conventional Software Management
No ratings yet
Unit I - Conventional Software Management
21 pages
Hakam Resume
No ratings yet
Hakam Resume
1 page
SDLC
100% (1)
SDLC
12 pages

8 PDFsam Apache Spark Tutorial

Uploaded by

8 PDFsam Apache Spark Tutorial

Uploaded by

2.

SPARK – RDD Apache Spark

Resilient Distributed Datasets

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created

Data Sharing is Slow in MapReduce

Iterative Operations on MapReduce

Figure: Iterative operations on MapReduce

Interactive Operations on MapReduce

Figure: Interactive operations on MapReduce

Data Sharing using Spark RDD

Recognizing this problem, researchers developed a specialized framework called Apache

Iterative Operations on Spark RDD

Figure: Iterative operations on Spark RDD

Interactive Operations on Spark RDD

Figure: Interactive operations on Spark RDD

Step 1: Verifying Java Installation

java version "1.7.0_71"

Step 2: Verifying Scala installation

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 3: Downloading Scala

Step 4: Installing Scala

Extract the Scala tar file

$ tar xvf scala-2.11.6.tgz

Move Scala software files

Set PATH for Scala

$ export PATH = $PATH:/usr/local/scala/bin

Verifying Scala Installation

Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

Step 5: Downloading Apache Spark

Step 6: Installing Spark

Extracting Spark tar

$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz

Moving Spark software files

Setting up the environment for Spark

export PATH = $PATH:/usr/local/spark/bin

Use the following command for sourcing the ~/.bashrc file.

Step 7: Verifying the Spark Installation

You might also like