0% found this document useful (0 votes)

29 views12 pages

Beginner Guide Spark

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

Beginner Guide Spark

Uploaded by

mitmak

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

LEARN. DO.

EARN
About ACADGILD
ACADGILD is a technology education startup that aims to create an ecosystem for skill development in
which people can learn from mentors and from each other. We believe that software development
requires highly specialized skills that are best learned with guidance from experienced practitioners.
Online videos or classroom formats are poor substitutes for building real projects with help from a
dedicated mentor. Our mission is to teach hands-on, job-ready software programming skills, globally,
in small batches of 8 to 10 students, using industry experts.

ACADGILD oﬀers courses in

Enroll in our programming course

& Boost your career

ANDROID DIGITAL MACHINE LEARNING BIG DATA

DEVELOPMENT MARKETING WITH R ANALYSIS

JAVA FOR BIG DATA & HADOOP FULL STACK WEB NODE JS
FRESHER ADMINISTRATION DEVELOPMENT

FRONT END
CLOUD
DEVELOPMENT
COMPUTING
(WITH ANGULARJS)

Become a Big Data & Hadoop Developer 01

Watch this short video to know more about ACADGILD.

© 2016 ACADGILD. All rights reserved.

No part of this book may be reproduced, distributed, or transmitted in any form or by any means, electronic or
mechanical methods, including photocopying, recording, or by any information storage retrieval system, without
permission in writing from ACADGILD.

Disclaimer
This material is intended only for the learners and is not intended for any commercial purpose. If you are not the
intended recipient, then you should not distribute or copy this material. Please notify the sender immediately or
click here to contact us.

Published by
ACADGILD,
[email protected]

Become a Big Data & Hadoop Developer 02

In this EBook we will be discussing the
basics of Spark’s functionality
and its installation.

What is Spark?
Apache spark is a cluster computing framework
which runs on Hadoop and handles diﬀerent
Spark SQL +
types of data. It is a one stop solution to many
DataFrames
problems. Spark has rich resources for handling
the data and most importantly, it is 10-20x faster
than Hadoop’s MapReduce. It attains this speed Spark
of computation by its in-memory primitives. Streaming
The data is cached and is present in the memory MLlib
(RAM) and performs all the computations Machine
in-memory. Learning

GraphX
Spark’s rich resources has almost all the
Graph
components of Hadoop. For example we can
Computation
perform batch processing in Spark and real time
data processing, without using any additional
tools like kafka/ﬂume of Hadoop. It has its own
streaming engine called spark streaming.
Spark Core API

Become a Big Data & Hadoop Developer 03

We can perform various functions
with Spark

SQL operations Machine Learning Graph processing

It has its own SQL engine It has Machine Learning It performs Graph
called Spark SQL. It covers Library , MLib. It can perform processing by using
the features of both Machine Learning without GraphX component.
SQL and Hive. the help of MAHOUT.

All the above features are in-built in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and
Apache Mesos framework. It has its own standalone scheduler to get started, if other frame-
works are not available.Spark provides the access and ease of storing the data,it can be run on
many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its
local files system.

Resilient Distributed Datasets

Resilient Distributed Datasets (RDD) is a simple
and immutable distributed collection of objects.
Each RDD is split into multiple partitions which
may be computed on diﬀerent nodes of the
cluster. In spark all function are performed on
RDDs only.

Spark revolves around the concept of a

resilient distributed dataset (RDD), which is a
fault-tolerant collection of elements that can be
operated on in parallel.

Become a Big Data & Hadoop Developer 04

Let’s see now the features of
Resilient Distributed Datasets
in the below explanation:

In Hadoop we store the data as blocks and store them in diﬀerent data

01 nodes. In Spark, instead of following the above approach, we make

partitions of the RDDs and store in worker nodes (datanodes) which are
computed in parallel across all the nodes.

In Hadoop we need to replicate the data for fault recovery, but in

case of Spark, replication is not required as this is performed by 02
RDDs.

03 RDDs load the data for us and are resilient which means they can be
recomputed.

RDDs perform two types of operations: transformations which creates

a new dataset from the previous RDD and actions which return a value 04
to the driver program after performing the computation on the
dataset.

RDDs keeps a track of transformations and checks them periodically.

05 If a node fails, it can rebuild the lost RDD partition on the other
nodes, in parallel.

Become a Big Data & Hadoop Developer 05

RDDs can be created in
two different ways:
1 2

Referencing an external dataset in an By parallelizing a collection of

external storage system, such as a objects (a list or a set) in the
shared ﬁle system, HDFS, HBase, or any driver program.
data source oﬀering a Hadoop Input
Format.

Step by step process to Install Spark

Before installing spark Scala needs to be installed in the system. We need to follow the below
steps to install scala.

1. Open the Terminal in your CentOS

To download Scala type the below command:

Type: Wget http://downloads.typesafe.com/scala/2.11.1/scala-2.11.1.tgz

Become a Big Data & Hadoop Developer 06

2. Extract the downloaded tar file by using the below command
Extract the downloaded tar ﬁle by using the command, tar –xvfscala-2.11.1.tgz

3. After extracting specify the path of scala in .bashrc file.

After setting the path we need to save the file and type the
below command to save all the configurations.:
source .bashrc

The above command will sum up the scala installation.

we need to then install spark after that.

Become a Big Data & Hadoop Developer 07

To install spark in centos we need to follow the below
steps to download and install Single Node cluster
of Spark in CentOS.

1. Open the browser and go the link

Download spark-1.5.1-bin-hadoop2.6.tgz

File will be downloaded into Downloads folder

Go to the Downloads folder and untar the Downloaded ﬁle using the below command:

tar –xvf spark-1.5.1-bin-hadoop2.6.tgz

2. After untaring the file we need to move the file to the

Home Folder using the below command:
sudo mv spark-1.5.1-bin-hadoop2.6 /home/acadgild

The above command moves the ﬁle to the Home folder.

We need to update the path for spark in the .bashrc in the same way as we did
for scala.

3. Refer the given

screenshot for
updating the path
for .bashrc

Become a Big Data & Hadoop Developer 08

4. After adding the path for SPARK, type the command
source .bashrc, refer the screenshot for the same.

5. Make a folder by Name ‘work’ in HOME using the below

command:

6. Inside the work folder we need to make another folder

by name ‘sparkdata’ using the command

We need to give the permissions to the sparkdata folder

as 777 using the below command.

7. Now move into the conf directory of spark folder using

the below command

Become a Big Data & Hadoop Developer 09

Type the command ls to see the files inside conf folder:
There will be a ﬁle by name spark-env.sh.template, we need to copy that ﬁle by name
spark-env.sh using the below command:

Edit the spark-env.sh file using the below command:

Let’s follow the below steps to start the spark single node
cluster.Move to the sbin directory of spark folder using
the below command:

Become a Big Data & Hadoop Developer 10

Inside sbin type the below command to start the Master
and Slave daemons.

Now the spark Single Node cluster will start with One Master and Two Workers.

You can check that the cluster is running or not by using the below command
‘jps’

If the Master and Worker Nodes are running then it means you have successfully
started the spark single node cluster.

We hope this EBook helped

you in getting the basic
understanding of Spark
& the ways to install it.

Become a Big Data & Hadoop Developer 11

Unit 4 Spark Updated
No ratings yet
Unit 4 Spark Updated
86 pages
Parallel Processing
No ratings yet
Parallel Processing
38 pages
Unit V
No ratings yet
Unit V
23 pages
Unit 6-1
No ratings yet
Unit 6-1
128 pages
1.1.4 and 1.1.5
No ratings yet
1.1.4 and 1.1.5
38 pages
06-Apache Spark
No ratings yet
06-Apache Spark
75 pages
What Is Apache Spark?
No ratings yet
What Is Apache Spark?
232 pages
7 Spark
No ratings yet
7 Spark
9 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
7 PDFsam Beginner Guide Spark
No ratings yet
7 PDFsam Beginner Guide Spark
2 pages
11 - PDFsam - Beginner Guide Spark
No ratings yet
11 - PDFsam - Beginner Guide Spark
2 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Introduction To Big Data Technologies
No ratings yet
Introduction To Big Data Technologies
10 pages
SPARK
No ratings yet
SPARK
47 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
3 - PDFsam - Beginner Guide Spark
No ratings yet
3 - PDFsam - Beginner Guide Spark
2 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Bda Unit 6
No ratings yet
Bda Unit 6
14 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Module 2
No ratings yet
Module 2
20 pages
Bda U4
No ratings yet
Bda U4
49 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Big Data Apache Spark123
No ratings yet
Big Data Apache Spark123
121 pages
Spark Interview Questions PDF 2
No ratings yet
Spark Interview Questions PDF 2
19 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
3.5 Apache Spark
No ratings yet
3.5 Apache Spark
12 pages
Spark Overview: Security
No ratings yet
Spark Overview: Security
4 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Practical 11cdscds
No ratings yet
Practical 11cdscds
4 pages
Module 3
No ratings yet
Module 3
51 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
51 PDFsam IOQM-BY-FIITJEE
No ratings yet
51 PDFsam IOQM-BY-FIITJEE
10 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
36 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Unit 5
100% (1)
Unit 5
109 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
No ratings yet
Module 9: Processing Distributed Data With Apache Spark: WWW - Edureka.co/big-Data-And-Hadoop
45 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Spark Training - Java
No ratings yet
Spark Training - Java
8 pages
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
21 PDFsam IOQM-BY-FIITJEE
No ratings yet
21 PDFsam IOQM-BY-FIITJEE
10 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
2 Diameter Gy Interface Specification
No ratings yet
2 Diameter Gy Interface Specification
43 pages
Ankit Frontenddevloper 1
No ratings yet
Ankit Frontenddevloper 1
1 page
1 PDFsam IOQM-BY-FIITJEE
No ratings yet
1 PDFsam IOQM-BY-FIITJEE
10 pages
Dissertation On Dbms
100% (2)
Dissertation On Dbms
8 pages
71 PDFsam IOQM-BY-FIITJEE
No ratings yet
71 PDFsam IOQM-BY-FIITJEE
10 pages
41 PDFsam IOQM-BY-FIITJEE
No ratings yet
41 PDFsam IOQM-BY-FIITJEE
10 pages
Google V Badbox
No ratings yet
Google V Badbox
48 pages
6044Q1 Specimen Software Engineering
No ratings yet
6044Q1 Specimen Software Engineering
4 pages
Operating System Solutions: 1. Which of The Below Algorithms Which Implements Synchronization Mechanisms and Satisfies
No ratings yet
Operating System Solutions: 1. Which of The Below Algorithms Which Implements Synchronization Mechanisms and Satisfies
8 pages
9 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
9 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
1 PDFsam IOQm Theory Vedantu
No ratings yet
1 PDFsam IOQm Theory Vedantu
10 pages
Mech HeatTransfer 17.0 M04 APDL and Command Objects
No ratings yet
Mech HeatTransfer 17.0 M04 APDL and Command Objects
33 pages
Intergraph Smart 3D: (Includes Smartplant® 3D, Smartmarine® 3D, Smartplant® 3D Materials Handling Edition)
No ratings yet
Intergraph Smart 3D: (Includes Smartplant® 3D, Smartmarine® 3D, Smartplant® 3D Materials Handling Edition)
199 pages
11 PDFsam IOQm Theory Vedantu
No ratings yet
11 PDFsam IOQm Theory Vedantu
10 pages
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
No ratings yet
89 - PDFsam - Start Sketching and Drawing Now Simple Techniques For Drawing Landscapes, People and Objects
8 pages
31 PDFsam Mathematical Formulae
No ratings yet
31 PDFsam Mathematical Formulae
10 pages
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
121 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
Lecture Week 5 - 1: Stack
No ratings yet
Lecture Week 5 - 1: Stack
35 pages
S - Slave: High Performance Interchangeable Slave Interfaces Supporting All Major Industrial Networks
No ratings yet
S - Slave: High Performance Interchangeable Slave Interfaces Supporting All Major Industrial Networks
3 pages
21 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
21 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
7 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
7 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
23 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
23 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
23 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
23 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
MMQ 44
No ratings yet
MMQ 44
15 pages
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
201 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
281 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
20 pages
25 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
25 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
1 Pdfsam Ioqm Important CDF
No ratings yet
1 Pdfsam Ioqm Important CDF
2 pages
11 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
11 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
25 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
25 PDFsam Trigonometry RESULTS For IOQM
2 pages
21 Pdfsam Ioqm Important CDF
No ratings yet
21 Pdfsam Ioqm Important CDF
2 pages
9 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
9 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
13 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
13 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
13 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
13 PDFsam Trigonometry RESULTS For IOQM
2 pages
13 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
13 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
1 PDFsam IOQM 2029 Properties of GCD LCM
No ratings yet
1 PDFsam IOQM 2029 Properties of GCD LCM
2 pages
15 PDFsam IOQM 2024 Non Routine Equation YT
No ratings yet
15 PDFsam IOQM 2024 Non Routine Equation YT
2 pages
5 PDFsam Trigonometry RESULTS For IOQM
No ratings yet
5 PDFsam Trigonometry RESULTS For IOQM
2 pages
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
No ratings yet
301 - PDFsam - The Big Book of Realistic Drawing Secrets Easy Techniques For Drawing People, Animals and More
6 pages
OpenText Automatic Document Numbering 16.3.7 - Installation and Administration Guide English (LLESADN160307-IGD-EN-1)
No ratings yet
OpenText Automatic Document Numbering 16.3.7 - Installation and Administration Guide English (LLESADN160307-IGD-EN-1)
74 pages
CS411 MIDTERM SOLVED MCQS by JUNAID
No ratings yet
CS411 MIDTERM SOLVED MCQS by JUNAID
48 pages
Creating Custom Web ADI Integrators
No ratings yet
Creating Custom Web ADI Integrators
4 pages
MAME32 For Dummies
No ratings yet
MAME32 For Dummies
10 pages
PSTAT 130 Midterm 2 Notes
No ratings yet
PSTAT 130 Midterm 2 Notes
31 pages
Citra Log
No ratings yet
Citra Log
172 pages
PowerMax Quarterly Highlights - Oct 2021V1
No ratings yet
PowerMax Quarterly Highlights - Oct 2021V1
13 pages
CP5512E Installation
No ratings yet
CP5512E Installation
12 pages
SAP MM Consultant - Procurement Subject Matter ... - LSI Consulting
No ratings yet
SAP MM Consultant - Procurement Subject Matter ... - LSI Consulting
2 pages
OS Functions Definition of Systems and Its Information
No ratings yet
OS Functions Definition of Systems and Its Information
14 pages
Coverity Tools
No ratings yet
Coverity Tools
7 pages
Fine-Tune Whisper For Multilingual ASR With Transformers
No ratings yet
Fine-Tune Whisper For Multilingual ASR With Transformers
24 pages
Ayyash Et Al 2024 Optimal Operation of Intermittent Water Supply Systems Under Water Scarcity
No ratings yet
Ayyash Et Al 2024 Optimal Operation of Intermittent Water Supply Systems Under Water Scarcity
15 pages
Advanced Laboratories: RF Link Using The eZ430-RF2500 Texas Instruments Incorporated University of Beira Interior (PT)
No ratings yet
Advanced Laboratories: RF Link Using The eZ430-RF2500 Texas Instruments Incorporated University of Beira Interior (PT)
27 pages
BMS 201 Arrays and Functions 2021
No ratings yet
BMS 201 Arrays and Functions 2021
22 pages
Owlet Smart Sock User Guide - English
No ratings yet
Owlet Smart Sock User Guide - English
28 pages
Mockfog 2.0: Automated Execution of Fog Application Experiments in The Cloud
No ratings yet
Mockfog 2.0: Automated Execution of Fog Application Experiments in The Cloud
14 pages
Examples:: (Optional Parameter) Text Format Response As Text (Default), XML Format
No ratings yet
Examples:: (Optional Parameter) Text Format Response As Text (Default), XML Format
2 pages
Install Eprints 3.4
No ratings yet
Install Eprints 3.4
2 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet

Beginner Guide Spark

Uploaded by

Beginner Guide Spark

Uploaded by

LEARN. DO.

ACADGILD oﬀers courses in

Enroll in our programming course

ANDROID DIGITAL MACHINE LEARNING BIG DATA

Become a Big Data & Hadoop Developer 01

© 2016 ACADGILD. All rights reserved.

Become a Big Data & Hadoop Developer 02

Become a Big Data & Hadoop Developer 03

SQL operations Machine Learning Graph processing

All the above features are in-built in Spark.

Resilient Distributed Datasets

Spark revolves around the concept of a

Become a Big Data & Hadoop Developer 04

01 nodes. In Spark, instead of following the above approach, we make

In Hadoop we need to replicate the data for fault recovery, but in

RDDs perform two types of operations: transformations which creates

RDDs keeps a track of transformations and checks them periodically.

Become a Big Data & Hadoop Developer 05

Referencing an external dataset in an By parallelizing a collection of

Step by step process to Install Spark

1. Open the Terminal in your CentOS

Type: Wget http://downloads.typesafe.com/scala/2.11.1/scala-2.11.1.tgz

Become a Big Data & Hadoop Developer 06

3. After extracting specify the path of scala in .bashrc file.

The above command will sum up the scala installation.

Become a Big Data & Hadoop Developer 07

1. Open the browser and go the link

File will be downloaded into Downloads folder

tar –xvf spark-1.5.1-bin-hadoop2.6.tgz

2. After untaring the file we need to move the file to the

The above command moves the ﬁle to the Home folder.

3. Refer the given

Become a Big Data & Hadoop Developer 08

5. Make a folder by Name ‘work’ in HOME using the below

6. Inside the work folder we need to make another folder

We need to give the permissions to the sparkdata folder

7. Now move into the conf directory of spark folder using

Become a Big Data & Hadoop Developer 09

Edit the spark-env.sh file using the below command:

Become a Big Data & Hadoop Developer 10

We hope this EBook helped

Become a Big Data & Hadoop Developer 11

You might also like