0% found this document useful (0 votes)
29 views12 pages

Beginner Guide Spark

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Beginner Guide Spark

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

LEARN. DO.

EARN
About ACADGILD
ACADGILD is a technology education startup that aims to create an ecosystem for skill development in
which people can learn from mentors and from each other. We believe that software development
requires highly specialized skills that are best learned with guidance from experienced practitioners.
Online videos or classroom formats are poor substitutes for building real projects with help from a
dedicated mentor. Our mission is to teach hands-on, job-ready software programming skills, globally,
in small batches of 8 to 10 students, using industry experts.

ACADGILD offers courses in

Enroll in our programming course


& Boost your career

ANDROID DIGITAL MACHINE LEARNING BIG DATA


DEVELOPMENT MARKETING WITH R ANALYSIS

JAVA FOR BIG DATA & HADOOP FULL STACK WEB NODE JS
FRESHER ADMINISTRATION DEVELOPMENT

FRONT END
CLOUD
DEVELOPMENT
COMPUTING
(WITH ANGULARJS)

Become a Big Data & Hadoop Developer 01


Watch this short video to know more about ACADGILD.

© 2016 ACADGILD. All rights reserved.


No part of this book may be reproduced, distributed, or transmitted in any form or by any means, electronic or
mechanical methods, including photocopying, recording, or by any information storage retrieval system, without
permission in writing from ACADGILD.

Disclaimer
This material is intended only for the learners and is not intended for any commercial purpose. If you are not the
intended recipient, then you should not distribute or copy this material. Please notify the sender immediately or
click here to contact us.

Published by
ACADGILD,
[email protected]

Become a Big Data & Hadoop Developer 02


In this EBook we will be discussing the
basics of Spark’s functionality
and its installation.

What is Spark?
Apache spark is a cluster computing framework
which runs on Hadoop and handles different
Spark SQL +
types of data. It is a one stop solution to many
DataFrames
problems. Spark has rich resources for handling
the data and most importantly, it is 10-20x faster
than Hadoop’s MapReduce. It attains this speed Spark
of computation by its in-memory primitives. Streaming
The data is cached and is present in the memory MLlib
(RAM) and performs all the computations Machine
in-memory. Learning

GraphX
Spark’s rich resources has almost all the
Graph
components of Hadoop. For example we can
Computation
perform batch processing in Spark and real time
data processing, without using any additional
tools like kafka/flume of Hadoop. It has its own
streaming engine called spark streaming.
Spark Core API

Become a Big Data & Hadoop Developer 03


We can perform various functions
with Spark

SQL operations Machine Learning Graph processing

It has its own SQL engine It has Machine Learning It performs Graph
called Spark SQL. It covers Library , MLib. It can perform processing by using
the features of both Machine Learning without GraphX component.
SQL and Hive. the help of MAHOUT.

All the above features are in-built in Spark.

It can be run on different types of cluster managers such as Hadoop, YARN framework and
Apache Mesos framework. It has its own standalone scheduler to get started, if other frame-
works are not available.Spark provides the access and ease of storing the data,it can be run on
many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its
local files system.

Resilient Distributed Datasets


Resilient Distributed Datasets (RDD) is a simple
and immutable distributed collection of objects.
Each RDD is split into multiple partitions which
may be computed on different nodes of the
cluster. In spark all function are performed on
RDDs only.

Spark revolves around the concept of a


resilient distributed dataset (RDD), which is a
fault-tolerant collection of elements that can be
operated on in parallel.

Become a Big Data & Hadoop Developer 04


Let’s see now the features of
Resilient Distributed Datasets
in the below explanation:

In Hadoop we store the data as blocks and store them in different data

01 nodes. In Spark, instead of following the above approach, we make


partitions of the RDDs and store in worker nodes (datanodes) which are
computed in parallel across all the nodes.

In Hadoop we need to replicate the data for fault recovery, but in


case of Spark, replication is not required as this is performed by 02
RDDs.

03 RDDs load the data for us and are resilient which means they can be
recomputed.

RDDs perform two types of operations: transformations which creates


a new dataset from the previous RDD and actions which return a value 04
to the driver program after performing the computation on the
dataset.

RDDs keeps a track of transformations and checks them periodically.


05 If a node fails, it can rebuild the lost RDD partition on the other
nodes, in parallel.

Become a Big Data & Hadoop Developer 05


RDDs can be created in
two different ways:
1 2

Referencing an external dataset in an By parallelizing a collection of


external storage system, such as a objects (a list or a set) in the
shared file system, HDFS, HBase, or any driver program.
data source offering a Hadoop Input
Format.

Step by step process to Install Spark


Before installing spark Scala needs to be installed in the system. We need to follow the below
steps to install scala.

1. Open the Terminal in your CentOS


To download Scala type the below command:

Type: Wget http://downloads.typesafe.com/scala/2.11.1/scala-2.11.1.tgz

Become a Big Data & Hadoop Developer 06


2. Extract the downloaded tar file by using the below command
Extract the downloaded tar file by using the command, tar –xvfscala-2.11.1.tgz

3. After extracting specify the path of scala in .bashrc file.

After setting the path we need to save the file and type the
below command to save all the configurations.:
source .bashrc

The above command will sum up the scala installation.


we need to then install spark after that.

Become a Big Data & Hadoop Developer 07


To install spark in centos we need to follow the below
steps to download and install Single Node cluster
of Spark in CentOS.

1. Open the browser and go the link


Download spark-1.5.1-bin-hadoop2.6.tgz

File will be downloaded into Downloads folder


Go to the Downloads folder and untar the Downloaded file using the below command:

tar –xvf spark-1.5.1-bin-hadoop2.6.tgz

2. After untaring the file we need to move the file to the


Home Folder using the below command:
sudo mv spark-1.5.1-bin-hadoop2.6 /home/acadgild

The above command moves the file to the Home folder.


We need to update the path for spark in the .bashrc in the same way as we did
for scala.

3. Refer the given


screenshot for
updating the path
for .bashrc

Become a Big Data & Hadoop Developer 08


4. After adding the path for SPARK, type the command
source .bashrc, refer the screenshot for the same.

5. Make a folder by Name ‘work’ in HOME using the below


command:

6. Inside the work folder we need to make another folder


by name ‘sparkdata’ using the command

We need to give the permissions to the sparkdata folder


as 777 using the below command.

7. Now move into the conf directory of spark folder using


the below command

Become a Big Data & Hadoop Developer 09


Type the command ls to see the files inside conf folder:
There will be a file by name spark-env.sh.template, we need to copy that file by name
spark-env.sh using the below command:

Edit the spark-env.sh file using the below command:

Let’s follow the below steps to start the spark single node
cluster.Move to the sbin directory of spark folder using
the below command:

Become a Big Data & Hadoop Developer 10


Inside sbin type the below command to start the Master
and Slave daemons.

Now the spark Single Node cluster will start with One Master and Two Workers.

You can check that the cluster is running or not by using the below command
‘jps’

If the Master and Worker Nodes are running then it means you have successfully
started the spark single node cluster.

We hope this EBook helped


you in getting the basic
understanding of Spark
& the ways to install it.

Become a Big Data & Hadoop Developer 11

You might also like