Beginner Guide Spark
Beginner Guide Spark
EARN
About ACADGILD
ACADGILD is a technology education startup that aims to create an ecosystem for skill development in
which people can learn from mentors and from each other. We believe that software development
requires highly specialized skills that are best learned with guidance from experienced practitioners.
Online videos or classroom formats are poor substitutes for building real projects with help from a
dedicated mentor. Our mission is to teach hands-on, job-ready software programming skills, globally,
in small batches of 8 to 10 students, using industry experts.
JAVA FOR BIG DATA & HADOOP FULL STACK WEB NODE JS
FRESHER ADMINISTRATION DEVELOPMENT
FRONT END
CLOUD
DEVELOPMENT
COMPUTING
(WITH ANGULARJS)
Disclaimer
This material is intended only for the learners and is not intended for any commercial purpose. If you are not the
intended recipient, then you should not distribute or copy this material. Please notify the sender immediately or
click here to contact us.
Published by
ACADGILD,
[email protected]
What is Spark?
Apache spark is a cluster computing framework
which runs on Hadoop and handles different
Spark SQL +
types of data. It is a one stop solution to many
DataFrames
problems. Spark has rich resources for handling
the data and most importantly, it is 10-20x faster
than Hadoop’s MapReduce. It attains this speed Spark
of computation by its in-memory primitives. Streaming
The data is cached and is present in the memory MLlib
(RAM) and performs all the computations Machine
in-memory. Learning
GraphX
Spark’s rich resources has almost all the
Graph
components of Hadoop. For example we can
Computation
perform batch processing in Spark and real time
data processing, without using any additional
tools like kafka/flume of Hadoop. It has its own
streaming engine called spark streaming.
Spark Core API
It has its own SQL engine It has Machine Learning It performs Graph
called Spark SQL. It covers Library , MLib. It can perform processing by using
the features of both Machine Learning without GraphX component.
SQL and Hive. the help of MAHOUT.
It can be run on different types of cluster managers such as Hadoop, YARN framework and
Apache Mesos framework. It has its own standalone scheduler to get started, if other frame-
works are not available.Spark provides the access and ease of storing the data,it can be run on
many file systems. For example, HDFS, Hbase, MongoDB, Cassandra and can store the data in its
local files system.
In Hadoop we store the data as blocks and store them in different data
03 RDDs load the data for us and are resilient which means they can be
recomputed.
After setting the path we need to save the file and type the
below command to save all the configurations.:
source .bashrc
Let’s follow the below steps to start the spark single node
cluster.Move to the sbin directory of spark folder using
the below command:
Now the spark Single Node cluster will start with One Master and Two Workers.
You can check that the cluster is running or not by using the below command
‘jps’
If the Master and Worker Nodes are running then it means you have successfully
started the spark single node cluster.