0% found this document useful (0 votes)

85 views

Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

This document provides an Apache Spark tutorial. It discusses what Apache Spark is, its architecture, and how to install it on a Mesos cluster. It demonstrates running a sample Spark application to calculate Pi. Key points covered include Spark's components for SQL, streaming, machine learning and graphs. The tutorial also explains how to create a systemd service for the Spark dispatcher and test that Spark is working correctly.

Uploaded by

Ricardo Cardoso

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views

Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

Uploaded by

Ricardo Cardoso

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

 

DZone > Big Data Zone > Apache Spark Tutorial (Fast Data Architecture Series)

Apache Spark Tutorial (Fast Data

Architecture Series)
by Bill Ward  · Jul. 13, 18 · Big Data Zone · Tutorial

Familiarize yourself with Kubernetes monitoring requirements and learn best practices to achieve
scalable observability. Watch the on-demand webinar now!
Presented by InfluxData

Continuing the Fast Data Architecture Series, this article will focus on Apache Spark. In this
Apache Spark Tutorial we will learn what Spark is and why it is important for Fast Data
Architecture. We will install Spark on our Mesos Cluster and run a sample spark
application.

1. Installing Apache Mesos 1.6.0 on Ubuntu 18.04

2. Ka ka Tutorial for Fast Data Architecture
3. Ka ka Python Tutorial for Fast Data Architecture

Video Introduction
Check out my Apache Spark Tutorial Video on YouTube:

5 Minute Spark Tutorial

https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 1/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

What Is Apache Spark?

Apache Spark is a uni ied computing engine and collection of several libraries to help data
scientists analyze big data. Uni ied means that Spark aims to support many different data
analysis tasks that can range from SQL query analysis to machine learning and graphing
support. Before Spark, there was Hadoop MapReduce that was the dominant player in
data analysis platforms. Spark was developed to remedy some issues that were identi ied
with Hadoop MapReduce.

The are several components to Spark that achieve the uni ied computing engine module as
outlined in the Apache Spark Documentation.

1. Spark SQL - This a library that allows data scientist to analyze data using simple SQL
queries.
2. Spark Streaming - A library that can process streaming data in real-time using batch
processing.
3. MLlib - A machine learning library for Spark.
4. GraphX - A library that adds graphing functionality for Spark.

We will be covering these in much more detail in future articles. We will install Apache
Spark on our Mesos Cluster that we have installed in previous articles in the Fast Data
Architecture Series.

Apache Spark Architecture

Spark consists of a Driver Program that manages the Spark application. Spark Driver
Programs can be written in many languages including Python, Scala, Java, and R. The
driver program splits a task into executors and schedules the executors to run. You can
install Spark applications on many cluster managers including Apache Mesos, Kubernetes,
or in standalone mode. In this Apache Spark tutorial, we will deploy the Spark driver
program to a Mesos cluster and run an example application that comes with Spark to test
it.

The Spark driver program schedules work on Spark executors. Executors actually carry
out the work of the Spark application. In our Mesos environment, these executors are
scheduled on Mesos nodes and are short-lived. They are created, carry out their assigned
tasks, report their status back to the driver, and then they are destroyed.
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 2/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

Install Apache Spark

Run the following on your Mesos Masters and all your Mesos Slaves. You will also want to
install this on your local development system.

1 $ wget http://apache.claz.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
2 $ mkdir -p /usr/local/spark && tar -xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark --strip-co
3 $ chown root:root -R /usr/local/spark/

This will create a binary installation of Apache Spark that we can use to deploy our Spark
applications on using Mesos. In the next section, we will create a SystemD Service that will
run our Spark cluster dispatcher.

Create a SystemD Service Definition

When we have Spark deployed in cluster mode on a Mesos cluster we need to have the
Spark Dispatcher running that will schedule our Spark applications on Mesos. In this
section, we will create a SystemD Service de inition that we will use to manage the Spark
Dispatcher as a service.

On one of your Mesos Masters create a new ile /etc/systemd/system/spark.service

and add the following contents:

1 [Unit]
2 Description=Spark Dispatcher Service
3 After=mesos-master.service
4 Requires=mesos-master.service
5
6 [Service]
7 Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
8 ExecStart=/usr/local/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --ma
9
10 [Install]
11 WantedBy=multi-user.target7

This ile con igures the Spark dispatcher service to start up after the mesos-master service.
Also, notice that we specify the IP and port of our Mesos Master. Be sure that yours
re lects your actual IP address and port for your Mesos Master. Now we can enable the
service and start it

1 # systemctl daemon-reload
2 # systemctl start spark.service
3 # systemctl enable spark.service
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 3/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
3 # systemctl enable spark.service

You can make sure it is running using this command:

1 # systemctl status spark.service

If everything is working correctly, you will see the service as Started and Active. The next
part of our Apache Spark Tutorial is to test our Spark deployment!

Testing Spark
Now that we have our Spark Dispatcher service running on our Mesos Cluster we can test
it by running an example job. We will be using an example that comes with Spark that will
calculate PI for us.

1 bin/spark-submit --name SparkPiTestApp --class org.apache.spark.examples.SparkPi --master mesos://19

You will see that our example is scheduled:

1 .168.1.30:7077 --deploy-mode cluster --executor-memory 1G --total-executor-cores 30 /usr/local/spark

2 2018-07-11 16:44:12 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platfo
3 2018-07-11 16:44:12 INFO RestSubmissionClient:54 - Submitting a request to launch an application in
4 2018-07-11 16:44:13 INFO RestSubmissionClient:54 - Submission successfully created as driver-201807
5 2018-07-11 16:44:13 INFO RestSubmissionClient:54 - Submitting a request for the status of submissio
6 2018-07-11 16:44:13 INFO RestSubmissionClient:54 - State of driver driver-20180711164412-0001 is no
7 2018-07-11 16:44:13 INFO RestSubmissionClient:54 - Server responded with CreateSubmissionResponse:
8{
9 "action" : "CreateSubmissionResponse",
10 "serverSparkVersion" : "2.3.1",
11 "submissionId" : "driver-20180711164412-0001",
12 "success" : true
13 }
14 2018-07-11 16:44:13 INFO ShutdownHookManager:54 - Shutdown hook called
15 2018-07-11 16:44:13 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-4edf5319-8ff1-45bc-

To see the output, we need to look at the Sandbox for our job in Mesos. Go to your Mesos
web interface, http://{mesos-ip}:5050. You should see that there is a task named Driver
for SparkPiTestApp under Completed Tasks which is our job.

https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 4/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data

Click on the Sandbox link for our job, then click on the stdout link to see the logging for
our application. You will see that it calculated Pi for us.

1 2018-07-11 16:44:20 INFO DAGScheduler:54 - ResultStage 0 (reduce at SparkPi.scala:38) finished in 3

2 2018-07-11 16:44:20 INFO DAGScheduler:54 - Job 0 finished: reduce at SparkPi.scala:38, took 3.55147
3 Pi is roughly 3.1418855141885516

Conclusion
This Apache Spark tutorial simply demonstrated how to get Apache Spark installed. The
true power of Spark lies in the APIs that it provides to write powerful analytical
applications to process your raw data and provide meaningful results that you can use to
make real-world business decisions. Don't miss out on the next several articles where we
cover how to write Spark applications using the Python API and continue in our
exploration of the SMACK Stack. If you haven't already, please signup for my weekly
newsletter so you will get updates when I release new articles. Thanks for reading this
tutorial. If you liked it or hated it then please leave a comment below.

Like This Article? Read More From DZone

Apache Storm: Architecture Getting Started With Apache Solr

Getting Started With Apache Ignite Free DZone Refcard

(Part 7) Understanding Apache Spark
Failures and Bottlenecks

Topics: BIG DATA , APACHE SHARK , DATA ARCHITECTURE , TUTORIAL

Published at DZone with permission of Bill Ward , DZone MVB. See the original article here. 
Opinions expressed by DZone contributors are their own.
DZone > Big Data Zone > The Ultimate Guide to React Dashboards Part 1: Overview and Analytics