Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
DZone > Big Data Zone > Apache Spark Tutorial (Fast Data Architecture Series)
Familiarize yourself with Kubernetes monitoring requirements and learn best practices to achieve
scalable observability. Watch the on-demand webinar now!
Presented by InfluxData
Continuing the Fast Data Architecture Series, this article will focus on Apache Spark. In this
Apache Spark Tutorial we will learn what Spark is and why it is important for Fast Data
Architecture. We will install Spark on our Mesos Cluster and run a sample spark
application.
Video Introduction
Check out my Apache Spark Tutorial Video on YouTube:
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 1/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
The are several components to Spark that achieve the uni ied computing engine module as
outlined in the Apache Spark Documentation.
1. Spark SQL - This a library that allows data scientist to analyze data using simple SQL
queries.
2. Spark Streaming - A library that can process streaming data in real-time using batch
processing.
3. MLlib - A machine learning library for Spark.
4. GraphX - A library that adds graphing functionality for Spark.
We will be covering these in much more detail in future articles. We will install Apache
Spark on our Mesos Cluster that we have installed in previous articles in the Fast Data
Architecture Series.
The Spark driver program schedules work on Spark executors. Executors actually carry
out the work of the Spark application. In our Mesos environment, these executors are
scheduled on Mesos nodes and are short-lived. They are created, carry out their assigned
tasks, report their status back to the driver, and then they are destroyed.
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 2/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
1 $ wget http://apache.claz.org/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
2 $ mkdir -p /usr/local/spark && tar -xvf spark-2.3.1-bin-hadoop2.7.tgz -C /usr/local/spark --strip-co
3 $ chown root:root -R /usr/local/spark/
This will create a binary installation of Apache Spark that we can use to deploy our Spark
applications on using Mesos. In the next section, we will create a SystemD Service that will
run our Spark cluster dispatcher.
1 [Unit]
2 Description=Spark Dispatcher Service
3 After=mesos-master.service
4 Requires=mesos-master.service
5
6 [Service]
7 Environment=MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
8 ExecStart=/usr/local/spark/bin/spark-class org.apache.spark.deploy.mesos.MesosClusterDispatcher --ma
9
10 [Install]
11 WantedBy=multi-user.target7
This ile con igures the Spark dispatcher service to start up after the mesos-master service.
Also, notice that we specify the IP and port of our Mesos Master. Be sure that yours
re lects your actual IP address and port for your Mesos Master. Now we can enable the
service and start it
1 # systemctl daemon-reload
2 # systemctl start spark.service
3 # systemctl enable spark.service
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 3/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
3 # systemctl enable spark.service
If everything is working correctly, you will see the service as Started and Active. The next
part of our Apache Spark Tutorial is to test our Spark deployment!
Testing Spark
Now that we have our Spark Dispatcher service running on our Mesos Cluster we can test
it by running an example job. We will be using an example that comes with Spark that will
calculate PI for us.
To see the output, we need to look at the Sandbox for our job in Mesos. Go to your Mesos
web interface, http://{mesos-ip}:5050. You should see that there is a task named Driver
for SparkPiTestApp under Completed Tasks which is our job.
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 4/13
24/01/2020 Apache Spark Tutorial (Fast Data Architecture Series) - DZone Big Data
Click on the Sandbox link for our job, then click on the stdout link to see the logging for
our application. You will see that it calculated Pi for us.
Conclusion
This Apache Spark tutorial simply demonstrated how to get Apache Spark installed. The
true power of Spark lies in the APIs that it provides to write powerful analytical
applications to process your raw data and provide meaningful results that you can use to
make real-world business decisions. Don't miss out on the next several articles where we
cover how to write Spark applications using the Python API and continue in our
exploration of the SMACK Stack. If you haven't already, please signup for my weekly
newsletter so you will get updates when I release new articles. Thanks for reading this
tutorial. If you liked it or hated it then please leave a comment below.
Published at DZone with permission of Bill Ward , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
DZone > Big Data Zone > The Ultimate Guide to React Dashboards Part 1: Overview and Analytics
https://dzone.com/articles/apache-spark-tutorial-fast-data-architecture-serie 5/13