Installing Spark
Installing Spark
Starting with Apache Spark can be intimidating. However, after you have gone
through the process of installing it on your local machine, in hindsight, it will not
look so scary.
In this chapter, we will guide you through the requirements of Spark 2.0, the
installation process of the environment itself, and through setting up the Jupyter
notebook so that it is convenient and easy to write your code.
The topics covered are:
• Requirements
• Installing Spark
• Jupyter on PySpark
• Installing in the cloud
Requirements
Before we begin, let's make sure your computer is ready for Spark installation. What
you need is Java 7+ and Python 2.6+/3.4+. Spark also requires R 3.1+ if you want to
run R code. For the Scala API, Spark 2.0.0 Preview uses Scala 2.11. You will need to
use a compatible Scala version (2.11.x).
Spark installs Scala during the installation process, so we just need to make sure that
Java and Python are present on your machine.
[1]
Installing Apache Spark
[2]
Appendix A
If you are sure you have Java installed (or simply do not know) you can try locating
Java binaries. On Linux you can try executing the following command:
locate java
If you have Python installed, the Terminal should print out its version. In our case,
this is:
Python 3.5.1 :: Anaconda 2.4.1 (x86_64)
If, however, you do not have Python, you will have to install a compatible version on
your machine (see the following section, Installing Python).
Installing Java
It goes beyond the scope of this book to provide detailed instructions on how you
should install Java. However, it is a fairly straightforward process and the high-level
steps you need to undertake are:
1. Go to https://www.java.com/en/download/mac_download.jsp and
download the version appropriate for your system.
2. Once downloaded, follow the instructions to install on your machine.
[3]
Installing Apache Spark
Check
https://www.java.com/en/download/help/ie_online_install.xml for
steps outlining the installation process on Windows.
Installing Python
Our preferred flavor of Python is Anaconda (provided by Continuum) and we
strongly recommend this distribution. The package comes with all the necessary and
most commonly used modules included (such as pandas, NumPy, SciPy, or Scikit,
among many others). If a module you want to use is not present, you can quickly
install it using the conda package management system.
The Anaconda environment can be downloaded from https://www.continuum.
io/downloads. Check the correct version for your operating system and follow the
instructions presented to install the distribution.
Once downloaded, follow the instructions to install the environment appropriate for
your operating system:
Once both of the environments are installed, repeat the steps from the preceding
section, Checking for presence of Java and Python. Everything should work now.
[4]
Appendix A
To learn what bash is, check out the following link https://www.
gnu.org/software/bash/manual/html_node/What-is-
Bash_003f.html
We will use vi text editor in CLI to do this, but you are free to choose a text editor of
your liking:
vi ~/.bash_profile
We need to add a couple of lines, preferably at the end of the file. If you are using vi,
press the I key on your keyboard (that will initiate the edit mode in vi), navigate to
the end of the file, and insert a new line by hitting the Enter key. Starting on the new
line, add the following two lines to your file (for Mac):
export PATH=/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Cont ents/
Home/bin:$PATH
export PATH=/Library/Frameworks/Python.framework/Versions/3.5/ bin/:$PATH
On Linux:
export PATH=/usr/lib/jvm/java-8-sun-1.8.0.40/jre/bin/java:$PATH
export PATH=$HOME/anaconda/bin:$PATH
Note that we are referring to Java 1.8 update 40 above. Also, note that,
on Mac, there are only two lines; due to space constraints they appear
as four lines, as the words Contents and bin were wrapped at the
end of the line.
Once done typing hit the Esc key and type the following command:
:wq
Once you hit the Enter key, vi will write and quit.
[5]
Installing Apache Spark
Click on Environment Variables and in the System variables search forPath. Once
found, click on Edit:
[6]
Appendix A
In the new window that opens click New and then Browse. Navigate to C:\Program
Files (x86)\Java\jre1.8.0_91\bin and click OK:
Once this is done, continue clicking the OK button until the System window closes.
[7]
Installing Apache Spark
Installing Spark
Your machine is now ready to install Spark. You can do this in two ways:
1. Download source codes and compile the environment yourself; this gives
you the most flexibility.
2. Download pre-built binaries.
3. Install PySpark libraries through PIP (see here: http://bit.ly/2ivVhbH)
The following instructions for Mac and Linux guide you through the first way. We
will show you how to configure your Windows machine while showcasing the
second option of installing Spark.
Check https://developer.apple.com/library/mac/
documentation/Darwin/Conceptual/KernelProgramming/
Architecture/Architecture.html or http://www.ee.surrey.
ac.uk/Teaching/Unix/unixintro.html for more information if you
feel so inclined.
1. Choose a Spark release: 2.1.0 (Dec 28, 2016). Note that at the time you read
this, the version might be different; simply select the latest one for Spark 2.0.
2. Choose a package type: Source code.
3. Choose a download type: Direct download.
4. Click on the link next to Download Spark: It should state something similar
to spark-2.1.0.tgz.
Once the download finishes, go to your CLI and navigate to the folder you have
downloaded the file to; in our case it is ~/Downloads/:
cd ~/Downloads
[8]
Appendix A
The tilde sign ~ denotes your home folder on both Mac and Linux.
To confirm the authenticity and completeness of the file, in your CLI, type the
following (on Mac):
md5 spark-2.1.0.tgz
You can then compare this with the corresponding md5checksum provided by Spark:
http://www.apache.org/dist/spark/spark-2.1.0/spark-2.1.0.tgz.md5
Next, we need to unpack the archive. This can be achieved with the following
command:
tar -xvf spark-2.1.0.tgz
The -xvf options of the tar command makes it easy to extract the archive (the x part)
and produce a verbose output (the v option) from a file (the f) that we specified.
We will be building Spark with Maven and sbt , which we will later use to package
up our applications deployed in the cloud.
Maven is a build automation system that is used to install Spark. You can
read more at https://maven.apache.org. sbt stands for scala build tool
and it is an incremental compiler for Scala. Scala is a scalable
programming language (that is where its name comes from: Scalable
Language). The code written in Scala compiles to a Java-bytecode, so it
can run in Java Virtual Environment (JVM). For more information, check
out the following link http://www.scala-lang.org/what-is-
scala.html.
[9]
Installing Apache Spark
Installing with Maven
You do not need to install it explicitly as Spark source codes ship with mvn located in
the build folder; that will get us started.
First, we need to change default memory settings for Maven so it allocates more
memory for the installer to use. You attain this with executing the following
command in your CLI (everything in one line):
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCode
CacheSize=512m"
Also, you need to have the JAVA_HOME system variable specified properly and
pointing to where your Java JDK distribution is installed. This can be done with the
help of the following command on Mac:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/
Contents/Home
Or the following command on Linux:
export JAVA_HOME=/usr/lib/jvm/open-jdk
Note that your distribution locations might be different, so you will have
to adapt the preceding commands to your system.
Having the Maven options and the JAVA_HOME environment variable set we can
proceed to build Spark.
We will build Spark with Hadoop 2.7 and Hive. Execute the following command in
your CLI (again, everything in one line):
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive
-Phive-thriftserver -DskipTests clean package
Once you issue the preceding command the installer will download Zinc, Scala,
and Maven.
[ 10 ]
Appendix A
If everything goes well, you should see a screen similar to the following screenshot:
If all goes well, you will see a final screen similar to this:
[ 11 ]
Installing Apache Spark
The preceding commands will clean up the installation and run tests of all modules
of PySpark (since we execute the run-tests inside the python folder). If you want to
test only a specific module of PySpark, you can use the following command:
./python/run-tests --python-executables=python --modules=pyspark-sql
Once the tests finish, assuming all went well, you should see a screen similar to the
following screenshot:
[ 12 ]
Appendix A
First run
Now you can try running pyspark. Let's navigate to ~/Spark/bin and start the
interactive shell by using the following command:
cd ~/Spark/bin
./pyspark
You should see something similar to this:
The sc is SparkContext that is automatically created for you when PySpark starts.
Initializing a PySpark session creates sqlContext as well. We will get to describing
what these are later in this book.
To exit the pyspark session, type quit().
[ 13 ]
Installing Apache Spark
Windows
Installing Spark on Windows is also fairly straightforward. However, as mentioned
earlier, instead of building the whole environment from scratch, we will download a
precompiled version of Spark.
1. Choose a Spark release: 2.1.0 (Dec 28, 2016). Note that, when you read this,
the version might be different; simply select the latest one for Spark 2.0.
2. Choose a package type: Pre-built for Hadoop 2.7 and later.
3. Choose a download type: Direct download.
4. Click on the link next to Download Spark: it should state something similar
to spark-2.1.0-bin-hadoop2.7.tgz.
This should produce a string of random looking letters and numbers similar to the
following one:
50E73F255F9BDE50789AD5BD657C7A71 spark-2.1.0-bin-hadoop2.7.tgz
You can compare the cited number with the corresponding md5checksum found
here http://www.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-
hadoop2.7.tgz.md5.
Check this video for step-by-step instructions on how to install and use
the fciv tool https://www.youtube.com/watch?v=G08xum0AuFg
Let's unpack the archive now. If you have 7-Zip or another unarchiver that can
handle .tgz archives, you are ready to go. However, if you do not have an
unarchiver we suggest you go to http://www.7-zip.org, download the version
of 7-Zip that is compatible with your system, and install it.
[ 14 ]
Appendix A
Once you hit the Enter key, you should see a screen similar to the following screenshot:
Note that, while running Spark locally, Hadoop installation is not required.
However, even though Spark might print some error messages when starting up,
complaining about not finding Hadoop binaries, it will still execute your PySpark
code. Spark, according to the FAQ from http://spark.apache.org, requires
Hadoop (or any other distributed file system) only when you deploy Spark on a
cluster; running locally, it is not necessary and the error can be ignored.
[ 15 ]
Installing Apache Spark
Jupyter on PySpark
Jupyter is a convenient and powerful shell for Python where you can create
notebooks with embedded code. Jupyter's notebooks allow you to include regular
text, code, and images, and you can also create tables or use the LaTeX typesetting
system. All in one place, running above Python, it is a really convenient way of
writing your applications, where you essentially keep your thoughts, documentation,
and code in one place.
Installing Jupyter
If you run the Anaconda distribution of Python, you can easily install Jupyter by
running the following command:
conda install jupyter
The command will install all the necessary modules Jupyter depends on, as well
as Jupyter itself. If, however, you do not run Anaconda, follow the instructions on
http://jupyter.readthedocs.io/en/latest/install.html to install Jupyter
manually.
[ 16 ]
Appendix A
In the first line, we allow the bash environment to find the newly compiled
binaries (pyspark lives in the ~/Spark/bin folder). So, now, you can simply
type the following:
pyspark
Windows
Following the same way we added and changed the environment variables earlier
(see the Changing PATH on Windows section), change the PATH variable to point to
your Spark distribution's bin folder. Next, add PYSPARK_DRIVER_PYTHON and set
its value to jupyter.
[ 17 ]
Installing Apache Spark
Starting Jupyter
Now, every time you type pyspark in CLI, a new instance of Jupyter and PySpark
will be created, and your default browser will launch with the starting screen
from Jupyter:
[ 18 ]
Appendix A
It will open a notebook in another tab in your browser, as shown in the following
screenshot:
Now we can start coding. Type sc in the first cell and hit the Alt + Enter keys (or the
[Alt-Option + Enter keys on Mac). This will execute the command and create a new
cell for our following code. The output of the command should look similar to the
following screenshot:
We are in business! As a last step, let's rename our notebook as it normally starts
To stop the notebook (what you should do every time you want to finish working
with a notebook) you go to File and click on the Close and Halt option. What this
does is it closes the notebook, but it also stops the Python kernel from releasing
the memory.
[ 19 ]
Installing Apache Spark
Summary
In this chapter, we walked you through the (sometimes painful) process of setting
up your local Spark environment. We showed you how to check whether two
required environments (Java and Python) were present on your machine. We
also provided some guidance on how to install them on your system if these two
packages were missing.
Even though the process of installing Spark itself might be intimidating at times, we
hope, with our help, you were able to successfully install the engine and execute the
minimal code presented in this chapter. At this point, you should be able to run code
in Jupyter notebook.
[ 20 ]