Machine Learning With Spark - Sample Chapter
Machine Learning With Spark - Sample Chapter
ee
Sa
pl
Acknowledgments
Writing this book has been quite a rollercoaster ride over the past year, with many ups
and downs, late nights, and working weekends. It has also been extremely rewarding to
combine my passion for machine learning with my love of the Apache Spark project, and
I hope to bring some of this out in this book.
I would like to thank the Packt Publishing team for all their assistance throughout the
writing and editing process: Rebecca, Susmita, Sudhir, Amey, Neil, Vivek, Pankaj, and
everyone who worked on the book.
Thanks also go to Debora Donato at StumbleUpon for assistance with data- and
legal-related queries.
Writing a book like this can be a somewhat lonely process, so it is incredibly helpful to
get the feedback of reviewers to understand whether one is headed in the right direction
(and what course adjustments need to be made). I'm deeply grateful to Andrea Mostosi,
Hao Ren, and Krishna Sankar for taking the time to provide such detailed and
critical feedback.
I could not have gotten through this project without the unwavering support of all my
family and friends, especially my wonderful wife, Tammy, who will be glad to have me
back in the evenings and on weekends once again. Thank you all!
Finally, thanks to all of you reading this; I hope you find it useful!
Chapter 6, Building a Regression Model with Spark, shows how to create a model for
regression, extending the classification model created in Chapter 5, Building a
Classification Model with Spark. Evaluation metrics for the performance of regression
models will be detailed here.
Chapter 7, Building a Clustering Model with Spark, explores how to create a clustering
model as well as how to use related evaluation methodologies. You will learn how to
analyze and visualize the clusters generated.
Chapter 8, Dimensionality Reduction with Spark, takes us through methods to extract the
underlying structure from and reduce the dimensionality of our data. You will learn some
common dimensionality-reduction techniques and how to apply and analyze them, as
well as how to use the resulting data representation as input to another machine
learning model.
Chapter 9, Advanced Text Processing with Spark, introduces approaches to deal with
large-scale text data, including techniques for feature extraction from text and dealing
with the very high-dimensional features typical in text data.
Chapter 10, Real-time Machine Learning with Spark Streaming, provides an overview of
Spark Streaming and how it fits in with the online and incremental learning approaches to
apply machine learning on data streams.
The standalone local mode, where all Spark processes are run within the
same Java Virtual Machine (JVM) process
If you have previous experience in setting up Spark and are familiar with the basics
of writing a Spark program, feel free to skip this chapter.
Chapter 1
As Spark's local mode is fully compatible with the cluster mode, programs written
and tested locally can be run on a cluster with just a few additional steps.
The first step in setting up Spark locally is to download the latest version (at the time
of writing this book, the version is 1.2.0). The download page of the Spark project
website, found at http://spark.apache.org/downloads.html, contains links to
download various versions as well as to obtain the latest source code via GitHub.
The Spark project documentation website at http://spark.apache.
org/docs/latest/ is a comprehensive resource to learn more about
Spark. We highly recommend that you explore it!
Spark places user scripts to run Spark in the bin directory. You can test whether
everything is working correctly by running one of the example programs included
in Spark:
>./bin/run-example org.apache.spark.examples.SparkPi
[9]
This will run the example in Spark's local standalone mode. In this mode, all the Spark
processes are run within the same JVM, and Spark uses multiple threads for parallel
processing. By default, the preceding example uses a number of threads equal to the
number of cores available on your system. Once the program is finished running, you
should see something similar to the following lines near the end of the output:
To configure the level of parallelism in the local mode, you can pass in a master
parameter of the local[N] form, where N is the number of threads to use. For
example, to use only two threads, run the following command instead:
>MASTER=local[2] ./bin/run-example org.apache.spark.examples.SparkPi
Spark clusters
A Spark cluster is made up of two types of processes: a driver program and multiple
executors. In the local mode, all these processes are run within the same JVM. In a
cluster, these processes are usually run on separate nodes.
For example, a typical cluster that runs in Spark's standalone mode (that is, using
Spark's built-in cluster-management modules) will have:
A master node that runs the Spark standalone master process as well as the
driver program
While we will be using Spark's local standalone mode throughout this book to
illustrate concepts and examples, the same Spark code that we write can be run
on a Spark cluster. In the preceding example, if we run the code on a Spark
standalone cluster, we could simply pass in the URL for the master node as follows:
>MASTER=spark://IP:PORT ./bin/run-example org.apache.spark.examples.
SparkPi
[ 10 ]
Chapter 1
Here, IP is the IP address, and PORT is the port of the Spark master. This tells Spark
to run the program on the cluster where the Spark master process is running.
A full treatment of Spark's cluster management and deployment is beyond the scope
of this book. However, we will briefly teach you how to set up and use an Amazon
EC2 cluster later in this chapter.
For an overview of the Spark cluster-application deployment, take a
look at the following links:
http://spark.apache.org/docs/latest/clusteroverview.html
http://spark.apache.org/docs/latest/submittingapplications.html
[ 11 ]
Once initialized, we will use the various methods found in the SparkContext object
to create and manipulate distributed datasets and shared variables. The Spark shell
(in both Scala and Python, which is unfortunately not supported in Java) takes care
of this context initialization for us, but the following lines of code show an example
of creating a context running in the local mode in Scala:
val conf = new SparkConf()
.setAppName("Test Spark App")
.setMaster("local[4]")
val sc = new SparkContext(conf)
This creates a context running in the local mode with four threads, with the name
of the application set to Test Spark App. If we wish to use default configuration
values, we could also call the following simple constructor for our SparkContext
object, which works in exactly the same way:
val sc = new SparkContext("local[4]", "Test Spark App")
[ 12 ]
Chapter 1
To use the Spark shell with Scala, simply run ./bin/spark-shell from the Spark
base directory. This will launch the Scala shell and initialize SparkContext, which is
available to us as the Scala value, sc. Your console output should look similar to the
following screenshot:
[ 13 ]
To use the Python shell with Spark, simply run the ./bin/pyspark command. Like the
Scala shell, the Python SparkContext object should be available as the Python variable
sc. You should see an output similar to the one shown in this screenshot:
[ 14 ]
Chapter 1
Creating RDDs
RDDs can be created from existing collections, for example, in the Scala Spark shell
that you launched earlier:
val collection = List("a", "b", "c", "d", "e")
val rddFromCollection = sc.parallelize(collection)
RDDs can also be created from Hadoop-based input sources, including the local
filesystem, HDFS, and Amazon S3. A Hadoop-based RDD can utilize any input
format that implements the Hadoop InputFormat interface, including text files,
other standard Hadoop formats, HBase, Cassandra, and many more. The following
code is an example of creating an RDD from a text file located on the local filesystem:
val rddFromTextFile = sc.textFile("LICENSE")
The preceding textFile method returns an RDD where each record is a String
object that represents one line of the text file.
Spark operations
Once we have created an RDD, we have a distributed collection of records that
we can manipulate. In Spark's programming model, operations are split into
transformations and actions. Generally speaking, a transformation operation applies
some function to all the records in the dataset, changing the records in some way.
An action typically runs some computation or aggregation operation and returns the
result to the driver program where SparkContext is running.
Spark operations are functional in style. For programmers familiar with functional
programming in Scala or Python, these operations should seem natural. For those
without experience in functional programming, don't worry; the Spark API is
relatively easy to learn.
One of the most common transformations that you will use in Spark programs is
the map operator. This applies a function to each record of an RDD, thus mapping the
input to some new output. For example, the following code fragment takes the RDD
we created from a local text file and applies the size function to each record in the
RDD. Remember that we created an RDD of Strings. Using map, we can transform
each string to an integer, thus returning an RDD of Ints:
val intsFromStringsRDD = rddFromTextFile.map(line => line.size)
[ 15 ]
You should see output similar to the following line in your shell; this indicates the
type of the RDD:
intsFromStringsRDD: org.apache.spark.rdd.RDD[Int] = MappedRDD[5] at map
at <console>:14
In the preceding code, we saw the => syntax used. This is the Scala syntax for an
anonymous function, which is a function that is not a named method (that is, one
defined using the def keyword in Scala or Python, for example).
While a detailed treatment of anonymous functions is beyond the scope
of this book, they are used extensively in Spark code in Scala and Python,
as well as in Java 8 (both in examples and real-world applications), so it is
useful to cover a few practicalities.
The line => line.size syntax means that we are applying a function
where the input variable is to the left of the => operator, and the output is
the result of the code to the right of the => operator. In this case, the input
is line, and the output is the result of calling line.size. In Scala, this
function that maps a string to an integer is expressed as String => Int.
This syntax saves us from having to separately define functions every
time we use methods such as map; this is useful when the function is
simple and will only be used once, as in this example.
Now, we can apply a common action operation, count, to return the number of
records in our RDD:
intsFromStringsRDD.count
The result should look something like the following console output:
14/01/29 23:28:28 INFO SparkContext: Starting job: count at <console>:17
...
14/01/29 23:28:28 INFO SparkContext: Job finished: count at <console>:17,
took 0.019227 s
res4: Long = 398
Perhaps we want to find the average length of each line in this text file. We can first
use the sum function to add up all the lengths of all the records and then divide the
sum by the number of records:
val sumOfRecords = intsFromStringsRDD.sum
val numRecords = intsFromStringsRDD.count
val aveLengthOfRecord = sumOfRecords / numRecords
[ 16 ]
Chapter 1
Spark operations, in most cases, return a new RDD, with the exception of most
actions, which return the result of a computation (such as Long for count and Double
for sum in the preceding example). This means that we can naturally chain together
operations to make our program flow more concise and expressive. For example,
the same result as the one in the preceding line of code can be achieved using the
following code:
val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).
sum / rddFromTextFile.count
An important point to note is that Spark transformations are lazy. That is, invoking
a transformation on an RDD does not immediately trigger a computation. Instead,
transformations are chained together and are effectively only computed when an
action is called. This allows Spark to be more efficient by only returning results to the
driver when necessary so that the majority of operations are performed in parallel on
the cluster.
This means that if your Spark program never uses an action operation, it will
never trigger an actual computation, and you will not get any results. For example,
the following code will simply return a new RDD that represents the chain of
transformations:
val transformedRDD = rddFromTextFile.map(line => line.size).
filter(size => size > 10).map(size => size * 2)
Notice that no actual computation happens and no result is returned. If we now call
an action, such as sum, on the resulting RDD, the computation will be triggered:
val computation = transformedRDD.sum
You will now see that a Spark job is run, and it results in the following
console output:
...
14/11/27 21:48:21 INFO SparkContext: Job finished: sum at <console>:16,
took 0.193513 s
computation: Double = 60468.0
[ 17 ]
Caching RDDs
One of the most powerful features of Spark is the ability to cache data in memory
across a cluster. This is achieved through use of the cache method on an RDD:
rddFromTextFile.cache
Calling cache on an RDD tells Spark that the RDD should be kept in memory. The
first time an action is called on the RDD that initiates a computation, the data is
read from its source and put into memory. Hence, the first time such an operation is
called, the time it takes to run the task is partly dependent on the time it takes to read
the data from the input source. However, when the data is accessed the next time
(for example, in subsequent queries in analytics or iterations in a machine learning
model), the data can be read directly from memory, thus avoiding expensive I/O
operations and speeding up the computation, in many cases, by a significant factor.
If we now call the count or sum function on our cached RDD, we will see that the
RDD is loaded into memory:
val aveLengthOfRecordChained = rddFromTextFile.map(line => line.size).
sum / rddFromTextFile.count
Indeed, in the following output, we see that the dataset was cached in memory on
the first call, taking up approximately 62 KB and leaving us with around 270 MB of
memory free:
...
14/01/30 06:59:27 INFO MemoryStore: ensureFreeSpace(63454) called with
curMem=32960, maxMem=311387750
14/01/30 06:59:27 INFO MemoryStore: Block rdd_2_0 stored as values to
memory (estimated size 62.0 KB, free 296.9 MB)
14/01/30 06:59:27 INFO BlockManagerMasterActor$BlockManagerInfo: Added
rdd_2_0 in memory on 10.0.0.3:55089 (size: 62.0 KB, free: 296.9 MB)
...
[ 18 ]
Chapter 1
We will see from the console output that the cached data is read directly from
memory:
...
14/01/30 06:59:34 INFO BlockManager: Found block rdd_2_0 locally
...
The console output shows that the broadcast variable was stored in memory,
taking up approximately 488 bytes, and it also shows that we still have 270 MB
available to us:
14/01/30 07:13:32 INFO MemoryStore: ensureFreeSpace(488) called with
curMem=96414, maxMem=311387750
14/01/30 07:13:32 INFO MemoryStore: Block broadcast_1 stored as values to
memory (estimated size 488.0 B, free 296.9 MB)
broadCastAList: org.apache.spark.broadcast.Broadcast[List[String]] =
Broadcast(1)
[ 19 ]
A broadcast variable can be accessed from nodes other than the driver program that
created it (that is, the worker nodes) by calling value on the variable:
sc.parallelize(List("1", "2", "3")).map(x => broadcastAList.value ++
x).collect
This code creates a new RDD with three records from a collection (in this case, a
Scala List) of ("1", "2", "3"). In the map function, it returns a new collection with
the relevant record from our new RDD appended to the broadcastAList that is our
broadcast variable.
Notice that we used the collect method in the preceding code. This is a Spark action
that returns the entire RDD to the driver as a Scala (or Python or Java) collection.
We will often use collect when we wish to apply further processing to our results
locally within the driver program.
Note that collect should generally only be used in cases where we
really want to return the full result set to the driver and perform further
processing. If we try to call collect on a very large dataset, we might
run out of memory on the driver and crash our program.
It is preferable to perform as much heavy-duty processing on our Spark
cluster as possible, preventing the driver from becoming a bottleneck. In
many cases, however, collecting results to the driver is necessary, such
as during iterations in many machine learning models.
On inspecting the result, we will see that for each of the three records in our new
RDD, we now have a record that is our original broadcasted List, with the new
element appended to it (that is, there is now either "1", "2", or "3" at the end):
...
14/01/31 10:15:39 INFO SparkContext: Job finished: collect at
<console>:15, took 0.025806 s
res6: Array[List[Any]] = Array(List(a, b, c, d, e, 1), List(a, b, c, d,
e, 2), List(a, b, c, d, e, 3))
An accumulator is also a variable that is broadcasted to the worker nodes. The key
difference between a broadcast variable and an accumulator is that while the broadcast
variable is read-only, the accumulator can be added to. There are limitations to this,
that is, in particular, the addition must be an associative operation so that the global
accumulated value can be correctly computed in parallel and returned to the driver
program. Each worker node can only access and add to its own local accumulator
value, and only the driver program can access the global value. Accumulators are also
accessed within the Spark code using the value method.
[ 20 ]
Chapter 1
For our Scala program, we need to create two files: our Scala code and our project
build configuration file, using the build tool Scala Build Tool (sbt). For ease of use,
we recommend that you download the sample project code called scala-spark-app
for this chapter. This code also contains the CSV file under the data directory.
You will need SBT installed on your system in order to run this example program
(we use version 0.13.1 at the time of writing this book).
Setting up SBT is beyond the scope of this book; however, you can
find more information at http://www.scala-sbt.org/release/
docs/Getting-Started/Setup.html.
Our SBT configuration file, build.sbt, looks like this (note that the empty lines
between each line of code are required):
name := "scala-spark-app"
version := "1.0"
[ 21 ]
In our main method, we need to initialize our SparkContext object and use this
to access our CSV data file with the textFile method. We will then map the raw
text by splitting the string on the delimiter character (a comma in this case) and
extracting the relevant records for username, product, and price:
def main(args: Array[String]) {
val sc = new SparkContext("local[2]", "First Spark App")
// we take the raw data in CSV format and convert it into a
set of records of the form (user, product, price)
val data = sc.textFile("data/UserPurchaseHistory.csv")
.map(line => line.split(","))
.map(purchaseRecord => (purchaseRecord(0),
purchaseRecord(1), purchaseRecord(2)))
Now that we have an RDD, where each record is made up of (user, product,
price), we can compute various interesting metrics for our store, such as the
following ones:
[ 22 ]
Chapter 1
This last piece of code to compute the most popular product is an example of the
Map/Reduce pattern made popular by Hadoop. First, we mapped our records of
(user, product, price) to the records of (product, 1). Then, we performed a
reduceByKey operation, where we summed up the 1s for each unique product.
Once we have this transformed RDD, which contains the number of purchases for
each product, we will call collect, which returns the results of the computation to
the driver program as a local Scala collection. We will then sort these counts locally
(note that in practice, if the amount of data is large, we will perform the sorting in
parallel, usually with a Spark operation such as sortByKey).
Finally, we will print out the results of our computations to the console:
println("Total purchases: " + numPurchases)
println("Unique users: " + uniqueUsers)
println("Total revenue: " + totalRevenue)
println("Most popular product: %s with %d purchases".
format(mostPopular._1, mostPopular._2))
}
}
We can run this program by running sbt run in the project's base directory or by
running the program in your Scala IDE if you are using one. The output should look
similar to the following:
...
[info] Compiling 1 Scala source to ...
[ 23 ]
We can see that we have five purchases from four different users with a total revenue
of 39.91. Our most popular product is an iPhone cover with 2 purchases.
[ 24 ]
Chapter 1
In this section, we covered the standard Java API syntax. For more
details and examples related to working RDDs in Java as well as the
Java 8 lambda syntax, see the Java sections of the Spark Programming
Guide found at http://spark.apache.org/docs/latest/
programming-guide.html#rdd-operations.
We will see examples of most of these differences in the following Java program,
which is included in the example code of this chapter in the directory named
java-spark-app. The code directory also contains the CSV data file under the
data subdirectory.
We will build and run this project with the Maven build tool, which we assume you
have installed on your system.
Installing and setting up Maven is beyond the scope of this book.
Usually, Maven can easily be installed using the package manager on
your Linux system or HomeBrew or MacPorts on Mac OS X.
Detailed installation instructions can be found here: http://maven.
apache.org/download.cgi.
The project contains a Java file called JavaApp.java, which contains our
program code:
import
import
import
import
import
import
import
org.apache.spark.api.java.JavaRDD;
org.apache.spark.api.java.JavaSparkContext;
org.apache.spark.api.java.function.DoubleFunction;
org.apache.spark.api.java.function.Function;
org.apache.spark.api.java.function.Function2;
org.apache.spark.api.java.function.PairFunction;
scala.Tuple2;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
/**
* A simple Spark app in Java
*/
public class JavaApp {
public static void main(String[] args) {
[ 25 ]
As in our Scala example, we first need to initialize our context. Notice that we will
use the JavaSparkContext class here instead of the SparkContext class that we
used earlier. We will use the JavaSparkContext class in the same way to access
our data using textFile and then split each row into the required fields. Note
how we used an anonymous class to define a split function that performs the string
processing, in the highlighted code:
JavaSparkContext sc = new JavaSparkContext("local[2]",
"First Spark App");
// we take the raw data in CSV format and convert it into a
set of records of the form (user, product, price)
JavaRDD<String[]> data =
sc.textFile("data/UserPurchaseHistory.csv")
.map(new Function<String, String[]>() {
@Override
public String[] call(String s) throws Exception {
return s.split(",");
}
});
Now, we can compute the same metrics as we did in our Scala example. Note how
some methods are the same (for example, distinct and count) for the Java and
Scala APIs. Also note the use of anonymous classes that we pass to the map function.
This code is highlighted here:
// let's count the number of purchases
long numPurchases = data.count();
// let's count how many unique users made purchases
long uniqueUsers = data.map(new Function<String[], String>() {
@Override
public String call(String[] strings) throws Exception {
return strings[0];
}
}).distinct().count();
// let's sum up our total revenue
double totalRevenue = data.map(new DoubleFunction<String[]>() {
@Override
public Double call(String[] strings) throws Exception {
return Double.parseDouble(strings[2]);
}
}).sum();
[ 26 ]
Chapter 1
In the following lines of code, we can see that the approach to compute the most
popular product is the same as that in the Scala example. The extra code might seem
complex, but it is mostly related to the Java code required to create the anonymous
functions (which we have highlighted here). The actual functionality is the same:
// let's find our most popular product
// first we map the data to records of (product, 1)
using a PairFunction
// and the Tuple2 class.
// then we call a reduceByKey operation with a Function2,
which is essentially the sum function
List<Tuple2<String, Integer>> pairs = data.map(new
PairFunction<String[], String, Integer>() {
@Override
public Tuple2<String, Integer> call(String[] strings)
throws Exception {
return new Tuple2(strings[1], 1);
}
}).reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2)
throws Exception {
return integer + integer2;
}
}).collect();
// finally we sort the result. Note we need to create a
Comparator function,
// that reverses the sort order.
Collections.sort(pairs, new Comparator<Tuple2<String,
Integer>>() {
@Override
public int compare(Tuple2<String, Integer> o1,
Tuple2<String, Integer> o2) {
return -(o1._2() - o2._2());
}
});
String mostPopular = pairs.get(0)._1();
int purchases = pairs.get(0)._2();
System.out.println("Total purchases: " + numPurchases);
System.out.println("Unique users: " + uniqueUsers);
System.out.println("Total revenue: " + totalRevenue);
System.out.println(String.format("Most popular product:
%s with %d purchases", mostPopular, purchases));
}
}
[ 27 ]
As can be seen, the general structure is similar to the Scala version, apart from
the extra boilerplate code to declare variables and functions via anonymous inner
classes. It is a good exercise to work through both examples and compare the same
lines of Scala code to those in Java to understand how the same result is achieved in
each language.
This program can be run with the following command executed from the project's
base directory:
>mvn exec:java -Dexec.mainClass="JavaApp"
You will see output that looks very similar to the Scala version, with the results of
the computation identical:
...
14/01/30 17:02:43 INFO spark.SparkContext: Job finished: collect at
JavaApp.java:46, took 0.039167 s
Total purchases: 5
Unique users: 4
Total revenue: 39.91
Most popular product: iPhone Cover with 2 purchases
[ 28 ]
Chapter 1
sc = SparkContext("local[2]", "First Spark App")
# we take the raw data in CSV format and convert it into a set of
records of the form (user, product, price)
data = sc.textFile("data/UserPurchaseHistory.csv").map(lambda line:
line.split(",")).map(lambda record: (record[0], record[1], record[2]))
# let's count the number of purchases
numPurchases = data.count()
# let's count how many unique users made purchases
uniqueUsers = data.map(lambda record: record[0]).distinct().count()
# let's sum up our total revenue
totalRevenue = data.map(lambda record: float(record[2])).sum()
# let's find our most popular product
products = data.map(lambda record: (record[1], 1.0)).
reduceByKey(lambda a, b: a + b).collect()
mostPopular = sorted(products, key=lambda x: x[1], reverse=True)[0]
print "Total purchases: %d" % numPurchases
print "Unique users: %d" % uniqueUsers
print "Total revenue: %2.2f" % totalRevenue
print "Most popular product: %s with %d purchases" % (mostPopular[0],
mostPopular[1])
If you compare the Scala and Python versions of our program, you will see that
generally, the syntax looks very similar. One key difference is how we express
anonymous functions (also called lambda functions; hence, the use of this keyword
for the Python syntax). In Scala, we've seen that an anonymous function mapping an
input x to an output y is expressed as x => y, while in Python, it is lambda x: y. In
the highlighted line in the preceding code, we are applying an anonymous function
that maps two inputs, a and b, generally of the same type, to an output. In this case,
the function that we apply is the plus function; hence, lambda a, b: a + b.
The best way to run the script is to run the following command from the base
directory of the sample project:
>$SPARK_HOME/bin/spark-submit pythonapp.py
Here, the SPARK_HOME variable should be replaced with the path of the directory
in which you originally unpacked the Spark prebuilt binary package at the start of
this chapter.
Upon running the script, you should see output similar to that of the Scala and Java
examples, with the results of our computation being the same:
...
14/01/30 11:43:47 INFO SparkContext: Job finished: collect at pythonapp.
py:14, took 0.050251 s
Total purchases: 5
[ 29 ]
Running it in this way without an argument will show the help output:
Usage: spark-ec2 [options] <action> <cluster_name>
<action> can be: launch, destroy, login, stop, start, get-master
Options:
...
Before creating a Spark EC2 cluster, you will need to ensure you have an
Amazon account.
If you don't have an Amazon Web Services account, you can sign up at
http://aws.amazon.com/.
The AWS console is available at http://aws.amazon.com/console/.
You will also need to create an Amazon EC2 key pair and retrieve the relevant
security credentials. The Spark documentation for EC2 (available at http://spark.
apache.org/docs/latest/ec2-scripts.html) explains the requirements:
Create an Amazon EC2 key pair for yourself. This can be done by logging into your
Amazon Web Services account through the AWS console, clicking on Key Pairs
on the left sidebar, and creating and downloading a key. Make sure that you set the
permissions for the private key file to 600 (that is, only you can read and write it)
so that ssh will work.
Whenever you want to use the spark-ec2 script, set the environment variables
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to your Amazon EC2 access
key ID and secret access key, respectively. These can be obtained from the AWS
homepage by clicking Account | Security Credentials | Access Credentials.
[ 30 ]
Chapter 1
When creating a key pair, choose a name that is easy to remember. We will simply
use spark for the key pair name. The key pair file itself will be called spark.pem.
As mentioned earlier, ensure that the key pair file permissions are set appropriately
and that the environment variables for the AWS credentials are exported using the
following commands:
>chmod 600 spark.pem
>export AWS_ACCESS_KEY_ID="..."
>export AWS_SECRET_ACCESS_KEY="..."
You should also be careful to keep your downloaded key pair file safe and not lose it,
as it can only be downloaded once when it is created!
Note that launching an Amazon EC2 cluster in the following section will incur costs
to your AWS account.
This will launch a new Spark cluster called test-cluster with one master and one
slave node of instance type m3.medium. This cluster will be launched with a Spark
version built for Hadoop 2. The key pair name we used is spark, and the key pair file
is spark.pem (if you gave the files different names or have an existing AWS key pair,
use that name instead).
It might take quite a while for the cluster to fully launch and initialize. You should
see something like this screenshot immediately after running the launch command:
[ 31 ]
If the cluster has launched successfully, you should eventually see the console output
similar to the following screenshot:
To test whether we can connect to our new cluster, we can run the following
command:
>ssh -i spark.pem [email protected]
Remember to replace the public domain name of the master node (the address after
root@ in the preceding command) with the correct Amazon EC2 public domain
name that will be shown in your console output after launching the cluster.
You can also retrieve your cluster's master public domain name by running this line
of code:
>./spark-ec2 i spark.pem get-master test-cluster
After successfully running the ssh command, you will be connected to your Spark
master node in EC2, and your terminal output should match the following screenshot:
[ 32 ]
Chapter 1
We can test whether our cluster is correctly set up with Spark by changing into the
Spark directory and running an example in the local mode:
>cd spark
>MASTER=local[2] ./bin/run-example SparkPi
You should see output similar to running the same command on your
local computer:
...
14/01/30 20:20:21 INFO SparkContext: Job finished: reduce at SparkPi.
scala:35, took 0.864044012 s
Pi is roughly 3.14032
...
Now that we have an actual cluster with multiple nodes, we can test Spark in the
cluster mode. We can run the same example on the cluster, using our 1 slave node,
by passing in the master URL instead of the local version:
>MASTER=spark://ec2-54-227-127-14.compute-1.amazonaws.com:7077 ./bin/runexample SparkPi
Note that you will need to substitute the preceding master domain
name with the correct domain name for your specific cluster.
Again, the output should be similar to running the example locally; however, the log
messages will show that your driver program has connected to the Spark master:
...
14/01/30 20:26:17 INFO client.Client$ClientActor: Connecting to master
spark://ec2-54-220-189-136.eu-west-1.compute.amazonaws.com:7077
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend: Connected to
Spark cluster with app ID app-20140130202617-0001
14/01/30 20:26:17 INFO client.Client$ClientActor: Executor added: app20140130202617-0001/0 on worker-20140130201049-ip-10-34-137-45.eu-west-1.
compute.internal-57119 (ip-10-34-137-45.eu-west-1.compute.internal:57119)
with 1 cores
14/01/30 20:26:17 INFO cluster.SparkDeploySchedulerBackend: Granted
executor ID app-20140130202617-0001/0 on hostPort ip-10-34-137-45.euwest-1.compute.internal:57119 with 1 cores, 2.4 GB RAM
[ 33 ]
Feel free to experiment with your cluster. Try out the interactive console in Scala,
for example:
> ./bin/spark-shell --master spark://ec2-54-227-127-14.compute-1.
amazonaws.com:7077
Once you've finished, type exit to leave the console. You can also try the PySpark
console by running the following command:
> ./bin/pyspark --master spark://ec2-54-227-127-14.compute-1.amazonaws.
com:7077
You can use the Spark Master web interface to see the applications registered
with the master. To load the Master Web UI, navigate to ec2-54-227-127-14.
compute-1.amazonaws.com:8080 (again, remember to replace this domain name
with your own master domain name). You should see something similar to the
following screenshot showing the example you ran as well as the two console
applications you launched:
[ 34 ]
Chapter 1
Remember that you will be charged by Amazon for usage of the cluster. Don't forget to
stop or terminate this test cluster once you're done with it. To do this, you can first
exit the ssh session by typing exit to return to your own local system and then, run
the following command:
>./ec2/spark-ec2 -k spark -i spark.pem destroy test-cluster
Summary
In this chapter, we covered how to set up Spark locally on our own computer as
well as in the cloud as a cluster running on Amazon EC2. You learned the basics of
Spark's programming model and API using the interactive Scala console, and we
wrote the same basic Spark program in Scala, Java, and Python.
In the next chapter, we will consider how to go about using Spark to create a
machine learning system.
[ 35 ]
www.PacktPub.com
Stay Connected: