Big Data Analysis Using Amazon Web Services and Support Vector Machines
Big Data Analysis Using Amazon Web Services and Support Vector Machines
SJSU ScholarWorks
Master's Projects Master's Theses and Graduate Research
4-1-2013
Recommended Citation
Jalota, Dhruv, "Big Data Analysis Using Amazon Web Services and Support Vector Machines" (2013). Master's Projects. Paper 316.
This Master's Project is brought to you for free and open access by the Master's Theses and Graduate Research at SJSU ScholarWorks. It has been
accepted for inclusion in Master's Projects by an authorized administrator of SJSU ScholarWorks. For more information, please contact Library-
[email protected].
Big Data Analysis Using Amazon Web Services and Support Vector Machines
A Writing Project
Presented To
In Partial Fulfillment
Master of Science
by
Dhruv Jalota
May 2013
1
COPYRIGHT 2013 DHRUV JALOTA
2
APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE
________________________________________________________
________________________________________________________
________________________________________________________
________________________________________________________
3
ABSTRACT
Big Data Analysis Using Amazon Web Services and Support Vector Machines
by Dhruv Jalota
This writing project aims to apply the supervised machine learning technique known as
Support Vector Machines to a large labeled data set, to attempt to classify an unlabeled data set
using the result of training on the labeled data set, and hence perform an analysis of the various
results obtained using different Amazon Elastic Cloud Compute instances, sizes of input data set,
and different parameters or kernels of the SVM tool. The given data set is relatively large for
SVM and the tool being used, known as libsvm, having approximately 1.3 million training
examples and 341 attributes with binary classification labels i.e., true (+1) and false (-1). By
using the open source tool and deploying it to the cloud, we make use of the computing power
available, to get the best possible results for classification. We eventually give a detailed analysis
of the performance of all the experiments conducted, and draw conclusions from these results.
4
ACKNOWLEDGMENT
I would like to thank Dr. Tseng for his technical guidance and his constant support through out
my writing project. Dr. Tseng always gave me kind words even through the toughest time. I
would also like to thank Dr. Pollett and Mr. Sumeet Trehan, for being my committee members
5
TABLE OF CONTENTS
1. PROBLEM DESCRIPTION....................................................................................................8
5. CONCLUSION........................................................................................................................40
6. REFERENCES.........................................................................................................................41
6
List of Tables and Figures
Table 1. Comparison of EC2 instances 10-11
Figure 1. Linear Separating Hyperplane 12
Figure 2. The “outlier” case 13
Figure 3. Outlier causes hyperplane to shift drastically 14
Figure 4. AWS console 16
Figure 5. AWS zones 17
Figure 6. Launch an instance 18
Figure 7. Launch an instance step 2 18
Figure 8. Launch an instance step 3 19
Figure 9. Launch an instance step 4 19
Figure 10. Launch an instance step 5 20
Figure 11. Launch an instance step 6 20
Figure 12. Launch an instance step 7 21
Figure 13. Launch an instance step 8 21
Figure 14. Launch an instance step 9 22
Figure 15. Launch an instance step 10 22
Figure 16. Optimal margin classifier is the black separating hyperplane 24
Figure 17. Non-linear separating hyperplane 26
Table 2. Comparison of Training and Classification outputs from libsvm 34-35
Figure 18. Graph of Training Examples vs Training Time for Various Kernels 36
Figure 19. Graph of Training Examples vs Classification Time for Various Kernels 37
Figure 20. Graph of Training Examples vs Accuracy Rates for Various Kernels 38
Figure 21. Graph of Training Examples vs Accuracy Rates for Various Kernels 38
Table 3. Cost spent on AWS hours for all experiments of various kernels 39
7
1. Problem Description
The two data sets were provided to us by IBM as part of ‘The Great Minds Challenge’.
The entire data set is basically a part of what is actually used by IBM’s Watson supercomputer,
which was the same one which participated on the ‘Jeopardy!’ television show, for learning
purposes. The first data set that we have, has 1,314,407 rows and 343 columns. These 1.3 million
rows are known as feature vectors or training examples and the 343 columns are known as
features or attributes. Of these 343 columns, the first column is a “question id” and the last
According to IBM’s description provided for this data set [1], “For each question that
Watson answers many possible answers are generated using standard information retrieval
techniques. These "candidate answers" along with the corresponding question are fed to a series
of "scorers" that evaluate the likelihood that the answer is a correct one. These "features" are
then fed into a machine learning algorithm that learns how to appropriately weight these features.
Each row in the file represents a possible answer to a question. The row contains the question
identifier (i.e. the question that it was a candidate answer to), the feature scores and it also
contains a label indicating whether it is the right answer. The vast majority of rows in the file are
for wrong answers with a smaller percentage being the correct answer. The file is in CSV format
and is a comma delimited list of feature scores. The two important "columns" in the file are the
first column that contains a unique question id and the last column that contains the label.
Candidate answers to the same question share a common question id. The label is true for a right
answer and false for an incorrect answer. Note that some questions may not have a correct
answer.”
8
This file of labeled data was provided to us in a plain text format of size 3.23 GB that
What is Amazon Web Services or AWS as it is popularly known? Well, recently the terms
“Cloud Computing” and “Big Data” have gained tremendous importance due to their
significance to computing and analytical power growing exponentially as compared with the
scenarios of the past where large scientific data crunching could be achieved only by the likes of
super computers. In today’s world, powerful computing architecture can be rented out on a pay-
per-use basis from companies like Amazon (AWS is one of the many such services) for
processing complex data of high importance. This “cloud” as it is called, is where the computing
is done, since the user never sees it himself, but only accesses it over the internet, and the data
that is being run is “big” in the sense of either size or complexity and importance.
The wide-spread use and adoption of these services has made it affordable to
organizations and individuals that require it, and AWS is among the most popular of these
services. AWS offers many products including computing, networking, content delivery, and
storage etc. [2] We will be focussing on the computing service offered that is known as Elastic
We have used EC2 instances to store and analyze our data for the classification purposes.
This table [3] provides a comparison of the various EC2 options available. We will talk about
9
Name Memory Compute Units Linux cost
10
Name Memory Compute Units Linux cost
In the table above, one EC2 compute unit is equivalent to a 1-1.2 GHz Intel 2007 Xeon or
As we can see there are currently 18 different EC2 instance types available, but not all are
affordable or suitable for this project. Initially, in an effort to keep costs low, I used the “M1
Large” machine, since it’s RAM was almost double of that available on my laptop, but it was
unable to process the large data sets in a relatively short amount of time .
From observation, I saw that the libsvm tool used up to 25% of RAM for each instance of it’s
process, and 99% CPU. Using the highest level of EC2 machine types was not feasible due to the
budget limitation of $100 by the Amazon educational grant funds per instance, and hence at rates
of $4 per hour it could be used for only 25 hours. So I settled on the mid-tier machine “M3
Double Extra Large” due to it’s higher RAM and an affordable hourly rate which allowed me to
11
1.3 Brief Introduction to Support Vector Machines
learning that has gained popularity over the last 2 decades. It was originally invented by Vladmir
N. Vapnik, and the now standard implementation was created by Vapnik and Corrina Cortes [4].
It has proven to be quite useful in classification tasks, particularly for the case of binary classes
Essentially, SVM, which is also called a “large margin classifier”, establishes an optimal
separating hyperplane to divide points in the feature space into one of2 two classes (Figure 1).
This is typical for linearly separable data, and can be susceptible to the “outliers” case (Figure 2).
θT x(i) ! 0 whenever y (i) = 0, since this would reflect a very confident (and
correct) setToofenhance
classifications for all
this method forthe training examples.
non-linearly ThisSVMs
separable data, seemsuse
to be
a technique known as the
a nice goal to aim for, and we’ll soon formalize this idea using the notion of
functional “kernel
margins.trick” which establishes the separating hyperplane for data in high-dimensional feature
For a different type of intuition, consider the following figure, in which x’s
represent positive
spaces. training examples, o’s denote negative training examples,
a decision boundary (this is the line given by the equation θT x = 0, and
is also called the separating hyperplane) is also shown, and three points
have also been labeled A, B and C.
B
C
Notice that the point A is very far from the decision boundary. If we are
Figure 1. Linear Separating Hyperplane [5]
asked to make a prediction for the value of y at A, it seems we should be
quite confident that y = 1 there. Conversely, the point C is very close to
the decision boundary, and while it’s on the side of the decision boundary
on which we would predict y = 1, it seems likely that just a small change to 12
the decision boundary could easily have caused our prediction to be y = 0.
Hence, we’re much more confident about our prediction at A than at C. The
point B lies in-between these two cases, and more broadly, we see that if
In the figure above, we see the separating hyperplane which has divided the points in either of
the two classes (crosses and circles), and hence established positive or negative classifications
respectively. In this case, when new points “A”, “B” and “C” are introduced to the system, it is
relatively more confident about the classification for A than it is for B, and C has the least
confidence. But this can be sensitive to the outliers case, as seen below.
Here, though the data has been linearly separated by the hyperplane, there is one point which
falls in the wrong category, and this is an outlier. Such a point would cause drastic changes to the
hyperplane, and would not be as optimally spaced in terms of margins as before (Figure 3).
13
Figure 3. Outlier causes hyperplane to shift drastically. [7]
There is a link between SVMs and Perceptrons, due to the linear classifier concept. [4]
Why are we using the AWS services here, especially since it can be expensive for high-
power machines? When the project began, I experimented on my own laptop, which is a
MacBook Pro with a 2GHz Intel Core i7 processor and 4GB of memory. After researching the
several tools available to apply SVM and trying out a few, MATLAB prove to be the most robust
for our purposes. This was mainly due to the size of the data. As we shall see later, this was one
of the major challenges in this project, and crossing this hurdle in phase 1 was achieved by using
MATLAB.
14
Other than MATLAB, the rest of the tools out there are mostly open-source, and they
have their own requirements for the format of the data to be input. MATLAB was the only one
which could read the CSV file directly (the format in which we received the data) and produce a
matrix in it’s workspace for our manipulation. So once the entire data set (3.2GB raw size) was
loaded into memory, performing the required matrix manipulations and using the SVM toolbox
with it, would drastically reduce the performance of the machine, and dealing with machine
This is where AWS came to the rescue, since it gave us a relatively higher powered
machine (we went with the mid-tier machines, ranging in cost from $0.40 to $1.00 per hour), it
was able to deal with the large data set with ease and we could better conduct our SVM analysis
of the data.
For my project, after trying out several of the cheaper machines available, I settled on the
“M3 Double Extra Large” machines since it was not too expensive, and gave me enough
memory to be able to run a lot of training and classification experiments simultaneously, which
I will now walk through the process of setting up a machine in the cloud, or launching an
instance as it is known.
First, you need an Amazon AWS account to be able to use their services. You can sign up
for this by going to the web site http://aws.amazon.com and clicking on the “Sign Up” button.
15
After signing up with an email address and credit card, Amazon conducts and automated identity
verification by calling your phone number, and giving you a unique PIN which you use to get
access to your account on first time use only. Once these initial steps are complete, you are
We can see here on the console screen, the various services offered by AWS. We are interested in
the EC2 service, as seen on the left side of the screen. To launch an instance, we click on the EC2
link. By default, AWS picks the “U.S. East” zone for where the servers are hosted, as seen by the
“N. Virginia” text in the top right corner. You can choose any zone you prefer, but the cost of
instances in this zone are generally cheaper. The various zones available are as shown below.
16
Figure 5. AWS zones.
17
We select “Classic Wizard” and click on “Continue” (Figure 6). On the next screen we select the
first option, with 64-bit, and then press “Select” (Figure 7). Next, we select the “m3.2xlarge”
option from the drop-down list, and click on “Continue (Figure 8). Next, click on
18
Figure 8. Launch an instance step 3.
19
Figure 10. Launch an instance step 5.
20
Figure 12. Launch an instance step 7.
21
Figure 14. Launch an instance step 9.
22
We obtain the IP address of this instance from the EC2 dashboard,
On my Mac, I use the ‘Terminal’ program, and enter the following command to access my
instance,
But, before this instance can be successfully accessed, the file permissions on the key-pair need
Now we can access the instance by using the ssh command as shown earlier.
To ensure our instance uses the entire 100GB from it’s EBS (Elastic Block Storage) volume that
23
3. Support Vector Machines (SVM)
As discussed in the earlier section on SVM, I had mentioned that SVM establishes the
optimal margin hyperplane for the data, this is illustrated in the figure below.
As we can see, the black separating hyperplane has maximum margin from the closest
Figure 16. Optimal margin classifier is the black separating hyperplane. [7]
24
I will now illustrate mathematically how SVM performs the optimization to arrive on the
This is the SVM cost function (for the linear case), [7]
where,
25
For the non-linear case, the optimization objective is as follows, [7]
where,
‘f(i)’ is the i’th feature, where features are the high order polynomials from the attributes based
on the kernel function used. The rest of the variables are the same as before. This leads to a
26
The various kernel options available in the ‘libsvm’ tool (described later) are as follows,
Polynomial Kernel[8],
Sigmoid Kernel[8],
These kernel functions are basically similarity functions for the features which are mapped from
the attributes.
In phase 1 of the project, I was using the student version of MATLAB 2012a on my
MacBook Pro, with the “Bioinformatics” toolbox which includes the svm functions that we need
to use.
To get the data into the workspace, we use the “Import Data” tool from the File menu,
and select our CSV file with the labeled data set. We select the options to format the data viz. the
column separator is set to “Comma” and then the data is imported. This creates a matrix, which
27
This is the MATLAB code I wrote to generate my svm training model,
To classify data using the generated ‘svmstruct’ model, I wrote this code,
This gives us a ‘class’ variable which contains the required predictions of classification of the
unlabeled data.
As I mentioned earlier, this was a very slow process due to the limitations of my laptop. On my
machine, I was never able to successfully train on the entire data set. The memory restrictions,
and lessons learned from this phase were used to be applied in the phase 2 of the project, when
using AWS.
28
3.3 Tutorial on LIBSVM
The open source tool that I used for my svm training and classification is called libsvm.
source tools available for using svm. The current version is 3.17, which is the one that I used.
My AWS instance was a linux based one, and these are the steps I used to install and use
First, we transfer the libsvm files over to the instance using this command (after downloading to
[email protected]:/home/ec2-user/
> cd libsvm-3.17
> make
Reading the README file within this folder gives a good description of how to use the
software. The key part is the data format, which we will come to next.
29
3.4 Converting the Data to Libsvm Format
The data set given to us by IBM was in the format as described earlier, where the first
column was a “question id”, the next 342 columns were features (real numbers), and the final
column was a label (“true” or “false”). And there were approximately 1.3 million rows of data.
where, “Each line contains an instance and is ended by a '\n' character. For
classification, <label> is an integer indicating the class label. The pair <index>:<value> gives a
feature (attribute) value: <index> is an integer starting from 1 and <value> is a real number.”
To achieve this conversion I wrote a Perl script, the code for which is below.
#! /usr/bin/perl
use warnings;
$my_dir = "/Users/Dhruv/Documents/MATLAB/data/labelled_2000_lines/";
opendir MYDIR, $my_dir;
readdir MYDIR;
readdir MYDIR;
readdir MYDIR;
while ( $filename = readdir MYDIR ) {
open SVMFILE, "<", "$my_dir$filename" or die "$!";
open OUTFILE, ">>", "out" or die "$!";
while (<SVMFILE>) {
chomp($line = $_);
@splitline = split(',',$line);
$qid = shift(@splitline);
30
$label = pop(@splitline);
if ($label eq "false") {
$newlabel = '-1';
} elsif ($label eq "true") {
$newlabel = '+1';
} else {
$newlabel = '0';
}
$i = 1;
foreach $value (@splitline) {
push(@newline,"$i:$value ");
$i += 1;
}
unshift(@newline,"$newlabel ");
print OUTFILE @newline,"\n";
undef @splitline;
undef @newline;
}
close(SVMFILE);
close(OUTFILE);
}
closedir MYDIR;
exit;
This code basically, opens the labeled data set file, reads it line-by-line, checks what the label
assigned is, converts it to +1 for true, -1 for false and 0 otherwise, and then rearranges the data
with the required feature index values, and then writes it out to a new file, again line-by-line.
This successfully converts the IBM supplied data to the required format of libsvm.
31
4. Execution and Analysis
For running my training and classification of data, I used 6 subsets of the entire data set
and the entire data set as a whole, making seven data sets in total. For this, I divided the original
data set into subsets of size 2,000 rows, 10,000 rows, 50,000 rows, 100,000 rows, 200,000 rows,
500,000 rows and then the entire data set of almost 1.3 million rows. I used the linux ‘split’
As per the documentation of libsvm [9], training is achieved by using the ‘svm-train’ command
as such,
32
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking : whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates : whether to train a SVC or SVR model for probability estimates, 0 or 1
(default 0)
-wi weight : set the parameter C of class i to weight*C, for C-SVC (default 1)
-v n: n-fold cross validation mode
-q : quiet mode (no outputs)
I ran training runs with the 4 different kernel options available, and each had different
training time, classification time, and accuracy rate for the various data set sizes.
For the six subsets, I trained and classified on the same size data set, but the trained data
set was seen and the classified one was an unseen one by the system. For training of the entire
I present here a table comparing my various outputs from the training and classification
experiments.
33
Kernel Type Data Set Size Training Time Classification Classification
(rows) (hh:mm:ss) Time Accuracy
(hh:mm:ss) (%)
34
Kernel Type Data Set Size Training Time Classification Classification
(rows) (hh:mm:ss) Time Accuracy
(hh:mm:ss) (%)
Note:
1) The classification accuracy is calculated by the libsvm tool, and forms part of the output.
2) The time taken to train and classify was found by using the linux ‘time’ command when
executing the train and classify commands, this is the real elapsed time which is the output of
‘time’ command.
3) The kernels are varied by using the kernel option in libsvm as described in the ‘svm-train’
usage earlier.
4) RBF training takes more than 100 hours, which is the maximum credit I can use for a single
35
Figure 18. Graph of Training Examples vs Training Time for Various Kernels
As we can see above RBF Kernel took the most time to train (even though it has only 6 data
points) since the kernel function it uses is an exponential one, and the curve increases
exponentially.
36
Figure 19. Graph of Training Examples vs Classification Time for Various Kernels
As we can see here again classification time for RBF kernel is most and increases exponentially
again.
For the two graphs below, we can see that RBF kernel gives best accuracy of classifications and
polynomial is most erratic. Also, fluctuations in linear kernel output says that data is not linearly
separable.
37
Figure 20. Graph of Training Examples vs Accuracy Rates for Various Kernels
Figure 21. Graph of Training Examples vs Accuracy Rates for Various Kernels
38
Kernel Type Training Cost Classification Cost Total Cost
(approx.) (approx.)
Table 3. Cost spent on AWS hours for all experiments of various kernels
As can be seen from the above table RBF cost the most in terms of dollars spent on AWS,
whereas Sigmoid and Polynomial were the least. We see the graph of comparisons below.
39
RBF Linear Sigmoid Polynomial
200
150
100
50
0
Training Classification Total
5. Conclusion
As has been seen from this project, the major challenge in such a task was dealing with
the large data set size. Once the hurdle of machine power was crossed using AWS, data
preparation was crucial for the tool being used. The libsvm tool, and the SVM technique in
general, has proven to be quite effective for this task of classification. Using it’s powerful
method of kernels, it can deal with complicated data that cannot be linearly separated. Though
these are computationally expensive for large data sets, AWS offers an affordable solution to the
average user.
40
RBF kernel prove to be most expensive in cost and took most time to train and classify but gave
best classification results. Whereas Sigmoid kernel prove to be most cost-effective by giving
consistent results and not taking long to train or classify and being cheap in AWS costs.
6. References
[1] IBM 'The Great Minds Challenge' introduction documentation for technical pilot at SJSU.
[6] http://stackoverflow.com/questions/11253746/amazon-ec2-compute-unit-and-gceu-google-
10, 2013
41