Scala Data Analysis Cookbook - Sample Chapter
Scala Data Analysis Cookbook - Sample Chapter
ee
Starting with introductory recipes on utilizing the Breeze and Spark libraries, the book then gets to grips with
how to import data from a host of possible sources and how to pre-process numerical, string, and date data.
Next, you'll gain an understanding of concepts that will help you visualize data using the Apache Zeppelin
and Bokeh bindings in Scala, enabling exploratory data analysis. discover how to program quintessential
machine learning algorithms using the Spark ML library. Work through steps to scale your machine learning
models and deploy them in a standalone cluster, EC2, YARN, and Mesos.
This book will introduce you to the most popular Scala data analysis tools, libraries, and frameworks
through practical recipes around loading, manipulating, and preparing your data.
Sa
pl
e
and problems
problems efficiently
real-world problems
$ 44.99 US
28.99 UK
P U B L I S H I N G
Arun Manivannan
P U B L I S H I N G
Arun Manivannan
Preface
JVM has become a clear winner in the race between different methods of scalable data
analysis. The power of JVM, strong typing, simplicity of code, composability, and availability of
highly abstracted distributed and machine learning frameworks make Scala a clear contender
for the top position in large-scale data analysis. Thanks to its dynamic-looking, yet static type
system, scientists and programmers coming from Python backgrounds feel at ease with Scala.
This book aims to provide easy-to-use recipes in Apache Spark, a massively scalable
distributed computation framework, and Breeze, a linear algebra library on which Spark's
machine learning toolkit is built. The book will also help you explore data using interactive
visualizations in Apache Zeppelin.
Other than the handful of frameworks and libraries that we will see in this book, there's a
host of other popular data analysis libraries and frameworks that are available for Scala.
They are by no means lesser beasts, and they could actually fit our use cases well.
Unfortunately, they aren't covered as part of this book.
Apache Flink
Apache Flink (http://flink.apache.org/), just like Spark, has first-class support
for Scala and provides features that are strikingly similar to Spark. Real-time streaming
(unlike Spark's mini-batch DStreams) is its distinctive feature. Flink also provides a machine
learning and a graph processing library and runs standalone as well as on the YARN cluster.
Scalding
Scalding (https://github.com/twitter/scalding) needs no introductionScala's
idiomatic approach to writing Hadoop MR jobs.
Preface
Saddle
Saddle (https://saddle.github.io/) is the "pandas" (http://pandas.pydata.org/)
of Scala, with support for vectors, matrices, and DataFrames.
Spire
Spire (https://github.com/non/spire) has a powerful set of advanced numerical
types that are not available in the default Scala library. It aims to be fast and precise in
its numerical computations.
Akka
Akka (http://akka.io) is an actor-based concurrency framework that has actors as its
foundation and unit of work. Actors are fault tolerant and distributed.
Accord
Accord (https://github.com/wix/accord) is simple, yet powerful, validation library
in Scala.
Getting Started
with Breeze
In this chapter, we will cover the following recipes:
Introduction
This chapter gives you a quick overview of one of the most popular data analysis libraries in
Scala, how to get them, and their most frequently used functions and data structures.
We will be focusing on Breeze in this first chapter, which is one of the most popular and
powerful linear algebra libraries. Spark MLlib, which we will be seeing in the subsequent
chapters, builds on top of Breeze and Spark, and provides a powerful framework for scalable
machine learning.
How to do it...
Let's add the Breeze dependencies into our build.sbt so that we can start playing with
them in the subsequent recipes. The Breeze dependencies are just twothe breeze (core)
and the breeze-native dependencies.
1. Under a brand new folder (which will be our project root), create a new file called
build.sbt.
2. Next, add the breeze libraries to the project dependencies:
organization := "com.packt"
name := "chapter1-breeze"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.scalanlp" %% "breeze" % "0.11.2",
//Optional - the 'why' is explained in the How it works
section
"org.scalanlp" %% "breeze-natives" % "0.11.2"
)
3. From that folder, issue a sbt compile command in order to fetch all your
dependencies.
Chapter 1
You could import the project into your Eclipse using sbt eclipse
after installing the sbteclipse plugin https://github.com/
typesafehub/sbteclipse/. For IntelliJ IDEA, you just need to import
the project by pointing to the root folder where your build.sbt file is.
There's more...
Let's look into the details of what the breeze and breeze-native library dependencies we
added bring to us.
If you are running a Mac, you are in luckNative BLAS libraries come out of the box on Macs.
Installing NativeBLAS on Ubuntu / Debian involves just running the following commands:
sudo apt-get install libatlas3-base libopenblas-base
sudo update-alternatives --config libblas.so.3
sudo update-alternatives --config liblapack.so.3
Chapter 1
Creating vectors:
Vector arithmetic:
Scalar operations
Standard deviation
Finding the sum, square root and log of all the values in the vector
Getting ready
In order to run the code, you could either use the Scala or use the Worksheet feature available
in the Eclipse Scala plugin (or Scala IDE) or in IntelliJ IDEA. The reason these options are
suggested is due to their quick turnaround time.
How to do it...
Let's look at each of the above sub-recipes in detail. For easier reference, the output of the
respective command is shown as well. All the classes that are being used in this recipe are
from the breeze.linalg package. So, an "import breeze.linalg._" statement at
the top of your file would be perfect.
Creating vectors
Let's look at the various ways we could construct vectors. Most of these construction
mechanisms are through the apply method of the vector. There are two different flavors
of vectorbreeze.linalg.DenseVector and breeze.linalg.SparseVectorthe
choice of the vector depends on the use case. The general rule of thumb is that if you have
data that is at least 20 percent zeroes, you are better off choosing SparseVector but then
the 20 percent is a variant too.
Creating a dense vector from values: Creating a DenseVector from values is just
a matter of passing the values to the apply method:
val dense=DenseVector(1,2,3,4,5)
println (dense) //DenseVector(1, 2, 3, 4, 5)
Chapter 1
Creating a sparse vector from values: Creating a SparseVector from values is also
through passing the values to the apply method:
val sparse=SparseVector(0.0, 1.0, 0.0, 2.0, 0.0)
println (sparse) //SparseVector((0,0.0), (1,1.0), (2,0.0),
(3,2.0), (4,0.0))
//DenseVector(0.0,
val sparseZeros=SparseVector.zeros[Double](5)
//SparseVector()
Not surprisingly, the SparseVector does not allocate any memory for the contents of
the vector. However, the creation of the SparseVector object itself is accounted for in
the memory.
Just like the range function, which has all the arguments as integers, there is also a rangeD
function that takes the start, stop, and the step parameters as Double:
val rangeD=DenseVector.rangeD(0.5, 20, 2.5)
// DenseVector(0.5, 3.0, 5.5, 8.0, 10.5, 13.0, 15.5)
Chapter 1
Vector arithmetic
Now let's look at the basic arithmetic that we could do on vectors with scalars and vectors.
Scalar operations
Operations with scalars work just as we would expect, propagating the value to each element
in the vector.
Adding a scalar to each element of the vector is done using the + function (surprise!):
val inPlaceValueAddition=evenNosTill20 +2
//DenseVector(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
We'll create another vector from 0 to 5 with a step value of 1 (a fancy way of saying 0
through 4):
val zeroThrough4=DenseVector.range(0, 5, 1)
//DenseVector(0, 1, 2, 3, 4)
There's an interesting behavior encapsulated in the addition though. Assuming you try to add
two vectors of different lengths, if the first vector is smaller and the second vector larger, the
resulting vector would be the size of the first vector and the rest of the elements in the second
vector would be ignored!
val fiveLength=DenseVector(1,2,3,4,5)
//DenseVector(1, 2, 3, 4, 5)
val tenLength=DenseVector.fill(10, 20)
//DenseVector(20, 20, 20, 20, 20, 20, 20, 20, 20, 20)
fiveLength+tenLength
//DenseVector(21, 22, 23, 24, 25)
On the other hand, if the first vector is larger and the second vector smaller, it would result
in an ArrayIndexOutOfBoundsException:
tenLength+fiveLength
// java.lang.ArrayIndexOutOfBoundsException: 5
10
Chapter 1
No surprise here. There is also the horzcat method that places the second vector
horizontally next to the first vector, thus forming a matrix.
val concatVector1=DenseVector.horzcat(zeroThrough4, justFive2s)
//breeze.linalg.DenseMatrix[Int]
0
While dealing with vectors of different length, the vertcat function happily
arranges the second vector at the bottom of the first vector. Not surprisingly,
the horzcat function throws an exception:
java.lang.IllegalArgumentException, meaning all vectors must be
of the same size!
11
Now, let's briefly look at how to calculate some basic summary statistics for a vector.
Mean and variance
Calculating the mean and variance of a vector could be achieved by calling the
meanAndVariance universal function in the breeze.stats package. Note that
this needs a vector of Double:
meanAndVariance(evenNosTill20Double)
//MeanAndVariance(9.0,36.666666666666664,10)
Standard deviation
Calling the stddev on a Double vector could give the standard deviation:
stddev(evenNosTill20Double)
//Double = 6.0553007081949835
Chapter 1
Finding the sum, square root and log of all the values
in the vector
The same as with max, the sum universal function inside the breeze.linalg package
calculates the sum of the vector:
val intSumOfVectorVals=sum (evenNosTill20)
//90
The functions sqrt, log, and various other universal functions in the breeze.numerics
package calculate the square root and log values of all the individual elements inside
the vector:
The Sqrt function
val sqrtOfVectorVals= sqrt (evenNosTill20)
// DenseVector(0.0, 1. 4142135623730951, 2.0, 2.449489742783178,
2.8284271247461903, 3.16227766016 83795, 3.4641016151377544,
3.7416573867739413, 4.0, 4.242640687119285)
How to do it...
There are a variety of functions that we have in a matrix. In this recipe, we will look at some
details around:
Creating matrices:
Addition
Standard deviation
Finding the sum, square root and log of all the values in the matrix
Creating matrices
Let's first see how to create a matrix.
Creating a matrix from values
The simplest way to create a matrix is to pass in the values in a row-wise fashion into the
apply function of the matrix object:
val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//Returns a DenseMatrix[Int]
1
11
12
13
21
22
23
14
Chapter 1
There's also a Sparse version of the matrix toothe Compressed Sparse Column Matrix
(CSCMatrix):
val sparseMatrix=CSCMatrix((1,0,0),(11,0,0),(0,0,23))
//Returns a SparseMatrix[Int]
(0,0) 1
(1,0) 11
(2,2) 23
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
val compressedSparseMatrix=CSCMatrix.zeros[Double](5,4)
//Returns a CSCMatrix[Double] = 5 x 4 CSCMatrix
Notice how the SparseMatrix doesn't allocate any memory for the
values in the zero value matrix.
15
Returns a DenseMatrix[Double] =
0.0
0.0
0.0
0.0
0.0
1.0
2.0
3.0
0.0
2.0
4.0
6.0
0.0
3.0
6.0
9.0
0.0
4.0
8.0
12.0
The type parameter is needed only if you would like to convert the type of the matrix from
an Int to a Double. So, the following call without the parameter would just return an
Int matrix:
val denseTabulate=DenseMatrix.tabulate(5,4)((firstIdx,secondIdx)=>firstId
x*secondIdx)
12
16
Chapter 1
Creating a matrix from random numbers
The rand function in the matrix would generate a matrix of a given dimension (4 rows * 4
columns in our case) with random values between 0 and 1. We'll have an in-depth look into
random number generated vectors and matrices in a subsequent recipe.
val randomMatrix=DenseMatrix.rand(4, 4)
Returns DenseMatrix[Double]
0.09762565779429777
0.19428193961985674
0.01089176285376725
0.2660579009292807
0.9662568115400412
0.3957540854393169
0.718377391997945
0.8230367668470933
0.9080090988364429
0.26722019105654415
0.7697780247035393
0.49887760321635066
3.326843165250004E-4
0.7682752255172411
0.447925644082819
0.8195838733418965
If there are more values than the number of values required by the dimensions of the matrix,
the rest of the values are ignored. Note how (6,7) is ignored in the array:
val vectFromArray=new DenseMatrix(2,2,Array(2,3,4,5,6,7))
DenseMatrix[Int]
2
However, if fewer values are present in the array than what is required by the dimensions of
the matrix, then the constructor call would throw an ArrayIndexOutOfBoundsException:
val vectFromArrayIobe=new DenseMatrix(2,2,Array(2,3,4))
//throws java.lang.ArrayIndexOutOfBoundsException: 3
17
Matrix arithmetic
Now let's look at the basic arithmetic that we could do using matrices.
Let's consider a simple 3*3 simpleMatrix and a corresponding identity matrix:
val simpleMatrix=DenseMatrix((1,2,3),(11,12,13),(21,22,23))
//DenseMatrix[Int]
1
11
12
13
21
22
23
val identityMatrix=DenseMatrix.eye[Int](3)
//DenseMatrix[Int]
1
Addition
Adding two matrices will result in a matrix whose corresponding elements are summed up.
val additionMatrix=identityMatrix + simpleMatrix
// Returns DenseMatrix[Int]
2
11
13
13
21
22
24
Multiplication
Now, as you would expect, multiplying a matrix with its identity should give you the matrix
itself:
val simpleTimesIdentity=simpleMatrix * identityMatrix
//Returns DenseMatrix[Int]
1
11
12
13
21
22
23
18
Chapter 1
Breeze also has an alternative element-by-element operation that has the format of prefixing
the operator with a colon, for example, :+,:-, :*, and so on. Check out what happens when
we do an element-wise multiplication of the identity matrix and the simple matrix:
val elementWiseMulti=identityMatrix :* simpleMatrix
//DenseMatrix[Int]
1
12
23
//DenseMatrix[Int]
1
11
12
13
21
22
23
3
19
11
12
13
21
22
23
2.0
3.0
11.0
12.0
13.0
21.0
22.0
23.0
7.0
3.0
-5.0
Getting the second column vector and so on is achieved by passing the correct zero-indexed
column number:
val secondVector=simpleMatrix(::,1)
//DenseVector(7.0, -5.0)
Chapter 1
While explicitly stating the range (as in 0 to 1), we have to be careful not to exceed the matrix
size. For example, the following attempt to select 3 columns (0 through 2) on a 2 * 2 matrix
would throw an ArrayIndexOutOfBoundsException:
val errorTryingToSelect3ColumnsOn2By2Matrix=simpleMatrix(0,0 to 2)
//java.lang.ArrayIndexOutOfBoundsException
Getting the second row vector is achieved by passing the second row (1) and all the columns
(::) in that vector:
val secondRow=simpleMatrix(1,::)
//Transpose(DenseVector(3.0, -5.0))
On the one hand, transpose is a function on the matrix object itself, like so:
val transpose=simpleMatrix.t
4.0
3.0
7.0
-5.0
21
0.12195121951219512
0.17073170731707318
0.07317073170731708
-0.0975609756097561
Let's do a matrix product to its inverse and confirm whether it is an identity matrix:
simpleMatrix * inverse
1.0
0.0
-5.551115123125783E-17
1.0
As expected, the result is indeed an identity matrix with rounding errors when doing floating
point arithmetic.
Alternatively, converting an Int matrix to a Double matrix and calculating the mean and
variance for that Matrix could be merged into a one-liner:
meanAndVariance(convert(simpleMatrix, Double))
22
Chapter 1
Standard deviation
Calling the stddev on a Double vector could give the standard deviation:
stddev(simpleMatrixAsDouble)
//Double = 8.703447592764606
Finding the sum, square root and log of all the values in the matrix
The same as with max, the sum object inside the breeze.linalg package calculates the
sum of all the matrix elements:
val intSumOfMatrixVals=sum (simpleMatrix)
//108
The functions sqrt, log, and various other objects (universal functions) in the breeze.
numerics package calculate the square root and log values of all the individual values
inside the matrix.
Sqrt
val sqrtOfMatrixVals= sqrt (simpleMatrix)
//DenseMatrix[Double] =
1.0
1.4142135623730951
1.7320508075688772
3.3166247903554
3.4641016151377544
3.605551275463989
4.58257569495584
4.69041575982343
4.795831523312719
Log
val log2MatrixVals=log(simpleMatrix)
//DenseMatrix[Double]
0.0
0.6931471805599453
1.0986122886681098
2.3978952727983707
2.4849066497880004
2.5649493574615367
3.044522437723423
3.091042453358316
3.1354942159291497
23
-0.5395744865143975
0.26485118719604456
We could extract the eigenvectors and eigenvalues by calling the corresponding functions on
the returned Eig reference:
val eigenVectors=denseEig.eigenvectors
//DenseMatrix[Double] =
0.9642892971721949
-0.5395744865143975
0.26485118719604456
0.8419378679586305
The two eigenValues corresponding to the two eigenvectors could be captured using the
eigenvalues function on the Eig object:
val eigenValues=denseEig.eigenvalues
//DenseVector[Double] = DenseVector(5.922616289332565,
-6.922616289332565)
2. Then let's multiply the first eigenvalue with the first eigenvector. The resulting vector
will be the same with a marginal error when doing floating point arithmetic:
val vectorToEigValue=denseEig.eigenvectors(::,0) *
denseEig.eigenvalues (0)
//DenseVector(5.7111154990610915, 1.5686119555363618)
24
Chapter 1
How it works...
The same as with vectors, the initialization of the Breeze matrices are achieved by way of the
apply method or one of the various methods in the matrix's Object class. Various other
operations are provided by way of polymorphic functions available in the breeze.numeric,
breeze.linalg and breeze.stats packages.
How it works...
Before we delve into how to create the vectors and matrices out of random numbers, let's
create instances of the most common random number distribution. All these generators are
under the breeze.stats.distributions package:
//Uniform distribution with low being 0 and high being 10
val uniformDist=Uniform(0,10)
//Gaussian distribution with mean being 5 and Standard deviation
being 1
val gaussianDist=Gaussian(5,1)
//Poission distribution with mean being 5
val poissonDist=Poisson(5)
25
26
Chapter 1
//DenseVector(4.235655596913547, 5.535011377545014, 6.201428236839494,
6.046289604188366, 4.319709374229152,
4.2379652913447154, 2.957868021601233, 3.96371080427211,
4.351274306757224, 5.445022658876723)
We saw how easy it is to create a vector of random values. Now, let's proceed to create a
matrix of random values. Similar to DenseVector.rand to generate vectors with random
values, we'll use the DenseMatrix.rand function to generate a matrix of random values.
0.4492155777289115
0.9098840386699856
0.8203022252988292
0.0888975848853315
0.009677790736892788
0.6058885905934237
0.6201415814136939
0.7017492438727635
0.08404147915159443
7.592014659345548
8.164652560340933
6.966445294464401
8.35949395084735
3.442654641743763
3.6761640240938442
9.42626645215854
0.23658921372298636
7.327120138868571
27
5.724540885605018
5.647051873430568
5.337906135107098
6.2228893721489875
4.799561665187845
5.12469779489833
5.136960834730864
5.176410360757703
5.262707072950913
11
28
Chapter 1
How it works...
There are just two functions that we need to remember in order to read and write data from
and to CSV files. The signatures of the functions are pretty straightforward too:
csvread(file, separator, quote, escape, skipLines)
csvwrite(file, mat, separator, quote, escape, skipLines)
when needed.
skipLines: This is the number of lines to be skipped while reading the file.
Generally, if there is a header, we pass a skipLines=1.
mat: While writing, this is the matrix object that is being written.
quote: This defaults to double quotes. It is a character that implies that the value
inside is one single value.
special characters.
Let's see these in action. For the sake of clarity, I have skipped the quote and the escape
parameter while calling the csvread and csvwrite functions. For this recipe, we will do
three things:
29
1.0
88.0
2.0
2.0
84.0
3.0
3.0
85.0
4.0
4.0
85.0
5.0
5.0
84.0
6.0
6.0
85.0
30
Chapter 1
println ("First Column skipped \n"+ firstColumnSkipped(0 to
5, ::))
Output :
1.0
88.0
2.0
84.0
3.0
85.0
4.0
85.0
5.0
84.0
6.0
85.0
31
www.PacktPub.com
Stay Connected: