Sparkly Notebook: Interactive Analysis and Visualization with Spark

SPARKLY NOTEBOOK: INTERACTIVE
ANALYSIS AND VISUALIZATION WITH SPARK
FELIX CHEUNG
APRIL 2015
HTTP://WWW.MEETUP.COM/SEATTLE-SPARK-MEETUP/EVENTS/208711962/

SETUP
• Spark on CDH cluster
• Vagrant - 2-nodes - custom provisioning

AGENDA
• IPython + PySpark cluster
• Zeppelin
• Spark’s Streaming k-means
• Lightning

SPARK - 10 SEC INTRODUCTION
• Spark
• Spark SQL + Data Frame + data source
• Spark Streaming
• MLlib
• GraphX

It’s a lot of time looking at data..

Set of REPL related to Spark…

$
spark-‐shell

Welcome
to

____

__

/
__/__

___
_____/
/__

_
/
_
/
_
`/
__/

'_/

/___/
.__/_,_/_/
/_/_

version
1.2.0-‐SNAPSHOT

/_/

Using
Scala
version
2.10.4
(Java
HotSpot(TM)
64-‐Bit
Server
VM,
Java
1.7.0_67)

Type
in
expressions
to
have
them
evaluated.

Type
:help
for
more
information.

15/04/15
11:31:28
INFO
SparkILoop:
Created
spark
context..

Spark
context
available
as
sc.

scala>
val
a
=
sc.parallelize(1
to
100)

a:
org.apache.spark.rdd.RDD[Int]
=
ParallelCollectionRDD[0]
at
parallelize
at
<console>:12

scala>
a.collect.foreach(x
=>
println(x))

1

2

3

4

GOOD
• See results instantly

NOT SO GOOD
• Ok as an IDE
• No Save / Repeat
• No visualization

Jupyter
IPython will continue to exist as a Python kernel for Jupyter, but
the notebook and other language-agnostic parts of IPython will
move to new projects under the Jupyter name. IPython 3.0 will
be the last monolithic release of IPython.
!
“IPython” http://ipython.org/
• interactive shell
• browser-based notebook
• 'Kernel'
• great support for visualization library (eg. matplotlib)
• built on pyzmq, tornado
IPYTHON/JUPYTER

IPYTHON NOTEBOOK 
NOTEBOOK == BROWSER-BASED REPL
IPython Notebook is a web-based interactive
computational environment for creating IPython
notebooks. An IPython notebook is a JSON
document containing an ordered list of input/output
cells which can contain code, text, mathematics,
plots and rich media.

MATPLOTLIB
matplotlib tries to make easy things easy and hard things
possible. You can generate plots, histograms, power
spectra, bar charts, errorcharts, scatterplots, etc, with just a
few lines of code, with familiar MATLAB APIs.
plt.barh(y_pos,
performance,
xerr=error,

align='center',
alpha=0.4)

plt.yticks(y_pos,
people)

plt.xlabel('Performance')

plt.title('How
fast
do
you
want
to
go
today?')

plt.show()

PYSPARK
• Spark on Python, this serves as the Kernel,
integrating with IPython
• Each notebook spins up a new instance of the
Kernel (ie. PySpark running as the Spark Driver, in
different deploy mode Spark/PySpark supports)

(All notebook examples are a subset of those in
the Meetup reconstructed here)

Source: http://nbviewer.ipython.org/github/ResearchComputing/
scientific_computing_tutorials/blob/master/spark/02_word_count.ipynb

WORD2VEC EXAMPLE
Word2Vec computes distributed vector
representation of words. Distributed vector
representation is showed to be useful in many
natural language processing applications such as
named entity recognition, disambiguation, parsing,
tagging and machine translation. 
https://code.google.com/p/word2vec/
Spark MLlib implements the Skip-gram approach.
With Skip-gram we want to predict a window of
words given a single word.

WORD2VEC DATASET
Wikipedia dump http://mattmahoney.net/dc/
textdata
grep
-‐o
-‐E
'w+(W+w+){0,15}'
text8
>
text8_lines

then randomly sampled to ~200k lines

matplotlib: http://matplotlib.org
Seaborn: http://stanford.edu/~mwaskom/software/seaborn/
Bokeh: http://bokeh.pydata.org/en/latest/
MORE VISUALIZATIONS Seaborn
Bokeh
matplotlib

SETUP
To setup IPython
• Python 2.7.9 (separate from CentOS default 2.6.6), on all
nodes
• matplotlib, on the host running IPython
To run IPython with the PySpark Kernel, set these in the environment 
(Please check out my handy script on github)
!
!
!
PYSPARK_PYTHON command to run python, eg. “python2.7”
PYSPARK_DRIVER_PYTHON command to run ipython
PYSPARK_DRIVER_PYTHON_OPTS “notebook —proﬁle”
PYSPARK_SUBMIT_ARGS pyspark commandline, eg. --master --deploy_mode
YARN_CONF_DIR if YARN mode
LD_LIBRARY_PATH for matplotlib

IPYTHON/JUPYTER KERNELS
• IPython
• IGo
• Bash
• IR
• IHaskell
• IMatlab
• ICSharp
• IScala
• IRuby
• IJulia
.. and more https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-
languages

Apache Zeppelin (incubating) is interactive data analytics environment
for distributed data processing system. It provides beautiful interactive
web-based interface, data visualization, collaborative work
environment and many other nice features to make your data analytics
more fun and enjoyable.
Zeppelin has been incubating since Dec 2014. 
https://zeppelin.incubator.apache.org/

shell script & 
calling library package
Load and process data 
with Spark

SQL query powered by
Spark SQL - 
progress & 
parameterization via
dynamic form

Python & 
data passing across
languages (interpreters)

ZEPPELIN ARCHITECTURE
Realtime collaboration
- enabled by
websocket
communications
Frontend: AngularJS  
Backend server: Java  
Interpreters: Java 
Visualization: NVD3

INTERPRETERS
• Spark group
• Spark (Scala)
• PySpark
• Spark SQL
• Dependency
• Markdownjs
• Shell
• Hive
• Coming: jdbc, Tajo, etc.

CLUSTERING
• Clustering tries to ﬁnd natural groupings in
data. It puts objects into groups in which
those within a group are more similar to each
other than to those in other groups.
• Unsupervised learning

K-MEANS
• First, given an initial set of k cluster centers,
we ﬁnd which cluster each data point is
closest to
• Then, we compute the average of each of the
new clusters and use the result to update our
cluster centers

K-MEANS|| IN MLLIB
• a parallelized variant of the k-means++ 
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Parameters:
• k is the number of desired clusters.
• maxIterations is the maximum number of iterations to run.
• initializationMode speciﬁes either random initialization or initialization via
k-means||.
• runs is the number of times to run the k-means algorithm (k-means is not
guaranteed to ﬁnd a globally optimal solution, and when run multiple
times on a given dataset, the algorithm returns the best clustering result).
• initializationSteps determines the number of steps in the k-means||
algorithm.
• epsilon determines the distance threshold within which we consider k-
means to have converged.

CASE STUDY: 
K-MEANS - ZEPPELIN

Details on github at: http://bit.ly/1JWOPh8
ANOMALY DETECTION WITH K-MEANS
Using Spark DataFrame, csv data source, to process KDDCup’99 data 
Scoring with different k values

Spark-notebook: https://github.com/andypetrella/spark-notebook
ISpark: https://github.com/tribbloid/ISpark
Spark Kernel: https://github.com/ibm-et/spark-kernel
Jove: https://github.com/jove-sh/jove-notebook
Beaker: https://github.com/twosigma/beaker-notebook
OTHER NOTEBOOKS
• Spark-notebook
• ISpark
• Spark Kernel
• Jove Notebook
• Beaker
• Databricks Cloud notebook

WHY STREAMING?
• Train - model - predict works well on static
data
• What if data is
• Coming in streams
• Changing over time?

STREAMING K-MEANS DESIGN
• Proposed by Dr Jeremy Freeman (here)

STREAMING K-MEANS
• key concept: forgetfulness
• balances the relative importance of new
data versus past history
• half-life
• time it takes before past data contributes to
only one half of the current model

STREAMING K-MEANS
• time unit
• batches (which have a ﬁxed duration in
time), or points
• eliminate dying clusters

VISUALIZING 
STREAMING K-MEANS - LIGHTNING

• Lightning - data visualization server 
http://lightning-viz.org
• provides API-based access to reproducible, web-
based, interactive visualizations. It includes a core set
of visualization types, but is built for extendability
and customization. Lightning supports modern
libraries like d3.js and three.js, and is designed for
interactivity over large data sets and continuously
updating data streams.
VISUALIZING STREAMING K-
MEANS ON IPYTHON + LIGHTNING

RUNNING LIGHTNING
• API: node.js, Python, Scala
• Extension support for custom chart (eg. d3.js)
• Requirements:
• Postgres recommended (SQLlite ok)
• node.js (npm , gulp)

The Freeman Lab at Janelia Research Campus uses Lightning to visualize
large-scale neural recordings from zebraﬁsh, in collaboration with the
Ahrens Lab

SPARK STREAMING K-MEANS
DEMO
Environment
• requires: numpy, scipy, scikit-learn
• IPython/Python requires: lightning-python package
Demo consists of 3 parts: 
https://github.com/felixcheung/spark-ml-streaming
• Python driver script, data generator
• Scala job - Spark Streaming & Streaming k-means
• IPython notebook to process result, visualize with Lightning 
Originally this was part of the Python driver script - it has
been modiﬁed for this talk to run within IPython

CHALLENGES
• Package management
• Version/build conﬂicts!

YOU CAN RUN THIS TOO!
• Notebooks available at http://bit.ly/1JWOPh8
• Everything is heavily scripted and automated 
Vagrant conﬁg for local, virtual environment
available at http://bit.ly/1DB3OLw

QUESTION?
!
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
!

Sparkly Notebook: Interactive Analysis and Visualization with Spark

More Related Content

Viewers also liked (13)

Similar to Sparkly Notebook: Interactive Analysis and Visualization with Spark (20)

Recently uploaded (20)

Sparkly Notebook: Interactive Analysis and Visualization with Spark