Spark and Scala - Module 5
Spark and Scala - Module 5
ᗍ When Bombay Stock Exchange (the seventh largest stock exchange in the world, in terms of market capitalization)
wanted to ramp up / scale up its operations, the company faced major challenges
ᗍ These challenges were in terms of exponential growth of data (read big data), need for complex analytics and
managing information that was scattered across multiple and monolithic system
ᗍ DataMetica (a Mumbai / Pune based big data organization) suggested a 3 phased solution to BSE:
• In the first phase, they created a POC which demonstrated how a Hadoop based Big data implementation can
work for BSE
• In the second phase, they worked with BSE to pick up the most critical business use cases (which had the
maximum ROI for BSE) and implemented them
• Finally in the third phase they delivered the complete solution in a multi-faced manner for a full fledged
implementation
ᗍ That’s how Hadoop got implemented at BSE in a cost effective and scalable fashion
Unstructured
Data
Sqoop
Structured
Data
Business
Analytics / Batch
Processing Pig
System
Data Processing
Business
Analytics / Batch Pig
Processing
System
Data Processing
Output
ᗍ City of Chicago uses a MongoDB based Real Time Analytics Platform called WindyGrid
ᗍ This platform integrates unstructured data from various city departments to predict co-relations and outcomes in a
proactive manner. E.g. How a rodent complaint will follow well within 7 days of a garbage complaint
WindyGrid in Practice
ᗍ With MongoDB based system, WindyGrid created a central nervous system for Chicago, helping improve services,
cut costs, and create a more livable city
ᗍ By pulling together 311 and 911 calls, tweets, and bus locations, the city can better manage traffic and incidents
and get streets cleaned and opened up more quickly
ᗍ The city of Chicago collects more than seven million rows of data every day. With MongoDB’s flexible data schema,
This system doesn’t need to worry about unwieldy and constantly changing schema requirements
The next step for WindyGrid is building an open-source, predictive analytics system called the SmartData Platform to
anticipate problems before they occur, and propose solutions in an even faster manner
Apache Hadoop is a framework that allows the distributed processing of large data sets across
clusters of commodity computers using a simple programming mode
Reliable
Flexible
ᗍ Apache Spark is a fast and general engine for large-scale data processing
ᗍ Apache Spark is a general-purpose cluster in-memory computing system
ᗍ It is used for fast data analytics
ᗍ It abstracts APIs in Java, Scala and Python, and provides an optimized engine that supports general execution
graphs
ᗍ Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and
more
Aplha/Pre-alpha
BlindDB
(Approximate
SQL)
An approximate Aplha/Pre-alpha
query engine. To
run over Core
Spark Engine Enables analytical
and interactive apps Package for R language
for live streaming Graph Computation to enable R-users to
BlindDB data engine leverage Spark power
(Similar to Graph) from R shell
(Approximate
SQL)
Machine learning library being built on top of Spark. Provision for support to many machine
learning algorithms with speeds upto 100 times faster than Map-Reduce
ᗍ BlinkDB
• An approximate query engine. To run over Core Spark Engine
• Accuracy trade-off for response time
ᗍ MLLib
• Machine learning library being built on top of Spark
• Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-
Reduce
• Mahout is also being migrated to MLLib
ᗍ GraphX
• Graph Computation engine (Similar to Giraph)
• Combines data-parallel and graph-parallel concepts
ᗍ SparkR
• Package for R language to enable R-users to leverage Spark power from R shell
ᗍ Spark exposes a simple programming layer which provides powerful caching and disk persistence capabilities
ᗍ The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager
ᗍ Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and
Python supported)
ᗍ Has super active community
ᗍ Spark fits well with existing Hadoop ecosystem
• Can be launched in existing YARN Cluster
• Can fetch the data from Hadoop 1.0
• Can be integrated with Hive
ᗍ Map Reduce is a very powerful programming paradigm, but it has some limitations:
• Difficult to Program an algorithm directly in Native Map Reduce
• Performance bottlenecks, specifically for small batch not fitting the use cases
• Many categories of algorithms not supported (e.g. iterative algorithms, asynchronous algorithms etc.)
ᗍ In short, MR doesn’t compose well for large applications
ᗍ We are forced to take “hybrid” approaches many times
ᗍ Therefore, many specialized systems evolved over a period of time as workarounds
Pregel GIraph
Impala GraphLab
Storm S4
ᗍ Unlike other evolved specialized systems, Spark’s design goal is to generalize Map Reduce concept to support new
apps within same engine
ᗍ Two reasonably small additions are enough to express the previous models:
• Fast data sharing (For Faster Processing)
• General DAGs (For Lazy Processing)
ᗍ This allows for an approach which is more efficient for the engine, and much simpler for the end users
Code Size
140000
120000
100000 GraphX
80000 Shark
60000 Streaming
40000
20000
0
RDDs track the series of transformation used to build them (their lineage) to recomputed lost data
Example:
messages=textFile(…).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
ᗍ RDDs
IN-MEMORY PERFORMANCE
ᗍ DAGs Unify Processing
Spark Hadoop
UNLIMITED EASE OF
SCALE DEVELOPMENT
Worker Node
Executor Cache
Task Task
Driver Program
Worker Node
Executor Cache
Task Task
Why SBT?
ᗍ Full Scala language support for creating tasks
ᗍ Continuous command execution
ᗍ Launch REPL in project context
ᗍ SBT can be used both as a command line script and as a build console
ᗍ We’ll be primarily using it as a build console, but most commands can be run standalone by passing the command
as an argument to SBT, e.g.
This simple program provides a good test case for parallel processing, since it:
ᗍ Requires a minimal amount of code
ᗍ Demonstrates use of both symbolic and numeric values
ᗍ Isn’t many steps away from search indexing
val f = sc.textFile("README.md")
wc.saveAsTextFile("wc_out")