0% found this document useful (0 votes)
13 views36 pages

Spark and Scala - Module 5

The document outlines a training module on Apache Spark and Scala, focusing on Spark's role in big data processing and its advantages over traditional MapReduce. It includes a case study of the Bombay Stock Exchange's implementation of Hadoop, details the Spark ecosystem, and highlights the architecture and functionalities of Spark. Additionally, it discusses the integration of Spark with Hadoop and the benefits of using both technologies together.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views36 pages

Spark and Scala - Module 5

The document outlines a training module on Apache Spark and Scala, focusing on Spark's role in big data processing and its advantages over traditional MapReduce. It includes a case study of the Bombay Stock Exchange's implementation of Hadoop, details the Spark ecosystem, and highlights the architecture and functionalities of Spark. Additionally, it discusses the integration of Spark with Hadoop and the benefits of using both technologies together.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Apache Spark and Scala

Module 5: Spark and Big Data

© 2015 BlueCamphor Technologies (P) Ltd.


Course Topics

Module 1 Module 2 Module 3 Module 4


Getting Started / Scala – Essentials and Introducing Traits and Functional Programming
Introduction to Scala Deep Dive OOPS in Scala in Scala

Module 5 Module 6 Module 7 Module 8


Spark and Big Data Advanced Spark Understanding RDDs Shark, SparkSQL and
Concepts Project Discussion

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 2


Session Objectives
In this session, you will understand:

ᗍ Analyze Batch Processing and Real-time Processing


ᗍ Understand Spark Ecosystem
ᗍ Analyze MapReduce Limitations
ᗍ Go through Spark History
ᗍ Analyze Spark Architecture
ᗍ Understand Spark and Hadoop Advantages
ᗍ Analyze benefits of Spark and Hadoop combined
ᗍ Install Spark

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 3


Bombay Stock Exchange – Big Data Case Study

ᗍ When Bombay Stock Exchange (the seventh largest stock exchange in the world, in terms of market capitalization)
wanted to ramp up / scale up its operations, the company faced major challenges
ᗍ These challenges were in terms of exponential growth of data (read big data), need for complex analytics and
managing information that was scattered across multiple and monolithic system
ᗍ DataMetica (a Mumbai / Pune based big data organization) suggested a 3 phased solution to BSE:
• In the first phase, they created a POC which demonstrated how a Hadoop based Big data implementation can
work for BSE
• In the second phase, they worked with BSE to pick up the most critical business use cases (which had the
maximum ROI for BSE) and implemented them
• Finally in the third phase they delivered the complete solution in a multi-faced manner for a full fledged
implementation
ᗍ That’s how Hadoop got implemented at BSE in a cost effective and scalable fashion

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 4


Batch Processing Phase / Life Cycle

ᗍ Processing transactions in a group or batch


ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type of
data (structured or unstructured)

Data Collection Data Preparation Data Presentation

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 5


Data Collection

Real Time System Business


Analytics / Batch
Flume Processing
System

Unstructured
Data
Sqoop

Structured
Data

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 6


Data Preparation

Business
Analytics / Batch
Processing Pig
System

Data Processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 7


Data Presentation

Business
Analytics / Batch Pig
Processing
System

Data Processing

Output

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 8


Real Time Analytics Examples – WindyGrid

ᗍ City of Chicago uses a MongoDB based Real Time Analytics Platform called WindyGrid
ᗍ This platform integrates unstructured data from various city departments to predict co-relations and outcomes in a
proactive manner. E.g. How a rodent complaint will follow well within 7 days of a garbage complaint

WindyGrid in Practice

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 9


Real Time Analytics Examples – WindyGrid (Cont’d)

ᗍ With MongoDB based system, WindyGrid created a central nervous system for Chicago, helping improve services,
cut costs, and create a more livable city
ᗍ By pulling together 311 and 911 calls, tweets, and bus locations, the city can better manage traffic and incidents
and get streets cleaned and opened up more quickly
ᗍ The city of Chicago collects more than seven million rows of data every day. With MongoDB’s flexible data schema,
This system doesn’t need to worry about unwieldy and constantly changing schema requirements

The next step for WindyGrid is building an open-source, predictive analytics system called the SmartData Platform to
anticipate problems before they occur, and propose solutions in an even faster manner

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 10


What is Hadoop?

Apache Hadoop is a framework that allows the distributed processing of large data sets across
clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 11


Hadoop Key Characteristics

Reliable

Scalable Characteristics Economical

Flexible

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 12


What is Spark?

ᗍ Apache Spark is a fast and general engine for large-scale data processing
ᗍ Apache Spark is a general-purpose cluster in-memory computing system
ᗍ It is used for fast data analytics
ᗍ It abstracts APIs in Java, Scala and Python, and provides an optimized engine that supports general execution
graphs
ᗍ Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and
more

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 13


Spark Ecosystem

Aplha/Pre-alpha

BlindDB
(Approximate
SQL)

Spark MLLib GraphX SparkR


SQL Streaming (Machine (Graph (R on
(Streaming) learning) Computation) Spark)

Spark Core Engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 14


Spark Ecosystem (Cont’d)

An approximate Aplha/Pre-alpha
query engine. To
run over Core
Spark Engine Enables analytical
and interactive apps Package for R language
for live streaming Graph Computation to enable R-users to
BlindDB data engine leverage Spark power
(Similar to Graph) from R shell
(Approximate
SQL)

Used for structured Spark MLLib GraphX SparkR


data. Can run SQL Streaming (Machine (Graph (R on
unmodified hive (Streaming) learning) Computation) Spark)
queries on existing
Hadoop deployment

Spark Core Engine

Machine learning library being built on top of Spark. Provision for support to many machine
learning algorithms with speeds upto 100 times faster than Map-Reduce

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 15


Spark Ecosystem (Cont’d)

ᗍ Spark Core Engine


• The core engine for entire Spark framework
• Provides utilities and architecture for other components
ᗍ Spark SQL
• Spark SQL is the newest component of Spark and provides a SQL like interface
• Used for Structured data
• Can expose many datasets as tables
• Spark SQL is tightly integrated with the various spark programming languages like hive
ᗍ Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
• A good alternative of Storm

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 16


Spark Ecosystem (Cont’d)

ᗍ BlinkDB
• An approximate query engine. To run over Core Spark Engine
• Accuracy trade-off for response time
ᗍ MLLib
• Machine learning library being built on top of Spark
• Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-
Reduce
• Mahout is also being migrated to MLLib
ᗍ GraphX
• Graph Computation engine (Similar to Giraph)
• Combines data-parallel and graph-parallel concepts
ᗍ SparkR
• Package for R language to enable R-users to leverage Spark power from R shell

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 17


Why Spark?

ᗍ Spark exposes a simple programming layer which provides powerful caching and disk persistence capabilities
ᗍ The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager
ᗍ Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and
Python supported)
ᗍ Has super active community
ᗍ Spark fits well with existing Hadoop ecosystem
• Can be launched in existing YARN Cluster
• Can fetch the data from Hadoop 1.0
• Can be integrated with Hive

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 18


Brief History: M/R Limitations

ᗍ Map Reduce is a very powerful programming paradigm, but it has some limitations:
• Difficult to Program an algorithm directly in Native Map Reduce
• Performance bottlenecks, specifically for small batch not fitting the use cases
• Many categories of algorithms not supported (e.g. iterative algorithms, asynchronous algorithms etc.)
ᗍ In short, MR doesn’t compose well for large applications
ᗍ We are forced to take “hybrid” approaches many times
ᗍ Therefore, many specialized systems evolved over a period of time as workarounds

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 19


Brief History: Evolution of Specialized Systems

Pregel GIraph

Dremel Drill Tez


MapReduce

Impala GraphLab

Storm S4

General Batch Processing Specialized Systems


Iterative, interactive, streaming, graph, etc.

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 20


Brief History: Spark

ᗍ Unlike other evolved specialized systems, Spark’s design goal is to generalize Map Reduce concept to support new
apps within same engine
ᗍ Two reasonably small additions are enough to express the previous models:
• Fast data sharing (For Faster Processing)
• General DAGs (For Lazy Processing)
ᗍ This allows for an approach which is more efficient for the engine, and much simpler for the end users

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 21


Brief History: Spark Key Points

Code Size

140000
120000
100000 GraphX

80000 Shark
60000 Streaming

40000
20000
0

Used as libs, instead of


Non-test, non example source lines *also calls into Hive specialized systems

The State of Spark, and where we’re going next


Matei Zaharia
Spark Summit(2013)
you.be/nU6v02EJAb4

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 22


Brief History: Spark Key Points (Cont’d)

RDD Fault Tolerance

RDDs track the series of transformation used to build them (their lineage) to recomputed lost data

Example:
messages=textFile(…).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))

HadoopRDD FilteredRDD MapperRDD


Path=hdfs:// Func=_.contains(…) Func=_.split(…)

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 23


Spark in Industry

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 24


Spark Advantages

ᗍ Easier APIs EASE OF DEVELOPMENT


ᗍ Python, Scala, Java

ᗍ RDDs
IN-MEMORY PERFORMANCE
ᗍ DAGs Unify Processing

ᗍ SQL, ML, Streaming, COMBINE WORKFLOWS


GraphX

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 25


Spark + Hadoop

Operational Applications Augmented by In-Memory Performance

Spark Hadoop

UNLIMITED EASE OF
SCALE DEVELOPMENT

IN-MEMORY The Combination of Spark on ENTERPRISE


PERFORMANCE PLATFORM
Hadoop

WIDE RANGE OF COMBINE


APPLICATIONS WORKFLOWS

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 26


Spark Cluster Manager

Worker Node
Executor Cache

Task Task

Driver Program

SparkContext Cluster Manager

Worker Node
Executor Cache

Task Task

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 27


Spark Architecture – SparkContext

ᗍ Spark apps run as separate set of process on a cluster


ᗍ All of the distributed process is coordinated by SparkContext object in the driver program
ᗍ SparkContext object then connects to one type of cluster Manager (Standalone/Yarn/Mesos) for resource allocation
across cluster
ᗍ Cluster Managers provide Executors, which are essentially JVM process to run the logic and store app data
ᗍ Then, the SparkContext object sends the application code ( jar files/python scripts) to executors
ᗍ Finally, the SparkContext executes tasks in each executor

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 28


SBT Demo

ᗍ Sbt stands for Scala Build Tool


ᗍ A build tool provides facility to compile, run, test, package your projects
ᗍ SBT is a modern build tool. While it is written in Scala and provides many Scala conveniences, it is a general
purpose build tool

Why SBT?
ᗍ Full Scala language support for creating tasks
ᗍ Continuous command execution
ᗍ Launch REPL in project context

Note: Given SBT installation guide separately

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 29


SBT Demo (Cont’d)
SBT console:

Testing in the Console:

ᗍ SBT can be used both as a command line script and as a build console
ᗍ We’ll be primarily using it as a build console, but most commands can be run standalone by passing the command
as an argument to SBT, e.g.

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 30


SBT Demo (Cont’d)
SBT allows you to start a Scala REPL with all your project dependencies loaded. It compiles your project source before
launching the console, providing us a quick way to bench test our parser

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 31


Simple Spark Apps: Word Count

This simple program provides a good test case for parallel processing, since it:
ᗍ Requires a minimal amount of code
ᗍ Demonstrates use of both symbolic and numeric values
ᗍ Isn’t many steps away from search indexing

val f = sc.textFile("README.md")

val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wc.saveAsTextFile("wc_out")

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 32


Using Hadoop as Storage

ᗍ Spark can use Hadoop as Storage


• Spark is NOT limited to HDFS only for it’s storage needs
• HDFS provides distributed storage of large datasets
• High Availability is assured natively through HDFS
• No extra software installation is required
• Compatible with Hadoop 1.x also. Using HDFS as storage doesn’t require Hadoop 2.x
• Data Loss during computation is handled by HDFS itself

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 33


Using Hadoop as Execution Engine

ᗍ Spark can use Hadoop as execution engine


• Spark can be integrated with Yarn for it’s execution
• Spark can be used with other engines (like Mesos, Spark Clsuter manager) also
• Yarn integration automatically provides processing scalability to Spark
• Spark needs Hadoop 2.0+ versions in order to use it for execution
• Every node in Hadoop cluster need Spark also to be installed
• Using Hadoop cluster for Spark processes, requires RAM upgrading of data nodes
• The integration distribution of Spark is quite new and still in the process of stabilization

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 34


Questions

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 35

You might also like