0% found this document useful (0 votes)

14 views36 pages

Spark and Scala - Module 5

The document outlines a training module on Apache Spark and Scala, focusing on Spark's role in big data processing and its advantages over traditional MapReduce. It includes a case study of the Bombay Stock Exchange's implementation of Hadoop, details the Spark ecosystem, and highlights the architecture and functionalities of Spark. Additionally, it discusses the integration of Spark with Hadoop and the benefits of using both technologies together.

Uploaded by

Kalaiyarasan Venkatesan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views36 pages

Spark and Scala - Module 5

Uploaded by

Kalaiyarasan Venkatesan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Apache Spark and Scala

Module 5: Spark and Big Data

© 2015 BlueCamphor Technologies (P) Ltd.

Course Topics

Module 1 Module 2 Module 3 Module 4

Getting Started / Scala – Essentials and Introducing Traits and Functional Programming
Introduction to Scala Deep Dive OOPS in Scala in Scala

Module 5 Module 6 Module 7 Module 8

Spark and Big Data Advanced Spark Understanding RDDs Shark, SparkSQL and
Concepts Project Discussion

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 2

Session Objectives
In this session, you will understand:

ᗍ Analyze Batch Processing and Real-time Processing

ᗍ Understand Spark Ecosystem
ᗍ Analyze MapReduce Limitations
ᗍ Go through Spark History
ᗍ Analyze Spark Architecture
ᗍ Understand Spark and Hadoop Advantages
ᗍ Analyze benefits of Spark and Hadoop combined
ᗍ Install Spark

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 3

Bombay Stock Exchange – Big Data Case Study

ᗍ When Bombay Stock Exchange (the seventh largest stock exchange in the world, in terms of market capitalization)
wanted to ramp up / scale up its operations, the company faced major challenges
ᗍ These challenges were in terms of exponential growth of data (read big data), need for complex analytics and
managing information that was scattered across multiple and monolithic system
ᗍ DataMetica (a Mumbai / Pune based big data organization) suggested a 3 phased solution to BSE:
• In the first phase, they created a POC which demonstrated how a Hadoop based Big data implementation can
work for BSE
• In the second phase, they worked with BSE to pick up the most critical business use cases (which had the
maximum ROI for BSE) and implemented them
• Finally in the third phase they delivered the complete solution in a multi-faced manner for a full fledged
implementation
ᗍ That’s how Hadoop got implemented at BSE in a cost effective and scalable fashion

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 4

Batch Processing Phase / Life Cycle

ᗍ Processing transactions in a group or batch

ᗍ Following three phases are common to batch processing or business analytics project, irrespective of the type of
data (structured or unstructured)

Data Collection Data Preparation Data Presentation

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 5

Data Collection

Real Time System Business

Analytics / Batch
Flume Processing
System

Unstructured
Data
Sqoop

Structured
Data

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 6

Data Preparation

Business
Analytics / Batch
Processing Pig
System

Data Processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 7

Data Presentation

Business
Analytics / Batch Pig
Processing
System

Data Processing

Output

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 8

Real Time Analytics Examples – WindyGrid

ᗍ City of Chicago uses a MongoDB based Real Time Analytics Platform called WindyGrid
ᗍ This platform integrates unstructured data from various city departments to predict co-relations and outcomes in a
proactive manner. E.g. How a rodent complaint will follow well within 7 days of a garbage complaint

WindyGrid in Practice

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 9

Real Time Analytics Examples – WindyGrid (Cont’d)

ᗍ With MongoDB based system, WindyGrid created a central nervous system for Chicago, helping improve services,
cut costs, and create a more livable city
ᗍ By pulling together 311 and 911 calls, tweets, and bus locations, the city can better manage traffic and incidents
and get streets cleaned and opened up more quickly
ᗍ The city of Chicago collects more than seven million rows of data every day. With MongoDB’s flexible data schema,
This system doesn’t need to worry about unwieldy and constantly changing schema requirements

The next step for WindyGrid is building an open-source, predictive analytics system called the SmartData Platform to
anticipate problems before they occur, and propose solutions in an even faster manner

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 10

What is Hadoop?

Apache Hadoop is a framework that allows the distributed processing of large data sets across
clusters of commodity computers using a simple programming mode

It is an Open-source Data Management with scale-out storage and distributed processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 11

Hadoop Key Characteristics

Reliable

Scalable Characteristics Economical

Flexible

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 12

What is Spark?

ᗍ Apache Spark is a fast and general engine for large-scale data processing
ᗍ Apache Spark is a general-purpose cluster in-memory computing system
ᗍ It is used for fast data analytics
ᗍ It abstracts APIs in Java, Scala and Python, and provides an optimized engine that supports general execution
graphs
ᗍ Provides various high level tools like Spark SQL for structured data processing, Mlib for Machine Learning and
more

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 13

Spark Ecosystem

Aplha/Pre-alpha

BlindDB
(Approximate
SQL)

Spark MLLib GraphX SparkR

SQL Streaming (Machine (Graph (R on
(Streaming) learning) Computation) Spark)

Spark Core Engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 14

Spark Ecosystem (Cont’d)

An approximate Aplha/Pre-alpha
query engine. To
run over Core
Spark Engine Enables analytical
and interactive apps Package for R language
for live streaming Graph Computation to enable R-users to
BlindDB data engine leverage Spark power
(Similar to Graph) from R shell
(Approximate
SQL)

Used for structured Spark MLLib GraphX SparkR

data. Can run SQL Streaming (Machine (Graph (R on
unmodified hive (Streaming) learning) Computation) Spark)
queries on existing
Hadoop deployment

Spark Core Engine

Machine learning library being built on top of Spark. Provision for support to many machine
learning algorithms with speeds upto 100 times faster than Map-Reduce

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 15

Spark Ecosystem (Cont’d)

ᗍ Spark Core Engine

• The core engine for entire Spark framework
• Provides utilities and architecture for other components
ᗍ Spark SQL
• Spark SQL is the newest component of Spark and provides a SQL like interface
• Used for Structured data
• Can expose many datasets as tables
• Spark SQL is tightly integrated with the various spark programming languages like hive
ᗍ Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant
stream processing of live data streams
• A good alternative of Storm

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 16

Spark Ecosystem (Cont’d)

ᗍ BlinkDB
• An approximate query engine. To run over Core Spark Engine
• Accuracy trade-off for response time
ᗍ MLLib
• Machine learning library being built on top of Spark
• Provision for support to many machine learning algorithms with speeds upto 100 times faster than Map-
Reduce
• Mahout is also being migrated to MLLib
ᗍ GraphX
• Graph Computation engine (Similar to Giraph)
• Combines data-parallel and graph-parallel concepts
ᗍ SparkR
• Package for R language to enable R-users to leverage Spark power from R shell

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 17

Why Spark?

ᗍ Spark exposes a simple programming layer which provides powerful caching and disk persistence capabilities
ᗍ The Spark framework can be deployed through Apache Mesos, Apache Hadoop via Yarn, or Spark’s own cluster
manager
ᗍ Spark framework is polyglot – Can be programmed in several programming languages (Currently Scala, Java and
Python supported)
ᗍ Has super active community
ᗍ Spark fits well with existing Hadoop ecosystem
• Can be launched in existing YARN Cluster
• Can fetch the data from Hadoop 1.0
• Can be integrated with Hive

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 18

Brief History: M/R Limitations

ᗍ Map Reduce is a very powerful programming paradigm, but it has some limitations:
• Difficult to Program an algorithm directly in Native Map Reduce
• Performance bottlenecks, specifically for small batch not fitting the use cases
• Many categories of algorithms not supported (e.g. iterative algorithms, asynchronous algorithms etc.)
ᗍ In short, MR doesn’t compose well for large applications
ᗍ We are forced to take “hybrid” approaches many times
ᗍ Therefore, many specialized systems evolved over a period of time as workarounds

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 19

Brief History: Evolution of Specialized Systems

Pregel GIraph

Dremel Drill Tez

MapReduce

Impala GraphLab

Storm S4

General Batch Processing Specialized Systems

Iterative, interactive, streaming, graph, etc.

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 20

Brief History: Spark

ᗍ Unlike other evolved specialized systems, Spark’s design goal is to generalize Map Reduce concept to support new
apps within same engine
ᗍ Two reasonably small additions are enough to express the previous models:
• Fast data sharing (For Faster Processing)
• General DAGs (For Lazy Processing)
ᗍ This allows for an approach which is more efficient for the engine, and much simpler for the end users

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 21

Brief History: Spark Key Points

Code Size

140000
120000
100000 GraphX

80000 Shark
60000 Streaming

40000
20000
0

Used as libs, instead of

Non-test, non example source lines *also calls into Hive specialized systems

The State of Spark, and where we’re going next

Matei Zaharia
Spark Summit(2013)
you.be/nU6v02EJAb4

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 22

Brief History: Spark Key Points (Cont’d)

RDD Fault Tolerance

RDDs track the series of transformation used to build them (their lineage) to recomputed lost data

Example:
messages=textFile(…).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))

HadoopRDD FilteredRDD MapperRDD

Path=hdfs:// Func=_.contains(…) Func=_.split(…)

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 23

Spark in Industry

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 24

Spark Advantages

ᗍ Easier APIs EASE OF DEVELOPMENT

ᗍ Python, Scala, Java

ᗍ RDDs
IN-MEMORY PERFORMANCE
ᗍ DAGs Unify Processing

ᗍ SQL, ML, Streaming, COMBINE WORKFLOWS

GraphX

Spark + Hadoop

Operational Applications Augmented by In-Memory Performance

Spark Hadoop

UNLIMITED EASE OF
SCALE DEVELOPMENT

IN-MEMORY The Combination of Spark on ENTERPRISE

PERFORMANCE PLATFORM
Hadoop

WIDE RANGE OF COMBINE

APPLICATIONS WORKFLOWS

Spark Cluster Manager

Worker Node
Executor Cache

Task Task

Driver Program

SparkContext Cluster Manager

Worker Node
Executor Cache

Task Task

Spark Architecture – SparkContext

ᗍ Spark apps run as separate set of process on a cluster

ᗍ All of the distributed process is coordinated by SparkContext object in the driver program
ᗍ SparkContext object then connects to one type of cluster Manager (Standalone/Yarn/Mesos) for resource allocation
across cluster
ᗍ Cluster Managers provide Executors, which are essentially JVM process to run the logic and store app data
ᗍ Then, the SparkContext object sends the application code ( jar files/python scripts) to executors
ᗍ Finally, the SparkContext executes tasks in each executor

SBT Demo

ᗍ Sbt stands for Scala Build Tool

ᗍ A build tool provides facility to compile, run, test, package your projects
ᗍ SBT is a modern build tool. While it is written in Scala and provides many Scala conveniences, it is a general
purpose build tool

Why SBT?
ᗍ Full Scala language support for creating tasks
ᗍ Continuous command execution
ᗍ Launch REPL in project context

Note: Given SBT installation guide separately

SBT Demo (Cont’d)
SBT console:

Testing in the Console:

ᗍ SBT can be used both as a command line script and as a build console
ᗍ We’ll be primarily using it as a build console, but most commands can be run standalone by passing the command
as an argument to SBT, e.g.

SBT Demo (Cont’d)
SBT allows you to start a Scala REPL with all your project dependencies loaded. It compiles your project source before
launching the console, providing us a quick way to bench test our parser

Simple Spark Apps: Word Count

This simple program provides a good test case for parallel processing, since it:
ᗍ Requires a minimal amount of code
ᗍ Demonstrates use of both symbolic and numeric values
ᗍ Isn’t many steps away from search indexing

val f = sc.textFile("README.md")

val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

wc.saveAsTextFile("wc_out")

Using Hadoop as Storage

ᗍ Spark can use Hadoop as Storage

• Spark is NOT limited to HDFS only for it’s storage needs
• HDFS provides distributed storage of large datasets
• High Availability is assured natively through HDFS
• No extra software installation is required
• Compatible with Hadoop 1.x also. Using HDFS as storage doesn’t require Hadoop 2.x
• Data Loss during computation is handled by HDFS itself

Using Hadoop as Execution Engine

ᗍ Spark can use Hadoop as execution engine

• Spark can be integrated with Yarn for it’s execution
• Spark can be used with other engines (like Mesos, Spark Clsuter manager) also
• Yarn integration automatically provides processing scalability to Spark
• Spark needs Hadoop 2.0+ versions in order to use it for execution
• Every node in Hadoop cluster need Spark also to be installed
• Using Hadoop cluster for Spark processes, requires RAM upgrading of data nodes
• The integration distribution of Spark is quite new and still in the process of stabilization

Questions

My Pals Are Here Maths Homework Book Answers 5a
33% (12)
My Pals Are Here Maths Homework Book Answers 5a
4 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Check List (Quality Auditors) - Converted1
No ratings yet
Check List (Quality Auditors) - Converted1
65 pages
Spark and Scala - Module 1
No ratings yet
Spark and Scala - Module 1
42 pages
DEV3600SlideGuide PDF
No ratings yet
DEV3600SlideGuide PDF
555 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark - Part 1 - Introduction - InfoQ
18 pages
Spark Devops
0% (1)
Spark Devops
301 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
T07 Spark
No ratings yet
T07 Spark
23 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
Spark 101
No ratings yet
Spark 101
25 pages
Big Data Technologies Presentation
No ratings yet
Big Data Technologies Presentation
10 pages
Learning Spark - Chapter 1
No ratings yet
Learning Spark - Chapter 1
18 pages
Pyspark Notes New
No ratings yet
Pyspark Notes New
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Spark2x: Big Data Huawei Course
No ratings yet
Spark2x: Big Data Huawei Course
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Bda U4
No ratings yet
Bda U4
49 pages
Module 2
No ratings yet
Module 2
20 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Real Time Analytics With Spark and Kafka
No ratings yet
Real Time Analytics With Spark and Kafka
53 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Data Platform and Analytics Foundational Training: (Speaker Name)
No ratings yet
Data Platform and Analytics Foundational Training: (Speaker Name)
14 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Shark
No ratings yet
Shark
24 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
00 - 00 DS - Overview - FRAMEWORK
No ratings yet
00 - 00 DS - Overview - FRAMEWORK
63 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
06 Big Data
No ratings yet
06 Big Data
52 pages
Bda U3 p1 (Intro To Spark)
No ratings yet
Bda U3 p1 (Intro To Spark)
66 pages
Big Data Handling Techniques
No ratings yet
Big Data Handling Techniques
21 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Unit V
No ratings yet
Unit V
35 pages
Introduction-to-Apache-Spark
No ratings yet
Introduction-to-Apache-Spark
22 pages
ApacheSparkWorkshop 2020 09 17
No ratings yet
ApacheSparkWorkshop 2020 09 17
58 pages
Mastering Apache Spark PDF
75% (4)
Mastering Apache Spark PDF
541 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Introduction To Spark 1
No ratings yet
Introduction To Spark 1
21 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Lecture 4 - Spark Introduction
No ratings yet
Lecture 4 - Spark Introduction
45 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Big Data
No ratings yet
Big Data
190 pages
Mastering Apache Spark - Sample Chapter
No ratings yet
Mastering Apache Spark - Sample Chapter
24 pages
Development Length Tables
No ratings yet
Development Length Tables
1 page
Unit 6 Listening 1
No ratings yet
Unit 6 Listening 1
2 pages
SCMA306 Course Outline - July 2017
No ratings yet
SCMA306 Course Outline - July 2017
9 pages
Medical Image Analysis: Published by Elsevier B.V
No ratings yet
Medical Image Analysis: Published by Elsevier B.V
1 page
Fuel and Control System - Schematic Diagram: From Neighboring Engine
100% (2)
Fuel and Control System - Schematic Diagram: From Neighboring Engine
1 page
ChuteDesignFormulas Paper43
No ratings yet
ChuteDesignFormulas Paper43
11 pages
2av56 Sensor
No ratings yet
2av56 Sensor
1 page
Assignmentdetails Physics 12
No ratings yet
Assignmentdetails Physics 12
3 pages
Operations and Supply Chain Management Week 6
No ratings yet
Operations and Supply Chain Management Week 6
13 pages
Magnetically Coupled Circuits
No ratings yet
Magnetically Coupled Circuits
21 pages
Case-Control Study Design
No ratings yet
Case-Control Study Design
60 pages
Manual F315-F321-F330-F340
No ratings yet
Manual F315-F321-F330-F340
19 pages
Mec-1200 Vet
No ratings yet
Mec-1200 Vet
2 pages
SMCG PPT Unit-1
No ratings yet
SMCG PPT Unit-1
41 pages
CCS 2124 2202 Operating Systems I Course Outline January 2025 Se
No ratings yet
CCS 2124 2202 Operating Systems I Course Outline January 2025 Se
3 pages
How To Use The TIMESTAMPADD Parameter To Retrieve by Today - X Time in An Alma Analytics Report
No ratings yet
How To Use The TIMESTAMPADD Parameter To Retrieve by Today - X Time in An Alma Analytics Report
27 pages
Lab 6: Memory Allocation: CS 429: Fall 2018
No ratings yet
Lab 6: Memory Allocation: CS 429: Fall 2018
8 pages
Imagery Use in Sport: Mediational Effects For Efficacy: Sandra E. Short, Amy Tenute, & Deborah L. Feltz
No ratings yet
Imagery Use in Sport: Mediational Effects For Efficacy: Sandra E. Short, Amy Tenute, & Deborah L. Feltz
11 pages
DB en Trio Ups 2g 1ac 1ac 120v 750va 107057 en 01
No ratings yet
DB en Trio Ups 2g 1ac 1ac 120v 750va 107057 en 01
24 pages
Research Methods Synopsis
No ratings yet
Research Methods Synopsis
22 pages
Revision
No ratings yet
Revision
7 pages
Generation Y: Success in The Workplace
No ratings yet
Generation Y: Success in The Workplace
12 pages
Energies 13 02602
No ratings yet
Energies 13 02602
22 pages
Script Output
No ratings yet
Script Output
53 pages
Solving Linear Fractional Programming Problems With Interval Coefficients in The Objective Function. A New Approach
No ratings yet
Solving Linear Fractional Programming Problems With Interval Coefficients in The Objective Function. A New Approach
11 pages
Section 23 21 14 - Underground Pre-Insulated Hydronic Piping
No ratings yet
Section 23 21 14 - Underground Pre-Insulated Hydronic Piping
7 pages
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
No ratings yet
Addis Ababa University Addis Ababa Institute of Technology School of Electrical and Computer Engineering
5 pages
Schrack RE030024
No ratings yet
Schrack RE030024
2 pages

Spark and Scala - Module 5

Uploaded by

Spark and Scala - Module 5

Uploaded by

Apache Spark and Scala

Module 5: Spark and Big Data

© 2015 BlueCamphor Technologies (P) Ltd.

Module 1 Module 2 Module 3 Module 4

Module 5 Module 6 Module 7 Module 8

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 2

ᗍ Analyze Batch Processing and Real-time Processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 3

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 4

ᗍ Processing transactions in a group or batch

Data Collection Data Preparation Data Presentation

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 5

Real Time System Business

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 6

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 7

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 8

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 9

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 10

It is an Open-source Data Management with scale-out storage and distributed processing

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 11

Scalable Characteristics Economical

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 12

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 13

Spark MLLib GraphX SparkR

Spark Core Engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 14

Used for structured Spark MLLib GraphX SparkR

Spark Core Engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 15

ᗍ Spark Core Engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 16

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 17

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 18

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 19

Dremel Drill Tez

General Batch Processing Specialized Systems

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 20

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 21

Used as libs, instead of

The State of Spark, and where we’re going next

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 22

RDD Fault Tolerance

HadoopRDD FilteredRDD MapperRDD

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 23

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 24

ᗍ Easier APIs EASE OF DEVELOPMENT

ᗍ SQL, ML, Streaming, COMBINE WORKFLOWS

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 25

Operational Applications Augmented by In-Memory Performance

IN-MEMORY The Combination of Spark on ENTERPRISE

WIDE RANGE OF COMBINE

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 26

SparkContext Cluster Manager

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 27

ᗍ Spark apps run as separate set of process on a cluster

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 28

ᗍ Sbt stands for Scala Build Tool

Note: Given SBT installation guide separately

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 29

Testing in the Console:

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 30

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 31

val wc = f.flatMap(l => l.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 32

ᗍ Spark can use Hadoop as Storage

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 33

ᗍ Spark can use Hadoop as execution engine

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 34

© 2015 BlueCamphor Technologies (P) Ltd. www.skillspeed.com Slide 35

You might also like