Big Data Processing With Spark

View Apache Spark and Scala
course details at www.edureka.co/apache-spark-scala-training
Apache Spark | Spark SQL

www.edureka.co/apache-spark-scala-trainingSlide 2
Objectives
At the end of this module, you will be able to
 Introduction of Spark
 Spark Architecture
 What is an RDD
 Demo On Creating RDD and Running sample example
 Spark SQL

What is Spark?
Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it
easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.
 Developed at UC Berkeley
Written in Scala , a Functional Programming Language that runs in a JMV
It generalize the Map Reduce framework

Why Spark ?
Speed
Run programs up to 100x
faster than Hadoop Map
Reduce in memory, or 10x
faster on disk.
Ease of Use
Supports different
languages for developing
applications using Spark
Generality
Combine SQL, streaming,
and complex analytics into
one platform
Runs Everywhere
Spark runs on Hadoop,
Mesos, standalone, or in
the cloud.

Map Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass
computations and algorithms ( Machine learning etc.)
To run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in
sequence
 Each of those jobs was high-latency, and none could start until the previous job had finished completely
The Job output data between each step has to be stored in the local file system before the next step can begin
 Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning
and Storm for streaming data processing)
Map Reduce Limitations

Spark Features
 Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-
memory data storage
 Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
 It’s designed to be an execution engine that works both in-memory and on-disk
 Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow
 Provides concise and consistent APIs in Scala, Java and Python
 Offers interactive shell for Scala and Python. This is not available in Java yet
 Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)

Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture

Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Cluster management ( Native Spark Cluster, YARN, MESOS )
Distributed storage ( HDFS, Cassandra, S3, HBase )

Spark Advantages
EASE OF
DEVELOPMENT
COMBINE
WORKFLOWS
IN-MEMORY
PERFORMANCE
 Easier APIs
 Python, Scala, Java
 RDDs
 DAGs Unify Processing
 Shark, ML
Streaming, GraphX

UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
 Multiple data sources
 Multiple applications
 Multiple users
 Reliability
 Multi-tenancy
 Security
 Files
 Databases
 Semi-structured
Hadoop Advantages

Spark + Hadoop
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
EASE OF
DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY
PERFORMANCE
Operational Applications
Augmented by In-Memory
Performance

Resilient Distributed Datasets
RDD ( Resilient Distributed Data Sets )
Resilient – If data in memory is lost, It can be recreated
Distributed – Stored in memory across the cluster
Dataset – Initial data can come from a file or created programmatically.
RDDs are the fundamental unit of data in spark

Core concept of Spark framework.
RDDs can store any type of data.
Primitive Types : Integer, Characters, Boolean etc.
Files : Text files, SequencFiles etc.
RDD is fault tolerance.
RDDs are immutable

RDD supports two types of operations:
Transformation: Transformations don't return a single value, they return a new RDD.
Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and
coalesce.
Action: Action operation evaluates and returns a new value.
Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.

Spark Sql
Spark Core
 Spark SQL allows relational queries through Spark
 The backbone for all these operations is SchemaRDD
 Schema RDDs are mode of row objects along with the metadata information
 SchemaRDDs are equivalent to RDBMS tables
 They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data
stored in Apache Hive(*)
Spark SQL

www.edureka.co/apache-spark-scala-training
Spark SQL
Spark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with
integrated APIs in Scala and Java
 Shark Project is completely closed now
Earlier it was Shark but now
we will use Spark SQL
Shark
Spark SQL Hive on Spark
Development ending:
transitioning to Spark SQL
A new SQL engine designed
from ground up for Spark
Help existing Hive users
migrate Spark

Efficient In-Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-object overhead
Instead, Spark SQL employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4

Demo On Spark RDDs

LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Course Features

Questions

Course Topics
 Module 1
» Introduction to Scala
 Module 2
» Scala Essentials
 Module 3
» Traits and OOPs in Scala
 Module 4
» Functional Programming in Scala
Module 5
» Introduction to Big Data and Spark
Module 6
» Spark Baby Steps
Module 7
» Playing with RDDs
Module 8
» Spark with SQL- When Spark meets Hive

Big Data Processing With Spark

More Related Content

What's hot (20)

Viewers also liked (15)

Similar to Big Data Processing With Spark (20)

More from Edureka! (20)

Recently uploaded (20)

Big Data Processing With Spark