SlideShare a Scribd company logo
View Apache Spark and Scala
course details at www.edureka.co/apache-spark-scala-training
Apache Spark | Spark SQL
Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2
Objectives
At the end of this module, you will be able to
ï‚ź Introduction of Spark
ï‚ź Spark Architecture
ï‚ź What is an RDD
ï‚ź Demo On Creating RDD and Running sample example
ï‚ź Spark SQL
Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3
What is Spark?
Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it
easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics.
ï‚ź Developed at UC Berkeley
ï‚źWritten in Scala , a Functional Programming Language that runs in a JMV
ï‚źIt generalize the Map Reduce framework
Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4
Why Spark ?
Speed
Run programs up to 100x
faster than Hadoop Map
Reduce in memory, or 10x
faster on disk.
Ease of Use
Supports different
languages for developing
applications using Spark
Generality
Combine SQL, streaming,
and complex analytics into
one platform
Runs Everywhere
Spark runs on Hadoop,
Mesos, standalone, or in
the cloud.
Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5
ï‚źMap Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass
computations and algorithms ( Machine learning etc.)
ï‚źTo run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in
sequence
ï‚ź Each of those jobs was high-latency, and none could start until the previous job had finished completely
ï‚źThe Job output data between each step has to be stored in the local file system before the next step can begin
ï‚ź Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning
and Storm for streaming data processing)
Map Reduce Limitations
Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6
Spark Features
ï‚ź Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in-
memory data storage
ï‚ź Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing
ï‚ź It’s designed to be an execution engine that works both in-memory and on-disk
ï‚ź Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow
ï‚ź Provides concise and consistent APIs in Scala, Java and Python
ï‚ź Offers interactive shell for Scala and Python. This is not available in Java yet
ï‚ź Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8
Spark Core
Spark
Streaming
Spark Sql
Blink DB
MLlib Graph X Spark R
Spark Architecture
Cluster management ( Native Spark Cluster, YARN, MESOS )
Distributed storage ( HDFS, Cassandra, S3, HBase )
Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9
Spark Advantages
EASE OF
DEVELOPMENT
COMBINE
WORKFLOWS
IN-MEMORY
PERFORMANCE
ï‚ź Easier APIs
ï‚ź Python, Scala, Java
ï‚ź RDDs
ï‚ź DAGs Unify Processing
ï‚ź Shark, ML
Streaming, GraphX
Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
ï‚ź Multiple data sources
ï‚ź Multiple applications
ï‚ź Multiple users
ï‚ź Reliability
ï‚ź Multi-tenancy
ï‚ź Security
ï‚ź Files
ï‚ź Databases
ï‚ź Semi-structured
Hadoop Advantages
Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11
Spark + Hadoop
UNLIMITED SCALE
WIDE RANGE OF
APPLICATIONS
ENTERPRISE
PLATFORM
EASE OF
DEVELOPMENT
COMBINE WORKFLOWS
IN-MEMORY
PERFORMANCE
Operational Applications
Augmented by In-Memory
Performance
Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12
Resilient Distributed Datasets
RDD ( Resilient Distributed Data Sets )
Resilient – If data in memory is lost, It can be recreated
Distributed – Stored in memory across the cluster
Dataset – Initial data can come from a file or created programmatically.
RDDs are the fundamental unit of data in spark
Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13
Resilient Distributed Datasets
Core concept of Spark framework.
RDDs can store any type of data.
Primitive Types : Integer, Characters, Boolean etc.
Files : Text files, SequencFiles etc.
RDD is fault tolerance.
RDDs are immutable
Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14
RDD supports two types of operations:
Transformation: Transformations don't return a single value, they return a new RDD.
Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and
coalesce.
Action: Action operation evaluates and returns a new value.
Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach.
Resilient Distributed Datasets
Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15
Spark Sql
Spark Core
ï‚ź Spark SQL allows relational queries through Spark
ï‚ź The backbone for all these operations is SchemaRDD
ï‚ź Schema RDDs are mode of row objects along with the metadata information
ï‚ź SchemaRDDs are equivalent to RDBMS tables
ï‚ź They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data
stored in Apache Hive(*)
Spark SQL
Slide 16 www.edureka.co/apache-spark-scala-training
Spark SQL
ï‚źSpark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with
integrated APIs in Scala and Java
ï‚ź Shark Project is completely closed now
Earlier it was Shark but now
we will use Spark SQL
Shark
Spark SQL Hive on Spark
Development ending:
transitioning to Spark SQL
A new SQL engine designed
from ground up for Spark
Help existing Hive users
migrate Spark
Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17
Efficient In-Memory Storage
ï‚źSimply caching Hive records as Java objects is inefficient due to high per-object overhead
ï‚źInstead, Spark SQL employs column-oriented storage using arrays of primitive types
1
Column Storage
2 3
john mike sally
4.1 3.5 6.4
Row Storage
1 john 4.1
2 mike 3.5
3 sally 6.4
Slide 18 www.edureka.co/apache-spark-scala-trainingSlide 18
Demo On Spark RDDs
Slide 19 www.edureka.co/apache-spark-scala-training
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
Course Features
Slide 20 www.edureka.co/apache-spark-scala-training
Questions
Slide 21 www.edureka.co/apache-spark-scala-training
Course Topics
ï‚ź Module 1
» Introduction to Scala
ï‚ź Module 2
» Scala Essentials
ï‚ź Module 3
» Traits and OOPs in Scala
ï‚ź Module 4
» Functional Programming in Scala
ï‚źModule 5
» Introduction to Big Data and Spark
ï‚źModule 6
» Spark Baby Steps
ï‚źModule 7
» Playing with RDDs
ï‚źModule 8
» Spark with SQL- When Spark meets Hive
Slide 22 www.edureka.co/apache-spark-scala-training

More Related Content

PDF
Tr 069
PPTX
Internet of things (IoT) with Azure
PPTX
Hybrid Integration
PDF
Intel dpdk Tutorial
PDF
Dell Technologies Dell EMC ISG Storage, CI, HCI and Data Protection Portfolio...
PPTX
The Inside Story: How OPC UA and DDS Can Work Together in Industrial Systems
PDF
LF_DPDK17_Serverless DPDK - How SmartNIC resident DPDK Accelerates Packet Pro...
PPTX
Introduction to Node-RED
Tr 069
Internet of things (IoT) with Azure
Hybrid Integration
Intel dpdk Tutorial
Dell Technologies Dell EMC ISG Storage, CI, HCI and Data Protection Portfolio...
The Inside Story: How OPC UA and DDS Can Work Together in Industrial Systems
LF_DPDK17_Serverless DPDK - How SmartNIC resident DPDK Accelerates Packet Pro...
Introduction to Node-RED

What's hot (20)

PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PDF
PTS_Hardware_Installation_Guide_A29.pdf
PPTX
Docker networking Tutorial 101
PPSX
Juniper for Enterprise
PDF
DPDK: Multi Architecture High Performance Packet Processing
PDF
Gstreamer: an Overview
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
PDF
Open Network OS Overview as of 2015/10/16
PPTX
Linux security introduction
PDF
NGSI-LD IoT Agents
 
PPTX
Universal Flash Storage
PDF
Dell Technologies Dell EMC ISG Storage, CI, HCI and Data Protection Portfolio...
PDF
FAPI æœ€æ–°æƒ…ć ± - OpenID BizDay #15
PPTX
Accelerating Innovation from Edge to Cloud
PPT
Linux SD/MMC Driver Stack
PDF
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
PDF
Micro XRCE-DDS and micro-ROS
PDF
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
PPSX
HPE SimpliVity
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
PTS_Hardware_Installation_Guide_A29.pdf
Docker networking Tutorial 101
Juniper for Enterprise
DPDK: Multi Architecture High Performance Packet Processing
Gstreamer: an Overview
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Open Network OS Overview as of 2015/10/16
Linux security introduction
NGSI-LD IoT Agents
 
Universal Flash Storage
Dell Technologies Dell EMC ISG Storage, CI, HCI and Data Protection Portfolio...
FAPI æœ€æ–°æƒ…ć ± - OpenID BizDay #15
Accelerating Innovation from Edge to Cloud
Linux SD/MMC Driver Stack
No instrumentation Golang Logging with eBPF (GoSF talk 11/11/20)
Micro XRCE-DDS and micro-ROS
IBM INTEGRATION BUS (IIB V10)—DATA ROUTING AND TRANSFORMATION
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
HPE SimpliVity
Ad

Viewers also liked (15)

PPTX
Apache Spark & Scala
PDF
Spark Streaming
PPTX
Apache Storm Internals
PDF
Spark SQL | Apache Spark
PPTX
5 things one must know about spark!
PDF
Spark For Faster Batch Processing
PPT
Scala and spark
PDF
Spark + S3 + R3넌 읎용한 데읎터 분석 시슀템 만듀Ʞ
PDF
Apache Zeppeliná„‹á…łá„…á…© ᄃᅊᄋᅔᄐᅄ ᄇᅟᆫᄉᅄᆚ허ᄀᅔ
PDF
á„Œá…”á„€á…łá†· 하á†ș헌 Real-time In-memory Stream Processing ᄋᅔ야ᄀᅔ
PPTX
Spark machine learning & deep learning
PDF
Spark overview 읎상훈(SK C&C)_ìŠ€íŒŒíŹ ì‚Źìš©ìž ëȘšìž„_20141106
PDF
Spark 의 í•”ì‹Źì€ ëŹŽì—‡ìžê°€? RDD! (RDD paper review)
PDF
Storm: distributed and fault-tolerant realtime computation
PDF
Realtime Analytics with Storm and Hadoop
Apache Spark & Scala
Spark Streaming
Apache Storm Internals
Spark SQL | Apache Spark
5 things one must know about spark!
Spark For Faster Batch Processing
Scala and spark
Spark + S3 + R3넌 읎용한 데읎터 분석 시슀템 만듀Ʞ
Apache Zeppeliná„‹á…łá„…á…© ᄃᅊᄋᅔᄐᅄ ᄇᅟᆫᄉᅄᆚ허ᄀᅔ
á„Œá…”á„€á…łá†· 하á†ș헌 Real-time In-memory Stream Processing ᄋᅔ야ᄀᅔ
Spark machine learning & deep learning
Spark overview 읎상훈(SK C&C)_ìŠ€íŒŒíŹ ì‚Źìš©ìž ëȘšìž„_20141106
Spark 의 í•”ì‹Źì€ ëŹŽì—‡ìžê°€? RDD! (RDD paper review)
Storm: distributed and fault-tolerant realtime computation
Realtime Analytics with Storm and Hadoop
Ad

Similar to Big Data Processing With Spark (20)

PPTX
5 reasons why spark is in demand!
PDF
Big Data Processing with Spark and Scala
PDF
Apache spark
PDF
Module01
PDF
5 things one must know about spark!
PDF
Spark Concepts Cheat Sheet_Interview_Question.pdf
PDF
Apache Spark beyond Hadoop MapReduce
PDF
5 Reasons why Spark is in demand!
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
PPTX
Apache Spark Overview
PDF
Apache Spark Introduction.pdf
PDF
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
PPTX
Apache spark installation [autosaved]
PPTX
Apache spark
PDF
Introduction to apache spark and the architecture
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
PDF
spark_v1_2
PDF
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
PDF
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
PPTX
Introduction to Apache Spark Developer Training
5 reasons why spark is in demand!
Big Data Processing with Spark and Scala
Apache spark
Module01
5 things one must know about spark!
Spark Concepts Cheat Sheet_Interview_Question.pdf
Apache Spark beyond Hadoop MapReduce
5 Reasons why Spark is in demand!
Marketing Strategyyguigiuiiiguooogu.pptx
Apache Spark Overview
Apache Spark Introduction.pdf
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache spark installation [autosaved]
Apache spark
Introduction to apache spark and the architecture
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
spark_v1_2
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...
Introduction to Apache Spark Developer Training

More from Edureka! (20)

PDF
What to learn during the 21 days Lockdown | Edureka
PDF
Top 10 Dying Programming Languages in 2020 | Edureka
PDF
Top 5 Trending Business Intelligence Tools | Edureka
PDF
Tableau Tutorial for Data Science | Edureka
PDF
Python Programming Tutorial | Edureka
PDF
Top 5 PMP Certifications | Edureka
PDF
Top Maven Interview Questions in 2020 | Edureka
PDF
Linux Mint Tutorial | Edureka
PDF
How to Deploy Java Web App in AWS| Edureka
PDF
Importance of Digital Marketing | Edureka
PDF
RPA in 2020 | Edureka
PDF
Email Notifications in Jenkins | Edureka
PDF
EA Algorithm in Machine Learning | Edureka
PDF
Cognitive AI Tutorial | Edureka
PDF
AWS Cloud Practitioner Tutorial | Edureka
PDF
Blue Prism Top Interview Questions | Edureka
PDF
Big Data on AWS Tutorial | Edureka
PDF
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
PDF
Kubernetes Installation on Ubuntu | Edureka
PDF
Introduction to DevOps | Edureka
What to learn during the 21 days Lockdown | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
Tableau Tutorial for Data Science | Edureka
Python Programming Tutorial | Edureka
Top 5 PMP Certifications | Edureka
Top Maven Interview Questions in 2020 | Edureka
Linux Mint Tutorial | Edureka
How to Deploy Java Web App in AWS| Edureka
Importance of Digital Marketing | Edureka
RPA in 2020 | Edureka
Email Notifications in Jenkins | Edureka
EA Algorithm in Machine Learning | Edureka
Cognitive AI Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
Blue Prism Top Interview Questions | Edureka
Big Data on AWS Tutorial | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
Kubernetes Installation on Ubuntu | Edureka
Introduction to DevOps | Edureka

Recently uploaded (20)

PDF
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
PPTX
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
[벌표본] á„‚á…„á„‹á…Ž ᄀá…Șá„Œá…Šá„‚á…łá†« 클라ᄋᅼ드오 ᄋᅔᆻᄋᅄ_KTDS_ᄀᅔᆷ도ᆌ현_20250524.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Cloud computing and distributed systems.
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
DevOps & Developer Experience Summer BBQ
 
PDF
Advanced IT Governance
PDF
KodekX | Application Modernization Development
 
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
How Onsite IT Support Drives Business Efficiency, Security, and Growth.pdf
Telecom Fraud Prevention Guide | Hyperlink InfoSystem
Smarter Business Operations Powered by IoT Remote Monitoring
Event Presentation Google Cloud Next Extended 2025
madgavkar20181017ppt McKinsey Presentation.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Understanding_Digital_Forensics_Presentation.pptx
[벌표본] á„‚á…„á„‹á…Ž ᄀá…Șá„Œá…Šá„‚á…łá†« 클라ᄋᅼ드오 ᄋᅔᆻᄋᅄ_KTDS_ᄀᅔᆷ도ᆌ현_20250524.pdf
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
 
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Cloud computing and distributed systems.
“AI and Expert System Decision Support & Business Intelligence Systems”
CroxyProxy Instagram Access id login.pptx
DevOps & Developer Experience Summer BBQ
 
Advanced IT Governance
KodekX | Application Modernization Development
 
AI And Its Effect On The Evolving IT Sector In Australia - Elevate

Big Data Processing With Spark

  • 1. View Apache Spark and Scala course details at www.edureka.co/apache-spark-scala-training Apache Spark | Spark SQL
  • 2. Slide 2 www.edureka.co/apache-spark-scala-trainingSlide 2 Objectives At the end of this module, you will be able to ï‚ź Introduction of Spark ï‚ź Spark Architecture ï‚ź What is an RDD ï‚ź Demo On Creating RDD and Running sample example ï‚ź Spark SQL
  • 3. Slide 3 www.edureka.co/apache-spark-scala-trainingSlide 3 What is Spark? Apache Spark is an open source, parallel data processing framework that complements Apache Hadoop to make it easy to develop fast, unified Big Data applications combining batch, streaming, and interactive analytics. ï‚ź Developed at UC Berkeley ï‚źWritten in Scala , a Functional Programming Language that runs in a JMV ï‚źIt generalize the Map Reduce framework
  • 4. Slide 4 www.edureka.co/apache-spark-scala-trainingSlide 4 Why Spark ? Speed Run programs up to 100x faster than Hadoop Map Reduce in memory, or 10x faster on disk. Ease of Use Supports different languages for developing applications using Spark Generality Combine SQL, streaming, and complex analytics into one platform Runs Everywhere Spark runs on Hadoop, Mesos, standalone, or in the cloud.
  • 5. Slide 5 www.edureka.co/apache-spark-scala-trainingSlide 5 ï‚źMap Reduce is a great solution for one-pass computations, but not very efficient for use cases that require multi-pass computations and algorithms ( Machine learning etc.) ï‚źTo run complicated jobs, you would have to string together a series of Map Reduce jobs and execute them in sequence ï‚ź Each of those jobs was high-latency, and none could start until the previous job had finished completely ï‚źThe Job output data between each step has to be stored in the local file system before the next step can begin ï‚ź Hadoop requires the integration of several tools for different big data use cases (like Mahout for Machine Learning and Storm for streaming data processing) Map Reduce Limitations
  • 6. Slide 6 www.edureka.co/apache-spark-scala-trainingSlide 6 Spark Features ï‚ź Spark takes Map Reduce to the next level with less expensive shuffles in the data processing. With capabilities like in- memory data storage ï‚ź Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing ï‚ź It’s designed to be an execution engine that works both in-memory and on-disk ï‚ź Lazy evaluation of big data queries which helps with the optimization of the overall data processing workflow ï‚ź Provides concise and consistent APIs in Scala, Java and Python ï‚ź Offers interactive shell for Scala and Python. This is not available in Java yet ï‚ź Spark support high level APIs to develop applications (Scala, Java, Python, Clojure, R)
  • 7. Slide 7 www.edureka.co/apache-spark-scala-trainingSlide 7 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture
  • 8. Slide 8 www.edureka.co/apache-spark-scala-trainingSlide 8 Spark Core Spark Streaming Spark Sql Blink DB MLlib Graph X Spark R Spark Architecture Cluster management ( Native Spark Cluster, YARN, MESOS ) Distributed storage ( HDFS, Cassandra, S3, HBase )
  • 9. Slide 9 www.edureka.co/apache-spark-scala-trainingSlide 9 Spark Advantages EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE ï‚ź Easier APIs ï‚ź Python, Scala, Java ï‚ź RDDs ï‚ź DAGs Unify Processing ï‚ź Shark, ML Streaming, GraphX
  • 10. Slide 10 www.edureka.co/apache-spark-scala-trainingSlide 10 UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM ï‚ź Multiple data sources ï‚ź Multiple applications ï‚ź Multiple users ï‚ź Reliability ï‚ź Multi-tenancy ï‚ź Security ï‚ź Files ï‚ź Databases ï‚ź Semi-structured Hadoop Advantages
  • 11. Slide 11 www.edureka.co/apache-spark-scala-trainingSlide 11 Spark + Hadoop UNLIMITED SCALE WIDE RANGE OF APPLICATIONS ENTERPRISE PLATFORM EASE OF DEVELOPMENT COMBINE WORKFLOWS IN-MEMORY PERFORMANCE Operational Applications Augmented by In-Memory Performance
  • 12. Slide 12 www.edureka.co/apache-spark-scala-trainingSlide 12 Resilient Distributed Datasets RDD ( Resilient Distributed Data Sets ) Resilient – If data in memory is lost, It can be recreated Distributed – Stored in memory across the cluster Dataset – Initial data can come from a file or created programmatically. RDDs are the fundamental unit of data in spark
  • 13. Slide 13 www.edureka.co/apache-spark-scala-trainingSlide 13 Resilient Distributed Datasets Core concept of Spark framework. RDDs can store any type of data. Primitive Types : Integer, Characters, Boolean etc. Files : Text files, SequencFiles etc. RDD is fault tolerance. RDDs are immutable
  • 14. Slide 14 www.edureka.co/apache-spark-scala-trainingSlide 14 RDD supports two types of operations: Transformation: Transformations don't return a single value, they return a new RDD. Some of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. Action: Action operation evaluates and returns a new value. Some of the Action operations are reduce, collect, count, first, take, countByKey, and foreach. Resilient Distributed Datasets
  • 15. Slide 15 www.edureka.co/apache-spark-scala-trainingSlide 15 Spark Sql Spark Core ï‚ź Spark SQL allows relational queries through Spark ï‚ź The backbone for all these operations is SchemaRDD ï‚ź Schema RDDs are mode of row objects along with the metadata information ï‚ź SchemaRDDs are equivalent to RDBMS tables ï‚ź They can be constructed from existing RDDs, JSON data sets, Parquet files or Hive QL queries against the data stored in Apache Hive(*) Spark SQL
  • 16. Slide 16 www.edureka.co/apache-spark-scala-training Spark SQL ï‚źSpark SQL lets you query structured data as a distributed dataset (RDD) in Spark, with integrated APIs in Scala and Java ï‚ź Shark Project is completely closed now Earlier it was Shark but now we will use Spark SQL Shark Spark SQL Hive on Spark Development ending: transitioning to Spark SQL A new SQL engine designed from ground up for Spark Help existing Hive users migrate Spark
  • 17. Slide 17 www.edureka.co/apache-spark-scala-trainingSlide 17 Efficient In-Memory Storage ï‚źSimply caching Hive records as Java objects is inefficient due to high per-object overhead ï‚źInstead, Spark SQL employs column-oriented storage using arrays of primitive types 1 Column Storage 2 3 john mike sally 4.1 3.5 6.4 Row Storage 1 john 4.1 2 mike 3.5 3 sally 6.4
  • 19. Slide 19 www.edureka.co/apache-spark-scala-training LIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate Course Features
  • 21. Slide 21 www.edureka.co/apache-spark-scala-training Course Topics ï‚ź Module 1 » Introduction to Scala ï‚ź Module 2 » Scala Essentials ï‚ź Module 3 » Traits and OOPs in Scala ï‚ź Module 4 » Functional Programming in Scala ï‚źModule 5 » Introduction to Big Data and Spark ï‚źModule 6 » Spark Baby Steps ï‚źModule 7 » Playing with RDDs ï‚źModule 8 » Spark with SQL- When Spark meets Hive