SlideShare a Scribd company logo
3
Most read
14
Most read
15
Most read
Getting Started
with
Apache Spark
Presented By
Manish Mishra
Pradyuman Pratap Singh
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
 Punctuality
Join the session 5 minutes prior to the session start time. We start on
time and conclude on time!
 Feedback
Make sure to submit a constructive feedback for all sessions as it is very
helpful for the presenter.
 Silent Mode
Keep your mobile devices in silent mode, feel free to move out of session
in case you need to attend an urgent call.
 Avoid Disturbance
Avoid unwanted chit chat during the session.
1. Introduction to Big Data and Apache Spark
 What is Big Data?
 What is Apache Spark?
 Features of Apache Spark
2. Overview of Spark Architecture
3. Spark Components
4. Spark Basic & Programming Model
 Spark Context
 Spark Session
 RDD
 Dataframe
 RDD v/s Dataframe
5. Advantages of Apache Spark
6. Disadvantages of Apache Spark
7. Demo
Getting Started with Apache Spark (Scala)
What is Big Data?
Big Data means very large and complex sets
of information that are too big and fast for
traditional computer systems to handle. It
includes a wide variety of data types from many
sources.
It is characterized by the 5 Vs:
 Volume: Massive amounts of data.
 Velocity: Speed at which data is generated
and processed.
 Variety: Different types of data (structured,
semi-structured, unstructured).
 Veracity: Data quality and accuracy.
 Value: Value the data provides.
What is Apache Spark?
 Apache Spark is an open-source analytical processing engine for large-scale powerful
distributed data processing and machine learning applications. It can handle
both batches as well as real-time analytics and data processing workloads.
 It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently
use it for more types of computations, which includes interactive queries and stream
processing.
 The main feature of Spark is its in-memory computing that increases the processing
speed of an application.
Features of Apache Spark
01 02
03
05 06
04
In Memory Computation
Speed
Different Cluster Managers
Distributed Processing
Fault Tolerant
Lazy Evaluation
02
Apache Spark Architecture
03
Spark Components
Spark Core
Spark SQL
Supported
Languages
Spark
Streaming
Real Time
Mlib
Machine
Learning
GraphX
Graph
Processing
Scala Java Python R
Spark
Engine
Libraries
04
Spark Basics
1. Spark Context: SparkContext is the primary entry point to any spark functionality.
When we run any Spark application, a driver program starts, which has the main
function and your SparkContext gets initiated here. The driver program then runs the
operations inside the executors on worker nodes.
2. Spark Session: SparkSession is a unified entry point for Spark applications; it was
introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities,
including RDDs, DataFrames, and Datasets, providing a unified interface to work with
structured data processing.
RDD
 Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
 There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.
RDD Operation:
o Transformation
o Actions
Dataframe
 In Spark, Dataframe are the distributed
collections of data, organized into rows and
columns. Each column in a Dataframe has a
name and an associated type. Dataframe are
like traditional database tables, which are
structured and concise.
 We can say that Dataframe are relational
databases with better optimization
techniques.
 Spark Dataframe can be created from
various sources, such as Hive tables, log
tables, external databases, or the existing
RDDs. Dataframe allow the processing of
huge amounts of data.
RDD v/s Dataframe
Features RDD Dataframe
Data Format Structured and unstructured Structured and semi-structured
APIs
Provide a low-level API that requires
more code to perform transformations
and actions on data
Provide a high-level API that makes it
easier to perform transformations and
actions on data.
Schema enforcement
Do not have an explicit schema, and are
often used for unstructured data.
Dataframe enforce schema at runtime.
Have an explicit schema that
describes the data and its types.
Optimization
No inbuilt optimization engine is
available in RDD.
It uses a catalyst optimizer for
optimization.
05
Advantages of Apache Spark
 In Memory Computation
 Speed
 Ease of Use
 Advanced Analytics
 Fault Tolerant
 Multi Language Support
06
Disadvantages of Apache Spark
 Small Files Issue
 File Management System
 No automatic optimization process
 Fewer Algorithms
07
Getting Started with Apache Spark (Scala)

More Related Content

What's hot (20)

PPTX
Transaction processing ppt
Javed Khan
 
PPT
Chapter18
gourab87
 
PPTX
Types of Database Models
Murassa Gillani
 
PDF
Sql delete, truncate, drop statements
Vivek Singh
 
PPT
Java Servlets
BG Java EE Course
 
PPTX
Learn C# Programming - Encapsulation & Methods
Eng Teong Cheah
 
PPT
Spring AOP
AnushaNaidu
 
PPTX
8. sql
khoahuy82
 
PPTX
Unit 5 composite datatypes
DrkhanchanaR
 
PDF
SQL Joins With Examples | Edureka
Edureka!
 
PPTX
Trigger
VForce Infotech
 
PPT
Working with Databases and MySQL
Nicole Ryan
 
PPTX
Sql(structured query language)
Ishucs
 
ODP
Ms sql-server
Md.Mojibul Hoque
 
PPT
Query optimization
dixitdavey
 
PPT
1 - Introduction to PL/SQL
rehaniltifat
 
PPT
Creating and Managing Tables -Oracle Data base
Salman Memon
 
PDF
Trabajando sentencias de manipulación de datos con MySQL
Jesús Canales Guando
 
Transaction processing ppt
Javed Khan
 
Chapter18
gourab87
 
Types of Database Models
Murassa Gillani
 
Sql delete, truncate, drop statements
Vivek Singh
 
Java Servlets
BG Java EE Course
 
Learn C# Programming - Encapsulation & Methods
Eng Teong Cheah
 
Spring AOP
AnushaNaidu
 
8. sql
khoahuy82
 
Unit 5 composite datatypes
DrkhanchanaR
 
SQL Joins With Examples | Edureka
Edureka!
 
Working with Databases and MySQL
Nicole Ryan
 
Sql(structured query language)
Ishucs
 
Ms sql-server
Md.Mojibul Hoque
 
Query optimization
dixitdavey
 
1 - Introduction to PL/SQL
rehaniltifat
 
Creating and Managing Tables -Oracle Data base
Salman Memon
 
Trabajando sentencias de manipulación de datos con MySQL
Jesús Canales Guando
 

Similar to Getting Started with Apache Spark (Scala) (20)

PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPT
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
PPTX
Spark Unveiled Essential Insights for All Developers
Knoldus Inc.
 
PDF
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
PPTX
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
PPTX
Apache spark
Prashant Pranay
 
PPTX
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
PPTX
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
PPTX
Big Data Processing with Apache Spark 2014
mahchiev
 
PDF
Apache spark
Dona Mary Philip
 
PPTX
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
SparkPaper
Suraj Thapaliya
 
PPTX
Spark_Talha.pptx
ITLAb21
 
PPTX
An Introduction to Apache Spark
Dona Mary Philip
 
PPTX
Introduction to spark
Home
 
PPTX
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
PDF
Apache Spark Notes
Venkateswaran Kandasamy
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Spark Unveiled Essential Insights for All Developers
Knoldus Inc.
 
A Master Guide To Apache Spark Application And Versatile Uses.pdf
DataSpace Academy
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Apache spark
Prashant Pranay
 
Pyspark presentationfsfsfjspfsjfsfsfjsfpsfsf
sasuke20y4sh
 
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache spark
Dona Mary Philip
 
Marketing Strategyyguigiuiiiguooogu.pptx
abhinandpk2405
 
Spark from the Surface
Josi Aranda
 
SparkPaper
Suraj Thapaliya
 
Spark_Talha.pptx
ITLAb21
 
An Introduction to Apache Spark
Dona Mary Philip
 
Introduction to spark
Home
 
Engagement_DataBricks_Amit_Kumar_Part_01 (1).pptx
sasuke20y4sh
 
Apache Spark Notes
Venkateswaran Kandasamy
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
PDF
Python basic programing language for automation
DanialHabibi2
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
July Patch Tuesday
Ivanti
 
WooCommerce Workshop: Bring Your Laptop
Laura Hartwig
 
Python basic programing language for automation
DanialHabibi2
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
July Patch Tuesday
Ivanti
 

Getting Started with Apache Spark (Scala)

  • 1. Getting Started with Apache Spark Presented By Manish Mishra Pradyuman Pratap Singh
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. 1. Introduction to Big Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
  • 5. What is Big Data? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
  • 6. What is Apache Spark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
  • 7. Features of Apache Spark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
  • 8. 02
  • 10. 03
  • 11. Spark Components Spark Core Spark SQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
  • 12. 04
  • 13. Spark Basics 1. Spark Context: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
  • 14. RDD  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
  • 15. Dataframe  In Spark, Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
  • 16. RDD v/s Dataframe Features RDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
  • 17. 05
  • 18. Advantages of Apache Spark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
  • 19. 06
  • 20. Disadvantages of Apache Spark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
  • 21. 07