SlideShare a Scribd company logo
Spark Structured APIs
Using Databricks
Presented By:
Raviyanshu Singh
Software Consultant
Knoldus Inc
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Join the session 5 minutes prior to
the session start time. We start on
time and conclude on time!
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your mobile devices in silent
mode, feel free to move out of
session in case you need to attend
an urgent call.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Our Agenda
01 What is Spark
02 What’s an RDD
03 Dataframes
04 Datasets
Databricks
05
05
06 Demo
What is Spark?
Unified Analytics Engine
Apache Spark is a unified engine designed for large-scale distributed data
processing, on premises in data centers or in the cloud.
Spark’s design philosophy is based
on these principles:
● Speed
● Ease of Use
● Modularity
● Extensibility
00
Spark APIs Trio
RDD, Dataframe & Datasets
Distributed collections of
JVM objects
Functional Operators
(Map, filter etc)
2011
Distributed collections of
Row objects.
Expression based
operations and UDFs
Fast/Efficient and
internal representations
2013
Internally rows,
externally
JVM objects.
“Best of both the
worlds”:
type safe + fast
2015
RDD Dataframe Datasets
The Timeline of Three
Whatʼs RDD?
[Resilient Distributed Datasets]
2013 2017 2018
● An RDD represents an immutable, partitioned collection of records that can be operated on in
parallel.
● RDDs gives you complete control because every record in RDD is just a Java or Python object.
RDD
Dependencies Partitions
Compute Function
Partition => Iterator[T]
Characteristics of an RDD
RDD Characteristics
2013 2017 2018
1. Dependencies
➢ The List of dependencies that instructs spark how an RDD is constructed.
➢ Spark can recreate an RDD from these dependencies and replicate operations on them.
(This characteristic gives RDDs resiliency)
2. Partitions
➢ This provide spark the ability to distribute the work to parallelize computation across executors.
➢ Spark also uses locality information to send work to executors close to the data.
(This characteristic gives RDDs distribution)
3. Compute Function
➢ An abstract method that computes the input split partition in the TaskContext to produce a
collection of values (of type T)
compute(split: Partition, context: TaskContext): Iterator[T]
Visualizing RDD
Simple &
Elegant
Whatʼs the Problem?
RDDs Expresses How-to Not What-to
Compute Function (or computation)
is opaque to Spark
Slow for non JVM languages like
Python
No optimization by Spark
No data compression techniques
Leading to inadvertent
inefficiencies
Dataframe
Solution is in structuring
What we mean by Structuring?
● Ordering and Structuring for allowing to arrange your data in
tabular format.
● Expressing computation using patterns like filtering, selecting,
counting etc.
The DataFrame API
Distributed in-memory tables with named columns and schemas, (where each_column ==
specific_datatype[String, Int, Timestamp etc.] )
To Human Eye DataFrame is like a table.
Visualizing Dataframes
With Custom Data
Spark Operations on Data
Manoeuvring Data
Transformation
Spark
Operation Head of IT
Actions
Finance Manager
Marketing Manager
● Transforming a Spark DF into a new
DF without altering the original data.
● Giving Immutability property.
● Actions are operations that returns the
raw value.
● It triggers the Lazy Evaluation of all the
recorded transformation
Transformations Actions
show()
take()
count()
collect()
orderBy()
groupBy()
filter()
select()
Common Dataframe Ops
Projections & Filter
➢ A way to return only the rows matching a certain relational condition by using filters.
➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method.
val topHits = df.select("Id", "First", "Url")
.where($"Hits" > 10000)
Renaming, Adding, and Dropping Columns
➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and
drop() will drop the column specified inside it.
val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name")
val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy"))
.drop("Published")
Common Dataframe Ops
Aggregation
➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and
then aggregate counts across them.
val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull)
.groupBy("Campaigns")
.count()
.orderBy(desc("count"))
The Datasets API
A Type-Safe one
According to the Dataset Documentation:
➢ A strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each
Dataset [in Scala] also has an untyped view called a DataFrame, which
is a Dataset of Row.
DataFrame
DataSets
Structured
APIs
Untyped APIs
Typed APIs
● Dataframe = Dataset[Row]
● Alias in Scala
● Dataset[T]
● In Scala & Java
Visualizing Datasets
Case Class (Type-Safe Hero)
Datasets Ops
Databricks?
A LakeHouse Company
● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and
maintaining enterprise-grade data solutions at scale.
● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud
infrastructure on your behalf.
Common Tools In Databricks
Core Data Tasks
REST API
Interactive
Notebooks
ML Model
Serving
Workflows
Scheduler
Source
Controlling
(GIt)
SQL Editor &
Dashboard
Compute
Management
Data
Ingestion
DEMO
Thank You !

More Related Content

PPTX
Database Systems - SQL - DCL Statements (Chapter 3/4)
PPTX
Data Mining Technique - SEMMA
PPT
chapter5-file system implementation.ppt
PPTX
Operating system 07 batch processing operating system
ODP
Network Security Topic 3 cryptography
PDF
File based approach
PPTX
Database System Architectures
PDF
Multilevel queue scheduling
Database Systems - SQL - DCL Statements (Chapter 3/4)
Data Mining Technique - SEMMA
chapter5-file system implementation.ppt
Operating system 07 batch processing operating system
Network Security Topic 3 cryptography
File based approach
Database System Architectures
Multilevel queue scheduling

What's hot (20)

PDF
Cs8493 unit 3
PPT
OS Components and Structure
PPTX
Dead Lock in operating system
PDF
Dbms interview questions
PPTX
Characteristic of dabase approach
PDF
Introduction to database-Transaction Concurrency and Recovery
PPTX
Operating system - Deadlock
PDF
CPU Scheduling
PDF
Deadlock in distribute system by saeed siddik
PPTX
Distributed airline reservation system
PPTX
Unit1 dbms
PPTX
Test soruları
PPTX
Protection models
PDF
Oracle Join Methods and 12c Adaptive Plans
PDF
Storage and File Structure in DBMS
PPT
Memory Management in OS
PDF
Operating system Memory management
PPTX
Fundamental data structure
PDF
Complete dbms notes
PDF
Distributed Database practicals
Cs8493 unit 3
OS Components and Structure
Dead Lock in operating system
Dbms interview questions
Characteristic of dabase approach
Introduction to database-Transaction Concurrency and Recovery
Operating system - Deadlock
CPU Scheduling
Deadlock in distribute system by saeed siddik
Distributed airline reservation system
Unit1 dbms
Test soruları
Protection models
Oracle Join Methods and 12c Adaptive Plans
Storage and File Structure in DBMS
Memory Management in OS
Operating system Memory management
Fundamental data structure
Complete dbms notes
Distributed Database practicals
Ad

Similar to Spark Structured APIs (20)

PDF
Apache Spark and DataStax Enablement
PPTX
Spark Unveiled Essential Insights for All Developers
PDF
Big Data processing with Apache Spark
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
PPTX
Spark real world use cases and optimizations
PDF
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
PPTX
Dive into spark2
PDF
Let's start with Spark
PPTX
Building a modern Application with DataFrames
PPTX
Building a modern Application with DataFrames
PDF
Introduction to Apache Spark
PPTX
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
PPTX
Ten tools for ten big data areas 03_Apache Spark
PDF
Boston Spark Meetup event Slides Update
PPTX
OVERVIEW ON SPARK.pptx
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
PDF
Meetup ml spark_ppt
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
PPTX
Spark from the Surface
Apache Spark and DataStax Enablement
Spark Unveiled Essential Insights for All Developers
Big Data processing with Apache Spark
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Spark real world use cases and optimizations
Data Summer Conf 2018, “Hands-on with Apache Spark for Beginners (ENG)” — Akm...
Dive into spark2
Let's start with Spark
Building a modern Application with DataFrames
Building a modern Application with DataFrames
Introduction to Apache Spark
4Introduction+to+Spark.pptx sdfsdfsdfsdfsdf
Ten tools for ten big data areas 03_Apache Spark
Boston Spark Meetup event Slides Update
OVERVIEW ON SPARK.pptx
Structuring Spark: DataFrames, Datasets, and Streaming
Meetup ml spark_ppt
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Spark from the Surface
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
PPTX
Self-Healing Test Automation Framework - Healenium
PPTX
Kanban Metrics Presentation (Project Management)
PPTX
Java 17 features and implementation.pptx
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
PPTX
GraalVM - A Step Ahead of JVM Presentation
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
Nomad by HashiCorp Presentation (DevOps)
PPTX
DAPR - Distributed Application Runtime Presentation
PPTX
Introduction to Azure Virtual WAN Presentation
PPTX
Introduction to Argo Rollouts Presentation
PPTX
Intro to Azure Container App Presentation
PPTX
Insights Unveiled Test Reporting and Observability Excellence
PPTX
Introduction to Splunk Presentation (DevOps)
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
PPTX
AWS: Messaging Services in AWS Presentation
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
PPTX
Managing State & HTTP Requests In Ionic.
Angular Hydration Presentation (FrontEnd)
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Self-Healing Test Automation Framework - Healenium
Kanban Metrics Presentation (Project Management)
Java 17 features and implementation.pptx
Chaos Mesh Introducing Chaos in Kubernetes
GraalVM - A Step Ahead of JVM Presentation
Nomad by HashiCorp Presentation (DevOps)
Nomad by HashiCorp Presentation (DevOps)
DAPR - Distributed Application Runtime Presentation
Introduction to Azure Virtual WAN Presentation
Introduction to Argo Rollouts Presentation
Intro to Azure Container App Presentation
Insights Unveiled Test Reporting and Observability Excellence
Introduction to Splunk Presentation (DevOps)
Code Camp - Data Profiling and Quality Analysis Framework
AWS: Messaging Services in AWS Presentation
Amazon Cognito: A Primer on Authentication and Authorization
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Managing State & HTTP Requests In Ionic.

Recently uploaded (20)

PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Smarter Business Operations Powered by IoT Remote Monitoring
PDF
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
PDF
Modernizing your data center with Dell and AMD
PDF
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Omni-Path Integration Expertise Offered by Nor-Tech
PDF
Chapter 2 Digital Image Fundamentals.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Event Presentation Google Cloud Next Extended 2025
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PDF
DevOps & Developer Experience Summer BBQ
CroxyProxy Instagram Access id login.pptx
Sensors and Actuators in IoT Systems using pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Review of recent advances in non-invasive hemoglobin estimation
Advanced Soft Computing BINUS July 2025.pdf
NewMind AI Monthly Chronicles - July 2025
Smarter Business Operations Powered by IoT Remote Monitoring
CIFDAQ's Market Wrap: Ethereum Leads, Bitcoin Lags, Institutions Shift
Modernizing your data center with Dell and AMD
HCSP-Presales-Campus Network Planning and Design V1.0 Training Material-Witho...
madgavkar20181017ppt McKinsey Presentation.pdf
GamePlan Trading System Review: Professional Trader's Honest Take
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Omni-Path Integration Expertise Offered by Nor-Tech
Chapter 2 Digital Image Fundamentals.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Event Presentation Google Cloud Next Extended 2025
Reimagining Insurance: Connected Data for Confident Decisions.pdf
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
DevOps & Developer Experience Summer BBQ

Spark Structured APIs

  • 1. Spark Structured APIs Using Databricks Presented By: Raviyanshu Singh Software Consultant Knoldus Inc
  • 2. Lack of etiquette and manners is a huge turn off. KnolX Etiquettes Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time! Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3. Our Agenda 01 What is Spark 02 What’s an RDD 03 Dataframes 04 Datasets Databricks 05 05 06 Demo
  • 4. What is Spark? Unified Analytics Engine Apache Spark is a unified engine designed for large-scale distributed data processing, on premises in data centers or in the cloud. Spark’s design philosophy is based on these principles: ● Speed ● Ease of Use ● Modularity ● Extensibility
  • 5. 00 Spark APIs Trio RDD, Dataframe & Datasets Distributed collections of JVM objects Functional Operators (Map, filter etc) 2011 Distributed collections of Row objects. Expression based operations and UDFs Fast/Efficient and internal representations 2013 Internally rows, externally JVM objects. “Best of both the worlds”: type safe + fast 2015 RDD Dataframe Datasets The Timeline of Three
  • 6. Whatʼs RDD? [Resilient Distributed Datasets] 2013 2017 2018 ● An RDD represents an immutable, partitioned collection of records that can be operated on in parallel. ● RDDs gives you complete control because every record in RDD is just a Java or Python object. RDD Dependencies Partitions Compute Function Partition => Iterator[T] Characteristics of an RDD
  • 7. RDD Characteristics 2013 2017 2018 1. Dependencies ➢ The List of dependencies that instructs spark how an RDD is constructed. ➢ Spark can recreate an RDD from these dependencies and replicate operations on them. (This characteristic gives RDDs resiliency) 2. Partitions ➢ This provide spark the ability to distribute the work to parallelize computation across executors. ➢ Spark also uses locality information to send work to executors close to the data. (This characteristic gives RDDs distribution) 3. Compute Function ➢ An abstract method that computes the input split partition in the TaskContext to produce a collection of values (of type T) compute(split: Partition, context: TaskContext): Iterator[T]
  • 9. Whatʼs the Problem? RDDs Expresses How-to Not What-to Compute Function (or computation) is opaque to Spark Slow for non JVM languages like Python No optimization by Spark No data compression techniques Leading to inadvertent inefficiencies
  • 10. Dataframe Solution is in structuring What we mean by Structuring? ● Ordering and Structuring for allowing to arrange your data in tabular format. ● Expressing computation using patterns like filtering, selecting, counting etc. The DataFrame API Distributed in-memory tables with named columns and schemas, (where each_column == specific_datatype[String, Int, Timestamp etc.] ) To Human Eye DataFrame is like a table.
  • 12. Spark Operations on Data Manoeuvring Data Transformation Spark Operation Head of IT Actions Finance Manager Marketing Manager ● Transforming a Spark DF into a new DF without altering the original data. ● Giving Immutability property. ● Actions are operations that returns the raw value. ● It triggers the Lazy Evaluation of all the recorded transformation Transformations Actions show() take() count() collect() orderBy() groupBy() filter() select()
  • 13. Common Dataframe Ops Projections & Filter ➢ A way to return only the rows matching a certain relational condition by using filters. ➢ Projections are done with the select() method, while filters can be expressed using the filter() or where() method. val topHits = df.select("Id", "First", "Url") .where($"Hits" > 10000) Renaming, Adding, and Dropping Columns ➢ Using withColumnRenamed() we can rename the column, just withColumn() will add new column and drop() will drop the column specified inside it. val newDf = df.withColumnRenamed("First","First_Name").withColumnRenamed("Last", "Last_Name") val dfWithTS = newDf.withColumn("Issued_Date", to_timestamp(col("Published"), "dd/MM/yyyy")) .drop("Published")
  • 14. Common Dataframe Ops Aggregation ➢ Transformations and actions on DataFrames, such as groupBy(), orderBy(), and count(), offer the ability to aggregate by column names and then aggregate counts across them. val mostShare = dfWithTS.select("Campaigns","First_Name").where(col("Campaigns").isNotNull) .groupBy("Campaigns") .count() .orderBy(desc("count"))
  • 15. The Datasets API A Type-Safe one According to the Dataset Documentation: ➢ A strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset [in Scala] also has an untyped view called a DataFrame, which is a Dataset of Row. DataFrame DataSets Structured APIs Untyped APIs Typed APIs ● Dataframe = Dataset[Row] ● Alias in Scala ● Dataset[T] ● In Scala & Java
  • 18. Databricks? A LakeHouse Company ● The Databricks Lakehouse Platform provides a unified set of tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. ● Databricks integrates with cloud storage and security in your cloud account, and manages and deploys cloud infrastructure on your behalf.
  • 19. Common Tools In Databricks Core Data Tasks REST API Interactive Notebooks ML Model Serving Workflows Scheduler Source Controlling (GIt) SQL Editor & Dashboard Compute Management Data Ingestion
  • 20. DEMO