SlideShare a Scribd company logo
Big data processing with Apache
Spark and Oracle database
Martin Toshev
Who am I
Software consultant (CoffeeCupConsulting)
BG JUG board member (http://jug.bg)
(BG JUG is a 2018 Oracle Duke’s
choice award winner)
Agenda
• Apache Spark from an eagle’s eye
• Apache Spark capabilities
• Using Oracle RDBMS as a Spark datasource
Apache Spark from an Eagle’s eye
Highlights
• A framework for large-scale distributed data processing
• Originally in Scala but extended with Java, Python and R
• One of the most contributed
open source/Apache/GitHub projects with over 1400
contributors
Spark vs MapReduce
• Spark has been developed in order to address the
shortcomings of the MapReduce programming model
• In particular MapReduce is unsuitable for:
– real-time processing (suitable for batch processing of present data)
– operations not limited to the key-value format of data
– large data on a network
– online transaction processing
– graph processing
– sequential program execution
Spark vs Hadoop
• Spark is faster as it depends more on RAM usage and
tries to minimize disk IO (on the storage system)
• Spark however can still use Hadoop:
– as a storage engine (HDFS)
– as a compute engine (MapReduce or Hadoop YARN)
• Spark has pluggable storage and compute engine
architecture
Spark components
Spark Framework
Spark Core
Spark
Streaming
MLib GraphXSpark SQL
Spark architecture
SparkContext
(driver)
Cluster
manager
Worker
node
Worker
node
Worker
node
Spark application
(JAR)
Input data
sources
Output data
sources
Apache Spark capabilities
Spark datasets
• The building block of Spark are RDDs (Resilient
Distributed Datasets)
• They are immutable collections of objects spread across
a Spark cluster and stored in RAM or on disk
• Created by means of distributed transformations
• Rebuilt on failure of a Spark node
Spark datasets
• The DataFrame API is a superset of RDDs introduced in
Spark 2.0
• The Dataset API provides a way to work with a
combination of RDDs and DataFrames
• The DataFrame API is preferred compared to RDDs due to
improved performance and more advanced operations
Spark datasets
List<Item> items = …;
SparkConf configuration = new
SparkConf().setAppName(“ItemsManager").setMaster("local");
JavaSparkContext context =
new JavaSparkContext(configuration);
JavaRDD<Item> itemsRDD = context.parallelize(items);
Spark transformations
map itemsRDD.map(i -> { i.setName(“phone”);
return i;});
filter itemsRDD.filter(i ->
i.getName().contains(“phone”))
flatMap itemsRDD.flatMap(i ->
Arrays.asList(i, i).iterator());
union itemsRDD.union(newItemsRDD);
intersection itemsRDD.intersection(newItemsRDD);
distinct itemsRDD.distinct()
cartesian itemsRDD.cartesian(otherDatasetRDD)
Spark transformations
groupBy pairItemsRDD = itemsRDD.mapToPair(i ->
new Tuple2(i.getType(), i));
modifiedPairItemsRDD =
pairItemsRDD.groupByKey();
reduceByKey pairItemsRDD = itemsRDD.mapToPair(o ->
new Tuple2(o.getType(), o));
modifiedPairItemsRDD =
pairItemsRDD.reduceByKey((o1, o2) ->
new Item(o1.getType(),
o1.getCount() + o2.getCount(),
o1.getUnitPrice())
);
• Other transformations include aggregateByKey,
sortByKey, join, cogroup …
Spark actions
• Spark actions are the terminal operations that produce
results from the transformations
• Actions are a way to communicate back from the
execution engine to the Spark driver instance
Spark actions
collect itemsRDD.collect()
reduce itemsRDD.map(i ->
i.getUnitPrice() * i.getCount()).
reduce((x, y) -> x + y);
count itemsRDD.count()
first itemsRDD.first()
take itemsRDD.take(4)
takeOrdered itemsRDD.takeOrdered(4, comparator)
foreach itemsRDD.foreach(System.out::println)
saveAsTextFile itemsRDD.saveAsTextFile(path)
saveAsObjectFile itemsRDD.saveAsObjectFile(path)
DataFrames/DataSets
• A dataframe can be created using an instance of the
org.apache.spark.sql.SparkSession class
• The DataFrame/DataSet APIs provide more advanced
operations and the capability to run SQL queries on the
data
itemsDS.createOrReplaceTempView(“items");
session.sql("SELECT * FROM items");
DataFrames/DataSets
• An existing RDD can be converted to a Spark dataframe:
• An RDD can be retrieved from a dataframe as well:
SparkSession session =
SparkSession.builder().appName("app").getOrCreate();
Dataset<Row> itemsDS =
session.createDataFrame(itemsRDD, Item.class);
itemsDS.rdd()
Spark data sources
• Spark can receive data from a variety of data sources in a
variety of ways (batching, real-time streaming)
• These datasources might be:
– files: Spark supports reading data from a variety of formats (JSON, CSV, Avro,
etc.)
– relational databases: using JDBC/ODBC driver Spark can extract data from an
RDBMS
– TCP sockets, messaging systems: using streaming capabilities of Spark data
can be read from messaging systems and raw TCP sockets
Spark data sources
• Spark provides support for operations on batch data or
real time data
• For real time data Spark provides two main APIs:
– Spark streaming is an older API working on RDDs
– Spark structured streaming is a newer API working on DataFrames/DataSets
Spark data sources
• Spark provides capabilities to plug-in additional data
sources not supported by Spark
• For streaming sources you can define your own custom
receivers
Spark streaming
• Data is divided into batches called Dstreams
(decentralized streams)
• Typical use case is the integration of Spark with
messaging systems such as Kafka, RabbitMQ and
ActiveMQ etc.
• Fault tolerance can be enabled in Spark Streaming
whereby data is stored in HDFS
Spark streaming
• To define a Spark stream you need to create a
JavaStreamingContext instance
SparkConf conf = new
SparkConf().setMaster("local[4]").setAppName("CustomerItems");
JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.seconds(1));
Spark streaming
• Then a receiver can be created for the data:
– from sockets:
– from data directory:
– from RDD streams (for testing purposes):
jssc.socketTextStream("localhost", 7777);
jssc.textFileStream("... some data directory ...");
jssc.queueStream(... RDDs queue ... )
Spark streaming
• Then the data pipeline can be built using transformations
and actions on the streams
• Finally retrieval of data must be triggered from the
streaming context:
jssc.start();
jssc.awaitTermination();
Spark streaming
• Window streams can be created over stream data based
on two criteria:
– length of the window
– sliding interval for the windows
• Streaming datasets can also be joined with other
streaming or batch datasets
Spark structured streaming
• Newer streaming API working on DataSets/DataFrames:
• A schema can be specified on the streaming data using
the .schema(<schema>) method on the read stream
SparkSession context = SparkSession
.builder()
.appName("CustomerItems")
.getOrCreate();
Dataset<Row> lines = spark
.readStream()
.format("socket")
.option("host", "localhost")
.option("port", 7777)
.load();
Spark structured streaming
• Write sinks can also be used to write out streaming datasets:
• The following write sinks are provided by Spark:
- file
- Kafka
- foreach
- console (for testing purpose)
- memory (for testing purpose)
StreamingQuery query =
wordCounts.writeStream()
.outputMode("complete")
.format("console")
.start();
query.awaitTermination();
Clustering
• Spark supports the following cluster managers:
– Standalone scheduler (default)
– YARN
– Mesos
• Support for Kubernetes cluster manager is also
undergoing (experimental at present)
Using Oracle RDBMS
as a Spark datasource
Oracle RDBMS data source
• Spark supports retrieval of data through JDBC/ODBC
• Database driver must be supplied to the Spark classpath
(specified with the --driver-class-path) option
• For Oracle RDBMS that is the ojdbc driver
Oracle RDBMS data source
session.read()
.format("jdbc")
.option("url","jdbc:oracle:thin:@//127.0.0.1:1521/ORCL")
.option("dbtable", "items")
.option("user", "c##spark")
.option("password", "spark")
.load();
Oracle RDBMS data source
• You can use a variery of options when reading data from an
RDBMS using the jdbc format:
– query: a subquery that provides the possibility to limit retrieved data
– queryTimeout: specify the timeout for the JDBC query executed
against the RDBMS
• You can also save datasets to a table:
itemsDF.write().mode(org.apache.spark.sql.SaveMode.Append).
jdbc("jdbc:oracle:thin:@//127.0.0.1:1521/ORCL", “items",
prop);
Data processing options
• However the support provided by Spark is for batch
processing of data from the RDBMS …
• In many cases one might want to process data in a
streaming manner
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records as they are inserted in the RDBMS
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records on evenly-sized batches
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process records on evenly-sized time intervals (record size may vary)
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– process batches of overlapping records using a sized window
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• For stream processing of data from an Oracle RDBMS a
Spark instance may have to:
– processing of batches based on custom filter criteria
Id Type OrderTime
1 Laptop 2019.11.05 11:55:05
2 Battery 2019.11.05 12:04:23
3 Headphones 2019.11.05 12:24:17
4 Laptop 2019.11.05 12:52:32
Data processing options
• These can be achieved using the following mechanism:
– by duplicating writes over a streaming system such as Kafka
– via Spark streaming receiver that:
• buffer records (if a small delay is tolerable)
• creates an endpoint that an RDBMS trigger calls upon insertion
• listens for database changes using DCN (Database Change Notifications) via JDBC
(only pre-12c, DCN support dropped for PDBs as of 12c)
DEMO
Summary
• Apache Spark is one of the most feature-rich and
developed big data processing frameworks
• Provides a mechanism to distribute load over a large
number of nodes using different cluster managers
• A great option for fast and scalable processing of data
from an Oracle RDBMS

More Related Content

PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
Introduction to Spark Internals
Pietro Michiardi
 
PPTX
NOSQL and MongoDB Database
Tariqul islam
 
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
PDF
Intro to databricks delta lake
Mykola Zerniuk
 
PPTX
Unit 3
vishal choudhary
 
PPTX
Emr spark tuning demystified
Omid Vahdaty
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Introduction to Spark Internals
Pietro Michiardi
 
NOSQL and MongoDB Database
Tariqul islam
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Intro to databricks delta lake
Mykola Zerniuk
 
Emr spark tuning demystified
Omid Vahdaty
 

What's hot (20)

PDF
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
Apache Cassandra
Rutuja Gholap
 
PPT
Storage Area Network (San)
sankcomp
 
PPTX
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
PPTX
CAP Theorem - Theory, Implications and Practices
Yoav Francis
 
PDF
Chapter 8
pavan penugonda
 
PDF
Data Modeling for MongoDB
MongoDB
 
PDF
Achieving Lakehouse Models with Spark 3.0
Databricks
 
PDF
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
PPTX
Snowflake essentials
qureshihamid
 
PPTX
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
PDF
Databases for Data Science
Alexander Hendorf
 
PPT
Distributed Database Management System(DDMS)
mobeen.laws
 
PPTX
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Apache Cassandra
Rutuja Gholap
 
Storage Area Network (San)
sankcomp
 
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
CAP Theorem - Theory, Implications and Practices
Yoav Francis
 
Chapter 8
pavan penugonda
 
Data Modeling for MongoDB
MongoDB
 
Achieving Lakehouse Models with Spark 3.0
Databricks
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
Edureka!
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Real time stock processing with apache nifi, apache flink and apache kafka
Timothy Spann
 
Snowflake essentials
qureshihamid
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Agile Testing Alliance
 
Databases for Data Science
Alexander Hendorf
 
Distributed Database Management System(DDMS)
mobeen.laws
 
Hadoop introduction , Why and What is Hadoop ?
sudhakara st
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Ad

Similar to Big data processing with Apache Spark and Oracle Database (20)

PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPTX
Spark from the Surface
Josi Aranda
 
PPTX
Apache Spark Components
Girish Khanzode
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
PPTX
Glint with Apache Spark
Venkata Naga Ravi
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
PPTX
Apache Spark for Beginners
Anirudh
 
PPTX
Apache Spark in Industry
Dorian Beganovic
 
PPTX
Apache Spark Overview
Dharmjit Singh
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Apache Spark Overview @ ferret
Andrii Gakhov
 
PDF
Apache Spark - A High Level overview
Karan Alang
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
SparkPaper
Suraj Thapaliya
 
PDF
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
 
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
Spark from the Surface
Josi Aranda
 
Apache Spark Components
Girish Khanzode
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Glint with Apache Spark
Venkata Naga Ravi
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Apache Spark for Beginners
Anirudh
 
Apache Spark in Industry
Dorian Beganovic
 
Apache Spark Overview
Dharmjit Singh
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark Overview @ ferret
Andrii Gakhov
 
Apache Spark - A High Level overview
Karan Alang
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
SparkPaper
Suraj Thapaliya
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Ad

More from Martin Toshev (20)

PPT
Jdk 10 sneak peek
Martin Toshev
 
PPT
Semantic Technology In Oracle Database 12c
Martin Toshev
 
PPTX
Practical security In a modular world
Martin Toshev
 
PPT
Java 9 Security Enhancements in Practice
Martin Toshev
 
PPTX
Java 9 sneak peek
Martin Toshev
 
PPTX
Writing Stored Procedures in Oracle RDBMS
Martin Toshev
 
PPTX
Spring RabbitMQ
Martin Toshev
 
PPTX
Security Architecture of the Java platform
Martin Toshev
 
PPTX
Oracle Database 12c Attack Vectors
Martin Toshev
 
PPTX
JVM++: The Graal VM
Martin Toshev
 
PPTX
RxJS vs RxJava: Intro
Martin Toshev
 
PPTX
Security Аrchitecture of Тhe Java Platform
Martin Toshev
 
PPTX
Spring RabbitMQ
Martin Toshev
 
PPTX
Writing Stored Procedures with Oracle Database 12c
Martin Toshev
 
PDF
Concurrency Utilities in Java 8
Martin Toshev
 
PPTX
The RabbitMQ Message Broker
Martin Toshev
 
PPTX
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Martin Toshev
 
PPTX
Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Martin Toshev
 
PPTX
Writing Java Stored Procedures in Oracle 12c
Martin Toshev
 
PDF
KDB database (EPAM tech talks, Sofia, April, 2015)
Martin Toshev
 
Jdk 10 sneak peek
Martin Toshev
 
Semantic Technology In Oracle Database 12c
Martin Toshev
 
Practical security In a modular world
Martin Toshev
 
Java 9 Security Enhancements in Practice
Martin Toshev
 
Java 9 sneak peek
Martin Toshev
 
Writing Stored Procedures in Oracle RDBMS
Martin Toshev
 
Spring RabbitMQ
Martin Toshev
 
Security Architecture of the Java platform
Martin Toshev
 
Oracle Database 12c Attack Vectors
Martin Toshev
 
JVM++: The Graal VM
Martin Toshev
 
RxJS vs RxJava: Intro
Martin Toshev
 
Security Аrchitecture of Тhe Java Platform
Martin Toshev
 
Spring RabbitMQ
Martin Toshev
 
Writing Stored Procedures with Oracle Database 12c
Martin Toshev
 
Concurrency Utilities in Java 8
Martin Toshev
 
The RabbitMQ Message Broker
Martin Toshev
 
Security Architecture of the Java Platform (BG OUG, Plovdiv, 13.06.2015)
Martin Toshev
 
Modularity of The Java Platform Javaday (http://javaday.org.ua/)
Martin Toshev
 
Writing Java Stored Procedures in Oracle 12c
Martin Toshev
 
KDB database (EPAM tech talks, Sofia, April, 2015)
Martin Toshev
 

Recently uploaded (20)

PPTX
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PPTX
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
This slide provides an overview Technology
mineshkharadi333
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
How to Build a Scalable Micro-Investing Platform in 2025 - A Founder’s Guide ...
Third Rock Techkno
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
cloud computing vai.pptx for the project
vaibhavdobariyal79
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
This slide provides an overview Technology
mineshkharadi333
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 

Big data processing with Apache Spark and Oracle Database

  • 1. Big data processing with Apache Spark and Oracle database Martin Toshev
  • 2. Who am I Software consultant (CoffeeCupConsulting) BG JUG board member (http://jug.bg) (BG JUG is a 2018 Oracle Duke’s choice award winner)
  • 3. Agenda • Apache Spark from an eagle’s eye • Apache Spark capabilities • Using Oracle RDBMS as a Spark datasource
  • 4. Apache Spark from an Eagle’s eye
  • 5. Highlights • A framework for large-scale distributed data processing • Originally in Scala but extended with Java, Python and R • One of the most contributed open source/Apache/GitHub projects with over 1400 contributors
  • 6. Spark vs MapReduce • Spark has been developed in order to address the shortcomings of the MapReduce programming model • In particular MapReduce is unsuitable for: – real-time processing (suitable for batch processing of present data) – operations not limited to the key-value format of data – large data on a network – online transaction processing – graph processing – sequential program execution
  • 7. Spark vs Hadoop • Spark is faster as it depends more on RAM usage and tries to minimize disk IO (on the storage system) • Spark however can still use Hadoop: – as a storage engine (HDFS) – as a compute engine (MapReduce or Hadoop YARN) • Spark has pluggable storage and compute engine architecture
  • 8. Spark components Spark Framework Spark Core Spark Streaming MLib GraphXSpark SQL
  • 11. Spark datasets • The building block of Spark are RDDs (Resilient Distributed Datasets) • They are immutable collections of objects spread across a Spark cluster and stored in RAM or on disk • Created by means of distributed transformations • Rebuilt on failure of a Spark node
  • 12. Spark datasets • The DataFrame API is a superset of RDDs introduced in Spark 2.0 • The Dataset API provides a way to work with a combination of RDDs and DataFrames • The DataFrame API is preferred compared to RDDs due to improved performance and more advanced operations
  • 13. Spark datasets List<Item> items = …; SparkConf configuration = new SparkConf().setAppName(“ItemsManager").setMaster("local"); JavaSparkContext context = new JavaSparkContext(configuration); JavaRDD<Item> itemsRDD = context.parallelize(items);
  • 14. Spark transformations map itemsRDD.map(i -> { i.setName(“phone”); return i;}); filter itemsRDD.filter(i -> i.getName().contains(“phone”)) flatMap itemsRDD.flatMap(i -> Arrays.asList(i, i).iterator()); union itemsRDD.union(newItemsRDD); intersection itemsRDD.intersection(newItemsRDD); distinct itemsRDD.distinct() cartesian itemsRDD.cartesian(otherDatasetRDD)
  • 15. Spark transformations groupBy pairItemsRDD = itemsRDD.mapToPair(i -> new Tuple2(i.getType(), i)); modifiedPairItemsRDD = pairItemsRDD.groupByKey(); reduceByKey pairItemsRDD = itemsRDD.mapToPair(o -> new Tuple2(o.getType(), o)); modifiedPairItemsRDD = pairItemsRDD.reduceByKey((o1, o2) -> new Item(o1.getType(), o1.getCount() + o2.getCount(), o1.getUnitPrice()) ); • Other transformations include aggregateByKey, sortByKey, join, cogroup …
  • 16. Spark actions • Spark actions are the terminal operations that produce results from the transformations • Actions are a way to communicate back from the execution engine to the Spark driver instance
  • 17. Spark actions collect itemsRDD.collect() reduce itemsRDD.map(i -> i.getUnitPrice() * i.getCount()). reduce((x, y) -> x + y); count itemsRDD.count() first itemsRDD.first() take itemsRDD.take(4) takeOrdered itemsRDD.takeOrdered(4, comparator) foreach itemsRDD.foreach(System.out::println) saveAsTextFile itemsRDD.saveAsTextFile(path) saveAsObjectFile itemsRDD.saveAsObjectFile(path)
  • 18. DataFrames/DataSets • A dataframe can be created using an instance of the org.apache.spark.sql.SparkSession class • The DataFrame/DataSet APIs provide more advanced operations and the capability to run SQL queries on the data itemsDS.createOrReplaceTempView(“items"); session.sql("SELECT * FROM items");
  • 19. DataFrames/DataSets • An existing RDD can be converted to a Spark dataframe: • An RDD can be retrieved from a dataframe as well: SparkSession session = SparkSession.builder().appName("app").getOrCreate(); Dataset<Row> itemsDS = session.createDataFrame(itemsRDD, Item.class); itemsDS.rdd()
  • 20. Spark data sources • Spark can receive data from a variety of data sources in a variety of ways (batching, real-time streaming) • These datasources might be: – files: Spark supports reading data from a variety of formats (JSON, CSV, Avro, etc.) – relational databases: using JDBC/ODBC driver Spark can extract data from an RDBMS – TCP sockets, messaging systems: using streaming capabilities of Spark data can be read from messaging systems and raw TCP sockets
  • 21. Spark data sources • Spark provides support for operations on batch data or real time data • For real time data Spark provides two main APIs: – Spark streaming is an older API working on RDDs – Spark structured streaming is a newer API working on DataFrames/DataSets
  • 22. Spark data sources • Spark provides capabilities to plug-in additional data sources not supported by Spark • For streaming sources you can define your own custom receivers
  • 23. Spark streaming • Data is divided into batches called Dstreams (decentralized streams) • Typical use case is the integration of Spark with messaging systems such as Kafka, RabbitMQ and ActiveMQ etc. • Fault tolerance can be enabled in Spark Streaming whereby data is stored in HDFS
  • 24. Spark streaming • To define a Spark stream you need to create a JavaStreamingContext instance SparkConf conf = new SparkConf().setMaster("local[4]").setAppName("CustomerItems"); JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1));
  • 25. Spark streaming • Then a receiver can be created for the data: – from sockets: – from data directory: – from RDD streams (for testing purposes): jssc.socketTextStream("localhost", 7777); jssc.textFileStream("... some data directory ..."); jssc.queueStream(... RDDs queue ... )
  • 26. Spark streaming • Then the data pipeline can be built using transformations and actions on the streams • Finally retrieval of data must be triggered from the streaming context: jssc.start(); jssc.awaitTermination();
  • 27. Spark streaming • Window streams can be created over stream data based on two criteria: – length of the window – sliding interval for the windows • Streaming datasets can also be joined with other streaming or batch datasets
  • 28. Spark structured streaming • Newer streaming API working on DataSets/DataFrames: • A schema can be specified on the streaming data using the .schema(<schema>) method on the read stream SparkSession context = SparkSession .builder() .appName("CustomerItems") .getOrCreate(); Dataset<Row> lines = spark .readStream() .format("socket") .option("host", "localhost") .option("port", 7777) .load();
  • 29. Spark structured streaming • Write sinks can also be used to write out streaming datasets: • The following write sinks are provided by Spark: - file - Kafka - foreach - console (for testing purpose) - memory (for testing purpose) StreamingQuery query = wordCounts.writeStream() .outputMode("complete") .format("console") .start(); query.awaitTermination();
  • 30. Clustering • Spark supports the following cluster managers: – Standalone scheduler (default) – YARN – Mesos • Support for Kubernetes cluster manager is also undergoing (experimental at present)
  • 31. Using Oracle RDBMS as a Spark datasource
  • 32. Oracle RDBMS data source • Spark supports retrieval of data through JDBC/ODBC • Database driver must be supplied to the Spark classpath (specified with the --driver-class-path) option • For Oracle RDBMS that is the ojdbc driver
  • 33. Oracle RDBMS data source session.read() .format("jdbc") .option("url","jdbc:oracle:thin:@//127.0.0.1:1521/ORCL") .option("dbtable", "items") .option("user", "c##spark") .option("password", "spark") .load();
  • 34. Oracle RDBMS data source • You can use a variery of options when reading data from an RDBMS using the jdbc format: – query: a subquery that provides the possibility to limit retrieved data – queryTimeout: specify the timeout for the JDBC query executed against the RDBMS • You can also save datasets to a table: itemsDF.write().mode(org.apache.spark.sql.SaveMode.Append). jdbc("jdbc:oracle:thin:@//127.0.0.1:1521/ORCL", “items", prop);
  • 35. Data processing options • However the support provided by Spark is for batch processing of data from the RDBMS … • In many cases one might want to process data in a streaming manner
  • 36. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records as they are inserted in the RDBMS Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 37. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records on evenly-sized batches Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 38. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process records on evenly-sized time intervals (record size may vary) Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 39. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – process batches of overlapping records using a sized window Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 40. Data processing options • For stream processing of data from an Oracle RDBMS a Spark instance may have to: – processing of batches based on custom filter criteria Id Type OrderTime 1 Laptop 2019.11.05 11:55:05 2 Battery 2019.11.05 12:04:23 3 Headphones 2019.11.05 12:24:17 4 Laptop 2019.11.05 12:52:32
  • 41. Data processing options • These can be achieved using the following mechanism: – by duplicating writes over a streaming system such as Kafka – via Spark streaming receiver that: • buffer records (if a small delay is tolerable) • creates an endpoint that an RDBMS trigger calls upon insertion • listens for database changes using DCN (Database Change Notifications) via JDBC (only pre-12c, DCN support dropped for PDBs as of 12c)
  • 42. DEMO
  • 43. Summary • Apache Spark is one of the most feature-rich and developed big data processing frameworks • Provides a mechanism to distribute load over a large number of nodes using different cluster managers • A great option for fast and scalable processing of data from an Oracle RDBMS