SlideShare a Scribd company logo
Spark SQL, Dataframes, SparkR
hadoop fs -cat /data/spark/books.xml
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
…
…
</book>
<book id="bk101">
…
…
</book>
…
...
</catalog>
Loading XML
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-xml
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Loading XML
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-xml
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1
Load the Data:
val df = spark.read.format("xml").option("rowTag",
"book").load("/data/spark/books.xml")
OR
val df = spark.read.format("com.databricks.spark.xml")
.option("rowTag", "book").load("/data/spark/books.xml")
Loading XML
Spark SQL, Dataframes, SparkR
Loading XML
scala> df.show()
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
| _id| author| description| genre|price|publish_date| title|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
|bk101|Gambardella, Matthew|
An in...| Computer|44.95| 2000-10-01|XML Developer's G...|
|bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain|
|bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
|bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds|
|bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash|
|bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies|
|bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
|bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
|bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
|bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
+-----+--------------------+--------------------+---------------+-----+------------+--------------------+
Display Data:
df.show()
Spark SQL, Dataframes, SparkR
What is RPC - Remote Process Call
[{
Name: John,
Phone: 1234
},
{
Name: John,
Phone: 1234
},]
…
getPhoneBook("myuserid")
Spark SQL, Dataframes, SparkR
Avro is:
1. A Remote Procedure call
2. Data Serialization Framework
3. Uses JSON for defining data types and protocols
4. Serializes data in a compact binary format
5. Similar to Thrift and Protocol Buffers
6. Doesn't require running a code-generation program
Its primary use is in Apache Hadoop, where it can provide both a serialization format
for persistent data, and a wire format for communication between Hadoop nodes,
and from client programs to the Hadoop services.
Apache Spark SQL can access Avro as a data source.[1]
AVRO
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-avro
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Loading AVRO
Spark SQL, Dataframes, SparkR
We will use: https://github.com/databricks/spark-avro
Start Spark-Shell:
/usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0
Load the Data:
val df = spark.read.format("com.databricks.spark.avro")
.load("/data/spark/episodes.avro")
Display Data:
df.show()
+--------------------+----------------+------+
| title| air_date|doctor|
+--------------------+----------------+------+
| The Eleventh Hour| 3 April 2010| 11|
| The Doctor's Wife| 14 May 2011| 11|
Loading AVRO
Spark SQL, Dataframes, SparkR
https://parquet.apache.org/
Data Sources
ā— Columnar storage format
ā— Any project in the Hadoop ecosystem
ā— Regardless of
ā—‹ Data processing framework
ā—‹ Data model
ā—‹ Programming language.
Spark SQL, Dataframes, SparkR
var df = spark.read.load("/data/spark/users.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
var df = spark.read.load("/data/spark/users.parquet")
df = df.select("name", "favorite_color")
df.write.save("namesAndFavColors_21jan2018.parquet")
Data Sources
Method1 - Automatically (parquet unless otherwise configured)
Spark SQL, Dataframes, SparkR
Data Sources
Method2 - Manually Specifying Options
df = spark.read.format("json").load("/data/spark/people.json")
df = df.select("name", "age")
df.write.format("parquet").save("namesAndAges.parquet")
Spark SQL, Dataframes, SparkR
Data Sources
Method3 - Directly running sql on file
val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`")
val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
Spark SQL, Dataframes, SparkR
ā— Spark SQL also supports reading and writing data stored in Apache Hive.
ā— Since Hive has a large number of dependencies, it is not included in the default Spark assembly.
Hive Tables
Spark SQL, Dataframes, SparkR
Hive Tables
ā— Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/
ā— Not required in case of CloudxLab, it already done.
Spark SQL, Dataframes, SparkR
Hive Tables - Example
/usr/spark2.0.1/bin/spark-shell
scala> import spark.implicits._
import spark.implicits._
scala> var df = spark.sql("select * from a_student")
scala> df.show()
+---------+-----+-----+------+
| name|grade|marks|stream|
+---------+-----+-----+------+
| Student1| A| 1| CSE|
| Student2| B| 2| IT|
| Student3| A| 3| ECE|
| Student4| B| 4| EEE|
| Student5| A| 5| MECH|
| Student6| B| 6| CHEM|
Spark SQL, Dataframes, SparkR
Hive Tables - Example
import java.io.File
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.enableHiveSupport()
.getOrCreate()
Spark SQL, Dataframes, SparkR
From DBs using JDBC
ā— Spark SQL also includes a data source that can read data from DBs using JDBC.
ā— Results are returned as a DataFrame
ā— Easily be processed in Spark SQL or joined with other data sources
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
From DBs using JDBC
Spark SQL, Dataframes, SparkR
hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar
/usr/spark2.0.1/bin/spark-shell --driver-class-path
mysql-connector-java-5.1.36-bin.jar --jars
mysql-connector-java-5.1.36-bin.jar
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex")
.option("dbtable", "widgets")
.option("user", "sqoopuser")
.option("password", "NHkkP876rp")
.load()
jdbcDF.show()
From DBs using JDBC
Spark SQL, Dataframes, SparkR
val jdbcDF = spark.read.format("jdbc").option("url",
"jdbc:mysql://ip-172-31-13-154/sqoopex").option("dbtable",
"widgets").option("user", "sqoopuser").option("password",
"NHkkP876rp").load()
jdbcDF.show()
var df = spark.sql("select * from a_student")
df.show()
jdbcDF.createOrReplaceTempView("jdbc_widgets");
df.createOrReplaceTempView("hive_students");
spark.sql("select * from jdbc_widgets j, hive_students h where h.marks =
j.id").show()
Joining Across
Spark SQL, Dataframes, SparkR
Data Frames
Dataframes
(Spark SQL)
JSON
HIVE
RDD
TEXT
Parquet map(), reduce() ...
SQL
RDMS (JDBC)
Spark SQL, Dataframes, SparkR
ā— Spark SQL as a distributed query engine
ā— using its JDBC/ODBC
ā— or command-line interface.
ā— Users can run SQL queries on Spark
ā— without the need to write any code.
Distributed SQL Engine
Spark SQL, Dataframes, SparkR
Distributed SQL Engine - Setting up
Step 1: Running the Thrift JDBC/ODBC server
The thrift JDBC/ODBC here corresponds to HiveServer. You can start it
from the local installation:
./sbin/start-thriftserver.sh
It starts in the background and writes data to log file. To see the logs use,
tail -f command
Spark SQL, Dataframes, SparkR
Step 2: Connecting
Connect to thrift service using beeline:
./bin/beeline
On the beeline shell:
!connect jdbc:hive2://localhost:10000
You can further query using the same commands as hive.
Distributed SQL Engine - Setting up
Spark SQL, Dataframes, SparkR
Demo
Distributed SQL Engine
Thank you!
Dataframes & Spark SQL

More Related Content

PDF
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
Ā 
PDF
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
Ā 
PDF
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
Ā 
PDF
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
Ā 
PPTX
Introduction to Apache Spark
Mohamed hedi Abidi
Ā 
Apache Spark Introduction | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
Ā 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
Ā 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
Ā 
Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
Ā 
Introduction to Apache Spark
Mohamed hedi Abidi
Ā 

What's hot (20)

PDF
Apache Spark Tutorial
Farzad Nozarian
Ā 
PDF
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PPTX
Data analysis scala_spark
Yiguang Hu
Ā 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
Ā 
ODP
Introduction to Spark with Scala
Himanshu Gupta
Ā 
PDF
Apache Spark Introduction
sudhakara st
Ā 
PPTX
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
Ā 
PPTX
Advanced Sqoop
Yogesh Kulkarni
Ā 
PDF
SQL to Hive Cheat Sheet
Hortonworks
Ā 
PDF
PySpark in practice slides
Dat Tran
Ā 
PDF
DataSource V2 and Cassandra – A Whole New World
Databricks
Ā 
PPTX
Apache spark Intro
Tudor Lapusan
Ā 
PDF
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
Ā 
PDF
Scala+data
Samir Bessalah
Ā 
PDF
Apache Spark RDDs
Dean Chen
Ā 
PPTX
Apache Spark RDD 101
sparkInstructor
Ā 
PDF
Apache Spark and DataStax Enablement
Vincent Poncet
Ā 
PDF
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
User Defined Aggregation in Apache Spark: A Love Story
Databricks
Ā 
Apache Spark Tutorial
Farzad Nozarian
Ā 
Advanced Spark Programming - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Introduction to Sqoop | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Data analysis scala_spark
Yiguang Hu
Ā 
Introduction to Apache Spark
Anastasios Skarlatidis
Ā 
Introduction to Spark with Scala
Himanshu Gupta
Ā 
Apache Spark Introduction
sudhakara st
Ā 
Frustration-Reduced Spark: DataFrames and the Spark Time-Series Library
Ilya Ganelin
Ā 
Advanced Sqoop
Yogesh Kulkarni
Ā 
SQL to Hive Cheat Sheet
Hortonworks
Ā 
PySpark in practice slides
Dat Tran
Ā 
DataSource V2 and Cassandra – A Whole New World
Databricks
Ā 
Apache spark Intro
Tudor Lapusan
Ā 
Spark Cassandra Connector: Past, Present, and Future
Russell Spitzer
Ā 
Scala+data
Samir Bessalah
Ā 
Apache Spark RDDs
Dean Chen
Ā 
Apache Spark RDD 101
sparkInstructor
Ā 
Apache Spark and DataStax Enablement
Vincent Poncet
Ā 
Introduction to Apache Hive | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
User Defined Aggregation in Apache Spark: A Love Story
Databricks
Ā 
Ad

Similar to Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab (20)

PPTX
Spark sql
Zahra Eskandari
Ā 
PDF
Intro to Spark and Spark SQL
jeykottalam
Ā 
PPTX
Building highly scalable data pipelines with Apache Spark
Martin Toshev
Ā 
PDF
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
Ā 
PPTX
Learning spark ch09 - Spark SQL
phanleson
Ā 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
Ā 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
Ā 
PDF
Spark SQL
Joud Khattab
Ā 
PPTX
Building a modern Application with DataFrames
Spark Summit
Ā 
PPTX
Building a modern Application with DataFrames
Databricks
Ā 
ODP
A Step to programming with Apache Spark
Knoldus Inc.
Ā 
PDF
Introduction to Structured Data Processing with Spark SQL
datamantra
Ā 
PPTX
Intro to Apache Spark by CTO of Twingo
MapR Technologies
Ā 
PDF
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
Ā 
PDF
Strata NYC 2015 - What's coming for the Spark community
Databricks
Ā 
PPTX
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
Ā 
PDF
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
Ā 
PDF
SparkSQL and Dataframe
Namgee Lee
Ā 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
Ā 
PPTX
Intro to Spark
Kyle Burke
Ā 
Spark sql
Zahra Eskandari
Ā 
Intro to Spark and Spark SQL
jeykottalam
Ā 
Building highly scalable data pipelines with Apache Spark
Martin Toshev
Ā 
Vancouver AWS Meetup Slides 11-20-2018 Apache Spark with Amazon EMR
Allice Shandler
Ā 
Learning spark ch09 - Spark SQL
phanleson
Ā 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
Ā 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
Ā 
Spark SQL
Joud Khattab
Ā 
Building a modern Application with DataFrames
Spark Summit
Ā 
Building a modern Application with DataFrames
Databricks
Ā 
A Step to programming with Apache Spark
Knoldus Inc.
Ā 
Introduction to Structured Data Processing with Spark SQL
datamantra
Ā 
Intro to Apache Spark by CTO of Twingo
MapR Technologies
Ā 
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
Ā 
Strata NYC 2015 - What's coming for the Spark community
Databricks
Ā 
Big Data processing with Spark, Scala or Java?
Erik-Berndt Scheper
Ā 
IoT Applications and Patterns using Apache Spark & Apache Bahir
Luciano Resende
Ā 
SparkSQL and Dataframe
Namgee Lee
Ā 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
Ā 
Intro to Spark
Kyle Burke
Ā 
Ad

More from CloudxLab (20)

PDF
Understanding computer vision with Deep Learning
CloudxLab
Ā 
PDF
Deep Learning Overview
CloudxLab
Ā 
PDF
Recurrent Neural Networks
CloudxLab
Ā 
PDF
Natural Language Processing
CloudxLab
Ā 
PDF
Naive Bayes
CloudxLab
Ā 
PDF
Autoencoders
CloudxLab
Ā 
PDF
Training Deep Neural Nets
CloudxLab
Ā 
PDF
Reinforcement Learning
CloudxLab
Ā 
PDF
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
Ā 
PDF
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
Ā 
PPTX
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
Ā 
PPTX
Introduction to Deep Learning | CloudxLab
CloudxLab
Ā 
PPTX
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
Ā 
PPTX
Ensemble Learning and Random Forests
CloudxLab
Ā 
PPTX
Decision Trees
CloudxLab
Ā 
PPTX
Support Vector Machines
CloudxLab
Ā 
PDF
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
PDF
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Understanding computer vision with Deep Learning
CloudxLab
Ā 
Deep Learning Overview
CloudxLab
Ā 
Recurrent Neural Networks
CloudxLab
Ā 
Natural Language Processing
CloudxLab
Ā 
Naive Bayes
CloudxLab
Ā 
Autoencoders
CloudxLab
Ā 
Training Deep Neural Nets
CloudxLab
Ā 
Reinforcement Learning
CloudxLab
Ā 
Apache Spark - Key Value RDD - Transformations | Big Data Hadoop Spark Tutori...
CloudxLab
Ā 
Introduction to NoSQL | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Introduction to MapReduce - Hadoop Streaming | Big Data Hadoop Spark Tutorial...
CloudxLab
Ā 
Introduction To TensorFlow | Deep Learning Using TensorFlow | CloudxLab
CloudxLab
Ā 
Introduction to Deep Learning | CloudxLab
CloudxLab
Ā 
Dimensionality Reduction | Machine Learning | CloudxLab
CloudxLab
Ā 
Ensemble Learning and Random Forests
CloudxLab
Ā 
Decision Trees
CloudxLab
Ā 
Support Vector Machines
CloudxLab
Ā 
Introduction to Linux | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Introduction to Pig & Pig Latin | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 
Introduction to Oozie | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
Ā 

Recently uploaded (20)

PDF
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
Ā 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
Ā 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
Ā 
PDF
Doc9.....................................
SofiaCollazos
Ā 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
Ā 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
PDF
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
Ā 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
Ā 
PDF
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
Ā 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
Ā 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
Ā 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
Ā 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
Ā 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
Ā 
PPTX
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
Ā 
PDF
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
Ā 
PDF
Software Development Methodologies in 2025
KodekX
Ā 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
Ā 
Revolutionize Operations with Intelligent IoT Monitoring and Control
Rejig Digital
Ā 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
Ā 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
Ā 
Doc9.....................................
SofiaCollazos
Ā 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
Ā 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
Ā 
NewMind AI Weekly Chronicles - July'25 - Week IV
NewMind AI
Ā 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
Ā 
Best ERP System for Manufacturing in India | Elite Mindz
Elite Mindz
Ā 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
Ā 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
Ā 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
Ā 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
Ā 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
Ā 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
Ā 
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
Ā 
Data_Analytics_vs_Data_Science_vs_BI_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
Ā 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
Ā 
Software Development Methodologies in 2025
KodekX
Ā 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
Ā 

Apache Spark - Dataframes & Spark SQL - Part 2 | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1. Spark SQL, Dataframes, SparkR hadoop fs -cat /data/spark/books.xml <?xml version="1.0"?> <catalog> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description> An in-depth look at creating applications … … </book> <book id="bk101"> … … </book> … ... </catalog> Loading XML
  • 2. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-xml Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Loading XML
  • 3. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-xml Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-xml_2.10:0.4.1 Load the Data: val df = spark.read.format("xml").option("rowTag", "book").load("/data/spark/books.xml") OR val df = spark.read.format("com.databricks.spark.xml") .option("rowTag", "book").load("/data/spark/books.xml") Loading XML
  • 4. Spark SQL, Dataframes, SparkR Loading XML scala> df.show() +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ | _id| author| description| genre|price|publish_date| title| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ |bk101|Gambardella, Matthew| An in...| Computer|44.95| 2000-10-01|XML Developer's G...| |bk102| Ralls, Kim|A former architec...| Fantasy| 5.95| 2000-12-16| Midnight Rain| |bk103| Corets, Eva|After the collaps...| Fantasy| 5.95| 2000-11-17| Maeve Ascendant| |bk104| Corets, Eva|In post-apocalyps...| Fantasy| 5.95| 2001-03-10| Oberon's Legacy| |bk105| Corets, Eva|The two daughters...| Fantasy| 5.95| 2001-09-10| The Sundered Grail| |bk106| Randall, Cynthia|When Carla meets ...| Romance| 4.95| 2000-09-02| Lover Birds| |bk107| Thurman, Paula|A deep sea diver ...| Romance| 4.95| 2000-11-02| Splish Splash| |bk108| Knorr, Stefan|An anthology of h...| Horror| 4.95| 2000-12-06| Creepy Crawlies| |bk109| Kress, Peter|After an inadvert...|Science Fiction| 6.95| 2000-11-02| Paradox Lost| |bk110| O'Brien, Tim|Microsoft's .NET ...| Computer|36.95| 2000-12-09|Microsoft .NET: T...| |bk111| O'Brien, Tim|The Microsoft MSX...| Computer|36.95| 2000-12-01|MSXML3: A Compreh...| |bk112| Galos, Mike|Microsoft Visual ...| Computer|49.95| 2001-04-16|Visual Studio 7: ...| +-----+--------------------+--------------------+---------------+-----+------------+--------------------+ Display Data: df.show()
  • 5. Spark SQL, Dataframes, SparkR What is RPC - Remote Process Call [{ Name: John, Phone: 1234 }, { Name: John, Phone: 1234 },] … getPhoneBook("myuserid")
  • 6. Spark SQL, Dataframes, SparkR Avro is: 1. A Remote Procedure call 2. Data Serialization Framework 3. Uses JSON for defining data types and protocols 4. Serializes data in a compact binary format 5. Similar to Thrift and Protocol Buffers 6. Doesn't require running a code-generation program Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Apache Spark SQL can access Avro as a data source.[1] AVRO
  • 7. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-avro Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Loading AVRO
  • 8. Spark SQL, Dataframes, SparkR We will use: https://github.com/databricks/spark-avro Start Spark-Shell: /usr/spark2.0.1/bin/spark-shell --packages com.databricks:spark-avro_2.11:3.2.0 Load the Data: val df = spark.read.format("com.databricks.spark.avro") .load("/data/spark/episodes.avro") Display Data: df.show() +--------------------+----------------+------+ | title| air_date|doctor| +--------------------+----------------+------+ | The Eleventh Hour| 3 April 2010| 11| | The Doctor's Wife| 14 May 2011| 11| Loading AVRO
  • 9. Spark SQL, Dataframes, SparkR https://parquet.apache.org/ Data Sources ā— Columnar storage format ā— Any project in the Hadoop ecosystem ā— Regardless of ā—‹ Data processing framework ā—‹ Data model ā—‹ Programming language.
  • 10. Spark SQL, Dataframes, SparkR var df = spark.read.load("/data/spark/users.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 11. Spark SQL, Dataframes, SparkR var df = spark.read.load("/data/spark/users.parquet") df = df.select("name", "favorite_color") df.write.save("namesAndFavColors_21jan2018.parquet") Data Sources Method1 - Automatically (parquet unless otherwise configured)
  • 12. Spark SQL, Dataframes, SparkR Data Sources Method2 - Manually Specifying Options df = spark.read.format("json").load("/data/spark/people.json") df = df.select("name", "age") df.write.format("parquet").save("namesAndAges.parquet")
  • 13. Spark SQL, Dataframes, SparkR Data Sources Method3 - Directly running sql on file val sqlDF = spark.sql("SELECT * FROM parquet.`/data/spark/users.parquet`") val sqlDF = spark.sql("SELECT * FROM json.`/data/spark/people.json`")
  • 14. Spark SQL, Dataframes, SparkR ā— Spark SQL also supports reading and writing data stored in Apache Hive. ā— Since Hive has a large number of dependencies, it is not included in the default Spark assembly. Hive Tables
  • 15. Spark SQL, Dataframes, SparkR Hive Tables ā— Place your hive-site.xml, core-site.xml and hdfs-site.xml file in conf/ ā— Not required in case of CloudxLab, it already done.
  • 16. Spark SQL, Dataframes, SparkR Hive Tables - Example /usr/spark2.0.1/bin/spark-shell scala> import spark.implicits._ import spark.implicits._ scala> var df = spark.sql("select * from a_student") scala> df.show() +---------+-----+-----+------+ | name|grade|marks|stream| +---------+-----+-----+------+ | Student1| A| 1| CSE| | Student2| B| 2| IT| | Student3| A| 3| ECE| | Student4| B| 4| EEE| | Student5| A| 5| MECH| | Student6| B| 6| CHEM|
  • 17. Spark SQL, Dataframes, SparkR Hive Tables - Example import java.io.File val spark = SparkSession .builder() .appName("Spark Hive Example") .enableHiveSupport() .getOrCreate()
  • 18. Spark SQL, Dataframes, SparkR From DBs using JDBC ā— Spark SQL also includes a data source that can read data from DBs using JDBC. ā— Results are returned as a DataFrame ā— Easily be processed in Spark SQL or joined with other data sources
  • 19. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar From DBs using JDBC
  • 20. Spark SQL, Dataframes, SparkR hadoop fs -copyToLocal /data/spark/mysql-connector-java-5.1.36-bin.jar /usr/spark2.0.1/bin/spark-shell --driver-class-path mysql-connector-java-5.1.36-bin.jar --jars mysql-connector-java-5.1.36-bin.jar val jdbcDF = spark.read .format("jdbc") .option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex") .option("dbtable", "widgets") .option("user", "sqoopuser") .option("password", "NHkkP876rp") .load() jdbcDF.show() From DBs using JDBC
  • 21. Spark SQL, Dataframes, SparkR val jdbcDF = spark.read.format("jdbc").option("url", "jdbc:mysql://ip-172-31-13-154/sqoopex").option("dbtable", "widgets").option("user", "sqoopuser").option("password", "NHkkP876rp").load() jdbcDF.show() var df = spark.sql("select * from a_student") df.show() jdbcDF.createOrReplaceTempView("jdbc_widgets"); df.createOrReplaceTempView("hive_students"); spark.sql("select * from jdbc_widgets j, hive_students h where h.marks = j.id").show() Joining Across
  • 22. Spark SQL, Dataframes, SparkR Data Frames Dataframes (Spark SQL) JSON HIVE RDD TEXT Parquet map(), reduce() ... SQL RDMS (JDBC)
  • 23. Spark SQL, Dataframes, SparkR ā— Spark SQL as a distributed query engine ā— using its JDBC/ODBC ā— or command-line interface. ā— Users can run SQL queries on Spark ā— without the need to write any code. Distributed SQL Engine
  • 24. Spark SQL, Dataframes, SparkR Distributed SQL Engine - Setting up Step 1: Running the Thrift JDBC/ODBC server The thrift JDBC/ODBC here corresponds to HiveServer. You can start it from the local installation: ./sbin/start-thriftserver.sh It starts in the background and writes data to log file. To see the logs use, tail -f command
  • 25. Spark SQL, Dataframes, SparkR Step 2: Connecting Connect to thrift service using beeline: ./bin/beeline On the beeline shell: !connect jdbc:hive2://localhost:10000 You can further query using the same commands as hive. Distributed SQL Engine - Setting up
  • 26. Spark SQL, Dataframes, SparkR Demo Distributed SQL Engine