0% found this document useful (1 vote)

298 views

Spark SQL Tutorial

Uploaded by

Anusha Reddy

0% found this document useful (1 vote)

298 views

Spark SQL Tutorial

Uploaded by

Anusha Reddy

You are on page 1/ 7

Spark SQL

About the Tutorial

Apache Spark is a lightning-fast cluster computing designed for fast computation. It was
built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use
more types of computations which includes Interactive Queries and Stream Processing.

This is a brief tutorial that explains the basics of Spark SQL programming.

Audience
This tutorial has been prepared for professionals aspiring to learn the basics of Big Data
Analytics using Spark Framework and become a Spark Developer. In addition, it would
be useful for Analytics Professionals and ETL developers as well.

Prerequisite
Before you start proceeding with this tutorial, we assume that you have prior exposure
to Scala programming, database concepts, and any of the Linux operating system
flavors.

Copyright & Disclaimer

All the content and graphics published in this e-book are the property of Tutorials Point
(I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or
republish any contents or a part of contents of this e-book in any manner without written
consent of the publisher.

We strive to update the contents of our website and tutorials as timely and as precisely
as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I)
Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of
our website or its contents including this tutorial. If you discover any errors on our
website or in this tutorial, please notify us at [email protected]

i
Spark SQL

Table of Contents
About the Tutorial ............................................................................................................................................ i

Audience........................................................................................................................................................... i

Prerequisite ...................................................................................................................................................... i

Copyright & Disclaimer ..................................................................................................................................... i

Table of Contents............................................................................................................................................. ii

1. SPARK SQL – INTRODUCTION ............................................................................................... 1

Apache Spark ................................................................................................................................................... 1

Evolution of Apache Spark ............................................................................................................................... 1

Features of Apache Spark ................................................................................................................................ 1

Spark Built on Hadoop ..................................................................................................................................... 2

Components of Spark ...................................................................................................................................... 3

2. SPARK SQL – RDD ................................................................................................................. 4

Resilient Distributed Datasets.......................................................................................................................... 4

Data Sharing is Slow in MapReduce ................................................................................................................. 4

Iterative Operations on MapReduce ................................................................................................................ 4

Interactive Operations on MapReduce ............................................................................................................ 5

Data Sharing using Spark RDD ......................................................................................................................... 6

Iterative Operations on Spark RDD .................................................................................................................. 6

Interactive Operations on Spark RDD .............................................................................................................. 6

3. SPARK SQL – INSTALLATION ................................................................................................. 8

Step 1: Verifying Java Installation .................................................................................................................... 8

Step 2: Verifying Scala installation ................................................................................................................... 8

Step 3: Downloading Scala ............................................................................................................................... 8

Step 4: Installing Scala ..................................................................................................................................... 9

Step 5: Downloading Apache Spark ................................................................................................................. 9

ii
Spark SQL

Step 6: Installing Spark .................................................................................................................................. 10

Step 7: Verifying the Spark Installation .......................................................................................................... 10

4. SPARK SQL – FEATURES AND ARCHITECTURE ..................................................................... 12

Features of Spark SQL .................................................................................................................................... 12

Spark SQL Architecture .................................................................................................................................. 13

5. SPARK SQL – DATAFRAMES ................................................................................................ 14

Features of DataFrame .................................................................................................................................. 14

SQLContext .................................................................................................................................................... 14

DataFrame Operations .................................................................................................................................. 15

Running SQL Queries Programmatically ......................................................................................................... 17

Inferring the Schema using Reflection ........................................................................................................... 18

Programmatically Specifying the Schema ...................................................................................................... 21

6. SPARK SQL – DATA SOURCES .............................................................................................. 25

JSON Datasets ................................................................................................................................................ 25

DataFrame Operations .................................................................................................................................. 26

Hive Tables .................................................................................................................................................... 27

Parquet Files .................................................................................................................................................. 29

iii
1. SPARK SQL – INTRODUCTION Spark SQL

Industries are using Hadoop extensively to analyze their data sets. The reason is that
Hadoop framework is based on a simple programming model (MapReduce) and it
enables a computing solution that is scalable, flexible, fault-tolerant and cost effective.
Here, the main concern is to maintain speed in processing large datasets in terms of
waiting time between queries and waiting time to run the program.

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.

As against a common belief, Spark is not a modified version of Hadoop and is not,
really, dependent on Hadoop because it has its own cluster management. Hadoop is just
one of the ways to implement Spark.

Spark uses Hadoop in two ways – one is storage and second is processing. Since
Spark has its own cluster management computation, it uses Hadoop for storage purpose
only.

Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and
stream processing. The main feature of Spark is its in-memory cluster computing
that increases the processing speed of an application.

Spark is designed to cover a wide range of workloads such as batch applications,

iterative algorithms, interactive queries and streaming. Apart from supporting all these
workload in a respective system, it reduces the management burden of maintaining
separate tools.

Evolution of Apache Spark

Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by
Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to
Apache software foundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.

Features of Apache Spark

Apache Spark has following features.

 Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster
in memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data
in memory.

1
Spark SQL

 Supports multiple languages: Spark provides built-in APIs in Java, Scala, or

Python. Therefore, you can write applications in different languages. Spark comes
up with 80 high-level operators for interactive querying.

 Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

Spark Built on Hadoop

The following diagram shows three ways of how Spark can be built with Hadoop
components.

There are three ways of Spark deployment as explained below.

 Standalone: Spark Standalone deployment means Spark occupies the place on

top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS,
explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs
on cluster.

 Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn
without any pre-installation or root access required. It helps to integrate Spark
into Hadoop ecosystem or Hadoop stack. It allows other components to run on
top of stack.

 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job

in addition to standalone deployment. With SIMR, user can start Spark and uses
its shell without any administrative access.

2
Spark SQL

End of ebook preview

If you liked what you saw…

Buy it from our store @ https://store.tutorialspoint.com

Teradata Tutorial PDF
100% (1)
Teradata Tutorial PDF
120 pages
Twitter Marketing Tutorial
67% (3)
Twitter Marketing Tutorial
15 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Mastering Spring Boot 3.0: A comprehensive guide to building scalable and efficient backend systems with Java and Spring
From Everand
Mastering Spring Boot 3.0: A comprehensive guide to building scalable and efficient backend systems with Java and Spring
Ahmet Meric
No ratings yet
Getting Started with Meteor.js JavaScript Framework - Second Edition
From Everand
Getting Started with Meteor.js JavaScript Framework - Second Edition
Strack Isaac
No ratings yet
Apache Storm Tutorial Point
0% (1)
Apache Storm Tutorial Point
20 pages
Hive Tutorial PDF
0% (1)
Hive Tutorial PDF
14 pages
Log4J - PDF - Java Platform - Software
No ratings yet
Log4J - PDF - Java Platform - Software
32 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
Hadoop - PIG User Material
No ratings yet
Hadoop - PIG User Material
292 pages
Apache Solr High Performance Sample Chapter
No ratings yet
Apache Solr High Performance Sample Chapter
11 pages
C Lang Hand
No ratings yet
C Lang Hand
198 pages
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
Pythin Qa
No ratings yet
Pythin Qa
8 pages
Xstream Tutorial
100% (1)
Xstream Tutorial
68 pages
E - 1-JAVA DEVELOPER-20230410-Siva S K
No ratings yet
E - 1-JAVA DEVELOPER-20230410-Siva S K
4 pages
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
100% (2)
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
35 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Unix
No ratings yet
Unix
67 pages
1000 Java Interview Questions-4
No ratings yet
1000 Java Interview Questions-4
250 pages
The Node - Js Developer Roadmap For 2021
No ratings yet
The Node - Js Developer Roadmap For 2021
6 pages
18 Months With Scala
No ratings yet
18 Months With Scala
38 pages
Apache Poi PPT Tutorial
100% (1)
Apache Poi PPT Tutorial
78 pages
1.language Fundamentals Study Material PDF
No ratings yet
1.language Fundamentals Study Material PDF
32 pages
Big Query Interview Q&A
No ratings yet
Big Query Interview Q&A
8 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Apache Spark & Scala Course Content
No ratings yet
Apache Spark & Scala Course Content
5 pages
Ibm Redbook - Db2 Web Query
No ratings yet
Ibm Redbook - Db2 Web Query
606 pages
1.1 JShell PDF
No ratings yet
1.1 JShell PDF
49 pages
Java Means Durgasoft: An Introduction To Ant
No ratings yet
Java Means Durgasoft: An Introduction To Ant
13 pages
Java Dip Tutorial PDF
No ratings yet
Java Dip Tutorial PDF
149 pages
Ringkasan 2
100% (1)
Ringkasan 2
241 pages
Main Profile Jaya Bharatha Reddy Blockchain Full Stack Developer
No ratings yet
Main Profile Jaya Bharatha Reddy Blockchain Full Stack Developer
3 pages
The Difference Between XML and HTML
No ratings yet
The Difference Between XML and HTML
161 pages
Adobe
No ratings yet
Adobe
25 pages
Talend ESB GettingStarted UG 51 en
No ratings yet
Talend ESB GettingStarted UG 51 en
96 pages
Apache Derby Tutorial PDF
0% (1)
Apache Derby Tutorial PDF
15 pages
Mongodb Tutorial
No ratings yet
Mongodb Tutorial
15 pages
Informatica Basic Dac Obia7964
0% (1)
Informatica Basic Dac Obia7964
96 pages
Advanced Java
No ratings yet
Advanced Java
132 pages
SQL Detailed Notes For Professionals 1672765219
No ratings yet
SQL Detailed Notes For Professionals 1672765219
166 pages
MongoDB Data Models Guide
100% (1)
MongoDB Data Models Guide
39 pages
JUnit 5 User Guide
No ratings yet
JUnit 5 User Guide
90 pages
Swing & GUI Design
No ratings yet
Swing & GUI Design
28 pages
Sql-Most Important Concepts Placement Preparation: (Save and Share)
No ratings yet
Sql-Most Important Concepts Placement Preparation: (Save and Share)
46 pages
Mongo DB
100% (1)
Mongo DB
35 pages
Advanced Java and Web Technologies
No ratings yet
Advanced Java and Web Technologies
236 pages
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
Sergio Ramazzina
No ratings yet
L Sudhkar (SQL Server)
No ratings yet
L Sudhkar (SQL Server)
188 pages
Durga Soft PDF Part II
100% (7)
Durga Soft PDF Part II
201 pages
Sqoop Commands - Latest
No ratings yet
Sqoop Commands - Latest
4 pages
Learning Apache Cassandra - Sample Chapter
No ratings yet
Learning Apache Cassandra - Sample Chapter
20 pages
Ramchandra Corejava New
67% (3)
Ramchandra Corejava New
569 pages
10 Frequently Asked SQL Query Interview Questions - Java67
No ratings yet
10 Frequently Asked SQL Query Interview Questions - Java67
26 pages
RESTful Web Services With Scala - Sample Chapter
No ratings yet
RESTful Web Services With Scala - Sample Chapter
26 pages
Hive Succinctly
No ratings yet
Hive Succinctly
114 pages
Testlink Tutorial
100% (1)
Testlink Tutorial
126 pages
Apache Httpclient Tutorial
100% (1)
Apache Httpclient Tutorial
69 pages
UNIX For Testers
100% (1)
UNIX For Testers
141 pages
MongoDB Practice
No ratings yet
MongoDB Practice
2 pages
Docs Graylog Org en 3.2
No ratings yet
Docs Graylog Org en 3.2
528 pages
Wireless Security Tutorial
50% (2)
Wireless Security Tutorial
18 pages
Webrtc Tutorial
No ratings yet
Webrtc Tutorial
34 pages
Wireless Communication Tutorial
0% (2)
Wireless Communication Tutorial
12 pages
Windows10 Development Tutorial
0% (1)
Windows10 Development Tutorial
31 pages
Web2py Tutorial
No ratings yet
Web2py Tutorial
17 pages
Wifi Tutorial PDF
No ratings yet
Wifi Tutorial PDF
8 pages
Webservices Tutorial
0% (1)
Webservices Tutorial
8 pages
Webrtc Tutorial
No ratings yet
Webrtc Tutorial
34 pages
Webgl Tutorial
50% (2)
Webgl Tutorial
31 pages
Vsam Tutorial
50% (2)
Vsam Tutorial
15 pages
Vaadin Tutorial
No ratings yet
Vaadin Tutorial
15 pages
Typescript Tutorial
0% (1)
Typescript Tutorial
25 pages
Uddi Tutorial
0% (1)
Uddi Tutorial
11 pages
Talend Tutorial
50% (2)
Talend Tutorial
19 pages
Testlodge Tutorial
0% (1)
Testlodge Tutorial
17 pages
Telecom Billing Tutorial
75% (4)
Telecom Billing Tutorial
20 pages
Teradata Tutorial
No ratings yet
Teradata Tutorial
18 pages
TCL TK Tutorial
0% (1)
TCL TK Tutorial
19 pages
T SQL Tutorial
No ratings yet
T SQL Tutorial
13 pages
Homework Oh Homework Poetry
100% (1)
Homework Oh Homework Poetry
6 pages
WBD practical no11 answers
No ratings yet
WBD practical no11 answers
4 pages
UNIT 3 - Information Technology System Applicable in Nursing Practice
100% (3)
UNIT 3 - Information Technology System Applicable in Nursing Practice
85 pages
The CTF Toolbox - CTF Tools of The Trade PDF
No ratings yet
The CTF Toolbox - CTF Tools of The Trade PDF
55 pages
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
No ratings yet
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
18 pages
Federal Investigation Agency (FIA) : Recruitment Test
No ratings yet
Federal Investigation Agency (FIA) : Recruitment Test
4 pages
Quest For The Frozen Flame Player's Guide
0% (1)
Quest For The Frozen Flame Player's Guide
12 pages
EnglishPage - Simple Present
No ratings yet
EnglishPage - Simple Present
4 pages
Definite Article The Names
50% (2)
Definite Article The Names
1 page
BADENAS y AURELL, 2004 - Sea Level Changes, Jabaloyas
No ratings yet
BADENAS y AURELL, 2004 - Sea Level Changes, Jabaloyas
17 pages
SoxhletExtractionworking
No ratings yet
SoxhletExtractionworking
5 pages
Shailendra Kaushik27@
No ratings yet
Shailendra Kaushik27@
3 pages
Session 2 - Marketing Environment
No ratings yet
Session 2 - Marketing Environment
40 pages
Lab 575-SDS
No ratings yet
Lab 575-SDS
5 pages
Industrial Profile: Hutti Gold Mines Company Limited
No ratings yet
Industrial Profile: Hutti Gold Mines Company Limited
78 pages
Encyclopedia of Embroidery From The Arab
No ratings yet
Encyclopedia of Embroidery From The Arab
3 pages
Polisomnografí A Dinamica No Dise.: Club de Revistas Julián David Cáceres O. Otorrinolaringología
No ratings yet
Polisomnografí A Dinamica No Dise.: Club de Revistas Julián David Cáceres O. Otorrinolaringología
25 pages
BAHL
No ratings yet
BAHL
2 pages
Veeam Tricksvolume1-Sample
No ratings yet
Veeam Tricksvolume1-Sample
52 pages
Chea
No ratings yet
Chea
177 pages
Analysis of Synchronous Generator Internal Insulation Failures
100% (2)
Analysis of Synchronous Generator Internal Insulation Failures
5 pages
Marine Pumps: Grundfos Industrial Solutions Marine
100% (1)
Marine Pumps: Grundfos Industrial Solutions Marine
7 pages
Performance Assessment and Review Admin and Accounts Managers
No ratings yet
Performance Assessment and Review Admin and Accounts Managers
3 pages
C02. Electrical Safety Standard
100% (1)
C02. Electrical Safety Standard
4 pages
05 Brand-Archetype-Report
No ratings yet
05 Brand-Archetype-Report
22 pages
Notes:: Key Plan
No ratings yet
Notes:: Key Plan
1 page
Solved If The Cubic Total Cost Function Described in The Text
No ratings yet
Solved If The Cubic Total Cost Function Described in The Text
1 page
Infection Prevention and Control (IPC) For COVID-19 Virus
No ratings yet
Infection Prevention and Control (IPC) For COVID-19 Virus
21 pages
SMA 3261 - Lecture 4 - Numerical - Differentiation
No ratings yet
SMA 3261 - Lecture 4 - Numerical - Differentiation
10 pages
PM Reyes Notes On Taxation 2 - Valued Added Tax (Working Draft)
100% (1)
PM Reyes Notes On Taxation 2 - Valued Added Tax (Working Draft)
22 pages

Teradata Tutorial PDF
Teradata Tutorial PDF
Twitter Marketing Tutorial
Twitter Marketing Tutorial
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Mastering Spring Boot 3.0: A comprehensive guide to building scalable and efficient backend systems with Java and Spring
From Everand
Mastering Spring Boot 3.0: A comprehensive guide to building scalable and efficient backend systems with Java and Spring
Getting Started with Meteor.js JavaScript Framework - Second Edition
From Everand
Getting Started with Meteor.js JavaScript Framework - Second Edition
Apache Storm Tutorial Point
Apache Storm Tutorial Point
Hive Tutorial PDF
Hive Tutorial PDF
Log4J - PDF - Java Platform - Software
Log4J - PDF - Java Platform - Software
HDFS Internals
HDFS Internals
Hadoop - PIG User Material
Hadoop - PIG User Material
Apache Solr High Performance Sample Chapter
Apache Solr High Performance Sample Chapter
C Lang Hand
C Lang Hand
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Pythin Qa
Pythin Qa
Xstream Tutorial
Xstream Tutorial
E - 1-JAVA DEVELOPER-20230410-Siva S K
E - 1-JAVA DEVELOPER-20230410-Siva S K
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
[FREE PDF sample] Python Unit Test Automation: Practical Techniques for Python Developers and Testers 1 / converted Edition Ashwin Pajankar ebooks
Parallel Programming With Spark: Matei Zaharia
Parallel Programming With Spark: Matei Zaharia
Unix
Unix
1000 Java Interview Questions-4
1000 Java Interview Questions-4
The Node - Js Developer Roadmap For 2021
The Node - Js Developer Roadmap For 2021
18 Months With Scala
18 Months With Scala
Apache Poi PPT Tutorial
Apache Poi PPT Tutorial
1.language Fundamentals Study Material PDF
1.language Fundamentals Study Material PDF
Big Query Interview Q&A
Big Query Interview Q&A
Distributed Computing With Python - Sample Chapter
Distributed Computing With Python - Sample Chapter
Apache Spark & Scala Course Content
Apache Spark & Scala Course Content
Ibm Redbook - Db2 Web Query
Ibm Redbook - Db2 Web Query
1.1 JShell PDF
1.1 JShell PDF
Java Means Durgasoft: An Introduction To Ant
Java Means Durgasoft: An Introduction To Ant
Java Dip Tutorial PDF
Java Dip Tutorial PDF
Ringkasan 2
Ringkasan 2
Main Profile Jaya Bharatha Reddy Blockchain Full Stack Developer
Main Profile Jaya Bharatha Reddy Blockchain Full Stack Developer
The Difference Between XML and HTML
The Difference Between XML and HTML
Adobe
Adobe
Talend ESB GettingStarted UG 51 en
Talend ESB GettingStarted UG 51 en
Apache Derby Tutorial PDF
Apache Derby Tutorial PDF
Mongodb Tutorial
Mongodb Tutorial
Informatica Basic Dac Obia7964
Informatica Basic Dac Obia7964
Advanced Java
Advanced Java
SQL Detailed Notes For Professionals 1672765219
SQL Detailed Notes For Professionals 1672765219
MongoDB Data Models Guide
MongoDB Data Models Guide
JUnit 5 User Guide
JUnit 5 User Guide
Swing & GUI Design
Swing & GUI Design
Sql-Most Important Concepts Placement Preparation: (Save and Share)
Sql-Most Important Concepts Placement Preparation: (Save and Share)
Mongo DB
Mongo DB
Advanced Java and Web Technologies
Advanced Java and Web Technologies
Instant Pentaho Data Integration Kitchen
From Everand
Instant Pentaho Data Integration Kitchen
L Sudhkar (SQL Server)
L Sudhkar (SQL Server)
Durga Soft PDF Part II
Durga Soft PDF Part II
Sqoop Commands - Latest
Sqoop Commands - Latest
Learning Apache Cassandra - Sample Chapter
Learning Apache Cassandra - Sample Chapter
Ramchandra Corejava New
Ramchandra Corejava New
10 Frequently Asked SQL Query Interview Questions - Java67
10 Frequently Asked SQL Query Interview Questions - Java67
RESTful Web Services With Scala - Sample Chapter
RESTful Web Services With Scala - Sample Chapter
Hive Succinctly
Hive Succinctly
Testlink Tutorial
Testlink Tutorial
Apache Httpclient Tutorial
Apache Httpclient Tutorial
UNIX For Testers
UNIX For Testers
MongoDB Practice
MongoDB Practice
Docs Graylog Org en 3.2
Docs Graylog Org en 3.2
Wireless Security Tutorial
Wireless Security Tutorial
Webrtc Tutorial
Webrtc Tutorial
Wireless Communication Tutorial
Wireless Communication Tutorial
Windows10 Development Tutorial
Windows10 Development Tutorial
Web2py Tutorial
Web2py Tutorial
Wifi Tutorial PDF
Wifi Tutorial PDF
Webservices Tutorial
Webservices Tutorial
Webrtc Tutorial
Webrtc Tutorial
Webgl Tutorial
Webgl Tutorial
Vsam Tutorial
Vsam Tutorial
Vaadin Tutorial
Vaadin Tutorial
Typescript Tutorial
Typescript Tutorial
Uddi Tutorial
Uddi Tutorial
Talend Tutorial
Talend Tutorial
Testlodge Tutorial
Testlodge Tutorial
Telecom Billing Tutorial
Telecom Billing Tutorial
Teradata Tutorial
Teradata Tutorial
TCL TK Tutorial
TCL TK Tutorial
T SQL Tutorial
T SQL Tutorial
Homework Oh Homework Poetry
Homework Oh Homework Poetry
WBD practical no11 answers
WBD practical no11 answers
UNIT 3 - Information Technology System Applicable in Nursing Practice
UNIT 3 - Information Technology System Applicable in Nursing Practice
The CTF Toolbox - CTF Tools of The Trade PDF
The CTF Toolbox - CTF Tools of The Trade PDF
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
Aiwa XR-MS5 Verticle CD Executive Micro System Manual
Federal Investigation Agency (FIA) : Recruitment Test
Federal Investigation Agency (FIA) : Recruitment Test
Quest For The Frozen Flame Player's Guide
Quest For The Frozen Flame Player's Guide
EnglishPage - Simple Present
EnglishPage - Simple Present
Definite Article The Names
Definite Article The Names
BADENAS y AURELL, 2004 - Sea Level Changes, Jabaloyas
BADENAS y AURELL, 2004 - Sea Level Changes, Jabaloyas
SoxhletExtractionworking
SoxhletExtractionworking
Shailendra Kaushik27@
Shailendra Kaushik27@
Session 2 - Marketing Environment
Session 2 - Marketing Environment
Lab 575-SDS
Lab 575-SDS
Industrial Profile: Hutti Gold Mines Company Limited
Industrial Profile: Hutti Gold Mines Company Limited
Encyclopedia of Embroidery From The Arab
Encyclopedia of Embroidery From The Arab
Polisomnografí A Dinamica No Dise.: Club de Revistas Julián David Cáceres O. Otorrinolaringología
Polisomnografí A Dinamica No Dise.: Club de Revistas Julián David Cáceres O. Otorrinolaringología
BAHL
BAHL
Veeam Tricksvolume1-Sample
Veeam Tricksvolume1-Sample
Chea
Chea
Analysis of Synchronous Generator Internal Insulation Failures
Analysis of Synchronous Generator Internal Insulation Failures
Marine Pumps: Grundfos Industrial Solutions Marine
Marine Pumps: Grundfos Industrial Solutions Marine
Performance Assessment and Review Admin and Accounts Managers
Performance Assessment and Review Admin and Accounts Managers
C02. Electrical Safety Standard
C02. Electrical Safety Standard
05 Brand-Archetype-Report
05 Brand-Archetype-Report
Notes:: Key Plan
Notes:: Key Plan
Solved If The Cubic Total Cost Function Described in The Text
Solved If The Cubic Total Cost Function Described in The Text
Infection Prevention and Control (IPC) For COVID-19 Virus
Infection Prevention and Control (IPC) For COVID-19 Virus
SMA 3261 - Lecture 4 - Numerical - Differentiation
SMA 3261 - Lecture 4 - Numerical - Differentiation
PM Reyes Notes On Taxation 2 - Valued Added Tax (Working Draft)
PM Reyes Notes On Taxation 2 - Valued Added Tax (Working Draft)

Spark SQL Tutorial

Uploaded by

Spark SQL Tutorial

Uploaded by

Spark SQL

About the Tutorial

Copyright & Disclaimer

Copyright & Disclaimer ..................................................................................................................................... i

1. SPARK SQL – INTRODUCTION ............................................................................................... 1

Apache Spark ................................................................................................................................................... 1

Evolution of Apache Spark ............................................................................................................................... 1

Features of Apache Spark ................................................................................................................................ 1

Spark Built on Hadoop ..................................................................................................................................... 2

Components of Spark ...................................................................................................................................... 3

2. SPARK SQL – RDD ................................................................................................................. 4

Resilient Distributed Datasets.......................................................................................................................... 4

Data Sharing is Slow in MapReduce ................................................................................................................. 4

Iterative Operations on MapReduce ................................................................................................................ 4

Interactive Operations on MapReduce ............................................................................................................ 5

Data Sharing using Spark RDD ......................................................................................................................... 6

Iterative Operations on Spark RDD .................................................................................................................. 6

Interactive Operations on Spark RDD .............................................................................................................. 6

3. SPARK SQL – INSTALLATION ................................................................................................. 8

Step 1: Verifying Java Installation .................................................................................................................... 8

Step 2: Verifying Scala installation ................................................................................................................... 8

Step 3: Downloading Scala ............................................................................................................................... 8

Step 4: Installing Scala ..................................................................................................................................... 9

Step 5: Downloading Apache Spark ................................................................................................................. 9

Step 6: Installing Spark .................................................................................................................................. 10

Step 7: Verifying the Spark Installation .......................................................................................................... 10

4. SPARK SQL – FEATURES AND ARCHITECTURE ..................................................................... 12

Features of Spark SQL .................................................................................................................................... 12

Spark SQL Architecture .................................................................................................................................. 13

5. SPARK SQL – DATAFRAMES ................................................................................................ 14

Features of DataFrame .................................................................................................................................. 14

DataFrame Operations .................................................................................................................................. 15

Running SQL Queries Programmatically ......................................................................................................... 17

Inferring the Schema using Reflection ........................................................................................................... 18

Programmatically Specifying the Schema ...................................................................................................... 21

6. SPARK SQL – DATA SOURCES .............................................................................................. 25

JSON Datasets ................................................................................................................................................ 25

DataFrame Operations .................................................................................................................................. 26

Hive Tables .................................................................................................................................................... 27

Parquet Files .................................................................................................................................................. 29

Spark is designed to cover a wide range of workloads such as batch applications,

Evolution of Apache Spark

Features of Apache Spark

 Supports multiple languages: Spark provides built-in APIs in Java, Scala, or

Spark Built on Hadoop

There are three ways of Spark deployment as explained below.

 Standalone: Spark Standalone deployment means Spark occupies the place on

 Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job

End of ebook preview

If you liked what you saw…

Buy it from our store @ https://store.tutorialspoint.com

You might also like