What Is Apache Spark - Azure Synapse Analytics - Microsoft Docs
What Is Apache Spark - Azure Synapse Analytics - Microsoft Docs
In this article
What is Apache Spark
Spark pool architecture
Apache Spark in Azure Synapse Analytics use cases
Where do I start
Next steps
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 1/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs
Apache Spark provides primitives for in-memory cluster computing. A Spark job can
load and cache data into memory and query it repeatedly. In-memory computing is
much faster than disk-based applications. Spark also integrates with multiple
programming languages to let you manipulate distributed data sets like local
collections. There's no need to structure everything as map and reduce operations.
Spark pools in Azure Synapse offer a fully managed Spark service. The benefits of
creating a Spark pool in Azure Synapse Analytics are listed here.
Feature Description
Speed and Spark instances start in approximately 2 minutes for fewer than 60 nodes and
efficiency approximately 5 minutes for more than 60 nodes. The instance shuts down, by
default, 5 minutes after the last job executed unless it is kept alive by a notebook
connection.
Ease of You can create a new Spark pool in Azure Synapse in minutes using the Azure
creation portal, Azure PowerShell, or the Synapse Analytics .NET SDK. See Get started with
Spark pools in Azure Synapse Analytics.
Ease of use Synapse Analytics includes a custom notebook derived from Nteract . You can
use these notebooks for interactive data processing and visualization.
REST APIs Spark in Azure Synapse Analytics includes Apache Livy , a REST API-based
Spark job server to remotely submit and monitor jobs.
Support for Spark pools in Azure Synapse can use Azure Data Lake Storage Generation 2 as
Azure Data well as BLOB storage. For more information on Data Lake Storage, see Overview
Lake Storage of Azure Data Lake Storage.
Generation 2
Integration Azure Synapse provides an IDE plugin for JetBrains' IntelliJ IDEA that is useful
with third- to create and submit applications to a Spark pool.
party IDEs
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 2/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs
Feature Description
Pre-loaded Spark pools in Azure Synapse come with Anaconda libraries pre-installed.
Anaconda Anaconda provides close to 200 libraries for machine learning, data analysis,
libraries visualization, etc.
Scalability Apache Spark in Azure Synapse pools can have Auto-Scale enabled, so that
pools scale by adding or removing nodes as needed. Also, Spark pools can be
shut down with no loss of data since all the data is stored in Azure Storage or
Data Lake Storage.
Spark pools in Azure Synapse include the following components that are available on
the pools by default.
Spark Core . Includes Spark Core, Spark SQL, GraphX, and MLlib.
Anaconda
Apache Livy
Nteract notebook
The SparkContext can connect to the cluster manager, which allocates resources across
applications. The cluster manager is Apache Hadoop YARN . Once connected, Spark
acquires executors on nodes in the pool, which are processes that run computations and
store data for your application. Next, it sends your application code (defined by JAR or
Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks
to the executors to run.
The SparkContext runs the user's main function and executes the various parallel
operations on the nodes. Then, the SparkContext collects the results of the operations.
The nodes read and write data from and to the file system. The nodes also cache
transformed data in-memory as Resilient Distributed Datasets (RDDs).
The SparkContext connects to the Spark pool and is responsible for converting an
application to a directed acyclic graph (DAG). The graph consists of individual tasks that
get executed within an executor process on the nodes. Each application gets its own
executor processes, which stay up for the duration of the whole application and run
tasks in multiple threads.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 3/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs
Machine Learning
Apache Spark comes with MLlib , a machine learning library built on top of Spark that
you can use from a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse
Analytics also include Anaconda, a Python distribution with a variety of packages for
data science including machine learning. When combined with built-in support for
notebooks, you have an environment for creating machine learning applications.
Where do I start
Use the following articles to learn more about Apache Spark in Azure Synapse Analytics:
Quickstart: Create a Spark pool in Azure Synapse
Quickstart: Create an Apache Spark notebook
Tutorial: Machine learning using Apache Spark
Apache Spark official documentation
7 Note
Some of the official Apache Spark documentation relies on using the spark console,
this is not available on Azure Synapse Spark, use the notebook or IntelliJ
experiences instead
Next steps
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 4/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs
In this overview, you get a basic understanding of Apache Spark in Azure Synapse
Analytics. Advance to the next article to learn how to create a Spark pool in Azure
Synapse Analytics:
Recommended content
Quickstart: Create a serverless Apache Spark pool using web tools - Azure
Synapse Analytics
This quickstart shows how to use the web tools to create a serverless Apache Spark pool in
Azure Synapse Analytics and how to run a Spark SQL query.
Quickstart: Create a serverless Apache Spark pool using the Azure portal -
Azure Synapse Analytics
Create a serverless Apache Spark pool using the Azure portal by following the steps in this
guide.
Overview of how to use Linux Foundation Delta Lake in Apache Spark for
Azure Synapse Analytics - Azure Synapse Analytics
Learn how to use Delta Lake in Apache Spark for Azure Synapse Analytics, to create, and use
tables with ACID properties.
Quickstart: Transform data using Apache Spark job definition - Azure Synapse
Analytics
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 5/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs
This tutorial provides step-by-step instructions for using Azure Synapse Analytics to
transform data with Apache Spark job definition.
Show more S
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 6/6