0% found this document useful (0 votes)
75 views6 pages

What Is Apache Spark - Azure Synapse Analytics - Microsoft Docs

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big data analytic applications. Apache Spark in Azure Synapse Analytics provides a fully managed Spark service that makes it easy to create and configure serverless Apache Spark pools in the cloud. Spark pools can leverage Azure Storage and Azure Data Lake Storage Gen2 and come with Spark, Anaconda, Apache Livy, and Nteract notebooks pre-installed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views6 pages

What Is Apache Spark - Azure Synapse Analytics - Microsoft Docs

Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big data analytic applications. Apache Spark in Azure Synapse Analytics provides a fully managed Spark service that makes it easy to create and configure serverless Apache Spark pools in the cloud. Spark pools can leverage Azure Storage and Azure Data Lake Storage Gen2 and come with Spark, Anaconda, Apache Livy, and Nteract notebooks pre-installed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

Azure ​/ Synapse Analytics ​/  D / 

Apache Spark in Azure Synapse


Analytics
Article • 02/16/2022 • 4 minutes to read • 9 contributors  

In this article
What is Apache Spark
Spark pool architecture
Apache Spark in Azure Synapse Analytics use cases
Where do I start
Next steps

Apache Spark is a parallel processing framework that supports in-memory processing to


boost the performance of big-data analytic applications. Apache Spark in Azure Synapse
Analytics is one of Microsoft's implementations of Apache Spark in the cloud. Azure
Synapse makes it easy to create and configure a serverless Apache Spark pool in Azure.
Spark pools in Azure Synapse are compatible with Azure Storage and Azure Data Lake
Generation 2 Storage. So you can use Spark pools to process your data stored in Azure.

What is Apache Spark

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 1/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

Apache Spark provides primitives for in-memory cluster computing. A Spark job can
load and cache data into memory and query it repeatedly. In-memory computing is
much faster than disk-based applications. Spark also integrates with multiple
programming languages to let you manipulate distributed data sets like local
collections. There's no need to structure everything as map and reduce operations.

Spark pools in Azure Synapse offer a fully managed Spark service. The benefits of
creating a Spark pool in Azure Synapse Analytics are listed here.

Feature Description

Speed and Spark instances start in approximately 2 minutes for fewer than 60 nodes and
efficiency approximately 5 minutes for more than 60 nodes. The instance shuts down, by
default, 5 minutes after the last job executed unless it is kept alive by a notebook
connection.

Ease of You can create a new Spark pool in Azure Synapse in minutes using the Azure
creation portal, Azure PowerShell, or the Synapse Analytics .NET SDK. See Get started with
Spark pools in Azure Synapse Analytics.

Ease of use Synapse Analytics includes a custom notebook derived from Nteract . You can
use these notebooks for interactive data processing and visualization.

REST APIs Spark in Azure Synapse Analytics includes Apache Livy , a REST API-based
Spark job server to remotely submit and monitor jobs.

Support for Spark pools in Azure Synapse can use Azure Data Lake Storage Generation 2 as
Azure Data well as BLOB storage. For more information on Data Lake Storage, see Overview
Lake Storage of Azure Data Lake Storage.
Generation 2

Integration Azure Synapse provides an IDE plugin for JetBrains' IntelliJ IDEA that is useful
with third- to create and submit applications to a Spark pool.
party IDEs

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 2/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

Feature Description

Pre-loaded Spark pools in Azure Synapse come with Anaconda libraries pre-installed.
Anaconda Anaconda provides close to 200 libraries for machine learning, data analysis,
libraries visualization, etc.

Scalability Apache Spark in Azure Synapse pools can have Auto-Scale enabled, so that
pools scale by adding or removing nodes as needed. Also, Spark pools can be
shut down with no loss of data since all the data is stored in Azure Storage or
Data Lake Storage.

Spark pools in Azure Synapse include the following components that are available on
the pools by default.
Spark Core . Includes Spark Core, Spark SQL, GraphX, and MLlib.
Anaconda
Apache Livy
Nteract notebook

Spark pool architecture


It is easy to understand the components of Spark by understanding how Spark runs on
Azure Synapse Analytics.
Spark applications run as independent sets of processes on a pool, coordinated by the
SparkContext object in your main program (called the driver program).

The SparkContext can connect to the cluster manager, which allocates resources across
applications. The cluster manager is Apache Hadoop YARN . Once connected, Spark
acquires executors on nodes in the pool, which are processes that run computations and
store data for your application. Next, it sends your application code (defined by JAR or
Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks
to the executors to run.
The SparkContext runs the user's main function and executes the various parallel
operations on the nodes. Then, the SparkContext collects the results of the operations.
The nodes read and write data from and to the file system. The nodes also cache
transformed data in-memory as Resilient Distributed Datasets (RDDs).

The SparkContext connects to the Spark pool and is responsible for converting an
application to a directed acyclic graph (DAG). The graph consists of individual tasks that
get executed within an executor process on the nodes. Each application gets its own
executor processes, which stay up for the duration of the whole application and run
tasks in multiple threads.
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 3/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

Apache Spark in Azure Synapse Analytics use


cases
Spark pools in Azure Synapse Analytics enable the following key scenarios:

Data Engineering/Data Preparation


Apache Spark includes many language features to support preparation and processing
of large volumes of data so that it can be made more valuable and then consumed by
other services within Azure Synapse Analytics. This is enabled through multiple
languages (C#, Scala, PySpark, Spark SQL) and supplied libraries for processing and
connectivity.

Machine Learning
Apache Spark comes with MLlib , a machine learning library built on top of Spark that
you can use from a Spark pool in Azure Synapse Analytics. Spark pools in Azure Synapse
Analytics also include Anaconda, a Python distribution with a variety of packages for
data science including machine learning. When combined with built-in support for
notebooks, you have an environment for creating machine learning applications.

Where do I start
Use the following articles to learn more about Apache Spark in Azure Synapse Analytics:
Quickstart: Create a Spark pool in Azure Synapse
Quickstart: Create an Apache Spark notebook
Tutorial: Machine learning using Apache Spark
Apache Spark official documentation

7 Note

Some of the official Apache Spark documentation relies on using the spark console,
this is not available on Azure Synapse Spark, use the notebook or IntelliJ
experiences instead

Next steps
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 4/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

In this overview, you get a basic understanding of Apache Spark in Azure Synapse
Analytics. Advance to the next article to learn how to create a Spark pool in Azure
Synapse Analytics:

Create a Spark pool in Azure Synapse

Recommended content
Quickstart: Create a serverless Apache Spark pool using web tools - Azure
Synapse Analytics
This quickstart shows how to use the web tools to create a serverless Apache Spark pool in
Azure Synapse Analytics and how to run a Spark SQL query.

Quickstart: Get started analyzing with Spark - Azure Synapse Analytics


In this tutorial, you'll learn to analyze data with Apache Spark.

Quickstart: Create a serverless Apache Spark pool using the Azure portal -
Azure Synapse Analytics
Create a serverless Apache Spark pool using the Azure portal by following the steps in this
guide.

Overview of how to use Linux Foundation Delta Lake in Apache Spark for
Azure Synapse Analytics - Azure Synapse Analytics
Learn how to use Delta Lake in Apache Spark for Azure Synapse Analytics, to create, and use
tables with ACID properties.

Apache Spark core concepts - Azure Synapse Analytics


Introduction to core concepts for Apache Spark in Azure Synapse Analytics.

Tutorial: Get started integrate with pipelines - Azure Synapse Analytics


In this tutorial, you'll learn how to integrate pipelines and activities using Synapse Studio.

Quickstart: Transform data using Apache Spark job definition - Azure Synapse
Analytics
https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 5/6
4/26/22, 2:29 PM What is Apache Spark - Azure Synapse Analytics | Microsoft Docs

This tutorial provides step-by-step instructions for using Azure Synapse Analytics to
transform data with Apache Spark job definition.

What's new? - Azure Synapse Analytics


Learn about the new features and documentation improvements for Azure Synapse Analytics

Show more S

https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview 6/6

You might also like