0% found this document useful (0 votes)
16 views102 pages

Azure Synapse Analytics

For more details, Contact and subscribe at https://youtube.com/@mindsgreenstudios?si=FZQTQQ6ekVfAYIeD

Uploaded by

mindsgreengroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views102 pages

Azure Synapse Analytics

For more details, Contact and subscribe at https://youtube.com/@mindsgreenstudios?si=FZQTQQ6ekVfAYIeD

Uploaded by

mindsgreengroup
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Azure Synapse Analytics

© ANKIT & VIJAY – AZURE CLOUD 1


Azure Synapse Analytics
 Azure Synapse is an enterprise analytics service that accelerates,

 Time to insight across data warehouses

 Big data systems

© ANKIT & VIJAY – AZURE CLOUD 2


Azure Synapse Analytics
 Azure Synapse brings together,

 The best of SQL technologies used in enterprise data warehousing

 Spark technologies used for big data

 Data Explorer for log and time series analytics

 Pipelines for data integration and ETL/ELT

 Deep integration with other Azure services such as Power BI, CosmosDB, and AzureML

© ANKIT & VIJAY – AZURE CLOUD 3


Azure Synapse Analytics
 Azure Synapse brings together,

 The best of SQL technologies used in enterprise data warehousing

 Spark technologies used for big data

 Data Explorer for log and time series analytics

 Pipelines for data integration and ETL/ELT

 Deep integration with other Azure services such as Power BI, CosmosDB, and AzureML

© ANKIT & VIJAY – AZURE CLOUD 4


Azure Synapse Analytics
Data source 1 ADLS

ETL Operations Big data analytics Data warehouses BI Tools


Data source 2
(ADF) (like Databricks) (like SQL DB) (like Power BI)

Data source n
Business
Decision Makers

Azure Synapse Analytics

© ANKIT & VIJAY – AZURE CLOUD 5


Azure Synapse Analytics

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 6


© ANKIT & VIJAY – AZURE CLOUD 7
Basic Concepts in Azure
Synapse Analytics

© ANKIT & VIJAY – AZURE CLOUD 8


Basic Concepts
 Synapse Workspace & Studio

 Linked Services

 SQL pool

 Apache Spark pool

 Pipelines

 Data flow

 Integration with Power BI

© ANKIT & VIJAY – AZURE CLOUD 9


Basic Concepts
 Synapse Workspace: Central environment where you manage and work with various data and analytics
resources, logical container for your analytics projects

 Inside a Synapse Workspace you can create and manage,


 SQL pools

 Apache Spark pools

 Data flows

 Linked services

 Pipelines

 Other resources related to data integration, analytics, and reporting

© ANKIT & VIJAY – AZURE CLOUD 10


Basic Concepts
 Synapse Studio: Web-based integrated development environment (IDE) provided by Azure
Synapse Analytics

 Serves as the primary tool for,  Enables collaboration on,


 Designing  Data integration

 Developing  Data preparation

 Monitoring  Data transformation

 Managing data analytics solutions  Data visualization tasks

© ANKIT & VIJAY – AZURE CLOUD 11


Basic Concepts
 Linked Services: Configurations that define connections to external data sources or destinations

 These connections can be,


 Databases
 Storage accounts
 Other services you need to access within your data analytics workflows

 Linked Services store connection information, credentials, and other settings needed to interact
with external data sources

 Linked services are used in various components like pipelines and data flows

© ANKIT & VIJAY – AZURE CLOUD 12


Basic Concepts
 SQL Pool (formerly known as SQL Data Warehouse): Component of Azure Synapse Analytics used for high-
performance data warehousing and structured data analytics

 Provides a massively parallel processing (MPP) architecture for query execution

 Apache Spark Pool: Used for big data analytics

 With Apache Spark Pools, you can use Spark notebooks and libraries to perform a wide range of data processing
tasks including,
 Data preparation
 Exploration
 Advanced analytics

© ANKIT & VIJAY – AZURE CLOUD 13


Basic Concepts
 Pipelines: Used to create, schedule, and manage data workflows

 Allows you to orchestrate the data movement & data transformation from various sources to
destinations

 Pipelines can include activities like,


 Data copying
 Data transformation
 Data loading

 Azure Data Factory is often used to design and manage these pipelines

© ANKIT & VIJAY – AZURE CLOUD 14


Basic Concepts
 Data Flow: Refers to a data transformation activity within pipelines

 It provides a visual interface for designing ETL processes (Extract, Transform, Load)

 Data Flows are used to clean, enrich, and shape data as it moves through the pipeline, making
them ideal for data preparation and transformation tasks

© ANKIT & VIJAY – AZURE CLOUD 15


Basic Concepts
Integration with Power BI: Azure Synapse Analytics can be integrated with Power BI, a powerful BI tool

 This integration allows you to create interactive reports & dashboards

 You can connect Power BI directly to your Synapse Analytics data sources, enabling real-time data analysis &
visualization for BI, reporting purposes

 These concepts collectively provide a comprehensive environment for,


 Data integration

 Data analytics

 Reporting in Azure Synapse Analytics

© ANKIT & VIJAY – AZURE CLOUD 16


Basic Concepts
 Synapse Workspace & Studio

 Linked Services

 SQL pool

 Apache Spark pool

 Pipelines

 Data flow

 Integration with Power BI

© ANKIT & VIJAY – AZURE CLOUD 17


© ANKIT & VIJAY – AZURE CLOUD 18
Analyze Data with a
Serverless SQL Pool

© ANKIT & VIJAY – AZURE CLOUD 19


Analyze Data with a Serverless
SQL Pool

 Step 1: Upload the datasets in Gen2Storage (employee.csv, NYCTripSmall.parquet)

 Step 2: The Built-in serverless SQL pool

 Step 3: Analyze employee data with a serverless SQL pool (employee.csv dataset)

 Step 4: Analyze NYC Taxi data with a serverless SQL pool (NYCTripSmall.parquet dataset)

 Step 5: Select top 100 rows on both the datasets (UI)

© ANKIT & VIJAY – AZURE CLOUD 20


© ANKIT & VIJAY – AZURE CLOUD 21
Analyze Data with Dedicated
SQL Pool

© ANKIT & VIJAY – AZURE CLOUD 22


Analyze Data with Dedicated SQL
Pool

 Step 1: Upload the dataset in Gen2Storage (NYCTripSmall.parquet)

 Step 2: Create a dedicated SQL pool

 Step 3: Load the NYC Taxi Data into Dedicated SQL Pool

 Step 4: Explore the NYC Taxi data in the dedicated SQL pool

© ANKIT & VIJAY – AZURE CLOUD 23


© ANKIT & VIJAY – AZURE CLOUD 24
Analyze Data with Serverless
Spark Pool

© ANKIT & VIJAY – AZURE CLOUD 25


Analyze Data with Serverless
Spark Pool

 Step 1: Create a serverless Apache Spark pool

 Step 2: Understanding serverless Apache Spark pools

 Step 3: Analyze NYC Taxi data with a Spark pool

 Step 4: Load the NYC Taxi data into the Spark nyctaxi database

 Step 5: Analyze the NYC Taxi data using Spark and notebooks

© ANKIT & VIJAY – AZURE CLOUD 26


Understanding Serverless Spark
Pool
 A serverless Spark pool is a way of indicating how a user wants to work with Spark

 When you start using a pool, a Spark session is created if needed

 The pool controls,


 How many Spark resources will be used by that session
 How long the session will last before it automatically pauses

 You pay for spark resources used during that session and not for the pool itself
 This way a Spark pool lets you use Apache Spark without managing clusters
 This is similar to how a serverless SQL pool works

© ANKIT & VIJAY – AZURE CLOUD 27


© ANKIT & VIJAY – AZURE CLOUD 28
Integrate with Pipelines

© ANKIT & VIJAY – AZURE CLOUD 29


Integrate with Pipelines
 Step 1: Create a pipeline and add a notebook activity

 Step 2: Forcing a pipeline to run immediately

 Step 3: Schedule the pipeline to run every hour

 Step 4: Monitor pipeline execution

© ANKIT & VIJAY – AZURE CLOUD 30


© ANKIT & VIJAY – AZURE CLOUD 31
Monitor Synapse Workspace

© ANKIT & VIJAY – AZURE CLOUD 32


Monitor Synapse Workspace
 Introduction to the Monitor Hub

 Integration

 Data Explorer activities

 Apache Spark activities

 SQL activities

© ANKIT & VIJAY – AZURE CLOUD 33


© ANKIT & VIJAY – AZURE CLOUD 34
Administrative accounts in
Synapse SQL

© ANKIT & VIJAY – AZURE CLOUD 35


Administrative accounts in
Synapse SQL

 When we create the synapse analytics workspace, then we have 2 admin IDs created,

1. SQL admin

2. Microsoft Entra admin

 Synapse Analytics utilizes the combination of Microsoft Entra & SQL authentication methods to
manage admin access to various components

© ANKIT & VIJAY – AZURE CLOUD 36


Administrative accounts in
Synapse SQL
 SQL admin username: Refers to the traditional SQL authentication username that grants access to dedicated
SQL pools within an Azure Synapse workspace

 This account possesses full administrative privileges over the SQL pool, including creating, reading, updating,
and deleting data, as well as managing database objects and user permissions

 SQL Microsoft Entra admin: More secure & centralized approach to managing SQL pool administration

 It leverages Microsoft Entra (a unified identity management platform) to authenticate & authorize access to the
SQL pool

 This method eliminates the need for managing local SQL passwords & provides a more robust authentication
mechanism

© ANKIT & VIJAY – AZURE CLOUD 37


Administrative accounts in
Synapse SQL
Feature SQL admin username SQL Microsoft Entra admin

Authentication method Local SQL authentication Microsoft Entra

Entire Azure Synapse


Scope Individual dedicated SQL pool
workspace

Less secure due to local More secure with centralized


Security
passwords identity management

Requires managing local Simplifies administration and


Management
passwords reduces password exposure

© ANKIT & VIJAY – AZURE CLOUD 38


© ANKIT & VIJAY – AZURE CLOUD 39
Temporary tables in Synapse
SQL

© ANKIT & VIJAY – AZURE CLOUD 40


Temporary tables in Synapse SQL

 Temporary tables are useful when processing data, especially during transformation where the
intermediate results are transient

 With Synapse SQL, temporary tables exist at the session level. They're only visible to the
session in which they were created. They are automatically dropped when the session ends

© ANKIT & VIJAY – AZURE CLOUD 41


Temporary table limitations

 Dedicated SQL pool does have a few implementation limitations for temporary tables:

1. Only session scoped temporary tables are supported. Global Temporary Tables aren't
supported

2. Views can't be created on temporary tables

3. Temporary tables can only be created with hash or round robin distribution. Replicated
temporary table distribution isn't supported

© ANKIT & VIJAY – AZURE CLOUD 42


Temporary tables in serverless SQL
pool

 Temporary tables in serverless SQL pool are supported but their usage is limited. They can't be
used in queries which target files

 For example, you can't join a temporary table with data from files in storage

 The number of temporary tables is limited to 100, and their total size is limited to 100 MB

© ANKIT & VIJAY – AZURE CLOUD 43


© ANKIT & VIJAY – AZURE CLOUD 44
Identity

© ANKIT & VIJAY – AZURE CLOUD 45


Agenda
 Surrogate key  Explicitly inserting values into an IDENTITY column

 Creating a table with an IDENTITY column  Loading data

 Allocation of values  System views

 Skewed data  Limitations

 SELECT..INTO  Common tasks: Find the highest allocated value for


a table, Find the seed and increment for the IDENTITY
 CREATE TABLE AS SELECT
property

© ANKIT & VIJAY – AZURE CLOUD 46


What is a surrogate key?
 A surrogate key on a table is a column with a unique identifier for each row

 The key is not generated from the table data

 Data modelers like to create surrogate keys on their tables when they design data warehouse
models

 You can use the IDENTITY property to achieve this goal simply & effectively without affecting
load performance

© ANKIT & VIJAY – AZURE CLOUD 47


Skewed data
 The range of values for the data type are spread evenly across the distributions

 If a distributed table suffers from skewed data, then the range of values available to the
datatype can be exhausted prematurely

 For example, if all the data ends up in a single distribution, then effectively the table has
access to only one-sixtieth of the values of the data type

 For this reason, the IDENTITY property is limited to INT and BIGINT data types only.

© ANKIT & VIJAY – AZURE CLOUD 48


SELECT..INTO
 When an existing IDENTITY column is selected into a new table, the new column inherits the
IDENTITY property, unless one of the following conditions is true:
 The SELECT statement contains a join
 Multiple SELECT statements are joined by using UNION
 The IDENTITY column is listed more than one time in the SELECT list
 The IDENTITY column is part of an expression.

 If any one of these conditions is true, the column is created NOT NULL instead of inheriting the
IDENTITY property

© ANKIT & VIJAY – AZURE CLOUD 49


CREATE TABLE AS SELECT
 CREATE TABLE AS SELECT (CTAS) follows the same SQL Server behavior that's documented for
SELECT..INTO

 However,
 you can't specify an IDENTITY property in the column definition of the CREATE TABLE part of the
statement

 You also can't use the IDENTITY function in the SELECT part of the CTAS

 To populate a table, you need to use CREATE TABLE to define the table followed by INSERT..SELECT
to populate it

© ANKIT & VIJAY – AZURE CLOUD 50


Limitations
 The IDENTITY property can't be used:

 When the column data type is not INT or BIGINT

 When the column is also the distribution key

 When the table is an external table

© ANKIT & VIJAY – AZURE CLOUD 51


Limitations
 The following related functions are not supported in dedicated SQL pool:

 IDENTITY()

 @@IDENTITY

 SCOPE_IDENTITY

 IDENT_CURRENT

 IDENT_INCR

 IDENT_SEED

© ANKIT & VIJAY – AZURE CLOUD 52


© ANKIT & VIJAY – AZURE CLOUD 53
OPENROWSET

© ANKIT & VIJAY – AZURE CLOUD 54


OPENROWSET
 The OPENROWSET function allows you to access files in Azure Storage

 OPENROWSET function,

 reads content of remote data source

 returns content as a set of rows

 Within the serverless SQL pool resource, the OPENROWSET bulk rowset provider is accessed,

 by calling the OPENROWSET function

 specifying the BULK option

© ANKIT & VIJAY – AZURE CLOUD 55


OPENROWSET
 The OPENROWSET function can be referenced in the FROM clause of a query like a table name
 example,

SELECT *

FROM OPENROWSET(

BULK 'https://gen2storageaccount257.dfs.core.windows.net/synapsecontainer/data/employee.csv',

FORMAT = 'CSV',

PARSER_VERSION = '2.0',

HEADER_ROW = TRUE

) as output

© ANKIT & VIJAY – AZURE CLOUD 56


© ANKIT & VIJAY – AZURE CLOUD 57
Indexes on dedicated SQL
pool tables

© ANKIT & VIJAY – AZURE CLOUD 58


Index types
 Dedicated SQL pool offers several indexing options including,

 clustered columnstore indexes

 clustered indexes and nonclustered indexes

 a non-index option also known as heap

© ANKIT & VIJAY – AZURE CLOUD 59


Clustered columnstore indexes
 By default, dedicated SQL pool creates a clustered columnstore index when no index options are specified on a
table

 Clustered columnstore tables offer both,


 the highest level of data compression

 the best overall query performance

 Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best
choice for large tables

 For these reasons, clustered columnstore is the best place to start when you are unsure of how to index your
table

© ANKIT & VIJAY – AZURE CLOUD 60


Clustered columnstore indexes

 To create a clustered columnstore table, simply specify CLUSTERED COLUMNSTORE INDEX in


the WITH clause, or leave the WITH clause off

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 61


Clustered columnstore indexes

 Few scenarios where clustered columnstore may not be a good option:

1. Columnstore tables do not support varchar(max), nvarchar(max), and varbinary(max).


Consider heap or clustered index instead

2. Columnstore tables may be less efficient for transient data. Consider heap and perhaps even
temporary tables

3. Small tables with less than 60 million rows, consider heap tables

© ANKIT & VIJAY – AZURE CLOUD 62


Heap tables
 When you are temporarily landing data in dedicated SQL pool, you may find that using a heap table
makes the overall process faster

 This is because loads to heaps are faster than to index tables and, in some cases the subsequent
read can be done from cache

 If you are loading data only to stage it before running more transformations, loading the table to
heap table is much faster than loading the data to a clustered columnstore table

 In addition, loading data to a temporary table loads faster than loading a table to permanent
storage. After data loading, you can create indexes in the table for faster query performance

© ANKIT & VIJAY – AZURE CLOUD 63


Heap tables
 Cluster columnstore tables begin to achieve optimal compression once there are more than 60
million rows

 For small lookup tables, less than 60 million rows, consider using HEAP or clustered index for
faster query performance

© ANKIT & VIJAY – AZURE CLOUD 64


Heap tables
 To create a heap table, simply specify HEAP in the WITH clause

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 65


Clustered and nonclustered
indexes
 Clustered indexes may outperform clustered columnstore tables when a single row needs to be
quickly retrieved

 For queries where a single or very few row lookup is required to perform with extreme speed,
consider a clustered index or nonclustered secondary index

 The disadvantage to using a clustered index, only queries that benefit are the ones that use a highly
selective filter on the clustered index column

 To improve filter on other columns, a nonclustered index can be added to other columns

 However, each index that is added to a table adds both space and processing time to loads

© ANKIT & VIJAY – AZURE CLOUD 66


Clustered and nonclustered
indexes

 To create a clustered index table, simply specify CLUSTERED INDEX in the WITH clause

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 67


Clustered and nonclustered
indexes

 To add a non-clustered index on a table, use the following syntax:

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 68


© ANKIT & VIJAY – AZURE CLOUD 69
Partitioning tables in
dedicated SQL pool

© ANKIT & VIJAY – AZURE CLOUD 70


Agenda
 What are table partitions?

 Partition sizing

© ANKIT & VIJAY – AZURE CLOUD 71


Table partitions
 Table partitions enable you to divide your data into smaller groups of data. In most cases, table partitions are
created on a date column

 Partitioning is supported on all dedicated SQL pool table types,


 clustered columnstore
 clustered index
 heap

 Partitioning is also supported on all distribution types,


 Hash
 round robin distributed

© ANKIT & VIJAY – AZURE CLOUD 72


Table partitions
 Partitioning can benefit,

 data maintenance

 query performance

 Whether it benefits both or just one is dependent on how data is loaded and whether the
same column can be used for both purposes, since partitioning can only be done on one column

© ANKIT & VIJAY – AZURE CLOUD 73


Table partitions
 Benefits for Data Loading

1. Turbocharge Your Data Pipeline: Load data at warp speed with partition deletion, switching,
and merging

2. Log Less, Do More: Skip time-consuming transaction logs - focus on what matters

3. Archive Like a Pro: Archive old data effortlessly by dropping partitions (e.g., say goodbye to
month-long deletes)

© ANKIT & VIJAY – AZURE CLOUD 74


Table partitions
 Benefits for Your Queries

1. Laser-Focused Searches: Find what you need faster. Queries skip unwanted partitions,
scanning only relevant data

2. Smarter Scans, Even Smart Results: Clustering and indexing amplify the benefits, optimizing
even more searches

© ANKIT & VIJAY – AZURE CLOUD 75


Table partitions
 Remember: Choose a key date column for partitioning that matches your data flow for
maximum impact

© ANKIT & VIJAY – AZURE CLOUD 76


Partition sizing
 Balancing Act: Partitioning enhances performance but excessive partitions can harm it, especially for clustered
columnstore tables

 Key Considerations:
 Understand when to use partitioning and the optimal number of partitions

 Successful schemes typically involve tens to hundreds of partitions, not thousands

 Clustered Columnstore Tables:


 Optimal compression and performance require a minimum of 1 million rows per distribution and partition

 Distributions are pre-set to 60; additional partitions should align with this distribution

© ANKIT & VIJAY – AZURE CLOUD 77


Partition sizing
 Example Scenario:

 For a sales fact table with 36 monthly partitions and 60 distributions, aim for 60 million rows
per month or 2.1 billion rows in total.

 Adjust partitioning based on the recommended minimum rows per partition for optimal
performance.

© ANKIT & VIJAY – AZURE CLOUD 78


Syntax differences Between Dedicated SQL Pool
and SQL Server

 Dedicated SQL pool introduces a way to define partitions that is simpler than SQL Server

 Partitioning functions and schemes are not used in dedicated SQL pool as they are in SQL
Server

 Instead, all you need to do is identify partitioned column and the boundary points

 While the syntax of partitioning may be slightly different from SQL Server, the basic concepts
are the same. SQL Server and dedicated SQL pool support one partition column per table, which
can be ranged partition

© ANKIT & VIJAY – AZURE CLOUD 79


© ANKIT & VIJAY – AZURE CLOUD 80
Replicated Tables

© ANKIT & VIJAY – AZURE CLOUD 81


Replicated Table
 A replicated table has a full copy of the table accessible on each Compute node

 Replicating a table removes the need to transfer data among Compute nodes before a join or
aggregation

 Since the table has multiple copies, replicated tables work best when the table size is less than
2 GB compressed

 2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables

© ANKIT & VIJAY – AZURE CLOUD 82


Replicated Table

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 83


Replicated Table
 Replicated tables are ideal for dimension tables in a star schema, particularly when dealing
with slowly changing descriptive data like customer names and addresses

 This approach minimizes maintenance due to the stable nature of dimension data

© ANKIT & VIJAY – AZURE CLOUD 84


Replicated Table
 Consider using replicated tables when:

 Table size is less than 2 GB, irrespective of row count

 Tables are frequently joined, avoiding data movement in the process

 Query plans reveal BroadcastMoveOperation, indicating potential benefits from eliminating


data movement

© ANKIT & VIJAY – AZURE CLOUD 85


Replicated Table
 Replicated tables may not offer optimal performance in scenarios involving:

 Frequent insert, update, or delete operations, necessitating frequent rebuilding

 Frequent scaling of the SQL pool, leading to replicated table rebuilding

 Tables with many columns, where only a subset is frequently accessed. Consider distribution
with selective indexing for improved efficiency in such cases

© ANKIT & VIJAY – AZURE CLOUD 86


© ANKIT & VIJAY – AZURE CLOUD 87
Table constraints

© ANKIT & VIJAY – AZURE CLOUD 88


Table constraints
 Dedicated SQL pool supports these table constraints:

 PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used

 UNIQUE constraint is only supported when NOT ENFORCED is used

 FOREIGN KEY constraint is not supported in dedicated SQL pool

 Having primary key and/or unique key allows dedicated SQL pool engine to generate an
optimal execution plan for a query

 All values in a primary key column or a unique constraint column should be unique

© ANKIT & VIJAY – AZURE CLOUD 89


© ANKIT & VIJAY – AZURE CLOUD 90
Apache Spark in Azure
Synapse Analytics

© ANKIT & VIJAY – AZURE CLOUD 91


Apache Spark in Azure Synapse
Analytics

 Apache Spark: Parallel processing framework, supports in-memory processing to boost the
performance of big data analytic applications

 Apache Spark in Azure Synapse Analytics: One of Microsoft's implementations of Apache


Spark in the cloud

 Azure Synapse makes it easy to create & configure a serverless Apache Spark pool in Azure

 Spark pools in Azure Synapse are compatible with Azure Storage & Azure Data Lake Generation
2 Storage, So you can use Spark pools to process your data stored in Azure

© ANKIT & VIJAY – AZURE CLOUD 92


Apache Spark in Azure Synapse
Analytics

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 93


Apache Spark in Azure Synapse
Analytics

 Apache Spark provides primitives for in-memory cluster computing

 A Spark job can load & cache data into memory and query it repeatedly

 In-memory computing is much faster than disk-based applications

 Spark also integrates with multiple programming languages to let you manipulate distributed
data sets like local collections

 There's no need to structure everything as map and reduce operations

© ANKIT & VIJAY – AZURE CLOUD 94


Apache Spark in Azure Synapse
Analytics

Credit: Azure Cloud

© ANKIT & VIJAY – AZURE CLOUD 95


Apache Spark in Azure Synapse
Analytics
Feature Traditional MapReduce Spark MapReduce
Data storage HDFS only HDFS or other sources

Intermediate results On disk In memory

Shuffle Less efficient More efficient

Fault tolerance Less fault-tolerant More fault-tolerant

© ANKIT & VIJAY – AZURE CLOUD 96


Benefits of creating a Spark pool in
Azure Synapse Analytics

 Azure documentation link: https://learn.microsoft.com/en-us/azure/synapse-


analytics/spark/apache-spark-overview#what-is-apache-spark

© ANKIT & VIJAY – AZURE CLOUD 97


Spark Pool Architecture

 SparkContext Coordinates: The SparkContext object, within your main program (driver),
coordinates the application's processes on the pool

 Resource Allocation: It connects to the cluster manager (YARN) to allocate resources and acquire
executors on nodes

 Code Distribution: Your application code (JAR or Python files) is sent to the executors

 Task Execution: SparkContext sends tasks to the executors for running

 Parallel Operations: Executors execute parallel operations on nodes

 Result Collection: SparkContext gathers the results

© ANKIT & VIJAY – AZURE CLOUD 98


Spark Pool Architecture

 Data Reading/Writing: Nodes handle data reading/writing from/to the file system

 Data Caching: Nodes cache transformed data in memory as RDDs

 DAG Creation: SparkContext converts the application into a DAG (directed acyclic graph)

 Task Execution: Individual tasks run within executor processes on nodes

 Dedicated Executors: Each application gets its own executor processes, which persist
throughout the application and run tasks in multiple threads

© ANKIT & VIJAY – AZURE CLOUD 99


Use Cases
 Data Engineering/Data Preparation: Apache Spark empowers large-scale data preparation and
processing with multiple languages (C#, Scala, PySpark, Spark SQL) and extensive libraries
 Makes data more valuable for other Azure Synapse Analytics services

 Machine Learning: Azure Synapse Analytics provides a complete machine learning


environment:
 MLlib: Built-in machine learning library for Spark, ready for large-scale tasks
 Anaconda: Pre-installed Python distribution with extensive data science packages
 Notebooks: Built-in support for creating and running machine learning workflows

© ANKIT & VIJAY – AZURE CLOUD 100


Use Cases
 Streaming Data: Synapse Spark supports structured streaming with these key points:

 Limited support: Only with specific runtime versions

 7-day lifespan: All jobs (batch and streaming) expire after 7 days

 Restart automation: Typically managed using Azure Functions

© ANKIT & VIJAY – AZURE CLOUD 101


© ANKIT & VIJAY – AZURE CLOUD 102

You might also like