Azure Synapse Analytics
Azure Synapse Analytics
Deep integration with other Azure services such as Power BI, CosmosDB, and AzureML
Deep integration with other Azure services such as Power BI, CosmosDB, and AzureML
Data source n
Business
Decision Makers
Linked Services
SQL pool
Pipelines
Data flow
Data flows
Linked services
Pipelines
Linked Services store connection information, credentials, and other settings needed to interact
with external data sources
Linked services are used in various components like pipelines and data flows
With Apache Spark Pools, you can use Spark notebooks and libraries to perform a wide range of data processing
tasks including,
Data preparation
Exploration
Advanced analytics
Allows you to orchestrate the data movement & data transformation from various sources to
destinations
Azure Data Factory is often used to design and manage these pipelines
It provides a visual interface for designing ETL processes (Extract, Transform, Load)
Data Flows are used to clean, enrich, and shape data as it moves through the pipeline, making
them ideal for data preparation and transformation tasks
You can connect Power BI directly to your Synapse Analytics data sources, enabling real-time data analysis &
visualization for BI, reporting purposes
Data analytics
Linked Services
SQL pool
Pipelines
Data flow
Step 3: Analyze employee data with a serverless SQL pool (employee.csv dataset)
Step 4: Analyze NYC Taxi data with a serverless SQL pool (NYCTripSmall.parquet dataset)
Step 3: Load the NYC Taxi Data into Dedicated SQL Pool
Step 4: Explore the NYC Taxi data in the dedicated SQL pool
Step 4: Load the NYC Taxi data into the Spark nyctaxi database
Step 5: Analyze the NYC Taxi data using Spark and notebooks
You pay for spark resources used during that session and not for the pool itself
This way a Spark pool lets you use Apache Spark without managing clusters
This is similar to how a serverless SQL pool works
Integration
SQL activities
When we create the synapse analytics workspace, then we have 2 admin IDs created,
1. SQL admin
Synapse Analytics utilizes the combination of Microsoft Entra & SQL authentication methods to
manage admin access to various components
This account possesses full administrative privileges over the SQL pool, including creating, reading, updating,
and deleting data, as well as managing database objects and user permissions
SQL Microsoft Entra admin: More secure & centralized approach to managing SQL pool administration
It leverages Microsoft Entra (a unified identity management platform) to authenticate & authorize access to the
SQL pool
This method eliminates the need for managing local SQL passwords & provides a more robust authentication
mechanism
Temporary tables are useful when processing data, especially during transformation where the
intermediate results are transient
With Synapse SQL, temporary tables exist at the session level. They're only visible to the
session in which they were created. They are automatically dropped when the session ends
Dedicated SQL pool does have a few implementation limitations for temporary tables:
1. Only session scoped temporary tables are supported. Global Temporary Tables aren't
supported
3. Temporary tables can only be created with hash or round robin distribution. Replicated
temporary table distribution isn't supported
Temporary tables in serverless SQL pool are supported but their usage is limited. They can't be
used in queries which target files
For example, you can't join a temporary table with data from files in storage
The number of temporary tables is limited to 100, and their total size is limited to 100 MB
Data modelers like to create surrogate keys on their tables when they design data warehouse
models
You can use the IDENTITY property to achieve this goal simply & effectively without affecting
load performance
If a distributed table suffers from skewed data, then the range of values available to the
datatype can be exhausted prematurely
For example, if all the data ends up in a single distribution, then effectively the table has
access to only one-sixtieth of the values of the data type
For this reason, the IDENTITY property is limited to INT and BIGINT data types only.
If any one of these conditions is true, the column is created NOT NULL instead of inheriting the
IDENTITY property
However,
you can't specify an IDENTITY property in the column definition of the CREATE TABLE part of the
statement
You also can't use the IDENTITY function in the SELECT part of the CTAS
To populate a table, you need to use CREATE TABLE to define the table followed by INSERT..SELECT
to populate it
IDENTITY()
@@IDENTITY
SCOPE_IDENTITY
IDENT_CURRENT
IDENT_INCR
IDENT_SEED
OPENROWSET function,
Within the serverless SQL pool resource, the OPENROWSET bulk rowset provider is accessed,
SELECT *
FROM OPENROWSET(
BULK 'https://gen2storageaccount257.dfs.core.windows.net/synapsecontainer/data/employee.csv',
FORMAT = 'CSV',
PARSER_VERSION = '2.0',
HEADER_ROW = TRUE
) as output
Clustered columnstore tables will generally outperform clustered index or heap tables and are usually the best
choice for large tables
For these reasons, clustered columnstore is the best place to start when you are unsure of how to index your
table
2. Columnstore tables may be less efficient for transient data. Consider heap and perhaps even
temporary tables
3. Small tables with less than 60 million rows, consider heap tables
This is because loads to heaps are faster than to index tables and, in some cases the subsequent
read can be done from cache
If you are loading data only to stage it before running more transformations, loading the table to
heap table is much faster than loading the data to a clustered columnstore table
In addition, loading data to a temporary table loads faster than loading a table to permanent
storage. After data loading, you can create indexes in the table for faster query performance
For small lookup tables, less than 60 million rows, consider using HEAP or clustered index for
faster query performance
For queries where a single or very few row lookup is required to perform with extreme speed,
consider a clustered index or nonclustered secondary index
The disadvantage to using a clustered index, only queries that benefit are the ones that use a highly
selective filter on the clustered index column
To improve filter on other columns, a nonclustered index can be added to other columns
However, each index that is added to a table adds both space and processing time to loads
To create a clustered index table, simply specify CLUSTERED INDEX in the WITH clause
Partition sizing
data maintenance
query performance
Whether it benefits both or just one is dependent on how data is loaded and whether the
same column can be used for both purposes, since partitioning can only be done on one column
1. Turbocharge Your Data Pipeline: Load data at warp speed with partition deletion, switching,
and merging
2. Log Less, Do More: Skip time-consuming transaction logs - focus on what matters
3. Archive Like a Pro: Archive old data effortlessly by dropping partitions (e.g., say goodbye to
month-long deletes)
1. Laser-Focused Searches: Find what you need faster. Queries skip unwanted partitions,
scanning only relevant data
2. Smarter Scans, Even Smart Results: Clustering and indexing amplify the benefits, optimizing
even more searches
Key Considerations:
Understand when to use partitioning and the optimal number of partitions
Distributions are pre-set to 60; additional partitions should align with this distribution
For a sales fact table with 36 monthly partitions and 60 distributions, aim for 60 million rows
per month or 2.1 billion rows in total.
Adjust partitioning based on the recommended minimum rows per partition for optimal
performance.
Dedicated SQL pool introduces a way to define partitions that is simpler than SQL Server
Partitioning functions and schemes are not used in dedicated SQL pool as they are in SQL
Server
Instead, all you need to do is identify partitioned column and the boundary points
While the syntax of partitioning may be slightly different from SQL Server, the basic concepts
are the same. SQL Server and dedicated SQL pool support one partition column per table, which
can be ranged partition
Replicating a table removes the need to transfer data among Compute nodes before a join or
aggregation
Since the table has multiple copies, replicated tables work best when the table size is less than
2 GB compressed
2 GB is not a hard limit. If the data is static and does not change, you can replicate larger tables
This approach minimizes maintenance due to the stable nature of dimension data
Tables with many columns, where only a subset is frequently accessed. Consider distribution
with selective indexing for improved efficiency in such cases
PRIMARY KEY is only supported when NONCLUSTERED and NOT ENFORCED are both used
Having primary key and/or unique key allows dedicated SQL pool engine to generate an
optimal execution plan for a query
All values in a primary key column or a unique constraint column should be unique
Apache Spark: Parallel processing framework, supports in-memory processing to boost the
performance of big data analytic applications
Azure Synapse makes it easy to create & configure a serverless Apache Spark pool in Azure
Spark pools in Azure Synapse are compatible with Azure Storage & Azure Data Lake Generation
2 Storage, So you can use Spark pools to process your data stored in Azure
A Spark job can load & cache data into memory and query it repeatedly
Spark also integrates with multiple programming languages to let you manipulate distributed
data sets like local collections
SparkContext Coordinates: The SparkContext object, within your main program (driver),
coordinates the application's processes on the pool
Resource Allocation: It connects to the cluster manager (YARN) to allocate resources and acquire
executors on nodes
Code Distribution: Your application code (JAR or Python files) is sent to the executors
Data Reading/Writing: Nodes handle data reading/writing from/to the file system
DAG Creation: SparkContext converts the application into a DAG (directed acyclic graph)
Dedicated Executors: Each application gets its own executor processes, which persist
throughout the application and run tasks in multiple threads
7-day lifespan: All jobs (batch and streaming) expire after 7 days