Section 1 - Design & Performance For Netezza Migration To Azure Synapse
Section 1 - Design & Performance For Netezza Migration To Azure Synapse
Table of Contents
Context ........................................................................................................................ 3
Overview ..................................................................................................................... 4
Design considerations................................................................................................ 6
Migration scope .................................................................................................................................. 6
Preparation for migration ......................................................................................................... 6
Choosing the workload for the initial migration .............................................................. 6
‘Lift and shift as-is’ vs a phased approach incorporating changes ........................... 6
Use Azure Data Factory to implement a metadata-driven migration ..................... 7
Design differences between Netezza and Azure Synapse.................................................. 7
Multiple databases vs single database and schemas ..................................................... 7
Table considerations ................................................................................................................... 8
Unsupported Netezza database object types ................................................................... 8
Netezza data type mapping ..................................................................................................... 9
SQL DML syntax differences.................................................................................................. 10
Functions, stored procedures and sequences ................................................................ 10
Extracting metadata and data from a Netezza environment ......................................... 11
Data Definition Language (DDL) generation .................................................................. 11
Data extraction from Netezza .............................................................................................. 12
Performance recommendations for Netezza migrations ..................................... 13
Similarities in performance tuning approach concepts.............................................. 13
Differences in performance tuning approach ................................................................ 13
Context
This paper is one of a series of documents which discuss aspects of migrating legacy
data warehouse implementations to Azure Synapse Analytics. The focus of this
paper is on the design and performance aspects of migrated data specifically from
existing Netezza environments – other topics such as ETL, recommended migration
approach and advanced analytics in the data warehouse are covered in separate
documents. This document should be read in conjunction with the ‘Section 1 –
Design and Performance’ document which discusses the general aspects of design
and performance for migrations to Azure Synapse.
Overview
‘More than just a Given the end of support from IBM, many existing users of Netezza data warehouse
database’ – the Azure
environment includes a systems are now looking to take advantage of the innovations provided by newer
comprehensive set of environments (e.g. cloud, IaaS, PaaS) and to delegate tasks such as infrastructure
capabilites and tools
maintenance and platform development to the cloud provider.
While there are similarities between Netezza and Azure Synapse in that both are
SQL databases designed to use massively parallel processing (MPP) techniques to
achieve high query performance on very large data volumes, there are also some
basic differences in approach:
• Legacy Netezza systems are installed on-premise, using proprietary hardware
whereas Azure Synapse is cloud based using Azure storage and compute
resources.
• Upgrading a Netezza configuration is a major task involving additional physical
hardware and a potentially lengthy database reconfiguration or dump and
reload. Since storage and compute resources are separate in the Azure
environment these can easily be scaled (upwards and downwards) independently
leveraging the elastic scalability capability.
• Azure Synapse can be paused or resized as required to reduce resource
utilization and therefore cost.
Microsoft Azure is a globally available, highly secure, scalable cloud environment
which includes Azure Synapse within an eco-system of supporting tools and
capabilities.
Azure Synapse gives best Azure Synapse provides best-of-breed relational database performance by using
performance and price-
performance in techniques such as massively parallel processing (MPP) and automatic in-memory
independent benchmark caching – the results of this approach can be seen in independent benchmarks such
as the one run recently by GigaOm – see https://gigaom.com/report/data-
warehouse-cloud-benchmark/ which compares Azure Synapse to other popular
cloud data warehouse offerings. Customers who have already migrated to this
environment have seen many benefits including:
• Improved performance and price/performance
This paper looks at schema migration with a view to obtain equivalent or better
performance of your migrated Netezza data warehouse and data marts on Azure
Synapse. The topics included in this paper apply specifically to migrations from an
existing Netezza environment.
Design considerations
Migration scope
In terms of size, it is important that the data volume to be migrated in the initial
exercise is large enough to demonstrate the capabilities and benefits of the Azure
Synapse environment while keeping the time to demonstrate value short – typically
in the 1-10TB range.
One possible approach for the initial migration project which will minimize the risk
and reduce the implementation time for the initial project is confine the scope of
the migration to just the data marts. This approach by definition limits the scope of
the migration and can typically be achieved within short timescales and so can be a
good starting point – however this will not address the broader topics such as ETL
migration and historical data migration as part of the initial migration project. These
would have to be addressed in later phases of the project as the migrated data mart
layer is ‘back filled’ with the data and processes required to build them.
and the time taken to migrate by reducing the work that has to be done to
achieve the benefits of moving to the Azure cloud environment.
This is a good fit for existing Netezza environments where a single data mart
is to be migrated, or the data is already in a well-designed star or snowflake
schema or there are time and cost pressures to move to a more modern
cloud environment.
The recommended approach for this is to initially move the existing data
model ‘as-is’ into the Azure environment then to use the performance and
flexibility of the Azure environment to apply the re-engineering changes,
leveraging the Azure capabilities where appropriate to make the changes
without impacting the existing source system.
Azure Data Factory is a cloud-based data integration service that allows creation of
data-driven workflows in the cloud for orchestrating and automating data movement
and data transformation. Using Azure Data Factory, you can create and schedule
data-driven workflows (called pipelines) that can ingest data from disparate data
stores. It can process and transform the data by using compute services such as
Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine
Learning.
By creating metadata to list the data tables to be migrated and their location it is
possible to use the ADF facilities to manage the migration process.
In the Azure Synapse environment there is a single database, and schemas are used
to separate the tables into logically separate groups. Therefore, the recommendation
is to use a series of schemas within the target Azure Synapse to mimic any separate
databases that will be migrated from the Netezza environment. If schemas are
already being used within the Netezza environment then it may be necessary to use
a new naming convention to move the existing Netezza tables and views to the new
environment (e.g. concatenate the existing Netezza schema and table names into the
new Azure Synapse table name and use schema names in the new environment to
maintain the original separate database names). Another option is to use SQL views
over the underlying tables to maintain the logical structures – but there are some
potential downsides to this approach:
• Views in Azure Synapse are read-only – therefore any updates to the data must
take place on the underlying base tables
There may already be a layer (or layers) of views in existence and adding an
extra layer of views might impact performance
Table considerations
Use existing indexes to When migrating tables between different technologies it is generally only the raw
give an indiation of
candidates for indexing data (and the metadata that describes it) that gets physically moved between the 2
in the migrated environments. Other database elements from the source system (e.g. indexes) are
warehouse
not migrated as these may not be needed, or may be implemented differently
within the new target environment.
• Zone Maps – In Netezza zone maps are automatically created and maintained for
some column types and are used at query time to restrict the amount of data to
be scanned. They are created on the following column types:
INTEGER columns of length 8 bytes or less
It is possible to find out which columns have zone maps by using the
nz_zonemap utility (part of the NZ Toolkit).
Azure Synapse does not include zone maps, but similar results can be
achieved by using other (user-defined) index types and/or partitioning.
• Clustered Base tables (CBT) – In Netezza CBT’s are most commonly used
for fact table which has billions of records. Scanning such a huge table
requires lot of processing time as full table scan could be needed to get
relevant records. Organizing records on restrictive CBT via allows Netezza
to group records in same or nearby extents, and this process will also
create zone maps that improves the performance by reducing the
amount of data to be scanned.
In Azure Synapse a similar effect can be achieved by use of partitioning
and/or use of other indexes.
NUMERIC(p,s) NUMERIC(p,s)
REAL REAL
SMALLINT SMALLINT
ST_GEOMETRY(n) Spatial data types such as ST_GEOMETRY
are not currently supported in ASDW - but
the data could be stored as VARCHAR or
VARBINARY
TIME TIME
TIME WITH TIME ZONE DATETIMEOFFSET
TIMESTAMP DATETIME
There are 3rd party vendors who offer tools and services to automate migration
including the mapping of data types as described above. Also, if a 3rd party ETL tool
such as Informatica or Talend is already in use in the Netezza environment, these can
implement any required data transformations.
SELECT CHARINDEX(‘def’,‘abcdef’)…
• AGE – Netezza supports the AGE operator to give the interval between 2
temporal values (, timestamps, dates,etc) – e.g.
SELECT AGE (’23-03-1956’,’01-01-2019’) FROM…
This can be achieved in Azure Synapse by using DATEDIFF (note also the date
representation sequence):
SELECT DATEDIFF(day, ’1956-03-26’,’2019-01-01’) FROM…
It may be that there are facilities in the Azure environment that replace the
functionality implemented as functions or stored procedures in the Netezza
environment – in which case it is generally more efficient to use the built-in Azure
facilities rather than re-coding the Netezza functions.
3rd party vendors offer tools and services that can automate the migration of these –
see for example see Attunity or Wherescape migration products.
Functions
In common with most database products, Netezza supports system functions
and also user-defined functions within the SQL implementation. When
migrating to another database platform such as Azure Synapse common
system functions are generally available and can be migrated without
change. Some system functions my have slightly different syntax but the
required changes can be automated in this case.
Stored procedures
Most modern database products allow for procedures to be stored within
the database – in Netezza’s case the NZPLSQL language is provided for this
purpose. NZPLSQL is based on Postgres PL/pgSQL. A stored procedure
typically contains SQL statements and some procedural logic and may return
data or a status.
SQL Azure Data Warehouse also supports stored procedures using T-SQL –
so if there are stored procedures to be migrated they must be recoded
accordingly.
Sequences
In Netezza a sequence is a named database object created via CREATE
SEQUENCE that can provide the unique value via the NEXT VALUE FOR
method. These can be used to generate unique numbers that can be used as
surrogate key values for primary key values.
It is possible to edit existing Netezza CREATE TABLE and CREATE VIEW scripts to
create the equivalent definitions (with modified data types if necessary as described
above) – typically this involves removing or modifying any extra Netezza-specific
clauses (e.g. ORGANIZE ON).
However all the information that specifes the current definitions of tables and views
within the existing Netezza environment is maintained within system catalog tables –
this is the best source of this information as it is bound to be up to date and
complete. (Be aware that user-maintained documentation may not be in sync with
the current table definitions).
This information can be accessed via utilities such as nz_ddl_table and can be used to
generate the CREATE TABLE DDL statements which can then be edited for the
equivalent tables in Azure Synapse.
3rd party migration and ETL tools also use the catalog information to achieve the
same result.
If sufficient network bandwidth exists data can be extracted directly from an on-
premise Netezza system into Azure Synapse tables or Azure Blob Data Storage by
using Azure Data Factory processes or 3rd party data migration or ETL products.
Recommended data formats for the extracted data are delimited text files (also
called Comma Separated Values or CSV or similar) or Optimized Row Columnar
(ORC) or Parquet files.
For more detailed information on the process of migrating data and ETL from a
Netezza environment see the associated document ‘Section 2.1. Data Migration ETL
and Load from Netezza’.
• Using data distribution to co-locate data to be joined onto the same processing
node
• Using the smallest data type for a given column will save storage space and
accelerate query processing
• Ensuring data types of columns to be joined are identical will optimize join
processing by reducing the need to transform data for matching
• Ensuring statistics are up to date will help the optimizer produce the best
execution plan
Data indexing
Azure Synapse provides a number of user definable indexing options, but
these are different in operation and usage to the system managed zone
maps in Netezza. Understand the different indexing options as described in
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-
warehouse-tables-index
Existing system managed zone maps within the source Netezza environment
can however provide a useful indication of how the data is currently used
and provide an indication of candidate columns for indexing within the Azure
Synapse environment.
Data partitioning
In an enterprise data warehouse fact tables can contain many billions of rows
and partitioning is a way to optimize the maintenance and querying of these
tables by splitting them into separate parts to reduce the amount of data
processed. The partitioning specification for a table is defined in the CREATE
TABLE statement.
Only 1 field per table can be used for partitioning, and this is frequently a
date field as many queries will be filtered by date or a date range. Note that
it is possible to change the partitioning of a table after initial load if
necessary by recreating the table with the new distribution using the CREATE
TABLE AS (or CTAS) statement. See https://docs.microsoft.com/en-
us/azure/sql-data-warehouse/sql-data-warehouse-tables-partition for a
detailed discussion of partitioning in Azure Synapse.