0% found this document useful (0 votes)
18 views

Running Databricks Migrations Code Analyzer

The document provides detailed instructions for running the Databricks Migrations Code Analyzer, including prerequisites, metadata extraction, and execution steps across various ETL platforms. It outlines the process for analyzing SQL code and interpreting results, as well as exporting XMLs from ETL tools. Additionally, it includes specific commands and configurations needed for different systems like Informatica, DataStage, and SQL-based systems.

Uploaded by

8syz8yq8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Running Databricks Migrations Code Analyzer

The document provides detailed instructions for running the Databricks Migrations Code Analyzer, including prerequisites, metadata extraction, and execution steps across various ETL platforms. It outlines the process for analyzing SQL code and interpreting results, as well as exporting XMLs from ETL tools. Additionally, it includes specific commands and configurations needed for different systems like Informatica, DataStage, and SQL-based systems.

Uploaded by

8syz8yq8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Running Databricks Migrations Code Analyzer

Running Databricks Migrations Code Analyzer​ 1


Detailed steps for Running Analyzer​ 3
1. PREREQUISITES​ 3
2. EXTRACT THE METADATA​ 5
3. RUN THE ANALYZER​ 5
Informatica Powercenter​ 6
Informatica Cloud​ 6
All SQL code and DBT SQL code​ 6
( like Redshift, Snowflake, Teradata , Oracle, SQL server, Synapse, Greenplum, Netezza
and .sql files:)​ 6
SSIS​ 7
4. INTERPRETING THE RESULTS​ 7
Appendix​ 7
Exporting XML’s out of ETL tools​ 7
DataStage​ 7
Informatica PowerCenter XML export​ 8
For Informatica Cloud (IICS):​ 9
For SQL-Based Systems (Snowflake, Teradata, Netezza, Oracle,
Synapse, SQL server, Greenplum, Vertica, Presto, any DB etc.)​ 11
Azure Synapse (Dedicated)​ 12
Azure Synapse (Serverless)​ 14
SSIS​ 16
How is complexity calculated in the analyzer?​ 17
SQL Code Analysis​ 17
Informatica Code Analysis​ 18
DataStage Analysis​ 18
Talend Analysis​ 19
SSIS Code Analysis​ 20
Alteryx Code Analysis​ 20
BODS Code Analysis​ 21
SAS Code Analysis​ 21
Pentaho Code Analysis​ 22
Splitter Instructions:​ 23
Detailed steps for Running Analyzer
Contact Databricks PS if you need help on how to run it.
The analyzer can be executed on Windows, Mac and Linux

1.​ PREREQUISITES

IMPORTANT NOTE:
Execution of Code Analyzer ( and SQL Splitter) needs the following prerequisites.
So make sure to go through this section to get the prerequisites taken care of.

If you have downloaded the MAC version, please pay special attention to section C
below.

A) Analyzer Package Download and Integrity Checks


1)​Download the Code Analyzer package using the SAS URL shared with you in
email

2)​ This Analyzer package is a zip file which contains Analyzer Binaries (both linux
and windows versions) for Code Analyzer and SQL Splitter.

3)​ If you would like to do integrity checks on the downloaded zip file, please let
your Databricks representative know and we can provide this information for
you.

B) Share Folder for Sending back the generated Code Analysis


report(s)
1.​Execution of Code Analyzer (steps are in subsequent sections) generates
analysis reports in spreadsheet format ( xlsx). Don’t share these
reports with the Databricks team in email.

2.​Instead use a Share folder (with proper access control) to upload the
generated report(s) and give access to Databricks technical point of contact.
a.​ You will receive a shared folder URL in the same email with Code
Analyzer download link.
b.​ If you have an existing secured file sharing process, set by your
organization, use it to share generated report files. Otherwise, you can
use the same shared-folder (in step a ) to upload the xlsx files.
C) MAC version requires an extra step to allow the running of the
analyzer (and more extra steps if you have M3+ chip on Mac)
1.​ After downloading the analyzer/splitter zip file and unzipping it. Navigate in
the mac terminal to the location of the unzipped analyzer contents (you may
want to place the analyzer in a folder outside of “Downloads”).
2.​ Perform an “ls -l” and you will notice that the “analyzer” unzipped file is not
executable. Change this by running “chmod +x analyzer”.
3.​ Re-run “ls -l” to verify you now see this:

-rwxr-xr-x@ 1 user.name staff 12345678 Jan 30 12:50 analyzer

4.​ Attempt to run the following command “./analyzer” which will bring up a
pop-up window saying the tool developer is not verified (this is expected).
DO NOT move it to trash, click Cancel.

5.​
6.​ Navigate to System Settings -> Privacy and Security. Scroll down to where
you see the following and click “Open Anyway”.

7.​ Another window might open to verify that you want to open it, click open.
8.​ Re-run the command “./analyzer” to verify that you now have access to
run analyzer tool via command line. If it works, it will respond with the
successful output showing you which flags/arguments are available/required
by the tool.
9.​ If you have an M3, M3 pro, M3 max you also need to install Rosetta
(https://support.arduino.cc/hc/en-us/articles/7765785712156-Error-bad-CPU-t
ype-in-executable-on-macOS)

2.​ EXTRACT THE METADATA


All the major ETL platforms provide some kind of export of their code repositories.
Typically this is done into XML or JSON formats which can be used to restore the
environment. Here is a short guide for how to export the various environments:
DataStage
Informatica PowerCenter XML export
Informatica Cloud (IICS)
SQL-Based Systems
Azure Synapse (Dedicated)
Azure Synapse (Serverless)
Talend
SSIS

3.​ RUN THE ANALYZER


The analyzer can be executed on Windows and Linux and Mac. Before running the analyzer
you might need to move all config files into the same directory as the tool itself. Config files
might be packaged in the zip download in a ‘config’ directory. Simply copy those files into the
same directory as the ‘analyzer’ executable file.

Please confirm that the general_sql_specs.json is in the same directory as the analyzer
executable file which should be included in the download already. If other config files are
needed, please seek out your Databricks representative to help in acquiring any other specific
config files needed based on the source system.
Sample commands to run the For Datastage:
analyzer -t DATASTAGE -d “<folder with ds xml files>” -r <path to xlsx
report file>
Informatica Powercenter
analyzer -t INFA -d “<folder with xml files>” -r <path to xlsx report
file>

Informatica Cloud
analyzer -u ic2dws.json -t INFACLOUD -d “<folder with zip files>” -r
<path to xlsx report file>

All SQL code and DBT SQL code

( like Redshift, Snowflake, Teradata , Oracle, SQL server, Synapse,


Greenplum, Netezza and .sql files:)

Important Note
Analyzing SQL code requires following steps for accurate analysis. Make sure to follow the
given instructions below.
●​ Typically DDL statements (like create table, create views, create procedure and
create functions etc…) are extracted for analysis using various
utilities/commands and they are kept in a few large files. These files need to be
split before running the analyzer.

●​ Whereas the DML statements , queries and data load scripts etc..are maintained
as part of application code they are kept in a large number of small files. These
files cannot be split for running them through the analyzer.

●​ Follow below steps to get your code ready for running it through the Analyzer

1)​ Keep SQL DDL files and Rest of the SQL code in separate root folders. As an
example..

●​ sql_ddl → DDL files


●​ sql_other → All other SQL files
2)​ Run the sqlsplit program with sql_ddl as input. You need an empty folder for
keeping the output (let’s call this analyzer_input/sql_ddl) created by sqlsplit.

3)​ Also copy sql_other to analyzer_input/sql_other

4)​ Now run analyzer using analyzer_input as input folder


5)​ File extensions to be processed by Analyzer. Note that by default analyzer looks
for files with .sql extension only. So it is important to let analyzer know about ALL
file extension types in your SQL code base ( for example bteq in case of Teradata
BTEQ scripts) using -E input flag for analyzer. Value for this parameter is a comma
separated list and no need to include “.” (period).

Adjust this list ( ksh,sh,bteq,sql) in the analyzer command below

mkdir analyzer_input
mkdir analyzer_input/sql_ddl
cp -R sql_other analyzer_input/sql_other

sqlsplit -d sql_ddl -o analyzer_input/sql_ddl

analyzer -t SQL -E ksh,sh,bteq,sql -d analyzer_input -r


analyzer_report_v1.xlsx -u general_sql_specs.json​

SSIS

analyzer.exe -d “<folder with dstx/ssis exported files>” -t SSIS -u


ssis2dws.json -r analyzer_report_v1.xlsx

Note:
** Paths, if they have spaces - say in the folder directory - use double quotes” .
E.g. “C:\Users\xyz\Downloads\analyzer-package\SQL Server”
(double quotes - because “SQL server” directory has space) - else you will get an error.

4. INTERPRETING THE RESULTS


You can review how the analyzer calculates complexity here, based on the system.
Appendix

Exporting XML’s out of ETL tools

DataStage
The easiest way to export metadata is through the GUI, one folder at a time. In order to do
so, please right click on a folder to export and select the option “Export”. Please ensure that
the XML format is specified for the export and that all the jobs within the folder are selected
(they are by default)

Informatica PowerCenter XML export

Overview​
To run the Analyzer or converters on Informatica XMLs, the XML file first need to be
extracted out of the PowerCenter repository. Typically, it is easier to deal with the
conversion of a relatively granular level, so extracting the artifacts at the workflow level
is advisable.
●​ Metadata Extraction​
To extract the metadata out of PowerCenter repository, use the following
commands:
●​ Connect to repository​
pmrep connect <list of credentials>
●​ Get the list of folders​
pmrep listobjects -o FOLDER
●​ For each folder, get the list of workflows​
pmrep listobjects -o WORKFLOW -f <your folder name>
●​ Workflow extraction​
Create a batch script with the following command template for each folder.​
Note: Excel can be used to create the script with the following command.​
pmrep objectexport -n workflow_name -o WORKFLOW -f
folder_name -b -r -m -s -u path-to-output-file

( Or do it manually via exporting the entire folder and save it as XML)

For Informatica Cloud (IICS):

The following comes from this article (How to read metadata in Informatica Cloud (IICS)? -
ThinkETL 1)

●​ Select all the Mapping Configuration tasks you want to read the metadata from
and export them as a single file.
●​ Exporting Mapping task fetches the associated mapping also.
●​ Make sure you select the check box as shown below to include all dependent
assets.

●​ Next Click on MyImport/Export Logs from the left pane. Go to Export Tab. Find
the name with which you exported the code. Click download.
●​ The entire tasks and its dependencies are downloaded as a single zip file. In
our example the file name will be IICS_Demo_Export.zip

Talend
To export all jobs in bulk, right click on Job Designs and select “Export Items”. In the popup,
select “Include All Dependencies”
Also here is a link on the topic: Talend export and import a job - Stack Overflow 15
Note: while Talend jobs can be exported as a single zip file, when running analyzer or any
converter utilities please unzip the file(s). Both the analyzer and converters will look for .item
and .properties files in non-zipped folders.

For SQL-Based Systems (Snowflake, Teradata, Netezza, Oracle, Synapse,


SQL server, Greenplum, Vertica, Presto, any DB etc.)

( Get the code/scripts/stored procedures from the Database in a folder)

Typically, client environments make use of source code repositories, such as Git, SVN,
Perforce and others. It would be preferred to get the code from such a repository,
potentially a combination of production branches and dev/qa branches- whichever makes
sense. This is the preferred method of getting the code, as it is stored in its original form,
unobstructed by any database-injected code snippets.
The same is true for general shell scripts and shell script wrappers with embedded SQL
code.

If such a repository is not available, SQL-based objects, such as procedures, UDFs, macros,
table and view DDLs, can be extracted using either native code export utilities. SQL scripts
and BTEQ code that lives outside of the database on a file system can just be taken as is.
For example, in the case of Snowflake, you can use the below statement to extract DDLs. It
extracts definitions of schemas, tables, views, functions, stored procs, tasks, etc. in that
database. Please repeat the step for all production databases.
You will have to use the analyzer splitter option to split DDLs automatically (see next
section on how to use the splitter).

select get_ddl('database','<database name>');

Please note that some SQL exporter utilities may create files with a single long line, with all
the statements appended on the same line. This would not be an acceptable import into
the analyzer.

Also, note that every database object (table/view/procedure/function/macro etc) should be


exported into its own individual file. If that is not possible and the only way to export
database code is into one large file, then SQL Splitter utility that we provide should be
executed to split up large combined files into smaller individual files.

( Ask the Teradata/Oracle DBA to export out the Table DDL, Views DDL, Packages, Stored
procedures, Functions, etc. to a folder. And then run the analyzer on it - so that we can get
an analyzer results such as below: )

Azure Synapse (Dedicated)


To extract metadata like Table, View and Stored Procedures DDL you can use Microsoft
SQL Server Management Studio.

●​ Preferred way to export the DDLs is each database object into individual files ( by
selecting the “One script file per object” option in the Set Scripting Options step of
Generate Script wizard. Ref screen-shots below)
●​ If you already have all the DDL statements in a single file, Analyzer Package comes
with a SQL splitter program which you can use to split one large file with all DDL
statements into individual files. This needs to be executed before running the
analyzer command. Check “Run the Analyzer section” for SQL code. ​
Note: In the above step select all required object types. Above screenshot is for
illustration purpose only
Azure Synapse (Serverless)
To extract metadata like Table, View and Stored Procedures DDL you can use Microsoft
SQL Server Management Studio.

For a Serverless database “Generate Scripts” Context Menu option is not available at
Database level in the studio ( as of version 19.1). So we need to use the “Object Explorer
Details” view and select required objects to export the corresponding DDL to a file as in
below screenshots.

​ Switch to Object Explorer Details view



Export External Table DDLs

Export View DDLs


SSIS

●​ You’ll need to export the DTSX packages. For details on how to obtain it see: Save
and Run Package (SQL Server Import and Export Wizard) - SQL Server Integration
Services (SSIS) | Microsoft Learn 35

ODI

●​ Exporting jobs in ODI is detailed in this document: 20 Exporting and Importing 11

Alteryx

●​ Analyzer needs the .yxmd files. These can be obtained by Select File > Export to
download your workflow to your local machine in .yxmd format.

SAP Business Objects Data Services

●​ Instructions for export can be found in the following articles: SAP Help Portal 4
SAP Help Portal 2
How is complexity calculated in the analyzer?

SQL Code Analysis


At the beginning of script analysis, mark a script with complexity level of LOW

If any of the following conditions are true, then mark the job as MEDIUM complexity:

1.​ At least one loop


2.​ Conventional Statement count greater than 10
3.​ Simple Statement count greater than 1000
4.​ Number of pivot statements between 1 and 3
5.​ Number of XML SQL statements between 1 and 3

If any of the following conditions are true, then mark the job as COMPLEX complexity:

1.​ Number of loops greater than 5


2.​ Conventional Statement count greater than 30
3.​ Simple Statement count greater than 2000
4.​ Number of pivot statements greater than 3
5.​ Number of XML SQL statements greater than 3

If any of the following conditions are true, then mark the job as VERY COMPLEX complexity:

1.​ Number of loops greater than 8


2.​ Conventional Statement count greater than 50
3.​ Simple Statement count greater than 5000
4.​ Number of pivot statements greater than 5
5.​ Number of XML SQL statements greater than 5

Simple Statement count is determined by regex patterning in analyzer config file.


Conventional Statement count is determined by below formula:

Conventional Statement count = Total Statement count - Simple Statement count

If the analyzer encounters a SQL procedure or function body inside a SQL file, it will
categorize the script as “ETL”.

Teradata MLOAD and FLOAD scripts follow the same rules as above.
Informatica Code Analysis
At the beginning of mapping analysis, mark mapping with complexity level of LOW

If any of the following conditions are true, then mark the mapping as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Number of sources > 1
3.​ Number of joins >= 1
4.​ Number of lookups between 4 and 6
5.​ Number of targets > 1
6.​ Overall function call count >= 10
7.​ Number of components (transformations) >= 10

If any of the following conditions are true, then mark the mapping as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of mapping components >= 20
4.​ Overall function call count >= 20
5.​ Complex or Unstructured nodes are being used (e.g. Normalizer)
6.​ Number of lookups between 7 and 14

If any of the following conditions are true, then mark the mapping as VERY COMPLEX
complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of lookups > 15
4.​ Number of job components >= 50

DataStage Analysis
At the beginning of job analysis, mark job with complexity level of LOW

If any of the following conditions are true, then mark the job as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Number of sources > 1
3.​ Number of joins >= 1
4.​ Number of lookups between 4 and 6
5.​ Number of targets > 1
6.​ Overall function call count >= 10
If any of the following conditions are true, then mark the job as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of job components >= 20
4.​ Overall function call count >= 20
5.​ Complex or Unstructured nodes are being used (ChangeCapture, etc…)
6.​ Number of lookups between 7 and 14

If any of the following conditions are true, then mark the job as VERY COMPLEX complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of lookups > 15
4.​ Number of job components >= 50

Talend Analysis
At the beginning of job analysis, mark job with complexity level of LOW

If any of the following conditions are true, then mark the job as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Number of sources > 1
3.​ Number of joins >= 1
4.​ Number of job components >= 10
5.​ Number of targets > 1
6.​ Overall function call count >= 10

If any of the following conditions are true, then mark the job as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of job components >= 20
4.​ Overall function call count >= 20
5.​ Complex or Unstructured nodes are being used (ChangeCapture, etc…)

If any of the following conditions are true, then mark the job as VERY COMPLEX complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of job components >= 50
SSIS Code Analysis
At the beginning of package analysis, mark package with complexity level of LOW

If any of the following conditions are true, then mark the mapping as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Number of sources > 1
3.​ Number of targets > 1
4.​ Overall function call count >= 10
5.​ Number of package components >= 10

If any of the following conditions are true, then mark the mapping as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of package components >= 20
4.​ Overall function call count >= 20

If any of the following conditions are true, then mark the mapping as VERY COMPLEX
complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of job components >= 50

Alteryx Code Analysis


At the beginning of package analysis, mark package with complexity level of LOW

If any of the following conditions are true, then mark the mapping as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Overall function call count >= 10
3.​ Number of job components >= 10

If any of the following conditions are true, then mark the mapping as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of package components >= 20
4.​ Overall function call count >= 20
If any of the following conditions are true, then mark the mapping as VERY COMPLEX
complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of job components >= 50

BODS Code Analysis


At the beginning of job analysis, mark package with complexity level of LOW

If any of the following conditions are true, then mark the mapping as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Overall function call count >= 10
3.​ Number of job components >= 10

If any of the following conditions are true, then mark the mapping as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of package components >= 20
4.​ Overall function call count >= 20

If any of the following conditions are true, then mark the mapping as VERY COMPLEX
complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of job components >= 50

SAS Code Analysis


At the beginning of SAS script analysis, mark the script with complexity level of LOW

If any of the following conditions are true, then mark the script as MEDIUM complexity:

1.​ Macro definition count > 3


2.​ Data block count > 5
3.​ number of statements inside macros and data blocks > 50
4.​ Conditional statement count > 10
5.​ ‘DO’ loop count > 3
6.​ Count of SQL Procs categorized as MEDIUM > 0
7.​ SQL Proc count > 10

If any of the following conditions are true, then mark the script as COMPLEX:

1.​ Macro definition count > 7


2.​ Data block count > 15
3.​ number of statements inside macros and data blocks > 100
4.​ Conditional statement count > 20
5.​ ‘DO’ loop count > 10
6.​ Count of SQL Procs categorized as COMPLEX > 0
7.​ SQL Proc count > 20

If any of the following conditions are true, then mark the script as VERY COMPLEX:

1.​ Macro definition count > 15


2.​ Data block count > 25
3.​ number of statements inside macros and data blocks > 150
4.​ Conditional statement count > 50
5.​ ‘DO’ loop count > 20
6.​ Count of SQL Procs categorized as VERY COMPLEX > 0
7.​ SQL Proc count > 40

Pentaho Code Analysis


At the beginning of mapping analysis, mark mapping with complexity level of LOW

If any of the following conditions are true, then mark the mapping as MEDIUM complexity:

1.​ Number of expressions with 5+ function calls between 2 and 4


2.​ Number of sources > 1
3.​ Number of joins >= 1
4.​ Number of lookups between 4 and 6
5.​ Number of targets > 1
6.​ Overall function call count >= 10
7.​ Number of components (transformations) >= 10

If any of the following conditions are true, then mark the mapping as COMPLEX complexity:

1.​ Three MEDIUM breaks from the list above


2.​ Number of expressions with 5+ function calls between 5 and 7
3.​ Number of mapping components >= 20
4.​ Overall function call count >= 20
5.​ Complex or Unstructured nodes are being used (e.g. Normalizer)
6.​ Number of lookups between 7 and 14

If any of the following conditions are true, then mark the mapping as VERY COMPLEX
complexity:

1.​ Three COMPLEX breaks from the list above


2.​ Number of expressions with 5+ function calls > 7
3.​ Number of lookups > 15
4.​ Number of job components >= 50

Splitter Instructions:
Purpose - Splits large SQL files with multiple objects into individual .sql files

sqlsplit
-h this message
######## OPTIONS ########

-i input file OR comma-separated list of files


OR
-d input folder

-o output folder

[-s plug in newline after semicolon]


[-E extensions. Default is sql]
[-t trim lines from both sides]
[-b remove square brackets]
[-P do not add package variables to procedures and functions]
[-G custom object separator pattern]
[-v verbose mode]

You might also like