100% found this document useful (1 vote)
1K views27 pages

SAP BO Data Services Transforms

This document provides an overview of various transform types in SAP BO Data Services including Query, Merge, Case, Row Generation, Effective Date, Pivot, Reverse Pivot, and Date Generation. It also discusses how to use the SQL transform and describes optimization techniques for table comparison and auto-correct load transforms to improve performance.

Uploaded by

Lakshmi P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
1K views27 pages

SAP BO Data Services Transforms

This document provides an overview of various transform types in SAP BO Data Services including Query, Merge, Case, Row Generation, Effective Date, Pivot, Reverse Pivot, and Date Generation. It also discusses how to use the SQL transform and describes optimization techniques for table comparison and auto-correct load transforms to improve performance.

Uploaded by

Lakshmi P
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

SAP BO Data Services Transforms – Basics Part1

Transforms:
 Query                  
 Case
 Merge
 Row_Generation
 Key_Generation
 Date_Generation
 Effective_Date
 Table_Comparison
 Hierarchy flattening
 History_Perserving
 Pivot
 Reverse Pivot
 Map_Operation
 Validation
 SQL
 XML_Map
 Data_Transfer
 Text Data Processing
 
 
1. Query Transform:
 Query Transform is similar to a SQL SELECT statement.
 It can perform the following operations-
 Choose (filter) the data to extract from sources
 Join data from multiple sources
 Map columns from input to output schemas
 Perform transformations and functions on the data
 Add new columns, nested schemas, and function results to the
output schema
 Assign primary keys to output columns
 Different functions can be performed using the query transform like
LOOKUP, AGGREGATE,CONVERSIONS, etc.
 
                    

 
 
                       

 
 
 
2. Merge Transform:
 Merge Transform combines the rows from two or more sources into a
single target
 The output schema will be the same as the input schema source objects
 All the sources should have –
 Same number of columns
 Same data types of columns
 Same column names
 The transform does not strip out duplicate rows
 
                   

 
                      

3. Case Transform:
 Case transform is used to route the Input coming from the Source to two or
more targets based upon the given condition.
 Data Inputs
 Only one data flow source is allowed.
 Data Outputs
 The output of the Case transform is connected with another object
in the workspace. Choose a case label from a pop-up menu. Each label represents a case
expression (WHERE clause) created in the Case editor.
 
 
             

 
 
              

          
Source:

                
 
Target:
        
 

                    
           

                       
 

                         
 
4. Row Generation:
 This transform doesn’t need an input
 Generates a column filled with integer values starting at zero and incrementing by
one to the end value you specify.
 You can give the starting row no. as per your requirement
 
 
                       

 
                          

 
5. Effective_Date:
 Generates an additional Effective_To_Column based on the primary key’s
“effective date”
 For this the Data Input should have an effective date column.
 Effective_Date allow you to indicate changes to information over time. This
Can be used to implement SCD Type 3
 
 
                           
 
 
                               

 
6. Pivot Transform:
 Pivot transform creates a new row for each value in columns that we
identify as a pivot columns
 It can rearrange the data into a more simple and manageable form, with
all data in a single column, without losing the category information
 
 
 
                   

                      
 
 
 
7. Reverse Pivot Transform:
 Reverse pivot transform creates a single row of data from several existing
rows
 It allows us to combine data from several rows into a single row by
creating new columns
 It can rearrange the data into a more searchable form without losing the
category  information
 
                   

 
                

How to use SQL Transform in SAP BODS?


 
SQL Transformation:
This transform is used to perform SQL operation retrieve the data from one or multiple
table.
When we use: Sometimes it may not be possible or complex to
design ETL with complex Business logic.
It is very easy to build. We need the SQL select statement.
Create a new PROJECT and a JOB as shown in below screen

Bring the SQL Transform into Data flow.


Paste the below simple SQL code inside the transform.

select e.empno,e.ename,e.sal,d.dname from emp e ,dept d where d
.loc in ('NEW
YORK','DALLAS')and e.deptno = d.deptno and e.empno in (select e.
empno
from emp e where e.job in ('MANAGER','ANALYST') and e.comm is n
ull)
order by d.loc asc;

Select the Data Store Name and Database type

Keep the other options as shown in screen


Click on Update schema
 
Change the data type of the column names.
 
 
In this case, EMPNO shows data type as decimal. But we can change to int.
 
Use the Query transform and Template table as the destination.

 
SAVE the work.
 
Run the JOB.
 
Target table's data after the JOB run.
 
138 Views Products: sap_data_services, sap_data_integrator

Date Generation Transformation:


 
This transform will help us to create the dates incremented as our requirement.
 
When we use: To create Time Dimension fields and Table

Create a new PROJECT and a JOB as shown in below screen.

Bring the Date_Generation Transform into Data flow.

Set the options as shown below.

 
Start Date : 2000.01.01
 
End Date : 2100.12.31
 
Increment: Daly
 
 
 
Bring the Query_Transform and map the columns as shown.

 
Map the columns as shown below:
 
DATE:Date_Generation.DI_GENERATED_DATE
WEEK: week_in_year(DI_GENERATED_DATE)
MONTH:month(DI_GENERATED_DATE)
QUARTER: quarter(Date_Generation.DI_GENERATED_DATE )
YEAR: year(DI_GENERATED_DATE)
 

I have used following optimization techniques while working in project.


 
I am sharing my problems and solution which will help you to understand the issues and optimization
techniques used for same,
 
1) Table Comparison options:
 
The performance of a ‘Table_Comparison’ transform can be improved by caching the
comparison table. There are three modes of comparisons:
 
 Row-by-row select
 Cached comparison table
 Sorted input
 

Basically we use table comparison for target based Change data capture (CDC).
It is used to capture a changed data that is present in the source but not in target and/or changes in
source but not in target.
There many methods available for source and target bases CDC.
Follow the link below (for Oracle DB) for the
same: https://docs.oracle.com/cd/B28359_01/server.111/b28313/cdc.htm

1.1 Row-by-row Select
 
We can choose this option in following case
1.1.1Option 1 - Normal operations
 
Select this option to have the transform look up the target table using SQL every time it receives an
input row. This option is best if the target table is large compared to the number of rows the transform
will receive as input. Make sure the appropriate indexes exist on the lookup columns in the target
table for optimal performance.
 
1.1.2Option 2 - When we need to consider a trailing blank while comparison
 
While comparison if in source field value is with trailing blanks then BODS will treat it as different
value i.e. BODS will not apply any trim function for that Particular field.
Example:
Consider an account number '1234 ' (I.e with trailing space) in source and '1234' in target. In such
case, BODS considers both the account numbers as separate in case of row by row but in case of
cache comparison it will consider both account numbers to be the same. The latter is explained
below.
 
1.2Cached Comparison:
 
Select this option to load the comparison table into memory. In this case, queries to the comparison
table access memory rather than the actual table. However, the table must fit in the available memory.
This option is best when the table fits into memory and you are comparing the entire target table.
With the help of this option BODS will capture target data into internal Cache which is again very
faster to access.
1.3Sorted input:
 
Often the most efficient solution when dealing with large data sources, because DS reads the
comparison table only once. This option can only be selected when it is guaranteed that the incoming
data are sorted in exactly the same order as the primary key in the comparison table. In most cases
incoming data must be pre-sorted, e.g. using a Query transform with an Order-by (that may be
pushed down to the underlying database), to take advantage of this functionality.
 
2) Auto Correct Load option :
Auto-correct load is used to avoid loading of duplicate data in target table using SAP BODS.
Basically, Auto-Correct load is used to implement SCD-1 when we are not using table comparison
feature of SAP BODS. There are many options available if you are using Auto -Correct load which can
help to optimize the performance.
 
If you are not choosing any Auto -correct load option then BODS will generate simple insert query.
 
Snapshot and query generated by BODS is as shown below:

Query :(BODS will generate Insert query,so it may insert duplicate data.)
INSERT /*+ APPEND */ INTO "DS_1"."TEST_AUTO_CORRECT_LOAD"
( "ACCOUNT_NO" , "STAT" )
SELECT "TABLE1"."ACCOUNT_NO" ACCOUNT_NO , "TABLE1"."STAT" STAT
FROM "DS_2"."TABLE1" "TABLE1";
 
Many options are available with BODS Auto-correct load. Some of them are explained in below
section:
2.1Allow merge set to Yes & Ignore Columns with null set to No

While going for auto-correct load, if you select ‘Allow merge option to yes & Ignore columns with null
to No’, BODS will generate Merge Query to maintain SCD 1, ignoring null values if any.
So by choosing ‘Allow merge option to Yes’, BODS will insert the new rows coming from source to
target and update the existing rows from source into target.

 
Snapshot for same is as below:
 
Query generated by BODS job is as follows:
 
MERGE INTO "DS_1"."TEST_AUTO_CORRECT_LOAD" s
USING
(SELECT  "TABLE_1"."ACCOUNT_NO"  ACCOUNT_NO ,  "TABLE_1"."STAT"  STAT
FROM "PSEUDO_PROD"."TABLE_1" "TABLE_1"
)n
ON ((s.ACCOUNT_NO = n.ACCOUNT_NO))
WHEN MATCHED THEN
UPDATE SET s."STAT" = n.STAT
WHEN NOT MATCHED THEN
INSERT  /*+ APPEND */ (s."ACCOUNT_NO", s."STAT" )
VALUES (n.ACCOUNT_NO , n.STAT)
 
Here, the query generated by BODS is Merge query which will insert the new rows coming from
source and update the existing rows present in target.
This query will be pushed down to database hence it will be an optimized one.
 
2.2Allow merge set to Yes & Ignore Columns with null set to Yes
Query generated by BODS job is as follows:
 
MERGE INTO "DS_1"."TEST_AUTO_CORRECT_LOAD" s
USING
(SELECT  "TABLE_1"."ACCOUNT_NO"  ACCOUNT_NO ,  "TABLE_1"."STAT"  STAT
FROM "PSEUDO_PROD"."TABLE_1" "TABLE_1"
)n
ON ((s.ACCOUNT_NO = n.ACCOUNT_NO))
WHEN MATCHED THEN
UPDATE SET s."STAT" = NVL(n.STAT,S."STAT")
WHEN NOT MATCHED THEN
INSERT  /*+ APPEND */ (s."ACCOUNT_NO", s."STAT" )
VALUES (n.ACCOUNT_NO , n.STAT).
 
Snap Shot for same as follows:

As seen in the snapshot above, BODS adds NVL function to consider null values if any.

2.3Allow Merge set to No while Auto correct load set to Yes

If you select ‘Allow Merge to No and Auto correct load to Yes’ then BODS will generate PL/SQL code
which will again be very helpful when considering performance.
Below is the code generated by BODS:

BEGIN
DECLARE
CURSOR s_cursor IS
SELECT  "TABLE_1"."ACCOUNT_NO"  ACCOUNT_NO ,  "TABLE_1"."STAT"  STAT
FROM "PSEUDO_PROD"."TABLE_1" "TABLE_1"
;
s_row   s_cursor%ROWTYPE;
CURSOR t_cursor(p_ACCOUNT_NO s_row.ACCOUNT_NO%TYPE) IS
SELECT "ACCOUNT_NO" ACCOUNT_NO, "STAT" STAT, rowid
FROM "DS_1"."TEST_AUTO_CORRECT_LOAD"
WHERE (p_ACCOUNT_NO = "ACCOUNT_NO");
t_row   t_cursor%ROWTYPE;
commit_count NUMBER;
BEGIN
commit_count := 0;
:processed_row_count := 0;
FOR r_reader IN
(SELECT  "TABLE_1"."ACCOUNT_NO"  ACCOUNT_NO ,  "TABLE_1"."STAT"  STAT
FROM "PSEUDO_PROD"."TABLE_1" "TABLE_1"
) LOOP
OPEN t_cursor(r_reader.ACCOUNT_NO);
FETCH t_cursor INTO t_row;
IF t_cursor%NOTFOUND THEN
INSERT INTO "DS_1"."TEST_AUTO_CORRECT_LOAD"("ACCOUNT_NO", "STAT" )
VALUES (r_reader.ACCOUNT_NO , r_reader.STAT);
commit_count := commit_count + 1;
:processed_row_count := :processed_row_count + 1;
ELSE
LOOP
UPDATE "DS_1"."TEST_AUTO_CORRECT_LOAD" SET
"STAT" = r_reader.STAT
WHERE rowid = t_row.rowid;
commit_count := commit_count + 1;
:processed_row_count := :processed_row_count + SQL%ROWCOUNT;
IF (commit_count = 1000) THEN
COMMIT; commit_count := 0;
END IF;
FETCH t_cursor INTO t_row;
EXIT WHEN t_cursor%NOTFOUND;
END LOOP;
END IF;
CLOSE t_cursor;
IF (commit_count = 0) THEN
COMMIT; commit_count := 0;
END IF;
END LOOP;
COMMIT;
END;
END;
 
 
Snapshot for same is as below.
 
 
3) Array fetch & Rows par commit option
 
3.1Array Fetch size:
 
The array fetch feature lowers the number of database requests by "fetching" multiple rows (an array)
of data with each request. The number of rows to be fetched per request is entered in the Array fetch
size option on any source table editor or SQL transform editor. The default setting is 1000, which
means that with each database request, the software will automatically fetch 1000 rows of data from
the source database. The maximum array fetch size that can be specified is 5000 bytes.
 
 
 
Suggestion while using array fetch size option:
 
The optimal number for Array fetch size depends on the size of your table rows (the
number and type of columns involved) as well as the network round-trip time involved in
the database requests and responses. If your computing environment is very powerful,
(which means that the computers running the Job Server, related databases, and
connections are extremely fast), then try higher values for Array fetch size and test the
performance of your jobs to find the best setting.
3.2Rows per commit:
 
‘Rows per commit’ specifies the transaction size in number of rows. If set to 1000, Data
Integrator sends a commit to the underlying database for every 1000 rows.
 
‘Rows per commit’ for regular loading defaults to 1000 rows. Setting the Rows per
commit value significantly affects job performance. Adjust the rows per commit value in
the target table editor's Options tab, noting the following rules:
 Do not use negative number signs and other non-numeric characters.
 If you enter nothing or 0, the text box will automatically display 1000.
 If you enter a number larger than 5000, the text box automatically displays 5000.
 
It is recommended that you set rows per commit between 500 and 2000 for best performance. You
might also want to calculate a value. To do this, use the following formula:
max_IO_size/row size (in bytes)
For most platforms, max_IO_size is 64K. For Solaris, max_IO_size is 1024K.
Note that even with a value greater than one set for Rows per commit, SAP Data Services will submit
data one row at a time if the following conditions exist:
You are loading into a database (this scenario does not apply to Oracle databases), and have a
column with a LONG datatype attribute.
You are using an overflow file where the transaction failed. However, once all the rows are loaded
successfully, the commit size reverts to the number you entered. In this case, depending on how often
a load error happens, performance might become worse than setting Rows per commit to 1.
 
 

Let us consider a scenario while choosing a different array fetch size and rows per commit value.
 
Let’s say you are setting Array fetch size value to 100 and rows per commit to 500 and the total
number of rows in source are 800. Suppose the job terminates after processing 700 rows, then the job
will enter only 500 rows into target because we have set rows per commit value to 500. The remaining
200 rows will not enter into the target, while the same 200 rows will be fetched by the same job. It will
cause an insufficient data in target.
So, setting array fetch size & rows per commit value is very important while designing a job.

4) Use database link In datastore option :


 
While fetching data from different datastores connected to different database then BODS may not be
able to push down the query to database instead BODS will generate separate query for each
datastore.
Now to avoid such a situation we can define database link between databases at datastore level, so
after applying DB link at datastore level BODS will generate Push-down SQL which is optimized one.
Refer the scenario for the same below.
 
 
4.1 Scenario :
 
I have considered the following tables and database for this scenario.
Table Information:
1) ACCOUNT DIMENSION
2) BORG FLAT TABLE
Datastore & Database information:
1) DATASTORE: EDW_REP POINTING TO DATABASE TCLBI.
2) DATASTORE: PSEUDO_PROD POINTING TO DATABASE PSEUDO_PROD.
The account dimension table has been taken from EDW_REP datastore which is pointing to
EDW_REP schema of TCLBI database whereas the other table BORG_DTL is from Pseudo_prod
database.
 
The job has been deigned as shown in image below:

 
 
As both the tables are from different databases, hence BODS generates multiple SQL which is shown
in snapshot below.
Also, a DB link is present between both the databases.(DB link is edwrep_to_pseudoprod). This link
helps us to generate Push-down SQL thereby generating an optimized SQL.
In order to achieve this, the following advance setting needs to be done in the Datastore of BODS.
Right click on DS_EDW_REP Datastore -> EDIT -> Advanced ->scroll down to Linked DataStores .

Snapshot of same is attached below:

 
Click on Linked Datastores you will get following window :
 
Choose a datastore that needs to be linked with other datastore, In this case, it is DS_1.
Press OK. Provide proper DB link name from the dropdown

 
Press OK again. After saving the job, check the optimized SQL generated by BODS, which is as
follows.

INSERT /*+ APPEND */ INTO "DS_1"."TEST_DB_LINK_BODS" ( "KEY_1" , "APP_PRINCIPLE_AMT"


, "LOAN_AVAILED_DATE" )
SELECT  "BORG_DTL"."KEY_1"  KEY_1 ,  "BORG_DTL"."APP_PRINCIPLE_AMT" 
APP_PRINCIPLE_AMT ,  "ACCOUNT_DIM"."LOAN_AVAILED_DATE"  LOAN_AVAILED_DATE
FROM "DS_1"."ACCOUNT_DIM" "ACCOUNT_DIM" INNER JOIN
"PSEUDO_PROD"."BORG_DTL"@DBLINK "BORG_DTL" ON ( "ACCOUNT_DIM"."ACCOUNT_NO" 
=  "BORG_DTL"."KEY_1" )
 
 
It generates Insert into statement that is an optimized Push down SQL.
 
Similarly more than 1 DB link may be added if required.
 
5) Use server group for splitting dataflows among
servers
 
In BODS, we can define multiple dataflows parallel in single workflow. So when a job is designed by
parallel dataflow, BODS will start all the dataflow at same time on particular job server mentioned
while executing a job.
Because of this, single job server might experience heavy load. Also resources of that particular job
server are shared among multiple parallel dataflows which may degrade the performance of BODS
job.
To avoid such scenario we can distribute parallel dataflows to job server group having multiple job
servers.
 
5.1     What is server group?
 
You can distribute the execution of a job or a part of a job across multiple Job Servers within a Server
Group to better balance resource-intensive operations. A server group automatically measures
resource availability on each Job Server in the group and distributes scheduled batch jobs to the Job
Server with the lightest load at runtime.
5.2     How to create/manage server group
 
Goto management console ->Server Groups ->All server groups->Server Group configuration-> Click
on Add

 
Add the job servers into predefined server group and click Apply.

5.3     Distribution levels for data flow execution


 
When you execute a job, you can specify the following values on the Distribution level option:
 Job level - An entire job can execute on an available Job Server.
 Data flow level - Each data flow within a job can execute on an available Job Server and can
take advantage of additional memory (up to two gigabytes) for both in-memory and pageable cache
on another computer.
 Sub data flow level - A resource-intensive operation (such as a sort, table comparison, or
table lookup) within a data flow can execute on an available Job Server. Each operation can take
advantage of up to two gigabytes additional memory for both in-memory and pageable cache on
another computer.
 
If a job is run using only single job server then all the dataflow will start on a single job server which
may take more time to execute because single job server have a limited resources.
For this job we can split dataflows to multiple job servers with the help of job server group, which is
shown below.

[Note : In this example we will consider Data Flow level distribution.]


I have created a sample job to test how data flows spilt occurs.
 
 
 

In Job server or Server Group, select server group.


In Distribution level, select on which level (i.e Job,Dataflow or sub dataflow ) splitting should occur.
After executing the job, the Management Console appears as below.

You might also like