DataStage Interview Question
DataStage Interview Question
Both Datastage and Informatica are powerful ETL tools . Both tools do almost exactly the same thing in almost exactly the same
way. Performance, maintainability, learning curve are all similar and comparable. Below are the few things which I would like
highlight regarding both these tools.
Multiple Partitions
Informatica offers partitioning as dynamic partitioning which defaults a workflow not at every Stage/Object level in a mapping/job.
Informatica offers other partitioning choices as well at the workflow level.
DataStage's pipeline partitioning uses multiple partitions, processed and then re-collected with DataStage. DataStage lets control a job
design based on the logic of the processing instead of defaulting the whole pipeline flow to one partition type. DataStage offers 7
different types of multi-processing partitions.
User Interface
Informatica offers access to the development and monitoring effort through its 4 GUIs - offered as Informatica
PowerDesigner, Repository Manager, Worflow Designer, Workflow Manager.
DataStage caters to development and monitoring its jobs through 3 GUIs - IBM DataStage Designer(for development), Job Sequence
Designer(workflow design) and Director(for monitoring).
Version Control
Informatica offers instant version control through its repository server managed with “Repository Manager” GUI console. A mapping
with work-in-progress cannot be opened until saved and checked back into the repository. Version control is done by using checkin
and check out.
Version Control was offered as a component until version Ascential DataStage7.5.x. Ascential was acquired by IBM and
when DataStage was integrated into IBM Information Server with DataStage at version 8.0.1, the support of version control as a
component was discontinued.
Data Encryption
Informatica has an offering within PowerCenter Designer as a separate transformation called “Data Masking Transformation”.
Variety of Transformations
Informatica offers about 30 general transformations for processing incoming data.
Datastage offers about 40 data transforming stages/objects. Datastage is more powerful transformation engine by using functions
(Oconv and IConv) and routines. We can do almost any transformation.
Datastage lets drag and drop a functionality i.e a stage within in one canvas area for a pipeline source-target job. With DataStage
within the “DataStage Designer” import of both source and target metadata is needed, proceeding with variety of stages offered as
database stages, transformation stages, etc.
The biggest difference between both the vendor offerings in this area is Informatica forces you to be organized through a step-by-step
design process, while DataStage leaves the organization as a choice and gives you flexibility in dragging and dropping objects based
on the logic flow.
Checking Dependencies
Informatica offers a separate edition – Advanced edition that helps with data lineage and impact analysis. We can go to separate
targets and source and check all the dependencies on that.
DataStage offers through Designer by right clicking on a job to perform dependencies or impact analysis.
Components Used
The Informatica ETL transformations are very specific purpose, so you tend to need more boxes on the page to do the same thing. eg.
A simple transform in Informatica would have a Source Table, Source Qualifier, Lookup, Router, 2 Update Strategies, and 2 Target
Tables (9 boxes).
In DataStage, you would have a Table and Hashed File for the lookup, plus a Source Relational Stage, Transformation Stage, and 2
links to a target Relational Stage (5 boxes). This visual clutter in Informatica is a bit annoying.
Type of link
To link two components in Informatica, you have to link at the column level.We have to connect each and every column bw the two
componenents
In DataStage, you link at the component level, and then map individual columns. This allows you to have coding templates that are all
linked up - just add columns. I find this a big advantage in DS.
Reusability
Informatica offers ease of re-usability through Mapplets and Worklets for re-using mappings and workflows.This really improves the
performance
DataStage offers re-usability of a job through containers(local&shared). To re-use a Job Sequence(workflow), you will need to make
a copy, compile and run.
Heterogeneous Sources
In Informatica we can use both heterogenous source and homogenous source.
Datastage does not perform very well with heterogeneous sources. You might end up extracting data from all the sources and putting
them into a hash and start your transformation
Informatica supports Full History, Recent Values, Current & Previous Values using SCD wizards.
DataStage supports only through Custom scripts and does not have a wizard to do this
Informatica's marvellous Dynamic Cache Lookup has no equivalent in DS Server Edition. The same saves some effort and is very
easily maintainable.
http://shortcut-tricks.blogspot.com/2016/04/difference-between-informatica-and.html
Datastage Scenario Based Questions and Answers for Freshers and Experienced
1. Create a job to load the first 3 records from a flat file into a target table?
2. Create a job to load the last 3 records from a flat file into a target table?
3. Create a job to load the first record from a flat file into one table A, the last record from a flat file into table B and the remaining
records into table C?
A
B
C
C
B
D
B
Q1. Create a job to load all unique products in one table and the duplicate rows in to another table.
A
D
B
B
B
C
C
Q2. Create a job to load each product once into one table and the remaining products which are duplicated into another table.
A
B
C
D
B
B
C
employee_id, salary
-------------------
10, 1000
20, 2000
30, 3000
40, 5000
Q1. Create a job to load the cumulative sum of salaries of employees into target table?
The target table data should look like as
Q2. Create a job to get the pervious row salary for the current row. If there is no pervious row exists for the current row, then the
pervious row salary should be displayed as null.
Q3. Create a job to get the next row salary for the current row. If there is no next row for the current row, then the next row salary
should be displayed as null.
Q4. Create a job to find the sum of salaries of all employees and this sum should repeat for all the rows.
department_no, employee_name
----------------------------
20, R
10, A
10, D
20, P
10, B
10, C
20, Q
20, S
Q1. Create a job to load a target table with the following values from the above source?
department_no, employee_list
--------------------------------
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, A,B,C,D,P
20, A,B,C,D,P,Q
20, A,B,C,D,P,Q,R
20, A,B,C,D,P,Q,R,S
Q2. Create a job to load a target table with the following values from the above source?
department_no, employee_list
----------------------------
10, A
10, A,B
10, A,B,C
10, A,B,C,D
20, P
20, P,Q
20, P,Q,R
20, P,Q,R,S
Q3. Create a job to load a target table with the following values from the above source?
department_no, employee_names
-----------------------------
10, A,B,C,D
20, P,Q,R,S
Product_id, product_type
------------------------
10, video
10, Audio
20, Audio
30, Audio
40, Audio
50, Audio
10, Movie
20, Movie
30, Movie
40, Movie
50, Movie
60, Movie
Assume that there are only 3 product types are available in the source. The source contains 12 records and you dont know how many
products are available in each product type.
Q1. Create a job to select 9 products in such a way that 3 products should be selected from video, 3 products should be selected from
Audio and the remaining 3 products should be selected from Movie.
Q2. In the above problem Q1, if the number of products in a particular product type are less than 3, then you wont get the total 9
records in the target table. For example, see the videos type in the source data. Now design a mapping in such way that even if the
number of products in a particular product type are less than 3, then you have to get those less number of records from another product
types. For example: If the number of products in videos are 1, then the reamaining 2 records should come from audios or movies. So,
the total number of records in the target table should always be 9.
Col
---
a
b
c
d
e
f
id, value
---------
10, a
10, b
10, c
20, d
20, e
20, f
http://shortcut-tricks.blogspot.com/2016/04/datastage-scenario-based-questions-and.html
Answer / kiran
Answer / madhava
Parallel jobs:
1.parallel jobs run on parallel engine.
2.Supports pipeline and partition parallelism.
3.compiled into OSH
server jobs:
1.run on server engine
Answer / yarramasu
https://www.allinterview.com/showanswers/33491/what-is-exact-difference-between-parallel-jobs-and-server-jobs.html
PIPELINING &
PARTITIONING DOES NOT SUPPORT SUPPORTS
PARALLEL JOBS
BOTH MASSIVE PARALLEL PROCESSING AND
SYMMETRICMULTIPROCESSING
Answer / bharath
SERVER JOBS
->Runs on single node
->Executes on DS Server Engine
->Handles less volume of data
->Slow data processing
->Having less no. of components(i.e, palette)
->Compiled into Basic language.
PARALLEL JOBS
->Runs on multiple nodes.
->Executes on DS parallel engine.
->Handles Huge volume of data
->Faster data processing.
->Having more no. of components
->Compiled into OSH(orchestrate shell script) except transformer( C++ and OSH )
Answer / poonam
https://www.allinterview.com/showanswers/33495/what-is-exact-difference-between-parallel-jobs-and-server-jobs.html
https://www.allinterview.com/company/1000/ibm/interview-questions/177/data-stage.html