Etl

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

1) What is ETL?

In data warehousing architecture, ETL is an important component, which


manages the data for any business process. ETL stands for Extract,
Transform and Load. Extract does the process of reading data from a database.
Transform does the converting of data into a format that could be appropriate
for reporting and analysis. While, load does the process of writing the data
into the target database.

2) Explain what are the ETL testing operations includes?

ETL testing includes

 Verify whether the data is transforming correctly according to business


requirements
 Verify that the projected data is loaded into the data warehouse without
any truncation and data loss
 Make sure that ETL application reports invalid data and replaces with
default values
 Make sure that data loads at expected time frame to improve scalability
and performance

3) Mention what are the types of data warehouse applications and what is the
difference between data mining and data warehousing?

The types of data warehouse applications are

 Info Processing
 Analytical Processing
 Data Mining

Data mining can be define as the process of extracting hidden predictive


information from large databases and interpret the data while data
warehousing may make use of a data mine for analytical processing of the
data in a faster way. Data warehousing is the process of aggregating data from
multiple sources into one common repository

4) What are the various tools used in ETL?

 Cognos Decision Stream


 Oracle Warehouse Builder
 Business Objects XI
 SAS business warehouse
 SAS Enterprise ETL server
5) What is fact? What are the types of facts?

It is a central component of a multi-dimensional model which contains the


measures to be analysed. Facts are related to dimensions.

Types of facts are

 Additive Facts
 Semi-additive Facts
 Non-additive Facts

6) Explain what are Cubes and OLAP Cubes?

Cubes are data processing units comprised of fact tables and dimensions from
the data warehouse. It provides multi-dimensional analysis.

OLAP stands for Online Analytics Processing, and OLAP cube stores large data
in muti-dimensional form for reporting purposes. It consists of facts called as
measures categorized by dimensions.

7) Explain what is tracing level and what are the types?

Tracing level is the amount of data stored in the log files. Tracing level can be
classified in two Normal and Verbose. Normal level explains the tracing level
in a detailed manner while verbose explains the tracing levels at each and
every row.

8) Explain what is Grain of Fact?

Grain fact can be defined as the level at which the fact information is stored. It
is also known as Fact Granularity

9) Explain what factless fact schema is and what is Measures?

A fact table without measures is known as Factless fact table. It can view the
number of occurring events. For example, it is used to record an event such as
employee count in a company.

The numeric data based on columns in a fact table is known as Measures

10) Explain what is transformation?

A transformation is a repository object which generates, modifies or passes


data. Transformation are of two types Active and Passive
11) Explain the use of Lookup Transformation?

The Lookup Transformation is useful for

 Getting a related value from a table using a column value


 Update slowly changing dimension table
 Verify whether records already exist in the table

12) Explain what is partitioning, hash partitioning and round robin


partitioning?

To improve performance, transactions are sub divided, this is called as


Partitioning. Partioning enablesInformatica Server for creating of multiple
connection to various sources

The types of partitions are

Round-Robin Partitioning:

 By informatica data is distributed evenly among all partitions


 In each partition where the number of rows to process are
approximately same this partioning is applicable

Hash Partitioning:

 For the purpose of partitioning keys to group data among partitions


Informatica server applies a hash function
 It is used when ensuring the processes groups of rows with the same
partitioning key in the same partition need to be ensured

13) Mention what is the advantage of using DataReader Destination Adapter?

The advantage of using the DataReader Destination Adapter is that it


populates an ADO recordset (consist of records and columns) in memory and
exposes the data from the DataFlow task by implementing the DataReader
interface, so that other application can consume the data.

14) Using SSIS ( SQL Server Integration Service) what are the possible ways to
update table?

To update table using SSIS the possible ways are:

 Use a SQL command


 Use a staging table
 Use Cache
 Use the Script Task
 Use full database name for updating if MSSQL is used

15) In case you have non-OLEDB (Object Linking and Embedding Database)
source for the lookup what would you do?

In case if you have non-OLEBD source for the lookup then you have to use
Cache to load data and use it as source

16) In what case do you use dynamic cache and static cache in connected and
unconnected transformations?

 Dynamic cache is used when you have to update master table and
slowly changing dimensions (SCD) type 1
 For flat files Static cache is used

Connected Lookup Unconnected Lookup

 Connected lookup participates in mapping - It is used when lookup function is used instead
of an expression transformation while mapping

 Multiple values can be returned - Only returns one output port

 It can be connected to another transformations  Another transformation cannot be


and returns a value connected

 Static or dynamic cache can be used for connected  Unconnected as only static cache
Lookup

 Connected lookup supports user defined default  Unconnected look up does not support
values user defined default values

 In Connected Lookup multiple column can be  Unconnected lookup designate one return
return from the same row or insert into dynamic port and returns one column from each
lookup cache row

17) Explain what are the differences between Unconnected and Connected
lookup?

18) Explain what is data source view?


A data source view allows to define the relational schema which will be used
in the analysis services databases. Rather than directly from data source
objects, dimensions and cubes are created from data source views.

19) Explain what is the difference between OLAP tools and ETL tools ?

The difference between ETL and OLAP tool is that

ETL tool is meant for the extraction of data from the legacy systems and load
into specified data base with some process of cleansing data.

20) How you can extract SAP data using Informatica?

 With the power connect option you extract SAP data using informatica
 Install and configure the PowerConnect tool
 Import the source into the Source Analyzer. Between Informatica and
SAP Powerconnect act as a gateaway. The next step is to generate the
ABAP code for the mapping then only informatica can pull data from
SAP
 To connect and import sources from external systems Power Connect is
used.

21) Mention what is the difference between Power Mart and Power Center?

Power Center Power Mart

 Suppose to process huge volume of data Suppose to process low volume of data

 It supports ERP sources such as SAP, people soft etc. It does not support ERP sources

 It supports local and global repository It supports local repository

 It converts local into global repository It has no specification to convert local into glob

22) Explain what staging area is and what is the purpose of a staging area?

Data staging is an area where you hold the data temporary on data warehouse
server. Data staging includes following steps

 Source data extraction and data transformation ( restructuring )


 Data transformation (data cleansing, value transformation )
 Surrogate key assignments

23) What is Bus Schema?

For the various business process to identify the common dimensions, BUS
schema is used. It comes with a conformed dimensions along with a
standardized definition of information

24) Explain what is data purging?

Data purging is a process of deleting data from data warehouse. It deletes junk
data's like rows with null values or extra spaces.

25) Explain what are Schema Objects?

Schema objects are the logical structure that directly refer to the databases
data. Schema objects includes tables, views, sequence synonyms, indexes,
clusters, functions packages and database links

26) Explain these terms Session, Worklet, Mapplet and Workflow ?

 Mapplet : It arranges or creates sets of transformation


 Worklet: It represents a specific set of tasks given
 Workflow: It's a set of instructions that tell the server how to execute
tasks
 Session: It is a set of parameters that tells the server how to move data
from sources to target

More Question only :

How do check CDC.

How do you validate each and every record whether value in source and target are same.

Questions on sql will be asked.

If any particular Etl tool

Suppose like informatica

Will be asked questions on mappets,workflow.

What is dwh, schemas will be asked.


More real time questions related to count and query output or scenario will be given will
Need to write query.

1. Question 1. What Is Etl?


Answer :
ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target mappings,
transformation and job control parameter.

Extraction :
Take data from an external source and move it to the warehouse pre-processor
database.

Transformation:
Transform data task allows point-to-point generating, modifying and transforming
data.

Loading:
Load data task adds records to a database table in a warehouse.
2. Question 2. What Is The Difference Between Etl Tool And Olap Tools?
Answer :
ETL tool is meant for extraction data from the legacy systems and load into
specified database with some process of cleansing data.
ex: Informatica, data stage ....etc

OLAP is meant for Reporting purpose in OLAP data available in Multidirectional


model. so that you can write simple query to extract data from the data base.
ex: Business objects, Cognos....etc
Question 4. What Is Ods (operation Data Source)?
Answer :
o ODS - Operational Data Store.
o ODS Comes between staging area & Data Warehouse. The data is ODS
will be at the low level of granularity.
o Once data was populated in ODS aggregated data will be loaded into EDW
through ODS.
uestion 5. Where Do We Use Connected And Unconnected Lookups?
Answer :
o If return port only one then we can go for unconnected. More than one
return port is not possible with Unconnected. If more than one return port
then go for Connected.
o If you require dynamic cache i.e where your data will change dynamically
then you can go for connected lookup. If your data is static where your data
won't change when the session loads you can go for unconnected lookups

o Question 6. Where Do We Use Semi And Non Additive Facts?


o Answer :
o Additive: A measure can participate arithmetic calculations using all or any
dimensions.
o Ex: Sales profit
Semi additive: A measure can participate arithmetic calculations using some
dimensions.
o Ex: Sales amount
Non Additive:A measure can't participate arithmetic calculations using
dimensions.
o Ex: temperature.

1. Question 7. What Are Non-additive Facts In Detail?


Answer :
o A fact may be measure, metric or a dollar value. Measure and metric are
non additive facts.
o Dollar value is additive fact. If we want to find out the amount for a
particular place for a particular period of time, we can add the dollar
amounts and come up with the total amount.
o A non additive fact, for eg; measure height(s) for 'citizens by
geographical location' , when we rollup 'city' data to 'state' level data we
should not add heights of the citizens rather we may want to use it to
derive 'count'.
Question 8. What Is A Staging Area? Do We Need It? What Is The Purpose Of A
Staging Area?
Answer :
Data staging is actually a collection of processes used to prepare source system
data for loading a data warehouse. Staging includes the following steps:
o Source data extraction, Data transformation (restructuring),
o Data transformation (data cleansing, value transformations),
o Surrogate key assignments.
1.
Question 10. What Are The Modules In Power Mart?
Answer :
o PowerMart Designer
o Server
o Server Manager
o Repository
o Repository Manager

Question 9. What Is Latest Version Of Power Center / Power Mart?


Answer :
The Latest Version is 7.2
Question 11. What Are Active Transformation / Passive Transformations?
Answer :
o Active transformation can change the number of rows that pass through it.
(Decrease or increase rows)
o Passive transformation cannot change the number of rows that pass
through it.
o
Question 12. What Are The Different Lookup Methods Used In
Informatica?
Answer :
Connected lookup:
Connected lookup will receive input from the pipeline and sends output to the
pipeline and can return any number of values it does not contain return port.
Unconnected lookup:
Unconnected lookup can return only one column it contain return port.
1.
Question 14. How Do We Call Shell Scripts From Informatica?
Answer :
Specify the Full path of the Shell script the "Post session properties of
session/workflow".

1.
Question 16. What Is A Mapping, Session, Worklet, Workflow, Mapplet?
Answer :
o A mapping represents dataflow from sources to targets.
o A mapplet creates or configures a set of transformations.
o A workflow is a set of instructions that tell the Informatica server how to
execute the tasks.
o A worklet is an object that represents a set of tasks.
o A session is a set of instructions that describe how and when to move
data from sources to targets.

2. What Is Informatica Metadata And Where Is It Stored?


3. Answer :
4. Informatica Metadata is data about data which stores in Informatica
repositories.
5. Question 21. How To Determine What Records To Extract?
Answer :
When addressing a table some dimension key must reflect the need for a record
to get extracted. Mostly it will be from time dimension (e.g. date >= 1st of current
month) or a transaction flag (e.g. Order Invoiced Stat). Foolproof would be adding
an archive flag to record which gets reset when record changes.
6. Question 22. What Is Full Load & Incremental Or Refresh Load?
Answer :
Full Load: completely erasing the contents of one or more tables and reloading
with fresh data.
Incremental Load: applying ongoing changes to one or more tables based on a
predefined schedule.

Compare Etl & Manual Development?


Answer :
These are some differences b/w manual and ETL development.
ETL
o The process of extracting data from multiple sources.(ex. flatfiles, XML,
COBOL, SAP etc) is more simpler with the help of tools.
o High and clear visibility of logic.
o Contains Meta data and changes can be done easily.
o Error handling, log summary and load progress makes life easier for
developer and maintainer.
o Can handle Historic data very well.
Manual
o Loading the data other than flat files and oracle table need more effort.
o complex and not so user friendly visibility of logic.
o No Meta data concept and changes needs more effort.
o need maximum effort from maintenance point of view.
o as data grows the processing time degrades.

1. Question 28. What Is Cube Grouping?


Answer :
A transformer built set of similar cubes is known as cube grouping. They are
generally used in creating smaller cubes that are based on the data in the level of
dimension.
2. Question 29. What Is Data Wearhousing?
Answer :
o A data warehouse can be considered as a storage area where relevant
data is stored irrespective of the source.
o Data warehousing merges data from multiple sources into an easy and
complete form.
3. Question 30. What Is Virtual Data Wearhousing?
Answer :
A virtual data warehouse provides a collective view of the completed data. It can
be considered as a logical data model of the containing metadata.
4. Question 31. What Is Active Data Wearhousing?
Answer :
An active data warehouse represents a single state of the business. It considers
the analytic perspectives of customers and suppliers. It helps to deliver the
updated data through reports.
5. Question 32. What Is Data Modeling And Data Mining?
Answer :
Data Modeling is a technique used to define and analyze the requirements of
data that supports organization’s business process. In simple terms, it is used for
the analysis of data objects in order to identify the relationships among these data
objects in any business.
Data Mining is a technique used to analyze datasets to derive useful
insights/information. It is mainly used in retail, consumer goods,
telecommunication and financial organizations that have a strong consumer
orientation in order to determine the impact on sales, customer satisfaction and
profitability.
6. Question 33. What Are Critical Success Factors?
Answer :
Key areas of activity in which favorable results are necessary for a company to
obtain its goal.
There are four basic types of CSFs which are:
o Industry CSFs
o Strategy CSFs
o Environmental CSFs
o Temporal CSFs
7. uestion 46. What Is Etl Process ?how Many Steps Etl Contains Explain
With Example?
Answer :
ETL is extraction, transforming, loading process, you will extract data from the
source and apply the business role on it then you will load it in the target the steps
are :
o define the source(create the odbc and the connection to the source
DB)
o define the target (create the odbc and the connection to the target
DB)
o create the mapping ( you will apply the business role here by adding
transformations , and define how the data flow will go from the source
to the target )
o create the session (its a set of instruction that run the mapping )
o create the work flow (instruction that run the session)
8. Question 47. Give Some Popular Tools?
Answer :
Popular Tools:
o IBM Web Sphere Information Integration (Accentual DataStage)
o Ab Initio
o Informatica
o Talend
9. Question 48. Give Some Etl Tool Functionalities?
Answer :
While the selection of a database and a hardware platform is a must, the selection
of an ETL tool is highly recommended, but it's not a must. When you evaluate
ETL tools, it pays to look for the following characteristics:
o Functional capability: This includes both the 'transformation' piece
and the 'cleansing' piece. In general, the typical ETL tools are either
geared towards having strong transformation capabilities or having
strong cleansing capabilities, but they are seldom very strong in both.
As a result, if you know your data is going to be dirty coming in, make
sure your ETL tool has strong cleansing capabilities. If you know
there are going to be a lot of different data transformations, it then
makes sense to pick a tool that is strong in transformation.
o Ability to read directly from your data source: For each
organization, there is a different set of data sources. Make sure the
ETL tool you select can connect directly to your source data.
o Metadata support: The ETL tool plays a key role in your metadata
because it maps the source data to the destination, which is an
important piece of the metadata. In fact, some organizations have
come to rely on the documentation of their ETL tool as their metadata
source. As a result, it is very important to select an ETL tool that
works with your overall metadata strategy.

Question 52. What Are The Various Tools? - Name A Few.


Answer :
o Abinitio
o DataStage
o Informatica
o Cognos Decision Stream
o Oracle Warehouse Builder
o Business Objects XI (Extreme Insight)
o SAP Business Warehouse
o SAS Enterprise ETL Server

Question 57. What Are The Different Versions Of Informatica?


Answer :
Here are some popular versions of Informatica.
o Informatica Powercenter 4.1,
o Informatica Powercenter 5.1,
o Powercenter Informatica 6.1.2,
o Informatica Powercenter 7.1.2,
o Informatica Powercenter 8.1,
o Informatica Powercenter 8.5,
o Informatica Powercenter 8.6.

1. Question 59. What Is The Difference Between Power Center & Power Mart?
Answer :
PowerCenter - ability to organize repositories into a data mart domain and share
metadata across repositories.
PowerMart - only local repository can be created.
2. Question 60. What Are Snapshots? What Are Materialized Views & Where
Do We Use Them? What Is A Materialized View Log?
Answer :
Snapshots are read-only copies of a master table located on a remote node
which is periodically refreshed to reflect changes made to the master table.
Snapshots are mirror or replicas of tables.
Views are built using the columns from one or more tables. The Single Table View
can be updated but the view with multi table cannot be updated.
A View can be updated/deleted/inserted if it has only one base table if the view is
based on columns from one or more tables then insert, update and delete is not
possible.
Materialized view
A pre-computed table comprising aggregated or joined data from fact and possibly
dimension tables. Also known as a summary or aggregate table.

You might also like