0% found this document useful (0 votes)
537 views94 pages

DataStage PPT

1. ETL (Extraction, Transformation, and Loading) is usually a batch process that handles large volumes of data from heterogeneous sources to load into data warehouses, marts, and analytical applications. 2. DataStage is an ETL tool that provides a graphical interface for designing data flows to extract, transform, and load data. It utilizes stages connected by links to represent these processes. 3. The Designer component allows creating and editing DataStage jobs, while the Director is used for scheduling, monitoring, and running jobs on the DataStage server.

Uploaded by

sainisaurabh_1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
537 views94 pages

DataStage PPT

1. ETL (Extraction, Transformation, and Loading) is usually a batch process that handles large volumes of data from heterogeneous sources to load into data warehouses, marts, and analytical applications. 2. DataStage is an ETL tool that provides a graphical interface for designing data flows to extract, transform, and load data. It utilizes stages connected by links to represent these processes. 3. The Designer component allows creating and editing DataStage jobs, while the Director is used for scheduling, monitoring, and running jobs on the DataStage server.

Uploaded by

sainisaurabh_1
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

ETL Basics

Extraction Transformation & Load


Usually a
batch process of large volumes of data

Scenarios
Load a warehouse, mart, analytical and reporting applications Application/Data Integration Load packaged applications, or external systems through their APIs or interface databases Data Migration

Extract

Transform

Heterogeneous Data Sources: Relational & non-relational databases; Sequential flat files, complex flat files, COBOL files, VSAM data, XML data, etc.; Packaged Applications (e.g. SAP, Siebel, etc.) Incremental/changed data or complete/snapshot data Internal data or third-part data Push/Pull Cleansing & validation Simple - Range checks, duplicate checks, NULL value transforms etc. Specialized/Complex Name & address validations, de-duplication, etc.

Transform

Load

Computations (arithmetic, string, date, etc.) Pivot Split or Concatenate Aggregate Filter Join, look-up

Historical vs. Refresh load Incremental vs. Snapshot Bulk Loading vs Record-level Loading

ETL Platform Options


Database features including SQL, stored procedures, etc.: Oracle, Teradata, etc. Code-based custom scripts: PL/SQL, Cobol, Pro*C, etc. Engine-based products: IBM/Ascential DataStage, Informatica PowerCenter, Ab Initio

Usual features provided by ETL tools:


Graphical data flow definition interfaces for easy development Native & ODBC connectivity to standard databases, packages, etc. Metadata maintenance components Metadata import & export from standard databases, packages, etc. Inbuilt standard functions & transformations e.g. date, aggregate, sort, etc. Options for sharing or reusing developed components Facility to call external routines or write custom code for complex requirement Batch definition to handle dependencies between data flows to create the application ETL Engines that handle the data manipulation without depending on the database engines. Run-time support for monitoring the data flow and reading message logs Scheduling options

Architecture of a Typical ETL Tool

Source & Target Database

Source & Target Database

ETL Metadata Repository


Metadata

Data

ETL Engine Data

GUI-Based Development Environment


Metadata Definition/Import/Export Data Flow & Transformation Definition Batch Definition Test & Debug Schedule

Run-time Environment
Trigger ETL Monitor flow View logs

Optional additional functions


Cleansing capability, name & address cleansing, de-duplication Data Profiling Metadata Management Run Audit Pre-built templates Additional adaptors for interfacing with third-party products, models & protocols

Server Component DataStage Server Repository DataStage Package Installer

Client Component DataStage Designer DataStage Director DataStage Administrator DataStage Manager

DataStage Components

Engine

4
Sources Manager

Server

4
Targets
ETL Metadata Maintained in internal format

Repository

Execute Jobs Monitor Jobs, view job logs

Director
Manage Repository Create custom routines & transforms Import & Export component definitions

Designer

Assemble Jobs Debug Compile Jobs Execute Jobs

DataStage Server Components


DataStage Server:
Available for : Win NT, 2000, Server 2003; IBM AIX; HP Compaq Tru64; HP HP-UX; Red Hat Enterprise Linux AS; Sun Solaris Server runs the executable, managing data

Repository:
Contains all the metadata, mapping rules, etc. DataStage applications are organized into Projects, each server can handle multiple projects DataStage repository maintained in an internal format & not in the database

Package Installer
Note: DataStage uses the OS-level security - Only root/admin user can administer the server

DataStage Client Components


Windows-based components Need to access the server at development time as well Designer. to create DataStage jobs , compiled to create the executables Director: validate, schedule, run, and monitor jobs Manager: view and edit the contents of the Repository. Administrator: setting up users, creating and moving projects, and setting up purging criteria Designer, Director & Manager can connect to one Project at a time

Most DataStage configuration tasks are carried out using the DataStage Administrator, a client program provided with DataStage. To access the DataStage Administrator: 1. From the Ascential DataStage program folder, choose DataStage Administrator.

2.

Log on to the server. If you do so as an Administrator (for Windows NT servers), or as dsadm (for UNIX servers), you have unlimited administrative rights; otherwise your rights are restricted as described in the previous section. 3. The DataStage Administration window appears: The General page lets you set server-wide properties. It is enabled only when at least one project exists. The controls and buttons on this page are enabled only if you logged on as an administrator

1.Used to store and manage re-usable metadata for the jobs. 2. Used to import and export components from file-system to DataStage projects.. 3. Primary interface to the DataStage Repository. 4. Custom routines and transforms can also be created in the Manager

The DataStage Director is the client component that validates,runs,schedules and monitors jobs run by the DataStage Server. It is the starting point for most off the tasks a DataStage operator needs to do in respect of DataStage jobs..

Job Category Pan

Menu Bar

Toolbar

Status Bar

Display Area

The display area is the main part of the DataStage Director window. There are three views:

Job Status - The default view, which appears in the right pane of the DataStage Director window. It displays the status of all jobs in the category currently selected in the job category tree. If you hide the job category pane, the Job Statues view includes a Category column, and displays the status of all server jobs in the current project, regardless of their category.

Job Schedule - Displays a summary of scheduled jobs and batches in the currently selected job category..If the job category pane is hidden, the display area shows all scheduled jobs and batches, regardless of their category. Job Log- Displays the log file for a job chosen from the Job Status view or the Job Schedule view.

DataStage Designer is used to: Create DataStage Jobs that are compiled into executable programs. Design the jobs that extract, integrate, aggregate, load, and transform the data. Create and reuse metadata and job components Allows you to use familiar graphical point and-click techniques to develop processes for extracting, cleansing, transforming, Integrating and loading data.

Use Designer to: Specify how data is extracted.. Specify data transformations.. Decode data going into the target tables using reference lookups Aggregate Data.. Split data into multiple outputs on the basis of defined constraints

The Designer graphical interface lets you select Stage icons, drop them onto the Designer work area, and add links. Then, still working in the Designer, you define the required actions and processes for each stage and link. A job created with the Designer is easily scalable. This means that you can easily create a simple job, get it working, then insert further processing, additional data sources, and so on.

1.Enter the name of your host in the Host system field. This is the name of the system where the DataStage server components are installed. 2. Enter your user name in the User name field. This is your user name on the server system. 3. Enter your password in the Password field. 4. Choose the project to connect to from the Project list. This list box displays all the projects installed on your DataStage server. At this point, you may only have one project installed on your system and this is displayed by default. 5. Select the Save settings check box to save your logon settings

The DataStage Designer window consists of the following parts: One or more Job windows where you design your jobs. The Property Browser window where you view the properties of the selected job. The Repository window where you view components in a projects. A Toolbar from where you select Designer functions. A Tool Palette from which you select job components. A Debug Toolbar from where you select debug functions. A Status Bar which displays one-line help for the window

components, and information on the current state of job operations, for example, compilation. For full information about the Designer window, including the functions of the pull-down and shortcut menus, refer to the DataStage Designer Guide.

STAGES IN DATASTAGE
FILE:SEQUENTIAL FILE

DATA SET

PROCESSING
TRANSFORMER

COPY

FILTER

SORTER

AGGREGATOR

FUNNEL

REMOVE DUPLICATE

JOIN

LOOK UP

MERGE

MODIFY

NETEZZA

TERADATA

ORACLE

Learn how to:

Create a Enterprise Edition Job that generates data, and take a look at some of that data.

Create a Job Select and position stages Connect stages with links Import a schema Set stage options Save, Compile & Run a Job View and Delete Job Log Row Generator Peek
40

Stages used:

To create a new job: Select FileNew, and select Parallel OR Click the New Program icon on the toolbar

Creating a New Job

Create the following flow:


Select the Row Generator stage Drag it onto the Parallel Canvas and drop it Select the Peek stage Drag it onto the Parallel Canvas and drop it

Right-Click on the Row Generator stage and drag a Link onto Peek
Does your flow look like the one above?
41

Importing a Schema

1. Select to import schema

2. Enter appropriate path and file name (Instructor will provide details) 3. You can also use the File Browser

42

Importing a Schema

Make sure you put it into the right categoryShould reflect your userid

Click on Next/Import/Finish to import.


43

Importing a Schema- End Goal

Did everything go smoothly?


After clicking on Finish, select the imported schema

This lets you select which columns you want to bring in. We want all of the columns! Click OK.

44

Column Properties

Row Generator specific options: Here you can select specific properties for the data you are going to generate.

Double-Click here to access additional options Click on Next> to step through column properties Click on Close when done
45

Final Touches

Your job should look like this However, the eye should not wink. Notice the new icon on the link, indicating presence of metadata

Next: Click on the Compile icon Save the job (Lab2) under your own Category Did it compile successfully?

46

Ready to Run

Action: Click to Run Click for Log

47

Running Job

After you click Select Run

Click for Log

48

Clearing the Job Log

Tips: Clear away unnecessary Job Logs Use <Copy> button, paste as text into any editor of your choice
49

Objectives
Learn how to:
Modify the simple datagenerating program to sort the data and save it.
Create a copy of a Job Edit an existing Job

Create a Dataset
Handle Errors View a Dataset

New stage used:


Sort
50

Create a Copy of a Job


If necessary, open the Job created in Lab 2. 1. Access stage properties for Peek stage 2. Select Input tab 3. Override default Partition type: from (Auto) to Hash
Click Here to specify Sort Insertion Next: Click OK What Happens?

51

Insert a Sort
Lets sort by birth_date Select the birth_date column from the Available list Once selected, you should see birth_date listed under Selected

Food for Thought: Why Hash partitioning type?

52

Sort Insertion A
Are your results sorted on birth_date?

Z
Note the new icon appears on the link, denoting the presence of a sort. Select Save As from the File menu And Save Job (Lab3).

Choose one of these to compile and run your job.


53

Lets Stage the Data

Well now save the output of the sort for later use. Now attach a Dataset stage to the program by

Placing a Dataset stage on the Canvas Right-clicking on Peek and drawing a Link over to the Dataset stage

Your Job should now look like this:

54

Viewing a Dataset
Right-click on the Dataset stage and select View DSLinkX data (Note: Link names may vary) Click OK to bring up Data Browser:

55

Objectives
Use the Lookup stage to replace state codes with state names

Learn how to:


use the Lookup operator start thinking about partitioning

New operators used:


Lookup

Entire partitioner

Remember the Records in Lab 2?


They look like this:
John Parker M 1979-04-24 MA 0 1 0 0 Susan Calvin F 1967-12-24 IL 0 1 1 1 William Mandella M 1962-04-07 CA 0 1 2 2 Ann Claybourne F 1960-10-29 FL 0 1 3 3 Frank Chalmers M 1969-12-10 NY 0 1 4 4 Jane Studdock F 1962-02-24 TX 0 1 5 5

One of the fields is a two-character state code. Lets expand it out into a full state name.

57

The State Table


We have a table that maps state codes to state names:
Alabama Alaska American Samoa Arizona Arkansas California Colorado Connecticut Delaware District of Columbia [] AL AK AS AZ AR CA CO CT DE DC

Unix text file with tab after state full name. We imported this file in Lab5a.

Well use that table to tack on the expanded state name to the rows generated in Lab 2.

58

What Were Going To Build...


Uses states.txt file as the lookup table Has TAB delimiter between state_name & state columns Use state column as lookup key Note that source data has column called state while lookup table has state_code

Use same schema as Lab 2 Generate 100 rows

Reminder: Dont forget to perform column mapping (see next slide)


59

Lookup Mapping

60

What You Should See...


Sample Output:

Make sure state names match state code


Peek,0: John Parker M 1979-04-24 0087228.46 MA 0 1 0 0 Massachusetts 0004881.94 NY 0 1 4 4 New York

Peek,0: Frank Chalmers M 1969-12-10

Peek,0: John Boone M 1964-04-16

0042729.03 CO 0 1 8 8 Colorado
0082552.55 OH 0 1 12 12 Ohio

Peek,0: Frank Sinatra M 1984-06-12 Peek,0: John Calvin M 1961-11-30

0025966.39 FL 0 1 16 16 Florida 0022976.45 KY 0 1 20 20 Kentucky 0005305.48 MI 0 1 24 24 Michigan 0098979.80 CA 0 1 28 28 California 0023340.92 NJ 0 1 32 32 New Jersey

Peek,0: Frank Studdock M 1962-10-29 Peek,0: John Sarandon M 1964-06-03 Peek,0: Frank Austin M 1971-01-21 Peek,0: John Mandella M 1981-06-16 Peek,0: Frank Glass M 1983-04-15

0068974.57 SD 0 1 36 36 South Dakota

61

Objectives
Learn how to: Use the Join stage to Use an InnerJoin find out which products New stages used: the customer purchased
Join
Remdup Hash partitioner

62

Background - What We Have Customers of ACME Hardware place orders for products.
We have two simple tables to model this
customer order
1 1 1001 1116 1147 1032 1161 1106 1132 1195 1007 1021 1072 1139

order
1000 1000

product quantity
screws nuts bolts nails nuts screws washers bolts nails nuts screws washers 137 200 145 159 197 253 330 527 370 162 135 351

customer_order table: tells us which orders were placed by each customer

1 2 2 3 3 3 4 4 4 4

order_product table: tells us how many of each product are in an order

1001 1001 1001 1001 1001 1002 1002 1002 1002 1002

1153

1003
1005

nails
bolts

359
443

Note the data types involved.

Use Integer and Varchar types where appropriate when defining table definitions.
63

Background - What We Want


Q: Which products have been ordered by each customer? A: Customer 1 has ordered washers, bolts, screws and

Go ahead and assemble this flow, but do so in a more optimized manner see next slide (save as Lab8a and a copy for Lab 9). Use cust_order.txt & order_prod_plus.txt as input files. See previous slide for file layouts whitespace delimited fields Note: column ordering matters! Make sure you get the column data types correct also.
64

Job Optimizations

These two jobs are equivalent!

Notice the different partitioner/collector icons. Can you visually determine how the data is being handled?

65

We want to join the tables using the order number as the join key. This means the tables need to be hashed and sorted on the order number. The resulting table will have one record for each row of the order_product table, with the customer field added. Make sure you use 'order_prod_plus.txt as your input If we then sort these records on customer and product and remove duplicated customer/product combinations, we have our answer 234 records should be written out

Using Lookup and Merge

This is what your flows should look like,

Using Lookup:

Note the "PhantomOrders" links leading to "Customerless" files

67

This is what your flows should look like,

Using Merge:

Order Matters!
Remember:
Lookup captures unmatched Source rows, on the Primary link Merge captures unmatched Update rows, on the Secondary link(s)

Tip:
Always check the Link Ordering Tab in the Stage page

70

New Results from Lookup and Merge


Outputs: Lookup and Merge should yield outputs with 234 rows, just as InnerJoin Rejects:

Lookup and Merge populate the "Customerless" file with following two rows: "1000",gaskets",28" "1000",widgets",14" You caught ACME Hardware red-handed: they tried to boost their stock by reporting a phantom order of 28 gaskets and 14 widgets!

Objectives
Use the Aggregator stage to see how many of each product does each customer have on order.

New stages used:


Aggregator

71

The Aggregator stage is a processing stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group.

Our InnerJoin Job Was A Bit Incomplete...


We did almost enough work in the InnerJoin lab (Lab 8a) to find out how many of each product each customer has on order. Now that we know about the Aggregator stage, we can finish the job. Go back to the version of Lab 8a. Remove the implicit Remdup (sort unique) Insert an Aggregator

73

Aggregator Options

Method: sort

we could have a lot of customer/product groups

Grouping keys: customer and product Set Column for Calculation=quantity Function to apply: Sum Output Column

Name of result column: quantity

74

What You Should Have...


Your Job should look like this. Compile and Run your Job.

75

Modify Stage

The Modify stage is a processing stage. It can have a single input link and a single output link. The modify stage alters the record schema of its input data set. The modified data set is then output. You can drop or keep columns from the schema, or change the type of a column.

Dropping and Keeping Columns The following example takes a data set comprising the following columns:

Modify Stage 28-3

The modify stage is used to drop the REPID, CREDITLIMIT, and COMMENTS columns. To do this, the stage properties are set as follows:

You could achieve the same effect by specifying which columns to keep, rather than which ones to drop. In the case of this example the required specification to use in the stage properties would be:
KEEP CUSTID, NAME, ADDRESS, CITY, STATE, ZIP, AREA, PHONE Changing Data Type You could also change the data types of one or more of the columns from the above example. Say you wanted to convert the CUSTID from decimal to string, you would specify a new column to take the converted data, andspecify the conversion in the stage properties: conv_CUSTID:string = string_from_decimal(CUSTID)

Copy Stage
The Copy stage is a processing stage. It can have a single input link and any number of output links. The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns

The Copy stage properties are fairly simple. The only property is Force, and we do not need to set it in this instance as we are copying to multiple data sets (and DataStage will not attempt to optimize it out of the job). We need to concentrate on telling DataStage which columns to drop on each output link. The easiest way to do this is using the Outputs page Mapping tab. When you open this for a link the left pane shows the input columns, simply drag the columns you want to preserve across to the right pane. We repeat this for each link as follows:

Funnel Stage
The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.

The continuous funnel method is selected on the Stage page Properties tab of the Funnel stage:

The continuous funnel method does not attempt to impose any order on the data it is processing. It simply writes rows as they become available on the input links. In our example the stage has written a row from each input link in turn. A sample of the final, funneled, data is as follows:

Filter Stage
The Filter stage is a processing stage. It can have a single input link and a any number of output links and, optionally, a single reject link. The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. The filtered out records can be routed to a reject link, if required.

Specifying the Filter The operation of the filter stage is governed by the expressions you set in the Where property on the Properties Tab. You can use the following elements to specify the expressions: Input columns. Requirements involving the contents of the input columns. Optional constants to be used in comparisons. The Boolean operators AND and OR to combine requirements. When a record meets the requirements, it is written unchanged to the specified output link. The Where property supports standard SQL expressions,except when comparing strings.

You might also like