DataStage PPT
DataStage PPT
Scenarios
Load a warehouse, mart, analytical and reporting applications Application/Data Integration Load packaged applications, or external systems through their APIs or interface databases Data Migration
Extract
Transform
Heterogeneous Data Sources: Relational & non-relational databases; Sequential flat files, complex flat files, COBOL files, VSAM data, XML data, etc.; Packaged Applications (e.g. SAP, Siebel, etc.) Incremental/changed data or complete/snapshot data Internal data or third-part data Push/Pull Cleansing & validation Simple - Range checks, duplicate checks, NULL value transforms etc. Specialized/Complex Name & address validations, de-duplication, etc.
Transform
Load
Computations (arithmetic, string, date, etc.) Pivot Split or Concatenate Aggregate Filter Join, look-up
Historical vs. Refresh load Incremental vs. Snapshot Bulk Loading vs Record-level Loading
Data
Run-time Environment
Trigger ETL Monitor flow View logs
Client Component DataStage Designer DataStage Director DataStage Administrator DataStage Manager
DataStage Components
Engine
4
Sources Manager
Server
4
Targets
ETL Metadata Maintained in internal format
Repository
Director
Manage Repository Create custom routines & transforms Import & Export component definitions
Designer
Repository:
Contains all the metadata, mapping rules, etc. DataStage applications are organized into Projects, each server can handle multiple projects DataStage repository maintained in an internal format & not in the database
Package Installer
Note: DataStage uses the OS-level security - Only root/admin user can administer the server
Most DataStage configuration tasks are carried out using the DataStage Administrator, a client program provided with DataStage. To access the DataStage Administrator: 1. From the Ascential DataStage program folder, choose DataStage Administrator.
2.
Log on to the server. If you do so as an Administrator (for Windows NT servers), or as dsadm (for UNIX servers), you have unlimited administrative rights; otherwise your rights are restricted as described in the previous section. 3. The DataStage Administration window appears: The General page lets you set server-wide properties. It is enabled only when at least one project exists. The controls and buttons on this page are enabled only if you logged on as an administrator
1.Used to store and manage re-usable metadata for the jobs. 2. Used to import and export components from file-system to DataStage projects.. 3. Primary interface to the DataStage Repository. 4. Custom routines and transforms can also be created in the Manager
The DataStage Director is the client component that validates,runs,schedules and monitors jobs run by the DataStage Server. It is the starting point for most off the tasks a DataStage operator needs to do in respect of DataStage jobs..
Menu Bar
Toolbar
Status Bar
Display Area
The display area is the main part of the DataStage Director window. There are three views:
Job Status - The default view, which appears in the right pane of the DataStage Director window. It displays the status of all jobs in the category currently selected in the job category tree. If you hide the job category pane, the Job Statues view includes a Category column, and displays the status of all server jobs in the current project, regardless of their category.
Job Schedule - Displays a summary of scheduled jobs and batches in the currently selected job category..If the job category pane is hidden, the display area shows all scheduled jobs and batches, regardless of their category. Job Log- Displays the log file for a job chosen from the Job Status view or the Job Schedule view.
DataStage Designer is used to: Create DataStage Jobs that are compiled into executable programs. Design the jobs that extract, integrate, aggregate, load, and transform the data. Create and reuse metadata and job components Allows you to use familiar graphical point and-click techniques to develop processes for extracting, cleansing, transforming, Integrating and loading data.
Use Designer to: Specify how data is extracted.. Specify data transformations.. Decode data going into the target tables using reference lookups Aggregate Data.. Split data into multiple outputs on the basis of defined constraints
The Designer graphical interface lets you select Stage icons, drop them onto the Designer work area, and add links. Then, still working in the Designer, you define the required actions and processes for each stage and link. A job created with the Designer is easily scalable. This means that you can easily create a simple job, get it working, then insert further processing, additional data sources, and so on.
1.Enter the name of your host in the Host system field. This is the name of the system where the DataStage server components are installed. 2. Enter your user name in the User name field. This is your user name on the server system. 3. Enter your password in the Password field. 4. Choose the project to connect to from the Project list. This list box displays all the projects installed on your DataStage server. At this point, you may only have one project installed on your system and this is displayed by default. 5. Select the Save settings check box to save your logon settings
The DataStage Designer window consists of the following parts: One or more Job windows where you design your jobs. The Property Browser window where you view the properties of the selected job. The Repository window where you view components in a projects. A Toolbar from where you select Designer functions. A Tool Palette from which you select job components. A Debug Toolbar from where you select debug functions. A Status Bar which displays one-line help for the window
components, and information on the current state of job operations, for example, compilation. For full information about the Designer window, including the functions of the pull-down and shortcut menus, refer to the DataStage Designer Guide.
STAGES IN DATASTAGE
FILE:SEQUENTIAL FILE
DATA SET
PROCESSING
TRANSFORMER
COPY
FILTER
SORTER
AGGREGATOR
FUNNEL
REMOVE DUPLICATE
JOIN
LOOK UP
MERGE
MODIFY
NETEZZA
TERADATA
ORACLE
Create a Enterprise Edition Job that generates data, and take a look at some of that data.
Create a Job Select and position stages Connect stages with links Import a schema Set stage options Save, Compile & Run a Job View and Delete Job Log Row Generator Peek
40
Stages used:
To create a new job: Select FileNew, and select Parallel OR Click the New Program icon on the toolbar
Right-Click on the Row Generator stage and drag a Link onto Peek
Does your flow look like the one above?
41
Importing a Schema
2. Enter appropriate path and file name (Instructor will provide details) 3. You can also use the File Browser
42
Importing a Schema
Make sure you put it into the right categoryShould reflect your userid
This lets you select which columns you want to bring in. We want all of the columns! Click OK.
44
Column Properties
Row Generator specific options: Here you can select specific properties for the data you are going to generate.
Double-Click here to access additional options Click on Next> to step through column properties Click on Close when done
45
Final Touches
Your job should look like this However, the eye should not wink. Notice the new icon on the link, indicating presence of metadata
Next: Click on the Compile icon Save the job (Lab2) under your own Category Did it compile successfully?
46
Ready to Run
47
Running Job
48
Tips: Clear away unnecessary Job Logs Use <Copy> button, paste as text into any editor of your choice
49
Objectives
Learn how to:
Modify the simple datagenerating program to sort the data and save it.
Create a copy of a Job Edit an existing Job
Create a Dataset
Handle Errors View a Dataset
51
Insert a Sort
Lets sort by birth_date Select the birth_date column from the Available list Once selected, you should see birth_date listed under Selected
52
Sort Insertion A
Are your results sorted on birth_date?
Z
Note the new icon appears on the link, denoting the presence of a sort. Select Save As from the File menu And Save Job (Lab3).
Well now save the output of the sort for later use. Now attach a Dataset stage to the program by
Placing a Dataset stage on the Canvas Right-clicking on Peek and drawing a Link over to the Dataset stage
54
Viewing a Dataset
Right-click on the Dataset stage and select View DSLinkX data (Note: Link names may vary) Click OK to bring up Data Browser:
55
Objectives
Use the Lookup stage to replace state codes with state names
Entire partitioner
One of the fields is a two-character state code. Lets expand it out into a full state name.
57
Unix text file with tab after state full name. We imported this file in Lab5a.
Well use that table to tack on the expanded state name to the rows generated in Lab 2.
58
Lookup Mapping
60
0042729.03 CO 0 1 8 8 Colorado
0082552.55 OH 0 1 12 12 Ohio
0025966.39 FL 0 1 16 16 Florida 0022976.45 KY 0 1 20 20 Kentucky 0005305.48 MI 0 1 24 24 Michigan 0098979.80 CA 0 1 28 28 California 0023340.92 NJ 0 1 32 32 New Jersey
Peek,0: Frank Studdock M 1962-10-29 Peek,0: John Sarandon M 1964-06-03 Peek,0: Frank Austin M 1971-01-21 Peek,0: John Mandella M 1981-06-16 Peek,0: Frank Glass M 1983-04-15
61
Objectives
Learn how to: Use the Join stage to Use an InnerJoin find out which products New stages used: the customer purchased
Join
Remdup Hash partitioner
62
Background - What We Have Customers of ACME Hardware place orders for products.
We have two simple tables to model this
customer order
1 1 1001 1116 1147 1032 1161 1106 1132 1195 1007 1021 1072 1139
order
1000 1000
product quantity
screws nuts bolts nails nuts screws washers bolts nails nuts screws washers 137 200 145 159 197 253 330 527 370 162 135 351
1 2 2 3 3 3 4 4 4 4
1001 1001 1001 1001 1001 1002 1002 1002 1002 1002
1153
1003
1005
nails
bolts
359
443
Use Integer and Varchar types where appropriate when defining table definitions.
63
Go ahead and assemble this flow, but do so in a more optimized manner see next slide (save as Lab8a and a copy for Lab 9). Use cust_order.txt & order_prod_plus.txt as input files. See previous slide for file layouts whitespace delimited fields Note: column ordering matters! Make sure you get the column data types correct also.
64
Job Optimizations
Notice the different partitioner/collector icons. Can you visually determine how the data is being handled?
65
We want to join the tables using the order number as the join key. This means the tables need to be hashed and sorted on the order number. The resulting table will have one record for each row of the order_product table, with the customer field added. Make sure you use 'order_prod_plus.txt as your input If we then sort these records on customer and product and remove duplicated customer/product combinations, we have our answer 234 records should be written out
Using Lookup:
67
Using Merge:
Order Matters!
Remember:
Lookup captures unmatched Source rows, on the Primary link Merge captures unmatched Update rows, on the Secondary link(s)
Tip:
Always check the Link Ordering Tab in the Stage page
70
Lookup and Merge populate the "Customerless" file with following two rows: "1000",gaskets",28" "1000",widgets",14" You caught ACME Hardware red-handed: they tried to boost their stock by reporting a phantom order of 28 gaskets and 14 widgets!
Objectives
Use the Aggregator stage to see how many of each product does each customer have on order.
71
The Aggregator stage is a processing stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group.
73
Aggregator Options
Method: sort
Grouping keys: customer and product Set Column for Calculation=quantity Function to apply: Sum Output Column
74
75
Modify Stage
The Modify stage is a processing stage. It can have a single input link and a single output link. The modify stage alters the record schema of its input data set. The modified data set is then output. You can drop or keep columns from the schema, or change the type of a column.
Dropping and Keeping Columns The following example takes a data set comprising the following columns:
The modify stage is used to drop the REPID, CREDITLIMIT, and COMMENTS columns. To do this, the stage properties are set as follows:
You could achieve the same effect by specifying which columns to keep, rather than which ones to drop. In the case of this example the required specification to use in the stage properties would be:
KEEP CUSTID, NAME, ADDRESS, CITY, STATE, ZIP, AREA, PHONE Changing Data Type You could also change the data types of one or more of the columns from the above example. Say you wanted to convert the CUSTID from decimal to string, you would specify a new column to take the converted data, andspecify the conversion in the stage properties: conv_CUSTID:string = string_from_decimal(CUSTID)
Copy Stage
The Copy stage is a processing stage. It can have a single input link and any number of output links. The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns
The Copy stage properties are fairly simple. The only property is Force, and we do not need to set it in this instance as we are copying to multiple data sets (and DataStage will not attempt to optimize it out of the job). We need to concentrate on telling DataStage which columns to drop on each output link. The easiest way to do this is using the Outputs page Mapping tab. When you open this for a link the left pane shows the input columns, simply drag the columns you want to preserve across to the right pane. We repeat this for each link as follows:
Funnel Stage
The Funnel stage is a processing stage. It copies multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. The stage can have any number of input links and a single output link.
The continuous funnel method is selected on the Stage page Properties tab of the Funnel stage:
The continuous funnel method does not attempt to impose any order on the data it is processing. It simply writes rows as they become available on the input links. In our example the stage has written a row from each input link in turn. A sample of the final, funneled, data is as follows:
Filter Stage
The Filter stage is a processing stage. It can have a single input link and a any number of output links and, optionally, a single reject link. The Filter stage transfers, unmodified, the records of the input data set which satisfy the specified requirements and filters out all other records. You can specify different requirements to route rows down different output links. The filtered out records can be routed to a reject link, if required.
Specifying the Filter The operation of the filter stage is governed by the expressions you set in the Where property on the Properties Tab. You can use the following elements to specify the expressions: Input columns. Requirements involving the contents of the input columns. Optional constants to be used in comparisons. The Boolean operators AND and OR to combine requirements. When a record meets the requirements, it is written unchanged to the specified output link. The Where property supports standard SQL expressions,except when comparing strings.