Custom Operator Reference
Custom Operator Reference
Version 8 Release 5
SC19-2925-01
SC19-2925-01
Note Before using this information and the product that it supports, read the information in Notices and trademarks on page 201.
Copyright IBM Corporation 2009, 2010. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Chapter 1. Defining custom operators . . 1
Compiling operators . Using your operator . . . . . . . . . . . . . . . . . . . . . . 1 . 1 Using the messaging macros. . . . . . . . Descriptions of the message macros . . . . Message environment variables. . . . . . . Localizing message code . . . . . . . . . Convert a multi-line message . . . . . . Convert a message with no run time variables. Convert a message with run time variables . . Steps to convert pre-NLS messages . . . . Eliminating deprecated interfaces . . . . . . . . . . . . . . . 61 61 63 63 63 64 64 64 66
. 124 . 125
iii
Overriding APT_Partitioner::partitionInput() Overriding APT_Partitioner::getPartitioningStyle() . . Example partitioning method definition . . Hashing functions . . . . . . . . . Using a view adapter with a partitioner .
. . . . .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. 160 . 161
Contacting IBM
. . . . . . . . . . 195
Product documentation . . . . . . . 197 Product accessibility . . . . . . . . 199 Notices and trademarks . . . . . . . 201 Index . . . . . . . . . . . . . . . 205
iv
Compiling operators
After you have defined your custom operators, you must compile them. The C++ compilers that you can use for this are the same C++ compilers that must be installed on the engine tier of your IBM InfoSphere Information Server installation. Details of the compilers that are required for different platforms are given in the InfoSphere Information Server requirements at http://www.ibm.com/ software/data/infosphere/info-server/overview/requirements.html.
You must make the shared library that contains your custom operator available by either adding the directory that contains the library to your PATH or the parallel engine bin directory on Windows, or your library search path on UNIX or Linux. You must also map your operator so that it is recognized by the parallel engine. You map your operator by adding it a line describing it to the file IS_install\Server\PXEngine\etc\operator.apt or IS_install/Server/PXEngine/etc/ operator.apt, where IS_install is the InfoSphere Information Server installation directory, for example c:\IBM\InformationServer or /opt/IBM/InformationServer. You add an entry to this file in the form:
osh_name_of_operator shared_library 1
v Compile your operator code, and include its shared library in the libraries optionally defined by your OSH_PRELOAD_LIBS environment variable.
ExampleOperator example
The ExampleOperator operator is example of an operator derived from APT_Operator that uses no arguments. The ExampleOperator operator changes the contents of its two record fields and then writes each record to a single output data set.
ExampleOperator
iField:int32; sField:string; output data set
The following table contains the code for the ExampleOperator. Comments follow the table.
Table 1. APT_Operator Derivation With No Arguments Comment 1 2 4 5 Code #include <apt_framework/orchestrate.h> class ExampleOperator : public APT_Operator { APT_DECLARE_RTTI(ExampleOperator); APT_DECLARE_PERSISTENT(ExampleOperator); public: ExampleOperator(); protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); }; #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(ExampleOperator, exampleOp, ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(ExampleOperator, APT_Operator); APT_IMPLEMENT_PERSISTENT(ExampleOperator); ExampleOperator::ExampleOperator() {} APT_Status ExampleOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } void ExampleOperator::serialize(APT_Archive& archive, APT_UInt8) {} APT_Status ExampleOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (iField:int32; sField:string)", 0); setOutputInterfaceSchema("record (iField:int32; sField:string)",0); return APT_StatusOk; }
9 11
12 13 14 15 16 18
22 24
Table 1. APT_Operator Derivation With No Arguments (continued) Comment 33 Code APT_Status ExampleOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); APT_InputAccessorToInt32 iFieldInAcc("iField", &inCur); APT_InputAccessorToString sFieldInAcc("sField", &inCur); APT_OutputAccessorToInt32 iFieldOutAcc("iField", &outCur); APT_OutputAccessorToString sFieldOutAcc("sField", &outCur); while (inCur.getRecord()) { cout << "*sFieldInAcc =" << *sFieldInAcc << endl; cout << "*iFieldInAcc =" << *iFieldInAcc << endl; *sFieldOutAcc = "XXXXX" + *sFieldInAcc; *iFieldOutAcc = *iFieldInAcc + 100; cout << "*sFieldOutAcc =" << *sFieldOutAcc << endl; cout << "*iFieldOutAcc =" << *iFieldOutAcc << endl; outCur.putRecord(); } return APT_StatusOk; }
Comments 1 2 4 5 9-11 Include the orchestrate.h header file. All operators are derived, directly or indirectly, from APT_Operator. Use the required macro, APT_DECLARE_RTTI, to declare runtime type information for the operator. Use the required macro, APT_DECLARE_PERSISTENT, to declare object persistence for operator objects transmitted to the processing nodes. You must override the virtual function initializeFromArgs_ and the two pure virtual functions, describeOperator() and runLocally(). Overrides are in this example. See the header file, install_directory/Server/PXEngine/include/ apt_util/argvcheck.h, for documentation on the ARGS_DESC string. Use APT_DEFINE_OSH_NAME to connect the class name to the name used to invoke the operator from osh, and pass your argument description string to DataStage. APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement runtime type information and persistent object processing. ExampleOperator::ExampleOperator() defines a default constructor for the operator. All operators must have a public default constructor, even if the constructor is empty. Use the override of initializeFromArgs_() to transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() simply returns APT_StatusOk. The function serialize() defines complex persistence in a derived class. The serialization operators operator||, operator<<, and operator>> are
12 13
14-15
16
18
22
declared and defined by macros in terms of APT_Persistent::serialize(). There are no data variables in ExampleOperator; therefore, serialize() is empty. When there are data variables, as in the HelloWorldOp example, each variable has it own line. For example: archive2 || member_variable5. 24 Use the override of describeOperator() to describe the configuration information for the operator. This example implementation specifies that: v The operator is run in parallel. Parallel mode is the default; therefore it is not necessary to explicitly set parallel mode with setKind(). v There is one input data set and one output data set. v The input and output schemas are as defined by setInputInterfaceSchema() and setOutputInterfaceSchema(). v The data partitioning method is random. With your override of runLocally(), you define what the operator does at run time. In this example implementation, the lines APT_InputCursor inCur and APT_OutputCursor outCur declare the input and out cursors; and the functions setupInputCursor() and setupOutputCursor() initialize them. The APT_InputAccessorTotype() and APT_OutputAccessorTotype() functions set up the input and output field accessors. The example while statement loops over all the records, and does these tasks for each record: v Prints out the initial contents of the two fields in the record. v Changes the contents of the fields and writes the new contents to the output record. v Prints out the new contents of the fields. v Writes the record to output.
33
HelloWorldOp example
The HelloWorldOp operator is an example of an operator derived from APT_Operator that uses arguments. The following figure illustrates the interface schema for the HelloWorldOp operator:
HelloWorldOp
outRec:*; output data set
The definition for HelloWorldOp is explained in the following tables. The first table contains the code for the header file, and the second table has the code for the .C file. The operator takes two arguments which determine how many times the string "hello world" is printed and whether it is printed in uppercase or lowercase. The operator simply copies its input to its output.
Table 2. APT_Operator Derivation With Arguments: Header File Comment 1 Code #include <apt_framework/orchestrate.h>
Chapter 2. Creating operators
Table 2. APT_Operator Derivation With Arguments: Header File (continued) Comment 2 4 5 7 8 9 11 12 13 15 16 Code class HelloWorldOp : public APT_Operator { APT_DECLARE_PERSISTENT(HelloWorldOp); APT_DECLARE_RTTI(HelloWorldOp); public: HelloWorldOp(); void setNumTimes(APT_Int32 numTimes); void setUpperCase(bool uppercase); protected: virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); virtual APT_Status describeOperator(); virtual APT_Status runLocally(); private: APT_Int32 numTimes_; bool uppercase_; };
1 2 4-5 7 8-9
Include the orchestrate.h header file. All operators are derived, directly or indirectly, from APT_Operator. Declare run time type information and persistence. Declare the constructor for this class. Declare the C++ initialization methods for this operator. These methods are called from initializeFromArgs_().
11 -13 Declare the three virtual functions that must be overridden. The function initializeFrom Args_() is the osh initialization function which makes the operator osh aware. The function describeOperator() specifies the pre-parallel initialization steps, and runLocally() specifies the parallel execution steps. 15 -16 Declare the two member variables.
Table 3. APT_Operator Derivation With Arguments: .C File Comment 1 3 Code #include "hello.h" #define HELLO_ARGS_DESC\ "{uppercase={optional, description='capitalize or not'},"\ "numtimes={value={type={int, min=1, max=10}, usageName='times'},"\ "optional, description='number of times to print message'}}" APT_DEFINE_OSH_NAME(HelloWorldOp, hello, HELLO_ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(HelloWorldOp, APT_Operator); APT_IMPLEMENT_PERSISTENT(HelloWorldOp); HelloWorldOp::HelloWorldOp() : numTimes_(1), uppercase_(false) {} void HelloWorldOp::setNumTimes(APT_Int32 numTimes) { numTimes_ = numTimes; }
4 5 6 7
11
Table 3. APT_Operator Derivation With Arguments: .C File (continued) Comment 15 18 19 Code void HelloWorldOp::setUpperCase(bool uppercase) { uppercase_ = uppercase; } APT_Status HelloWorldOp::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { APT_Status status=APT_StatusOk; if (context == APT_Operator::eRun) return status; for (int i = 0; i < args.count(); i++) { const APT_Property& prop = args[i]; if (prop.name() == "numtimes") numTimes_ = (int) prop.valueList().getProperty("value", 0). valueDFloat(); else if (prop.name() == "uppercase") uppercase_ = true; } return status; } 33 void HelloWorldOp::serialize(APT_Archive& archive, APT_UInt8) { archive || numTimes_; archive || uppercase_; } APT_Status HelloWorldOp::describeOperator() { setKind(eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (in:*)", 0); setOutputInterfaceSchema("record (out:*)", 0); declareTransfer("in", "out", 0, 0); return APT_StatusOk; } APT_Status HelloWorldOp::runLocally() { APT_Status status = APT_StatusOk; APT_InputCursor inCur; setupInputCursor(&inCur, 0); APT_OutputCursor outCur; setupOutputCursor(&outCur, 0); while(inCur.getRecord() && status == APT_StatusOk) { transfer(0); outCur.putRecord(); int i; for (i=0; i<numTimes_; i++) { if (uppercase_) cout << "HELLO, WORLD" << endl; else cout << "hello, world" << endl; } } return status; }
38
48
1 3
hello.h is the header file for this example operator. A formal description of the operator arguments is given in the HELLO_ARGS_DESC string. This string enables the argv checking facility for an operator. Use APT_DEFINE_OSH_NAME to connect the class name to the name used to invoke the operator from osh and pass your argument description string to DataStage. APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run time information and persistent object processing. HelloWorldOp::HelloWordOp() defines a constructor for the operator. All operators must have a public default constructor, even if the constructor is empty.
5-6
11 - 18 The functions setNumTimes() and setUpperCase() are the initialization methods for this operator. The method setNumTimes() sets the number of times to print the string; and the method setUpperCase() sets whether to print in uppercase letters. 19 Override initializeFromArgs_() to traverse the property list created from your argument description string and transfer information from the arguments to the class instance. Override APT_Persistent::serialize() to define complex persistence in a derived class. The method is called before the operator is parallelized to archive the values of its member variables. The method is called again after parallelization to restore those variables from the archive in each parallel copy of the operator. Override describeOperator() to describe the configuration information for the operator. Configuration information specifies whether the operator runs sequentially or in parallel, how many input and output data sets there are, what the input and output schemas are, and what data is transferred out of the operator. Override runLocally() to define what the operator does at run time.
33
38
48
In this example implementation, the lines APT_InputCursor inCur and APT_OutputCursor outCur declare the input and out cursors; and the functions setupInputCursor() and setupOutputCursor() initialize them. The while loop copies the input record to the output record and writes it to output. Finally, the "hello, world" string is printed the specified number of times by each node.
10
addNodeConstraint() or addResourceConstraint() Specify any operator constraints. The constraints are appended to the existing constraints. checkWasSuccessful() Indicate whether the operator has been successfully checked. declareTransfer() Specify any data transfers. getConcreteInputSchema()/getConcreteOutputSchema Return the concrete schema associated with a data set. lookupCollectionMethod() Determine the collection method to be used. lookupPartitionMethod() Determine the partitioning method to be used. reportError() Specify error-description strings. setAvailableNodes() Limit the nodes on which this operator can run. It always clears previous restraints. setCollectionMethod() Specify the collection method for a sequential operator. setKind(): Specify parallel or sequential execution of the operator. setNodeMap() Specify how many partitions an operator has and which node each partition runs on. setPartitionMethod() Specify the partitioning method for a parallel operator. setPreservePartitioningFlag() or clearPreservePartitioningFlag() Modify the preserve-partitioning flag in an output data set. setRuntimeArgs() Specify an initialization property list passed to initializeFromArgs_() at run time. setWorkingDirectory() Set the working directory before runLocally() is called.
11
setKind() Sets the execution mode of an operator to either parallel or sequential. This function can be called only from within describeOperator(). setKind() overrides a call to setRequestedKind(). If a call to setKind() specifies a different execution mode than setRequestedKind(), DataStage issues a warning.
partitioner
partitioner
...
All steps containing a parallel operator will also contain a partitioner to divide the input data set into individual partitions for each processing node in the system. The partitioning method can either be supplied by the framework or defined by you.
12
Operator 1 (sequential)
Operator 2 (parallel)
All sequential operators define a collection method specifying how a sequential operator combines the partitions of an input data set for processing by a single node. The collection method can be either supplied by the framework or defined by you.
13
of an output data set are read and write. When you use a data set as an output, you write results to the records of the output data set. For both input and output data sets, you must specify: v The names for the fields accessed within a record of the data set. v The data type of each accessed field. To make these specifications, you include calls to two member functions from within APT_Operator::(): setInputInterfaceSchema() and setOutputInterfaceSchema(). The following figure is an example of an interface schema for an operator:
In this figure, the interface schema for both the input and the output data sets is the same. Both schemas require the data sets to have at least three fields: two integer fields named field1 and field2 and a floating-point field named field3. Any extra fields in the input data set are dropped by this operator. You can use view adapters to convert components of the record schema of a data set to match components of the interface schema of an operator. Consider these variations when you specify interface schemas: v Schema variables, which use wildcards to include a potentially large number of record fields. v A dynamic schema, which lets the user of the operator define some or all the schema.
Schema variables
Use schema variables where input or output schemas might have many fields. The following figure shows how to use variables in both the input and output data set schemas:
14
By default, a schema variable, denoted by the type *, references an entire record, regardless of the data type of each field in the record. In this example, field1, field2, and field3 are included in the schema variable. You can include schema variables in your specification for either an input or an output interface schema. In the figure above, the operator can take any data set as input as long as that data set contains two integers and one floating-point field with the specified names. The data set can contain more than the required fields. By default, for an operator using schema variables, all fields of the input data set are transferred to the output data set.
In this example, the operator specifies both an input and an output interface schema containing a single schema variable. This figure shows that the operator allows the user to specify two fields of its input interface schema when the operator is instantiated.
15
partitioner
partitioner
...
InfoSphere DataStage provides a number of different partitioning methods, including: any The operator does not care how data is partitioned; therefore, InfoSphere DataStage can partition the data set in any way to optimize the performance of the operator. Any is the default partitioning method. Operators that use a partitioning method of any allow the user to explicitly override the partitioning method. To set a partitioning method, the user assigns a partitioner to a data set used as an input to the operator. The framework then partitions the data set accordingly.
round robin This method divides the data set so that the first record goes to the first node, the second record goes to the second node, and so on. When the last node in the system is reached, this method starts back at the first node. This method is useful for resizing the partitions of an input data set that are not equal in size. The round robin method always creates approximately equal-sized partitions.
16
random This method randomly distributes records across all nodes. Like round robin, random partitioning can rebalance the partitions of an input data set to guarantee that each processing node receives an approximately equal-sized partition. same This method performs no repartitioning on the data; the partitions of the previous operator are inherited. This partitioning method is often used within composite operators.
entire Every instance of an operator on every processing node receives the complete data set as input. This form of partitioning is useful when you want the benefits of parallel execution, but you also want each instance of the operator to be able to access the entire input data set. hash by field Uses a field in a record as a hash key and partitions the records based on a function of this hash key. You use the class APT_HashPartitioner to implement this partitioning method. modulus Partitioning is based on a key field modulo the number of partitions. This method is like hash by field, but involves simpler computation. range Divides a data set into approximately equal size partitions based on one or more partitioning keys. You use the APT_RangePartitioner class to implement this partitioning method. Partitions an input data set in the same way that DB2 partitions it. For example, if you use this method to partition an input data set containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then, during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node. Any reads and writes of the DB2 table entails no network activity. You can define a custom partitioning operator by deriving a class from the C++ APT_Partitioner class. Other is the partitioning method for operators that use custom partitioners.
DB2
other
By default, operators use the partitioning method any. The any partitioning method allows operator users to prefix the operator with a partitioning operator to control partitioning. For example, a user could insert the hash partitioner operator in the stage before the derived operator. To set an explicit partitioning method for the operator that cannot be overridden, you must include a call to APT_Operator::setPartitionMethod() within APT_Operator::describeOperator(). Another option is to define your own partitioner for each input to the operator. To do so, you must derive a partitioner class from APT_Partitioner.
17
Set and get the partitioning style of an operator input data set
To set and get the partitioning style of an operator input data set, use the APT_Operator functions below:
void setInputPartitioningStyle (APT_PartitioningStyle::partitioningStyle, int inputDS) APT_PartitioningStyle::partitioningStyle getPartitioningStyle(int inputDS) void setOutputPartitioningStyle (APT_PartitioningStyle::partitioningStyle, int outputDS)
These functions allow you to: v Direct InfoSphere DataStage to insert appropriate partitioners in a data flow. v Require that two inputs be identically partitioned. v Determine whether two inputs are identically partitioned
Set and get the partitioning keys of an operator input data set
To set and get the partitioning keys of an operator input data set, use the APT_Operator functions below:
void setInputPartitioningKeys(APT_PropertyList keys, int inputDS); APT_PropertyList APT_Operator::getPartitioningKeys(int input)
18
For example:
{ key={value=lastname}, key={value=balance} }
This function must be overridden by partitioners to define the partitioning style which they fit into.
Get the name of the partitioner last used to partition an operator input
To get the name of the partitioner last used to partition an operator input, use this APT_Operator function:
APT_String APT_Operator::getPartitioningName(int input)
Note: An operator that calls setPartitionMethod() on a data set can call setInputSortKeys() and setOutputSortKeys(), but cannot call setInputPartitioningStyle() and setInputPartitioningKeys() on an input data set, or setOutputPartitioningStyle() and setOutputPartitioningKeys() on an output data set.
19
round robin This method reads a record from the first input partition, then from the second partition, and so on. When the last processing node in the system is reached, it starts over. ordered This method reads all records from the first partition, then all records from the second partition, and so on. This collection method preserves any sorted order in the input data set. sorted merge This method reads records in an order based on one or more fields of the record. The fields used to define record order are called collecting keys. You use the sortmerge collection operator to implement this method. other You can define a custom collection method by deriving a class from APT_Collector. Operators that use custom collectors have a collection method of other.
By default, sequential operators use the collection method any. The any collection method allows operator users to prefix the operator with a collection operator to control the collection method. For example, a user could insert the ordered collection operator in a step before the derived operator. To set an explicit collection method for the operator that cannot be overridden, you must include a call to APT_Operator::setCollectionMethod() within APT_Operator::describeOperator(). You can also define your own collection method for each operator input. To do so, you derive a collector class from APT_Collector.
Using cursors
For your operators to process an input data set and write results to an output data set, you need a mechanism for accessing the records and record fields that make up a data set.
20
There are three mechanisms that work together for accessing the records and record fields of a data set: cursors, subcursors, and field accessors. Cursors allow you to reference specific records in a data set, and field accessors let you access the individual fields in those records. You use cursors and field accessors from within your override of the APT_Operator::runLocally() function. You use subcursors only with vectors of subrecords. A subcursor allows you to identify the current element of the vector. You use two types of cursors with data sets: input cursors and output cursors. Input cursors provide read-only access to the records of an input data set; output cursors provide read/write access to the current record of an output data set. To process an input data set, you initialize the input cursor to point to the current input record. The cursor advances through the records until all have been processed. Once a cursor has moved beyond a record of an input data set, that record can no longer be accessed. For the output data set, the output cursor initially points to the current output record and advances through the rest of the records. As with input data sets, once a cursor has moved beyond a record of an output data set, that record can no longer be accessed. The following figure shows an operator with a single input and a single output data set:
operator
There are two classes that represent cursors: v APT_InputCursor defines a cursor for an input data set. v APT_OutputCursor defines a cursor for an output data set. When you create an input cursor, it does not reference a record. You must call APT_InputCursor::getRecord() from within runLocally() to initialize the cursor to point to a specific record. When you have finished processing that record, you again call APT_InputCursor::getRecord() to advance the cursor to the next record in the input data set, making it the current input record. When no more input records are available, APT_InputCursor::getRecord() returns false. Typically, you use a while loop to determine when APT_InputCursor::getRecord() returns false.
21
When you create an output cursor, it references the first record in the output data set. The record output fields are set to the following default values: v Nullable fields are set to null. v Integers = 0. v Floats = 0. v v v v v Dates = January 1, 0001. Decimals = 0. Times = 00:00:00 (midnight). Timestamps = 00:00:00 (midnight) on January 1, 0001. The length of variable-length string and raw fields is set to 0.
v The characters of a fixed-length string are set to null (0x00) or to the pad character, if specified. v The bytes of fixed-length raw fields are set to zero. v The tag of a tagged aggregate is set to 0 to set the data type to be that of the first field of the tagged aggregate. v The length of variable-length vector fields is set to 0. After you write to that output record, you must call APT_OutputCursor::putRecord() to advance the cursor to the next record in the output data set, making it the current output record. The following table shows an example of a while loop that processes the records of a single input data set and writes its results to a single output data set. You would place this loop, along with the code to define and initialize the cursors, within the override of the runLocally() function.
Table 4. Example of while loop Comments Code APT_Status AddOperator::runLocally() { 3 4 5 6 7 APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); while (inCur.getRecord()) { 9 } 10 } return APT_StatusOk; // body of the loop outCur.putRecord();
3 4
Define inCur, an instance of APT_InputCursor, as the input cursor for the first data set input to this operator. Define outCur, an instance of APT_OutputCursor, as the output cursor for the first data set output by this operator.
22
5 6 7
Use APT_Operator::setupInputCursor() to initialize inCur. Input data sets are numbered starting from 0. Use APT_Operator::setupOutputCursor() to initialize outCur. Output data sets are numbered starting from 0. Use APT_InputCursor::getRecord() to advance the input cursor to the next input record. You must call this function before attempting to process an input record because an input cursor does not initially reference a valid record. APT_InputCursor::getRecord() returns false when there are no more records in the input data set, terminating the while loop. Process the record. Processing includes writing any results to the current output record. Do not call APT_OutputCursor::putRecord() until after you have written to the first output record (unless you want default values), because the output cursor initially points to the first empty record in an output data set. Use APT_OutputCursor::putRecord() to update the current output record, then advance the output cursor to the next output record.
10
Not all operators produce an output record for each input record. Also, operators can produce more output records than there are input records. An operator can process many input records before computing a single output record. You call putRecord() only when you have completed processing an output record, regardless of the number of input records you process between calls to putRecord().
23
AddOperator
field1:int32; field2:int32; total:int32 output data set
Figure 10. Sample operator and interface schemas
This operator adds two fields of a record and stores the sum in a third field. For each of the components of the input and output interface schemas, you define a single field accessor. In this case, therefore, you need two input accessors for the input interface schema and three output accessors for the output interface schema. This example uses field accessors to explicitly copy field1 and field2 from an input record to the corresponding fields in an output record. If the input data set had a record schema that defined more than these two fields, all other fields would be dropped by AddOperator and not copied to the output data set. The following table provides an example override of APT_Operator::describeOperator().
Table 5. Example override of APT_Operator::describeOperator() Comment Code APT_Status AddOperator::describeOperator() { 3 4 5 6 setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (field1:int32; field2:int32;)", 0); setOutputInterfaceSchema("record (field1:int32; field2:int32; total:int32;)", 0); return APT_StatusOk; }
3 4 5
Set the number of input data sets. Set the number of output data sets. Because input data sets are numbered starting from 0, specify the interface schema of input 0. You can simply pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0, the first output data set.
Field accessors normally are defined as local variables in APT_Operator::runLocally(). To access the values of fields defined in an interface schema, you must create input accessors and output accessors in the APT_Operator::runLocally() function of the operator. The following table provides an example override of APT_Operator::runLocally().
24
Table 6. Override of APT_Operator::runLocally() Comment Code APT_Status AddOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setInputCursor(&inCur, 0); setOutputCursor(&outCur, 0); 7 8 9 11 12 14 APT_InputAccessorToInt32 field1InAcc("field1", &inCur); APT_InputAccessorToInt32 field2InAcc("field2", &inCur); APT_OutputAccessorToInt32 field1OutAcc("field1", &outCur); APT_OutputAccessorToInt32 field2OutAcc("field2", &outCur); APT_OutputAccessorToInt32 totalOutAcc("total", &outCur); while (inCur.getRecord()) { *totalOutAcc = *field1InAcc + *field2InAcc; *field1OutAcc = *field1InAcc; 16 17 } return APT_StatusOk; } *field2OutAcc = *field2InAcc; outCur.putRecord();
Define read-only field accessors for the fields of the input interface schema. Define read/write field accessors for the fields of the output interface schema. Use APT_InputCursor::getRecord() to advance the input data set to the next input record. Use APT_InputAccessorToInt32::operator* and APT_OutputAccessorToInt32::operator* to dereference the field accessors, in order to access the values of the record fields in both the input and the output data sets. Note: You can also use the equivalent member functions APT_InputAccessorToInt32::value() and APT_OutputAccessorToInt32::setValue() to access the fields of input and output records. Here is Line 14 rewritten using these functions:
*totalOutAcc = field1InAcc.value() + field2InAcc.value();
17
Use APT_OutputCursor::putRecord() to update the current output record, then advance the output data set to the next output record.
25
26
Background
The following figure shows an operator containing schema variables in both its input and its output interface schemas:
This figure shows the same operator with an input data set:
field1:int32; inRec*;
outRec:*;
By default, a schema variable in an input interface schema corresponds to an entire record of the input data set, including the record schema. In this example:
"inRec:*" "fName:string; lName:string; age:int32;"
Performing an operation on inRec corresponds to performing an operation on an entire record and the record schema of the input data set. When a transfer is declared, an output schema variable, by default, assumes the record schema of the variable in the input interface. As you can see in the previous figure, outRec assumes the schema of inRec, which corresponds to the record schema of the input data set. Therefore, the output interface schema of the operator is:
"outRec:*" = "inRec:*" "fName:string; lName:string; age:int32;"
By using a variable to reference a record, operators can copy, or transfer, an entire record from an input to an output data set. The process of copying records from an input to an output data set then becomes much simpler because you do not have to define any field accessors.
27
a
Figure 15. Record transfer using schema variables
Many operators have a more complicated interface schema than the one shown in the previous figure. The next figure shows an operator with three elements in its input interface schema and two elements in its output interface schema:
28
NewOperator
x:int32; y:int32; outRec:*;
transfer outRec:*
In this example, the operator adds two fields to the output data set and calculates the values of those fields. For the operator to complete this task, you must define field accessors for the three defined fields in the input data set as well as the two new fields in the output data set. You cannot define accessors for the fields represented by an interface variable. The total output interface schema of this operator is the combination of the schema variable outRec and the two new fields, as shown:
The order of fields in the interface schema determines the order of fields in the records of the output data set. In this example, the two new fields are added to the front of the record. If the output interface schema is defined as:
"outRec:*; x:int32; y:int32;"
the two new fields would be added to the end of each output record. A schema variable can contain fields with the same name as fields explicitly stated in an output interface schema. For example, if the input data set in the previous example contained a field named x, the schema variable would also contain a field named x, as shown:
29
name conflict
Figure 18. Schema field with variable x
Name conflicts are resolved by dropping all fields with the same name as a previous field in the schema and issuing a warning. In this example, the second occurrence of field x which is contained in the schema variable is dropped.
Performing transfers
To perform a transfer, your code must perform the following steps. 1. Define the schema variables affected by the transfer using the APT_Operator::declareTransfer() function within the body of APT_Operator::describeOperator(). 2. Implement the transfer from within the body of APT_Operator::runLocally() using the function APT_Operator::transfer(). The code in the following table shows the describeOperator() override for NewOperator, the operator introduced in the previous figure:
Table 7. Override of describeOperator() with a transfer Comment Code APT_Status NewOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); 5 6 7 } setInputInterfaceSchema("record (a:string; c:int32; d:int16; inRec:*;)", 0); setOutputInterfaceSchema("record (x:int32; y:int32; outRec:*;)", 0); declareTransfer("inRec", "outRec", 0, 0); return APT_StatusOk;
Specify the interface schema of input 0 (the first data set). You can simply pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0. Use APT_Operator::declareTransfer() to define a transfer from inRec to outRec.
6 7
Declaring transfers
The function declareTransfer() returns an index identifying this transfer. The first call to declareTransfer() returns 0, the next call returns 1, and so on. You can store this index value as a data member, then pass the value to APT_Operator::transfer() in the runLocally() function to perform the transfer.
30
Because this transfer is the only transfer defined for this operator, the index is not stored. You could, however, define an operator that takes more than one data set as input, or one that uses multiple schema variables. If you do, you can define multiple transfers and, therefore, store the returned index. With the transfer defined, you implement it. Typically, you call transfer() just before you call putRecord(). The code in the following table is the runLocally() override for NewOperator:
Table 8. runLocally() override for NewOperator Comment Code APT_Status NewOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; // define accessors // initialize cursors // initialize accessors 5 while (inCur.getRecord()) { // access input record using input accessors // update any fields in the output record using output accessors 6 7 6 7 } transfer(0); outCur.putRecord(); } return APT_StatusOk;
5 6
Use APT_InputCursor::getRecord() to advance the input cursor to the next input record. After processing the input record and writing any fields to the output record, use APT_Operator::transfer() to perform the transfer. This function copies the current input record to the current output record. Use APT_OutputCursor::putRecord() to update the current output record and advance the output data set to the next output record.
Note: For more efficient processing, use APT_Operator::transferAndPutRecord(), which is equivalent to deferredTransfer() followed by a putRecord().
31
filterOperator
outRec:*; output data set
Figure 19. Schema variables in a filter operator
This operator uses a single field accessor to determine the value of key. Because the operator does not modify the record during the filter operation, you need not define any output field accessors. To iterate through the records of the output data set, you must still define an output cursor. FilterOperator uses the transfer mechanism to copy a record without modification from the input data set to the output data set. If key is equal to "REMOVE", the operator does not transfer that record. FilterOperator is derived from APT_Operator, as shown in the following table:
Table 9. Filter operator Comment Code #include <apt_framework/orchestrate.h> class FilterOperator : public APT_Operator { APT_DECLARE_RTTI(FilterOperator); APT_DECLARE_PERSISTENT(FilterOperator); public: FilterOperator(); protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); }; 12 13 14 15 16 18 #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(FilterOperator, FilterOperator, ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(FilterOperator, APT_Operator); APT_IMPLEMENT_PERSISTENT(FilterOperator); FilterOperator::FilterOperator() {} APT_Status FilterOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; }
32
Table 9. Filter operator (continued) Comment 22 Code APT_Status FilterOperator::describeOperator() { setInputDataSets(1);; setOutputDataSets(1); setInputInterfaceSchema("record (key:string; inRec:*;)", 0); setOutputInterfaceSchema("record (outRec:*;)", 0); declareTransfer("inRec", "outRec", 0, 0); return APT_StatusOk; } APT_Status FilterOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); APT_InputAccessorToString keyInAcc("key", &inCur); while (inCur.getRecord()) { if (*keyInAcc != "REMOVE") { transfer(0); outCur.putRecord(); } } return APT_StatusOk; } 48 void FilterOperator::serialize(APT_Archive& archive, APT_UInt8) {}
26 27 28
33
36 37 38 40 42 43
12 13
See the header file, argvcheck.h, for documentation on the ARGS_DESC string. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro.
14 - 15 APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run time type information and persistent object processing. 16 18 Define the default constructor for FilterOperator. With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance, making it osh aware. Because there are no arguments for this example, initializeFromArgs_() simply returns APT_StatusOk. See the header file operator.h for documentation on this function. The describeOperator() function defines the input and output interface schema of the operator, as well as the transfer from inRec to outRec. Specify the interface schema of input 0 (the first input data set). You can pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema().
Chapter 2. Creating operators
22 26
33
27 28
Specify the interface schema of output 0. Use APT_Operator::declareTransfer() to define a transfer from inRec to outRec. Because this transfer is the only transfer defined for this operator, the index is not stored.
33 - 36 Define and initialize the input and output cursors. 37 Define a read-only field accessor for key in the input interface schema. Because this operator does not access any fields of an output record, it does not define any output field accessors. Use APT_InputCursor::getRecord() to advance the input cursor to the next input record. You must call this function before attempting to process an input record, because an input cursor does not initially reference a valid record. Determine whether key of an input record is equal to "REMOVE". If it is not, transfer the record and update the output data set to the next empty record. If key is equal to "REMOVE", bypass the transfer and get the next input record. Use APT_Operator::transfer() to perform the transfer. Use APT_OutputCursor::putRecord() to update the current output record and advance the output cursor to the next output record. FilterOperator does not define any data members; therefore, serialize() is empty. You must provide serialize(), even if it is an empty function.
38
40
42 43 48
34
accountType:int8; inRec*;
CalculateInterestOperator
interestRate:dfloat; outRec:*;
The input interface schema of the lookup table specifies two fields: an account type and an interest rate. The input interface schema of customerData specifies a single field containing the account type as well as a schema variable. The output interface schema of the operator specifies a schema variable and the new field interestRate. The interestRate field is the field that the operator prepends to each record of customerData. The following figure shows CalculateInterestOperator with its input and output data sets:
accountType:int8; inRec*;
CalculateInterestOperator
interestRate:dfloat; outRec:*; outData data set schema: fName:string; IName:string; address:string;accountType:int8; interestRate:dfloat;
Figure 21. CalculateInterestOperator
CalculateInterestOperator is derived from APT_Operator, as shown in the code in the following table:
35
Table 10. Example of a two-input operator Comment Code #include <apt_framework/orchestrate.h> class CalculateInterestOperator : public APT_Operator { APT_DECLARE_RTTI(CalculateInterestOperator); APT_DECLARE_PERSISTENT(CalculateInterestOperator); public: CalculateInterestOperator(); protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); }; 12 13 14 15 16 18 #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(CalculateInterestOperator, CalculateInterestOperator, ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(CalculateInterestOperator, APT_Operator); APT_IMPLEMENT_PERSISTENT(CalculateInterestOperator); CalculateInterestOperator::CalculateInterestOperator() {} APT_Status CalculateInterestOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } APT_Status CalculateInterestOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(2); setOutputDataSets(1); setPartitionMethod(APT_Operator::eEntire, 0); setPartitionMethod(APT_Operator::eAny, 1); setInputInterfaceSchema("record" "(accountType:int8; interestRate:dfloat;)", 0); setInputInterfaceSchema("record" "(accountType:int8; inRec:*;)", 1); setOutputInterfaceSchema("record" "(interestRate:dfloat; outRec:*;)", 0); declareTransfer("inRec", "outRec", 1, 0); return APT_StatusOk; } APT_Status CalculateInterestOperator::runLocally() {
24 25 27 28 29 30 31 32
36
Table 10. Example of a two-input operator (continued) Comment 37 38 39 Code APT_InputCursor inCur0; APT_InputCursor inCur1; APT_OutputCursor outCur0; setupInputCursor(&inCur0, 0); setupInputCursor(&inCur1, 1); setupOutputCursor(&outCur0, 0); APT_InputAccessorToInt8 tableAccountInAcc("accountType", &inCur0); APT_InputAccessorToDFloat tableInterestInAcc("interestRate", &inCur0); APT_InputAccessorToInt8 customerAcctTypeInAcc("accountType", &inCur1); APT_OutputAccessorToDFloat interestOutAcc("interestRate", &outCur0); 47 struct interestTableEntry { APT_Int8 aType; APT_DFloat interest; }; interestTableEntry lookupTable[10]; for (int i = 0; i < 10; ++i) { bool gotRecord = inCur0.getRecord(); APT_ASSERT(gotRecord); lookupTable[i].aType = *tableAccountInAcc; lookupTable[i].interest = *tableInterestInAcc; } APT_ASSERT(!inCur0.getRecord()); 61 while (inCur1.getRecord()) { int i; for (i = 0; i < 10; i++) { if (lookupTable[i].aType == *customerAcctTypeInAcc) { *interestOutAcc = lookupTable[i].interest; transfer(0); outCur0.putRecord(); } } } 74 } void CalculateInterestOperator::serialize(APT_Archive& archive, APT_UInt8) {}
52 53
68 69 70
12 13
See the header file argvcheck.h. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro.
14 - 15 APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run-time type information and persistent object processing.
Chapter 2. Creating operators
37
16 18
Define the default constructor for CalculateInterestOperator. With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() returns APT_StatusOk. See the header file operator.h for documentation on this function. Use APT_Operator::setKind() to explicitly set the execution mode to parallel. Calling this function means that the operator cannot be run sequentially. Use APT_Operator::setInputDataSets() to set the number of input data sets to 2. Use APT_Operator::setPartitionMethod() to set the partitioning method of the lookup table to entire. This partitioning method specifies that every instance of this operator on every processing node receives the complete lookup table as input. This data set requires entire partitioning because all instances of this operator require the complete table as input. Set the partitioning method of input 1, the customer data set. Because this operator is a parallel operator and the records in the data set are not related to each other, choose a partitioning method of any. Specify the interface schema of input 0, the lookup table. Specify the interface schema of input 1, the customer data set. Specify the interface schema of output 0. Declare a transfer from input data set 1, the customer data set to the output data set. Define an input cursor for data set 0, the lookup table. Define an input cursor for data set 1, the customer data. Define an output cursor for data set 0. Define a structure representing an entry in the lookup table. Each record in the lookup table has two fields: an account type and the daily interest rate for that type of account. Define lookupTable, an array of 10 elements, to hold the lookup table. Use a for loop to read in each record of the table to initialize lookupTable. For example purposes, the number of elements in the array and the loop counter are hard-coded, but this practice is not recommended. Use a while loop to read in each record of the customer data. Compare the account type of the input record to each element in the lookup table to determine the daily interest rate corresponding to the account type. Read the matching interest rate from the table and add it to the current output record. Transfer the entire input record from the customer data set to the output data set. Update the current output record, then advance the output data set to the next output record. CalcuateInterestOperator does not define any data members; therefore, serialize() is empty. You must provide serialize(), even it is an empty function.
24
25 27
28
29 30 31 32 37 38 39 47
52 53
61
68 69 70 74
38
ADynamicOperator
outRec:*; output data set
This operator specifies a schema variable for both the input and the output interface schema. In addition, when you instantiate the operator, you must specify two fields of its input interface schema. To create an operator with such a dynamic interface schema, you typically provide a constructor that takes an argument defining the schema. This constructor can take a single argument defining a single interface field, an array of arguments defining multiple interface fields, or any combination of arguments specific to your derived operator. The constructor for the operator shown in the previous figure has the following form:
ADynamicOperator (char * inputSchema);
The constructor for an operator that supports a dynamic interface must make the interface schema definition available to the APT_Operator::describeOperator() function, which contains calls to: v setInputInterfaceSchema() to define the input interface schema v setOutputInterfaceSchema() to define the output interface schema For example:
setInputInterfaceSchema("record(a:string; b:int64; inRec:*;)", 0); setOutputInterfaceSchema("record (outRec:*;)", 0);
These functions must include any schema components specified in the constructor or by member functions of the operator. APT_Operator::runLocally() defines the cursors and field accessors used by the operator to access the records and record fields of the data sets processed by an operator. Because users of the operator can define field names and data types for the interface schema of an operator, APT_Operator::runLocally() must be able to create the corresponding accessors for those fields.
39
One way to achieve this aim is by passing a schema definition to the constructor for the field accessor class APT_MultiFieldAccessor. Note: A more efficient technique for defining run time, multi-field accessors is demonstrated in Chapter 15, Advanced features, on page 175. The technique uses the APT_InputAccessorBase class to set up an array of input accessors. In the following figure, an APT_MultiFieldAccessor object is used to access three components of an input interface schema:
input data set a:type; b:type; c:type; inRec*; accessor 1 accessor 2 outRec:*; output data set
Figure 23. Multifield accessor
accessor 3
APT_MultiFieldAccessor
This figure shows an APT_MultiFieldAccessor object containing three accessors, one for each possible component of the input interface schema. An APT_MultiFieldAccessor object is an array of accessors. In this example, the array is three elements long, with one element for each of the three user-specified fields in the input interface schema. There is a current accessor location, which you can use to read a field of an input record or read and write the field of an output record. You use two member functions, APT_MultiFieldAccessor::previousField() and APT_MultiFieldAccessor::nextField(), to change the location of the current accessor in the array. You must initialize the APT_MultiFieldAccessor object before you can use it to access record fields. The following statements define and initialize an APT_MultiFieldAccessor object as part the runLocally() override of a dynamic operator:
APT_MultiFieldAccessor inAccessor(inputInterfaceSchema(0); inAccessor.setup(&incur);
The first line defines inAccessor, an instance of APT_MultiFieldAccessor. This constructor takes a single argument containing a schema definition. The second line uses APT_MultiFieldAccessor::setup() to initialize the multifield accessor to all components of the input interface schema, except schema variables. You must specify the input cursor used to access the current input record. APT_MultiFieldAccessor lets you access any field in a record schema by using these two steps: 1. Determine the data types of the fields you want to access. 2. Read the field values. For example, to access the first field of the input interface schema, you use the following statements:
40
switch(inAccessor.type()) { case APT_AccessorBase::eInt8: //access field break; case APT_AccessorBase::eUInt8: // access field break; case APT_AccessorBase::eInt16: . . . default; }
To read the field values, you use the following APT_MultiFieldAccessor member functions: v getInt8() v getUInt8() v getInt16() v getUInt16() v getInt32() v getUInt32() v getInt64() v v v v v v v v v v v getUInt64() getSFloat() getDFloat() getDecimal() getSimpleString getStringField() getRawField() getDate() getTime() getTimeStamp() getGeneric()
v getUStringField()
getSimpleString() returns a copy of the field, and getStringField() and getUStringField() return a reference to the field. You use overloads of APT_MultiFieldAccessor::setValue() to write to the fields of a record of an output data set. After processing the first field, you update the APT_MultiFieldAccessor object to the next field using the statement:
inAccessor.nextField();
41
ADynamicOperator
outRec:*; output data set
Using this operator, a user can specify up to two fields of the input interface schema. The constructor for this operator takes as input a string defining these two fields. This string must be incorporated into the complete schema definition statement required by setInputInterfaceSchema() in describeOperator(). For example, if the constructor is called as:
ADynamicOperator myOp("field1:int32; field2:int16; ");
the complete input schema definition for this operator would be:
"record (field1:int32; field2:int16; inRec:*;) "
8 9
42
Table 11. Example of an operator with a dynamic interface (continued) Comment 24 Code APT_Status ADynamicOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } APT_Status ADynamicOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema(inputSchema.data(), 0); setOutputInterfaceSchema("record (outRec:*;)", 0); declareTransfer("inRec", "outRec", 0, 0); return APT_StatusOk; } APT_Status ADynamicOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 43 44 APT_MultiFieldAccessor inAccessor(inputInterfaceSchema(0)); inAccessor.setup(&inCur); while (inCur.getRecord()) { transfer(0); outCur.putRecord(); } return APT_StatusOk; } 52 void ADynamicOperator::serialize(APT_Archive& archive, APT_UInt8) { archive || inputSchema; }
8 9 15
Declare a constructor that takes a string defining part of the input interface schema. Declare a default constructor. This constructor is required by the persistence mechanism. Define inputSchema, an instance of APT_String, to hold the complete input interface schema of this operator. The complete input interface schema is equal to the schema passed in to the constructor, plus the definition of the schema variable inRec. See the header file argvcheck.h for documentation on the ARGS_DESC string. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro.
16 17
43
18 - 19 APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run-time type information and persistent object processing. 20 24 The constructor creates the complete schema definition statement and writes it to the private data member inputSchema. With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() returns APT_StatusOk. See the header file operator.h for documentation on this function. Define inAccessor, an instance of APT_MultiFieldAccessor. The constructor for APT_MultiFieldAccessor takes a single argument containing a schema definition. In this example, you use APT_Operator::inputInterfaceSchema() to return the schema of the input interface as set by setInputInterfaceSchema() in describeOperator(). Use APT_MultiFieldAccessor::setup() to initialize the multifield accessor for all components of the input interface schema. ADynamicOperator defines a single data member, inputSchema. You must serialize inputSchema within serialize()
43
44 52
Specifying custom information through the APT_CustomReportInfo class ensures that your custom XML information appears in this name/description/value format:
<custom_info Name="custName" Desc="description of custName"> add any information here,including information in XML notation </custom_info>
44
Internal storage vectors are maintained for Job Monitor metadata and summary messages. You add your custom information to the storage vectors using these two functions which are declared in:
$APT_ORCHHOME/include/apt_framework/operator.h: void addCustomMetadata(APT_CustomReportInfo &); void addCustomSummary(APT_CustomReportInfo &);
The function addCustomMetadata() must be called within describeOperator(); and addCustomSummary() must be executed after the runLocally() processing loop. You can add any number of entries to the storage vectors. Your entries are appended to the Job Monitoring metadata and summary messages. Here are examples of metadata and summary messages with appended custom information:
<response type="metadata" jobID="17518"> <component ident="custom_import"> <componentstats startTime="2002-06-28 22:087:05"/> <linkstats portNum="0" portType="out"/> ... <resource type="in"> resource description </resource> ... <custom_info Name=""custName" Desc="description of custname"> add any information here, including data in XML notation </custom_info> ... </component> ... </response> <response type="summary" jobId="17518"> <component ident="custom_import" pid="17550"> <componentstats currTime="2002-06-28 22:08:09" percentCPU="10.0"/> <linkstats portNum="0" portType="out" recProcessed="50000"/> ... <custom_info Name="custName" Desc="description of custName"> add any information here, including data in XML notation </custom_info> ... </component> ... </response>
Call sendCustomReport() before the execution of runLocally(), which is before the parallel execution of an operator, to generate a message of type
custom_report
Call sendCustomInstanceReport() during the execution of runLocally(), which is during parallel execution, to generate a message of type custom_instance_report. This function provides operator instance information.
Chapter 2. Creating operators
45
46
import
...
suboperators virtual data set
...
export
This composite operator contains two suboperators. Each suboperator takes a single input, creates a single output, and defines its own partitioning method. Node pool constraints placed on a composite operator propagate to the suboperators; node map constraints do not propagate to the suboperators.
Copyright IBM Corp. 2009, 2010
47
The figure shows a temporary data set between the two suboperators. This data set is invisible to the user of the operator because it is created solely to be used as the output of the first suboperator and the input to the second. Typically, you create composite operator for one of two reasons: v To hide complexity from a user of the operator. v To allow a second operator to use a data set already partitioned by a previous operator. As an example of the first reason, you might want to create a single operator that removes all duplicate records from a data set. To incorporate this functionality into a single operator, you need an operator to sort the data set so that the duplicate records are adjacent. One solution is to select a sort operator followed by the remdup operator which removes duplicate records. By creating a single composite operator from these two operators, you have to instantiate and reference only a single operator. The following figure illustrates the second reason for creating composite operators: allowing an operator to use data already partitioned by the previous operator. Here, you can see that the first suboperator partitions the input data sets. The second suboperator does not define a partitioning method; instead, it uses the already partitioned data from the previous operator.
suboperators
...
...
You can use a composite operator to create a join and filter operator. In this case, the first suboperator would join the two data sets and the second suboperator would filter records from the combined data set. The second suboperator, filter,
48
simply processes the already partitioned data set. In order to implement this scenario, you must specify same as the partitioning method of the second suboperator.
49
The following code shows the class definition for a composite operator derived from APT_CompositeOperator:
Table 12. APT_CompositeOperator Derivation() Line number (Comment)
protected: virtual APT_Status describeOperator(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); private: APT_DataSet * tempDS; SortOperator * sortOp; RemoveOperator * removeOp; }; #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(RemoveDuplicatesOperator,RemoveDups, ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(RemoveDuplicatesOperator, APT_CompositeOperator); APT_IMPLEMENT_PERSISTENT(RemoveDuplicatesOperator); APT_Status RemoveDuplicatesOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; }
11 12 13 14 15 16 17 18
22 24 25 26 27 28
RemoveDuplicatesOperator::RemoveDuplicatesOperator() { tempDS = new APT_DataSet; sortOp = new SortOperator; removeOp = new RemoveOperator; sortOp->attachOutput(tempDS, 0); removeOp->attachInput(tempDS, 0); } void RemoveDuplicatesOperator::serialize(APT_Archive& archive, APT_UInt8) {} APT_Status RemoveDuplicatesOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1);
50
Code markSubOperator(sortOp); markSubOperator(removeOp); redirectInput(0, sortOp, 0); redirectOutput(0, removeOp, 0); return APT_StatusOk; }
2 9 11
Derive this class from APT_CompositeOperator. Override describeOperator(). You cannot override APT_Operator::runLocally for a composite operator. Define a pointer to a temporary data set that connects the two suboperators. This data set will be dynamically allocated in the constructor for RemoveDuplicates. Dynamically allocate the suboperators of the composite operator. This line defines sortOp, a pointer to a SortOperator, as a private data member of the composite operator. You instantiate SortOperator in the constructor for RemoveDuplicates. Define removeOp, a pointer to a RemoveOperator, as a private data member of the composite operator. See the header file, argvcheck.h, for documentation on the ARGS_DESC string. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro.
12
13 14 15
16- 17 APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run-time information and persistent object processing. 18 With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() returns APT_StatusOk. See the header file, operator.h for documentation on this function. The constructor for RemoveDuplicatesOperator must instantiate the suboperators and tie together the suboperators and temporary data sets that make up the composite operator. In this example, you must specify tempDS as the output data set of sortOp and the input data set of removeOp. Dynamically allocate a data set.
22
24
25 -26 Dynamically allocate a SortOperator object and a RemoveOperator object. Both are derived from APT_Operator. Subops must be allocated dynamically. They are deleted by InfoSphere DataStage using the default destructor, and must not be deleted or serialized by the composite operator. 27 Specify tempDS as the output data set of sortOp. You need not specify an input data set. The function APT_CompositeOperator::redirectInput() in the
Chapter 3. Composite operators
51
override of APT_Operator::describeOperator() specifies the input data set. See the override of APT_Operator::describeOperator(). 28 Specify tempDS as the input data set of removeOp. You do not have to specify an output data set. The function APT_CompositeOperator::redirectOutput() in the override of APT_Operator::describeOperator() specifies the output data set. Specify sortOp as a suboperator. Specify removeOp as a suboperator. Use APT_CompositeOperator::redirectInput() to specify that the data set input to sortOp corresponds to the first data set input to the composite operator. Use APT_CompositeOperator::redirectOutput() to specify that the data set output by removeOp corresponds to the first data set output by the composite operator.
36 37 38
39
52
APT_SubProcessOperator::runSource()
Use this function to pass information or data to the third-party application. The function runSource() writes a record, a record field, or any other information to the third-party application. As part of the write operation, you can preprocess records before they are passed to the third-party application. Possible preprocessing tasks include rearranging record fields and stripping off unnecessary fields. Within your override of runSource(), you use the function APT_SubProcessOperator::writeToSubProcess() to copy a buffer to the third-party application through stdin. While APT_SubProcessOperator::writeToSubProcess() can be used to copy any type of data stored in a buffer, you commonly use the function to copy an entire record or a portion of a record. To get the current input record into a buffer, perform these steps:
Copyright IBM Corp. 2009, 2010
53
1. Call APT_Operator::getTransferBufferSize() to determine the size of the buffer required to hold the current input record. This function determines the size using a schema variable in the input interface schema of the operator. To use APT_Operator::getTransferBufferSize(), you must have declared a transfer from a schema variable in the input interface schema of the APT_SubProcessOperator to a schema variable in the output interface schema of the operator. APT_Operator::getTransferBufferSize() uses the information from the transfer to determine buffer size. Create a buffer large enough to hold the current input record and any additional information that you want to add to the record. Call APT_Operator::transferToBuffer() to transfer the current input record to the buffer you created in Step 2. Perform any preprocessing on the record required by the third-party application. Call APT_SubProcessOperator::writeToSubProcess() to copy the buffer to the third-party application. This application must be configured to receive inputs over stdin.
2. 3. 4. 5.
APT_SubProcessOperator::runSink()
Use this function to read data from the third-party application. The function runSink() reads a buffer back from a third-party application. The returned buffer can contain a record, record fields, results calculated from a record, and any other output information. You can to perform post-processing on the results after getting the results back from APT_SubProcessOperator. To read a fixed-length buffer back from the third-party application, perform the following steps: 1. Determine the buffer length. Typically, you call APT_SubProcessOperator::readFromSubProcess() to read a fixed-length buffer containing the length of the results buffer. The buffer is read from the stdout of the subprocess. 2. Allocate a buffer equal to the length determined in Step 1. 3. Call APT_SubProcessOperator::readFromSubProcess() again, this time reading the fixed-length buffer from the third-party application. To read a variable-length buffer, perform the following steps: 1. Call APT_SubProcessOperator::getReadBuffer() to read a block of data back from the subprocess. This function returns a pointer to a buffer containing data read from the subprocess. 2. Parse the buffer to determine field and record boundaries. Process the buffer as necessary. To read a character-delimited buffer, call APT_SubProcessOperator::getTerminatedReadBuffer(). This function reads a block of data up to a specified delimiter back from the subprocess. This function returns a pointer to a buffer containing data read from the subprocess. To write the returned buffer to an output data set, you typically call APT_Operator::transferFromBuffer(). This function transfers the results buffer to the output record of the operator.
54
APT_SubProcessOperator::commandLine()
Use this function to pass a command-line string to the third-party application. You override the pure virtual function commandLine() to pass a command-line string to the third-party application. This string is used to execute the subprocess. As part of the command line, you must configure the subprocess to receive all input over stdin and write all output to stdout. This function is called once to invoke the third-party application on each processing node of your system. For example, the UNIX gzip command takes its input from standard input by default. You can use the -c option to configure gzip to write its output to standard output. An example command line for gzip would be:
"/usr/local/bin/gzip -c"
Including any redirection symbols, for either input to or output from gzip, causes an error.
Zip9Operator
street:string; city:string; state:string; zip9:string; output data set
Zip9Operator takes as input a data set containing exactly four fields, one of which contains a five-digit zip code. As output, this operator creates a data set with exactly four fields, one of which is a nine-digit zip code. The input strings cannot contain commas because the UNIX utility ZIP9 takes as input comma-separated strings. Also, the total line length of the strings must be less than or equal to 80 characters. The code in the following table contains an APT_SubProcessOperator derivation.
Table 13. APT_SubProcessorOperator Derivation() Comment Code #include <apt_framework/orchestrate.h> 2 class Zip9Operator: public APT_SubProcessOperator
55
Table 13. APT_SubProcessorOperator Derivation() (continued) Comment 4 5 Code { APT_DECLARE_RTTI(Zip9Operator); APT_DECLARE_PERSISTENT(Zip9Operator); public: Zip9Operator(); 9 10 11 12 protected: virtual APT_Status describeOperator(); virtual APT_UString commandLine() const; virtual APT_Status runSink(); virtual APT_Status runSource(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); }; 13 14 15 16 17 19 #define ARGS_DESC"{}" APT_DEFINE_OSH_NAME(Zip9Operator, zip9, ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(Zip9Operator, APT_SubProcessOperator); APT_IMPLEMENT_PERSISTENT(Zip9Operator); Zip9Operator::Zip9Operator() {} APT_Status Zip9Operator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } APT_Status Zip9Operator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record" "(street:string; city:string; state:string; zip5:string;)", 0); setOutputInterfaceSchema("record" "(street:string; city:string; state:string; zip9:string;)", 0); return APT_StatusOk; } 32 APT_UString Zip9Operator::commandLine() const { return APT_UString("ZIP9"); }
28 29
56
Table 13. APT_SubProcessorOperator Derivation() (continued) Comment 36 Code APT_Status Zip9Operator::runSource() { APT_InputCursor inCur; setupInputCursor (&inCur, 0); APT_InputAccessorToString APT_InputAccessorToString APT_InputAccessorToString APT_InputAccessorToString street("street", &inCur); city("city", &inCur); state("state", &inCur); zip5("zip5", &inCur);
char linebuf[80]; while (inCur.getRecord()) { // This code builds a comma-separated string containing the street, // city, state, and 5-digit zipcode, and passes it to ZIP9. char * write = linebuf; memcpy(write, street->content(), street->length()); write += street->length(); *write++ = ','; memcpy(write, city->content(), city->length()); write += city->length(); *write++ = ','; memcpy(write, state->content(), state->length()); write += state->length(); *write++ = ','; memcpy(write, zip5->content(), zip5->length()); write += zip5->length(); *write++ = '\n'; size_t lineLength = write - linebuf; APT_ASSERT(lineLength <= 80); writeToSubProcess(linebuf, lineLength); } return APT_StatusOk; } 66 APT_Status Zip9Operator::runSink() { APT_OutputCursor outCur; setupOutputCursor(&outCur, 0); APT_OutputAccessorToString APT_OutputAccessorToString APT_OutputAccessorToString APT_OutputAccessorToString street("street", &outCur); city("city", &outCur); state("state", &outCur); zip9("zip9", &outCur);
57
Table 13. APT_SubProcessorOperator Derivation() (continued) Comment Code char linebuf[80]; while (1) { // read a single line of text from the subprocess size_t lineLength = readTerminatedFromSubProcess(linebuf, '\n', 80); if (lineLength == 0) break; char* scan = linebuf; char* streetStr = scan; while (*scan != ',') { scan++; APT_ASSERT(*scan != '\n'); APT_ASSERT(scan - linebuf < 80); } size_t streetLen = scan - streetStr; *street = APT_String(streetStr, streetLen); scan++; char* cityStr = scan; while (*scan != ',') { scan++; APT_ASSERT(*scan != '\n'); APT_ASSERT(scan - linebuf < 80); } size_t cityLen = scan - cityStr; *city = APT_String(cityStr, cityLen); scan++; char* stateStr = scan; while (*scan != ',') { scan++; APT_ASSERT(*scan != '\n'); APT_ASSERT(scan - linebuf < 80); } size_t stateLen = scan - stateStr; *state = APT_String(stateStr, stateLen); scan++; char* zipStr = scan; while (*scan != '\n') { scan++; APT_ASSERT(scan - linebuf < 80); } size_t zipLen = scan - zipStr; *zip9 = APT_String(zipStr, zipLen); outCur.putRecord(); } return APT_StatusOk; } 112 void Zip9Operator::serialize(APT_Archive& archive, APT_UInt8) {}
2 4 5
Derive this class from APT_SubProcessOperator. The macro APT_DECLARE_RITTI() is required to support run time type information. The macro APT_DECLARE_PERSISTENT() is required for persistence
58
support. This macro also inserts a declaration for the APT_Persistent::serialize() function required by the persistence mechanism. 9 10 11 12 13 14 Specify the override of APT_Operator::describeOperator(). Specify the override of APT_SubProcessOperator::commandLine(). Specify the override of APT_SubProcessOperator::runSink(). Specify the override of APT_SubProcessOperator::runSource(). See the header file, argvcheck.h, for documentation on the ARGS_DESC string. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro.
15- 16 APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement run-time information and persistent object processing. 17 19 Define the default constructor. With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance, making it osh aware. Because there are no arguments for this example, initializeFromArgs_() simply returns APT_StatusOk. See the header file, operator.h for documentation on this function. Specify the input interface schema of the first input data set. Specify the output interface schema of the first output data set. The override of APT_SubProcessOperator::commandLine() specifies the command line for an example UNIX utility ZIP9, a utility to convert a five-digit zip code to a nine-digit zip code. The function returns the command line for ZIP9. The override of APT_SubProcessOperator::runSource() copies the street, city, state, and zip5 fields to ZIP9 as a single line of text containing comma-separated fields. The override of APT_SubProcessOperator::runSink() copies the street, city, state, and zip9 fields back from ZIP9 as a single string containing comma-separated fields. This function then writes the returned fields to the output data set. Zip9Operator does not define any data members; therefore, serialize() is empty. You must provide serialize() even if it is an empty function.
28 29 32
36
66
112
59
60
APT_MSG()
This macro issues a message. The macro can only be called from .C files. APT_MSG(severity, englishString, argumentArray, logOrSourceModule)
Parameters
The severity argument can have one of the following string values: "Info" issues an informational message; execution does not terminate "Warning" issues a warning message; execution does not terminate "Error" issues an error message; execution terminates at the end of the step "Fatal" issues a fatal message; execution terminates immediately
Copyright IBM Corp. 2009, 2010
61
"Monitor" issues output to the Job Monitor; execution does not terminate "Metadata" issues metadata output; execution does not terminate An example of theenglishString argument is: "The number of output datasets is invalid." The string cannot contain a right parenthesis followed by a semi-colon ( ); ). A new-line is automatically appended to the englishString when the message is issued. The string can be parameterized to accept run time values, using {n} syntax. See the argumentArray description. The argumentArray argument lists the parameters that are to be resolved at run time using the APT_Formattable array. When there are no run time parameters, argumentArray must be NULL. An array example is:
APT_Formattable args [] = { numOutDS, opName }; APT_MSG(Error, "The number of output datasets {0} is invalid for custom " "operator {1}.", args, errorLog);
The maximum number of run time elements in the APT_Formattable array is 10. The elements can be of type char, int, unsigned int, short int, unsigned short int, long int, unsigned long int, long long, unsigned long long, float, char*, APT_String, APT_Int53, UChar*, or APT_UString. Note: If you include an object in the APT_Formattable array, such as APT_Identifier, access its member data using the appropriate methods. See fmtable.h for the definition of APT_Formattable. An example without run time parameters is:
APT_MSG(Error, "The number of output datasets is invalid.", NULL, errorLog);
The final argument is logOrSourceModule : the object to which the message is written. The argument value can be an APT_ErrorLog object, an APT_Error::SourceModule object, or it can be NULL. If the value is NULL, the message is written to the APT_Error::SourceModule object APT_localErrorSourceModule. Message indices are generated automatically by the messaging system.
APT_NLS()
This macro does not issue a message but simply returns an APT_UString containing the localized version of the englishString. This macro is needed to pass localized strings to other functions that output messages. This macro can be called from .C files only. APT_NLS(englishString, argumentArray)
APT_Formattable args [] = { hostname }; error_ = APT_NLS("Invalid hostname: {0}", args);
In this example, a member variable is set to a localized string that can be output later.
62
The two arguments to APT_NLS() are identical to the englishString and argumentArray arguments of APT_MSG(). If no run-time arguments are needed, the value of the APT_Formattable array must be NULL. englishString cannot contain a right parenthesis followed by a semi-colon ( ); ).
APT_DECLARE_MSG_LOG()
This macro uniquely identifies a message. It must appear in files that call APT_MSG() and APT_NLS(). APT_DECLARE_MSG_LOG(moduleId, "$Revision:$"); The macro must always be used in this context:
APT_Error::SourceModule stemId("CUST"); APT_DECLARE_MSG_LOG(stem, "$Revision:$");
where stem is a string specifying a unique identifier for the source module, and CUST is the identifier for custom operators. For example:
APT_Error::SourceModule APT_customOpId("CUST"); APT_DECLARE_MSG_LOG(APT_customOp, "$Revision:$")
This message uses several stream operators (<<) over three lines; however, it is a single error message and must be converted using a single APT_MSG() macro.
APT_MSG(Error, "Output schema has duplicate field names. If the " "-flatten key is being used it is likely generating a scalar " "field name that is the same as another input field.", NULL, errorLog());
63
Convert this message by declaring the run-time variables in an APT_Formattable array and designating them in your englishString argument using {} and an index into the args[] array. Indexing begins at 0. The converted message is:
APT_Formattable args[] = { errorCode, fieldName1, fieldName2 }; APT_MSG(Error, "Error code {0}: Fields {1} and {2} cannot have variable length.", args, errorLog());
into this:
if (rep_->numInputs_ > 128) { APT_MSG(Fatal, "The number of inputs attached must be no more than 128", NULL, NULL); }
64
The if statement is the negation of the APT_USER_REQUIRE() argument. Also note that the log_or_module argument to APT_MSG() is NULL because no APT_ErrorLog or APT_ErrorLog::SourceModule object is included in the APT_USER_REQUIRE() macro. 4. Convert all APT_USER_REQUIRE_LONG() macros of this form:
APT_USER_REQUIRE_LONG(rep_->numInputs_ <= 128, localErrorModule, APT_PERLBUILDOP_ERROR_START+43, "The number of inputs attached must be no more than 128");
into this:
if (rep_->numInputs_ > 128) { APT_MSG(Fatal, "The number of inputs attached must be no more than 128", NULL,localErrorModule); }
The severity for APT_USER_REQUIRE_LONG() is always Fatal. The if statement is not identical to the first APT_USER_REQUIRE_LONG() argument, and that the log_or_module argument to APT_MSG() is localErrorModule. 5. Convert all APT_DETAIL_LOGMSG() macros of this form:
APT_DETAIL_LOGMSG(APT_Error::eInfo, PM_PROCESSMGR_INDEX_START + 7, "APT_PMCleanUp::registerFileImpl(" << name << ", " << flagString(dispositionWhenOK) << ", " << flagString(dispositionWhenFailed) << ") - path registered is " << fullName);
into this:
APT_Formattable args[] = {name, flagString(dispositionWhenOK), flagString(dispositionWhenFailed), fullName}; APT_MSG(Info,"APT_PMCleanUp::registerFileImpl({0}, {1}, {2})" "- path registered is {3}", args, NULL);
into this:
APT_MSG(Info, "Timestamp message test 1", NULL, localErrorModule);
into this:
APT_MSG(Fatal, "Timestamp message test", NULL, NULL);
The severity for APT_DETAIL_FATAL() is always Fatal. 8. Convert all APT_DETAIL_FATAL_LONG() macros with this form:
APT_DETAIL_FATAL_LONG(errorModule, APT_PERLBUILDOP_ERROR_START+43, "Timestamp message test");
into this
APT_MSG(Fatal, "Timestamp message test", NULL, errorModule);
The severity for APT_DETAIL_FATAL_LONG() is always Fatal. 9. Replace all errorLog() messages with this form
*errorLog() << "There must be at least" << numCoords << "coordinates in the input vectors."<< endl; errorLog().logError(APT_CLUSTERQUALITYOP_ERROR_START+2);
Chapter 5. Localizing messages
65
into this
APT_Formattable args [] = { numCoords }; APT_MSG(Error, "There must be at least {0} coordinates in the input vectors.", args, errorLog());
In addition to logError() messages, there can be logWarning() or logInfo() messages. The corresponding APT_MSG() severities are Warning or Info. 10. Replace all occurrences of APT_APPEND_LOG() with appendLog() so that the string passed to the function is an APT_UString. For example, change:
APT_APPEND_LOG(log, subLog, "Error when checking operator:");
to
log.appendLog(subLog, APT_NLS("Error when checking operator:"));
11. Replace all occurrences of APT_PREPEND_LOG() with prepend() so that the string passed to the function is an APT_UString. For example, change
APT_PREPEND_LOG(*log, "Trouble importing field \"" << path_.unparse() << sub(vecLen, vecElt) << "\"" << data << ", at offset: " << bufferSave-recStart << ": ");
into this
APT_Formattable args [] = { path_.unparse(), sub(vecLen, vecElt), data, bufferSave-recStart }; *log.prepend(APT_NLS("Trouble importing field \"{0}{1}\"{2}, at offset: {3}: ", args);
12. Replace all occurrences of APT_DUMP_LOG() with dump() so that the string passed to the function is an APT_UString. For example, change:
APT_DUMP_LOG(*logp, "Import warning at record " << rep_->goodRecords_+ rep_->badRecords_ << ": ");
into this
APT_Formattable args [] = { rep_->goodRecords_+ rep_->badRecords_ }; *logp.dump(APT_NLS("Import warning at record {0}: "), args);
66
Substituting the function defined on the fourth line of the example for the function defined on the first line resolves the compilation error in your .C file.
67
68
69
C is the class name associated with the operator, O is the osh name for the operator, and U is the argument description string. For example:
APT_DEFINE_OSH_NAME(APT_TSortOperator, tsort, APT_UString(TSORT_ARGS_DESC))
At run time, the argument processor uses your argument description and the actual arguments given to your operator to produce a property-list encoding of the arguments and their values to your override of initializeFromArgs_(). The APT_DEFINE_OSH_NAME macro is defined in osh_name.h. The initializeFromArgs_() function is defined in operator.h, partitioner.h, and collector.h.
70
The argument name is always present as the argName property in the list. Value properties will be present according to whether the argument item has any values. Subarguments will be present when the argument item has subarguments. If an argument item does not have a value or subarguments, it just appears as an empty property in the property list. The property list presents argument items in the order in which they are encountered on the command line. For example, given the argument description for the tsort operator and this osh command line:
tsort -key product -ci -sorted -hash -key productid int32 -memory 32 -stats
Note: For readability, the argument-description syntax table and the examples in the header files omit the quotes and backslashes.
71
Table 14. Argument description syntax code and short comments Line number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 } type = Type syntax (expanded from line 6) a single type is required } }, subArgDesc = {argName = {...}, ... }, minOccurrences = int, maxOccurrences = int, optional, oshName = string, oshAlias = string, silentOshAlias = string, default, hidden, deprecated = { property, property, ... } }, argName = { {...}, ... }, otherInfo = { exclusive = {name, name, ... }, exclusive_required = {name, name, ... }, implies = {name, name, ... }, description = string, inputs = dataset_type_descriptions, outputs = dataset_type_descriptions, op; 0 or more op; 0 or more op; 0 or more op; goes in usage string req req req op; 0 or more op; default = 0 op; default = inf op; same as min/max = 0/1 op op; 0 or more op; 0 or more; not in usage op; goes in usage string op; not in usage string Code { argName = { description = description_string, value = { type = { type, other_properties } usageName = string, optional, default = type_literal_value, deprecated 1 req; syntax starts on line 33 req op; value is optional op; affects usage string op; omit from usage string req op; 0 or more Short comments Optional; 0 or more
72
Table 14. Argument description syntax code and short comments (continued) Line number 35 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 } | type = { fieldType, min = type_literal, must be the first property op; no lower limit by default
Chapter 6. Argument-list processor
Code { string, list = { string, string, ... }, regexp = regexp, case = sensitive | insensitive } | type = { ustring, list = { ustring, ustring, ... }, regexp = regexp, case = sensitive | insensitive } | type = { int, min = int, max = int, list = {int, int, ... } } | type = { float, min = float, max = float, list = { float, float, ... }
Short comments must be the first property list = { string, string, ... }, op; list of legal values op; regexp for legal values op; default: case-insensive
must be the first property op; list of legal values op; regexp for legal values op; default: case-insensive
must be the first property op; no lower limit by default op; no upper limit by default op; list of legal values; list exclusive with min/max
must be the first property op; no lower limit by default op; no upper limit by default op; list of legal values; list exclusive with min/max
73
Table 14. Argument description syntax code and short comments (continued) Line number 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 } | type =
Custom Operator Reference
Code max = type_literal, list = { type_literal, type_literal, ... }, compareOptions = { ... }, } | type = { propList, elideTopBraces, requiredProperties = { property, property, ... } } | type = { schema, acceptSubField, acceptVector, acceptSchemaVar } | type = { fieldName, input | output, acceptSubField } | type = { fieldTypeName, list = { name, name, ... }, noParams
Short comments op; no upper limit by default op; list of legal values op; adjusts comparisons list exclusive with min/max
must be the first property op; default: top-level only op; default: no vectors op; default: no schema vars
must be the first property op; list of legal type names op; default: params accepted
74
Table 14. Argument description syntax code and short comments (continued) Line number 106 107 108 109 110 111 } inputs | outputs = Input/Output dataset_type_descriptions (expanded from lines 30 and 31) Code { pathName, canHaveHost, defaultExtension = string must be the first property op op Short comments
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
{ portTypeName = { description = string, oshName = string, minOccurrences = int, maxOccurrences = int, optional, required, once, multiple, any, constraints = { ifarg = argName, ifnotarg = argName, ifargis = (name = argName, value = argValue), argcount = argName, argcountopt = argName, portcount = portName }, incomplete }, portTypeName = { {...}, ... }, } op op; 0 or more port type if specified port type if not specified port type if specified value times the port type appears port appears N or N+1 times output appears input times req op op; default = 0 op; default = inf op; same as min/max=0/1 op; same as min/max=1/1 op; same as min/max=1/1 op; same as min/max=1/inf op; same as min/max=0/inf req; name of the data set
75
Table 15. Argument description syntax comments Syntax line 1 Comment There can be zero or more argument entries. An argName property is required for an argument item, and it must be unique within an argument list. It is matched in a case-insensitive manner. By default, the osh tag for an argument is formed by prepending a hyphen to the name of the argument. You can override this rule with the oshName property. 3 4 The description property is required and is used when generating a usage string. There can be zero or more value properties to describe the values that are expected to follow the argument name on the command line. The order in which the value entries are listed determines the order in which the values must be presented on the command line after the argument name. A single type subproperty must be supplied. Its syntax allows you to specify one of nine data types which are described starting on line 34. The required usageName property defines the name of the type as it appears on the command line. The optional property indicates that the value need not be supplied on the command line. Only the final value is optional. The optional property itself is optional. You can use the optional default flag property to specify a value to be used when no value is given on the command line. This only affects the generated usage string. The print/scan generic function of the type is used to read the literal value of the type. The deprecated flag property is optional. When supplied, the value description is omitted by default from generated usage strings. Using the optional subArgDesc property allows you to provide additional flags to be associated with an argument usage string. By default, an argument vector can have any number of items matching an argument description. You can restrict the number of times an argument can occur by using the optional minOccurrences and maxOccurences properties. The default values are 0 for minOccurrences and any integer for maxOccurrences. The optional parameter allows you to specify minOccurences = 0 and maxOccurrences = 1. With oshName, you specify a non-default name for the argument on the osh command line. An argument can optionally have one or more osh alias names, allowing you to provide abbreviations, variant spellings, and so on. You specify them using the oshAlias property. By using the optional silentOshAlias property, an argument can have one or more osh alias names which are not listed in a usage string. You use the optional default flag property to indicate that the argument represents a default that is in force if this argument (and typically other related arguments in an exclusion set with this argument) is not present. This information is put into the generated usage string, and has no other effect. With the hidden flag property, you can optionally describe arguments that are not normally exposed to the operator user. Hidden argument descriptions are, by default, omitted from generated usage strings. You can optionally use the deprecated flag parameter to indicate that an argument description exists only for back-compatibility. Its argument description is omitted by default from generated usage strings. The required otherInfo parameter allows you to specify constraints. For an operator without arguments, the minimum required argument description contains the otherInfo parameter and its description subparameter. For example:{ otherInfo = { description = "this operator has no arguments."} } 26 With the exclusive constraint, you optionally name a set of arguments which are mutually exclusive. Multiple exclusive sets can be defined.
Custom Operator Reference
6 7 8 9
10 12 13-14
15 16 17 18 19
20
21
24
76
Table 15. Argument description syntax comments (continued) Syntax line 27 28 29 30-31 34-110 Comment An exclusive_required constraint is like an exclusive constraint, described on line 26; however, one of its listed arguments must be present. An implies constraint specifies that if one given argument occurs, then another given argument must also be supplied. Your optional description string is added to the generated usage string. Both the input and output properties are required. If they are omitted, warning messages are emitted when the operator is run. The type property must be the first subproperty in a value clause. It must be an Orchestrate type. For example: value = { type = int32, usageName = "mbytes", default = 20 } Comments for each type are given next. 34-40 String type. The list and regexp subproperties optionally specify the legal values, either in list form or in a regular expression. If neither of these two subproperties is specified, any string value is accepted for the argument. When case has its default value of insensitive, list matching is performed in a case-insensitive manner, and the regexp is evaluated on a copy of the string value that has been converted to lowercase. 42-48 Ustring type. The list and regexp subproperties optionally specify the legal values, either in list form or in a regular expression. If neither of these two subproperties is specified, any ustring value is accepted for the argument. When case has its default value of insensitive, list matching is performed in a case-insensitive manner, and the regexp is evaluated on a copy of the string value that has been converted to lowercase. 50-56 Integer type. The min and max subproperties are optional. By default, there are no lower or upper limits. The optional list subproperty specifies a list of legal values. It is exclusive with min and max. Integer values are 32 bits, signed. The field value is encoded as a dfloat in the argument's value = value property. 58-64 Float type. The min and max subproperties are optional. By default, there are no lower or upper limits. The optional list subproperty specifies a list of legal values. It is exclusive with min and max. Floating-point values are double precision. The field value is encoded as a dfloat in the argument's value = value property. 66-73 FieldType type. The optional min and max subproperties can be specified if the field type supports ordered comparison. By default, there are no lower or upper limits. The optional list subproperty can be provided if the field type supports equality comparisons. It specifies a list of legal values and is exclusive with min and max. The print/scan generic function is used to parse the type_literal values. The optional compareOptions subproperty adjusts how comparisons are done with the min, max, and list values. The field value is encoded as a string in the argument's value = value property.
77
Table 15. Argument description syntax comments (continued) Syntax line 75-81 Comment PropList type. The elideTopBraces and requiredProperties subproperties are optional. The elideTopBraces property notifies InfoSphere DataStage that the user omits the top-level opening and closing property-list braces. The field value is encoded as a property list in the argument's value = value property. 83-89 Schema type. The acceptSubField, acceptVector, and acceptSchemaVar subproperties are optional, and their default values indicate not to accept subfields, vectors, and schema variables. FieldName type. You must also specify either input or output. The acceptSubField subproperty is optional. Its default value is top-level field only. FieldTypeName type. Using the optional list subproperty, you can specify acceptable type names. The noParams subproperty is also optional. The default is to accept type parameters. PathName type. The canHaveHost and defaultExtension subproperties are optional. CanHaveHost specifies that the path name can begin with a host computer designation, host:. You use the required portTypeName property to specify a one-word name for the port. Input and output ports are the same as input and output datasets. With the required description property, you can describe the purpose of the port. Use the oshName property to specify the port name for the osh command line. You can restrict the number of times a portTypeName property can occur by using the optional minOccurrences and maxOccurences subproperties. The default values are 0 for minOccurrences and any integer for maxOccurrences. The optional subproperty specifies zero or one occurrence of a portTypeName property. The optional required subproperty specifies one and only one occurrence of a portTypeName property. The optional once subproperty has the same functionality as the required subproperty. It specifies one and only one occurrence of a portTypeName property. The optional multiple subproperty specifies one or more occurrences of a portTypeName property. The optional any subproperty specifies zero or more occurrences of a portTypeName property. The constraints property is optional. If it is present, it can not be the empty list. The syntax supplied provides simple constraint types that make it convenient to describe most simple cases. The ifarg constraint subproperty specifies that the port type does not appear unless the argName has been specified. This subproperty can appear more than once, to specify multiple "enabling" options combined by logical OR. An example is the reject option for import/export. The ifnotarg constraint subproperty indicates that the port type only appears if the argName has not been specified. This subproperty can appear more than once to specify multiple "disabling" options, which are combined by logical OR. An example is the createOnly option for the lookup operator. The ifargis constraint subproperty indicates that the port type appears if the specified argName has the specified argValue. This suboption can be specified more than once to specify multiple "enabling" values. It can be combined with ifarg or ifnotarg. If it is specified alone, it is effectively equivalent to also specifying ifarg for the same argName. An example is "ifNotFound = reject" for the lookup operator. 129 The argcount constraint subproperty indicates that the port type appears exactly as many times as the argName appears. An example is the percent option for the sample operator.
126
127
128
78
Table 15. Argument description syntax comments (continued) Syntax line 130 131 133 Comment The argcountopt constraint subproperty indicates that if argName appears N times, the port type appears either N or N+1 times. An example is the table option for the lookup operator. The portcount constraint subproperty indicates an output port type that appears as many times as an input port type with the specified portName. The incomplete flag indicates the provided input/output description is not completely accurate given the complexity of the behavior of the operator.
The argument-list processor uses the description elements of your argument description to produce a usage string for an operator. The second argument description example is the argument description string for the tsort operator:
"{key={value={type={fieldName, input}, usageName="name" }, value={type={fieldTypeName}, usageName="type", optional, deprecated }, subArgDesc={ ci={optional, description="case-insensitive comparison" }, cs={optional, default, description="case-sensitive comparison" }, ebcdic={optional, description= "use EBCDIC collating sequence" }, nulls={value={type={string, list={first,last} },
Chapter 6. Argument-list processor
79
usageName="first/last", default=first }, optional, description= "where null values should sort" } hash={optional, description="hash partition using this key" }, asc={oshAlias="-ascending", optional, default, description="ascending sort order" }, desc={oshAlias="-descending", silentOshAlias="-des", optional, description="descending sort order" }, sorted={optional, description= "records are already sorted by this key" }, clustered={optional, description= "records are grouped by this key" }, param={value={type={propList, elideTopBraces}, usageName="params" }, optional, description="extra parameters for sort key" }, otherInfo={exclusive={ci, cs}, exclusive={asc, desc}, exclusive={sorted, clustered}, description="Sub-options for sort key:" }, }, description="specifies a sort key" }, memory={value={type={int32, min=4}, usageName="mbytes", default=20 }, optional, description="size of memory allocation" }, flagCluster={optional, description="generate flag field identifying clustered/sorted key value changes in output" }, stable={optional, default, description="use stable sort algorithm" }, nonStable={silentOshAlias="-unstable", optional, description="use non-stable sort algorithm (can reorder same-key records)" }, stats={oshAlias="-statistics", optional, description="print execution statistics"
80
}, unique={oshAlias="-distinct", optional, description= "keep only first record of same-key runs in output" }, keys={value={type={schema}, usageName="keyschema" }, deprecated=key, maxOccurrences=1, description="schema specifying sort key(s)" }, seq={silentOshAlias="-sequential", deprecated, optional, description="select sequential execution mode" }, otherInfo={exclusive={stable, nonStable}, exclusive_required={key, keys}, description="InfoSphere DataStage sort operator:" inputs={unSorted={description="sorted dataset", required} } outputs={sorted={description=sorted dataset", required} } }"
The argument list processor generates this property list based on the tsort argument description string shown in the previous section:
{key={value=lastname, subArgs={ci, sorted, hash}}, key={value=balance, value=int32}, memory={value=32}, stats }
81
{ APT_Status status=APT_StatusOk; if (context == APT_Operator::eRun) return status; for (int i = 0; i < args.count(); i++) { const APT_Property& prop = args[i]; if (prop.name() == "numtimes") numTimes_ = (int) prop.valueList().getProperty("value", 0) .valueDFloat(); else if (prop.name() == "uppercase") uppercase_ = true; } return status; }
Error handling
Syntax errors in your argument-description string are not detected at compile time. Instead, errors are generated at run time when your argument description string and the current operator arguments are used to produce a property-list object. You can use the error-log facility to capture errors.
82
Usage strings
The usage string is generated from the description elements in your argument description string. You can access an operator usage string from the osh command line. For example: $ osh -usage tsort The usage string generated for the tsort operator follows. The example assumes that both deprecated options and current options have been requested.
sort operator: -key -- specifies a sort key; 1 or more name -- input field name type -- field type; optional; DEPRECATED Sub-options for sort key: -ci -- case-insensitive comparison; optional -cs -- case-sensitive comparison; optional; default -ebcdic -- use EBCDIC collating sequence; optional -nulls -- where null values should sort; optionalfirst/last -- string; value one of first, last; default=first -hash -- hash partition using this key; optional -asc or -ascending -- ascending sort order; optional; default -desc or -descending -- descending sort order; optional -sorted -- records are already sorted by this key; optional -clustered -- records are grouped by this key; optional -param -- extra parameters for sort key; optional params -- property=value pairs(s), without curly braces (mutually exclusive: -ci, -cs) (mutually exclusive: -asc, -desc) (mutually exclusive: -sorted, -clustered) -memory -- size of memory allocation; optional mbytes -- int32; 4 or larger; default=20 -flagKey -- generate flag field identifying key value changes in output; -- optional -flagCluster -- generate flag field identifying clustered/sorted key value -- changes in output; optional -stable -- use stable sort algorithm; optional; default -nonStable -- use non-stable sort algorithm (can reorder same-key records); -- optional -stats or -statistics -- print execution statistics; optional -unique or -distinct -- keep only first record of same-key runs in output; -- optional -collation_sequence -- use a collation sequence; optional collationfile -- string -strength -- strength level; optional strength -- string -keys -- schema specifying sort key(s); optional; DEPRECATED: use -key instead keyschema -- string -seq -- select sequential execution mode; optional; DEPRECATED (mutually exclusive: -stable, -nonStable) (mutually exclusive: -key, -keys; one of these must be provided)
A complete specification of the argument description language is given in the argvcheck.h header file.
83
84
85
86
Table 16. Example operator with predefined type-conversion code Comment 2 Code #include <apt_framework/orchestrate.h> #include <apt_framework/type/conversion.h> class PreDefinedConversionOperator : public APT_Operator { APT_DECLARE_RTTI(PreDefinedConversionOperator); APT_DECLARE_PERSISTENT(PreDefinedConversionOperator); public: PreDefinedConversionOperator(); protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_ (const APT_PropertyList &args, APT_Operator::InitializeContext context); private: // other data members }; APT_Status PreDefinedConversionOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record(dField:date)", 0); setOutputInterfaceSchema("record(dField:date; sField:string[1])", 0); return APT_StatusOk; } APT_Status PreDefinedConversionOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); APT_InputAccessorToDate dFieldInAcc("dField", &inCur); APT_OutputAccessorToDate dFieldOutAcc("dField", &outCur); APT_OutputAccessorToString sFieldOutAcc("sField", &outCur); 32 33 34 35 APT_FieldConversion* nameConv = APT_FieldConversionRegistry:: get().lookupAndParse("weekday_from_date[Monday]", NULL); APT_ASSERT(nameConv); APT_FieldConversion* conv = APT_FieldConversionRegistry::get(). lookupDefault("Int8", "String"); APT_ASSERT(conv);
87
Table 16. Example operator with predefined type-conversion code (continued) Comment Code while (inCur.getRecord()) { *dFieldOutAcc = *dFieldInAcc; 39 40 41 42 APT_Int8 weekday; nameConv->convert(&(*dFieldInAcc), &weekday, 0); weekday = weekday + 1; conv->convert(&weekday, &(*sFieldOutAcc), 0); outCur.putRecord(); } if ( nameConv ) nameConv->disOwn(); if ( conv ) conv->disOwn(); return APT_StatusOk; }
45
2 32 33 34 35 39 40 41 42 45
Include the header file, conversion.h, which defines the type conversion interface. Create a object with explicit parameterized conversion. Assert that the conversion object exists. Create a conversion object based on the default conversion between int8 and string. Use assertion to make sure that the conversion object exists. Create an intermediate local variable to store the conversion result. Call the member function convert() to perform the conversion between the input field dField and the local variable weekday. The default value of weekday is 0. Increment weekday by 1 to mark the first day. Call the member function convert() to perform the conversion between the local variable weekday and the output field sField. Call disown() on the nameConv conversion object since it is returned by using its own() function.
88
Table 17. Example operator with custom type conversion code (continued) Comment Code public: RawStringConversion(); virtual APT_Status convert (const void *STval, void* DTval, void* data) const; static bool registerConversion(); protected: virtual APT_FieldConversion* clone() const; }; APT_IMPLEMENT_RTTI_ONEBASE(RawStringConversion, APT_FieldConversion); APT_IMPLEMENT_PERSISTENT(RawStringConversion); 17 18 RawStringConversion::RawStringConversion(): APT_FieldConversion(eImmutable, "string", "raw", "raw_from_string") {} APT_FieldConversion *RawStringConversion::clone() const { return new RawStringConversion(*this); } APT_Status RawStringConversion::convert (const void* STval, void* DTval, void *data) const { const APT_String &s = *(const APT_String *)STval; APT_RawField &d = *(APT_RawField *)DTval; d.assignFrom(s.content(), s.length()); return APT_StatusOk; } 31 33 34 void RawStringConversion::serialize(APT_Archive &ar, APT_UInt8) {} static RawStringConversion* sRawString = new RawStringConversion; bool RawStringConversion::registerConversion() { APT_FieldConversionRegistry::get().addFieldConversion(sRawString); return true; } static bool sRegisteredRawStringConversion = RawStringConversion::registerConversion();
10 11 13
20 24
26 28
39
2 3 4 10 11 13 17 18
Include the orchestrate.h header file. Include the header file that defines the type-conversion interface. All type conversions are derived, directly or indirectly, from APT_FieldConversion. You must override the virtual function convert(). Define the member function which is used to register the newly defined type conversion. You must override the virtual function clone(). Define the default constructor for this conversion. Setup the conversion information about the initialization line. eImmutable
Chapter 7. Type conversions
89
indicates that this conversion does not accept parameters and is not subject to change. The arguments string and raw identify the source schema type and destination schema type of this type conversion. For an explicit conversion, the conversion name must be specified. In this example, raw_from_string is the conversion name. The schema type names and the conversion name are case-insensitive. 20 24 The override of the virtual function clone(). The override of the virtual function convert() which performs the conversion from string to raw. The pointer STval points to the address which contains the value for the source string type. The pointer DTval points to the valid (already constructed) instance of the destination raw type. This instance is assigned the value converted from the source string. For information about the data argument, see the header file conversion.h. The implementation for the conversion raw_from_string. References are used for the source and the destination, with the pointers STval and DTval being cast into APT_String and APT_RawField. The member function assignFrom() of the class APT_RawField is used to complete the conversion. The serialize() function is empty because RawStringConversion has no internal state. Create a static instance of RawStringConversion, which is used within registerConversion() to register the newly defined conversion function. The implementation of the member function registerConversion(). Call the function registerConversion() to register the new conversion, raw_from_string.
26-28
31 33 34 39
90
int16
string[8]
int16
uint32
A cursor defines the current input or output record of a data set. Field accessors perform relative access to the current record; allowing you to access the fields of the current record as defined by a cursor. In order to access a different record, you update a cursor to move it through the data set, creating a new current record. However, you do not have to update the field accessors; they will automatically reference the record fields of the new current record. A record field is characterized by the field name and data type. There is a different field accessor for every data type. In order to access a record field, you must create an accessor for the data type of the field.
91
Cursors
Cursors let you reference specific records in a data set, while field accessors let you access the individual fields in those records. You use cursors and field accessors from within your override of the APT_Operator::runLocally() function. Each input and output data set requires its own cursor object. You use two classes to represent cursors: APT_InputCursor Defines a cursor object providing read access to an input data set. APT_OutputCursor Defines a cursor object providing read/write access to an output data set. The APT_InputCursor and APT_OutputCursor classes define the following functions for making input records available for access.
92
When you first create an input cursor, it is uninitialized and does not reference a record. Therefore, field accessors to the input data set do not reference valid data. You must call APT_InputCursor::getRecord() to initialize the cursor and make the first record in the data set the current record. You can then use field accessors to access the fields of the input record. When you have finished processing a record in an input data set, you again call APT_InputCursor::getRecord() to advance the input cursor to the next record in the data set, making it the current input record. When no more input records are available, APT_InputCursor::getRecord() returns false. Commonly, you use a while loop to determine when APT_InputCursor::getRecord() returns false. When you first create an output cursor, it references the first record in the output data set. If the record is valid, the record fields are set to the following default values: v Nullable fields are set to null. v Integers = 0. v Floats = 0. v Dates = January 1, 0001. v Decimals = 0. v Times = 00:00:00 (midnight). v Timestamps = 00:00:00 (midnight) on January 1, 0001. v The length of variable-length string, ustring, and raw fields is set to 0. v The characters of a fixed-length string and fixed-length ustring are set to null (0x00) or to the pad character, if one is specified. v The bytes of fixed-length raw fields are set to zero. v The tag of a tagged aggregate is set to 0 to set the data type to be that of the first field of the tagged aggregate. v The length of variable-length vector fields is set to 0. When you have finished writing to an output record, you must call APT_OutputCursor::putRecord() to advance the output cursor to the next record in the output data set, making it the current output record.
93
3 4 5 6 7 9 10
3 4 5 6 7
Define inCur, an instance of APT_InputCursor, the input cursor for the first data set input to this operator. Define outCur, an instance of APT_OutputCursor, the output cursor for the first data set output by this operator. Use APT_Operator::setupInputCursor() to initialize inCur. Input data sets are numbered starting from 0. Use APT_Operator::setupOutputCursor() to initialize outCur. Output data sets are numbered starting from 0. Use APT_InputCursor::getRecord() to advance the input cursor to the next input record. You must call this function before attempting to process an input record, because an input cursor does not initially reference a valid record. APT_InputCursor::getRecord() can take as an argument a skip value that allows you to bypass records of an input data set. By default, the skip value is 0, causing you to update the input cursor to the next record in the input data set. The following statement causes the input cursor to skip two records every time it is called:inCur.getRecord(2) There is no way to go back to the skipped records. APT_InputCursor::getRecord() returns false when there are no more records in the input data set, terminating the while loop.
Process the record (including writing any results to the current output record). You need not call APT_OutputCursor::putRecord() until after you have written to the first output record, because the output cursor initially points to the first empty record in an output data set.
94
Fields in the output record are read/write. This allows you to use the output record as temporary storage within your operator. The output field values are not permanent until you call APT_OutputCursor::putRecord() to advance the output cursor to the next output record. 10 Use APT_OutputCursor::putRecord() to update the current output record, then advance the output cursor to the next output record. Not all operators produce an output record for each input record. Also, operators can produce more output records than there are input records. An operator can process many input records before computing a single output record. You call putRecord() only when you have completed processing an output record, regardless of the number of input records you process between calls to putRecord().
Field accessors
After you have defined a cursor to reference the records of a data set, you define field accessors to reference record fields. You assign field accessors to each component of the record schema of the data set that you want to access. For an input or output data set, field accessors provide named access to the record fields. Such access is necessary if an operator is to process data sets. No field accessor is allowed for schema variables, which have no defined data type. Operators use field accessors to read the fields of an input record and to write the fields of an output record. Field accessors do not allow access to the entire data set at one time; instead, they allow you to access the fields of the current input or output record as defined by an input or output cursor. Field accessors allow you to work with nullable fields. Using accessors, you can determine if a field contains a null before processing the field, or you can set a field to null. The fields of an input record are considered read only. There is no mechanism for you to write into the fields of the records of an input data set. Because the fields of an output record are considered read/write, you can modify the records of an output data set.
95
Table 19. Field Accessor Classes (continued) Field Type 32-bit signed integer 32-bit unsigned integer 64-bit signed integer 64-bit unsigned integer Single-precision float Double-precision float String Ustring Raw Date Decimal Time Timestamp Input Accessor Class APT_InputAccessorToInt32 APT_InputAccessorToUInt32 APT_InputAccessorToInt64 APT_InputAccessorToUInt64 APT_InputAccessorToSFloat APT_InputAccessorToDFloat APT_InputAccessorToString APT_InputAccessorToUString APT_InputAccessorToRawField APT_InputAccessorToDate APT_InputAccessorToDecimal APT_InputAccessorToTime APT_InputAccessorToTimeStamp Output Accessor Class APT_OutputAccessorToInt32 APT_OutputAccessorToUInt32 APT_OutputAccessorToInt64 APT_OutputAccessorToUInt64 APT_OutputAccessorToSFloat APT_OutputAccessorToDFloat APT_OutputAccessorToString APT_OutputAccessorToUString APT_OutputAccessorToRawField APT_OutputAccessorToDate APT_OutputAccessorToDecimal APT_OutputAccessorToTime APT_OutputAccessorToTimeStamp
Here is example code that uses three of the field accessor classes:
// Define input accessors APT_InputAccessorToInt32 aInAccessor; APT_InputAccessorToSFloat bInAccessor; APT_InputAccessorToString cInAccessor;
In addition, the classes APT_InputTagAccessor and APT_OutputTagAccessor provide access to the tag of a tagged aggregate. This is necessary to determine the current data type of a tagged aggregate. You must also define individual accessors for each element of a tagged aggregate. The tag of a tagged aggregate in an input data set is read-only; the tag of a tagged aggregate in an output data set is read/write.
96
All the output accessor classes support the following member functions: v value() returns the value of an output field. v valueAt() returns the value of an output vector field. v vectorLength() returns the length of an output vector field. v setValue() sets the value of an output field. v setValueAt() sets the value of an output vector field. v setVectorLength() sets the length of an output vector field.
AddOperator
field1:int32; field2:int32; total:int32 output data set
Figure 30. Sample operator
This operator adds two fields of an input record and stores the sum in a field of the output record. In addition, this operator copies the two fields of the input to corresponding fields of the output. For each of the components of the input and output interface schemas, you define a single field accessor. In this case, therefore, you need two input accessors for the input interface schema and three output accessors for the output interface schema This example uses field accessors to explicitly copy field1 and field2 from an input record to the corresponding fields in an output record. If the input data set had a record schema that defined more than these two fields, all other fields would be ignored by AddOperator and not copied to the output data set. The code in the following section is the describeOperator() function for AddOperator:
97
Table 20. Accessors to numeric data types in describeOperator() code Comment Code APT_Status AddOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (field1:int32; field2:int32)",0); setOutputInterfaceSchema("record (field1:int32; field2:int32; total:int32)",0); return APT_StatusOk; }
3 4 5 6
3 4 5
Set the number of input data sets to 1. Set the number of output data sets to 1. Specify the interface schema of input 0 (input data sets are numbered starting from 0). You can pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0 (the first output data set).
Field accessors are defined as local variables of APT_Operator::runLocally(). To access the values of fields defined in an interface schema, you must create input accessors and output accessors in the APT_Operator::runLocally() function of this operator. The code in the following section shows how the runLocally() function for AddOperator would be written:
Table 21. Creating Accessors in runLocally() code Comment Code APT_Status AddOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 10 11 12 14 15 16 17 } APT_InputAccessorToInt32 field1InAcc("field1", &inCur); APT_InputAccessorToInt32 field2InAcc("field2", &inCur); APT_OutputAccessorToInt32 field1OutAcc("field1", &outCur); APT_OutputAccessorToInt32 field2OutAcc("field2", &outCur); APT_OutputAccessorToInt32 totalOutAcc("total", &outCur); while (inCur.getRecord()) { *totalOutAcc = *field1InAcc + *field2InAcc; *field1OutAcc = *field1InAcc; *field2OutAcc = *field2InAcc; outCur.putRecord(); } return APT_StatusOk;
7-8
Define read-only accessors for the fields of the operator input interface schema.
98
9-11 12 14-16 17
Define read/write accessors for the fields of the operator output interface schema. Use APT_InputCursor::getRecord() to advance the input data set to the next input record. Dereference the field accessors to access the values of the record fields in both the input and the output data set. Use APT_OutputCursor::putRecord() to update the current output record and advance the output data set to the next output record.
99
AddOperator
field1[10]:int32; total:int32; output data set
Figure 31. Operator containing a vector field in its interface schemas
This operator adds all elements of the vector in the input record and stores the sum in a field of the output record. In addition, this operator copies the input vector to the output. The code in the following table is the describeOperator() function for AddOperator.
Table 22. Accessors to fixed-length vector data types in describeOperator() code Comment Code APT_Status AddOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (field1[10]:int32;)",0); setOutputInterfaceSchema("record (field1[10]:int32; total:int32)",0); return APT_StatusOk; }
3 4 5 6
3 4 5
Set the number of input data sets to 1. Set the number of output data sets to 1. Specify the interface schema of input 0 (input data sets are numbered starting from 0). You can pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0 (the first output data set).
For a vector, you only need to define a single accessor to access all vector elements. The runLocally() function for AddOperator would be written as shown in the following code example.
100
Table 23. Accessors to Fixed-Length Vector Fields in runLocally() code Comment Code APT_Status AddOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 APT_InputAccessorToInt32 field1InAcc("field1", &inCur); APT_OutputAccessorToInt32 field1OutAcc("field1", &outCur); APT_OutputAccessorToInt32 totalOutAcc("total", &outCur); while (inCur.getRecord()) { *totalOutAcc = 0; for (int i = 0; i < 10; i++) { *totalOutAcc = *totalOutAcc + field1InAcc[i]; field1OutAcc[i] = field1InAcc[i]; } outCur.putRecord(); } return APT_StatusOk; }
12 13 15 16
7 8-9 12
Define a read-only accessor for the fields of the operator's input interface schema. Define read/write accessors for the fields of the operator's output interface schema. Clear the total in the output record. The initial value of all numeric fields in an output record is already 0 or NULL if the field is nullable. This statement is included for clarity only. Create a for loop to add the elements of the input vector to the output total field. Dereference the field accessors and to access the values of the vector elements. Copies the value of element i of the input vector to element i of the output vector. Since the output interface schema defines the length of the vector in the output record, you do not have to set it, but you must set the vector length of a variable length vector. You also could have used the following equivalent statement to write the field value: field1OutAcc.setValueAt(i, field1InAcc.valueAt(i));
13 15 16
101
The following figure shows an operator containing a vector field in its interface schemas:
AddOperator
field1[]:int32; total:int32; output data set
Figure 32. Operator containing a variable-length vector field in its interface schemas
This operator adds all the elements of the vector in the input record and stores the sum in a field of the output record. In addition, this operator copies the input vector to the output. Since the input interface schema defines a variable length for the vector in the output record, the output interface schema contains a corresponding variable-length vector. The code in the following table is the describeOperator() function for AddOperator.
Table 24. Accessors to variable-length vector data types in describeOperator() code Comment Code APT_Status AddOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (field1[]:int32;)",0); setOutputInterfaceSchema("record (field1[]:int32; total:int32)",0); return APT_StatusOk; }
3 4 5 6
3 4 5
Set the number of input data sets to 1. Set the number of output data sets to 1. Specify the interface schema of input 0 (input data sets are numbered starting from 0). You can pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0 (the first output data set).
For a vector, you only need to define a single accessor to access all vector elements. The runLocally() function for AddOperator would be written as shown in the following code example. For a vector, you only need to define a single accessor to access all vector elements. The runLocally() function for AddOperator would be written as shown in the following code example.
102
Table 25. Accessors to variable-length Vector Fields in runLocally() code Comment Code APT_Status AddOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 APT_InputAccessorToInt32 field1InAcc("field1", &inCur); APT_OutputAccessorToInt32 field1OutAcc("field1", &outCur); APT_OutputAccessorToInt32 totalOutAcc("total", &outCur); while (inCur.getRecord()) { *totalOutAcc = 0; field1OutAcc.setVectorLength(field1InAcc.vectorLength()); for (int i = 0; i < field1InAcc.vectorLength(); i++) { *totalOutAcc = *totalOutAcc + field1InAcc[i]; field1OutAcc[i] = field1InAcc[i]; } outCur.putRecord(); } return APT_StatusOk; }
12 13 14 16 17
7 8-9 12
Define a read-only accessor for the fields of the input interface schema. Define read/write accessors for the fields of the output interface schema. Clear the total in the output record. The initial value of all numeric fields in an output record is already 0 or NULL if the field is nullable. This statement is included for clarity only. Set the length of the variable-length vector field in the output record. Create a for loop to add the elements of the input vector to the output total field. APT_InputAccessorToInt32::vectorLength() returns the length of a vector field. Use APT_InputAccessorToInt32::operator*, APT_InputAccessorToInt32::operator[], and APT_OutputAccessorToInt32::operator* to dereference the field accessors and to access the values of the vector elements. You can also use the equivalent member functions to access the fields of input and output records. This is Line 16 rewritten using these functions:totalOutAcc.setValue(totalOutAcc.value() + field1InAcc.valueAt(i));
13 14
16
17
Copies the value of element i of the input vector to element i of the output vector. You can also use the following equivalent statement to write the field value: field1OutAcc.setValueAt(i, field1InAcc.valueAt(i));
103
For example, you can have a data set whose record schema contains an age field. If the age field of a particular record is null, the age is not known for the person corresponding to the record. As part of processing a record field, you can detect a null and take the appropriate action. For instance, you can omit the null field from a calculation, signal an error condition, or take some other action. To recognize a nullable field, the field of the interface must be defined to be nullable. You include the keyword nullable in the interface specification of a field to make it nullable. For example, all fields of the operator shown in the following figure are nullable:
AddOperator
field1:nullable int32; field2:nullable int32; total: nullable int32; output data set
Specify that all fields of the interface schema of input 0 are nullable. You can individually specify any or all fields of the interface schema as nullable. Specify that all fields of the output interface schema are nullable. For the first record in the output data set, and after each call to APT_OutputCursor::putRecord(), the null indicator in all nullable output fields is set, marking the field as containing a null. Writing a value to an output field clears the null indicator.
If an input field to an operator contains null and the corresponding operator interface field is not nullable, InfoSphere DataStage issues a fatal error and aborts your application. You can use view adapters to prevent a fatal error in this case.
104
Both input and output accessors contain member functions for working with nulls. For input accessors, you use: isNull() Returns true if the accessor references a field containing a null. isNullAt() Returns true if the accessor references a vector element containing a null. isNullable() Returns true if the accessor references a nullable field. For output accessors, you use: isNull() Returns true if the accessor references a field containing a null. isNullAt() Returns true if the accessor references a vector element containing a null. isNullable() Returns true if the accessor references a nullable field. clearIsNull() Clears the null indicator for the field referenced by an accessor and sets the field to the default value. If isNullable() returns false, clearIsNull() does nothing. For the first record in the output data set, and after each call to APT_OutputCursor::putRecord(), the null indicator in all nullable output fields of the new output record is set, marking the field as containing a null. Writing a value to an output field clears the null indicator, so it is typically unnecessary to call clearIsNull(). clearIsNullAt() Clears the null indicator for the vector element referenced by an accessor, and sets the value of the field to the default value for its type. setIsNull() Sets the null indicator for the field referenced by an accessor, marking the field to contain a null. setIsNull() requires that isNullable() returns true. Because the null flag for all nullable output fields is initially set, you only need to call setIsNull() if you have written valid data to an output field and later decide to set the field to null. setIsNullAt() Sets the null indicator for a vector element referenced by an accessor, marking the field to contain a null. You typically use the operator* member function of an input and output accessor to obtain the value of a field. For an input accessor, using operator* on a field containing a null causes a requirements violation and aborts your application. Therefore, you must first determine if a nullable input field contains a null, using isNull(), before attempting to access it. For an output accessor, calling operator* always clears the null indicator for a nullable field, marking the field to contain valid data. You can use setIsNull() to explicitly set the null indicator in an output field, if necessary. How you choose to handle nulls detected in an input field or what conditions cause an output field to be set to null is determined by your operator logic. Often,
Chapter 8. Using cursors and accessors
105
a null in an input field will be propagated through to the output. For example, the following code example is the APT_Operator::runLocally() function for AddOperator:
Table 27. Handling nullable fields in runLocally() code Comment Code APT_Status AddOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); APT_InputAccessorToInt32 field1InAcc("field1", &inCur); APT_InputAccessorToInt32 field2InAcc("field2", &inCur); APT_OutputAccessorToInt32 field1OutAcc("field1", &outCur); APT_OutputAccessorToInt32 field2OutAcc("field2", &outCur); APT_OutputAccessorToInt32 totalOutAcc("total", &outCur); while (inCur.getRecord()) { if (!field1InAcc.isNull()) *field1OutAcc = *field1InAcc; if (!field2InAcc.isNull()) *field2OutAcc = *field2InAcc; if (!field1InAcc.isNull() && !field2InAcc.isNull()) *totalOutAcc = *field1InAcc + *field2InAcc; outCur.putRecord(); } return APT_StatusOk; }
14 16 18 20
14 16 18 20
Use field1InAcc.isNull() to determine if field1 contains a null. If not, copy it to the output record. Determine if field2 contains a null. If not, copy it to the output record. If both field1 and field2 contain valid data, perform the addition. Writing to total clears the null indicator for the field. Call APT_OutputCursor::putRecord() to write the output record.
106
APT_UString represents the value type of the ustring schema field used for processing multi-byte Unicode character data. It stores its contents in an UChar array. UChar is defined in the ICU header file unicode/utf.h, and its implementation is operating-system dependent. Strings are sorted using the collation services. APT_UString uses conversion services for input from and output to streams which are UTF-8 encoded. The header files, basicstring.h, string.h, and ustring.h contain descriptions of the APT_String and APT_UString class interface functions. Also see the header file unicode_utils.h. You access a field of type APT_String or APT_UString using the field accessor APT_InputAccessorToString or APT_InputAccessorToUString and APT_OutputAccessorToString or APT_OutputAccessorToUString. Once you have defined and initialized the accessor, you then use indirect addressing, by using the dereferencing operator ->, to call a member function of APT_String or APT_UString to process the field. You can see from the class member functions that APT_String and APT_UString allow you to copy a string field, using operator=, and compare string fields using operator==, operator!=, isEqualCI()and other functions. These classes also contain member functions to access the contents and length of a string field. The following code creates an input accessor to a string field, then uses the accessor to call member functions of APT_String:
APT_InputAccessorToString field1InAcc("field1", &inCur); while (inCur.getRecord()) { APT_Int32 fieldLen = field1InAcc->length(); const char * buffer = field1InAcc->content(); . . . }
The following figure shows an operator containing a variable-length ustring field and a fixed-length string field in its interface schemas:
StringOperator
field1:ustring; field2:string[10]; output data set
Figure 34. StringOperator with variable-length string fields
The code in the following table is the describeOperator() function for StringOperator.
107
Table 28. Accessors to string and ustring data types in describeOperator() code Comment Code APT_Status AddOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (field1:ustring;field2:string[10])",0); setOutputInterfaceSchema("record (field1:ustring;field2:string[10])",0); return APT_StatusOk; }
3 4 5 6
3 4 5
Set the number of input data sets to 1. Set the number of output data sets to 1. Specify the interface schema of input 0 (input data sets are numbered starting from 0). You can pass a string containing the interface schema as an argument to APT_Operator::setInputInterfaceSchema(). Specify the interface schema of output 0 (the first output data set).
13 14 15 16
7-8
Define the input string accessors using APT_InputAccessorToUString and APT_InputAccessorToString. After you have set up an accessor to a field of type APT_UString or APT_String, you can use the member functions of APT_UString or APT_String to manipulate the field. Because accessors are a type of pointer, you can use them to call APT_UString or APT_String member functions using the dereferencing operator, - >.
108
9-10 13 14
Define the output string accessors using APT_OutputAccessorToUString and APT_OutputAccessorToString. Determine the number of code points in the variable-length ustring field. Return a pointer to the contents of the ustring field. Because a ustring field is not defined to be null terminated, you need the length of the field as well. Copy the buffer to the output field, including the ustring length. By default, a variable-length ustring field in an output data set has a length of 0; you must set the ustring length as part of writing to the field. Directly copy the fixed-length input string field to the output string field. If the input fixed-length string is longer than the output fixed-length string, the input string is truncated to the length of the output. If the input string is shorter, the output string is by default padded with zeros to the length of the output string. You can call setPadChar() to specify a different pad character.
15
16
Processing fixed and variable length vectors of string fields is the same as for vectors of numeric data types.
Unicode utilities
The Unicode utility functions are described in the header file unicode_utils.h. The Unicode utility functions include the following categories of character set conversion functions: v v v v v UTF-8 character set conversion functions. OS character set conversion functions. Input character set conversion functions. Invariant ASCII character set conversion functions. User-specified character set conversion functions.
Unicode utility functions also include string manipulation functions, Ctype functions, file-related functions. These functions accept both char and UChar arguments unless otherwise noted. Some of the APT_UString comparison methods accept a collation sequence. The <, >, <=, and => operators also use a collation sequence if a non-default sequence is available. The default collation sequence uses byte-wise comparisons.
109
APT_int32 transform(C* newData, APT_Int32 length, bool caseInsensitive = false, APT_Int32 strLen = -1, const APT_CollationSeq* seq = NULL) const; APT_Int32 transformLength(const APT_CollationSeq* seq = NULL, bool caseInsensitive = false) const; bool isTransformNecessary(const APT_CollationSeq* seq = NULL) const; void setCollationSeq(APT_CollationSeq* col); const APT_CollationSeq* getCollationSeq (const APT_CollationSeq* seq = NULL) const;
The following figure shows an operator containing a variable-length and a fixedlength raw field in its interface schemas:
110
RawOperator
field1:raw; field2:raw[10]; output data set
Figure 35. RawOperator operator and the fixed length raw fields
13 14 15 16
7-8
Define the input raw accessors using APT_InputAccessorToRawField. Once you have set up an accessor to a field of type APT_RawField, you can use the member functions of APT_RawField to manipulate the field. Because accessors are a type of pointer, you can use them to call APT_RawField member functions using the dereferencing operator, ->.
9-10 13 14
Define the output raw accessors using APT_OutputAccessorToRawField. Use APT_RawField::length() to return the number of bytes in the variable-length raw field. Use APT_RawField::content() to return a pointer to the contents of the raw field. The returned pointer is of type void*. Because a raw field is not defined to be null terminated, you might need the length of the field as well. After processing, use APT_RawField::assignFrom() to copy the results to
15
111
the output raw field, including the field length. You must set the length as part of writing to the field. By default, a variable-length raw field in an output data set has a length of 0. 16 Use APT_OutputAccessorToRawField::operator* to copy the input fixed-length raw field directly to the output field. Because the output field is read/write, you can process the raw field in place.
Processing vectors of raw fields, either fixed or variable length, is the same as for vectors of numeric data types.
After you have defined an accessor to one of these fields, you use the dereference operator, ->, to call member functions of the corresponding class to process the field. For example, the following figure shows an operator containing a date field in its interface schemas:
DateOperator
field1:date; output data set
The runLocally() function for DateOperator would be written as shown in the following table:
112
Table 32. Handling a date field in runLocally() code Comments Code APT_Status DateOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 APT_InputAccessorToDate field1InAcc("field1", &inCur); APT_OutputAccessorToDate field1OutAcc("field1", &outCur); APT_Date cutoffDate(1997, 01, 01); while (inCur.getRecord()) { if (*field1InAcc < cutoffDate) { int year = field1InAcc->year(); int month = field1InAcc->month(); int day = field1InAcc->day(); . . . } outCur.putRecord(); } return APT_StatusOk; }
7 8 9
Create an input accessor to the input date field. Create an output accessor to the output date field. Create cutoffDate, an instance of APT_Date, and initialize it to 1/1/97.
113
AggregateOperator
a: subrec (aSubField1:int32; aSubField2:sfloat;); b: tagged (bTaggedField1:string;bTaggedField2:int32;) output data set
Figure 37. Operator containing a subrecord and a tagged aggregate field in its interface schemas
In order to access the elements of an aggregate field, you must define accessors to: v Each element of the aggregate v The tag for a tagged aggregate
114
Table 33. Handling aggregate fields in runLocally() code Comments Code APT_Status AggregateOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 10 11 12 13 14 15 16 17 18 APT_InputAccessorToInt32 aSubField1In("a.aSubField1", &inCur); APT_InputAccessorToSFloat aSubField2In("a.aSubField2", &inCur); APT_InputTagAccessor bTagIn; inCur.setupTagAccessor("b", &bTagIn); APT_InputAccessorToString bTaggedField1In("b.bTaggedField1", &inCur); APT_InputAccessorToInt32 bTaggedField2In("b.bTaggedField2", &inCur); APT_OutputAccessorToInt32 aSubField1Out("a.aSubField1", &outCur); APT_OutputAccessorToSFloat aSubField2Out("a.aSubField2", &outCur); APT_OutputTagAccessor bTagOut; outCur.setupTagAccessor("b", &bTagOut); APT_OutputAccessorToString bTaggedField1Out( "b.bTaggedField1", &outCur); APT_OutputAccessorToInt32 bTaggedField2Out( "b.bTaggedField2", &outCur); while (inCur.getRecord()) { *aSubField1Out = *aSubField1In; *aSubField2Out = *aSubField2In; 23 switch(bTagIn.tag()) { case 0: bTagOut.setTag(0); *bTaggedField1Out = *bTaggedField1In; break; case 1: bTagOut.setTag(1); *bTaggedField2Out = *bTaggedField2In; break; } default: APT_ASSERT(0); 35 } outCur.putRecord(); } return APT_StatusOk; }
26 27
7-8
Define input accessor elements for the subrecord. Note that you use dot-delimited referencing to refer to the fields of an aggregate in much the same way that you do for the elements of a C structure. Once you have defined accessors for the subrecord aggregate elements, you access the fields of a subrecord aggregate in the same way you access ordinary fields.
Chapter 8. Using cursors and accessors
115
Define a tag accessor and accessor elements for the tagged aggregate. Define output accessors for fields in the output subrecord. Define a tag accessor and accessor elements for the output tagged aggregate. Determine the active tag element. For tagged aggregates, only one element of the tagged aggregate is active at one time. For an input data set, you use a tag accessor to determine the currently active element. Set the tag in the output tagged field. For an output record, you must set the tag in order to specify the data type of a tagged field. Though you can change the tag for every record in an output data set, data might be destroyed. Once you set the tag for a record, it is good practice not to change it. Copy the input field to the output field. Use the macro APT_ASSERT(0) to generate an assertion failure if the tag value is not 0 or 1. This means that the tag has an invalid value because field b is defined to contain only two elements. This condition should never happen; therefore, you handle it using an assertion.
25
26 35
116
SubrecordVectorOperator
A[10]:subrec(aSubField1:int32;aSubField2:sfloat;)
To access the elements of a subrecord vector, you must define: v An accessor to each element of the subrecord v A subcursor to each subrecord vector
Table 34. Handling subrecord vector fields in runLocally() code Comment Code APT_Status SubrecordVectorOperator::runLocally() { APT_InputCursor inCur; APT_OutputCursor outCur; setupInputCursor(&inCur, 0); setupOutputCursor(&outCur, 0); 7 8 9 APT_InputSubCursor inSubCur("a", &inCur); APT_OutputSubCursor outSubCur("a", &outCur); APT_InputAccessorToInt32 aSubField1In("a.aSubField1", &inCur); APT_InputAccessorToSFloat aSubField2In("a.aSubField2", &inCur); APT_OutputAccessorToInt32 aSubField1Out("a.aSubField1", &outCur); APT_OutputAccessorToSFloat aSubField2Out("a.aSubField2", &outCur); while (inCur.getRecord()) { for (int i = 0; i < inSubCur.vectorLength(); i++) { *aSubField1Out = *aSubField1In; *aSubField2Out = *aSubField2In; inSubCur.next(); outSubCur.next(); } outCur.putRecord(); } return APT_StatusOk; }
12 13 15 17 18 19 20 22
Define inSubCur, an instance of APT_InputSubCursor. In order to access the elements of the subrecord vector field, you must define a subcursor for the aggregate vector and a field accessor to each element of the aggregate.
117
You pass to the constructor the name of the subrecord field accessed by the subcursor, and the input cursor used by the subcursor. An input subcursor starts out as valid, unlike an input cursor. 8 9-12 13 Define outSubCur, an instance of APT_OutputSubCursor. Create input and output accessors. Call APT_InputRecord::getRecord() to initialize the input cursor to the first input record. This function also resets all subcursors bound to the input cursor to reference the first element in the subrecord vector. Create a for loop to index through all vector elements. Use APT_InputSubCursor::vectorLength() to return the length of the input vector. Copy the first input subrecord field to the output. Copy the second input subrecord field to the output. Use APT_InputSubCursor::next() to increment the input subcursor to the next element in the vector. Use APT_OutputSubCursor::next() to increment the output subcursor to the next element in the vector. Use APT_OutputCursor::putRecord() to write the output record. This call resets all output subcursors bound to this output cursor to reference the first vector element.
15
17 18 19 20 22
You can also define a vector of subrecords and nest it within a subrecord that is itself either a vector or a scalar. You use the same procedure described for nested subrecord vectors.
118
119
You can use the protected member functions setPreservePartitioningFlag() and clearPreservePartitioningFlag() to set or clear the preserve-partitioning flag in the output data set of a derived operator as part of the override of APT_Operator::describeOperator(). These functions have a lower priority than the member functions of APT_DataSet which manipulate the preserve-partitioning flag. These functions can only modify the preserve-partitioning flag of an output data set if the flag has not been explicitly set or cleared by the APT_DataSet member functions. Any attempt by an operator to modify partitioning that has been explicitly set or cleared by the member functions of APT_DataSet is ignored. You include calls to these member functions as part of the override of APT_Operator::describeOperator(). You must call the appropriate member function for each output data set.
120
The first argument, pType, specifies the partitioning method as defined using one of the following values: v APT_Operator::eAny (default) v APT_Operator::eRoundRobin v APT_Operator::eRandom v APT_Operator::eSame v APT_Operator::eEntire The second argument, inputDS, specifies the number of the input data set to the operator. The input data sets to an operator are numbered starting from 0. For example, to use round robin partitioning with an operator that takes a single input data set, include the following statements within the describeOperator() function:
setKind(APT_Operator::eParallel); setPartitionMethod(APT_Operator::eRoundRobin, 0);
If the operator has two input data sets and you want to partition the data sets using random, you include the lines:
setKind(APT_Operator::eParallel); setPartitionMethod(APT_Operator::eRandom, 0); // input data set 0 setPartitionMethod(APT_Operator::eRandom, 1); // input data set 1
121
field1:int32;field2:int32;field3:string;in:*;
SortOperator
-----;
...
out:*; output data set
APT_HashPartitioner does not define any interface schema; you use the APT_HashPartitioner constructor or the member function APT_HashPartitioner::setKey() to specify the key fields. The constructor for APT_HashPartitioner has two overloads:
APT_HashPartitioner(); APT_HashPartitioner(const APT_FieldList& fList);
The first overload creates an APT_HashPartitioner object without specifying any key fields. You must then use setKey() to specify key fields. The second form of the constructor creates an APT_HashPartitioner object using a list of key fields from the input interface schema for the operator. These fields can be any field type, including raw, date, and timestamp. APT_HashPartitioner determines the data type of each field from the input interface schema. SortOperator requires three fields as input: two integer fields and a string field. You can specify the interface schema of the partitioner within the describeOperator() function, as the code in the following table shows:
122
Table 35. Partitioner Interface Schema in describeOperator() Comment Code APT_Status SortOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record(field1:int32; field2:int32;field3:string; in:*;)", 0); setOutputInterfaceSchema("record(out:*;)", 0); declareTransfer("in", "out", 0, 0); 9 10 11 12 APT_HashPartitioner * hashPart = new APT_HashPartitioner; hashPart->setKey("field1", "int32"); hashPart->setKey("field2", "int32"); setPartitionMethod(hashPart, APT_ViewAdapter(), 0); return APT_StatusOk; }
Use the default constructor to dynamically allocate an APT_HashPartitioner object. Partitioner objects must be dynamically allocated within describeOperator(). The framework deletes the partitioner for you when it is no longer needed. You must call setKey() to specify the key fields for this APT_HashPartitioner object.
10
Use APT_HashPartitioner::setKey() to specify field1 as a key field for this APT_HashPartitioner object. Use setKey() to specify both a field name and a data type for the field. The order in which key fields are listed is unimportant.
11 12
Specify field2 as a key field for this APT_HashPartitioner object. Use APT_Operator::setPartitionMethod() to specify hashPart as the partitioner for this operator. After calling this function, do not delete the partitioner because the framework has claim to its memory. Because you do not need to use a view adapter with this partitioner, this function creates and passes a default view adapter.
An application developer using this operator can use adapters to translate the name of a data set field and its data type in the input data set schema to match the input interface schema. In the previous figure, the data set myDS is input to the sort operator. An application developer could translate field a and field b of myDS to field1 and field2 of the operator. Therefore, the hash partitioner would partition the record by fields a and b.
123
Range partitioning guarantees that all records with the same partitioning key values are assigned to the same partition and that the partitions are approximately equal in size. This means that all nodes perform an equal amount of work when processing the data set. You use the class APT_RangePartitioner to implement this partitioning method. For the range partitioner to determine the partition boundaries, you pass the range partitioner a sorted sample of a data set to be range partitioned. This sorted sample is stored in a data file, not as a normal data set. From this sample, the range partitioner can determine the partition boundaries for the entire data set. Typically, you use a separate job to create the sorted sample of the data set. Once you have the sorted sample, you can either: v Instantiate an APT_RangePartitioner object and pass it into the derived operator. The APT_Operator::describeOperator() override then uses APT_Operator::setPartitionMethod() to configure the operator to use the APT_RangePartitioner object. v Pass a reference to the sorted sample to the operator using an operator constructor or member function. You then dynamically allocate an APT_RangePartitioner object in your override of APT_Operator::describeOperator() using the sorted sample. Finally, you use APT_Operator::setPartitionMethod() to configure the operator to use the dynamically allocated range partitioner.
Overriding APT_Partitioner::describePartitioner()
Many partitioning methods use the fields of a record to determine the partition for the record.
124
To access those fields, the partitioner must have an interface schema, defined by overriding the pure virtual function describePartitioner(). The following figure shows a partitioner with a single integer field named hashField as its interface schema:
hashField1:int32; sortField:string;in:*;
SortOperator
hashField:int32; partitioner and partitioner interface schema
...
The concrete schema of the input data set must be compatible with the interface schema of the partitioner. In this example, both schemas contain an integer field named hashField. If an input interface schema of the operator is not compatible with the schema of the partitioner, you can use an adapter to translate components. viewAdaptedSchema() returns the data set concrete schema as projected through the view adapter. A partitioner is not required to define an interface schema if it does not use record fields as part of its method. This type of partitioner is called a keyless partitioner. You still must provide an override to describePartitioner(), but the function should just return APT_StatusOk.
Overriding APT_Partitioner::setupInputs()
After you have established the interface schema of the partitioner, you need to define field accessors. Field accessors provide named access to any type of field within a record of a data set. A field accessor is normally defined as a private data member of the derived partitioner class. You override the pure virtual function setupInputs() to initialize the field accessors. The following figure shows a partitioner that defines a single integer field named hashField as its interface schema:
125
hashField1:int32; sortField:string;in:*;
SortOperator
hashField:int32; partitioner and partitioner interface schema
...
In this example, you override the pure virtual function setupInputs() to initialize the single field accessor used by a partitioner to access hashField. If your partitioning method does not access record fields, you still must override setupInputs(), but it should simply return APT_StatusOk.
Overriding APT_Partitioner::partitionInput()
You must override the pure virtual function APT_Partitioner::partitionInput() to perform the actual partitioning operation. APT_Partitioner::partitionInput() contains the code defining your partitioning method. Here is the function prototype of partitionInput():
virtual int partitionInput(int numPartitions) = 0;
The function partitionInput() assigns a partition number to a record of an input data set. InfoSphere DataStage calls partitionInput() for each record of an input data set; you do not call it directly. The argument numPartitions specifies the number of partitions available for the record. The value numPartitions is passed to partitionInput(), where numPartitions is guaranteed to be positive. Your override of partitionInput() must return an integer value denoting the partition for the current input record. This returned value must satisfy the requirement:
0 <= returnValue < numPartitions
Overriding APT_Partitioner::getPartitioningStyle()
You must override the pure virtual function APT_Partitioner::getPartitioningStyle() to return the partitioning style. APT_Partitioner::getPartitioningStyle() returns the partitioning style. Here is the function prototype of getPartitioningStyle():
virtual APT_PartitioningStyle::partitioningStyle getPartitioningStyle() const = 0;
126
The following figure shows an operator that partitions an input data set based on an integer field of the records, and sorts the records based on the integer field and a string field:
SortOperator
hashField:int32; partitioner and partitioner interface schema
To access the record hashField, the partitioner defines one accessor. The partitioner schema and input interface schema of the operator both contain an integer field named hashField. Therefore, they are compatible. If they were not compatible, you could create a view adapter to translate the interface schema. The code in the table shows the derivation of SortPartitioner, the partitioner for this operator:
127
Table 36. Partitioner derivation Comment 2 4 5 7 9 10 11 12 13 Code #include <apt_framework/orchestrate.h> class SortPartitioner : public APT_Partitioner { APT_DECLARE_RTTI(SortPartitioner); APT_DECLARE_PERSISTENT(SortPartitioner); public: SortPartitioner(); protected: virtual APT_Status describePartitioner(); virtual APT_Status setupInputs(int numPartitions); virtual int partitionInput(int numPartitions); virtual APT_PartitioningStyle::partitioningStyle getPartitioningStyle()const; virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, InitializeContext context); private: APT_InputAccessorToInt32 hashFieldAccessor; }; #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(SortPartitioner, sortpart, APT_UString(ARGS_DESC)); APT_IMPLEMENT_RTTI_ONEBASE(SortPartitioner, APT_Partitioner); APT_IMPLEMENT_PERSISTENT(SortPartitioner); SortPartitioner::SortPartitioner() {} APT_Status SortPartitioner::initializeFromArgs_(const APT_PropertyList &args, APT_Partitioner::InitializeContext context) { return APT_StatusOk; } void SortPartitioner::serialize(APT_Archive& archive, APT_UInt8) {} APT_Status SortPartitioner::describePartitioner() { setInputInterfaceSchema("record(hashField:int32;)"); return APT_StatusOk; } APT_Status SortPartitioner::setupInputs(int numPartitions) { setupInputAccessor("hashField", &hashFieldAccessor); return APT_StatusOk; } int SortPartitioner::partitionInput(int numPartitions) { APT_UInt32 hashVal = APT_hash(*hashFieldAccessor); return hashVal % numPartitions; } APT_PartitioningStyle::partitioningStyle SortPartitioner::getPartitioningStyle() const { return APT_PartitioningStyle::eLocalKeys; }
15 17 18 19 20 21 23
27 29 31
34 36
39 41 42 44
128
2 4 5
Derive SortPartitioner from APT_Partitioner. Include the macro APT_DECLARE_RTTI(), which is required to support run-time type information. Include the macro APT_DECLARE_PERSISTENT(), which is required for persistence support. This macro also inserts a declaration for the APT_Persistent::serialize() function required by the persistence mechanism. Include the default constructor for SortPartitioner. This constructor is required for persistent classes. Override the APT_Partitioner virtual functions: describePartitioner(), setupInputs(), partitionInput(), getPartitioningStyle(), and initializeFromArgs_(). Define hashFieldAccessor, a field accessor to access hashField. If your partitioner has arguments, you supply a description of them in the ARGS_DESC string. See $APT_ORCHHOME/include/apt_util/argvcheck.h for documentation on this string. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description stringe. See osh_name.h for documentation on this macro. Include the macro APT_IMPLEMENT_RTTI_ONEBASE(), which is required to support run time type information. Include the macro APT_IMPLEMENT_PERSISTENT(), which is required for persistence support. Declare a default constructor. With your override of initializeFromArgs_(), you transfer information from the arguments to the class instance, making it osh aware. See operator.h for documentation on this function. The function APT_Persistent::serialize() defines complex persistence. Override APT_Partitioner::describePartitioner(), a pure virtual function. Use APT_Partitioner::setInputInterfaceSchema() to specify the interface to this partitioner. This interface schema of the partitioner consists of a single integer field. You must override the pure virtual function APT_Partitioner::setupInputs() to initialize the field accessors used by SortPartitioner to access the required fields of a record. Use APT_Partitioner::setupInputAccessor() to initialize hashFieldAccessor. You must override the pure virtual function APT_Partitioner::partitionInput() to perform the actual partitioning operation. Use APT_hash() to compute a hash value for the integer field. APT_hash() returns a partition number for a specified hash key. Return the hash value modulo numPartitions where numPartitions is passed in by InfoSphere DataStage. Override APT_Partitioner::getPartitioningStyle(), a pure virtual function.
7 9-13
15 17
18
19 20 21 23
27 29 31
34
36 39
41 42 44
129
Once you have defined your partitioner, you can use it with a derived operator. Typically, you define the partitioner within the override of APT_Operator::describeOperator(). To use SortPartitioner with SortOperator, you use APT_Operator::setPartitionMethod() within the APT_Operator::describeOperator() function of SortOperator. setPartitionMethod() allows you to specify a partitioner class for your partitioning method. Here is the function prototype of setPartitionMethod():
void setPartitionMethod(APT_Partitioner * partitioner, const APT_ViewAdapter& adapter, int inputDS);
In this form, setPartitionMethod() takes three arguments: v partitioner, a pointer to a partitioner object. v adapter, a view adapter. If no adapter is required, you can simply pass a default-constructed adapter. v inputDS, the input data set for this partitioner. In the example describeOperator() function below, the partitioner is allocated and is specified as the partitioner for the operator in setPartitionMethod(). In this example, setPartitionMethod() takes a default-constructed adapter because the interface schema of the operator and the partition are compatiable.
APT_Status SortOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema( "record (hashField:int32; sortField:string; in:*;)", 0); setOutputInterfaceSchema("record (out:*;)", 0); declareTransfer("in", "out", 0, 0); SortPartitioner * sortPart = new SortPartitioner; setPartitionMethod(sortPart, APT_ViewAdapter(), 0); return APT_StatusOk; }
Hashing functions
As part of your partitioning method, you can choose to calculate a hash value based on fields of a record, which are referred to as hash keys. InfoSphere DataStage provides several overloads of the partitioning hash function APT_hash() to handle most data types that can be used as hash keys. APT_hash() returns a partition number for a specified hash key. You must perform modulo division of the value returned by APT_hash() to ensure that it is between 0 and one less than the maximum partition number for the corresponding data set. Normally, you call APT_hash() from within APT_Partitioner::partitionInput() when you derive your own partitioning method from APT_Partitioner. InfoSphere DataStage supplies overloads of APT_hash() for the most common data types, as shown
#include <apt_framework/orchestrate.h> extern extern extern extern APT_UInt32 APT_UInt32 APT_UInt32 APT_UInt32 APT_hash(char key); APT_hash(int key); APT_hash(long key); APT_hash(float key);
130
extern APT_UInt32 APT_hash(double key); extern APT_UInt32 APT_hash(const char * key, bool caseSensitive = true); extern APT_UInt32 APT_hash(const char * key, APT_UInt32 keyLength, bool caseSensitive = true); extern APT_UInt32 APT_hash(const APT_String& key, bool caseSensitive = true ); extern APT_UInt32 APT_hash(const UChar* d, APT_UInt32 len, bool caseSensitive=true); extern APT_UInt32 APT_hash(const UChar* d, bool caseSensitive=true); extern APT_UInt32 APT_hash(const APT_UString& d, bool caseSensitive=true); extern APT_UInt32 APT_hash(const APT_RawField& d);
You specify the hash key using the key argument. You use the keyLength argument to specify the length of a character string if the string is not null-terminated. With the caseSensitive argument, you can specify whether the key represents a case-sensitive character string (caseSensitive = true) or a case-insensitive string (caseSensitive = false). In addition to APT_hash(), InfoSphere DataStage also provides hashing functions for decimal, date, time, and timestamp data types: v APT_Decimal::hash() v APT_Date::hash() v APT_Time::hash() v APT_TimeStamp::hash()
field1:int32;field2:int32;field3:string;in:*;
SortOperator
pField1:int32; pField2:int32; partitioner and partitioner interface schema
131
The input interface schema of the partitioner defines two integer fields, pField1 and pField2, which it uses to partition the records of an input data set. This schema is not compatible with the interface schema of the operator, so it is necessary to define and initialize an APT_ViewAdapter within the describeOperator() function of the derived operator. Here is the describeOperator() function for SortOperator:
#include <apt_framework/orchestrate.h> APT_Status SortOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema( "record(field1:int32; field2:int32; field3:string; in:*;)", 0); setOutputInterfaceSchema("record (out:*;)", 0); APT_ViewAdapter partitionAdapter ( "pField1 = field1;" "pField2 = field2;" ); SortPartitioner * sortPart = new SortPartitioner; setPartitionMethod(sortPart, partitionAdapter, 0); return APT_StatusOk; }
partitionAdapter is defined within describeOperator() and is destroyed when this function completes. However, APT_Operator::setPartitionMethod() makes a copy of the adapter and passes it to the partitioner, so that when describeOperator() completes, the destruction of the adapter does not affect the partitioner.
132
133
might actually be ready for processing before those records from partition 0. In this case, the sequential operator must wait, possibly creating a processing bottleneck in your application. The ordered collection method is necessary if you want to process a totally sorted data set with a sequential operator and preserve the sort order. Unless your sequential operator requires a deterministic order for processing records, you typically will use the any collection method. If you want more control over the order of records processed by the operator, you can use the ordered method, the APT_SortedMergeCollector, or a custom collector that you define.
The first argument, cType, specifies the collection method as defined by the following values: v APT_Operator::eCollectRoundRobin v APT_Operator::eCollectOrdered The second argument, inputDS, specifies the number of the input data set to the operator. These data sets are numbered starting from 0. For example, to use round robin collection with a sequential operator that takes a single input data set, include the following statements within the operator describeOperator() function:
setKind(APT_Operator::eSequential); setCollectionMethod(APT_Operator::eCollectRoundRobin, 0);
If the operator has two input data sets and you want to use ordered for both, include the lines:
134
APT_SortedMergeCollector example
The class APT_SortedMergeCollector orders the records processed by a sequential operator, based on one or more fields of a record. APT_SortedMergeCollector uses a dynamic interface schema that allows you to specify one or more numeric or string fields as input. The following figure shows a sequential operator using APT_SortedMergeCollector:
field1:int32;field2:int32;field3:string;in:*;
MyOperator
-----;
APT_SortedMergeCollector does not define any interface schema; you use the APT_SortedMergeCollector member function APT_SortedMergeCollector::setKey() to specify the collecting key fields.
135
MyOperator requires three fields as input: two integer fields and a string field. You can specify the collector interface schema within the describeOperator() function, as shown:
APT_Status MyOperator::describeOperator() { setKind(APT_Operator::eSequential); // set mode to sequential setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema( "record(field1:int32; field2:int32; field3:string; in:*;)", 0); setOutputInterfaceSchema("record (out:*;)", 0); declareTransfer("in", "out", 0, 0); // Define the collector APT_SortedMergeCollector * coll = new APT_SortedMergeCollector; APT_SchemaTypeSpec schType; schType.setType(APT_SchemaTypeSpec::eInt); coll->setKey("field1", schType); coll->setKey("field2", schType); setCollectionMethod(coll, APT_ViewAdapter(), 0); return APT_StatusOk; }
In the example above, the default constructor is used to dynamically allocate an APT_SortedMergeCollector object. The collector is deleted for you. You must call setKey() to specify the key fields for the APT_SortedMergeCollector object. In this example, setKey() specifies field1 as the primary collecting key field and field2 as the secondary collecting key field for the APT_SortedMergeCollector object. The function setKey() is used to specify both a field name and a data type for the field. The function APT_Operator::setCollectionMethod() specifies coll as the collector for this operator. It is not necessary to use an input field adapter with this collector; a default view adapter is passed instead.
136
APT_Collector contains two other functions: The public member function initializeFromArgs(), and the protected function initializeFromArgs_(). You use these functions to enable argument-list processing facility and to make your collector osh-aware. As part of deriving a collector class, you can include support for detecting error and warning conditions and for relaying that information back to users.
In this figure, each input partition has a current record, corresponding to the record that a sequential operator would read if it consumed a record from that partition. When a sequential operator calls APT_InputCursor::getRecord() as part of its override of APT_Operator::runLocally() to obtain the next record from an input data set, the collector determines the partition that supplies the record. The selected partition then updates itself, so the next record in the partition becomes the current record. When any partition becomes empty because it has supplied its final record, that partition returns an End Of File (EOF) whenever a record is requested from it. The call to APT_InputCursor::getRecord() causes the operator to call APT_Collector::selectInput(), one of the pure virtual functions that you must override when defining a collector. This function returns the number of the input partition supplying the record read by the operator. Your override of selectInput() implements the algorithm defining the order of records supplied to the operator. As part of its algorithm for determining the partition number, your override of selectInput() can interrogate fields within the current record of each partition. This allows you to use information in the record to determine the order in which the
Chapter 10. Creating Collectors
137
sequential operator reads records. To access the fields of a record, you must define input accessors to each record field in each partition you want to access. The following figure shows a collector that uses field information to determine record order:
input data set partitions p0 current record in each partition record field collector operator (sequential) p1 p2 pN
You use the override of APT_Collector::setupInputs() to define the field accessors used by the collector. If your collector uses no accessors, this function should return APT_StatusOK.
Overriding APT_Collector::describeCollector()
Many collection methods use the fields of a record to determine the order of records processed by a sequential operator. To access those fields, the collector must have an interface schema, defined by overriding the pure virtual function describeCollector(). The following figure shows a collector with a single integer field named collectorField as its interface schema:
collectorField:int32;
138
The input interface schema of an operator must be compatible with the interface schema of the collector. In this example, both contain an integer field named collectorField. If an input interface schema is not compatible with the schema of the collector, you can use a view adapter to translate components. A collector is not required to define an interface schema if it does not use record fields as part of its collection method. This type of collector is called a keyless collector. You must still provide an override to describeCollector(), but the function should return APT_StatusOk.
Overriding APT_Collector::setupInputs()
After you have established the interface schema for the collector, you must define field accessors for each component of the interface schema. Field accessors provide named access to any type of field within a record of a data set. Field accessors normally are defined as a private data member of the derived collector class. You then override the pure virtual function setupInputs() to initialize the field accessors. The following figure shows a collector that defines a single integer field named collectorField as its interface schema:
collectorField:int32;aField:string;in:*;
collectorField:int32;
In this example, you override the pure virtual function setupInputs() to initialize one field accessor for each partition of the input data set to access collectorField. If your collection method does not access record fields, you still must override setupInputs(), but it should return APT_StatusOk.
Overriding APT_Collector::selectInput()
You must override the pure virtual function APT_Collector::selectInput() to perform the actual collection operation. Here is the function prototype of selectInput():
virtual int selectInput(int numPartitions) = 0;
selectInput() returns the number of the input partition supplying the next record to the operator. InfoSphere DataStage calls selectInput() each time the operator reads a record from the data set; you do not call it directly. The argument numPartitions
Chapter 10. Creating Collectors
139
specifies the number of input partitions. InfoSphere DataStage passes numPartitions to selectInput(), where numPartitions is guaranteed to be positive. Your override of selectInput() must return an integer value denoting the partition supplying the input record. This returned value must satisfy the requirement:
0 <= returnValue < numPartitions
collectorField:int32;
This operator uses a collector that determines the next field by inspecting a single integer field. The partition whose current record has the smallest value for collectorField supplies the record to the operator. To access the collectorField, the collector defines one accessor for each partition of the input data set. The collector schema and the operator input interface schema both contain an integer field named collectorField. Therefore, they are compatible. If they were not, you could create a view adapter to translate the interface schema. The table shows the derivation of MyCollector, the collector for this operator:
Table 37. Example APT_Collector Derivation Comment Code #include <apt_framework/orchestrate.h> 2 4 5 7 8 class MyCollector: public APT_Collector { APT_DECLARE_RTTI(MyCollector); APT_DECLARE_PERSISTENT(MyCollector); public: MyCollector(); ~MyCollector();
140
Table 37. Example APT_Collector Derivation (continued) Comment 10 13 Code protected: virtual APT_Status describeCollector(); virtual APT_Status setupInputs(int numPartitions); virtual int selectInput(int numPartitions); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, InitializeContext context); private: APT_InputAccessorToInt32 * collectorFieldAccessors; int numParts; }; #define ARGS_DESC "{otherInfo="\ "{description='An example collector with no arguments to"\ "describe'}}" APT_DEFINE_OSH_NAME(MyCollector, mycollector, APT_UString(ARGS_DESC)); APT_IMPLEMENT_RTTI_ONEBASE(MyCollector, APT_Collector); APT_IMPLEMENT_PERSISTENT(MyCollector); MyCollector::MyCollector() : collectorFieldAccessors(0), numParts(0) {} MyCollector::~MyCollector() { delete[] collectorFieldAccessors; } APT_Status MyCollector::initializeFromArgs_(const APT_PropertyList &args, APT_Collector::InitializeContext context) { return APT_StatusOk; }
15 16 18
19 20 21 22
25 27 29
33 35
APT_Status MyCollector::describeCollector() { setInputInterfaceSchema("record(collectorField:int32;)"); return APT_StatusOk; } APT_Status MyCollector::setupInputs(int numPartitions) { collectorFieldAccessors = new APT_InputAccessorToInt32[numPartitions]; for (int n = 0; n < numPartitions; n++) { setupInputAccessor("collectorField", &collectorFieldAccessors[n], n); } return APT_StatusOk; } int MyCollector::selectInput(int numPartitions) { int minVal = INT_MAX; int minPartIndex = -1;
40 43
49 50
141
Table 37. Example APT_Collector Derivation (continued) Comment 51 53 54 Code for (int n = 0; n < numPartitions; n++) { if (atEOF(n)) continue; if (*collectorFieldAccessors[n] <= minVal) { minVal = *collectorFieldAccessors[n]; minPartIndex = n; } } APT_ASSERT(minPartIndex != -1); return minPartIndex; }
60 61
2 4 5
Derive MyCollector from APT_Collector. Include the macro APT_DECLARE_RTTI(), which is required to support runtime type information. Include the macro APT_DECLARE_PERSISTENT(), which is required for persistence support. This macro also inserts a declaration for the APT_Persistent::serialize() function required by the persistence mechanism. Include the default constructor for MyCollector. This constructor is required for persistent classes. Define the destructor for MyCollector. Override the virtual functions. Define collectorFieldAccessors, a pointer to an array of field accessors used to read collectorField in the partitions of the input data set. You need a single accessor for each field in each partition that you want to access. Define numParts to hold the length of the accessor array referenced by collectorFieldAccessors. If your collector has arguments, you supply a description of them in the ARGS_DESC string. See argvcheck.h for documentation on this string and the argv checking facility. With APT_DEFINE_OSH_NAME, you connect the class name to the name used to invoke the operator from osh and pass your argument description string. See osh_name.h for documentation on this macro. Include the macro APT_IMPLEMENT_RTTI_ONEBASE(), which is required to support run time type information. Include the macro APT_IMPLEMENT_PERSISTENT(), which is required for persistence support. The default constructor. Define the destructor for MyCollector. Delete the array of accessors used by the collector. With your override of initializeFromArgs_(), you transfer information from the collector arguments to the class instance, making it osh aware. See the header file operator.h for documentation on this function. Override APT_Collector::describeCollector().
7 8 10-13 15
16 18
19
20 21 22 25 27 29
33
142
35 40 43 49 50 51 53
Use APT_Collector::setInputInterfaceSchema() to specify the interface to this collector. This interface schema consists of a single integer field. Define an array of field accessors, one for each input partition, and initialize collectorFieldAccessors, a pointer to the array. Use APT_Collector::setupInputAccessor() to initialize a field accessor to collectorField in each input partition. Define minVal to hold the current minimum value for collectorField and initialize it to INT_MAX, the largest supported APT_Int32. Define minPartIndex to hold the number of the partition with the minimum value for collectorField. Iterate through all partitions of the input data set. Use APT_Collector::atEOF() to determine whether the current partition contains a valid record. APT_Collector::atEOF() returns true after the final record has been read from a partition. If a partition contains a record, compare the record's collectorField value against the current minimum field value. If the current record's collectorField is less than the current minVal, update minVal and minPartIndex accordingly. Your override of APT_Collector::selectInput() must always return a partition index. This statement issues an assertion failure if you have iterated through all the partitions and have not found a partition number to return. You can also return -1 to indicate a fatal error. Return the partition number of the record read by the operator.
54
60
61
Once you have defined your collector, you can use it with a derived operator. Typically, you define the collector within the override of APT_Operator::describeOperator(). To use MyCollector with MySequentialOperator, you use APT_Operator::setCollectionMethod() within the APT_Operator::describeOperator() function of MySequentialOperator. setCollectionMethod() allows you to specify a collector object for your operator. The following example shows the function prototype of setCollectionMethod():
void setCollectionMethod(APT_Collector * collector, const APT_ViewAdapter& adapter, int inputDS);
In this form, setCollectionMethod() takes three arguments: v collector, a pointer to a collector object. v adapter, a view adapter. If no adapter is required, you can simply pass a default-constructed adapter. v inputDS, the input data set for this collector. The following example shows the describeOperator() function for MySequentialOperator. The collector MyCollector is dynamically allocated; InfoSphere DataStage performs the deletion. The function APT_Operator::setCollectionMethod() specifies opCollector as the collector for this operator. This function takes a default-constructed adapter because the interface schema of the operator and collector are compatible.
143
APT_Status MySequentialOperator::describeOperator() { setKind(APT_Operator::eSequential); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema( "record (collectorField:int32; aField:string; in:*;)", 0); setOutputInterfaceSchema("record (out:*;)", 0); declareTransfer("in", "out", 0, 0); MyCollector * opCollector = new MyCollector; setCollectionMethod(opCollector, APT_ViewAdapter(), 0); return APT_StatusOk; }
field1:int32;field2:int32;field3:string;in:*;
collectorField:int32;
The input interface schema of the collector defines the integer field collectorField that it uses to combine the records of an input data set. This schema is not compatible with the interface schema of the operator, so an APT_ViewAdapter object must be defined and initialized within the describeOperator() function of the derived operator. Here is the describeOperator() function for MySequentialOperator:
#include <apt_framework/orchestrate.h> APT_Status MySequentialOperator::describeOperator() { setKind(APT_Operator::eSequential);
144
setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema( "record(field1:int32; field2:int32; field3:string; in:*;)", 0); setOutputInterfaceSchema("record (out:*;)", 0); APT_ViewAdapter collectorAdapter( "collectorField = field1;"); MyCollector * opCollector = new MyCollector; setCollectionMethod(opCollector, collectorAdapter, 0); return APT_StatusOk; }
An instance of APT_ViewAdapter, collectorAdapter, is defined to translate field1 of the input interface schema of the operator to collectorField of the interface schema of the collector. This translation does not contain any form of type conversion. You also can use adapters to perform type conversion during a translation.
145
146
A function that can operate on one or more fields of a data set using this record schema could have the following prototype:
void processFields(const APT_FieldList& fList);
This function takes as an argument the list of fields to process. The field list you include can have several forms, as shown:
processFields("a, b, e"); processFields("a - c"); // comma-separated list of fields // field range // field range and a comma-separated
// all fields
To create a field list, you use one or more of these elements: v Individual field identifiers. v Field ranges, which are two field identifiers separated by a hyphen. The first field must appear earlier in the record schema definition than the second (the fields must be in schema, not alphabetic, order). A field range includes all the fields whose identifiers fall within the range. v A wildcard (*), which represents all the fields in the record schema.
147
You then specify the following record schema as the context and expand the list using APT_FieldList::expand():
static char schema1[] = "record" "( a:int32; " " b:int32; " " c:int16; " " d:sfloat; " " e:string; )"; listObject.expand(APT_Schema(schema1));
The expanded field list contains three APT_FieldSelector objects: one each for the fields a, b, and c. If the field list is already expanded, APT_FieldList::expand() does nothing. You cannot unexpand an expanded field list. Selecting a different record schema as the context results in a different number of selectors. Consider, for example, the following schema:
static char schema2[] = "record" "( a:int32; " " a1:int32; " " a2:int16; " " a3:sfloat; " " c:string; )"; listObject.expand(APT_Schema(schema2));
In this example, the expanded field list contains five field selectors, one for each field in the record schema. After you expand a field list, you can access the APT_FieldSelector for each field in the list. Using the APT_FieldSelector, you can then determine the data type of the field and process the field accordingly.
148
DynamicOperator
outRec:*; output data set
The operator has the following characteristics: v Takes a single data set as input. v Writes its results to a single output data set. v Has an input interface schema consisting of a single schema variable inRec and an output interface schema consisting of a single schema variable outRec. To use this operator, you would specify the field list to its constructor, as shown in the following statement:
static char schema[] = "record" "( a:int32; " " b:int32; " " c:int16; " " d:sfloat; " " e:string; )"; DynamicOperator aDynamicOp("a - c"); // Schema of input data set
The following table shows the definition of DynamicOperator. Comments follow the code in the table.
149
Table 38. DynamicOperator Comment Code #include <apt_framework/orchestrate.h> class DynamicOperator : public APT_Operator { APT_DECLARE_RTTI(DynamicOperator); APT_DECLARE_PERSISTENT(DynamicOperator); public: DynamicOperator(); DynamicOperator(const APT_FieldList& fList) : inSchema(), inFList(fList) {} protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); 16 17 private: APT_Schema inSchema; APT_FieldList inFList; . . . APT_Status DynamicOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); 23 24 25 26 APT_Schema tempSchema = viewAdaptedSchema(0); APT_FieldList::Error err; inFList.expand(tempSchema, &err); if (err.errorOccurred()) { reportError(err.description()); return APT_StatusFailed; } int numFields = inFList.numFields(); for (int n = 0; n < numFields; n++) { APT_FieldSelector fs = inFList.field(n); APT_SchemaField f = tempSchema.field(fs); inSchema.addField(f); } APT_SchemaField inField; inField.setIdentifier("inRec"); inField.setTypeSpec("*"); inSchema.addField(inField); setInputInterfaceSchema(inSchema, 0); setOutputInterfaceSchema("record (outRec:*;)", 0); declareTransfer("inRec", "outRec", 0, 0); Return APT_StatusOk; }
31 32 34 35 36 38 41 42
150
16 17 23
Define storage for the complete input interface schema to the operator. Define storage for the input field list specifying the input schema. Use APT_Operator::viewAdaptedSchema() to obtain the complete record schema of the input data set. You will expand the input field list in the context of the input data set record schema. Define err, an instance of APT_FieldList::Error, to hold any errors generated when expanding the field list. Use APT_FieldList::expand() to expand the field list relative to the record schema of the input data set. If the field list does not contain a wildcard or field range, the expand function does nothing. If any errors occurred during the expansion, print the description of the error and return APT_StatusFailed from describeOperator(). Use APT_FieldList::numFields() to determine the number of fields in the field list. Create a for loop to iterate through the field lists. For each element in the field list, add a schema field to inSchema, the input interface schema of the operator. Use APT_FieldList::field() to get a field selector from the expanded field list. Use APT_Schema::field() to return an APT_SchemaField object corresponding to a record schema field of the input data set. Use APT_Schema::addField() to add the APT_SchemaField object to inSchema.
24 25
26 31 32
34 35 36
38 - 41 After adding all the fields of the field list to the input interface schema, you must create a schema field component for the schema variable "inRec:*;" and add that component to the input interface schema. Use APT_Schema::addField() to add the schema variable to allow transfers from the input data set to the output data set. 42 Use APT_Operator::setInputInterfaceSchema() to set the input interface schema of the operator to the schema contained in inSchema.
151
152
153
To make a class complex-persistent, you directly or indirectly derive your class from the APT_Persistent base class. If your class has none of the complex conditions outlined above, you can make your class simple-persistent. To do this you simply need to define serialization operators. Derivation from APT_Persistent is not necessary. The serialized representation of a simple-persistent object consumes no archive storage other than that used to serialize the object.
Storing and loading simple and complex persistent objects Store an object
1. Create a storing archive. 2. Use a serialization operator to store the object in the archive.
Create archives
The base class APT_Archive defines the base-level functionality of the archive facility. You use the derived classes APT_FileArchive and APT_MemoryArchive to perform object serialization. APT_FileArchive is used to store objects to a file or load objects from a file. APT_MemoryArchive is used to store objects to a memory buffer or load objects back from a memory buffer. Accessing a buffer typically is faster than accessing a file. Objects stored to a buffer can later be stored to a file. For example, this line of code creates a storing file archive using the file output.dat:
APT_FileArchive ar("output.dat", APT_Archive::eStoring);
This line of code creates a loading file archive using the file input.dat:
APT_FileArchive ar("input.dat", APT_Archive::eLoading);
You establish the archive mode when you create the archive object. The mode cannot be subsequently changed. Also, archives do not support a random access seeking mechanism. You must load objects back from an archive in the order in which you store them.
154
Any complexity in the internal structure of Catalog does not complicate the code required to store and load instances of Catalog. The bidirectional operator, operator||, can perform either storing or loading. It determines its action based on the mode of the archive supplied to the function. Here is an example that uses operator||:
155
ar1("input.dat", APT_Archive::eLoading); ar2("output.dat", APT_Archive::eStoring); load gCat from input.dat store gCat to output.dat
The APT_Archive base class defines bidirectional serialization operators for the following data types: v signed or unsigned 8-, 16-, 32- or 64-bit integers v single-precision (32 bits) or double-precision (64 bits) floats v time of day v boolean The one-value serialization operators are shown below:
class APT_Archive { public: ... friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& friend APT_Archive& ... };
operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator|| operator||
(APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive& (APT_Archive&
ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar, ar,
APT_UInt8& d); APT_Int8& d); char& d); UChar& d); APT_UInt16& d); APT_Int16& d); APT_UInt32& d); APT_Int32& d); APT_UInt64& d); APT_Int64& d); unsigned int& d); int& d); time_t& d); float& d); double& d); bool& d);
In addition, the base class defines the corresponding directional serialization operators operator>> and operator<<. See $APT_ORCHHOME/include/apt_util/ archive.h for a complete definition of APT_Archive Because derived classes define data in terms of the built-in data types, you construct serialization operators using the supplied operators defined in APT_Archive. This means that you need not write serialization operators from scratch; you can simply build serialization operators from those supplied in APT_Archive. If you are creating an operator, partitioner, or collector class compatible with osh, the shell command facility, you can override initializeFromArgs_() in the derived class to handle object initialization. In this case, you still must include an override of APT_Persistent::serialize() in your derived class, but it can be empty.
156
The following example code shows how simple-persistence can be built into a class:
#include <apt_util/archive.h> class FPair { float x_, y_; public: ... friend APT_Archive& operator|| (APT_Archive& ar, FPair& d) { return ar || d.x_ || d.y_; } }; // define operator<< and operator>> APT_DIRECTIONAL_SERIALIZATION(FPair); }
You explicitly provide only the bidirectional serialization operator. By including the APT_DIRECTIONAL_SERIALIZATION macro, you also provide both the store and load unidirectional serialization operators. This allows FPair objects to be serialized in the same manner as built-in types. For example:
APT_MemoryArchive ar; FPair fp = ...; ar << fp;
For simple classes such as FPair, just defining serialization operators suffices to make a class persistent. Simple-persistent classes have low overhead, since they do not inherit from a persistence base class, and their serialized representation consumes no archive storage other than that used to serialize the class members. When loading, you always call the serialize() function on a default-constructed object. This is because the loaded object has either just been dynamically allocated, or it has been explicitly destroyed and default-constructed in place. This policy simplifies the serialize() function, since it need not worry about the previous state of an object. With simple persistence, a class serialization operator needs to recognize that it might be loading over the pre-load state of an object. When a simple-persistent object is loaded, the object state is overwritten by the serialization operator for the class. It is up to the serialization operator to properly manage any state that the object might have had before loading. A crucial limitation of simple-persistent classes is that pointers to class objects cannot be serialized. If it is necessary to serialize an object pointer, the class must be complex-persistent.
157
APT_IMPLEMENT_ABSTRACT_PERSISTENT macro in the .C file. These macros generate operator>>, operator<<, and operator|| overloads automatically to support serialization. Rule 3 You must create a default constructor. The default constructor need not be public; it can be protected or private. A compiler-generated default constructor is acceptable. Rule 4 You must implement APT_Persistent::serialize() in all derivations, regardless of whether the derived class is concrete or abstract. The APT_DECLARE_PERSISTENT macro includes a declaration for APT_Persistent::serialize() in the .h file. APT_Persistent::serialize() is the only function that you must define to support persistence in a derived class. All three serialization operators are declared and defined by macros in terms of APT_Persistent::serialize(). Rule 5 You must provide Run-Time Type Information (RTTI) support on the derived class. RTTI currently is not supported for template classes. The following example shows how to apply the rules to make your class complex-persistent: In a file named ComplexPersistentClass.h, you define a complex persistent class named ComplexPersistentClass.
#include <apt_framework/orchestrate.h> class ComplexPersistentClass: public APT_Persistent // Rule 1 { APT_DECLARE_PERSISTENT(ComplexPersistentClass);// Rules 2 & 4 APT_DECLARE_RTTI(ComplexPersistentClass); // Rule 5 public: ComplexPersistentClass(int, float); private: ComplexPersistentClass(); // Rule 3 int i_; float f_; };
The definition of APT_Persistent::serialize() uses the built-in forms of APT_Archive::operator|| to implement persistence for ComplexPersistentClass. See the next section for more information.
158
Implementing APT_Persistent::serialize()
You use the macro APT_DECLARE_PERSISTENT() to declare APT_Persistent::serialize() as a private member function in a derived class. The function prototype of APT_Persistent::serialize() is:
void serialize(APT_Archive& archive, APT_UInt8);
The first argument to APT_Persistent::serialize() specifies an archive. The second argument is reserved. Because APT_Persistent::serialize() is called whenever you load or store an object, you typically implement your definition of APT_Persistent::serialize() using operator||. The function operator|| determines its operation by the mode of the archive passed to it. You can also use the member functions of APT_Archive to determine the mode of the archive if your class requires special processing for either a load or a store. This approach lets you use operator<< and operator>> within APT_Persistent::serialize(). When storing an object to an archive, APT_Persistent::serialize() stores the object. When storing a pointer to an archive, the object referenced by the pointer is stored. If the object has already been stored, subsequent stores are not performed. APT_Persistent::serialize() simply stores a pointer to the previously stored object. When an object from an archive is being loaded, the action of APT_Persistent::serialize() depends on whether you are loading a pointer or a reference: v If you are loading a pointer, InfoSphere DataStage dynamically allocates the object, calls the default constructor for the object, and then calls APT_Persistent::serialize() to load the object from the archive. v If an object has already been loaded, a second load is not performed. Instead, operator|| simply initializes the pointer to reference the loaded object. v If you are loading an object by using a reference to a class derived from APT_Persistent, InfoSphere DataStage destroys the object, default constructs the object in place, and then calls APT_Persistent::serialize() to load the object from the archive. v When loading objects of simple data types, such as integers and floats, APT_Persistent::serialize() simply loads the object. During loading, the previous state of an object is irrelevant because the object is always default constructed. Your definition of APT_Persistent::serialize() should not have any effects other than reading, writing, or otherwise modifying the object being serialized.
Serializing pointers
Using classes that support persistence lets you save and load objects directly or use a pointer to an object.
Chapter 12. Enabling object persistence
159
Basic data types such as integers and characters, however, do not support serialization via a pointer. You must write your APT_Persistent::serialize() overload to handle serialization of pointers to data types that do not support the persistence mechanism. For example, char * pointers can often be replaced by an instance of the persistent class APT_String. Because APT_String supports persistence, you can serialize a reference to an APT_String object.
Serializing arrays
The persistence mechanism does not directly contain support for serializing arrays. If your classes contain array members, you must build support for array serialization within APT_Persistent::serialize(). Arrays are relatively simple to serialize. The serialization code depends on whether the array is fixed length (and fixed-allocation) or variable-length (and dynamically allocated). The example contains both fixed-length and variable-length arrays. In this example, ObjClass is a persistent class and Container is a persistent class containing two ObjClass arrays. When writing the APT_Persistent::serialize() definition for Container, you would handle the ObjClass arrays as follows:
#include <apt_framework/orchestrate.h> class ObjClass: public APT_Persistent { APT_DECLARE_PERSISTENT(ObjClass); APT_DECLARE_RTTI(ObjClass); public: ObjClass(); . . . }; class Container: public APT_Persistent { APT_DECLARE_PERSISTENT(Container); APT_DECLARE_RTTI(Container); public: Container() :variable_(0), nVariable_(0) {} ~Container() { delete[] variable_; } . .. private: ObjClass fixed_[12]; // define a fixed-length array ObjClass* variable_; // define a variable-length array int nVariable_; // contains length of array variable_ };
The definition of APT_Persistent::serialize() for Container is shown below. This definition is written using the bidirectional operator|| so that it can be used for both loading and storing Container objects. Comments follow the code in the table.
160
Code comments
v Use a simple for loop to serialize the elements of the fixed-length array. v Serialize the length of the variable-length array nVariable_. v Ensure that the length of the variable-length array is greater than 0. v Use the APT_Archive::isLoading() member function to determine whether the operation is a load. v If this is a load operation, delete the current contents of variable_, and create a new variable-length array. v Serialize the elements of the variable-length array.
Persistence macros
This section describes the macros that you use when declaring and defining classes that support the persistence mechanism. APT_DECLARE_ABSTRACT_PERSISTENT() declares the operator||, operator<<, operator>>, and serialize() definitions within the definition of an abstract base class that supports persistence. #define APT_DECLARE_ABSTRACT_PERSISTENT(className); className specifies the name of the persistent class. APT_DECLARE_PERSISTENT() declares the operator||, operator<<, operator>>, and serialize() definitions within the definition of a class that supports persistence. #define APT_DECLARE_PERSISTENT(className);
161
className specifies the name of the persistent class. APT_IMPLEMENT_ABSTRACT_PERSISTENT() for an abstract base class supporting persistence. This macro means that you only have to define serialize() for the abstract base class; you do not have to define operator||, operator<<, and operator>>. #define APT_IMPLEMENT_ABSTRACT_PERSISTENT(className); className specifies the name of the persistent class. APT_IMPLEMENT_PERSISTENT() implements operator||, operator<<, and operator>> in terms of serialize() for a class supporting persistence. This macro means that you only have to define serialize() for the class; you do not have to define operator||, operator<<, and operator>>. #define APT_IMPLEMENT_PERSISTENT(className); className specifies the name of the persistent class.
162
The two primary objectives of the RTTI facility are to determine the run time data type of an object and to cast a pointer to a derived type or a base type.
The dynamic data type can change at run time, as shown here:
DClass dObject; // Static type of dObject is DClass. BClass * basePtr = &dObject; // Static type of basePtr is BClass, // but its dynamic is DClass. const char * sType = APT_STATIC_TYPE(*basePtr).name(); const char * dType = APT_DYNAMIC_TYPE(*basePtr).name(); // returns // returns BClass DClass
This example uses two RTTI macros: v APT_STATIC_TYPE(): Returns an APT_TypeInfo object describing the static data type of an object reference. The static data type is the data type of the reference, not of the object referenced. v APT_DYNAMIC_TYPE(): Returns an APT_TypeInfo object describing the dynamic data type of an object reference. The dynamic data type is the data type of the object referenced.
Copyright IBM Corp. 2009, 2010
163
The class APT_TypeInfo contains information describing the data type of an object. APT_TypeInfo::name() returns a string containing the class name of the data type.
Performing casts
Use casting to assign a pointer to another pointer of a different data type. The data types of both the pointer and the casted pointer must support the RTTI facility. You perform checked casts using the APT_PTR_CAST() macro which converts a pointer to a pointer of a new data type. If the cast cannot be performed, the pointer is set to 0. The data types of both the pointer and the casted pointer must exist in the same inheritance hierarchy. The example uses APT_PTR_CAST(). This example uses the class DClass derived from the base class BClass:
BClass bObject; BClass * bPtr = &bObject; // cast bPtr to type DClass DClass * dPtr = APT_PTR_CAST(DClass, bPtr); if (dPtr) // APT_PTR_CAST() returns 0 if the cast cannot be performed { ... }
The macro APT_DECLARE_RTTI() is used to declare ExampleOperator as a class that supports the RTTI facility. With the APT_DECLARE_PERSISTENT macro, which is required, object persistence for operator objects transmitted to the processing nodes is declared. In example.C, you define ExampleOperator:
164
The macro APT_IMPLEMENT_RTTI_ONEBASE() is used to implement the RTTI facility for a derived class with a single base class, and the macro APT_IMPLEMENT_PERSISTENT() is used to implement the persistence mechanism. You use different macros depending on the derivation hierarchy of a class. For example, if you define a class using multiple inheritance, you declare and define the class as shown in the following example. In the .h file:
class MI_Class: public B1, public B2 { APT_DECLARE_RTTI(MI_Class); APT_DECLARE_PERSISTENT(MI_Class); public: ... };
The class MI_Class is derived from two base classes: B1 and B2. Typically, both base classes must support the RTTI facility. If a base class does not support RTTI, that class cannot be used as part of a checked cast. MI_Class is a user-defined class; it is not derived from a supplied base class. In the .C file, you must include the macros:
APT_IMPLEMENT_RTTI_BEGIN(MI_Class); APT_IMPLEMENT_RTTI_BASE(MI_Class, B1); APT_IMPLEMENT_RTTI_BASE(MI_Class, B2); APT_IMPLEMENT_RTTI_END(MI_Class); ...
The APT_IMPLEMENT_RTTI_BEGIN() macro starts a block defining the base classes of MI_Class, the APT_IMPLEMENT_RTTI_BASE() macros specify the base classes of MI_Class, and the APT_IMPLEMENT_RTTI_END() macro ends the definition. An APT_IMPLEMENT_RTTI_BEGIN() - APT_IMPLEMENT_RTTI_END() block can contain only a sequence of APT_IMPLEMENT_RTTI_BASE() macros. A checked cast to an ambiguous base class yields the first base class defined in an APT_IMPLEMENT_RTTI_BEGIN() - APT_IMPLEMENT_RTTI_END() block. For cases in which the class being defined has no bases or only one base, the APT_IMPLEMENT_RTTI_NOBASE() and APT_IMPLEMENT_RTTI_ONEBASE() macros can be used instead.
RTTI macros
The macros that support RTTI fall into two categories: derivation macros and application macros.
165
Derivation macros
You use RTTI derivation macros when you are declaring and defining object classes that support the RTTI. The following macros are available: v APT_DECLARE_RTTI(). You insert this macro in a class definition to support the RTTI facility.
APT_DECLARE_RTTI(className);
className specifies the name of the class. v APT_IMPLEMENT_RTTI_BASE(). This macro specifies a base class of a derived class within an APT_IMPLEMENT_RTTI_BEGIN() APT_IMPLEMENT_RTTI_END() block in the source code file for the derived class.
APT_IMPLEMENT_RTTI_BASE(className, baseClass);
className specifies the name of the derived class. baseClass specifies a base class of className. v APT_IMPLEMENT_RTTI_BEGIN(). If a derived class has two or more base classes, you must specify each base class within an APT_IMPLEMENT_RTTI_BEGIN() - APT_IMPLEMENT_RTTI_END() block in the source code file for the derived class. This macro defines the beginning of this block.
APT_IMPLEMENT_RTTI_BEGIN(className);
className specifies the name of the defined class. v APT_IMPLEMENT_RTTI_NOBASE(). If a derived class has no base class, you must insert this macro in the source code file of the defined class.
APT_IMPLEMENT_RTTI_NOBASE(className);
Macro APT_IMPLEMENT_RTTI_ONEBASE()
If a derived class has a single base class, you use this macro in the source code file of the derived class to specify the base class name.
APT_IMPLEMENT_RTTI_ONEBASE(className, baseClass);
className specifies the name of the derived class. baseClass specifies a base class of className. This macro is equivalent to:
APT_IMPLEMENT_RTTI_BEGIN(className); APT_IMPLEMENT_RTTI_BASE(className, baseClass); APT_IMPLEMENT_RTTI_END(className);
Application macros
You use the RTTI application macros to determine the data type of objects instantiated from a class that supports the RTTI facility. The following macros are available: v APT_DYNAMIC_TYPE(). This macro returns an APT_TypeInfo object describing the dynamic, or run-time, data type of an object. The object must be instantiated from a class that supports RTTI.
166
object specifies an object instantiated from a class that supports the RTTI facility. The following example uses APT_DYNAMIC_TYPE(), with the class DClass derived from the base class BClass:
DClass dObject; BClass * basePtr = &dObject; const char * dType = APT_DYNAMIC_TYPE(*basePtr).name(); // returns "DClass"
v APT_PTR_CAST(). This macro lets you perform a checked cast to assign a pointer of one data type to a pointer of a different data type. The pointer and casted pointer both must reference objects instantiated from classes that support RTTI.
destType * APT_PTR_CAST(destPtr, sourcePtr);
destType specifies the resultant data type of the cast. The data types of destPtr and sourcePtr must exist in the same inheritance hierarchy; otherwise the pointer is set to 0. sourcePtr is a pointer to convert. APT_PTR_CAST() returns 0 if the data type of sourcePtr does not exist in the same inheritance hierarchy of type destType, or if sourcePtr is equal to 0. v APT_STATIC_TYPE(). This macro returns an APT_TypeInfo object describing the static data type of an object reference. The class must support RTTI.
const APT_TypeInfo& APT_STATIC_TYPE(object);
object specifies an object or a reference to an object instantiated from a class that supports the RTTI facility. The example uses APT_STATIC_TYPE(), with the class DClass derived from the base class BClass:
DClass dObject; const char * dType = APT_STATIC_TYPE(dObject).name(); // returns "DClass"
v APT_TYPE_INFO(). This macro returns an APT_TypeInfo object describing the class type. The class must support RTTI.
const APT_TypeInfo& APT_TYPE_INFO(className);
167
168
169
170
getError() resetError()
operator*
accumulator
getWarning() resetWarning()
logInfo()
information log
getInfo() resetInfo()
APT_ErrorLog
Figure 52. Error log structure APT_ErrorLog eLog(APT_userErrorSourceModule); *eLog << "a string"; // Use the appropriate function to copy the accumulator to a log // buffer
In this example, APT_userErrorSourceModule defines the module identifier for the error log. After writing to the accumulator, you call logError(), logWarning(), or logInfo() to copy the information from the accumulator to the appropriate log buffer. Calling these functions appends the information in the accumulator to the current contents of the appropriate buffer, then clears the accumulator. In order to access the information in the error log buffer, you use UgetError(). This function returns a ustring containing the buffer contents. You can also use the member function APT_ErrorLog::dump() to display the information stored in an error log. This function writes all the information contained in an APT_ErrorLog to standard error on the workstation that invoked the application, then resets all the log buffers to the empty state. One use of dump() is to periodically purge an APT_ErrorLog that might contain many messages. For example, you might have an error or warning caused by each record processed by an operator. Because an operator can process a huge number of records, the amount of memory consumed by an APT_ErrorLog object can become correspondingly large. Using dump(), you can purge the APT_ErrorLog object to prevent the object from overflowing memory.
171
A message identifier has two parts: the module identifier and the message index number. The module identifier defines the functional component that issued the message. The message index is a unique value for each message within each module identifier. You do not have to construct an APT_ErrorLog object for a derived operator, partitioner, or collector. APT_ErrorLog objects are generated for you, with the module identifier APT_userErrorSourceModule, when an object is instantiated from a derived class. You only need to specify a module identifier if you create your own APT_ErrorLog objects. The list of available module identifiers is extensible. The message index number is a unique value for each error, warning, or informational message you write to an error log for a given module identifier. Typically, you define a base value for your message index, then modify this value for every unique message written to the log. For example, the following code defines the base value for all messages written to the log:
#define MESSAGE_ID_BASE 0
When writing an error message to the error log, you would then specify a message index value as shown:
APT_ErrorLog eLog(APT_userErrorSourceModule);// create an error log *eLog << "a string"; // write a message to error log eLog.logError(MESSAGE_ID_BASE + 1); // log the error message and index
Subsequent messages written to the log would have an index of MESSAGE_ID_BASE + 2, MESSAGE_ID_BASE + 3, and so on, All messages have a unique index within each module identifier. Creating a unique index for each of your error messages allows you to catalog and track your messages.
172
Table 39. Using the error log in a derived operator Comment Code #include <apt_framework/orchestrate.h> class ExampleOperator : public APT_Operator { APT_DECLARE_RTTI(ExampleOperator); APT_DECLARE_PERSISTENT(ExampleOperator); public: ExampleOperator(); void setKey(const APT_String& name); void setStable(); protected: virtual APT_Status describeOperator(); virtual APT_Status runLocally(); virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); 15 17 private: int numkeys; // other data members #define MESSAGE_ID_BASE 0 APT_Status ExampleOperator::describeOperator() { setKind(APT_Operator::eParallel); setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record (iField:int32; sField:string)", 0); setOutputInterfaceSchema("record (iField:int32; sField:string)", 0); if (numKeys == 0) { APT_ErrorLog& eLog = errorLog(); *eLog << "no keys specified for operator."; eLog.logError(MESSAGE_ID_BASE + 1); return APT_StatusFailed; } Return APT_StatusOK; }
25 26 27 28 29
8 15 25 26
Use the member function, setKey(), to set a key field. The variable numKeys is a private variable containing the number of key fields specified using setKey(). If the number of keys is 0, the user has not specified a key field, and an error condition exists. Use APT_Operator::errorLog() to access the APT_ErrorLog object for the operator instance. This function returns a reference to the APT_ErrorLog object. Alternatively, you could use the following statement to replace lines 26 and 27: *errorLog() << "No keys specified for operator."; APT_ErrorLog::operator*() returns an ostream to which messages can be written. Use APT_ErrorLog::operator*() to write an error message to the accumulator.
27
173
28 29
Use APT_ErrorLog::logError() to copy the contents of the accumulator to the error log buffer, including the message index. Return APT_StatusFailed from the override. The APT_ErrorLog object is checked for the class upon a return and displays any contained warning or error information. In addition, because the function returned APT_StatusFailed, the job is terminated.
You do not have to return from the function after detecting a single error or warning. You might want to execute the entire function, logging multiple error or warnings before returning. In this case, you can define a flag to signal that an error occurred and check that flag before returning from the function.
174
APT_GenericFunction
TypeOpGF
TypeOpStringGF
TypeOpInt32GF
The code also demonstrates how generic accessors can be safely cast into strongly typed accessors by casting an APT_InputAccessorBase pointer to both a string accessor and an int32 accessor. Using the default conversion mechanism, the code also converts int8 field values to int32 values. In addition, the code demonstrates how to build a schema variable field and add it to the schema so that entire records can be transferred without change from input to output. The following table shows the code of types.C. The code is followed by comments which are keyed to code line numbers.
175
Table 40. types.C Comment 1 2 Code #include "types.h" class TypeOpGF : public APT_GenericFunction { APT_DECLARE_RTTI(TypeOpGF); APT_DECLARE_ABSTRACT_PERSISTENT(TypeOpGF); public: TypeOpGF(char *type) : APT_GenericFunction(type, "typeop", ""){}; virtual APT_String display(APT_InputAccessorBase *)=0; }; APT_IMPLEMENT_RTTI_ONEBASE(TypeOpGF, APT_GenericFunction); APT_IMPLEMENT_ABSTRACT_PERSISTENT(TypeOpGF); void TypeOpGF::serialize(APT_Archive & ar, APT_UInt8 version) { // TypeOPGF has no internal state to serialize } 15 class TypeOpStringGF : public TypeOpGF { APT_DECLARE_RTTI(TypeOpStringGF); APT_DECLARE_PERSISTENT(TypeOpStringGF); public: TypeOpStringGF() : TypeOpGF("string") {} virtual APT_GenericFunction* clone() const { return new TypeOpStringGF(*this); } APT_String display(APT_InputAccessorBase *inAcc) { return "This is a string; its value is: " + ((APT_InputAccessorToString*)inAcc)->value(); } }; void TypeOpStringGF::serialize(APT_Archive & ar, APT_UInt8 version) { // TypeOPStringGF has no internal state to serialize. } APT_IMPLEMENT_RTTI_ONEBASE(TypeOpStringGF, TypeOpGF); APT_IMPLEMENT_PERSISTENT(TypeOpStringGF);
30
176
Table 40. types.C (continued) Comment 36 Code class TypeOpInt32GF : public TypeOpGF { APT_DECLARE_RTTI(TypeOpInt32GF); APT_DECLARE_PERSISTENT(TypeOpInt32GF); public: TypeOpInt32GF() : TypeOpGF("int32") {} virtual APT_GenericFunction* clone() const { return new TypeOpInt32GF(*this); } APT_String display(APT_InputAccessorBase *inAcc) { char tmp[25]; sprintf(tmp, "%d", ((APT_InputAccessorToInt32*)inAcc)->value()); return APT_String("This is an int32; its value is: ") +tmp; } }; void TypeOpInt32GF::serialize(APT_Archive & ar, APT_UInt8 version) {} APT_IMPLEMENT_RTTI_ONEBASE(TypeOpInt32GF, TypeOpGF); APT_IMPLEMENT_PERSISTENT(TypeOpInt32GF); int registerGFs() { APT_GenericFunction *tmp = new TypeOpStringGF(); APT_GenericFunctionRegistry::get().addGenericFunction(tmp); tmp = new TypeOpInt32GF(); APT_GenericFunctionRegistry::get().addGenericFunction(tmp); return 0; } static int sTypeOp = registerGFs(); 67 APT_IMPLEMENT_RTTI_ONEBASE(TypesOp, APT_Operator); APT_IMPLEMENT_PERSISTENT(TypesOp); #define ARG_DESC "{}" APT_DEFINE_OSH_NAME(TypesOp, fieldtypes, ARG_DESC); TypesOp::TypesOp() {} APT_Status TypesOp::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { APT_Status status=APT_StatusOk; return status; }
53
69 71 73
78
177
Table 40. types.C (continued) Comment 80 Code APT_Status TypesOp::describeOperator() { setInputDataSets(1); setOutputDataSets(1); APT_Schema schema=viewAdaptedSchema(0); for (int i=0; i < schema.numFields(); i++) { APT_SchemaField &field=schema.field(i); const APT_FieldTypeDescriptor *fd = field.typeDescriptor(); 89 if (APT_PTR_CAST(APT_Int8Descriptor, fd)) { APT_FieldTypeDescriptor *int32fd=APT_FieldTypeRegistry::get().lookupBySchemaTypeName("int32"); field.setTypeDescriptor(int32fd); } } APT_SchemaField schemaVar; schemaVar.setIdentifier("in"); schemaVar.setKind(APT_SchemaField::eSchemaVariable); schema.addField(schemaVar); setInputInterfaceSchema(schema, 0); setOutputInterfaceSchema("record (out:*)", 0); declareTransfer("in", "out", 0, 0); setCheckpointStateHandling(eNoState); return APT_StatusOk; } 105 APT_Status TypesOp::runLocally() { APT_Status status = APT_StatusOk; APT_Schema schema = inputInterfaceSchema(0); int count=0; for(; count < schema.numFields();) { if(schema.field(count).kind() != APT_SchemaField::eValue) schema.removeField(count); else count++; } APT_InputAccessorBase *accessors = new APT_InputAccessorBase[count]; TypeOpGF **gfs = new TypeOpGF *[count]; for (int q=0; q < count; q++) { gfs[q]=APT_PTR_CAST(TypeOpGF, APT_GenericFunctionRegistry::get().lookupDefault(schema.field(q). typeDescriptor()->schemaTypeName(), "typeop")); }
94 95
115 116
122
178
Table 40. types.C (continued) Comment 124 Code APT_InputCursor inCur; setupInputCursor(&inCur, 0); for (int p=0; p < schema.numFields(); p++) inCur.setupAccessor(schema.field(p).path(), &accessors[p]); APT_OutputCursor outCur; setupOutputCursor(&outCur, 0); 130 while (inCur.getRecord() && status == APT_StatusOk) { for (int z=0; z < schema.numFields(); z++) { if (gfs[z]) { *errorLog() << gfs[z] ->display(&accessors[z]); errorLog().logInfo(1); } } transfer(0); outCur.putRecord(); 142 } delete[] gfs; delete[] accessors; return status; }
1 2
Include the header file, types.h. It defines the TypesOp operator which is directly derived from APT_Operator. Create a first-level generic-function class, TypeOpGF, to provide a public interface for the two classes generic-function classes that derive from TypeOpGF. The three arguments to APT_GenericFunction are: v type: a schema type name which identifies the schema type v "typeop": an interface name which uniquely identifies the generic function interface among all the interfaces of the same schema type v " ": You can supply an implementation name or the empty string for the third argument. If you supply a name, the implementation must be explicitly selected by name. If you supply the empty string, the generic interface is the default implementation for its schema type and interface name. The second and third arguments are used together to determine the generic function to be used. See the call to lookupDefault() called from the runLocally() method and the call to lookupBySchemaTypeName() in the describeOperator() method for TypesOp.
15-30 & 36-53 These code lines define the TypeOpStringGF and TypeOpInt32GF classes that provide string and int32 implementations of generic functions. Except for data type, the two implementations are similar. Overriding clone() is an APT_GenericFunction base-class requirement. The display() function displays the data type and value. You can call this type of function to perform many kinds of operations with any number of arguments. The
Chapter 15. Advanced features
179
display() function in this case takes a generic accessor base pointer so that the return value can be safely cast to any data type pointer. 67-68 The APT_IMPLEMENT_RTTI_ONEBASE macro supports run time type information. The APT_IMPLEMENT_PERSISTENT macro defines the persistence mechanism. The argument description string for TypesOp is empty because the operator takes no arguments. TypesOp::TypesOp() is the constructor for the operator. Because the operator has no arguments, its osh initialization function, initializeFromArgs_ simply returns APT_StatusOk. The serialize method is part of the persistence mechanism. Before the operator is parallelized, this method is called to archive the values of its member variables; and it is called again after parallelization, to restore those variables from the archived values in each parallel copy of the operator. In this case, there are no members to serialize. Just before an operator is parallelized, call the describeOperator() method to set important operator properties. A simple assignment is used to convert int8 types into int32 types because InfoSphere DataStage supplies a default conversion between these types. When there is no default conversion, an adapter must be used to perform the conversion. In the parallel execution method runlocally(), schema variables are removed from a copy of the schema in order to exclude them from having accessors assigned to them. This method also skips over subrecs and tagged fields. The runlocally() method then determines how many accessors are needed, and declares pointers to generic functions to be used with the accessors. 116-122 The method also defines a cursor for the input data set, sets up accessors to their corresponding fields, defines a cursor for the output data set. 130-142 The while loop iterates over the available input records, calling the generic functions. The loop could also have used APT_InputAccessorBase::type() to return an enum on which to run switch(); however, in many cases that technique is not as efficient as using generic functions, particularly when there are many fields in a record. Since a generic function has not been created for every datatype, the code guards against generic functions that have not been allocated. The following table contains the type.h header file which defines the TypesOp sample operator.
69 71 73 78
80 89-94
105-115
180
Table 41. types.h #include <apt_framework/orchestrate.h> class TypesOp : public APT_Operator { APT_DECLARE_PERSISTENT(TypesOp); APT_DECLARE_RTTI(TypesOp); public: // constructor TypesOp(); // C++ initialization methods for this operator which are // called from the initializeFromArgs_ method. protected: // osh initialization function which makes the operator "osh-aware". virtual APT_Status initializeFromArgs_( const APT_PropertyList &args, APT_Operator::InitializeContext context); // pre-parallel initialization virtual APT_Status describeOperator(); // parallel execution method virtual APT_Status runLocally(); // There are no member variables. };
Combinable operators
A combinable operator is an operator that is managed by the combinable operator controller. A combinable operator processes an input record and then returns control to the controller. This frees the framework from waiting for the operator to consume all of its input. All combinable operators are based on the abstract base class APT_CombinableOperator. When you write to the combinable API, you enable the internal combinable optimization mechanism of the framework. The framework will combine operators when possible, and will avoid combination in cases where it does not make sense. The framework will not combine operators when there is a need for flow-based data buffering, when repartitioning is necessary, when the operators have differing degrees of parallelism, and so on.
181
Advantages
Combinable operators can substantially improve performance for certain kinds of operations by reducing the record transit time between operators. They do this by eliminating the time that non-combinable operators need to pack records into a buffer as the data passes between operators. In addition, combinable operators allow you to process records whose size exceeds the 128 KB limit. When data flows between two non-combinable operators, it is stored in a buffer which limits record size to 128 KB. With combinable operators there is no buffer limit, so record size is unlimited until the combinable operator outputs to a non-combinable operator or repartitions.
Disadvantages
Although combinable operators can confer substantial performance advantages to data flows, combinable operators expose the complexity of the internal API and as such their use comes with certain risks. These risks can be minimized, however, by following the guidelines. Also, combinable operators are not always an appropriate choice. For example, using combinable operators reduces pipeline parallelism and so could actually slow down the data flow.
Virtual methods
A virtual method is a method which you must implement. The virtual methods listed are called by the framework. v APT_Status doFinalProcessing() This method is called once per operator instance. If this method is only outputting one record, use transferAndPutRecord(). If there are multiple outputs for an operator instance, use requestWriteOutputRecord() instead, as it returns control to the framework to process the outputs.
182
v APT_Status doInitialProcessing() This method lets you generate output records before doing input processing. This method is called once per operator instance. If this method is only outputting one record, use transferAndPutRecord(). If there are multiple outputs for an operator instance, use requestWriteOutputRecord( ) instead, which returns control to the framework to process the outputs. v void outputAbandoned(int outputDS) This method sends a notification to the combinable operator that the output is no longer needed. This method is called by the framework when the operator following this operator has received all the input it needs. v void processEOF(int inputDS) You use this method for processing that needs to be done after the last record for the input data set has been received. The framework calls this method once for each input. v void processInputRecord(int inputDS) You use this method to apply processing on a per record basis. Use no more than one putRecord() or transferAndPutRecord() per output data set for each input record. Call requestWriteOutputRecord() for each additional record output in the method. v APT_Status writeOutputRecord() The framework calls this method in response to operator calls to requestWriteOutputRecord() when the controller is ready for the next output record. You can output one record for each call to this function. You can call requestWriteOutputRecord() again if necessary.
Non-virtual methods
A non-virtual method is one that you do not need to implement. The combinable operator that you write can call the following non-virtual methods. v void abandonInput(int inputDS) This method notifies the framework to stop sending records from inputDS to the operator. After this function is called, atEOF() for the input will return true and processInputRecord() cannot be called with this input as the argument. If atEOF(inputDS) is true, this function does nothing. Conceptually, abandonInput() is called automatically for all inputs when the operator terminates. Only call this method if inputDS is less than inputDataSets() and inputConsumptionPattern() equals eSpecificInput. v int activeInput() const Returns the number of the currently active input. This value remains stable throughout the entire dynamic scope of processInputRecord() or processEOF(). Calls to setActiveInput() or advanceToNextInput() affect the value of activeInput(). v void clearCombiningOverrideFlag() Sets the combining override flag for the operator to false and sets the hasCombiningOverrideFlag() flag to true. v bool combiningOverrideFlag() This function returns the combining override flag for the operator. It returns true if setCombiningOverrideFlag() was called and false if clearCombiningOverrideFlag() was called. Do not call this function unless hasCombiningOverrideFlag() is true. v bool hasCombiningOverrideFlag()
Chapter 15. Advanced features
183
This function indicates if combining override flag value of the operator has been set. If this flag is set to true and if combiningOverrideFlag() returns false, operator combining will be prevented and combinable operators will be treated as ordinary operators. See also the related functions combiningOverrideFlag(), setCombiningOverrideFlag(), and clearCombiningOverrideFlag(). v APT_InputAccessorInterface* inputAccessorInterface(int inputDS) This method provides access to the input accessor interface associated with each input defined for this operator. It is to be used for setting up input accessors. Call this method in doInitialProcessing() instead of exposing input cursors. To call this method, inputDS must be nonnegative and less than inputDataSets(). v InputConsumptionPattern inputConsumptionPattern() This method returns the input consumption pattern set in describeOperator() by a call to setInputConsumptionPattern(). You can call this any time after the framework calls describeOperator(). You can only call setInputConsumptionPattern() from within describeOperator(). setInputConsumptionPattern() can take three values: eSpecificInput This value specifies that the operator exercises direct control over the consumption pattern by using the setActiveInput() or advanceToNextInput() functions. This is the default value. eBalancedInput This value requests the framework to drive the consumption pattern in a balanced manner, consuming one record from each input not at the end of file marker in a circular fashion. The operator must not call setActiveInput() or advanceToNextInput(). eAnyInput The operator is indifferent to the order of record consumption; the framework can direct the operator to process records from any input. The operator must not call setActiveInput() or advanceToNextInput(). v APT_OutputCursor* outputCursor(int outputDS) This method provides access to the output cursor associated with each output defined for this operator. Use it to set up output accessors, and to call putRecord(). For performance reasons, it is better to call this function once in doInitialProcessing() and then store the returned pointer for frequent access, for example in processInputRecord(). To use this method, outputDS must be nonnegative and less than outputDataSets(). v int remainingOutputs() const Returns the number of outputs this operator has that have not been abandoned. v void requestWriteOutputRecord() Use this method when your combinable operator method needs to output multiple records for a single input record. Within the functions doInitialProcessing(), processInputRecord(), processEOF(), doFinalProcessing(), and writeOutputRecord(), at most one putRecord() or transferAndPutRecord() operation can be performed per output port. If your operator requires additional putRecord() operations, it must call requestWriteOutputRecord() to schedule a call to writeOutputRecord(). The requested call to writeOutputRecord() will take place before any other calls back into this operator. Multiple calls to this method within a single activation of doInitialProcessing(), processInputRecord(), processEOF(), doFinalProcessing(), and
184
writeOutputRecord() have the same effect as a single call. To write out multiple records, call requestWriteOutputRecord() from writeOutputRecord() after the previous record has been output. v void setCombiningOverrideFlag() Sets to true both the combiningOverrideFlag() flag and the hasCombiningOverrideFlag() flag for the operator. v void terminateOperator(APT_Status s) This method sets the termination status of the operator as specified and terminates as soon as the current function returns. If all input has not been consumed, a warning is issued and the remaining input is consumed. This method must be called only from within the dynamic scope of doInitialProcessing(), processInputRecord(), processEOF(), or writeOutputRecord(). It can not be called during or after doFinalProcessing().
185
v When writing a combinable operator, do not use getRecord(). A combinable operator must never call getRecord(); instead, the framework calls getRecord() on behalf of the combinable operator and then calls processInputRecord() for each record. v After calling putRecord() or transferAndPutRecord(), a combinable operator must not do anything to disturb the output record before returning control to the framework. The safest thing to do after calling putRecord() or transferAndPutRecord() is to return to the framework.
186
void setWaveAware() Informs the framework that the operator knows how to handle EOW properly. This method should only be called from describeOperator(). If this function is not called, the framework will provide default end-of-wave behavior for the operator. For non-combinable operators, the framework resets the operator, calls the operator's initializeFromArgs_() followed by runLocally(), and re-runs the operator on the next wave. For combinable operators, the framework calls the operator's doFinalProcessing(), resets the operator, calls the operator's initializeFromArgs_() followed by doInitialProcessing(), and re-runs the operator on the next wave. bool isWaveAware() const Returns true if this operator has informed the framework by calling setWaveAware() that it knows how to handle EOW correctly; false otherwise. bool isFirstWave() const Returns true if the operator is processing its first wave; false otherwise. This method should only be called from within the dynamic scope of runLocally() for non-combinable operators; or between the start of doInitialProcessing() and end of doFinalProcessing() for combinable operators. The value of this function does not change within a wave; it goes from true to false upon completion of the first wave. This is done automatically by the framework. The APT_InputCursor class provides position() function to indicate whether or not an end-of-wave marker is encountered. inputCursorPosition position() const Returns the position of the cursor after the preceding call to getRecord(). If getRecord() returns true, eNotAtEnd will be returned; otherwise, eAtEndOfWave or eAtEndOfFile will be returned. For combinable operators, the APT_CombinableOperator class provides position() as a wrapper to APT_InputCursor::position(). APT_InputCursor::inputCursorPosition position(int inputDS) const Returns the position of the cursor for the given input: eNotAtEnd, eAtEndOfWave, or eAtEndOfFile. inputDS must be non-negative and less than inputDataSets(). In addition, the APT_CombinableOperator class provides the following two virtual functions to support initial wave processing and end of wave processing. virtual APT_Status initializeWave() This function should be called immediately after doInitialProcessing() for the first wave; and called for subsequent waves after processEOW() is called for all inputs. This function should be overridden for per-wave setup processing (for example, resetting variables that change from wave to wave). The default implementation returns APT_StatusOk. virtual void processEOW(int inputDS) This function is called when all records for the given input are exhausted at end-of-wave. Operators with multiple inputs might wish to note the change of state, and generally call setActiveInput() or advanceToNextInput() before returning. A step containing no end-of-wave markers is considered as one wave, which means that this function is always called at least once per step. The default implementation calls processEOF() to maintain the backward compatibility. Up to a single
Chapter 15. Advanced features
187
putRecord() (or transferAndPutRecord()) or markEndOfWave() operation per output port may be performed in processEOW(). See requestWriteOutputRecord(). If remainingInputs() is non-zero, either setActiveInput() or advanceToNextInput() must be called before returning. This example shows how to make a simple operator wave-aware. EndOfWaveOperator is a non-combinable operator derived from APT_Operator class. It does not take any arguments.
Table 42. Making an operator wave-aware Comment 1 2 4 5 Code #include <apt_framework/orchestrate.h> class EndOfWaveOperator : public APT_Operator { APT_DECLARE_RTTI(EndOfWaveOperator); APT_DECLARE_PERSISTENT(EndOfWaveOperator); public: EndOfWaveOperator(); 9 11 13 14 15 17 18 19 20 21 22 protected: virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); virtual APT_Status describeOperator(); virtual APT_Status runLocally(); private: APT_InputCursor inCur_; int numWaves_; int numRecords_; }; #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(EndOfWaveOperator, eowOp, (APT_UString)ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(EndOfWaveOperator, APT_Operator); APT_IMPLEMENT_PERSISTENT(EndOfWaveOperator); EndOfWaveOperator::EndOfWaveOperator() :numWaves_(1), numRecords_(0) { } APT_Status EndOfWaveOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } void EndOfWaveOperator::serialize(APT_Archive& ar, APT_UInt8) { } APT_Status EndOfWaveOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record(inRec0:*;)", 0); setOutputInterfaceSchema("record(outRec0:*;)", 0); declareTransfer("inRec0", "outRec0", 0, 0); setWaveAware(); return APT_StatusOk; }
25
29
32
188
Table 42. Making an operator wave-aware (continued) Comment 42 Code APT_Status EndOfWaveOperator::runLocally() { if ( isFirstWave() ) { setupInputCursor(&inCur_, 0); } else numWaves_++; while (inCur_.getRecord()) { APT_ASSERT(inCur_.position() == APT_InputCursor::eNotAtEnd); numRecords_++; transferAndPutRecord(0); } if ( inCur_.position() == APT_InputCursor::eAtEndOfWave ) { numRecords_ = 0; } else { APT_ASSERT(inCur_.position() == APT_InputCursor::eAtEndOfFile); APT_ASSERT(numRecords_ == 0); APT_ASSERT(numWaves_ >= 2); } return APT_StatusOk; }
1 2 4 5 9-11
Include the orchestrate.h header file. All operators are derived, directly or indirectly, from APT_Operator. Use the required macro, APT_DECLARE_RTTI, to declare runtime type information for the operator. Use the required macro, APT_DECLARE_PERSISTENT, to declare object persistence for operator objects transmitted to the processing nodes. You must override the virtual function initializeFromArgs_ and the two pure virtual functions, describeOperator() and runLocally(). Overrides are in this example. Define the input cursor. Define the total number of waves. Define the number of records per wave. See the header file, install_directory/Server/PXEngine/include/apt_util/ argvcheck.h, for documentation on the ARGS_DESC string. Use APT_DEFINE_OSH_NAME to connect the class name to the name used to invoke the operator from osh, and pass your argument description string to DataStage. APT_IMPLEMENT_RTTI_ONEBASE and APT_IMPLEMENT_PERSISTENT are required macros that implement runtime type information and persistent object processing. EndOfWaveOperator::EndOfWaveOperator() defines a default constructor for the operator. All operators must have a public default constructor, even if the constructor is empty.
Chapter 15. Advanced features
13 14 15 17 18
19-20
21
189
22 25
Initialize member variables. Use the override of initializeFromArgs_() to transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() simply returns APT_StatusOk. The function serialize() defines complex persistence in a derived class. The serialization operators operator||, operator<<, and operator>> are declared and defined by macros in terms of APT_Persistent::serialize(). There are no data variables in EndOfWaveOperator that need to be serialized; therefore, serialize() is empty. Use the override of describeOperator() to describe the configuration information for EndOfWaveOperator. This example implementation specifies that: v The operator is run in parallel. Parallel mode is the default; therefore it is not necessary to explicitly set parallel mode with setKind(). v There is one input data set and one output data set. v The input and output schemas are as defined by setInputInterfaceSchema() and setOutputInterfaceSchema(). v Data is simply transferred from input to output. v The operator is wave-aware. With your override of runLocally(), you define what the operator does at run time. In this example implementation, the function setupInputCursor() is called in the first wave to initialize the input cursor because the input cursor does not change its state from wave to wave. The example while statement loops over all the records, and does these tasks for each record: v Confirms that the input cursor's position is APT_InputCursor::eNotAtEnd. v Increases the number of records in the current wave by 1. v Transfers the current record from input to output. When there is no more input data, getRecord() returns false, code execution exits from the while loop. The example code then checks to see if the input encounters an end-of-wave marker or an end-of-file marker. If an end-of-wave marker is encountered, numRecords_ is reset to 0 so it can count the number of records in the next wave. If an end-of-file marker is encountered, the code confirms that an end-of-wave marker is seen before the end-of-file marker and there are no records between these two markers.
29
32
42
190
Table 43. Comment 1 2 4 5 Code #include <apt_framework/orchestrate.h> class EndOfWaveCombinableOperator : public APT_CombinableOperator { APT_DECLARE_RTTI(EndOfWaveCombinableOperator); APT_DECLARE_PERSISTENT(EndOfWaveCombinableOperator); public: EndOfWaveCombinableOperator(); 9 protected: virtual APT_Status initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context); virtual APT_Status describeOperator(); virtual APT_Status doInitialProcessing(); virtual APT_Status initializeWave(); virtual void processInputRecord(int inputDS); virtual void processEOW(int inputDS); virtual APT_Status doFinalProcessing(); private: APT_InputAccessorInterface* inCur_; APT_InputAccessorToInt32 input0Int32a_; int numWaves_; int numRecords_; }; #define ARGS_DESC "{}" APT_DEFINE_OSH_NAME(EndOfWaveCombinableOperator, eowCombOp, (APT_UString)ARGS_DESC); APT_IMPLEMENT_RTTI_ONEBASE(EndOfWaveCombinableOperator, APT_CombinableOperator); APT_IMPLEMENT_PERSISTENT(EndOfWaveCombinableOperator); EndOfWaveCombinableOperator::EndOfWaveCombinableOperator() :inCur_(0), numWaves_(0), numRecords_(0) { } APT_Status EndOfWaveCombinableOperator::initializeFromArgs_(const APT_PropertyList &args, APT_Operator::InitializeContext context) { return APT_StatusOk; } void EndOfWaveCombinableOperator::serialize(APT_Archive& ar, APT_UInt8) { } APT_Status EndOfWaveCombinableOperator::describeOperator() { setInputDataSets(1); setOutputDataSets(1); setInputInterfaceSchema("record(a:int32;inRec0:*)", 0); setOutputInterfaceSchema("record(outRec0:*)", 0); declareTransfer("inRec0", "outRec0", 0, 0); setWaveAware(); return APT_StatusOk; }
15 17 18 19 20 22 23 24 25
26 27
30
34
37
191
Table 43. (continued) Comment 47 Code APT_Status EndOfWaveCombinableOperator::doInitialProcessing() { inCur_ = inputAccessorInterface(0); inCur_->setupAccessor(APT_UString("a"), &input0Int32a_); return APT_StatusOk; } APT_Status EndOfWaveCombinableOperator::initializeWave() { numRecords_ = 0; return APT_StatusOk; } void EndOfWaveCombinableOperator::processInputRecord(int inputDS) { APT_ASSERT(position(0) == APT_InputCursor::eNotAtEnd); if ( isFirstWave() ) cout << "input0Int32a_[0]=" << input0Int32a_[0] << endl; numRecords_++; transferAndPutRecord(0); return; } void EndOfWaveCombinableOperator::processEOW(int inputDS) { if ( position(0) == APT_InputCursor::eAtEndOfWave ) { numWaves_++; } else if ( position(0) == APT_InputCursor::eAtEndOfFile ) { APT_ASSERT(numRecords_ == 0); } else APT_ASSERT(0); return; } APT_Status EndOfWaveCombinableOperator::doFinalProcessing() { APT_ASSERT(position(0) == APT_InputCursor::eAtEndOfFile); APT_ASSERT(numRecords_ == 0); return APT_StatusOk; }
55
58
67
78
81
1 2 4 5 9-15 17 18 19
Include the orchestrate.h header file. EndOfWaveCombinableOperator is derived from APT_CombinableOperator. Use the required macro, APT_DECLARE_RTTI, to declare runtime type information for the operator. Use the required macro, APT_DECLARE_PERSISTENT, to declare object persistence for operator objects transmitted to the processing nodes. You must override the virtual function initializeFromArgs_ and the pure virtual function describeOperator(). Overrides are in this example. Define the input cursor. Define an input accessor. Define the total number of waves.
192
20 30
Define the number of records per wave. Use the override of initializeFromArgs_() to transfer information from the arguments to the class instance. Because there are no arguments for this example, initializeFromArgs_() simply returns APT_StatusOk. The function serialize() defines complex persistence in a derived class. The serialization operators operator||, operator<<, and operator>> are declared and defined by macros in terms of APT_Persistent::serialize(). There are no data variables in EndOfWaveOperator that need to be serialized; therefore, serialize() is empty. Use the override of describeOperator() to describe the configuration information for EndOfWaveCombinableOperator. This example implementation specifies that: 1. The operator is run in parallel. Parallel mode is the default; therefore it is not necessary to explicitly set parallel mode with setKind(). 2. There is one input data set and one output data set. 3. The input and output schemas are as defined by setInputInterfaceSchema() and setOutputInterfaceSchema(). 4. Data is simply transferred from input to output. 5. The operator is wave-aware. Initialize input cursor and setup input accessor in doInitialProcessing(). Both data variables do not change state from wave to wave. Reset the number of records at the beginning of each wave. With your override of processInputRecord(), you define what the operator does at run time. This example implementation does these tasks for each record: 1. Confirms that the input cursor's position is APT_InputCursor::eNotAtEnd. 2. For every record in the first wave, print the value of field a. Increases the number of records in the current wave by 1. 3. Transfers the current record from input to output.
34
37
47 55 58
67
When there is no more input data, code execution exits from processInputRecord() and enters processEOW(). The example code then checks to see if the input encounters an end-of-wave marker or an end-of-file marker. If an end-of-wave marker is encountered, numWaves_ is incremented by 1. If an end-of-file marker is encountered, the code confirms that an end-of-wave marker is seen before the end-of-file marker and there are no records between these two markers. Issue an fatal error if neither end-of-wave nor end-of-file marker is encountered in the dynamic scope of processEOW(). In transanction-like data processing, doFinalProcessing() is called only when all input datasets encounter end-of-file markers.
78 81
Per-partition processing
The postFinalRunLocally() function is invoked for each data set partition after runLocally() is called. When end-of-wave processing is done, the runLocally() function can be called multiple times. In this case, the postFinalRunLocally() function is called after the last invocation of runLocally().
Chapter 15. Advanced features
193
194
Contacting IBM
You can contact IBM for customer support, software services, product information, and general information. You can also provide feedback on products and documentation.
Customer support
For customer support for IBM products and for product download information, go to the support and downloads site at www.ibm.com/support/. You can open a support request by going to the software support service request site at www.ibm.com/software/support/probsub.html.
My IBM
You can manage links to IBM Web sites and information that meet your specific technical support needs by creating an account on the My IBM site at www.ibm.com/account/.
Software services
For information about software, IT, and business consulting services, go to the solutions site at www.ibm.com/businesssolutions/.
General information
To find general information about IBM, go to www.ibm.com.
Product feedback
You can provide general product feedback through the Consumability Survey at www.ibm.com/software/data/info/consumability-survey.
Documentation feedback
To comment on the information center, click the Feedback link on the top right side of any topic in the information center. You can also send your comments about PDF file books, the information center, or any other documentation in the following ways: v Online reader comment form: www.ibm.com/software/data/rcf/
Copyright IBM Corp. 2009, 2010
195
v E-mail: [email protected]
196
Product documentation
Documentation is provided in a variety of locations and formats, including in help that is opened directly from the product interface, in a suite-wide information center, and in PDF file books. The information center is installed as a common service with IBM InfoSphere Information Server. The information center contains help for most of the product interfaces, as well as complete documentation for all product modules in the suite. You can open the information center from the installed product or by entering a Web address and default port number. You can use the following methods to open the installed information center. v From the IBM InfoSphere Information Server user interface, click Help on the upper right of the screen to open the information center. v From InfoSphere Information Server clients, such as the InfoSphere DataStage and QualityStage Designer client, the FastTrack client, and the Balanced Optimization client, press the F1 key to open the information center. The F1 key opens the information center topic that describes the current context of the user interface. v On the computer on which InfoSphere Information Server is installed, you can access the InfoSphere Information Server information center even when you are not logged in to the product. Open a Web browser and enter the following address: http:// host_name:port_number/infocenter/topic/ com.ibm.swg.im.iis.productization.iisinfsv.nav.doc/dochome/ iisinfsrv_home.html where host_name is the name of the services tier computer where the information center is installed and where port_number is the port number for InfoSphere Information Server. The default port number is 9080. For example on a Microsoft Windows Server computer named iisdocs2, the Web address is in the following format: http://iisdocs2:9080/infocenter/topic/ com.ibm.swg.im.iis.productization.iisinfsv.nav.doc/dochome/iisinfsrv_home.html A subset of the product documentation is also available online from the product documentation library at publib.boulder.ibm.com/infocenter/iisinfsv/v8r1/ index.jsp. PDF file books are available through the InfoSphere Information Server software installer and the distribution media. A subset of the information center is also available online and periodically refreshed at www.ibm.com/support/ docview.wss?rs=14&uid=swg27008803. You can also order IBM publications in hardcopy format online or through your local IBM representative. To order publications online, go to the IBM Publications Center at www.ibm.com/shop/publications/order. You can send your comments about documentation in the following ways: v Online reader comment form: www.ibm.com/software/data/rcf/
Copyright IBM Corp. 2009, 2010
197
v E-mail: [email protected]
198
Product accessibility
You can get information about the accessibility status of IBM products. The IBM InfoSphere Information Server product modules and user interfaces are not fully accessible. The installation program installs the following product modules and components: v IBM InfoSphere Business Glossary v IBM InfoSphere Business Glossary Anywhere v IBM InfoSphere DataStage v IBM InfoSphere FastTrack v v v v IBM IBM IBM IBM InfoSphere InfoSphere InfoSphere InfoSphere Information Analyzer Information Services Director Metadata Workbench QualityStage
For information about the accessibility status of IBM products, see the IBM product accessibility information at http://www.ibm.com/able/product_accessibility/ index.html.
Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided in an information center. The information center presents the documentation in XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you to set display preferences in your browser. It also allows you to use screen readers and other assistive technologies to access the documentation.
199
200
Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web
201
sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to
202
IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved. If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. The United States Postal Service owns the following trademarks: CASS, CASS Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACSLink licensee of the United States Postal Service. Other company, product or service names may be trademarks or service marks of others.
203
204
Index A
accessibility documentation 197 accessors 91, 96, 175 field 95 multifield 40 accumulator 170 aggregate fields 113 apt_archive 154 APT_Collector::errorLog() 170 APT_CombinableOperator 181 APT_CompositeOperator class interface 49 apt_decimal interface 99 APT_DECLARE_ABSTRACT_ PERSISTENT 157 APT_DECLARE_ABSTRACT_ PERSISTENT() 161 APT_DECLARE_MSG_LOG() 61 APT_DECLARE_PERSISTENT 157, 164 APT_DECLARE_PERSISTENT() 159, 161 APT_DECLARE_RTTI 164, 166 apt_define_osh_name 69 apt_detail 64 APT_DYNAMIC_TYPE 163, 166 apt_errorlog 169 APT_ErrorLog 170 apt_errorlog class interface 169 APT_FieldConversion 85 interface 86 APT_FieldConversionRegistry 85 interface 86 APT_FieldList class interface 147 apt_filearchive 154 apt_formattable 64 APT_GenericFunction 175 APT_GFComparison 175 APT_GFEquality 175 APT_GFPrint 175 APT_hash 130 APT_HashPartitioner 121 APT_IMPLEMENT_ABSTRACT_ PERSISTENT 157 APT_IMPLEMENT_ABSTRACT_ PERSISTENT() 161 APT_IMPLEMENT_PERSISTENT 157 APT_IMPLEMENT_PERSISTENT() 161 APT_IMPLEMENT_RTTI_ ONEBASE 166 APT_IMPLEMENT_RTTI_BASE 166 APT_IMPLEMENT_RTTI_BEGIN 166 APT_IMPLEMENT_RTTI_END 166 APT_IMPLEMENT_RTTI_NOBASE 166 APT_InputCursor 92 interface 92 APT_InputSubCursor 116 apt_memoryarchive 154 apt_msg 61, 63, 64 Copyright IBM Corp. 2009, 2010 APT_NLS 61 apt_operator 172 APT_Operator additional member functions 10 class interface 4 derivation examples 4 derivation requirements 3 APT_Operator::errorLog() 170 APT_OutputCursor 92 interface 92 APT_OutputSubCursor 116 APT_Partitioner class interface 124 describePartitioner 125 getPartitioningStyle 126 partitionInput 126 setupInputs 125 APT_Partitioner::errorLog() 170 apt_persistent 153, 156 interface 159 APT_Persistent::serialize() 159, 160 apt_prepend 64 APT_PTR_CAST 164, 166 APT_RangePartitioner 124 apt_rawfield interface 110 APT_STATIC_TYPE 163, 166 APT_StatusFaile 169 APT_StatusOk 169 apt_string 66, 91 interface 106 APT_SubProcessOperator class interface 53 commandLine function 55 deriving from 53 runSink function 54 runSource function 53 APT_TYPE_INFO 166 APT_TypeInfo 163 apt_user_require 64 apt_ustring 66, 91 interface 106 archive 155 create 154 argc/argv argument 69 argument description examples 79 argument errors command line 82 argument values 70 argument-list supply description 69 syntax 71 argument-list description elements 70 argument-list processor 69 arrays serialize 160
C
C++ compiler 1 cast 164 character set conversion 109 class apt_rawfield 110 define complext-persistent 157 define simple-persistent 156 class interface APT_CompositeOperator 49 APT_Operator 4 APT_Partitioner 124 APT_SubProcessOperator 53 collector 20 collectors 1 combinable controller 185 combinable operator 181 limitations and risks 185 methods 182 using 182 compiler 1 complex-persistent class 157 composite operators 47 example 50 conversion default types 85 cursors 21, 91 records 94 custom messages 45 customer support 195
D
data types numeric 97 decimal fields 99 derivation requirements APT_Operator 3 derived operator error log 172 describePartitioner 125 documentation accessible 197 dynamic interface schema 39 dynamic interface schemas 15 dynamic operators 26 dynamic schema 175
E
end-of-wave 186 end-of-wave marker 186 end-of-wave processing 193 error token 82 error log 169, 170, 172 error log functions base class 170 example 31, 34, 41
205
F
field accessor 95 field accessors 23, 26, 96 aggregate fields 113 date, time, and timestamp 112 decimal fields 99 fixed-length vector 99 nullable fields 104 numeric data types 97 schema variables 25 string and ustring fields 106 variable-length vector 101 field list creating 147 expanding 148 using 149 fields aggregate 113 fixed-length vector fields 99
message (continued) duplicate 63 error log 171 identify 61 issue 61 localize 63 macro 61, 63 messages 45 classify 171 messaging macro 61 metadata 44 monitoring 44 multifield accessors 40
postFinalRunLocally 193 preserve-partitioning flag 119 product accessibility accessibility 199 property list encoding 81 example of traverse 81 structure 70
R
record fields reference 91 records accessing with cursors 94 registering operators 2 response messages 44 RTTI 163, 164 application macros 166 derivation macros 166 run time data type 163 run time type information 163
N
National Language Support no run-time variables message 64 nullable fields 104 61
O
object save and restore 155 object serialization 153 operator run 84 operator arguments property list encoding 81 operator messages localize 61 operator.apt 2 operator.hThe file message.h is located in 4 operators 1 compiling 1 composite 47, 50 serialization 155 subprocess 53 using 2
S
schema variables 14, 26, 31 screen readers product documentation 197 sequential operators 11 creating 12 serialization operator 154 serialization operators 155 setupInputs 125 simple-persistent 156 software services 195 sort keys 19 sorting 18 string 85 string and ustring fields 106 subcursors 21 subprocess operator example derivation 55 subprocess operators 53 subrecord fields 116 support customer 195 syntax example 71
G
generic functions 175 getPartitioningStyle 126
H
hashing functions HelloWorldOp example 7 130
I
information log 170 interface apt_archive 154 interface schemas 13 interfaces eliminate deprecated
P
66 parallel operators 11 creating 12 partitioner 131 partitioners 1 creating 119 partitioning 18 partitioning method 16 any 120 APT_HashPartitioner 121 APT_RangePartitioner 124 choosing 120 custom 124 keyless 120 partitionInput 126 per-partition processing 193 persistence enable object 153 simple and complex 153 persistent objects store and load 154
J
job monitoring 44
T
token error 82 transaction 186 transfer mechanism 28, 30 type conversions non-default and pre-defined non-default type conversion functions 85
L
legal notices 201
85
M
message convert multi-line 63 convert pre-NLS 64 convert with run-time variables define identifiers 171 64
U
unicode 91, 109 ustring 85, 106
206
V
variable-length vector vectors accessing 116 view adapter 131 101
W
warning log 170 wave-aware 186
Index
207
208
Printed in USA
SC19-2925-01
Spine information:
Version 8 Release 5