IBM Infosphere Datastage and QualityStage Parallel Job Advanced Developer Guide v8 7
IBM Infosphere Datastage and QualityStage Parallel Job Advanced Developer Guide v8 7
IBM Infosphere Datastage and QualityStage Parallel Job Advanced Developer Guide v8 7
Version 8 Release 7
SC19-3458-01
SC19-3458-01
Note Before using this information and the product that it supports, read the information in Notices and trademarks on page 819.
Copyright IBM Corporation 2001, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Chapter 1. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2. Job design tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
InfoSphere DataStage Designer interface Processing large volumes of data . . Modular development . . . . . . Designing for good performance. . . Combining data . . . . . . . . Sorting data . . . . . . . . . Default and explicit type conversions . Using Transformer stages . . . . . Using Sequential File stages . . . . Using Database stages . . . . . . Database sparse lookup vs. join . . DB2 database tips. . . . . . . Oracle database tips . . . . . . Teradata Database Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 3 4 4 5 5 7 7 8 8 8 9 9
iii
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. 30 . 31
iv
Decimal support. . . . . . . . . . . . . . . . . . . . . . . . . . APT_DECIMAL_INTERM_PRECISION . . . . . . . . . . . . . . . . . APT_DECIMAL_INTERM_SCALE. . . . . . . . . . . . . . . . . . . APT_DECIMAL_INTERM_ROUND_MODE . . . . . . . . . . . . . . . Disk I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_BUFFER_DISK_WRITE_INCREMENT. . . . . . . . . . . . . . . . APT_CONSISTENT_BUFFERIO_SIZE. . . . . . . . . . . . . . . . . . APT_EXPORT_FLUSH_COUNT . . . . . . . . . . . . . . . . . . . APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/APT_BUFFERIO_NOMAP APT_PHYSICAL_DATASET_BLOCK_SIZE . . . . . . . . . . . . . . . . General Job Administration . . . . . . . . . . . . . . . . . . . . . . APT_CHECKPOINT_DIR. . . . . . . . . . . . . . . . . . . . . . APT_CLOBBER_OUTPUT . . . . . . . . . . . . . . . . . . . . . APT_CONFIG_FILE . . . . . . . . . . . . . . . . . . . . . . . APT_DISABLE_COMBINATION . . . . . . . . . . . . . . . . . . . APT_EXECUTION_MODE . . . . . . . . . . . . . . . . . . . . . APT_ORCHHOME . . . . . . . . . . . . . . . . . . . . . . . . APT_STARTUP_SCRIPT . . . . . . . . . . . . . . . . . . . . . . APT_NO_STARTUP_SCRIPT . . . . . . . . . . . . . . . . . . . . APT_STARTUP_STATUS . . . . . . . . . . . . . . . . . . . . . . APT_THIN_SCORE. . . . . . . . . . . . . . . . . . . . . . . . QSM_DISABLE_DISTRIBUTE_COMPONENT . . . . . . . . . . . . . . . Job Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . APT_MONITOR_SIZE . . . . . . . . . . . . . . . . . . . . . . . APT_MONITOR_TIME . . . . . . . . . . . . . . . . . . . . . . APT_NO_JOBMON. . . . . . . . . . . . . . . . . . . . . . . . APT_PERFORMANCE_DATA . . . . . . . . . . . . . . . . . . . . Look up support . . . . . . . . . . . . . . . . . . . . . . . . . APT_LUTCREATE_MMAP . . . . . . . . . . . . . . . . . . . . . APT_LUTCREATE_NO_MMAP . . . . . . . . . . . . . . . . . . . Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . APT_COPY_TRANSFORM_OPERATOR . . . . . . . . . . . . . . . . . APT_DATE_CENTURY_BREAK_YEAR . . . . . . . . . . . . . . . . . APT_EBCDIC_VERSION . . . . . . . . . . . . . . . . . . . . . . APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL . . . . . . . . . . . APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS . . . . . . . . . . . . APT_INSERT_COPY_BEFORE_MODIFY. . . . . . . . . . . . . . . . . APT_ISVALID_BACKCOMPAT . . . . . . . . . . . . . . . . . . . . APT_OLD_BOUNDED_LENGTH . . . . . . . . . . . . . . . . . . . APT_OPERATOR_REGISTRY_PATH . . . . . . . . . . . . . . . . . . APT_PM_NO_SHARED_MEMORY . . . . . . . . . . . . . . . . . . APT_PM_NO_NAMED_PIPES . . . . . . . . . . . . . . . . . . . . APT_PM_SOFT_KILL_WAIT . . . . . . . . . . . . . . . . . . . . APT_PM_STARTUP_CONCURRENCY . . . . . . . . . . . . . . . . . APT_RECORD_COUNTS . . . . . . . . . . . . . . . . . . . . . . APT_SAVE_SCORE. . . . . . . . . . . . . . . . . . . . . . . . APT_SHOW_COMPONENT_CALLS . . . . . . . . . . . . . . . . . . APT_STACK_TRACE . . . . . . . . . . . . . . . . . . . . . . . APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING. . . . . . . . . . . APT_TRANSFORM_LOOP_WARNING_THRESHOLD . . . . . . . . . . . . APT_WRITE_DS_VERSION . . . . . . . . . . . . . . . . . . . . . OSH_PRELOAD_LIBS . . . . . . . . . . . . . . . . . . . . . . . Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_IO_MAXIMUM_OUTSTANDING . . . . . . . . . . . . . . . . . APT_IOMGR_CONNECT_ATTEMPTS . . . . . . . . . . . . . . . . . APT_PM_CONDUCTOR_HOSTNAME . . . . . . . . . . . . . . . . . APT_PM_NO_TCPIP . . . . . . . . . . . . . . . . . . . . . . . APT_PM_NODE_TIMEOUT . . . . . . . . . . . . . . . . . . . . . APT_PM_SHOWRSH . . . . . . . . . . . . . . . . . . . . . . . APT_PM_STARTUP_PORT . . . . . . . . . . . . . . . . . . . . . APT_PM_USE_RSH_LOCALLY. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 61 62 62 62 62 62 62 62 62 63 63 63 63 63 64 64 64 64 64 64 65 65 65 65 65 65 65 65 65 65 66 66 66 66 66 66 67 67 67 67 67 67 67 67 68 68 68 68 68 69 69 69 69 69 69 69 69 69 70
Contents
APT_RECVBUFSIZE . . . . . . . . . . . . . . . . . . . . . . . . . . APT_USE_IPV4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . NLS Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_COLLATION_SEQUENCE . . . . . . . . . . . . . . . . . . . . . . APT_COLLATION_STRENGTH . . . . . . . . . . . . . . . . . . . . . . APT_ENGLISH_MESSAGES . . . . . . . . . . . . . . . . . . . . . . . . APT_IMPEXP_CHARSET. . . . . . . . . . . . . . . . . . . . . . . . . APT_INPUT_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . APT_OS_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . . APT_OUTPUT_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . APT_STRING_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . Oracle Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_ORACLE_LOAD_OPTIONS . . . . . . . . . . . . . . . . . . . . . . APT_ORACLE_NO_OPS . . . . . . . . . . . . . . . . . . . . . . . . . APT_ORACLE_PRESERVE_BLANKS . . . . . . . . . . . . . . . . . . . . . APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM . . . . . . . . . . . . . . . APT_ORA_WRITE_FILES . . . . . . . . . . . . . . . . . . . . . . . . APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_PART_INSERTION . . . . . . . . . . . . . . . . . . . . . . . APT_NO_PARTSORT_OPTIMIZATION . . . . . . . . . . . . . . . . . . . . APT_PARTITION_COUNT . . . . . . . . . . . . . . . . . . . . . . . . APT_PARTITION_NUMBER. . . . . . . . . . . . . . . . . . . . . . . . Reading and writing files . . . . . . . . . . . . . . . . . . . . . . . . . . APT_DELIMITED_READ_SIZE . . . . . . . . . . . . . . . . . . . . . . . APT_FILE_IMPORT_BUFFER_SIZE . . . . . . . . . . . . . . . . . . . . . APT_FILE_EXPORT_BUFFER_SIZE . . . . . . . . . . . . . . . . . . . . . APT_IMPORT_PATTERN_USES_FILESET . . . . . . . . . . . . . . . . . . . APT_MAX_DELIMITED_READ_SIZE . . . . . . . . . . . . . . . . . . . . APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE . . . . . . . . . . . . . . . APT_STRING_PADCHAR . . . . . . . . . . . . . . . . . . . . . . . . Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_DUMP_SCORE . . . . . . . . . . . . . . . . . . . . . . . . . . APT_ERROR_CONFIGURATION . . . . . . . . . . . . . . . . . . . . . . APT_MSG_FILELINE . . . . . . . . . . . . . . . . . . . . . . . . . . APT_PM_PLAYER_MEMORY . . . . . . . . . . . . . . . . . . . . . . . APT_PM_PLAYER_TIMING . . . . . . . . . . . . . . . . . . . . . . . . APT_RECORD_COUNTS . . . . . . . . . . . . . . . . . . . . . . . . . OSH_DUMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_ECHO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . OSH_PRINT_SCHEMAS . . . . . . . . . . . . . . . . . . . . . . . . . SAS Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_HASH_TO_SASHASH . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_SASOUT_INSERT . . . . . . . . . . . . . . . . . . . . . . . . APT_NO_SAS_TRANSFORMS . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_ACCEPT_ERROR . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_CHARSET . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_CHARSET_ABORT . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_COMMAND . . . . . . . . . . . . . . . . . . . . . . . . . APT_SASINT_COMMAND . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG . . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_IO . . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_LEVEL . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_DEBUG_VERBOSE . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_NO_PSDS_USTRING . . . . . . . . . . . . . . . . . . . . . . APT_SAS_S_ARGUMENT . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_SCHEMASOURCE_DUMP. . . . . . . . . . . . . . . . . . . . . APT_SAS_SHOW_INFO . . . . . . . . . . . . . . . . . . . . . . . . . APT_SAS_TRUNCATION . . . . . . . . . . . . . . . . . . . . . . . . Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70 70 70 70 70 70 71 71 71 71 71 71 71 71 72 72 72 72 72 72 72 72 73 73 73 73 73 73 73 74 74 74 74 74 76 76 76 76 76 76 76 76 76 77 77 77 77 77 77 77 77 78 78 78 78 78 78 78 78 78 79
vi
APT_NO_SORT_INSERTION . . . . . . . . . . . . . . . . . . APT_SORT_INSERTION_CHECK_ONLY . . . . . . . . . . . . . . Sybase support . . . . . . . . . . . . . . . . . . . . . . . . APT_SYBASE_NULL_AS_EMPTY . . . . . . . . . . . . . . . . . APT_SYBASE_PRESERVE_BLANKS . . . . . . . . . . . . . . . . Teradata Support . . . . . . . . . . . . . . . . . . . . . . . APT_TERA_64K_BUFFERS . . . . . . . . . . . . . . . . . . . APT_TERA_NO_ERR_CLEANUP . . . . . . . . . . . . . . . . . APT_TERA_NO_SQL_CONVERSION . . . . . . . . . . . . . . . APT_TERA_NO_PERM_CHECKS . . . . . . . . . . . . . . . . . APT_TERA_SYNC_DATABASE. . . . . . . . . . . . . . . . . . APT_TERA_SYNC_PASSWORD . . . . . . . . . . . . . . . . . APT_TERA_SYNC_USER . . . . . . . . . . . . . . . . . . . . Transport Blocks. . . . . . . . . . . . . . . . . . . . . . . . APT_LATENCY_COEFFICIENT . . . . . . . . . . . . . . . . . APT_DEFAULT_TRANSPORT_BLOCK_SIZE . . . . . . . . . . . . . APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE . Guide to setting environment variables . . . . . . . . . . . . . . . . Environment variable settings for all jobs . . . . . . . . . . . . . . Optional environment variable settings . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
79 79 79 79 79 79 79 79 80 80 80 80 80 80 80 80 81 81 81 81
Chapter 7. Operators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Stage to Operator Mapping . . . . . . . . . . . . . . Bloom filter operator . . . . . . . . . . . . . . . . Bloom filter input and output data sets . . . . . . . . . Bloom filter operator: Create mode syntax . . . . . . . . Bloom filter operator: Process mode syntax . . . . . . . . Changeapply operator . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . changeapply: properties . . . . . . . . . . . . . . Schemas . . . . . . . . . . . . . . . . . . . Changeapply: syntax and options . . . . . . . . . . . Example of the changeapply operator . . . . . . . . . Changecapture operator . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . Key and value fields . . . . . . . . . . . . . . . Changecapture: syntax and options . . . . . . . . . . Changecapture example 1: all output results . . . . . . . Example 2: dropping output results . . . . . . . . . . Checksum operator . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . Properties . . . . . . . . . . . . . . . . . . Checksum: syntax and options . . . . . . . . . . . Checksum: example . . . . . . . . . . . . . . . Compare operator . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . compare: properties . . . . . . . . . . . . . . . Compare: syntax and options . . . . . . . . . . . . Compare example 1: running the compare operator in parallel . Example 2: running the compare operator sequentially . . . Copy operator . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . Copy: properties . . . . . . . . . . . . . . . . Copy: syntax and options . . . . . . . . . . . . . Preventing InfoSphere DataStage from removing a copy operator Copy example 1: The copy operator . . . . . . . . . . Example 2: running the copy operator sequentially . . . . . Diff operator. . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . diff: properties . . . . . . . . . . . . . . . . . Transfer behavior . . . . . . . . . . . . . . . . . 83 . 86 . 86 . 87 . 88 . 89 . 89 . 89 . 90 . 91 . 94 . 96 . 96 . 96 . 97 . . . . . . . . . . . . . . . . 100 . . . . . . . . . . . . . . . . 101 . . . . . . . . . . . . . . . . 102 . . . . . . . . . . . . . . . . 102 . . . . . . . . . . . . . . . . 102 . . . . . . . . . . . . . . . . 103 . . . . . . . . . . . . . . . . 103 . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . 105 . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . 108 . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . 109 . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . 111 . . . . . . . . . . . . . . . . 111 . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . 112 . . . . . . . . . . . . . . . . 112
Contents
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
vii
Diff: syntax and options . . . . . . . . . . Diff example 1: general example . . . . . . . Example 2: Dropping Output Results . . . . . Encode operator . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . encode: properties . . . . . . . . . . . . Encode: syntax and options. . . . . . . . . Encoding InfoSphere DataStage data sets . . . . Example of the encode operator . . . . . . . Filter operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . filter: properties . . . . . . . . . . . . Filter: syntax and options . . . . . . . . . Job monitoring information. . . . . . . . . Expressions . . . . . . . . . . . . . . Input data types . . . . . . . . . . . . Filter example 1: comparing two fields . . . . . Example 2: testing for a null . . . . . . . . Example 3: evaluating input records . . . . . . Job scenario: mailing list for a wine auction . . . Funnel operators . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . sortfunnel: properties. . . . . . . . . . . Funnel operator . . . . . . . . . . . . Sort funnel operators . . . . . . . . . . . Generator operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . generator: properties . . . . . . . . . . . Generator: syntax and options . . . . . . . . Using the generator operator . . . . . . . . Example 1: using the generator operator . . . . Example 2: executing the operator in parallel . . . Example 3: using generator with an input data set . Defining the schema for the operator . . . . . Timestamp fields . . . . . . . . . . . . Head operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . head: properties . . . . . . . . . . . . Head: syntax and options . . . . . . . . . Head example 1: head operator default behavior . Example 2: extracting records from a large data set . Example 3: locating a single record . . . . . . Lookup operator . . . . . . . . . . . . . Data flow diagrams . . . . . . . . . . . lookup: properties . . . . . . . . . . . . Lookup: syntax and options . . . . . . . . Partitioning . . . . . . . . . . . . . . Create-only mode . . . . . . . . . . . . Lookup example 1: single lookup table record . . Example 2: multiple lookup table record . . . . Example 3: interest rate lookup example . . . . Example 4: handling duplicate fields example . . Merge operator . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . merge: properties . . . . . . . . . . . . Merge: syntax and options . . . . . . . . . Merging records . . . . . . . . . . . . Understanding the merge operator . . . . . . Example 1: updating national data with state data . Example 2: handling duplicate fields . . . . . Job scenario: galactic industries . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
114 117 117 119 119 119 119 120 121 121 121 121 122 123 124 125 126 127 127 127 129 129 129 130 130 133 134 134 134 135 136 137 137 138 143 144 145 145 145 146 147 147 147 148 148 149 153 153 154 154 155 156 157 158 158 158 160 163 166 168 170
viii
Missing records . . . . . . . . . Modify operator . . . . . . . . . . Data flow diagram . . . . . . . . modify: properties. . . . . . . . . Modify: syntax and options . . . . . Transfer behavior . . . . . . . . . Avoiding contiguous modify operators . . Performing conversions . . . . . . . Allowed conversions . . . . . . . . pcompress operator . . . . . . . . . Data flow diagram . . . . . . . . pcompress: properties . . . . . . . Pcompress: syntax and options . . . . Compressed data sets . . . . . . . Example of the pcompress operator . . . Peek operator . . . . . . . . . . . Data flow diagram . . . . . . . . peek: properties . . . . . . . . . Peek: syntax and options . . . . . . Using the operator . . . . . . . . PFTP operator . . . . . . . . . . . Data flow diagram . . . . . . . . Operator properties . . . . . . . . Pftp: syntax and options. . . . . . . Restartability . . . . . . . . . . pivot operator . . . . . . . . . . . Properties: pivot operator . . . . . . Pivot: syntax and options . . . . . . Pivot: examples . . . . . . . . . Remdup operator . . . . . . . . . . Data flow diagram . . . . . . . . remdup: properties . . . . . . . . Remdup: syntax and options . . . . . Removing duplicate records . . . . . Using options to the operator . . . . . Using the operator . . . . . . . . Example 1: using remdup . . . . . . Example 2: using the -last option . . . . Example 3: case-insensitive string matching Example 4: using remdup with two keys . Sample operator . . . . . . . . . . Data flow diagram . . . . . . . . sample: properties . . . . . . . . . Sample: syntax and options . . . . . Example sampling of a data set . . . . Sequence operator . . . . . . . . . . Data flow diagram . . . . . . . . sequence: properties . . . . . . . . Sequence: syntax and options . . . . . Example of Using the sequence Operator . Switch operator . . . . . . . . . . Data flow diagram . . . . . . . . switch: properties . . . . . . . . . Switch: syntax and options . . . . . . Job monitoring information. . . . . . Example metadata and summary messages Customizing job monitor messages . . . Tail operator. . . . . . . . . . . . Data flow diagram . . . . . . . . tail: properties . . . . . . . . . . Tail: syntax and options . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
172 174 174 175 175 176 176 176 207 209 209 209 209 210 211 212 212 212 213 214 215 216 216 217 222 222 223 223 224 225 225 226 226 227 228 229 230 230 230 230 230 231 231 231 232 233 233 234 234 234 235 235 236 237 241 241 241 242 242 242 242
Contents
ix
Tail example 1: tail operator default behavior. . . . . . . . . . . Example 2: tail operator with both options . . . . . . . . . . . Transform operator . . . . . . . . . . . . . . . . . . . . Running your job on a non-NFS MPP . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . transform: properties . . . . . . . . . . . . . . . . . . . Transform: syntax and options. . . . . . . . . . . . . . . . Transfer behavior . . . . . . . . . . . . . . . . . . . . The transformation language . . . . . . . . . . . . . . . . The transformation language versus C . . . . . . . . . . . . . Using the transform operator . . . . . . . . . . . . . . . . Example 1: student-score distribution . . . . . . . . . . . . . Example 2: student-score distribution with a letter grade added to example Example 3: student-score distribution with a class field added to example . Example 4. student record distribution with null score values and a reject . Example 5. student record distribution with null score values handled . . Example 6. student record distribution with vector manipulation . . . . Example 7: student record distribution using sub-record . . . . . . . Example 8: external C function calls . . . . . . . . . . . . . . Writerangemap operator. . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . writerangemap: properties . . . . . . . . . . . . . . . . . Writerangemap: syntax and options . . . . . . . . . . . . . . Using the writerange operator . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
243 243 244 244 244 244 245 254 255 293 294 294 297 299 302 304 308 311 315 316 317 317 317 318
. . . . . . . . . . . . . . . . . . . . . . 319
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 320 320 321 322 326 329 330 336 336 336 337 344 346 347 348 348 349 350 354 358 359 360 360 361 368
Record schemas . . . . . . . . . . . . Import example 1: import schema . . . . . Example 2: export schema . . . . . . . . Field and record properties . . . . . . . . Complete and partial schemas . . . . . . . Implicit import and export . . . . . . . . Error handling during import/export . . . . ASCII and EBCDIC conversion tables . . . . Import operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . import: properties . . . . . . . . . . . Import: syntax and options . . . . . . . . How to import data . . . . . . . . . . Example 1: importing from a single data file . . Example 2: importing from multiple data files . Export operator . . . . . . . . . . . . Data flow diagram . . . . . . . . . . export: properties . . . . . . . . . . . Export: syntax and options . . . . . . . . How to export data . . . . . . . . . . Export example 1: data set export to a single file Example 2: Data Set Export to Multiple files . . Import/export properties . . . . . . . . . Setting properties . . . . . . . . . . . Properties . . . . . . . . . . . . . Properties: reference listing . . . . . . . .
Specifying hash keys . . . . . . . . . . Example of hash partitioning . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . hash: properties . . . . . . . . . . . Hash: syntax and options . . . . . . . . The modulus partitioner. . . . . . . . . . Data flow diagram . . . . . . . . . . modulus: properties . . . . . . . . . . Modulus: syntax and options . . . . . . . Example of the modulus partitioner . . . . . The random partitioner . . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . random: properties . . . . . . . . . . Syntax. . . . . . . . . . . . . . . The range Partitioner . . . . . . . . . . . Considerations when using range partitioning . The range partitioning algorithm . . . . . . Specifying partitioning keys . . . . . . . Creating a range map . . . . . . . . . Example: configuring and using range partitioner Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . range: properties . . . . . . . . . . . Range: syntax and options . . . . . . . . Writerangemap operator. . . . . . . . . . Data flow diagram . . . . . . . . . . writerangemap: properties . . . . . . . . Writerangemap: syntax and options . . . . . Using the writerange operator . . . . . . . The makerangemap utility . . . . . . . . . Makerangemap: syntax and options . . . . . Using the makerangemap utility . . . . . . The roundrobin partitioner . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . roundrobin: properties . . . . . . . . . Syntax. . . . . . . . . . . . . . . The same partitioner . . . . . . . . . . . Using the partitioner . . . . . . . . . . Data flow diagram . . . . . . . . . . same: properties . . . . . . . . . . . Syntax. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
418 418 419 419 419 420 421 421 422 422 422 423 424 424 424 425 425 426 426 426 427 429 429 430 430 430 432 432 433 433 434 435 435 437 437 438 438 438 439 439 440 440 440 440
Contents
xi
xii
Input and output interface schemas . . . The case option. . . . . . . . . . Using the operator . . . . . . . . Tagswitch: syntax and options . . . . . Tagswitch example 1: default behavior . . Example 2: the tagswitch operator, one case
. . . . . . . . . . . . . . . chosen
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Contents
xiii
Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbcread: properties . . . . . . . . . . . . . . . . . . . . . . . . Odbclookup: syntax and options . . . . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . . . . . . . External data source record size . . . . . . . . . . . . . . . . . . . . Reading external data source tables . . . . . . . . . . . . . . . . . . . Join operations . . . . . . . . . . . . . . . . . . . . . . . . . . Odbcread example 1: reading an external data source table and modifying a field name . The odbcwrite operator . . . . . . . . . . . . . . . . . . . . . . . . Writing to a multibyte database . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbcwrite: properties . . . . . . . . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . . . . . . . Where the odbcwrite operator runs . . . . . . . . . . . . . . . . . . . Odbcwrite: syntax and options . . . . . . . . . . . . . . . . . . . . Example 1: writing to an existing external data source table . . . . . . . . . . . Example 2: creating an external datasource table . . . . . . . . . . . . . . Example 3: writing to an external data source table using the modify operator . . . . Other features . . . . . . . . . . . . . . . . . . . . . . . . . . The odbcupsert operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbcupsert: properties . . . . . . . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . . . . . . . Odbcupsert: syntax and options . . . . . . . . . . . . . . . . . . . . Updating an ODBC table . . . . . . . . . . . . . . . . . . . . . . The odbclookup operator . . . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . . . . . . odbclookup: properties . . . . . . . . . . . . . . . . . . . . . . . Odbclookup: syntax and options . . . . . . . . . . . . . . . . . . . . Example of the odbclookup operator . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
533 533 533 535 535 536 536 536 537 537 538 539 539 539 539 540 542 544 545 546 547 547 547 548 548 549 550 551 552 553 553 555
xiv
Data flow diagram . . . . sasin: properties . . . . . Sasin: syntax and options . . The sas operator . . . . . . Data flow diagram . . . . sas: properties . . . . . . SAS: syntax and options. . . The sasout operator . . . . . Data flow diagram . . . . sasout: properties . . . . . Sasout: syntax and options . . The sascontents operator . . . Data flow diagram . . . . sascontents: properties . . . sascontents: syntax and options Example reports . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
587 587 587 591 591 591 592 596 597 597 597 599 599 600 600 601
Contents
xv
. 635
xvi
Execution mode . . . . . . . . . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . . Informix example 1: Reading all data from an Informix table . . . . . Write operators for Informix . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . . Execution mode . . . . . . . . . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . . Write modes. . . . . . . . . . . . . . . . . . . . . . Matching InfoSphere DataStage fields with columns of Informix table . . Limitations . . . . . . . . . . . . . . . . . . . . . . Example 2: Appending data to an existing Informix table. . . . . . . Example 3: writing data to an INFORMIX table in truncate mode . . . . Example 4: Handling unmatched InfoSphere DataStage fields in an Informix Example 5: Writing to an INFORMIX table with an unmatched column . . hplread operator . . . . . . . . . . . . . . . . . . . . . Special operator features . . . . . . . . . . . . . . . . . Establishing a remote connection to the hplread operator. . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the hplread operator . . . . . . . . . . . . . . Hplread: syntax and options . . . . . . . . . . . . . . . . Reading data from an informix table . . . . . . . . . . . . . hplwrite operator for Informix. . . . . . . . . . . . . . . . . Special operator features . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the hplwrite operator . . . . . . . . . . . . . . hplwrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . infxread operator . . . . . . . . . . . . . . . . . . . . . Data Flow Diagram . . . . . . . . . . . . . . . . . . . infxread: properties . . . . . . . . . . . . . . . . . . . Infxread: syntax and Options . . . . . . . . . . . . . . . . Reading data from an INFORMIX table . . . . . . . . . . . . infxwrite operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the infxwrite operator . . . . . . . . . . . . . . infxwrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . . xpsread operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the xpsread operator . . . . . . . . . . . . . . Xpsread: syntax and options . . . . . . . . . . . . . . . . Reading data from an INFORMIX table . . . . . . . . . . . . xpswrite operator . . . . . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . . Properties of the xpswrite operator . . . . . . . . . . . . . . Xpswrite: syntax and options . . . . . . . . . . . . . . . . Examples . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . write . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
680 681 681 681 683 683 683 684 684 684 685 685 686 686 687 688 689 690 691 691 692 692 692 693 693 694 694 694 695 696 696 697 697 697 699 699 699 700 700 701 702 702 702 702 704 704 704 705 705 706
Contents
xvii
teraread restrictions . . . . . . . . Teraread: syntax and Options . . . . . Terawrite Operator . . . . . . . . . Data flow diagram . . . . . . . . terawrite: properties . . . . . . . . Column Name and Data Type Conversion. Correcting load errors . . . . . . . Write modes. . . . . . . . . . . Writing fields . . . . . . . . . . Limitations . . . . . . . . . . . Restrictions . . . . . . . . . . . Terawrite: syntax and options . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
711 711 713 714 714 714 715 716 716 716 717 717
xviii
Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrread: properties . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . Where the sqlsrvrread operator runs . . . . . . . . . . . . Column name conversion . . . . . . . . . . . . . . . . Data type conversion . . . . . . . . . . . . . . . . . . SQL Server record size . . . . . . . . . . . . . . . . . Targeting the read operation . . . . . . . . . . . . . . . Join operations . . . . . . . . . . . . . . . . . . . . Sqlsrvrread: syntax and options . . . . . . . . . . . . . . Sqlsrvrread example 1: Reading a SQL Server table and modifying a field The sqlsrvrwrite operator . . . . . . . . . . . . . . . . . Writing to a multibyte database . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrwrite: properties . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . Where the sqlsrvrwrite operator runs . . . . . . . . . . . . Data conventions on write operations to SQL Server . . . . . . . Write modes. . . . . . . . . . . . . . . . . . . . . Sqlsrvrwrite: syntax and options . . . . . . . . . . . . . . Example 1: Writing to an existing SQL Server table . . . . . . . . Example 2: Creating a SQL Server table . . . . . . . . . . . Example 3: Writing to a SQL Server table using the modify operator . . The sqlsrvrupsert operator . . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrupsert: properties . . . . . . . . . . . . . . . . . Operator action. . . . . . . . . . . . . . . . . . . . Sqlsrvrupsert: syntax and options . . . . . . . . . . . . . Example of the sqlsrvrupsert operator . . . . . . . . . . . . The sqlsrvrlookup operator. . . . . . . . . . . . . . . . . Data flow diagram . . . . . . . . . . . . . . . . . . sqlsrvrlookup: properties . . . . . . . . . . . . . . . . Sqlsrvrlookup: syntax and options . . . . . . . . . . . . . Example of the sqlsrvrlookup operator . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
752 752 752 753 753 753 754 754 755 755 757 758 758 758 758 759 759 759 760 761 763 764 765 766 766 766 766 767 769 770 771 771 772 773
Contents
xix
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. 786 . 787
Product accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813 Accessing product documentation . . . . . . . . . . . . . . . . . . . . . . . 815 Links to non-IBM Web sites . . . . . . . . . . . . . . . . . . . . . . . . . . 817 Notices and trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819 Contacting IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
xx
Chapter 1. Terminology
Because of the technical nature of some of the descriptions in this manual, it sometimes talks about details of the engine that drives parallel jobs. This involves the use of terms that might be unfamiliar to ordinary parallel job users. v Operators. These underlie the stages in an InfoSphere DataStage job. A single stage might correspond to a single operator, or a number of operators, depending on the properties you have set, and whether you have chosen to partition or collect or sort data on the input link to a stage. At compilation, InfoSphere DataStage evaluates your job design and will sometimes optimize operators out if they are judged to be superfluous, or insert other operators if they are needed for the logic of the job. v OSH. This is the scripting language used internally by the parallel engine. v Players. Players are the workhorse processes in a parallel job. There is generally a player for each operator on each node. Players are the children of section leaders; there is one section leader per processing node. Section leaders are started by the conductor process running on the conductor node (the conductor node is defined in the configuration file).
Modular development
You should aim to use modular development techniques in your job designs in order to maximize the reuse of parallel jobs and components and save yourself time.
v Use job parameters in your design and supply values at run time. This allows a single job design to process different data in different circumstances, rather than producing multiple copies of the same job with slightly different arguments. v Using job parameters allows you to exploit the InfoSphere DataStage Director's multiple invocation capability. You can run several invocations of a job at the same time with different runtime arguments. v Use shared containers to share common logic across a number of jobs. Remember that shared containers are inserted when a job is compiled. If the shared container is changed, the jobs using it will need recompiling.
Avoid reading from sequential files using the Same partitioning method.
Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entire downstream flow run sequentially unless you explicitly repartition (see "Using Sequential File Stages" for more tips on using Sequential file stages).
Combining data
The two major ways of combining data in an InfoSphere DataStage job are via a Lookup stage or a Join stage. How do you decide which one to use? Lookup and Join stages perform equivalent operations: combining two or more input data sets based on one or more specified keys. When one unsorted input is very large or sorting is not feasible, Lookup is preferred. When all inputs are of manageable size or are pre-sorted, Join is the preferred solution.
The Lookup stage is most appropriate when the reference data for all Lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of physical memory. The Lookup stage requires all but the first input (the primary input) to fit into physical memory. If the reference to a lookup is directly from a DB2 or Oracle table and the number of input rows is significantly smaller than the reference rows, 1:100 or more, a Sparse Lookup might be appropriate. If performance issues arise while using Lookup, consider using the Join stage. The Join stage must be used if the data sets are larger than available memory resources.
Sorting data
Look at job designs and try to reorder the job flow to combine operations around the same sort keys if possible, and coordinate your sorting strategy with your hashing strategy. It is sometimes possible to rearrange the order of business logic within a job flow to leverage the same sort order, partitioning, and groupings. If data has already been partitioned and sorted on a set of key columns, specify the "don't sort, previously sorted" option for the key columns in the Sort stage. This reduces the cost of sorting and takes greater advantage of pipeline parallelism. When writing to parallel data sets, sort order and partitioning are preserved. When reading from these data sets, try to maintain this sorting if possible by using Same partitioning method. The stable sort option is much more expensive than non-stable sorts, and should only be used if there is a need to maintain row order other than as needed to perform the sort. The performance of individual sorts can be improved by increasing the memory usage per partition using the Restrict Memory Usage (MB) option of the Sort stage. The default setting is 20 MB per partition. Note that sort memory usage can only be specified for standalone Sort stages, it cannot be changed for inline (on a link) sorts.
Table 1. Default and explicit type conversions, part 1 (continued) Destination Source int32 uint32 int64 uint64 sfloat dfloat int8 dm d dm d dm dm uint8 d d d d d d d d d int16 d d d d d d d dm dm uint16 d d d d d d d d d d d d d d dm d d m m m m m m m m d d d d d dm dm d d d dm d d d d dm d d d d d d dm dm dm int32 uint32 d int64 d d uint64 d d d sfloat d d d d dfloat d d d d d
Table 2. Default and explicit type conversions, part 2 Destination Source int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp m m m m m m m m m dm dm dm d decimal d d d d d d d d d dm string dm d dm dm dm m d d d dm dm ustring dm d dm dm dm m d d d dm dm d m m m m m m m m m m raw date m time m timestamp m
d = default conversion; m = modify operator conversion; blank = no conversion needed or provided You should also note the following points about type conversion:
v When converting from variable-length to fixed-length strings using default conversions, parallel jobs pad the remaining length with NULL (ASCII zero) characters. v The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0x0) to another character; for example, an ASCII space (Ox20) or a unicode space (U+0020). v As an alternate solution, the PadString function can be used to pad a variable-length (Varchar) string to a specified length using a specified pad character. Note that PadString does not work with fixed-length (Char) string types. You must first convert Char to Varchar before using PadString.
You should avoid generating target tables in the database from your InfoSphere DataStage job (that is, using the Create write mode on the database stage) unless they are intended for temporary storage only. This is because this method does not allow you to, for example, specify target table space, and you might inadvertently violate data-management policies on the database. If you want to create a table on a target database from within a job, use the Open command property on the database stage to explicitly create the table and allocate tablespace, or any other options required. The Open command property allows you to specify a command (for example some SQL) that will be executed by the database before it processes any data from the stage. There is also a Close property that allows you to specify a command to execute after the data from the stage has been processed. (Note that, when using user-defined Open and Close commands, you might need to explicitly specify locks where appropriate.)
The default value is 2. Setting this too low on a large system can result in so many players that the job fails due to insufficient resources. v requestedsessions. This is a number between 1 and the number of vprocs in the database. The default is the maximum number of available sessions.
10
Understanding a flow
In order to resolve any performance issues it is essential to have an understanding of the flow of InfoSphere DataStage jobs.
Score dumps
To help understand a job flow you might take a score dump. Do this by setting the APT_DUMP_SCORE environment variable true and running the job (APT _DUMP_SCORE can be set in the Administrator client, under the Parallel > Reporting ranch). This causes a report to be produced which shows the operators, processes and data sets in the job. The report includes information about: v Where and how data is repartitioned. v Whether InfoSphere DataStage had inserted extra operators in the flow. v The degree of parallelism each operator runs with, and on which nodes. v Information about where data is buffered. The dump score information is included in the job log when you run a job. The score dump is particularly useful in showing you where InfoSphere DataStage is inserting additional components in the job flow. In particular InfoSphere DataStage will add partition and sort operators where the logic of the job demands it. Sorts in particular can be detrimental to performance and a score dump can help you to detect superfluous operators and amend the job design to remove them.
11
Performance monitoring
There are various tools you can you use to aid performance monitoring, some provided with InfoSphere DataStage and some general UNIX tools.
Job monitor
You access the IBM InfoSphere DataStage job monitor through the InfoSphere DataStage Director (see InfoSphere DataStage Director Client Guide). You can also use certain dsjob commands from the command line to access monitoring functions (see IBM InfoSphere DataStage Programmer's Guide for details). The job monitor provides a useful snapshot of a job's performance at a moment of execution, but does not provide thorough performance metrics. That is, a job monitor snapshot should not be used in place of a full run of the job, or a run with a sample set of data. Due to buffering and to some job semantics, a snapshot image of the flow might not be a representative sample of the performance over the course of the entire job. The CPU summary information provided by the job monitor is useful as a first approximation of where time is being spent in the flow. However, it does not include any sorts or similar that might be inserted automatically in a parallel job. For these components, the score dump can be of assistance. See "Score Dumps". A worst-case scenario occurs when a job flow reads from a data set, and passes immediately to a sort on a link. The job will appear to hang, when, in fact, rows are being read from the data set and passed to the sort. The operation of the job monitor is controlled by two environment variables: APT_MONITOR_TIME and APT_MONITOR_SIZE. By default the job monitor takes a snapshot every five seconds. You can alter the time interval by changing the value of APT_MONITOR_TIME, or you can have the monitor generate a new snapshot every so-many rows by following this procedure: 1. Select APT_MONITOR_TIME on the InfoSphere DataStage Administrator environment variable dialog box, and press the set to default button. 2. Select APT_MONITOR_SIZE and set the required number of rows as the value for this variable.
12
Iostat
The UNIX tool Iostat is useful for examining the throughput of various disk resources. If one or more disks have high throughput, understanding where that throughput is coming from is vital. If there are spare CPU cycles, IO is often the culprit. The specifics of Iostat output vary slightly from system to system. Here is an example from a Linux machine which slows a relatively light load: (The first set of output is cumulative data since the machine was booted)
Device: tps dev8-0 13.50 ... Device: tps dev8-0 4.00 Blk_read/s Blk_wrtn/s Blk_read 144.09 122.33 346233038 Blk_wrtn 293951288
Load average
Ideally, a performant job flow should be consuming as much CPU as is available. The load average on the machine should be two to three times the value as the number of processors on the machine (for example, an 8-way SMP should have a load average of roughly 16-24). Some operating systems, such as HPUX, show per-processor load average. In this case, load average should be 2-3, regardless of number of CPUs on the machine. If the machine is not CPU-saturated, it indicates a bottleneck might exist elsewhere in the flow. A useful strategy in this case is to over-partition your data, as more partitions cause extra processes to be started, utilizing more of the available CPU power. If the flow cause the machine to be fully loaded (all CPUs at 100%), then the flow is likely to be CPU limited, and some determination needs to be made as to where the CPU time is being spent (setting the APT_PM_PLAYER _TIMING environment variable can be helpful here - see the following section). The commands top or uptime can provide the load average.
Runtime information
When you set the APT_PM_PLAYER_TIMING environment variable, information is provided for each operator in a job flow. This information is written to the job log when the job is run. An example output is:
##I TFPM 000324 08:59:32(004) ##I TFPM 000325 08:59:32(005) user: 0.00 sys: 0.00 suser: ##I TFPM 000324 08:59:32(006) ##I TFPM 000325 08:59:32(012) 0.00 sys: 0.00 suser: 0.09 ##I TFPM 000324 08:59:32(013) ##I TFPM 000325 08:59:32(019) 0.00 sys: 0.00 suser: 0.09 <generator,0> Calling runLocally: step=1, node=rh73dev04, op=0, ptn=0 <generator,0> Operator completed. status: APT_StatusOk elapsed: 0.04 0.09 ssys: 0.02 (total CPU: 0.11) <peek,0> Calling runLocally: step=1, node=rh73dev04, op=1, ptn=0 <peek,0> Operator completed. status: APT_StatusOk elapsed: 0.01 user: ssys: 0.02 (total CPU: 0.11) <peek,1> Calling runLocally: step=1, node=rh73dev04a, op=1, ptn=1 <peek,1> Operator completed. status: APT_StatusOk elapsed: 0.00 user: ssys: 0.02 (total CPU: 0.11)}
This output shows that each partition of each operator has consumed about one tenth of a second of CPU time during its runtime portion. In a real world flow, we'd see many operators, and many partitions. It is often useful to see how much CPU each operator (and each partition of each component) is using. If one partition of an operator is using significantly more CPU than others, it might mean the data is partitioned in an unbalanced way, and that repartitioning, or choosing different partitioning keys might be a useful strategy.
13
If one operator is using a much larger portion of the CPU than others, it might be an indication that you've discovered a problem in your flow. Common sense is generally required here; a sort is going to use dramatically more CPU time than a copy. This will, however, give you a sense of which operators are the CPU hogs, and when combined with other metrics presented in this document can be very enlightening. Setting the environment variable APT_DISABLE_COMBINATION might be useful in some situations to get finer-grained information as to which operators are using up CPU cycles. Be aware, however, that setting this flag will change the performance behavior of your flow, so this should be done with care. Unlike the job monitor cpu percentages, setting APT_PM_PLAYER_TIMING will provide timings on every operator within the flow.
Performance data
You can record performance data about job objects and computer resource utilization in parallel job runs. You can record performance data in these ways: v At design time, with the Designer client v At run time, with either the Designer client or the Director client Performance data is written to an XML file that is in the default directory C:\IBM\InformationServer\ Server\Performance. You can override the default location by setting the environment variable APT_PERFORMANCE_DATA. Use the Administrator client to set a value for this variable at the project level, or use the Parameters page of the Job Properties window to specify a value at the job level.
Procedure
1. Open a job in the Designer client. 2. Click Edit > Job Properties. 3. Click the Execution page. 4. Select the Record job performance data check box.
Results
Performance data is recorded each time that the job runs successfully.
Procedure
1. 2. 3. 4. Open a job in the Designer client, or select a job in the display area of the Director client. Click the Run button on the toolbar to open the Job Run Options window. Click the General page. Select the Record job performance data check box.
14
Results
Performance data is recorded each time that the job runs successfully.
Procedure
1. Open the Performance Analysis window by using one of the following methods: v In the Designer client, click File > Performance Analysis. v In the Director client, click Job > Analyze Performance. v In either client, click the Performance Analysis toolbar button. 2. In the Performance Data group in the left pane, select the job run that you want to analyze. Job runs are listed in descending order according to the timestamp. 3. In the Charts group, select the chart that you want to view. 4. If you want to exclude certain job objects from a chart, use one of the following methods: v For individual objects, clear the check boxes in the Job Tree group. v For all objects of the same type, clear the check boxes in the Partitions, Stages, and Phases groups. 5. Optional: In the Filters group, change how data is filtered in a chart. 6. Click Save to save the job performance data in an archive. The archive includes the following files: v Performance data file named performance.xxxx (where xxxx is the suffix that is associated with the job run) v Computer descriptions file named description.xxxx v Computer utilization file named utilization.xxxx v Exported job definition named exportedjob.xxxx
Results
When you open a performance data file, the system creates a mapping between the job stages that are displayed on the Designer client canvas and the operating system processes that define a job. The mapping might not create a direct relationship between stages and processes for these reasons: v Some stages compile into many processes. v Some stages are combined into a single process. You can use the check boxes in the Filters area of the Performance Analysis window to include data about hidden operators in the performance data file. For example, Modify stages are combined with the previous stage in a job design. If you want to see the percentage of elapsed time that is used by a modify operator, clear the Hide Inserted Operators check box. Similarly, you can clear the Hide Composite Operators check box to expose performance data about composite operators.
15
What to do next
You can delete performance data files by clicking Delete. All of the data files that belong to the selected job run, including the performance data, utilization data, computer description data, and job export data, are deleted.
Performance analysis
Once you have carried out some performance monitoring, you can analyze your results. Bear in mind that, in a parallel job flow, certain operators might complete before the entire flow has finished, but the job isn't successful until the slowest operator has finished all its processing.
16
Similarly, specifying a nodemap for an operator might prove useful to eliminate repartitions. In this case, a transform stage sandwiched between a DB2 stage reading (db2read) and another one writing (db2write) might benefit from a nodemap placed on it to force it to run with the same degree of parallelism as the two db2 operators to avoid two repartitions.
Resource estimation
New in this release, you can estimate and predict the resource utilization of parallel job runs by creating models and making projections in the Resource Estimation window. A model estimates the system resources for a job, including the amount of scratch space, disk space, and CPU time that is needed for each stage to run on each partition. A model also estimates the data set throughput in a job. You can generate these types of models: v Static models estimate disk space and scratch space only. These models are based on a data sample that is automatically generated from the record schema. Use static models at compilation time. v Dynamic models predict disk space, scratch space, and CPU time. These models are based on a sampling of actual input data. Use dynamic models at run time. An input projection estimates the size of all of the data sources in a job. You can project the size in megabytes or in number of records. A default projection is created when you generate a model. The resource utilization results from a completed job run are treated as an actual model. A job can have only one actual model. In the Resource Estimation window, the actual model is the first model in the Models list. Similarly, the total size of the data sources in a completed job run are treated as an actual projection. You must select the actual projection in the Input Projections list to view the resource utilization statistics in the actual model. You can compare the actual model to your generated models to calibrate your modeling techniques.
Creating a model
You can create a static or dynamic model to estimate the resource utilization of a parallel job run.
17
Procedure
1. Open a job in the Designer client, or select a job in the Director client. 2. Open the Resource Estimation window by using one of the following methods: v In the Designer, click File > Estimate Resource. v In the Director, click Job > Estimate Resource. v Click the Resource Estimation toolbar button. The first time that you open the Resource Estimation window for a job, a static model is generated by default. Click the Model toolbar button to display the Create Resource Model options. Type a name in the Model Name field. The specified name must not already exist. Select a type in the Model Type field. If you want to specify a data sampling range for a dynamic model, use one of the following methods:
3. 4. 5. 6.
v Click the Copy Previous button to copy the sampling specifications from previous models, if any exist. v Clear the Auto check box for a data source, and type values in the From and To fields to specify a record range. 7. Click Generate.
Results
After the model is created, the Resource Estimation window displays an overview of the model that includes the model type, the number of data segments in the model, the input data size, and the data sampling description for each input data source.
What to do next
Use the controls in the left pane of the Resource Estimation window to view statistics about partition utilization, data set throughput, and operator utilization in the model. You can also compare the model to other models that you generate.
18
Table 3. Static and dynamic models (continued) Characteristics Sample data Static models Dynamic models
Requires automatic data sampling. Accepts automatic data sampling or a Uses the actual size of the input data data range: if the size can be determined. v Automatic data sampling Otherwise, the sample size is set to a determines the sample size default value of 1000 records on each dynamically according to the stage output link from each source stage. type: For a database source stage, the sample size is set to 1000 records on each output link from the stage. For all other source stage types, the sample size is set to the minimum number of input records among all sources on all partitions. v A data range specifies the number of records to include in the sample for each data source. If the size of the sample data exceeds the actual size of the input data, the model uses the entire input data set.
Estimates are based on a worst-case scenario. Estimates are based on a worst-case scenario. Not estimated. Estimates are based on a best-case scenario. No record is dropped. Input data is propagated from the source stages to all other stages in the job. Solely determined by the record schema. Estimates are based on a worst-case scenario. Data is assumed to be evenly distributed among all partitions.
Estimates are based on linear regression. Estimates are based on linear regression. Estimates are based on linear regression. Dynamically determined. Best-case scenario does not apply. Input data is processed, not propagated. Records can be dropped. Estimates are based on linear regression. Dynamically determined by the actual record at run time. Estimates are based on linear regression. Dynamically determined. Estimates are based on linear regression.
Record size
Data partitioning
When a model is based on a worst-case scenario, the model uses maximum values. For example, if a variable can hold up to 100 characters, the model assumes that the variable always holds 100 characters. When a model is based on a best-case scenario, the model assumes that no single input record is dropped anywhere in the data flow. The accuracy of a model depends on these factors: Schema definition The size of records with variable-length fields cannot be determined until the records are processed. Use fixed-length or bounded-length schemas as much as possible to improve accuracy.
19
Input data When the input data contains more records with one type of key field than another, the records might be unevenly distributed across partitions. Specify a data sampling range that is representative of the input data. Parallel processing environment The availability of system resources when you run a job can affect the degree to which buffering occurs. Generate models in an environment that is similar to your production environment in terms of operating system, processor type, and number of processors.
Custom stages that need disk space and scratch space must call two additional functions within the dynamic scope of APT_Operator::runLocally(): v For disk space, call APT_Operator::setDiskSpace() to describe actual disk space usage. v For scratch space, call APT_Operator::setScratchSpace() to describe actual scratch space usage. Both functions accept values of APT_Int64.
Making a projection
You can make a projection to predict the resource utilization of a job by specifying the size of the data sources.
Procedure
1. Open a job in the Designer client, or select a job in the Director client. 2. Open the Resource Estimation window by using one of the following methods: v In the Designer, click File > Estimate Resource. v In the Director, click Job > Estimate Resource. v Click the Resource Estimation toolbar button.
20
3. 4. 5. 6.
Click the Projection toolbar button to display the Make Resource Projection options. Type a name in the Projection Name field. The specified name must not already exist. Select the unit of measurement for the projection in the Input Units field. Specify the input size upon which to base the projection by using one of the following methods: v Click the Copy Previous button to copy the specifications from previous projections, if any exist.
v If the Input Units field is set to Size in Megabytes, type a value in the Megabytes (MB) field for each data source. v If the Input Units field is set to Number of Records, type a value in the Records field for each data source. 7. Click Generate.
Results
The projection applies the input data information to the existing models, excluding the actual model, to predict the resource utilization for the given input data.
Procedure
1. In the Resource Estimation window, select a model in the Models list. 2. Select a projection in the Input Projections list. If you do not select a projection, the default projection is used. 3. Click the Report toolbar button.
Results
By default, reports are saved in the following directory: C:\IBM\InformationServer\Clients\Classic\Estimation\ Host_name_for_InfoSphere_Information_Server_engine\project_name\job_name\html\report.html
What to do next
You can print the report or rename it by using the controls in your Web browser.
21
v Saves the merged records to two different data sets based on the value of a specific field In this example, each data source has 5 million records. You can use resource estimation models and projections to answer questions such as these: v Which stage merges data most efficiently? v When should data be sorted? v Are there any performance bottlenecks? v What are the disk and scratch space requirements if the size of the input data increases?
By comparing the models, you see that Job 1 does not require any scratch space, but is the slowest of the three jobs. The Lookup stage also requires memory to build a lookup table for a large amount of reference data. Therefore, the optimal job design uses either a Merge stage or a Join stage to merge data.
22
v Job 4 sorts the data first: 1. Each data source is linked to a separate Sort stage. 2. The sorted data is sent to a single Funnel stage for consolidation. 3. The Funnel stage sends the data to the Merge or Join stage, where it is merged with the data from the fourth data source. v Job 5 consolidates the data first: 1. The three source stages are linked to a single Funnel stage that consolidates the data. 2. The consolidated data is sent to a single Sort stage for sorting. 3. The Sort stage sends the data to the Merge or Join stage, where it is merged with the data from the fourth data source. Use the same processing configuration as in the first example to generate an automatic dynamic model for each job. The resource utilization statistics for each job are shown in the table:
Table 5. Resource utilization statistics Job Job 4 (Sort before Funnel) Job 5 (Sort after Funnel) CPU (seconds) 74.6812 64.1079 Disk (MB) 515.125 515.125 Scratch (MB) 801.086 743.866
You can see that sorting data after consolidation is a better design because Job 5 uses approximately 15% less CPU time and 8% less scratch space than Job 4.
23
Table 6. Resource utilization statistics (continued) Job Job 7 (Join stage with job parameter in Transformer stage) CPU (seconds) 106.5 Disk (MB) 801.125 Scratch (MB) 915.527
Job performance is significantly improved after you remove the bottleneck in the Transformer stage. Total CPU time for Jobs 6 and 7 is about half of the total CPU time for Jobs 2 and 3. CPU time for the Transformer stage is a small portion of total CPU time: v In Job 6, the Transformer stage uses 13.8987 seconds out of the 109.065 seconds of total CPU time for the job. v In Job 7, the Transformer stage uses 13.1489 seconds out of the 106.5 seconds of total CPU time for the job. These models also show that job performance improves by approximately 2.4% when you merge data by using a Join stage rather than a Merge stage.
24
Lookup requires all but one (the first or primary) input to fit into physical memory. Join requires all inputs to be sorted. When one unsorted input is very large or sorting isn't feasible, lookup is the preferred solution. When all inputs are of manageable size or are pre-sorted, join is the preferred solution.
Combinable Operators
Combined operators generally improve performance at least slightly (in some cases the difference is dramatic). There might also be situations where combining operators actually hurts performance, however. Identifying such operators can be difficult without trial and error. The most common situation arises when multiple operators are performing disk I/O (for example, the various file stages and sort). In these sorts of situations, turning off combination for those specific stages might result in a performance increase if the flow is I/O bound. Combinable operators often provide a dramatic performance increase when a large number of variable length fields are used in a flow.
Disk I/O
Total disk throughput is often a fixed quantity that InfoSphere DataStage has no control over. It can, however, be beneficial to follow some rules. v If data is going to be read back in, in parallel, it should never be written as a sequential file. A data set or file set stage is a much more appropriate format. v When importing fixed-length data, the Number of Readers per Node property on the Sequential File stage can often provide a noticeable performance boost as compared with a single process reading the data. v Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting the environment variable APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N. v Memory mapped I/O, in many cases, contributes to improved performance. In certain situations, however, such as a remote disk mounted via NFS, memory mapped I/O might cause significant
Chapter 3. Improving performance
25
performance problems. Setting the environment variables APT_IO_NOMAP and APT_BUFFERIO_NOMAP true will turn off this feature and sometimes affect performance. (AIX and HP-UX default to NOMAP. Setting APT_IO_MAP and APT_BUFFERIO_MAP true can be used to turn memory mapped I/O on for these platforms.)
Buffering
Buffering is intended to slow down input to match the consumption rate of the output. When the downstream operator reads very slowly, or not at all, for a length of time, upstream operators begin to slow down. This can cause a noticeable performance loss if the buffer's optimal behavior is something other than rate matching. By default, each link has a 3 MB in-memory buffer. Once that buffer reaches half full, the operator begins to push back on the upstream operator's rate. Once the 3 MB buffer is filled, data is written to disk in 1 MB chunks. In most cases, the easiest way to tune buffering is to eliminate the pushback and allow it to buffer the data to disk as necessary. Setting APT_BUFFER_FREE_RUN=N or setting Buffer Free Run in the Output page Advanced tab on a particular stage will do this. A buffer will read N * max_memory (3 MB by default) bytes before beginning to push back on the upstream. If there is enough disk space to buffer large amounts of data, this will usually fix any egregious slowdown issues cause by the buffer operator. If there is a significant amount of memory available on the machine, increasing the maximum in-memory buffer size is likely to be very useful if buffering is causing any disk I/O. Setting the APT_BUFFER_MAXIMUM_MEMORY environment variable or Maximum memory buffer size on the Output page Advanced tab on a particular stage will do this. It defaults to 3145728 (3 MB). For systems where small to medium bursts of I/O are not desirable, the 1 MB write to disk size chunk size might be too small. The environment variable APT_BUFFER_DISK_WRITE_INCREMENT or Disk write increment on the Output page Advanced tab on a particular stage controls this and defaults to 1048576 (1 MB). This setting might not exceed max_memory * 2/3. Finally, in a situation where a large, fixed buffer is needed within the flow, setting Queue upper bound on the Output page Advanced tab (no environment variable exists) can be set equal to max_memory to force a buffer of exactly max_memory bytes. Such a buffer will block an upstream operator (until data is read by the downstream operator) once its buffer has been filled, so this setting should be used with extreme caution. This setting is rarely, if ever, necessary to achieve good performance, but might be useful in an attempt to squeeze every last byte of performance out of the system where it is desirable to eliminate buffering to disk entirely. No environment variable is available for this flag, and therefore this can only be set at the individual stage level.
26
AIX
If you are running IBM InfoSphere DataStage on an IBM RS/6000 SP or a network of workstations on IBM AIX, verify your setting of the thewall network parameter.
27
28
Buffering assumptions
This section describes buffering in more detail, and in particular the design assumptions underlying its default behavior. Buffering in InfoSphere DataStage is designed around the following assumptions: v Buffering is primarily intended to remove the potential for deadlock in flows with fork-join structure. v Throughput is preferable to overhead. The goal of the InfoSphere DataStage buffering mechanism is to keep the flow moving with as little memory and disk usage as possible. Ideally, data should simply stream through the data flow and rarely land to disk. Upstream operators should tend to wait for downstream operators to consume their input before producing new data records. v Stages in general are designed so that on each link between stages data is being read and written whenever possible. While buffering is designed to tolerate occasional backlog on specific links due to one operator getting ahead of another, it is assumed that operators are at least occasionally attempting to read and write data on each link. Buffering is implemented by the automatic insertion of a hidden buffer operator on links between stages. The buffer operator attempts to match the rates of its input and output. When no data is being read from the buffer operator by the downstream stage, the buffer operator tries to throttle back incoming data from the upstream stage to avoid letting the buffer grow so large that it must be written out to disk. The goal is to avoid situations where data will be have to be moved to and from disk needlessly, especially in situations where the consumer cannot process data at the same rate as the producer (for example, due to a more complex calculation). Because the buffer operator wants to keep the flow moving with low overhead, it is assumed in general that it is better to cause the producing stage to wait before writing new records, rather than allow the buffer operator to consume resources.
29
Controlling buffering
InfoSphere DataStage offers two ways of controlling the operation of buffering: you can use environment variables to control buffering on all links of all stages in all jobs, or you can make individual settings on the links of particular stages via the stage editors.
Buffering policy
You can set this via the APT_BUFFERING_POLICY environment variable, or via the Buffering mode field on the Inputs or Outputs page Advanced tab for individual stage editors. The environment variable has the following possible values: v AUTOMATIC_BUFFERING. Buffer a data set only if necessary to prevent a dataflow deadlock. This setting is the default if you do not define the environment variable. v FORCE_BUFFERING. Unconditionally buffer all links. v NO_BUFFERING. Do not buffer links. This setting can cause deadlock if used inappropriately. The possible settings for the Buffering mode field are: v (Default). This will take whatever the default settings are as specified by the environment variables (this will be Auto buffer unless you have explicitly changed the value of the APT_BUFFERING _POLICY environment variable). v Auto buffer. Buffer data only if necessary to prevent a dataflow deadlock situation. v Buffer. This will unconditionally buffer all data output from/input to this stage. v No buffer. Do not buffer data under any circumstances. This could potentially lead to deadlock situations if not used carefully.
30
disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access. v APT_BUFFER_FREE_RUN. Specifies how much of the available in-memory buffer to consume before the buffer offers resistance to any new data being written to it, as a percentage of Maximum memory buffer size. When the amount of buffered data is less than the Buffer free run percentage, input data is accepted immediately by the buffer. After that point, the buffer does not immediately accept incoming data; it offers resistance to the incoming data by first trying to output data already in the buffer before accepting any new input. In this way, the buffering mechanism avoids buffering excessive amounts of data and can also avoid unnecessary disk I/O. The default percentage is 0.5 (50% of Maximum memory buffer size or by default 1.5 MB). You must set Buffer free run greater than 0.0. Typical values are between 0.0 and 1.0. You can set Buffer free run to a value greater than 1.0. In this case, the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing data to disk. The available settings in the Input or Outputs pageAdvanced tab of stage editors are: v Maximum memory buffer size (bytes). Specifies the maximum amount of virtual memory, in bytes, used per buffer. The default size is 3145728 (3 MB). v Buffer free run (percent). Specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a percentage of Maximum memory buffer size. When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk. v Queue upper bound size (bytes). Specifies the maximum amount of data buffered at any time using both memory and disk. The default value is zero, meaning that the buffer size is limited only by the available disk space as specified in the configuration file (resource scratchdisk). If you set Queue upper bound size (bytes) to a non-zero value, the amount of data stored in the buffer will not exceed this value (in bytes) plus one block (where the data stored in a block cannot exceed 32 KB). If you set Queue upper bound size to a value equal to or slightly less than Maximum memory buffer size, and set Buffer free run to 1.0, you will create a finite capacity buffer that will not write to disk. However, the size of the buffer is limited by the virtual memory of your system and you can create deadlock if the buffer becomes full. (Note that there is no environment variable for Queue upper bound size). v Disk write increment (bytes). Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access.
31
The default setting for Buffer free run is 0.5 for the environment variable, (50% for Buffer free run on the Advanced tab), which means that half of the internal memory buffer can be consumed before pushback occurs. This biases the buffer operator to avoid allowing buffered data to be written to disk. If your stage needs to buffer large data sets, we recommend that you initially set Buffer free run to a very large value such as 1000, and then adjust according to the needs of your application. This will allow the buffer operator to freely use both memory and disk space in order to accept incoming data without pushback. We recommend that you set the Buffer free run property only for those links between stages that require a non-default value; this means altering the setting on the Inputs page or Outputs page Advanced tab of the stage editors, not the environment variable.
32
Procedure
1. Do one of: a. Choose File > New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Custom Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New > Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to InfoSphere DataStage. Avoid using the same name as existing stages. v Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Execution Mode. Choose the execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See InfoSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Mapping. Choose whether the stage has a Mapping tab or not. A Mapping tab enables the user of the stage to specify how output columns are derived from the data produced by the stage. Choose None to specify that output mapping is not performed, choose Default to accept the default setting that InfoSphere DataStage uses. v Preserve Partitioning. Choose the default setting of the Preserve Partitioning flag. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag.
Copyright IBM Corp. 2001, 2012
33
v Partitioning. Choose the default partitioning method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. Choose the default collection method for the stage. This is the method that will appear in the Inputs page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Operator. Enter the name of the Orchestrate operator that you want the stage to invoke. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Links page and specify information about the links allowed to and from the stage you are defining. Use this to specify the minimum and maximum number of input and output links that your custom stage can have, and to enable the ViewData feature for target data (you cannot enable target ViewData if your stage has any output links). When the stage is used in a job design, a ViewData button appears on the Input page, which allows you to view the data on the actual data target (provided some has been written there). In order to use the target ViewData feature, you have to specify an Orchestrate operator to read the data back from the target. This will usually be different to the operator that the stage has used to write the data (that is, the operator defined in the Operator field of the General page). Specify the reading operator and associated arguments in the Operator and Options fields. If you enable target ViewData, a further field appears in the Properties grid, called ViewData. 4. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a version number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the InfoSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default InfoSphere DataStage icon for this stage. 5. Go to the Properties page. This allows you to specify the options that the Orchestrate operator requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page. 6. Fill in the fields as follows: v Property name. The name of the property. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns.
34
v v v v v v
If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. Default Value. The value the option will take if no other is specified. Required. Set this to True if the property is mandatory. Repeats. Set this true if the property repeats (that is, you can have multiple instances of it). Use Quoting. Specify whether the property will haves quotes added when it is passed to the Orchestrate operator. Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. Input Schema. Specifies that the property will contain a schema string whose contents are populated from the Input page Columns tab. Output Schema. Specifies that the property will contain a schema string whose contents are populated from the Output page Columns tab. None. This allows the creation of properties that do not generate any osh, but can be used for conditions on other properties (for example, for use in a situation where you have mutually exclusive properties, but at least one of them must be specified). Schema properties require format options. Select this check box to specify that the stage being specified will have a Format tab. If you have enabled target ViewData on the Links page, the following property is also displayed: ViewData. Select Yes to indicate that the value of this property should be used when viewing data. For example, if this property specifies a file to write to when the stage is used in a job design, the value of this property will be used to read the data back if ViewData is used in the stage. If you select a conversion type of Input Schema or Output Schema, you should note the following: Data Type is set to String. Required is set to Yes. The property is marked as hidden and will not appear on the Properties page when the custom stage is used in a job design. If your stage can have multiple input or output links there would be a Input Schema property or Output Schema property per-link. When the stage is used in a job design, the property will contain the following OSH for each input or output link:
-property_name record {format_properties} ( column_definition {format_properties}; ...)
v v v
Where: v property_name is the name of the property (usually `schema') v format_properties are formatting information supplied on the Format page (if the stage has one). v there is one column_definition for each column defined in the Columns tab for that link. The format_props in this case refers to per-column format information specified in the Edit Column Meta Data dialog box.
35
Schema properties are mutually exclusive with schema file properties. If your custom stage supports both, you should use the Extended Properties dialog box to specify a condition of "schemafile= " for the schema property. The schema property is then only valid provided the schema file property is blank (or does not exist). 7. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v Specify that the property will be hidden and not appear in the stage editor. This is primarily intended to support the case where the underlying operator needs to know the JobName. This can be passed using a mandatory String property with a default value that uses a DS Macro. However, to prevent the user from changing the value, the property needs to be hidden. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar '|' separated list of conditions that are AND'ed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. 8. If your custom stage will create columns, go to the Mapping Additions page. It contains a grid that allows for the specification of columns created by the stage. You can also specify that column details are filled in from properties supplied when the stage is used in a job design, allowing for dynamic specification of columns. The grid contains the following fields: v Column name. The name of the column created by the stage. You can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column name. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the name of the created column. v Parallel type. The type of the column (this is the underlying data type, not the SQL data type). Again you can specify the name of a property you specified on the Property page of the dialog box to dynamically allocate the column type. Specify this in the form #property_name#, the created column will then take the value of this property, as specified at design time, as the type of the created column. (Note that you cannot use a repeatable property to dynamically allocate a column type in this way.) v Nullable. Choose Yes or No to indicate whether the created column can contain a null. v Conditions. Allows you to enter an expression specifying the conditions under which the column will be created. This could, for example, depend on the setting of one of the properties specified in the Property page. You can propagate the values of the Conditions fields to other columns if required. Do this by selecting the columns you want to propagate to, then right-clicking in the source Conditions field and choosing Propagate from the shortcut menu. A dialog box asks you to confirm that you want to propagate the conditions to all columns. 9. Click OK when you are happy with your custom stage definition. The Save As dialog box appears. 10. Select the folder in the repository tree where you want to store the stage type and click OK.
36
Procedure
1. Do one of: a. Choose File > New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Build Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top. Or: d. Select a folder in the repository tree. e. Choose New > Other Parallel Stage Custom from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to InfoSphere DataStage. Avoid using the same name as existing stages. v Class Name. The name of the C++ class. By default this takes the name of the stage type.
37
v Parallel Stage type. This indicates the type of new parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See InfoSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Preserve Partitioning. This shows the default setting of the Preserve Partitioning flag, which you cannot change in a Build stage. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag. v Partitioning. This shows the default partitioning method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. This shows the default collection method, which you cannot change in a Build stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Operator. The name of the operator that your code is defining and which will be executed by the InfoSphere DataStage stage. By default this takes the name of the stage type. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the InfoSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default InfoSphere DataStage icon for this stage. 4. Go to the Properties page. This allows you to specify the options that the Build stage requires as properties that appear in the Stage Properties tab. For custom stages the Properties tab always appears under the Stage page. Fill in the fields as follows: v Property name. The name of the property. This will be passed to the operator you are defining as an option, prefixed with `-' and followed by the value selected in the Properties tab of the stage editor. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns.
38
v v v v
If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. Default Value. The value the option will take if no other is specified. Required. Set this to True if the property is mandatory. Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the operator as the option value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the operator as the option name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the operator as the option name. Typically used to group operator options that are mutually exclusive.
Value only. The value for the property specified in the stage editor is passed as it is. 5. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar '|' separated list of conditions that are AND'ed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. Click OK when you are happy with the extended properties. 6. Click on the Build page. The tabs here allow you to define the actual operation that the stage will perform. The Interfaces tab enable you to specify details about inputs to and outputs from the stage, and about automatic transfer of records from input to output. You specify port details, a port being where a link connects to the stage. You need a port for each possible input link to the stage, and a port for each possible output link from the stage. You provide the following information on the Input sub-tab: v Port Name. Optional name for the port. The default names for the ports are in0, in1, in2 ... . You can refer to them in the code using either the default name or the name you have specified. v Alias. Where the port name contains non-ascii characters, you can give it an alias in this column (this is only available where NLS is enabled). v AutoRead. This defaults to True which means the stage will automatically read records from the port. Otherwise you explicitly control read operations in the code. v Table Name. Specify a table definition in the InfoSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition by choosing Select Table from the menu that appears when you click the browse button. You can also view the schema corresponding to this table definition by choosing View Schema from the same menu. You do not have to supply a Table Name. If any of the columns in your table definition have names that contain non-ascii
39
characters, you should choose Column Aliases from the menu. The Build Column Aliases dialog box appears. This lists the columns that require an alias and let you specify one. v RCP. Choose True if runtime column propagation is allowed for inputs to this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility. You provide the following information on the Output sub-tab: v Port Name. Optional name for the port. The default names for the links are out0, out1, out2 ... . You can refer to them in the code using either the default name or the name you have specified. v Alias. Where the port name contains non-ascii characters, you can give it an alias in this column. v AutoWrite. This defaults to True which means the stage will automatically write records to the port. Otherwise you explicitly control write operations in the code. Once records are written, the code can no longer access them. v Table Name. Specify a table definition in the InfoSphere DataStage Repository which describes the meta data for the port. You can browse for a table definition. You do not have to supply a Table Name. A shortcut menu accessed from the browse button offers a choice of Clear Table Name, Select Table, Create Table,View Schema, and Column Aliases. The use of these is as described for the Input sub-tab. v RCP. Choose True if runtime column propagation is allowed for outputs from this port. Defaults to False. You do not need to set this if you are using the automatic transfer facility. The Transfer sub-tab allows you to connect an input buffer to an output buffer such that records will be automatically transferred from input to output. You can also disable automatic transfer, in which case you have to explicitly transfer data in the code. Transferred data sits in an output buffer and can still be accessed and altered by the code until it is actually written to the port. You provide the following information on the Transfer tab: v Input. Select the input port to connect to the buffer from the drop-down list. If you have specified an alias, this will be displayed here. v Output. Select the output port to transfer input records from the output buffer to from the drop-down list. If you have specified an alias, this will be displayed here. v Auto Transfer. This defaults to False, which means that you have to include code which manages the transfer. Set to True to have the transfer carried out automatically. v Separate. This is False by default, which means this transfer will be combined with other transfers to the same port. Set to True to specify that the transfer should be separate from other transfers. The Logic tab is where you specify the actual code that the stage executes. The Definitions sub-tab allows you to specify variables, include header files, and otherwise initialize the stage before processing any records. The Pre-Loop sub-tab allows you to specify code which is executed at the beginning of the stage, before any records are processed. The Per-Record sub-tab allows you to specify the code which is executed once for every record processed. The Post-Loop sub-tab allows you to specify code that is executed after all the records have been processed. You can type straight into these pages or cut and paste from another editor. The shortcut menu on the Pre-Loop, Per-Record, and Post-Loop pages gives access to the macros that are available for use in the code. The Advanced tab allows you to specify details about how the stage is compiled and built. Fill in the page as follows: v Compile and Link Flags. Allows you to specify flags that are passed to the C++ compiler. v Verbose. Select this check box to specify that the compile and build is done in verbose mode. v Debug. Select this check box to specify that the compile and build is done in debug mode. Otherwise, it is done in optimize mode.
40
v Suppress Compile. Select this check box to generate files without compiling, and without deleting the generated files. This option is useful for fault finding. v Base File Name. The base filename for generated files. All generated files will have this name followed by the appropriate suffix. This defaults to the name specified under Operator on the General page. v Source Directory. The directory where generated .c files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator. v Header Directory. The directory where generated .h files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator. v Object Directory. The directory where generated .so files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator. v Wrapper directory. The directory where generated .op files are placed. This defaults to the buildop folder in the current project directory. You can also set it using the DS_OPERATOR_BUILDOP_DIR environment variable in the InfoSphere DataStage Administrator. 7. When you have filled in the details in all the pages, click Generate to generate the stage. A window appears showing you the result of the build.
Informational macros
Use these macros in your code to determine the number of inputs, outputs, and transfers as follows: v inputs(). Returns the number of inputs to the stage. v outputs(). Returns the number of outputs from the stage. v transfers(). Returns the number of transfers in the stage.
Flow-control macros
Use these macros to override the default behavior of the Per-Record loop in your stage definition: v endLoop(). Causes the operator to stop looping, following completion of the current loop and after writing any auto outputs for this loop. v nextLoop() Causes the operator to immediately skip to the start of next loop, without writing any outputs. v failStep() Causes the operator to return a failed status and terminate the job.
41
v output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_. v index is the index of the transfer (0 to n). The following macros are available: v readRecord(input). Immediately reads the next record from input, if there is one. If there is no record, the next call to inputDone() will return true. v writeRecord(output). Immediately writes a record to output. v inputDone(input). Returns true if the last call to readRecord() for the specified input failed to read a new record, because the input has no more records. v holdRecord(input). Causes auto input to be suspended for the current record, so that the operator does not automatically read a new record at the start of the next loop. If auto is not set for the input, holdRecord() has no effect. v discardRecord(output). Causes auto output to be suspended for the current record, so that the operator does not output the record at the end of the current loop. If auto is not set for the output, discardRecord() has no effect. v discardTransfer(index). Causes auto transfer to be suspended, so that the operator does not perform the transfer at the end of the current loop. If auto is not set for the transfer, discardTransfer() has no effect.
Transfer Macros
These macros allow you to explicitly control the transfer of individual records. Each of the macros takes an argument as follows: v input is the index of the input (0 to n). If you have defined a name for the input port you can use this in place of the index in the form portname.portid_. v output is the index of the output (0 to n). If you have defined a name for the output port you can use this in place of the index in the form portname.portid_. v index is the index of the transfer (0 to n). The following macros are available: v doTransfer(index). Performs the transfer specified by index. v doTransfersFrom(input). Performs all transfers from input. v doTransfersTo(output). Performs all transfers to output. v transferAndWriteRecord(output). Performs all transfers and writes a record for the specified output. Calling this macro is equivalent to calling the macros doTransfersTo() and writeRecord().
Procedure
1. Handles any definitions that you specified in the Definitions sub-tab when you entered the stage details. 2. Executes any code that was entered in the Pre-Loop sub-tab. 3. Loops repeatedly until either all inputs have run out of records, or the Per-Record code has explicitly invoked endLoop(). In the loop, performs the following steps:
42
Reads one record for each input, except where any of the following is true: The input has no more records left. The input has Auto Read set to false. The holdRecord() macro was called for the input last time around the loop. Executes the Per-Record code, which can explicitly read and write records, perform transfers, and invoke loop-control macros such as endLoop(). f. Performs each specified transfer, except where any of the following is true: g. The input of the transfer has no more records. a. b. c. d. e. h. The transfer has Auto Transfer set to False. i. The discardTransfer() macro was called for the transfer during the current loop iteration. j. Writes one record for each output, except where any of the following is true: k. The output has Auto Write set to false. l. The discardRecord() macro was called for the output during the current loop iteration. 4. If you have specified code in the Post-loop sub-tab, executes it. 5. Returns a status, which is written to the InfoSphere DataStage Job Log.
43
This method is fine if you want your stage to read a record from every link, every time round the loop.
Using inputs with auto read enabled for some and disabled for others
You define one (or possibly more) inputs as Auto Read, and the rest with Auto Read disabled. You code the stage in such a way as the processing of records from the Auto Read input drives the processing of the other inputs. Each time round the loop, your code should call inputDone() on the Auto Read input and call exitLoop() to complete the actions of the stage. This method is fine where you process a record from the Auto Read input every time around the loop, and then process records from one or more of the other inputs depending on the results of processing the Auto Read record.
44
Details about the single input to Divide are given on the Input sub-tab of the Interfaces tab. A table definition for the inputs link is available to be loaded from the InfoSphere DataStage Repository Details about the outputs are given on the Output sub-tab of the Interfaces tab. When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output sub-tabs. Details about the transfers carried out by the stage are defined on the Transfer sub-tab of the Interfaces tab. 5. The code itself is defined on the Logic tab. In this case all the processing is done in the Per-Record loop and so is entered on the Per-Record sub-tab. 6. As this example uses all the compile and build defaults, all that remains is to click Generate to build the stage.
Procedure
1. Do one of: a. Choose File > New from the Designer menu. The New dialog box appears. b. Open the Stage Type folder and select the Parallel Wrapped Stage Type icon. c. Click OK. TheStage Type dialog box appears, with the General page on top.
Chapter 5. Specifying your own parallel stages
45
Or: d. Select a folder in the repository tree. e. Choose New > Other Parallel Stage Wrapped from the shortcut menu. The Stage Type dialog box appears, with the General page on top. 2. Fill in the fields on the General page as follows: v Stage type name. This is the name that the stage will be known by to InfoSphere DataStage. Avoid using the same name as existing stages or the name of the actual UNIX command you are wrapping. v Category. The category that the new stage will be stored in under the stage types branch. Type in or browse for an existing category or type in the name of a new one. The category also determines what group in the palette the stage will be added to. Choose an existing category to add to an existing group, or specify a new category to create a new palette group. v Parallel Stage type. This indicates the type of new Parallel job stage you are defining (Custom, Build, or Wrapped). You cannot change this setting. v Wrapper Name. The name of the wrapper file InfoSphere DataStage will generate to call the command. By default this will take the same name as the Stage type name. v Execution mode. Choose the default execution mode. This is the mode that will appear in the Advanced tab on the stage editor. You can override this mode for individual instances of the stage as required, unless you select Parallel only or Sequential only. See InfoSphere DataStage Parallel Job Developer Guide for a description of the execution mode. v Preserve Partitioning. This shows the default setting of the Preserve Partitioning flag, which you cannot change in a Wrapped stage. This is the setting that will appear in the Advanced tab on the stage editor. You can override this setting for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the preserve partitioning flag. v Partitioning. This shows the default partitioning method, which you cannot change in a Wrapped stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the partitioning methods. v Collecting. This shows the default collection method, which you cannot change in a Wrapped stage. This is the method that will appear in the Inputs Page Partitioning tab of the stage editor. You can override this method for individual instances of the stage as required. See InfoSphere DataStage Parallel Job Developer Guide for a description of the collection methods. v Command. The name of the UNIX command to be wrapped, plus any required arguments. The arguments that you enter here are ones that do not change with different invocations of the command. Arguments that need to be specified when the Wrapped stage is included in a job are defined as properties for the stage. v Short Description. Optionally enter a short description of the stage. v Long Description. Optionally enter a long description of the stage. 3. Go to the Creator page and optionally specify information about the stage you are creating. We recommend that you assign a release number to the stage so you can keep track of any subsequent changes. You can specify that the actual stage will use a custom GUI by entering the ProgID for a custom GUI in the Custom GUI Prog ID field. You can also specify that the stage has its own icon. You need to supply a 16 x 16 bit bitmap and a 32 x 32 bit bitmap to be displayed in various places in the InfoSphere DataStage user interface. Click the 16 x 16 Bitmap button and browse for the smaller bitmap file. Click the 32 x 32 Bitmap button and browse for the large bitmap file. Note that bitmaps with 32-bit color are not supported. Click the Reset Bitmap Info button to revert to using the default InfoSphere DataStage icon for this stage. 4. Go to the Properties page. This allows you to specify the arguments that the UNIX command requires as properties that appear in the stage Properties tab. For wrapped stages the Properties tab always appears under the Stage page.
46
Fill in the fields as follows: v Property name. The name of the property that will be displayed on the Properties tab of the stage editor. v Data type. The data type of the property. Choose from: Boolean Float Integer String Pathname List Input Column Output Column If you choose Input Column or Output Column, when the stage is included in a job a drop-down list will offer a choice of the defined input or output columns. If you choose list you should open the Extended Properties dialog box from the grid shortcut menu to specify what appears in the list. v v v v v Prompt. The name of the property that will be displayed on the Properties tab of the stage editor. Default Value. The value the option will take if no other is specified. Required. Set this to True if the property is mandatory. Repeats. Set this true if the property repeats (that is, you can have multiple instances of it). Conversion. Specifies the type of property as follows: -Name. The name of the property will be passed to the command as the argument value. This will normally be a hidden property, that is, not visible in the stage editor. -Name Value. The name of the property will be passed to the command as the argument name, and any value specified in the stage editor is passed as the value. -Value. The value for the property specified in the stage editor is passed to the command as the argument name. Typically used to group operator options that are mutually exclusive. Value only. The value for the property specified in the stage editor is passed as it is. 5. If you want to specify a list property, or otherwise control how properties are handled by your stage, choose Extended Properties from the Properties grid shortcut menu to open the Extended Properties dialog box. The settings you use depend on the type of property you are specifying: v Specify a category to have the property appear under this category in the stage editor. By default all properties appear in the Options category. v If you are specifying a List category, specify the possible values for list members in the List Value field. v If the property is to be a dependent of another property, select the parent property in the Parents field. v Specify an expression in the Template field to have the actual value of the property generated at compile time. It is usually based on values in other properties and columns. v Specify an expression in the Conditions field to indicate that the property is only valid if the conditions are met. The specification of this property is a bar '|' separated list of conditions that are AND'ed together. For example, if the specification was a=b|c!=d, then this property would only be valid (and therefore only available in the GUI) when property a is equal to b, and property c is not equal to d. Click OK when you are happy with the extended properties. 6. Go to the Wrapped page. This allows you to specify information about the command to be executed by the stage and how it will be handled.
Chapter 5. Specifying your own parallel stages
47
The Interfaces tab is used to describe the inputs to and outputs from the stage, specifying the interfaces that the stage will need to function. Details about inputs to the stage are defined on the Inputs sub-tab: v Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In the example, the first link will be taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editor's Link Ordering tab on the General page. v Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition. Alternatively, you can specify an argument to the UNIX command which specifies a table definition. In this case, when the wrapped stage is used in a job design, the designer will be prompted for an actual table definition to use. v Stream. Here you can specify whether the UNIX command expects its input on standard in, or another stream, or whether it expects it in a file. Click on the browse button to open the Wrapped Stream dialog box. In the case of a file, you should also specify whether the file to be read is given in a command line argument, or by an environment variable. Details about outputs from the stage are defined on the Outputs sub-tab: v Link. The link number, this is assigned for you and is read-only. When you actually use your stage, links will be assigned in the order in which you add them. In the example, the first link will be taken as link 0, the second as link 1 and so on. You can reassign the links using the stage editor's Link Ordering tab on the General page. v Table Name. The meta data for the link. You define this by loading a table definition from the Repository. Type in the name, or browse for a table definition. v Stream. Here you can specify whether the UNIX command will write its output to standard out, or another stream, or whether it outputs to a file. Click on the browse button to open the Wrapped Stream dialog box. In the case of a file, you should also specify whether the file to be written is specified in a command line argument, or by an environment variable. The Environment tab gives information about the environment in which the command will execute. Set the following on the Environment tab: v All Exit Codes Successful. By default InfoSphere DataStage treats an exit code of 0 as successful and all others as errors. Select this check box to specify that all exit codes should be treated as successful other than those specified in the Failure codes grid. v Exit Codes. The use of this depends on the setting of the All Exits Codes Successful check box. If All Exits Codes Successful is not selected, enter the codes in the Success Codes grid which will be taken as indicating successful completion. All others will be taken as indicating failure. If All Exits Codes Successful is selected, enter the exit codes in the Failure Code grid which will be taken as indicating failure. All others will be taken as indicating success. v Environment. Specify environment variables and settings that the UNIX command requires in order to run. 7. When you have filled in the details in all the pages, click Generate to generate the stage.
48
Wrapping the sort command in this way would be useful if you had a situation where you had a fixed sort operation that was likely to be needed in several jobs. Having it as an easily reusable stage would save having to configure a built-in sort stage every time you needed it. When included in a job and run, the stage will effectively call the Sort command as follows:
sort -r -o outfile -k 2 infile1 infile2
The following screen shots show how this stage is defined in InfoSphere DataStage using the Stage Type dialog box.
Procedure
1. First general details are supplied in the General tab. The argument defining the second column as the key is included in the command because this does not vary: 2. The reverse order argument (-r) are included as properties because it is optional and might or might not be included when the stage is incorporated into a job. 3. The fact that the sort command expects two files as input is defined on the Input sub-tab on the Interfaces tab of the Wrapper page. 4. The fact that the sort command outputs to a file is defined on the Output sub-tab on the Interfaces tab of the Wrapper page. Note: When you use the stage in a job, make sure that you use table definitions compatible with the tables defined in the input and output sub-tabs. 5. Because all exit codes other than 0 are treated as errors, and because there are no special environment requirements for this command, you do not need to alter anything on the Environment tab of the Wrapped page. All that remains is to click Generate to build the stage.
49
50
51
APT_RDBMS_COMMIT_ROWS DB2DBDFT Debugging APT_DEBUG_OPERATOR APT_DEBUG_MODULE_NAMES APT_DEBUG_PARTITION APT_DEBUG_SIGNALS APT_DEBUG_STEP APT_DEBUG_SUBPROC APT_EXECUTION_MODE APT_PM_DBX APT_PM_GDB APT_PM_SHOW_PIDS APT_PM_XLDB APT_PM_XTERM APT_PXDEBUGGER_FORCE_SEQUENTIAL APT_SHOW_LIBLOAD Decimal Support APT_DECIMAL_INTERM_PRECISION APT_DECIMAL_INTERM_SCALE APT_DECIMAL_INTERM_ROUND_MODE Disk I/O APT_BUFFER_DISK_WRITE_INCREMENT APT_CONSISTENT_BUFFERIO_SIZE APT_EXPORT_FLUSH_COUNT APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/APT_BUFFERIO_NOMAP APT_PHYSICAL_DATASET_BLOCK_SIZE General Job Administration APT_CHECKPOINT_DIR APT_CLOBBER_OUTPUT APT_CONFIG_FILE APT_DISABLE_COMBINATION APT_EXECUTION_MODE APT_ORCHHOME APT_STARTUP_SCRIPT APT_NO_STARTUP_SCRIPT APT_STARTUP_STATUS APT_THIN_SCORE
52
Job Monitoring APT_MONITOR_SIZE APT_MONITOR_TIME APT_NO_JOBMON APT_PERFORMANCE_DATA Look Up support APT_LUTCREATE_NO_MMAP Miscellaneous APT_COPY_TRANSFORM_OPERATOR APT_EBCDIC_VERSION APT_DATE_CENTURY_BREAK_YEAR APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS APT_INSERT_COPY_BEFORE_MODIFY APT_ISVALID_BACKCOMPAT APT_OLD_BOUNDED_LENGTH APT_OPERATOR_REGISTRY_PATH APT_PM_NO_SHARED_MEMORY APT_PM_NO_NAMED_PIPES APT_PM_SOFT_KILL_WAIT APT_PM_STARTUP_CONCURRENCY APT_RECORD_COUNTS APT_SAVE_SCORE APT_SHOW_COMPONENT_CALLS APT_STACK_TRACE APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING APT_TRANSFORM_LOOP_WARNING_THRESHOLD APT_WRITE_DS_VERSION OSH_PRELOAD_LIBS Network APT_IO_MAXIMUM_OUTSTANDING APT_IOMGR_CONNECT_ATTEMPTS APT_PM_CONDUCTOR_HOSTNAME APT_PM_NO_TCPIP APT_PM_NODE_TIMEOUT APT_PM_SHOWRSH APT_PM_STARTUP_PORT APT_PM_USE_RSH_LOCALLY APT_RECVBUFSIZE
Chapter 6. Environment Variables
53
APT_USE_IPV4 NLS APT_COLLATION_SEQUENCE APT_COLLATION_STRENGTH APT_ENGLISH_MESSAGES APT_IMPEXP_CHARSET APT_INPUT_CHARSET APT_OS_CHARSET APT_OUTPUT_CHARSET APT_STRING_CHARSET Oracle Support APT_ORACLE_LOAD_OPTIONS APT_ORACLE_NO_OPS APT_ORACLE_PRESERVE_BLANKS APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM APT_ORA_WRITE_FILES APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL Partitioning APT_NO_PART_INSERTION APT_NO_PARTSORT_OPTIMIZATION on page 72 APT_PARTITION_COUNT APT_PARTITION_NUMBER Reading and Writing Files APT_DELIMITED_READ_SIZE APT_FILE_IMPORT_BUFFER_SIZE APT_FILE_EXPORT_BUFFER_SIZE APT_IMPORT_PATTERN_USES_FILESET APT_MAX_DELIMITED_READ_SIZE APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE APT_STRING_PADCHAR Reporting APT_DUMP_SCORE APT_ERROR_CONFIGURATION APT_MSG_FILELINE APT_PM_PLAYER_MEMORY APT_PM_PLAYER_TIMING APT_RECORD_COUNTS OSH_DUMP OSH_ECHO OSH_EXPLAIN
54
OSH_PRINT_SCHEMAS SAS Support APT_HASH_TO_SASHASH APT_NO_SASOUT_INSERT APT_NO_SAS_TRANSFORMS APT_SAS_ACCEPT_ERROR APT_SAS_CHARSET APT_SAS_CHARSET_ABORT APT_SAS_COMMAND APT_SASINT_COMMAND APT_SAS_DEBUG APT_SAS_DEBUG_IO APT_SAS_DEBUG_LEVEL APT_SAS_DEBUG_VERBOSE APT_SAS_NO_PSDS_USTRING APT_SAS_S_ARGUMENT APT_SAS_SCHEMASOURCE_DUMP APT_SAS_SHOW_INFO APT_SAS_TRUNCATION Sorting APT_NO_SORT_INSERTION APT_SORT_INSERTION_CHECK_ONLY Teradata Support APT_TERA_64K_BUFFERS APT_TERA_NO_ERR_CLEANUP APT_TERA_NO_PERM_CHECKS APT_TERA_NO_SQL_CONVERSION APT_TERA_SYNC_DATABASE APT_TERA_SYNC_USER Transport Blocks APT_LATENCY_COEFFICIENT APT_DEFAULT_TRANSPORT_BLOCK_SIZE APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE
Buffering
These environment variable are all concerned with the buffering InfoSphere DataStage performs on stage links to avoid deadlock situations. These settings can also be made on the Inputs page or Outputs page Advanced tab of the parallel stage editors.
55
APT_BUFFER_FREE_RUN
This environment variable is available in the InfoSphere DataStage Administrator, under the Parallel category. It specifies how much of the available in-memory buffer to consume before the buffer resists. This is expressed as a decimal representing the percentage of Maximum memory buffer size (for example, 0.5 is 50%). When the amount of data in the buffer is less than this value, new data is accepted automatically. When the data exceeds it, the buffer first tries to write some of the data it contains before accepting more. The default value is 50% of the Maximum memory buffer size. You can set it to greater than 100%, in which case the buffer continues to store data up to the indicated multiple of Maximum memory buffer size before writing to disk.
APT_BUFFER_MAXIMUM_MEMORY
Sets the default value of Maximum memory buffer size. The default value is 3145728 (3 MB). Specifies the maximum amount of virtual memory, in bytes, used per buffer.
APT_BUFFER_MAXIMUM_TIMEOUT
InfoSphere DataStage buffering is self tuning, which can theoretically lead to long delays between retries. This environment variable specified the maximum wait before a retry in seconds, and is by default set to 1.
APT_BUFFER_DISK_WRITE_INCREMENT
Sets the size, in bytes, of blocks of data being moved to/from disk by the buffering operator. The default is 1048576 (1 MB). Adjusting this value trades amount of disk access against throughput for small amounts of data. Increasing the block size reduces disk access, but might decrease performance when data is being read/written in smaller units. Decreasing the block size increases throughput, but might increase the amount of disk access.
APT_BUFFERING_POLICY
This environment variable is available in the InfoSphere DataStage Administrator, under the Parallel category. Controls the buffering policy for all virtual data sets in all steps. The variable has the following settings: v AUTOMATIC_BUFFERING (default). Buffer a data set only if necessary to prevent a data flow deadlock. v FORCE_BUFFERING. Unconditionally buffer all virtual data sets. Note that this can slow down processing considerably. v NO_BUFFERING. Do not buffer data sets. This setting can cause data flow deadlock if used inappropriately.
APT_DISABLE_ROOT_FORKJOIN
Set this environment variable to disable a change to the fork-join pattern that was introduced in IBM InfoSphere Information Server, Version 8.7. A change was introduced in InfoSphere Information Server, Version 8.7, so that combination of the root operator of a fork-join pattern is disabled to prevent potential deadlock situations. To ensure compatibility with Version 8.5, set the APT_DISABLE_ROOT_FORKJOIN environment variable to re-enable buffering of concurrent input operators, such as the funnel operator.
56
APT_SHARED_MEMORY_BUFFERS
Typically the number of shared memory buffers between two processes is fixed at 2. Setting this will increase the number used. The likely result of this is POSSIBLY both increased latency and increased performance. This setting can significantly increase memory use.
DS_OPERATOR_BUILDOP_DIR
Identifies the directory in which generated buildops are created. By default this identifies a directory called buildop under the current project directory. If the directory is changed, the corresponding entry in APT_OPERATOR_REGISTRY_PATH needs to change to match the buildop folder.
OSH_BUILDOP_CODE
Identifies the directory into which buildop writes the generated .C file and build script. It defaults to the current working directory. The -C option of buildop overrides this setting.
OSH_BUILDOP_HEADER
Identifies the directory into which buildop writes the generated .h file. It defaults to the current working directory. The -H option of buildop overrides this setting.
OSH_BUILDOP_OBJECT
Identifies the directory into which buildop writes the dynamically loadable object file, whose extension is .so on Solaris, .sl on HP-UX, or .o on AIX. Defaults to the current working directory. The -O option of buildop overrides this setting.
OSH_BUILDOP_XLC_BIN
AIX only. Identifies the directory specifying the location of the shared library creation command used by buildop. On older AIX systems this defaults to /usr/lpp/xlC/bin/makeC++SharedLib_r for thread-safe compilation. On newer AIX systems it defaults to /usr/ibmcxx/bin/makeC++SharedLib_r. For non-thread-safe compilation, the default path is the same, but the name of the file is makeC++SharedLib.
OSH_CBUILDOP_XLC_BIN
AIX only. Identifies the directory specifying the location of the shared library creation command used by cbuildop. If this environment variable is not set, cbuildop checks the setting of OSH_BUILDOP_XLC_BIN for the path. On older AIX systems OSH_CBUILDOP_XLC_BIN defaults to /usr/lpp/xlC/bin/
57
makeC++SharedLib_r for thread-safe compilation. On newer AIX systems it defaults to /usr/ibmcxx/bin/makeC++SharedLib_r. For non-threadsafe compilation, the default path is the same, but the name of the command is makeC++SharedLib.
Compiler
These environment variables specify details about the C++ compiler used by InfoSphere DataStage in connection with parallel jobs.
APT_COMPILER
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Compiler branch. Specifies the full path to the C++ compiler.
APT_COMPILEOPT
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Compiler branch. Specifies extra options to be passed to the C++ compiler when it is invoked.
APT_LINKER
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Compiler branch. Specifies the full path to the C++ linker.
APT_LINKOPT
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Compiler branch. Specifies extra options to be passed to the C++ linker when it is invoked.
DB2 Support
These environment variables are concerned with setting up access to DB2 databases from InfoSphere DataStage.
APT_DB2INSTANCE_HOME
Specifies the DB2 installation directory. This variable is set by InfoSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.
APT_DB2READ_LOCK_TABLE
If this variable is defined and the open option is not specified for the DB2 stage, InfoSphere DataStage performs the following open command to lock the table:
lock table table_name in share mode
APT_DBNAME
Specifies the name of the database if you choose to leave out the Database option for DB2 stages. If APT_DBNAME is not defined as well, DB2DBDFT is used to find the database name. These variables are set by InfoSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.
58
APT_RDBMS_COMMIT_ROWS
Specifies the number of records to insert into a data set between commits. The default value is 2048.
DB2DBDFT
For DB2 operators, you can set the name of the database by using the -dbname option or by setting APT_DBNAME. If you do not use either method, DB2DBDFT is used to find the database name. These variables are set by InfoSphere DataStage to values obtained from the DB2Server table, representing the currently selected DB2 server.
Debugging
These environment variables are concerned with the debugging of InfoSphere DataStage parallel jobs.
APT_DEBUG_OPERATOR
Specifies the operators on which to start debuggers. If not set, no debuggers are started. If set to an operator number (as determined from the output of APT_DUMP_SCORE), debuggers are started for that single operator. If set to -1, debuggers are started for all operators.
APT_DEBUG_MODULE_NAMES
This comprises a list of module names separated by white space that are the modules to debug, that is, where internal IF_DEBUG statements will be run. The subproc operator module (module name is "subproc") is one example of a module that uses this facility.
APT_DEBUG_PARTITION
Specifies the partitions on which to start debuggers. One instance, or partition, of an operator is run on each node running the operator. If set to a single number, debuggers are started on that partition; if not set or set to -1, debuggers are started on all partitions. See the description of APT_DEBUG_OPERATOR for more information on using this environment variable. For example, setting APT_DEBUG_STEP to 0, APT_DEBUG_OPERATOR to 1, and APT_DEBUG_PARTITION to -1 starts debuggers for every partition of the second operator in the first step.
APT_DEBUG_ OPERATOR not set -1 -1 >= 0 >= 0 APT_DEBUG_ PARTITION any value not set or -1 >= 0 -1 >= 0 Effect no debugging debug all partitions of all operators debug a specific partition of all operators debug all partitions of a specific operator debug a specific partition of a specific operator
59
APT_DEBUG_SIGNALS
You can use the APT_DEBUG_SIGNALS environment variable to specify that signals such as SIGSEGV, SIGBUS, and so on, should cause a debugger to start. If any of these signals occurs within an APT_Operator::runLocally() function, a debugger such as dbx is invoked. Note that the UNIX and InfoSphere DataStage variables DEBUGGER, DISPLAY, and APT_PM_XTERM, specifying a debugger and how the output should be displayed, must be set correctly.
APT_DEBUG_STEP
Specifies the steps on which to start debuggers. If not set or if set to -1, debuggers are started on the processes specified by APT_DEBUG_OPERATOR and APT_DEBUG_PARTITION in all steps. If set to a step number, debuggers are started for processes in the specified step.
APT_DEBUG_SUBPROC
Display debug information about each subprocess operator.
APT_EXECUTION_MODE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: v ONE_PROCESS one-process mode v MANY_PROCESS many-process mode v NO_SERIALIZE many-process mode, without serialization In ONE_PROCESS mode: v The application executes in a single UNIX process. You need only run a single debugger session and can set breakpoints anywhere in your code. v Data is partitioned according to the number of nodes defined in the configuration file. v Each operator is run as a subroutine and is called the number of times appropriate for the number of partitions on which it must operate. In MANY_PROCESS mode the framework forks a new process for each instance of each operator and waits for it to complete rather than calling operators as subroutines. In both cases, the step is run entirely on the Conductor node rather than spread across the configuration. NO_SERIALIZE mode is similar to MANY_PROCESS mode, but the InfoSphere DataStage persistence mechanism is not used to load and save objects. Turning off persistence might be useful for tracking errors in derived C++ classes.
APT_PM_DBX
Set this environment variable to the path of your dbx debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.
60
APT_PM_GDB
Linux only. Set this environment variable to the path of your xldb debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.
APT_PM_LADEBUG
Tru64 only. Set this environment variable to the path of your dbx debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.
APT_PM_SHOW_PIDS
If this variable is set, players will output an informational message upon startup, displaying their process id.
APT_PM_XLDB
Set this environment variable to the path of your xldb debugger, if a debugger is not already included in your path. This variable sets the location; it does not run the debugger.
APT_PM_XTERM
If InfoSphere DataStage invokes dbx, the debugger starts in an xterm window; this means InfoSphere DataStage must know where to find the xterm program. The default location is /usr/bin/X11/xterm. You can override this default by setting the APT_PM_XTERM environment variable to the appropriate path. APT_PM_XTERM is ignored if you are using xldb.
APT_PXDEBUGGER_FORCE_SEQUENTIAL
Set this environment variable to specify which stages that are to be run in sequential mode when a parallel job is run in debug mode. Set the APT_PXDEBUGGER_FORCE_SEQUENTIAL environment variable to specify the names of one or more stages that are to be run in sequential mode when a parallel job is run in debug mode. Multiple stage names are separated by a single space character. The environment variable is set at job level.
APT_SHOW_LIBLOAD
If set, dumps a message to stdout every time a library is loaded. This can be useful for testing/verifying the right library is being loaded. Note that the message is output to stdout, NOT to the error log.
APT_DECIMAL_INTERM_SCALE
Specifies the default scale value for any decimal intermediate variables required in calculations. Default value is 10.
61
APT_DECIMAL_INTERM_ROUND_MODE
Specifies the default rounding mode for any decimal intermediate variables required in calculations. The default is round_inf.
Disk I/O
These environment variables are all concerned with when and how InfoSphere DataStage parallel jobs write information to disk.
APT_BUFFER_DISK_WRITE_INCREMENT
For systems where small to medium bursts of I/O are not desirable, the default 1MB write to disk size chunk size might be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting might not exceed max_memory * 2/3.
APT_CONSISTENT_BUFFERIO_SIZE
Some disk arrays have read ahead caches that are only effective when data is read repeatedly in like-sized chunks. Setting APT_CONSISTENT_BUFFERIO_SIZE=N will force stages to read data in chunks which are size N or a multiple of N.
APT_EXPORT_FLUSH_COUNT
Allows the export operator to flush data to disk more often than it typically does (data is explicitly flushed at the end of a job, although the OS might choose to do so more frequently). Set this variable to an integer which, in number of records, controls how often flushes should occur. Setting this value to a low number (such as 1) is useful for real time applications, but there is a small performance penalty associated with setting this to a low value.
APT_PHYSICAL_DATASET_BLOCK_SIZE
Specify the block size to use for reading and writing to a data set stage. The default is 128 KB.
62
APT_CHECKPOINT_DIR
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. By default, when running a job, InfoSphere DataStage stores state information in the current working directory. Use APT_CHECKPOINT_DIR to specify another directory.
APT_CLOBBER_OUTPUT
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. By default, if an output file or data set already exists, InfoSphere DataStage issues an error and stops before overwriting it, notifying you of the name conflict. Setting this variable to any value permits InfoSphere DataStage to overwrite existing files or data sets without a warning message.
APT_CONFIG_FILE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. Sets the path name of the configuration file. (You might want to include this as a job parameter, so that you can specify the configuration file at job run time).
APT_DISABLE_COMBINATION
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. Globally disables operator combining. Operator combining is InfoSphere DataStage's default behavior, in which two or more (in fact any number of) operators within a step are combined into one process where possible. You might need to disable combining to facilitate debugging. Note that disabling combining generates more UNIX processes, and hence requires more system resources and memory. It also disables internal optimizations for job efficiency and run times.
APT_EXECUTION_MODE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: v ONE_PROCESS one-process mode v MANY_PROCESS many-process mode v NO_SERIALIZE many-process mode, without serialization In ONE_PROCESS mode: v The application executes in a single UNIX process. You need only run a single debugger session and can set breakpoints anywhere in your code. v Data is partitioned according to the number of nodes defined in the configuration file. v Each operator is run as a subroutine and is called the number of times appropriate for the number of partitions on which it must operate. In MANY_PROCESS mode the framework forks a new process for each instance of each operator and waits for it to complete rather than calling operators as subroutines. In both cases, the step is run entirely on the Conductor node rather than spread across the configuration.
63
NO_SERIALIZE mode is similar to MANY_PROCESS mode, but the InfoSphere DataStage persistence mechanism is not used to load and save objects. Turning off persistence might be useful for tracking errors in derived C++ classes.
APT_ORCHHOME
Must be set by all InfoSphere DataStage users to point to the top-level directory of the parallel engine installation.
APT_STARTUP_SCRIPT
As part of running a job, InfoSphere DataStage creates a remote shell on all InfoSphere DataStage processing nodes on which the job runs. By default, the remote shell is given the same environment as the shell from which InfoSphere DataStage is invoked. However, you can write an optional startup shell script to modify the shell configuration of one or more processing nodes. If a startup script exists, InfoSphere DataStage runs it on remote shells before running your job. APT_STARTUP_SCRIPT specifies the script to be run. If it is not defined, InfoSphere DataStage searches ./startup.apt, $APT_ORCHHOME/etc/startup.apt and $APT_ORCHHOME/etc/startup, in that order. APT_NO_STARTUP_SCRIPT disables running the startup script.
APT_NO_STARTUP_SCRIPT
Prevents InfoSphere DataStage from executing a startup script. By default, this variable is not set, and InfoSphere DataStage runs the startup script. If this variable is set, InfoSphere DataStage ignores the startup script. This might be useful when debugging a startup script. See also APT_STARTUP_SCRIPT.
APT_STARTUP_STATUS
Set this to cause messages to be generated as parallel job startup moves from phase to phase. This can be useful as a diagnostic if parallel job startup is failing.
APT_THIN_SCORE
Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.
QSM_DISABLE_DISTRIBUTE_COMPONENT
Set this environment variable to ensure that control files for InfoSphere QualityStage jobs are not copied from the conductor node to one or more compute nodes. In MPP or grid environments, the conductor and compute nodes might not share the project directory. If the project directory is not shared, some InfoSphere QualityStage jobs must copy control files from the conductor node to the compute nodes. If the project directory is shared, the control files do not need to be copied. Copying the control files can cause file access issues. Set this environment variable to ensure that control files are not copied.
64
If you configure your grid environment to run InfoSphere DataStage jobs, this environment variable is set by default.
Job Monitoring
These environment variables are concerned with the Job Monitor on InfoSphere DataStage.
APT_MONITOR_SIZE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. Determines the minimum number of records the InfoSphere DataStage Job Monitor reports. The default is 50000 records.
APT_MONITOR_TIME
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel branch. Determines the minimum time interval in seconds for generating monitor information at runtime. The default is 5 seconds. This variable takes precedence over APT_MONITOR_SIZE.
APT_NO_JOBMON
Turn off job monitoring entirely.
APT_PERFORMANCE_DATA
Set this variable to turn on performance data output generation. APT_PERFORMANCE_DATA can be either set with no value, or be set to a valid path which will be used as the default location for performance data output.
APT_LUTCREATE_NO_MMAP
Set this to force lookup tables to be created using malloced memory. By default lookup table creation is done using memory mapped files. There might be situations, depending on the OS configuration or file system, where writing to memory mapped files causes poor performance. In these situations this variable can be set so that malloced memory is used, which should boost performance.
Miscellaneous APT_COPY_TRANSFORM_OPERATOR
If set, distributes the shared object file of the sub-level transform operator and the shared object file of user-defined functions (not extern functions) via distribute-component in a non-NFS MPP.
Chapter 6. Environment Variables
65
APT_DATE_CENTURY_BREAK_YEAR
Four digit year which marks the century two-digit dates belong to. It is set to 1900 by default.
APT_EBCDIC_VERSION
Some operators, including the import and export operators, support the ebcdic property. This is set to specify that field data is represented in the EBCDIC character set. The APT_EBCDIC_VERSION environment variable indicates which EBCDIC character set to use. You can set APT_EBCDIC_VERSION to one of the following values: HP IBM ATT USS IBM037 Use the IBM 037 EBCDIC character set. IBM500 Use the IBM 500 EBCDIC character set. Iif the value of APT_EBCDIC_VERSION is HP, IBM, ATT, or USS, EBCDIC data is internally converted to and from 7-bit ASCII. If the value is IBM037 or IBM500, EBCDIC data is internally converted to and from ISO-8859-1 (the 8-bit Latin-1 superset of ASCII, with accented character support). Use the EBCDIC character set supported by HP terminals (this is the default, except on USS installations). Use the EBCDIC character set supported by IBM 3780 terminals. Use the EBCDIC character set supported by AT&T terminals. Use the IBM 1047 EBCDIC character set (this is the default on USS installations).
APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL
When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.
APT_IMPORT_REJECT_STRING_FIELD_OVERRUNS
When set, InfoSphere DataStage will reject any string or ustring fields read that go over their fixed size. By default these records are truncated.
APT_INSERT_COPY_BEFORE_MODIFY
When defined, turns on automatic insertion of a copy operator before any modify operator (WARNING: if this variable is not set and the operator immediately preceding 'modify' in the data flow uses a modify adapter, the 'modify' operator will be removed from the data flow). Only set this if you write your own custom operators AND use modify within those operators.
APT_ISVALID_BACKCOMPAT
Set this environment variable to disable a change to the isValid() function that was introduced in IBM InfoSphere Information Server, Version 8.1. A change was introduced in InfoSphere Information Server, Version 8.1, to the result returned by the isValid() function when coded in a transform expression. The isValid() function was changed to return true or false to indicate whether an implicit conversion of the given string to a variable of the specified type would succeed.
66
Set the APT_ISVALID_BACKCOMPAT environment variable to ensure compatibility with versions before InfoSphere Information Server, Version 8.1.
APT_OLD_BOUNDED_LENGTH
Some parallel datasets generated with InfoSphere DataStage 7.0.1 and later releases require more disk space when the columns are of type VarChar when compared to 7.0. This is due to changes added for performance improvements for bounded length VarChars in 7.0.1. Set APT_OLD_BOUNDED_LENGTH to any value to revert to pre-7.0.1 storage behavior when using bounded length varchars. Setting this variable can have adverse performance effects. The preferred and more performant solution is to use unbounded length VarChars (don't set any length) for columns where the maximum length is rarely used, rather than set this environment variable.
APT_OPERATOR_REGISTRY_PATH
Used to locate operator .apt files, which define what operators are available and which libraries they are found in.
APT_PM_NO_SHARED_MEMORY
By default, shared memory is used for local connections. If this variable is set, named pipes rather than shared memory are used for local connections. If both APT_PM_NO_NAMED_PIPES and APT_PM_NO_SHARED_MEMORY are set, then TCP sockets are used for local connections.
APT_PM_NO_NAMED_PIPES
Specifies not to use named pipes for local connections. Named pipes will still be used in other areas of InfoSphere DataStage, including subprocs and setting up of the shared memory transport protocol in the process manager.
APT_PM_SOFT_KILL_WAIT
Delay between SIGINT and SIGKILL during abnormal job shutdown. Gives time for processes to run cleanups if they catch SIGINT. Defaults to ZERO.
APT_PM_STARTUP_CONCURRENCY
Setting this to a small integer determines the number of simultaneous section leader startups to be allowed. Setting this to 1 forces sequential startup. The default is defined by SOMAXCONN in sys/socket.h (currently 5 for Solaris, 10 for AIX).
APT_RECORD_COUNTS
Causes InfoSphere DataStage to print, for each operator Player, the number of records consumed by getRecord() and produced by putRecord(). Abandoned input records are not necessarily accounted for. Buffer operators do not print this information.
APT_SAVE_SCORE
Sets the name and path of the file used by the performance monitor to hold temporary score data. The path must be visible from the host machine. The performance monitor creates this file, therefore it need not exist when you set this variable.
Chapter 6. Environment Variables
67
APT_SHOW_COMPONENT_CALLS
This forces InfoSphere DataStage to display messages at job check time as to which user overloadable functions (such as checkConfig and describeOperator) are being called. This will not produce output at runtime and is not guaranteed to be a complete list of all user-overloadable functions being called, but an effort is made to keep this synchronized with any new virtual functions provided.
APT_STACK_TRACE
This variable controls the number of lines printed for stack traces. The values are: v unset. 10 lines printed v 0. infinite lines printed v N. N lines printed v none. no stack trace The last setting can be used to disable stack traces entirely.
APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING
Set APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING to revert to manual handling of nulls for all jobs in the project. This setting means that, when you use an input column in the derivation expression of an output column in the Transformer stage, you have to explicitly handle any nulls that occur in the input data. If you do not specify such handling, a null causes the row to be dropped or, if a reject link exists, rejected. If you do set APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING, then a null in the input column used in the derivation causes a null to be output (unless overridden at the job level).
APT_TRANSFORM_LOOP_WARNING_THRESHOLD
Set APT_TRANSFORM_LOOP_WARNING_THRESHOLD to a value to control the number of times that a loop iterates, and the size of the cache used to hold input rows. A warning is written to the job log when a loop has repeated the specified number of times, and the warning is repeated every time a multiple of that value is reached. So, for example, if you specify a threshold of 100, warnings are written to the job log when the loop iterates 100 times, 200 times, 300 times, and so on. Setting the threshold to 0 specifies that no warnings are issued. Similarly, a warning is issued when the cache holding input rows for aggregate operations reach the specified value, or a multiple of that value. Setting APT_TRANSFORM_LOOP_WARNING_THRESHOLD specifies a threshold for all jobs in a project.
APT_WRITE_DS_VERSION
By default, InfoSphere DataStage saves data sets in the Orchestrate Version 4.1 format. APT_WRITE_DS_VERSION lets you save data sets in formats compatible with previous versions of Orchestrate. The values of APT_WRITE_DS_VERSION are: v v3_0. Orchestrate Version 3.0 v v3. Orchestrate Version 3.1.x v v4. Orchestrate Version 4.0 v v4_0_3. Orchestrate Version 4.0.3 and later versions up to but not including Version 4.1 v v4_1. Orchestrate Version 4.1 and later versions through and including Version 4.6
68
OSH_PRELOAD_LIBS
Specifies a colon-separated list of names of libraries to be loaded before any other processing. Libraries containing custom operators must be assigned to this variable or they must be registered. For example, in Korn shell syntax:
$ export OSH_PRELOAD_LIBS="orchlib1:orchlib2:mylib1"
Network
These environment variables are concerned with the operation of InfoSphere DataStage parallel jobs over a network.
APT_IO_MAXIMUM_OUTSTANDING
Sets the amount of memory, in bytes, allocated to an InfoSphere DataStage job on every physical node for network communications. The default value is 2097152 (2 MB). When you are executing many partitions on a single physical node, this number might need to be increased.
APT_IOMGR_CONNECT_ATTEMPTS
Sets the number of attempts for a TCP connect in case of a connection failure. This is necessary only for jobs with a high degree of parallelism in an MPP environment. The default value is 2 attempts (1 retry after an initial failure).
APT_PM_CONDUCTOR_HOSTNAME
The network name of the processing node from which you invoke a job should be included in the configuration file as either a node or a fastname. If the network name is not included in the configuration file, InfoSphere DataStage users must set the environment variable APT_PM_CONDUCTOR_HOSTNAME to the name of the node invoking the InfoSphere DataStage job.
APT_PM_NO_TCPIP
This turns off use of UNIX sockets to communicate between player processes at runtime. If the job is being run in an MPP (non-shared memory) environment, do not set this variable, as UNIX sockets are your only communications option.
APT_PM_NODE_TIMEOUT
This controls the number of seconds that the conductor will wait for a section leader to start up and load a score before deciding that something has failed. The default for starting a section leader process is 30. The default for loading a score is 120.
APT_PM_SHOWRSH
Displays a trace message for every call to RSH.
APT_PM_STARTUP_PORT
Use this environment variable to specify the port number from which the parallel engine will start looking for TCP/IP ports.
Chapter 6. Environment Variables
69
By default, InfoSphere DataStage will start look at port 10000. If you know that ports in this range are used by another application, set APT_PM_STARTUP_PORT to start at a different level. You should check the /etc/services file for reserved ports.
APT_PM_USE_RSH_LOCALLY
If set, startup will use rsh even on the conductor node.
APT_RECVBUFSIZE
The value of APT_RECVBUFSIZE specifies the per-connection TCP/IP buffer space that is allocated. This might need to be set if any stage within a job has a large number of inter-node communication links. The value is specified in bytes. Setting APT_RECVBUFSIZE overrides the values of the following environment variables: v APT_SENDBUFSIZE v APT_IO_MAXIMUM_OUTSTANDING Set APT_SENDBUFSIZE or APT_RECVBUFSIZE in preference to APT_IO_MAXIMUM_OUTSTANDING for more efficient use of buffer space.
APT_USE_IPV4
Set this environment variable to force network class to use only IPv4 protocols. The default is to use the IPv6/IPv4 dual stack protocol.
NLS Support
These environment variables are concerned with InfoSphere DataStage's implementation of NLS. Note: You should not change the settings of any of these environment variables other than APT_COLLATION _STRENGTH if NLS is enabled.
APT_COLLATION_SEQUENCE
This variable is used to specify the global collation sequence to be used by sorts, compares, and so on This value can be overridden at the stage level.
APT_COLLATION_STRENGTH
Set this to specify the defines the specifics of the collation algorithm. This can be used to ignore accents, punctuation or other details. It is set to one of Identical, Primary, Secondary, Tertiary, or Quartenary. Setting it to Default unsets the environment variable. http://oss.software.ibm.com/icu/userguide/Collate_Concepts.html
APT_ENGLISH_MESSAGES
If set to 1, outputs every message issued with its English equivalent.
70
APT_IMPEXP_CHARSET
Controls the character encoding of ustring data imported and exported to and from InfoSphere DataStage, and the record and field properties applied to ustring fields. Its syntax is:
APT_IMPEXP_CHARSET icu_character_set
APT_INPUT_CHARSET
Controls the character encoding of data input to schema and configuration files. Its syntax is:
APT_INPUT_CHARSET icu_character_set
APT_OS_CHARSET
Controls the character encoding InfoSphere DataStage uses for operating system data such as the names of created files and the parameters to system calls. Its syntax is:
APT_OS_CHARSET icu_character_set
APT_OUTPUT_CHARSET
Controls the character encoding of InfoSphere DataStage output messages and operators like peek that use the error logging system to output data input to the osh parser. Its syntax is:
APT_OUTPUT_CHARSET icu_character_set
APT_STRING_CHARSET
Controls the character encoding InfoSphere DataStage uses when performing automatic conversions between string and ustring fields. Its syntax is:
APT_STRING_CHARSET icu_character_set
Oracle Support
These environment variables are concerned with the interaction between InfoSphere DataStage and Oracle databases.
APT_ORACLE_LOAD_OPTIONS
You can use the environment variable APT_ORACLE_LOAD_OPTIONS to control the options that are included in the Oracle load control file.You can load a table with indexes without using the Index Mode or Disable Constraints properties by setting the APT_ORACLE_LOAD_OPTIONS environment variable appropriately. You need to set the Direct option or the PARALLEL option to FALSE, for example:
APT_ORACLE_LOAD_OPTIONS=OPTIONS(DIRECT=FALSE,PARALLEL=TRUE)
In this example the stage would still run in parallel, however, since DIRECT is set to FALSE, the conventional path mode rather than the direct path mode would be used. If loading index organized tables (IOTs), you should not set both DIRECT and PARALLEL to true as direct parallel path load is not allowed for IOTs.
APT_ORACLE_NO_OPS
Set this if you do not have Oracle Parallel server installed on an AIX system. It disables the OPS checking mechanism on the Oracle Enterprise stage.
Chapter 6. Environment Variables
71
APT_ORACLE_PRESERVE_BLANKS
Set this to set the PRESERVE BLANKS option in the control file. This preserves leading and trailing spaces. When PRESERVE BLANKS is not set Oracle removes the spaces and considers fields with only spaces to be NULL values.
APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM
By default InfoSphere DataStage determines the number of processing nodes available for a parallel write to Oracle from the configuration file. Set APT_ORA_IGNORE_CONFIG_FILE_PARALLELISM to use the number of data files in the destination table's tablespace instead.
APT_ORA_WRITE_FILES
Set this to prevent the invocation of the Oracle loader when write mode is selected on an Oracle Enterprise destination stage. Instead, the sqlldr commands are written to a file, the name of which is specified by this environment variable. The file can be invoked once the job has finished to run the loaders sequentially. This can be useful in tracking down export and pipe-safety issues related to the loader.
APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_ORAUPSERT_COMMIT_TIME_INTERVAL
These two environment variables work together to specify how often target rows are committed when using the Upsert method to write to Oracle. Commits are made whenever the time interval period has passed or the row interval is reached, whichever comes first. By default, commits are made every 2 seconds or 5000 rows.
Partitioning
The following environment variables are concerned with how InfoSphere DataStage automatically partitions data.
APT_NO_PART_INSERTION
InfoSphere DataStage automatically inserts partition components in your application to optimize the performance of the stages in your job. Set this variable to prevent this automatic insertion.
APT_NO_PARTSORT_OPTIMIZATION
Set this environment variable to disable optimizations that were introduced in IBM InfoSphere Information Server, Version 8.7. If APT_NO_PARTSORT_OPTIMIZATION is defined, these optimizations are disabled: v repartitioning on all input links even if one input link requires a repartition v if a user-defined tsort does not fulfill the requirements of the downstream operator, replacing the user-defined tsort with a new tsort instead of inserting an additional tsort after the user-defined tsort
APT_PARTITION_COUNT
Read only. InfoSphere DataStage sets this environment variable to the number of partitions of a stage. The number is based both on information listed in the configuration file and on any constraints applied
72
to the stage. The number of partitions is the degree of parallelism of a stage. For example, if a stage executes on two processing nodes, APT_PARTITION_COUNT is set to 2. You can access the environment variable APT_PARTITION_COUNT to determine the number of partitions of the stage from within: v an operator wrapper v a shell script called from a wrapper v getenv() in C++ code v sysget() in the SAS language.
APT_PARTITION_NUMBER
Read only. On each partition, InfoSphere DataStage sets this environment variable to the index number (0, 1, ...) of this partition within the stage. A subprocess can then examine this variable when determining which partition of an input file it should handle.
APT_DELIMITED_READ_SIZE
By default, the InfoSphere DataStage will read ahead 500 bytes to get the next delimiter. For streaming inputs (socket, FIFO, and so on) this is sub-optimal, since the InfoSphere DataStage might block (and not output any records). InfoSphere DataStage, when reading a delimited record, will read this many bytes (minimum legal value for this is 2) instead of 500. If a delimiter is NOT available within N bytes, N will be incremented by a factor of 2 (when this environment variable is not set, this changes to 4).
APT_FILE_IMPORT_BUFFER_SIZE
The value in kilobytes of the buffer for reading in files. The default is 128 (that is, 128 KB). It can be set to values from 8 upward, but is clamped to a minimum value of 8. That is, if you set it to a value less than 8, then 8 is used. Tune this upward for long-latency files (typically from heavily loaded file servers).
APT_FILE_EXPORT_BUFFER_SIZE
The value in kilobytes of the buffer for writing to files. The default is 128 (that is, 128 KB). It can be set to values from 8 upward, but is clamped to a minimum value of 8. That is, if you set it to a value less than 8, then 8 is used. Tune this upward for long-latency files (typically from heavily loaded file servers).
APT_IMPORT_PATTERN_USES_FILESET
When this is set, InfoSphere DataStage will turn any file pattern into a fileset before processing the files. This allows the files to be processed in parallel as opposed to sequentially. By default file pattern will cat the files together to be used as the input.
APT_MAX_DELIMITED_READ_SIZE
By default, when reading, InfoSphere DataStage will read ahead 500 bytes to get the next delimiter. If it is not found, InfoSphere DataStage looks ahead 4*500=2000 (1500 more) bytes, and so on (4X) up to 100,000
73
bytes. This variable controls the upper bound which is by default 100,000 bytes. Note that this variable should be used instead of APT_DELIMITED_READ_SIZE when a larger than 500 bytes read-ahead is desired.
APT_PREVIOUS_FINAL_DELIMITER_COMPATIBLE
Set this to revert to the pre-release 7.5 behavior of the final delimiter whereby, when writing data, a space is inserted after every field in a record including the last one. (The new behavior is that the a space is written after every field except the last one).
APT_STRING_PADCHAR
Overrides the pad character of 0x0 (ASCII null), used by default when InfoSphere DataStage extends, or pads, a string field to a fixed length.
Reporting
These environment variables are concerned with various aspects of InfoSphere DataStage jobs reporting their progress.
APT_DUMP_SCORE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting. Configures InfoSphere DataStage to print a report showing the operators, processes, and data sets in a running job.
APT_ERROR_CONFIGURATION
Controls the format of InfoSphere DataStage output messages. Note: Changing these settings can seriously interfere with InfoSphere DataStage logging. This variable's value is a comma-separated list of keywords (see table below). Each keyword enables a corresponding portion of the message. To disable that portion of the message, precede it with a "!". Default formats of messages displayed by InfoSphere DataStage include the keywords severity, moduleId, errorIndex, timestamp, opid, and message. The following table lists keywords, the length (in characters) of the associated components in the message, and the keyword's meaning. The characters "##" precede all messages. The keyword lengthprefix appears in three locations in the table. This single keyword controls the display of all length prefixes.
Keyword severity vseverity jobid Length 1 7 3 Meaning Severity indication: F, E, W, or I. Verbose description of error severity (Fatal, Error, Warning, Information). The job identifier of the job. This allows you to identify multiple jobrunning at once. The default job identifier is 0.
74
Keyword moduleId
Length 4
Meaning The module identifier. For InfoSphere DataStage-defined messages, this value is a four byte string beginning with T. For user-defined messages written to the error log, this string is USER. For all outputs from a subprocess, the string is USBP. The index of the message specified at the time the message was written to the error log. The message time stamp. This component consists of the string "HH:MM:SS(SEQ)", at the time the message was written to the error log. Messages generated in the same second have ordered sequence numbers. The IP address of the processing node generating the message. This 15-character string is in octet form, with individual octets zero filled, for example, 104.032.007.100. Length in bytes of the following field. The node name of the processing node generating the message. Length in bytes of the following field. The string <main_program> for error messages originating in your main program (outside of a step or within the APT_Operator::describeOperator() override). The string <node_nodename> representing system error messages originating on a node, where nodename is the name of the node. The operator originator identifier, represented by "ident, partition_number", for errors originating within a step. This component identifies the instance of the operator that generated the message. ident is the operator name (with the operator index in parenthesis if there is more than one instance of it). partition_number defines the partition number of the operator issuing the message.
errorIndex
timestamp
13
ipaddr
15
2 variable 2 variable
lengthprefix message
5 variable 1
Length, in bytes, of the following field. Maximum length is 15 KB. Error text. Newline character
75
APT_MSG_FILELINE
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. Set this to have InfoSphere DataStage log extra internal information for parallel jobs.
APT_PM_PLAYER_MEMORY
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. Setting this variable causes each player process to report the process heap memory allocation in the job log when returning.
APT_PM_PLAYER_TIMING
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. Setting this variable causes each player process to report its call and return in the job log. The message with the return is annotated with CPU times for the player process.
APT_RECORD_COUNTS
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. Causes InfoSphere DataStage to print to the job log, for each operator player, the number of records input and output. Abandoned input records are not necessarily accounted for. Buffer operators do not print this information.
OSH_DUMP
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. If set, it causes InfoSphere DataStage to put a verbose description of a job in the job log before attempting to execute it.
OSH_ECHO
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. If set, it causes InfoSphere DataStage to echo its job specification to the job log after the shell has expanded all arguments.
OSH_EXPLAIN
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. If set, it causes InfoSphere DataStage to place a terse description of the job in the job log before attempting to run it.
OSH_PRINT_SCHEMAS
This environment variable is available in the InfoSphere DataStage Administrator under the Parallel > Reporting branch. If set, it causes InfoSphere DataStage to print the record schema of all data sets and the interface schema of all operators in the job log.
SAS Support
These environment variables are concerned with InfoSphere DataStage interaction with SAS.
76
APT_HASH_TO_SASHASH
The InfoSphere DataStage hash partitioner contains support for hashing SAS data. In addition, InfoSphere DataStage provides the sashash partitioner which uses an alternative non-standard hashing algorithm. Setting the APT_HASH_TO_SASHASH environment variable causes all appropriate instances of hash to be replaced by sashash. If the APT_NO_SAS_TRANSFORMS environment variable is set, APT_HASH_TO_SASHASH has no affect.
APT_NO_SASOUT_INSERT
This variable selectively disables the sasout operator insertions. It maintains the other SAS-specific transformations.
APT_NO_SAS_TRANSFORMS
InfoSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents InfoSphere DataStage from making these transformations.
APT_SAS_ACCEPT_ERROR
When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for InfoSphere DataStage to terminate the operator with an error.
APT_SAS_CHARSET
When the -sas_cs option of a SAS-interface operator is not set and a SAS-interface operator encounters a ustring, InfoSphere DataStage interrogates this variable to determine what character set to use. If this variable is not set, but APT_SAS_CHARSET_ABORT is set, the operator will abort; otherwise the -impexp_charset option or the APT_IMPEXP_CHARSET environment variable is accessed. Its syntax is:
APT_SAS_CHARSET icu_character_set | SAS_DBCSLANG
APT_SAS_CHARSET_ABORT
Causes a SAS-interface operator to abort if InfoSphere DataStage encounters a ustring in the schema and neither the -sas_cs option nor the APT_SAS_CHARSET environment variable is set.
APT_SAS_COMMAND
Overrides the $PATH directory for SAS with an absolute path to the basic SAS executable. An example path is:
/usr/local/sas/sas8.2/sas
APT_SASINT_COMMAND
Overrides the $PATH directory for SAS with an absolute path to the International SAS executable. An example path is:
/usr/local/sas/sas8.2int/dbcs/sas
77
APT_SAS_DEBUG
Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the InfoSphere DataStage log. Use APT_SAS_DEBUG=1, APT_SAS_DEBUG_IO=1, and APT_SAS_DEBUG_VERBOSE=1 to get all debug messages.
APT_SAS_DEBUG_IO
Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the InfoSphere DataStage log.
APT_SAS_DEBUG_LEVEL
Its syntax is:
APT_SAS_DEBUG_LEVEL=[0-3]
Specifies the level of debugging messages to output from the SAS driver. The values of 1, 2, and 3 duplicate the output for the -debug option of the SAS operator:
no, yes, and verbose.
APT_SAS_DEBUG_VERBOSE
Set this to set verbose debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which might then be copied into the InfoSphere DataStage log.
APT_SAS_NO_PSDS_USTRING
Set this to prevent InfoSphere DataStage from automatically converting SAS char types to ustrings in an SAS parallel data set.
APT_SAS_S_ARGUMENT
By default, InfoSphere DataStage executes SAS with -s 0. When this variable is set, its value is be used instead of 0. Consult the SAS documentation for details.
APT_SAS_SCHEMASOURCE_DUMP
When using SAS Schema Source, causes the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.
APT_SAS_SHOW_INFO
Displays the standard SAS output from an import or export transaction. The SAS output is normally deleted since a transaction is usually successful.
APT_SAS_TRUNCATION
Its syntax is:
APT_SAS_TRUNCATION ABORT | NULL | TRUNCATE
78
Because a ustring of n characters does not fit into n characters of a SAS char value, the ustring value must be truncated beyond the space pad characters and \0. The sasin and sas operators use this variable to determine how to truncate a ustring value to fit into a SAS char field. TRUNCATE, which is the default, causes the ustring to be truncated; ABORT causes the operator to abort; and NULL exports a null field. For NULL and TRUNCATE, the first five occurrences for each column cause an information message to be issued to the log.
Sorting
The following environment variables are concerned with how InfoSphere DataStage automatically sorts data.
APT_NO_SORT_INSERTION
InfoSphere DataStage automatically inserts sort components in your job to optimize the performance of the operators in your data flow. Set this variable to prevent this automatic insertion.
APT_SORT_INSERTION_CHECK_ONLY
When sorts are inserted automatically by InfoSphere DataStage, if this is set, the sorts will just check that the order is correct, they won't actually sort. This is a better alternative to shutting partitioning and sorting off insertion off using APT_NO_PART_INSERTION and APT_NO_SORT_INSERTION.
Sybase support
These environment variables are concerned with setting up access to Sybase databases from InfoSphere DataStage.
APT_SYBASE_NULL_AS_EMPTY
Set APT_SYBASE_NULL_AS_EMPTY to 1 to extract null values as empty, and to load null values as " " when reading or writing an IQ database.
APT_SYBASE_PRESERVE_BLANKS
Set APT_SYBASE_PRESERVE_BLANKS to preserve trailing blanks while writing to an IQ database.
Teradata Support
The following environment variables are concerned with InfoSphere DataStage interaction with Teradata databases.
APT_TERA_64K_BUFFERS
InfoSphere DataStage assumes that the terawrite operator writes to buffers whose maximum size is 32 KB. Enable the use of 64 KB buffers by setting this variable. The default is that it is not set.
APT_TERA_NO_ERR_CLEANUP
Setting this variable prevents removal of error tables and the partially written target table of a terawrite operation that has not successfully completed. Set this variable for diagnostic purposes only. In some cases, setting this variable forces completion of an unsuccessful write operation.
Chapter 6. Environment Variables
79
APT_TERA_NO_SQL_CONVERSION
Set this to prevent the SQL statements you are generating from being converted to the character set specified for your stage (character sets can be specified at project, job, or stage level). The SQL statements are converted to LATIN1 instead.
APT_TERA_NO_PERM_CHECKS
Set this to bypass permission checking on the several system tables that need to be readable for the load process. This can speed up the start time of the load process slightly.
APT_TERA_SYNC_DATABASE
Specifies the database used for the terasync table. By default, the database used for the terasync table is specified as part of APT_TERA_SYNC_USER. If you want the database to be different, set this variable. You must then give APT_TERA_SYNC_USER read and write permission for this database.
APT_TERA_SYNC_PASSWORD
Specifies the password for the user identified by APT_TERA_SYNC_USER.
APT_TERA_SYNC_USER
Specifies the user that creates and writes to the terasync table.
Transport Blocks
The following environment variables are all concerned with the block size used for the internal transfer of data as jobs run. Some of the settings only apply to fixed length records The following variables are used only for fixed-length records.: v APT_MIN_TRANSPORT_BLOCK_SIZE v APT_MAX_TRANSPORT_BLOCK_SIZE v APT_DEFAULT_TRANSPORT_BLOCK_SIZE v APT_LATENCY_COEFFICIENT
APT_LATENCY_COEFFICIENT
Specifies the number of writes to a block which transfers data between players. This variable allows you to control the latency of data flow through a step. The default value is 5. Specify a value of 0 to have a record transported immediately. This is only used for fixed length records. Note: Many operators have a built-in latency and are not affected by this variable.
APT_DEFAULT_TRANSPORT_BLOCK_SIZE
Set this environment variable to specify the default block size for transferring data between players. The APT_DEFAULT_TRANSPORT_BLOCK_SIZE environment variable is provided as part of the support for processing records longer than 128 Kbytes. Set APT_DEFAULT_TRANSPORT_BLOCK_SIZE to a value between 8192 and 1048576. If necessary, the value is rounded to the nearest page size for the operating size.
80
APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_BLOCK_SIZE
Specify the minimum and maximum allowable block size for transferring data between players. APT_MIN_TRANSPORT_BLOCK_SIZE cannot be less than 8192 which is its default value. APT_MAX_TRANSPORT_BLOCK_SIZE cannot be greater than 1048576 which is its default value. These variables are only meaningful when used in combination with APT_LATENCY_COEFFICIENT.
Performance tuning
v APT_BUFFER_MAXIMUM_MEMORY v APT_BUFFER_FREE_RUN v TMPDIR. This defaults to /tmp. It is used for miscellaneous internal temporary data, including FIFO queues and Transformer temporary storage. As a minor optimization, it can be better to ensure that it is set to a file system separate to the InfoSphere DataStage install directory.
81
82
Chapter 7. Operators
The parallel job stages are built on operators. These topics describe those operators and is intended for knowledgeable Orchestrate users. The first section describes how InfoSphere DataStage stages map to operators. Subsequent sections are an alphabetical listing and description of operators. Some operators are part of a library of related operators, and each of these has its own topic as follows: v The Import/Export Library v The Partitioning Library v The Collection Library v The Restructure Library v The Sorting Library v The Join Library v The ODBC Interface Library v The SAS Interface Library v v v v v The The The The The Oracle Interface Library DB2 Interface Library Informix Interface Library Sybase Interface Library SQL Server Interface Library
v The iWay Interface Library In these descriptions, the term InfoSphere DataStage refers to the parallel engine that InfoSphere DataStage uses to execute the operators.
83
Table 7. Stage to Operator Mapping (continued) InfoSphere DataStage Stage External Source Operator Import Operator Options (where applicable) Comment -source -sourcelist External Target Export Operator -destination -destinationlist Complex Flat File Transformer BASIC Transformer Import Operator Transform Operator Represents server job transformer stage (gives access to BASIC transforms) Group Operator fullouterjoin Operator innerjoin Operator leftouterjoin Operator rightouterjoin Operator Merge Lookup Merge Operator Lookup Operator The oralookup Operator The db2lookup Operator The sybaselookup Operator for direct lookup in Oracle table (`sparse' mode) for direct lookup in DB2 table (`sparse' mode) for direct lookup in table accessed via iWay (`sparse' mode) for direct lookup in Sybase table (`sparse' mode)
Aggregator Join
The sybaselookup Operator The sqlsrvrlookup Operator Funnel Funnel Operators Sortfunnel Operator Sequence Operator Sort The psort Operator The tsort Operator Remove Duplicates Compress Expand Copy Modify Filter External Filter Change Capture Change Apply Difference Remdup Operator Pcompress Operator Pcompress Operator Generator Operator Modify Operator Filter Operator Changecapture Operator Changeapply Operator Diff Operator -compress -expand
84
Table 7. Stage to Operator Mapping (continued) InfoSphere DataStage Stage Compare Encode Decode Switch Generic Surrogate Key Column Import Column Export Make Subrecord Split Subrecord Combine records Promote subrecord Make vector Split vector Head Tail Sample Peek Row Generator Column generator Write Range Map SAS Parallel Data Set SAS Operator Compare Operator Encode Operator Encode Operator Switch Operator Surrogate key operator The field_import Operator The field_export Operator The makesubrec Operator The splitsubrec Operator The aggtorec Operator The makesubrec Operator The makevect Operator The splitvect Operator Head Operator Tail Operator Sample Operator Peek Operator Generator Operator Generator Operator Writerangemap Operator The sas Operator The sasout Operator The sasin Operator The sascontents Operator DB2/UDB Enterprise The db2read Operator The db2write and db2load Operators The db2upsert Operator The db2lookup Operator Oracle Enterprise The oraread Operator The oraupsert Operator The orawrite Operator The oralookup Operator Informix Enterprise The hplread operator The hplwrite Operator For in-memory (`normal') lookups For in-memory (`normal') lookups ( Represents Orchestrate parallel SAS data set. Any operator -encode -decode Options (where applicable) Comment
Chapter 7. Operators
85
Table 7. Stage to Operator Mapping (continued) InfoSphere DataStage Stage Operator The infxread Operator The infxwrite Operator The xpsread Operator The xpswrite Operator Teradata Teraread Operator Terawrite Operator Sybase The sybasereade Operator The sybasewrite Operator The sybaseupsert Operator The sybaselookup Operator SQL Server The sqlsrvrread Operator The sqlsrvrwrite Operator The sqlsrvrupsert Operator The sybaselookup Operator iWay The iwayread Operator The iwaylookup Operator For in-memory (`normal') lookups For in-memory (`normal') lookups Options (where applicable) Comment
86
This option specifies that the operator Yes runs in create mode. The keys in the input data set are added to a bloom filter and are written to memory after the last record in the data set. This option can be used to create bloom filters from old static data that will eventually be used in future jobs that use the bloom filter in -process mode. This option specifies the date string in the yyyy-mm-dd format that the incoming data set is associated with. This number is appended to the filename of the associated bloom filter that is used for dropping older filters. If you do not specify this option in create mode, the -previous_days option cannot be used in process mode. This option specifies the path and name of the fileset that is used to store the bloom filter information. This option specifies the number of unique entries that you expect to insert into the bloom filter. Overestimate the total number of entries when you specify the value for this option. This option specifies the number of hash indexes that each key group will produce. A higher number of phases lowers the false positive percentage, but raises the memory requirements. The phase count that you use must match the phase count that is used to create static filters. No
-data_date
-filter
Yes
-filter_size
Yes
-filter_phases
No
-key
This option specifies a key field to be Yes used for the lookup with either -create or -process option. At least one -key is required.
You can use the following syntax to add keys from the input data set and stores them in a record after the last record in a bloom filter that is stored in memory. This options creates a new bloom filter if none exists in memory. bloomfilter -create -data_date date -filter fileset -filter_size count -filter_phases count -key field
Chapter 7. Operators
87
This option specifies that the operator Yes will run in process mode. The keys in the input dataset are looked up against the bloom filters that are loaded in memory. This option specifies the number of days of old bloom filters to use for the lookup. If not specified all the existing filters will be used. This option specifies that bloom filters older than the -previous_days count will be removed from the fileset. This option is the reference date for the -previous_days option. Specify this variable in yyyy-mm-dd format. This option specifies the path and name of the fileset that is used to store the bloom filter information. This option specifies the number of unique entries that you expect to insert into the bloom filter. Overestimate the total number of entries when you specify the value for this option. This option specifies the number of hash indexes that each key group will produce. A higher number of phases lowers the false positive percentage, but raises the memory requirements. The phase count that you use must match the phase count that is used to create static filters. This option adds a duplicate column to the output schema. No
-previous_days
-drop_old
No
-ref_date
No
-filter
Yes
-filter_size
Yes
-filter_phases
No
-flag_column -key
No
This option specifies a key field to be Yes used for the lookup with either -create or -process option. At least one -key is required. This option truncates the fileset. No
-truncate
You can use the following syntax to process your input dataset against bloom filter records that are stored in memory. bloomfilter -process -previous_days count -ref_date date -drop_old -filter fileset -filter_size count -filter_phases count -key field -truncate
88
Changeapply operator
The changeapply operator takes the change data set output from the changecapture operator and applies the encoded change operations to a before data set to compute an after data set. If the before data set is identical to the before data set that was input to the changecapture operator, then the output after data set for changeapply is identical to the after data set that was input to the changecapture operator. That is:
change := changecapture(before, after) after := changeapply(before, change)
You use the companion operator changecapture to provide a data set that contains the changes in the before and after data sets.
afterRec:*
changeapply: properties
Table 10. changeapply Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Input partitioning style Output partitioning style Preserve-partitioning flag in output data set Composite operator Value 2 1 beforeRec:*, changeRec:* afterRec:* changeRec:*->afterRec:*, dropping the change_code field; beforeRec:*->afterRec:* with type conversions keys in same partition keys in same partition propagated no
The before input to changeapply must have the same fields as the before input that was input to changecapture, and an automatic conversion must exist between the types of corresponding fields. In addition, results are only guaranteed if the contents of the before input to changeapply are identical (in value and record order in each partition) to the before input that was fed to changecapture, and if the keys are unique.
Chapter 7. Operators
89
The change input to changeapply must have been output from changecapture without modification. Because preserve-partitioning is set on the change output of changecapture (under normal circumstances you should not override this), the changeapply operator has the same number of partitions as the changecapture operator. Additionally, both inputs of changeapply are designated as same partitioning by the operator logic. The changeapply operator performs these actions for each change record: v If the before keys come before the change keys in the specified sort order, the before record is consumed and transferred to the output; no change record is consumed. This is a copy. v If the before keys are equal to the change keys, the behavior depends on the code in the change_code field of the change record: Insert: The change record is consumed and transferred to the output; no before record is consumed. If key fields are not unique, and there is more than one consecutive insert with the same key, then changeapply applies all the consecutive inserts before existing records. This record order might be different from the after data set given to changecapture. Delete: The value fields of the before and change records are compared. If they are not the same, the before record is consumed and transferred to the output; no change record is consumed (copy). If the value fields are the same or if ignoreDeleteValues is specified, the change and before records are both consumed; no record is transferred to the output. If key fields are not unique, the value fields ensure that the correct record is deleted. If more than one record with the same keys have matching value fields, the first encountered is deleted. This might cause different record ordering than in the after data set given to the changecapture operator. Edit: The change record is consumed and transferred to the output; the before record is just consumed. If key fields are not unique, then the first before record encountered with matching keys is edited. This might be a different record from the one that was edited in the after data set given to the changecapture operator, unless the -keepCopy option was used. Copy: The change record is consumed. The before record is consumed and transferred to the output. the before keys come after the change keys, behavior also depends on the change_code field. Insert: The change record is consumed and transferred to the output; no before record is consumed (the same as when the keys are equal).
v If
Delete: A warning is issued and the change record is consumed; no record is transferred to the output; no before record is consumed. Edit or Copy: A warning is issued and the change record is consumed and transferred to the output; no before record is consumed. This is an insert. v If the before input of changeapply is identical to the before input of changecapture and either keys are unique or copy records are used, then the output of changeapply is identical to the after input of changecapture. However, if the before input of changeapply is not the same (different record contents or ordering), or keys are not unique and copy records are not used, this fact is not detected and the rules described above are applied anyway, producing a result that might or might not be useful.
Schemas
The changeapply output data set has the same schema as the change data set, with the change_code field removed. The before interface schema is:
record (key:type; ... value:type; ... beforeRec:*;)
90
Transfer behavior
The change to after transfer uses an internal transfer adapter to drop the change_code field from the transfer. This transfer is declared first, so the schema of the change data set determines the schema of the after data set.
Note: The -checkSort option has been deprecated. By default, partitioner and sort components are now inserted automatically.
Chapter 7. Operators
91
Table 11. Changeapply options Option -key Use -key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params ] [-key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params ] ...] Specify one or more key fields. You must specify at least one key for this option or specify the -allkeys option. These options are mutually exclusive.You cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci suboption specifies that the comparison of value keys is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. -asc and -desc specify ascending or descending sort order. -nulls first | last specifies the position of nulls. The -params suboption allows you to specify extra parameters for a key. Specify parameters using pr operty = value pairs separated by commas. -allkeys -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] Specify that all fields not explicitly declared are key fields. The suboptions are the same as the suboptions described for the -key option above. You must specify either the -allkeys option or the -key option. They are mutually exclusive. -allvalues -allvalues [-cs | ci] [-param params] Specify that all fields not otherwise explicitly declared are value fields. The -ci suboption specifies that the comparison of value keys is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. The -param suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -allvalues option is mutually exclusive with the -value and -allkeys options. -codeField -codeField field_name The name of the change code field. The default is change_code. This should match the field name used in changecapture.
92
Table 11. Changeapply options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale. v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname. v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Specifies the value for the change_code field in the change record for the copy result. The n value is an int8. The default value is 0. A copy record means that the before record should be copied to the output without modification. -deleteCode -deleteCode n Specifies the value for the change_code field in the change record for the delete result. The n value is an int8. The default value is 2. A delete record means that a record in the before data set must be deleted to produce the after data set. -doStats -doStats Configures the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -dropkey -dropkey input_field_name Optionally specify that the field is not a key field. If you specify this option, you must also specify the -allkeys option. There can be any number of occurrences of this option. -dropvalue -dropvalue input_field_name Optionally specify that the field is not a value field. If you specify this option, you must also specify the -allvalues option. There can be any number of occurrences of this option.
Chapter 7. Operators
93
Table 11. Changeapply options (continued) Option -editCode Use -editCode n Specifies the value for the change_code field in the change record for the edit result. The n value is an int8. The default value is 3. An edit record means that the value fields in the before data set must be edited to produce the after data set. -ignoreDeleteValues -ignoreDeleteValues Do not check value fields on deletes. Normally, changeapply compares the value fields of delete change records to those in the before record to ensure that it is deleting the correct record. The -ignoreDeleteValues option turns off this behavior. -insertCode -insertCode n Specifies the value for the change_code field in the output record for the insert result. The n value is an int8. The default value is 1. An insert means that a record must be inserted into the before data set to reproduce the after data set. -value -value field [-ci| -cs] [param params] Optionally specifies the name of a value field. The -value option might be repeated if there are multiple value fields. The value fields are modified by edit records, and can be used to ensure that the correct record is deleted when keys are not unique. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci suboption specifies that the comparison of values is case insensitive. The -cs suboption specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -value and -allvalues options are mutually exclusive.
94
before
after
hash
hash
tsort
tsort
copy
Chapter 7. Operators
95
Changecapture operator
The changecapture operator takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The operator produces a change data set, whose schema is transferred from the schema of the after data set with the addition of one field: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set. You can use the companion operator changeapply to combine the changes from the changecapture operator with the original before data set to reproduce the after data set. The changecapture operator is very similar to the diff operator described in "Diff Operator" .
change_code:int8; changeRec:*;
Transfer behavior
In the insert and edit cases, the after input is transferred to output. In the delete case, an internal transfer adapter transfers the before keys and values to output. In the copy case, the after input is optionally transferred to output. Because an internal transfer adapter is used, no user transfer or view adapter can be used with the changecapture operator.
Determining differences
The changecapture output data set has the same schema as the after data set, with the addition of a change_code field. The contents of the output depend on whether the after record represents an insert, delete, edit, or copy to the before data set: v Insert: a record exists in the after data set but not the before data set as indicated by the sorted key fields. The after record is consumed and transferred to the output. No before record is consumed.
96
If key fields are not unique, changecapture might fail to identify an inserted record with the same key fields as an existing record. Such an insert might be represented as a series of edits, followed by an insert of an existing record. This has consequences for changeapply. v Delete: a record exists in the before data set but not the after data set as indicated by the sorted key fields. The before record is consumed and the key and value fields are transferred to the output; no after record is consumed. If key fields are not unique, changecapture might fail to identify a deleted record if another record with the same keys exists. Such a delete might be represented as a series of edits, followed by a delete of a different record. This has consequences for changeapply. v Edit: a record exists in both the before and after data sets as indicated by the sorted key fields, but the before and after records differ in one or more value fields. The before record is consumed and discarded; the after record is consumed and transferred to the output. If key fields are not unique, or sort order within a key is not maintained between the before and after data sets, spurious edit records might be generated for those records whose sort order has changed. This has consequences for changeapply v Copy: a record exists in both the before and after data sets as indicated by the sorted key fields, and furthermore the before and after records are identical in value fields as well. The before record is consumed and discarded; the after record is consumed and optionally transferred to the output. If no after record is transferred, no output is generated for the record; this is the default. The operator produces a change data set, whose schema is transferred from the schema of the after data set, with the addition of one field: a change code with values encoding insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.
Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes. You must specify either one or more -key fields or the -allkeys option. You can parameterize each key field's comparison operation and specify the expected sort order (the default is ascending). Note: The -checkSort option has been deprecated. By default, partitioner and sort components are now inserted automatically.
Chapter 7. Operators
97
Table 12. Changecapture options Option -key Use -key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] [-key input_field_name [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] ...] Specify one or more key fields. You must specify either the -allkeys option or at least one key for the -key option. These options are mutually exclusive.You cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. -asc and -desc specify ascending or descending sort order. -nulls first | last specifies the position of nulls. The -param suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. -allkeys -allkeys [-cs | ci] [-asc | -desc] [-nulls first | last] [-param params] Specify that all fields not explicitly declared are key fields. The suboptions are the same as the suboptions described for the -key option above. You must specify either the -allkeys option or the -key option. They are mutually exclusive. -allvalues -allvalues [-cs | ci] [-param params] Specify that all fields not otherwise explicitly declared are value fields. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -param option allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -allvalues option is mutually exclusive with the -value and -allkeys options. You must specify the -allvalues option when you supply the -dropkey option. -codeField -codeField field_name Optionally specify the name of the change code field. The default is change_code.
98
Table 12. Changecapture options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Optionally specify the value of the change_code field in the output record for the copy result. The n value is an int8. The default value is 0. A copy result means that all keys and all values in the before data set are equal to those in the after data set. -deleteCode -deleteCode n Optionally specify the value for the change_code field in the output record for the delete result. The n value is an int8. The default value is 2. A delete result means that a record exists in the before data set but not in the after data set as defined by the key fields. -doStats -doStats Optionally configure the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -dropkey -dropkey input_field_name Optionally specify that the field is not a key field. If you specify this option, you must also specify the -allkeys option. There can be any number of occurrences of this option. -dropvalue -dropvalue input_field_name Optionally specify that the field is not a value field. If you specify this option, you must also specify the -allvalues option. There can be any number of occurrences of this option.
Chapter 7. Operators
99
Table 12. Changecapture options (continued) Option -editCode Use -editCode n Optionally specify the value for the change_code field in the output record for the edit result. The n value is an int8. The default value is 3. An edit result means all key fields are equal but one or more value fields are different. -insertCode -insertCode n Optionally specify the value for the change_code field in the output record for the insert result. The n value is an int8. The default value is 1. An insert result means that a record exists in the after data set but not in the before data set as defined by the key fields. -keepCopy | -dropCopy -keepDelete | -dropDelete -keepEdit | -dropEdit -keepInsert | -dropInsert -keepCopy | -dropCopy -keepDelete | -dropDelete -keepEdit | -dropEdit -keepInsert | -dropInsert Optionally specifies whether to keep or drop copy records at output. By default, the operator creates an output record for all differences except copy. -value -value field_name [-ci | -cs] [-param params] Optionally specifies one or more value fields. When a before and after record are determined to be copies based on the difference keys (as defined by -key), the value keys can then be used to determine if the after record is an edited version of the before record. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of values is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -param option allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. The -value and -allvalues options are mutually exclusive.
100
key
value
step
before
after
changecapture
switch
(-key change_code)
You specify these key and value fields to the changecapture operator:
-key month -key customer -value balance
After you run the changecapture operator, you invoke the switch operator to divide the output records into data sets based on the result type. The switch operator in this example creates three output data sets: one for delete results, one for edit results, and one for insert results. It creates only three data sets,
Chapter 7. Operators
101
because you have explicitly dropped copy results from the changecapture operator by specifying -dropCopy. By creating a separate data set for each of the three remaining result types, you can handle each one differently:
-deleteCode 0 -editCode 1 -insertCode 2
Checksum operator
You can use the checksum operator to add a checksum field to your data records. You can use the same operator later in the flow to validate the data.
inRec:*;
checksum
OutRec:*; checksum:string; crcbuffer:string;
Properties
Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 1 inRec:* outRec:*; string:checksum; string:crcbuffer note that the checksum field name can be changed, and the crcbuffer field is optional. Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated
102
Value no yes
The checksum operator: v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec and an output interface schema consisting of a single schema variable outRec v Copies the input data set to the output data set, and adds one or possibly two, fields.
Checksum: example
In this example you use checksum to add a checksum field named check that is calculated from the fields week_total, month_total, and quarter_total. The osh command is:
$ osh "checksum -checksum_name check -keepcol week_total -keepcol month_total -keepcol quarter_total < in.ds > out0.ds
Chapter 7. Operators
103
Compare operator
The compare operator performs a field-by-field comparison of records in two presorted input data sets. This operator compares the values of top-level non-vector data types such as strings. All appropriate comparison parameters are supported, for example, case sensitivity and insensitivity for string comparisons. The compare operator does not change the schema, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the operator. The comparison results are also recorded in the output data set. By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the changecapture operator and other operators.
compare: properties
Table 14. Compare properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 2 1 key0:type0; ... keyN:typeN; inRec:*; result:int8; first:subrec(rec:*;); second:subrec(rec:*;); The first input data set is transferred to first.rec, The second input data set is transferred to second.rec parallel (default) or sequential keys in same partition same (parallel mode) any (sequential mode) propagated no
104
The compare operator: v Compares only scalar data types. See "Restrictions" . v Takes two presorted data sets as input and outputs one data set. v Has an input interface schema consisting of the key fields and the schema variable inRec, and an output interface schema consisting of the result field of the comparison and a subrecord field containing each input record. Performs a field-by-field comparison of the records of the input data sets. Transfers the two input data sets to the single output data set without altering the input schemas, partitioning, or values. Writes to the output data set signed integers that indicate comparison results.
Restrictions
The compare operator: v Compares only scalar data types, specifically string, integer, float, decimal, raw, date, time, and timestamp; you cannot use the operator to compare data types such as tagged aggregate, subrec, vector, and so on. v Compares only fields explicitly specified as key fields, except when you do not explicitly specify any key field. In that case, the operator compares all fields that occur in both records.
Results field
The operator writes the following default comparison results to the output data set. In each case, you can specify an alternate value:
Description of Comparison Results The record in the first input data set is greater than the corresponding record in the second input data set. The record in the first input data set is equal to the corresponding record in the second input data set. The record in the first input data set is less than the corresponding record in the second input data set. The number of records in the first input data is greater than the number of records in the second input data set. The number of records in the first input data set is less than the number of records in the second input data set. Default Value 1 0 -1 2 -2
When this operator encounters any of the mismatches described in the table shown above, you can force it to take one or both of the following actions: v Terminate the remainder of the current comparison v Output a warning message to the screen
Chapter 7. Operators
105
106
Table 15. Compare options (continued) Option -gt | -eq | -lt Use -gt n | -eq n | -lt n Configures the operator to write n (a signed integer between -128 and 127) to the output data set if the record in the first input data set is: Greater than (-gt) the equivalent record in the second input data set. The default is 1. Equal to (-eq) the equivalent record in the second input data set. The default is 0. Less than (-lt) the equivalent record in the second input data set. The default is -1. -second -second n Configures the operator to write n (an integer between -128 and 127) to the output data set if the number of records in the first input data set exceeds the number of records in the second input data set. The default value is 2. -warnRecordCountMismatch -warnRecordCountMismatch Forces the operator to output a warning message when a comparison is aborted due to a mismatch in the number of records in the two input data sets.
The output record format for a successful comparison of records looks like this, assuming all default values are used:
result:0 first:name; second:age; third:gender;
107
To force the operator to execute sequentially specify the [-seq] framework argument. When executed sequentially, the operator uses a collection method of any. A sequential operator using this collection method can have its collection method overridden by an input data set to the operator. Suppose you want to run the same job as shown in "Example 1: Running the compare Operator in Parallel" but you want the compare operator to run sequentially. Issue this osh command:
$ osh "compare -field gender -field age [-seq] < inDS0.ds < inDS1.ds > outDS.ds"
Copy operator
You can use the modify operator with the copy operator to modify the data set as the operator performs the copy operation. See "Modify Operator" for more information on modifying data.
inRec:*;
copy
outRec:*;
outRec:*;
outRec:*;
Copy: properties
Table 16. Copy properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value 1 0 or more (0 - n) set by user inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no yes
108
v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec and an output interface schema consisting of a single schema variable outRec v Copies the input data set to the output data sets without affecting the record schema or contents
Chapter 7. Operators
109
step import
copy
OutDS.ds
Here is the osh command:
$ osh "import -file inFile.dat -schema recordSchema | copy > outDS.ds"
This occurs: 1. The import operator reads the data file, inFile.data, into a virtual data set. The virtual data set is written to a single partition because it reads a single data file. In addition, the import operator executes only on the processing node containing the file. 2. The copy operator runs on all processing nodes in the default node pool, because no constraints have been applied to the input operator. Thus, it writes one partition of outDS.ds to each processing node in the default node pool. However, if InfoSphere DataStage removes the copy operator as part of optimization, the resultant persistent data set, outDS.ds, would be stored only on the processing node executing the import operator. In this example, outDS.ds would be stored as a single partition data set on one node. To prevent removal specify the -force option. The operator explicitly performs the repartitioning operation to spread the data over the system.
110
step
copy
tsort
Output data set 0 from the copy operator is written to outDS1.ds and output data set 1 is written to the tsort operator. The syntax is as follows:
$ osh "... | copy > outDS1.ds | tsort options ...
Diff operator
Note: The diff operator has been superseded by the changecapture operator. While the diff operator has been retained for backwards compatibility, you might use the changecapture operator for new development. The diff operator performs a record-by-record comparison of two versions of the same data set (the before and after data sets) and outputs a data set whose records represent the difference between them. The operator assumes that the input data sets are hash-partitioned and sorted in ascending order on the key fields you specify for the comparison. The comparison is performed based on a set of difference key fields. Two records are copies of one another if they have the same value for all difference keys. In addition, you can specify a set of value key fields. If two records are copies based on the difference key fields, the value key fields determine if one record is a copy or an edited version of the other. The diff operator is very similar to the changecapture operator described in "Changecapture Operator" . In most cases, you should use the changecapture operator rather than the diff operator.
Chapter 7. Operators
111
By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the diff operator and other operators. The diff operator does not behave like Unix diff.
diff: properties
Table 18. Diff properties Property Number of input data sets Number of output data sets Input interface schema before data set: after data sets: Output interface schema Transfer behavior before to output: after to output: Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator parallel (default) or sequential keys in same partition any (parallel mode) any (sequential mode) propagated no diff:int8; beforeRec:*; afterRec:*; beforeRec -> beforeRec without record modification afterRec -> afterRec without record modification Value 2 1 key0; ... keyn; value0; ... valuen; beforeRec:*; key0; ... keyn; value0; ... valuen; afterRec:*;
Transfer behavior
The operator produces a single output data set, whose schema is the catenation of the before and after input schemas. Each record of the output data set has the following format:
112
diff:int8
The usual name conflict resolution rules apply. The output data set contains a number of records in the range:
num_in_before <= num_in_output <= (num_in_before + num_in_after)
The number of records in the output data set depends on how many records are copies, edits, and deletes. If the before and after data sets are exactly the same, the number of records in the output data set equals the number of records in the before data set. If the before and after data sets are completely different, the output data set contains one record for each before and one record for each after data set record.
Key fields
The before data set's schema determines the difference key type. You can use an upstream modify operator to alter it. The after data set's key field(s) must have the same name as the before key field(s) and be either of the same data type or of a compatible data type. The same rule holds true for the value fields: The after data set's value field(s) must be of the same name and data type as the before value field(s). You can use an upstream modify operator to bring this about. Only top-level, non-vector, non-nullable fields might be used as difference keys. Only top-level, non-vector fields might be used as value fields. Value fields might be nullable.
Determining differences
The diff operator reads the current record from the before data set, reads the current record from the after data set, and compares the records of the input data sets using the difference keys. The comparison results are classified as follows: v Insert: A record exists in the after data set but not the before data set. The operator transfers the after record to the output. The operator does not copy the current before record to the output but retains it for the next iteration of the operator. The data type's default value is written to each before field in the output. By default the operator writes a 0 to the diff field of the output record. v Delete: A record exists in the before data set but not the after data set. The operator transfers the before record to the output The operator does not copy the current after record to the output but retains it for the next iteration of the operator. The data type's default value is written to each after field in the output. By default, the operator writes a 1 to the diff field of the output record.
Chapter 7. Operators
113
v Copy: The record exists in both the before and after data sets and the specified value field values have not been changed. The before and after records are both transferred to the output. By default, the operator writes a 2 to the diff (first) field of the output record. v Edit: The record exists in both the before and after data sets; however, one or more of the specified value field values have been changed. The before and after records are both transferred to the output. By default, the operator writes a 3 to the diff (first) field of the output record. Options are provided to drop each kind of output record and to change the numerical value written to the diff (first) field of the output record. In addition to the difference key fields, you can optionally define one or more value key fields. If two records are determined to be copies because they have equal values for all the difference key fields, the operator then examines the value key fields. v Records whose difference and value key fields are equal are considered copies of one another. By default, the operator writes a 2 to the diff (first) field of the output record. v Records whose difference key fields are equal but whose value key fields are not equal are considered edited copies of one another. By default, the operator writes a 3 to the diff (first) field of the output record.
114
Table 19. Diff options (continued) Option -allValues Use -allValues [-ci | -cs] [-param params] Specifies that all fields other than the difference key fields identified by -key are used as value key fields. The operator does not use vector, subrecord, and tagged aggregate fields as value keys and skips fields of these data types. When a before and after record are determined to be copies based on the difference keys, the value keys can then be used to determine if the after record is an edited version of the before record. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas. -collation_sequence -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -copyCode -copyCode n Specifies the value for the diff field in the output record when the before and after records are copies. The n value is an int8. The default value is 2. A copy means all key fields and all optional value fields are equal. -deleteCode -deleteCode n Specifies the value for the diff field in the output record for the delete result. The n value is an int8. The default value is 1. A delete result means that a record exists in the before data set but not in the after data set as defined by the difference key fields.
Chapter 7. Operators
115
Table 19. Diff options (continued) Option -dropCopy -dropDelete -dropEdit -dropInsert Use -dropCopy -dropDelete -dropEdit -dropInsert Specifies to drop the output record, meaning not generate it, for any one of the four difference result types. By default, an output record is always created by the operator. You can specify any combination of these four options. -editCode -editCode n Specifies the value for the diff field in the output record for the edit result. The n value is an int8. The default value is 3. An edit result means all difference key fields are equal but one or more value key fields are different. -insertCode -insertCode n Specifies the value for the diff field in the output record for the insert result. The n value is an int8. The default value is 0. An insert result means that a record exists in the after data set but not in the before data set as defined by the difference key fields. -stats -stats Configures the operator to display result information containing the number of input records and the number of copy, delete, edit, and insert records. -tolerateUnsorted -tolerateUnsorted Specifies that the input data sets are not sorted. By default, the operator generates an error and aborts the step when detecting unsorted inputs. This option allows you to process groups of records that might be arranged by the difference key fields but not sorted. The operator consumes input records in the order in which they appear on its input. If you use this option, no automatic partitioner or sort insertions are made.
116
Table 19. Diff options (continued) Option -value Use -value field [-ci| -cs] Optionally specifies the name of a value key field. The -value option might be repeated if there are multiple value fields. When a before and after record are determined to be copies based on the difference keys (as defined by -key), the value keys can then be used to determine if the after record is an edited version of the before record. Note that you cannot use a vector, subrecord, or tagged aggregate field as a value key. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. The -params suboption allows you to specify extra parameters for a key. Specify parameters using property=value pairs separated by commas.
Chapter 7. Operators
117
before
after
step
diff
switch
(-key diff)
Here is the data flow for this example: You specify these key and value fields to the diff operator:
key=month key=customer value=balance
After you run the diff operator, you invoke the switch operator to divide the output records into data sets based on the result type. The switch operator in this example creates three output data sets: one for delete results, one for edit results, and one for insert results. It creates only three data sets, because you have explicitly dropped copy results from the diff operation by specifying -dropCopy. By creating a separate data set for each of the three remaining result types, you can handle each one differently:
deleteCode=0 editCode=1 insertCode=2
118
$ osh "diff -key month -key customer -value balance -dropCopy -deleteCode 0 -editCode 1 -insertCode 2 < before.ds < after.ds | switch -key diff > outDelete.ds > outEdit.ds > outInsert.ds"
Encode operator
The encode operator encodes or decodes an InfoSphere DataStage data set using a UNIX encoding command that you supply. The operator can convert an InfoSphere DataStage data set from a sequence of records into a stream of raw binary data. The operator can also reconvert the data stream to an InfoSphere DataStage data set.
in:*;
encode
(mode = encode)
encode
(mode = decode) out:*; decoded data set
In the figure shown above, the mode argument specifies whether the operator is performing an encoding or decoding operation. Possible values for mode are: v encode: encode the input data set v decode: decode the input data set
encode: properties
Table 20. encode properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value 1 1 mode = encode: in:*; mode = decode: none mode = encode: none mode = decode: out:*; in -> out without record modification for an encode/decode cycle parallel (default) or sequential mode =encode: any mode =decode: same any mode = encode: sets mode = decode: propagates no no
119
encode -command command_line [-direction encode | decode] |[-mode [encode | decode]] |[-encode | -decode] Table 21. Encode options Option -command Use -command command_line Specifies the command line used for encoding/decoding. The command line must configure the UNIX command to accept input from stdin and write its results to stdout. The command must be located in the search path of your job and be accessible by every processing node on which the encode operator executes. -direction or -mode -direction encode | decode -mode encode | decode Specifies the mode of the operator. If you do not select a direction, it defaults to encode. -encode -decode Specify encoding of the data set. Encoding is the default mode. Specify decoding of the data set.
120
For a decoding operation, the operator takes as input a previously encoded data set. The preserve-partitioning flag is propagated from the input to the output data set.
The following example decodes the previously compressed data set so that it might be used by other InfoSphere DataStage operators. To do so, you use an instance of the encode operator with a mode of decode. In a converse operation to the encoding, you specify the same operator, gzip, with the -cd option to decode its input. Here is the osh command for this example:
$ osh "encode -decode -command gzip -d < inDS.ds | op2 ..."
In this example, the command line uses the -d switch to specify the decompress mode of gzip.
Filter operator
The filter operator transfers the input records in which one or more fields meet the requirements you specify. If you request a reject data set, the filter operator transfers records that do not meet the requirements to the reject data set.
inRec:*;
filter
filter: properties
Table 22. filter properties Property Number of input data sets Number of output data sets Input interface schema Value 1 1 or more, and, optionally, a reject data set inRec:*;
Chapter 7. Operators
121
Table 22. filter properties (continued) Property Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value outRec:*; inRec -> outRec without record modification parallel by default, or sequential any (parallel mode) any (sequential mode) propagated no yes
122
Table 23. Filter options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -nulls -nulls first | last By default, nulls are evaluated first, before other values. To override this default, specify -nulls last. -reject -reject By default, records that do not meet specified requirements are dropped. Specify this option to override the default. If you do, attach a reject output data set to the operator. -target -target dsNum An optional sub-property of where. Use it to specify the target data set for a where clause. Multiple -where clauses can direct records to the same output data set. If a target data set is not specified for a particular -where clause, the output data set for that clause is implied by the order of all -where properties that do not have the -target sub-property. For example: Property -where -where -where -where -where "field1 "field2 "field3 "field4 "field5 < 4" like bb" like aa" -target > 10" -target like c.*" Data set 0 1 2 0 2
123
The output summary per criterion is included in the summary messages generated by InfoSphere DataStage as custom information. It is identified with:
name="CriterionSummary"
The XML tags, criterion, case and where, are used by the filter operator when generating business logic and criterion summary custom information. These tags are used in the example information below.
Expressions
The behavior of the filter operator is governed by expressions that you set. You can use the following elements to specify the expressions: v Fields of the input data set v Requirements involving the contents of the fields v Optional constants to be used in comparisons v The Boolean operators AND and OR to combine requirements When a record meets the requirements, it is written unchanged to an output data set. Which of the output data sets it is written to is either implied by the order of your -where options or explicitly defined by means of the -target suboption. The filter operator supports standard SQL expressions, except when comparing strings.
124
10. BOOLEAN is true 11. BOOLEAN is false 12. BOOLEAN is not true 13. BOOLEAN is not false Any of these can be combined using AND or OR.
Regular expressions
The description of regular expressions in this section has been taken from this publication: Rouge Wave, Tools.h++.
Chapter 7. Operators
125
Order of association
As in SQL, expressions are associated left to right. AND and OR have the same precedence. You might group fields and expressions in parentheses to affect the order of evaluation.
String comparison
InfoSphere DataStage operators sort string values according to these general rules: v Characters are sorted in lexicographic order v Strings are evaluated by their ASCII value v Sorting is case sensitive, that is, uppercase letters appear before lowercase letter in sorted data v Null characters appear before non-null characters in a sorted data set, unless you specify the nulls last option v Byte-for-byte comparison is performed
126
OSH syntax
osh " filter -where age >= 21 and income > 50000.00 -where income > 50000.00 -where age >= 21 -target 1 -where age >= 21 < [record@filter_example.schema] all12.txt 0>| AdultAndRich.txt 1>| AdultOrRich.txt 2>| Adult.txt "
Chapter 7. Operators
127
The first -where option directs all records that have age >= 21 and income > 50000.00 to output 0, which is then directed to the file AdultAndRich.txt. The second -where option directs all records that have income > 50000.00 to output 1, which is then directed to AdultOrRich.txt. The third -where option directs all records that have age >= 21 also to output 1 (because of the expression -target 1) which is then directed to AdultOrRich.txt. The result of the second and third -where options is that records that satisfy either of the two conditions income > 50000.00 or age >= 21 are sent to output 1. A record that satisfies multiple -where options that are directed to the same output are only written to output once, so the effect of these two options is exactly the same as:
-where income > 50000.00 or age >= 21
The fourth -where option causes all records satisfying the condition age >= 21 to be sent to the output 2, because the last -where option without a -target suboption directs records to output 1. This output is then sent to Adult.txt.
Input data
As a test case, the following twelve data records exist in an input file all12.txt.
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Frank Chalmers M 19 0004881.94 NY Jane Studdock F 24 0075990.80 TX Seymour Glass M 18 0051531.56 NJ Laura Engels F 57 0015280.31 KY John Boone M 16 0042729.03 CO Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI Jennifer Sarandon F 58 0081319.09 ND
Outputs
The following output comes from running InfoSphere DataStage. Because of parallelism, the order of the records might be different for your installation. If order matters, you can apply the psort or tsort operator to the output of the filter operator. After the InfoSphere DataStage job is run, the file AdultAndRich.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Jennifer Sarandon F 58 0081319.09 ND
After the InfoSphere DataStage job is run, the file AdultOrRich.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Seymour Glass M 18 0051531.56 NJ Laura Engels F 57 0015280.31 KY Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI
128
After the InfoSphere DataStage job is run, the file Adult.txt contains:
John Parker M 24 0087228.46 MA Susan Calvin F 24 0091312.42 IL William Mandella M 67 0040676.94 CA Ann Claybourne F 29 0061774.32 FL Jane Studdock F 24 0075990.80 TX Laura Engels F 57 0015280.31 KY Jennifer Sarandon F 58 0081319.09 ND William Tell M 73 0021008.45 SD Ann Dillard F 21 0004552.65 MI
Funnel operators
The funnel operators copy multiple input data sets to a single output data set. This operation is useful for combining separate data sets into a single large data set. InfoSphere DataStage provides two funnel operators: v The funnel operator combines the records of the input data in no guaranteed order. v The sortfunnel operator combines the input records in the order defined by the value(s) of one or more key fields and the order of the output records is determined by these sorting keys. By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the sortfunnel operator and other operators.
funnel or sortfunnel
outRec:*;
sortfunnel: properties
Table 24. sortfunnel properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Output partitioning style Value N (set by user) 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential sortfunnel operator: keys in same partition sortfunnel operator: distributed keys
Chapter 7. Operators
129
Table 24. sortfunnel properties (continued) Property Partitioning method Value funnel operator: round robin (parallel mode) sortfunnel operator: hash Collection method Preserve-partitioning flag in output data set Composite operator any (sequential mode) propagated no
Funnel operator
input data sets
funnel or sortfunnel
outRec:*;
Syntax
The funnel operator has no options. Its syntax is simply:
funnel
Note: We do not guarantee the output order of records transferred by means of the funnel operator. Use the sortfunnel operator to guarantee transfer order.
130
The sortfunnel operator requires that the record schema of all input data sets be identical. A parallel sortfunnel operator uses the default partitioning method local keys. See "The Partitioning Library" for more information on partitioning styles.
data set 0
data set 1
data set 2
Jane
Smith 42
Paul
Smith 34
Mary
Davis 42
current record
If the data set shown above is sortfunneled on the primary key, LastName, and then on the secondary
primary key
Mary
Davis
42
Paul
Smith
34
Jane
Smith
42
secondary key
Chapter 7. Operators
131
Table 25. Funnel: syntax and options Option -collation_ sequence Use -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.htm
132
Table 25. Funnel: syntax and options (continued) Option -key Use -key field [-cs | -ci] [-asc | -desc] [-nulls first | last] [-ebcdic] [-param params] Specifies a key field of the sorting operation. The first -key defines the primary key field of the sort; lower-priority key fields are supplied on subsequent -key specifications. You must define a single primary key to the sortfunnel operator. You can define as many secondary keys as are required by your job. For each key, select the option and supply the field name. Each record field can be used only once as a key. Therefore, the total number of primary and secondary keys must be less than or equal to the total number of fields in the record. -cs | -ci are optional arguments for specifying case-sensitive or case-insensitive sorting. By default, the operator uses a case-sensitive algorithm for sorting, that is, uppercase strings appear before lowercase strings in the sorted data set. Specify -ci to override this default and perform case-insensitive sorting of string fields. -asc | -desc are optional arguments for specifying ascending or descending sorting By default, the operator uses ascending sorting order, that is, smaller values appear before larger values in the sorted data set. Specify -desc to sort in descending sorting order instead, so that larger values appear before smaller values in the sorted data set. -nulls first | last By default fields containing null values appear first in the sorted data set. To override this default so that fields containing null values appear last in the sorted data set, specify nulls last. -ebcdic By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify this option. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property=value pairs separated by commas.
In this osh example, the sortfunnel operator combines two input data sets into one sorted output data set:
$ osh "sortfunnel -key Lastname -key Age < out0.v < out1.v > combined.ds
Generator operator
Often during the development of an InfoSphere DataStage job, you will want to test the job using valid data. However, you might not have any data available to run the test, your data might be too large to execute the test in a timely manner, or you might not have data with the characteristics required to test the job.
Chapter 7. Operators
133
The InfoSphere DataStage generator operator lets you create a data set with the record layout that you pass to the operator. In addition, you can control the number of records in the data set, as well as the value of all record fields. You can then use this generated data set while developing, debugging, and testing your InfoSphere DataStage job. To generate a data set, you pass to the operator a schema defining the field layout of the data set and any information used to control generated field values. This topic describes how to use the generator operator, including information on the schema options you use to control the generated field values.
inRec:*;
generator
outRec:*;
generator: properties
Table 26. generator properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Value 0 or 1 1 inRec:* supplied_schema; outRec:* inRec -> outRec without record modification sequential (default) or parallel any (parallel mode) any (sequential mode) propagated
You must use either the -schema or the -schemafile argument to specify a schema to the operator. Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
134
Table 27. generator Operator Options Option -schema Use -schema schema Specifies the schema for the generated data set. You must specify either -schema or -schemafile to the operator. If you supply an input data set to the operator, new fields with the specified schema are prepended to the beginning of each record. -schemafile -schemafile filename Specifies the name of a file containing the schema for the generated data set. You must specify either -schema or -schemafile to the operator. If you supply an input data set to the operator, new fields with the supplied schema are prepended to the beginning of each record. -records -records num_recs Specifies the number of records to generate. By default the operator generates an output data set with 10 records (in sequential mode) or 10 records per partition (in parallel mode). If you supply an input data set to the operator, any specification for -records is ignored. In this case, the operator generates one record for each record in the input data set. -resetForEachEOW -resetForEachEOW Specifies that the cycle should be repeated for each EOW.
Chapter 7. Operators
135
By default, the operator executes sequentially to generate a data set with a single partition containing 10 records. However, you can configure the operator to generate any number of records. If you configure the operator to execute in parallel, you control the number of records generated in each partition of the output data set. You can also pass an input data set to the operator. In this case, the operator prepends the generated record fields to the beginning of each record of the input data set to create the output.
This figure shows the data flow diagram for this example:
generator
newDS.ds
To use the generator operator, first configure the schema:
136
This example defines an environment variable ($rec_schema) to hold the schema passed to the operator. Alternatively you can specify the name of a file containing the schema, as shown below:
$ osh "generator -schemafile s_file.txt -records 1000 > newDS.ds"
Note that the keyword [par] has been added to the example to configure the generator operator to execute in parallel.
The figure below shows the output record of the generator operator:
Generated fields
Chapter 7. Operators
137
For example, you can enumerate the records of a data set by appending an int32 field that cycles from 0 upward. The generated fields are prepended to the beginning of each record. This means conflicts caused by duplicate field names in the generator schema and the input data set result in the field from the input data set being dropped. Note that InfoSphere DataStage issues a warning message to inform you of the naming conflict. You can use the modify operator to rename the fields in the input data set to avoid the name collision. See "Transform Operator" for more information.
In the absence of any other specifications in the schema, the operator assigns default values to the fields of the output data set. However, you can also include information in the schema to control the values of the generated fields. This section describes the default values generated for all InfoSphere DataStage data types and the use of options in the schema to control field values.
Note that you include the generator options as part of the schema definition for a field. The options must be included within braces and before the trailing semicolon. Use commas to separate options for fields that accept multiple options. This table lists all options for the different InfoSphere DataStage data types. Detailed information on these options follows the table.
Data Type numeric (also decimal, date, time, timestamp) date Generator Options for the Schema cycle = {init = init_val, incr = incr_val, limit = limit_val} random = {limit = limit_val, seed = seed_val, signed} epoch = 'date' invalids = percentage function = rundate decimal zeros = percentage invalids = percentage
138
Generator Options for the Schema no options available cycle = {value = 'string_1', value = 'string_2', ... } alphabet = 'alpha_numeric_string'
ustring
time
timestamp
nullable fields
Numeric fields
By default, the value of an integer or floating point field in the first record created by the operator is 0 (integer) or 0.0 (float). The field in each successive record generated by the operator is incremented by 1 (integer) or 1.0 (float). The generator operator supports the use of the cycle and random options that you can use with integer and floating point fields (as well as with all other fields except raw and string). The cycle option generates a repeating pattern of values for a field. The random option generates random values for a field. These options are mutually exclusive; that is, you can only use one option with a field. v cycle generates a repeating pattern of values for a field. Shown below is the syntax for this option:
cycle = {init = init_val, incr = limit = limit_val} incr_val ,
where: init_val is the initial field value (value of the first output record). The default value is 0. incr_val is the increment value added to produce the field value in the next output record. The default value is 1 (integer) or 1.0 (float). limit_val is the maximum field value. When the generated field value is greater than limit_val, it wraps back to init_val. The default value of limit_val is the maximum allowable value for the field's data type. You can specify the keyword part or partcount for any of these three option values. Specifying part uses the partition number of the operator on each processing node for the option value. The partition number is 0 on the first processing node, 1 on the next, and so on Specifying partcount uses the number of partitions executing the operator for the option value. For example, if the operator executes on four processing nodes, partcount corresponds to a value of 4. v random generates random values for a field. Shown below is the syntax for this option (all arguments to random are optional):
random = {limit = limit_val, seed = seed_val, signed}
where: limit_val is the maximum generated field value. The default value of limit_val is the maximum allowable value for the field's data type.
Chapter 7. Operators
139
seed_val is the seed value for the random number generator used by the operator for the field. You do not have to specify seed_val. By default, the operator uses the same seed value for all fields containing the random option. signed specifies that signed values are generated for the field (values between -limit_val and +limit_val.) Otherwise, the operator creates values between 0 and +limit_val. You can also specify the keyword part for seed_val and partcount for limit_val. For example, the following schema generates a repeating cycle of values for the AccountType field and a random number for balance:
record ( AccountType:int8 {cycle={init=0, incr=1, limit=24}}; Balance:dfloat {random={limit=100000, seed=34455}}; )
Date fields
By default, a date field in the first record created by the operator is set to January 1, 1960. The field in each successive record generated by the operator is incremented by one day. You can use the cycle and random options for date fields as shown above. When using these options, you specify the option values as a number of days. For example, to set the increment value for a date field to seven days, you use the following syntax:
record ( transDate:date {cycle={incr=7}}; transAmount:dfloat {random={limit=100000,seed=34455}}; )
In addition, you can use the following options: epoch, invalids, and functions. The epoch option sets the earliest generated date value for a field. You can use this option with any other date options. The syntax of epoch is:
epoch = date
where date sets the earliest generated date for the field. The date must be in yyyy-mm-dd format and leading zeros must be supplied for all portions of the date. If an epoch is not specified, the operator uses 1960-01-01. For example, the following schema sets the initial field value of transDate to January 1, 1998:
record ( transDate:date {epoch=1998-01-01}; transAmount:dfloat {random={limit=100000,seed=34455}}; )
You can also specify the invalids option for a date field. This option specifies the percentage of generated fields containing invalid dates:
invalids = percentage
where percentage is a value between 0.0 and 100.0. InfoSphere DataStage operators that process date fields can detect an invalid date during processing. The following example causes approximately 10% of transDate fields to be invalid:
record ( transDate:date {epoch=1998-01-01, invalids=10.0}; transAmount:dfloat {random={limit=100000, seed=34455}}; )
You can use the function option to set date fields to the current date:
140
function = rundate
There must be no other options specified to a field using function. The following schema causes transDate to have the current date in all generated records:
record ( transDate:date {function=rundate}; transAmount:dfloat {random={limit=100000, seed=34455}}; )
Decimal fields
By default, a decimal field in the first record created by the operator is set to 0. The field in each successive record generated by the operator is incremented by 1. The maximum value of the decimal is determined by the decimal's scale and precision. When the maximum value is reached, the decimal field wraps back to 0. You can use the cycle and random options with decimal fields. See "Numeric Fields" for information on these options. In addition, you can use the zeros and invalids options with decimal fields. These options are described below. The zeros option specifies the percentage of generated decimal fields where all bytes of the decimal are set to binary zero (0x00). Many operations performed on a decimal can detect this condition and either fail or return a flag signifying an invalid decimal value. The syntax for the zeros options is:
zeros = percentage
where percentage is a value between 0.0 and 100.0. The invalids options specifies the percentage of generated decimal fields containing and invalid representation of 0xFF in all bytes of the field. Any operation performed on an invalid decimal detects this condition and either fails or returns a flag signifying an invalid decimal value. The syntax for invalids is:
invalids = percentage
where percentage is a value between 0.0 and 100.0. If you specify both zeros and invalids, the percentage for invalids is applied to the fields that are not first made zero. For example, if you specify zeros=50 and invalids=50, the operator generates approximately 50% of all values to be all zeros and only 25% (50% of the remainder) to be invalid.
Raw fields
You can use the generator operator to create fixed-length raw fields or raw fields with a specified maximum length; you cannot use the operator to generate variable-length raw fields. If the field has a maximum specified length, the length of the string is a random number between 1 and the maximum length. Maximum-length raw fields are variable-length fields with a maximum length defined by the max parameter in the form:
max_r:raw [max=10];
By default, all bytes of a raw field in the first record created by the operator are set to 0x00. The bytes of each successive record generated by the operator are incremented by 1 until a maximum value of 0xFF is reached. The operator then wraps byte values to 0x00 and repeats the cycle. You cannot specify any options to raw fields.
Chapter 7. Operators
141
String fields
You can use the generator operator to create fixed-length string and ustring fields or string and ustring fields with a specified maximum length; you cannot use the operator to generate variable-length string fields. If the field has a maximum specified length, the length of the string is a random number between 0 and the maximum length. Note that maximum-length string fields are variable-length fields with a maximum length defined by the max parameter in the form:
max_s: string [max=10];
In this example, the field max_s is variable length up to 10 bytes long. By default, the generator operator initializes all bytes of a string field to the same alphanumeric character. When generating a string field, the operators uses the following characters, in the following order:
abcdefghijklmnopqrstuvwxyz0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ
After the last character, capital Z, values wrap back to lowercase a and the cycle repeats. Note: The alphabet property for ustring values accepts Unicode characters. You can use the alphabet property to define your own list of alphanumeric characters used for generated string fields:
alphabet = alpha_numeric_string
This option sets all characters in the field to successive characters in the alpha_numeric_string. For example, this field specification:
s: string[3] {alphabet=abc};
Note: The cycle option for usting values accepts Unicode characters. The cycle option specifies the list of string values assigned to generated string field:
cycle = { value = string_1, value = string_2, ... }
The operator assigns string_1 to the string field in the first generated record, string_2 to the field in the second generated record, and so on In addition: v If you specify only a single value, all string fields are set to that value.
142
v If the generated string field is fixed length, the value string is truncated or padded with the default pad character 0x00 to the fixed length of the string. v If the string field contains a maximum length setting, the length of the string field is set to the length of the value string. If the length of the value string is longer than the maximum string length, the value string is truncated to the maximum length.
Time fields
By default, a time field in the first record created by the operator is set to 00:00:00 (midnight). The field in each successive record generated by the operator is incremented by one second. After reaching a time of 23:59:59, time fields wrap back to 00:00:00. You can use the cycle and random options with time fields. See "Numeric Fields" for information on these options. When using these options, you specify the options values in numbers of seconds. For example, to set the value for a time field to a random value between midnight and noon, you use the following syntax:
record ( transTime:time {random={limit=43200, seed=83344}}; )
For a time field, midnight corresponds to an initial value of 0 and noon corresponds to 43,200 seconds (12 hours * 60 minutes * 60 seconds). In addition, you can use the scale and invalids options with time fields. The scale option allows you to specify a multiplier to the increment value for time. The syntax of this options is:
scale = factor
The increment value is multiplied by factor before being added to the field. For example, the following schema generates two time fields:
record ( timeMinutes:time {scale=60}; timeSeconds:time; )
In this example, the first field increments by 60 seconds per record (one minute), and the second field increments by seconds. You use the invalids option to specify the percentage of invalid time fields generated:
invalids = percentage
where percentage is a value between 0.0 and 100.0. The following schema generates two time fields with different percentages of invalid values:
record ( timeMinutes:time {scale=60, invalids=10}; timeSeconds:time {invalids=15}; )
Timestamp fields
A timestamp field consists of both a time and date portion. Timestamp fields support all valid options for both date and time fields. See "Date Fields" or "Time Fields" for more information.
Chapter 7. Operators
143
By default, a timestamp field in the first record created by the operator is set to 00:00:00 (midnight) on January 1, 1960. The time portion of the timestamp is incremented by one second for each successive record. After reaching a time of 23:59:59, the time portion wraps back to 00:00:00 and the date portion increments by one day.
Null fields
By default, schema fields are not nullable. Specifying a field as nullable allows you to use the nulls and nullseed options within the schema passed to the generator operator. Note: If you specify these options fo a non-nullable field, the operator issues a warning and the field is set to its default value. The nulls option specifies the percentage of generated fields that are set to null:
nulls = percentage
where percentage is a value between 0.0 and 100.0. The following example specifies that approximately 15% of all generated records contain a null for field a:
record ( a:nullable int32 {random={limit=100000, seed=34455}, nulls=15.0}; b:int16; )
The nullseed options sets the seed for the random number generator used to decide whether a given field will be null.
nullseed = seed
where seed specifies the seed value and must be an integer larger than 0. In some cases, you might have multiple fields in a schema that support nulls. You can set all nullable fields in a record to null by giving them the same nulls and nullseed values. For example, the following schema defines two fields as nullable:
record ( a:nullable int32 {nulls=10.0, nullseed=5663}; b:int16; c:nullable sfloat {nulls=10.0, nullseed=5663}; d:string[10]; e:dfloat; )
Since both fields a and c have the same settings for nulls and nullseed, whenever one field in a record is null the other is null as well.
Head operator
The head operator selects the first n records from each partition of an input data set and copies the selected records to an output data set. By default, n is 10 records. However, you can determine the following by means of options: v The number of records to copy v The partition from which the records are copied v The location of the records to copy v The number of records to skip before the copying operation begins.
144
This control is helpful in testing and debugging jobs with large data sets. For example, the -part option lets you see data from a single partition to ascertain if the data is being partitioned as you want. The -skip option lets you access a portion of a data set. The tail operator performs a similar operation, copying the last n records from each partition. See "Tail Operator" .
inRec:*;
head
outRec:*;
head: properties
Table 28. head properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:* outRec:* inRec -> outRec without record modification
Chapter 7. Operators
145
Table 29. Head options (continued) -nrecs -nrecs count Specify the number of records (count) to copy from each partition of the input data set to the output data set. The default value of count is 10. You cannot specify this option and the -all option at the same time. -part -part partition_number Copy records only from the indicated partition, partition_number. By default, the operator copies records from all partitions. You can specify -part multiple times to specify multiple partition numbers. Each time you do, specify the option followed by the number of the partition. -period -period P Copy every Pth record in a partition, where P is the period. You can start the copy operation after records have been skipped (as defined by -skip). P must equal or be greater than 1. The default value of P is 1. -skip -skip recs Ignore the first recs records of each partition of the input data set, where recs is the number of records to skip. The default skip count is 0.
146
For example, if in.ds is a data set of one megabyte, with 2500 K records, out0.ds is a data set of15.6 kilobytes with 4K records.
Lookup operator
With the lookup operator, you can create and use lookup tables to modify your input data set. For example, you could map a field that should contain a two letter U. S. state postal code to the name of the state, adding a FullStateName field to the output schema. The operator performs in two modes: lookup mode and create-only mode: v In lookup mode, the operator takes as input a single source data set, one or more lookup tables represented as InfoSphere DataStage data sets, and one or more file sets. A file set is a lookup table that contains key-field information. There most be at least one lookup table or file set. For each input record of the source data set, the operator performs a table lookup on each of the input lookup tables. The table lookup is based on the values of a set of lookup key fields, one set for each table. A source record and lookup record correspond when all of the specified lookup key fields have matching values. Each record of the output data set contains all of the fields from a source record. Concatenated to the end of the output records are the fields from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key fields. The reject data set is an optional second output of the operator. This data set contains source records that do not have a corresponding entry in every input lookup table. v In create-only mode, you use the -createOnly option to create lookup tables without doing the lookup processing step. This allows you to make and save lookup tables that you expect to need at a later time, making for faster start-up of subsequent lookup operations. The lookup operator is similar in function to the merge operator and the join operators. To understand the similarities and differences see "Comparison with Other Operators" .
Chapter 7. Operators
147
lookup
Lookup mode
look ups
source.ds
fileset0.ds
filesetn.ds
table0.ds
tableN.ds
lookup
rejectRec:*;
output.ds
reject.ds
fileset0.ds
filesetn.ds
lookup: properties
Table 30. lookup properties Property Number of input data sets Number of output data sets Input interface schema input data set: lookup data sets: Normal mode T+1 1 or 2 (output and optional reject) key0:data_type; ... keyN:data_type; inRec:*; key0:data_type; ... keyM:data_type; tableRec:*; Create-only mode T 0 n/a key0:data_type; ... keyM:data_type; tableRec:*;
148
Table 30. lookup properties (continued) Property Output interface schema output data set: reject data sets: Transfer behavior source to output: lookup to output: source to reject: table to file set Normal mode outRec:*; with lookup fields missing from the input data set concatenated rejectRec;* inRec -> outRec without record modification tableRecN -> tableRecN, minus lookup keys and other duplicate fields inRec -> rejectRec without record modification (optional) key-field information is added to the table Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator any (parallel mode); the default for table inputs is entire any (sequential mode) propagated yes any (default is entire) any n/a yes n/a key-field information is added to the table Create-only mode n/a
or
lookup [-fileset fileset_descriptor] [-collation_sequence locale |collation_file_pathname | OFF] [-table key_specifications [-allow_dups] -save fileset_descriptor [-diskpool pool]...] [-ifNotFound continue | drop | fail | reject]
where a fileset, or a table, or both, must be specified, and key_specifications is a list of one or more strings of this form:
-key field [-cs | -ci]
Chapter 7. Operators
149
Table 31. Lookup options Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. http://oss.software.ibm.com/icu/userguide/ Collate_Intro.html -createOnly -createOnly Specifies the creation of one or more lookup tables; no lookup processing is to be done. -fileset [-fileset fileset_descriptor ...] Specify the name of a fileset containing one or more lookup tables to be matched. These are tables that have been created and saved by an earlier execution of the lookup operator using the -createOnly option. In lookup mode, you must specify either the -fileset option, or a table specification, or both, in order to designate the lookup table(s) to be matched against. There can be zero or more occurrences of the -fileset option. It cannot be specified in create-only mode. Warning: The fileset already contains key specifications. When you follow -fileset fileset_descriptor by key_specifications, the keys specified do not apply to the fileset; rather, they apply to the first lookup table. For example, lookup -fileset file -key field, is the same as: lookup -fileset file1 -table -key field
150
Table 31. Lookup options (continued) Option -ifNotFound Use -ifNotFound continue | drop | fail | reject Specifies the operator action when a record of an input data set does not have a corresponding record in every input lookup table. The default action of the operator is to fail and terminate the step. continue tells the operator to continue execution when a record of an input data set does not have a corresponding record in every input lookup table. The input record is transferred to the output data set along with the corresponding records from the lookup tables that matched. The fields in the output record corresponding to the lookup table(s) with no corresponding record are set to their default value or null if the field supports nulls. drop tells the operator to drop the input record (refrain from creating an output record). fail sets the operator to abort. This is the default. reject tells the operator to copy the input record to the reject data set. In this case, a reject output data set must be specified.
Chapter 7. Operators
151
Table 31. Lookup options (continued) Option -table Use -table -key field [-ci | -cs] [-param parameters] [-key field [-ci | cs] [-param parameters] ...] [-allow_dups] -save fileset_descriptor [-diskpool pool]] ...] Specifies the beginning of a list of key fields and other specifications for a lookup table. The first occurrence of -table marks the beginning of the key field list for lookup table1; the next occurrence of -table marks the beginning of the key fields for lookup table2, and so on For example: lookup -table -key field -table -key field The -key option specifies the name of a lookup key field. The -key option must be repeated if there are multiple key fields. You must specify at least one key for each table. You cannot use a vector, subrecord, or tagged aggregate field as a lookup key. The -ci suboption specifies that the string comparison of lookup key values is to be case insensitive; the -cs option specifies case-sensitive comparison, which is the default. The -params suboption provides extra parameters for the lookup key. Specify property=value pairs, without curly braces. In create-only mode, the -allow_dups option causes the operator to save multiple copies of duplicate records in the lookup table without issuing a warning. Two lookup records are duplicates when all lookup key fields have the same value in the two records. If you do not specify this option, InfoSphere DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records. In normal lookup mode, only one lookup table (specified by either -table or -fileset) can have been created with -allow_dups set. The -save option lets you specify the name of a fileset to write this lookup table to; if -save is omitted, tables are written as scratch files and deleted at the end of the lookup. In create-only mode, -save is, of course, required. The -diskpool option lets you specify a disk pool in which to create lookup tables. By default, the operator looks first for a "lookup" disk pool, then uses the default pool (""). Use this option to specify a different disk pool to use.
152
The memory used to hold a lookup table is shared among the lookup processes running on each machine. Thus, on an SMP, all instances of a lookup operator share a single copy of the lookup table, rather than having a private copy of the table for each process. This reduces memory consumption to that of a single sequential lookup process. This is why partitioning the data, which in a non-shared-memory environment saves memory by creating smaller tables, also has the effect of disabling this memory sharing, so that there is no benefit to partitioning lookup tables on an SMP or cluster.
Partitioning
Normally (and by default), lookup tables are partitioned using the entire partitioning method so that each processing node receives a complete copy of the lookup table. You can partition lookup tables using another partitioning method, such as hash, as long as you ensure that all records with the same lookup keys are partitioned identically. Otherwise, source records might be directed to a partition that doesn't have the proper table entry. For example, if you are doing a lookup on keys a, b, and c, having both the source data set and the lookup table hash partitioned on the same keys would permit the lookup tables to be broken up rather than copied in their entirety to each partition. This explicit partitioning disables memory sharing, but the lookup operation consumes less memory, since the entire table is not duplicated. Note, though, that on a single SMP, hash partitioning does not actually save memory. On MPPs, or where shared memory can be used only in a limited way, or not at all, it can be beneficial.
Create-only mode
In its normal mode of operation, the lookup operator takes in a source data set and one or more data sets from which the lookup tables are built. The lookup tables are actually represented as file sets, which can be saved if you wish but which are normally deleted as soon as the lookup operation is finished. There is also a mode, selected by the -createOnly option, in which there is no source data set; only the data sets from which the lookup tables are to be built are used as input. The resulting file sets, containing lookup tables, are saved to persistent storage. This create-only mode of operation allows you to build lookup tables when it is convenient, and use them for doing lookups at a later time. In addition, initialization time for the lookup processing phase is considerably shorter when lookup tables already exist. For example, suppose you have data sets data1.ds and data2.ds and you want to create persistent lookup tables from them using the name and ID fields as lookup keys in one table and the name and accountType fields in the other. For this use of the lookup operator, you specify the -createOnly option and two -table options. In this case, two suboptions for the -table options are specified: -key and -save. In osh, use the following command:
$ osh " lookup -createOnly -table -key name -key ID -save fs1.fs -table -key name -key accountType -save fs2.fs < data1.ds < data2.ds"
Chapter 7. Operators
153
source record
key field 1 key field 2 key field 1
lookup record
key field 2
name ID
John 27 other_source_fields
name
John 27
ID
payload_fields
name
John
ID
27 other_source_fields other payload fields not including fields already in the source record
output record
This figure shows the source and lookup record and the resultant output record. A source record and lookup record are matched if they have the same values for the key field(s). In this example, both records have John as the name and 27 as the ID number. In this example, the lookup keys are the first fields in the record. You can use any field in the record as a lookup key. Note that fields in a lookup table that match fields in the source record are dropped. That is, the output record contains all of the fields from the source record plus any fields from the lookup record that were not in the source record. Whenever any field in the lookup record has the same name as a field in the source record, the data comes from the source record and the lookup record field is ignored. Here is the command for this example:
$ osh "lookup -table -key Name -key ID < inSrc.ds < inLU1.ds > outDS.ds"
154
source record
key field 1 key field 2
lookup record 1
key field 1 key field 2
lookup record 2
key field 1 key field 2
name ID
John 27 other_source_fields
name ID
John 27
name ID
payload_fields_1 John 27 payload_fields_2
name ID
John payload_fields_1 (not 27 other_source_fields including fields already in the source record) payload_fields_2 (not including fields already in the source record or payload1/payload2 collision)
output record
The osh command for this example is:
$ osh " lookup -table -key name -key ID -table -key name -key ID < inSrc.ds < inLU1.ds < inLU2.ds > outDS.ds"
Note that in this example you specify the same key fields for both lookup tables. Alternatively, you can specify a different set of lookup keys for each lookup table. For example, you could use name and ID for the first lookup table and the fields accountType and minBalance (not shown in the figure) for the second lookup table. Each of the resulting output records would contain those four fields, where the values matched appropriately, and the remaining fields from each of the three input records. Here is the osh command for this example:
$ osh " lookup -table -key name -key ID -table -key accountType -key minBalance < inSrc.ds < inLU1.ds < inLU2.ds > outDS.ds"
Chapter 7. Operators
155
step
lookup key
(entire)
lookup
OutDS.ds
Since the interest rate is not represented in the source data set record schema, the interestRate field from the lookup record has been concatenated to the source record. Here is the osh code for this example:
$ osh " lookup -table -key accountType < customers.ds < interest.ds > outDS.ds"
To make the lookup table's interestRate field the one that is retained in the output, use a modify operator to drop interestRate from the source record. The interestRate field from the lookup table record is propagated to the output data set, because it is now the only field of that name. The following figure shows how to use a modify operator to drop the interestRate field:
source
lookup
modify
(entire)
lookup
step
output
The osh command for this example is:
$ osh " modify -spec drop interestRate; < customer.ds | lookup -table -key accountType < interest.ds > outDS.ds"
Note that this is unrelated to using the -allow_dups option on a table, which deals with the case where two records in a lookup table are identical in all the key fields.
Merge operator
The merge operator combines a sorted master data set with one or more sorted update data sets. The fields from the records in the master and update data sets are merged so that the output record contains all the fields from the master record plus any additional fields from matching update record. A master record and an update record are merged only if both of them have the same values for the merge key field(s) that you specify. Merge key fields are one or more fields that exist in both the master and update records. By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the merge operator and other operators. As part of preprocessing your data for the merge operator, you must remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well. This section describes how to use the merge operator. Included in this topic are examples using the remdup operator to preprocess your data. The merge operator is similar in function to the lookup operator and the join operators. To understand the similarities and differences see "Comparison with Other Operators" .
Chapter 7. Operators
157
updaten
merge
rejectRec1:*; rejectRec1:*;
merged output
reject1
rejectn
merge: properties
Table 32. merge properties Property Number of input data sets Number of output data sets Input interface schema master data set: update data sets: Output interface schema output data set: reject data sets: Transfer behavior master to output: update to output: update to reject: Input partitioning style Output partitioning style Execution mode Preserve-partitioning flag in output data set masterRec -> masterRec without record modification updateRecn-> outputRecrejectRecn -> updateRecn ->rejectRecn without record modification (optional) keys in same partition distributed keys parallel (default) or sequential propagated Value 1 master; 1-n update 1 output; 1-n reject (optional) mKey0:data_type; ... mKeyk:data_type; masterRec:*; mKey0:data_type; ... mKeyk:data_type; updateRecr:*; rejectRecr:*; masterRec:*; updateRec1:*; updateRec2:*; ... updateRecn:*; rejectRecn;*
158
Table 33. Merge options Option -key Use -key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] [-key field [-ci | -cs] [-asc | -desc] [-ebcdic] [-nulls first | last] [param params] ...] Specifies the name of a merge key field. The -key option might be repeated if there are multiple merge key fields. The -ci option specifies that the comparison of merge key values is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default. -asc | -desc are optional arguments for specifying ascending or descending sorting By default, the operator uses ascending sorting order, that is, smaller values appear before larger values in the sorted data set. Specify -desc to sort in descending sorting order instead, so that larger values appear before smaller values in the sorted data set. -nulls first | last. By default fields containing null values appear first in the sorted data set. To override this default so that fields containing null values appear last in the sorted data set, specify nulls last. -ebcdic. By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify this option. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property=value pairs separated by commas. -collation_sequence -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v v v Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence.
By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide/ Collate_Intro.htm -dropBadMasters -dropBadMasters Rejected masters are not output to the merged data set. -keepBadMasters -keepBadMasters Rejected masters are output to the merged data set. This is the default.
Chapter 7. Operators
159
Table 33. Merge options (continued) Option -nowarnBadMasters Use -nowarnBadMasters Do not warn when rejecting bad masters. -nowarnBadUpdates -nowarnBadUpdates Do not warn when rejecting bad updates. -warnBadMasters -warnBadMasters Warn when rejecting bad masters. This is the default. -warnBadUpdates -warnBadUpdates Warn when rejecting bad updates. This is the default.
Merging records
The merge operator combines a master and one or more update data sets into a single, merged data set based upon the values of a set of merge key fields. Each record of the output data set contains all of the fields from a master record. Concatenated to the end of the output records are any fields from the corresponding update records that are not already in the master record. Corresponding master and update records have the same value for the specified merge key fields. The action of the merge operator depends on whether you specify multiple update data sets or a single update data set. When merging a master data set with multiple update data sets, each update data set might contain only one record for each master record. When merging with a single update data set, the update data set might contain multiple records for a single master record. The following sections describe merging a master data set with a single update data set and with multiple update data sets.
160
master record
key field 1 key field 2 key field 1
update record
key field 2
name ID
John 27 other_update_fields
name ID
John 27 other_update_fields
name
John
ID
27 other_master_fields other_update_fields not including fields already in the master record
output record
The figure shows the master and update records and the resultant merged record. A master record and an update record are merged only if they have the same values for the key field(s). In this example, both records have "John" as the Name and 27 as the ID value. Note that in this example the merge keys are the first fields in the record. You can use any field in the record as a merge key, regardless of its location. The schema of the master data set determines the data types of the merge key fields. The schemas of the update data sets might be dissimilar but they must contain all merge key fields (either directly or through adapters). The merged record contains all of the fields from the master record plus any fields from the update record which were not in the master record. Thus, if a field in the update record has the same name as a field in the master record, the data comes from the master record and the update field is ignored. The master data set of a merge must not contain duplicate records where duplicates are based on the merge keys. That is, no two master records can have the same values for all merge keys. For a merge using a single update data set, you can have multiple update records, as defined by the merge keys, for the same master record. In this case, you get one output record for each master/update record pair. In the figure above, if you had two update records with "John" as the Name and 27 as the value of ID, you would get two output records.
Chapter 7. Operators
161
By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the merge operator and other operators. The following figure shows the merge of a master record and two update records (one update record from each of two update data sets):
master record
key field 1 key field 2
update record 1
key field 1 key field 2
update record 2
key field 1 key field 2
name ID
John 27 other_master_ fields
name ID
John 27 other_update_ fields1
name ID
John 27 other_update_ fields2
name ID
John other_update_fields1 (not 27 other_master_fields including fields already in the master record) other_update_fields2 (not including fields already in the master record or update_record1)
output record
Any fields in the first update record not in the master record are concatenated to the output record. Then, any fields in the second update record not in the master record or the first update record are concatenated to the output record. For each additional update data set, any fields not already in the output record are concatenated to the end of the output record.
162
master
update
remdup
remdup
optional
merge
step
output
This diagram shows the overall process as one step. Note that the remdup operator is required only if you have multiple update data sets. If you have only a single update data set, the data set might contain more than one update record for each master record. Another method is to save the output of the remdup operator to a data set and then pass that data set to the merge operator, as shown in the following figure:
Chapter 7. Operators
163
master
update
remdup master
step
remdup update
step
optional
preprocessed master
preprocessed update
merge
step
output
This method has the disadvantage that you need the disk space to store the pre-processed master and update data sets and the merge must be in a separate step from the remove duplicates operator. However, the intermediate files can be checked for accuracy before merging them, or used by other processing steps that require records without duplicates.
164
step merge
This data-flow diagram shows the record schema of the master and update data sets. The record schema of the master data set has five fields and all five of these appear in the record schema of the output data set. The update data set also has five fields, but only two of these (f and g) are copied to the output data set because the remaining fields (a, b, and c) already exist in the master data set. If the example above is extended to include a second update data set with a record schema containing the following fields: abdhi Then the fields in the merged output record are now: abcedfghi because the last two fields (h and i) occur only in the second update data set and not in the master or first update data set. The unique fields from each additional update data set are concatenated to the end of the merged output. If there is a third update data set with a schema that contains the fields: abdh it adds nothing to the merged output since none of the fields is unique. Thus if master and five update data sets are represented as: M U1 U2 U3 U4 U5 where M represents the master data set and Un represent update data set n, and if the records in all six data sets contain a field named b, the output record has a value for b taken from the master data set. If a field named e occurs in the U2, U3, and U5 update data sets, the value in the output comes from the U2 data set since it is the first one encountered.
Chapter 7. Operators
165
Therefore, the record schema of the merged output record is the sequential concatenation of the master and update schema(s) with overlapping fields having values taken from the master data set or from the first update record containing the field. The values for the merge key fields are taken from the master record, but are identical values to those in the update record(s).
National.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8;
California.ds schema
customer:int16; month:string[3]; name;string[21]; calBalance:sfloat; status:string[8];
step merge
combined.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8; calBalance:sfloat; status:string[8];
166
National.ds schema
customer:int16; month:string[3]; name;string[21]; balance;sfloat; salesman:string[8]; accountType:int8;
86111
JUN
Jones, Bob
345.98
Steve
12
California.ds schema
customer:int16; month:string[3]; name;string[21]; CalBalance;sfloat; status:string[8];
86111
JUN
Jones, Bob
637.04
Normal
Chapter 7. Operators
167
master
NationalRaw.ds
update
CaliforniaRaw.ds
remdup
National.ds
remdup
California.ds
merge
step
output
For the remdup operators and for the merge operator you specify the same key two fields: v Option: key Value: Month v Option: key Value: Customer The steps for this example have been written separately so that you can check the output after each step. Because each step is separate, it is easier to understand the entire process. Later all of the steps are combined together into one step. The separate steps, shown as osh commands, are:
# $ # $ # $ Produce National.ds osh "remdup -key Month -key Customer < NationalRaw.ds > National.ds" Produce California.ds osh "remdup -key Month -key Customer < CaliforniaRaw.ds > California.ds" Perform the merge osh "merge -key Month -key Customer < National.ds < California.ds > Combined.ds"
This example takes NationalRaw.ds and CaliforniaRaw.ds and produces Combined.ds without creating the intermediate files. When combining these three steps into one, you use a named virtual data sets to connect the operators.
$ osh "remdup -key Month -key Customer < CaliforniaRaw.ds > California.v; remdup -key Month -key Customer < NationalRaw.ds | merge -key Month -key Customer < California.v > Combined.ds"
In this example, California.v is a named virtual data set used as input to merge.
168
The following figure shows record schemas for the NationalRaw.ds and CaliforniaRaw.ds in which both schemas have a field named Balance:
Master data set NationalRaw.ds schema: customer:int16; month:string[3]; balance:sfloat; salesman:string[8]; accountType:int8;
Update data set CaliforniaRaw.ds schema: customer:int16; month:string[3]; balance:sfloat; CalBalance:sfloat; status:string[8];
If you want the Balance field from the update data set to be output by the merge operator, you have two alternatives, both using the modify operator. v Rename the Balance field in the master data set. v Drop the Balance field from the master record. In either case, the Balance field from the update data set propagates to the output record because it is the only Balance field. The following figure shows the data flow for both methods.
master
National.ds
update
CaliforniaRaw.ds
remdup
remdup
California.v
modify
merge
step
output
Chapter 7. Operators
169
modify OldBalance = Balance | merge -key Month -key Customer < California.v > Combined.ds"
The name of the Balance field has been changed to OldBalance. The Balance field from the update data set no longer conflicts with a field in the master data set and is added to records by the merge.
170
# import the order file; store as a virtual data set. import -schema $ORDER_SCHEMA -file orders.txt -readers 4 | peek -name -nrecs 1 >orders.v; # import the product lookup table; store as a virtual data set. import -schema $LOOKUP_SCHEMA -file lookup_product_id.txt | entire | # entire partitioning only necessary in MPP environments peek -name -nrecs 1 >lookup_product_id.v; # merge customer data with order data; lookup product descriptions; # store as a persistent data set. merge -key cust_id -dropBadMasters # customer did not place an order this period < customers.v < orders.v 1>| orders_without_customers.ds | # if not empty, we have a problem lookup -key product_id -ifNotFound continue # allow products that dont have a description < lookup_product_id.v | peek -name -nrecs 10 >| customer_orders.ds;
Chapter 7. Operators
171
Missing records
The merge operator expects that for each master record there exists a corresponding update record, based on the merge key fields, and vice versa. If the merge operator takes a single update data set as input, the update data set might contain multiple update records for a single master record. By using command-line options to the operator, you can specify the action of the operator when a master record has no corresponding update record (a bad master record) or when an update record has no corresponding master record (a bad update record).
86111
JUN
Lee, Mary
345.98
Steve
12
172
In order to collect bad update records from an update data set, you attach one output data set, called a reject data set, for each update data set. The presence of a reject data set configures the merge operator to write bad update records to the reject data set. In the case of a merge operator taking as input multiple update data sets, you must attach a reject data set for each update data set if you want to save bad update records. You cannot selectively collect bad update records from a subset of the update data sets. By default, the merge operator issues a warning message when it encounters a bad update record. You can use the -nowarnBadMasters option to the operator to suppress this warning. For example, suppose you have a data set named National.ds that has one record per the key field Customer. You also have an update data set named California.ds, which also has one record per Customer. If you now merge these two data sets, and include a reject data set, bad update records are written to the reject data set for all customer records from California.ds that are not already in National.ds. If the reject data set is empty after the completion of the operator, it means that all of the California.ds customers already have National.ds records. The following diagram shows an example using a reject data set.
step
merge
combined.ds
CalReject.ds
After this step executes, CalReject.ds contains all records from update data set that did not have a corresponding record in the master data set. The following diagram shows the merge operator with multiple update sets (U1 and U2) and reject data sets (R1 and R2). In the figure, M indicates the master data set and O indicates the merged output data set.
Chapter 7. Operators
173
merge
R1
R2
Modify operator
The modify operator takes a single data set as input and alters (modifies) the record schema of the input data set to create the output data set. The modify operator changes the representation of data before or after it is processed by another operator, or both. Use it to modify: v Elements of the record schema of an input data set to the interface required by the by the operator to which it is input v Elements of an operator's output to those required by the data set that receive the results The operator performs the following modifications: v Keeping and dropping fields v Renaming fields v Changing a field's data type v Changing the null attribute of a field The modify operator has no usage string.
modify
modify: properties
Table 34. modify properties Property Number of input data sets Number of output data sets Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 1 1 any (parallel mode) any (sequential mode) propagated no
where each modify_spec specifies a conversion you want to perform. "Performing Conversions" describes the conversions that the modify operator can perform. v Enclose the list of modifications in single quotation marks. v Separate the modifications with a semi-colon. v If modify_spec takes more than one argument, separate the arguments with a comma and terminate the argument list with a semi-colon, as in the following example:
modify keep field1,field2, ... fieldn;
Multi-byte Unicode character data is supported for fieldnames in the modify specifications below. The modify_spec can be one of the following: v DROP v KEEP v replacement_spec v NOWARN To drop a field:
DROP fieldname [, fieldname ...]
To keep a field:
KEEP fieldname [, fieldname ...]
To change the name or data type of a field, or both, specify a replacement-spec, which takes the form:
new-fieldname [:new-type] = [explicit-conversion-spec] old-fieldname
Replace the old field name with the new one. The default type of the new field is the same as that if the old field unless it is specified by the output type of the conversion-spec if provided. Multiple new fields can be instantiated based on the same old field.
Chapter 7. Operators
175
When there is an attempt to put a null in a field that has not been defined as nullable, InfoSphere DataStage issues an error message and terminates the job. However, a warning is issued at step-check time. To disable the warning, specify the NOWARNF option.
Transfer behavior
Fields of the input data set that are not acted on by the modify operator are transferred to the output data set unchanged. In addition, changes made to fields are permanent. Thus: v If you drop a field from processing by means of the modify operator, it does not appear in the output of the operation for which you have dropped it. v If you use an upstream modify operator to change the name, type, or both of an input field, the change is permanent in the output unless you restore the field name, type, or both by invoking the modify operator downstream of the operation for whose sake the field name was changed. In the following example, the modify operator changes field names upstream of an operation and restores them downstream of the operation, as indicated in the following table.
Source Field Name Upstream of Operator aField bField cField field1 field2 field3 Destination Field Name field1 field2 field3 aField bField cField
Downstream of Operator
Performing conversions
The section "Allowed Conversions" provides a complete list of conversions you can effect using the modify operator. This section discusses these topics: v "Performing Conversions" v "Keeping and Dropping Fields" v "Renaming Fields" v "Duplicating a Field and Giving It a New Name" v "Changing a Field's Data Type" v "Default Data Type Conversion"
176
v v v v v v v v v v v
"Date Field Conversions" "Decimal Field Conversions" "Raw Field Length Extraction" "String and Ustring Field Conversions" "String Conversions and Lookup Tables" "Time Field Conversions" "Timestamp Field Conversions" "The modify Operator and Nulls" "The modify Operator and Partial Schemas" "The modify Operator and Vectors" "The modify Operator and Aggregate Schema Components"
Renaming fields
To rename a field specify the attribution operator (=) , as follows:
modify newField1=oldField1; newField2=oldField2; ...newFieldn=oldFieldn;
where: v a and c are the original field names; a_1, a_2 are the duplicated field names; c_1, and c_2 are the duplicated and converted field names v conversionSpec is the data type conversion specification, discussed in the next section
177
field1:int32;field2:int16;field3:sfloat;
modify
field1:int32;field2:int16;field3:sfloat;
178
Destination Field Source Field int8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time time stamp dm d dm d dm d dm dm dm dm dm m m m m m uint8 d d d d d d d d d d d d d d d d d d d dm dm d d d d d d d d d d d d d d dm d d m m m m m m m d d d d d dm dm d d d dm d d d d dm d d d d d d dm dm dm int16 uint16 d int32 d d uint32 d d d int64 d d d d uint64 d d d d d sfloat d d d d d d dfloat d d d d d d d
Source Field decimal int8 uint8 int16 uint16 int32 uint32 int64 uint64 sfloat dfloat decimal string ustring raw date time timestamp m m m m m m m m m dm dm dm d d d d d d d d d d dm string dm d dm dm dm m d d d dm dm ustring dm d dm dm dm m d d d dm dm d m m m m m m m m m m raw date m time m timestamp m
Chapter 7. Operators
179
where: v destField is the field in the output data set v dataType optionally specifies the data type of the output field. This option is allowed only when the output data set does not already have a record schema, which is typically the case. v sourceField specifies the field in the input data set v conversionSpec specifies the data type conversion specification; you need not specify it if a default conversion exists (see "Default Data Type Conversion" ). A conversion specification can be double quoted, single quoted, or not quoted, but it cannot be a variable. Note that once you have used a conversion specification to perform a conversion, InfoSphere DataStage performs the necessary modifications to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on.
180
day of month from date day of week from date originDay is a string specifying the day considered to be day zero of the week. You can specify the day using either the first three characters of the day name or the full day name. If omitted, Sunday is defined as day zero. The originDay can be either single- or double-quoted or the quotes can be omitted.
day of year from date (returned value 1-366) days since date Returns a value corresponding to the number of days from source_date to the contents of dateField. source_date must be in the form yyyy-mm-dd and can be quoted or unquoted.
Julian day from date month from date next weekday from date The destination contains the date of the specified day of the week soonest after the source date (including the source date). day is a string specifying a day of the week. You can specify day by either the first three characters of the day name or the full day name. The day can be quoted in either single or double quotes or quotes can be omitted.
previous weekday from date The destination contains the closest date for the specified day of the week earlier than the source date (including the source date) The day is a string specifying a day of the week. You can specify day using either the first three characters of the day name or the full day name. The day can be either single- or double- quoted or the quotes can be omitted.
strings and ustrings from date Converts the date to a string or ustring representation using the specified date_format. By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in "Date formats" .
Chapter 7. Operators
181
Description timestamp from date The time argument optionally specifies the time to be used in building the timestamp result and must be in the form hh:nn:ss. If omitted, the time defaults to midnight.
A date conversion to or from a numeric field can be specified with any InfoSphere DataStage numeric data type. InfoSphere DataStage performs the necessary modifications and either translates a numeric field to the source data type shown above or translates a conversion result to the numeric data type of the destination. For example, you can use the conversion month_day_from_date to convert a date to an int8, or to an int16, int32, dfloat, and so on
date formats
Four conversions, string_from_date, ustring_from_date, date_from_string, and ustring_from_date, take as a parameter of the conversion a date format or a date uformat. These formats are described below. The default format of the date contained in the string is yyyy-mm-dd. The format string requires that you provide enough information for InfoSphere DataStage to determine a complete date (either day, month, and year, or year and day of year).
date uformat
The date uformat provides support for international components in date fields. It's syntax is:
String%macroString%macroString%macroString
where %macro is a date formatting macro such as %mmm for a 3-character English month. See the following table for a description of the date format macros. Only the String components of date uformat can include multi-byte Unicode characters.
date format
The format string requires that you provide enough information for InfoSphere DataStage to determine a complete date (either day, month, and year, or year and day of year). The format_string can contain one or a combination of the following elements:
Table 35. Date format tags Tag %d %dd %ddd %m with v option import Variable width availability import Description Day of month, variable width Day of month, fixed width Day of year Month of year, variable width Value range 1...31 01...31 1...366 1...12 Options s s s, v s
182
Table 35. Date format tags (continued) Tag %mm %mmm %mmmm %yy %yyyy %NNNNyy %e %E %eee %eeee %W %WW import/export import import/export Variable width availability Description Month of year, fixed width Month of year, short name, locale specific Month of year, full name, locale specific Year of century Four digit year Cutoff year plus year of century Day of week, Sunday = day 1 Day of week, Monday = day 1 Value range 01...12 Jan, Feb ... January, February ... 00...99 0001 ...9999 yy = 00...99 1...7 1...7 t, u, w t, u, w, -N, +N s s s Options s t, u, w t, u, w, -N, +N s
Weekday short name, Sun, Mon ... locale specific Weekday long name, locale specific Week of year (ISO 8601, Mon) Week of year (ISO 8601, Mon) Sunday, Monday ... 1...53 01...53
When you specify a date format string, prefix each component with the percent symbol (%) and separate the string's components with a suitable literal character. The default date_format is %yyyy-%mm-%dd. Where indicated the tags can represent variable-width data elements. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated in the table: s Specify this option to allow leading spaces in date formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(m,s) indicates a numeric month of year field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(d,s)/%(m,s)/%yyyy Then the following dates would all be valid: 8/ 8/1958 08/08/1958 8/8/1958
Chapter 7. Operators
183
Use this option in conjunction with the %ddd tag to represent day of year in variable-width format. So the following date property: %(ddd,v) represents values in the range 1 to 366. (If you omit the v option then the range of values would be 001 to 366.)
u w t
Use this option to render uppercase text on output. Use this option to render lowercase text on output. Use this option to render titlecase text (initial capitals) on output.
The u, w, and t options are mutually exclusive. They affect how text is formatted for output. Input dates will still be correctly interpreted regardless of case. -N +N Specify this option to left justify long day or month names so that the other elements in the date will be aligned. Specify this option to right justify long day or month names so that the other elements in the date will be aligned.
Names are left justified or right justified within a fixed width field of N characters (where N is between 1 and 99). Names will be truncated if necessary. The following are examples of justification in use: %dd-%(mmmm,-5)-%yyyyy
21-Augus-2006
%dd-%(mmmm,-10)-%yyyyy
21-August -2005
%dd-%(mmmm,+10)-%yyyyy
21August-2005
The locale for determining the setting of the day and month names can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for month names and weekday names markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example the format string: %(L,'es')%eeee, %dd %mmmm %yyyy Specifies the Spanish locale and would result in a date with the following format: mircoles, 21 septembre 2005 The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
184
Table 36. Format tag restrictions Element year month Numeric format tags %yyyy, %yy, %[nnnn]yy %mm, %m Text format tags %mmm, %mmmm Requires year month year %eee, %eeee month, week of year year Incompatible with week of year day of week, week of year day of month, day of week, week of year day of year month, day of month, day of year
day of month %dd, %d day of year day of week week of year %ddd %e, %E %WW
When a numeric variable-width input tag such as %d or %m is used, the field to the immediate right of the tag (if any) in the format string cannot be either a numeric tag, or a literal substring that starts with a digit. For example, all of the following format strings are invalid because of this restriction: %d%m-%yyyy %d%mm-%yyyy %(d)%(mm)-%yyyy %h00 hours The year_cutoff is the year defining the beginning of the century in which all two-digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. On import and export, the year_cutoff is the base year. This property is mutually exclusive with days_since, text, and julian. You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
For example, the format string %mm/%dd/%yyyy specifies that slashes separate the string's date components; the format %ddd-%yy specifies that the string stores the date as a value from 1 to 366, derives the year from the current year cutoff of 1900, and separates the two components with a dash (-).
Chapter 7. Operators
185
The diagram shows the modification of a date field to three integers. The modify operator takes: v The day of the month portion of a date field and writes it to an 8-bit integer v The month portion of a date field and writes it to an 8-bit integer v The year portion of a date field and writes it to a 16-bit integer
modify
186
Conversion dfloat from decimal dfloat from decimal dfloat from dfloat int32 from decimal int64 from decimal string from decimal ustring from decimal uint64 from decimal
Conversion Specification dfloatField = dfloat_from_decimal[fix_zero](decimalField) dfloatField = mantissa_from_decimal(decimalField) dfloatField = mantissa_from_dfloat(dfloatField) int32Field = int32_from_decimal[r_type, fix_zero](decimalField) int64Field = int64_from_decimal[r_type, fix_zero](decimalField) stringField = string_from_decimal [fix_zero] [suppress_zero](decimalField) ustringField = ustring_from_decimal [fix_zero] [suppress_zero](decimalField) uint64Field = uint64_from_decimal[r_type, fix_zero](decimalField)
A decimal conversion to or from a numeric field can be specified with any InfoSphere DataStage numeric data type. InfoSphere DataStage performs the necessary modification. For example, int32_from_decimal converts a decimal either to an int32 or to any numeric data type, such as int16, or uint32. The scaled decimal from int64 conversion takes an integer field and converts the field to a decimal of the specified precision (p) and scale (s) by dividing the field by 102. For example, the conversion:
Decfield:decimal[8,2]=scaled_decimal_from_int64(intfield)
where intfield = 12345678 would set the value of Decfield to 123456.78. The fix_zero specification causes a decimal field containing all zeros (normally illegal) to be treated as a valid zero. Omitting fix_zero causes InfoSphere DataStage to issue a conversion error when it encounters a decimal field containing all zeros. "Data Type Conversion Errors" discusses conversion errors. The suppress_zero argument specifies that the returned string value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1
rounding type
You can optionally specify a value for the rounding type (r_type) of many conversions. The values of r_typeare: v ceil: Round the source field toward positive infinity. This mode corresponds to the IEEE 754 Round Up mode. v Examples: 1.4 -> 2, -1.6 -> -1 v floor: Round the source field toward negative infinity. This mode corresponds to the IEEE 754 Round Down mode. v Examples: 1.6 -> 1, -1.4 -> -2 v round_inf: Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. This mode corresponds to the COBOL ROUNDED mode. v Examples: 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2 v trunc_zero (default): Discard any fractional digits to the right of the right-most fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all
Chapter 7. Operators
187
fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. v Examples: 1.6 -> 1, -1.6 -> -1 The diagram shows the conversion of a decimal field to a 32-bit integer with a rounding mode of ceil rather than the default mode of truncate to zero:
field1 = int32_from_decimal[ceil](dField);
modify
where fix_zero ensures that a source decimal containing all zeros is treated as a valid representation.
188
modify
Use the following osh commands to specify the raw_length conversion of a field:
$ modifySpec="field1 = raw_length(aField); field2 = aField;" $ osh " ... | modify $modifySpec |... "
Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.
Conversion Specification rawField = raw_from_string(string) rawField = u_raw_from_string(ustring) int32Field = raw_length(raw) Description Returns string in raw representation. Returns ustring in raw representation. Returns the length of the raw field.
Chapter 7. Operators
189
Description You can use this function to remove the characters used to pad variable-length strings when they are converted to fixed-length strings of greater length. By default, these characters are retained when the fixed-length string is then converted back to a variable-length string. The character argument is the character to remove. It defaults to NULL. The value of the direction and justify arguments can be either begin or end; direction defaults to end, and justify defaults to begin. justify has no affect when the target string has variable length. Examples: name:string = string_trim[NULL, begin](name) removes all leading ASCII NULL characters from the beginning of name and places the remaining characters in an output variable-length string with the same name. hue:string[10] = string_trim['Z', end, begin](color) removes all trailing Z characters from color, and left justifies the resulting hue fixed-length string.
Copies parts of strings and ustrings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The arguments starting_position and length are uint16 types and must be positive (>= 0).
stringField=lookup_string_from_int16 [tableDefinition](int16Field) ustringField=lookup_ustring_from_int16 [tableDefinition](int16Field) int16Field=lookup_int16_from_string [tableDefinition](stringField) int16Field=lookup_int16_from_ustring [tableDefinition](ustringField) uint32 = lookup_uint32_from_string [tableDefinition](stringField) uint32 =lookup_uint32_from_ustring [tableDefinition](ustringField) stringField= lookup_string_from_uint32 [tableDefinition](uint32Field) ustringField=lookup_ustring_from_uint32 [tableDefinition](uint32Field) stringField = string_from_ustring(ustring) ustringField = ustring_from_string(string) decimalField = decimal_from_string(stringField) decimalField = decimal_from_ustring(ustringField)
Converts ustrings to strings. Converts strings to ustrings. Converts strings to decimals. Converts ustrings to decimals.
190
Description Converts decimals to strings. fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1
Converts decimals to ustrings. See string_from_decimal above for a description of the fix_zero and suppress_zero arguments. date from string or ustring Converts the string or ustring field to a date representation using the specified date_format or date_uformat. By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in "date Formats".
dateField = date_from_string [date_format | date_uformat] (stringField) dateField = date_from_ustring [date_format | date_uformat] (ustringField)
stringField = string_from_date [date_format | date_uformat] strings and ustrings from date (dateField) Converts the date to a string or ustring representation ustringField = ustring_from_date [date_format | using the specified date_format or date_uformat. date_uformat] (dateField) By default, the string format is yyyy-mm-dd. date_format and date_uformat are described in "date Formats" . int32Field=string_length(stringField) int32Field=ustring_length(ustringField) stringField=substring [startPosition,len] (stringField) ustringField=substring [startPosition,len] (ustringField) Returns an int32 containing the length of a string or ustring. Converts long strings/ustrings to shorter strings/ ustrings by string extraction. The startPosition specifies the starting location of the substring; len specifies the substring length. If startPosition is positive, it specifies the byte offset into the string from the beginning of the string. If startPosition is negative, it specifies the byte offset from the end of the string. stringField=uppercase_string (stringField) ustringField=uppercase_ustring (ustringField) stringField=lowercase_string (stringField) ustringField=lowercase_ustring (ustringField) Convert strings and ustrings to all uppercase. Non-alphabetic characters are ignored in the conversion. Convert stringsand ustrings to all lowercase. Non-alphabetic characters are ignored in the conversion.
stringField = string_from_time [time_format | time_uformat string and ustring from time ] (timeField) Converts the time to a string or ustring representation ustringField = ustring_from_time [time_format | using the specified time_format or time_uformat. The time_format options are described below. time_uformat] (timeField)
Chapter 7. Operators
191
Conversion Specification stringField = string_from_timestamp [timestamp_format | timestamp_uformat] (tsField) ustringField = ustring_from_timestamp [timestamp_format | timestamp_uformat] (tsField)
Description strings and ustrings from timestamp Converts the timestamp to a string or ustring representation using the specified timestamp_format or timestamp_uformat. By default, the string format is %yyyy-%mm-%dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in "timestamp Formats" .
tsField = timestamp_from_string [timestamp_format | timestamp_uformat] (stringField) tsField = timestamp_from_ustring [timestamp_format | timestamp_uformat] (usringField)
timestamp from strings and ustrings Converts the string or ustring to a timestamp representation using the specified timestamp_format or timestamp_uformat. By default, the string format is yyyy-mm-dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described "timestamp Formats" .
Converts the time to a string or ustring representation timeField = time_from_ustring [time_format | time_uformat] using the specified time_format. The time_uformat options (ustringField) are described below.
The diagram shows a modification that converts the name of aField to field1 and produces field2 from bField by extracting the first eight bytes of bField:
modify
The diagram shows the extraction of the string_length of aField. The length is included in the output as field1.
192
modify
The following osh commands extract the length of the string in aField and place it in field1 of the output:
$ modifyspec="field1 = string_length(aField); field2 = aField;" $ osh " ... | modify $modifySpec |... "
Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.
Each row of the lookup table specifies an association between a 16-bit integer or unsigned 32-bit integer value and a string or ustring. InfoSphere DataStage scans the Numeric Value or the String or Ustring column until it encounters the value or string to be translated. The output is the corresponding entry in the row. The numeric value to be converted might be of the int16 or the uint32 data type. InfoSphere DataStage converts strings to values of the int16 or uint32 data type using the same table. If the input contains a numeric value or string that is not listed in the table, InfoSphere DataStage operates as follows: v If a numeric value is unknown, an empty string is returned by default. However, you can set a default string value to be returned by the string lookup table. v If a string has no corresponding value, 0 is returned by default. However, you can set a default numeric value to be returned by the string lookup table. Here are the options and arguments passed to the modify operator to create a lookup table:
Chapter 7. Operators
193
OR:
intField = lookup_uint32_from_string[tableDefinition] (source_stringField); | intField = lookup_uint32_from_ustring[tableDefinition] (source_ustringField); stringField = lookup_string_from_int16[tableDefinition](source_intField); | ustringField = lookup_ustring_from_int16[tableDefinition](source_intField);
OR:
stringField = lookup_string_from_uint32[tableDefinition] ( source_intField ); ustringField = lookup_ustring_from_uint32[tableDefinition] (source_intField);
where: tableDefinition defines the rows of a string or ustring lookup table and has the following form:
{propertyList} (string | ustring = value; string | ustring= value; ... )
where: v propertyList is one or more of the following options; the entire list is enclosed in braces and properties are separated by commas if there are more than one: case_sensitive: perform a case-sensitive search for matching strings; the default is case-insensitive. default_value = defVal: the default numeric value returned for a string that does not match any of the strings in the table. default_string = defString: the default string returned for numeric values that do not match any numeric value in the table. v string or ustring specifies a comma-separated list of strings or ustrings associated with value; enclose each string or ustring in quotes. v value specifies a comma-separated list of 16-bit integer values associated with string or ustring. The diagram shows an operator and data set requiring type conversion:
194
schema:
modify
sample
Whereas gender is defined as a string in the input data set, the SampleOperator defines the field as an 8:-bit integer. The default conversion operation cannot work in this case, because by default InfoSphere DataStage converts a string to a numeric representation and gender does not contain the character representation of a number. Instead the gender field contains the string values "male", "female", "m", or "f". You must therefore specify a string lookup table to perform the modification. The gender lookup table required by the example shown above is shown in the following table:
Numeric Value 0 0 1 1 String "f" "female" "m" "male"
The value f or female is translated to a numeric value of 0; the value m or male is translated to a numeric value of 1. The following osh code performs the conversion:
modify gender = lookup_int16_from_string[{default_value = 2} (f = 0; female = 0; m = 1; male = 1;)] (gender);
In this example, gender is the name of both the source and the destination fields of the translation. In addition, the string lookup table defines a default value of 2; if gender contains a string that is not one of "f", "female", "m", or "male", the lookup table returns a value of 2.
Chapter 7. Operators
195
Converts the string or ustring to a time representation timeField = time_from_ustring [time_format | time_uformat] using the specified time_format or time_uformat. (ustringField) The time format options are described below. timeField = time_from_timestamp(tsField) tsField = timestamp_from_time [date](timeField) time from timestamp timestamp from time The date argument is required. It specifies the date portion of the timestamp and must be in the form yyyy-mm-dd.
Time conversion to a numeric field can be used with any InfoSphere DataStage numeric data type. InfoSphere DataStage performs the necessary modifications to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion hours_from_time to convert a time to an int8, or to an int16, int32, dfloat, and so on.
time Formats
Four conversions, string_from_time, ustring_from_time, time_from_string, and ustring_from_time, take as a parameter of the conversion a time format or a time uformat. These formats are described below. The default format of the time contained in the string is hh:mm:ss.
time Uformat
The time uformat date format provides support for international components in time fields. It's syntax is:
String % macroString % macroString % macroString
where %macro is a time formatting macro such as %hh for a two-digit hour. See time Format on page 197 below for a description of the time format macros. Only the String components of time uformat can include multi-byte Unicode characters.
196
time Format
The string_from_time and time_from_string conversions take a format as a parameter of the conversion. The default format of the time in the string is hh:mm:ss. However, you can specify an optional format string defining the time format of the string field. The format string must contain a specification for hours, minutes, and seconds. The possible components of the time_format string are given in the following table:
Table 37. Time format tags Tag %h %hh %H %HH %n %nn %s %ss %s.N %ss.N %SSS %SSSSSS %aa with v option with v option German import import import import Variable width availability import Description Hour (24), variable width Hour (24), fixed width Hour (12), variable width Hour (12), fixed width Minutes, variable width Minutes, fixed width Seconds, variable width Seconds, fixed width Value range 0...23 0...23 1...12 01...12 0...59 0...59 0...59 0...59 Options s s s s s s s s s, c, C s, c, C s, v s, v u, w
Seconds + fraction (N = 0...6) Seconds + fraction (N = 0...6) Milliseconds Microseconds am/pm marker, locale specific 0...999 0...999999 am, pm
By default, the format of the time contained in the string is %hh:%nn:%ss. However, you can specify a format string defining the format of the string field. You must prefix each component of the format string with the percent symbol. Separate the string's components with any character except the percent sign (%). Where indicated the tags can represent variable-fields on import, export, or both. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated: s Specify this option to allow leading spaces in time formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(n,s)
Chapter 7. Operators
197
indicates a minute field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(h,s):$(n,s):$(s,s) Then the following times would all be valid: 20: 6:58 20:06:58 20:6:58 v Use this option in conjunction with the %SSS or %SSSSSS tags to represent milliseconds or microseconds in variable-width format. So the time property: %(SSS,v) represents values in the range 0 to 999. (If you omit the v option then the range of values would be 000 to 999.) u w c C Use this option to render the am/pm text in uppercase on output. Use this option to render the am/pm text in lowercase on output. Specify this option to use a comma as the decimal separator in the %ss.N tag. Specify this option to use a period as the decimal separator in the %ss.N tag.
The c and C options override the default setting of the locale. The locale for determining the setting of the am/pm string and the default decimal separator can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for am/pm string and separators markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example: %L('es')%HH:%nn %aa Specifies the Spanish locale. The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
Table 38. Format tag restrictions Element hour am/pm marker minute second Numeric format tags %hh, %h, %HH, %H %nn, %n %ss, %s Text format tags %aa Requires hour (%HH) Incompatible with hour (%hh) -
198
Table 38. Format tag restrictions (continued) Element fraction of a second Numeric format tags %ss.N, %s.N, %SSS, %SSSSSS Text format tags Requires Incompatible with -
You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
modify
The following osh code converts the hours portion of tField to the int8 hoursField and the minutes portion to the int8 minField:
modify hoursField = hours_from_time(tField); minField = minutes_from_time(tField);
Chapter 7. Operators
199
Conversion Specification
Description
dfloatField = seconds_since_from_timestamp [ timestamp ]( seconds_since from timestamp tsField ) tsField = timestamp_from_seconds_since [ timestamp ]( dfloatField ) stringField = string_from_timestamp [ timestamp_format | timestamp_uformat] ( tsField ) ustringField = ustring_from_timestamp [ timestamp_format | timestamp_uformat] ( tsField ) timestamp from seconds_since strings and ustrings from timestamp Converts the timestamp to a string or ustring representation using the specified timestamp_format or timestamp_uformat . By default, the string format is %yyyy-%mm-%dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in "timestamp Formats" . int32Field = timet_from_timestamp ( tsField ) timet_from_timestamp int32Field contains a timestamp as defined by the UNIX timet representation. dateField = date_from_timestamp( tsField ) date from timestamp Converts the timestamp to a date representation. tsField = timestamp_from_string [ timestamp_format | timestamp_uformat] ( stringField ) tsField = timestamp_from_ustring [ timestamp_format | timestamp_uformat] ( usringField ) timestamp from strings and ustrings Converts the string or ustring to a timestamp representation using the specified timestamp_format . By default, the string format is yyyy-mm-dd hh:mm:ss. The timestamp_format and timestamp_uformat options are described in "timestamp Formats" . timestamp from time_t int32Field must contain a timestamp as defined by the UNIX time_t representation. tsField = timestamp_from_date [ time ]( dateField ) timestamp from date The time argument optionally specifies the time to be used in building the timestamp result and must be in the form hh:mm:ss. If omitted, the time defaults to midnight. tsField = timestamp_from_time [ date ]( timeField ) timestamp from time The date argument is required. It specifies the date portion of the timestamp and must be in the form yyyy-mm-dd. tsField = timestamp_from_date_time ( date , time ) Returns a timestamp from date and time. The date specifies the date portion (yyyy-nn-dd) of the timestamp. The time argument specifies the time to be used when building the timestamp. The time argument must be in the hh:nn:ss format. time from timestamp
Timestamp conversion of a numeric field can be used with any InfoSphere DataStage numeric data type. InfoSphere DataStage performs the necessary conversions to translate a conversion result to the numeric data type of the destination. For example, you can use the conversion timet_from_timestamp to convert a timestamp to an int32, dfloat, and so on.
200
timestamp formats
The string_from_timestamp, ustring_from_timestamp, timestamp_from_string, and timestamp_from_ustring conversions take a timestamp format or timestamp uformat argument. The default format of the timestamp contained in the string is yyyy-mm-dd hh:mm:ss. However, you can specify an optional format string defining the data format of the string field.
timestamp format
The format options of timestamp combine the formats of the date and time data types. The default timestamp format is as follows:
%yyyy-%mm-%dd %hh:%mm:%ss
timestamp uformat
For timestamp uformat, concatenate the date uformat with the time uformat. The two formats can be in any order, and their components can be mixed. These formats are described in "date Uformat" under "Date Field Conversions" and "time Uformat" under "Time Field Conversions".. The following diagram shows the conversion of a date field to a timestamp field. As part of the conversion, the operator sets the time portion of the timestamp to 10:00:00.
tsField = timestamp_from_time[10:00:00](dField);
modify
To specify the timestamp_from_date conversion and set the time to 10:00:00, use the following osh command:
modify tsField=timestamp_from_date[10:00:00](dField);
201
The modify operator can change a null representation from an out-of-band null to an in-band null and from an in-band null to an out-of-band null. The record schema of an operator's input or output data set can contain fields defined to support out-of-band nulls. In addition, fields of an operator's interface might also be defined to support out-of-band nulls. The next table lists the rules for handling nullable fields when an operator takes a data set as input or writes to a data set as output.
Source Field not_nullable not_nullable nullable Destination Field not_nullable nullable not_nullable Result Source value propagates to destination. Source value propagates; destination value is never null. If the source value is not null, the source value propagates. If the source value is null, a fatal error occurs, unless you apply the modify operator, as in "Out-of-Band to Normal Representation". nullable nullable Source value or null propagates.
where: v destField is the destination field's name. v dataType is its optional data type; use it if you are also converting types. v sourceField is the source field's name v value is the value you wish to represent a null in the output. The destField is converted from an InfoSphere DataStage out-of-band null to a value of the field's data type. For a numeric field value can be a numeric value, for decimal, string, time, date, and timestamp fields, value can be a string. Conversion specifications are described in: "Date Field Conversions" "Decimal Field Conversions" "String and Ustring Field Conversions" "Time Field Conversions" "Timestamp Field Conversions" For example, the diagram shows the modify operator converting the InfoSphere DataStage out-of-band null representation in the input to an output value that is written when a null is encountered:
202
modify
While in the input fields a null takes the InfoSphere DataStage out-of-band representation, in the output a null in aField is represented by -128 and a null in bField is represented by ASCII XXXX (0x59 in all bytes). To make the output aField contain a value of -128 whenever the input contains an out-of-band null, and the output bField contain a value of 'XXXX' whenever the input contains an out-of-band null, use the following osh code:
$ modifySpec = "aField = handle_null(aField, -128); bField = handle_null(bField, XXXX); " $ osh " ... | modify $modifySpec | ... "
Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.
Where: v destField is the destination field's name. v dataType is its optional data type; use it if you are also converting types. v sourceField is the source field's name. v value is the value of the source field when it is null. A conversion result of value is converted from an InfoSphere DataStage out-of-band null to a value of the field's data type. For a numeric field value can be a numeric value, for decimal, string, time, date, and timestamp fields, value can be a string. For example, the diagram shows a modify operator converting the value representing a null in an input field (-128 or 'XXXX') to the InfoSphere DataStage single-bit null representation in the corresponding field of the output data set:
Chapter 7. Operators
203
modify
In the input a null value in aField is represented by -128 and a null value in bField is represented by ASCII XXXX, but in both output fields a null value if represented by InfoSphere DataStage's single bit. The following osh syntax causes the aField of the output data set to be set to the InfoSphere DataStage single-bit null representation if the corresponding input field contains -128 (in-band-null), and the bField of the output to be set to InfoSphere DataStage's single-bit null representation if the corresponding input field contains 'XXXX' (in-band-null).
$modifySpec = "aField = make_null(aField, -128); bField = make_null(bField, XXXX); " $ osh " ... | modify $modifySpec | ... "
Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator.
By default, the data type of the destination field is int8. Specify a different destination data type to override this default. InfoSphere DataStage issues a warning if the source field is not nullable or the destination field is nullable.
204
modify
schema: fName:string; lName:string; subField:subrec( subF1:int32; subF2:sfloat;) tagField:tagged( tagF1:int32; tagF2:string;)
In this example, purchase contains an item number and a price for a purchased item; date contains the date of purchase represented as either an integer or a string. You must translate the aggregate purchase to the interface component subField and the tagged component date to tagField. To translate aggregates: 1. Translate the aggregate of an input data set to an aggregate of the output. To translate purchase, the corresponding output component must be a compatible aggregate type. The type is subrecord and the component is subField. The same principle applies to the elements of the subrecord. 2. Translate the individual fields of the data set's aggregate to the individual fields of the operator's aggregate.
Chapter 7. Operators
205
If multiple elements of a tagged aggregate in the input are translated, they must all be bound to members of a single tagged component of the output's record schema. That is, all elements of tagField must be bound to a single aggregate in the input. Here is the osh code to rename purchase.price to subField.subF2.
$ modifySpec = "subField = purchase; subField.subF1 = purchase.itemNum; subField.subF2 = purchase.price; tagField = date; tagField.tagF1 = date.intDate; tagField.tagF2 = date.stringDate; ); " $ osh " ... | modify $ modifySpec | ... " # translate aggregate # translate field # translate field
Notice that a shell variable (modifySpec) has been defined containing the specifications passed to the operator. Aggregates might contain nested aggregates. When you translate nested aggregates, all components at one nesting level in an input aggregate must be bound to all components at one level in an output aggregate. The table shows sample input and output data sets containing nested aggregates. In the input data set, the record purchase contains a subrecord description containing a description of the item:
Level 0 1 1 1 2 2 Schema 1 (for input data set) purchase: subrec ( itemNum: int32; price: sfloat; description: subrec; ( color: int32; size:int8; ); ); n n+1 n+1 Level Schema 2 (for output data set) ... subField ( subF1; subF2: ); ...
Note that: v itemNum and price are at the same nesting level in purchase. v color and size are at the same nesting level in purchase. v subF1 and subF2 are at the same nesting level in subField. You can bind: v purchase.itemNum and purchase.price (both level 1) to subField.subF1 and subField.subF2, respectively v purchase.description.color and purchase.description.size (both level 2) to subField.subF1 and subField.subF2, respectively You cannot bind two elements of purchase at different nesting levels to subF1 and subF2. Therefore, you cannot bind itemNum (level 1) to subF1 and size (level2) to subF2. Note: InfoSphere DataStage features several operators that modify the record schema of the input data set and the level of fields within records. Two of them act on tagged subrecords. See the topic on the restructure operators.
206
Allowed conversions
The table lists all allowed data type conversions arranged alphabetically. The form of each listing is:
conversion_name (source_type ,destination_type) Conversion Specification date_from_days_since ( int32 , date ) date_from_julian_day ( int32 , date ) date_from_string ( string , date ) date_from_timestamp ( timestamp , date ) date_from_ustring ( ustring , date ) days_since_from_date ( date , int32 ) decimal_from_decimal ( decimal , decimal ) decimal_from_dfloat ( dfloat , decimal ) decimal_from_string ( string , decimal ) decimal_from_ustring ( ustring , decimal ) dfloat_from_decimal ( decimal , dfloat ) hours_from_time ( time , int8 ) int32_from_decimal ( decimal , int32 ) int64_from_decimal ( decimal , int64 ) julian_day_from_date ( date , uint32 ) lookup_int16_from_string ( string , int16 ) lookup_int16_from_ustring ( ustring , int16 ) lookup_string_from_int16 ( int16 , string ) lookup_string_from_uint32 ( uint32 , string ) lookup_uint32_from_string ( string , uint32 ) lookup_uint32_from_ustring ( ustring , uint32 ) lookup_ustring_from_int16 ( int16 , ustring ) lookup_ustring_from_int32 ( uint32 , ustring ) lowercase_string ( string , string ) lowercase_ustring ( ustring , ustring ) mantissa_from_dfloat ( dfloat , dfloat ) mantissa_from_decimal ( decimal , dfloat ) microseconds_from_time ( time , int32 ) midnight_seconds_from_time ( time , dfloat) minutes_from_time ( time , int8 ) month_day_from_date ( date , int8 ) month_from_date ( date , int8 ) next_weekday_from_date ( date , date ) notnull ( any , int8 ) null ( any , int8 ) previous_weekday_from_date ( date , date ) raw_from_string ( string , raw )
Chapter 7. Operators
207
Conversion Specification raw_length ( raw , int32 ) seconds_from_time ( time , dfloat ) seconds_since_from_timestamp ( timestamp , dfloat ) string_from_date ( date , string ) string_from_decimal ( decimal , string ) string_from_time ( time , string ) string_from_timestamp ( timestamp , string ) string_from_ustring ( ustring , string ) string_length ( string , int32 ) substring ( string , string ) time_from_midnight_seconds ( dfloat , time ) time_from_string ( string , time ) time_from_timestamp ( timestamp , time ) time_from_ustring ( ustring , time ) timestamp_from_date ( date , timestamp ) timestamp_from_seconds_since ( dfloat , timestamp ) timestamp_from_string ( string , timestamp ) timestamp_from_time ( time , timestamp ) timestamp_from_timet ( int32 , timestamp ) timestamp_from_ustring ( ustring , timestamp ) timet_from_timestamp ( timestamp , int32 ) uint64_from_decimal ( decimal , uint64 ) uppercase_string ( string , string ) uppercase_ustring ( ustring , ustring ) u_raw_from_string ( ustring , raw ) ustring_from_date ( date , ustring ) ustring_from_decimal ( decimal , ustring ) ustring_from_string ( string , ustring ) ustring_from_time ( time , ustring ) ustring_from_timestamp ( timestamp , ustring ) ustring_length ( ustring , int32 ) u_substring ( ustring , ustring ) weekday_from_date ( date , int8 ) year_day_from_date ( date , int16 ) year_from_date ( date , int16 ) year_week_from_date ( date , int8 )
208
pcompress operator
The pcompress operator uses the UNIX compress utility to compress or expand a data set. The operator converts an InfoSphere DataStage data set from a sequence of records into a stream of raw binary data; conversely, the operator reconverts the data stream into an InfoSphere DataStage data set.
input data set compressed data set
in:*;
pcompress
pcompress
out:*;
pcompress: properties
Table 39. pcompress operator Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 1 1 mode = compress: in:*; mode = expand: out:*; in -> out without record modification for a compress/decompress cycle parallel (default) or sequential mode = compress: any mode = expand: same any mode = compress: sets mode = expand: propagates yes:APT_EncodeOperator
Chapter 7. Operators
209
Table 40. Pcompress options (continued) Option -expand Use -expand This option puts the operator in expand mode. The operator takes a compressed data set as input and produces an uncompressed data set as output. -command -command "compress" | "gzip" Optionally specifies the UNIX command to be used to perform the compression or expansion. When you specify "compress" the operator uses the UNIX command, compress -f, for compression and the UNIX command, uncompress, for expansion. When you specify "gzip", the operator uses the UNIX command, gzip -, for compression and the UNIX command, gzip -d -, for expansion.
The default mode of the operator is -compress, which takes a data set as input and produces a compressed version as output. Specifying -expand puts the command in expand mode, which takes a compressed data set as input and produces an uncompressed data set as output.
210
For an expand operation, the operator takes as input a previously compressed data set. If the preserve-partitioning flag in this data set is not set, InfoSphere DataStage issues a warning message.
step1 op1
pcompress
(mode=compress)
compressDS.ds
In osh, the default mode of the operator is -compress, so you need not specify any option:
$ osh " ... op1 | pcompress > compressDS.ds "
In the next step, the pcompress operator expands the same data set so that it can be used by another operator.
Chapter 7. Operators
211
compressDS.ds
step2 pcompress
(mode=expand)
op2
Peek operator
The peek operator lets you print record field values to the screen as the operator copies records from its input data set to one or more output data sets. This might be helpful for monitoring the progress of your job, or to diagnose a bug in your job.
inRec:*;
peek
outRec:*;
outRec:*;
outRec:*;
peek: properties
Table 41. peek properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 N (set by user) inRec:* outRec:*
212
Table 41. peek properties (continued) Property Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no
213
Table 42. Peek options (continued) Option -nrecs Use -nrecs numrec Specifies the number of records to print per partition. The default is 10. -period -period p Cause the operator to print every pth record per partition, starting with first record. p must be >= 1. -part -part part_num Causes the operator to print the records for a single partition number. By default, the operator prints records from all partitions. -skip -skip n Specifies to skip the first n records of every partition. The default value is 0. -var -var input_schema_var_name Explicitly specifies the name of the operator's input schema variable. This is necessary when your input data set contains a field named inRec.
214
step op1
peek
(name)
op2
PFTP operator
The PFTP (parallel file transfer protocol) Enterprise operator transfers files to and from multiple remote hosts. It works by forking an FTP client executable to transfer a file to or from multiple remote hosts using a URI (Uniform Resource Identifier). This section describes the operator and also addresses issues such as restartability, describing how you can restart an ftp transfer from the point where it was stopped if a fails. The restart occurs at the file boundary.
Chapter 7. Operators
215
pftp (get)
outRec:*
pftp (put)
Operator properties
Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 0 <= N <= 1 (zero or one input data set in put mode) 0 <= N <= 1 (zero or one output data set in get mode) inputRec:* (in put mode) outputRec:* (in get mode) inputRec:* is exported according to a user supplied schema and written to a pipe for ftp transfer. outputRec:* is imported according to a user supplied schema by reading a pipe written by ftp. Parallel None None None None Not Propagated Not Propagated Yes No None
Default execution mode Input partitioning style Output partitioning style Partitioning method Collection method Preserve-partitioning flag in input data set Preserve-partitioning flag in output data set Restartable Combinable operator Consumption pattern
216
Value No
Chapter 7. Operators
217
Table 43. Pftp2 options (continued) Option -uri Use -uri uri1 [-uri uri2...] The URIs (Uniform Resource Identifiers) are used to transfer or access files from or to multiple hosts. There can be multiple URIs. You can specify one or more URIs, or a single URI with a wild card in the path to retrieve all the files using the wild character pointed by the URI. pftp collects all the retrieved file names before starting the file transfer. pftp supports limited wild carding. Get commands are issued in sequence for files when a wild card is specified. You can specify an absolute or a relative pathname. For put operations the syntax for a relative path is: ftp://remotehost.domain.com/path/remotefilename Where path is the relative path of the user's home directory. For put operations the syntax for an absolute path is: ftp://remotehost.domain.com//path/remotefilename While connecting to the mainframe system, the syntax for an absolute path is: ftp://remotehost.domain.com/\path.remotefilename\ Where path is the absolute path of the user's home directory. For get operations the syntax for a relative path is: ftp://host/path/filename Where path is the relative path of the user's home directory. For get operations the syntax for an absolute path is: ftp://host//path/filename While connecting to the mainframe system, the syntax for an absolute path is: ftp://host//\path.remotefilename\ Where path is the absolute path of the user's home directory. -open_command -open_command cmd Needed only if any operations need to be performed besides navigating to the directory where the file exists This is a sub-option of the URI option. At most one open_command can be specified for an URI. Example: -uri ftp://remotehost/fileremote1.dat -open_command verbose
218
Table 43. Pftp2 options (continued) Option -user Use -user username1 [-user username2...] With each URI you can specify the User Name to connect to the URI. If not specified, the ftp will try to use the .netrc file in the user's home directory. There can be multiple user names. User1 corresponds to URI1. When the number of usernames is less than the number of URIs, the last username is set for the remaining URIs. Example: -user User1 -user User2 -password -password password1 [-password password1] With each URI you can specify the Password to connect to the URI. If not specified, the ftp will try to use the .netrc file in the user's home directory. There can be multiple passwords. Password1 corresponds to URI1. When the number of passwords is less than the number of URIs, the last password is set for the remaining URIs. Note The number of passwords should be equal to the number of usernames. Example: -password Secret1 -password Secret2 -schema -schema schema You can specify the schema for get or put operations. This option is mutually exclusive with -schemafile. Example: -schema record(name:string;) -schemafile -schemafile schemafile You can specify the schema for get or put operations. in a schema file. This option is mutually exclusive with -schema. Example: -schemafile file.schema -ftp_call -ftp_call cmd The ftp command to call for get or put operations. The default is 'ftp'. You can include absolute path with the command. Example: -ftp_call /opt/gnu/bin/wuftp.
Chapter 7. Operators
219
Table 43. Pftp2 options (continued) Option -force_config_file_parallelism Use -force_config_file_parallelism Optionally limits the number of pftp players via the APT_CONFIG_FILE configuration file. The operator executes with a maximum degree of parallelism as determined by the configuration file. The operator will execute with a lesser degree of parallelism if the number of get arguments is less than the number of nodes in the Configuration file. In some cases this might result in more than one file being transferred per player. -overwrite -overwrite Overwrites remote files in ftp put mode. When this option is not specified, the remote file is not overwritten.
220
Table 43. Pftp2 options (continued) Option -restartable_transfer | -restart_transfer | -abandon_transfer Use This option is used to initiate a restartable ftp transfer. The restartability option in get mode will reinitiate ftp transfer at the file boundary. The transfer of the files that failed half way is restarted from the beginning or zero file location. The file URIs that were transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the data set from the downloaded temporary folder path. v A restartable pftp session is initiated as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -restartable_transfer -jobid 100 -checkpointdir 'chkdir' -mode put < input.ds" v -restart_transfer :If the transfer fails, to restart the transfer again, the restartable pftp session is resumed as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -restart_transfer -jobid 100 -checkpointdir 'chkdir' -mode put < input.ds" v -abandon_transfer : Used to abort the operation,the restartable pftp session is abandoned as follows: osh "pftp -uri ftp://remotehost/file.dat -user user -password secret -abandon_transfer -jobid 100 -checkpointdir 'chkdir' -mode put < input.ds"
Chapter 7. Operators
221
Table 43. Pftp2 options (continued) Option -job_id Use This is an integer to specify job identifier of restartable transfer job. This is a dependent option of -restartable_transfer, -restart_transfer, or -abandon_transfer Example: -job_id 101 -checkpointdir This is the directory name/path of location where pftp restartable job id folder can be created. The checkpoint folder must exist. Example: -checkpointdir "/apt/linux207/orch_master/apt/folder" -transfer_type This option is used to specify the data transfer type. You can either choose ASCII or Binary as the data transfer type. Example: -transfer_type binary -xfer_mode This option is used to specify data transfer protocol. You can either choose FTP or SFTP mode of data transfer. Example: -xfer_mode sftp
Restartability
You can specify that the FTP operation runs in restartable mode. To do this you: 1. Specify the -restartable_transfer option 2. Specify a unique job_id for the transfer 3. Optionally specify a checkpoint directory for the transfer using the -checkpointdir directory (if you do not specify a checkpoint directory, the current working directory is used) When you run the job that performs the FTP operation, information about the transfer is written to a restart directory identified by the job id located in the checkpoint directory prefixed with the string "pftp_jobid_". For example, if you specify a job_id of 100 and a checkpoint directory of /home/bgamsworth/checkpoint the files would be written to /home/bgamsworth/checkpoint/pftp_jobid_100. If the FTP operation does not succeed, you can rerun the same job with the option set to restart or abandon. For a production environment you could build a job sequence that performed the transfer, then tested whether it was successful. If it was not, another job in the sequence could use another PFTP operator with the restart transfer option to attempt the transfer again using the information in the restart directory. For get operations, InfoSphere DataStage reinitiates the FTP transfer at the file boundary. The transfer of the files that failed half way is restarted from the beginning or zero file location. The file URIs that were transferred completely are not transferred again. Subsequently, the downloaded URIs are imported to the data set from the temporary folder path. If the operation repeatedly fails, you can use the abandon_transfer option to abandon the transfer and clear the temporary restart directory.
pivot operator
Use the Pivot Enterprise stage to pivot data horizontally.
222
The pivot operator maps a set of fields in an input row to a single column in multiple output records. This type of mapping operation is known as horizontal pivoting. The data output by the pivot operator usually has fewer fields, but more records than the input data. You can map several sets of input fields to several output columns. You can also output any of the fields in the input data with the output data. You can generate a pivot index that will assign an index number to each record with a set of pivoted data.
Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator
The pivot operator: v Takes any single data set as input v Has an input interface schema consisting of a single schema variable inRec . v Copies the input data set to the output data set, pivotting data in multiple input fields to single output fields in multiple records.
Chapter 7. Operators
223
Table 44. Pivot options (continued) Option -from Use -from field_name Specifies the name of the input field from which the output field is derived. -type -type type Specifies the type of the output field. -index -index field_name Specifies that an index field will be generated for piviotted data.
Pivot: examples
In this example you use the pivot operator to pivot the data shown in the first table to produce the data shown in the second table. This example has a single pivotted output field and a generated index added to the data.
Table 45. Simple pivot operation - input data REPID 100 101 last_name Smith Yamada Jan_sales 1234.08 1245.20 Feb_sales 1456.80 1765.00 Mar_sales 1578.00 1934.22
Table 46. Simple pivot operation - output data REPID 100 100 100 101 101 101 last_name Smith Smith Smith Yamada Yamada Yamada Q1sales 1234.08 1456.80 1578.00 1245.20 1765.00 1934.22 Pivot_index 0 1 2 0 1 2
In this example, you use the pivot operator to pivot the data shown in the first table to produce the data shown in the second table. This example has multiple pivotted output fields.
Table 47. Pivot operation with multiple pivot columns - input data REPID 100 101 last_name Smith Yamada Q1sales 4268.88 4944.42 Q2sales 5023.90 5111.88 Q3sales 4321.99 4500.67 Q4sales 5077.63 4833.22
224
Table 48. Pivot operation with multiple pivot columns - output data REPID 100 100 101 101 last_name Smith Smith Yamada Yamada halfyear1 4268.88 5023.90 4944.42 5111.88 halfyear2 4321.99 5077.63 4500.67 4833.22
Remdup operator
The remove-duplicates operator, remdup, takes a single sorted data set as input, removes all duplicate records, and writes the results to an output data set. Removing duplicate records is a common way of cleansing a data set before you perform further processing. Two records are considered duplicates if they are adjacent in the input data set and have identical values for the key field(s). A key field is any field you designate to be used in determining whether two records are identical. For example, a direct mail marketer might use remdup to aid in householding, the task of cleansing a mailing list to prevent multiple mailings going to several people in the same household. The input data set to the remove duplicates operator must be sorted so that all records with identical key values are adjacent. By default, InfoSphere DataStage inserts partition and sort components to meet the partitioning and sorting needs of the remdup operator and other operators.
inRec:*;
remdup
outRec:*;
Chapter 7. Operators
225
remdup: properties
Table 49. remdup properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Input partitioning style Partitioning method Collection method Preserve-partitioning flag in output data set Restartable Composite operator Value 1 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential keys in same partition same (parallel mode) any (sequential mode) propagated yes no
226
Table 50. remdup options (continued) Option -key Use -key field [-cs | -ci] [-ebcdic] [-hash] [-param params] Specifies the name of a key field. The -key option might be repeated for as many fields as are defined in the input data set's record schema. The -cs option specifies case-sensitive comparison, which is the default. The -ci option specifies a case-insensitive comparison of the key fields. By default data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify the -ebcdic option. The -hash option specifies hash partitioning using this key. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property =value pairs separated by commas.
In this example, InfoSphere DataStage-inserted partition and sort components guarantees that all records with the same key field values are in the same partition of the data set. For example, all of the January records for Customer 86111 are processed together as part of the same partition.
Chapter 7. Operators
227
The default result of removing duplicate records on this data set is:
Month Apr May Customer 86111 86111 Balance 787.38 134.66
Using the -last option, you can specify that the last duplicate record is to be retained rather than the first. This can be useful if you know, for example, that the last record in a set of duplicates is always the most recent record. For example, if the osh command is:
$ osh "remdup -key Month -key Customer -last < inDS.ds > outDS.ds"
If a key field is a string, you have a choice about how the value from one record is compared with the value from the next record. The default is that the comparison is case sensitive. If you specify the -ci options the comparison is case insensitive. In osh, specify the -key option with the command:
$osh "remdup -key Month -ci < inDS.ds > outDS.ds"
With this option specified, month values of "JANUARY" and "January" match, whereas without the case-insensitive option they do not match. For example, if your input data set is:
Month Apr Customer 59560 Balance 787.38
228
Thus the two April records for customer 59560 are not recognized as a duplicate pair because they are not adjacent to each other in the input. To remove all duplicate records regardless of the case of the Month field, use the following statement in osh:
$ osh "remdup -key Month -ci -key Customer < inDS.ds > outDS.ds"
Chapter 7. Operators
229
This example removes all records in the same month except the first record. The output data set thus contains at most 12 records.
The results differ from those of the previous example if the Month field has mixed-case data values such as "May" and "MAY". When the case-insensitive comparison option is used these values match and when it is not used they do not.
Sample operator
The sample operator is useful when you are building, testing, or training data sets for use with the InfoSphere DataStage data-modeling operators. The sample operator allows you to: v Create disjoint subsets of an input data set by randomly sampling the input data set to assign a percentage of records to output data sets. InfoSphere DataStage uses a pseudo-random number generator to randomly select, or sample, the records of the input data set to determine the destination output data set of a record. You supply the initial seed of the random number generator. By changing the seed value, you can create different record distributions each time you sample a data set, and you can recreate a given distribution by using the same seed value. A record distribution is repeatable if you use the same: Seed value Number of output data sets Percentage of records assigned to each data set No input record is assigned to more than one output data set. The sum of the percentages of all records assigned to the output data sets must be less than or equal to 100% v Alternatively, you can specify that every nth record be written to output data set 0.
230
inRec:*;
sample
outRec:*;
outRec:*;
outRec:*;
sample: properties
Table 51. sample properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 1 N (set by user) inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential any (parallel mode) any (sequential mode) propagated no
seed_val ]
Either the -percent option must be specified for each output data set or the -sample option must be specified.
Table 52. Sample options Option maxoutputrows Use -maxoutputrows maxout Optionally specifies the maximum number of rows to be output per process. Supply an integer >= 1 for maxout.
Chapter 7. Operators
231
Table 52. Sample options (continued) Option -percent Use -percent percent output_port_num Specifies the sampling percentage for each output data set. You specify the percentage as an integer value in the range of 0, corresponding to 0.0%, to 100, corresponding to 100.0%. The sum of the percentages specified for all output data sets cannot exceed 100.0%. The output_port_num following percent is the output data set number. The -percent and -sample options are mutually exclusive. One must be specified. -sample -sample sample Specifies that each nth record is written to output 0. Supply an integer >= 1 for sample to indicate the value for n. The -sample and -percent options are mutually exclusive. One must be specified. -seed -seed seed_val Initializes the random number generator used by the operator to randomly sample the records of the input data set. seed_val must be a 32-bit integer. The operator uses a repeatable random number generator, meaning that the record distribution is repeatable if you use the same seed_val, number of output data sets, and percentage of records assigned to each data set.
232
InDS.ds
step
Sample
5.0%
10.0%
15.0%
In this example, you specify a seed value of 304452, a sampling percentage for each output data set, and three output data sets.
Sequence operator
Using the sequence operator, you can copy multiple input data sets to a single output data set. The sequence operator copies all records from the first input data set to the output data set, then all the records from the second input data set, and so forth. This operation is useful when you want to combine separate data sets into a single large data set. This topic describes how to use the sequence operator. The sequence operator takes one or more data sets as input and copies all input records to a single output data set. The operator copies all records from the first input data set to the output data set, then all the records from the second input data set, and so on The record schema of all input data sets must be identical. You can execute the sequence operator either in parallel (the default) or sequentially. Sequential mode allows you to specify a collection method for an input data set to control how the data set partitions are combined by the operator. This operator differs from the funnel operator, described in "Funnel Operators" , in that the funnel operator does not guarantee the record order in the output data set.
Chapter 7. Operators
233
sequence
outRec:*;
sequence: properties
Table 53. sequence properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value N (set by user) 1 inRec:* outRec:* inRec -> outRec without record modification parallel (default) or sequential round robin (parallel mode) any (sequential mode) propagated no
234
step A
step B
step C
outDS1.ds
outDS1.ds
outDS1.ds
step sequence
op1
The osh command for the step beginning with the sequence operator is:
$ osh "sequence < outDS0.ds < outDS1.ds < outDS2.ds | op1 ... rees
Switch operator
The switch operator takes a single data set as input. The input data set must have an integer field to be used as a selector, or a selector field whose values can be mapped, implicitly or explicitly, to int8. The switch operator assigns each input record to one of multiple output data sets based on the value of the selector field.
Chapter 7. Operators
235
inRec:*;
switch
outRec:*;
outRec:*;
outRec:*;
data set N
The switch operator is analogous to a C switch statement, which causes the flow of control in a C program to branch to one of several cases based on the value of a selector variable, as shown in the following C program fragment.
switch (selector) { case 0: // if selector = 0, // write record to output data set 0 break; case 1: // if selector = 1, // write record to output data set 1 break; . . . case discard: // if selector = discard value // skip record break; case default:// if selector is invalid, // abort operator and end step };
You can attach up to 128 output data sets to the switch operator corresponding to 128 different values for the selector field. Note that the selector value for each record must be mapped to a value in the range 0 to N - 1, where N is the number of data sets that are attached to the operator, or be mapped to the discard value. Invalid selector values normally cause the switch operator to terminate and the step containing the operator to return an error. However, you can set an option that allows records whose selector field does not correspond to that range to be either dropped silently or treated as allowed rejects. You can discard records by setting a discard value and mapping one or more selector values to the discard value. When the selector value for a record is mapped to the discard value, the record is not assigned to any output data set.
switch: properties
Table 54. switch properties Property Number of input data sets Value 1
236
Table 54. switch properties (continued) Property Number of output data sets Input interface schema Output interface schema Preserve-partitioning flag in output data set Value 1 <= N <= 128 selector field:any data type;inRec:* outRec:* propagated
If the selector field is of type integer and has no more than 128 values, there are no required options, otherwise you must specify a mapping using the -case option.
Table 55. switch options Option -allowRejects Use -allowRejects Rejected records (whose selector value is not in the range 0 to N-1, where N is the number of data sets attached to the operator, or equal to the discard value) are assigned to the last output data set. This option is mutually exclusive with the -ignoreRejects, -ifNotFound, and -hashSelector options. -case -case mapping Specifies the mapping between actual values of the selector field and the output data sets. mapping is a string of the form "selector_value = output_ds ", where output_ds is the number of the output data set to which records with that selector value should be written (output_ds can be implicit, as shown in the example below). You must specify an individual mapping for each value of the selector field you want to direct to one of the output data sets, thus -case is invoked as many times as necessary to specify the complete mapping. Multi-byte Unicode character data is supported for ustring selector values. Note: This option is incompatible with the -hashSelector option.
Chapter 7. Operators
237
Table 55. switch options (continued) Option -collation_sequence Use -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu /userguide/Collate_Intro.htm -discard -discard discard_value Specifies an integer value that causes a record to be discarded by the operator. For a record to be discarded, the value in the selector field of the record must be mapped to the discard value by using the case option. Note that discard_value must be outside the range 0 to N - 1, where N is the number of data sets that are attached to the operator. Note: This option is mutually exclusive with -hashSelector. -hashSelector -hashSelector A boolean; when this is set, records are hashed on the selector field modulo the number of output data sets and assigned to an output data set accordingly. The selector field must be of a type that is convertible to uint32 and might not be nullable. Note: This option is incompatible with the -case, -discard, -allowRejects, -ignoreRejects, and -ifNotFound options.
238
Table 55. switch options (continued) Option -ifNotFound Use -ifNotFound {allow | fail | ignore} Specifies what the operator should do if a data set corresponding to the selector value does not exist: allow Rejected records (whose selector value is not in the range 0 to N-1 or equal to the discard value) are assigned to the last output data set. If this optionvalue is used, you might not explicitly assign records to the last data set. fail When an invalid selector value is found, return an error and terminate. This is the default. ignore Drop the record containing the out-of-range value and continue. Note. This option is incompatible with -allowRejects, -ignoreRejects, and -hashSelector options. -ignoreRejects -ignoreRejects Drop the record containing the out-of-range value and continue. Note. This option is mutually exclusive with the -allowRejects, -ifNotFound, and -hashSelector options. -key -key field_name [-cs | -ci] Specifies the name of a field to be used as the selector field. The default field name is selector. This field can be of any data type that can be converted to int8, or any non-nullable type if case options are specified. Field names can contain multi-byte Unicode characters. Use the -ci flag to specify that field_name is case-insensitive. The -cs flag specifies that field_name is treated as case sensitive, which is the default.
In this example, you create a switch operator and attach three output data sets numbered 0 through 2. The switch operator assigns input records to each output data set based on the selector field, whose year values have been mapped to the numbers 0 or 1 by means of the -case option. A selector field value that maps to an integer other than 0 or 1 causes the operator to write the record to the last data set. You might not explicitly assign input records to the last data set if the -ifNotFound option is set to allow. With these settings, records whose year field has the value 1990, 1991, or 1992 go to outDS0.ds. Those whose year value is 1993 or 1994 go to outDS1.ds. Those whose year is 1995 are discarded. Those with any other year value are written to outDS2.ds, since rejects are allowed by the -ifNotFound setting. Note that because the -ifNotFound option is set to allow rejects, switch does not let you map any year value explicitly to the last data set (outDS2.ds), as that is where rejected records are written. Note also that it was unnecessary to specify an output data set for 1991 or 1992, since without an explicit mapping indicated, case maps values across the output data sets, starting from the first (outDS0.ds). You might map more than one selector field value to a given output data set.
Chapter 7. Operators
239
The operator also verifies that if a -case entry maps a selector field value to a number outside the range 0 to N-1, that number corresponds to the value of the -discard option.
InDS.ds
schema: income:int32; year:string; name:string; state:string;
step
selector = year; selector:type; inRec:*;
switch
outRec:*; outRec:*; outRec:*;
Note that by default output data sets are numbered starting from 0. You could also include explicit data set numbers, as shown below:
$ osh "switch -discard 3 < inDS.ds 0> outDS0.ds 1> outDS1.ds 2> outDS2.ds "
240
The output summary per criterion is included in the summary messages generated by InfoSphere DataStage as custom information. It is identified with:
name="CriterionSummary"
The XML tags criterion, case and where are used by the switch operator when generating business logic and criterion summary custom information. These tags are used in the example information below.
Chapter 7. Operators
241
Tail operator
The tail operator copies the last N records from each partition of its input data set to its output data set. By default, N is 10 records. However, you can determine the following by means of options: v The number of records to copy v The partition from which the records are copied This control is helpful in testing and debugging jobs with large data sets. The head operator performs a similar operation, copying the first N records from each partition. See "Head Operator" .
inRec:*;
tail
outRec:*;
tail: properties
Table 56. tail properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:* outRec:* inRec -> outRec without record modification
242
Table 57. Tail options (continued) Option -part Use -part partition_number Copy records only from the indicated partition. By default, the operator copies records from all partitions. You can specify -part multiple times to specify multiple partition numbers. Each time you do, specify the option followed by the number of the partition.
0 9 18 19 23 9 18 19 23 25 36 37 40 25 36 37 40 47 51 47 51
3 5 11 12 13 16 17 35 42 14 15 16 17 46 49 50 53 35 42 46 49 57 59 50 53 57 59
Chapter 7. Operators
243
Transform operator
The transform operator modifies your input records, or transfers them unchanged, guided by the logic of the transformation expression you supply. You build transformation expressions using the Transformation Language, which is the language that defines expression syntax and provides built-in functions. By using the Transformation Language with the transform operator, you can: v Transfer input records to one or more outputs v Define output fields and assign values to them based on your job logic v Use local variables and input and output fields to perform arithmetic operations, make function calls, and construct conditional statements v Control record processing behavior by explicitly calling writerecord dataset_number, droprecord, and rejectrecord as part of a conditional statement
filesetN
table0.ds
tableN.ds
transform
transform: properties
Table 60. transform properties Property Number of input data sets Number of output data sets Value 1 plus the number of lookup tables specified on the command line. 1 or more and, optionally, 1 or more reject data sets
244
Table 60. transform properties (continued) Property Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator Value See "Transfer Behavior" parallel by default, or sequential any (parallel mode) any (sequential mode) propagated yes yes
part_key_options are:
[-ci | -cs] [-param params ]
flag_compilation_options are:
[-dir dir_name_for_compilation ] [-name library_path_name ] [-optimize | -debug] [-verbose] [-compiler cpath ] [-staticobj absolute_path_name ] [-sharedobj absolute_path_name ] [compileopt options] [-linker lpath] [-linkopt options ]
[-t
options ]
The -table and -fileset options allow you to use conditional lookups. Note: The following option values can contain multi-byte Unicode values: v the field names given to the -inputschema and -outputschema options and the ustring values v -inputschemafile and -outputschemafile files
Chapter 7. Operators
245
v v v v v
-expression option string and the -expressionfile option filepath -sort and -part key-field names -compiler, -linker, and -dir pathnames -name file name -staticobj and -sharedobj pathnames
246
Option -expression
Use -expression expression_string This option lets you specify expressions written in the Transformation Language. The expression string might contain multi-byte Unicode characters. Unless you choose the -flag option with run, you must use either the -expression or -expressionfile option. The -expression and -expressionfile options are mutually exclusive.
-expressionfile
-expressionfile expression_file This option lets you specify expressions written in the Transformation Language. The expression must reside in an expression_file, which includes the name and path to the file which might include multi-byte Unicode characters. Use an absolute path, or by default the current UNIX directory. Unless you choose the -flag option with run, you must choose either the -expression or -expressionfile option. The -expressionfile and -expression options are mutually exclusive.
Chapter 7. Operators
247
Option -flag
Use -flag {compile | run | compileAndRun} suboptions compile: This option indicates that you wish to check the Transformation Language expression for correctness, and compile it. An appropriate version of a C++ compiler must be installed on your computer. Field information used in the expression must be known at compile time; therefore, input and output schema must be specified. run: This option indicates that you wish to use a pre-compiled version of the Transformation Language code. You do not need to specify input and output schemas or an expression because these elements have been supplied at compile time. However, you must add the directory containing the pre-compiled library to your library search path. This is not done by the transform operator.You must also use the -name suboption to provide the name of the library where the pre-compiled code resides. compileAndRun: This option indicates that you wish to compile and run the Transformation Language expression. This is the default value. An appropriate version of a C++ compiler must be installed on your computer. You can supply schema information in the following ways: v You can omit all schema specifications. The transform operator then uses the up-stream operator's output schema as its input schema, and the schema for each output data set contains all the fields from the input record plus any new fields you create for a data set. v You can omit the input data set schema, but specify schemas for all output data sets or for selected data sets. The transform operator then uses the up-stream operator's output schema as its input schema. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set. v You can specify an input schema, but omit all output schemas or omit some output schemas. The transform operator then uses the input schema as specified. Any output schemas specified on the command line are used unchanged, and output data sets without schemas contain all the fields from the input record plus any new fields you create for a data set.
248
Use v The flag option has the following suboptions: -dir dir_name lets you specify a compilation directory. By default, compilation occurs in the TMPDIR directory or, if this environment variable does not point to an existing directory, to the /tmp directory. Whether you specify it or not, you must make sure the directory for compilation is in the library search path. -name file_name lets you specify the name of the file containing the compiled code. If you use the -dir dir_name suboption, this file is in the dir_name directory. v The following examples show how to use the -dir and -name options in an osh command line: For development: osh "transform -inputschema schema -outputschema schema -expression expression -flag compile - dir dir_name -name file_name " For your production machine: osh "... | transform -flag run -name file_name | ..." The library file must be copied to the production machine. -flag compile and -flag compileAndRun have these additional suboptions: -optimize specifies the optimize mode for compilation. -debug specifies the debug mode for compilation. v -verbose causes verbose messages to be output during compilation. -compiler cpath lets you specify the compiler path when the compiler is not in the default directory. The default compiler path for each operating system is: Solaris: /opt/SUNPRO6/SUNWspro/bin/CC AIX: /usr/vacpp/bin/xlC_r Tru64: /bin/cxx HP-UX: /opt/aCC/bin/aCC -staticobj absolute_path_name -sharedobj absolute_path_name These two suboptions specify the location of your static and dynamic-linking C-object libraries. The file suffix can be omitted. See "External Global C-Function Support" for details. -compileopt options lets you specify additional compiler options. These options are compiler-dependent. Pathnames might contain multi-byte Unicode characters. -linker lpath lets you specify the linker path when the linker is not in the default directory. The default linker path of each operating system is the same as the default compiler path listed above. -linkopt options lets you specify link options to the compiler. Pathnames might contain multi-byte Unicode characters.
Chapter 7. Operators
249
Option -inputschema
Use -inputschema schema Use this option to specify an input schema. The schema might contain multi-byte Unicode characters. An error occurs if an expression refers to an input field not in the input schema. The -inputschema and the -inputschemafile options are mutually exclusive. The -inputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.
-inputschemafile
-inputschemafile schema_file Use this option to specify an input schema. An error occurs if an expression refers to an input field not in the input schema. To use this option, the input schema must reside in a schema_file, where schema_file is the name and path to the file which might contain multi-byte Unicode characters. You can use an absolute path, or by default the current UNIX directory. The -inputschemafile and the -inputschema options are mutually exclusive. The -inputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -inputschema or the -inputschemafile option. See the -flag option description in this table for information on the -compile suboption.
-maxrejectlogs
-maxrejectlogs integer An information log is generated every time a record is written to the reject output data set. Use this option to specify the maximum number of output reject logs the transform option generates. The default is 50. When you specify -1 to this option, an unlimited number of information logs are generated.
-oldnullhandling
-oldnullhandling Use this option to reinstate old-style null handling. This setting means that, when you use an input field in the derivation expression of an output field, you have to explicitly handle any nulls that occur in the input data. If you do not specify such handling, a null causes the record to be dropped or rejected. If you do not specify the -oldnullhandling option, then a null in the input field used in the derivation causes a null to be output.
250
Option -outputschema
Use -outputschema schema Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema. The -outputschema and -outputschemafile options are mutually exclusive. The -outputschema option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.
-outputschemafile
-outputschemafile schema_file Use this option to specify an output schema. An error occurs if an expression refers to an output field not in the output schema. To use this option, the output schema must reside in a schema_file which includes the name and path to the file. You can use an absolute path, or by default the current UNIX directory. The -outputschemafile and the -outputschema options are mutually exclusive. The -outputschemafile option is not required when you specify compileAndRun or run for the -flag option; however, when you specify compile for the -flag option, you must include either the -outputschema or the -outputschemafile option. See the -flag option description in this table for information on the -compile suboption. For multiple output data sets, repeat the -outputschema or -outputschemafile option to specify the schema for all output data sets.
-part
-part {-input | -output[ port ]} -key field_name [-ci | -cs] [-param params ] You can use this option 0 or more times. It indicates that the data is hash partitioned. The required field_name is the name of a partitioning key. Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether partitioning occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer which represents an output data set where the data is partitioned. The suboptions to the -key option are -ci for case-insensitive partitioning, or -cs for a case-sensitive partitioning. The default is case-sensitive. The -params suboption is to specify any property=value pairs. Separate the pairs by commas (,).
Chapter 7. Operators
251
Option -reject
Use -reject [-rejectinfo reject_info_column_name_string] This is optional. You can use it only once. When a null field is used in an expression, this option specifies that the input record containing the field is not dropped, but is sent to the output reject data set. The -rejectinfo suboption specifies the column name for the reject information.
-sort
-sort {-input | -output [ port ]} -key field_name [-ci | -cs] [-asc | -desc] [-nulls {first | last}] [-param params ] You can use this option 0 or more times. It indicates that the data is sorted for each partition. The required field_name is the name of a sorting key. Exactly one of the suboptions -input and -output[ port ] must be present. These suboptions determine whether sorting occurs on the input data or the output data. The default for port is 0. If port is specified, it must be an integer that represents the output data set where the data is sorted. You can specify -ci for a case-insensitive sort, or -cs for a case-sensitive sort. The default is case-sensitive. You can specify -asc for an ascending order sort or -desc for a descending order sort. The default is ascending. You can specify -nulls {first | last} to determine where null values should sort. The default is that nulls sort first. You can use -param params to specify any property = value pairs. Separate the pairs by commas (,).
252
Option -table
Use -table -key field [ci | cs] [-key field [ci | cs] ...] [-allow_dups] [-save fileset_descriptor] [-diskpool pool] [-schema schema | -schemafile schema_file] Specifies the beginning of a list of key fields and other specifications for a lookup table. The first occurrence of -table marks the beginning of the key field list for lookup table1; the next occurrence of -table marks the beginning of the key fields for lookup table2, and so on For example: lookup -table -key field -table -key field The -key option specifies the name of a lookup key field. The -key option must be repeated if there are multiple key fields. You must specify at least one key for each table. You cannot use a vector, subrecord, or tagged aggregate field as a lookup key. The -ci suboption specifies that the string comparison of lookup key values is to be case insensitive; the -cs option specifies case-sensitive comparison, which is the default. In create-only mode, the -allow_dups option causes the operator to save multiple copies of duplicate records in the lookup table without issuing a warning. Two lookup records are duplicates when all lookup key fields have the same value in the two records. If you do not specify this option, InfoSphere DataStage issues a warning message when it encounters duplicate records and discards all but the first of the matching records. In normal lookup mode, only one lookup table (specified by either -table or -fileset) can have been created with -allow_dups set. The -save option lets you specify the name of a fileset to write this lookup table to; if -save is omitted, tables are written as scratch files and deleted at the end of the lookup. In create-only mode, -save is, of course, required. The -diskpool option lets you specify a disk pool in which to create lookup tables. By default, the operator looks first for a "lookup" disk pool, then uses the default pool (""). Use this option to specify a different disk pool to use. The -schema suboption specifies the schema that interprets the contents of the string or raw fields by converting them to another data type. The -schemafile suboption specifies the name of a file containing the schema that interprets the content of the string or raw fields by converting them to another data type. You must specify either -schema or -schemafile. One of them is required if the -compile option is set, but are not required for -compileAndRun or -run.
Chapter 7. Operators
253
Option -fileset
Use [-fileset fileset_descriptor ...] Specify the name of a fileset containing one or more lookup tables to be matched. In lookup mode, you must specify either the -fileset option, or a table specification, or both, in order to designate the lookup table(s) to be matched against. There can be zero or more occurrences of the -fileset option. It cannot be specified in create-only mode. Warning: The fileset already contains key specifications. When you follow -fileset fileset_descriptor by key_specifications , the keys specified do not apply to the fileset; rather, they apply to the first lookup table. For example, lookup -fileset file -key field, is the same as: lookup -fileset file1 -table -key field
Transfer behavior
You can transfer your input fields to your output fields using any one of the following methods: v Set the value of the -flag option to compileAndRun. For example:
osh "... | transform -expression expression -flacompileAndRun -dir dir_name -name file_name | ..."
v Use schema variables as part of the schema specification. A partial schema might be used for both the input and output schemas. This example shows a partial schema in the output:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time) -outputschema record(d:dfloat:outRec:*;) -flag compile ..."
This example shows partial schemas in the input and the output:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time;Inrec:*) -outputschema record(d:dfloat:outRec:*;) -flag compile ..." osh "... | transform -flag run ... | ..."
Output 0 contains the fields d, a, b, and c, plus any fields propagated from the up-stream operator. v Use name matching between input and output fields in the schema specification. When input and output field names match and no assignment is made to the output field, the input field is transferred to the output data set unchanged. Any input field which doesn't have a corresponding output field is dropped. For example:
osh "transform -expression expression -inputschema record(a:int32;b:string[5];c:time) -outputschema record(a:int32;) -outputschema record(a:int32;b:string[5];c:time) -flag compile ..."
254
Field a is transferred from input to output 0 and output 1. Fields b and c are dropped in output 0, but are transferred from input to output 1. v Specify a reject data set. In the Transformation Language, it is generally illegal to use a null field in expressions except in the following cases: In function calls to notnull(field_name) and null(fieldname) In an assignment statement of the form a=b where a and b are both nullable and b is null In these expressions:
if (null(a)) b=a else b=a+1 if (notnull(a)) b=a+1 else b=a b=null (a)?a:a +1; b=notnull(a)?a+1:a;
If a null field is used in an expression in other than these cases and a reject set is specified, the whole input record is transferred to the reject data set.
General structure
As in C, statements must be terminated by semi-colons and compound statements must be grouped by braces. Both C and C++ style comments are allowed.
255
// externs this C function: int my_times(int x, int y) { ... } extern int32 my_times(int32 x, int32 y); // externs this C function: void my_print_message(char *msg) { ... } extern void my_print_message(string msg); inputname 0 in0; outputname 0 out0; mainloop { ... }
The shared object file name has lib prepended to it and and has a platform-dependent object-file suffix: .so for Sun Solaris and Linux; .sl for HP-UX, and .o for AIX. The file must reside in this directory:
/external-functions/dynamic
256
/external-functions/dynamic/libgenerate.so
Dynamically-linked libraries must be manually deployed to all running nodes. Add the library-file locations to your library search path. See "Example 8: External C Function Calls" for an example job that includes C header and source files, a Transformation Language expression file with calls to external C functions, and an osh script.
257
The name of any field in the lookup schema, other than key fields, can be used to access the field value, such as table1.field1. If a field is accessed when is_match() returns false, the value of the field is null if it is nullable or it has its default value. Here is an example of lookup table usage:
transform -expressionfile trx1 -table -key a -fileset sKeyTable.fs < dataset.v < table.v > target.v trx1: inputname 0 in1; outputname 0 out0; tablename 0 tbl1; tablename 1 sKeyTable; mainloop { // This code demonstrates the interface without doing anything really // useful int nullCount; nullCount = 0; lookup(sKeyTable); if (is_match(sKeyTable)) // if theres no match { lookup(tbl1); if (!is_match(tbl1)) { out0.field2 = "missing"; } } else { // Loop through the results while (is_match(sKeyTable)) { if (is_null(sKeyTable.field1)) { nullCount++; } next_match(sKeyTable); } } writerecord 0; }
Because the transform operator accepts only a single input data set, the data set number for inputname is 0. You can specify 0 through (the number of output data sets - 1) for the outputname data set number. For example:
inputname 0 input-grades; outputname 0 output-low-grades; outputname 1 output-high-grades;
Data set numbers cannot be used to qualify field names. You must use the inputname and outputname data set names to qualify field names in your Transformation Language expressions. For example:
output-low-grades.field-a = input-grades.field-a + 10; output-high-grades.field-a = output-low-grades.field-a - 10;
Field names that are not qualified by a data set name always default to output data set 0. It is good practice to use the inputname data set name to qualify input fields in expressions, and use the
258
outputname data set name to qualify output fields even though these fields have unique names among all data sets. The Transformation Language does not attempt to determine if an unqualified, but unique, name exists in another data set. The inputname and outputname statements must appear first in your Transformation Language code.For an example, see the Transformation Language section of "Example 2: Student-Score Distribution With a Letter Grade Added to Example 1" .
floating Point
string
decimal[p, s]
Chapter 7. Operators
259
Forms date time time[microseconds] timestamp timestamp[microseconds] raw raw[max=n] raw[n] raw[align=k] raw[max=n, align=k]
Meaning Date with year, month, and day Time with one second resolution Time with one microsecond resolution Date/time with one second resolution Date/time with one microsecond resolution Variable length binary data. Variable length binary data with at most n bytes. Fixed length (n-byte) binary data. Variable length binary data, aligned on k-byte boundary (k = 1, 2, 4, or 8). Variable length binary data with at most n bytes, aligned on k-byte boundary (k = 1, 2, 4, or 8) Fixed length (n-byte) binary data, aligned on k-byte boundary (k = 1, 2, 4, or 8)
raw
raw[n, align=k]
InfoSphere DataStage supports the following complex field types: v vector fields v subrecord fields v tagged fields. Note: Tagged fields cannot be used in expressions; they can only be transferred from input to output.
Local variables
Local variables are used for storage apart from input and output records. You must declare and initialize them before use within your transformation expressions. The scope of local variables differs depending on which code segment defines them: v Local variables defined within the global and initialize code segments can be accessed before, during, and after record processing. v Local variables defined in the mainloop code segment are only accessible for the current record being processed. v Local variables defined in the finish code segment are only accessible after all records have been processed. v Local variables can represent any of the simple value types: int8, uint8, int16, uint16, int32, uint32, int64, uint64 sfloat, dfloat decimal string date, time, timestamp raw
260
Declarations are similar to C, as in the following examples: v int32 a[100]; declares a to be an array of 100 32-bit integers v dfloat b; declares b to be an double-precision float v string c; declares c to be a variable-length string v string[n] e; declares e to be a string of length n v string[n] f[m]; declares f to be an array of m strings, each of length n v decimal[p] g; declares g to be a decimal value with p (precision) digits v decimal[p, s] h; declares h to be a decimal value with p (precision) digits and s (scale) digits to the right of the decimal You cannot initialize variables as part of the declaration. They can only be initialized on a separate line. For example:
int32 a; a = 0;
The result is uncertain if a local variable is used without being initialized. There are no local variable pointers or structures, but you can use arrays.
Expressions
The Transformation Language supports standard C expressions, with the usual operator precedence and use of parentheses for grouping. It also supports field names as described in "Data types and record fields" , where the field name is specified in the schema for the data set.
Language elements
The Transformation Language supports the following elements: v Integer, character, floating point, and string constants v Local variables v Field names v Arithmetic operators v Function calls v v v v Flow control Record processing control Code segmentation Data set name specification
operators
The Transformation Language supports several unary operators, which all apply only to simple value types.
Symbol ~ Name One's complement Applies to Integer Comments ~a returns an integer with the value of each bit reversed !a returns 1 if a is 0; otherwise returns 0 +a returns a
Chapter 7. Operators
! +
Integer Numeric
261
Symbol ++ --
Comments -a returns the negative of a a++ or ++a returns a + 1 a-- or --a returns a - 1.
The Transformation Language supports a number of binary operators, and one ternary operator.
Symbol + * / % Name Addition Subtraction Multiplication Division Modulo Applies to Numeric Numeric Numeric Numeric Integers a % b returns the remainder when a is divided by b a << b returns a left-shifted b-bit positions a >> b returns a right-shifted b-bit positions a == b returns 1 (true) if a equals b and 0 (false) otherwise. a < b returns 1 if a is less than b and 0 otherwise. (See the note below the table.) a > b returns 1 if a is greater than b and 0 otherwise. (See the note below the table.) a <= b returns 1 if a < b or a == b, and 0 otherwise. (See the note below the table.) a >= b returns 1 if a > b or a == b, and 0 otherwise. (See the note below the table.) a != b returns 1 if a is not equal to b, and 0 otherwise. a ^ b returns an integer with bit value 1 in each bit position where the bit values of a and b differ, and a bit value of 0 otherwise. Comments
<< >> ==
Integer Integer Any; a and b must be numeric or of the same data type Same as ==.
<
Less than
>
Greater than
Same as ==
<=
Same as ==
>=
Same as ==
!= ^
Same as == Integer
262
Symbol &
Applies to Integer
Comments a & b returns an integer with bit value 0 in each bit position where the bit values of a and b are both 1, and a bit value of 0 otherwise. a | b returns an integer with a bit value 1 in each bit position where the bit value a or b (or both) is 1, and 0 otherwise. a && b returns 0 if either a == 0 or b == 0 (or both), and 1otherwise. a || b returns 1 if either a != 0 or b != 0 (or both), and 0 otherwise. a + b returns the string consisting of substring a followed by substring b. The ternary operator lets you write a conditional expression without using the if...else keyword. a ? b : c returns the value of b if a is true (non-zero) and the value of c if a is false.
Bitwise (inclusive) OR
Integer
&&
Logical AND
Any; a and b must be numeric or of the same data type Any; a and b must be numeric or of the same data type String
||
Logical OR
Concatenation
?:
Assignment
Any scalar; a and b must be numeric, numeric strings, or of the same data type
a = b places the value of b into a. Also, you can use "=" to do default conversions among integers, floats, decimals, and numeric strings.
Note: For the <, >, <=, and >= operators, if a and b are strings, lexicographic order is used. If a and b are date, time, or timestamp, temporal order is used. The expression a * b * c evaluates as (a * b) * c. We describe this by saying that multiplication has left to right associativity. The expression a + b * c evaluates as a + (b * c). We describe this by saying multiplication has higher precedence than addition. The following table describes the precedence and associativity of the Transformation Language operators. Operators listed in the same row of the table have the same precedence, and you use parentheses to force a particular order of evaluation. Operators in a higher row have a higher order of precedence than operators in a lower row.
Table 61. Precedence and Associativity of Operators Operators () [] ! ~ ++ -- + - (unary) * / % + - (binary) Associativity left to right right to left left to right left to right
Chapter 7. Operators
263
Table 61. Precedence and Associativity of Operators (continued) Operators << >> < <= > >= == != & ^ | && || : ? = Associativity left to right left to right left to right left to right left to right left to right left to right left to right for || right to left for : right to left right to left
Conditional Branching
The Transformation Language provides facilities for conditional branching. The following sections describe constructs available for conditional branching. if ... else
if (expression) statement1 else statement2;
If expression evaluates to a non-zero value (true) then statement1 is executed. If expression evaluates to 0 (false) then statement2 is executed. Both statement1 and statement2 can be compound statements. You can omit else statement2. In that case, if expression evaluates to 0 the if statement has no effect. Sample usage:
if (a < b) abs_difference = b - a; else abs_difference = a - b;
The order of execution is: 1. expression1. It is evaluated only once to initialize the loop variable. 2. expression2. If it evaluates to false, the loop terminates; otherwise, these expressions are executed in order:
statement expression3
264
This code sets sum to the sum of the first n integers and sum_squares to the sum of the squares of the first n integers. While Loop
while ( expression ) statement ;
In a while loop, statement, which might be a compound statement, is executed repeatedly as long as expression evaluates to true. A sample usage is:
sum = 0; i = 0; while ((a[i] >= 0) && (i < n)) { sum = sum + a[i]; i++; }
This evaluates the sum of the array elements a[0] through a[n-1], or until a negative array element is encountered. Break The break command causes a for or while loop to exit immediately. For example, the following code does the same thing as the while loop shown immediately above:
sum = 0; for (i = 0; i < n; i++) { if (a[i] >= 0) sum = sum + a[i]; else break; }
Continue The continue command is related to the break command, but used less often. It causes control to jump to the top of the loop. In the while loop, the test part is executed immediately. In a for loop, control passes to the increment step. If you want to sum all positive array entries in the array a[n], you can use the continue statement as follows:
sum = 0; for (i = 0; i < n; i++) { if (a[i] <= 0) continue; sum = sum + a[i]; }
This example could easily be written using an else statement rather than a continue statement. The continue statement is most useful when the part of the loop that follows is complicated, to avoid nesting the program too deeply.
Built-in functions
This section defines functions that are provided by the Transformation Language. It is presented in a series of tables that deal with data transformation functions of the following types:
Chapter 7. Operators
265
v v v v v
Lookup table functions Data conversion functions Mathematical functions String field functions Ustring field functions
v Bit manipulation functions v Job monitoring functions v Miscellaneous functions When a function generates an output value, it returns the result. For functions with optional arguments, simply omit the optional argument to accept the default value. Default conversions among integer, float, decimal, and numeric string types are supported in the input arguments and the return value of the function. All integers can be signed or unsigned. The transform operator has default NULL handling at the record-level with individual field "overrides". Options can be entered at the record level or the field level.
266
The default format of the date contained in the string is yyyy-mm-dd. However, you can specify an optional format string that defines another format. The format string requires that you provide enough information for InfoSphere DataStage to determine a complete date (either day, month, and year, or year and day of year). The format_string can contain one or a combination of the following elements:
Table 62. Date format tags Tag %d %dd %ddd %m %mm %mmm %mmmm %yy %yyyy %NNNNyy %e %E %eee %eeee %W %WW import/export import import/export with v option import Variable width availability import Description Day of month, variable width Day of month, fixed width Day of year Month of year, variable width Month of year, fixed width Month of year, short name, locale specific Month of year, full name, locale specific Year of century Four digit year Cutoff year plus year of century Day of week, Sunday = day 1 Day of week, Monday = day 1 Value range 1...31 01...31 1...366 1...12 01...12 Jan, Feb ... January, February ... 00...99 0001 ...9999 yy = 00...99 1...7 1...7 t, u, w t, u, w, -N, +N s s s Options s s s, v s s t, u, w t, u, w, -N, +N s
Weekday short name, Sun, Mon ... locale specific Weekday long name, locale specific Week of year (ISO 8601, Mon) Week of year (ISO 8601, Mon) Sunday, Monday ... 1...53 01...53
When you specify a date format string, prefix each component with the percent symbol (%) and separate the string's components with a suitable literal character. The default date_format is %yyyy-%mm-%dd. Where indicated the tags can represent variable-width data elements. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated in the table: s Specify this option to allow leading spaces in date formats. The s option is specified in the form:
Chapter 7. Operators
267
%(tag,s) Where tag is the format string. For example: %(m,s) indicates a numeric month of year field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(d,s)/%(m,s)/%yyyy Then the following dates would all be valid: 8/ 8/1958 08/08/1958 8/8/1958 v Use this option in conjunction with the %ddd tag to represent day of year in variable-width format. So the following date property: %(ddd,v) represents values in the range 1 to 366. (If you omit the v option then the range of values would be 001 to 366.) u w t Use this option to render uppercase text on output. Use this option to render lowercase text on output. Use this option to render titlecase text (initial capitals) on output.
The u, w, and t options are mutually exclusive. They affect how text is formatted for output. Input dates will still be correctly interpreted regardless of case. -N +N Specify this option to left justify long day or month names so that the other elements in the date will be aligned. Specify this option to right justify long day or month names so that the other elements in the date will be aligned.
Names are left justified or right justified within a fixed width field of N characters (where N is between 1 and 99). Names will be truncated if necessary. The following are examples of justification in use: %dd-%(mmmm,-5)-%yyyyy
21-Augus-2006
%dd-%(mmmm,-10)-%yyyyy
21-August -2005
%dd-%(mmmm,+10)-%yyyyy
21August-2005
The locale for determining the setting of the day and month names can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for month names and weekday names markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set).
268
Use the locale tag in conjunction with your time format, for example the format string: %(L,'es')%eeee, %dd %mmmm %yyyy Specifies the Spanish locale and would result in a date with the following format: mircoles, 21 septembre 2005 The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
Table 63. Format tag restrictions Element year month Numeric format tags %yyyy, %yy, %[nnnn]yy %mm, %m Text format tags %mmm, %mmmm Requires year month year %eee, %eeee month, week of year year Incompatible with week of year day of week, week of year day of month, day of week, week of year day of year month, day of month, day of year
day of month %dd, %d day of year day of week week of year %ddd %e, %E %WW
When a numeric variable-width input tag such as %d or %m is used, the field to the immediate right of the tag (if any) in the format string cannot be either a numeric tag, or a literal substring that starts with a digit. For example, all of the following format strings are invalid because of this restriction: %d%m-%yyyy %d%mm-%yyyy %(d)%(mm)-%yyyy %h00 hours The year_cutoff is the year defining the beginning of the century in which all two-digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. On import and export, the year_cutoff is the base year. This property is mutually exclusive with days_since, text, and julian. You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Chapter 7. Operators
269
Tag %% \% \n \t \\
Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
date Uformat The date uformat provides support for international components in date fields. It's syntax is:
String%macroString%macroString%macroString
where %macro is a date formatting macro such as %mmm for a 3-character English month. Only the String components of date uformat can include multi-byte Unicode characters. Note: Any argument that has to be double quoted cannot be a field name or a local variable. An argument must have the data format of its type.
Function date date_from_days_since ( int32 , " date " | format_variable ) Description Returns date by adding the given integer to the baseline date. Converts an integer field into a date by adding the integer to the specified base date. The date must be in the format yyyy-mm-dd and must be either double quoted or a variable. date date_from_julian_day( uint32 ) date date_from_string ( string , " date_format " | date_uformat | format_variable ) Returns the date given a Julian day. Returns a date from the given string formatted in the optional format specification. By default the string format is yyyy-mm-dd. For format descriptions, see and "date Uformat" . date date_from_ustring ( string , " date_format " | date_uformat | format_variable ) Returns a date from the given ustring formatted in the optional format specification. By default the ustring format is yyyy-mm-dd. For format descriptions, see and "date Uformat" . string string_from_date ( date , " date_format " | date_uformat ) Converts the date to a string representation using the given format specification. By default the ustring format is yyyy-mm-dd. For format descriptions, see and "date Uformat" . ustring ustring_from_date ( date , " date_format " | date_uformat ) Converts the date to a ustring representation using the given format specification. By default the ustring format is yyyy - mm -dd. For format descriptions, see and "date Uformat" . date date_from_timestamp( timestamp ) int32 days_since_from_date ( date , " source_date " | format_variable ) Returns the date from the given timestamp . Returns a value corresponding to the number of days from source_date to date . source_date must be in the form yyyy - mm -dd and must be double quoted or be a variable.
270
Function uint32 julian_day_from_date( date ) int8 month_day_from_date( date ) int8 month_from_date( date ) date next_weekday_from_date ( date , " day " | format_variable )
Description Returns a Julian date given the date . Returns the day of the month given the date . For example, the date 07-23-2001 returns 23. Returns the month from the given date . For example, the date 07-23-2001 returns 7. The value returned is the date of the specified day of the week soonest after date (including the date ). The day argument is optional. It is a string or variable specifying a day of the week. You can specify day by either the first three characters of the day name or the full day name. By default, the value is Sunday.
Returns the previous weekday date from date . The destination contains the closest date for the specified day of the week earlier than the source date (including the source date) The day argument is optional. It is a string or variable specifying a day of the week. You can specify day using either the first three characters of the day name or the full day name. By default, the value is Sunday.
Returns the day of the week from date . The optional argument origin_day is a string or variable specifying the day considered to be day zero of the week. You can specify the day using either the first three characters of the day name or the full day name. If omitted, Sunday is day zero. Returns the day of the year (1-366) from date . Returns the year from date . For example, the date 07-23-2001 returns 2001. Returns the week of the year from date . For example, the date 07-23-2001 returns 30.
decimal and float Field Functions You can do the following transformations using the decimal and float field functions. v Assign a decimal to an integer or float or numeric string, or compare a decimal to an integer or float or numeric string. v Specify an optional fix_zero argument (int8) to cause a decimal field containing all zeros to be treated as a valid zero. v Optionally specify a value for the rounding type (r_type) for many conversions. The values of r_type are: ceil: Round the source field toward positive infinity. This mode corresponds to the IEEE 754 Round Up mode. Examples: 1.4 -> 2, -1.6 -> -1 floor: Round the source field toward negative infinity. This mode corresponds to the IEEE 754 Round Down mode. Examples: 1.6 -> 1, -1.4 -> -2
Chapter 7. Operators
271
round_inf: Round or truncate the source field toward the nearest representable value, breaking ties by rounding positive values toward positive infinity and negative values toward negative infinity. This mode corresponds to the COBOL ROUNDED mode. Examples: 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2 trunc_zero (default): Discard any fractional digits to the right of the right-most fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, round or truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Examples: 1.6 -> 1, -1.6 -> -1
Function decimal decimal_from_decimal ( decimal , " r_type " | format_variable ) Description Returns decimal in decimal representation, changing the precision and scale according to the returned type. The rounding type, r_type, might be ceil, floor, round_inf, or trunc_zero as described above this table. The default rtype is trunc_zero. Returns dfloat in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns string in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns ustring in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns decimal in dfloat representation. Returns int32 in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns int64 in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns uint64 in decimal representation. The rounding type, r_type , might be ceil, floor, round_inf, or trunc_zero as described above this table. The default is trunc_zero. Returns string in decimal representation. fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero argument specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1
dfloat dfloat_from_decimal ( decimal , "fix-zero" | format_variable ) int32 int32_from_decimal ( decimal , "r_type fix_zero")
272
Description Returns ustring in decimal representation. fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero argument specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1
Returns ustring in decimal representation. fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero argument specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1
Returns the mantissa (the digits right of the decimal point) from dfloat . Returns the mantissa (the digits right of the decimal point) from decimal .
raw Field Functions Use the raw field functions to transform a string into a raw data type and to determine the length of a raw value.
Function raw raw_from_string( string ) raw u_raw_from_string( ustring ) int32 raw_length( raw ) Description Returns string in raw representation. Returns ustring in raw representation. Returns the length of the raw field.
273
The time uformat provides support for international components in time fields. It's syntax is:
String % macroString % macroString % macroString
where %macro is a time formatting macro such as %hh for a two-digit hour. See below for a description of the date format macros. Only the String components of time uformat can include multi-byte Unicode characters. timestamp Uformat This format is a concatenation of the date uformat and time uformat which are described in "date Uformat" and "time Uformat". The order of the formats does not matter, but the two formats cannot be mixed. time Format The possible components of the time_format string are given in the following table:
Table 64. Time format tags Tag %h %hh %H %HH %n %nn %s %ss %s.N %ss.N %SSS %SSSSSS %aa with v option with v option German import import import import Variable width availability import Description Hour (24), variable width Hour (24), fixed width Hour (12), variable width Hour (12), fixed width Minutes, variable width Minutes, fixed width Seconds, variable width Seconds, fixed width Value range 0...23 0...23 1...12 01...12 0...59 0...59 0...59 0...59 Options s s s s s s s s s, c, C s, c, C s, v s, v u, w
Seconds + fraction (N = 0...6) Seconds + fraction (N = 0...6) Milliseconds Microseconds am/pm marker, locale specific 0...999 0...999999 am, pm
By default, the format of the time contained in the string is %hh:%nn:%ss. However, you can specify a format string defining the format of the string field. You must prefix each component of the format string with the percent symbol. Separate the string's components with any character except the percent sign (%). Where indicated the tags can represent variable-fields on import, export, or both. Variable-width date elements can omit leading zeroes without causing errors.
274
The following options can be used in the format string where indicated: s Specify this option to allow leading spaces in time formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(n,s) indicates a minute field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(h,s):$(n,s):$(s,s) Then the following times would all be valid: 20: 6:58 20:06:58 20:6:58 v Use this option in conjunction with the %SSS or %SSSSSS tags to represent milliseconds or microseconds in variable-width format. So the time property: %(SSS,v) represents values in the range 0 to 999. (If you omit the v option then the range of values would be 000 to 999.) u w c C Use this option to render the am/pm text in uppercase on output. Use this option to render the am/pm text in lowercase on output. Specify this option to use a comma as the decimal separator in the %ss.N tag. Specify this option to use a period as the decimal separator in the %ss.N tag.
The c and C options override the default setting of the locale. The locale for determining the setting of the am/pm string and the default decimal separator can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for am/pm string and separators markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example: %L('es')%HH:%nn %aa Specifies the Spanish locale. The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
Chapter 7. Operators
275
Table 65. Format tag restrictions Element hour am/pm marker minute second fraction of a second Numeric format tags %hh, %h, %HH, %H %nn, %n %ss, %s %ss.N, %s.N, %SSS, %SSSSSS Text format tags %aa Requires hour (%HH) Incompatible with hour (%hh) -
You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
Function int8 hours_from_time( time ) int32 microseconds_from_time( time ) dfloat midnight_seconds_from_time ( time ) int8 minutes_from_time( time ) dfloat seconds_from_time( time ) dfloat seconds_since_from_timestamp ( timestamp , " source_timestamp_string " | format_variable )
Description Returns the hour portion of the given time. Returns the number of microseconds from the given time. Returns the number of seconds from midnight to time . Returns the number of minutes from time . Returns the number of seconds from time. Returns the number of seconds from timestamp to the base timestamp, or optionally the second timestamp argument for the number of seconds between timestamps. The source_timestamp_string argument must be double quoted or be a variable. Returns the time given the number of seconds ( dfloat ) since midnight. Returns a time representation of string using the optional time_format , time_uformat , or format_variable. By default, the time format is hh:nn:ss. For format descriptions, see and "time Uformat" . Returns a time representation of ustring using the optional time_format , time_uformat , or format_variable specification. By default, the time format is hh:nn:ss. For format descriptions, see and "time Uformat" . Returns a string from time. The format argument is optional.The default time format is hh:nn:ss. For format descriptions, see and "time Uformat" .
276
Function string string_from_time ( time , " time_format " | format_variable | time_format ) time time_from_timestamp( timestamp ) date date_from_timestamp( timestamp ) timestamp timestamp_from_date_time ( date , time )
Description Returns a ustring from time . The format argument is optional.The default time format is hh:nn:ss. For format descriptions, see and "time Uformat" . Returns the time from timestamp . Returns the date from the given timestamp . Returns a timestamp from date and time . The date specifies the date portion ( yyyy - nn - dd ) of the timestamp. The time argument specifies the time to be used when building the timestamp. The time argument must be in the hh : nn :ss format. Returns the timestamp from the number of seconds ( dfloat ) from the base timestamp or the original_timestamp_string argument. The original_timestamp_string must be double quoted or be a variable. Returns a timestamp from string, in the optional timestamp_format , timestamp_uformat, or format_variable . The timestamp_format must be double quoted or be a variable. The default format is yyyy - nn - dd hh : nn : ss . timestamp_format is described in . Returns a timestamp from ustring , in the optional format specification. The timestamp_format must be a double quoted string, a uformat , or a variable. The default format is yyyy - nn - dd hh : nn : ss . timestamp_uformat is described in . Returns a string from timestamp . The formatting specification is optional. The default format is yyyy - mm - dd hh : mm : ss . Returns a ustring from timestamp . The formatting specification is optional. The default format is yyyy - mm - dd hh : mm : ss . Returns a timestamp from time . For format descriptions, see and "time Uformat" Returns the date from the given timestamp . Returns a timestamp from the given UNIX time_t representation ( int32 ). Returns the UNIX time_t representation of timestamp.
string string_from_timestamp ( timestamp , " timestamp_format " | format_variable ) ustring ustring_from_timestamp ( timestamp , " timestamp_format " | format_variable ) timestamp timestamp_from_time ( time , time_format | time_uformat ) date date_from_timestamp( timestamp ) timestamp timestamp_from_timet( int32 ) int32 timet_from_timestamp( timestamp )
null handling functions lists the transformation functions for NULL handling. All data types support nulls. As part of processing a record, an operator can detect a null and take the appropriate action, for example, it can omit the null field from a calculation or signal an error condition. InfoSphere DataStage represents nulls in two ways. v It allocates a single bit to mark a field as null. This type of representation is called an out-of-band null.
Chapter 7. Operators
277
v It designates a specific field value to indicate a null, for example a numeric field's most negative possible value. This type of representation is called an in-band null. In-band null representation can be disadvantageous because you must reserve a field value for nulls and this value cannot be treated as valid data elsewhere. The null-handling functions can change a null representation from an out-of-band null to an in-band null and from an in-band null to an out-of-band null.
Function destination_field handle_null ( source_field , value ) Description Change the source_field NULL representations from out-of-band representation to an in-band representation. The value field assigns the value that corresponds to NULL. Changes source_field NULL representation from in-band NULL representation to out-of-band. The value field allows multiple valid NULL values to be inputted as arguments. Returns 1 if source_field is not NULL, otherwise returns 0. Returns 1 if source_field is NULL, otherwise returns 0. This function is used with "=" to set the left side output field, when it is nullable, to null. For example: a-field = set_null(); int8 is_dfloat_inband_null ( dfloat ) int8 is_int16_inband_null ( int16 ) int8 is_int32_inband_null ( int32 ) int8 is_int64_inband_null ( int64 ) int8 is_sfloat_inband_null ( sfloat ) int8 is_string_inband_null ( string ) int8 u_is_string_inband_null ( ustring ) Returns 1 if dfloat is an inband null; otherwise it returns 0. Returns 1 if int16 is an inband null; otherwise it returns 0. Returns 1 if int32 is an inband null; otherwise it returns 0. Returns 1 if int64 is an inband null; otherwise it returns 0. Returns 1 if sfloat is an inband null; otherwise it returns 0. Returns 1 if string is an inband null; otherwise it returns 0. Returns 1 if ustring is an inband null; otherwise it returns 0.
Mathematical functions
Function int32 abs( int32 ) dfloat acos( dfloat ) dfloat asin( dfloat ) dfloat atan( dfloat ) dfloat atan2( dfloat , dfloat ) Description Returns the absolute value of int32 . Returns the principal value of the arc cosine of dfloat . Returns the principal value of the arc sine of dfloat . Returns the principal value of the arc tangent of dfloat. Returns the principal value of the arc tangent of y/x (where y is the first argument).
278
Function dfloat ceil( decimal ) dfloat cos( dfloat ) dfloat cosh( dfloat ) dfloat exp( dfloat ) dfloat fabs( dfloat ) dfloat floor( decimal ) dfloat ldexp( dfloat , int32 ) uint64 llabs( int64 ) dfloat log( dfloat ) dfloat log10( dfloat ) int32 max( int32 , int32 ) int32 min( int32 , int32 ) dfloat pow( dfloat , dfloat ) uint32 rand()
Description Returns the smallest dfloat value greater than or equal to decimal . Returns the cosine of the given angle ( dfloat ) expressed in radians. Returns the hyperbolic cosine of dfloat . Returns the exponential of dfloat. Returns the absolute value of dfloat . Returns the largest dfloat value less than or equal to decimal . Reconstructs dfloat out of the mantissa and exponent of int32 . Returns the absolute value of int64 . Returns the natural (base e) logarithm of dfloat . Returns the logarithm to the base 10 of dfloat . Returns the larger of the two integers. Returns the smaller of the two integers. Returns the result of raising x (the first argument) to the power y (the second argument). Returns a pseudo-random integer between 0 and 232 - 1. The function uses a multiplicative congruential random-number generator with period 232. See the UNIX man page for rand for more details. Returns a random integer between 0 and 231 - 1. The function uses a nonlinear additive feedback random-number generator employing a default state array size of 31 long integers to return successive pseudo-random numbers. The period of this random-number generator is approximately 16 x (231 - 1). Compared with rand, random is slower but more random. See the UNIX man page for random for more details. Returns the sine of dfloat expressed in radians. Returns the hyperbolic sine of dfloat. Returns the square root of dfloat . Returns the value of the quotient after dfloat1 is divided by dfloat2 . Sets a new seed ( uint32 ) for the frand() or srand() random number generator. Sets a random seed for the random() number generator. See the UNIX man page for srandom for more details. Returns the tangent of the given angle ( dfloat ) expressed in radians. Returns the hyperbolic tangent of dfloat .
uint32 random()
dfloat sin( dfloat ) dfloat sinh( dfloat ) dfloat sqrt( dfloat ) int32 quotient_from_dfloat ( dfloat1 , dfloat2) srand( uint32 ) srandom( uint32 ) dfloat tan( dfloat ) dfloat tanh( dfloat )
Chapter 7. Operators
279
Each row of the lookup table specifies an association between a 16-bit integer or unsigned 32-bit integer value and a string or ustring. InfoSphere DataStage scans the Numeric Value or the String or Ustring column until it encounters the value or string to be translated. The output is the corresponding entry in the row. The numeric value to be converted might be of the int16 or the uint32 data type. InfoSphere DataStage converts strings to values of the int16 or uint32 data type using the same table. If the input contains a numeric value or string that is not listed in the table, InfoSphere DataStage operates as follows: v If a numeric value is unknown, an empty string is returned by default. However, you can set a default string value to be returned by the string lookup table. v If a string has no corresponding value, 0 is returned by default. However, you can set a default numeric value to be returned by the string lookup table. A table definition defines the rows of a string or ustring lookup table and has the following form:
{propertyList} (string | ustring = value; string | ustring= value; ... )
where: propertyList is one or more of the following options; the entire list is enclosed in braces and properties are separated by commas if there are more than one: v case_sensitive: perform a case-sensitive search for matching strings; the default is case-insensitive. v default_value = defVal: the default numeric value returned for a string that does not match any of the strings in the table. v default_string = defString: the default string returned for numeric values that do not match any numeric value in the table. v string or ustring specifies a comma-separated list of strings or ustrings associated with value; enclose each string or ustring in quotes. v value specifies a comma-separated list of 16-bit integer values associated with string or ustring.
280
Function int8 is_alnum( string ) int8 is_alpha( string ) int8 is_numeric( string ) int8 is_valid (" type_string ", " value_string ")
Description Returns 1 true if string consists entirely of alphanumeric characters. Returns 1 true if string consists entirely of alphabetic characters. Returns 1 true if string consists entirely of numeric characters, including decimal and sign. Returns 1 (true) if value_string is valid according to type_string , including NULL. The type_string argument is required. It must specify an InfoSphere DataStage schema data type. Integer types are checked to ensure the value_string is numeric (signed or unsigned), a whole number, and a valid value (for example, 1024 can not be assigned to an int8 type). The value_string can contain leading spaces, but not trailing spaces. Decimal types are checked to ensure the value_string is numeric (signed or unsigned) and a valid value. Float types are checked to ensure the value_string is numeric (signed or unsigned) and a valid value (exponent is valid). The value_string can contain leading spaces, but not trailing spaces. String is always valid with the NULL exception below. For all types, if the field cannot be set to NULL and the string is NULL, 0 (false) is returned. Date, time, and timestamp types are checked to ensure they are correct, using the optional format argument, and valid values. Raw cannot be checked since the input is a string.
int16 lookup_int16_from_string ( string , " table_definition " | table_variable ) string lookup_string_from_int16 ( int16 , " table_definition " | table_variable ) string lookup_string_from_uint32 ( uint32 , " table_definition " | table_variable ) uint32 lookup_uint32_from_ string ( string , " table_definition " | table_variable ) string lower_case( string ) string string_from_date ( date , " date_format " | format_variable | date_uformat)
Returns an integer corresponding to string using table_definition string or variable. See "String Conversions and Lookup Tables" for more information. Returns a string corresponding to int16 using table_definition string or variable. See "String Conversions and Lookup Tables" for more information. Returns a string corresponding to uint32 using table_definition string or variable. See "String Conversions and Lookup Tables" for more information. Returns an unsigned integer from string using table_definition string or variable. Converts string to lowercase. Non-alphabetic characters are ignored in the transformation. Converts date to a string representation using the specified optional formatting specification. By default, the date format is yyyy - mm - dd . For format descriptions, see "date Uformat" .
Chapter 7. Operators
281
Description Returns a string from decimal . fix_zero causes a decimal field containing all zeros to be treated as a valid zero. suppress_zero argument specifies that the returned ustring value will have no leading or trailing zeros. Examples: 000.100 -> 0.1; 001.000 -> 1; -001.100 -> -1.1 The formatting specification is optional.
string string_from_time ( time ," time_format " | format_variable | time_uformat) string string_from_timestamp ( timestamp , " timestamp_format " | format_variable ) string soundex ( input_string)
Returns a string from time . The format argument is optional.The default time format is hh : nn :ss. For format descriptions, see and "time Uformat" . Returns a string from timestamp . The formatting specification is optional. The default format is yyyy-mm-dd hh:mm:ss. Returns a string which represents the phonetic code for the input_string word. Input words that produce the same code are considered phonetically equivalent. Only alphabetic characters (a-z, A-Z) are processed. All other characters are ignored. If input_string is empty, or contains no alphabetic characters, an empty string is returned. For non-empty strings, the returned value contains four characters: v the first character is the first alphabetic character in the input string v the remaining three characters are digits representing the remaining alphabetic characters in the string
string upper_case( string ) string compact_whitespace ( string ) string pad_string ( string , pad_string , pad_length )
Converts string to uppercase. Non-alphabetic characters are ignored in the transformation. Returns a string after reducing all consecutive white space in string to a single space. Returns the string with the pad_string appended to the bounded length string for pad_length number of characters. pad_length is an int16. When the given string is a variable-length string, it defaults to a bounded-length of 1024 characters. If the given string is a fixed-length string, this function has no effect.
string strip_whitespace( string ) string trim_leading_trailing ( string ) string trim_leading( string ) string trim_trailing( string )
Returns string after stripping all white space in the string. Returns string after removing all leading and trailing white space. Returns a string after removing all leading white space. Returns a string after removing all trailing white space.
282
Function
Description
int32 string_order_compare ( string1 , string2, justification) Returns a numeric value specifying the result of the comparison. The numeric values are: -1: string1 is less than string2 0: string1 is equal to string2 1: string1 is greater than string2 The string justification argument is either 'L' or 'R'. It defaults to 'L if not specified. 'L' means a standard character comparison, left to right. 'R' means that any numeric substrings within the strings starting at the same position are compared as numbers. For example an 'R' comparison of "AB100" and "AB99" indicates that AB100 is great than AB99, since 100 is greater than 99. The comparisons are case sensitive. string replace_substring ( expression1 , expression2, string) Returns a string value that contains the given string , with any characters in expression1 replaced by their corresponding characters in expression2. For example: replace_substring ("ABC:, "abZ", "AGDCBDA") returns "aGDZbDa", where any "A" gets replaced by "a", any "B" gets replaced by "b" and any "C" gets replaced by "Z". If expression2 is longer than expression1, the extra characters are ignored. If expression1 is longer than expression2, the extra characters in expression1 are deleted from the given string (the corresponding characters are removed.) For example: replace_substring("ABC", "ab", "AGDCBDA") returns "aGDbDa". int32 count_substring ( string , substring ) Returns the number of times that substring occurs in string . If substring is an empty string, the number of characters in string is returned. Returns the number of fields in string delimited by delimiter , where delimiter is a string. For example, dcount_substring("abcFdefFghi", "F") returns 3. If delimiter is an empty string, the number of characters in the string + 1 is returned. If delimiter is not empty, but does not exist in the given string, 1 is returned. string double_quote_string ( expression ) Returns the given string expression enclosed in double quotes.
Chapter 7. Operators
283
Description The string and delimiter arguments are string values, and the occurrence and numsubstr arguments are int32 values. This function returns numsubstr substrings from string , delimited by delimiter and starting at substring number occurrence . An example is: substring_by_delimiter ("abcFdefFghiFjkl", "F", 2, 2) The string "defFghi" is returned. If occurrence is < 1, then 1 is assumed. If occurrence does not point to an existing field, the empty string is returned. If numsubstr is not specified or is less than 1, it defaults to 1.
Returns the starting position of the nth occurrence of substring in string . The occurrence argument is an integer indicating the nth occurrence. If there is no nth occurrence or string doesn't contain any substring , -1 is returned. If substring is an empty string, -2 is returned.
Returns the first length characters of string . If length is 0, it returns the empty string. If length is greater than the length of the string, the entire string is returned. Returns the last length characters of string. If length is 0, it returns the empty string. If length is greater than the length of string , the entire string is returned. Returns a string containing count spaces. The empty string is returned for a count of 0 or less. Returns the expression string enclosed in single quotes. Returns a string containing count occurrences of string. The empty string is returned for a count of 0 or less. If only string is specified, all leading and trailing spaces and tabs are removed, and all multiple occurrences of spaces and tabs are reduced to a single space or tab. If string and character are specified, option defaults to 'R' The available option values are: 'A' remove all occurrences of character 'B' remove both leading and trailing occurrences of character. 'D' remove leading, trailing, and redundant white-space characters. 'E' remove trailing white-space characters 'F' remove leading white-space characters 'L' remove all leading occurrences of character 'R' remove all leading, trailing, and redundant occurrences of character 'T' remove all trailing occurrences of character
string string_of_space( count) string single_quote_string ( expression) string string_of_substring ( string , count) string trimc_string ( string [, character [, option ]])
string system_time_date()
Returns the current system time in this 24-hour format: hh : mm :ss dd : mmm : yyyy
Searches for the substring in the string beginning at character number position , where position is an uint32. Returns the starting position of the substring. This is a case-insensitive version of string_compare() below.
284
Function int8 string_compare ( string , string) int8 string_num_case_compare ( string , string, uint16) string string_num_concatenate ( string , string, uint16) int8 string_num_compare ( string , string, uint16) string string_num_copy ( string , uint16) int32 string_length( string ) string substring ( string , starting_position, length)
Description Compares two strings and returns the index (0 or 1) of the greater string. This is a case-insensitive version of string_num_compare() below. Returns a string after appending uint16 characters from the second string onto the first string. Compares first uint16 characters of two given strings and returns the index (0 or 1) of the greater string. Returns the first uint16 characters from the given string . Returns the length of the string. Copies parts of strings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The starting_position and length are of type uint16. The starting_position is 0-based, and must be >= 0. Otherwise, the empty string is returned. The length must be > 0. Otherwise, the empty string is returned. An example use is: string in; string out; in = "world"; out out out out = = = = substring( substring( substring( substring( in, in, in, in, 0, 0, 1, 3, 1 3 3 3 ); ); ); ); // // // // out out out out = = = = "w" "wor" "orl" "ld"
Copies parts of strings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The starting_position and length are of type uint16. The starting_position is 1-based, and must be >= 1. Otherwise, the empty string is returned. The length must be > 0. Otherwise, the empty string is returned. An example use is: string in; string out; in = "world"; out out out out = = = = substring_1( substring_1( substring_1( substring_1( in, in, in, in, 1, 1, 2, 4, 1 3 3 3 ); ); ); ); // // // // out out out out = = = = "w" "wor" "orl" "ld"
Chapter 7. Operators
285
Description Returns an ASCII character from the given int32 . If given a value that is not associated with a character such as -1, the function returns a space. An example use is: char_from_num(38) which returns "&"
Returns the numeric value of the ASCII-character in the string. When this function is given an empty string, it returns 0; and when it is given a multi-character string, it uses the first character in the string. An example use is: num_from_char("&") which returns 38.
286
Function int16 lookup_int16_from_ustring ( ustring , " table_definition " | table_variable ) ustring lookup_ustring_from_int16 ( int16 , " table_definition " | table_variable ) ustring lookup_ustring_from_uint32 ( uint32 , " table_definition " | table_variable ) uint32 lookup_uint32_from_ustring ( string , " table_definition " | table_variable ) int8 u_is_valid (" type_ustring ", " value_ustring ")
Description Returns an integer corresponding to ustring using table_definition string or variable. See "String Conversions and Lookup Tables" for more information. Returns a ustring corresponding to int16 using table_definition string or table_variable . See "String Conversions and Lookup Tables" for more information. Returns a ustring corresponding to uint32 using table_definition string or variable. See"String Conversions and Lookup Tables" for more information. Returns an unsigned integer from ustring using table_definition or table_variable . Returns 1 (true) if value_ustring is valid according to type_ustring , including NULL. The type_ustring argument is required. It must specify an InfoSphere DataStage schema data type. Integer types are checked to ensure the value_ustring is numeric (signed or unsigned), a whole number, and a valid value (for example, 1024 can not be assigned to an int8 type). The value_string can contain leading spaces, but not trailing spaces. Decimal types are checked to ensure the value_ustring is numeric (signed or unsigned) and a valid value. Float types are checked to ensure the value_ustring is numeric (signed or unsigned) and a valid value (exponent is valid). The value_string can contain leading spaces, but not trailing spaces. String is always valid with the NULL exception below. For all types, if the field cannot be set to NULL and the string is NULL, 0 (false) is returned. Date, time, and timestamp types are checked to ensure they are correct, using the optional format argument, and valid values. Raw cannot be checked since the input is a string.
ustring u_lower_case( ustring ) ustring u_upper_case( ustring ) ustring u_compact_whitespace ( ustring ) ustring u_pad_string ( ustring , pad_ustring , pad_length )
Converts ustring to lowercase. Non-alphabetic characters are ignored in the transformation. Converts ustring to uppercase. Non-alphabetic characters are ignored in the transformation. Returns the ustring after reducing all consecutive white space in ustring to a single space. Returns the ustring with pad_ustring appended to the bounded length string for pad_length number of characters. pad_length is an int16. When the given ustring is a variable-length string, it defaults to a bounded-length of 1024 characters. If the given ustring is a fixed-length string, this function has no effect.
Chapter 7. Operators
287
Function ustring u_trim_leading_trailing ( ustring ) ustring u_trim_leading( ustring ) ustring u_trim_trailing ( ustring ) int32 u_string_order_compare ( ustring1 , ustring2, justification)
Description Returns ustring after removing all leading and trailing white space. Returns ustring after removing all leading white space. Returns a ustring after removing all trailing white space. Returns a numeric value specifying the result of the comparison. The numeric values are: -1: ustring1 is less than ustring2 0: ustring1 is equal to ustring2 1: ustring1 is greater than ustring2 The string justification argument is either 'L' or 'R'. It defaults to 'L' if not specified. 'L' means a standard character comparison, left to right. 'R' means that any numeric substrings within the strings starting at the same position are compared as numbers. For example an 'R' comparison of "AB100" and "AB99" indicates that AB100 is great than AB99, since 100 is greater than 99. The comparisons are case sensitive.
Returns a ustring value that contains the given ustring , with any characters in expression1 replaced by their corresponding characters in expression2 . For example: u_replace_substring ("ABC", "abZ", "AGDCBDA") returns "aGDZbDa", where any "A" gets replaced by "a", any "B" gets replaced by "b" and any "C" gets replaced by "Z". If expression2 is longer than expression1 , the extra characters are ignored. If expression1 is longer than expression2, the extra characters in expression1 are deleted from the given string (the corresponding characters are removed.) For example: u_replace_substring("ABC", "ab", "AGDCBDA") returns "aGDbDa".
Returns the number of times that sub_ustring occurs in ustring . If sub_ustring is an empty string, the number of characters in ustring is returned. Returns the number of fields in ustring delimited by delimiter , where delimiter is a string. For example, dcount_substring("abcFdefFghi", "F") returns 3. If delimiter is an empty string, the number of characters in the string + 1 is returned. If delimiter is not empty, but does not exist in the given string, 1 is returned.
288
Description The delimiter argument is a ustring value, and the occurrence and numsubstr arguments are int32 values. This function returns numsubstr substrings from ustring , delimited by delimiter and starting at substring number occurrence . An example is: u_substring_by_delimiter ("abcFdefFghiFjkl", "F", 2, 2) The string "defFghi" is returned. If occurrence is < 1, then 1 is assumed. If occurrence does not point to an existing field, the empty string is returned. If numsubstr is not specified or is less than 1, it defaults to 1.
Returns the starting position of the nth occurrence of sub_ustring in ustring. The occurrence argument is an integer indicating the nth occurrence . If there is no nth occurrence, 0 is returned; if sub_ustring is an empty string, -2 is returned; and if ustring doesn't contain any sub_ustring , -1 is returned.
Returns the first length characters of ustring. If length is 0, it returns the empty string. If length is greater than the length of the ustring , the entire ustring is returned. Returns the last length characters of ustring . If length is 0, it returns the empty string. If length is greater than the length of ustring , the entire ustring is returned. Returns a ustring containing count spaces. The empty string is returned for a count of 0 or less. Returns expression enclosed in single quotes. Returns a ustring containing count occurrences of ustring. The empty string is returned for a count of 0 or less. If only ustring is specified, all leading and trailing spaces and tabs are removed, and all multiple occurrences of spaces and tabs are reduced to a single space or tab. If ustring and character are specified, option defaults to 'R' The available option values are: 'A' remove all occurrences of character 'B' remove both leading and trailing occurrences of character. 'D' remove leading, trailing, and redundant white-space characters. 'E' remove trailing white-space characters 'F' remove leading white-space characters 'L' remove all leading occurrences of character 'R' remove all leading, trailing, and redundant occurrences of character 'T' remove all trailing occurrences of character
ustring u_string_of_space( count) ustring u_single_quote_string ( expression) ustring u_string_of_substring ( ustring , count) ustring u_trimc_string ( ustring [, character [, option ]])
ustring u_system_time_date()
Returns the current system time in this 24-hour format: hh:mm:ss dd:mmm:yyyy
Searches for the sub_ustring in the ustring beginning at character number position, where position is an uint32. Returns the starting position of the substring. This is a case-insensitive version of u_string_compare() below.
Chapter 7. Operators
289
Function int8 u_string_compare ( ustring , ustring) int8 u_string_num_case_compare ( ustring , ustring, uint16) ustring u_string_num_concatenate ( ustring , ustring, uint16) int8 u_string_num_compare ( utring , ustring, uint16) ustring u_string_num_copy ( ustring , uint16) int32 u_string_length( ustring ) ustring u_substring ( ustring , starting_position, length)
Description Compares two ustrings and returns the index (0 or 1) of the greater string. This is a case-insensitive version of u_string_num_compare() below. Returns a ustring after appending uint16 characters from the second ustring onto the first ustring . Compares first uint16 characters of two given ustrings and returns the index (0 or 1) of the greater ustring . Returns the first uint16 characters from the given ustring . Returns the length of the ustring . Copies parts of ustrings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The starting_position and length are of type uint16. The starting_position is 0-based, and must be >= 0. Otherwise, the empty string is returned. The length must be > 0. Otherwise, the empty string is returned. An example use is: ustring in; ustring out; in = "world"; out out out out = = = = u_substring( u_substring( u_substring( u_substring( in, in, in, in, 0, 0, 1, 3, 1 3 3 3 ); ); ); ); // // // // out out out out = = = = "w" "wor" "orl" "ld"
Copies parts of ustrings to shorter strings by string extraction. The starting_position specifies the starting location of the substring; length specifies the substring length. The starting_position and length are of type uint16. The starting_position is 1-based, and must be >= 1. Otherwise, the empty string is returned. The length must be > 0. Otherwise, the empty string is returned. An example use is: ustring in; ustring out; in = "world"; out out out out = = = = u_substring_1( u_substring_1( u_substring_1( u_substring_1( in, in, in, in, 1, 1, 2, 4, 1 3 3 3 ); ); ); ); // // // // out out out out = = = = "w" "wor" "orl" "ld"
290
Description Returns a ustring character value from the given int32 . If given a value that is not associated with a character such as -1, the function returns a space. An example use is: u_char_from_num(38) which returns "&"
Returns the numeric value of the character in the ustring . When this function is given an empty string, it returns 0; and when it is given a multi-character string, it uses the first character in the string. An example use is: u_num_from_char("&") which returns 38
Chapter 7. Operators
291
Function string set_custom_summary_info ( name_string , description_string, value_string) ustring u_set_custom_summary_info ( name_ustring ,description_ustring, value_ustring) string send_custom_report ( name_string , description_string, value_string) ustring u_send_custom_report ( name_ustring , description_ustring , value_ustring) string set_custom_instance_report ( name_string , description_string, value_string) ustring u_set_custom_instance_report ( name_ustring , description_ustring, value_ustring)
Miscellaneous functions
The following table describes functions in the Transformation Language that do not fit into any of the above categories.
Function void force_error ( error_message_string ) u_force_error ( error_message_ustring ) string get_environment( string ) Description Terminates the data flow when an error is detected, and prints error_message_string to stderr. Terminates the data flow when an error is detected, and prints error_message_ustring to stderr. Returns the current value of string , a UNIX environment variable. The functionality is the same as the C getenv function. Returns the current value of utring , a UNIX environment variable. The functionality is the same as the C getenv function. Returns the current partition number. Returns the number of partitions. Prints message_string to stdout. Prints message_ustring to stdout. Returns the actual size value when it is stored, not the length.
int16 get_partition_num() int16 get_num_of_partitions() void print_message ( message_string ) u_print_message ( message_ustring ) uint32 size_of( value )
292
By default, the transform operator sets the precision and scale of any temporary internal decimal variable created during arithmetic operations to:
[TRX_DEFAULT_MAX_PRECISION=38, TRX_DEFAULT_MAX_SCALE=10] with RoundMode equal to eRoundInf.
You can override the default values using these environment variables: v APT_DECIMAL_INTERM_PRECISION value v APT_DECIMAL_INTERM_SCALE value v APT_DECIMAL_INTERM_ROUNDMODE ceil | floor | round_inf | trunc_zero Fatal errors might occur at runtime if the precision of the destination decimal is smaller than that of the source decimal
Keywords
Keywords in the Transformation Language are not case sensitive.
Local variables
You cannot initialize variables as part of the declaration. There are no local variable pointers or structures. Enums are not supported.
Operators
Several C operators are not part of the language: v The comma operator v Composite assignment operators such as +=.
Flow control
The switch and case C keywords do not appear in the Transformation Language. The if ... else if construct for multiple branching is not supported. You can accomplish the same effect by using multiple if statements in sequence together with complex boolean expressions. For example, where in C you could write:
if (size < 7) tag = "S"; else if (size < 9) tag = "M"; else tag = "L";
Chapter 7. Operators
293
transform
After the records are processed, a report is printed that shows the number of students in each score category.
294
v Stage variables. How to declare stage variables and where to declare them. Examples are recNum0, recNum1, and recNum2. v Record flow controls. How to use the conditional statement with the writerecord command. v Local variables. How to declare and initialize local variables. Examples from the code are numOfPoorScores, numOfFairScores, and numOfGoodScores. v Default conversions. How to do default conversions using the assignment operator. An example that converts an int32 to a string is: v numOfGoodScore=recNum2 v Message logging. How to make function calls to print_message().
Transformation language
The record-processing logic can be expressed in the Transformation Language as follows. The expression file name for this example is score_distr_expr.
global { // the global variable that contains the name for each job run string jobName; } initialize { // the number of records in output 0 int32 recNum0; // the number of records in output 1 int32 recNum1; // the number of records in output 2 int32 recNum2; // initialization recNum0 = 0; recNum1 = 0; recNum2 = 0; } mainloop { // records in output 0 if (score < 75) { recNum0++; writerecord 0; } // records of output 1 if (score >= 75 && score <= 90) { RecNum1++; writerecord 1; } // records of output2 if (score > 90) { recNum2++; writerecord 2; } } finish { // define a string local variable to store the number of // students with poor scores string NumOfPoorScores; numOfPoorScores = recNum0; // define a string local variable to store the number of
Chapter 7. Operators
295
// students with fair scores string NumOfFairScores; numOfFairScores = recNum1; // define a string local variable to store the number of // students with good scores string NumOfGoodScores; numOfGoodScores = recNum2; // Print out the number of records in each output data set print_message(jobName+ "has finished running."); print_message("The number of students having poor scores are" +numOfPoorScores); print_message("The number of students having fair scores are" +numOfFairScores); print_message("The number of students having good scores are" +numOfGoodScores); }
osh command
An example osh command to run this job is:
osh -params "jobName=classA" -f score_distr
osh script
The contents of score_distr are:
#compile the expression code transform -inputschema record(student_id:string[10];score:int32;) -outputschema record(student_id:string[10];score:int32;) -outputschema record(student_id:string[10];score:int32;) -outputschema record(student_id:string[10];score:int32;) -expressionfile score_distr_expr -flag compile -name score_map; #run the job import -schema record(student_id:string[10];score:int32;) -file [&jobName].txt | transform -flag run -name score_map 0> export -schema record(student_id:string[10];score:int32) -filename [&jobName]poor_score.out -overwrite 1> -export -schema record(student_id:string[10];score:int32) -file [&jobName]fair_score.out -overwrite 2> -export -schema record(student_id:string[10];score:int32) -file [&jobName]good_score.out -overwrite
classAfair_score.out:
296
80 75 85 82 87 89
classAgood_score.out:
A218925316 95 A238950561 92
The global variable jobName is initialized using the -params option. To determine the score distribution for class B, for example, assign jobName another value:
osh -params "jobName=classB" -f score_distr
transform
297
Transformation language
The record-processing logic can be expressed in the Transformation Language as follows. The expression file name for this example is score_grade_expr.
inputname 0 in0; outputname 0 out0; outputname 1 out1; outputname 2 out2; global { // the global variable that contains the name for each job run string jobName; } initialize { // the number of records in the outputs int32 recNum0, recNum1, recNum2; // initialization recNum0 = 0; recNum1 = 0; recNum2 = 0; } mainloop { if (in0.score < 75) { recNum0++; out0.grade = "C"; writerecord 0; } if (in0.score >= 75 && in0.score <= 90) { recNum1++; out1.grade = "B"; writerecord 1; } if (in0.score > 90) { recNum2++; out2.grade = "A"; writerecord 2; } } finish { // define string local variables to store the number of // students having different letter grades string numOfCs, numOfBs, numOfAs; // default conversions using assignments numOfCs = recNum0; numOfBs = recNum1; numOfAs = recNum2; // Print out the number of records in each output data set print_message(jobName+ " has finished running."); print_message("The number of students getting C is " +numOfCs); print_message("The number of students getting B is " +numOfBs); print_message("The number of students getting A is " +numOfAs); }
298
osh command
An example osh command to run this job is:
osh -params "jobName=classA" -f score_grade
osh script
The contents of score_grade are:
#compile the expression code transform -inputschema record(student_id:string[10];score:int32;) -outputschema record (student_id:string[10];score:int32;grade:string[1]) -outputschema record (student_id:string[10];score:int32;grade:string[1]) -outputschema record (student_id:string[10];score:int32;grade:string[1]) -expressionfile score_grade_expr -flag compile -name score_map; #run the job import -schema record(student_id:string[10];score:int32;) -file [&jobName].txt | transform -flag run -name score_map 0> -export record(student_id:string[10];score:int32;grade:string[1]) -file [&jobName]poor_scores.out -overwrite 1> -export record(student_id:string[10];score:int32;grade:string[1]) -file [&jobName]fair_scores.out -overwrite 2> -export record(student_id:string[10];score:int32;grade:string[1]) -file [&jobName]good_scores.out -overwrite
classAfair_scores.out
A112387567 A731347820 A897382711 A327637289 A238967521 A763567100 80 75 85 82 87 89 B B B B B B
classAgood_scores.out
A218925316 95 A A238950561 92 A
Chapter 7. Operators
299
The student class is determined by the first character of the student_id. If the first character is B, the student's class is Beginner; if the first character is I, the student's class is Intermediate; and if the first character is A, the student's class is Advanced. Records with the same class field are written to the same output. The score field is not only transferred from input to output, but is also changed. If the score field is less than 75, the output score is:
(in0.score+(200-in0.score*2)/4)
transform
Transformation language
The record-processing logic can be expressed in the Transformation Language as follows. The expression file for this example is score_class_expr.
inputname 0 int0; outputname 0 out0; outputname 1 out1; outputname 2 out2; mainloop { //define an int32 local variable to store the score int32 score_local; score_local=(in0.score < 75) ? (in0.score+(200-in0.score*2)/4): (in0.score+(200-in0.score*2)/3) // define a string local variable to store the grade
300
string[1] grade_local; if (score_local < 75) grade_local = "C"; if (score_local >= 75 && score_local <= 90) grade_local = "B"; if (score_local > 90) grade_local = "A"; // define string local variables to check the class level string[max=15] class_local; string[1] class_init; class_init = substring(in0.student_id,0,1); if (class_init == "B") class_local = "Beginner"; if (class_init == "I") class_local = "Intermediate"; if (class_init == "A") class_local = "Advanced"; // outputs if (class_local == "Beginner") { out0.score = score_local; out0.grade = grade_local; out0.class = class_local; writerecord 0; } if (class_local == "Intermediate") { out1.score = score_local; out1.grade = grade_local; out1.class = class_local; writerecord 1; } if (class_local == "Advanced") { out2.score = score_local; out2.grade = grade_local; out2.class = class_local; writerecord 2; } }
osh command
The osh command to run this job is:
osh -f score_class
osh script
The contents of score_class are:
#compile the expression code transform -inputschema record(student_id:string[10];score:int32;) -outputschema record (student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -outputschema record (student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -outputschema record (student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -expressionfile score_class_expr -flag compile -name score_map; #run the job import -schema record(student_id:string[10];score:int32;) -file score.txt | transform -flag run -name score_map 0> -export record(student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -file beginner.out -overwrite 1> -export record(student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -filename intermediate.out -overwrite 2> -export record(student_id:string[10];score:int32;grade:string[1]; class:string[max=15]) -file advanced.out -overwrite
Chapter 7. Operators
301
intermediate.out
I731347820 91 A Intermediate I327637289 94 A Intermediate I238967521 95 A Intermediate
advanced.out
A218925316 A619846201 A238950561 A763567100 98 85 97 96 A B A A Advanced Advanced Advanced Advanced
Example 4. student record distribution with null score values and a reject
Job logic
The job logic in this example is the same as in "Example 3: Student-Score Distribution with a Class Field Added to Example 2" . The difference is that the input records contain null score fields, and records with null score fields are transferred to a reject data set. This example shows you how to specify and use a reject data set. The Transformation Language expression file is the same as that for Example 3.
302
transform
osh command
The osh command to run this job is:
osh -f score_reject
osh script
The contents of score_reject are:
#compile the expression code transform -inputschema record(student_id:string[10]; score:nullable int32;) -outputschema record (student_id:string[10];score:nullable int32;grade:string[1]; class:string[max=15]) -outputschema record (student_id:string[10];score:nullable int32;grade:string[1]; class:string[max=15]) -outputschema record (student_id:string[10];score:nullable int32;grade:string[1]; class:string[max=15]) -expressionfile score_class_expr -flag compile -name score_map -reject; #run the job import -schema record(student_id:string[10];score:nullable int32 {null_field=NULL}) -file score_null.txt | transform -flag run -name score_map 0> -export -schema record(student_id:string[10]; score:nullable int32{null_field=NULL}; grade:string[1]; class:string[max=15]) -file beginner.out -overwrite 1> -export record(student_id:string[10]; score:nullable int32{null_field=NULL}; grade:string[1]; class:string[max=15]) -file intermediate.out -overwrite 2> -export record(student_id:string[10]; score:nullable int32{null_field=NULL}; grade:string[1]; class:string[max=15]) -file advanced.out -overwrite 3> -export record(student_id:string[10]; score:nullable int32{null_field=NULL};) -file reject.out -overwrite
303
B112387567 I218925316 A619846201 I731347820 B897382711 I327637289 A238950561 I238967521 B826381931 A763567100
intermediate.out
I218925316 98 A Intermediate I731347820 91 A Intermediate I238967521 95 A Intermediate
advanced.out
A619846201 85 B Advanced A238950561 97 A Advanced
reject.out
B112387567 NULL I327637289 NULL A763567100 NULL
304
transform
2. Use a function call to null() and the ternary operator. For example:
score_local = null(in0.score)?-100:in0.score;
3. Use a function call to notnull() and the ternary operator. For example:
score_local = notnull(in0.score)?in0.score:-100;
Setting a nullable field to null occurs when the record is written to output. In this example, after the statement out0.score = score_local executes, out0.score no longer contains null values, since the old null value has been replaced by -100. To reinstate the null field, the function make_null() is called to mark that any field containing -100 is a null field. When the score is -100, grade should be null; therefore, set_null() is called.
Transformation language
The record processing logic can be expressed in the Transformation Language as follows. The expression file name for this example is score_null_handling_expr.
inputname 0 in0; outputname 0 out0; outputname 1 out1; outputname 2 out2; mainloop { // define an int32 local variable to store the score int32 score_local;
Chapter 7. Operators
305
// handle the null score score_local = handle_null(in0.score,"-100"); // alternatives: // score_local = null(in0.score)?-100:in0.score; // score_local = notnull(in0.score)?in0.score:-100; // define a string local variable to store the grade string[1] grade_local; grade_local = "F"; if ( score_local < 60 && score_local >= 0 ) grade_local = "D"; if ( score_local < 75 && score_local >= 60 ) grade_local = "C"; if ( score_local >= 75 && score_local <= 90 ) grade_local = "B"; if ( score_local > 90 ) grade_local = "A"; // define string local variables to check the class level string[max=15] class_local; string[1] class_init; class_init = substring(in0.student_id,0,1); if ( class_init == "B" ) class_local = "Beginner"; if ( class_init == "I" ) class_local = "Intermediate"; if ( class_init == "A" ) class_local = "Advanced"; // outputs if ( class_local == "Beginner" ) { out0.score = score_local; out0.score = make_null(out0.score,"-100"); if ( grade_local == "F" ) out0.grade = set_null(); else out0.grade = grade_local; out0.class = class_local; writerecord 0; } if ( class_local == "Intermediate" ) { out1.score = score_local; out1.score = make_null(out1.score,"-100"); if ( grade_local == "F" ) out1.grade = set_null(); else out1.grade = grade_local; out1.class = class_local; writerecord 1; } if ( class_local == "Advanced" ) { out2.score = score_local; out2.score = make_null(out2.score,"-100"); if ( grade_local == "F" ) Sout2.grade = set_null(); else out2.grade = grade_local; out2.class = class_local; writerecord 2; } }
osh command
The osh script to run this job is:
osh -f score_null_handling
Parallel Job Advanced Developer's Guide
306
osh script
The contents of score_null_handling are:
# compile the expression code transform -inputschema record(student_id:string[10];score:nullable int32;) -outputschema record(student_id:string[10];score:nullable int32;grade:nullable string[1];class:string[max=15]) -outputschema record(student_id:string[10];score:nullable int32;grade:nullable string[1];class:string[max=15]) -outputschema record(student_id:string[10];score:nullable int32;grade:nullable string[1];class:string[max=15]) -expressionfile score_null_handling_expr -flag compile -name score_map; # run the job import -schema record(student_id:string[10];score:nullable int32{null_field=NULL}) -file score_null.txt | transform -flag run -name score_map 0> -export record(student_id:string[10];score:nullable int32{null_field=NULL};grade:nullable string[1]{null_field=FAILED};class:string[max=5];) -file beginner.out -overwrite 1> -export record(student_id:string[10];score:nullable int32{null_field=NULL};grade:nullable string[1]{null_field=FAILED};class:string[max=5];) -file intermediate.out -overwrite 2> -export record(student_id:string[10];score:nullable int32{null_field=NULL};grade:nullable string[1]{null_field=FAILED};class:string[max=5];) -file advanced.out -overwrite
intermediate.out
I218925316 I731347820 I327637289 I238967521 95 A 75 B NULL 87 B Intermediate Intermediate FAILED Intermediate Intermediate
advanced.out
A619846201 70 C Advanced A238950561 92 A Advanced A763567100 NULL FAILED Advanced
Chapter 7. Operators
307
transform
v v v v
Vector variables with flow loops. For example, while and for loops. Using flow loops with controls. For example, continue statements. Using vector variables in expressions. For example, if statements and assignments. Using a schema variable for an implicit transfer. For example in -inputschema and -outputschema declarations, the field term is transferred from the up-stream operator to the transform operator and from the transform input to its output.
308
Transformation language
The record-processing logic can be expressed in the Transformation Language as follows. The expression file name for this example is score_vector_expr.
inputname 0 in0; outputname 0 out0; outputname 1 out1; outputname 2 out2; mainloop { // define an int32 local vector variable to store the scores int32 score_local[5]; int32 vecLen; vecLen = 5; // handle null score int32 i; i = 0; while ( i < vecLen ) { score_local[i] = handle_null(in0.score[i],"-100"); // alternatives // score_local[i] = null(in0.score[i])?-100:in0.score[i]; // score_local[i] = notnull(in0.score[i])?in0.score[i]:-100; i++; } // define a string local vector variable to store the grades string[1] grade_local[5]; // define sfloat local variables to calculate GPA. sfloat tGPA_local, GPA_local; tGPA_local = 0.0; GPA_local = 0.0; // define an int8 to count the number of courses taken. int8 numOfScore; numOfScore = 0; for ( i = 0; i < vecLen; i++) { // Null score means the course is not taken, // and will not be counted. if ( score_local[i] == -100) { grade_local[i] = "S"; continue; } numOfScore++; if ( score_local[i] < 60 && score_local[i] >= 0 ) { grade_local[i] = "D"; tGPA_local = tGPA_local + 1.0; } if ( score_local[i] < 75 && score_local[i] >= 60 ) { grade_local[i] = "C"; tGPA_local = tGPA_local + 2.0; } if ( score_local[i] >= 75 && score_local[i] <= 90 ) { grade_local[i] = "B"; tGPA_local = tGPA_local + 3.0; } if ( score_local[i] > 90 ) { grade_local[i] = "A"; tGPA_local = tGPA_local + 4.0; } }
Chapter 7. Operators
309
if ( numOfScore > 0 ) GPA_local = tGPA_local / numOfScore; // define string local variables to check the class level string[max=15] class_local; string[1] class_init; class_init = substring(in0.student_id,0,1); if ( class_init == "B" ) class_local = "Beginner"; if ( class_init == "I" ) class_local = "Intermediate"; if ( class_init == "A" ) class_local = "Advanced"; // outputs if (class_local == "Beginner") { for ( i = 0; i < vecLen; i++) { out0.score[i] = score_local[i]; out0.score[i] = make_null(out0.score[i],"-100"); out0.grade[i] = grade_local[i]; } out0.GPA = GPA_local; out0.class = class_local; writerecord 0; } if ( class_local == "Intermediate" ) { for ( i = 0; i < vecLen; i++) { out1.score[i] = score_local[i]; out1.score[i] = make_null(out1.score[i],"-100"); out1.grade[i] = grade_local[i]; } out1.GPA = GPA_local; out1.class = class_local; writerecord 1; } if ( class_local == "Advanced" ) { for ( i = 0; i < vecLen; i++) { out2.score[i] = score_local[i]; out2.score[i] = make_null(out2.score[i],"-100"); out2.grade[i] = grade_local[i]; } out2.GPA = GPA_local; out2.class = class_local; writerecord 2; }
osh command
The osh script to run this job is:
osh -f score_vector
osh script
The contents of score_vector are:
# compile the expression code transform -inputschema record(student_id:string[10];score[5]:nullable int32;inRec:*) -outputschema record(student_id:string[10];score[5]:nullable int32; grade[5]:string[1];GPA:sfloat;class:string[max=15];outRec:*) -outputschema record(student_id:string[10];score[5]:nullable int32;
310
grade[5]:string[1];GPA:sfloat;class:string[max=15];outRec:*) -outputschema record(student_id:string[10];score[5]:nullable int32; grade[5]:string[1];GPA:sfloat;class:string[max=15];outRec:*) -expressionfile score_vector_expr -flag compile -name score_map; # run the job import -schema record(student_id:string[10];score[5]:nullable int32{null_field=NULL};term:string) file score_vector.txt | transform -flag run -name score_map 0> -export record(student_id:string[10];score[5]:nullable int32{null_field=NULL};grade[5]:string[1];GPA:sfloat{out_format= %4.2g};class:string[max=15];term:string;) -file beginner.out -overwrite 1> -export record(student_id:string[10];score[5]:nullable int32{null_field=NULL};grade[5]:string[1];GPA:sfloat{out_format=%4.2g};class:string[max=15];ter m:string;) -file intermediate.out -overwrite 2> -export record(student_id:string[10];score[5]:nullable int32{null_field=NULL};grade[5]:string[1];GPA:sfloat{out_format=%4.2g};class:string[max=15];ter m:string;) -file advanced.out -overwrite
intermediate.out
I218925316 I731347820 I327637289 I238967521 95 NULL 91 88 NULL 75 NULL 89 93 95 B NULL NULL 88 92 76 87 NULL 86 NULL 82 A S S B S B S S A A B B B A A S S 3.7 Intermediate Midterm 3.5 Intermediate Final B 3.3 Intermediate Final B 3 Intermediate Midterm
advanced.out
A619846201 70 82 85 68 NULL C B B C S 2.5 Advanced Final A238950561 92 97 89 85 83 A A B B B 3.4 Advanced Final A763567100 NULL NULL 53 68 92 S S D C A 2.3 Advanced Final
Chapter 7. Operators
311
transform
Transformation language
The record-processing logic can be expressed in the Transformation Language as follows. The expression file name for this example is score_subrec_expr.
inputname 0 in0; outputname 0 out0; outputname 1 out1; outputname 2 out2; mainloop { // define an int32 local vector variable to store the scores int32 score_local[3]; // define an index to store vector or subrec length int32 vecLen, subrecLen; vecLen = 3; subrecLen = 2; // index int32 i, j; i = 0; j = 0; // define a string local vector variable to store the grades string[1] grade_local[3]; // define sfloat local vairalbes to calculate GPA. sfloat tGPA_local, GPA_local; tGPA_local = 0.0; GPA_local = 0.0; // define string local variables to check the class level string[1] class_init; string[max=15] class_local; class_init = substring(in0.student_id,0,1); if ( class_init == "B" )
312
class_local = "Beginner"; if ( class_init == "I" ) class_local = "Intermediate"; if ( class_init == "A" ) class_local = "Advanced"; // calculate grade and GPA // The outer loop controls subrec // The inner loop controls sub-fields for ( j = 0; j < subrecLen; j++) { for ( i = 0; i < vecLen; i++) { score_local[i] = in0.score_report[j].score[i]; if ( score_local[i] < 60 && score_local[i] >= 0 ) { grade_local[i] = "D"; tGPA_local = tGPA_local + 1.0; } if ( score_local[i] < 75 && score_local[i] >= 60 ) { grade_local[i] = "C"; tGPA_local = tGPA_local + 2.0; } if ( score_local[i] >= 75 && score_local[i] <= 90 ) { grade_local[i] = "B"; tGPA_local = tGPA_local + 3.0; } if ( score_local[i] > 90 ) { grade_local[i] = "A"; tGPA_local = tGPA_local + 4.0; } } GPA_local = tGPA_local / vecLen; // outputs if ( class_local == "Beginner" ) { for ( i = 0; i < vecLen; i++) { out0.score_report[j].score[i] = score_local[i]; out0.score_report[j].grade[i] = grade_local[i]; } out0.score_report[j].GPA = GPA_local; } if ( class_local == "Intermediate" ) { for ( i = 0; i < vecLen; i++) { out1.score_report[j].score[i] = score_local[i]; out1.score_report[j].grade[i] = grade_local[i]; } out1.score_report[j].GPA = GPA_local; } if ( class_local == "Advanced" ) { for ( i = 0; i < vecLen; i++) { out2.score_report[j].score[i] = score_local[i]; out2.score_report[j].grade[i] = grade_local[i]; } out2.score_report[j].GPA = GPA_local; } // intialize these variables for next subrec GPA_local = 0; tGPA_local = 0; }
Chapter 7. Operators
313
// outputs if ( class_local == "Beginner" ) { out0.class = class_local; writerecord 0; } if ( class_local == "Intermediate" ) { out1.class = class_local; writerecord 1; } if ( class_local == "Advanced" ) { out2.class = class_local; writerecord 2; } }
osh command
osh -f score_subrec
osh script
The contents of score_subrec are:
# compile the expression code transform -inputschemafile score_subrec_input.schema -outputschemafile score_subrec_output.schema -outputschemafile score_subrec_output.schema -outputschemafile score_subrec_output.schema -expressionfile score_subrec_expr -flag compile -name score_map; # run the job import -schemafile score_subrec_input.schema -file score_subrec.txt | transform -flag run -name score_map > a.v >b.v >c.v; 0> -export -schemafile score_subrec_output.schema -file beginner.out -overwrite < a.v; 1> -export -schemafile score_subrec_output.schema -file intermediate.out -overwrite < b.v; 2> -export -schemafile score_subrec_output.schema -file advanced.out -overwrite < c.v;
Input schema
The contents of score_subrec_input.schema are:
record( student_id:string[10]; score_report[2]:subrec(score[3]:nullable int32;term:string;) )
Output schema
The contents of score_subrec_output.schema are:
record( student_id:string[10]; score_report[2]:subrec(score[3]:nullable int32; grade[3]:string[1]; GPA:sfloat{out_format=%4.2g}; term:string;) class:string[max=15]; )
314
B112387567 I218925316 A619846201 I731347820 B897382711 I327637289 A238950561 I238967521 B826381931 A763567100
90 95 70 75 85 88 92 87 66 53
87 91 82 89 90 92 97 86 73 68
62 88 85 93 96 76 89 82 82 92
Final 80 89 52 Midterm Midterm 92 81 78 Final Final 60 89 85 Midterm Final 85 79 92 Midterm Midterm 88 92 96 Final Final 82 96 86 Midterm Final 90 87 91 Midterm Midterm 97 96 92 Final Midterm 86 93 82 Final Final 48 78 92 Midterm
intermediate.out
I218925316 I731347820 I327637289 I238967521 95 75 88 87 91 89 92 86 88 93 76 82 A B B B A B A B B A B B 3.7 3.3 3.3 3 Midterm 92 81 78 Final 85 79 92 B Final 82 96 86 B Midterm 97 96 92 A B A A B A B A B 3.3 Final Intermediate 3.3 Midterm Intermediate 3.3 Midterm Intermediate A 4 Final Intermediate
advanced.out
A619846201 70 82 85 C B B A238950561 92 97 89 A A B A763567100 53 68 92 D C A 2.7 Final 60 89 85 C B B 3.7 Final 90 87 91 B B A 2.3 Final 48 78 92 D B A 2.7 Midterm Advanced 3.3 Midterm Advanced 2.7 Midterm Advanced
315
{ unsigned int sum; sum = x + y; return sum; } void my_print_message(char* msg) { printf("%s\n",msg); return; } #if defined(__alpha) long my_square( long x, long y ) { long square; square = x*x + y*y; return square ; } #else long long my_square( long long x , long long y ) { long long square; square = x*x + y*y; return square ; } #endif
Transformation language
The expression file name for this example is t_extern_func.
extern int32 my_times(int32 x, int32 y); extern uint32 my_sum(uint32 x, uint32 y); extern void my_print_message(string msg); extern int64 my_square(int64 x, int64 y); inputname 0 in0; outputname 0 out0; mainloop { out0.times= my_times(in0.a1,in0.a2); out0.sum= my_sum(in0.a1,in0.a2); out0.square= my_square(in0.a1,in0.a2); my_print_message("HELLO WORLD!"); writerecord 0; }
osh script
transform -inputschema record(a1:int32;a2:int32) -outputschema record(times:int32;sum:uint32;square:int64;) -expressionfile t_extern_func -flag compile -name my_extern_func -staticobj /DIR/functions.o; generator -schema record(a1:int32;a2:int32) | transform -flag run -name my_extern_func | peek;
Writerangemap operator
The writerangemap operator takes an input data set produced by sampling and partition sorting a data set and writes it to a file in a form usable by the range partitioner. The range partitioner uses the sampled and sorted data set to determine partition boundaries.
316
writerangemap
newDS.ds
The operator takes a single data set as input. You specify the input interface schema of the operator using the -interface option. Only the fields of the input data set specified by -interface are copied to the output file.
writerangemap: properties
Table 66. writerangemap properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value 1 0 (produces a data file as output) specified by the interface arguments none inRec to outRec without modification sequential only range set no
317
Table 67. Writerangemap options Option -key Use -key fieldname Specifies an input field copied to the output file. Only information about the specified field is written to the output file.You only need to specify those fields that you use to range partition a data set. You can specify multiple -key options to define multiple fields. This option is mutually exclusive with -interface. You must specify either -key or -interface, but not both. -collation_ sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. v By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -interface -interface schema Specifies the input fields copied to the output file. Only information about the specified fields is written to the output file.You only need to specify those fields that you use to range partition a data set. This option is mutually exclusive with -key; that is, you can specify -key or -interface, but not both. -overwrite -overwrite Tells the operator to overwrite the output file, if it exists. By default, the operator does not overwrite the output file. Instead it generates an error and aborts the job if the file already exists. -rangemap -rangemap filename Specifies the pathname of the output file which will contain the sampled and sorted data.
318
319
Record schemas
A record schema is an implicit or explicit description of the layout and properties of the record-oriented data contained in an InfoSphere DataStage data set. Implicit (default) record schemas are discussed in "The Default Import Schema" and "The Default Export Schema" . When you invoke either the import or the export operator, you must explicitly specify a schema, as in the following two examples.
step import
statistics
320
step import
statistics
export
Below is a completely described data set record, where each byte of the record is contained in a field and all fields are accessible by InfoSphere DataStage because they have the required field definition information.
a:sfloat
b:int8
c:int8
d:dfloat
e:sfloat
f:decimal
g:decimal
h:int8
321
) If you wish to modify this example to specify that each record is to be delimited by a comma (,) you would add the record-level property delim = ',':
record {delim = ,} ( a:sfloat; b:int8; c:int8; d:dfloat; e:sfloat; f:decimal[1,0]; g:decimal[1,0]; h:int8;
) The same property delim = ',' could be used as a field-level property, for example:
record ( a:sfloat; b:int8 {delim = ,}; c:int8; d:dfloat; e:sfloat; f:decimal[1,0]; g:decimal[1,0]; h:int8;
) In this case, only field b would be delimited by a comma. For export, the other fields would be followed by the default delimiter, the ascii space (0x20). These brief examples only scratch the surface of schema definition. See "Import/Export Properties" for more information on record and field properties.
c:int8
f:decimal
The use of partial record schemas has advantages and disadvantages, which are summarized in the following table.
322
Advantages Partial record schemas are simpler than complete record schemas, because you define only the fields of interest in the record. The import operator reads these records faster than records with a complete schema because it does not have to interpret as many field definitions.
Disadvantages InfoSphere DataStage treats the entire imported record as an atomic object; that is, InfoSphere DataStage cannot add, remove, or modify the data storage within the record. (InfoSphere DataStage can, however, add fields to the beginning or end of the record.)
complete
InfoSphere DataStage can add fields The import operator takes longer to to the body of the record and remove read records with a complete schema or modify any existing field. than records with a partial schema. Removing fields from a record allows you to minimize its storage requirement.
The record schema defines an 82-byte record as specified by the record_length=82 property. The 82 bytes includes 80 bytes of data plus two bytes for the newline and carriage-return characters delimiting the record as specified by record_delim='\r\n'. No information is provided about any field. On import, the two bytes for the carriage-return and line-feed characters are stripped from each record of the data set and each record contains only the 80 bytes of data. An imported record with no defined fields can be used only with keyless partitioners, collectors, and operators. The term keyless means that no individual fields of the record must be accessed in order for it to be processed. InfoSphere DataStage supplies keyless operators to: v Copy a data set v Sample a data set to create one or more output data sets containing a random sample of records from the input data set v Combine multiple input data sets into a single output data set v Compress or decompress a data set. In addition, you can use keyless partitioners, such as random or same, and keyless collectors, such as round-robin, with this type of record. An imported record with no defined fields cannot be used for field-based processing such as: v Sorting the data based on one or more key fields, such as a name or zip code v Calculating statistics on one or more fields, such as income or credit card balance v Creating analytic models to evaluate your data A record with no field definitions might be passed to an operator that runs UNIX code. Such code typically has a built-in record and field processing ability.
323
You can define fields in partial schemas so that field-based processing can be performed on them. To do so, define only the fields to be processed. For example, the following partial record schema defines two fields of a record, Name and Income, thereby making them available for use as inputs to partitioners, collectors, and other operators.
record { intact, record_length=82, record_delim_string=\r\n } (Name: string[20] { position=12, delim=none }; Income: int32 { position=40, delim=,, text }; )
In this schema, the defined fields occur at fixed offsets from the beginning of the record. The Name field is a 20-byte string field starting at byte offset 12 and the Income field is a 32-bit integer, represented by an ASCII text string, that starts at byte position 40 and ends at the first comma encountered by the import operator. When variable-length fields occur in the record before any fields of interest, the variable-length fields must be described so that the import or export operator can determine the location of a field of interest. Note: When variable-length fields occur in the record before any fields of interest, the variable-length fields must be described so that the import and export operators can determine the location of a field of interest. For example, suppose that the record schema shown above is modified so that the Name and Income fields follow variable-length fields (V1, V2, and V3). Instead of fixed offsets, the Name and Income fields have relative offsets, based on the length of preceding variable-length fields. In this case, you specify the position of the Name and Income fields by defining the delimiters of the variable-length fields, as in the following example:
record { intact, record_delim_string=\r\n } ( v1: string { delim=, }; v2: string { delim=, }; Name: string[20] { delim=none }; v3: string { delim=, }; Income: int32 { delim=,, text }; )
324
In order to export this data set using the original record schema and properties, you specify an empty record schema, as follows:
record ()
Because there are no properties, no braces ({}) are needed. This empty record schema causes the export operator to use the original record schema and properties. However, defining any record property in the export schema overrides all record-level properties specified at import. That is, all record-level properties are ignored by the export operator. For example, if you specify the following record schema to the export operator:
record { record_delim=\n } ()
the intact record is exported with a newline character as the delimiter. The carriage-return character ('\r') is dropped. Each exported record is 81 bytes long: 80 bytes of data and one byte for the newline character. Intact record with additional fields: About this task When a schema being exported consists of multiple intact records or one intact record together with other fields, consider each intact record to be a separate field of the exported record. For example, consider a data set with the following record schema:
record ( a: int32; b: string; x: record {intact, record_delim=\r\n, text, delim=,,} (name: string[20] { position=0 }; address: string[30] { position=44 }; ); y: record {intact, record_length=244, binary, delim=none } (v1: int32 { position=84 }; v2: decimal(10,0) { position=13 }; ); )
The data set contains two intact fields, x and y, and two other fields, a and b. The intact fields retain all properties specified at the time of the import. When this type of data set is exported, the record-level properties for each intact field (specified in braces after the key word record) are ignored and you must define new record-level properties for the entire record. For example, if you want to export records formatted as follows: v a two-byte prefix is written at the beginning of each record v the fields are ordered x, y, and a; field b is omitted v a comma is added after x v y has no delimiter v a is formatted as binary. then the export schema is:
record { record_prefix=2 } ( x: record { intact } () { delim=, }; y: record { intact } () { delim=none }; a: int32 { binary }; )
Chapter 8. The import/export library
325
Note that you do not include the fields of the x and y intact records in the export schema because the entire intact record is always exported. However, this export schema does contain field-level properties for the intact records to control delimiter insertion.
No schema is specified, so InfoSphere DataStage uses the default schema (See "The Default Import Schema" and "The Default Export Schema" ). The diagram shows the data flow model of this step. InfoSphere DataStage automatically performs the import operation required to read the flat file.
326
pcompress
outDS1.ds
copy
pcompress
outDS1.ds
compressed.ds
Implicit operator insertion also works for export operations. The following example shows the tsort operator writing its results to a data file, again with no schema specified:
$ osh "tsort -key a -hash -key b < inDS.ds > result.dat"
The output file does not end in the extension .ds, so InfoSphere DataStage inserts an export operator to write the output data set to the file. The examples of implicit import/export insertion shown so far all use the default record schemas. However, you can also insert your own record schema to override the defaults.
Chapter 8. The import/export library
327
The next sections discuss: v "The Default Import Schema" v "The Default Export Schema" v "Overriding the Defaults"
The following figure shows the record layout of the source data as defined by the default import schema:
default record layout of source data record contents \n
After the import operation, the destination data set contains a single variable-length string field named rec corresponding to the entire imported record, as follows:
variable-length string named rec
This record schema causes InfoSphere DataStage to export all fields as follows: v Each field except the last field in the record is delimited by an ASCII space (0x20) v The end of the record (and thus the end of the last field) is marked by the newline character (\n) v All fields are represented as text (numeric data is converted to a text representation). For example, the following figure shows the layout of a data set record before and after export when the default export schema is applied:
record in the source data set: int16 string[8] string[2] int32
record after export as stored in data file: int16 as text 0x20 string[8] 0x20 string[2] 0x20 int32 as text \n
328
v A text file and a reference to the file in the command line, as in the following example:
$ osh "copy < [record @schema_file] user.dat > outDS1.ds | pcompress > compressed.ds"
where schema_file is the name of the text file containing the record schema definition. You can specify a schema anywhere on an osh command line by using this notation:
[record @schema_file].
A schema variable defined upstream of a command, and subsequently input to the command, as in the following example:
$ import_schema="record ( Name:string; Age:int8 {default = 127}; Income:dfloat {skip = 2}; Phone:string; )" $ osh "copy < [$import_schema] user.dat > outDS1.ds | pcompress > compressed.ds"
All implicit import/export operations in the command use that schema. However, you can override it for a particular data file by preceding the specification of a flat file with a different record schema. To define formatting properties of records as a whole rather than as individual fields in an export operation, use this form:
record {record_properties} ()
In this case, the export operator exports all the fields of the data set formatted according to the record properties specified in the braces.
Failure
A failure means that both: v The import or export operator failed to read or write the record's value correctly. v The error makes it unlikely that the rest of the record can be interpreted. An example of such a failure occurs when the import operator encounters an invalid length value of a variable-length field. On import, by default, an uninterpretable field is not written to the destination data set and processing continues to the next record. On export, by default, an uninterpretable field is not written to the destination file and processing continues to the next record. However, you can configure the import and export operators to save records causing a failure in a reject data set. You can also configure the operators to terminate the job in the event of a failure. See "Import Operator" and "Export Operator" . InfoSphere DataStage issues a message in the case of a failure. v The message appears for up to five records in a row, for each partition of the imported or exported data set, when the same field causes the failure. v After the fifth failure and message, messages no longer appear. v After a record is successfully processed, the message counter is reset to zero and up to five more error messages per partition can be generated for the same field.
Chapter 8. The import/export library
329
Note: No more than 25 failure messages can be output for the same failure condition during an import or export operation.
Warning
A warning means that the import operator or export operator failed to read or write the record's value correctly but that the import or export of that record can continue. An example of a warning condition is a numeric field represented as ASCII text that contains all blanks. When such a condition occurs, the import or export operator does not issue a message and by default drops the record. To override this behavior, define a default value for the field that causes the warning. If you have defined a default value for the record field that causes the warning: v The import operator sets the field to its default value, writes the record to the destination data set, and continues with the next record. v The export operator sets the field to its default value, exports the record, and continues with the next record.
EBCDIC to ASCII
The following table is an EBCDIC-to-ASCII conversion table that translates 8-bit EBCDIC characters to 7-bit ASCII characters. All EBCDIC characters that cannot be represented in 7 bits are represented by the ASCII character 0x1A. This translation is not bidirectional. Some EBCDIC characters cannot be translated to ASCII and some conversion irregularities exist in the table. See "Conversion Table Irregularities" for more information.
EBCDIC 00 01 02 03 04 05 06 07 08 09 0A 0B ASCII 00 01 02 03 1A 09 1A 7F 1A 1A 1A 0B EBCDIC Meaning NUL SOH STX ETX SEL HT RNL DEL GE SPS RPT VT EBCDIC 80 81 82 83 84 85 86 87 88 89 8A 8B ASCII 1A 61 62 63 64 65 66 67 68 69 1A 1A a b c d e f g h i EBCDIC Meaning
330
EBCDIC 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33
ASCII 0C 0D 0E 0F 10 11 12 13 1A 1A 08 1A 18 19 1A 1A 1C 1D 1E 1F 1A 1A 1A 1A 1A 0A 17 1B 1A 1A 1A 1A 1A 05 06 07 1A 1A 16 1A
EBCDIC Meaning FF CR SO SI DLE DC1 DC2 DC3 RES/ENP NL BS POC CAN EM UBS CU1 IFS IGS IRS ITB/IUS DS SOS FS WUS BYP/INP LF ETB ESC SA SFE SM/SW CSP MFA ENQ ACK BEL
EBCDIC 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF B0 B1
ASCII 1A 1A 1A 1A 1A 6A 6B 6C 6D 6E 6F 70 71 72 1A 1A 1A 1A 1A 1A 1A 7E 73 74 75 76 77 78 79 7A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A
EBCDIC Meaning
j k l m n o p q r
s t u v w x y z
SYN IR
B2 B3
331
EBCDIC 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B
ASCII 1A 1A 1A 04 1A 1A 1A 1A 14 15 1A 1A 20 1A 1A 1A 1A 1A 1A 1A 1A 1A 5B 2E 3C 28 2B 21 26 1A 1A 1A 1A 1A 1A 1A 1A 1A 5D 24
EBCDIC Meaning PP TRN NBS EOT SBS IT RFF CU3 DC4 NAK
EBCDIC B4 B5 B6 B7 B8 B9 BA BB BC BD BE
ASCII 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 1A 7B 41 42 43 44 45 46 47 48 49 1A 1A 1A 1A 1A 1A 7D 4A 4B 4C 4D 4E 4F 50 51 52 1A 1A
EBCDIC Meaning
BF C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
A B C D E F G H I
. < ( + | &
CA CB CC CD CE CF D0 D1 D2 D3 D4 D5 D6 D7 D8 D9
J K L M N O P Q R
! $
DA DB
332
EBCDIC 5C 5D 5E 5F 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
ASCII 2A 29 3B 5E 2D 1A 1A 1A 1A 1A 1A 1A 1A 1A 7C 2C 25 5F 3E 3F 1A 1A 1A 1A 1A 1A 1A 1A 1A 60 3A 23 40 27 3D 22
EBCDIC Meaning * ) ; .. _ /
EBCDIC DC DD DE DF E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA
ASCII 1A 1A 1A 1A 5C 1A 53 54 55 56 57 58 59 5A 1A 1A 1A 1A 1A 1A 30 31 32 33 34 35 36 37 38 39 1A 1A 1A 1A 1A 1A
EBCDIC Meaning
S T U V W X Y Z
EB EC
> ?
ED EE EF F0 F1 F2 F3 F4 F5 F6 F7 F8 F9
0 1 2 3 4 5 6 7 8 9
: # @ ' = "
FA FB FC FD FE FF
333
ASCII to EBCDIC
The following table is an ASCII-to-EBCDIC conversion table that translates 7-bit ASCII characters to 8-bit EBCDIC characters. This translation is not bidirectional. Some EBCDIC characters cannot be translated to ASCII and some conversion irregularities exist in the table. For more information, see "Conversion Table Irregularities" .
Table 68. ASCII to EBCDIC Conversion ASCII 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 20 21 22 EBCDIC 00 01 02 03 1A 09 1A ASCII Meaning ASCII 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 61 62
Parallel Job Advanced Developer's Guide
EBCDIC
ASCII Meaning
334
Table 68. ASCII to EBCDIC Conversion (continued) ASCII 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F EBCDIC ASCII Meaning ASCII 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F EBCDIC ASCII Meaning
335
Import operator
The import operator reads one or more non-DataStage source files and converts the source data to a destination InfoSphere DataStage data set.
import
user defined
reject:raw;
import: properties
Table 69. import properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Preserve-partitioning flag in output data set(s) Composite operator Value 0 1, or 2 if you specify optional data sets none output data set 0: user defined reject data set: reject:raw; none parallel or sequential clear by default no
336
The import operator: v Takes one or more files or named pipes as input. You can import data from one or more sources, provided all the imported data has the same record layout. v Writes its results to a single output data set. v Allows you to specify an optional output data set to hold records that are not successfully imported. v Has an output interface schema corresponding to the import schema of the data files read by the operator. The import operator can import source files containing several types of data: v Data in variable-length blocked/spanned records: Imported records have a variable-length format. This format is equivalent to IBM format-V, format-VS, format-VB, and format-VBS files. v Data in fixed-length records: This format is equivalent to IBM format-F and format-FB files. v Data in prefixed records: Records fields are prefixed by a length indicator. Records can be of either fixed or variable length. v Delimited records: Each record in the input file is separated by one or more delimiter characters (such as the ASCII newline character '\n'). Records can be of either fixed or variable length. v Implicit: Data with no explicit record boundaries. InfoSphere DataStage accepts a variety of record layouts. For example, a data file might represent integers as ASCII strings, contain variable-length records, or represent multi-byte data types (for example, 32-bit integer) in big-endian or little-endian format. See "Import/Export Properties" for more information on data representation. When the import operator imports a fixed-length string or ustring that is less-than or greater-than the declared schema length specification, an error is generated.
The following option values can contain multi-byte Unicode values: v the file names given to the -file, -schemafile, and -sourcelist options v the base file name for the -fileset option v -schema option schema v the program name and arguments for the -source option v the -filepattern pattern
Chapter 8. The import/export library
337
v the -filter command There are two types of required options. v You must include one of the following options to specify the imported files: -file, -filepattern, -fileset, -source, -sourcelist. v You must include either -schema or -schemafile to define the layout of the imported data unless you specify -fileset and it contains a schema. You can select only one of these three options. You can optionally include other arguments.
Table 70. Import options Option -checkpoint Use -checkpoint n Import n records per segment to the output data set. (A data segment contains all records written to a data set by a single InfoSphere DataStage step.) By default, the value of n is 0, that is, the entire input is imported for each iteration of the step. If the step containing the import operator performs multiple iterations, the output data set of the import operator contains multiple copies of the input data. The source files of the import operation must all be data files or source programs, that is, no pipes or other input devices can be used. In addition, you cannot specify a filter on the input. An import operator that creates a segmented output data set must be contained in a checkpointed step. -dontUseOffsetsWith Sources -dontUseOffsetsWithSources Start the source program's output stream at 0. Do not use this option with checkpointing. -file -file nodeName :file0 ... -file nodeName :filen Specifies an input file or pipe. You can include multiple -file options to specify multiple input files. The file names might contain multi-byte Unicode characters. Note: You can specify a hyphen to signal that import takes it's input from the stdin of osh. nodeName optionally specifies the name of the node on which the file is located. If you omit nodeName , the importer assumes that the file is located on the node on which the job is invoked. You cannot use -file with -filepattern, -fileset, -source, or -sourcelist.
338
Table 70. Import options (continued) Option -filepattern Use -filepattern pattern Specifies a group of files to import. For example, you can use the statement: -filepattern data*.txt You cannot use -filepattern with the -file, -fileset, -source, -sourcelist or -readers options. -fileset -fileset file_set .fs Specifies file_set .fs, a file containing a list of files, one per line, used as input by the operator. The suffix .fs identifies the file to InfoSphere DataStage as a file set. See "File Sets" . file_set might contain multi-byte Unicode characters. To delete a file set and the data files to which it points, invoke the InfoSphere DataStage data set administration utility, orchadmin. You cannot use -fileset with -file, -filepattern, -source, or -sourcelist. -filter -filter command Specifies a UNIX command to process input files as the data is read from a file and before it is converted into the destination data set. The command might contain multi-byte Unicode characters. For example, you could specify the following filter: -filter 'grep 1997' to filter out all records that do not contain the string "1997". Note that the source data to the import operator for this example must be newline-delimited for grep to function properly. InfoSphere DataStage checks the return value of the -filter process, and errors out when an exit status is not OK. You cannot use -filter with -source.
339
Table 70. Import options (continued) Option -first Use [-first n ] This option imports the first n records of a file. This option does not work with multiple nodes, filesets, or file patterns. It does work with multiple files and with the -source and -sourcefile options when file patterns are specified. Here are some osh command examples: osh "import -file file1 -file file2 -first 10 >| outputfile" osh "import -source 'cat file1' -first 10 >| outputfile" osh "import -sourcelist sourcefile -first 10 >| outputfile" osh "import -file file1 -first 5 >| outputfile" -firstLineColumnNames [-firstLineColumnNames] Specifies that the first line of a source file should not be imported. -keepPartitions -keepPartitions Partitions the imported data set according to the organization of the input file(s). By default, record ordering is not preserved, because the number of partitions of the imported data set is determined by the configuration file and any constraints applied to the data set. However, if you specify -keepPartitions, record ordering is preserved, because the number of partitions of the imported data set equals the number of input files and the preserve-partitioning flag is set in the destination data set. -missingFile -missingFile error | okay Determines how the importer handles a missing input file. By default, the importer fails and the step terminates if an input file is missing (corresponding to -missingFile error). Specify -missingFile okay to override the default. However, if the input file name has a node name prefix of "*", missing input files are ignored and the import operation continues. (This corresponds to -missingFile okay.) -multinode [ -multinode yes | no ] -multinode yes specifies the input file is to be read in sections from multiple nodes; -multinode no, the default, specifies the file is to be read by a single node.
340
Table 70. Import options (continued) Option -readers Use -readers numReaders Specifies the number of instances of the import operator on each processing node. The default is one operator per node per input data file. If numReaders is greater than one, each instance of the import operator reads a contiguous range of records from the input file. The starting record location in the file for each operator (or seek location) is determined by the data file size, the record length, and the number of instances of the operator, as specified by numReaders , which must be greater than zero. All instances of the import operator for a data file run on the processing node specified by the file name. The output data set contains one partition per instance of the import operator, as determined by numReaders . The imported data file(s) must contain fixed-length records (as defined by the record_length=fixed property). The data source must be a file or files. No other devices are allowed. This option is mutually exclusive with the filepattern option. -recordNumberField -recordNumberField recordNumberFieldName Adds a field with field name recordNumberFieldName with the record number as its value. -rejects -rejects { continue | fail | save } Configures operator behavior if a record is rejected. The default behavior is to continue. Rejected records are counted but discarded. The number of rejected records is printed as a log message at the end of the step. However, you can configure the operator to either fail and terminate the job or save, that is, create output data set 1 to hold reject records. If -rejects fail is specified and a record is not successfully imported, the import operator issues an error and the step terminates. If -rejects fail is not specified and a record is not successfully imported, the import operator issues a warning and the step does not terminate. -reportProgress -reportProgress { yes | no } By default (yes) the operator displays a progress report at each 10% interval when the importer can ascertain file size. Reporting occurs only if: import reads a file as opposed to a named pipe, the file is greater than 100 KB, records are fixed length, and there is no filter on the file. Disable this reporting by specifying -reportProgress no.
341
Table 70. Import options (continued) Option -schema Use -schema record_schema Specifies the import record schema. The value to this option might contain multi-byte Unicode characters. You can also specify a file containing the record schema using the syntax: -schema record @'file_name' where file_name is the path name of the file containing the record schema. You cannot use -schema with -schemafile. -schemafile -schemafile file_name Specifies the name of a file containing the import record schema. It might contain multi-byte Unicode characters. You cannot use -schemafile with -schema. -sourceNameField -sourceNameField sourceStringFieldName Adds a field named sourceStringFieldName with the import source string as its vaue
342
Table 70. Import options (continued) Option -source Use -source prog_name args Specifies the name of a program providing the source data to the import operator. InfoSphere DataStage calls prog_name and passes to it any arguments specified. prog_name and args might contain multi-byte Unicode characters. You can specify multiple -source arguments to the operator. InfoSphere DataStage creates a pipe to read data from each specified prog_name . You can prefix prog_name with either a node name to explicitly specify the processing node that executes prog_name or a node name prefix of "*", which causes InfoSphere DataStage to run prog_name on all processing nodes executing the operator. If this import operator runs as part of a checkpointed step, as defined by -checkpoint, InfoSphere DataStage calls prog_name once for each iteration of the step. InfoSphere DataStage always appends three arguments to prog_name as shown below: prog_name args -s H L where H and L are 32-bit integers. H and L are set to 0 for the first step iteration or if the step is not checkpointed. For each subsequent iteration of a checkpointed step, H and L specify the (64-bit) byte offset ( H = upper 32 bits, L = lower 32 bits) at which the source program's output stream should be restarted. At the end of each step iteration, prog_name receives a signal indicating a broken pipe. The prog_name can recognize this signal and perform any cleanup before exiting. InfoSphere DataStage checks the return value of the -source process, and errors out when an exit status is not OK. You cannot use -source with -filter, -file, -filepattern, or -fileset.
343
Table 70. Import options (continued) Option -sourcelist Use -sourcelist file_name Specifies a file containing multiple program names to provide source data for the import. This file contains program command lines, where each command line is on a separate line of the file. file_name might contain multi-byte Unicode characters. InfoSphere DataStage calls the programs just as if you specified multiple -source options. See the description of -source for more information. You cannot use -sourcelist with -filter, -file, -filepattern, or -fileset.
344
File sets
A file set is a text file containing a list of source files to import. The file set must contain one file name per line. The name of the file has the form file_name.fs, where .fs identifies the file to InfoSphere DataStage as a file set. Specify the -fileset option and the path name of the file set. Shown below is a sample file set:
--Orchetsrate File Set v1 --LFile node0:/home/user1/files/file0 node0:/home/user1/files/file1 node0:/home/user1/files/file2 --LFile node1:/home/user1/files/file0 node1:/home/user1/files/file1 --Schema record {record_delim="\n"} ( a:int32; b:int32; c:int16; d:sfloat; e:string[10]; )
The first line of the file set must be specified exactly as shown above. The list of all files on each processing node must be preceded by the line --LFile and each file name must be prefixed by its node name. A file set can optionally contain the record schema of the source files as the last section, beginning with --Schema. If you omit this part of the file set, you must specify a schema to the operator by means of either the -schema or -schemafile option.
File patterns
A file pattern specifies a group of similarly named files from which to import data. The wild card character allows for variations in naming. For example, the file pattern inFile*.data imports files that begin with the inFile and end with .data. The file names can contain multi-byte Unicode characters. For example, to import the data from the files that match the file pattern state*.txt, use the statement:
-filepattern state*.txt
345
The operator assumes that path_name is relative to the working directory from which the job was invoked. You can also indicate one of these: v A specific processing node on which the operator runs. Do this by specifying the node's name. When you do, InfoSphere DataStage creates an instance of the operator on that node to read the path_name. The nodeName must correspond to a node or fastname parameter of the InfoSphere DataStage configuration file. v All processing nodes on which the operator runs. Do this by specifying the asterisk wild card character (*). When you do, InfoSphere DataStage reads path_name on every node on which the operator is running. For example, you can supply the following specification as the file pattern on the osh command line:
*:inFile*.data
This imports all files of the form inFile*.data residing on all processing nodes of the default node pool. You can include a relative or absolute path as part of inFile. If you do not supply an absolute path, import searches for the files on all nodes using the same current working directory as the one from which you invoked the job.
346
inFile.dat
step import
347
inFile0.dat
inFile1.dat
step import
The format of the source file and the import schema are the same as those in "Example 1: Importing from a Single Data File" and the step similarly contains a single instance of the import operator. However, this example differs from the first one in that it imports data from two flat files instead of one and does not save records that cause an error into a reject data set. Specify the data flow with the following osh code:
$ osh "import -file inFile0.data -file inFile1.data -schema $example1_schema > outDS.ds "
Export operator
The export operator exports an InfoSphere DataStage data set to one or more UNIX data files or named pipes located on one or more disk drives.
348
inRec:*;
export
outRec:*;
export: properties
Table 71. Export properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set(s) Composite operator Value 1 0 or 1 inRec:*; plus user-supplied export schema output files: none reject data set: outRec:*; inRec -> outRec if you specify a reject data set parallel any any clear yes
The export operator: v Takes a single InfoSphere DataStage data set as input. v Writes its results to one or more output files. v Allows you to specify an optional output data set to hold records that cause a failure. v Takes as its input interface schema the schema of the exported data. v Generates an error and terminates if you try to write data to a file that is not empty. (This behavior can be overridden.) v Deletes files already written if the step of which it is a part fails. (This behavior can be overridden.)
349
The source data set might be persistent (stored on disk) or virtual (not stored on disk). The export operator writes several formats of exported data, including: v Fixed-length records: a fixed-length format equivalent to IBM format-F and format-FB files. v Data in prefixed records: Record fields are prefixed by a length indicator. Records can be of either fixed or variable length. v Delimited records: A delimiter such as the ASCII line feed character is inserted after every record. Records can be of either fixed or variable length. v Implicit: No explicit boundaries delimit the record. Note: The export operator does not support variable-length blocked records (IBM format-VB or format-VBS). InfoSphere DataStage accepts a variety of record layouts for the exported file. For example, a data file might represent integers as ASCII strings, contain variable-length records, or represent multi-byte data types in big-endian or little-endian format. See "Import/Export Properties" for more information. Note: When your character setting if UTF-16, the export operator appends and prepends the byte-order mark OxfeOxff0x00 to every column value.
The following option values can contain multi-byte Unicode characters: -file and -destinationlist file names and the base name of the file for the -fileset option program name and arguments to the -destination option -schema option schema and the schema filename given to the -schemafile option prefix value given to the -prefix option, and the suffix value given to the -suffix option
v the -filter option command v the value given to the -diskpool option There are two types of required options: v You must include exactly one of the following to specify the exported files: -file, -fileset, -destination, or -destinationlist. v You must include exactly one of -schema or -schemafile to define the layout of the exported data.
350
Table 72. Export operator options Option -add_bom Use -add_bom { utf16be | utf16le | utf8 } With this option you can add a BOM to your exported file. The utf16be value specifies FE FF, utf16le specifies FF FE, and utf8 specifies EF BB BF. -append -append Append exported data to an existing file. By default the step terminates if you attempt to export data to a file that is not empty. This option overrides the default behavior. You cannot use this option with -overwrite. -destination -destination prog_name [args] In single quotation marks specify the name of a program that reads the data generated by the export operator. Specify the program's arguments, if any. InfoSphere DataStage calls prog_name and passes to it any specified arguments. You can specify multiple -destination options to the operator: for each -destination, specify the option and supply the prog_name and args (if any). The prog_name and args values might contain multi-byte Unicode values. If this export operator runs as part of a checkpointed step, InfoSphere DataStage calls prog_name once for each iteration of the step. InfoSphere DataStage always appends three additional arguments to prog_name: prog_name [ args ] -s H L where H and L are 32-bit integers. For the first step iteration, or if the step is not checkpointed, H and L are set to 0. For each subsequent iteration of a checkpointed step, H and L specify the (64-bit) byte offset (H = upper 32 bits, L = lower 32 bits) of the exported data in the total export stream from the operator. After all data has been written to the program, prog_name is called once more with an appended switch of -e (corresponding to end of file) and is not passed the -s switch. On last call prog_name can perform any final operation, for example, write a trailing label to a tape. If the export operation fails, InfoSphere DataStage calls prog_name once with the appended switch -c (cleanup) and no -s switch. This gives the program an opportunity to clean up. You cannot use this option with -filter. -destinationlist -destinationlist file_name Specifies in single quotation marks file_name, the name of a file containing the names of multiple destination programs, where each command line is listed on a separate line of the file. file_name might contain multi-byte Unicode characters. InfoSphere DataStage calls the programs as if you specified multiple -destination options. See the description of -destination for more information. -dontUseOffsetsWith Destinations -dontUseOffsetsWithDestinations Do not supply the -s, H, L arguments to destination programs. This means that the byte offset is always 0. See the -destination option for more information.
351
Table 72. Export operator options (continued) Option -file Use -file [nodeName:]outFile0 Supply the name of an output file or pipe. The file or pipe must be empty unless you specify either the -append or -overwrite option. You can include multiple -file options to specify multiple input files. For each one, specify -file and supply the file name. The file name can contain multi-byte Unicode characters. Note: You can specify a hyphen to signal that export writes its output to the stdout for osh. You cannot use this option with -fileset.
352
Table 72. Export operator options (continued) Option -fileset Use -fileset filesetName.fs {-create | -replace | -discard_records |-discard_schema_and_records} [-diskpool diskpool] [-maxFileSize numMB] [-prefix prefix] [-single[FilePerPartition]] [-suffix suffix] [-writeSchema | -omitSchema] Specifies the name of the file set, filesetName, a file into which the operator writes the names of all data files that it creates. The suffix .fs identifies the file to InfoSphere DataStage as a file set. filesetName can contain multi-byte Unicode characters. The name of each export file generated by the operator is written to filesetName.fs, one name per line. The suboptions are: -create: Create the file set. If it already exists, this option generates an error. -replace: Remove the existing fileset and replace it with a new one. -discard_records: Keep the existing files and schema listed in filesetName .fs but discard the records; create the file set if it does not exist. -discard_schema_and_records: Keep existing files listed in filesetName .fs but discard the schema and records; create the file set if it does not exist. The previous suboptions are mutually exclusive with each other and also with the -append option. -diskpool diskpool: Specify the name of the disk pool into which to write the file set. diskpool can contain multi-byte Unicode characters. -maxFileSize numMB: Specify the maximum file size in MB. Supply integers. The value of numMB must be equal to or greater than 1. -omitSchema: Omit the schema from filesetName .fs. The default is for the schema to be written to the file set. -prefix: Specify the prefix of the name of the file set components. It can contain multi-byte Unicode characters. If you do not specify a prefix, the system writes the following: export username, where username is your login. -replace: Remove the existing file set and create a new one. -singleFilePerPartition: Create one file per partition. The default is to create many files per partition. This can be shortened to -single. -suffix suffix: Specify the suffix of the name of the file set components. It can contain multi-byte Unicode characters. The operator omits the suffix by default. -writeSchema: Use only with -fileset. Write the schema to the file set. This is the default.9 You cannot use -fileset with -file or -filter. "File Sets" discusses file sets. -firstLineColumnNames [-firstLineColumnNames] Specifies that column names be written to the first line of the output file. -nocleanup -nocleanup Configures the operator to skip the normal data file deletion if the step fails. By default, the operator attempts to delete partial data files and perform other cleanup operations on step failure.
353
Table 72. Export operator options (continued) Option -overwrite Use -overwrite The default action of the operator is to issue an error if you attempt to export data to a file that is not empty. Select -overwrite to override the default behavior and overwrite the file. You cannot use this option with -append or -replace. -rejects -rejects continue | fail | save Configures operator behavior if a record is rejected. The default behavior is to continue. Rejected records are counted but discarded. The number of rejected records is printed as a log message at the end of the step. However, you can configure the operator to either fail and terminate the job or save, that is, create output data set 0 to hold reject records. If you use -rejects fail, osh generates an error upon encountering a record that cannot be successfully exported; otherwise osh generates a warning upon encountering a record that cannot be successfully exported. -schema -schema record_schema Specifies in single quotation marks the export record schema. You can also specify a file containing the record schema using the syntax: -schema record @file_name where file_name is the path name of the file containing the record schema. The file_name and record_schema can contain multi-byte Unicode characters. You cannot use this option with -schemafile. -schemafile -schemafile schema_file Specifies in single quotation marks the name of a file containing the export record schema. The file name can contain multi-byte Unicode characters. This is equivalent to: -schema record @ schema_file You cannot use this option with -schema. -filter -filter command Specifies a UNIX command to process all exported data after the data set is exported but before the data is written to a file. command can contain multi-byte Unicode characters. You cannot use this option with -fileset or -destination.
354
v The properties of exported fields v The order in which they are exported The export operator writes only those fields specified by the export schema and ignores other fields, as in the following example, where the schema of the source InfoSphere DataStage data set differs from that of the exported data:
Table 73. Example: Data Set Schema Versus Export Schema Source InfoSphere DataStage Data Set Schema record ( Name: string; Address: string; State: string[2]; Age: int8; Gender: string[1]; Income: dfloat; Phone: string; ) Export Schema record ( Gender: string[1]; State: string[2]; Age: int8; Income: dfloat; )
In the example shown above, the export schema drops the fields Name, Address, and Phone and moves the field Gender to the beginning of the record. Note: If you do not provide a schema, InfoSphere DataStage assumes that exported data is formatted according to the default schema discussed in "The Default Export Schema" . Here is how you set up export schemas: v To export all fields of the source record and format them according to the default export schema, specify the following: v record ( ) v Refer to "The Default Export Schema" . v To export all fields of the source record but override one or more default record properties, add new properties, or do both, specify:
record {record_properties} ()
v Refer to "Import/Export Properties" to learn about setting up record and field properties. v To export selected fields of the source record, define them as part of the schema definition, as follows:
record ( field_definition0; ... field_definitionN;)
v where field_definition0; ... field_definitionN are the fields chosen for export. No record-level properties have been defined and the default export schema is applied. v You can define properties for records and for fields, as in the following example:
record {delim = none, binary} ( Name:string {delim = ,}; Age:int8 {default = 127}; Address:subrec ( Street:string[64]; City:string[24]; State:string[4]; ); Zip:string[12]; Phone:string[12]; )
Refer to "Import/Export Properties" to learn about setting up record and field properties. Refer to the following sections for a discussion of schemas as they pertain to import/export operations: "Record Schemas" and"Complete and Partial Schemas" .
355
v A program's input (see Destination program's input on page 357 ) v The input to several programs defined in a destination list (see List of destination programs on page 357 ). You can also explicitly specify the nodes and directories to which the operator exports data (see Nodes and directories on page 357 ).
File sets
The export operator can generate and name exported files, write them to their destination, and list the files it has generated in a file whose extension is .fs. The data files and the file that lists them are called a file set. They are established by means of the fileset option. This option is especially useful because some operating systems impose a 2 GB limit on the size of a file and you must distribute the exported files among nodes to prevent overruns. When you choose -fileset, the export operator runs on nodes that contain a disk in the export disk pool. If there is no export disk pool, the operator runs in the default disk pool, generating a warning when it does. However, if you have specified a disk pool other than export (by means of the -diskpool option), the operator does not fall back and the export operation fails. The export operator writes its results according to the same mechanism. However, you can override this behavior by means of the -diskpool option. The amount of data that might be stored in each destination data file is limited (typically to 2 GB) by the characteristics of the file system and the amount of free disk space available. The number of files created by a file set depends on: v The number of processing nodes in the default node pool v The number of disks in the export or default disk pool connected to each processing node in the default node pool v The size of the partitions of the data set You name the file set and can define some of its characteristics. The name of the file set has the form file_name.fs, where .fs identifies the file as a file set to InfoSphere DataStage. Specify the -fileset option and the path name of the file set. Note: When you choose -fileset, the export operator names the files it generates and writes the name of each one to the file set. By contrast, you have to name the files when you choose the -file option. The names of the exported files created by the operator have the following form:
nodeName : dirName /prefixPXXXXXX_FYYYYsuffix
where: v nodeName is the name of the node on which the file is stored and is written automatically by InfoSphere DataStage.
356
v dirName is the directory path and is written automatically by InfoSphere DataStage. v prefix is by default export.userName, where userName is your login name, and is written automatically by InfoSphere DataStage. However, you can either define your own prefix or suppress the writing of one by specifying "" as the prefix. v XXXXXX is a 6-digit hexadecimal string preceded by P specifying the partition number of the file written by export. The first partition is partition 000000. The partition numbers increase for every partition of the data set but might not be consecutive. v YYYY is a 4-digit hexadecimal string preceded by F specifying the number of the data file in a partition. The first file is 0000. The file numbers increase for every file but might not be consecutive. v InfoSphere DataStage creates one file on every disk connected to each node used to store an exported partition. v suffix is a user-defined file-name suffix. By default, it is omitted. For example, if you specify a prefix of "file" and a suffix of "_exported", the third file in the fourth partition would be named:
node1:dir_name/fileP000003_F0002_exported
Some data sets, such as sorted data sets, have a well-defined partitioning scheme. Because the files created by the export operator are numbered by partition and by data file within a partition, you can sort the exported files by partition number and file number to reconstruct your sorted data.
If you omit the optional node name, the operator exports files or named pipes only to the processing node on which the job was invoked. If you specify only the file name the operator assumes that path_name is relative to the working directory from which the job was invoked. You can also indicate one of these: v A specific processing node to which the output is written. Do this by specifying the node's name. The nodeName must correspond to a node or fastname parameter of the InfoSphere DataStage configuration file.
357
v All processing nodes on which the operator runs. Do this using the asterisk wild card character (*). When you do, InfoSphere DataStage writes one file to every node on which the operator is running. For example, you can supply the following as the exported file name on the osh command line:
*:outFile.data
This exports data to files called outFile.data, which are created on all processing nodes on which the operator runs. The name outFile.data is a relative path name and the exporter writes files using the same current working directory as the one from which you invoked the job.
exported file
The following osh code specifies the layout of the destination data file:
$ exp_example_1="record {delim = none, binary} ( Name:string {delim = ,}; Age:int8 {default = 127}; Address:subrec ( Street:string[64]; City:string[24]; State:string[4]; ); Zip:string[12]; Phone:string[12]; )"
In this schema, the export operator automatically inserts a newline character at the end of each record according to the default. However, numeric data is exported in binary mode and no field delimiters are present in the exported data file, except for the comma delimiting the Name field. Both properties are non-default (see "The Default Export Schema" ). They have been established through the definition of record and field properties. The record properties are delim = none and binary. The field Name is explicitly delimited by a comma (delim = ',') and this field property overrides the record-level property of delim = none. See "Import/Export Properties" for a full discussion of this subject.
358
The following osh code uses the specified schema to export the data:
$ osh "export -file outFile.dat -schema $exp_example_1 -rejects save < inDS.ds > errDS.ds"
export
The following osh code specifies the layout of the destination data files:
$ exp_example_2="record {record_length = fixed, delim = none, binary} ( Name:string[64]; Age:int8; Income:dfloat; Zip:string[12]; Phone:string[12]; )"
In this example, all fields are of specified length and the schema defines the fixed record length property. The following osh code uses the specified schema to export the data:
$ osh "export -fileset listFile.fs -schema $exp_example_2 -rejects save < inDS.ds > errDS.ds"
Upon completion of the step, listFile.fs contains a list of every destination data file created by the operator, one file name per line, and contains the record schema used for the export. Here are the contents of listFile.fs:
359
--Orchestrate File Set v1 --LFile node0:/home/user1/sfiles/node0/export.user1.P000000_F0000 node0:/home/user1/sfiles/node0/export.user1.P000000_F0001 --LFile node1:/home/user1/sfiles/node1/export.user1.P000001_F0000 node1:/home/user1/sfiles/node1/export.user1.P000001_F0001 node1:/home/user1/sfiles/node1/export.user1.P000001_F0002 --Schema record {record_length = fixed, delim = none, binary} ( Name:string[64]; Age:int8; Income:dfloat; Zip:string[12]; Phone:string[12]; )
Import/export properties
You can add import/export properties to a schema when you import and export flat files. Properties define the layout of imported and exported data, including such things as how numbers, strings, times, and dates are represented. Properties can apply to records or to fields, and there are numerous default properties that you do not need to specify. The property determines how the data is represented in the file from which the data is imported or to which the data is exported. Note: Properties apply only when you import or export data. They are not part of the internal InfoSphere DataStage representation of the data, which only needs to know the type of the data. InfoSphere DataStage assumes that data to import or export is formatted according to default properties if: v You do not specify either record or field properties for imported or exported data. v InfoSphere DataStage performs an unmodified implicit import or export operation (see "Implicit Import/Export Operations with No Schemas Specified" ). The defaults are discussed in "The Default Import Schema" and "The Default Export Schema" . You explicitly define properties to override these defaults.
Setting properties
You establish the properties of imported and exported data as part of defining the data's schema. Define record-level properties (which describe the format of the entire record) as follows:
record { prop1 [ = arg1 ], prop2 [ = arg2 ], ... propn [ = argn ] } (field_definitions ...;)
v v v v
Define record properties after the word record and before the enumeration of field definitions. Enclose the definition in braces ( { } ). Define properties in a comma-separated list if there is more than one property. Attribute values to the properties with the attribution sign ( = ). If the attributed values are strings, enclose them in single quotes. v Specify special characters by starting with a backslash escape character. For example, \t represents an ASCII tab delimiter character. For example, the following defines a record in which all fields are delimited by a comma except the final one, which is delimited by a comma followed by an ASCII space character:
360
Define field-level properties (which describe the format of a single field) as follows:
field_definition { prop1 [ = arg1 ], prop2 [ = arg2 ], ... propn [ = argn ] };
where field_definition is the name and data type of the field. v Define field properties after the field_definition and before its final semi-colon. v Enclose the definition of properties in braces ( { } ). v Define properties in a comma-separated list if there is more than one property. v Attribute values to the properties with the attribution sign ( = ). If the attributed values are strings, enclose them in single quotes. v Specify special characters by starting with a backslash escape character. For example, \t represents an ASCII tab delimiter character. For example, the following specifies that the width of a string field is 40 bytes and that the ASCII space pad character is used when it is exported:
record (a: string { width = 40, padchar = }; )
You can specify many default values for properties at the record level. If you do, the defined property applies to all fields of that data type in the record, except where locally overridden. For example, if numeric data of imported or exported records are represented in binary format, you can define the binary property at record level, as in the following example:
record {binary, delim = none} (a:int32; b:int16; c:int8;)
With one exception, properties set for an individual field can override the default properties set at record level. The exception is the fill property, which is defined at record level but cannot be overridden by an individual field or fields. For example, the following schema sets the record length as fixed with no field delimiters and the layout of all fields as binary. However, the definition of the properties of the purposeOfLoan field are as follows: the field is formatted as text, its length is 5 bytes (and not the 4 bytes of the int32 data type), and it is delimited by a comma:
record {record_length = fixed, delim = none, binary} ( checkingAccount Status:int32; durationOfAccount:sfloat; creditHistory:int32; purposeOfLoan:int32 {width = 5, text, delim = ,) creditAmount: sfloat; savingsAccountAmount:int32; yearsEmployed:int32; installmentRate:sfloat; )
Properties
Certain properties are associated with entire records, and other properties with fields of specific data types, but most can be used with fields of all data types and the records that contain them. This section contains the following topics: v "Record-Level Properties"
Chapter 8. The import/export library
361
v v v v v v v v v v v
"Numeric Field Properties" "String Field Properties" "Ustring Field Properties" "Decimal Field Properties" "Date Field Properties" "Time Field Properties" "Timestamp Field Properties" "Raw Field Properties" "Vector Properties" "Nullable Field Properties" "Tagged Subrecord Field Properties"
Each property that is described in the following section is discussed in detail in "Properties: Reference Listing" .
Record-level properties
Some properties apply only to entire records. These properties, which establish the record layout of the imported or exported data or define partial schemas, can be specified only at the record level. If you include no properties, default import and export properties are used; see "The Default Import Schema" and "The Default Export Schema" . The following table lists the record-level properties that cannot be set at the field level.
Keyword intact check_intact record_length record_prefix record_format Use Defines a partial schema and optionally verifies it Defines the fixed length of a record Defines a field's length prefix as being 1, 2, or 4 bytes long With type = implicit, field length is determined by field content With type = varying, defines IBM blocked or spanned format record_delim record_delim_string fill Defines character(s) delimiting a record Defines the value used to fill gaps between fields of an exported record. This property applies to export only and cannot be set at the field level.
Field Properties
You can define properties of imported and exported fields. This section lists field properties according to the data types that they can be used with, under the following categories. v Numeric Field Properties on page 363 v String Field Properties on page 364 v Decimal Field Properties on page 364 v Date Field Properties on page 365 v Time Field Properties on page 365 v Timestamp Field Properties on page 366
362
v v v v v
Raw Field Properties on page 366 Vector Properties on page 367 Nullable Field Properties on page 367 Subrecord Field Properties on page 367 Tagged Subrecord Field Properties on page 368
There is some overlap in these tables. For example, a field might be both decimal and nullable, and so field properties in both the decimal and nullable categories would apply to the field. Furthermore many properties such as delim apply to fields of all (or many) data types, and to avoid tedious repetition are not listed in any of the categories. However, they all appear in "Properties: Reference Listing" .
363
Use Format of numeric translation to text field Pad character of exported strings or numeric values Defines the exact width of the destination field
Defines the pad character of exported padchar strings or numeric values Defines the exact width of the destination field width
charset At the field level, it defines the character set to be used for ustring fields; at the record level it applies to the other import/export properties that support multi-byte Unicode character data. Defines the maximum width of the destination field max_width
Defines the pad character of exported padchar strings or numeric values Defines the exact width of the destination field width
fix_zero
Treat a packed decimal field fix_zero containing all zeros (normally illegal) as a valid representation of zero Defines the maximum width of the destination field Defines the field as containing a packed decimal Defines the precision of a decimal Defines the rounding mode max_width packed precision round
364
Use Defines the scale of a decimal Defines the field as containing an unpacked decimal with a separate sign byte Defines the exact width of the destination field Defines the field as containing an unpacked decimal in text format
width zoned
width zoned
Specifies the character set for the date charset Defines a text-string date format or uformat format other than the default; uformat can contain multi-byte Unicode characters The field stores the date as a signed integer containing the number of days since date_in_ISO_format or uformat. Provides support for international date components Defines the date as a binary numeric value containing the Julian day Specifies the byte ordering as little-endian Specifies the byte ordering as native-endian Specifies text as the data representation date_format
days_since
days_since
365
Use Specifies the byte ordering as little_endian Specifies the byte ordering as native_endian.
midnight_seconds Represent the time field as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight. Defines a text-string time format or time_format uformat format other than the default Defines the field as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight; see "midnight_seconds" . Specifies the byte ordering as native-endian Specifies text as the data representation Defines a text-string date format or uformat other than the default; uformat can contain multi-byte Unicode characters midnight_seconds
time_format midnight_seconds
366
Vector Properties
InfoSphere DataStage supports vector fields of any data type except subrecord and tagged subrecord. InfoSphere DataStage vectors can be of either fixed or variable length. No special properties are required for fixed-length vectors. The import and export operators perform the number of import/export iterations defined by the number of elements in the vector. However, variable-length vectors can have the properties described in the following table.
Keyword link prefix vector_prefix Use This field holds the number of elements in a variable-length vector Defines the length of prefix data, which specifies the length in bytes Defines the length of vector prefix data, which specifies the number of elements See link prefix vector_prefix
null_field null_length
null_field null_length
Defines the value used to fill gaps fill between fields of an exported record. This property applies to export only and cannot be set at the field level Specifies a single ASCII or multi-byte record_delim Unicode character to delimit a record
record_delim
367
Use Specifies an string or ustring to delimit a record With type = implicit, field length is determined by field content. With type = varying, defines IBM blocked or spanned format.
Defines the fixed length of a record Defines field's length prefix as being 1, 2, or 4 bytes long With type = implicit, field length is determined by field content. With type = varying, defines IBM blocked or spanned format.
See actual_length
ascii
ascii
368
See big_endian
Field value represented in binary binary format; decimal represented in packed decimal format; julian day format; time represented as number of seconds from midnight; timestamp formatted as two 32-bit integers Format of text-numeric translation format to and from numeric fields Specifies a character set Error checking of imported records with partial record schema (a suboption of intact) Defines a text-string date format or uformat format other than the default; uformat can contain multi-byte Unicode characters c_format charset check_intact
c_format charset "record {text} ( " " a:int32; " " b:int16 {binary}; " " c:int8; ) " check_intact date_format
date_format
days_since
The imported or exported field stores days_since the date as a signed integer containing the number of days since date_in_ISO_format or uformat. Specifies an ASCII character to separate the integer and fraction components of a decimal Default value for a field that causes an error Provides support for international date components in date fields Provides support for international time components in time fields Trailing delimiter of all fields decimal_separator
decimal_separator
default default_date_format default_time_format delim delim_string drop ebcdic export_ebcdic_as_ascii fill final_delim final_delim_string fix_zero
One or more ASCII characters delim_string forming trailing delimiter of all fields Field dropped on import Character set of data is in EBCDIC format Exported field converted from EBCDIC to ASCII Byte value to fill in gaps in exported record Delimiter character trailing last field of record Delimiter string trailing last field of record drop ebcdic export_ebcdic_as_ascii fill final_delim final_delim_string
A packed decimal field containing all fix_zero zeros (normally illegal) is treated as a valid representation of zero Creation of exported field generate
Chapter 8. The import/export library
generate
369
Specifies Translation of imported string field from ASCII to EBCDIC Format of text translation to numeric field
The record definition defines a partial intact record schema The imported or exported field represents the date as a numeric value containing Julian day A field holds the length of a another, variable-length field of the record; field might be a vector Multi-byte data types are formatted as little endian Maximum width of the destination field. The field represents the time as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight julian
link
link
native_endian
native_endian Multi-byte data types are formatted as defined by the native format of the machine; this is the default for import/export operations A packed decimal field containing all nofix_zero zeros generates an error (default) Value of null field in file to be imported; value to write to exported file if source field is null Imported length meaning a null value; exported field contains a null null_field
nofix_zero null_field
null_length overpunch
null_length
overpunch The field has a leading or ending byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed Format of numeric translation to text field The imported or exported field contains a packed decimal Pad character of exported strings or numeric values The byte offset of the field in the record The precision of the packed decimal out_format packed padchar position precision
Prefix of all fields of record; also prefix length of prefix holding the length in bytes of a vector
370
Property "record {prefix = 2} (" " a:string; " " b:string {prefix = 1}; )" print_field quote
See print_field
Field is enclosed in quotes or another quote ASCII character; useful for variable-length fields Record delimited by a single ASCII character Record delimited by one or more ASCII characters Variable-length blocked records or implicit records Fixed length records Records with a length prefix stored as binary data Name of field containing field length Rounding mode of source decimal The scale of the decimal record_delim record_delim_string record_format record_length record_prefix reference round scale
The imported or exported field separate contains an unpacked decimal with a separate sign byte Number of bytes skipped from the end of the previous field to the beginning of this one Defines the active field in a tagged subrecord Field represented as text-based data; decimal represented in string format; date, time, and timestamp representation is text-based The format of an imported or exported field representing a time as a string The format of an imported or exported field representing a timestamp as a string Prefix length for element count The exact width of the destination field The imported or exported field contains an unpacked decimal represented by either ASCII or EBCDIC text skip
skip
tagcase text
tagcase text
time_format
time_format
timestamp_format
timestamp_format
The remainder of this topic consists of an alphabetic listing of all properties. In presenting the syntax, field_definition is the name and data type of the field whose properties are defined.
371
actual_length
Used with null_length ("null_length" ). v On import, specifies the actual number of bytes to skip if the field's length equals the null_length v On export, specifies the number of bytes to fill with the null-field or specified pad character if the exported field contains a null.
Applies to
Nullable fields of all data types; cannot apply to record, subrec, or tagged.
Syntax
field_definition {actual_length = length};
where length is the number of bytes skipped by the import operator or filled with zeros by the export operator when a field has been identified as null.
See also
"null_length" .
Example
In this example: v On import, the import operator skips the next ten bytes of the imported data if the imported field has a length prefix of 255 v On export, the length prefix is set to 255 if source field a contains a null and the export operator fills the next ten bytes with one of these: zeros, the null-field specification, the pad-character specification for the field.
record {prefix = 2} ( a:nullable string {null_length = 255, actual_length = 10}; )
ascii
Specifies that the ASCII character set is used by text-format fields of imported or exported data. This property is equivalent to specifying the US_ASCII character set and is the default.
Applies to
All data types except raw, ustring; record, subrec, or tagged containing at least one non-raw field. For ustring, the same functionality is available using the charset property.
Syntax
record { ascii } field_definition { ascii };
See also
"ebcdic" .
Example
The following specification overrides the record-level property setting for field b:
372
big_endian
Specifies the byte ordering of multi-byte data types in imported or exported data as big-endian.
Applies to
Fields of the integer, date, time, or timestamp data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { big_endian } field_definition { big_endian };
Restrictions
InfoSphere DataStage ignores the endian property of a data type that is not formatted as binary. This property is mutually exclusive with little_endian and native_endian.
Examples
The following specification defines a record schema using a big-endian representation:
record {big_endian, binary, delim = none} ( a:int32; b:int16; c:int8; )
binary
Specifies the data representation format of a field in imported or exported data as binary.
Applies to
Fields of all data types except string, ustring, and raw; record, subrec or tagged containing at least one field that is neither string nor raw.
Syntax
record { binary } field_definition { binary };
This option specifies binary data; data is formatted as text by default (see "text" ).
Meanings
The binary property has different meanings when applied to different data types: v For decimals, binary means packed (see "packed" ). v For other numerical data types, binary means "not text".
Chapter 8. The import/export library
373
v For dates, binary is equivalent to specifying the julian property for the date field (see "julian" ). v For time, binary is equivalent to midnight_seconds (see "midnight_seconds" ). v For timestamp, binary specifies that the first integer contains a Julian day count for the date portion of the timestamp and the second integer specifies the time portion of the timestamp as the number of seconds from midnight; on export, binary specifies to export a timestamp to two 32-bit integers.
Restrictions
If you specify binary as a property of a numeric field, the data type of an imported or exported field must be the same as the corresponding field defined in a record schema. No type conversions are performed among the different numeric data types (as would be the case if text was specified instead). This property is mutually exclusive with text, c_format, in_format, and out_format.
Examples
For example, the following defines a schema using binary representation for the imported or exported numeric fields with no delimiter between fields:
record { binary, delim = none } ( a:int32; b:int16; c:int8; ) "record {text} ( "
c_format
Perform non-default conversion either of imported data from string to integer or floating-point data or of exported data from integer or floating-point data to a string. This property specifies a C-language format string used for both import and export. To specify separate import/export strings, see "in_format" and "out_format" .
Applies to
Imported or exported data of the integer or floating-point type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
field_definition { c_format= sscanf_or_sprintf_string };
sscanf_or_sprintf_string is a is a control string in single quotation marks containing formatting specification for both sscanf() to convert string data to floating-point or integer data and sprintf() to convert floating-point or integer data to strings. See the appropriate C language documentation for details of this formatting string.
Discussion
A UNIX data file can represent numeric values as strings. By default, on import InfoSphere DataStage invokes the C strtol(), strtoll(), strtoul(), strtoull(), or strtod() function to convert a string to a numeric field.
374
an InfoSphere DataStage data file can represent numeric data in integer and floating-point data types. By default, on export InfoSphere DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If these functions do not output data in a satisfactory format, you can use the c_format property to specify a format that is used both by the sscanf() function to convert string data to integer or floating point data and by the sprintf() function to convert integral or floating point data to string data. The associated imported/exported field must represent numeric data as text. Int64 values map to long long integers on all supported platforms. For all platforms the c_format is:
%[ padding_character ][ integer ]lld
The integer component specifies a minimum field width. The output column is printed at least this wide, and wider if necessary. If the column has fewer digits than the field width, it is padded on the left with padding_character to make up the field width. The default padding character is a space.
Examples
v For this example c_format specification:
%09lld
the padding character is zero (0), and the integers 123456 and 12345678 are printed out as 000123456 and 123456789. v For this example:
record (a:int32 {c_format = %x, width = 8};)
the meaning of the expression varies: On import, the expression uses the c_format property to ensure that string data in the imported file is formatted in the InfoSphere DataStage record as field a, a 32-bit integer formatted as an 8-byte hexadecimal string On export, the same expression ensures that the field a, consisting of a 32-bit integer, is exported from the InfoSphere DataStage record as an 8-byte hexadecimal string.
charset
Specifies a character set defined by the International Components for Unicode (ICU).
Applies to
At the field level, this option applies only to ustrings. At the record level, this option applies to the fields that do not specify a character set and to these properties that support multi-byte Unicode character data:
delim delim_string record_delim record_delim_string final_delim final_delim_string quote default padchar null_field date_format time_format timestamp_format
Syntax
record { charset = charset } field_definition { charset = charset };
Example
record { charset=charset1 } (a:ustring { charset=ISO-8859-15 } {delim = xxx}, b:date { charset=ISO-8859-15 }, c:ustring)
375
Where the user defined record charset, charset1, applies to field c and to the delim specification for field a. ISO-8859-15 applies to field a and b. Notice that the field character setting for field a does not apply to its delim property.
check_intact
Used only with the intact property, performs error checking when records are imported with a partial record schema.
Applies to
record, if it is qualified by the intact record property. (See "intact" ).
Syntax
record { intact, check_intact }
By default, when InfoSphere DataStage imports records with a partial schema, it does not perform error checking in order to maximize the import speed. This property overrides the default. Error-checking in this case verifies, during the import operation, that the record contents conform to the field description. In addition, downstream of this verification, the operator that acts on the input fields verifies their value.
Example
For example, the following statement uses intact to define a partial record schema and uses check_intact to direct InfoSphere DataStage to perform error checking:
record {intact=rName, check_intact, record_length=82, record_delim_string=\r\n} ()
date_format
Specifies the format of an imported or exported text-format date.
Applies to
Field of date data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { date_format = format_string | uformat }
uformat is described in "default_date_format" . The format_string can contain one or a combination of the following elements:
Table 74. Date format tags Tag %d %dd %ddd %m %mm with v option import Variable width availability import Description Day of month, variable width Day of month, fixed width Day of year Month of year, variable width Month of year, fixed width
Parallel Job Advanced Developer's Guide
Options s s s, v s s
376
Table 74. Date format tags (continued) Tag %mmm %mmmm %yy %yyyy %NNNNyy %e %E %eee %eeee %W %WW import/export import import/export Variable width availability Description Month of year, short name, locale specific Month of year, full name, locale specific Year of century Four digit year Cutoff year plus year of century Day of week, Sunday = day 1 Day of week, Monday = day 1 Value range Jan, Feb ... January, February ... 00...99 0001 ...9999 yy = 00...99 1...7 1...7 t, u, w t, u, w, -N, +N s s s Options t, u, w t, u, w, -N, +N s
Weekday short name, Sun, Mon ... locale specific Weekday long name, locale specific Week of year (ISO 8601, Mon) Week of year (ISO 8601, Mon) Sunday, Monday ... 1...53 01...53
When you specify a date format string, prefix each component with the percent symbol (%) and separate the string's components with a suitable literal character. The default date_format is %yyyy-%mm-%dd. Where indicated the tags can represent variable-width data elements. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated in the table: s Specify this option to allow leading spaces in date formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(m,s) indicates a numeric month of year field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(d,s)/%(m,s)/%yyyy Then the following dates would all be valid: 8/ 8/1958 08/08/1958 8/8/1958 v Use this option in conjunction with the %ddd tag to represent day of year in variable-width format. So the following date property:
Chapter 8. The import/export library
377
%(ddd,v) represents values in the range 1 to 366. (If you omit the v option then the range of values would be 001 to 366.) u w t Use this option to render uppercase text on output. Use this option to render lowercase text on output. Use this option to render titlecase text (initial capitals) on output.
The u, w, and t options are mutually exclusive. They affect how text is formatted for output. Input dates will still be correctly interpreted regardless of case. -N +N Specify this option to left justify long day or month names so that the other elements in the date will be aligned. Specify this option to right justify long day or month names so that the other elements in the date will be aligned.
Names are left justified or right justified within a fixed width field of N characters (where N is between 1 and 99). Names will be truncated if necessary. The following are examples of justification in use: %dd-%(mmmm,-5)-%yyyyy
21-Augus-2006
%dd-%(mmmm,-10)-%yyyyy
21-August -2005
%dd-%(mmmm,+10)-%yyyyy
21August-2005
The locale for determining the setting of the day and month names can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for month names and weekday names markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example the format string: %(L,'es')%eeee, %dd %mmmm %yyyy Specifies the Spanish locale and would result in a date with the following format: mircoles, 21 septembre 2005 The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
378
Table 75. Format tag restrictions Element year month Numeric format tags %yyyy, %yy, %[nnnn]yy %mm, %m Text format tags %mmm, %mmmm Requires year month year %eee, %eeee month, week of year year Incompatible with week of year day of week, week of year day of month, day of week, week of year day of year month, day of month, day of year
day of month %dd, %d day of year day of week week of year %ddd %e, %E %WW
When a numeric variable-width input tag such as %d or %m is used, the field to the immediate right of the tag (if any) in the format string cannot be either a numeric tag, or a literal substring that starts with a digit. For example, all of the following format strings are invalid because of this restriction: %d%m-%yyyy %d%mm-%yyyy %(d)%(mm)-%yyyy %h00 hours The year_cutoff is the year defining the beginning of the century in which all two-digit years fall. By default, the year cutoff is 1900; therefore, a two-digit year of 97 represents 1997. You can specify any four-digit year as the year cutoff. All two-digit years then specify the next possible year ending in the specified two digits that is the same or greater than the cutoff. For example, if you set the year cutoff to 1930, the two-digit year 30 corresponds to 1930, and the two-digit year 29 corresponds to 2029. On import and export, the year_cutoff is the base year. This property is mutually exclusive with days_since, text, and julian. You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
Tag %% \% \n \t \\ Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
If the format string does not include a day, the day is set to the first of the month in the destination field. If the format string does not include the month and day, they default to January 1. Note that the format string must contain a month if it also contains a day; that is, you cannot omit only the month.
Chapter 8. The import/export library
379
When using variable-width tags, it is good practice to enclose the date string in quotes. For example, the following schema:
f1:int32; f2:date { date_format=%eeee %mmmm %d, %yyyy, quote=double }; f3:int32;
The quotes are required because the parallel engine assumes that variable-width fields are space delimited and so might interpret a legitimate space in the date string as the end of the date.
See also
See the section on InfoSphere DataStage data types in the InfoSphere DataStage Parallel Job Developer Guide for more information on date formats.
days_since
The imported or exported field stores the date as a signed integer containing the number of days since date_in_ISO_format or uformat.
Applies to
Fields of the date data type ; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { days_since = date_in_ISO_format | uformat }
where date_in_ISO_format is in the form %yyyy-%mm-%dd; and uformat is the default date format as described in "default_date_format" . The imported or exported field is always stored as binary data. This property is mutually exclusive with date_format, julian, and text.
decimal_separator
Specifies an ASCII character to separate the integer and fraction components of a decimal.
Applies to
record and decimal fields.
Syntax
record { decimal_separator = ASCII_character } field_definition { decimal_separator = ASCII_character };
Example
record { decimal_separator = , } (a:decimal, b:decimal { decimal_separator = - })
where the decimal separator for field a is ',' and is '-' for field b.
default
Sets the default value of an imported or exported field: v On import if the generate property is set v On both import and export if an error occurs
380
Applies to
Fields of integer, float, string, ustring, and decimal data types. Cannot apply to record, subrecord, or tagged.
Syntax
field_definition { default = field_value };
where field_value is the default value of an imported or exported field. For ustring fields, field_value might contain multi-byte Unicode characters. On import, if you specify a default value, the import operator sets the destination field in the data set to the default value instead of rejecting an erroneous record. On export, the export operator sets the field in the destination data file to the default value instead of rejecting an erroneous record. The property is mutually exclusive with link.
Example
The following defines the default value of field c as 127:
record (a:int32; b:int16; c:int8 {default = 127};)
default_date_format
This property provides support for international date components in date fields.
Applies to
record and date fields.
Syntax
record {default_date_format = String%macroString%macroString%macroString}
where %macro is a formatting macro such as %mmm for a 3-character English month. The section "date_format" lists the date formatting macros. The String components can be strings or ustrings.
Example
record ( { default_date_format = jeudi%ddaot%yyyy };)
default_time_format
This property provides support for international time components in time fields.
Applies to
record and time fields.
Syntax
record {default_time_format = String%macroString%macroString%macroString}
where %macro is a formatting macro such as %hh for a two-digit hour. The section "time_format" lists the time formatting macros. The String components can be strings or ustrings.
Example
record ( { default_time_format = %hh&nnA.M.};)
381
delim
Specifies the trailing delimiter of all fields in the record.
Applies to
record; any field of the record.
Syntax
record { delim = delim_char } field_definition { delim = delim_char };
where delim_char can be one of the following: v ws to have the import operator skip all standard whitespace characters (space, tab, and newline) trailing after a field. v end to specify that the last field in the record is composed of all remaining bytes until the end of the record. v none to specify that fields have no delimiter. v null to specify that the delimiter is the null character. v You can specify an ASCII or multi-byte Unicode character enclosed in single quotation marks. To specify multiple characters, use delim_string (see "delim_string" ).
Discussion
By default, the import operator assumes that all fields but the last are whitespace delimited. By default, the export operator inserts a space after every exported field except the last. On export, you can use delim to specify the trailing delimiter that the export operator writes after each field in the destination data file. Specify: v ws to have the export operator insert an ASCII space (0x20) after each field v none to write a field with no delimiter and no length prefix. You can use a backslash (\) as an escape character to specify special characters. For example, \t represents an ASCII tab delimiter character. You can use final_delim to specify a different delimiter for the last field of the record. See "final_delim" . Mutually exclusive with prefix, delim_string, and reference.
Examples
The following statement specifies that all fields have no delimiter:
record {delim = none} ( a:int32; b:string; c:int8; d:raw; )
The following statement specifies a comma as a delimiter at record-level but overrides this setting for field d, which is composed entirely of bytes until the end of the record:
382
record {delim = ,} ( a:int32; b:string; c:int8; d:raw{delim = end}; ) "record {delim = ,} ( "
Note that in this example, the record uses the default record delimiter of a newline character. This means that: v On import, field d contains all the bytes to the newline character. v On export, a newline character is inserted at the end of each record.
delim_string
Like delim, but specifies a string of one or more ASCII or multi-byte Unicode characters forming a trailing delimiter.
Applies to
record; any field of the record.
Syntax
record {delim_string = ASCII_string | multi_byte_Unicode_string } field_definition { delim_string = ASCII_string | multi_byte_Unicode_string };
You can use a backslash (\) as an escape character to specify special characters within a string. For example, \t represents an ASCII tab delimiter character. Enclose the string in single quotation marks. This property is mutually exclusive with prefix, delim, and reference. Note: Even if you have specified the character set as EBCDIC, ASCII_string is always interpreted as ASCII character(s).
Examples
The following statement specifies that all fields are delimited by a comma followed by a space:
record {delim_string = , } ( a:int32; b:string; c:int8; d:raw; )
In the following example, the delimiter setting of one comma is overridden for field b, which will be delimited by a comma followed by an ASCII space character, and for field d, which will be delimited by the end of the record:
record {delim = ,} ( a:int32; b:string {delim_string = , }; c:int8; d:raw {delim = end}; )
Chapter 8. The import/export library
383
Note that in this example, the record uses the default record delimiter of a newline character. This means that: v On import field d contains all the bytes to the newline character. v On export a newline character is inserted at the end of each record.
drop
Specifies a field to be dropped on import and not stored in the InfoSphere DataStage data set.
Applies to
Fields of all data types. Cannot apply to record, subrec, or tagged.
Syntax
field_definition { drop };
You can use this property when you must fully define the layout of your imported data in order to specify field locations and boundaries but do not want to use all the data in the InfoSphere DataStage job.
Restrictions
This property is valid for import only and is ignored on export. This property is mutually exclusive with link.
See also
"padchar" .
Example
In this example, the variable-length string field, b, is skipped as the record is imported.
record (a:int32; b:string {drop}; c:int16;)
In the following example, all fields are written as strings to the InfoSphere DataStage data set on import and are delimited by a comma except the last field; records are delimited by the default newline character. The last four fields are dropped from every record on import, and generated on export. This technique is useful when fields of the source data are not be processed in an InfoSphere DataStage job but place holders for those fields must be retained in the resultant file containing the exported results. In addition, all bytes of the generated fields in the exported data file will be filled with ASCII spaces, as defined by the padchar property:
record { delim =,} ( first_name :string[20]; middle_init :string[1]; last_name :string[40]; street :string[40]; apt_num :string[4]; city :string[30]; state :string[2]; prev_street :string[40] {drop, generate, padchar= }; prev_apt_num :string[4] {drop, generate, padchar= }; prev_city :string[30] {drop, generate, padchar= }; prev_state :string[2] {delim = end, drop, generate, padchar= }; )
384
ebcdic
Specifies that the EBCDIC character set is used by text-format fields of imported or exported data. Setting this property is equivalent to specifying the EBCDIC equivalent of the US_ASCII character set.
Applies to
All data types except raw and ustring; record, subrec, or tagged if it contains at least one field of this type. For ustring, the same functionality is available through the charset property.
Syntax
record { ebcdic } field_definition { ebcdic };
This property is mutually exclusive with ascii. InfoSphere DataStage's default character set is ASCII-formatted. (InfoSphere DataStage supplies lookup tables for converting between ASCII and EBCDIC. See "ASCII and EBCDIC Conversion Tables" for more information.)
See also
"ascii" ."
export_ebcdic_as_ascii
Export a string formatted in the EBCDIC character set as an ASCII string.
Applies to
Fields of the string data type and records; record, subrec, or tagged if it contains at least one field of this type. This property does not apply to ustring; you can obtain the same functionality using the charset property.
Syntax
record { export_ebcdic_as_ascii } field_definition { export_ebcdic_as_ascii };
fill
Specifies the byte value used to fill the gaps between fields of an exported record caused by field positioning.
Applies to
record; cannot be specified for individual fields.
Syntax
record {fill [ = fill_value ] }
where fill_value specifies a byte value that fills in gaps. By default the fill value is 0. The fill_value can also be one of these: v a character or string in single quotation marks v an integer between 0 and 255.
Chapter 8. The import/export library
385
Restriction
You cannot override the record-level fill property with an individual field definition.
Examples
In the following example, the two-byte gap between fields a and b are to be filled with ASCII spaces:
record {fill = } (a:int32; b:int16 {skip = 2};)
In the following example, the gaps are to be filled with the value 127:
record { fill = 127} (a:int32; b:int16 {skip = 2};)
final_delim
Specifies a delimiter for the last field. If specified, final_delim precedes the record_delim if both are defined.
Applies To
The last field of a record. When the last field in a record is a subrec, a space is added after the subrec instead of your non-space delim_value.
Syntax
record {final_delim = delim_value} where delim_value is one of the following: v ws (a white space) v end (end of record, the default) v none (no delimiter, field length is used) v null (0x00 character) v 'a' (specified ASCII or multi-byte Unicode character, enclosed in single quotation marks)
Example
In this example, commas delimit all fields except the last. Since end is specified as the final_delim, the record delimiter serves as the delimiter of the final field. This is the newline character. Note that it is not specified as the record_delim, because newline-delimited records are the default.
record {delim = ,, final_delim = end} ( checkingAccountStatus :int32; durationOfAccount :sfloat; creditHistory :int32; purposeOfLoan :int32; creditAmount :sfloat; savingsAccountAmount :int32; yearsPresentlyEmployed :int32; installmentRate :sfloat; )
By default, on export a space is now inserted after every field except the last field in the record. Previous to this release, a space was inserted after every field, including the last field; or, when delim property was set, its value was used instead of the final_delim value.
386
Now when you specify the final_delim property for records that have a tagged or subrec field as the last field, your specification is correctly applied unless the subrec is a vector. By default, a space is added to the last field when it is a subrec vector. You can set the APT_PREVIOUS_FINAL_DELIM_COMPATIBLE environment variable to obtain the final_delim behavior prior to this release.
final_delim_string
Like final_delim, but specifies a string of one or more ASCII or multi-byte Unicode characters that are the delimiter string of the last field of the record. The final_delim_string property precedes the record_delim property, if both are used.
Applies to
The last field of a record.
Syntax
record {final_delim_string = ASCII_string | multi_byte_Unicode_string }
Example
For example, the following statement specifies that all fields are delimited by a comma, except the final field, which is delimited by a comma followed by an ASCII space character:
record {delim_string = ,, final_delim_string = , } ( a:int32; b:string; c:int8; d:raw; )
fix_zero
Treat a packed decimal field containing all zeros (normally illegal) as a valid representation of zero.
Applies to
Fields of the packed decimal data type on import; all decimal fields on export (exported decimals are always packed); record, subrec, or tagged if it contains at least one field of these types.
Syntax
field_definition { fix_zero };
Note: Omitting fix_zero causes the job to generate an error if it encounters a decimal containing all zeros.
Chapter 8. The import/export library
387
This property overrides the nocheck option to packed. See also "packed" and "nofix_zero" . The latter option restores default behavior.
generate
On export, creates a field and sets it to the default value.
Applies to
Fields of all data types. Cannot apply to record, subrec, or tagged.
Syntax
field_definition { generate, default = default_value };
where: v default is the ordinary default property (see "default" ) and not a sub-option of generate v default_value is the optional default value of the field that is generated; if you do not specify a default_value, InfoSphere DataStage writes the default value associated with that data type.
Discussion
You can specify both the drop property and the generate property in the same property list. When the schema is used for import it drops the field, and when it is used for export, it generates the field. This type of statement is useful if the field is not important when the data set is processed but you must maintain a place holder for the dropped field in the exported data.
Restriction
This property is valid for export only and is ignored on import.
Examples
The following statement creates a new field in an exported data set:
record ( a:int32; b:int16 {generate, default=0}; c:int8; )
The following statement causes field b to be: v dropped on import v generated as a 16-bit integer field, which is initialized to 0, on export.
record ( a:int32; b:int16 {drop, generate, default =0}; c:int8;)
import_ascii_as_ebcdic
Translate a string field from ASCII to EBCDIC as part of an import operation. This property makes it possible to import EBCDIC data into InfoSphere DataStage from ASCII text.
Applies to
Fields of the string data type; record, subrec, or tagged if it contains at least one field of this type. This property does not apply to ustring; you can obtain the same functionality using the charset property.
Syntax
field_definition { import_ascii_as_ebcdic };
388
Restriction
Use only in import operations. On export, you can use export_ebcdic_as_ascii to export the EBCDIC field back to ASCII. See "export_ebcdic_as_ascii" .
Example
In the following example, the string field x contains an ASCII string that is converted to EBCDIC on import:
record (x:string[20] {import_ascii_as_ebcdic};)
in_format
On import, perform non-default conversion of an input string to an integer or floating-point data. This property specifies a C-language format string used for import. See also "out_format" .
Applies to
Imported data of integer or floating-point type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
field_definition { in_format = sscanf_string };
where sscanf_string is a control string in single quotation marks interpreted by the C scanf() function to convert character data to numeric data. See the appropriate C language documentation for details of this formatting string.
Discussion
A UNIX data file can represent numeric values as strings. By default, on import InfoSphere DataStage invokes the C strtol(), strtoll(), strtoul(), strtoull(), or strtod() function to convert a string to a numeric field. If these functions do not output data in a satisfactory format, you can specify the out_format property. When you do, InfoSphere DataStage invokes the sprintf() function to convert the string to numeric data. You pass formatting arguments to the function. When strings are converted to 8-, 16-, or 32-bit signed integers, the sscanf_string must specify options as if the generated field were a 32-bit signed integer. InfoSphere DataStage converts the 32-bit integer to the proper destination format.
Restrictions
The export operator ignores this property; use it only for import operations. This property is mutually exclusive with binary.
Example
The following statement assumes that the data to be converted is represented as a string in the input flat file. The data is imported as field a, a 32-bit integer formatted as an 8-byte hexadecimal string:
record (a:int32 {in_format = %x, width=8};)
intact
Define a partial record schema.
Chapter 8. The import/export library
389
Applies to
record
Syntax
record {intact[= rName ] ... } ( field_definitions ; )
See also
"check_intact" .
Example
For example, the following statement uses intact to define a partial record schema:
record {intact=rName, record_length=82, record_delim_string=\r\n} ()
In this example, the record schema defines an 82-byte record (80 bytes of data, plus one byte each for the carriage return and line feed characters delimiting the record). On import, the two bytes for the carriage return and line feed characters are removed from the record and not saved; therefore, each record of the data set contains only the 80 bytes of data.
julian
Specifies that the imported or exported field represents the date as a binary numeric value denoting the Julian day.
Applies to
Fields of the date data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { binary, julian } field_definition { binary, julian };
Note: The imported or exported value is always represented as binary data. A Julian day specifies the date as the number of days from 4713 BCE January 1, 12:00 hours (noon) GMT. For example, January 1, 1998 is Julian day count 2,450,815. In this context, binary has the same meaning as julian. This property is mutually exclusive with days_since, date_format, and text.
link
Specifies that a field holds either the length of another (variable-length) field in the record or the length of the tag value of a tagged.
Applies to
Numeric or string fields; cannot apply to a record, subrec, or tagged.
Syntax
field_definition { link };
390
Discussion
A variable-length field must specify the number of elements it contains by means either of a prefix of the vector (see "vector_prefix" ) or a link to another field. v On import, the link field is dropped and does not appear in the schema of the imported data set. v On export, the link field is not exported from the data set, but is generated from the length of the field referencing it and then exported. The data type of a field defined by the link import/export property must: v Be a uint32 or data type that can be converted to uint32; see "Modify Operator" for information on data type conversions. v Have a fixed-length external format on export. v Not be a vector. This property is mutually exclusive with drop.
Examples
In this example, field a contains the element length of field c, a variable-length vector field.
record (a:uint32 {link}; b:int16; c[]:int16 {reference = a};) "record (a:uint32 {link}; b:int16; c:string {reference = a};)"
In this example, field a contains the length of field c, a variable-length string field.
record (a:uint32 {link}; b:int16; c:string {reference = a};)
little_endian
Specify the byte ordering of multi-byte data types as little-endian. The default mode is native-endian.
Applies to
Fields of the integer, date, time, or timestamp data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { little_endian } field_definition { little_endian };
This property is mutually exclusive with big_endian and native_endian. Note: InfoSphere DataStage ignores the endian property if a data type is not formatted as binary.
Examples
The following specification defines a schema using a little-endian representation:
record {little_endian, binary, delim = none} ( a:int32; b:int16; c:int8; )
The following declaration overrides the big_endian record-level property in a definition of field b:
record {big_endian, binary, delim = none} ( a:int32; b:int16 {little_endian}; c:int8; )
Chapter 8. The import/export library
391
max_width
Specifies the maximum number of 8-bit bytes of an imported or exported text-format field. Base your width specification on the value of your -impexp_charset option setting. If it's a fixed-width charset, you can calculate the maximum number of bytes exactly. If it's a variable length encoding, calculate an adequate maximum width for your fields.
Applies to
Fields of all data types except date, time, timestamp, and raw; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { max_width = n } field_definition { max_width = n };
where n is the maximum number of bytes in the field; you can specify a maximum width of 255 bytes. This property is useful for a numeric field stored in the source or destination file in a text representation. If you specify neither width nor max_width, numeric fields exported as text have the following number of bytes as their maximum width: v 8-bit signed or unsigned integers: 4 bytes v v v v v 16-bit signed or unsigned integers: 6 bytes 32-bit signed or unsigned integers: 11 bytes 64-bit signed or unsigned integers: 21 bytes. single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)
Restriction
On export, if you specify the max_width property with a dfloat field, the max_width must be at least eight characters long.
midnight_seconds
Represent the imported or exported time field as a binary 32-bit integer containing the number of seconds elapsed from the previous midnight.
Applies to
The time data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { binary, midnight_seconds }; field_definition { binary, midnight_seconds };
Note: The imported or exported integer is always stored as binary data. This property is mutually exclusive with time_format and text.
native_endian
Specify the byte ordering of multi-byte data types as native-endian. This is the default mode for import/export operations.
392
Applies to
Fields of the integer, date, time, or timestamp data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { native_endian } field_definition { native_endian };
where native_endian specifies that all multi-byte data types are formatted as defined by the native format of the machine. This property is mutually exclusive with big_endian and little_endian. Note: InfoSphere DataStage ignores the endian property of a data type that is not formatted as binary.
Examples
The following defines a schema using a native-endian representation:
record {native_endian, binary, delim = none} ( a:int32; b:int16; c:int8; )
nofix_zero
Generate an error when a packed decimal field containing all zeros is encountered. This is the default behavior of import and export.
Applies to
Import behavior and export behavior differ: v Fields of the packed decimal data type on import v All decimal fields on export (exported decimals are always packed); record, subrec, or tagged if it contains at least one field of this type.
Syntax
field_definition { nofix_zero };
null_field
Specifies the value representing null.
Applies to
Fields whose data type is nullable; cannot apply to record, subrec, or tagged.
Syntax
field_definition { null_field = byte_value | multi_byte_Unicode_value};
where byte_value or multi_byte_Unicode_value is: v On import, the value given to a field containing a null; v On export, the value given to an exported field if the source field is set to null. The byte_value or multi_byte_Unicode_value can take one of these forms:
Chapter 8. The import/export library
393
v A number or string that defines the value to be written if the field contains a null. Enclose the string or ustring in single quotation marks. v A standard C-style string literal escape character. For example, you can represent a byte value by \ooo, where each o is an octal digit 0 - 7 and the first o is < 4, or by \xhh, where each h is a hexadecimal digit 0 - F. You must use this form to encode non-printable byte values. When you are specifying the null_field property at the record level, it is important to also specify the field-level null_field property to any nullable fields not covered by the record-level property. For example:
record { null_field = ( field1:nullable field2:nullable field3:nullable field4:nullable aaaa } int8 { null_field = -127 }; string[4]; string; ustring[8] {null_field = }; )
The record-level property above applies only to variable-length strings and fixed-length strings of four characters, field2 and field3; field1 and field4 are given field-level null_field properties because they are not covered by the record property. This property is mutually exclusive with null_length and actual_length. The null_field parameter can be used for imported or exported data. For a fixed-width data representation, you can use padchar to specify a repeated trailing character if byte_value is shorter than the fixed width of the field.
null_length
Specifies the value of the length prefix of a variable-length field that contains a null.
Applies to
Fields whose data type is nullable; cannot apply to record, subrec, or tagged.
Syntax
field_definition { null_length= length }
where length is the length in bytes of a variable-length field that contains a null. When a variable-length field is imported, a length of null_length in the source field indicates that it contains a null. When a variable-length field is exported, the export operator writes a length value of null_length if it contains a null. This property is mutually exclusive with null_field.
Restriction
Specifying null_length with a non-nullable InfoSphere DataStage field generates an error.
See also
"actual_length"
394
Example
For example, the following schema defines a nullable, variable-length string field, prefixed by a two-byte length:
record {prefix = 2} ( a:nullable string {null_length = 255}; )
Import and export results differ: v On import, the imported field is assumed to be zero length and to contain a null if the length prefix contains 255; field a of the imported data is set to null. v On export, the length prefix is set to 255 and the length of the actual destination is zero if field a contains null.
out_format
On export, perform non-default conversion of an integer or floating-point data type to a string. This property specifies a C-language format string used for export of a text field. See also "in_format" .
Applies to
Integer or floating-point data to be exported to numeric strings.
Syntax
field_definition { out_format = sprintf_string };
where sprintf_string is a control string in single quotation marks containing formatting specification for sprintf() to convert floating-point or integer data to strings. See the appropriate C language documentation for details of this formatting string. When 8-, 16-, or 32-bit signed integers are converted to strings, the sprintf_string must specify options as if the source field were a 32-bit integer. InfoSphere DataStage converts the source field to a 32-bit signed integer before passing the value to sprintf(). For 8-, 16-, and 32-bit unsigned integers, specify options as if the source field were a 32-bit unsigned integer.
Discussion
An InfoSphere DataStage data file can represent numeric data in integer and floating-point data types. By default, on export InfoSphere DataStage invokes the C sprintf() function to convert a numeric field formatted as either integer or floating point data to a string. If this function does not output data in a satisfactory format, you can specify the out_format property. When you do, InfoSphere DataStage invokes the sprintf() function to convert the numeric data to a string. You pass formatting arguments to the function.
Restrictions
The import operator ignores this property; use it only for export operations. The property is mutually exclusive with binary.
Example
The following statement defines an exported record containing two integer values written to the exported data file as a string:
395
record ( a:int32 {out_format = %x, width = 8}; b:int16 {out_format = %x, width = 4}; )
overpunch
Specifies that the imported or exported decimal field has a leading or ending byte that contains a character which specifies both the numeric value of that byte and whether the number as a whole is negatively or positively signed. This representation eliminated the need to add a minus or plus sign to the beginning or end of a number. All the digits besides the overpunched byte represent normal numbers. Use one of these formats: v To indicate that the overpunched value is in the leading byte, use this syntax: {overpunch} For example, in the overpunched number B567, B indicates that the leading byte has a value of 2 and that the decimal is positively signed. It is imported as 2567. v To indicate that the overpunched value is in the last byte, use this syntax: {trailing, overpunch} For example, in the overpunched number 567K, K indicates that the last byte has a value of 2 and that the number is negatively signed. It is imported as -5672.
Applies to
Decimal fields.
Syntax
record { { overpunch | trailing, overpunch} option_1 .... option_n } field_definition { { overpunch | trailing, overpunch} option_1 ,... option_n };
packed
Specifies that the imported or exported field contains a packed decimal.
Applies to
Fields of the decimal, string, and ustring data types; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { packed option_1 .... option_n} field_definition { packed option_1 ,...option_n};
where option_1 ... option_n is a list of one or more options; separate options with a comma if there are more than one. The options are: v check. (default) perform the InfoSphere DataStage data verification process on import/export. Note that this property is ignored if you specify fix_zero. v nocheck. bypass the InfoSphere DataStage data verification process on import/export. Note that this property is ignored if you specify fix_zero. The options check and nocheck are mutually exclusive. v signed (default) use the sign of the source decimal on import or export. v unsigned generate a sign nibble of 0xf, meaning a positive value, regardless of the imported or exported field's actual sign. This format corresponds to the COBOL PICTURE 999 format (as opposed to S999).
396
Discussion
The precision and scale of either the source decimal on import or the destination decimal on export defaults to the precision and scale of the InfoSphere DataStage decimal field. Use the precision and scale properties to override these defaults. For example, the following schema specifies the InfoSphere DataStage decimal field has a precision of 5 and a scale of 2:
record {packed} ( a:decimal[5,2]; )
Import and export results differ: v On import, field a is imported from a packed decimal representation three bytes long. v On export, the field is written to three bytes in the destination.
padchar
Specifies a pad character used when InfoSphere DataStage strings or numeric values are exported to an external string representation.
Applies to
string, ustring, and numeric data types on export; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { padchar = ` char(s) ` } field_ definition { padchar = ` char(s) ` }
where char(s) is one or more pad characters, single byte for string fields and multi-byte Unicode for ustring fields. You can specify null to set the pad character to 0x00; the default pad character is 0x20 (ASCII space). Enclose the pad character in single quotation marks if it is not the value null. The pad character is used when the external string representation is larger than required to hold the exported field. In this case, the external string is filled with the pad character to its full length.
Restrictions
This property is ignored on import.
Example
In the following example, the destination string will be padded with spaces to the full length, if the string field a is less than 40 bytes long.
record (a:string {width = 40, padchar = };)
Notes
InfoSphere DataStage fixed-length strings also have a padchar property, which is part of a schema rather than an import/export property, as in the following example. Here the exported fixed-length string is also padded with spaces to the length of the external field:
record (a:string[20, padchar = ] {width = 40};)
397
You can globally override the default pad character using the InfoSphere DataStage environment variable APT_STRING_PADCHAR. The Option and Syntax of this environment variable is shown below:
export APT_STRING_PADCHAR= character setenv APT_STRING_PADCHAR character # ksh # csh
where character is the default pad character enclosed in single quotation marks.
position
Specifies the starting position of a field in the imported source record or exported destination record. The starting position can be either an absolute byte offset from the first record position (0) or the starting position of another field.
Applies to
Fields of all data types; cannot apply to record, subrec, or tagged.
Syntax
field_definition { position = byte_offset | field_name };
where: v byte_offset is an integer value that is greater than or equal to 0 and less than the record length and that indicates the absolute byte offset of the field from position 0 in the record v field_name specifies the name of another field in the record at whose beginning boundary the defined field also starts, as in the example below.
Discussion
Specifies the byte offset of a field in the imported source record or exported destination record, where the offset of the first byte is 0. If you omit this property, a field starts immediately after the preceding field. Note that a field can start at a position preceding the end of the previous field.
Examples
For example, the following defines a schema using this property:
record {binary, delim = none} (a:int32; b:int16 {position = 6};)
Import and export results differ: v On import, field b starts at absolute byte offset 6 in the source record, that is, there is a 2-byte gap between fields a and b. Note that numeric values are represented in a binary format in this example. The default numeric format is text. v On export, the export operator skips two bytes after field a, then writes field b. By default on export, any skipped bytes are set to zero in the destination record. You can use the record-level fill property ("fill" ) to specify a value for the skipped bytes. You can specify a field_name as a position, as in the following example:
record {binary, delim = none} ( a:string {delim = ws}; b:int16; c:raw[2] {position = b}; ) "record {binary, delim = none} ( "
Field c starts at field b. Import behavior and export behavior differ: v On import, the schema creates two fields from field b of the imported record, interpreting the same external value using two different field types.
398
v On export, the schema first writes the contents of field b to destination field b and then to destination field c.
precision
Import behavior and export behavior differ: v On import, specifies the precision of a source packed decimal. v On export, specifies the precision of a destination string when a decimal is exported in text format.
Applies to
Imported strings representing packed decimals; exported packed decimals to be written as strings; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { precision= p } field_definition { precision= p };
where p is the precision of the source packed decimal on import and the precision of the destination string on export; p has no limit.
Discussion
When a source decimal is exported to a string representation, the export operator uses the precision and scale defined for the source decimal field to determine the length of the destination string. The precision and scale properties override this default ("precision" and "scale" ). When they are defined, the export operator truncates or pads the source decimal to fit the size of the destination string. If you include the width property ("width" ), the export operator truncates or pads the source decimal to fit the size specified by width.
Restriction
The precision property is ignored on export if you also specify text.
Example
The following example shows a schema used to import a source field with the same precision and scale as the destination decimal:
record ( a:decimal[6,2]; )
InfoSphere DataStage imports the source field to a decimal representation with a 6-digit precision. The following example shows a schema that overrides the default to import a source field with a 4-digit precision:
record ( a:decimal[6,2] {precision = 4}; )
prefix
Specifies that each imported/exported field in the data file is prefixed by 1, 2, or 4 bytes containing, as a binary value, either the field's length or the tag value for a tagged subrecord.
Applies to
All fields and record.
399
Syntax
record { prefix = field_definition prefix } { prefix = prefix };
where prefix is the integer 1, 2, or 4, which denotes a 1-, 2-, or 4-byte prefix containing the field length or a character enclosed in single quotes.
Discussion
You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. Import behavior and export behavior differ: v On import, the import operator reads the length prefix but does not include the prefix as a separate field in the imported data set. v On export, the export operator inserts the prefix before each field. You can use this option with variable-length fields. Variable-length fields can be either delimited by a character or preceded by a 1-, 2-, or 4-byte prefix containing the field length. This property is mutually exclusive with delim, delim_string, quote, final_delim, and reference.
Example
In the following example, fields a and b are both variable-length string fields preceded by a 2-byte string length:
record {prefix = 2} ( a:string; b:string; c:int32; )
In the following example, the 2-byte prefix is overridden for field b, where the prefix is a single byte:
record {prefix = 2} ( a:string; b:string {prefix = 1}; )
For tagged subrecords, the tag field might be either a prefix of the tagged aggregate or another field in the record associated with the tagged aggregate by the link property. Shown below is an example in which the tagged aggregate is preceded by a one-byte unsigned integer containing the tag:
record ( ... tagField:tagged {prefix=1} ( aField:string; bField:int32; cField:sfloat; ); )
print_field
For debugging purposes only; causes the import operator to display each value imported for the field.
Applies to
Fields of all data types and record.
400
Syntax
record { print_field } field_definition { print_field };
Discussion
This property causes import to write out a message for either selected imported fields or all imported fields, in the form:
Importing N: D
where: v N is the field name. v D is the imported data of the field. Non-printable characters contained in D are prefixed with an escape character and written as C string literals; if the field contains binary data, it is output in octal format.
Restrictions
This property is ignored on export.
Example
For example, the following schema uses print_field:
record {print_field} (a:string; b:int32;)
By default, imported numeric fields represent data as text. In this case, the import operator issues the following message:
Importing a: "the string" Importing b: "4660"
The following schema specifies that the numeric data is represented in binary form:
record {binary, print_field} (a:string; b:int32;)
In this case, the import operator prints the binary data in an octal format as shown below:
Importing a: "a string" Importing b: "\000\000\022\064"
quote
Specifies that a field is enclosed in single quotes, double quotes, or another ASCII or multi-byte Unicode character or pair of ASCII or multi-byte Unicode characters.
Applies to
Fields of any data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { quote = quotechar | quotechars } field_definition { quote = quotechar | quotechars };
where quotechar is one of the following: single, double, an ASCII or Unicode character, and quotechars is a pair of ASCII or multi-byte Unicode characters, for example, '[ ]'. Enclose quotechar or quotechars in single quotation marks. This property is mutually exclusive with prefix and reference.
Chapter 8. The import/export library
401
Discussion
This property is useful for variable-length fields contained either in quotes or in other characters marking the start and end of a field. v On import, the leading quote character is ignored and all bytes up to but not including the trailing quote character are imported. v On export, the export operator inserts the leading quote character, the data, and a trailing quote character. Quote characters are not counted as part of a field's length.
Example
The following example specifies that the data imported into the variable-length string field b is contained in double quotes:
record ( a:int32; b:string {quote = double}; c:int8; d:raw; )
record_delim
Specifies a single ASCII or multi-byte Unicode character to delimit a record.
Applies to
record; cannot be a field property.
Syntax
record {record_delim [= delim_char ]}
where delim_char can be a newline character, a null, or one ASCII or multi-byte Unicode character. If no argument is specified, the default is a newline character. This property is mutually exclusive with record_delim_string, record_prefix, and record_format.
record_delim_string
Specifies an ASCII or multi-byte Unicode string to delimit a record.
Applies to
record; cannot be a field property.
Syntax
record { record_delim_string = ASCII_string | multi_byte_Unicode_string }
Restrictions
You cannot specify special characters by starting with a backslash escape character. For example, specifying \t, which represents an ASCII tab delimiter character, generates an error. This property is mutually exclusive with record_delim, record_prefix, and record_format.
402
record_format
Specifies that data consists of variable-length blocked records or implicit records.
Applies to
record; cannot be a field property.
Syntax
record { record_format = { type = type [, format = format ]}}
where type is either implicit or varying. If you choose the implicit property, data is imported or exported as a stream with no explicit record boundaries. You might not use the property delim = end with this format. On import, the import operator converts the input into records using schema information passed to it. Field boundaries are implied by the record schema passed to the operator. You cannot save rejected records with this record format. On export, the records are written with no length specifications or delimiters marking record boundaries. The end of the record is inferred when all of the fields defined by the schema have been parsed. The varying property is allows you to specify one of the following IBM blocked or spanned formats: V, VB, VS, VBS, or VR. Data is imported using that format. This property is mutually exclusive with record_length, record_delim, record_delim_string, and record_prefix.
record_length
Import or export fixed-length records.
Applies to
record; cannot be a field property.
Syntax
record { record_length = fixed | nbytes }
where: v fixed specifies fixed-length records; the record schema must contain only fixed-length elements so that InfoSphere DataStage can calculate the record length. v nbytes explicitly specifies the record length in bytes if the record contains variable-length elements. On export, the export operator pads the records to the specified length with either zeros or the fill character if one has been specified. This property is mutually exclusive with record_format.
record_prefix
Specifies that a variable-length record is prefixed by a 1-, 2-, or 4-byte length prefix.
403
Applies to
record; cannot apply to fields.
Syntax
record {record_prefix [ = prefix ]}
where prefix is 1, 2, or 4. If you do not specify a value for prefix, the variable defaults to 1. This property is mutually exclusive with record_delim, record_delim_string, and record_format.
reference
Points to a link field containing the length of an imported/exported field.
Applies to
Variable-length vectors of all data types; cannot apply to record, subrec, or tagged.
Syntax
field_definition { reference = link_field};
where link_field is the name of a field of the same record that holds the length of the field defined by field_definition. Variable-length fields can specify the number of elements they contain by means of a link to another field that contains their length or the tag of a tagged subrecord. This property is mutually exclusive with prefix, delim_string, quote, and delim.
Example
The following statement specifies that the link field a contains the length of the variable-length string field c:
record {delim = none, binary} (a:int32 {link}; b:int16; c:string {reference = a};)
round
Round decimals on import or export.
Applies to
Fields of the decimal data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { round = rounding_type } field_definition { round = rounding_type }
where rounding_type can be one of the following: v ceil: Round the source field toward positive infinity. This mode corresponds to the IEEE 754 Round Up mode. Examples: 1.4 -> 2, -1.6 -> -1 v floor: Round the source field toward negative infinity. This mode corresponds to the IEEE 754 Round Down mode.
404
Examples: 1.6 -> 1, -1.4 -> -2 v round_inf: Round the source field toward the nearest representable value, breaking ties by rounding toward positive infinity or negative infinity. This mode corresponds to the COBOL ROUNDED mode. Examples: 1.4 -> 1, 1.5 -> 2, -1.4 -> -1, -1.5 -> -2 v trunc_zero (default): Truncate the source field toward zero. Discard fractional digits to the right of the right-most fractional digit supported in the destination, regardless of sign. For example, if the destination is an integer, all fractional digits are truncated. If the destination is another decimal with a smaller scale, truncate to the scale size of the destination decimal. This mode corresponds to the COBOL INTEGER-PART function. Examples: 1.6 -> 1, -1.6 -> -1
scale
Specifies the scale of a packed decimal.
Applies to
Imported strings representing packed decimals; exported packed decimals to be written as strings; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { scale = s } field_definition { scale = s };
Discussion
By default, the import operator uses the scale defined for the InfoSphere DataStage decimal field to import the source field. You can change this. On import, the scale property specifies the scale of the source packed decimal. By default, when the export operator exports a source decimal to a string representation, it uses the precision and scale defined for the source decimal field to determine the length of the destination string. You can override the default by means of the precision and scale properties. When you do, the export operator truncates or pads the source decimal to fit the size of the destination string. If you include the width property, the export operator truncates or pads the source decimal to fit the size specified by width. See "width" .
Restrictions
The scale property is ignored on export if you also specify text. The value of scale must be less than the precision and greater than 0. The precision is specified by the precision property. See "precision" .
405
Example
The following example is a schema used to import a source field with the same precision and scale as the destination decimal:
record ( a:decimal[6,2]; )
InfoSphere DataStage imports the source field to a decimal representation with a 2-digit scale. The following schema overrides this default to import a source field with a 4-digit precision and a 1-digit scale:
record ( a:decimal[6,2] {precision = 4, scale = 1}; )
separate
Specifies that the imported or exported field contains an unpacked decimal with a separate sign byte.
Applies to
Fields of the decimal data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
field_definition { separate[, option] };
where option can be one of these: v leading (default)-the sign is contained in the first byte v trailing-the sign is contained in the last byte
Discussion
By default, the sign of an unpacked decimal is contained in the first byte of the imported string. The following table defines the legal values for the sign byte for both ASCII and EBCDIC:
Sign positive negative ASCII 0x2B (ASCII "+") 0x2D (ASCII "-") EBCDIC 0x43 (EBCDIC "+") 0x60 (EBCDIC "-")
Example
For example, the following schema specifies that the InfoSphere DataStage decimal field contains a leading sign and has a precision of 5 and a scale of 2:
record ( a:decimal[5,2] {separate}; )
Import and export results differ: v On import, field a is imported from a decimal representation six bytes long with the sign in the first byte. v On export, the field is written to six bytes in the destination: the five contained by the decimal and one byte to contain the sign.
skip
Skip a number of bytes from the end of the previous imported/exported field to the beginning of the field.
406
Applies to
Fields of all data types. Cannot apply to record, subrec, or tagged .
Syntax
field_definition { skip = nbytes };
where nbytes is the number of bytes to skip after the previous record. The value of nbytes can be negative but the absolute record offset computed from this and the previous field position must always be greater than or equal to 0. On export, any skipped bytes are set to zero by default. The record-level fill property specifies an explicit value for the skipped bytes.
Example
For example, the following statement defines a record schema for the import or export operator:
record (a:int32 {position = 4}; b:int16 {skip = 2};)
Import and export results differ: v On import, this schema creates each record from an input data file by importing a 32-bit integer, beginning at byte 4 in each input record, skipping the next 2 bytes of the data file, and importing the next two bytes as a 16-bit integer. v On export, the export operator fills in the first four bytes with zeros, writes out the 32-bit integer, fills the next two bytes with zeroes, and writes the 16-bit integer.
tagcase
Explicitly specifies the tag value corresponding to a subfield in a tagged. By default the fields are numbered 0 to N-1, where N is the number of fields.
Applies to
Fields within tagged.
Syntax
field_definition { tagcase = n }
text
Specifies the data representation type of a field as being text rather than binary. Data is formatted as text by default.
Applies to
Fields of all data types except ustring; record. For ustring, the same functionality is available through the charset property.
Syntax
record { text } field_definition { text [, ascii | edcdic] [,import_ebcdic_as_ascii] [, export_ebcdic_as_ascii] }
407
Discussion
Data is formatted as text by default, as follows: v For the date data type, text specifies that the source data on import or the destination data on export, contains a text-based date in the form %yyyy-%mm-%dd or uformat. See "default_date_format" for a description of uformat. v For the decimal data type: an imported or exported field represents a decimal in a string format with a leading space or '-' followed by decimal digits with an embedded decimal point if the scale is not zero. For import, the source string format is: [+ | -]ddd[.ddd] For export, the destination string format is: [+ | -]ddd.[ddd] Any precision and scale arguments are ignored on export if you specify text. v For numeric fields (int8, int16, int32, uint8, uint16, uint32, sfloat, and dfloat): the import and export operators assume by default that numeric fields are represented as text; the import operator converts the text representation to a numeric format by means of C functions to. (See "c_format" , "in_format" , and "out_format" .) v For the time data type: text specifies that the imported or exported field represents time in the text-based form %hh:%nn:%ss or uformat. See "default_time_format" for a description of uformat. v For the timestamp data type: text specifies a text-based timestamp in the form %yyyy-%mm-%dd %hh:%nn:%ss or uformat, which is default_date_format and default_time_format concatenated. Refer to "default_date_format" and "default_time_format" .
Example
If you specify the following record schema:
record {text} (a:decimal[5,2];)
import and export results are as follows: v On import, the source decimal is read from a 7-byte string (five bytes for the precision, one for the sign and one for the decimal point). v On export, the field is written out as a 7-byte string.
time_format
Specifies the format of an imported or exported field representing a time as a string or ustring.
Applies to
Fields of the time data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
field_definition { time_format = time_format | uformat};
uformat is described in "default_time_format" . The possible components of the time_format string are given in the following table:
408
Table 76. Time format tags Tag %h %hh %H %HH %n %nn %s %ss %s.N %ss.N %SSS %SSSSSS %aa with v option with v option German import import import import Variable width availability import Description Hour (24), variable width Hour (24), fixed width Hour (12), variable width Hour (12), fixed width Minutes, variable width Minutes, fixed width Seconds, variable width Seconds, fixed width Value range 0...23 0...23 1...12 01...12 0...59 0...59 0...59 0...59 Options s s s s s s s s s, c, C s, c, C s, v s, v u, w
Seconds + fraction (N = 0...6) Seconds + fraction (N = 0...6) Milliseconds Microseconds am/pm marker, locale specific 0...999 0...999999 am, pm
By default, the format of the time contained in the string is %hh:%nn:%ss. However, you can specify a format string defining the format of the string field. You must prefix each component of the format string with the percent symbol. Separate the string's components with any character except the percent sign (%). Where indicated the tags can represent variable-fields on import, export, or both. Variable-width date elements can omit leading zeroes without causing errors. The following options can be used in the format string where indicated: s Specify this option to allow leading spaces in time formats. The s option is specified in the form: %(tag,s) Where tag is the format string. For example: %(n,s) indicates a minute field in which values can contain leading spaces or zeroes and be one or two characters wide. If you specified the following date format property: %(h,s):$(n,s):$(s,s) Then the following times would all be valid: 20: 6:58 20:06:58
409
20:6:58 v Use this option in conjunction with the %SSS or %SSSSSS tags to represent milliseconds or microseconds in variable-width format. So the time property: %(SSS,v) represents values in the range 0 to 999. (If you omit the v option then the range of values would be 000 to 999.) u w c C Use this option to render the am/pm text in uppercase on output. Use this option to render the am/pm text in lowercase on output. Specify this option to use a comma as the decimal separator in the %ss.N tag. Specify this option to use a period as the decimal separator in the %ss.N tag.
The c and C options override the default setting of the locale. The locale for determining the setting of the am/pm string and the default decimal separator can be controlled through the locale tag. This has the format:
%(L,locale)
Where locale specifies the locale to be set using the language_COUNTRY.variant naming convention supported by ICU. See IBM InfoSphere DataStage and QualityStage Globalization Guide for a list of locales. The default locale for am/pm string and separators markers is English unless overridden by a %L tag or the APT_IMPEXP_LOCALE environment variable (the tag takes precedence over the environment variable if both are set). Use the locale tag in conjunction with your time format, for example: %L('es')%HH:%nn %aa Specifies the Spanish locale. The format string is subject to the restrictions laid out in the following table. A format string can contain at most one tag from each row. In addition some rows are mutually incompatible, as indicated in the 'incompatible with' column. When some tags are used the format string requires that other tags are present too, as indicated in the 'requires' column.
Table 77. Format tag restrictions Element hour am/pm marker minute second fraction of a second Numeric format tags %hh, %h, %HH, %H %nn, %n %ss, %s %ss.N, %s.N, %SSS, %SSSSSS Text format tags %aa Requires hour (%HH) Incompatible with hour (%hh) -
You can include literal text in your date format. Any Unicode character other than null, backslash, or the percent sign can be used (although it is better to avoid control codes and other non-graphic characters). The following table lists special tags and escape sequences:
410
Tag %% \% \n \t \\
Escape sequence literal percent sign literal percent sign newline horizontal tab single backslash
When using variable-width tags, it is good practice to enclose the time string in quotes. For example, the following schema:
f1:int32; f2:time { time_format=%h:%n:%s,, quote=double }; f3:int32;
The quotes are required because the parallel engine assumes that variable-width fields are space delimited and so might interpret a legitimate space in the time string as the end of the time.
Restrictions
You cannot specify time_format with either of the following: midnight_seconds and text. That is, you can specify only one of these options.
Example
For example, you define a format string as %hh:%nn:%ss.3 to specify that the string contains the seconds to three decimal places, that is, to milliseconds. Alternatively you could define this as %hh:%nn:%SSS.
timestamp_format
Specifies the format of an imported or exported field representing a timestamp as a string.
Applies to
Fields of the timestamp data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record ( {timestamp_format = timestamp_format | uformat } ) field_definition { timestamp_format = timestamp_format |uformat };
uformat is default_date_format and default_time_format concatenated. The two formats can be in any order and date and time elements can be mixed. The uformat formats are described in "default_date_format" and "default_time_format" . The timestamp_format is the date format and the time format. Again the two formats can be in any order and their elements can be mixed. The formats are described in date_format on page 376 and time_format on page 408. You must prefix each component of the format string with the percent symbol (%). Enclose timestamp_format in single quotation marks.
411
Default
If you do not specify the format of the timestamp it defaults to the string %yyyy-%mm-%dd %hh:%nn:%ss.
vector_prefix
Specifies 1-, 2-, or 4-byte prefix containing the number of elements in the vector.
Applies to
Fields that are variable-length vectors, which are formatted accordingly.
Syntax
record { vector_prefix [= n ] } field_definition { vector_prefix [= n ] };
where n is the optional byte size of the prefix containing the number of elements in the vector; n can be 1 (the default), 2, or 4. If a vector_prefix is defined for the entire record, you can override the definition for individual vectors.
Discussion
Variable-length vectors must use either a prefix on the vector or a link to another field in order to specify the number of elements in the vector. If the variable-length vector has a prefix, you use the property vector_prefix to indicate the prefix length. By default, the prefix length is assumed to be one byte. Behavior on import differs from that on export: v On import, the source data file must contain a prefix of each vector containing the element count. The import operator reads the length prefix but does not include the prefix as a separate field in the imported data set. v On export, the export operator inserts the element count as a prefix of each variable-length vector field. For multi-byte prefixes, the byte ordering is determined by the setting of the little_endian ("little_endian"), big_endian ("big_endian" ), or native_endian ("native_endian" ) properties.
Examples
The following schema specifies that all variable-length vectors are prefixed by a one-byte element count:
record {vector_prefix} (a[]:int32; b[]:int32; )
In the following record schema, the vector_prefix of the record (1 byte long by default) is overridden for field b, whose vector_prefix is two bytes long:
record {vector_prefix} (a[]:int32; b[]:int32 {vector_prefix = 2} )
The schema shown below specifies that the variable-length vector a is prefixed by a one-byte element count, and vector b is prefixed by a two-byte element count:
record (a[]:int32 {vector_prefix}; b[]:int32 {vector_prefix = 2};)
Import and export results differ: v On import, the source data file must contain a prefix of each vector containing the element count. v On export, the export operator inserts the element count as a prefix of each vector.
412
width
Specifies the number of 8-bit bytes of an imported or exported text-format field. Base your width specification on the value of your -impexp_charset option setting. If it's a fixed-width charset, you can calculate the number of bytes exactly. If it's a variable length encoding, base your calculation on the width and frequency of your variable-width characters.
Applies to
Fields of all data types except date, time, timestamp, and raw; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { width = n } field_definition { width = n };
where n is the number of bytes in the field; you can specify a maximum width of 255 bytes.
Discussion
This property is useful for numeric fields stored in the source or destination file in a text representation. If no width is specified and you do not use max_width to specify a maximum width, numeric fields exported as text have the following number of bytes of maximum width: v 8-bit signed or unsigned integers: 4 bytes v 16-bit signed or unsigned integers: 6 bytes v 32-bit signed or unsigned integers: 11 bytes v 64-bit signed or unsigned integers: 21 bytes v single-precision float: 14 bytes (sign, digit, decimal point, 7 fraction, "E", sign, 2 exponent) v double-precision float: 24 bytes (sign, digit, decimal point, 16 fraction, "E", sign, 3 exponent)
Restriction
On export, if you specify the width property with a dfloat field, the width must be at least eight bytes long.
zoned
Specifies that the field contains an unpacked decimal using either ASCII or EBCDIC text.
Applies to
Fields of the decimal data type; record, subrec, or tagged if it contains at least one field of this type.
Syntax
record { zoned[, option ] } field_definition { zoned[, option ]};
where option can be either trailing or leading: v trailing (default) specifies that the sign nibble is in the last byte v leading specifies that the sign nibble is in the first byte
413
Discussion
Import and export behavior differ: v On import, the file is read from a zoned representation of the same length, with zoning as defined by the property. v On export the field is written to the destination. The following table defines how the sign is represented in both the ASCII and EBCDIC formats:
Sign positive negative ASCII Indicated by representing the sign digit normally. Indicated by setting the 0x40 bit in the sign digit's byte. This turns "0" through "9" into "p" through "y". EBCDIC Upper nibble equal to: 0xA, 0xC, 0xE, 0xF Upper nibble equal to: 0xB, 0xD
Example
For example, the following schema specifies that the InfoSphere DataStage decimal field has a precision of 5 and a scale of 2 and that its sign nibble is found in the last byte of the field:
record ( a:decimal[5,2]{zoned}; )
The precision and scale of the source decimal on import, or the destination decimal on export, defaults to the precision and scale of the InfoSphere DataStage decimal field. You can use the precision ("precision" ) and scale ("scale" ) properties to override these defaults.
414
415
For example, the following figure shows an operator that uses the any partitioning method.
entire
operator
To override the any partitioning method of the operator and replace it by the entire partitioning method, you place the entire partitioner into the step as shown. The osh command for this example is:
$ osh "... | entire | op ... "
entire
entire: properties
Table 78. entire Partitioner Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Value 1 1 inRec:* outRec:* inRec to outRec without modification parallel
416
Table 78. entire Partitioner Properties (continued) Property Partitioning method Preserve-partitioning flag in output set Composite operator Value entire set no
Syntax
The entire partitioner has no options. its syntax is simply:
entire
417
age values
36 40 22
12 18 27
35 5 60
15 44 39
10 54 17
partition
Figure 2. Hash Partitioning Example
As you can see in the diagram, the key values are randomly distributed among the different partitions. The partition sizes resulting from a hash partitioner are dependent on the distribution of records in the data set so even though there are three keys per partition, the number of records per partition varies widely, because the distribution of ages in the population is non-uniform. When hash partitioning, you should select hashing keys that create a large number of partitions. For example, hashing by the first two digits of a zip code produces a maximum of 100 partitions. This is not a large number for a parallel processing system. Instead, you could hash by five digits of the zip code to create up to 10,000 partitions. You also could combine a zip code hash with an age hash (assuming a maximum age of 190), to yield 1,500,000 possible partitions. Fields that can only assume two values, such as yes/no, true/false, male/female, are particularly poor choices as hash keys.
418
Hash
operator
In this example, fields a and b are specified as partitioning keys. Shown below is the osh command:
$ osh "... | hash -key a -key b | op ..."
By default, the hash partitioner uses a case-sensitive hashing algorithm. You can override this by using the -ci option to the partitioner, which is applied to the string field e in the following example:
$ osh "... | hash -key e -ci | op ..."
To prevent the output of the hash partitioner from being repartitioned, the hash partitioner sets the preserve-partitioning flag in its output.
hash
hash: properties
Table 79. hash Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Value 1 1 inRec:*
419
Table 79. hash Operator Properties (continued) Property Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value outRec:* inRec to outRec without modification parallel hash set no
| OFF]
There is one required option, -key. You can specify it multiple times.
Table 80. hash Partitioner Option Option -key Use -key field [-ci | -cs] [-param params ] Specifies that field is a partitioning key field for the hash partitioner. You can designate multiple key fields where the order is unimportant. The key field must be a field of the data set using the partitioner. The data type of aT partitioning key might be any InfoSphere DataStage data type including nullable data types. By default, the hash partitioner uses a case sensitive algorithm for hashing. This means that uppercase strings are distinct from lowercase strings. You can override this default to perform case insensitive hashing, by using the -ci option after the field name. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas.
420
Table 80. hash Partitioner Option (continued) Option -collation_ sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
where: v fieldname is a numeric field of the input data set. v number_of_partitions is the number of processing nodes on which the partitioner executes. If an partitioner is executed on three processing nodes it has three partitions. InfoSphere DataStage automatically passes the number of partitions to partitioners, so you need not supply this information.
modulus
421
modulus: properties
Table 81. modulus Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Table 82. modulus Operator Option Option key Use key key_field Specifies the name of the key field on whose value the modulus will be calculated. The key field must be a numeric field, which is converted to a uint32 internally. Value 1 1 inRec:*; outRec:* inRec to outRec without modification parallel modulus set no
There is one option. It is required, and you can specify it only once.
Table 83. modulus Partitioner Option Option -key Use -key fieldname Supply the name of key field on whose value the modulus is calculated. The key field must be a numeric field, which is converted to an uint64 internally.
Field a, of type uint32, is specified as the key field, on which the modulus operation is calculated. Here is the input data set. Each line represents a record:
422
64123 61821 44919 22677 90746 21870 87702 4705 47330 88193
1960-03-30 1960-06-27 1961-06-18 1960-09-24 1961-09-15 1960-01-01 1960-12-22 1961-12-13 1961-03-21 1962-03-12
The following table shows the output data set divided among four partitions by the modulus partitioner.
Partition 0 Partition1 61821 1960-06-27 22677 1960-09-24 4705 1961-12-13 4705 1961-12-13 Partition2 21870 1960-01-01 87702 1960-12-22 47330 1961-03-21 90746 1961 Partion3 64123 1960-03-30 44919 1961-06-18
Here are three sample modulus operations, corresponding to the values of the three key fields shown above with underscore:
22677 mod 4 = 1; the data is written to Partition 1. 47330 mod 4 = 2; the data is written to Partition 2. 64123 mod 4 = 3; the data is written to Partition 3.
None of the key fields can be divided evenly by 4, so no data is written to Partition 0.
423
random
operator
To override the any partitioning method of op and replace it by the random partitioning method, you place the random partitioner into the step. Here is the osh command for this example:
$ osh "... | random | op ... "
random
random: properties
Table 84. random Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Value 1 1 inRec:* outRec:* inRec to outRec without modification parallel
424
Table 84. random Operator Properties (continued) Property Partitioning method Preserve-partitioning flag in output set Composite operator Value random cleared no
Syntax
The syntax for the random partitioner in an osh command is:
random
age values
partition
Figure 5. Range Partitioning Example
All partitions are of approximately the same size. In an ideal distribution, every partition would be exactly the same size. However, you typically observe small differences in partition size.
425
In order to size the partitions, the range partitioner orders the partitioning keys. The range partitioner then calculates partition boundaries based on the partitioning keys in order to evenly distribute records to the partitions. As shown above, the distribution of partitioning keys is often not even; that is, some partitions contain many partitioning keys, and others contain relatively few. However, based on the calculated partition boundaries, the number of records in each partition is approximately the same. Range partitioning is not the only partitioning method that guarantees equivalent-sized partitions. The random and roundrobin partitioning methods also guarantee that the partitions of a data set are equivalent in size. However, these partitioning methods are keyless; that is, they do not allow you to control how records of a data set are grouped together within a partition.
426
Procedure
1. Create a random sample of records from the data set to be partitioned using the sample partitioner. Your sample should contain at least 100 records per processing node in order to accurately determine the partition boundaries. See "Sample Operator" for more information on the sample partitioner. 2. Use the tsort partitioner, in sequential mode, to perform a complete sort of the sampled records using the partitioning keys as sorting keys. Since your sample size should typically be less than 25,600 (assuming a maximum of 256 nodes in your system), the sequential-mode sort is quick. You must sort the sampled data set using the same fields as sorting keys, and in the same order, as you specify the partitioning keys to the range partitioner. Also, the sorting keys must have the same characteristics for case sensitivity and ascending or descending ordering as you specified for the range partitioner. 3. Use the writerangemap operator to store the sorted, sampled data set to disk as a file. This file is called a range map. See "The writerangemap Operator" for more information on this operator. 4. Configure the range partitioner using the sorted, sampled data file. The range partitioner determines the partition boundaries for the entire data set based on this sample. 5. Use the -key argument to specify the partitioning keys to the range partitioner. Note that you must specify the same fields as the partitioning keys, and in the same order, as you specified as sorting keys above in Step 2. Also, the partitioning keys must have the same characteristics for case sensitivity and ascending or descending ordering as you specified for the sort.
Results
The diagram shows a data flow where the second step begins with a range partitioner:
427
step 1 sample
tsort
sequential mode
writerangemap
step 2 range
operator
Note the diagram shows that you sample and sort the data set used to configure the range partitioner in one InfoSphere DataStage step, and use the range partitioner in a second step. This is because all the processing of the sorted sample is not complete until the first step ends. This example shows a range partitioner configured from the same data set that you want to partition. However, you might have multiple data sets whose record distribution can be accurately modeled using a range partitioner configured from a single data set. In this case, you can use the same range partitioner for all the data sets.
428
You decide to create a range partition using fields a and c as the range partitioning keys. Your system contains 16 processing nodes. Here are the UNIX shell and osh commands for these two steps:
$ numSampled=1600 # Line 1 $ numRecs=`dsrecords inDS.ds | cut -f1 -d ` # Line 2 $ percent=`echo "10 k $numSampled $numRecs / 100 * p q" | dc` # Line 3 $ osh "sample $percent < inDS.ds | tsort -key a -key c [seq] | writerangemap -rangemap sampledData -overwrite -interface record(a:int32; c:string[5];) " $ osh "range -sample sampledData -key a -key c < inDS.ds | op1 ..."
The sample size required by the range partitioner is at least 100 records per processing node. Since there are 16 processing nodes, you specify the sample size as 1600 records on Line 1. On Lines 2 and 3 you calculate the sample size as a percentage of the total number of records in the data set. This calculation is necessary because the sample operator requires the sample size to be expressed as a percentage. In order to calculate the percentage, you use the dsrecords utility to obtain the number of records in the input data set. The return value of dsrecords has the form "# records" where # is the number of records. Line 2 returns the record count and strips off the word "record" from the value. Line 3 then calculates a floating point value for the sample percentage from the 1600 records required by the sample and the number of records in the data set. This example uses the UNIX dc command to calculate the percentage. In this command, the term 10 k specifies that the result has 10 digits to the left of the decimal point. See the man page on dc for more information. The range partitioner in this example partitions an input data set based on fields a and c. Therefore, the writerangemap operator only writes fields a and c to the output file used to generate the range partitioner.
429
Range
range: properties
Table 85. range Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value 1 1 inRec:* outRec:* inRec to outRec without modification parallel range set no
430
Table 86. range Partitioner Options Option -key Use -key fieldname [-ci | -cs] [-asc | desc] [-nulls first | last] [-ebcdic [-params params] Specifies that fieldname is a partitioning key field for the range partitioner. You can designate multiple partitioning key fields. You must specify the same fields, in the same order and with the same characteristics, for ascending/descending and case sensitive/insensitive order, as used to sort the sampled data set. The field name fieldname must be a field of the data set using the partitioner or a field created using an input field adapter on the data set. By default the range partitioner uses a case-sensitive algorithm. You can perform case-insensitive partitioning by using the -ci option after the field name. -ascending specifies ascending order sort; records with smaller values for fieldname are assigned to lower number partitions. This is the default. -descending specifies descending order sort; records with smaller values for fieldname are assigned to lower number partitions. -nulls {first | last} specifies whether nulls appear first or last in the sorted partition. The default is first. -ebcdic specifies that the EBCDIC collating sequence is used. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation _ sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -sample -sample sorted_sampled_data_set Specifies the file containing the sorted, sampled data set used to configure the range partitioner.
Chapter 9. The partitioning library
431
Table 87. range Operator Options Option key Use key field_name [Case: Sensitive | Insensitive] [Ascending | Descending] Specifies that field_name is a partitioning key field for the range operator. You can designate multiple partitioning key fields. You must specify the same fields, in the same order, and with the same characteristics for ascending/descending order and case sensitive/insensitive matching, as used to sort the sampled data set. The field name field_name must be a field of the data set using the partitioner or a field created using an input field adapter on the data set. By default, the range operator is case sensitive. This means that uppercase strings come before lowercase strings. You can override this default to perform case-insensitive partitioning by using the Case: Insensitive radio button under the field name. By default, the range partitioning operator uses ascending order, so that records with smaller values for field_name are assigned to lower number partitions than records with larger values. You can specify descending sorting order, so that records with larger values are assigned to lower number partitions, where the options are: Ascending specifies ascending order which is the default Descending specifies descending order sample -sample sorted_sampled_data_set Specifies the file containing the sorted, sampled data set used to create the operator.
Writerangemap operator
The writerangemap operator takes an input data set produced by sampling and partition sorting a data set and writes it to a file in a form usable by the range partitioner. The range partitioner uses the sampled and sorted data set to determine partition boundaries.
432
writerangemap
newDS.ds
The operator takes a single data set as input. You specify the input interface schema of the operator using the -interface option. Only the fields of the input data set specified by -interface are copied to the output file.
writerangemap: properties
Table 88. writerangemap properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value 1 0 (produces a data file as output) specified by the interface arguments none inRec to outRec without modification sequential only range set no
433
Table 89. Writerangemap options Option -key Use -key fieldname Specifies an input field copied to the output file. Only information about the specified field is written to the output file.You only need to specify those fields that you use to range partition a data set. You can specify multiple -key options to define multiple fields. This option is mutually exclusive with -interface. You must specify either -key or -interface, but not both. -collation_ sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. v By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -interface -interface schema Specifies the input fields copied to the output file. Only information about the specified fields is written to the output file.You only need to specify those fields that you use to range partition a data set. This option is mutually exclusive with -key; that is, you can specify -key or -interface, but not both. -overwrite -overwrite Tells the operator to overwrite the output file, if it exists. By default, the operator does not overwrite the output file. Instead it generates an error and aborts the job if the file already exists. -rangemap -rangemap filename Specifies the pathname of the output file which will contain the sampled and sorted data.
434
where: v filename specifies the name of the file containing the range map, a file containing the sorted, sampled records used to configure a range partitioner. v fieldname specifies the field(s) of the input data used as sorting key fields. You can specify one or more key fields. Note that you must sort the sampled data set using the same fields as sorting keys, and in the same order, as you specify as partitioning keys. Also, the sorting keys must have the same ascending/descending and case sensitive/insensitive properties.
Table 90. makerangemap Utility Options Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -f -f Specifies to overwrite the output file, if it exists. By default, the utility does not overwrite the output file. Instead, it generates an error and aborts if the file already exists.
435
Table 90. makerangemap Utility Options (continued) Option -key Use -key fieldname [-ci | -cs][-asc | -desc][-ebcdic] Specifies that fieldname is a sorting key field. You can designate multiple sorting key fields. You must specify the same fields, in the same order and with the same characteristics for ascending/descending and case sensitive/insensitive order, as used by the range partitioner. The field name fieldname must be a field of the input data set. By default, the sort uses a case-sensitive algorithm. This means that uppercase strings come before lowercase strings. You can override this default to perform case-insensitive sorting, where: -cs specifies case sensitive (default) -ci specifies case insensitive -ebcdic specifies that fieldname be sorted using the EBCDIC collating sequence. By default, the sort uses ascending order, so that records with smaller values for fieldname come before records with larger values. You can specify descending sorting order, so that records with larger values come first, where: -asc specifies ascending, the default -desc specifies descending -percentage or -p -percentage percent Specifies the sample size of the input data set as a percentage. The sample size defaults to 100 records per processing node in the default node pool as defined by the InfoSphere DataStage configuration file. If specified, percent should be large enough to create a sample of 100 records for each processing node executing the operator using the range partitioner. The options -size and -percentage are mutually exclusive. -rangemap or -rm -rangemap filename Specifies the pathname of the output file containing the sampled and sorted data.
436
Table 90. makerangemap Utility Options (continued) Option -size or -s Use -size samplesize Specifies the size of the sample taken from the input data set. The size defaults to 100 records per processing node in the default node pool as defined by the InfoSphere DataStage configuration file. If specified, samplesize should be set to at least 100 multiplied by the number of processing nodes executing the operator using the range partitioner. The options -size and -percentage are mutually exclusive.
437
2. A persistent data set For example, Figure 7 shows an operator that uses the any partitioning method.
roundrobin
operator
To override the any partitioning method of op and replace it by the round robin partitioning method, you place the roundrobin partitioner into the step as shown. Shown below is the osh command for this example:
$ osh "... | roundrobin | op ... "
roundrobin
roundrobin: properties
Table 91. roundrobin Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Value 1 1 inRec:* outRec:*
438
Table 91. roundrobin Operator Properties (continued) Property Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value inRec to outRec without modification parallel roundrobin cleared no
Syntax
The syntax for the roundrobin partitioner in an osh command is shown below. There are no options.
roundrobin
same
operator
To override the any partitioning method of op and replace it by the same partitioning method, you place the same partitioner into the step as shown. Shown below is the osh command for this example:
$ osh "... | same | op ... "
439
same
same: properties
Table 92. same Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output set Composite operator Value 1 1 inRec:* outRec:* inRec to outRec without modification parallel same propagated no
Syntax
The syntax for the same partitioner in an osh command is shown below. There are no options.
same
440
Ordered collecting
In ordered collection, the collector reads all records from the first input partition, then all records from the second input partition, and so on, to create the output data set. This collection method preserves the record order of each partition of the input data set and might be useful as a preprocessing action before exporting a sorted data set to a single data file. When you use the ordered collector, the output data set of the collector must be one of these: v A virtual data set connected to the input of a sequential collector using the any collection method. A virtual data set output by a collector overrides the collection method of a collector using the any collection method. v A persistent data set. If the data set exists, it must contain only a single partition unless a full overwrite of the data set is being performed. For example, the diagram shows an operator using the any collection method, preceded by an ordered collector.
441
ordered
operator
The any collection method of the other operator is overridden by inserting an ordered collector into the step. The ordered collector takes a single partitioned data set as input and collects the input data set to create a single sequential output data set with one partition.
Syntax
The syntax for the ordered collector in an osh command is:
$ osh " ... | ordered | ... "
442
roundrobin
operator
The any collection method of the other operator is overridden by inserting a roundrobin collector into the step. The roundrobin collector takes a single data set as input and collects the input data set to create a single sequential output data set with one partition.
443
Table 94. roundrobin Collector Properties (continued) Property Composite operator Value no
Syntax
The syntax for the roundrobin collector in an osh command is:
osh " ... | roundrobin | ... "
sortmerge
444
partition 0
partition 1
partition 2
Paul Smith 34
Mary Davis 42
In this example, the records consist of three fields. The first-name and last-name fields are strings, and the age field is an integer. The following diagram shows the order of the three records read by the sortmerge collector, based on different combinations of collecting keys.
445
Mary
Davis
42
Paul
Smith
34
Smith
34
Mary
Davis
42
Jane
Smith
42
order read: 1
Jane
Smith
42
Paul
Smith
34
Mary
Davis
42
You must define a single primary collecting key for the sortmerge collector, and you might define as many secondary keys as are required by your job. Note, however, that each record field can be used only once as a collecting key. Therefore, the total number of primary and secondary collecting keys must be less than or equal to the total number of fields in the record. The data type of a collecting key can be any InfoSphere DataStage type except raw, subrec, tagged, or vector. Specifying a collecting key of these types causes the sortmerge collector to issue an error and abort execution.
446
By default, the sortmerge collector uses ascending sort order and case-sensitive comparisons. Ascending order means that records with smaller values for a collecting field are processed before records with larger values. You also can specify descending sorting order, so records with larger values are processed first. With a case-sensitive algorithm, records with uppercase strings are processed before records with lowercase strings. You can override this default to perform case-insensitive comparisons of string fields.
sortmerge: properties
Table 95. sortmerge Collector Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Collection method Preserve-partitioning flag in output data set Composite operator Value 1 1 inRec:* outRec:* inRec to outRec without modification sequential sortmerge propagated no
locale
You must specify at least one -key field to the sortmerge collector.
447
Table 96. sortmerge Collection Options Option -key Use -key field_name [-ci | -cs] [-asc | -desc] [-nulls first | last] [-ebcdic] [-param params ] Specifies that field_name is a collecting key field for the sortmerge collector. You can designate multiple key fields. The first key you list is the primary key. field_name must be a field of the data set using the collector or a field created using an input field adapter on the data set. By default, the sortmerge collector does case-sensitive comparisons. This means that records with uppercase strings are processed before records with lowercase strings. You can optionally override this default to perform case-insensitive collecting by using the -ci option after the field name. By default, the sortmerge collector uses ascending order, so that records with smaller values for field_name are processed before records with larger values. You can optionally specify descending sorting order, so that records with larger values are processed first, by using -desc after the field name. By default, the sortmerge collector sorts fields with null keys first. If you wish to have them sorted last, specify -nulls last after the field name. By default, data is represented in the ASCII character set. To represent data in the EBCDIC character set, specify the -ebcdic option. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu /userguide/Collate_Intro.htm
448
groups records that have the same The aggtorec Operator key-field values into an output record combines the input fields specified in The field_export Operator your output schema into a string- or raw-valued field exports an input string or raw field to the output fields specified in your import schema combines specified vector fields into a vector of subrecords combines specified fields into a vector of fields of the same type converts input subrecord fields to output top-level fields separates input subrecords into sets of output top-level vector fields promotes the elements of a fixed-length vector to a set of similarly-named top-level fields converts tagged fields into output records whose schema supports all the possible fields of the tag cases. The field_import Operator
field_import
The makesubrec Operator The makevect Operator The promotesubrec Operator The splitsubrec Operator The splitvect Operator
tagbatch
tagswitch
The contents of tagged aggregates are The tagswitch Operator converted to InfoSphere DataStage-compatible records.
Output formats
The format of a top-level output record depends on whether you specify the -toplevel option. The following simplified example demonstrates the two output formats. In both cases, the -key option value is field-b, and the -subrecname option value is sub.
Copyright IBM Corp. 2001, 2012
449
The two forms of output are: v When the -toplevel option is not specified, each output top-level record consists entirely of subrecords, where each subrecord has exactly the same fields as its corresponding input record, and all subrecords in a top-level output record have the same values in their key fields.
sub:[0:(sub.field-a:3 sub.field-b:1)] sub:[0:(sub.field-a:4 sub.field-b:2) 1:(sub.field-a:5 sub.field-b:2)]
v When the -toplevel option is specified, the input key field or fields remain top-level fields in the output top-level record, and the non-key fields are placed in a subrecord.
field-b:1 sub:[0:(sub.field-a:3)] field-b:2 sub:[0:(sub.field-a:4) 1:(sub.field-a:5)]
aggtorec: properties
Table 98. aggtorec Operator Properties Property Number of input data sets Number of output data sets Input interface schema: Output interface schema Transfer behavior Execution mode Value 1 1 inRec:* See the section "Output Formats" . See the section "Output Formats" . parallel (default) or sequential
OFF]
450
-key
-key key_field [-ci|-cs] [-param params] This option is required. Specify one or more fields. If you do not specify -toplevelkeys, all records whose key fields contain identical values are gathered into the same record as subrecords. Each field becomes the element of a subrecord. If you specify the -toplevelkeys option, the key field appears as a top-level field in the output record. All non-key fields belonging to input records with that key field appear as elements of a subrecord in that key field's output record. You can specify multiple keys. For each one, specify the -key option and supply the key's name. By default, InfoSphere DataStage interprets the value of key fields in a case-sensitive manner if the values are strings. Specify -ci to override this default. Do so for each key you choose. For example: -key A -ci -key B -ci The -param suboption allows you to specify extra parameters for a field. Specify parameters using property =value pairs separated by commas.
-collation_ sequence
-collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
-subrecname
-subrecname subrecname This option is required. Specify the name of the subrecords that aggtorec creates.
-toplevelkeys
-toplevelkeys Optionally specify -toplevelkeys to create top-level fields from the field or fields you have chosen as keys.
451
In the output, records with an identical value in field d have been gathered as subrecords in the same record, as shown here:
sub:[0:(sub.a:1 1:(sub.a:3 sub:[0:(sub.a:1 1:(sub.a:2 2:(sub.a:2 3:(sub.a:2 4:(sub.a:3 5:(sub.a:3 sub:[0:(sub.a:1 1:(sub.a:2 2:(sub.a:2 3:(sub.a:3 4:(sub.a:3 5:(sub.a:3 6:(sub.a:3 sub.b:00:11:01 sub.b:08:45:54 sub.b:12:59:01 sub.b:07:33:04 sub.b:12:00:00 sub.b:07:37:04 sub.b:07:56:03 sub.b:09:58:02 sub.b:11:43:02 sub.b:01:30:01 sub.b:11:30:01 sub.b:10:28:02 sub.b:12:27:00 sub.b:06:33:03 sub.b:11:18:22 sub.c:1960-01-02 sub.c:1946-09-15 sub.c:1955-12-22 sub.c:1950-03-10 sub.c:1967-02-06 sub.c:1950-03-10 sub.c:1977-04-14 sub.c:1960-05-18 sub.c:1980-06-03 sub.c:1985-07-07 sub.c:1985-07-07 sub.c:1992-11-23 sub.c:1929-08-11 sub.c:1999-10-19 sub.c:1992-11-23 sub.d:A) sub.d:A)] sub.d:B) sub.d:B) sub.d:B) sub.d:B) sub.d:B) sub.d:B)] sub.d:C) sub.d:C) sub.d:C) sub.d:C) sub.d:C) sub.d:C) sub.d:C)]
452
In the output, records with identical values in both field d and field a have been gathered as subrecords in the same record:
sub:[0:(sub.a:1 sub:[0:(sub.a:3 sub:[0:(sub.a:1 sub:[0:(sub.a:2 1:(sub.a:2 2:(sub.a:2 sub:[0:(sub.a:3 1:(sub.a:3 sub:[0:(sub.a:1 sub:[0:(sub.a:2 1:(sub.a:2 sub:[0:(sub.a:3 1:(sub.a:3 2:(sub.a:3 3:(sub.a:3 sub.b:00:11:01 sub.b:08:45:54 sub.b:12:59:01 sub.b:07:33:04 sub.b:12:00:00 sub.b:07:37:04 sub.b:07:56:03 sub.b:09:58:02 sub.b:11:43:02 sub.b:01:30:01 sub.b:11:30:01 sub.b:10:28:02 sub.b:12:27:00 sub.b:06:33:03 sub.b:11:18:22 sub.c:1960-01-02 sub.c:1946-09-15 sub.c:1955-12-22 sub.c:1950-03-10 sub.c:1967-02-06 sub.c:1950-03-10 sub.c:1977-04-14 sub.c:1960-05-18 sub.c:1980-06-03 sub.c:1985-07-07 sub.c:1985-07-07 sub.c:1992-11-23 sub.c:1929-08-11 sub.c:1999-10-19 sub.c:1992-11-23 sub.d:A)] sub.d:A)] sub.d:B)] sub.d:B) sub.d:B) sub.d:B)] sub.d:B) sub.d:B)] sub.d:C)] sub.d:C) sub.d:C)] sub.d:C) sub.d:C) sub.d:C) sub.d:C)]
In the output: v Fields a and d are written as the top-level fields of a record only once when they have the same value, although they can occur multiple times in the input data set. v Fields b and c are written as subrecords of the same record in which the key fields contain identical values. Here is a small example of input records:
a:2 b:01:30:01 c:1985-07-07 d:C a:2 b:11:30:01 c:1985-07-07 d:C
453
The specified fields are combined into a single raw output field. You use the -field option to specify the name of the output field, and specify -type string to export to a string field. You can optionally save rejected records in a separate reject data set. The default behavior is to continue processing and report a count of failures.
454
inRec:*;
field_export
r:string | raw and outRec:*; with the exported fields dropped
field_export: properties
Table 100. field_export Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1, and optionally a reject data set inRec:* r:string OR ustring OR raw; outRec:* without the exported fields inRec -> outRec (exported fields are dropped)
455
Table 101. field_export Operator Options (continued) Option -schema Use -schema schema_name Specifies the name of the export schema. The schema specifies how input fields are packed into the exported string- or raw-type field. You must specify either -schema or -schemafile. -schemafile -schemafile file_name Specifies the name of a file containing the export schema. The schema specifies how input fields are packed into the exported string- or raw-type field. You must specify either -schemafile or -schema. -type -type string | ustring | raw Specifies the data type of the output field; this is the type to which the operator converts input data. The operator converts the input data to raw-type data by default.
The operator exports the first two input fields to a single string output field in which the two items are displayed as text separated by a comma. Here is the export schema that guides this operation:
record{text,delim=,,final_delim=none} (SN:int16;value:decimal[5,2];)
456
value:379.31
SN:103
time:00:09:16
Note that the operator has reversed the order of the value and SN fields and combined their contents. The time field is transferred to the output with no change, as in the following diagram:
value: 271.79
SN: 303
time: 00:09:32
time: 00:09:32
457
field_import
field1:type1;...fieldn:typen; outRec:*; with the raw or string field dropped
field_import: properties
Table 102. field_import Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1, and optionally a reject data set r:raw OR string;inRec:*; field1:type1;...fieldn:typen;outRec:* with the string or raw-type field dropped. inRec -> outRec (the string- or raw-type field is dropped)
458
Table 103. field_import Operator Properties (continued) Option -saveRejects Use -saveRejects Specifies that the operator continues when it encounters a reject record and writes the record to an output reject data set. The default action of the operator is to continue and report a count of the failures to the message stream. This option is mutually exclusive with -failRejects. -schema -schema schema_name Specifies the name of schema that interprets the string or raw field's contents by converting them to another data type. You must specify either -schema or -schemafile. -schemafile -schemafile file_name Specifies the name of a file containing the schema that interprets the string or raw field's contents by converting them to another data type. You must specify either -schema or -schemafile.
Here is the import schema that guides the import of the 16-byte raw field into four integer output fields. The schema also assures that the contents of the raw[16] field are interpreted as binary integers:
record {binary,delim=none} (a:int32; b:int32; c:int32; d:int32; )
459
rawfield:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 rawfield:04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 rawfield:08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 08 rawfield:01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 rawfield:05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 rawfield:09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 09 rawfield:02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 rawfield:06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 rawfield:03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 rawfield:07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07
x:00087.2 x:00004.8 x:00042.7 x:00091.3 x:00075.9 x:00081.3 x:00040.6 x:00051.5 x:00061.7 x:00015.2
Note that the operator has imported four bytes of binary data in the input raw field as one decimal value in the output. The input field x (decimal[6,1]) is transferred to the output with no change, as in the following diagram:
raw: 07 07 07 07 07 07 07 07
07 07 07 07 07 07 07 07
x:00015.2
460
In the example above each byte contributes the value shown for a total of 117901963:
byte byte byte byte 07 * 224 + 07 * 216 + 07 * 28 + 07 * 20 + Total 117440512 458752 1792 7 117901963
makesubrec
outRec:*; subrecname[]:subrec(a:vtype; b:vtype; n:vtype;)
makesubrec: properties
Table 104. makesubrec Operator Options Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:*; a[ ]:vtype; b[ ]:vtype; ...n[ ]:vtype; outRec:*; subrecname[ ]:subrec( a:vtype; b:vtype; ... n:vtype;) See "Transfer Behavior" .
461
Transfer behavior
The input interface schema is as follows:
record ( inRec:*; a[]:vtype; b[]:vtype;...n[]:vtype; )
The inRec schema variable captures all the input record's fields except those that are combined into subrecords. In the interface, each field that is combined in a subrecord is a vector of indeterminate type. The output interface schema is as follows:
record ( outRec:*; subrecname[]:subrec(a:vtype; b:vtype;...n:vtype;) )
Subrecord length
The length of the subrecord vector created by this operator equals the length of the longest vector field from which it is created. If a variable-length vector field was used in subrecord creation, the subrecord vector is also of variable length. Vectors that are smaller than the largest combined vector are padded with default values: NULL for nullable fields and the corresponding type-dependent value for non-nullable fields. For example, in the following record, the vector field a has five elements and the vector field b has four elements.
a:[0:0 1:2 2:4 3:6 4:8] b:[0:1960-08-01 1:1960-11-23 2:1962-06-25 3:1961-08-18]
The makesubrec operator combines the fields into a subrecord named sub as follows:
sub:[0:(sub.a:0 sub.b:1960-08-01) 1:(sub.a:2 sub.b:1960-11-23) 2:(sub.a:4 sub.b:1962-06-25) 3:(sub.a:6 sub.b:1961-08-18) 4:(sub.a:8 sub.b:1-1-0001)]
Subfield b of the subrecord's fifth element, shown in boldface type, contains the data type's default value. When the makesubrec operator encounters mismatched vector lengths, it displays the following message:
When checking operator: Not all vectors are of the same length. field_nameA and field_nameB differ in length. Some fields will contain their default values or Null, if nullable
where field_nameA and field_nameB are the names of vector fields to be combined into a subrecord. You can suppress this message by means of the -variable option. This operator treats scalar components as if they were a vector of length 1. This allows scalar and vector components to be combined.
462
Table 105. makesubrec Operator Options (continued) -name -name field_name Specify the name of the field to include in the subrecord. You can specify multiple fields to be combined into a subrecord. For each field, specify the option followed by the name of the field to include. -subrecname -subrecname subrecname Specify the name of the subrecord into which you want to combine the fields specified by the -name option. -variable -variable When the operator combines vectors of unequal length, it pads fields and displays a message to this effect. Optionally specify this option to disable display of the message. See "Subrecord Length" .
Here are the input and output data set record schemas:
a[5]:uint8; b[4]:decimal[2,1]; c[4]:string[1] b[4]:decimal[2,1]; sub [5](a:uint8; c:string[1];)
463
Note that vector field c was shorter than vector field a and so was padded with the data type's default value when the operator combined it in a subrecord. The default value is shown in boldface type.
makevect
makevect: properties
Table 106. makevect Operator Options Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Value 1 1 inRec:*; name0: vtype; name1: vtype; ... namen: vtype; outRec:*; name[n+1]:vtype; See "Transfer Behavior" . parallel (default) or sequential
464
Transfer Behavior
The input interface schema is:
record ( inRec:*; name0:vtype; name1:vtype; ... namen:vtype; )
The inRec schema variable captures all the input record's fields except those that are combined into the vector. The data type of the input fields determines that of the output vector. All fields to be combined in a vector must have compatible data types so that InfoSphere DataStage type. Data type casting is based on the following rules: v Any integer, signed or unsigned, when compared to a floating-point type, is converted to floating-point. v Comparisons within a general type convert the smaller to the larger size (sfloat to dfloat, uint8 to uint16, and so on) v When signed and unsigned integers are compared, unsigned are converted to signed. v Decimal, raw, string, time, date, and timestamp do not figure in type conversions. When any of these is compared to another type, makevect returns an error and terminates. The output data set has the schema:
record ( outRec:*; name[n]:vtype;)
The outRec schema variable does not capture the entire input record. It does not include the fields that have been combined into the vector.
Non-consecutive fields
If a field between field_name0 and field_namen is missing from the input, the operator stops writing fields to a vector at the missing number and writes the remaining fields as top-level fields. For example, data with this input schema:
record ( a0:uint8; a1:uint8; a2:uint8; a4:uint8; a5:uint8; )
is combined as follows:
record ( a4:uint8; a5:uint8; a[3]:uint8; )
The operator combines fields a0, a1, and a2 in a vector and writes fields a4 and a5 as top-level fields.
465
Note that the operator did not write the output records in the same order as the input records, although all fields appear in the correct order.
Here is the output schema of the same data after the makevect operation.
record ( a4:uint8; c:decimal[2,1]; a[3]:uint8;)
Note that the operator: v Stopped combining 'a' fields into a vector when it encountered a field that was not consecutively numbered. v Wrote the remaining, non-consecutively numbered field (a4) as a top-level one. v Transferred the unmatched field (c) without alteration. v Did not write the output records in the same order as the input records, although all fields appear in the correct order.
466
Input Data Set a0:0 a1:6 a2:0 a4:8 c:0.0 a0:20 a1:42 a2:16 a4:28 c:2.8 a0:40 a1:78 a2:32 a4:48 c:5.6 a0:5 a1:15 a2:4 a4:13 c:0.7 a0:25 a1:51 a2:20 a4:33 c:3.5 a0:45 a1:87 a2:36 a4:53 c:6.3 a0:10 a1:24 a2:8 a4:18 c:1.4 a0:30 a1:60 a2:24 a4:38 c:4.2 a0:15 a1:33 a2:12 a4:23 c:2.0 a0:35 a1:69 a2:28 a4:43 c:4.9
Output Data Set a4:8 c:0.0 a:[0:0 1:6 2:0] a4:28 c:2.8 a:[0:20 1:42 2:16] a4:48 c:5.6 a:[0:40 1:78 2:32] a4:13 c:0.7 a:[0:5 1:15 2:4] a4:23 c:2.0 a:[0:15 1:33 2:12] a4:33 c:3.5 a:[0:25 1:51 2:20] a4:53 c:6.3 a:[0:45 1:87 2:36] a4:18 c:1.4 a:[0:10 1:24 2:8] a4:38 c:4.2 a:[0:30 1:60 2:24] a4:43 c:4.9 a:[0:35 1:69 2:28]
Here are the input and output of a makevect operation on the field a:
promotesubrec
outRec:*; rec:*;
promotesubrec: properties
Table 108. promotesubrec Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 1 inRec:*; subrecname[ ]:subrec (rec:*); outRec:*; rec:*;
467
Table 108. promotesubrec Operator Properties (continued) Property Transfer behavior Value See below.
where inRec does not include the subrecord to be promoted. The output interface schema is as follows:
record ( outRec:*; rec:*; )
where outRec includes the same fields as inRec (top-level fields), and rec includes the fields of the subrecord.
468
splitsubrec
outRec:*;fielda[]:atype; fieldb[]:btype;...fieldn[]:ntype;
splitsubrec properties
Table 110. splitsubrec Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:*;name[ ] subrec (a:atype; b:btype; ...n:ntype;) outRec:*; a[]:atype; b[]:btype; ...n[]:ntype; See below.
The inRec schema variable omits the input field whose elements are separated and promoted. The subfields to be promoted can be of any data type except tagged or vector. The output interface schema is as follows:
Chapter 11. The restructure library
469
Each new vector field has the same name and data type as its corresponding input subfield. This operator also works with scalar components, treating them as if they were vectors of length 1.
v The splitsubrec operator separates and promotes subfields a and b of subrecord s. v The output schema is as follows:
record ( a[5]:uint8; b[5]:decimal[2,1]; )
470
4:(s.a:2 s.b:2.2)] s:[0:(s.a:0 s.b:3.6) 1:(s.a:2 s.b:-5.3) 2:(s.a:4 s.b:-0.5) 3:(s.a:6 s.b:-3.2) 4:(s.a:8 s.b:4.6)] s:[0:(s.a:4 s.b:8.5) 1:(s.a:6 s.b:-6.6) 2:(s.a:8 s.b:6.5) 3:(s.a:10 s.b:8.9) 4:(s.a:0 s.b:4.4)]
inRec:*; name[n]:vtype;
splitvect
splitvect: properties
Table 112. splitvect Operator Properties Property Number of input data sets Number of output data sets Input interface schema Value 1 1 inRec:*; name[n]:vtype;
471
Table 112. splitvect Operator Properties (continued) Property Output interface schema Transfer behavior Execution mode Value outRec:*; name0:vtype; name1:vtype; namen-1:vtype; See below. parallel (default) or sequential
The inRec schema variable captures all the input record's fields except the vector field whose elements are separated from each other and promoted. The data type of this vector is vtype, which denotes any valid InfoSphere DataStage data type that can figure in a vector, that is, any data type except tagged and * (schema variable). This data type determines the data type of the output, top-level fields.The vector must be of fixed length. The output data set has the schema:
record ( outRec:*; name0:vtype; name1:vtype; namen-1:vtype; )
472
Input Data Set a:[0:0 1:6 2:0 3:2 4:8] a:[0:20 1:42 2:16 3:10 4:28] a:[0:40 1:78 2:32 3:18 4:48] a:[0:15 1:33 2:12 3:8 4:23] a:[0:5 1:15 2:4 3:4 4:13] a:[0:25 1:51 2:20 3:12 4:33] a:[0:45 1:87 2:36 3:20 4:53] a:[0:35 1:69 2:28 3:16 4:43] a:[0:10 1:24 2:8 3:6 4:18] a:[0:30 1:60 2:24 3:14 4:38]
Output Data Set a0:0 a1:6 a2:0 a3:2 a4:8 a0:20 a1:42 a2:16 a3:10 a4:28 a0:40 a1:78 a2:32 a3:18 a4:48 a0:10 a1:24 a2:8 a3:6 a4:18 a0:5 a1:15 a2:4 a3:4 a4:13 a0:30 a1:60 a2:24 a3:14 a4:38 a0:15 a1:33 a2:12 a3:8 a4:23 a0:35 a1:69 a2:28 a3:16 a4:43 a0:25 a1:51 a2:20 a3:12 a4:33 a0:45 a1:87 a2:36 a3:20 a4:53
Here are a sample input and output of the operator: Note that the operator did not write the output records in the same order as the input records, although all fields appear in the correct order.
473
You typically use tagged fields when importing data from a COBOL data file, if the COBOL data definition contains a REDEFINES statement. A COBOL REDEFINES statement specifies alternative data types for a single field. A tagged field has multiple data types, called tag cases. Each case of the tagged is defined with both a name and a data type. The following example shows a record schema that defines a tagged field that has three cases corresponding to three different data types:
record ( tagField:tagged ( aField:string; bField:int32; cField:sfloat; ); )
The content of a tagged field can be any one of the subfields defined for the tagged field. In this example, the content of tagField is one of the following: a variable-length string, a 32-bit integer, a single-precision floating-point value. A tagged field in a record can have only one item from the list. This case is called the active case of the tagged. "Dot" addressing refers to the components of a tagged field: For example, tagField.aField references the aField case, tagField.bField references the bField case, and so on. Tagged fields can contain subrecord fields. For example, the following record schema defines a tagged field made up of three subrecord fields:
record ( tagField:tagged ( aField:subrec( aField_s1:int8; aField_s2:uint32;); bField:subrec( bField_s1:string[10]; bField_s2:uint32;); cField:subrec( cField_s1:uint8; cField_s2:dfloat;); ); )
In this example, each subrecord is a case of the tagged. You reference cField_s1 of the cField case as tagField.cField.cField_s1. See the topic on InfoSphere DataStage data sets in the InfoSphere DataStage Parallel Job Developer Guide for more information on tagged fields.
474
tagbatch
tagbatch: properties
Table 114. tagbatch Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 inRec:*; tField:tagged(c0:subrec(rec:*); c1:subrec(rec:*); ...); outRec:*; c0_present:int8; c0_rec:*; c1_present:int8; c1_rec:*; ... See below.
By default, the inRec schema variable does not include the original tagged field. The output interface schema is:
outRec:*; c0_present:int8; c0_rec:*; c1_present:int8; c1_rec:*; ...
475
"Example 2: The tagbatch Operator, Missing and Duplicate Cases" shows the behavior described in the three preceding paragraphs.
476
Table 115. tagbatch Operator Options (continued) Option -collation_sequence Use -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, lreference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -ifDuplicateCase -ifDuplicateCase drop | fail | keepFirst | keepLast Controls how the operator handles two records with the same key value that have the same active case for the tagged field. Suboptions include: drop: Drop the output record. This means that no record is output for that key value. See "Example 2: The tagbatch Operator, Missing and Duplicate Cases" . fail: Fail the operator when two or more records have the same active case. keepFirst: Default. The operator outputs the case value from the first record processed by the operator. keepLast: Output the case from the last record. See "Example 2: The tagbatch Operator, Missing and Duplicate Cases" . -nobatch -nobatch Optionally specify that a record is output for every input record. The default is to place input records into groups based on key values and output them as a single record.
477
Table 115. tagbatch Operator Options (continued) Option -ifMissingCase Use -ifMissingCase continue | drop | fail Optionally specifies how the operator handles missing tag cases for a specified key field. By default, the operator outputs a record with the corresponding present field set to 0 and the field of the tag case set to their default values (0 for numeric fields, 0 length for variable-length strings, and so on). The suboptions are: continue: Default. Outputs a record with the corresponding present field set to 0 and the field of the tag case set to their default values (0 for numeric fields, 0 length for variable-length strings, and so on). drop: Drop the record. See "Example 2: The tagbatch Operator, Missing and Duplicate Cases" . fail: Fail the operator and step. -tag -tag tag_field Specifies the name of the tagged field in the input data set used by the operator. If you do not specify this option, the input data set should have only one tagged field. If it has more than one tagged field, the operator processes only the first one and issues a warning when encountering others.
In the following data set representations, the elements of a tagged field are enclosed in angle brackets (<>); the elements of a subrecord are enclosed in parentheses (()). Here are three tagged records whose key field is equal:
key:11 t:<t.A:(t.A.fname:booker t.A.lname:faubus)> key:11 t:<t.B:(t.B.income:27000)> key:11 t:<t.C:(t.C.birth:1962-02-06 t.C.retire:2024-02-06)>
478
The fields A_present, B_present, and C_present now precede each output field and are set to 1, since the field that follows each contains a value. Here is the osh command for this example:
$ osh "... tagbatch -key key ..."
Here are the input and output data sets without field names:
Input Data Set 11 <(booker faubus)> 11 <(27000)> 11 <(1962-02-06 2024-0206)> 22 <(marilyn hall)> 22 <(35000)> 22 <(1950-09-20 2010-0920)> 33 <(gerard devries)> 33 <(50000)> 33 <(1944-12-23 2009-1223)> 44 <(ophelia oliver)> 44 <(65000)> 44 <(1970-04-11 2035-0411)> 55 <(omar khayam)> 55 <(42000)> 55 <(1980-06-06 2040-0606)> Output Data Set 11 1 booker faubus 1 27000 1 1962-02-06 2024-02-06
Here is the input data set without field names. The income field is missing from those containing a -key whose value is 33. The corresponding fields are shown in boldface type. The income field occurs twice in those containing a key whose value is 44. The corresponding fields are shown in italic type.
11 11 11 22 22 22 33 33 44 <(booker faubus)> <(27000)> <(1962-02-06 2024-02-06)> <(marilyn hall)> <(35000)> <(1950-09-20 2010-09-20)> <(gerard devries)> <(1944-12-23 2009-12-23)> <(ophelia oliver)>
Chapter 11. The restructure library
479
44 44 44 55 55
The operator has dropped the field containing a key whose value is 33 and has written the last value of the duplicate field (shown in italic type).
480
principal engineer 3 72000 1 1950-04-16 111-11- 111100 senior engineer 3 42000 1 1960-11-12 000-00-0000 208 1 senior marketer 2 66000 1 1959- 10-16 222-22-2222 456 1 VP sales 1 120000 1 1950-04-16 333-33-3333
The tagbatch operator has processed the input data set according to both the fname and lname key fields, because choosing only one of these as the key field gives incomplete results. If you specify only the fname field as the key, the second set of tagged fields whose fname is Rosalind is dropped from the output data set. The output data set, which is as follows, does not contain the data corresponding to Rosalind Fleur:
Jane Doe 1 1991-11-01 137 1 principal engineer 3 72000 1 1950-04John Doe 1 1992-03-12 136 1 senior engineer 3 42000 1 1960-11-12 Rosalind Elias 1 1994-07-14 208 1 senior marketer 2 66000 1 195916 111-11- 111100 000-00-0000 10-16 222-22- 2222
If you specify only the lname field as the key, the second set of tagged fields whose lname is Doe is dropped from output data set. The output data set, which follows, does not contain the data corresponding to John Doe.
Jane Doe 1 1991-11-01 137 1 principal engineer 3 72000 1 1950-04Rosalind Elias 1 1994-07-14 208 1 senior marketer 2 66000 1 1959Rosalind Fleur 1 1998-01-03 456 1 VP sales 1 120000 1 1950-04-16 16 111-11- 111100 10-16 222-22- 2222 333-33-3333
tagswitch
outRec:*; rec:*;
outRec:*; rec:*;
outRec:*; rec:*;
data set 0
data set 1
other (optional)
481
tagswitch: properties
Table 116. tagswitch Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 1 1 <= n, where n is the number of cases; 1 <= n+1, if the operator transfers unlisted cases. inRec:*; tField:tagged(c0:subrec(rec:*); c1:subrec(rec:*); ..); otherRec:*; Output data set N: outRec:*; rec:*; Optional other output data set: otherRec:*; See below.
By default, the inRec schema variable does not include the original tagged field or fields. However, you can override this default by means of the modify operator to force the operator to retain the tagged field, so that it is transferred from the input data set to the output data sets. The output interface schema is:
Output dataset N: outRec:*; rec:*; Optional other output dataset: otherRec:*;
Top-level fields are always copied from the input data set to all output data sets.
482
You can override the default and explicitly list tag cases to be processed by means of the -tag option. If you choose the -tag option, you can control how the operator behaves when it encounters an unlisted case.
483
Table 117. tagswitch Operator Options (continued) Option -tag Use -tag tag_field Specifies the name of the tagged field in the input data set used by the operator. If you do not specify a -tag, the operator acts on the first tag field it encounters.
tagswitch
484
Here are the input data set and output data sets without field names:
Input Data Set 11 11 11 22 22 22 33 33 33 44 44 44 55 55 55 <(booker faubus)> <(27000)> <(1962-02-06 2024-02-06)> <(marilyn hall)> <(35000)> <(1950-09-20 2010-09-20)> <(gerard devries)> <(50000)> <(1944-12-23 2009-12-23)> <(ophelia oliver)> <(65000)> <(1970-04-11 2035-04-11)> <(omar khayam)> <(42000)> <(1980-06-06 2040-06-06)> Output Data Sets Data Set # 1 (Case A) 11 booker faubus 22 marilyn hall 33 gerard devries 44 ophelia oliver 55 omar khayam Data Set # 2 (Case B) 11 27000 55 42000 22 35000 44 60000 44 65000 Data Set # 3 (Case C) 11 1962-02-06 2024-02-06 55 1980-06-06 2040-06-06 33 1944-12-23 2009-12-23 22 1950-09-20 2010-09-20 44 1970-04-11 2035-04-11
Each case (A, B, and C) has been promoted to a record. Each subrecord of the case has been promoted to a top-level field of the record.
485
tagswitch
Table 118. Input and output schemas Input Schema key:int32; t:tagged (A:subrec(fname:string; lname:string; ); B:subrec(income:int32; ); C:subrec(birth:date; retire:date; ); ) Output Schemas Data set # 1 (Case B) key:int32; income:int32; Data set # 2 (other) key:int32; t:tagged (A:subrec(fname:string; lname:string; ); B:subrec(income:int32; ); C:subrec(birth:date; retire:date; ); )
486
Note that the input schema is copied intact to the output schema. Case B has not been removed from the output schema. If it were, the remaining tags would have to be rematched. Here is the osh command for this example:
osh "tagswitch -case B -ifUnlistedCase other < in.ds > out0.ds 1> out1.ds >"
487
488
489
Tom Dick Harry Jack Ted Mary Bob Jane Monica Bill Dave
SEQUENTIAL
tsort
Bill Bob Dave Dick Harry Jack Jane Mary Mike Monica Ted
tsort
490
Typically, you use a parallel tsort operator as part of a series of operators that requires sorted partitions. For example, you can combine a sort operator with an operator that removes duplicate records from a data set. After the partitions of a data set are sorted, duplicate records in a partition are adjacent. To perform a parallel sort, you insert a partitioner in front of the tsort operator. This lets you control how the records of a data set are partitioned before the sort. For example, you could hash partition records by a name field, so that all records whose name field begins with the same one-, two-, or three-letter sequence are assigned to the same partition. See "Example: Using a Parallel tsort Operator" for more information. If you combine a parallel sort with a sort merge collector, you can perform a total sort. A totally sorted data set output is completely ordered, meaning the records in each partition of the output data set are ordered, and the partitions themselves are ordered. See "Performing a Total Sort" for more information.
491
If the input to tsort was partitioned using a range partitioner, you can use a partition-sorted data set as input to a sequential operator using an ordered collector to preserve the sort order of the data set. Using this collection method causes all the records from the first partition of a data set to be read first, then all records from the second partition, and so on. If the input to tsort was partitioned using a hash partitioner, you can use a partition-sorted data set as input to a sequential operator using a sortmerge collector with the keys. Using this collection method causes the records to be read in sort order.
primary key = lName fName Mary Bob Jane Paul lName Davis Jones Smith Smith age 41 42 42 34
primary key = lName fName Mary Bob Paul Jane lName Davis Jones Smith Smith age 41 42 34 42 fName Paul Mary Bob Jane
This figure also shows the results of three sorts using different combinations of sorting keys. In this figure, the fName and lName fields are string fields and the age field represents an integer. Here are tsort defaults and optional settings: v By default, the tsort operatorAPT_TSortOperator uses a case-sensitive algorithm for sorting. This means that uppercase strings appear before lowercase strings in a sorted data set. You can use an option to the tsort operator to select case-insensitive sorting. You can use the member function APT_TSortOperator::setKey() to override this default, to perform case-insensitive sorting on string fields. v By default, the tsort operatorAPT_TSortOperator uses ascending sort order, so that smaller values appear before larger values in the sorted data set. You can use an option to the tsort operator to perform descending sorting.You can use APT_TSortOperator::setKey() to specify descending sorting order as well, so that larger values appear before smaller values in the sorted data set. v By default, the tsort operatorAPT_TSortOperator uses ASCII. You can also set an option to tsort to specify EBCDIC. v By default, nulls sort low. You can set tsort to specify sorting nulls high.
inRec:*;
tsort
outRec:*;
tsort: properties
Table 120. tsort Operator Properties Property Number of input data sets Value 1
493
Table 120. tsort Operator Properties (continued) Property Number of output data sets Input interface schema Value 1 sortRec:*; The dynamic input interface schema lets you specify as many input key fields as are required by your sort operation. sortRec:*; sortRec -> sortRec without record modification parallel (default) or sequential parallel mode: any sequential mode: any set no yes
Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Combinable operator
You must use -key to specify at least one sorting key to the operator.
494
Table 121. tsort Operator Options Option -key Use -key field [ci | cs] [-ebcdic] [-nulls first | last] [-asc | -desc] [-sorted | -clustered] [-param params] Specifies a key field for the sort. The first -key defines the primary key field for the sort; lower-priority key fields are supplied on subsequent key specifications. -key requires that field be a field of the input data set. -ci | -cs are optional arguments for specifying case-sensitive or case insensitive sorting. By default, the operator does case-sensitive sorting. This means that uppercase strings appear before lowercase strings in a sorted data set. You can override this default to perform case-insensitive sorting on string fields only. -asc| -desc are optional arguments for specifying ascending or descending sorting. By default, the operator uses ascending sort order, so that smaller values appear before larger values in the sorted data set. You can specify descending sorting order instead, so that larger values appear before smaller values in the sorted data set. -ebcdic (string fields only) specifies to use EBCDIC collating sequence for string fields. Note that InfoSphere DataStage stores strings as ASCII text; this property only controls the collating sequence of the string. For example, using the EBCDIC collating sequence, lowercase letters sort before uppercase letters (unless you specify the -ci option to select case-insensitive sorting). Also, the digits 0-9 sort after alphabetic characters. In the default ASCII collating sequence used by the operator, numbers come first, followed by uppercase, then lowercase letters. -sorted specifies that input records are already sorted by this field. The operator then sorts on secondary key fields, if any. This option can increase the speed of the sort and reduce the amount of temporary disk space when your records are already sorted by the primary key field(s) because you only need to sort your data on the secondary key field(s). -sorted is mutually exclusive with -clustered; if any sorting key specifies -sorted, no key can specify -clustered. continued
495
Table 121. tsort Operator Options (continued) Option -key (continued) Use If you specify -sorted for all sorting key fields, the operator verifies that the input data set is correctly sorted, but does not perform any sorting. If the input data set is not correctly sorted by the specified keys, the operator fails. -clustered specifies that input records are already grouped by this field, but not sorted. The operator then sorts on any secondary key fields. This option is useful when your records are already grouped by the primary key field(s), but not necessarily sorted, and you want to sort your data only on the secondary key field(s) within each group. -clustered is mutually exclusive with -sorted; if any sorting key specifies -clustered, no key can specify -sorted. -nulls specifies whether null values should be sorted first or last. The default is first. The -param suboption allows you to specify extra parameters for a field. Specify parameters using property =value pairs separated by commas. -collation_sequence -collation_sequence locale |collation_file_pathname | OFF This option determines how your string data is sorted. You can: Specify a predefined IBM ICU locale Write your own collation sequence using ICU syntax, and supply its collation_file_pathname Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.htm -flagCluster -flagCluster Tells the operator to create the int8 field clusterKeyChange in each output record. The clusterKeyChange field is set to 1 for the first record in each group where groups are defined by the -sorted or -clustered argument to -key. Subsequent records in the group have the clusterKeyChange field set to 0. You must specify at least one sorting key field that uses either -sorted or -clustered to use -flagCluster, otherwise the operator ignores this option. -flagKey Optionally specify whether to generate a flag field that identifies the key-value changes in output.
496
Table 121. tsort Operator Options (continued) Option -memory Use -memory num_megabytes Causes the operator to restrict itself to num_megabytes megabytes of virtual memory on a processing node. -memory requires that 1 < num_megabytes < the amount of virtual memory available on any processing node. We recommend that num_megabytes be smaller than the amount of physical memory on a processing node. -stable -stable Specifies that this sort is stable. A stable sort guarantees not to rearrange records that are already sorted properly in a data set. The default sorting method is unstable. In an unstable sort, no prior ordering of records is guaranteed to be preserved by the sorting operation. -stats -stats Configures tsort to generate output statistics about the sorting operation. -unique -unique Specifies that if multiple records have identical sorting key values, only one record is retained. If -stable is set, the first record is retained.
unSortedDS.ds
step tsort
sortedDS.ds
497
This step uses a sequential tsort operator to completely sort the records of an input data set. The primary key field is a, the secondary sorting key field is e. Since the tsort operator runs by default in parallel, you use the [seq] framework argument to configure the operator to run sequentially. Shown below is the osh command line for this step:
$ osh "tsort -key a -key e -stable [seq] < unSortedDS.ds > sortedDS.ds "
InfoSphere DataStage supplies a hash partitioner operator that allows you to hash records by one or more fields. See "The hash Partitioner" for more information on the hash operator. You can also use any one of the supplied InfoSphere DataStage partitioning methods. The following example is a modification of the previous example, "Example: Using a Sequential tsort Operator" , to execute the tsort operator in parallel using a hash partitioner operator. In this example, the hash operator partitions records using the integer field a, the primary sorting key. Thus all records containing the same value for field a will be assigned to the same partition. The figure below shows this example:
498
unSortedDS.ds
step hash
tsort
sortedDS.ds
Shown below is the osh command corresponding to this example:
$ osh " hash -key a -key e < unSortedDS.ds | tsort -key a -key e > sortedDS.ds"
In contrast to a partition sort, you can also use the tsort operator to perform a total sort on a parallel data set. A totally sorted parallel data set means the records in each partition of the data set are ordered, and the partitions themselves are ordered.
499
Dick Tom Mary Bob Stacy Nick Bob Sue Ted Jack Harry Patsy
input partitions
Bob Bob Dick Harry Jack mary Nick Patsy Stacy Sue Ted Tom
The following figure shows a total sort performed on a data set with two partitions:
In this example, the partitions of the output data set contain sorted records, and the partitions themselves are sorted. A total sort requires that all similar and duplicate records be located in the same partition of the data set. Similarity is based on the key fields in a record. Because each partition of a data set is sorted by a single processing node, another requirement must also be met before a total sort can occur. Not only must similar records be in the same partition, but the partitions themselves should be approximately equal in size so that no one node becomes a processing bottleneck. To meet these two requirements, use the range partitioner on the input data set before performing the actual sort. The range partitioner guarantees that all records with the same key fields are in the same partition, and calculates partition boundaries based on the key fields, in order to evenly distribute records across all partitions. See the next section for more information on using the range partitioner. You need to perform a total sort only when your job requires a completely ordered data set, either as output or as a preprocessing step. For example, you might want to perform a total sort on a mailing list so that the list is sorted by zip code, then by last name within zip code. For most jobs, a partition sort is the correct sorting approach. For example, if you are sorting a data set as a preprocessing step to remove duplicate records, you use a hash partitioner with a partition sort to partition the data set by the key fields; a total sort is unnecessary. A hash partitioner assigns records with the same key fields to the same partition, then sorts the partition. A second operator
500
can compare adjacent records in a partition to determine if they have the same key fields to remove any duplicates.
501
step 1 sample
tsort
-key a sequential mode
writerangemap
step 2 range
tsort
-key a parallel mode
sortedDS.ds
As the figure shows, this example uses two steps. The first step creates the sorted sample of the input data set used to create the range map required by the range partitioner. The second step performs the parallel sort using the range partitioner. InfoSphere DataStage supplies the UNIX command line utility, makerangemap, that performs the first of these two steps for you to create the sorted, sampled file, or range map. See "The range Partitioner" for more information on makerangemap.
502
Shown below are the commands for the two steps shown above using makerangemap:
$ makerangemap -rangemap sampledData -key a -key e inDS.ds $ osh "range -sample sampledData -key a -key c < inDS.ds | tsort -key a -key e > sortDS.ds"
503
Tom Dick Harry Jack Ted Mary Bob Jane Monica Bill Dave
SEQUENTIAL
psort
Bill Bob Dave Dick Harry Jack Jane Mary Mike Monica Ted
psort
504
Typically you use a parallel psort operator as part of a series of operators that requires sorted partitions. For example, you can combine a sort operator with an operator that removes duplicate records from a data set. After the partitions of a data set are sorted, duplicate records are adjacent in a partition. In order to perform a parallel sort, you specify a partitioning method to the operator, enabling you to control how the records of a data set are partitioned before the sort. For example, you could partition records by a name field, so that all records whose name field begins with the same one-, two-, or three-letter sequence are assigned to the same partition. See "Example: Using a Parallel Partition Sort Operator" for more information. If you combine a parallel sort with a range partitioner, you can perform a total sort. A totally sorted data set output is completely ordered. See "Performing a Total Sort" for more information. The example shown above for a parallel psort operator illustrates an input data set and an output data set containing the same number of partitions. However, the number of partitions in the output data set is determined by the number of processing nodes in your system configured to run the psort operator. The psort operator repartitions the input data set such that the output data set has the same number of partitions as the number of processing nodes. See the next section for information on configuring the psort operator.
505
primary key = lName fName Mary Bob Jane Paul lName Davis Jones Smith Smith age 41 42 42 34
primary key = lName fName Mary Bob Paul Jane lName Davis Jones Smith Smith age 41 42 34 42 fName Paul Mary Bob Jane
This figure also shows the results of three sorts using different combinations of sorting keys. In this figure, the lName field represents a string field and the age field represents an integer. By default, the psort operator uses a case-sensitive algorithm for sorting. This means that uppercase strings appear before lowercase strings in a sorted data set. You can use an option to the psort operator to select case-insensitive sorting. You can use the member function APT_PartitionSortOperator::setKey() to override this default, to perform case-insensitive sorting on string fields. By default, the psort operator APT_PartitionSortOperator uses ascending sort order, so that smaller values appear before larger values in the sorted data set. You can use an option to the psort operator to select descending sorting.
inRec:*;
psort
outRec:*;
psort: properties
Table 122. psort Operator Properties Property Number of input data sets Number of output data sets Input interface schema Value 1 1 inRec:*; The dynamic input interface schema lets you specify as many input key fields as are required by your sort operation. outRec:*; inRec -> outRec without record modification sequential or parallel
507
Table 122. psort Operator Properties (continued) Property Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value parallel mode: any sequential mode: any set no
You must use -key to specify at least one sorting key to the operator. You use the -part option to configure the operator to run in parallel, and the -seq option to specify sequential operation. If you include the -ebcdic option, you must also include the -sorter option with a value of syncsort. When you do not include the -sorter option, the default sort is unix which is incompatible with the EBCDIC collation sequence. Example usage:
psort -sorter syncsort -key a -ebcdic
508
Table 123. psort Operator Options Option -key Use -key field_name [-ci | -cs] [-asc | -desc] [-ebcdic] If the -ebcdic suboption is specified, you must also include the -sorter option with a value of syncsort. The -key option specifies a key field for the sort. The first -key option defines the primary key field for the sort; lower-priority key fields are supplied on subsequent -key specifications. You must specify this option to psort at least once. -key requires that field_name be a field of the input data set. The data type of the field must be one of the following data types: int8 , int16 , int32 , int64, uint8, uint16, uint32 , uint64 sfloat, dfloat string[n] , where n is an integer literal specifying the string length -ci or -cs are optional arguments for specifying case-sensitive or case-insensitive sorting. By default, the operator uses a case-sensitive algorithm for sorting. This means that uppercase strings appear before lowercase strings in a sorted data set. You can override this default to perform case-insensitive sorting on string fields. -asc or -desc specify optional arguments for specifying ascending or descending sorting By default, the operator uses ascending sort order, so that smaller values appear before larger values in the sorted data set. You can use descending sorting order as well, so that larger values appear before smaller values in the sorted data set. -ebcdic (string fields only) specifies to use EBCDIC collating sequence for string fields. Note that InfoSphere DataStage stores strings as ASCII text; this property only controls the collating sequence of the string. If you include the -ebcdic option, you must also include the -sorter option with a value of syncsort. When you do not include the -sorter, the default sort is unix which is incompatible with the EBCDIC collation sequence. When you use the EBCDIC collating sequence, lowercase letters sort before upper-case letters (unless you specify the -ci option to select case-insensitive sorting). Also, the digits 0-9 sort after alphabetic characters. In the default ASCII collating sequence used by the operator, numbers come first, followed by uppercase, then lowercase letters. -extraOpts -extraOpts syncsort_options Specifies command-line options passed directly to SyncSort. syncsort_options contains a list of SyncSort options just as you would normally type them on the SyncSort command line.
509
Table 123. psort Operator Options (continued) Option -memory Use -memory num_megabytes Causes the operator to restrict itself to num_megabytes megabytes of virtual memory on a processing node. -memory requires that 1 < num_megabytes < the amount of virtual memory available on any processing node. We recommend that num_megabytes be smaller than the amount of physical memory on a processing node. -part -part partitioner This is a deprecated option. It is included for backward compatability. -seq -seq This is a deprecated option. It is included for backward compatability. -sorter -sorter unix | syncsort Specifies the sorting utility used by the operator. The default is unix, corresponding to the UNIX sort utility. -stable -stable Specifies that this sort is stable. A stable sort guarantees not to rearrange records that are already sorted properly in a data set. The default sorting method is unstable. In an unstable sort, no prior ordering of records is guaranteed to be preserved by the sorting operation, but might processing might be slightly faster. -stats -stats Configures psort to generate output statistics about the sorting operation and to print them to the screen. -unique -unique Specifies that if multiple records have identical sorting key values, only one record is retained. If stable is set, then the first record is retained. -workspace -workspace workspace Optionally supply a string indicating the workspace directory to be used by the sorter
510
unSortedDS.ds
step psort
sortedDS.ds
This step uses a sequential psort operator to completely sort the records of an input data set. The primary key field is a; the secondary sorting key field is e. By default, the psort operator executes sequentially; you do not have to perform any configuration actions. Note that case-sensitive and ascending sort are the defaults. Shown below is the osh command line for this step:
$ osh "psort -key a -key e -stable [seq] < unSortedDS.ds > sortedDS.ds "
511
InfoSphere DataStage supplies a hash partitioner operator that allows you to hash records by one or more fields. See "The hash Partitioner" for more information on the hash operator. You can also use any one of the supplied InfoSphere DataStage partitioning methods. The following example is a modification of the previous example, "Example: Using a Sequential Partition Sort Operator" , to execute the psort operator in parallel using a hash partitioner operator. In this example, the hash operator partitions records using the integer field a, the primary sorting key. Therefore, all records containing the same value for field a are assigned to the same partition. The figure below shows this example:
unSortedDS.ds
step hash
psort
sortedDS.ds
To configure the psort operator in osh to execute in parallel,use the [par] annotation. Shown below is the osh command line for this step:
$ osh " hash -key a -key e < unSortedDS.ds | psort -key a -key e [par] > sortedDS.ds"
512
Dick Tom Mary Bob Stacy Nick Bob Sue Ted Jack Harry Patsy
input partitions
Bob Bob Dick Harry Jack mary Nick Patsy Stacy Sue Ted Tom
In this example, the partitions of the output data set contain sorted records, and the partitions themselves are sorted. As you can see in this example, a total sort requires that all similar and duplicate records are located in the same partition of the data set. Similarity is based on the key fields in a record. Because each partition of a data set is sorted by a single processing node, another requirement must also be met before a total sort can occur. Not only must similar records be in the same partition, but the partitions themselves should be approximately equal in size so that no one node becomes a processing bottleneck. To meet these two requirements, you use the range partitioner on the input data set before performing the actual sort. The range partitioner guarantees that all records with the same key fields are in the same partition, but it does more. The range partitioner also calculates partition boundaries, based on the sorting keys, in order to evenly distribute records to the partitions. All records with sorting keys values between the partition boundaries are assigned to the same partition so that the partitions are ordered and that the partitions are approximately the same size. See the next section for more information on using the range partitioning operator. You need to perform a total sort only when your job requires a completely ordered data set, either as output or as a preprocessing step. For example, you might want to perform a total sort on a mailing list so that the list is sorted by zip code, then by last name within zip code. For most jobs, a partition sort is the correct sorting component. For example, if you are sorting a data set as a preprocessing step to remove duplicate records, you use a hash partitioner with a partition sort to partition the data set by the sorting key fields; a total sort is unnecessary. A hash partitioner assigns
513
records with the same sorting key fields to the same partition, then sorts the partition. A second operator can compare adjacent records in a partition to determine if they have the same key fields to remove any duplicates.
Range partitioning
The range partitioner guarantees that all records with the same key field values are in the same partition and it creates partitions that are approximately equal in size so that all processing nodes perform an equal amount of work when performing the sort. In order to do so, the range partitioner must determine distinct partition boundaries and assign records to the correct partition. To use a range partitioner, you first sample the input data set to determine the distribution of records based on the sorting keys. From this sample, the range partitioner determines the partition boundaries of the data set. The range partitioner then repartitions the entire input data set into approximately equal-sized partitions in which similar records are in the same partition. The following example shows a data set with two partitions:
input partitions
This figure shows range partitioning for an input data set with four partitions and an output data set with four:
514
input partitions
In this case, the range partitioner calculated the correct partition size and boundaries, then assigned each record to the correct partition based on the sorting key fields of each record. See "The range Partitioner" for more information on range partitioning.
515
step 1 sample
psort
-key a sequential mode
writerangemap
step 2 range
psort
-key a parallel mode
sortedDS.ds
As you can see in this figure, this example uses two steps. The first step creates the sorted sample of the input data set used to create the range map required by the range partitioner. The second step performs the parallel sort using the range partitioner and the psort operator. InfoSphere DataStage supplies the UNIX command line utility, makerangemap, that performs the first of these two steps for you to create the sorted, sampled file, or range map. See "The range Partitioner" for more information on makerangemap.
516
Shown below are the commands for the two steps shown above using makerangemap:
$ makerangemap -rangemap sampledData -key a -key e inDS.ds $ osh "range -sample sampledData -key a -key c < inDS.ds | psort -key a -key e > srtDS.ds"
517
518
519
input0
input1
inputn
input0
input1
Join: properties
Table 124. Join Operators Properties Property Number of input data sets Number of output data sets Input interface schema: Value 2 or more for the innerjoin, leftouterjoin, rightouterjoin operators and exactly 2 for the fullouterjoin operator 1 key1a:type1a; . . . key1n:type1n; input0Rec:*; key1a:type1a; . . . key1n:type1n; input1Rec:*; . . . Output interface schema Transfer behavior from source to output Composite operator Input partitioning style leftRec:*; rightRec:*; leftRec; -> leftRec; rightRec -> rightRec; with modifications yes keys in same partition
520
Transfer behavior
The join operators: v Transfer the schemas of the input data sets to the output and catenates them to produce a "join" data set. v Transfer the schema variables of the input data sets to the corresponding output schema variables. The effect is a catenation of the fields of the schema variables of the input data sets. For the fullouterjoin operator, duplicate field names are copied from the left and right data sets as leftRec_field and rightRec_field, where field is the name of the duplicated key field. v Transfer duplicate records to the output, when appropriate. v The innerjoin and leftouterjoin operators drop the key fields from the right data set. v The rightouterjoin operator drops the key fields from the left data set.
Memory use
For the right data set, for each value of the key group, the collection of records with that key value must fit comfortably in memory to prevent paging from slowing performance.
521
Table 125. Comparison of Joins, Lookup, and Merge (continued) Joins Number and names of inputs Handling of duplicates in primary input Handling of duplicates in secondary input Options on unmatched primary Options on unmatched secondary On match, secondary entries are Number of outputs Captured in reject sets 2 or more inputs OK, produces a cross-product OK, produces a cross-product NONE NONE reusable 1 Does not apply Lookup 1 source and N lookup tables OK Merge 1 master table and N update tables Warning given. Duplicate will be an unmatched primary. OK only when N = 1
Warning given. The second lookup table entry is ignored. Fail, continue, drop, or reject. Fail is the default. NONE reusable 1 output and optionally 1 reject
Keep or drop. Keep is the default. Capture in reject sets consumed 1 output and 1 reject for each update table
innerjoin operator
The innerjoin operator transfers records from both input data sets whose key fields contain equal values to the output data set. Records whose key fields do not contain equal values are dropped.
522
There is one required option, -key. You can specify it multiple times. Only top-level, non-vector fields can be keys.
Table 126. innerjoin Operator option Option -key Use -key field_name [-cs | ci] [-param params ] Specify the name of the key field or fields. You can specify multiple keys. For each one, specify the -key option and supply the key's name. By default, InfoSphere DataStage interprets the value of key fields in a case-sensitive manner. Specify -ci to override this default. Do so for each key you choose, for example: -key A -ci -key B -ci The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation_ sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
523
left data set status field Sold Sold Offered Pending Pending Offered Offered Pending price field 125 213 378 575 649 777 908 908
right data set price Field 113 125 285 628 668 777 908 908 id Field NI6325 BR9658 CZ2538 RU5713 SA5680 JA1081 DE1911 FR2081
leftouterjoin operator
The leftouterjoin operator transfers all values from the left data set and transfers values from the right data set only where key fields match. The operator drops the key field from the right data set. Otherwise, the operator writes default values.
There is one required option, -key. You can specify it multiple times. Only top-level, non-vector fields can be keys.
524
Table 127. leftouterjoin Operator Option Option -key Use -key field_name [-ci or -cs] [-param params ] Specify the name of the key field or fields. You can specify multiple keys. For each one, specify the -key option and supply the key's name. By default, InfoSphere DataStage interprets the value of key fields in a case-sensitive manner. Specify -ci to override this default. Do so for each key you choose, for example: -key A -ci -key B -ci The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation_ sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, see reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
Example
In this example, the leftouterjoin operation is performed on two data sets using the price field as the key field. Equal values in the price field of each data set are shown with double underscores. Here are the input data sets:
left data set status Field Sold Sold Offered Pending Pending Offered price Field 125 213 378 575 649 777 right data set price Field 113 125 285 628 668 777 id Field NI6325 BR9658 CZ2538 RU5713 SA5680 JA1081
525
left data set status Field Offered Pending price Field 908 908
right data set price Field 908 908 id Field DE1911 FR2081
Here are the results of the leftouterjoin operation on the left and right data sets.
status Field Sold Sold Offered Pending Pending Offered Offered Offered Pending Pending price Field 125 213 378 575 649 777 908 908 908 908 JA1081 DE1911 FR2081 DE1911 FR2081 id Field BR9658
Here is the syntax for the example shown above in an osh command:
$ osh "... leftouterjoin -key price ..."
rightouterjoin operator
The rightouterjoin operator transfers all values from the right data set and transfers values from the left data set only where key fields match. The operator drops the key field from the left data set. Otherwise, the operator writes default values.
There is one required option, -key. You can specify it multiple times. Only top-level, non-vector fields can be keys.
526
Table 128. rightouterjoin Operator Options Option -key Use -key field_name [-ci or -cs] [-param params ] Specify the name of the key field or fields. You can specify multiple keys. For each one, specify the -key option and supply the key's name. By default, InfoSphere DataStage interprets the value of key fields in a case-sensitive manner. Specify -ci to override this default. Do so for each key you choose, for example: -key A -ci -key B -ci The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
527
left data set status Field Offered Pending price Field 908 908
right data set price Field 908 908 id Field DE1911 FR2081
Here are the results of the rightouterjoin operation on the left and right data sets.
status Field Field 113 Sold 125 285 628 668 Offered Offered Offered Pending Pending 777 908 908 908 908 id Field NI6325 BR9658 CZ2538 RU5713 SA5680 JA1081 DE1911 FR2081 DE1911 FR2081
Here is the syntax for the example shown above in an osh command:
$ osh "... rightouterjoin -key price ..."
fullouterjoin operator
The fullouterjoin operator transfers records whose key fields are equal in both input data sets to the output data set. It also transfers records whose key fields contain unequal values from both input data sets to the output data set. The output data set: v Contains all input records, except where records match; in this case it contains the cross-product of each set of records with an equal key v Contains all input fields v Renames identical field names of the input data sets as follows: leftRec_field (left data set) and rightRec_field (right data set), where field is the field name v Supplies default values to the output data set, where values are not equal
There is one required option, -key. You can specify it multiple times. Only top-level, non-vector fields can be keys.
528
Table 129. fullouterjoin Operator Option Option -key Use -key field_name [-ci or -cs] [-param params ] Specify the name of the key field or fields. You can specify multiple keys. For each one, specify the -key option and supply the key's name. By default, InfoSphere DataStage interprets the value of key fields in a case-sensitive manner. Specify -ci to override this default. Do so for each key you choose, for example: -key A -ci -key B -ci The -param suboption allows you to specify extra parameters for a field. Specify parameters using property = value pairs separated by commas. -collation_sequence -collation_sequence locale | collation_file_pathname | OFF This option determines how your string data is sorted. You can: v Specify a predefined IBM ICU locale v Write your own collation sequence using ICU syntax, and supply its collation_file_pathname v Specify OFF so that string comparisons are made using Unicode code-point value order, independent of any locale or custom sequence. By default, InfoSphere DataStage sorts strings using byte-wise comparisons. For more information, reference this IBM ICU site: http://oss.software.ibm.com/icu/userguide /Collate_Intro.html
529
left data set status Field Offered Pending price Field 908 908
right data set price Field 908 908 id Field DE1911 FR2081
Here are the results of the fullouterjoin operation on the left and right data sets.
status Field leftRec_Price Field rightRec_price Field 113 Sold Sold 125 213 285 Offered Pending 378 575 628 Pending 649 668 Offered Offered Offered Pending Pending 777 908 908 908 908 777 908 908 908 908 SA5680 JA1081 DE1911 FR2081 DE1911 FR2081 RU5713 CZ2538 125 id Field NI6325 BR9658
The syntax for the example shown above in an osh command is:
$ osh "... fullouterjoin -key price ..."
530
Procedure
1. DataDirect ODBC drivers are installed in the directory $dshome/..//branded_odbc. The shared library path is modified to include $dshome/../branded_odbc/lib. The ODBCINI environment variable will be set to $dshome/.odbc.ini. 2. Start the external datasource. 3. Add $APT_ORCHHOME/branded_odbc to your PATH and $APT_ORCHHOME/branded_odbc /lib to your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH. The ODBCINI environment variable must be set to the full path of the odbc.ini file. 4. Access the external datasource using a valid user name and password.
531
v v v v v
column names. table and column aliases. SQL*Net service names. SQL statements. file-name and directory paths.
532
odbcread
odbcread: properties
Table 131. odbcread Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in the output data set Composite Stage Value 1 1 None Determined by the SQL query None Parallel and sequential MOD based Not applicable Clear No
533
You must specify either the -query or -tablename option. You must also specify the -datasourcename, -username, and -password options.
Table 132. odbcread Operator Options Option -query Use -query sql_query Specify an SQL query to read from one or more tables. Note: The -query option is mutually exclusive with the -table option. -table -tablename table_name Specify the table to be read from. It might be fully qualified. This option has two suboptions: v -filter where_predicate: Optionally specify the rows of the table to exclude from the read operation. This predicate is appended to the where clause of the SQL statement to be executed. v -list select_predicate: Optionally specify the list of column names that appear in the select clause of the SQL statement to be executed. Note: This option is mutually exclusive with the -query option. -datasource -datasourcename data_source_name Specify the data source to be used for all database connections. This option is required. -user -username user_name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. -open -opencommand open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -closecommand close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -arraysize -arraysize n Specify the number of rows to retrieve during each fetch operation. The default number of rows is 1. -isolation_level --isolation_level read_uncommitted | read_committed | repeatable_read | serializable Optionally specify the transaction level for accessing data. The default transaction level is decided by the database or possibly specified in the data source. -db_cs -db_cs character_set Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1. -use_strings -use_strings If the -use_strings option is set, strings instead of ustrings are generated in the InfoSphere DataStage schema.
534
Table 132. odbcread Operator Options (continued) Option -partitioncol Use -partition_column Use this option to run the ODBC read operator in parallel mode. The default execution mode is sequential. Specify the key column; the data type of this column must be integer. The column should preferably have the Monoatomically Increasing Order property. If you use the -partitioncol option, then you must follow the sample OSH scripts below to specify your -query and -table options. v Sample OSH for -query option: odbcread -data_source SQLServer -user sa -password asas -query select * from SampleTable where Col1 =2 and %orchmodColumn% -partitionCol Col1 >| OutDataSet.ds v Sample OSH for -table option: odbcread -data_source SQLServer -user sa -password asas -table SampleTable -partitioncol Col1 >| OutDataset.ds
Operator action
Here are the chief characteristics of the odbcread operator: v It runs both in parallel and sequential mode. v It translates the query's result set (a two-dimensional array) row by row to an InfoSphere DataStage data set. v Its output is an InfoSphere DataStage data set you can use as input to a subsequent InfoSphere DataStage operator. v Its translation includes the conversion of external datasource data types to InfoSphere DataStage data types. v The size of external datasource rows can be greater than that of InfoSphere DataStage records. v It specifies either an external datasource table to read or directs InfoSphere DataStage to perform an SQL query. v It optionally specifies commands to be run before the read operation is performed and after it has completed. v It can perform a join operation between an InfoSphere DataStage data set and an external data source. There might be one or more tables.
535
v The column names are used except when the external datasource column name contains a character that InfoSphere DataStage does not support. In that case, two underscore characters replace the unsupported character. v Both external datasource columns and InfoSphere DataStage fields support nulls. A null contained in an external datasource column is stored as the keyword NULL in the corresponding InfoSphere DataStage field.
Note: Data types that are not listed in the table above generate an error.
536
v Specify the table name. This allows InfoSphere DataStage to generate a default query that reads all records in the table. v Explicitly specify the query. The read methods are described in the following two sections.
You can optionally specify these options to narrow the read operation: v -list specifies the columns of the table to be read. By default, InfoSphere DataStage reads all columns. v -filter specifies the rows of the table to exclude from the read operation. By default, InfoSphere DataStage reads all rows. You can optionally specify the -opencommand and -closecommand options. The open commands are executed by ODBC on the external data source before the table is opened; and the close commands are executed by ODBC on the external data source after it is closed.
Join operations
You can perform a join operation between InfoSphere DataStage data sets and external data. First use the odbcread operator, and then the lookup operator or the join operator.
Odbcread example 1: reading an external data source table and modifying a field name
The following figure shows an external data source table used as input to an InfoSphere DataStage operator:
537
ODBC table
itemNum 101 101 220 price 1.95 1.95 5.95 storeID 26 34 26
step odbcread
modify
field1 = ItemNum
field1:int32;in:*;
sample
The external datasource table contains three columns. The operator converts the data types of these columns as follows: v itemNum of type NUMBER[3,0] is converted to int32. v price of type NUMBER[6,2] is converted to decimal[6,2]. v storeID of type NUMBER[2,0] is converted to int32. The schema of the InfoSphere DataStage data set created from the table is also shown in this figure. Note that the InfoSphere DataStage field names are the same as the column names of the external datasource table. However, the operator to which the data set is passed has an input interface schema containing the 32-bit integer field field1, while the data set created from the external datasource table does not contain a field with the same name. Therefore, the modify operator must be placed between the odbcread operator and the sample operator to translate the name of the itemNum field to field1. Here is the osh syntax for this example:
$ osh "odbcread -tablename table_name -datasource data_source_name -user user_name -password password | modify $modifySpec | ... " $modifySpec="field1 = itemNum;" modify (field1 = itemNum,;)
538
odbcwrite
output table
odbcwrite: properties
Table 134. odbcwrite Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag Composite Stage Value 1 0 Derived from the input data set None None Sequential by default or parallel Not applicable Any The default is clear No
Operator action
Below are the chief characteristics of the odbcwrite operator: v Translation includes the conversion of InfoSphere DataStage data types to external datasource data types.
539
v The operator appends records to an existing table, unless you specify the create, replace, or truncate write mode. v When you write to an existing table, the input data set schema must be compatible with the table schema. v Each instance of a parallel write operator running on a processing node writes its partition of the data set to the external datasource table. You can optionally specify external datasource commands to be parsed and executed on all processing nodes before the write operation runs or after it is completed.
540
Table 135. Mapping of osh Data Types to ODBC Data Types (continued) InfoSphere DataStage Datatype raw(n) raw(max=n) date time[p] timestamp[p] string[36] ODBC Datatype SQL_BINARY SQL_VARBINARY SQL_TYPE_DATEP[6] SQL_TYPE_TIMEP[6]P SQL_TYPE_TIMESTAMPP[6]P SQL_GUID
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The four write modes are: v append: This is the default mode. The table must exist and the record schema of the data set must be compatible with the table. The odbcwrite operator appends new rows to the table. The schema of the existing table determines the input interface of the operator. v create: The odbcwrite operator creates a new table. If a table exists with the same name as the one being created, the step that contains the operator terminates with an error. The schema of the InfoSphere DataStage data set determines the schema of the new table. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own create table statement. v replace: The operator drops the existing table and creates a new one in its place. If a table exists with the same name as the one you want to create, the existing table is overwritten. The schema of the InfoSphere DataStage data set determines the schema of the new table. v truncate: The operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. Each mode requires the specific user privileges shown in the table below. Note: If a previous write operation failed, you can try again. Specify the replace write mode to delete any information in the output table that might have been written by the previous attempt to run your program.
Table 136. Required External Data Source Privileges for External Data Source Writes Write Mode Append Create Replace Truncate Required Privileges INSERT on existing table TABLE CREATE INSERT and TABLE CREATE on existing table INSERT on existing table
541
3. If the input data set contains fields that do not have matching components in the table, the operator generates an error and terminates the step. 4. InfoSphere DataStage does not add new columns to an existing table if the data set contains fields that are not defined in the table. Note that you can use the odbcwrite -drop option to drop extra fields from the data set. Columns in the external datasource table that do not have corresponding fields in the input data set are set to their default value, if one is specified in the external datasource table. If no default value is defined for the external datasource column and it supports nulls, it is set to null. Otherwise, InfoSphere DataStage issues an error and terminates the step. 5. InfoSphere DataStage data sets support nullable fields. If you write a data set to an existing table and a field contains a null, the external data- source column must also support nulls. If not, InfoSphere DataStage issues an error message and terminates the step. However, you can use the modify operator to convert a null in an input field to another value.
542
Table 137. odbcwrite Operator Options and Values (continued) Option -mode Use -mode append | create | replace | truncate Specify the write mode as one of these modes: append: This operator appends new records into an existing table. create: This operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own create table statement. replace: This operator drops the existing table and creates a new one in its place. The schema of the InfoSphere DataStage data set determines the schema of the new table. truncate: This operator deletes all records from an existing table before loading new records. -statement -statement create_statement Optionally specify the create statement to be used for creating the table when -mode create is specified. -drop -drop If this option is set, unmatched fields in the InfoSphere DataStage data set are dropped. An unmatched field is a field for which there is no identically named field in the datasource table. -truncate -truncate When this option is set, column names are truncated to the maximum size allowed by the ODBC driver. -truncateLength -truncateLength n Specify the length to truncate column names. -opencommand -opencommand open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -closecommand -closecommand close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -arraysize -arraysize n Optionally specify the size of the insert array. The default size is 2000 records.
543
Table 137. odbcwrite Operator Options and Values (continued) Option -rowCommitInterval Use -rowCommitInterval n Optionally specify the number of records to be committed before starting a new transaction. This option can only be specified if arraysize = 1. Otherwise rowCommitInterval = arraysize. This is because of the rollback retry logic that occurs when an array execute fails. -transactionLevels -transactionLevels read_uncommitted | read_committed | repeatable_read | serializable Optionally specify the transaction level for accessing data. The default transaction level is decided by the database or possibly specified in the data source. -db_cs -db_cs code_page_name Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1. -useNchar -useNchar Read all nchars/nvarchars from the database.
544
odbcwrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
odbcwrite
mode = create
age (number[5,0])
zip (char[5])
income (number)
To create the table, the odbcwrite operator names the external data- source columns the same as the fields of the input InfoSphere DataStage data set and converts the InfoSphere DataStage data types to external datasource data types.
545
Example 3: writing to an external data source table using the modify operator
The modify operator allows you to drop unwanted fields from the write operation and translate the name or data type of a field in the input data set to match the input interface schema of the operator.
schema before modify: skewNum:int32; price:sfloat; store:int8; rating:string[4];
modify
odbcwrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
In this example, the modify operator is used to: v Translate the field names of the input data set to the names of the corresponding fields of the operator's input interface schema, that is, skewNum to itemNum and store to storeID. v Drop the unmatched rating field so that no error occurs. Note: InfoSphere DataStage performs an automatic type conversion of store, promoting its int8 data type in the input data set to int16 in the odbcwrite input interface. Here is the osh syntax for this example:
$ modifySpec="itemNum = skewNum, storeID = store; drop rating" $ osh "... op1 | modify $modifySpec | odbcwrite -table table_2 -dboptions {user = user101, password = userPword}"
546
Other features
Quoted identifiers
All operators that accept SQL statements as arguments support quoted identifiers in those arguments. The quotes should be escaped with the `\' character.
Stored procedures
The ODBC operator does not support stored procedures.
Transaction rollback
Because it is not possible for native transactions to span multiple processes, rollback is not supported in this release.
Unit of work
The unit-of-work operator has not been modified to support ODBC in this release.
odbcupsert
output table
547
odbcupsert: properties
Table 138. odbcupsert Properties and Values Property Number of input data sets Number of output data sets by default Input interface schema Transfer behavior Execution mode Partitioning method Value 1 None; 1 when you select the -reject option Derived from your insert and update statements Rejected update records are transferred to an output data set when you select the -reject option Parallel by default, or sequential Same You can override this partitioning method. However, a partitioning method of entire cannot be used. Collection method Combinable stage Any Yes
Operator action
Here are the main characteristics of odbcupsert: v If an -insert statement is included, the insert is executed first. Any records that fail to be inserted because of a unique-constraint violation are then used in the execution of the update statement. v By default, InfoSphere DataStage uses host-array processing to enhance the performance of insert array processing. Each insert array is executed with a single SQL statement. Updated records are processed individually. v Use the -insertArraySize option to specify the size of the insert array. For example:
-insertArraySize 250
v The default length of the insert array is 500. To direct InfoSphere DataStage to process your insert statements individually, set the insert array size to 1:
-insertArraySize 1
v The record fields can be of variable-length strings. You can either specify a maximum length or use the default maximum length of 80 characters. The example below specifies a maximum length of 50 characters:
record(field1:string[max=50])
v When an insert statement is included and host array processing is specified, an InfoSphere DataStage update field must also be an InfoSphere DataStage insert field. v The odbcupsert operator converts all values to strings before passing them to the external data source. The following osh data types are supported: int8, uint8, int16, uint16, int32, uint32, int64, and uint64 dfloat and sfloat decimal strings of fixed and variable length timestamp date v By default, odbcupsert does not produce an output data set. Using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. Its syntax is:
548
-reject filename
Exactly one occurrence of of -update option is required. All others are optional. Specify an ICU character set to map between external datasource char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to an external data source. The default character set is UTF-8, which is compatible with the osh jobs that contain 7-bit US-ASCII data.
Table 139. odbcupsert Operator Options Option -datasourcename Use -datasourcename data_source_name Specify the data source to be used for all database connections. This option is required. -username -username user_name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. statement options You must specify at least one of the following options but not more than two. An error is generated if you do not specify any statement option or specify more than two. -updateStatement update_statement Optionally specify the update or delete statement to be executed. -insertStatement insert_statement Optionally specify the insert statement to be executed. -deleteStatement delete_statement Optionally specify the delete statement to be executed.
549
Table 139. odbcupsert Operator Options (continued) Option -mode Use -mode insert_update | update_insert | delete_insert Specify the upsert mode to be used when two statement options are specified. If only one statement option is specified, the upsert mode is ignored. insert_update: The insert statement is executed first. If the insert fails due to a duplicate key violation (if the record being inserted exists), the update statement is executed. This is the default upsert mode. update_insert: The update statement is executed first. If the update fails because the record does not exist, the insert statement is executed. delete_insert: The delete statement is executed first; then the insert statement is executed. -reject -reject If this option is set, records that fail to be updated or inserted are written to a reject data set. You must designate an output data set for this purpose. If this option is not specified, an error is generated if records fail to update or insert. -open -open open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -insertarraysize -insertarraysize n Optionally specify the size of the insert/update array. The default size is 2000 records. -rowCommitInterval -rowCommitInterval n Optionally specify the number of records to be committed before starting a new transaction. This option can only be specified if arraysize = 1. Otherwise rowCommitInterval = arraysize. This is because of the rollback logic/retry logic that occurs when an array execution fails.
550
Summarized below are the state of the ODBC table before the data flow is run, the contents of the input file, and the action that InfoSphere DataStage performs for each record in the input file.
Table 140. Example Data Table before dataflow acct_id 073587 873092 675066 566678 Input file contents acct_balance 45.64 2001.89 3523.62 89.72 InfoSphere DataStage Action acct_id 873092 67.23 865544 566678 678888 073587 995666 acct_balance 67.23 8569.23 2008.56 7888.23 82.56 75.72 Update Insert Update Insert Update Insert
551
Because odbclookup can perform lookups against more than one external datasource table, it is useful for joining multiple external datasource tables in one query. The -statement option command corresponds to an SQL statement of this form:
select a,b,c from data.testtbl where Orchestrate.b = data.testtbl.c and Orchestrate.name = "Smith"
The odbclookup operator replaces each InfoSphere DataStage field name with a field value; submits the statement containing the value to the external data source; and outputs a combination of external data source and InfoSphere DataStage data. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more external data source tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced external data source table. When an external data source table has a column name that is the same as an InfoSphere DataStage data set field name, the external datasource column is renamed using the following syntax:
APT_integer_fieldname
An example is APT_0_lname. The integer component is incremented when duplicate names are encountered in additional tables. Note: If the external datasource table is not indexed on the lookup keys, the performance of this operator is likely to be poor.
odbclookup
odbclookup: properties
Table 142. odbclookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in the output data set Composite stage Value 1 1; 2 if you include the -ifNotFound reject option Determined by the query Determined by the SQL query Transfers all fields from input to output Sequential or parallel (default) Not applicable Not applicable Clear No
You must specify either the -query option or one or more -table options with one or more -key fields. The odbclookup operator is parallel by default. The options are as follows:
Table 143. odbclookup Operator Options and Values Option -datasourcename Value -datasourcename data_source_name Specify the data source to be used for all database connections. This option is required. -user -user user_name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source.
553
Table 143. odbclookup Operator Options and Values (continued) Option -tablename Value -tablename table_name Specify a table and key fields to be used to generate a lookup query. This option is mutually exclusive with the -query option. The -table option has three suboptions: -filter where_predicate: Specify the rows of the table to exclude from the read operation. This predicate is appended to the where clause of the SQL statement to be executed. -selectlist select_predicate: Specify the list of column names that will appear in the select clause of the SQL statement to be executed. -key field: Specify a lookup key. A lookup key is a field in the table that is used to join with a field of the same name in the InfoSphere DataStage data set. The -key option can be used more than once to specify more than one key field. -ifNotFound -ifNotFound fail | drop | reject | continue Specify an action to be taken when a lookup fails. Choose any of the following actions: fail: Stop job execution. drop: Drop the failed record from the output data set. reject: Put records that are not found into a reject data set. You must designate an output data set for this option. continue: Leave all records in the output data set (outer join). -query -query sql_query Specify a lookup query to be executed. This option is mutually exclusive with the -table option. -open -open open_command Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -fetcharraysize -fetcarraysize n Specify the number of rows to retrieve during each fetch operation. The default number is 1.
554
Table 143. odbclookup Operator Options and Values (continued) Option -db_cs Value -db_cs code_page_name Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1. This option has this suboption: -use_strings If this suboption is not set, strings, not ustrings, are generated in the InfoSphere DataStage schema.
InfoSphere DataStage prints the lname, fname and DOB column names and values from the InfoSphere DataStage input data set, and the lname, fname, and DOB column names and values from the external datasource table. If a column name in the external datasource table has the same name as an InfoSphere DataStage output data set schema fieldname, the printed output shows the column in the external datasource table renamed using this format:
APT_integer_fieldname
555
556
557
For example, the left side of the next figure shows a sequential SAS application that reads its input from an RDBMS and writes its output to the same database. This application contains a number of sequential bottlenecks that can negatively impact its performance.
SAS step
SAS step
SAS step
SAS step
In a typical client/server environment, a sequential application such as SAS establishes a single connection to a parallel RDBMS in order to access the database. Therefore, while your database might be explicitly designed to support parallel access, SAS requires that all input be combined into a single input stream. One effect of single input is that often the SAS program can process data much faster than it can access the data over the single connection, therefore the performance of a SAS program might be bound by the rate at which it can access data. In addition, while a parallel system, either MPP or SMP, contains multiple CPUs, a single SAS job in SAS itself executes on only a single CPU; therefore, much of the processing power of your system is unused by the SAS application. In contrast, InfoSphere DataStage's parallel processing model, shown above on the right, allows the database reads and writes, as well as the SAS program itself, to run simultaneously, reducing or eliminating the performance bottlenecks that might otherwise occur when SAS is run on a parallel computer. InfoSphere DataStage enables SAS users to: v Access, for reading or writing, large volumes of data in parallel from parallel relational databases, with much higher throughput than is possible using PROC SQL.
558
v Process parallel streams of data with parallel instances of SAS DATA and PROC steps, enabling scoring or other data transformations to be done in parallel with minimal changes to existing SAS code. v Store large data sets in parallel, circumventing restrictions on data-set size imposed by your file system or physical disk-size limitations. Parallel data sets are accessed from SAS programs in the same way as conventional SAS data sets, but at much higher data I/O rates. v Realize the benefits of pipeline parallelism, in which some number of InfoSphere DataStage sas operators run at the same time, each receiving data from the previous process as it becomes available.
In a configuration file with multiple nodes, the path for each resource disk must be unique in order to obtain the SAS parallel data sets. The following configuration file for an MPP system differentiates the resource disk paths by naming the final directory with a node designation; however, you can use any directory paths as long as they are unique across all nodes. Bold type is used for illustration purposes only to highlight the resource directory paths.
{ node "elwood1" { fastname "elwood" pools "" resource sas "/usr/local/sas/sas8.2" { } resource disk "/opt/IBMm/InformationServer/Server/Datasets/node1"
Chapter 15. The SAS interface library
559
{pools "" "sasdataset" } resource scratchdisk "/opt/IBMm/InformationServer/Server/Scratch" {pools ""} } node "solpxqa01-1" { fastname "solpxqa01" pools "" resource sas "/usr/local/sas/sas8.2" { } resource disk "/opt/IBMm/InformationServer/Server/Datasets/node2" {pools "" "sasdataset" } resource scratchdisk "/opt/IBMm/InformationServer/Server/Scratch" {pools ""} } {
The resource names sas and sasworkdisk and the disk pool name sasdata set are reserved words.
db2read
standard DataStage data set
sasin
SAS data set
sas
data step SAS data set
sasout
standard DataStage data set
non-SAS operator
The InfoSphere DataStage db2read operator streams data from DB2 in parallel and converts it automatically into the standard InfoSphere DataStage data set format. The sasin operator then converts it, by default, into an internal SAS data set usable by the sas operator. Finally, the sasout operator inputs the SAS data set and outputs it as a standard InfoSphere DataStage data set. The SAS interface operators can also input and output data sets in Parallel SAS data set format. Descriptions of the data set formats are in the next section.
560
For example:
DataStage SAS data set UstringFields[ISO-2022-JP];B_FIELD;C_FIELD; LFile: eartha-gbit 0: /apt/eartha1sand0/jgeyer701/orch_master/apt/pds_files/node0/ test_saswu.psds.26521.135470472.1065459831/test_saswu.sas7bdat
Set the APT_SAS_NO_PSDS_USTRING environment variable to output a header file that does not list ustring fields. This format is consistent with previous versions of InfoSphere DataStage:
DataStage SAS data set Lfile: HOSTNAME:DIRECTORY/FILENAME Lfile: HOSTNAME:DIRECTORY/FILENAME Lfile: HOSTNAME:DIRECTORY/FILENAME . . .
Note: The UstringFields line does not appear in the header file when any one of these conditions is met: v You are using a version of InfoSphere DataStage that is using a parallel engine prior to Orchestrate 7.1. v Your data has no ustring schema values.
561
You have set the APT_SAS_NO_PSDS_USTRING environment variable.If one of these conditions is met, but your data contains multi-byte Unicode data, you must use the modify operator to convert the data schema type to ustring. This process is described in "Handling SAS char Fields Containing Multi-Byte Unicode Data" . To save your data as a parallel SAS data set, either add the framework argument [psds] to the output redirection or add the .psds extension to the outfile name on your osh command line. These two commands are equivalent, except that the resulting data sets have different file names. The first example outputs a file without the .psds extension and second example outputs a file with the .psds extension.
$ osh ". . . sas $ osh ". . . sas options > [psds] outfile " options > outfile .psds "
You can likewise use a parallel SAS data set as input to the sas operator, either by specifying [psds] or by having the data set in a file with the .psds extension:
$ osh " . . . sas $ osh " . . . sas options < [psds] infile " options < infile .psds "
A persistent parallel SAS data set is automatically converted to an InfoSphere DataStage data set if the next operator in the flow is not a SAS interface operator.
562
sas
In this example v libname temp '/tmp' specifies the library name of the SAS input v data liborch.p_out specifies the output SAS data set v temp.p_in is the input SAS data set. The output SAS data set is named using the following syntax:
liborch.osds_name
As part of writing the SAS code executed by the operator, you must associate each input and output data set with an input and output of the SAS code. In the figure above, the sas operator executes a DATA step with a single input and a single output. Other SAS code to process the data would normally be included. In this example, the output data set is prefixed by liborch, a SAS library name corresponding to the Orchestrate SAS engine, while the input data set comes from another specified library; in this example the data set file would normally be named /tmp/p_in.ssd01.
Getting input from an InfoSphere DataStage data set or a SAS data set
You might have data already formatted as a data set that you want to process with SAS code, especially if there are other InfoSphere DataStage operators earlier in the data flow. In that case, the SAS interface operators automatically convert a data set in the normal data set format into an SAS data set. However, you still need to provide the SAS step that connects the data to the inputs and outputs of the sas operator. For InfoSphere DataStage data sets or SAS data sets the liborch library is referenced. The operator creates an SAS data set as output. Shown below is a sas operator with one input and one output data set:
input DataStage or SAS data set
data liborch.p_out;
sas
As part of writing the SAS code executed by the operator, you must associate each input and output data set with an input and output of the SAS code. In the figure above, the sas operator executes a DATA step with a single input and a single output. Other SAS code to process the data would normally be included.
563
To define the input data set connected to the DATA step input, you use the -input option to the sas operator. This option has the following osh format:
-input in_port_ # ds_name
where: v ds_name is the member name of a standard InfoSphere DataStage data set or an SAS data set used as an input by the SAS code executed by the operator. You need only to specify the member name here; do not include any SAS library name prefix. You always include this member name in your sas operator SAS code, prefixed with liborch. v in_port_# is an input data set number. Input data sets are numbered starting from 0. You use the corresponding -output option to connect an output SAS data set to an output of the DATA step. The osh syntax of -output is:
-output out_port_# ods_name
For the example above, the settings for -input and -output are:
$ osh "... sas -input 0p_in -output 0p_out other_options "
564
Table 144. Conversions to SAS Data Types (continued) InfoSphere DataStage Data Type decimal[p,s] (p is precision and s is scale) int64 and uint64 int8, int16, int32, uint8, uint16, uint32, sfloat, dfloat fixed-length raw, in the form: raw[n] SAS Data Type SAS numeric not supported SAS numeric SAS numeric SAS string with length n The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. variable-length raw, in the form: raw[max=n] SAS string with a length of the actual string length The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. variable-length raw in the form: raw fixed-length string or ustring, in the form: string[n] or ustring[n] not supported SAS string with length n The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. SAS string with a length of the actual string length The maximum string length is 200 bytes; strings longer than 200 bytes are truncated to 200 bytes. not supported SAS numeric time value SAS numeric datetime value
565
566
Example applications
Example 1: parallelizing a SAS data step
This section contains an example that executes a SAS DATA step in parallel. Here is a figure describing this step:
567
The step takes a single SAS data set as input and writes its results to a single SAS data set as output. The DATA step recodes the salary field of the input data to replace a dollar amount with a salary-scale value. Here is the original SAS code:
libname prod "/prod"; data prod.sal_scl; set prod.sal; if (salary < 10000) then salary = 1; else if (salary < 25000) then salary = 2; else if (salary < 50000) then salary = 3; else if (salary < 100000) then salary = 4; else salary = 5; run;
This DATA step requires little effort to parallelize because it processes records without regard to record order or relationship to any other record in the input. Also, the step performs the same operation on every input record and contains no BY clauses or RETAIN statements. The following figure shows the InfoSphere DataStage data flow diagram for executing this DATA step in parallel:
568
sas (seq)
SAS data set
sas (par)
DATA step execute SAS DATA step
sasout
DataStage data set (alternatively, the sas operator can output a parallel SAS data set)
In this example, you: v Get the input from a SAS data set using a sequential sas operator; v Execute the DATA step in a parallel sas operator; v Output the results as a standard InfoSphere DataStage data set (you must provide a schema for this) or as a parallel SAS data set. You might also pass the output to another sas operator for further processing. The schema required might be generated by first outputting the data to a Parallel SAS data set, then referencing that data set. InfoSphere DataStage automatically generates the schema. The first sequential sas operator executes the following SAS code as defined by the -source option to the operator:
libname prod "/prod"; data liborch.out; set prod.sal; run;
The sas operator can then do one of three things: use the sasout operator with its -schema option to output the results as a standard InfoSphere DataStage data set, output the results as a Parallel SAS data set, or pass the output directly to another sas operator as an SAS data set. The default output format is SAS data set. When the output is to a Parallel SAS data set or to another sas operator, for example, as a standard InfoSphere DataStage data set, the liborch statement must be used. Conversion of the output to a standard InfoSphere DataStage data set or a Parallel SAS data set is discussed in "SAS Data Set Format" and "Parallel SAS Data Set Format" .
569
hash
hash
sas
DATA step
data.psds
The sas operator in this example runs the following DATA step to perform the merge and score:
data liborch.emptabd; merge liborch.wltab liborch.emptab; by workloc; a_13 = (f1-f3)/2; a_15 = (f1-f5)/2; . . . run;
Records are hashed based on the workloc field. In order for the merge to work correctly, all records with the same value for workloc must be sent to the same processing node and the records must be ordered. The merge is followed by a parallel InfoSphere DataStage sas operator that scores the data, then writes it out to a parallel SAS data set.
570
by acctno; data prod.up_trans (keep = acctno sum); set prod.s_trans; by acctno; if first.acctno then sum=0; if type = "D" then sum + amt; if type = "W" then sum - amt; if last.acctno then output; run;
The SUM variable is reset at the beginning of each group of records where the record groups are determined by the account number field. Because this DATA step uses the BY clause, you use the InfoSphere DataStage hash partitioning operator with this step to make sure all records from the same group are assigned to the same node. Note that DATA steps using SUM without the BY clause view their input as one large group. Therefore, if the step used SUM but not BY, you would execute the step sequentially so that all records would be processed on a single node. Shown below is the data flow diagram for this example:
571
step sas
import data as SAS data set
hash
hash on acctno
sas
DATA step output parallel SAS data set
The SAS DATA step executed by the second sas operator is:
data liborch.nw_trans (keep = acctno sum); set liborch.p_trans; by acctno; if first.acctno then sum=0;
572
if type = "D" then sum + amt; if type = "W" then sum - amt; if last.acctno then output; run;
Example applications
Example 1: parallelize PROC steps using the BY keyword
This example parallelizes a SAS application using PROC SORT and PROC MEANS. In this example, you first sort the input to PROC MEANS, then calculate the mean of all records with the same value for the acctno field. The following figure illustrates this SAS application:
573
list file
The BY clause in a SAS step signals that you want to hash partition the input to the step. Hash partitioning guarantees that all records with the same value for acctno are sent to the same processing node. The SAS PROC step executing on each node is thus able to calculate the mean for all records with the same value for acctno. Shown below is the implementation of this example:
574
step sas
convert SAS data set to DataStage data set
hash
hash on acctno
sas
PROC step
PROC MEANS
list file
PROC MEANS pipes its results to standard output, and the sas operator sends the results from each partition to standard output as well. Thus the list file created by the sas operator, which you specify using the -listds option, contains the results of the PROC MEANS sorted by processing node. Shown below is the SAS PROC step for this application:
proc means data=liborch.p_dhist; by acctno; run;
575
libname prod "/prod"; proc summary data=prod.txdlst missing NWAY; CLASS acctno lstrxd fpxd; var xdamt xdcnt; output out=prod.xnlstr(drop=_type_ _freq_) sum=; run;
In order to parallelize this example, you hash partition the data based on the fields specified in the CLASS option. Note that you do not have to partition the data on all of the fields, only to specify enough fields that your data is be correctly distributed to the processing nodes. For example, you can hash partition on acctno if it ensures that your records are properly grouped. Or you can partition on two of the fields, or on all three. An important consideration with hash partitioning is that you should specify as few fields as necessary to the partitioner because every additional field requires additional overhead to perform partitioning. The following figure shows the InfoSphere DataStage application data flow for this example:
step sas
convert SAS data set to DataStage data set
hash
hash on acctno
sas
PROC step
output parallel data set
data.psds
The SAS code (DATA step) for the first sas operator is:
libname prod "/prod"; data liborch.p_txdlst set prod.txdlst; run;
576
Rules of thumb
There are rules of thumb you can use to specify how a SAS program runs in parallel. Once you have identified a program as a potential candidate for use in InfoSphere DataStage, you need to determine how to divide the SAS code itself into InfoSphere DataStage steps. The sas operator can be run either parallel or sequentially. Any converted SAS program that satisfies one of the four criteria outlined above will contain at least one parallel segment. How much of the program should be contained in this segment? Are there portions of the program that need to be implemented in sequential segments? When does a SAS program require multiple parallel segments? Here are some guidelines you can use to answer such questions. 1. Identify the slow portions of the sequential SAS program by inspecting the CPU and real-time values for each of the PROC and DATA steps in your application. Typically, these are steps that manipulate records (CPU-intensive) or that sort or merge (memory-intensive). You should be looking for times that are a relatively large fraction of the total run time of the application and that are measured in units of many minutes to hours, not seconds to minutes. You might need to set the SAS fullstimer option on in your config.sas612 or in your SAS program itself to generate a log of these sequential run times. 2. Start by parallelizing only those slow portions of the application that you have identified in Step 1 above. As you include more code within the parallel segment, remember that each parallel copy of your code (referred to as a partition) sees only a fraction of the data. This fraction is determined by the partitioning method you specify on the input or inpipe lines of your sas operator source code. 3. Any two sas operators should only be connected by one pipe. This ensures that all pipes in the InfoSphere DataStage program are simultaneously flowing for the duration of the execution. If two segments are connected by multiple pipes, each pipe must drain entirely before the next one can start. 4. Keep the number of sas operators to a minimum. There is a performance cost associated with each operator that is included in the data flow. Rule 3 takes precedence over this rule. That is, when reducing the number of operators means connecting any two operators with more than one pipe, don't do it. 5. If you are reading or writing a sequential file such as a flat ASCII text file or a SAS data set, you should include the SAS code that does this in a sequential sas operator. Use one sequential operator
Chapter 15. The SAS interface library
577
for each such file. You will see better performance inputting one sequential file per operator than if you lump many inputs into one segment followed by multiple pipes to the next segment, in line with Rule 2 above. 6. When you choose a partition key or combination of keys for a parallel operator, you should keep in mind that the best overall performance of the parallel application occurs if each of the partitions is given approximately equal quantities of data. For instance, if you are hash partitioning by the key field year (which has five values in your data) into five parallel segments, you will end up with poor performance if there are big differences in the quantities of data for each of the five years. The application is held up by the partition that has the most data to process. If there is no data at all for one of the years, the application will fail because the SAS process that gets no data will issue an error statement. Furthermore, the failure will occur only after the slowest partition has finished processing, which might be well into your application. The solution might be to partition by multiple keys, for example, year and storeNumber, to use roundrobin partitioning where possible, to use a partitioning key that has many more values than there are partitions in your InfoSphere DataStage application, or to keep the same key field but reduce the number of partitions. All of these methods should serve to balance the data distribution over the partitions. 7. Multiple parallel segments in your InfoSphere DataStage application are required when you need to parallelize portions of code that are sorted by different key fields. For instance, if one portion of the application performs a merge of two data sets using the patientID field as the BY key, this PROC MERGE will need to be in a parallel segment that is hash partitioned on the key field patientID. If another portion of the application performs a PROC MEANS of a data set using the procedureCode field as the CLASS key, this PROC MEANS will have to be in a parallel sas operator that is hash partitioned on the procedureCode key field. 8. If you are running a query that includes an ORDER BY clause against a relational database, you should remove it and do the sorting in parallel, either using SAS PROC SORT or an InfoSphere DataStage input line order statement. Performing the sort in parallel outside of the database removes the sequential bottleneck of sorting within the RDBMS. 9. A sort that has been performed in a parallel operator will order the data only within that operator. If the data is then streamed into a sequential operator, the sort order will be lost. You will need to re-sort within the sequential operator to guarantee order. 10. Within a parallel sas operator you might only use SAS work directories for intermediate writes to disk. SAS generates unique names for the work directories of each of the parallel operators. In an SMP environment this is necessary because it prevents the multiple CPUs from writing to the same work file. Do not use a custom-specified SAS library within a parallel operator. 11. Do not use a liborch directory within a parallel segment unless it is connected to an inpipe or an outpipe. A liborch directory might not be both written and read within the same parallel operator. 12. A liborch directory can be used only once for an input, inpipe, output or outpipe. If you need to read or write the contents of a liborch directory more than once, you should write the contents to disk via a SAS work directory and copy this as needed. 13. Remember that all InfoSphere DataStage operators in a step run simultaneously. This means that you cannot write to a custom-specified SAS library as output from one InfoSphere DataStage operator and simultaneously read from it in a subsequent operator. Connections between operators must be via InfoSphere DataStage pipes which are virtual data sets normally in SAS data set format.
578
The basic SAS executable sold in Europe might require you to create work arounds for handling some European characters. InfoSphere DataStage has not made any modifications to SAS that eliminate the need for this special character handling.
Procedure
1. Specify the location of your basic SAS executable using one of the methods below. The SAS international syntax in these methods directs InfoSphere DataStage to perform European-language transformations. v Set the APT_SASINT_COMMAND environment variable to the absolute path name of your basic SAS executable. For example:
APT_SASINT_COMMAND /usr/local/sas/sas8.2/sas
2. Using the -sas_cs option of the SAS-interface operators, specify an ICU character set to be used by InfoSphere DataStage to map between your ustring values and the char data stored in SAS files. If your data flow also includes the sasin or the sasout operator, specify the same character set you specified to the sas operator. The syntax is:
-sas_cs icu_character_set
For example:
-sas_cs fr_CA-iso-8859
3. In the table of your sascs.txt file, enter the designations of the ICU character sets you use in the right columns of the table. There can be multiple entries. Use the left column of the table to enter a symbol that describes the character sets. The symbol cannot contain space characters, and both columns must be filled in. The platform-specific sascs.txt files are in:
$APT_ORCHHOME/apt/etc/ platform /
The platform directory names are: sun, aix, hpux , and lunix. For example:
$APT_ORCHHOME/etc/aix/sascs.txt
579
-output 0 out_data -schemaFile cl_ship -workingdirectory . [seq] 0> Link2.v; peek -all [ seq] 0< Link2.v;
If you know the fields contained in the SAS file, the step can be written as:
osh sas -source libname curr_dir \.\; DATA liborch.out_data; SET curr_dir.cl_ship; RUN; -output 0 out_data -schema record(SHIP_DATE:nullable string[50]; DISTRICT:nullable string[10]; DISTANCE:nullable sfloat; EQUIPMENT:nullable string[10]; PACKAGING:nullable string[10]; LICENSE:nullable string[10]; CHARGE:nullable sfloat;) [seq] 0> DSLink2.v; peek -all [seq] 0< DSLink2.v;
It is also easy to write a SAS file from an InfoSphere DataStage virtual datastream. The following osh commands generated the SAS file described above.
osh import -schema record {final_delim=end, delim=,, quote=double} (Ship_Date:string[max=50]; District:string[max=10]; Distance:int32; Equipment:string[max=10]; Packaging:string[max=10]; License:string[max=10]; Charge:int32;) -file /home/cleonard/sas_in.txt -rejects continue -reportProgress yes 0> DSLink2a.v; osh sas -source libname curr_dir \.\; DATA curr_dir.cl_ship; SET liborch.in_data; RUN; -input 0 in_data [seq] 0< DSLink2a.v;
580
to be invoked in international mode unless you have configured your system for European languages as described in "Using SAS with European Languages" . See the next section, "Using -sas_cs to Determine SAS mode" for information on modes. Note: If your data flow also includes the sasin or the sasout operator, use the -sas_cs option of these operators to specify the same character set you specified to the sas operator. The sasin and sas operators use the -sas_cs option to determine what character-set encoding should be used when exporting ustring (UTF-16) values to SAS. The sas and sasout operators use this option to determine what character set InfoSphere DataStage uses to convert SAS char data to InfoSphere DataStage ustring values. The syntax for the -sas_cs option is:
-sas_cs icu_character_set | DBCSLANG
In $APT_ORCHHOME/apt/etc/platform/ there are platform-specific sascs.txt files that show the default ICU character set for each DBCSLANG setting. The platform directory names are: sun, aix, hpux , and lunix. For example:
$APT_ORCHHOME/etc/aix/sascs.txt
When you specify a character setting, the sascs.txt file must be located in the platform-specific directory for your operating system. InfoSphere DataStage accesses the setting that is equivalent to your -sas_cs specification for your operating system. ICU settings can differ between platforms. For the DBCSLANG settings you use, enter your ICU equivalents. DBCSLANG Setting ICU Character Set JAPANESE icu_character_set KATAKANA icu_character_set KOREAN icu_character_set HANGLE icu_character_set CHINESE icu_character_set TAIWANESE icu_character_set Note: If InfoSphere DataStage encounters a ustring value but you have not specified the -sas_cs option, SAS remains in standard mode and InfoSphere DataStage then determines what character setting to use when it converts your ustring values to Latin1 characters. InfoSphere DataStage first references your APT_SAS_CHARSET environment variable. If it is not set and your APT_SAS_CHARSET_ABORT environment variable is also not set, it uses the value of your -impexp_charset option or the value of your APT_IMPEXP_CHARSET environment variable.
581
your ustring data and step source code using multi-byte characters in your chosen character set. Multi-byte Unicode character sets are supported in SAS char fields, but not in SAS field names or data set file names. Your SAS system is capable of running in DBCS mode if your SAS log output has this type of header:
NOTE: SAS (r) Proprietary Software Release 8.2 (TS2MO DBCS2944)
If you do not specify the -sas_cs option for the sas operator, SAS is invoked in standard mode. In standard mode, SAS processes your string fields and step source code using single-byte LATIN1 characters. SAS standard mode, is also called "the basic US SAS product". When you have invoked SAS in standard mode, your SAS log output has this type of header:
NOTE: SAS (r) Proprietary Software Release 8.2 (TS2MO)
When the APT_SAS_NO_PSDS_USTRING environment variable is not set, the .psds header file lists the SAS char fields that are to be converted to ustring fields in InfoSphere DataStage, making it unnecessary to use the modify operator to convert the data type of these fields. For more information, see "Parallel SAS Data Set Format" .
582
You can use the APT_SAS_SCHEMASOURCE_DUMP environment variable to see the SAS CONTENTS report used by the -schemaFile option. The output also displays the command line given to SAS to produce the report and the pathname of the report. The input file and output file created by SAS is not deleted when this variable is set. You can also obtain an InfoSphere DataStage schema by using the -schema option of the sascontents operator. Use the peek operator in sequential mode to print the schema to the screen. Example output is:
...<peek,0>Suggested schema for DataStage SAS data set cow ...<peek,0>record (a : nullable dfloat; ...<peek,0> b : nullable string[10])
You can then fine tune the schema and specify it to the -schema option to take advantage of performance improvements. Note: If both the -schemaFile option and the -sas_cs option are set, all of your SAS char fields are converted to InfoSphere DataStage ustring values. If the -sas_cs option is not set, all of your SAS char values are converted to InfoSphere DataStage string values. To obtain a mixture of string and ustring values use the -schema option.
583
Specify a value for n_codepoint_units that is 2.5 or 3 times larger than the number of characters in the ustring. This forces the SAS char fixed-length field size to be the value of n_codepoint_units. The maximum value for n_codepoint_units is 200.
For SAS 6.12, InfoSphere DataStage handles SAS file and column names up to 8 characters.
584
Environment variables
These are the SAS-specific environment variables: v APT_SAS_COMMAND absolute_path Overrides the $PATH directory for SAS and resource sas entry with an absolute path to the US SAS executable. An example path is: /usr/local/sas/sas8.2/sas. v APT_SASINT_COMMAND absolute_path Overrides the $PATH directory for SAS and resource sasint entry with an absolute path to the International SAS executable. An example path is: /usr/local/sas/sas8.2int/dbcs/sas. v APT_SAS_CHARSET icu_character_set When the -sas_cs option is not set and a SAS-interface operator encounters a ustring, InfoSphere DataStage interrogates this variable to determine what character set to use. If this variable is not set, but APT_SAS_CHARSET_ABORT is set, the operator aborts; otherwise InfoSphere DataStage accesses the -impexp_charset option or the APT_IMPEXP_CHARSET environment variable. v APT_SAS_CHARSET_ABORT Causes a SAS-interface operator to abort if InfoSphere DataStage encounters a ustring in the schema and neither the -sas_cs option or the APT_SAS_CHARSET environment variable is set. v APT_SAS_TRUNCATION ABORT | NULL | TRUNCATE Because a ustring of n characters does not fit into n bytes of a SAS char value, the ustring value might be truncated before the space pad characters and \0. The sasin and sas operators use this variable to determine how to truncate a ustring value to fit into a SAS char field. TRUNCATE, which is the default, causes the ustring to be truncated; ABORT causes the operator to abort; and NULL exports a null field. For NULL and TRUNCATE, the first five occurrences for each column cause an information message to be issued to the log. v APT_SAS_ACCEPT_ERROR When a SAS procedure causes SAS to exit with an error, this variable prevents the SAS-interface operator from terminating. The default behavior is for InfoSphere DataStage to terminate the operator with an error. v APT_SAS_DEBUG_LEVEL=[0-2] Specifies the level of debugging messages to output from the SAS driver. The values of 0, 1, and 2 duplicate the output for the -debug option of the sas operator: no, yes, and verbose. v APT_SAS_DEBUG=1, APT_SAS_DEBUG_IO=1,APT_SAS_DEBUG_VERBOSE=1 Specifies various debug messages output from the SASToolkit API. v APT_HASH_TO_SASHASH The InfoSphere DataStage hash partitioner contains support for hashing SAS data. In addition, InfoSphere DataStage provides the sashash partitioner which uses an alternative non-standard hasing algorithm. Setting the APT_HASH_TO_SASHASH environment variable causes all appropriate instances of hash to be replaced by sashash. If the APT_NO_SAS_TRANSFORMS environment variable is set, APT_HASH_TO_SASHASH has no affect. v APT_NO_SAS_TRANSFORMS InfoSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting a sasout operator and substituting sasRoundRobin for eRoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents InfoSphere DataStage from making these transformations. v APT_NO_SASOUT_INSERT This variable selectively disables the sasout operator insertions. It maintains the other SAS-specific transformations. v APT_SAS_SHOW_INFO Displays the standard SAS output from an import or export transaction. The SAS output is normally deleted since a transaction is usually successful.
Chapter 15. The SAS interface library
585
v APT_SAS_SCHEMASOURCE_DUMP Displays the SAS CONTENTS report used by the -schemaFile option. The output also displays the command line given to SAS to produce the report and the pathname of the report. The input file and output file created by SAS is not deleted when this variable is set. v APT_SAS_S_ARGUMENT number_of_characters Overrides the value of the SAS -s option which specifies how many characters should be skipped from the beginning of the line when reading input SAS source records. The -s option is typically set to 0 indicating that records be read starting with the first character on the line. This environment variable allows you to change that offset. For example, to skip the line numbers and the following space character in the SAS code below, set the value of the APT_SAS_S_ARGUMENT variable to 6.
0001 data temp; x=1; run; 0002 proc print; run;
v APT_SAS_NO_PSDS_USTRING Outputs a header file that does not list ustring fields. In addition to the SAS-specific debugging variables, you can set the APT_DEBUG_SUBPROC environment variable to display debug information about each subprocess operator. Each release of SAS also has an associated environment variable for storing SAS system options specific to the release. The environment variable name includes the version of SAS it applies to; for example, SAS612_OPTIONS, SASV8_OPTIONS, and SASV9_OPTIONS. The usage is the same regardless of release. Any SAS option that can be specified in the configuration file or on the command line at startup, can also be defined using the version-specific environment variable. SAS looks for the environment variable in the current shell and then applies the defined options to that session. The environment variables are defined as any other shell environment variable. A ksh example is:
export SASV8_OPTIONS=-xwait -news SAS_news_file
A option set using an environment variable overrides the same option set in the configuration file; and an option set in SASx_OPTIONS is overridden by its setting on the SAS startup command line or the OPTIONS statement in the SAS code (as applicable to where options can be defined).
586
sasin
sasin: properties
Table 145. sasin Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Value 1 standard data set 1 SAS data set from upstream operator or specified by the sasin operator -schema option If no key option is specified: record ( sasData:raw; ) If a key option is specified: record ( sastsort:raw; Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set none Parallel by default or sequential any (parallel mode) any (sequential mode) set sasdata:raw; )
... fieldN ]
587
Table 146. sasin Operator Options Option -context Use -context prefix_string Optionally specifies a string that prefixes any informational, warning, or error messages generated by the operator. -debug -debug yes | no | verbose A setting of -debug yes causes the operator to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if a SAS step has an error. By default, the setting is -debug no. Setting -debug verbose is the same as -debug yes, but in addition it causes the operator to echo the SAS source code executed by the operator. -defaultlength -defaultlength length Specifies the default length, in bytes, for all SAS numeric fields generated by the operator. The value must be between 3 and 8. This option allows you to control the length of SAS numeric fields when you know the range limits of your data. Using smaller lengths reduces the size of your data. You can override the default length for specific fields using the -length option. -drop -drop field0 field1 ... fieldN Optionally specifies the fields of the input data set to be dropped. All fields not specified by -drop are written to the output data set. field designates an InfoSphere DataStage data set field name. It is the original name of an input field before any renaming, as performed by the -rename option, or name truncation for input field names longer than eight characters in SAS Version 6. You can also specify a range of fields to drop by specifying two field names separated by a hyphen in the form fieldN - fieldM. In this case, all fields between, and including fieldN and fieldM are dropped. This option is mutually exclusive with -keep.
588
Table 146. sasin Operator Options (continued) Option -keep Use -keep field0 field1 ... fieldN Optionally specifies the fields of the InfoSphere DataStage data set to be retained on input. All fields not specified by -keep are dropped from input. field designates an InfoSphere DataStage data set field name. It is the original name of an input field before any renaming, as performed by the -rename option, or name truncation for input field names longer than eight characters in SAS Version 6. You can also specify a range of fields to drop by specifying two field names separated by a hyphen in the form fieldN - fieldM. In this case, all fields between, and including fieldN and fieldM are dropped. This option is mutually exclusive with -drop. -key [-key field_name [-a | -d] [-cs | -ci ]] Specifies a field of the input DataStage data set that you want to use to sort the SAS data set. The key must be the original field name before any renaming, as performed by the -rename option, or name truncation for input field names longer than eight characters for SAS Version 6. This option causes the sasin operator to create a new field in the output data set named sastsort. You specify this field as the sorting key to the tsort operator. Also, you must use a modify operator with the tsort operator to remove this field when it performs the sort. -a and -d specify ascending or descending sort order. The -ci option specifies that the comparison of value keys is case insensitive. The -cs option specifies a case-sensitive comparison, which is the default.
589
Table 146. sasin Operator Options (continued) Option -length Use -length integer field0 field1 ... fieldN Specifies the length in bytes, of SAS numeric fields generated by the operator. The value must be between 3 and 8. This option allows you to control the length of SAS numeric fields when you know the range limits of your data. Using smaller lengths reduces the size of your data. By default, all SAS numeric fields are 8 bytes long, or the length specified by the -defaultlength option to the operator. The field name is the original name of an input field before any renaming, as performed by the -rename option, or name truncation for input field names longer than eight characters for SAS 6. You can also specify a range of fields by specifying two field names separated by a hyphen in the form fieldN fieldM. All fields between, and including fieldN and fieldM use the specified length. -rename -rename schema_field_name SAS_field_name Specifies a mapping of an input field name from the InfoSphere DataStage data set to a SAS variable name. For multiple mappings, use multiple -rename options. By default, InfoSphere DataStage truncates to eight characters the input fields with a name longer than eight characters in SAS Version 6. If truncation causes two fields to have the same name, the operator issues a syntax error. Aliases must be unique with all other alias specifications and with respect to the SAS data set field names. -report -report Causes the operator to output a report describing how it converts the input InfoSphere DataStage data set to the SAS data set. -sas_cs -sas_cs icu_character_set | DBCSLANG When your InfoSphere DataStage data includes ustring values, you can use the -sas_cs option to specify a character set that maps between InfoSphere DataStage ustrings and the char data stored in SAS files. Use the same -sas_cs character setting for all the SAS-interface operators in your data flow. See "Using -sas_cs to Specify a Character Set" for more details. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset
590
Table 146. sasin Operator Options (continued) Option -schema Use schema schema_definition Specifies the record schema of the input InfoSphere DataStage data set. Only those fields specified in the record schema are written to the output data set.
sas
sas: properties
Table 147. sas Operator Properties Property Number of input data sets Value N (set by user) Can be either InfoSphere DataStage data sets or SAS data sets.
591
Table 147. sas Operator Properties (continued) Property Number of output data sets Value M (set by user) All output data sets can be either SAS data sets or Parallel SAS data sets. If you are passing output to another sas operator, the data should remain in SAS data set format. If requested, the SAS log file is written to the last output data set. For SAS 8.2, the log file contains a header and additional information at the beginning of the file. If requested, the SAS list file is written to the second to last output data set if you also request a log file output, and to the last data set if no log file is requested. Input interface schema Output interface schema Derived from the input data set For output data sets: As specified by the -schema option or the -schemaFile option when the sasout operator is not used. When the sasout operator is used and the downstream operator expects an SAS data set, the schema is: record (sasData:raw; ) For list and log data sets: record (partitionNumber:uint16; recordNumber:uint32; rec:string; Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Parallel (default) or sequential Any parallel mode except modulus Any sequential mode Set on all output data sets; clear on log and list data sets )
592
[-output out_port_# ods_name [-schemaFile schema_file_name ( alias -schema schema_definition]] [-debug yes | no | verbose] Table 148. sas Operator Options Option -source Use
-schemaSource) |
-source SAS_code Specifies the SAS code to be executed by SAS. The SAS code might contain both PROC steps and DATA steps. You must specify either the -source or -sourcefile option.
-sourcefile
-sourcefile SAS_code_filepath Specifies the path to a file containing the SAS source code to be executed. The file path and file contents should be in the UTF-8 character set. You must specify either the -sourcefile or the -source option.
-debug
-debug yes | no | verbose A setting of -debug yes causes the sas operator to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if a SAS step has an error. By default, the setting is -debug no, which causes the operator to abort when it detects an error in the SAS program. Setting -debug verbose is the same as -debug yes, but in addition it causes the operator to echo the SAS source code executed by the operator.
-input
[-input in_port_# sas_ds_name] Specifies the name of the SAS data set, sas_ds_name , receiving its input from the data set connected to in_port_#. The operator uses -input to connect each input data set of the operator to an input of the SAS code executed by the operator. For example, your SAS code contains a DATA step whose input you want to read from input data set 0 of the operator. The following SAS statements might be contained within your code: libname temp `/tmp';data liborch.parallel_out;set temp.parallel_in; In this case, you would use -input and set the in_port_# to 0, and the sas_ds_name to the member name parallel_in. sas_ds_name is the member name of a SAS data set used as an input to the SAS code executed by the operator. You only need to specify the member name here; do not include any SAS library name prefix. When referencing sas_ds_name as part of the SAS code executed by the operator, always prefix it with liborch, the name of the SAS engine. in_port_# is the number of an input data set of the operator. Input data sets are numbered from 0, thus the first input data set is data set 0, the next is data set 1, and so on
593
Table 148. sas Operator Options (continued) Option -listds Use -listds file | dataset | none | display Optionally specify that sas should generate a SAS list file. Specifying -listds file causes the sas operator to write the SAS list file generated by the executed SAS code to a plain text file in the working directory. The list is sorted before being written out. The name of the list file, which cannot be modified, is orchident.lst, where ident is the name of the operator, including an index in parentheses if there are more than one with the same name. For example, orchsas(1).lst is the list file from the second sas operator in a step. -listds dataset causes the list file to be written to the last output data set. If you also request that the SAS log file be written to a data set using -logds, the list file is written to the second to last output data set.The data set from a parallel sas operator containing the list information is not sorted. If you specify -listds none, the list is not generated. -listds display is the default. It causes the list to be written to standard error. -logds -logds file | dataset | none | display Optionally specify that sas write a SAS log file. -logds file causes the operator to write the SAS log file generated by the executed SAS code to a plain text file in the working directory. The name of the log file, which cannot be modified, is orchident.log, where ident is the name of the operator, including its index in parentheses if there are more than one with the same name. For example, orchsas(1).log is the log file from the second sas operator in a step. For SAS 8.2, the log file contains a header and additional information at the beginning of the file. -logds dataset causes the log file to be written to the last output data set of the operator. The data set from a parallel sas operator containing the SAS log information is not sorted. If you specify -logds none, the log is not generated. -logds display is the default. It causes the log to be written to standard error. -convertlocal -convertlocal Optionally specify that the conversion phase of the sas operator (from the parallel data set format to Transitional SAS data set format) should run on the same nodes as the sas operator. If this option is not set, the conversion runs by default with the previous operator's degree of parallelism and, if possible, on the same nodes as the previous operator. -noworking- directoryor -nowd -noworkingdirectory Disables the warning message generated by Orchestrate when you omit the -workingdirectory option. If you omit the -workingdirectory argument, the SAS working directory is indeterminate and Orchestrate automatically generates a warning message. See the -workingdirectory option below. The two options are mutually exclusive.
594
Table 148. sas Operator Options (continued) Option -options Use -options sas_options Optionally specify a quoted string containing any options that can be specified to a SAS OPTIONS directive. These options are executed before the operator executes your SAS code. For example, you can use this argument to enable the SAS fullstimer. You can specify multiple options, separated by spaces. By default, the operator executes your SAS code with the SAS options notes and source. Specifying any string for sas_options configures the operator to execute your code using only the specified options. Therefore you must include notes and source in sas_options if you still want to use them. -output -output out_port_# ods_name [-schemaFile schema_file_name | -schema schema_definition ] Optionally specify the name of the SAS data set, ods_name , writing its output to the data set connected to out_port_ # of the operator. The operator uses -output to connect each output data set of the operator to an output of the SAS code executed by the operator. For example, your SAS code contains a DATA step whose output you want to write to output data set 0 of the operator. Here is the SAS expression contained within your code: data liborch.parallel_out; In this case, you would use -output and set the out_port_# to 0, and the osds_name to the member name parallel_out. osds_name corresponds to the name of an SAS data set used as an output by the SAS code executed by the operator. You only need to specify the member name here; do not include any SAS library name prefix. When referencing osds_name as part of the SAS code executed by the operator, always prefix it with liborch, the name of the SAS engine. out_port_# is the number of an output data set of the operator. Output data sets are numbered starting from 0. You use the -schemaFile suboption to specify the name of a SAS file containing the metadata column information which InfoSphere DataStage uses to generate an InfoSphere DataStage schema; or you use the -schema suboption followed by the schema definition. See "Specifying an Output Schema" for more details. If both the -schemaFile option and the -sas_cs option are set, all of your SAS char fields are converted to InfoSphere DataStage ustring values. If the -sas_cs option is not set, all of your SAS char values are converted to InfoSphere DataStage string values. To obtain a mixture of string and ustring values use the -schema option. See "Specifying an Output Schema" for more details. Note: The -schemaFile and -schema suboptions are designated as optional because you should not specify them for the sas operator if you specify them for the sasout operator. It is an error to specify them for both the sas and sasout operators.
595
Table 148. sas Operator Options (continued) Option -sas_cs Use -sas_cs icu_character_set | DBCSLANG When your InfoSphere DataStage data includes ustring values, you can use the -sas_cs option to specify a character set that maps between InfoSphere DataStage ustrings and the char data stored in SAS files. Use the same -sas_cs character setting for all the SAS-interface operators in your data flow. See "Using -sas_cs to Specify a Character Set" for more details. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset -workingdirectoryor -wd -workingdirectory dir_name Specifies the name of the working directory on all processing nodes executing the SAS application. All relative pathnames in your application are relative to the specified directory. If you omit this argument, the directory is indeterminate and Orchestrate generates a warning message. You can use -noworkingdirectory to disable the warning. This option also determines the location of the fileconfig.sasversion. By default, the operator searches the directory specified by -workingdirectory, then your home directory, then the SAS install directory for config.sasversion. Legal values for dir_name are fully qualified pathnames (which must be valid on all processing nodes) or "." (period), corresponding to the name of the current working directory on the workstation invoking the application. Relative pathnames for dir_name are illegal.
596
For example, if the InfoSphere DataStage SAS data set contains a SAS numeric field named a_field that you want to convert to an int16, you include the following line as part of the sasout record schema:
record ( ... a_field:int16; ... )
If you want to convert the field to a decimal, you would use the appropriate decimal definition, including precision and scale. When converting a SAS numeric field to an InfoSphere DataStage numeric, you can get a numeric overflow or underflow if the destination data type is too small to hold the value of the SAS field. By default, InfoSphere DataStage issues an error message and aborts the program if this occurs. However, if the record schema passed to sasout defines a field as nullable, the numeric overflow or underflow does not cause an error. Instead, the destination field is set to null and processing continues.
sasout
sasout: properties
Table 149. sasout Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Value 1 SAS data set 1 InfoSphere DataStage data set none as specified by the -schema option or the -schemaFile option none parallel (default) or sequential any (parallel mode) any (sequential mode) clear
597
Terms in italic typeface are option strings you supply. When your option string contains a space or a tab character, you must enclose it in single quotes.
sasout -schema schema_definition | -schemaFile filepath [-sas_cs icu_character_set | SAS_DBCSLANG] [-context prefix_string] [-debug no | yes | verbose] [-drop | -nodrop] [-rename schema_field_name SAS_name ...] [-report] Table 150. sasout Operator Options Option -schema Use schema schema_definition Specifies the record schema of the output data set. Only those fields specified in the record schema are written to the output data set. You must specify either the -schema option or the -schemaFile option. See "Specifying an Output Schema" for more details. -schemaFile schemaFile filepath You use the -schemaFile option to specify the name of a SAS file containing the metadata column information which InfoSphere DataStage uses to generate an InfoSphere DataStage schema; or you use the -schema suboption followed by the schema definition. See "Specifying an Output Schema" for more details. If both the -schemaFile option and the -sas_cs option are set, all of your SAS char fields are converted to InfoSphere DataStage ustring values. If the -sas_cs option is not set, all of your SAS char values are converted to InfoSphere DataStage string values. To obtain a mixture of string and ustring values use the -schema option. -debug -debug -yes | -no | -verbose A setting of -debug -yes causes the operator to ignore errors in the SAS program and continue execution of the application. This allows your application to generate output even if a SAS step has an error. By default, the setting is -debug -no, which causes the operator to abort when it detects an error in the SAS program. Setting -debug -verbose is the same as -debug -yes, but in addition it causes the operator to echo the SAS source code executed by the operator. -drop -drop Specifies that sasout drop any input fields not included in the record schema. This is the default action of sasout. You can use the -nodrop option to cause sasout to pass all input fields to the output data set. The -drop and -nodrop options are mutually exclusive.
598
Table 150. sasout Operator Options (continued) Option -nodrop Use -nodrop Specifies the failure of the step if there are fields in the input data set that are not included in the record schema passed to sasout. You can use the -drop option to cause sasout to drop all input fields not included in the record schema. -rename -rename in_field_name ... Specifies a mapping of an input field name from the SAS data set to an InfoSphere DataStage data set field name. For multiple mappings, use multiple -rename options. Aliases must be unique with all other alias specifications and with respect to the SAS data set field names. -report -report Causes the operator to output a report describing how it converts the input SAS data set to the InfoSphere DataStage data set. -sas_cs -sas_cs icu_character_set | DBCSLANG When your InfoSphere DataStage data includes ustring values, you can use the -sas_cs option to specify a character set that maps between InfoSphere DataStage ustrings and the char data stored in SAS files. Use the same -sas_cs character setting for all the SAS-interface operators in your data flow. See "Using -sas_cs to Specify a Character Set" for more details. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset
599
sascontents
report
sascontents: properties
Table 151. sascontents Operator Properties Property Number of input data sets Number of output data sets Value 1 self-describing data set 1 or 2 output 0: generated report output 1: an optional copy of the input data set with no record modification Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output data set none none input to output with no record modification parallel (default) or sequential same (cannot be overridden) output 0: not applicable -- there is a a single partition output 1: set
Table 152. sascontents Operator Options Option -name Use -name name_string_for_report Optionally specifies a name string, typically the name of the data set. The name is used to make the generated report identifiable.
600
Table 152. sascontents Operator Options (continued) Option -recordcount Use -recordcount Optionally specifies the inclusion of record and partition counts in the report. -schema -schema Optionally configures the operator to generate an InfoSphere DataStage record schema for the self-describing input data set instead of a report.
Example reports
The names in these examples can contain a maximum of 8 characters for SAS 6.12 and a maximum of 32 characters for SAS 8.2. v Report with no options specified:
Field Count: 2 Version: 4 Number 1 2 Record Length: 13 Type Num Char Created 2003-05-17 15:22:34 Len 8 5 Offset 0 8 Format 6. Label a b
Name A B
v Report with the -schema, -recordcount and -name options specified. The name value is 'my data set':
Suggested schema for SAS dataset my data set Number of partitions = 4 Number of data records = 10 Suggested schema for parallel dataset sasin[o0].v record (A:int32; B:string[5]) Field descriptor records = 8 Total records in dataset = 22
601
602
Procedure
1. Set your ORACLE_HOME environment variable to your Oracle client installation which must include the Oracle Database Utilities and the Oracle network software. 2. From your $APT_ORCHHOME/install directory, execute this command:
./install.liborchoracle
3. Verify that the Oracle library links in your $APT_ORCHHOME/lib directory have the correct version.
603
v The -insert and -update options when referring to an ORCHESTRATE name. For example:
-insert INSERT INTO tablename (A#,B$#) VALUES (ORCHESTRATE.A__035__, ORCHESTRATE.B__036____035__) -update UPDATE tablename set B$# = ORCHESTRATE.B__036____035__ WHERE (A# = ORCHESTRATE.A__035__)
604
oraread
605
oraread: properties
Table 153. oraread Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 0 1 none determined by the SQL query none sequential (default) or parallel not applicable not applicable clear no
Operator action
Here are the chief characteristics of the oraread operator: v It reads only from non-partitioned and range-partitioned tables when in parallel mode. v The operator functions sequentially, unless you specify the -part suboption of the -query or -table option. v You can direct it to run in specific node pools. See "Where the oraread Operator Runs" . v It translates the query's result set (a two-dimensional array) row by row to an InfoSphere DataStage data set, as discussed in "Column Name Conversion" . v Its output is a data set that you can use as input to a subsequent operator. v Its translation includes the conversion of Oracle data types to Orchestrate data types, as listed in "Data Type Conversion" . v The size of Oracle rows can be greater than that of InfoSphere DataStage records. See "Oracle Record Size" . v The operator specifies either an Oracle table to read or an SQL query to carry out. See "Specifying the Oracle Table Name" and "Specifying an SQL SELECT Statement" . v It optionally specifies commands to be run on all processing nodes before the read operation is performed and after it has completed. v You can perform a join operation between InfoSphere DataStage data sets and Oracle data. See "Join Operations" . Note: An RDBMS such as Oracle does not guarantee deterministic ordering behavior unless an SQL operation constrains it to do so. Thus, if you read the same Oracle table multiple times, Oracle does not guarantee delivery of the records in the same order every time. While Oracle allows you to run queries against tables in parallel, not all queries should be run in parallel. For example, some queries perform an operation that must be carried out sequentially, such as a grouping operation or a non-collocated join operation. These types of queries should be executed sequentially in order to guarantee correct results.
Line 1. the operator.
606
"syncsort" }
"syncsort" }
In this case, the option -server oracle_pool configures the operator to execute only on node0 and node1.
607
Note: Data types that are not listed in the table above generate an error.
You can specify optional parameters to narrow the read operation. They are as follows: v The selectlist specifies the columns of the table to be read; by default, InfoSphere DataStage reads all columns. v The filter specifies the rows of the table to exclude from the read operation; by default, InfoSphere DataStage reads all rows. You can optionally specify -open and -close option commands. These commands are executed by Oracle on every processing node containing a partition of the table before the table is opened and after it is closed. See "Example 1: Reading an Oracle Table and Modifying a Field Name" for an example of using the table option.
608
Join operations
You can perform a join operation between InfoSphere DataStage data sets and Oracle data. First invoke the oraread operator and then invoke either the lookup operator or a join operator. See "Lookup Operator" and Chapter 16, The Oracle interface library, on page 603, "The Join Library." Alternatively, you can use the oralookup operator described in "The oralookup Operator".
You must specify either the -query or -table option. You must also specify -dboptions and supply the necessary information. The table on the next page lists the options.
609
Table 155. oraread Operator Options Option -close Use -close close_command Specify a command, enclosed in single quotes, to be parsed and executed by Oracle on all processing nodes after InfoSphere DataStage completes processing the Oracle table and before it disconnects from Oracle. If you do not specify a close_command, InfoSphere DataStage terminates its connection to Oracle. There is no default close_command. You can include an Oracle stored procedure as part of close_command, in the form: "execute procedureName;" -db_cs -db_cs icu_character_set Specify an ICU character set to map between Oracle char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to Oracle.The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, the -nchar option must also be specified. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset -nchar_cs -nchar_cs icu_character_set Specify an ICU character set to map between Oracle nchar and nvarchar2 values and InfoSphere DataStage ustring data. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, the -db_cs option must also be specified. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset
610
Table 155. oraread Operator Options (continued) Option -dboptions Use -dboptions '{user=username, password=password}' | '{user=\'@filename\'}' [arraysize = num_records] Specify either a user name and password for connecting to Oracle or a file containing the user name and password. You can optionally use -arraysize to specify the number of records in each block of data read from Oracle. InfoSphere DataStage reads records in as many blocks as required to read all source data from Oracle. By default, the array size is 1000 records. You can modify this parameter to tune the performance of your application. -open -open open_command Specify a command, enclosed in single quotes, to be parsed and executed by Oracle on all processing nodes before the table is opened. -ora8partition -ora8partition partition_name Specify the name of the specific Oracle table partition that contains the rows you want to read. -part -part table_name Specifies running the read operator in parallel on the processing nodes in the default node pool. The default execution mode is sequential. The table_name must be the name of the table specified by -table. -query -query sql_query [-part table_name] Specify an SQL query, enclosed in single quotes, to read a table. The query specifies the table and the processing that you want to perform on the table as it is read into InfoSphere DataStage. This statement can contain joins, views, database links, synonyms, and so on. -server -server remote_server_name remote_server_name must specify a remote connection. To specify a local connection, set your ORACLE_SID environment variable to a local server. See "Where the oraread Operator Runs" for more information.
611
Table 155. oraread Operator Options (continued) Option -table Use -table table_name [-filter filter] [-selectlist list] Specify the name of the Oracle table. The table must exist and you must have SELECT privileges on the table. The table name might contain a view name only if the operator executes sequentially. If your Oracle user name does not correspond to that of the owner of the table, prefix the table name with that of the table owner in the form: tableowner.table_name The -filter suboption optionally specifies a conjunction, enclosed in single quotes, to the WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from reading into InfoSphere DataStage. See "Targeting the Read Operation" for more information. The suboption -selectlist optionally specifies an SQL select list, enclosed in single quotes, that can be used to determine which fields are read. You must specify the fields in list in the same order as the fields are defined in the record schema of the input table.
Oracle table
itemNum 101 101 220 price 1.95 1.95 5.95 storeID 26 34 26
step oraread
modify
field1 = ItemNum
field1:int32;in:*;
sample
612
The Oracle table contains three columns whose data types the operator converts as follows: v itemNum of type NUMBER[3,0] is converted to int32 v price of type NUMBER[6,2] is converted to decimal[6,2] v storeID of type NUMBER[2,0] is converted to int32 The schema of the InfoSphere DataStage data set created from the table is also shown in this figure. Note that the InfoSphere DataStage field names are the same as the column names of the Oracle table. However, the operator to which the data set is passed has an input interface schema containing the 32-bit integer field field1, while the data set created from the Oracle table does not contain a field of the same name. For this reason, the modify operator must be placed between oraread and sampleOperator to translate the name of the field, itemNum, to the name field1. See "Modify Operator" for more information. Here is the osh syntax for this example:
$ modifySpec="field1 = itemNum;" $ osh "oraread -table table_1 -dboptions {user = user101, password = userPword} | modify $modifySpec | ... "
Example 2: reading from an Oracle table in parallel with the query option
The following query configures oraread to read the columns itemNum, price, and storeID from table_1 in parallel. One instance of the operator runs for each partition of the table. Here is the osh syntax for this example:
$ osh "oraread -query select itemNum, price, storeID from table_1 -part table_1 -dboptions {user = user101, password = userPword} ..."
613
orawrite
orawrite: properties
Table 156. orawrite Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Value 1 0 none derived from the input data set. none parallel (default) or sequential same (Note that you can override this partitioning method. However, a partitioning method of entire is not allowed.) any not applicable yes
Operator action
Here are the main characteristics of the orawrite operator: v It translates an InfoSphere DataStage data set record by record to an Oracle table using Oracle's SQL*Loader Parallel Direct Path Load method. v Translation includes the conversion of InfoSphere DataStage data types to Oracle data types. See "Data Type Conversion". v The operator appends records to an existing table, unless you set another mode of writing. See"Write Modes". v When you write to an existing table, the schema of the table determines the operator's input interface, and the input data set schema must be compatible with the table's schema. See "Matched and Unmatched Fields". v Each instance of a parallel write operator running on a processing node writes its partition of the data set to the Oracle table. If any instance fails, all fail.
614
You can optionally specify Oracle commands to be parsed and executed on all processing nodes before the write operation runs or after it completes.
Indexed tables
Because the Oracle write operator writes to a table using the Parallel Direct Path Load method, the operator cannot write to a table that has indexes defined on it unless you include either the -index rebuild or the -index maintenance option. If you want to write to such a table and not use an -index option you must first delete the indexes and recreate them after orawrite has finished processing. The operator checks for indexes and returns an error if any are found. You can also write to an indexed table if the operator is run in sequential mode and the environment variable APT_ORACLE_LOAD_OPTIONS is set to 'OPTIONS (DIRECT=TRUE, PARALLEL=FALSE)'. Note that if you define the environment variable APT_ORACLE_LOAD_OPTIONS, InfoSphere DataStage allows you to attempt to write to an indexed table, regardless of how the variable is defined.
615
* The default length of VARCHAR is 32 bytes. This means all records of the table have 32 bytes allocated for each variable-length string field in the input data set. If an input field is longer than 32 bytes, the operator issues a warning. The -stringlength option modifies the default length. string, 2096 bytes < length time timestamp not supported DATE (does not support microsecond resolution) DATE (does not support microsecond resolution)
InfoSphere DataStage data types not listed in this table generate an error. Invoke the modify operator to perform type conversions. See "Modify Operator". All InfoSphere DataStage and Oracle integer and floating-point data types have the same range and precision, and you need not worry about numeric overflow or underflow.
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: This is the default mode. The table must exist and the record schema of the data set must be compatible with the table. The write operator appends new rows to the table. The schema of the existing table determines the input interface of the operator.
616
v create: The operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own create table statement. v replace: The operator drops the existing table and creates a new one in its place. If a table exists of the same name as the one you want to create, it is overwritten. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. v truncate: The operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. Note: If a previous write operation fails, you can retry your application specifying a write mode of replace to delete any information in the output table that might have been written by the previous attempt to run your program. Each mode requires the specific user privileges shown in the table below:
Write Mode append create replace truncate Required Privileges INSERT on existing table TABLE CREATE INSERT and TABLE CREATE on existing table INSERT on existing table
617
618
Table 157. orawrite Operator Options (continued) Option -createstmt Use -createstmt create_statement Creates a table. You use this option only in create and replace modes. You must supply the create table statement, otherwise InfoSphere DataStage attempts to create it based on simple defaults. Here is an example create statement: 'create table test1 (A number, B char(10), primary key (A))' You do not add the ending semicolon, as would normally be acceptable with an SQL statement. When writing to a multibyte database, specify chars and varchars in bytes, with two bytes for each character. This example specifies 10 characters: 'create table orch_data(col_a varchar(20))' When specifying nchar and nvarchar2 column types, specify the size in characters: This example, specifies 10 characters: 'create table orch_data(col_a nvarchar2(10))' This option is mutually exclusive with the -primaryKeys option. -db_cs -db_cs icu_character_set Specify an ICU character set to map between Oracle char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to Oracle.The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, you must also specify the -nchar option. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset
619
Table 157. orawrite Operator Options (continued) Option -nchar_cs Use -nchar_cs icu_character_set Specify an ICU character set to map between Oracle nchar and nvarchar2 values and InfoSphere DataStage ustring data. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, you must also specify the -db_cs option. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset -dboptions -dboptions '{user=username, password=password}' | '{user=\'@filename\'}' Specify either a user name and password for connecting to Oracle or a file containing the user name and password. These options are required by the operator. -disableConstraints -disableConstraints Disables all enabled constraints on a table, and then attempts to enable them again at the end of a load. When disabling the constraints, the cascade option is included. If Oracle cannot enable all the constraints, warning messages are displayed. If the -exceptionsTable option is included, ROWID information on rows that violate constraints are inserted into an exceptions table. In all cases, the status of all constraints on the table are displayed at the end of the operator run. This option applies to all write modes. -drop -drop Specify dropping unmatched fields of the InfoSphere DataStage the data set. An unmatched field is a field for which there is no identically named field in the Oracle table.
620
Table 157. orawrite Operator Options (continued) Option -exceptionsTable Use -exceptionsTable exceptionsTableName Specify an exceptions table. exceptionsTableName is the name of a table where record ROWID information is inserted if a record violates a table constraint when an attempt is made to enable the constraint. This table must already exist. It is not created by the operator. Your Oracle installation should contain a script that can be executed to create an exceptions table named exceptions. This option can be included only in conjunction with the -disableConstraints option. -index -index rebuild -computeStatus -nologging | -maintenance Lets you perform a direct parallel load on an indexed table without first dropping the index. You can choose either the -maintenance or -rebuild option, although special rules apply to each (see below). The -index option is applicable only in append and truncate write modes, and in create mode only if a -createstmt option is provided. rebuild: Skips index updates during table load and instead rebuilds the indexes after the load is complete using the Oracle alter index rebuild command. The table must contain an index, and the indexes on the table must not be partitioned. The -computeStatus and the -nologging options can be used only with the -index rebuild command. The -computeStatus option adds Oracle's COMPUTE STATISTICS clause to the -index rebuild command; and the -nologging option adds Oracle's NOLOGGING clause to the -index rebuild command. maintenance: Results in each table partition being loaded sequentially. Because of the sequential load, the table index that exists before the table is loaded is maintained after the table is loaded. The table must contain an index and be partitioned, and the index on the table must be a local range-partitioned index that is partitioned according to the same range values that were used to partition the table. Note that in this case sequential means sequential per partition, that is, the degree of parallelism is equal to the number of partitions. This option is mutually exclusive with the -primaryKeys option.
621
Table 157. orawrite Operator Options (continued) Option -mode Use -mode create | replace | append | truncate Specify the write mode of the operator. v append (default): New records are appended to the table. v create: Create a new table. InfoSphere DataStage reports an error if the Oracle table already exists. You must specify this mode if the Oracle table does not exist. v truncate: The existing table attributes (including schema) and the Oracle partitioning keys are retained but existing records are discarded. New records are then appended to the table. v replace: The existing table is dropped and a new table is created in its place. Oracle uses the default partitioning method for the new table. See "Write Modes" for a table of required privileges for each mode. -nologging -nologging This option adds Oracle's NOLOGGING clause to the -index rebuild command. This option and the -computeStats option must be used in combination with the -index rebuild option as shown below: -index rebuild -computeStats -nologging -open -open open_command Specifies any command, enclosed in single quotes, to be parsed and run by the Oracle database on all processing nodes. InfoSphere DataStage causes Oracle to run this command before opening the table.There is no default open_command. You can include an Oracle stored procedure as part of open_command in the form: "execute procedureName;" -ora8partition -ora8partition ora8part_name Name of the Oracle 8 table partition that records are written to. The operator assumes that the data provided is for the partition specified.
622
Table 157. orawrite Operator Options (continued) Option -primaryKeys Use -primaryKeys fieldname_1, fieldname_2, ... fieldname_n This option can be used only with the create and replace write modes. The -createstmt option should not be included. Use the external column names for this option. By default, no columns are primary keys. This clause is added to the create table statement: constraint tablename_PK primary key (fieldname_1, fieldname_2, ..fieldname_n) The operator names the primary key constraint "tablename_PK, where tablename is the name of the table. If the -disableConstraints option is not also included, the primary key index will automatically be rebuilt at the end of the load. This option is mutually exclusive with the -index and -createstmt option. -server -server remote_server_name remote_server_name must specify a remote connection. To specify a local connection, set your ORACLE_SID environment variable to a local server. -stringlength -stringlength string_len Set the default string length for variable-length strings written to an Oracle table. If you do not specify a length, InfoSphere DataStage uses a default size of 32 bytes. Variable-length strings longer than the set length cause an error. You can set the maximum length up to 2000 bytes. Note that the operator always allocates string_len bytes for a variable-length string. In this case, setting string_len to 2000 allocates 2000 bytes. Set string_len to the expected maximum length of your longest string. -table -table table_name Specify the name of the Oracle table. If you are writing to the table, and the output mode is create, the table must not exist. If the output mode is append, replace, or truncate, the table must exist. The Oracle write operator cannot write to a table that has indexes defined on it. If you want to write to such a table, you must first delete the index(es) and recreate them after orawrite completes. The operator checks for indexes and returns an error if one is found.
623
Table 157. orawrite Operator Options (continued) Option -truncate Use -truncate Configure the operator to truncate InfoSphere DataStage field names to 30 characters. Oracle has a limit of 30 characters for column names. By default, if an InfoSphere DataStage field name is too long, InfoSphere DataStage issues an error. The -truncate option overrides this default. -useNchar Allows the creation of tables with nchar and nvarchar2 fields. This option only has effect when the -mode option is create. If you do not specify the -useNchar option, the table is created with char and varchar fields.
orawrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
The record schema of the InfoSphere DataStage data set and the row schema of the Oracle table correspond to one another, and field and column names are identical. Here are the input InfoSphere DataStage record schema and output Oracle row schema:
624
Note that since the write mode defaults to append, the mode option does not appear in the command.
odbcwrite
mode = create
age (number[5,0])
zip (char[5])
income (number)
The orawrite operator creates the table, giving the Oracle columns the same names as the fields of the input InfoSphere DataStage data set and converting the InfoSphere DataStage data types to Oracle data types. See "Data Type Conversion" for a list of InfoSphere DataStage-to-Oracle conversions.
625
modify
orawrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
In this example, you use the modify operator to: v Translate field names of the input data set to the names of corresponding fields of the operator's input interface schema, that is skewNum to itemNum and store to storeID. v Drop the unmatched rating field, so that no error occurs. Note that InfoSphere DataStage performs automatic type conversion of store, promoting its int8 data type in the input data set to int16 in the orawrite input interface. Here is the osh syntax for this operator:
$ modifySpec="itemNum = skewNum, storeID = store;drop rating" $ osh "... op1 | modify $modifySpec | orawrite -table table_2 -dboptions {user = user101, password = userPword}"
626
This operator receives a single data set as input and writes its output to an Oracle table. You can request an optional output data set that contains the records that fail to be inserted or updated. The output reject link sqlcode field contains an int32 value which corresponds to the Oracle error code for the record. By default, oraupsert uses Oracle host-array processing to optimize the performance of inserting records.
oraupsert
oraupsert: properties
Table 158. oraupsert properties Property Number of input data sets Number of output data sets Input interface schema Transfer behavior Execution mode Partitioning method Value 1 by default, none; 1 when you select the -reject option derived from your insert and update statements Rejected update records are transferred to an output data set when you select the -reject option. parallel by default, or sequential same You can override this partitioning method; however, a partitioning method of entire cannot be used. Collection method Combinable operator any yes
Operator Action
Here are the main characteristics of oraupsert: v An -update statement is required. The -insert and -reject options are mutually exclusive with using an -update option that issues a delete statement. v If an -insert statement is included, the insert is executed first. Any records that fail to be inserted because of a unique-constraint violation are then used in the execution of the update statement.
627
v InfoSphere DataStage uses host-array processing by default to enhance the performance of insert array processing. Each insert array is executed with a single SQL statement. Update records are processed individually. v You use the -insertArraySize option to specify the size of the insert array. For example:
-insertArraySize 250
The default length of the insert array is 500. To direct InfoSphere DataStage to process your insert statements individually, set the insert array size to 1:
-insertArraySize 1
v Your record fields can be variable-length strings. You can specify a maximum length or use the default maximum length of 80 characters. This example specifies a maximum length of 50 characters:
record(field1:string[max=50])
v When an insert statement is included and host array processing is specified, an InfoSphere DataStage update field must also be an InfoSphere DataStage insert field. v The oraupsert operator converts all values to strings before passing them to Oracle. The following InfoSphere DataStage data types are supported: int8, uint8, int16, uint16, int32, uint32, int64, and uint64 dfloat and sfloat decimal strings of fixed and variable length timestamp date
v By default, oraupsert produces no output data set. By using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. It's syntax is:
-reject filename
For a failed insert record, these sqlcodes cause the record to be transferred to your reject data set:
-1400: -1401: -1438: -1480: cannot insert NULL inserted value too large for column value larger than specified precision allows for this column trailing null missing from string bind value
Note: An insert record that fails because of a unique constraint violation (sqlcode of -1) is used for updating. For a failed update record, these sqlcodes cause the record to be transferred to your reject data set:
-1: unique constraint violation -1401: inserted value too large for column -1403: update record not found -1407: cannot update to null -1438: value larger than specified precision allows for this column -1480: trailing null missing from string bind value
A -1480 error occurs when a variable length string is truncated because its length is not specified in the input data set schema and it is longer than the default maximum length of 80 characters. Note: When a record fails with an sqlcode other than those listed above, oraupsert also fails. Therefore, you must backup your Oracle table before running your data flow.
628
v APT_ORAUPSERT_COMMIT_TIME_INTERVAL: You can reset this variable to change the time interval between Oracle commits; the default value is 2 seconds. v APT_ORAUPSERT_COMMIT_ROW_INTERVAL: You can reset this variable to change the record interval between Oracle commits; the default value is 5000 records. Commits are made whenever the time interval period has passed or the row interval is reached, whichever occurs first.
Exactly one occurrence of the -dboptions option and exactly one occurrence of the -update are required. All other options are optional.
Table 159. Oraupsert options Options -dboptions Value -dboptions '{user=username, password=password}' | '{user=\'@filename\'}' -db_cs -db_cs icu_character_set Specify an ICU character set to map between Oracle char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to Oracle. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, the -nchar option must also be specified. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset
629
Table 159. Oraupsert options (continued) Options -nchar_cs Value -nchar_cs icu_character_set Specify an ICU character set to map between Oracle nchar and nvarchar2 values and InfoSphere DataStage ustring data. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, the -db_cs option must also be specified. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset -insert -insert insert_statement Optionally specify the insert statement to be executed. This option is mutually exclusive with using an -update option that issues a delete statement. -insertArraySize -insertArraySize n Specify the size of the insert host array. The default size is 500 records. If you want each insert statement to be executed individually, specify 1 for this option. -reject -reject filename If this option is set, records that fail are written to a reject data set. You must designate an output data set for this purpose. This option is mutually exclusive with using an -update option that issues a delete statement. -server -server remote_server_name remote_server_name must specify a remote connection. To specify a local connection, set your ORACLE_SID environment variable to a local server. -update -update update_or_delete_statement Use this required option to specify the update or delete statement to be executed. An example delete statement is: -update 'delete from tablename where A = ORCHESTRATE.A' A delete statement cannot be issued when using the -insert or -reject option. -update_first -update_first Specify this option to have the operator update first and then insert.
630
osh syntax
$ osh "import -schema record(acct_id:string[6]; acct_balance:dfloat;) -file input.txt | hash -key acct_id | tsort -key acct_id | oraupsert -dboptions {user=apt, password=test} -insert insert into accounts values(ORCHESTRATE.acct_id, ORCHESTRATE.acct_balance) -update update accounts set acct_balance = ORCHESTRATE.acct_balance where acct_id = ORCHESTRATE.acct_id -reject /user/home/reject/reject.ds" Table after dataflow acct_id acct_balance 073587 82.56 873092 67.23 675066 3523.62 566678 2008.56 865544 8569.23 678888 7888.23 995666 75.72
631
This operator is particularly useful for sparse lookups, that is, where the InfoSphere DataStage data set you are matching is much smaller than the Oracle table. If you expect to match 90% of your data, using the oraread and lookup operators is probably more efficient. Because oralookup can do lookups against more than one Oracle table, it is useful for joining multiple Oracle tables in one query. The -statement option command corresponds to an SQL statement of this form:
select a,b,c from data.testtbl where orchestrate.b = data.testtbl.c and orchestrate.name = "Smith"
The operator replaces each orchestrate.fieldname with a field value, submits the statement containing the value to Oracle, and outputs a combination of Oracle and InfoSphere DataStage data. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more Oracle tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced Oracle table. When an Oracle table has a column name that is the same as an InfoSphere DataStage data set field name, the Oracle column is renamed using the following syntax:
APT_integer_fieldname
An example is APT_0_lname. The integer component is incremented when duplicate names are encountered in additional tables. Note: If the Oracle table is not indexed on the lookup keys, the performance of this operator is likely to be poor.
632
oralookup
Properties
Table 160. oralookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Preserve-partitioning flag in output data set Composite operator Value 1 1; 2 if you include the -ifNotFound reject option determined by the query determined by the SQL query transfers all fields from input to output sequential or parallel (default) clear no
633
You must specify either the -query option or one or more -table options with one or more -key fields. Exactly one occurrence of the -dboptions is required.
Table 161. oralookup Operator Options Option -dboptions Use -dboptions '{user=username, password=password}' | '{user=\'@filename\'}' -db_cs -db_cs icu_character_set Specify an ICU character set to map between Oracle char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to Oracle. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, you must also specify the -nchar option. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset -nchar_cs -nchar_cs icu_character_set Specify an ICU character set to map between Oracle nchar and nvarchar2 values and InfoSphere DataStage ustring data. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. If this option is specified, you must also specify the -db_cs option. InfoSphere DataStage maps your ICU character setting to its Oracle character-set equivalent. See "Mapping Between ICU and Oracle Character Sets" for the details. For information on national language support, reference this IBM ICU site: http://www.oss.software.ibm.com/icu/charset -ifNotFound -ifNotFound fail | drop| reject filename | continue Optionally determines what action to take in case of a lookup failure. The default is fail. If you specify reject, you must designate an additional output data set for the rejected records.
634
Table 161. oralookup Operator Options (continued) Option -query Use -query sql_query Specify an SQL query, enclosed in single quotes, to read a table. The query specifies the table and the processing that you want to perform on the table as it is read into InfoSphere DataStage. This statement can contain joins, views, database links, synonyms, and so on. -statement 'statement' A select statement corresponding to this lookup. -server -server remote_server remote_server_name must specify a remote connection. To specify a local connection, set your ORACLE_SID environment variable to a local server. -table -table table_name -key1 field1 ... Specify a table and key field(s) from which a SELECT statement is created. This is equivalent to specifying the SELECT statement directly using the -statement option You can specify multiple instances of this option.
InfoSphere DataStage prints the lname, fname, and DOB column names and values from the InfoSphere DataStage input data set and also the lname, fname, and DOB column names and values from the Oracle table. If a column name in the Oracle table has the same name as an InfoSphere DataStage output data set schema fieldname, the printed output shows the column in the Oracle table renamed using this format:
APT_integer_fieldname
635
636
637
The DB2 environment variable, DB2INSTANCE, specifies the user name of the owner of the DB2 instance. DB2 uses DB2INSTANCE to determine the location of db2nodes.cfg. For example, if you set DB2INSTANCE to "Mary", the location of db2nodes.cfg is ~Mary/sqllib/db2nodes.cfg. The three methods of specifying the default DB2 database are listed here in order of precedence: 1. The -dbname option of the DB2 Interface read and write operators 2. The InfoSphere DataStage environment variable APT_DBNAME 3. The DB2 environment variable DB2DBDFT
Procedure
1. In your configuration file, include the node for the client and the node for the remote platform where DB2 is installed. 2. Set the client instance name using the -client_instance option. 3. Optionally set the server using the -server option. If the server is not set, you must set the DB2INSTANCE environment variable to the DB2 instance name. 4. Set the remote server database name using the -dbname option. 5. Set the client alias name for the remote database using the -client_dbname option. 6. Set the user name and password for connecting to DB2, using the -user and -password suboptions. Note: If the name in the -user suboption of the -client_instance option differs from your UNIX user name, use an owner-qualified name when specifying the table for the -table, -part, -upsert, -update, and -query options. The syntax is:
owner_name.table_name
For example:
db2instance1.data23A.
638
v The upsert and update options when referring to an InfoSphere DataStage name. For example:
-insert INSERT INTO tablename (A#,B$#) VALUES (ORCHESTRATE.A__035__, ORCHESTRATE.B__036____035__) -update UPDATE tablename set B$# = ORCHESTRATE.B__036____035__ ORCHESTRATE.A__035__)
WHERE (A# =
The db2write operator automatically pads with null terminators, and the default pad character for the db2upsert and db2lookup operators is the null terminator. Therefore, you do not need to include the -padchar option when the db2write operator is used to insert rows into the table and the db2upsert or db2lookup operators are used to update or query that table. The -padchar option should not be used if the CHAR column in the table is a CHAR FOR BIT type, and any of the input records contain null terminators embedded within the string or ustring fields.
639
-ifNotFound reject >| lookup.out | db2load -table table_3 -mode replace -server remote_server_DB2instance_3 -dbname remote_database_2 -client_instance DB2instance_1 -client_dbname alias_database_3 -user username -password passwd"
In this example, db2read reads from table_1 of database_1 in DB2instance_1. The results read from table_1 are then piped to db2lookup to table_2 of remote database_1 in remote_server_DB2instance_2. The records rejected by db2lookup are then piped to db2load where they are put in table_3 of remote_database_3 in remote_server_DB2instance_3.
If your -db_cs ICU character setting is listed in db2_cs.txt, InfoSphere DataStage sets the DB2CODEPAGE environment variable to the corresponding code page. If your ICU character setting is not in db2_cs.txt, you can add it to that file or set the DB2CODEPAGE environment variable yourself. If there is no code-page mapping in db2_cs.txt and DB2CODEPAGE is not set, InfoSphere DataStage uses the DB2 defaults. DB2 converts between your character setting and the DB2 code page, provided that the appropriate language support is set in your operating system; however, no conversion is performed for the data of the db2load operator. Refer to your DB2 documentation for DB2 national language support.
640
db2read
db2read: properties
Table 163. db2read Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 0 1 none determined by the SQL query. none determined by options passed to it (see "Targeting the Read Operation" ) not applicable not applicable clear no
641
Operator action
Here are the chief characteristics of the db2read operator: v The read operation is carried out on either the default database or a database specified by another means. See "Configuring InfoSphere DataStage Access" . v Its output is a data set that you can use as input to a subsequent operator. The output data set is partitioned in the same way as the input DB2 table. v It translates the query's result set row by row to an InfoSphere DataStage data set, as described in "Conversion of a DB2 Result Set to an InfoSphere DataStage Data Set" . v The translation includes the conversion of DB2 data types to Orchestrate data types, as listed in "DB2 Interface Operator Data Type Conversions" . v It specifies either a DB2 table to read or an SQL query to carry out. If it specifies a read operation of a table, the operation is executed in parallel, one instance of db2read on every partition of the table. See "Specifying the DB2 Table Name" . If it specifies a query, the operation is executed sequentially, unless the query directs it to operate in parallel. See "Explicitly Specifying the Query" . v It optionally specifies commands to be executed on all processing nodes before the read operation is performed and after it has completed. Note: An RDBMS such as DB2 does not guarantee deterministic ordering behavior unless an SQL operation constrains it to do so. Thus, if you read the same DB2 table multiple times, DB2 does not guarantee delivery of the records in the same order every time. In addition, while DB2 allows you to run queries against tables in parallel, not all queries should be run in parallel. For example, some queries perform an operation that must be carried out sequentially, such as a group-by operation or a non-collocated join operation. These types of queries should be executed sequentially in order to guarantee correct results and finish in a timely fashion.
642
v The DB2 read operators convert DB2 data types to InfoSphere DataStage data types, as in the next table:
Table 164. DB2 Interface Operator Data Type Conversions DB2 Data Type CHAR(n) CHARACTER VARYING(n,r) DATE DATETIME InfoSphere DataStage Data Type string[n] or ustring[n] string[max = n] or ustring[max = n] date time or timestamp with corresponding fractional precision for time If the DATETIME starts with a year component, the result is a timestamp field. If the DATETIME starts with an hour, the result is a time field. DECIMAL[p,s] decimal[p,s] where p is the precision and s is the scale The maximum precision is 32, and a decimal with floating scale is converted to a dfloat. DOUBLE-PRECISION FLOAT INTEGER MONEY GRAPHIC(n) VARGRAPHIC(n) LONG VARGRAPHIC REAL SERIAL SMALLFLOAT SMALLINT VARCHAR(n) dfloat dfloat int32 decimal string[n] or ustring[n] string[max = n] or ustring[max = n] ustring sfloat int32 sfloat int16 string[max = n] or ustring[max = n]
Note: Data types that are not listed in the table shown above generate an error.
643
You can specify optional parameters to narrow the read operation. They are as follows: v -selectlist specifies the columns of the table to be read. By default, InfoSphere DataStage reads all columns. v -filter specifies the rows of the table to exclude from the read operation By default, Orchestrate reads all rows. Note that the default query's WHERE clause contains the predicate:
nodenumber(colName)=current node
where colName corresponds to the first column of the table or select list. This predicate is automatically generated by Orchestrate and is used to create the correct number of DB2 read operators for the number of partitions in the DB2 table. You can optionally specify the -open and -close options. These commands are executed by DB2 on every processing node containing a partition of the table before the table is opened and after it is closed.
If you do not specify an open command and the read operation is carried out in parallel, InfoSphere DataStage runs the following default command:
lock table table_name in share mode;
644
This command locks the table until InfoSphere DataStage finishes reading the table. The table cannot be written to when it is locked; however, if you specify an explicit open command, the lock statement is not run. When DB2 is accessed sequentially, InfoSphere DataStage provides no default open command. The close command is executed by the operator after InfoSphere DataStage finishes reading the DB2 table and before it disconnects from DB2. If you do not specify a close command, the connection is immediately terminated after InfoSphere DataStage finishes reading the DB2 table. If DB2 has been accessed in parallel and -close is not specified, InfoSphere DataStage releases the table lock obtained by the default open_command.
645
Table 165. db2read Options (continued) Option -db_cs Use -db_cs character_set Specify the character set to map between DB2 char and Varchar values and InfoSphere DataStage ustring schema types and to map SQL statements for output to DB2. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset -dbname -dbname database_name Specifies the name of the DB2 database to access. By default, the operator uses the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise. Specifying -dbname overrides APT_DBNAME and DB2DBDFT. See the DataStage 7.5 Installation and Administration Manual for information on creating the DB2 configuration. -open -open open_command Specifies any command to be parsed and executed by DB2. InfoSphere DataStage causes DB2 to execute this command on all processing nodes to be accessed before opening the table. See "Specifying Open and Close Commands" for more details. -query -query sql_query [-part table_name] Specifies an SQL query to read one or more tables. The query specifies the tables and the processing that you want to perform on the tables as they are read into InfoSphere DataStage. This statement can contain joins, views, database links, synonyms, and so on. The -part suboption specifies execution of the query in parallel on the processing nodes containing a partition of table_name. If you do not specify -part, the operator executes the query sequentially on a single node. If the name in the -user suboption of the -client_instance option differs from your UNIX user name, use an owner-qualified name when specifying the table. The syntax is: owner_name.table_name The default open command for a parallel query uses the table specified by the -part option. Either the -table or -query option must be specified.
646
Table 165. db2read Options (continued) Option -server Use -server server_name Optionally specifies the DB2 instance name for the table or query read by the operator. By default, InfoSphere DataStage uses the setting of the DB2INSTANCE environment variable. Specifying -server overrides DB2INSTANCE. -table -table table_name [-filter filter] [-selectlist list] -table specifies the name of the DB2 table. The table must exist and you must have SELECT privileges on the table. If your DB2 user name does not correspond to the name of the owner of table_name, you can prefix table_name with a table owner in the form: table_owner.table_name If you do not include the table owner, the operator uses the following statement to determine your user name: select user from sysibm.systables; -filter specifies a conjunction to the WHERE clause of the SELECT statement to specify the rows of the table to include or exclude from reading into InfoSphere DataStage. If you do not supply a WHERE clause, all rows are read. -filter does not support the use of bind variables. -selectlist specifies an SQL select list that determines which columns are read. Select list items must have DB2 data types that map to InfoSphere DataStage data types.You must specify fields in the same order as they are defined in the record schema of the input table. If you do not specify a list, all columns are read. The -selectlist option does not support the use of bind variables. Either the -table or -query option must be specified. -use_strings -use_strings Directs InfoSphere DataStage to import DB2 char and Varchar values to InfoSphere DataStage as InfoSphere DataStage strings without converting them from their ASCII or binary form. This option overrides the db_cs option which converts DB2 char and Varchar values as ustrings using a specified character set.
647
The db2read operator writes the table to an InfoSphere DataStage data set. The schema of the data set is also shown in this figure. The data set is the input of the next operator.
DB2 table
itemNum 101 101 220 price 1.95 1.95 5.95 storeID 26 34 26
step db2read
schema: itemNum:int32; price:decimal; storeID:int16;
operator
The DB2 table contains three columns. Each DB2 column name becomes the name of an InfoSphere DataStage field. The operator converts the data type of each column to its corresponding InfoSphere DataStage data type, as listed in the this table:
Column/Field Name itemNum price storeID DB2 Data Type INTEGER DECIMAL SMALLINT Converted to InfoSphere DataStage Data Type int32 decimal int16
648
where itemNum is the name of the first column of the table or select list. This predicate is automatically generated by InfoSphere DataStage and is used to create the correct number of DB2 read operators for the number of partitions in the DB2 table.
db2write | db2load
DB2 table
649
Table 166. db2write and db2load Operator Properties (continued) Property Transfer behavior Execution mode Partitioning method Value none parallel (default) or sequential db2part See "The db2part Operator" . You can override this partitioning method; however, a partitioning method of entire is not allowed. Collection method Preserve-partitioning flag in output data set Composite operator any not applicable no
How InfoSphere DataStage writes the table: the default SQL INSERT statement
You cannot explicitly define an SQL statement for the write operation. Instead, InfoSphere DataStage generates an SQL INSERT statement that writes to the table. However, optional parameters of the write operators allow you to narrow the operation. When you specify the DB2 table name to a write operator, InfoSphere DataStage creates this INSERT statement:
insert into table_name [(selectlist)] values (?,?,?, ...);
where: v table_name specifies the name of a DB2 table. By default, if table_name does not exist, the step containing the write operator terminates, unless you use the -mode create option. To set the table name, you specify the -table option.
650
v selectlist optionally specifies a clause of the INSERT statement to determine the fields of the data set to be written. If you do not specify selectlist, InfoSphere DataStage writes all fields in the data set. You specify the write operator's -selectlist suboption of -table to set the select list. v ?,?,?, ... contains one input parameter for each column written by InfoSphere DataStage. By default, this clause specifies that all fields in the input data set are written to DB2. However, when you specify a selectlist, the default is modified to contain only the columns defined by the selectlist.
651
Table 168. db2write and db2load Type Conversions (continued) InfoSphere DataStage Data Type int32 sfloat dfloat fixed-length string in the form string[n] and ustring[n]; length <= 254 bytes fixed-length string in the form string[n] and ustring[n]; 255 < = length <= 4000 bytes variable-length string, in the form string[max=n] and ustring[max=n]; maximum length <= 4000 bytes variable-length string in the form string and ustring string and ustring, 4000 bytes < length time timestamp DB2 Data Type INTEGER FLOAT FLOAT CHAR(n) where n is the string length VARCHAR(n) where n is the string length VARCHAR(n) where n is the maximum string length VARCHAR(32)* Not supported TIME TIMESTAMP
* The default length of VARCHAR is 32 bytes. That is, 32 bytes are allocated for each variable-length string field in the input data set. If an input variable-length string field is longer than 32 bytes, the operator issues a warning. You can use the -stringlength option to modify the default length.
InfoSphere DataStage data types not listed in this table generate an error and terminate the step. Use the modify operator to perform type conversions. See "Modify Operator" . Any column in the DB2 table corresponding to a nullable data set field will be nullable. InfoSphere DataStage and DB2 integer and floating-point data types have the same range and precision. You need not worry about numeric overflow or underflow.
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: The table must exist and the record schema of the data set must be compatible with the table. The write operator appends new rows to the table. This is the default mode. The schema of the existing table determines the input interface of the operator. See "Example 1: Appending Data to an Existing DB2 Table" . v create: The operator creates a new table. If a table exists of the same name as the one you want to create, the step that contains the operator terminates with an error. You must specify either create mode or replace mode if the table does not exist. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. v By default, InfoSphere DataStage creates the table on all processing nodes in the default table space and uses the first column in the table, corresponding to the first field in the input data set, as the partitioning key. You can override these options for partitioning keys and table space by means of the -dboptions option. v replace: The operator drops the existing table and creates a new one in its place. If a table exists of the same name as the one you want to create, it is overwritten. DB2 uses the default partitioning method for the new table. The schema of the new table is determined by the schema of the InfoSphere DataStage data set.
652
v truncate: The operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. See "Example 2: Writing Data to a DB2 Table in truncate Mode" . Note: If a previous write operation fails, you can retry your application specifying a write mode of replace to delete any information in the output table that might have been written by the previous attempt to run your program. Each mode requires specific user privileges, as listed in this table:
Table 169. Privileges Required for db2write and db2load Write Modes Write Mode append create replace truncate Required Privileges INSERT on existing table TABLE CREATE INSERT and TABLE CREATE on existing table INSERT on existing table
653
-client_dbname database -user user_name -password password] [-close close-command] [-cpu integer] (db2load only) [-rowCommitInterval integer] (db2write only) [-db_cs character_set] [-dbname database_name] [-dboptions {[ = table_space,] [key = field0, . . . key = fieldN]} [-drop] [-exceptionTable table_name] (db2load only) [-mode create | replace | append | truncate] [-msgfile msgFile] (db2load only) [-nonrecoverable] (db2load only) [-omitPart] [-open open_command] [-server server_name] [-statistics stats_none | stats_exttable_only | stats_extindex_only | stats_exttable_index | stats_index | stats_table | stats_extindex_table | stats_all | stats_both] (db2load only) [-stringlength string_length] [-truncate] [-truncationLength n]
.
Table 170. db2write and db2load Options Option -ascii Use -ascii db2load only. Specify this option to configure DB2 to use the ASCII-delimited format for loading binary numeric data instead of the default ASCII-fixed format. This option can be useful when you have variable-length fields, because the database does not have to allocate the maximum amount of storage for each variable-length field. However, all numeric fields are converted to an ASCII format by DB2, which is a CPU-intensive operation. See the DB2 reference manuals for more information.
654
Table 170. db2write and db2load Options (continued) Option -cleanup Use -cleanup db2load only. Specify this option to deal with operator failures during execution that leave the loading tablespace that it was loading in an inaccessible state. For example, if the following osh command was killed in the middle of execution: db2load -table upcdata -dbname my_db The tablespace in which this table resides will probably be left in a quiesced exclusive or load pending state. In order to reset the state to normal, run the above command specifying -cleanup, as follows: db2load -table upcdata -dbname my_db -cleanup The cleanup procedure neither inserts data into the table nor deletes data from it. You must delete rows that were inserted by the failed execution either through the DB2 command-level interpreter or by running the operator subsequently using the replace or truncate modes. -client_instance -client_instance client_instance_name [-client_dbname database] -user user_name -password password Specifies the client DB2 instance name. This option is required for a remote connection. The -client_dbname suboption specifies the client database alias name for the remote server database. If you do not specify this option, InfoSphere DataStage uses the value of the -dbname option, or the value of the APT_DBNAME environment variable, or DB2DBDFT; in that order. The required -user and -password suboptions specify a user name and password for connecting to DB2. -close -close close_command Specifies any command to be parsed and executed by the DB2 database on all processing nodes after InfoSphere DataStage finishes processing the DB2 table.
655
Table 170. db2write and db2load Options (continued) Option -cpu Use [cpu integer] [-anyorder] Specifies the number of processes to initiate on every node. Its syntax is: [-cpu integer] [-anyorder] The default value for the -cpu option is 1. Specifying a 0 value allows db2load to generate an algorithm that determines the optimal number based on the number of CPUs available at runtime. Note that the resulting value might not be optimal because DB2 does not take into account the InfoSphere DataStage workload. The -anyorder suboption allows the order of loading to be arbitrary for every node, potentially leading to a performance gain. The -anyorder option is ignored if the value of the -cpu option is 1. -rowCommitInterval -rowCommitInterval integer db2write only. Specifies the size of a commit segment. Specify an integer that is 1 or larger. The specified number must be a multiple of the input array size. The default size is 2000. You can also use the APT_RDBMS_COMMIT_ROWS environment to specify the size of a commit. -db_cs -db_cs character_set Specify the character set to map between DB2 char and Varchar values and InfoSphere DataStage ustring schema types and to map SQL statements for output to DB2. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset -dbname -dbname database_name Specifies the name of the DB2 database to access. By default, the operator uses the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise. Specifying -dbname overrides APT_DBNAME and DB2DBDFT.
656
Table 170. db2write and db2load Options (continued) Option -dboptions Use -dboptions { [-tablespace = table_space,] [-key = field ...]} Specifies an optional table space or partitioning key to be used by DB2 to create the table. You can specify this option only when you perform the write operation in either create or replace mode. The partitioning key must be the external column name. If the external name contains # or $ characters, the entire column name should be surrounded by single backslashed quotes. For example: -dboptions '{key=\'B##B$\'}' By default, InfoSphere DataStage creates the table on all processing nodes in the default table space and uses the first column in the table, corresponding to the first field in the input data set, as the partitioning key. You specify arguments to -dboptions as a string enclosed in braces. This string can contain a single -tablespace argument, a single -nodegroup argument, and multiple -key arguments, where: -tablespace defines the DB2 table space used to store the table. -nodegroup specifies the name of the node pool (as defined in the DataStage configuration file) of nodes on which DB2 is installed. -key specifies a partitioning key for the table. -drop -drop Causes the operator to silently drop all input fields that do not correspond to fields in an existing DB2 table. By default, the operator reports an error and terminates the step if an input field does not have a matching column in the destination table. -exceptionTable [-exceptionTable table_name] db2write only. Specifies a table for inserting records which violate load-table constraints. The exceptions table should be consistent with the FOR EXCEPTION option of the db2load utility. Refer to the DB2 documentation for instructions on how to create this table. The -exceptionTable option cannot be used with the create or replace modes because InfoSphere DataStage cannot recreate the table with all its applicable data constraints. When the mode is truncate, the tables for the -table and -exceptionTable options are truncated.
657
Table 170. db2write and db2load Options (continued) Option -mode Use -mode create | replace | append | truncate Specifies the write mode of the operator. append (default): New records are appended to an existing table. create: Create a new table. InfoSphere DataStage reports an error and terminates the step if the DB2 table already exists. You must specify this mode if the DB2 table does not exist. truncate: The existing table attributes (including schema) and the DB2 partitioning keys are retained, but any existing records are discarded. New records are then appended to the table. replace: The existing table is first dropped and an entirely new table is created in its place. DB2 uses the default partitioning method for the new table. -msgfile -msgfile msgFile db2load only. Specifies the file where the DB2 loader writes diagnostic messages. The msgFile can be an absolute path name or a relative path name. Regardless of the type of path name, the database instance must have read/write privilege to the file. By default, each processing node writes its diagnostic information to a separate file named APT_DB2_LOADMSG_nodenumber where nodenumber is the DB2 node number of the processing node. Specifying -msgFile configures DB2 to write the diagnostic information to the file msgFile_nodenumber. If you specify a relative path name, InfoSphere DataStage attempts to write the files to the following directories, in this order: 1. The file system specified by the default resource scratchdisk, as defined in the InfoSphere DataStage configuration file. 2. The directory specified by the tmpdir parameter. 3. The directory /tmp. -nonrecoverable -nonrecoverable db2load only. Specifies that your load transaction is marked as nonrecoverable. It will not be possible to recover your transaction with a subsequent roll forward action. The roll forward utility skips the transaction, and marks the table into which data was being loaded as "invalid". The utility also ignores any subsequent transactions against the table. After a roll forward is completed, the table can only be dropped. Table spaces are not put in a backup pending state following the load operation, and a copy of the loaded data is not made during the load operation.
658
Table 170. db2write and db2load Options (continued) Option -open Use -open open_command Specifies any command to be parsed and executed by the DB2 database on all processing nodes. DB2 runs this command before opening the table. -omitPart [-omitPart] Specifies that that db2load and db2write should not automatically insert a partitioner. By default, the db2partitioner is inserted. -server -server server_name Optionally specifies the DB2 instance name for the table. By default, InfoSphere DataStage uses the setting of the DB2INSTANCE environment variable. Specifying -server overrides DB2INSTANCE. -statistics [-statistics stats_none | stats_exttable_only | stats_extindex_only | stats_exttable_index | stats_index | stats_table | stats_extindex_table | stats_all | stats_both] db2load only. Specifies which statistics should be generated upon load completion. -stringlength -stringlength string_length Sets the default string length of variable-length strings written to a DB2 table. If you do not specify a length, InfoSphere DataStage uses a default size of 32 bytes. Variable-length strings longer than the set length cause an error. The maximum length you can set is 4000 bytes. Note that the operator always allocates string_length bytes for a variable-length string. In this case, setting string_length to 4000 allocates 4000 bytes for every string. Therefore, you should set string_length to the expected maximum length of your largest string and no larger.
659
Table 170. db2write and db2load Options (continued) Option -table Use -table table_name [-selectlist list] Specifies the name of the DB2 table. If the output mode is create, the table must not exist and you must have TABLE CREATE privileges. By default, InfoSphere DataStage creates the table without setting partitioning keys. By default, DB2 uses the first column as the partitioning key. You can use -dboptions to override this default operation. If the output mode is append or truncate, the table must exist and you must have INSERT privileges. In replace mode, the table might already exist. In this case, the table is dropped and and the table is created in create mode. The table name might not contain a view name. If your DB2 user name does not correspond to the owner of the table, you can optionally prefix table_name with that of a table owner in the form tableOwner.tableName. The tableOwner string must be 8 characters or less. If you do not specify tableOwner, InfoSphere DataStage uses the following statement to determine your user name: select user from sysibm.systables; -selectlist optionally specifies an SQL select list used to determine which fields are written. The list does not support the use of bind variables. -truncate -truncate Select this option to configure the operator to truncate InfoSphere DataStage field names to 18 characters. To specify a length other than 18, use the -truncationLength option along with this one. See "Field Conventions in Write Operations to DB2" for further information. -truncationLength -truncationLength n Use this option with the -truncate option to specify a truncation length other than 18 for DataStage field names too long to be DB2 column names.
660
The db2load operator requires that InfoSphere DataStage users, under their InfoSphere DataStage login name, have DBADM privilege on the DB2 database written by the operator.
The following figure shows the data flow and field and column names for this example. Columns that the operator appends to the DB2 table are shown in boldface type. Note that the order of the fields of the InfoSphere DataStage record and that of the columns of the DB2 rows are not the same but are nonetheless written successfully.
field name
storeID 09 57 26
db2write | db2load
column name
DB2 table_1
storeID 26 34 09 57 26
661
$ osh "... db2_write_operator -table table_1 -server db2Server -dbname my_db ... "
field name
storeID 18 75 62
db2write | db2load
DB2 table_5
column name
column name
storeID 09 57 26
storeID 18 75 62
662
field name
code 26 14 88
storeID 18 75 62
db2write | db2load
mode=truncate drop flag set
column name
column name
storeID 09 57 26
storeID 18 75 62
new DB2 table_5 'Input field code dropped because it does not exist in table_5'
663
The next figure shows the write operator writing to the DB2 table. As in the previous example, the order of the columns is not necessarily the same as that of the fields of the data set.
field name
storeID 09 57 26
db2write | db2load
DB2 table_6
column name
storeID 26 34 09 57 26
664
The db2upsert operator takes advantage of the DB2 CLI (Call Level Interface) system to optimize performance. It builds an array of records and then executes them with a single statement. This operator receives a single data set as input and writes its output to a DB2 table. You can request an optional output data set that contains the records that fail to be inserted or updated.
db2upsert
db2upsert: properties
Table 171. db2upsert Operator Properties Property Number of input data sets Number of output data sets Value 1 by default, none; 1 when you select the -reject option
665
Table 171. db2upsert Operator Properties (continued) Property Input interface schema Transfer behavior Value derived from your insert and update statements Rejected update records are transferred to an output data set when you select the -reject option. A state field is added to each record. It contains a five-letter SQL code which identifies the reason the record was rejected. parallel by default, or sequential any Specifying any allows the db2part partitioner to be automatically inserted. Collection method Combinable operator any yes
Operator action
Here are the main characteristics of db2upsert: v An -update statement is required. The -insert and -reject options are mutually exclusive with using an -update option that issues a delete statement. v InfoSphere DataStage uses the DB2 CLI (Call Level Interface) to enhance performance by executing an array of records with a single SQL statement. In most cases, the CLI system avoids repetitive statement preparation and parameter binding. v The -insert statement is optional. If it is included, it is executed first. An insert record that does not receive a row status of SQL_PARAM_SUCCESS or SQL_PARAM_SUCCESS_WITH_INFO is then used in the execution of an update statement. v When you specify the -reject option, any update record that receives a status of SQL_PARAM_ERROR is written to your reject data set. It's syntax is:
-reject filename
v You use the -arraySize option to specify the size of the input array. For example:
-arraySize 600
The default array size is 2000. v The db2upsert operator commits its input record array, and then waits until a specified number of records have been committed or a specifed time interval has elapsed, whichever occurs first, before committing the next input array.You use the -timeCommitInterval and -rowCommitInterval options to control intervals between commits. To specify between-commit intervals by time, specify a number of seconds using the -timeCommitInterval option. For example:
-timeCommitInterval 3
The default time-commit interval is 2 seconds. To specify between-commit intervals by by the number of records committed, use the -rowCommitInterval option. Specify a multiple of the input array size. The default value is 2000. v You use the -open and -close options to define optional DB2 statements to be executed before and after the processing of the insert array. v At the end of execution, the db2upsert operator prints the number of inserted, updated, and rejected records for each processing node used in the step.
666
v Columns in the DB2 table that do not have corresponding fields in the input data set are set to their default value, if one is specified in the DB2 table. If no default value is defined for the DB2 column and it supports nulls, it is set to null. Otherwise, InfoSphere DataStage issues an error and terminates the step. v The db2upsert operator converts InfoSphere DataStage values to DB2 values using the type conversions detailed in Data type conversion on page 651. The required DB2 privileges are listed in Write modes on page 652.
667
Table 172. db2upsert Operator Options (continued) Options -client_instance Value -client_instance client_instance_name [-client_dbname database] -user user_name -password password Specifies the client DB2 instance name. This option is required for a remote connection. The -client_dbname suboption specifies the client database alias name for the remote server database. If you do not specify this option, InfoSphere DataStage uses the value of the -dbname option, or the value of the APT_DBNAME environment variable, or DB2DBDFT; in that order. The required -user and -password suboptions specify a user name and password for connecting to DB2. -db_cs -db_cs character_set Specify the character set to map between DB2 char and Varchar values and DataStage ustring schema types and to map SQL statements for output to DB2. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. For information on national language support, reference this IBM ICU site: http://oss.software.ibm.com/icu/charset -dbname -dbname database_name Specifies the name of the DB2 database to access. By default, the operator uses the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise. Specifying -dbname overrides APT_DBNAME and DB2DBDFT. -insert -insert insert_statement Optionally specify the insert statement to be executed. This option is mutually exclusive with using an -update option that issues a delete statement. -omitPart [-omitPart] Specifies that that db2upsert should not automatically insert a partitioner. By default, the db2partitioner is inserted. -open -open open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the Conductor node. -padchar -padchar string Specify a string to pad string and ustring fields that are less than the length of the DB2 CHAR column. See "Using the -padchar Option" for more information on how to use this option.
668
Table 172. db2upsert Operator Options (continued) Options -reject Value -reject If this option is set, records that fail to be updated or inserted are written to a reject data set. You must designate an output data set for this purpose. This option is mutually exclusive with using an -update option that issues a delete statement. -rowCommitInterval -rowCommitInterval n Specifies the number of records that should be committed before starting a new transaction. The specified number must be a multiple of the input array size. The default is2000. You can also use the APT_RDBMS_COMMIT_ROWS environment to specify the size of a commit. -server -server server_name Specify the name of the DB2 instance name for the table name.By default, InfoSphere DataStage uses the setting of the DB2INSTANCE environment variable. Specifying -server overrides DB2INSTANCE. -timeCommitInterval -timeCommitInterval n Specifies the number of seconds InfoSphere DataStage should allow between committing the input array and starting a new transaction. The default time period is 2 seconds -update -update update_or_delete_statement Use this required option to specify the update or delete statement to be executed. An example delete statement is: -update 'delete from tablename where A = ORCHESTRATE.A' A delete statement cannot be issued when using the -insert or -reject option.
669
Using db2part can optimize operator execution. For example, if the input data set contains information that updates an existing DB2 table, you can partition the data set so that records are sent to the processing node that contains the corresponding DB2 rows. When you do, the input record and the DB2 row are local to the same processing node, and read and write operations entail no network activity.
670
Table 173. dbpart Operator Options (continued) Option -dbname Use -dbname database_name Specify the name of the DB2 database to access. By default, the partitioner uses the setting of the environment variable APT_DBNAME, if defined, and DB2DBDFT otherwise. Specifying -dbname overrides APT_DBNAME and DB2DBDFT. -server -server server_name Type the name of the DB2 instance name for the table name.By default, InfoSphere DataStage uses the setting of the DB2INSTANCE environment variable. Specifying -server overrides DB2INSTANCE.
inDS.ds
db2part (table_1)
custom operator
671
The operator replaces each orchestrate.fieldname with a field value, submits the statement containing the value to DB2, and outputs a combination of DB2 and InfoSphere DataStage data. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more DB2 tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced DB2 table. When an DB2 table has a column name that is the same as an InfoSphere DataStage data set field name, the DB2 column is renamed using the following syntax:
APT_integer_fieldname
An example is APT_0_lname. The integer component is incremented when duplicate names are encountered in additional tables. Note: If the DB2 table is not indexed on the lookup keys, this operator's performance is likely to be poor.
672
db2lookup
db2lookup: properties
Table 174. db2lookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Preserve-partitioning flag in output data set Composite operator Value 1 1; 2 if you include the -ifNotFound reject option determined by the query determined by the SQL query transfers all fields from input to output sequential or parallel (default) clear no
673
[-dbname dbname] [-ifNotFound fail | drop| reject | continue] [-open open_command] [-padchar char] [-server remote_server_name]
You must specify either the -query option or one or more -table options with one or more -key fields.
Table 175. db2lookup Operator Options Option -table Use -table table_name [-selectlist selectlist] [-filter filter] -key field [-key field ...] Specify a table and key field(s) from which a SELECT statement is created. Specify either the -table option or the -query option. -query -query sql_query Specify an SQL query, enclosed in single quotes, to read a table. The query specifies the table and the processing that you want to perform on the table as it is read into InfoSphere DataStage. This statement can contain joins, views, database links, synonyms, and so on. -close -close close_command Optionally apply a closing command to execute on a database after InfoSphere DataStage execution. -client_instance -client_instance client_instance_name [-client_dbname database] -user user_name -password password Specifies the client DB2 instance name. This option is required for a remote connection. The -client_dbname suboption specifies the client database alias name for the remote server database. If you do not specify this option, InfoSphere DataStage uses the value of the -dbname option, or the value of the APT_DBNAME environment variable, or DB2DBDFT; in that order. The required -user and -password suboptions specify a user name and password for connecting to DB2. -db_cs -db_cs character_set Specify the character set to map between DB2 char and varchar values and InfoSphere DataStage ustring schema types and to map SQL statements for output to DB2. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data. For information on national language support,ference this IBM ICU site: http://oss.software.ibm.com/icu/charset -dbname -dbname dbname Optionally specify a database name.
674
Table 175. db2lookup Operator Options (continued) Option -ifNotFound Use -ifNotFound fail | drop| reject dataset | continue Determines what action to take in case of a lookup failure. The default is fail. If you specify reject, you must designate an additional output data set for the rejected records. -open -open open_command Optionally apply an opening command to execute on a database prior to InfoSphere DataStage execution. -padchar padchar string Specify a string to pad string and ustring fields that are less than the length of the DB2 CHAR column. For more information see "Using the -padchar Option". -server -server remote_server_name Optionally supply the name of a current DB2 instance. -use_strings -use_strings Directs InfoSphere DataStage to import DB2 char and varchar values to InfoSphere DataStage as InfoSphere DataStage strings without converting them from their ASCII or binary form. This option overrides the db_cs option which converts DB2 char and Varchar values as ustrings using the specified character set.
Example of db2lookup
Suppose you want to connect to the APT81 server as user user101, with the password test. You want to perform a lookup between an InfoSphere DataStage data set and a table called target, on the key fields lname, fname, and DOB. You can configure db2lookup in either of two ways to accomplish this. Here is the osh command using the -table and -key options:
$ osh " db2lookup -server APT81 -table target -key lname -key fname -key DOB < data1.ds > data2.ds "
InfoSphere DataStage prints the lname, fname, and DOB column names and values from the InfoSphere DataStage input data set and also the lname, fname, and DOB column names and values from the DB2 table. If a column name in the DB2 table has the same name as an InfoSphere DataStage output data set schema fieldname, the printed output shows the column in the DB2 table renamed using this format:
APT_integer_fieldname
Chapter 17. The DB2 interface library
675
Considerations for reading and writing DB2 tables Data translation anomalies
Translation anomalies exist under the following write and read operations:
Write operations
A data set is written to a DB2 table in create mode. The table's contents are then read out and reconverted to the InfoSphere DataStage format. The final InfoSphere DataStage field types differ from the original, as shown in the next table:
Table 176. DB2 Table Write Operations Translation Anomaly Original InfoSphere DataStage Data Type Converted to DB2 Type fixed-length string field, 255 <= fixed length <= 4000 bytes VARCHAR(n), where n is the length of the string field
Read Back and Reconverted string[max=n] and ustring[max=n] a variable-length string with a maximum length = n int16 dfloat
int8 sfloat
SMALLINT FLOAT
Invoke the modify operator to convert the data type of the final InfoSphere DataStage field to its original data type. See "Modify Operator" .
Read Operations
A DB2 table is converted to a data set and reconverted to DB2 format. The final DB2 column type differs from the original, as shown in the next table:
Table 177. DB2 Table Read Operations Translation Anomaly Converted to InfoSphere DataStage Type dfloat
where:
676
v -table table_name specifies the DB2 table that determines the processing nodes on which the operator runs. InfoSphere DataStage runs the operator on all processing nodes containing a partition of table_name. v -dbname db optionally specifies the name of the DB2 database to access. By default, InfoSphere DataStage uses the setting of APT_DBNAME, if it is defined; otherwise, InfoSphere DataStage uses the setting of DB2DBDFT. Specifying -dbname overrides both APT_DBNAME and DB2DBDFT. v -server s_name optionally specifies the DB2 instance name for table_name. By default, InfoSphere DataStage uses the setting of the DB2INSTANCE environment variable. Specifying -server overrides DB2INSTANCE. The next example reads a table from DB2, then uses db2nodes to constrain the statistics operator to run on the nodes containing a partition of the table: Here is the osh syntax for this example:
$ osh "db2read -table table_1 | statistics -results /stats/results -fields itemNum, price [nodemap (db2nodes -table table_1)] > outDS.ds "
In this case, all data access on each node is local, that is, InfoSphere DataStage does not perform any network activity to process the table.
677
678
679
A read operation must specify the database on which the operation is performed. A read operation must also indicate either an SQL query of the database or a table to read. Read operations can specify Informix SQL statements to be parsed and run on all processing nodes before the read operation is prepared and run, or after the selection or query is completed.
hplread/infxread/xpsread
Execution mode
If you are using Informix version 9.4 or 10.0, you can control the processing nodes, which corresponds to Informix servers, on which the read operators are run. You use the -part suboption of the -table or -query option to control the processing nodes.
680
681
The following figure shows an example of a read operation of an Informix result set. The operator specifies the -table option, with no further qualification, so that a single table is read in its entirety from Informix. The table is read into a result set, and the operator translates the result set into an InfoSphere DataStage data set. In this example, the table contains three rows. All rows are read.
column name
storeID 26 34 26
hplread/infxread/xpsread
field name
storeID 26 34 26
The following table shows the schemas of both the input Informix table rows and output InfoSphere DataStage records.
Table 180. Schemas for Input Informix table row and corresponding InfoSphere DataStage result set Input Informix table row itemNum INTEGER not null price DECIMAL(3,2) not null storeID SMALLINT not null Output InfoSphere DataStage record itemNum:int32; price:decimal[3,2]; storeID:int16;
In the above example the InfoSphere DataStage operator converts the Informix result set, which is a two-dimensional array, to an InfoSphere DataStage data set. The schema of the Informix result set corresponds to the record schema of the InfoSphere DataStage data set. Each row of the INFORMIX result set corresponds to a record of the InfoSphere DataStage data set; each column of each row corresponds to a field of the InfoSphere DataStage record. The InfoSphere DataStage field names are the same as the column names of the Informix table. Here is the osh command for this example:
682
hplwrite/infxwrite/xpswrite
Operator action
Here are the chief characteristics of the InfoSphere DataStage operators that write InfoSphere DataStage data sets to INFORMIX tables: v They specify the name of the database to write to. v They translate the InfoSphere DataStage data set record-by-record to INFORMIX table rows, as discussed in "Column Name Conversion" . v The translation includes the conversion of INFORMIX data types to InfoSphere DataStage data types, as listed in "Data Type Conversion" . v They append records to an existing table, unless another mode of writing has been set. "Write Modes" discusses these modes.
683
Operators optionally specify such information as the mode of the write operation and INFORMIX commands to be parsed and executed on all processing nodes before the write operation is executed or after it has completed. The discussion of each operator in subsequent sections includes an explanation of properties and options.
Execution mode
The infxwrite and hplwrite operator executes sequentially, writing to one server. The xpswrite operator executes in parallel.
decimal[p,s] (p=precision; s=scale) dfloat int8 int16 int32 raw[n] (fixed length) raw [max] (variable length) sfloat string[n] (fixed length) string[max] (variable length) subrec tagged aggregate time
684
Table 181. Informix interface write operators data type conversion (continued) InfoSphere DataStage Data Type timestamp Informix Data Type DATETIME The DATETIME data type starts with a year and contains 5 fractional digits if the timestamp field resolves to microseconds. For Informix versions 9.4 and 10.0, the timestamp data type can also be converted to a datetime year-to-month type. uint8, uint16, uint32 Not supported
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: The table must exist and the record schema of the data set must be compatible with the table. A write operator appends new rows to the table; and the final order of rows in the table is determined by INFORMIX. This is the default mode. See "Example 2: Appending Data to an Existing INFORMIX Table" . v create: The operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. You must specify either create mode or replace mode if the table does not exist. v replace: The operator drops the existing table and creates a new one in its place. If a table exists with the same name as the one you want to create, it is overwritten. v truncate: The operator retains the table attributes but discards existing records and appends new ones. Each mode requires specific user privileges.
Note: Important: The default length of InfoSphere DataStage variable-length strings is 32 bytes, so all records of the Informix table have 32 bytes allocated for each variable- length string field in the input data set. If a variable-length field is longer than 32 bytes, the write operator issues an error, but you can use the -stringlength option to modify the default length, up to a maximum of 255 bytes.
Chapter 18. The Informix interface library
685
Note: Remember: 1. The names of InfoSphere DataStage fields can be of any length, but the names of Informix columns cannot exceed 128 characters for versions 9.4 and 10.0. If you write InfoSphere DataStage fields that exceed this limit, the operator issues an error and the step stops 2. The data set cannot contain fields of these types: raw, tagged aggregate, subrecord, and unsigned integer (of any size).
Limitations
Write operations have the following limitations: v While the names of InfoSphere DataStage fields can be of any length, the names of INFORMIX columns cannot exceed 18 characters for versions 7.x and 8.x and 128 characters for version 9.x. If you write InfoSphere DataStage fields that exceed this limit, the operator issues an error and the step terminates. v The InfoSphere DataStage data set cannot contain fields of these types: raw, tagged aggregate, subrecord, or unsigned integer (of any size). v If an INFORMIX write operator tries to write a data set whose fields contain a data type listed above, the write operation terminates and the step containing the operator returns an error. You can convert unsupported data types by means of the modify operator. See "Modify Operator" for information.
The following diagram shows the operator action. Columns that the operator appends to the Informix table are shown in boldface type. The order of fields in the InfoSphere DataStage record and columns of the Informix rows are not the same but are written successfully.
686
field name
storeID 09 57 26
hplwrite/infxwrite/xpswrite
INFORMIX table_1
column name
storeID 26 34 09 57 26
687
field name
storeID 18 75 62
hplwrite/infxwrite/xpswrite
mode = truncate
column name
column name
storeID 09 57 26
storeID 18 75 62
688
In the example, which operates in truncate write mode, the field named code is dropped and a message to that effect is displayed.
field name
code 26 14 88
storeID 18 75 62
hplwrite/infxwrite/xpswrite
mode = truncate drop flag set
column name
column name
storeID 09 57 26
storeID 18 75 62
689
Table 183. Schemas for writing InfoSphere DataStage records to Informix tables with Unmatched Column Input InfoSphere DataStage record itemNum :int32; price:decimal[3,2]; storeID:int16; Output Informix table row code SMALLINT, price DECIMAL, itemNum INTEGER, storeID SMALLINT
field name
storeID 09 57 26
hplwrite/infxwrite/xpswrite
INFORMIX table_6
column name
storeID 26 34 09 57 26
hplread operator
Before you can use the hplread operator, you must set up the Informix onpload database. Run the Informix ipload utility to create the database. The hplread operator sets up a connection to an Informix database, queries the database, and translates the resulting two-dimensional array that represents the Informix result set to an InfoSphere DataStage data set. This operator works with Informix versions 9.4 and 10.0.
690
Note: 1. The High Performance Loader uses more shared memory, than Informix does in general. If the High Performance Loader is unable to allocate enough shared memory, the hplread might not work. For more information about shared memory limits, contact your system administrator. 2. If your step is processing a few records, you can use the infxread operator rather than hplread. In any case, there is no performance benefit from using hplread for very small data sets. If your data set size is less than about 10 times the number of processing nodes, you must set -dboptions smalldata on the hplwrite operator.
Troubleshooting configuration
Prerequisite: The remote computer must be cross-mounted with the local computer. 1. Verify that the Informix sqlhosts file on the remote machine has a TCP interface. 2. Copy the Informix etc/sqlhosts file from the remote computer to a directory on the local computer. Set the Informix HPLINFORMIXDIR environment variable to this directory. For example, if the directory on the local computer is /apt/informix, the sqlhosts file should be in the /apt/informix/directory, and the HPLINFORMIXDIR variable should be set to /apt/informix. 3. Set the INFORMIXSERVER environment variable to the name of the remote Informix server. 4. Add the remote Informix server nodes to your node configuration file in $APT_ORCHHOME/../../ config; and use a nodepool resource constraint to limit the execution of the hplread operator to these nodes. The nodepool for the remote nodes is arbitrarily named InformixServer. The configuration file must contain at least two nodes, one for the local machine and one for the remote machine. 5. The remote variable needs to be set in a startup script which you must create on the local machine. Here is a sample startup.apt file with HPLINFORMIXDIR being set to /usr/informix/9.4, the INFORMIX directory on the remote machine:
#! /bin/sh INFORMIXDIR=/usr/informix/9.4 export INFORMIXDIR INFORMIXSQLHOSTS=$INFORMIXDIR/etc/sqlhosts export INFORMIXDIRSQLHOSTS shift 2 exec $*
6.
Set the environment variable APT_STARTUP_SCRIPT to the full pathname of the startup.apt file.
You are now ready to run a job that uses the hplread operator to connect to a remote Informix server. If you are unable to connect to the remote server, try making either one or both of the following changes to the sqlhosts file on the local computer:
Chapter 18. The Informix interface library
691
v In the fourth column in the row that corresponds to the remote Informix server name, replace the Informix server name with the Informix server port number in the /etc/services file on the remote computer. v The third column contains the host name of the remote computer. Change this value to the IP address of the remote computer.
hplread
692
Table 185. hplread operator options Option -close Use -close close_command Specify an Informix SQL statement to be parsed and executed on all processing nodes after the table selection or query is completed. This parameter is optional. -dbname -dbname database_name Specify the name of the Informix database to read from. -dboptions -dboptions smalldata Set this option if the number of records is fewer than 10 * number of processing nodes. -open -open open_command Specify an Informix SQL statement to be parsed and executed by the database on all processing nodes before the read query is prepared and executed. This parameter is optional -query -query sql_query Specify a valid SQL query to submit to Informix to read the data. The sql_query variable specifies the processing that Informix performs as data is read into InfoSphere DataStage. The query can contain joins, views, synonyms, and so on. -server -server server_name Set the name of the Informix server that you want to connect to. If no server is specified, the value of the INFORMIXSERVER environment variable is used. -table -table table_name [-filter filter] [selectlist list] Specify the name of the Informix table to read from. The table must exist. You can prefix the name of the table with a table owner in the form: table_owner.table_name The operator reads the entire table unless you limit its scope by using the -filter or -selectlist options, or both. With the -filter suboption, you can specify selection criteria to be used as part of an SQL statement WHERE clause, to specify the rows of the table to include in or exclude from the InfoSphere DataStage data set. With the -selectlist subption, you can specify a list of columns to read, if you do not want all columns to be read..
693
. Before you can use the hplread operator, you must set up the Informix ipload database. Run the Informix ipload utility to create the database.If the destination Informix table does not exist, the hplwrite operator creates a table. This operator runs with INFORMIX Versions 9.4 and 10.0. Before you can use the hplwrite operator, you must set up the Informix onpload database. Run the Informix ipload utility to create the database. The hplwrite operator sets up a connection to an Informix database, queries the database, and translates the resulting two-dimensional array that represents the Informix result set to an InfoSphere DataStage data set. This operator works with Informix versions 9.4 and 10.0. Note: Remember: The High Performance Loader uses more shared memory, than Informix does in general. If the High Performance Loader is unable to allocate enough shared memory, the hplread might not work. For more information about shared memory limits, contact your system administrator.
hplwrite
694
Table 186. hplwrite properties (continued) Property Execution mode Partitioning method Composite operator Value sequential same no
695
Table 187. hplwrite operator options (continued) Option -mode Use -mode append The hplwrite opeator appends new records to the table. The database user who writes in this mode must have Resource privileges. This mode is the default mode. -mode create The hplwrite operator creates a new table. The database user who writes in this mode must have Resource privileges. An error is returned if the table already exists. -mode replace The hplwrite operator deletes the existing table and creates a new one. The database user who writes in this mode must have Resource privileges. -mode truncate The hplwrite operator retains the table attributes but discards existing records and appends new ones. The operator runs more slowly in this mode if the user does not have Resource privileges. -open -open open_command Specify an Informix SQL statement to be parsed and executed by Informix on all processing nodes before table is accessed. -server -server server_name Specify the name of the Informix server that you want to connect to. -stringlength -stringlength length Set the default length of variable-length raw fields or string fields. If you do not specify a length, the default length is 32 bytes. You can specify a length up to 255 bytes. -table -table table_name [-selectlist selectlist] Specifies the name of the Informix table to write to. See the mode options for constraints on the existence of the table and the required user privileges. The -selectlist parameter specifies an SQL list that determines which fields are written. If you do not supply the list, the hplwrite operator writes all fields to the table.
Examples
Refer to the following sections for examples of INFORMIX write operator use: v "Example 2: Appending Data to an Existing INFORMIX Table" v "Example 3: Writing Data to an INFORMIX Table in Truncate Mode" v "Example 4: Handling Unmatched InfoSphere DataStage Fields in an INFORMIX Write Operation" v "Example 5: Writing to an INFORMIX Table with an Unmatched Column"
infxread operator
The infxread operator is used to read data from an Informix table. When you have to process fewer records you can use the infxread operator instead of using the hplread operator.
696
The infxread operator sets up a connection to an Informix database, queries the database, and translates the resulting two-dimensional array that represents the Informix result set to an InfoSphere DataStage data set. This operator runs with Informix versions 9.4 and 10.0.
Infxread
infxread: properties
The infxread operator properties has to be set in order to read data from an Informix database.
Table 188. infxread operator properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Preserve-partitioning flag in output data set Composite operator Value 0 1 none determined by the SQL query. none parallel not applicable clear no
697
-querysql_query [-closeclose_command] [-dbnamedatabase_name] [-open open_command] [-part table_name] [-serverserver_name] . Table 189. infxread operator options Option -close Use -close close_command Specify an Informix SQL statement to be parsed and executed on all processing nodes after the table selection or query is completed. This parameter is optional -dbname -dbname database_name Specify the name of the Informix database to read from. This parameter is optional. -dboptions This options is deprecated and is retained only for compatibility with previous InfoSphere DataStage releases. -open open_command Specify an Informix SQL statement to be parsed and executed by the database on all processing nodes before the read query is prepared and executed. This parameter is optional. -part -part table_name If the table is fragmented, specify this option. This option improves performance by creating one instance of the operator per table fragment. If the table is fragmented across nodes, this option creates one instance of the operator per fragment per node. If the table is fragmented and you do not specify -part, the operator functions successfully, if more slowly. You must have Resource privilege to use this option. When this option is used with the -table option, the table name must be the same name as that specified for the -table option. -query -query sql_query Specify a valid SQL query to be submitted to Informix to read the data. The sql_query variable specifies the processing that Informix performs as data is read into InfoSphere DataStage. The query can contain joins, views, synonyms, and so on. -server -server server_name This option is maintained for compatibility with the xpsread operator. This parameter is optional.
-open
698
Table 189. infxread operator options (continued) Option -table Use -table table_name [-filter filter ] [-selectlist selectlist ] Specify the name of the Informix table to read from. The table must exist. You can prefix the table name with a table owner in the form: table_owner.table_name The operator reads the entire table unless you limit its scope by using the -filter option or the -selectlist option, or both. You can specify selection criteria to use as part of an SQL statement WHERE clause, to specify the rows of the table to include in or exclude from the InfoSphere DataStage data set. If you use the -selectlist option, you can specify a list of columns to read, if you do not want all columns to be read.
infxwrite operator
When the number of records to be written to the Informix table is less then the infxwrite operator can be used instead of hplwrite operator. The infxwrite operator sets up a connection to the Informix database and inserts records into a table. The operator takes a single input data set. The record schema of the data set and the write mode of the operator determine how the records of a data set are inserted into the table.
infxwrite
699
This operator does not function in parallel. For parallel operation, use the xpswrite or hplwrite operator when running with Informix versions 9.4 or 10.0.
-drop
700
Table 191. infxwrite operator options (continued) Option -mode Use -mode append | create | replace | truncate append: infxwrite appends new records to the table. The database user who writes in this mode must have Resource privileges. This is the default mode. create: infxwrite creates a new table; the database user who writes in this mode must have Resource privileges. InfoSphere DataStage returns an error if the table already exists. replace: infxwrite deletes the existing table and creates a new one in its place; the database user who writes in this mode must have Resource privileges. truncate: infxwrite retains the table attributes but discards existing records and appends new ones. The operator runs more slowly in this mode if the user does not have Resource privileges. -open -open open_command Specify an Informix SQL statement to be parsed and executed by Informix on all processing nodes before opening the table. -server -server server_name This option is maintained for compatibility with the xpswrite operator. -stringlength -stringlength length Set the default length of variable-length raw fields or string fields. If you do not specify a length, the default length is 32 bytes. You can specify a length up to 255 bytes. -table -table table_name [-selectlist list] Specifies the name of the Informix table to write to. See the mode options for constraints on the existence of the table and the required user privileges. The -selectlist option specifies an SQL list that determines which fields are written. If you do not supply the list, the infxwrite operator writes the input field records according to your chosen mode specification.
Examples
Refer to the following sections for examples of INFORMIX write operator use: v "Example 2: Appending Data to an Existing INFORMIX Table" v "Example 3: Writing Data to an INFORMIX Table in Truncate Mode" v "Example 4: Handling Unmatched InfoSphere DataStage Fields in an INFORMIX Write Operation" v "Example 5: Writing to an INFORMIX Table with an Unmatched Column"
701
xpsread operator
The xpsread operator runs in parallel to other write operators of Informix in versions 9.4 and 10.0 The xpsread operator sets up a connection to an Informix database, queries the database, and translates the resulting two-dimensional array that represents the Informix result set to an InfoSphere DataStage data set. This operator works with Informix versions 9.4 and 10.0.
xpsread
702
xpsread -tabletable_name [-filterfilter] [-selectlistlist] | -querysql_query [-closeclose_command] [-dbnamedatabase_name] [-openopen_command] [-parttable_name] [-serverserver_name] Table 193. xpsread operator options Option -close Use -close close_command Optionally specify an Informix SQL statement to be parsed and executed on all processing nodes after the query runs. -dbname -dbname database_name Specify the name of the Informix database to read from. -dboptions This option is deprecated and is retained only for compatibility with previous versions of InfoSphere DataStage. -open open_command Optionally specify an Informix SQL statement to be parsed and executed by the database on all processing nodes before the read query is prepared and executed. -part -part table_name If the table is fragmented, you can use this option to create one instance of the operator per table fragment, which can improve performance. If the table is fragmented across nodes, this option creates one instance of the operator per fragment per node. If the table is fragmented and you do not specify this option, the operator functions successfully, but more slowly. You must have Resource privilege to use this option. -query -query sql_query Specify a valid SQL query that is submitted to INFORMIX to read the data. The sql_query variable specifies the processing that Informix performs as data is read into InfoSphere DataStage. The query can contain joins, views, synonyms, and so on. The -query option is mutually exclusive with the -table option. -server -server server_name Specify the name of the Informix server that you want to connect to. If no server is specified, the value of the INFORMIXSERVER environment variable is used.
-open
703
Table 193. xpsread operator options (continued) Option -table Use -table table_name [-filter filter] [-selectlist list] Specify the name of the INFORMIX table to read from. The table must exist. You can prefix table_name with a table owner in the form: table_owner.table_name With the -filter suboption, you can optionally specify selection criteria to be used as part of an SQL statement's WHERE clause, to specify the rows of the table to include in or exclude from the InfoSphere DataStage data set. If you do not want all columns to be read, use the -selectlist suboption, specify a list of columns to read,
xpswrite operator
The xpswrite operator is used to write data to an Informix database. The xpswrite operator can be run in parallel in Informix versions 9.4 and 10.0. The xpswrite operator sets up a connection to Informix and inserts records into a table. The operator takes a single input data set. The record schema of the data set and the write mode of the operator determine how the records of a data set are inserted into the table. This operator works with Informix version 9.4 and 10.0.
xpswrite
704
705
Table 195. xpswrite operator options (continued) Option -mode Use -mode append | create | replace | truncate append: xpswrite appends new records to the table. The database user who writes in this mode must have Resource privileges. This is the default mode. create: xpswrite creates a new table; the database user who writes in this mode must have Resource privileges. InfoSphere DataStage returns an error if the table already exists. replace: xpswrite deletes the existing table and creates a new one in its place; the database user who writes in this mode must have Resource privileges. truncate: xpswrite retains the table attributes but discards existing records and appends new ones. The operator runs more slowly in this mode if the user does not have Resource privileges. -open -open open_command Specify an Informix SQL statement to be parsed and executed by INFORMIX on all processing nodes before opening the table. -server -server server_name Specify the name of the Informix server you want to connect to. -stringlength -stringlength length Set the default length of variable-length raw fields or string fields. If you do not specify a length, the default length is 32 bytes. You can specify a length up to 255 bytes. -table -table table_name [-selectlist list] Specifies the name of the Informix table to write to. See the mode options for constraints on the existence of the table and the user privileges required. If you do not supply the list, the -selectlist option specifies an SQL list of fields to be written. xpswrite writes all fields to the table.
Examples
Refer to the following for examples of INFORMIX write operator use: v "Example 2: Appending Data to an Existing INFORMIX Table" v "Example 3: Writing Data to an INFORMIX Table in Truncate Mode" v "Example 4: Handling Unmatched InfoSphere DataStage Fields in an INFORMIX Write Operation" v "Example 5: Writing to an INFORMIX Table with an Unmatched Column"
706
The database character set types are: v Latin: chartype=1. The character set for U.S. and European applications which limit character data to the ASCII or ISO 8859 Latin1 character sets. This is the default. v Unicode: chartype=2. 16-bit Unicode characters from the ISO 10646 Level 1 character set. This setting supports all of the ICU multi-byte character sets. v KANJISJIS: chartype=3. For Japanese third-party tools that rely on the string length or physical space allocation of KANJISJIS. v Graphic: chartype=4. Provided for DB2 compatibility. Note: The KANJI1: chartype=5 character set is available for Japanese applications that must remain compatible with previous releases; however, this character set will be removed in a subsequent release because it does not support the new string functions and will not support future characters sets. We recommend that you use the set of SQL translation functions provided to convert KANJI1 data to Unicode.
707
For example:
terawrite -db_cs ISO-8859-5
Your database character set specification controls the following conversions: v SQL statements are mapped to your specified character set before they are sent to the database via the native Teradata APIs. v If you do not want your SQL statements converted to this character set, set the APT_TERA_NO_SQL_CONVERSION environment variable. This variable forces the SQL statements to be converted to Latin1 instead of the character set specified by the -db_cs option. v All char and varchar data read from the Teradata database is mapped from your specified character set to the ustring schema data type (UTF-16). If you do not specify the -db_cs option, string data is read into a string schema type without conversion. v The teraread operator converts a varchar(n) field to ustring[n/min], where min is the minimum size in bytes of the largest codepoint for the character set specified by -db_cs. v ustring schema type data written to a char or varchar column in the Teradata database is converted to your specified character set. v When writing a varchar(n) field, the terawrite operator schema type is ustring[n * max] where max is the maximum size in bytes of the largest codepoint for the character set specified by -db_cs. No other environment variables are required to use the Teradata operators. All the required libraries are in /usr/lib which should be on your PATH. To speed up the start time of the load process slightly, you can set the APT_TERA_NO_PERM_CHECKS environment variable to bypass permission checking on several system tables that need to be readable during the load process.
Teraread operator
The teraread operator sets up a connection to a Teradata database and reads the results of a query into a data set. As the operator requests records, the result set is translated row by row into the InfoSphere DataStage data set. The operator converts the data type of Teradata columns to corresponding InfoSphere DataStage data types, as listed in the tableteraread Operator Data Type Conversions.
708
teraread
teraread: properties
Table 196. teraread Operator Properties Property Number of output data sets Output interface schema Execution mode Value 1 determined by the SQL query. parallel (default) or sequential
Note: Consider the following points: v An RDBMS such as Teradata does not guarantee deterministic ordering behavior, unless an SQL query constrains it to do so. v The teraread operator is not compatible with Parallel CLI, which should not be enabled. In any case, no performance improvements can be obtained by using Parallel CLI with teraread. By default, InfoSphere DataStage reads from the default database of the user who performs the read operation. If this database is not specified in the Teradata profile for the user, the user name is the default database. You can override this behavior by means of the -dbname option. You must specify either the -query or the -table option. The operator can optionally pass -open and -close option commands to the database. These commands are run once by Teradata on the conductor node either before a query is executed or after the read operation is completed.
709
By default, the operator displays a progress message for every 100,000 records per partition it processes, for example:
##I TTER000458 16:00:50(009) <teraread,0> 98300000 records processed.
However, you can either change the interval or disable the messages by means of the -progressInterval option.
710
You can use the modify operator to perform explicit data type conversion after the Teradata data type has been converted to an InfoSphere DataStage data type.
teraread restrictions
The teraread operator is a distributed FastExport and is subject to all the restrictions on FastExport. Briefly, these are: v There is a limit to the number of concurrent FastLoad and FastExport jobs. Each use of the Teradata operators (teraread and terawrite) counts toward this limit. v Aggregates and most arithmetic operators in the SELECT statement are not allowed. v The use of the USING modifier is not allowed. v Non-data access (that is, pseudo-tables like DATE or USER) is not allowed. v Single-AMP requests are not allowed. These are SELECTs satisfied by an equality term on the primary index or on a unique secondary index.
You must specify either the -query or the -table option. You must also specify the -server option to supply the server name, and specify the -dboptions option to supply connection information to log on to Teradata.
Table 198. teraread Operator Options Option -close Use -close close_command; close_command Optionally specifies a Teradata command to be run once by Teradata on the conductor node after the query has completed. You can specify multiple SQL statements separated by semi-colons. Teradata will start a separate transaction and commit it for each statement. -db_cs -db_cs character_set Optionally specify the character set to be used for mapping strings between the database and the Teradata operators. The default value is Latin1. See "Specifying an InfoSphere DataStage ustring Character Set" for information on the data affected by this setting.
711
Table 198. teraread Operator Options (continued) Option -dbname Use -dbname database_name By default, the read operation is carried out in the default database of the Teradata user whose profile is used. If no default database is specified in that user's Teradata profile, the user name is the default database. This option overrides the default. If you supply the database_name, the database to which it refers musts exist and you must have the necessary privileges. -dboptions -dboptions '{-user = username -password = password workdb= work_database[-sessionsperplayer = nn ] [-requestedsessions = nn ] [-synctimeout=nnn] }' You must specify both the username and password with which you connect to Teradata. If the user does not have CREATE privileges on the default database, the workdb option allows the user to specify an alternate database where the error tables and work table will be created. The value of -sessionsperplayer determines the number of connections each player has to Teradata. Indirectly, it also determines the number of players. The number selected should be such that (sessionsperplayer * number of nodes * number of players per node) equals the total requested sessions. The default is 2. Setting the value of -sessionsperplayer too low on a large system can result in so many players that the step fails due to insufficient resources. In that case, -sessionsperplayer should be increased. The value of the optional -requestedsessions is a number between 1 and the number of vprocs in the database. The default is the maximum number of available sessions. synctimeout specifies the time that the player slave process waits for the control process. The default is 20 seconds. -open -open open_command; open_command Optionally specifies a Teradata command run once by Teradata on the conductor node before the query is initiated. You can specify multiple SQL statements separated by semi-colons. Teradata will start a separate transaction and commit it for each statement. -progressInterval -progressInterval number By default, the operator displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or disable the messages. To change the interval, specify a new number of records per partition. To disable the messages, specify 0.
712
Table 198. teraread Operator Options (continued) Option -query Use -query sqlquery Specifies a valid Teradata SQL query in single quotes that is submitted to Teradata. The query must be a valid Teradata SQL query that the user has the privileges to run. Do not include formatting characters in the query. A number of other restrictions apply; see "teraread Restrictions" . -server -server servername You must specify a Teradata server. -table -table table_name [-filter filter] [-selectlist list] Specifies the name of the Teradata table to read from. The table must exist, and the user must have the necessary privileges to read it. The teraread operator reads the entire table, unless you limit its scope by means of the -filter or -selectlist option, or both options. The -filter suboption optionally specifies selection criteria to be used as part of an SQL statement's WHERE clause. Do not include formatting characters in the query. A number of other restrictions apply; see "teraread Restrictions" The -selectlist suboption optionally specifies a list of columns to read. The items of the list must appear in the same order as the columns of the table.
Terawrite Operator
The terawrite operator sets up a connection to a Teradata database to write data to it from an InfoSphere DataStage data set. As the operator writes records, the InfoSphere DataStage data set is translated row by row to a Teradata table. The operator converts the data type of InfoSphere DataStage fields to corresponding Teradata types, as listed in the table Table 19-5. The mode of the operator determines how the records of a data set are inserted in the table.
713
terawrite
terawrite: properties
Table 199. terawrite Operator Properties Property Number of input data sets Input interface schema Execution mode Partitioning method Value 1 derived from the input data set parallel (default) or sequential same You can override this partitioning method. However, a partitioning method of entire is not allowed.
Note: The terawrite operator is not compatible with Parallel CLI, which should not be enabled. In any case, no performance improvements can be obtained by using Parallel CLI with terawrite. By default, InfoSphere DataStage writes to the default database of the user name used to connect to Teradata. The userid under which the step is running is not the basis of write authorization. If no default database is specified in that user's Teradata profile, the user name is the default database. You can override this behavior by means of the -dbname option. You must specify the table. By default, the operator displays a progress message for every 100,000 records per partition it processes, for example:
##I TTER000458 16:00:50(009) <terawrite,0> 98300000 records processed.
The -progressInterval option can be used to change the interval or disable the messages.
714
v The name and data type of an InfoSphere DataStage field are related to those of the corresponding Teradata column. However, InfoSphere DataStage field names are case sensitive and Teradata column names are not. Make sure that the field names in the data set are unique, regardless of case. v Both InfoSphere DataStage fields and Teradata columns support nulls, and an InfoSphere DataStage field that contains a null is stored as a null in the corresponding Teradata column. The terawrite operator automatically converts InfoSphere DataStage data types to Teradata data types as shown in the following table:
Table 200. terawrite Operator Data Type Conversions InfoSphere DataStage Data Type date decimal( p ,s) dfloat int8 int16 int32 int64 raw raw[ fixed_size ] raw[max= n ] sfloat string string[ fixed_size ] string[ max ] time timestamp uint8 uint16 uint32 Teradata Data Type date numeric( p ,s) double precision byteint smallint integer unsupported varbyte( default ) byte( fixed_size ) varbyte( n ) unsupported varchar( default_length ) char( fixed_size ) varchar( max ) not supported not supported unsupported unsupported unsupported
When terawrite tries to write an unsupported data type to a Teradata table, the operation terminates with an error. You can use the modify operator to perform explicit data type conversions before the write operation is initiated.
715
Since FastLoad creates error tables whenever it is run (see your FastLoad documentation for details), you might use BTEQ to examine the relevant error table and then correct the records that failed to load. FastLoad does not load duplicate records, however, such records are not loaded into the error tables. The remdup operator can be used to remove the duplicate records prior to using terawrite.
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: The terawrite operator appends new rows to the table; the database user who writes in this mode must have TABLE CREATE privileges and INSERT privileges on the database being written to. This is the default mode. v create: The terawrite operator creates a new table. The database user who writes in this mode must have TABLE CREATE privileges. If a table exists with the same name as the one you want to create, the step that contains terawrite terminates in error. v replace: The terawrite operator drops the existing table and creates a new one in its place. The database user who writes in this mode must have TABLE CREATE and TABLE DELETE privileges. If a table exists with the same name as the one you want to create, it is overwritten. v truncate: The terawrite operator retains the table attributes (including the schema) but discards existing records and appends new ones. The database user who writes in this mode must have DELETE and INSERT privileges on that table. Note: The terawrite operator cannot write to tables that have indexes other than the primary index defined on them. This applies to the append and truncate modes. Override the default append mode by means of the -mode option.
Writing fields
Fields of the InfoSphere DataStage data set are matched by name and data type to columns of the Teradata table but do not have to appear in the same order. The following rules determine which fields of an InfoSphere DataStage data set are written to a Teradata table: v If the InfoSphere DataStage data set contains fields for which there are no matching columns in the Teradata table, the step containing the operator terminates. However, you can remove an unmatched field from processing: either specify the -drop option and the operator drops any unmatched field, or use the modify operator to drop the extra field or fields before the write operation begins. v If the Teradata table contains a column that does not have a corresponding field in the InfoSphere DataStage data set, Teradata writes the column's default value into the field. If no default value is defined for the Teradata column, Teradata writes a null. If the field is not nullable, an error is generated and the step is terminated.
Limitations
Write operations have the following limitations: v A Teradata row might contain a maximum of 256 columns. v While the names of InfoSphere DataStage fields can be of any length, the names of Teradata columns cannot exceed 30 characters. If you write InfoSphere DataStage fields that exceed this limit, use the modify operator to change the InfoSphere DataStage field name.
716
v InfoSphere DataStage assumes that the terawrite operator writes to buffers whose maximum size is 32 KB. However, you can override this and enable the use of 64 KB buffers by setting the environment variable APT_TERA_64K_BUFFERS. v The InfoSphere DataStage data set cannot contain fields of the following types: int64 Unsigned integer of any size String, fixed- or variable-length, longer than 32 KB Raw, fixed- or variable-length, longer than 32 KB Subrecord Tagged aggregate Vectors If terawrite tries to write a data set whose fields contain a data type listed above, the write operation is not begun and the step containing the operator fails. You can convert unsupported data types by using the modify operator.
Restrictions
The terawrite operator is a distributed FastLoad and is subject to all the restrictions on FastLoad. In particular, there is a limit to the number of concurrent FastLoad and FastExport jobs. Each use of the teraread and terawrite counts toward this limit.
717
Table 201. terawrite Operator Options (continued) Option -dbname Usage and Meaning -dbname database_name By default, the write operation is carried out in the default database of the Teradata user whose profile is used. If no default database is specified in that user's Teradata profile, the user name is the default database. If you supply the database_name , the database to which it refers must exist and you must have necessary privileges. -dboptions -dboptions {-user = username , -password = password workdb= work_database[-sessionsperplayer = nn ] [-requestedsessions = nn ] [-synctimeout=nnn]} You must specify both the user name and password with which you connect to Teradata. If the user does not have CREATE privileges on the default database, the workdb option allows the user to specify an alternate database where the error tables and work table will be created. The value of -sessionsperplayer determines the number of connections each player has to Teradata. Indirectly, it also determines the number of players. The number selected should be such that (sessionsperplayer * number of nodes * number of players per node) equals the total requested sessions. The default is 2. Setting the value of -sessionsperplayer too low on a large system can result in so many players that the step fails due to insufficient resources. In that case, the value for -sessionsperplayer should be increased. The value of the optional -requestedsessions is a number between 1 and the number of vprocs in the database. synctimeout specifies the time that the player slave process waits for the control process. The default is 20 seconds. -drop -drop This optional flag causes the operator to silently drop all unmatched input fields.
718
Table 201. terawrite Operator Options (continued) Option -mode Usage and Meaning -mode append | create | replace | truncate append: Specify this option and terawrite appends new records to the table. The database user must have TABLE CREATE privileges and INSERT privileges on the table being written to. This mode is the default. create: Specify this option and terawrite creates a new table. The database user must have TABLE CREATE privileges. If a table exists of the same name as the one you want to create, the step that contains terawrite terminates in error. replace: Specify this option and terawrite drops the existing table and creates a new one in its place; the database user must have TABLE CREATE and TABLE DELETE privileges. If a table exists of the same name as the one you want to create, it is overwritten. truncate: Specify this option and terawrite retains the table attributes, including the schema, but discards existing records and appends new ones. The database user must have DELETE and INSERT privileges on the table. -open -open open_command Optionally specify a Teradata command to be parsed and executed by Teradata on all processing nodes before the table is populated. -primaryindex -primaryindex = ' field 1, field 2, ... fieldn ' Optionally specify a comma-separated list of field names that become the primary index for tables. Format the list according to Teradata standards and enclose it in single quotes. For performance reasons, the InfoSphere DataStage data set should not be sorted on the primary index. The primary index should not be a smallint, or a field with a small number of values, or a high proportion of null values. If no primary index is specified, the first field is used. All the considerations noted above apply to this case as well. -progressInterval -progressInterval number By default, the operator displays a progress message for every 100,000 records per partition it processes. Specify this option either to change the interval or to disable the messages. To change the interval, specify a new number of records per partition. To disable the messages, specify 0. -server -server servername You must specify the name of a Teradata server.
719
Table 201. terawrite Operator Options (continued) Option -stringlength Usage and Meaning -stringlength length Optionally specify the maximum length of variable-length raw or string fields. The default length is 32 bytes. The upper bound is slightly less than 32 KB. -table -table tablename [-selectlist list ] Specify the name of the table to write to. The table name must be a valid Teradata table name. -selectlist optionally specifies a list that determines which fields are written. If you do not supply the list, terawrite writes to all fields. Do not include formatting characters in the list.
720
721
sybaseread
722
Table 202. asesybasereade and sybasereade Properties (continued) Property Transfer behavior Partitioning method Execution mode Collection method Preserve-partitioning flag in output data set Composite operator Value None Not applicable Sequential Not applicable Clear No
Operator Action
The chief characteristics of the asesybasereade and sybasereade operators are: v You can direct it to run in specific node pools. v It translates the query's result set (a two-dimensional array) row by row to an InfoSphere DataStage data set. v Its output is an InfoSphere DataStage data set, that you can use as input to a subsequent InfoSphere DataStage Stage. v Its translation includes the conversion of Sybase data types to InfoSphere DataStage data types. v The size of Sybase rows can be greater than that of InfoSphere DataStage records. v The operator specifies either a Sybase table to read or a SQL query to carry out. v It optionally specifies commands to be run before the read operation is performed and after it has completed the operation. v You can perform a join operation between InfoSphere DataStage data set and Sybase (there might be one or more tables) data.
723
Data Types that are not listed in the table above generate an error.
724
You can specify optional parameters to narrow the read operation. They are as follows: v The selectlist specifies the columns of the table to be read; by default, InfoSphere DataStage reads all columns. v The filter specifies the rows of the table to exclude from the read operation by default, InfoSphere DataStage reads all rows. v You can optionally specify an -open and -close command. These commands are executed by the sybase before the table is opened and after it is closed.
Join Operations
You can perform a join operation between InfoSphere DataStage data sets and Sybase data. First invoke the asesybasereade or sybasereade operator and then invoke either the lookup operator or a join operator. Alternatively, you can use the asesybaselookup or sybaselookup operator.
725
-close -- Command run after completion of processing a table ; optional closecommand -- string -server -- sybase server; exactly one occurrence required server -- string -table -- table to load; optional tablename -- string -db_name -- use database; optional db_name -- string -user -- sybase username; exactly one occurrence required username -- string -password -- sybase password; exactly one occurrence required password -- string -filter -- conjunction to WHERE clause that specifies which row are read from the table ; used with -table option ; optional filter -- string; default=all rows in table are read -selectlist -- SQL select list that specifies which column in table are read ; used with -table option ; optional list -- string; default=all columns in table are read -query -- SQL query to be used to read from table ; optional query -- string -db_cs -- specifies database character set; optional db_cs -- string -open -- Command run before opening a table ; optional opencommand -- string -close -- Command run after completion of processing a table ; optional closecommand -- string
726
Table 204. asesybasereade and sybasereade Options (continued) Option -selectlist Value list The suboption -selectlist optionally specifies a SQL select list, enclosed in single quotes, that can be used to determine which fields are read. -open opencommand Optionally specify a SQL statement to be executed before the insert array is processed. The statement is executed only once on the conductor node. -close closecommand Optionally specify a SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -query query Specify a SQL query to read from one or more tables. The -query option is mutually exclusive with the -table option. -db_cs db_cs Optionally specify the ICU code page, which represents the database character set in use. The default is ISO-8859-1.
727
Sybase table
itemNum 101 101 220 price 1.95 1.95 5.95 storeID 26 34 26
step sybaseread
modify
field1 = ItemNum
field1:int32;in:*;
sample
The Sybase table contains three columns, the data types in the columns are converted by the operator as follows: v itemNum of type integer is converted to int32 v price of type NUMERIC [6,2] is converted to decimal [6,2] v storeID of type integer is converted to int32 The schema of the InfoSphere DataStage data set created from the table is also shown in this figure. Note that the InfoSphere DataStage field names are the same as the column names of the Sybase table. However, the operator to which the data set is passed has an input interface schema containing the 32-bit integer field field1, while the data set created from the Sybase table does not contain a field of the same name. For this reason, the modify operator must be placed between the read operator and sample operator to translate the name of the field, itemNum, to the name field1 Here is the osh syntax for this example:
$ osh "sybasereade -table table_1 -server server_name -user user1 -password user1 | modify $modifySpec | ... " $ modifySpec="field1 = itemNum;" modify (field1 = itemNum,;)
728
sybasewrite
Operator Action
Here are the chief characteristics of the asesybasewrite and sybasewrite operators: v Translation includes the conversion of InfoSphere DataStage data types to Sybase data types.
Chapter 20. The Sybase interface library
729
v The operator appends records to an existing table, unless you set another mode of writing v When you write to an existing table, the input data set schema must be compatible with the table's schema. v Each instance of a parallel write operator running on a processing node writes its partition of the data set to the Sybase table. You can optionally specify Sybase commands to be parsed and executed on all processing nodes before the write operation runs or after it completes. v The asesybasewrite operator uses bcp to load data into a table. Bcp can run in fast or slow mode. If any triggers or indexes have been defined on table to write to, bcp automatically runs in slow mode, and you do not have to set any specific database properties. Otherwise, bcp runs in fast mode. However, bcp cannot run in fast mode unless you set the database property Select into/bulkcopy to True. To set this property, run the following commands by logging in as a system administrator using the iSQL utility. use mastergosp_dboption <database name>, "select into/bulkcopy", truegouse <database name>gocheckpointgo Note: To send bad records down the reject link, you must set the environment variable APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL.
730
Write Modes
The write mode of the asesybasewrite or sybasewrite operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: This is the default mode. The table must exist and the record schema of the data set must be compatible with the table. The write operator appends new rows to the table. The schema of the existing table determines the input interface of the operator. v create: The operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the InfoSphere DataStage data set determines the schema of the new table. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other nonstandard way, you can use the -createstmt option with your own create table statement. v replace: The operator drops the existing table and creates a new one in its place. If a table exists of the same name as the one you want to create, it is overwritten. The schema of the InfoSphere DataStage data set determines the schema of the new table.
Chapter 20. The Sybase interface library
731
v truncate: The operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. Each mode requires the specific user privileges shown in the table below: Note: If a previous write operation fails, you can retry your application specifying a write mode of replace to delete any information in the output table that might have been written by the previous attempt to run your program.
Table 207. Sybase privileges for the write operator Write Mode Append Create Replace Truncate Required Privileges INSERT on existing table TABLE CREATE INSERT and TABLE CREATE on existing table INSERT on existing table
v v v
732
-truncate -- trucate field name to sybase max field name length; optional -truncateLength -- used with -truncate option to specify a truncation length; optional truncateLength -- integer; 1 or larger; default=1 -maxRejectRecords -- used with -reject option to specify maximum records that can be rejected; optional maximum reject records -- integer; 1 or larger; default=10 -mode -- mode: one of create, replace, truncate, or append.; optional create/replace/append/truncate -- string; value one of create, replace, append, truncate; default=append -createstmt -- create table statement; applies only to create and replace modes; optional statement -- string -db_cs -- specifies database character set; optional db_cs -- string -string_length -- specifies default data type length; optional string_length -- integer -drop -- drop input unmatched fields ; optional -reject -- mention the reject link; optional -open -- Command run before opening a table ; optional opencommand -- string -close -- Command run after completion of processing a table ; optional closecommand -- string Table 208. asesybasewrite options Option -db_name Value db_name Specify the data source to be used for all database connections. If you do not specify one, the default database is used. -user username Specify the user name used to connect to the data source. This option might not be required depending on the data source. -password password Specify the password used to connect to the data source. This option might not be required depending on the data source. -server server Specify the server name to be used for all database connections. -table table Specify the table to write to. May be fully qualified. -mode append | create | replace | truncate Specify the mode for the write operator as one of the following: v append: new records are appended into an existing table. v create: the operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the InfoSphere DataStage data set determines the schema of the new table. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other nonstandard way, you can use the -createstmt option with your own create table statement. v replace: The operator drops the existing table and creates a new one in its place. The schema of the InfoSphere DataStage data set determines the schema of the new table. v truncate: All records from an existing table are deleted before loading new records. -createstmt statement Optionally specify the create statement to be used for creating the table when -mode create is specified.
733
Table 208. asesybasewrite options (continued) Option -drop Value If this option is set unmatched fields of the InfoSphere DataStage data set will be dropped. An unmatched field is a field for which there is no identically named field in the datasource table. If this option is set column names are truncated to the maximum size allowed by the Sybase driver. n Specify the length to truncate column names to. maxRejectRecords maximum reject records Specify the maximum number of records that can be sent down the reject link. The default number is 10. You can specify the number one or higher. -open opencommand Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close closecommand Optionally specify a SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -string_length string_length Optionally specify the length of a string. The value is an integer. -reject -db_cs Optionally specify the reject link. db_cs Optionally specify the ICU code page, which represents the database character set in use. The default is ISO-8859-1.
-truncate -truncateLength
734
-----
Command run before opening a table ; optional string Command run after completion of processing a table ; optional string
Table 209. sybasewrite options Option -db_name Value db_name Specify the data source to be used for all database connections. This option is mandatory. -user username Specify the user name used to connect to the data source. This option might not be required depending on the data source. -password password Specify the password used to connect to the data source. This option might not be required depending on the data source. -server server Specify the server name used to connect to the database. This option is optional. -table table Specify the table to write to. May be fully qualified. -mode append/create/replace/truncate Specify the mode for the write operator as one of the following: v append: new records are appended into an existing table. v create: the operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the DataStage data set determines the schema of the new table. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other nonstandard way, you can use the -createstmt option with your own create table statement. v replace: The operator drops the existing table and creates a new one in its place. The schema of the DataStage data set determines the schema of the new table. v truncate: All records from an existing table are deleted before loading new records. -drop If this option is set unmatched fields of the DataStage data set will be dropped. An unmatched field is a field for which there is no identically named field in the datasource table. If this option is set column names are truncated to the maximum size allowed by the Sybase driver. n Specify the length to truncate column names to. -open opencommand Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close closecommand Optionally specify a SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node.
-truncate -truncateLength
735
Table 209. sybasewrite options (continued) Option -db_cs Value db_cs Optionally specify the ICU code page, which represents the database character set in use. The default is ISO-8859-1. -manual -dumplocation Optionally specify whether you wish to manually load data. string Optionally specify the control fie and dump location. -string_length string_length Optionally specify the length of a string. The value is an integer.
sybasewrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
The record schema of the InfoSphere DataStage data set and the row schema of the Sybase table correspond to one another, and field and column names are identical. Here are the input InfoSphere DataStage record schema and output Sybase row schema:
Table 210. comparison of record schema and row schema Input InfoSphere DataStage Record itemNum:int32; Output Sybase Table price Integer
736
Table 210. comparison of record schema and row schema (continued) Input InfoSphere DataStage Record price:decimal [3,2]; storeID:int16 Output Sybase Table itemNum Decimal [3,2] storeID Small int
Note: that since the write mode defaults to append, the mode option does not appear in the command.
sybasewrite
(mode = create)
column name
age (number)[5,0])
zip (char[5])
income (number)
Sybase table
Here is the osh syntax for this operator:
$ osh "... sybasewrite -table table_2 -mode create -dboptions {user = user101, password = userPword} ..."
The sybasewrite operator creates the table, giving the Sybase columns the same names as the fields of the input InfoSphere DataStage data set and converting the InfoSphere DataStage data types to Sybase data types.
737
modify
sybasewrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
Sybase table
In this example, you use the modify operator to: v Translate field names of the input data set to the names of corresponding fields of the operator's input interface schema, that is skewNum to itemNum and store to storeID. v Drop the unmatched rating field, so that no error occurs. Note: InfoSphere DataStage performs automatic type conversion of store, promoting its int8 data type in the input data set to int16 in the sybasewrite input interface. Here is the osh syntax for this operator:
$ modifySpec="itemNum = skewNum, storeID = store;drop rating" $ osh "... op1 | modify $modifySpec | sybasewrite -table table_2 -dboptions {user = user101, password = UserPword}"
738
sybaseupsert
Operator Action
Here are the main characteristics of asesybaseupsert and sybaseupsert: v If an -insert statement is included, the insert is executed first. Any records that fail to be inserted because of a unique-constraint violation are then used in the execution of the update statement.
739
v InfoSphere DataStage uses host-array processing by default to enhance the performance of insert array processing. Each insert array is executed with a single SQL statement. Update records are processed individually. v You use the -insertArraySize option to specify the size of the insert array. For example:
-insertArraySize 250
The default length of the insert array is 500. To direct InfoSphere DataStage to process your insert statements individually, set the insert array size to 1:
-insertArraySize 1
v Your record fields can be variable-length strings. You can specify a maximum length or use the default maximum length of 80 characters. This example specifies a maximum length of 50 characters:
record (field1: string [max=50])
v When an insert statement is included and host array processing is specified, an InfoSphere DataStage update field must also be an InfoSphere DataStage insert field. v The operators convert all values to strings before passing them to Sybase. The following InfoSphere DataStage data types are supported: - int8, uint8, int16, uint16, int32, uint32, int64, and uint64 dfloat and sfloat decimal strings of fixed and variable length timestamp date
v By default, the operators produce no output data set. By using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. It's syntax is:
-reject filename
740
---------
array size; optional integer; 1 or larger; default=1 row commit interval; optional integer; 1 or larger; default=1 Command run before opening a table ; optional string Command run after completion of processing a table ;optional string
Specify an ICU character set to map between Sybase char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to Sybase. The default character set is UTF-8, which is compatible with your osh jobs that contain 7-bit US-ASCII data. :
Table 212. sybaseupsert options Options -db_name value -db_name database_name Specify the data source to be used for all database connections. If no database is specified the default is used. -server -server servername Specify the server name to be used for all database connections. This option is mandatory. -user -user user_ name Specify the user name used to connect to the data source. -password -password password Specify the password used to connect to the data source. Statement options The user must specify at least one of the following three options and no more than two. An error is generated if the user does not specify a statement option or specifies more than two. -update update_statement Optionally specify the update or delete statement to be executed. -insert -insert insert_statement Optionally specify the insert statement to be executed. -delete -delete delete_statement Optionally specify the delete statement to be executed.
-update
741
Table 212. sybaseupsert options (continued) Options -mode value -mode insert_update | update_insert | delete_insert Specify the upsert mode to be used when two statement options are specified. If only one statement option is specified the upsert mode will be ignored. v insert_update - The insert statement is executed first. If the insert fails due to a duplicate key violation (that is, record exists), the update statement is executed. This is the default upsert mode. v update_insert - The update statement is executed first. If the update fails because the record doesn't exist, the insert statement is executed. v delete_insert - The delete statement is executed first. Then the insert statement is executed. -reject -reject If this option is set, records that fail to be updated or inserted are written to a reject data set. You must designate an output data set for this purpose. If this option is not specified an error is generated if records fail to update or insert. -open -open open_command Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify a SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -insertarraysize n Optionally specify the size of the insert/update array. The default size is 2000 records.
742
Table 213. Sybase table (continued) Table before dataflow 873092 2001.89 675066 3523.62 566678 89.72 Input file contents 566678 2008.56 678888 7888.23 073587 82.56 995666 75.72 InfoSphere DataStage action Update Insert Update Insert
743
The asesybaselookup and sybaselookup operators replace each Orchestrate.fieldname with a field value, submit the statement containing the value to Sybase, and output a combination of Sybase and InfoSphere DataStage data. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more Sybase tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced Sybase table. When a Sybase table has a column name that is the same as an InfoSphere DataStage data set field name, the Sybase column is renamed using the following syntax:
APT_integer_fieldname
An example is APT_0_lname. The integer component is incremented when duplicate names are encountered in additional tables. Note: If the Sybase ASE/IQ table is not indexed on the lookup keys, the performance of the asesybaselookup or sybaselookup operator is likely to be poor.
sybaselookup
744
You must specify either the -query option or one or more -table options with one or more -key fields.
745
Table 216. asesybaselookup Options Option -server Value server Specify the server to connect to. -db_name db_name Specify the data source to be used for all database connections. This option is required. -user username Specify the user name used to connect to the data source. This option might not be required depending on the data source. -password password Specify the password used to connect to the data source. This option might not be required depending on the data source. -table table Specify a table and key fields to be used to generate a lookup query. This option is mutually exclusive with the -query option. -filter filter Specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. -selectlist string Specify the list of column names that will appear in the select clause of the SQL statement to be executed. -transfer_adapter transfer_adapter Optionally specify the transfer adapter for table fields. -ifNotFound fail | drop | reject | continue Specify an action to be taken when a lookup fails. Can be one of the following: v fail: stop job execution v drop: drop failed record from the output data set v reject: put records that are not found into a reject data set. You must designate an output data set for this option v continue: leave all records in the output data set (outer join ) -query query Specify a lookup SQL query to be executed. This option is mutually exclusive with the -table option. -open opencommand Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node.
746
Table 216. asesybaselookup Options (continued) Option -close Value closecommand Optionally specify a SQL statement to be executed after the insertarray is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -fetcharraysize n Specify the number of rows to retrieve during each fetch operation. Default is 1. -db_cs db_cs Optionally specify the ICU code page, which represents the database character set in use. The default is ISO-8859-1.
You must specify either the -query option or one or more -table options with one or more -key fields.
747
Table 217. sybaselookup Options Option -db_name Value -db_name database_name Specify the data source to be used for all database connections. If you do not specify a database, the default is used. -user -user user_name Specify the user name used to connect to the data source. This option might not be required depending on the data source -password -password password Specify the password used to connect to the data source. This option might not be required depending on the data source -table -table table_name Specify a table and key fields to be used to generate a lookup query. This option is mutually exclusive with the -query option. The -table option has 3 suboptions: v -filter where_predicate Specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. v -selectlist list Specify the list of column names that will appear in the select clause of the SQL statement to be executed v -key field Specify a lookup key. A lookup key is a field in the table that will be used to join against a field of the same name in the InfoSphere DataStage data set. The -key option can be specified more than once to specify more than one key field. -ifNotFound -ifNotFound fail| drop | reject | continue Specify an action to be taken when a lookup fails. Can be one of the following:fail: stop job execution v drop: drop failed record from the output data set v reject: put records that are not found into a reject data set. You must designate an output data set for this option v continue: leave all records in the output data set (outer join ) -query -query sql_query Specify a lookup query to be executed. This option is mutually exclusive with the -table option -open -open open_command Optionally specify a SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node.
748
Table 217. sybaselookup Options (continued) Option -close Value -close close_command Optionally specify a SQL statement to be executed after the insertarray is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -fetcharraysize -fetcharraysize n Specify the number of rows to retrieve during each fetch operation. Default is 1. -db_cs -db_cs character_set Optionally specify the ICU code page, which represents the database character set in use. The default is ISO-8859-1. This option has the following sub option: -use_strings -use_strings If this option is set strings (instead of ustrings) will be generated in the DataStage schema.
InfoSphere DataStage prints the lname, fname, and DOB column names and values from the InfoSphere DataStage input data set and also the lname, fname, and DOB column names and values from the Sybase table. If a column name in the Sybase table has the same name as an InfoSphere DataStage output data set schema fieldname, the printed output shows the column in the Sybase table renamed using this format:
APT_integer_fieldname
749
750
UNIX
About this task
v Ensure the shared library path includes $dshome/../branded_odbc/lib and set the ODBCINI environment variable to $dshome/.odbc.ini. v Start SQL Server. v Add $APT_ORCHHOME/branded_odbc to your PATH and $APT_ORCHHOME/branded_odbc /lib to your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH. v Check you can access SQL server using a valid user name and password.
Windows
About this task
v Create a DSN for the SQL Server database using the ODBC Datasource Administrator. Select Microsoft SQL Server driver for a datasource which is to be used with sqlsrvrwrite. Select the Data Direct SQL Server Wire driver datasource for use with sqlsrvread, sqlsrvrlookup or sqlsrvrupsert. v Check you can access SQL server using a valid user name and password.
751
sqlsrvrread
sqlsrvrread: properties
Table 218. sqlsrvrread Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator Value 0 1 None Determined by the SQL query None Sequential Not applicable Not applicable Clear No
Operator action
Below are the chief characteristics of the sqlsrvrread operator: v You can direct it to run in specific node pools. v It translates the query's result set (a two-dimensional array) row by row to a data set. v v v v Its output is a data set that you can use as input to a subsequent InfoSphere DataStage operator. Its translation includes the conversion of SQL Server data types to InfoSphere DataStage data types. The size of SQL Server rows can be greater than that of InfoSphere DataStage records. The operator specifies whether to read a SQL Server table or to carry out an SQL query to carry out.
Parallel Job Advanced Developer's Guide
752
v It optionally specifies commands to be run before the read operation is performed and after it has completed the operation. v You can perform a join operation between InfoSphere DataStage data set and SQL Server (there might be one or more tables) data.
753
Table 219. Mapping of SQL Server datatypes to InfoSphere DataStage datatypes (continued) SQL Server Data Type TIMESTAMP UNIQUEIDENTIFIER SMALLMONEY MONEY InfoSphere DataStage Data Type raw(8) string[36] decimal[10,4] decimal[19,4]
Note: Datatypes that are not listed in this table generate an error.
You can specify optional parameters to narrow the read operation. They are as follows: v The selectlist parameter specifies the columns of the table to be read. By default, InfoSphere DataStage reads all columns. v The filter parameter specifies the rows of the table to exclude from the read operation. By default, InfoSphere DataStage reads of all rows. You can optionally specify an -open and -close option command. These commands are executed by sqlsrvrread on SQL Server before the table is opened and after it is closed.
754
Join operations
You can perform a join operation between InfoSphere DataStage data sets and SQL Server data. First invoke the sqlsrvrread operator and then invoke either the lookup operator (see Lookup Operator) or a join operator (see "The Join Library.")
You must specify either the -query or -table option. You must also specify the -data_source, user, and password.
Table 220. sqlsrvrread operator options Option -data_source Value -data_source data_source_name Specify the data source to be used for all database connections. This option is required. -user -user user_name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. -tableName -tableName table_name Specify the table to read. May be fully qualified. The -table option is mutually exclusive with the -query option. This option has 2 suboptions: v -filter where_predicate. Optionally specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. v -selectlist select_predicate. Optionally specify the list of column names that will appear in the select clause of the SQL statement to be executed.
755
Table 220. sqlsrvrread operator options (continued) Option -open Value -open open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -query -query sql_query Specify an SQL query to read from one or more tables. The -query option is mutually exclusive with the -table option. -fetcharraysize -fetcharraysize n Specify the number of rows to retrieve during each fetch operation Defaultis1. -isolation_level -isolation_level read_uncommitted | read_committed | repeatable_read | serializable Optionally specify the isolation level for accessing data. The default isolation level is decided by the database or possibly specified in the data source. -db_cs -db_cs character_set Optionally specify the ICU code page which represents the database character set in use. The defaults is ISO-8859-1. This option has the following suboption: v -use_strings. If this option is set, strings will be generated in the InfoSphere DataStage schema.
756
Sqlsrvrread example 1: Reading a SQL Server table and modifying a field name
The following figure shows a SQL Server table used as input to an InfoSphere DataStage operator:
step sqlsrvrread
modify
field1 = ItemNum
field1:int32;in:*;
sample
The SQL Server table contains three columns. The operator converts data in those columns as follows: v itemNum of type INTEGER is converted to int32. v price of type DECIMAL[6,2] is converted to decimal[6,2]. v storeID of type SMALLINT is converted to int16. The schema of the InfoSphere DataStage data set created from the table is also shown in this figure. Note that the InfoSphere DataStage field names are the same as the column names of the SQL Server table. However, the operator to which the data set is passed has an input interface schema containing the 32-bit integer field field1, while the data set created from the SQL Server table does not contain a field with the same name. For this reason, the modify operator must be placed between sqlsrvrread and sample operator to translate the name of the field, itemNum, to the name field1. Here is the osh syntax for this example:
$ osh "sqlsrvrread -tablename table_1 -data_source datasource1 -user user1 -password user1 | modify $modifySpec | ... " $ modifySpec="field1 = itemNum;" modify (field1 = itemNum,;)
757
sqlsrvrwrite
sqlsrvrwrite: properties
Table 221. sqlsrvrwrite Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag Value 1 0 Derived from the input data set None None Sequential default or parallel Not applicable Any Default clear
758
Table 221. sqlsrvrwrite Operator Properties (continued) Property Composite operator Value No
Operator action
Here are the chief characteristics of the sqlsrvrwrite operator: v Translation includes the conversion of InfoSphere DataStage data types to SQL Server data types. v The operator appends records to an existing table, unless you set another writing mode. v When you write to an existing table, the input data set schema must be compatible with the table's schema. v Each instance of a parallel write operator running on a processing node writes its partition of the data set to the SQL Server table. You can optionally specify the SQL Server commands to be parsed and executed on all processing nodes before the write operation runs or after it completes.
759
Table 222. Mapping of InfoSphere DataStage data types to SQLSRVR data types (continued) InfoSphere DataStage Data Type int32 sfloat dfloat uint8 (0 or 1) uint8 int64 raw(n) raw(max=n) timestamp[p] timestamp[p] string[36] raw(8) decimal[10,4] decimal[19,4] SQL Server Data Type INTEGER REAL FLOAT BIT TINYINT BIGINT BINARY VARBINARY DATETIME SMALLDATETIME UNIQUEIDENTIFIER TIMESTAMP SMALLMONEY MONEY
Write modes
The write mode of the operator determines how the records of the data set are inserted into the destination table. The write mode can have one of the following values: v append: This is the default mode. The table must exist and the record schema of the data set must be compatible with the table. The write operator appends new rows to the table. The schema of the existing table determines the input interface of the operator. v create: This operator creates a new table. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own create table statement. v replace: This operator drops the existing table and creates a new one in its place. If a table exists with the same name as the one you want to create, the existing table is overwritten. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. v truncate: This operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. Each mode requires the specific user privileges shown in the table below: Note: If a previous write operation fails, you can retry your job specifying a write mode of replace to delete any information in the output table that might have been written by the previous attempt to run your program.
Table 223. Required SQL Server privileges for write operators Write Mode Append Create Replace Truncate Required Privileges INSERT on existing table TABLE CREATE INSERT and TABLE CREATE on existing table INSERT on existing table
760
:
Table 224. sqlsrvrwrite Operator options Options -data_source Value -data_source data_source_name Specify the data source to be used for all database connections. This option is required. -user -user user_ name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source.
761
Table 224. sqlsrvrwrite Operator options (continued) Options -password Value -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. -tableName -tableName table_name Specify the table to which to write. May be fully qualified. -mode -mode append | create | replace | truncate Specify the mode for the write operator as one of the following: append: New records are appended into an existing table. create: A new table is created. If a table exists with the same name as the one you want to create, the step that contains the operator terminates with an error. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own create table statement. replace: This operator drops the existing table and creates a new one in its place. The schema of the new table is determined by the schema of the InfoSphere DataStage data set. truncate: All records from an existing table are deleted before loading new records. -createstmt -createstmt create_statement Optionally specify the create statement to be used for creating the table when -mode create is specified. -drop If this option is set, unmatched fields of the InfoSphere DataStage the data set will be dropped. An unmatched field is a field for which there is no identically named field in the datasource table. If this option is set, column names are truncated to the maximum size allowed by the SQL SERVER driver. -truncateLength n Specify the length to which to truncate column names. -open -close -open open_command -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node.
-truncate -truncateLength
762
Table 224. sqlsrvrwrite Operator options (continued) Options -batchsize Value -batchsize n Optionally specify the number of records that should be committed before starting a new transaction. The default value is 1000 records. -isolation_level -isolation_level read_uncommitted | read_committed | repeatable_read | serializable Optionally specify the isolation level for accessing data. The default isolation level is decided by the database or possibly specified in the datasource. -db_cs -db_cs character_set Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1.
Sqlsrvrwrite
column name
itemNumber INTEGER
price DECIMAL
storeID SMALLINT
The record schema of the InfoSphere DataStage data set and the row schema of the SQL Server table correspond to one another, and field and column names are identical. Here are the input InfoSphere DataStage record schema and output SQL Server row schema: I
Table 225. Schemas for Example 1 InfoSphere DataStage itemNum:int32; price:decimal[6,2];; storeID:int16 SQL Server itemNum INTEGER price DECIMAL[6,2] storeID SMALLINT
Note: Since the write mode defaults to append, the mode option does not appear in the command.
sybasewrite
(mode = create)
column name
age smallint
zip char[5]
income float
Sybase table
Here is the osh syntax for this operator:
$ osh "... sqlsrvrwrite -table table_2 -mode create -dboptions {user = user101, password = userPword} ..."
764
The sqlsrvrwrite operator creates the table, giving the SQL Server columns the same names as the fields of the input InfoSphere DataStage data set and converting the InfoSphere DataStage data types to SQL Server data types.
modify
sqlsrvrwrite
column name
itemNumber (number)[10,0])
price (number[6,2])
storeID (number[5,0])
765
$ modifySpec="itemNum = skewNum, storeID = store;drop rating" $ osh "... op1 | modify $modifySpec | sqlsrvrwrite -table table_2 -user username -password passwd"
sqlsrvrupsert
sqlsrvrupsert: properties
Table 226. sqlsrvrupsert Properties Property Number of input data sets Number of output data sets by default Input interface schema Transfer behavior Execution mode Partitioning method Collection method Combinable operator Value 1 None; 1 when you select the -reject option Derived from your insert and update statements Rejected update records are transferred to an output data set when you select the -reject option. Parallel by default, or sequential Same. You can override this partitioning method. However, a partitioning method of entire cannot be used. Any Yes
Operator action
Here are the main characteristics of sqlsrvrupsert:
766
v If a -insert statement is included, the insert is executed first. Any records that fail to be inserted because of a unique-constraint violation are then used in the execution of the update statement. v InfoSphere DataStage uses host-array processing by default to enhance the performance of insert array processing. Each insert array is executed with a single SQL statement. Update records are processed individually. v You use the -insertArraySize option to specify the size of the insert array. For example, insertArraySize 250. v The default length of the insert array is 500. To direct InfoSphere DataStage to process your insert statements individually, set the insert array size to 1: -insertArraySize 1. v Your record fields can be variable-length strings. You can specify a maximum length or use the default maximum length of 80 characters. This example specifies a maximum length of 50 characters:
record(field1:string[max=50])
v When an insert statement is included and host array processing is specified, an InfoSphere DataStage update field must also be an InfoSphere DataStage insert field. v The sqlsrvrupsert operator converts all values to strings before passing them to SQL Server. The following InfoSphere DataStage data types are supported: int8, uint8, int16, uint16, int32, uint32, int64, and uint64 dfloat and sfloat decimal strings of fixed and variable length timestamp date
v By default, sqlsrvrupsert produces no output data set. By using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. Its syntax is: -reject filename
Specify an ICU character set to map between the SQL Server char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to SQL Server. The default character set is UTF-8 which is compatible with your osh jobs that contain 7-bit US-ASCII data.
767
Table 227. sqlsvrupsert Operator Options Options -data_source Value -data_source data_source_name Specify the data source to be used for all database connections. This option is required. -user -user user_name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. Statement options The user must specify at least one of the options specified in the next three rows of this table. No more than two statements should be specified. An error is generated if the user does not specify a statement option or specifies more than two. -update update_statement Optionally specify the update or delete statement to be executed. -insert -insert insert_statement Optionally specify the insert statement to be executed. -delete -delete delete_statement Optionally specify the delete statement to be executed. -mode -mode insert_update | update_insert | delete_insert Specify the upsert mode to be used when two statement options are specified. If only one statement option is specified, then the upsert mode will be ignored. insert_update: The insert statement is executed first. If the insert fails due to a duplicate key violation (that is, if the record exists), the update statement is executed. This is the default upsert mode. update_insert: The update statement is executed first. If the update fails because the record doesn't exist, the insert statement is executed. delete_insert: The delete statement is executed first. Then the insert statement is executed. -reject If this option is set, records that fail to be updated or inserted are written to a reject data set. You must designate an output data set for this purpose. If this option is not specified an error is generated if records fail to update or insert.
-update
768
Table 227. sqlsvrupsert Operator Options (continued) Options -open Value -open open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statements are executed only once on the conductor node. -insertarraysize -insertarraysize n Optionally specify the size of the insert/update array. The default size is 2000 records. -rowCommitInterval -rowCommitInterval n Optionally specify the number of records that should be committed before starting a new transaction. This option can only be specified if arraysize = 1. Otherwise rowCommitInterval = arraysize. This is because of the rollback logic/retry logic that occurs when an array execute fails.
769
Table 229. Table after Upsert (continued) acct_id 873092 675066 566678 865544 678888 995666 acct_balance 67.23 3523.62 2008.56 8569.23 7888.23 75.72
The operator replaces each .fieldname with a field value, submits the statement containing the value to SQL Server, and outputs a combination of SQL Server and SQL Server data. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more SQL Server tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
770
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced SQL Server table. When a SQL Server table has a column name that is the same as a data set field name, the SQL Server column is renamed using the following syntax:
APT_integer_fieldname
For example:
APT_0_lname.
The integer component is incremented when duplicate names are encountered in additional tables. Note: If the SQL Server table is not indexed on the lookup keys, this operator's performance is likely to be poor.
sqlsrvrlookup
sqlsrvrlookup: properties
Table 230. sqlsrvrlookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method value 1 1; 2 if you include the -ifNotFound reject option determined by the query Determined by the sql query transfers all fields from input to output sequential or parallel (default) Not applicable Not applicable
Chapter 21. The SQL Server interface library
771
Table 230. sqlsrvrlookup Operator Properties (continued) Property Preserve-partitioning flag in output data set Composite operator value Clear no
You must specify the -query option or one or more -table options with one or more -key fields. The sqlsrvrlookup operator is a parallel operator by default. The options are as follows:
Table 231. sqlsrvrlookup Operator Options Options -data_source Value -data_source data_source_ name Specify the data source to be used for all database connections. This option is required. -user -user user_ name Specify the user name used to connect to the data source. This option might or might not be required depending on the data source. -password -password password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. -tableName -tableName table_name Specify a table and key fields to be used to generate a lookup query. This option is mutually exclusive with the -query option. The -table option has 2 suboptions: v -filter where_predicate. Specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. v -selectlist select_predicate. Specify the list of column names that will appear in the select clause of the SQL statement to be executed.
772
Table 231. sqlsrvrlookup Operator Options (continued) Options -key Value -key field Specify a lookup key. A lookup key is a field in the table that will be joined to a field of the same name in the InfoSphere DataStage data set. The -key option can be specified more than once to specify more than one key field. -ifNotFound -ifNotFound fail| drop | reject | continue Specify an action to be taken when a lookup fails. Can be one of the following: v fail: stop job execution v drop: drop failed record from the output data set v reject: put records that are not found into a reject data set. You must designate an output data set for this option v continue: leave all records in the output data set (outer join ) -query -query sql_query Specify a lookup query to be executed. This option is mutually exclusive with the -table option. -open -open open_command Optionally specify an SQL statement to be executed before the insert array is processed. The statements are executed only once on the conductor node. -close -close close_command Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -fetcharraysize -fetcharraysize n Specify the number of rows to retrieve during each fetch operation. Default is 1. -db_cs -db_cs character_set Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1. This option has the following suboption: v -use_strings. If this option is set strings (instead of ustrings) will be generated in the InfoSphere DataStage schema.
773
$ osh " sqlsrvrlookup - } -key lname -key fname -key DOB < data1.ds > data2.ds "
InfoSphere DataStage prints the lname, fname, DOB column names, and values from the InfoSphere DataStage input data set as well as the lname, fname, DOB column names, and values from the SQL Server table. If a column name in the SQL Server table has the same name as an InfoSphere DataStage output data set schema fieldname, the printed output shows the column in the SQL Server table renamed using this format:
APT_integer_fieldname
774
775
iwayread
iwayread: properties
Table 232. iwayread Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Value 0 1 None outRec:* The table columns are mapped to the InfoSphere DataStage underlying data types and the output schema is generated for the data set. Sequential Not applicable Not applicable Not applicable No
Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator
Operator action
Below are the chief characteristics of the iwayread operator: v You can direct it to run in specific node pools. v Its output is a data set that you can use as input to a subsequent InfoSphere DataStage operator. v Its translation includes the conversion of iWay data types to InfoSphere DataStage data types.
776
Table 233. Data Type Conversion InfoSphere DataStage Data Type int32 sfloat dfloat decimal (m,n) string [n] raw date string Not supported time timestamp iWay Data Type Integer Single Float Double Float Decimal (m,n) Alphanumeric (length=n) Binary Date Text Graphic (DBCS) Time Timestamp
777
Table 234. iwayread operator options and its values (continued) Option -query Value -query sql_query Specify an SQL query to read from one or more tables. The -query option is mutually exclusive with the -table option. It has one suboption: v -query_type stmt_type. Optionally specify the type of statement. Can be one of: SQL - a SQl query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. -table -table table_name Specify the table to read. May be fully qualified. The -table option is mutually exclusive with the -query option. This option has two suboptions: v -filter where_predicate. Optionally specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. v -selectlist select_predicate. Optionally specify the list of column names that will appear in the select clause of the SQL statement to be executed. -timeout -timeout timeout Optionally specifies a timeout value (in seconds) for the statement specified witha -table or -query option. The default value is 0, which causes InfoSphere DataStage to wait indefinitely for the statement to execute. -open -open open_command Optionally specify an SQL statement to be executed before the read is carried out. The statements are executed only once on the conductor node. This option has two suboptions: v -open_type stmt_type. Optionally specify the type of argument being suppled with -open. Can be one of: SQL - a SQl query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. v -open_timeout timeout. Optionally specify the timeout value (in seconds) for the statement specified with the -open option. The default is 0, which means InfoSphere DataStage will wait indefinitely for the statement to execute.
778
Table 234. iwayread operator options and its values (continued) Option -close Value -close close_command Optionally specify an SQL statement to be executed after the read is carried out. You cannot commit work using this option. The statements are executed only once on the conductor node. This option has two suboptions: v -close_type stmt_type. Optionally specify the type of argument being suppled with -close. Can be one of: SQL - a SQl query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. v -close_timeout timeout. Optionally specify the timeout value (in seconds) for the statement specified with the -close option. The default is 0, which means InfoSphere DataStage will wait indefinitely for the statement to execute. -db_cs -db_cs code_page Optionally specify the ICU code page that represents the character set the database you are accessing through iWay is using. The default is ISO-8859-1. This option has a single suboption: v -use_strings. Set this to have InfoSphere DataStage generate strings instead of ustrings in the schema. -data_password -data_password data_password Optionally specify a table level password for accessing the required table. -eda_params -eda_params name=value,... name=value Optionally specify values for the iWay environment variables as a list of semi-colon separated name=value pairs.
779
This operator is particularly useful for sparse lookups, that is, where the InfoSphere DataStage data set you are matching is much smaller than the table. If you expect to match 90% of your data, using the iwayread operator in conjunction with the lookup operator is probably more efficient. Because iwaylookup can do lookups against more than one table accessed through iWay, it is useful for joining multiple tables in one query. The -statement option command corresponds to an SQL statement of this form:
select a,b,c from data.testtbl where Orchestrate.b = data.testtbl.c Orchestrate.name = "Smith"
The operator replaces each fieldname with a field value, submits the statement containing the value to the database accessed through iWay, and outputs a combination of the two tables. Alternatively, you can use the -key/-table options interface to specify one or more key fields and one or more tables. The following osh options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced table. When a table has a column name that is the same as an InfoSphere DataStage data set field name, the column is renamed using the following syntax:
APT_integer_fieldname
For example:
APT_0_lname.
The integer component is incremented when duplicate names are encountered in additional tables. Note: If the table is not indexed on the lookup keys, this operator's performance is likely to be poor.
780
iwaylookup
iwaylookup: properties
Table 235. iwaylookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema (output data set) Output interface schema (reject data set) Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in output data set Composite operator value 1 1; 2 if you include the -ifNotFound reject option key0:data_type;...keyN:data_type;inRec:*; outRec:* with lookup fields missing from the input set concatenated rejectRec:* transfers all fields from input to output sequential or parallel (default) Not applicable Not applicable Propagate no
781
[-close close_command [-close_type stmt_type] [-close_timeout timeout]] [-open open_command [-open_type stmt_type] [-open_timeout timeout]] [-db_cs character_set [-use_strings]] [-data_password table_password] [-eda_params name=value, ...]
You must specify either the -query or -table option. The sqlsrvrlookup operator is a parallel operator by default. The options are as follows:
Table 236. iwayread operator options and its values Option -server Value -server server_name Specify the iWay server to be used for all data source connections. If this is not specified, the default server is used. -user -user user_name Specify the user name used to connect to the iWay server. This is not required if the iWay server has security mode off. -password -password password Specify the password used to connect to the iWay server. This is not required if the iWay server has security mode off. -query -query sql_query Specify an SQL query to read from one or more tables. The -query option is mutually exclusive with the -table option. It has one suboption: v -query_type stmt_type. Optionally specify the type of statement. Can be one of: SQL - a SQl query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. -table -table table_name Specify the table to read. May be fully qualified. The -table option is mutually exclusive with the -query option. This option has two suboptions: v -filter where_predicate. Optionally specify the rows of the table to exclude from the read operation. This predicate will be appended to the where clause of the SQL statement to be executed. v -selectlist select_predicate. Optionally specify the list of column names that will appear in the select clause of the SQL statement to be executed.
782
Table 236. iwayread operator options and its values (continued) Option -IfNotFound Value -IfNotFound fail | drop | reject | continue Specify an action to take if a lookup fails. This is one of the following: v fail. Stop job execution. v drop. Drop failed record from output data set. v reject. Put records that are not found in the lookup table into a reject data set. (You must designate an output data set for this option.) v continue. leave all records in the output data set (that is, perform an outer join). -timeout -timeout timeout Optionally specifies a timeout value (in seconds) for the statement specified with a -table or -query option. The default value is 0, which causes InfoSphere DataStage to wait indefinitely for the statement to execute. -open -open open_command Optionally specify an SQL statement to be executed before the read is carried out. The statements are executed only once on the conductor node. This option has two suboptions: v -open_type stmt_type. Optionally specify the type of argument being suppled with -open. Can be one of: SQL - a SQl query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. v -open_timeout timeout. Optionally specify the timeout value (in seconds) for the statement specified with the -open option. The default is 0, which means InfoSphere DataStage will wait indefinitely for the statement to execute. -close -close close_command Optionally specify an SQL statement to be executed after the read is carried out. You cannot commit work using this option. The statements are executed only once on the conductor node. This option has two suboptions: v -close_type stmt_type. Optionally specify the type of argument being suppled with -close. Can be one of: SQL - a SQL query StoredProc - indicates that the statement specified is a FOCUS procedure CMD - a command The default is SQL. v -close_timeout timeout. Optionally specify the timeout value (in seconds) for the statement specified with the -close option. The default is 0, which means InfoSphere DataStage will wait indefinitely for the statement to execute.
Chapter 22. The iWay interface library
783
Table 236. iwayread operator options and its values (continued) Option -db_cs Value -db_cs code_page Optionally specify the ICU code page that represents the character set the database you are accessing through iWay is using. The default is ISO-8859-1. This option has a single suboption: v -use_strings. Set this to have InfoSphere DataStage generate strings instead of ustrings in the schema. -transfer_adapter -transfer_adapter transfer_string Optionally specify new column names for the target data set (generally used when the table and data set to be renamed have the same column names). -data_password -data_password data_password Optionally specify a table level password for accessing the required table. -eda_params -eda_params name=value,... name=value Optionally specify values for the iWay environment variables as a list of semi-colon separated name=value pairs.
784
nzload method
You can use this load method if the data in the source database is consistent; that is, it implements a single character set for the entire database. Also the input schema for the nzload must be the same as that of the target table in the Netezza Performance Server database. The prerequisite to use the nzload method is that, nzclient utilities and ODBC functionality must be installed on the same computer as the IBM InfoSphere Information Server engine. The nzload utility provided by Netezza to load data into Netezza Performance Server. The retrieved data from the source database is fed to the nzload utility to load the destination database on the Netezza Performance Server.
Write modes
You can specify a write mode to determine how the records of the DataStage data set are inserted into the destination table in the Netezza Performance Server database. append Appends new rows to the specified table. To use this mode, you must have TABLE CREATE and
785
INSERT privileges on the database that is being written to. Also the table must exist and the record schema of the data set must be compatible with the table. This mode is the default mode. create Creates a new table in the database. To use this mode, you must have TABLE CREATE privileges. If a table already exists with the same name as the one that you want to create, the step that contains this mode ends in error. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or to create a customized table, you must use the -createstmt option with your own create table statement. replace Drops the existing table and creates a new one in its place. To use this mode, you must have TABLE CREATE and TABLE DELETE privileges. If another table exists with the same name as the one that you want to create, the existing table is overwritten. truncate Retains all the attributes of a table (including its schema), but discards the existing records and appends new records into the table. To use this mode, you must have DELETE and INSERT privileges on that table.
Error logs
You can view the error logs to identify errors that occur during any database operations, and view information about the success or failure of these operations. By default the log files are created under tmp in the root directory. While writing data to the Netezza Performance Server by using the nzload method, log files such as nzlog and nzbad files are created in the /tmp directory on the client computer. The nzlog file is upended for every nzload command that loads the same table into the same database. The nzbad files contains the records that caused the errors. The system overwrites this file each time you invoke the nzload command for the same table and database name. The following names are used for the log files: v /tmp/database name.table name.nzlog
786
v /tmp/database name.table name.nzbad While writing data to the Netezza Performance Server by using the External Table method, the log files are created in the /tmp directory in the Netezza Performance Server. The following names are used for the log files: v /tmp/external table name.log v /tmp/external table name.bad
787
-delim <delimiter character> Specify the delimiter character. The default delimiter is @ (at sign) character. You can use any ASCII character except blank and hyphen (-). The hyphen is the default delimiter for the date/time/timestamp parameter. -open <open_command> Specify an SQL statement to run after the insert array is processed. The statements are run only once on the conductor node. -close <close_command> Specify an SQL statement to run before the insert array is processed. You cannot commit work by using this option. The statements are executed only once on the conductor node. -createstmt <create_statement> Specify the CREATE SQL statement for creating the table when - mode create is specified. Use this command when you want to override the default write statements of the Netezza write operator. -truncate If you set this option, column names are truncated to the maximum size allowed by the ODBC driver. -truncatelen <n> You can specify the length of the column names to which they should be truncated when the data is written into the destination table. This option can be used only with the -truncate option. -drop If you set this option, the Netezza write operator drops unmatched input fields of the DataStage data set. An unmatched field is a field for which there is no identically named field in the destination table.
788
Procedure
1. The DataDirect drivers must be installed in the directory $dshome/../branded_odbc. The shared library path is modified to include $dshome/../branded_odbc/lib. The Classic Federation environment variable will be set to $dshome/.odbc.ini. 2. Start the external data source. 3. Add $APT_ORCHHOME/branded_odbc to your PATH and $APT_ORCHHOME/branded_odbc/lib to your LIBPATH, LD_LIBRARY_PATH, or SHLIB_PATH. The Classic Federation environment variable must be set to the full path of the odbc.ini file. The environment variable is operating system specific. 4. Access the external data source by using a valid user name and password.
789
v v v v v
Schema, table, and index names User names and passwords Column names Table and column aliases SQL*Net service names
790
The classicfedread operation reads records from federated table by using an SQL query. These records are read into an InfoSphere DataStage output data set. The operator can perform both parallel and sequential database reads. Here are the chief characteristics of the classicfedread operator: v It translates the query result set (a two-dimensional array) row by row to an InfoSphere DataStage data set. v The output is a DataStage data set that can be used as input to a subsequent InfoSphere DataStage operator. v The size of external data source rows can be greater than that of InfoSphere DataStage records. v It either reads an external data source table or directs InfoSphere DataStage to perform an SQL query. v It optionally specifies commands to be run before the read operation is performed and after it completes. v It can perform a join operation between a DataStage data set and an external data source. There can be one or more tables.
classicfedread: properties
The read operations are performed with classicfedread operator properties
Table 238. ClassicFedRead operator properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag in the output data set Composite stage Value 0 1 None Determined by the SQL query None Parallel and sequential MOD based Not applicable Clear No
791
-query sql_query | -table table_name [-filter where_predicate] [-list select_predicate] [-username user_name]
You must specify either the -query , -table parameters, -datasource parameters. Specifying -user, and -password options are optional. -close Optionally specify an SQL statement to be run after the insert array is processed. You cannot commit work using this option. The statements are run only once on the conductor node. -datasource Specify the data source for all database connections. This option is required. -db_cs Optionally specify the ICU code page that represents the database character set. The default is ISO-8859-1. -fetcharraysize Specify the number of rows to retrieve during each fetch operation. The default number of rows is 1. -isolation_level Optionally specify the transaction level for accessing data. The default transaction level is decided by the database or possibly specified in the data source. -open Optionally specify an SQL statement to be run before the insert array is processed. The statements are run only once on the conductor node. -partitioncol Run in parallel mode. The default execution mode is sequential. Specify the key column; the data type of this column must be integer. The entry in this column should be unique. The column should preferably have the Monotomically Increasing Order property. If you use the -partitioncol option, then you must follow the sample orchestrate shell scripts below to specify your -query and -table options. -password Specify the password used to connect to the data source. This option might or might not be required depending on the data source. -query Specify an SQL query to read from one or more tables. The -query and the -table option are mutually exclusive. v Sample OSH for -query option:
classicfedread -data_source SQLServer -user sa -password asas -query select * from SampleTable where Col1 =2 and %orchmodColumn% -partitionCol Col1 >| OutDataSet.ds
-table Specify the table to be read from. It might be fully qualified. This option has two suboptions: v -filter where_predicate: Optionally specify the rows of the table to exclude from the read operation. This predicate is appended to the WHERE clause of the SQL statement to be run. v -list select_predicate: Optionally specify the list of column names that appear in the SELECT clause of the SQL statement to be run. This option and the -query option are mutually exclusive. v Sample OSH for -table option:
classicfedread -data_source SQLServer -user sa -password asas -table SampleTable -partitioncol Col1 >| OutDataset.ds
792
-user Specify the user name for connections to the data source. This option might be required depending on the data source -use_strings Generate strings instead of ustrings in the InfoSphere DataStage schema.
793
Table 239. Conversion of federation data types to OSH data types (continued) Federated data type SQL_TYPE_DATE SQL_TYPE_TIMEP SQL_TYPE_TIMESTAMP SQL_GUID DataStage data type DATE TIME[p] TIMESTAMP[p] STRING[36]
Statements to narrow the read operation: -list Specifies the columns of the table to be read. By default, DataStage reads all columns. -filter Specifies the rows of the table to exclude from the read operation. By default, DataStage reads all rows. You can optionally specify the -open and -close options. The open commands are run on the external data source before the table is open. The close commands are run on the external data source after it is closed.
794
Below are the chief characteristics of the classicfedwrite operator: v Translation includes the conversion of InfoSphere DataStage data types to external data source data types. v The classicfedwrite operator appends records to an existing table, unless you specify the create, replace, or truncate write mode. v When you write to an existing table, the input data set schema must be compatible with the table schema. v Each instance of a parallel write operator that runs on a processing node writes its partition of the data set to the external data source table. You can optionally specify external data source commands to be parsed and run on all processing nodes before the write operation runs or after it completes. The default execution mode of the classicfedwrite operator is parallel. By default, the number of processing nodes is based on the configuration file. To run sequentially, specify the [seq] argument. You can optionally set the resource pool or a single node on which the operator runs. You can set the resource pool or a single node using the config file.
795
[-mode create | replace | append | truncate] [-createstmt statement] -table table_name [-transactionLevels read_uncommitted | read_committed | repeatable read | serializable] [-truncate] [-truncateLength n] [-username user_name] [-useNchar]
-arraysize Optionally specify the size of the insert array. The default size is 2000 records -close Optionally specify an SQL statement to be run after the insert array is processed. You cannot commit work by using this option. The statements are run only once on the conductor node -datasourcename Specify the data source for all database connections. This option is required. -drop Specify that unmatched fields in the InfoSphere DataStage data set are dropped. An unmatched field is a field for which there is no identically named field in the data source table. -db_cs Optionally specify the ICU code page that represents the database character set in use. The default is ISO-8859-1. -open Optionally specify an SQL statement to be run before the insert array is processed. The statements are run only once on the conductor node. -password Specify the password for connections to the data source. This option might not be required depending on the data source. -rowCommitInterval Optionally specify the number of records to be committed before starting a new transaction. This option can be specified only if arraysize = 1. Otherwise, set the -rowCommitInterval parameter equal to the arraysize parameter. -mode Specify the write mode as one of these modes: append: This is the default mode. The table must exist, and the record schema of the data set must be compatible with the table. The classicfedwrite operator appends new rows to the table. The schema of the existing table determines the input interface of the operator. create: The classicfedwrite operator creates a new table. If a table exists with the same name as the one being created, the step that contains the operator terminates with an error. The schema of the InfoSphere DataStage data set determines the schema of the new table. The table is created with simple default properties. To create a table that is partitioned, indexed, in a non-default table space, or in some other non-standard way, you can use the -createstmt option with your own CREATE TABLE statement. replace: The operator drops the existing table and creates a new one in its place. If a table exists with the same name as the one you want to create, the existing table is overwritten. The schema of the InfoSphere DataStage data set determines the schema of the new table. truncate: The operator retains the table attributes but discards existing records and appends new ones. The schema of the existing table determines the input interface of the operator. Each mode requires the specific user privileges shown in the table below.If a previous write operation failed, you can try again. Specify the replace write mode to delete any information in the output table that might exist from the previous attempt to run your program.
796
Table 240. External data source privileges for external data source writes Write mode Append Create Replace Truncate Required privileges INSERT on existing tables TABLE CREATE DROP, INSERT and TABLE CREATE on existing tables DELETE, INSERT on existing tables
-createstmt Optionally specify the CREATE TABLE statement to be used for creating the table when the -mode create parameter is specified. -table Specify the table to write to. The table name can be fully qualified. -transactionLevels Optionally specify the transaction level to access data. The default transaction level is decided by the database or possibly specified in the data source. -truncate Specify that column names are truncated to the maximum size that is allowed by the ODBC driver -truncateLength Specify the length to truncate column names. -username Specify the user name used for connections to the data source. This option might be required depending on the data source. -useNchar Read all nchars and nvarchars data types from the database.
classicfedwrite: properties
The write operations are performed with classicfedwrite operator properties The table below gives the list of properties for classicfedwrite operator and their corresponding values.
Table 241. classicfedwrite operator properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Collection method Preserve-partitioning flag Composite stage Value 1 0 Derived from the input data set None None Parallel by default or sequential Not applicable Any The default is clear No
797
v External data source column names are limited to 30 characters. If anInfoSphere DataStage field name is longer, you can do one of the following: Choose the -truncate or -truncateLength options to configure classicfedwrite to truncate InfoSphere DataStage field names to the maximum length of the data source column name. If you choose -truncateLength, you can specify the number of characters to be truncate. The number must be less than the maximum length that the data source supports. Use the modify operator to modify the field name. Modify operator is an external operator. The modify operator alters the record schema of the input data set.
Procedure
1. Specifying char and varchar data types 2. Specifying nchar and nvarchar2 column sizes
798
Specifying char and varchar data types Specify chars and varchars in bytes, with two bytes for each character. The following example specifies 10 characters: create table orch_data(col_a varchar(20));
Specifying nchar and nvarchar2 column sizes Specify nchar and nvarchar2 columns in characters. The example below specifies 10 characters: create table orch_data(col_a nvarchar2(10));
v When an insert statement is included and host array processing is specified, an InfoSphere DataStage update field must also be an InfoSphere DataStage insert field. v The classicfedupsert operator converts all values to strings before the operation passes them to the external data source. The following OSH data types are supported: int8, uint8, int16, uint16, int32, uint32, int64, and uint64 dfloat and sfloat decimal strings of fixed and variable length timestamp date v By default, classicfedupsert does not produce an output data set. Using the -reject option, you can specify an optional output data set containing the records that fail to be inserted or updated. Its syntax is:
-reject filename
799
classicfedupsert: Properties
Table 243. classicfedupsert Properties and Values Property Number of input data sets Number of output data sets by default Input interface schema Transfer behavior Execution mode Partitioning method Value 1 None; 1 when you select the -reject option Derived from your insert and update statements Rejected update records are transferred to an output data set when you select the -reject option Parallel by default, or sequential Same You can override this partitioning method. However, a partitioning method of entire cannot be used. The partitioning method can be overridden using the GUI. Collection method Combinable stage Any Yes
Exactly one occurrence of the -update option is required. All others are optional. Specify an International components for unicode character set to map between external data source char and varchar data and InfoSphere DataStage ustring data, and to map SQL statements for output to an external data source. The default character set is UTF-8, which is compatible with the orchestrate shell jobs that contain 7-bit US-ASCII data. -close Optionally specify an SQL statement to be run after the insert array is processed. You cannot commit work using this option. The statements are run only once on the conductor node. -datasourcename Specify the data source for all database connections. This option is required. -db_cs Optionally specify the ICU code page that represents the database character set in use. The default is ISO-8859-1.
800
-delete Optionally specify the delete statement to be run. -insert Optionally specify the insert statement to be run. -update Optionally specify the update or delete statement to be run. insertArraySize Optionally specify the size of the insert and update array. The default size is 2000 records. -mode Specify the upsert mode when two statement options are specified. If only one statement option is specified, the upsert mode is ignored. insert_update The insert statement is run first. If the insert fails due to a duplicate key violation (if the record being inserted exists), the update statement is run. This is the default mode. update_insert The update statement is run first. If the update fails because the record does not exist, the insert statement is run delete_insert The delete statement is run first; then the insert statement is run. -open Optionally specify an SQL statement to be run before the insert array is processed. The statements are run only once on the conductor node. -password Specify the password for connections to the data source. This option might not be required depending on the data source. -reject Specify that records that fail to be updated or inserted are written to a reject data set. You must designate an output data set for this purpose. If this option is not specified, an error is generated if records fail to update or insert. -rowCommitInterval Optionally specify the number of records to be committed before starting a new transaction. This option can be specified only if arraysize = 1. Otherwise set the rowCommitInterval parameter to equal the insertArraySize. -username Specify the user name used to connect to the data source. This option might or might not be required depending on the data source.
801
Table 244. Original data in the federated table before classicfedupsert runs acct_id 073587 873092 675066 566678 acct_balance 45.64 2001.89 3523.62 89.72
Table 245. Contents of the input file and their insert or update actions acc_id(primary key) 073587 873092 675066 566678 acct_balance 45.64 2001.89 3523.62 89.72 classicfedupsert action Update Insert Update Insert
Table 246. Example data in the federated table after classicfedupsert runs acct_id 073587 873092 675066 566678 865544 678888 995666 acct_balance 82.56 67.23 3523.62 2008.56 8569.23 7888.23 75.72
This Orchestrate shell syntax is an example of how to specify insert and update operations:
osh "import -schema record(acct_id:string[6] acct_balance:dfloat;) -file input.txt | hash -key acct_id | tsort -key acct_id | classicfedupsert -datasourcename dsn -username apt -password test -insert insert into accounts values(ORCHESTRATE.acct_id, ORCHESTRATE.acct_balance) -update update accounts set acct_balance = ORCHESTRATE.acct_balance where acct_id = ORCHESTRATE.acct_id -reject /user/home/reject/reject.ds"
802
This operator is particularly useful for sparse lookups, where the InfoSphere DataStage data set which you are trying to match, is much smaller than the external data source table. If you expect to match 90% of your data, use the classicfedread and classicfedlookup operators. Because classicfedlookup can perform lookups against more than one external data source table, the operation is useful when you join multiple external data source tables in one query. The -query parameter corresponds to an SQL statement of this form:
select a,b,c from data.testtbl where Orchestrate.b = data.testtbl.c and Orchestrate.name = "Smith"
v The classicfedlookup operator replaces each InfoSphere DataStage field name with a field value v Submits the statement containing the value to the external data source v Outputs a combination of external data source and InfoSphere DataStage data. Alternatively, you can use the -key and -table options to specify one or more key fields and one or more external data source tables. The following orchestrate shell options specify two keys and a single table:
-key a -key b -table data.testtbl
The resulting InfoSphere DataStage output data set includes the InfoSphere DataStage records and the corresponding rows from each referenced external data source table. When an external data source table has a column name that is the same as an InfoSphere DataStage data set field name, the external data source column is renamed with the following syntax:
APT_integer_fieldname
An example is APT_0_lname. The integer component is incremented when duplicate names are encountered in additional tables. If the external data source table is not indexed on the lookup keys, the performance of this operator is likely to be poor.
classicfedlookup: properties
The lookup operations are performed with classicfedlookup operator properties The table below contains the list of properties and the corresponding values for classicfedlookup operator.
Table 247. classicfedlookup Operator Properties Property Number of input data sets Number of output data sets Input interface schema Output interface schema Transfer behavior Execution mode Partitioning method Value 1 1; 2 if you include the -ifNotFound reject option Determined by the query Determined by the SQL query Transfers all fields from input to output Sequential or parallel (default) Not applicable
803
Table 247. classicfedlookup Operator Properties (continued) Property Collection method Preserve-partitioning flag in the output data set Composite stage Value Not applicable Clear No
You must specify either the -query option or one or more -table options with one or more -key fields. The classicfedlookup operator is parallel by default. The options are as follows: -close Optionally specify an SQL statement to be executed after the insert array is processed. You cannot commit work using this option. The statement is executed only once on the conductor node. -datasourcename Specify the data source for all database connections. This option is required. -db_cs Optionally specify the ICU code page which represents the database character set in use. The default is ISO-8859-1. This option has this suboption: -use_strings If this suboption is not set, strings, not ustrings, are generated in the InfoSphere DataStage schema. -fetcharraysize Specify the number of rows to retrieve during each fetch operation. The default number is 1. -ifNotFound Specify an action to be taken when a lookup fails. fail Stop job execution drop Drop the failed record from the output data set. reject Put records that are not found into a reject data set. You must designate an output data set for this option. continue Leave all records in the output data set which is outer join.
804
fail: Stop job execution. drop: Drop the failed record from the output data set. reject: Put records that are not found into a reject data set. You must designate an output data set for this option. continue: Leave all records in the output data set (outer join). -open Optionally specify an SQL statement to be run before the insert array is processed. The statements are run only once on the conductor node. -password Specify the password for connections to the data source. This option might be required depending on the data source. -query Specify a lookup query to be run. This option and the -table option are mutually exclusive. -table Specify a table and key fields to generate a lookup query. This option and the -query option are mutually exclusive. -filter where_predicate Specify the rows of the table to exclude from the read operation. This predicate is appended to the WHERE clause of the SQL statement. -selectlist select_predicate Specify the list of column names that are in the SELECT clause of the SQL statement. -key field Specify a lookup key. A lookup key is a field in the table that joins with a field of the same name in the InfoSphere DataStage data set. The -key option can be used more than once to specify more than one key field. -user Specify the user name for connections to the data source. This option might not be required depending on the data source.
805
v v v v
DOB column names and values from the InfoSphere DataStage input data set lname fname DOB column names and the values from the federated table
If a column name in the external data source table has the same name as an InfoSphere DataStage output data set schema field name, the printed output shows the column in the external data source table renamed with this format:
APT_integer_fieldname
806
807
808
apt_ util/archive. h
APT_ Archive APT_ FileArchive APT_ MemoryArchive
apt_util/basicstring.h
APT_BasicString
809
APT_ Date
apt_ util/keygroup. h
APT_ KeyGroup
apt_ util/locator. h
APT_ Locator
apt_ util/persist. h
APT_ Persistent
apt_ util/proplist. h
APT_ Property APT_ PropertyList
apt_ util/random. h
APT_ RandomNumberGenerator
apt_ util/rtti. h
APT_ TypeInfo
810
apt_util/ustring.h
APT_UString
apt_ util/assert. h
APT_ APT_ APT_ APT_ APT_ ASSERT() APT_ DETAIL_ FATAL() DETAIL_ FATAL_ LONG() MSG_ ASSERT() USER_ REQUIRE() USER_ REQUIRE_ LONG()
apt_ util/condition. h
CONST_ CAST() REINTERPRET_ CAST()
apt_ util/errlog. h
APT_ APPEND_ LOG() APT_ DUMP_ LOG() APT_ PREPEND_ LOG()
811
APT_ DETAIL_ LOGMSG() APT_ DETAIL_ LOGMSG_ LONG() APT_ DETAIL_ LOGMSG_ VERYLONG()
812
Product accessibility
You can get information about the accessibility status of IBM products. The IBM InfoSphere Information Server product modules and user interfaces are not fully accessible. The installation program installs the following product modules and components: v IBM InfoSphere Business Glossary v IBM InfoSphere Business Glossary Anywhere v v v v v v IBM IBM IBM IBM IBM IBM InfoSphere InfoSphere InfoSphere InfoSphere InfoSphere InfoSphere DataStage FastTrack Information Analyzer Information Services Director Metadata Workbench QualityStage
For information about the accessibility status of IBM products, see the IBM product accessibility information at http://www.ibm.com/able/product_accessibility/index.html.
Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided in an information center. The information center presents the documentation in XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you to set display preferences in your browser. It also allows you to use screen readers and other assistive technologies to access the documentation.
813
814
815
816
817
818
Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not grant you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: Intellectual Property Licensing Legal and Intellectual Property Law IBM Japan Ltd. 1623-14, Shimotsuruma, Yamato-shi Kanagawa 242-8502 Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.
819
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBM's future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be liable for any damages arising out of your use of the sample programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.
820
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at www.ibm.com/legal/ copytrade.shtml. The following terms are trademarks or registered trademarks of other companies: Adobe is a registered trademark of Adobe Systems Incorporated in the United States, and/or other countries. IT Infrastructure Library is a registered trademark of the Central Computer and Telecommunications Agency which is now part of the Office of Government Commerce. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. ITIL is a registered trademark, and a registered community trademark of the Office of Government Commerce, and is registered in the U.S. Patent and Trademark Office UNIX is a registered trademark of The Open Group in the United States and other countries. Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the United States, other countries, or both and is used under license therefrom. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. The United States Postal Service owns the following trademarks: CASS, CASS Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS and United States Postal Service. IBM Corporation is a non-exclusive DPV and LACSLink licensee of the United States Postal Service. Other company, product or service names may be trademarks or service marks of others.
821
822
Contacting IBM
You can contact IBM for customer support, software services, product information, and general information. You also can provide feedback to IBM about products and documentation. The following table lists resources for customer support, software services, training, and product and solutions information.
Table 248. IBM resources Resource IBM Support Portal Description and location You can customize support information by choosing the products and the topics that interest you at www.ibm.com/support/entry/portal/Software/ Information_Management/ InfoSphere_Information_Server You can find information about software, IT, and business consulting services, on the solutions site at www.ibm.com/businesssolutions/ You can manage links to IBM Web sites and information that meet your specific technical support needs by creating an account on the My IBM site at www.ibm.com/account/ You can learn about technical training and education services designed for individuals, companies, and public organizations to acquire, maintain, and optimize their IT skills at http://www.ibm.com/software/sw-training/ You can contact an IBM representative to learn about solutions at www.ibm.com/connect/ibm/us/en/
Software services
My IBM
IBM representatives
Providing feedback
The following table describes how to provide feedback to IBM about products and product documentation.
Table 249. Providing feedback to IBM Type of feedback Product feedback Action You can provide general product feedback through the Consumability Survey at www.ibm.com/software/data/ info/consumability-survey To comment on the information center, click the Feedback link on the top right side of any topic in the information center. You can also send comments about PDF file books, the information center, or any other documentation in the following ways: v Online reader comment form: www.ibm.com/ software/data/rcf/ v E-mail: [email protected]
Documentation feedback
823
824
Index A
aggtorec restructure operator 449 example with multiple key options 452 with toplevelkeys option 453 without toplevelkeys option 452 properties 450 syntax and options 450 toplevelkeys option 450 APT_BUFFER_DISK_ WRITE_INCREMENT 56, 62 APT_BUFFER_FREE_RUN 56 APT_BUFFER_MAXIMUM_TIMEOUT 56 APT_BUFFERING_POLICY 56 APT_CHECKPOINT_DIR 63 APT_CLOBBER_OUTPUT 63 APT_COLLATION_STRENGTH 70 APT_COMPILEOPT 58 APT_COMPILER 58 APT_CONFIG_FILE 63 APT_CONSISTENT_BUFFERIO_SIZE 62 APT_DATE_CENTURY_BREAK_YEAR 66 APT_DB2INSTANCE_HOME 58 APT_DB2READ_LOCK_TABLE 58 APT_DB2WriteOperator write mode APT_DbmsWriteInterface::eAppend 719 APT_DbmsWriteInterface::eReplace 719 APT_DbmsWriteInterface::eTruncate 719 APT_DbmsWriteInterface::eAppend 719 APT_DbmsWriteInterface::eReplace 719 APT_DbmsWriteInterface::eTruncate 719 APT_DBNAME 58 APT_DEBUG_OPERATOR 59 APT_DEBUG_PARTITION 59 APT_DEBUG_SIGNALS 60 APT_DEBUG_STEP 60 APT_DEFAULT_TRANSPORT_ BLOCK_SIZE 80 APT_DISABLE_COMBINATION 63 APT_DISABLE_ROOT_FORKJOIN 56 APT_DUMP_SCORE 74 APT_EBCDIC_VERSION 66 APT_EncodeOperator example encode 121 gzip and 121 APT_ERROR_CONFIGURATION 74 APT_EXECUTION_MODE 60, 63 APT_FILE_EXPORT_BUFFER_SIZE 73 APT_FILE_IMPORT_BUFFER_SIZE 73 APT_IMPEXP_CHARSET 71 APT_IMPORT_PATTERN_ USES_FILESET 73 APT_INPUT_CHARSET 71 APT_IO_MAP/APT_IO_NOMAP and APT_BUFFERIO_MAP/ APT_BUFFERIO_NOMAP 62 APT_IO_MAXIMUM_OUTSTANDING 69 APT_IOMGR_CONNECT_ATTEMPTS 69 APT_ISVALID_BACKCOMPAT 66 APT_LATENCY_COEFFICIENT 80 APT_LINKER 58 APT_LINKOPT 58 APT_MAX_TRANSPORT_BLOCK_SIZE/ APT_MIN_TRANSPORT_ BLOCK_SIZE 81 APT_MONITOR_SIZE 65 APT_MONITOR_TIME 65 APT_MSG_FILELINE 76 APT_NO_PART_INSERTION 72 APT_NO_PARTSORT_OPTIMIZATION 72 APT_NO_SAS_TRANSFORMS 78 APT_NO_SORT_INSERTION 79 APT_NO_STARTUP_SCRIPT 64 APT_OLD_BOUNDED_LENGTH 67 APT_OPERATOR_REGISTRY_PATH 67 APT_ORA_IGNORE_CONFIG_ FILE_PARALLELISM 72 APT_ORA_WRITE_FILES 72 APT_ORACLE_NO_OPS 71 APT_ORACLE_PRESERVE_BLANKS 72 APT_ORAUPSERT_COMMIT_ ROW_INTERVAL 72 APT_ORAUPSERT_COMMIT_ TIME_INTERVAL 72 APT_ORAUPSERT_COMMIT_ROW_INTERVAL 629 APT_ORCHHOME 64 APT_OS_CHARSET 71 APT_OUTPUT_CHARSET 71 APT_PARTITION_COUNT 72 APT_PARTITION_NUMBER 73 APT_PERFORMANCE_DATA 14, 65 APT_PM_CONDUCTOR_HOSTNAME 69 APT_PM_DBX 60 APT_PM_NO_NAMED_PIPES 67 APT_PM_NO_TCPIP 69 APT_PM_PLAYER_MEMORY 76 APT_PM_PLAYER_TIMING 76 APT_PM_STARTUP_PORT 70 APT_PM_XLDB 61 APT_PM_XTERM 61 APT_PREVIOUS_FINAL_ DELIMITER_COMPATIBLE 74 APT_PXDEBUGGER_FORCE_SEQUENTIAL 61 APT_RDBMS_COMMIT_ROWS 59 APT_RECORD_COUNTS 67, 76 APT_SAS_ACCEPT_ERROR 77 APT_SAS_CHARSET 77 APT_SAS_CHARSET_ABORT 77 APT_SAS_DEBUG 78 APT_SAS_DEBUG_IO 78 APT_SAS_DEBUG_LEVEL 78 APT_SAS_NO_PSDS_USTRING 78 APT_SAS_S_ARGUMENT 78 APT_SAS_SCHEMASOURCE_ DUMP 78 APT_SAVE_SCORE 67 APT_SHOW_COMPONENT_CALLS 68 APT_STACK_TRACE 68 APT_STARTUP_SCRIPT 64 APT_STARTUP_STATUS 64 APT_STRING_CHARSET 71 APT_STRING_PADCHAR 74 APT_SYBASE_NULL_AS_EMPTY 79 APT_SYBASE_PRESERVE_BLANKS 79 APT_TERA_64K_BUFFERS 79 APT_TERA_NO_ERR_CLEANUP 79 APT_TERA_NO_PERM_CHECKS 80 APT_TERA_NO_SQL_ CONVERSION 80 APT_TERA_SYNC_DATABASE 80
825
APT_TERA_SYNC_PASSWORD 80 APT_TERA_SYNC_USER 80 APT_THIN_SCORE 64 APT_TRANSFORM_COMPILE_OLD_NULL_HANDLING 68 APT_TRANSFORM_LOOP_WARNING_THRESHOLD 68 APT_USE_IPV4 70 APT_WRITE_DS_VERSION 68 asesybaselookup and sybaselookup operators example 749 properties 745 syntax 745 asesybaselookup and sybaselookup operatorsr 743 asesybasereade and sybasereade operators 722 action 723 examples reading a table and modifying a field name 728 properties 722 syntax 725 asesybaseupsert and sybaseupsert operators 739 action 739 example 742 properties 739 syntax 740 asesybasewrite and sybasewrite operators 729 action 729 examples creating a table 737 writing to a table using modify 738 writing to an existing table 736 properties 729 syntax 732 write modes 731 writing to a multibyte database 729 asesybasewrite operator options 733 associativity of operators 263
B
bottlenecks 23 build stage macros build stages 33 41
C
changeapply operator actions on change records 90 example 94 key comparison fields 91 schemas 90 transfer behavior 91 changecapture operator data-flow diagram 96 determining differences 96 example dropping output results 101 general 100 key and value fields 96 properties 96 syntax and options 97 transfer behavior 96 charts, performance analysis 15 checksum operator properties 103 syntax and options 103
Classic Federation interface operators accessing federated database 789 classicfedlookup 802 classicfedread 791 classicfedupsert 799 classicfedwrite 794 national language support 789 unicode character set 790 classicfedlookup operator options 804 classicfedread -query 794 column name conversion 793 data type conversion 793 classicfedread operator options 792 classicfedupsert operator action 799 classicfedwrite -drop 795 -truncate 798 char and varchar 799 data type conversion 798 classicfedwrite operator options 796 collecting key 441 collecting keys for sortmerge collecting 445 collection methods 442 collection operators roundrobin 442 sortmerge 444 collector ordered 441 compare operator data-flow diagram 104 example sequential execution 107 restrictions 105 results field 105 syntax and options 105 copy operator example 110 sequential execution 111 properties 108 syntax and options 109 CPU utilization 17 custom stages 33 resource estimation 20 customer support contacting 823
D
data set sorted RDBMS and 492, 506 DB2 configuring DataStage access 637 APT_DBNAME 638 DB2 required privileges 637 db2nodes.cfg system file 638 configuring Orchestrate access 676 db2lookup operator 672 environment variable DB2INSTANCE 638 joining DB2 tables and a DataStage data set
637
826
DB2 (continued) joining DB2 tables and an Orchestrate data set 672 operators that modify Orchestrate data types 651 Orchestrate field conventions 651 specifying DB2 table rows 644 specifying the SQL SELECT statement 644 using a node map 676 DB2 interface operators db2load 649 db2lookup 672 db2part 669 db2read 641 db2upsert 665 db2write 649 establishing a remote connection 638 handing # and $ in column names 638 national language support 640 running multiple operators in a single step 639 using the -padchar option 639 DB2 partitioning db2part operator 669 DB2DBDFT 59 DB2INSTANCE 638 db2load operator data-flow diagram 649 DB2 field conventions 651 default SQL INSERT statement 650 example appending data to a DB2 table. 661 DB2 table with an unmatched column 664 unmatched Orchestrate fields 663 writing data in truncate mode 662 matching fields 653 operators that modify Orchestrate data types for DB2 651 options 653 properties 649 requirements for using 661 specifying the select list 651 translation anomalies 676 write modes 652 db2lookup operator 672 properties 673 syntax and options 673 db2nodes command syntax 676 db2part operator 669 example 671 syntax and options 670 db2read operator data flow diagram 641 example reading a DB2 table sequentially with the query option 648 reading a DB2 table with the table option 647 reading a table in parallel with the query option 648 operator action 642 Orchestrate conversion DB2 column names 642 properties 641 specifying DB2 table rows 644 specifying the DB2 table name 643 specifying the SQL SELECT statement 644 syntax and options 645 translation anomaly 676 db2upsert operator operator action 666 partitioning 665 properties 665
db2write operator 649 data-flow diagram 649 DB2 field conventions 651 default SQL INSERT statement 650 example appending data to a DB2 table 661 DB2 table with an unmatched column 664 unmatched Orchestrate fields 663 writing data in truncate mode 662 matching fields 653 operators that modify Orchestrate data types for DB2 properties 649 specifying the select list 651 syntax and options 653 translation anomalies 676 write modes 652 diff operator data flow diagram 112 example dropping output results 117 typical 117 properties 112 syntax and options 114 disk space 17, 24 disk utilization 17 duplicates, removing 229 dynamic models 18 creating 17 examples 22, 23 overview 17
651
E
encode operator data flow diagram 119 encoded data sets and partitioning 120 encoding data sets 120 example 121 properties 119 encoded data sets and partitioning 120 encoding data sets 120 entire partitioner 415 data flow diagram 416 properties 417 syntax 417 using 416 environment variable APT_DBNAME 638 APT_ORAUPSERT_COMMIT_ROW_INTERVAL APT_PERFORMANCE_DATA 14, 65 DB2INSTANCE 638 environment variables APT_MONITOR_SIZE 65 example build stage 44 examples resource estimation bottlenecks 23 input projections 24 merging data 22 overview 21 sorting data 22 export utility operator data set versus export schema 355 example exporting to a single file 358 exporting to multiple files 359 export formats 350
629
Index
827
export utility operator (continued) properties 349 specifying data destination 355 file sets 356 files and named pipes 356 list of destination programs 357 nodes and directories 357 specifying export schema 355 default exporting of all fields 355 defining properties 355 exporting all fields with overrides and new properties 355 exporting selected fields 355
F
field_export restructure operator data-flow diagram 454 example 456 properties 455 transfer behavior 454, 457 field_import restructure operator 457 data flow diagram 457 example 459 properties 458 syntax and options 458 transfer behavior 458 fields changing data type 177 data-type conversion errors 180 data-type conversions date field 180, 266 decimal field 186 raw field length extraction 188 string conversions and lookup tables 193, 280 string field 189 time field 196 timestamp field 199 default data-type conversion 178 duplicating with a new name 177 explicitly converting the data type 180 transfer behavior of the modify operator 176 filter operator 121 data flow diagram 121 example comparing two fields 126 evaluating input records 127 filtering wine catalog recipients 127 testing for a null 127 input data types 125 Job Monitor information 123 customizing messages 124 example messages 124 properties 122 supported boolean expressions 125 fullouterjoin operator 528 output data set 528 syntax and options 528 funnel operator syntax and options 131
general operators (continued) modify 174 generator operator data-flow diagram 134 defining record schema 138 example general 136 parallel execution 137 using with an input data set 137 generator options for DataStage data types properties 134 record schema date fields 140 decimal fields 141 null fields 144 numeric fields 139 raw fields 141 string fields 142 time fields 143 timestamp fields 143 supported data types 136 syntax and options 134 usage details 135 gzip UNIX utility using with the encode operator 121
138
H
hash partitioner 417 data flow diagram 419 example 418 properties 420 syntax and option 420 using 419 head operator example extracting records from a large data set syntax and options 145 hplread INFORMIX interface operator 690 data flow diagram 692 properties 692 special features 690 syntax and options 692 hplwrite INFORMIX interface operator 694 data flow diagram 694 listing of examples 696 properties 694 special features 694 syntax and options 695
147
I
IBM format files 337 exporting data 350 ICU character sets 790 import utility operator import source file data types 337 properties 337 syntax and options 337 import/export utility introduction ASCII and EBCDIC conversion 330 ASCII to EBCDIC 334 EBCDIC to ASCII 330 table irregularities 335 error handling 329 failure 329
G
general operators filter 121 lookup 147 89
828
import/export utility introduction (continued) error handling (continued) warning 330 format of imported data 319 implicit import and export 326 default export schema 328 default import schema 328 no specified schema 326 overriding defaults 328 record schema comparison of partial and complete 322 defining complete schema 321 defining partial schema 323 export example 320 exporting with partial schema 324 import example 320 import/export utility properties field properties 362 date 366 decimal 366 nullable 367 numeric 363 raw 366 tagged subrecord 368 time 366 timestamp 366 ustring 366 vector 367 property description 371 ascii 372 big_endian 373 binary 373 c_format 374 check_intact 376 date_format 376 days_since 380 default 380 delim 382 delim_string 383 drop 384 ebcdic 385 export_ebcdic_as_ascii 385 fill 385 final_delim 386 final_delim_string 387 fix_zero 387 generate 388 import_ascii_as_ebcdic 388 in_format 389 intact 389 julian 390 link 390 little_endian 391 max_width 392 midnight_seconds 392 native_endian 393 nofix_zero 393 null_field 393 null_length 394 out_format 395 packed 396 padchar 397 position 398 precision 399 prefix 399 print_field 400 quote 401
import/export utility properties (continued) property description (continued) record_delim 402 record_delim_string 368, 402 record_format 403 record_length 403 record_prefix 403 reference 404 round 404 scale 405 separate 406 skip 406 tagcase 407 text 407 time_format 408 timestamp_format 411 vector_prefix 412 width 392, 413 zoned 413 record-level properties 362 setting properties 360 defining field-level properties 361 table of all record and field properties 368 INFORMIX accessing 679 configuring your environment 679 INFORMIX interface operators hplread 690 hplwrite 694 infxread 697 xpsread 702 xpswrite 704 Informix interface read operators operator action 679 INFORMIX interface read operators column name conversion 681 data type conversion 681 example 682 operator action 680 INFORMIX interface write operators column name conversion 684 data type conversion 684 dropping unmatched fields 685 example appending data 686 handling unmatched fields 688 table with unmatched column 689 writing in truncate mode 687 execution modes 684 limitations column name length 686 field types 686 write modes 685 writing fields 685 infxread INFORMIX interface operator 697 data flow diagram 697 properties 697 syntax and options 697 infxwrite INFORMIX interface operator 699 data flow diagram 699 listing of examples 701 properties 700 syntax and options 700 innerjoin operator 522 example 523 syntax and option 523
Index
829
input projections example 24 making 20 overview 17 iWay accessing 775 iWay operators 775 iwayread 775 iwaylookup operator 779 iwayread operator 775 syntax 777
J
job performance analysis 15 data design-time recording 14 runtime recording 14 overview 14 join operators fullouterjoin 528 general characteristics input data set requirements 521 memory use 521 multiple keys and values 521 properties 520 transfer behavior 521 innerjoin 522 leftouterjoin 524 rightouterjoin 526 See also individual operator entries [join operators zzz] 519
L
leftouterjoin operator 524 example 525 syntax and option 524 legal notices 819 lookup operator 147 create-only mode 153 dropping a field with the transform operator example handling duplicate fields 156 interest rate lookup 155 handling error conditions 155 lookup with single table record 154 operator behavior 147 properties 149 renaming duplicate fields 156 table characteristics 154, 155, 156
156
M
makerangemap utility creating a range map 437 example commands 437 specifying sample size 437 syntax and options 435 using with the range partitioner 437 makesubrec restructure operator 461 data flow diagram 461 properties 461 subrecord length 462 syntax and options 462
makesubrec restructure operator (continued) transfer behavior 462 makevect restructure operator data flow diagram 464 example general 466 missing input fields 466 non-consecutive fields 465 properties 465 syntax and option 465 transfer behavior 465 merge operator data flow diagram 158 example application scenario 170 handling duplicate fields 168, 169, 170 merging operation 164 merging records 160 example diagram 164 master record and update record 161 multiple update data sets 161 missing records 172 handling bad master records 172 handling bad update records 172 properties 158 syntax and options 158 typical data flow 163 models resource estimation accuracy 18 creating 17 dynamic 18 static 18 modify operator 174 aggregate schema components 205 changing field data type 177 data-flow diagram 175 data-type conversion errors 180 data-type conversion table 207 data-type conversions date field 180, 266 decimal field 186 raw field length extraction 188 string conversions and lookup tables 193, 280 string field 189 time field 196 timestamp field 199 default data-type conversion 178 duplicating a field with a new name 177 explicitly converting the data type 180 null support 201 partial schemas 204 properties 175 transfer behavior 176 vectors 205 modulus partitioner data flow diagram 421 example 422 syntax and option 422
N
node maps for DB2 interface operators non-IBM Web sites links to 817 676
830
201
O
ODBC interface operators 531 accessing ODBC from DataStage 531 national language support 531 odbclookup 551 odbcread 532 odbcupsert 547 odbcwrite 538 odbclookup operator 551 data flow diagram 552 example 555 options 553 properties 553 syntax 553 odbcread operator 532 action 535 column name conversion 535 data flow diagram 533 examples reading a table and modifying a field name 537 options 534 syntax 533 odbcread operatordata type conversion 536 odbcupsert operator 547 action 548 data flow diagram 547 example 550 options 549 properties 548 syntax 549 odbcwrite operator 538 action 539 data flow diagram 539 examples creating a table 545 writing to existing table 544 writing using the modify operator 546 options 542 properties 539 syntax 542 writing to a multibyte database 539 operator associativity 263 writerangemap 316, 432 operators DB2 interface 677 general 89 join See join operators 519 Oracle interface 635 precedence 263 sort 517 Teradata interface 721 Oracle accessing from DataStage PATH and LIBPATH requirements 603 data type restrictions 615 indexed tables 615 joining Oracle tables and a DataStage data set 631 Orchestrate to Oracle data type conversion 616 performing joins between data sets 609 reading tables SELECT statement 608
Oracle (continued) reading tables (continued) specifying rows 608 SQL query restrictions 609 writing data sets to column naming conventions 615 dropping data set fields 617 Oracle interface operators handing # and $ in column names 604 national language support 531, 604 oralookup 631 oraread 605 oraupsert 626 orawrite 613 preserving blanks in fields 603 oralookup operator 631 example 635 properties 633 syntax and options 633 oraread operator 605 column name conversion 607 data flow diagram 605 data-type conversion 608 operator action 606 Oracle record size 608 properties 606 specifying an Oracle SELECT statement 609 specifying processing nodes 607 specifying the Oracle table 608 syntax and options 609 oraupsert operator 626 data-flow diagram 626 environment variables 629 APT_ORAUPSERT_COMMIT_ROW_INTERVAL example 631 operator action 627, 666 properties 627, 665 orawrite operator 613 column naming conventions 615 data flow diagram 613 data-type restrictions 615 example creating an Oracle table 625 using the modify operator 625 writing to an existing table 624 execution modes 615 matched and unmatched fields 617 operator action 614 Orchestrate to Oracle data type conversion 616 properties 614 required privileges 617 syntax and options 618 write modes 616 writing to indexed tables 615 ordered collection operator properties 442 OSH_BUILDOP_CODE 57 OSH_BUILDOP_HEADER 57 OSH_BUILDOP_OBJECT 57 OSH_BUILDOP_XLC_BIN 57 OSH_CBUILDOP_XLC_BIN 57 OSH_DUMP 76 OSH_ECHO 76 OSH_EXPLAIN 76 OSH_PRELOAD_LIBS 69 OSH_PRINT_SCHEMAS 76 OSH_STDOUT_MSG 69 Index
629
831
P
Parallel SAS data set format 561 partition sort RDBMS and 506 partitioners entire 415 hash 417 random 423 range 425 roundrobin 437 same 439 pcompress operator compressed data sets partitioning 210 using orchadmin 211 data-flow diagram 209 example 211 mode compress 209 expand 209 properties 209 syntax and options 209 UNIX compress facility 210 peek operator syntax and options 213 performance analysis charts 15 overview 14 recording data at design time 14 at run time 14 PFTP operator 215 data flow diagram 216 properties 216 restartability 222 pivot operator properties 223 syntax and options 223 precedence of operators 263 product accessibility accessibility 813 product documentation accessing 815 projections, input example 24 making 20 overview 17 promotesubrec restructure operator data-flow diagram 467 properties 468 syntax and option 468 properties classicfedlookup 803 classicfedupsert 800 classicfedwrite 797 Properties classicfedread 791 psort operator data flow diagram 507 performing a total sort 512 properties 508 specifying sorting keys 506 syntax and options 508
Q
QSM_DISABLE_DISTRIBUTE_COMPONENT 64
R
random partitioner 423 data flow diagram 424 properties 425 syntax 425 using 424 range map creating with the makerangemap utility 437 range partitioner 425 algorithm 426 data flow diagram 430 example 429 properties 430 specifying partitioning keys 426 syntax and options 430 using 429 range partitioning and the psort operator 514 remdup operator data-flow diagram 225 data-flow with hashing and sorting 227 effects of case-sensitive sorting 228 example case-insensitive string matching 230 removing all but the first duplicate 230 using two keys 230 example osh command with hashing and sorting properties 226 removing duplicate records 227 syntax and options 226 usage 229 retaining the first duplicate record 228 retaining the last duplicate record 228 using the last option 230 remove duplicates 229 reports, resource estimation 21 resource estimation custom stages 20 examples finding bottlenecks 23 input projections 24 merging data 22 overview 21 sorting data 22 models accuracy 18 creating 17 dynamic 18 static 18 overview 17 projections 20 reports 21 restructure operators aggtorec 449 field_import 457 makesubrec 461 splitsubrec 469 tagswitch 481 rightouterjoin operator 526 example 527 syntax and option 526 roundrobin collection operator 442
227
832
roundrobin collection operator (continued) syntax 444 roundrobin partitioner 437 data flow diagram 438 properties 439 syntax 439 using 438
S
same partitioner 439 data flow diagram 440 properties 440 syntax 440 sample operator data-flow diagram 231 example 232 properties 231 syntax and options 231 SAS data set format 562 SAS DATA steps executing in parallel 567 SAS interface library configuring your system 559 converting between DataStage and SAS data types 564 DataStage example 566 example data flow 560 executing DATA steps in parallel 567 executing PROC steps in parallel 573 getting input from a DataStage or SAS data set 563 getting input from a SAS data set 562 overview 557 Parallel SAS data set format 561 parallelizing SAS code rules of thumb for parallelizing 577 SAS programs that benefit from parallelization 577 pipeline parallelism and SAS 559 SAS data set format 562 sequential SAS data set format 561 using SAS on sequential and parallel systems 557 writing SAS programs 557 SAS interface operators controlling ustring truncation 583 determining SAS mode 581 environment variables 585 generating a Proc Contents report 584 long name support 584 sas 591 sascontents 599 sasin 586 sasout 596 specifying an output schema 583 sas operator 591 data-flow diagram 591 properties 591 SAS operator syntax and options 592 SAS PROC steps executing in parallel 573 sascontents operator data flow diagram 599 data-flow diagram 599 properties 600 syntax and options 600 sasin operator data flow diagram 587 data-flow diagram 587
sasin operator (continued) properties 587 syntax and options 587 sasout operator 596 data flow diagram 597 properties 597 syntax and options 597 scratch space 17, 24 sequence operator example 234 properties 234 syntax and options 234 sequential SAS data set format 561 software services contacting 823 sort operators tsort 489 sortfunnel operator input requirements 130 setting primary and secondary keys 131 syntax and options 131 sortmerge collection operator 444 collecting keys 445 ascending order 447 case-sensitive 447 data types 446 primary 446 secondary 446 data flow diagram 445 properties 447 syntax and options 447 splitsubrec restructure operator 469 data-flow diagram 469 example 470 properties 469 syntax and option 470 splitvect restructure operator data-flow diagram 471 example 472 properties 471 syntax and option 472 SQL Server interface operators 751 accessing SQL Server from DataStage 751 national language support 751, 775 sqlsrvrread 752 sqlsrvrupsert 766 sqlsrvrwrite 758 SQL Server interface options sqlsrvrlookup 770 sqlsrvrlookup operator 770 data flow diagram 771, 781 example 773, 784 options 772 properties 771 syntax 772 sqlsrvrread operator 752 action 752, 776 column name conversion 753 data flow diagram 752, 776 data type conversion 753, 776 example 757 options 755, 777, 782 properties 752, 776 syntax 755 sqlsrvrupsert operator 766 action 766 data flow diagram 766 Index
833
sqlsrvrupsert operator (continued) example 769 options 768 properties 766 syntax 767 sqlsrvrwrite operator 758 action 759 data flow diagram 758 examples creating a table 764 writing to a table using modify 765 writing to an existing table 763 options 761 properties 758 syntax 761 write modes 760 static models 18 creating 17 overview 17 support customer 823 switch operator data flow diagram 235 discard value 236 example 239 Job Monitor information 241 example messages 241 properties 237 syntax and options 237 Sybase environment variables 79 Sybase interface operators 721 accessing Sybase from DataStage 721 asesybaselookup and sybaselookup 743 asesybasereade and sybasereade 722 asesybaseupsert and sybaseupsert 739 asesybasewrite and sybasewrite 729 national language support 722 sybaselookup operator options 747 sybasereade operator column name conversion 723 data type conversion 724 options 726 sybaseupsert operator options 741 sybasewrite operator data type conversion 731 syntax classicfedlookup 804 classicfedread 791 classicfedupsert 800 classicfedwrite 795
T
tagbatch restructure operator added, missing, and duplicate fields data-flow diagram 474 example with missing and duplicate cases with multiple keys 480 with simple flattening of tag cases input data set requirements 476 operator action and transfer behavior properties 475 syntax and options 476 tagged fields and operator limitations 475
473
tagswitch restructure operator 481 case option 482 data flow diagram 481 example with default behavior 484 with one case chosen 485 input and output interface schemas 482 properties 482 syntax and options 483 using 482 tail operator example default behavior 243 using the nrecs and part options 243 properties 242 Teradata writing data sets to data type restrictions 717 dropping data set fields 716 write mode 716 Teradata interface operators terawrite operator 713 teraread operator column name and data type conversion 710 data-flow diagram 709 national language support 707 properties 709 restrictions 711 specifying the query 709 syntax and options 711 terawrite operator 713 column name and data type conversion 714 correcting load errors 715 data-flow diagram 714 limitations 716 national language support 707 properties 714 restrictions 717 syntax and options 717 write mode append 716 create 716 replace 716 truncate 716 writing fields 716 trademarks list of 819 transform operator example applications 294 lookup-table functions 266 specifying lookup tables 257 syntax and options 245 Transformation Language built-in functions bit-manipulation functions 291 data conversion and data mapping 265 mathematical functions 278 miscellaneous functions 292 null handling functions 278 string functions 281 time and timestamp transformations 273 ustring functions 286 conditional branching 264 if ... else 264 data types and record fields 259 differences with the C language 293 casting 294
834
Transformation Language (continued) differences with the C language (continued) flow control 293 labeled statements 294 local variables 293 names and keywords 293 operators 293 pre-processor commands 294 expressions 261 language elements 261 operators 261 general structure 255 local variables 260 names and keywords 255 specifying datasets complex field types 260 simple fieldtypes 259 tsort operator configuring 491 data flow diagram 493 example sequential execution 497 performing a total sort 499 properties 493 sorting keys case sensitivity 493 restrictions 492 sort order 493 specifying sorting keys 492 syntax and options 494 using a sorted data set 491 using a sorted data set with a sequential operator
491
U
UNIX compress utility invoked from the pcompress operator 210
W
Web sites non-IBM 817 wrapped stages 33 writerangemap operator 316, 432 data-flow diagram 317, 432 properties 317, 433
X
xpsread INFORMIX interface operator 702 data-flow diagram 702 properties 702 syntax and options 702 xpswrite INFORMIX interface operator 704 data-flow diagram 704 listing of examples 706 properties 705 syntax and options 705
Index
835
836
Printed in USA
SC19-3458-01
Spine information:
Version 8 Release 7