Assign Key
Assign Key
Assign Key
Assign Keys assigns a value to a surrogate key field in each record on the in port,
based on the value of a natural key field in that record, and then sends the record
to one or two of three output ports.
Gather Logs collects the output from the log ports of components for analysis of a
graph after execution.
Leading Records copies data records from input to output, stopping after the
given number of records.
Redefine Format copies data records from its input to its output without changing
the values in the data records. You can use Redefine Format to change or rename
fields in a record format without changing the values in the records.
Replicate arbitrarily combines all the data records it receives into a single flow
and writes a copy of that flow to each of its output flows.
Throttle copies data records from its input to its output, limiting the rate at which
records are processed.
Transitive Closure Recirculate and Compute Closure are the two halves of the
transitive closure macro. These components calculate the complete set of direct
and derived relationships among a set of input key-pairs.
Trash ends a flow by accepting all the data records in it and discarding them.
1. Assign Keys
Assign Keys reads input records on the in port and checks them against input
records on the key port. For each record on the in port, Assign Keys assigns a value to a
surrogate key field. The assigned value is based on the value of the natural key field in
the same input record. For example, based on the value of the customer_name natural
key field, Assign Keys can assign a value to the customer_id surrogate key field. Assign
Keys then sends the record to one or two of three output ports:
The first output port receives a record for each new surrogate key. You can use
this information to update the information source for the key port.
The new output port receives a record for each input record for which a new
surrogate key was generated and assigned.
The old output port receives a record for each input record to which an existing
surrogate key was assigned.
An existing field or set of fields in the records on the in port, used as the
natural key (see About Natural and Surrogate Keys) for the records on the in port,
and, if you do not set the override_natural_key parameter, for the records on the
key port.
A key specifier consisting of the name of one field in the records on the in
port, and no modifiers.
Assign Keys uses this field as the surrogate key for the records on the in port,
and, if you do not set the override_surrogate_key parameter, for the records on
the key port. The specified field must be a decimal type with scale 0, or an integer
type.
A key specifier consisting of the name of one field in the records on the key port,
and no modifiers. If you specify a value for this parameter, Assign Keys uses it as
the surrogate key for the records on the key port, instead of using the value of the
surrogate_key parameter. The specified field must be a decimal type with scale
0, or an integer type.
override_natural_key (key specifier, optional)
If you specify a value for this parameter, Assign Keys uses it as the natural key
(see About Natural and Surrogate Keys) for the records on the key port, instead of
using the value of the natural_key parameter.
Set to True to improve performance when you expect that all records
entering the key port can fit within the number of bytes specified in the max_core
parameter (see About the few_keys Parameter of Assign Keys for more
information).
Default is True.
If the total size of the intermediate results Assign Keys holds in memory
exceeds the number of bytes specified in the max-core parameter, Assign Keys
writes temporary files to disk.
A key is a field or set of fields that uniquely identifies a record in a file or table.
A natural key is a key that is meaningful in some business or real-world sense. For
example, a social security number for a person, or a serial number for a piece of
equipment, is a natural key.
A surrogate key is a field that is added to a record, either to replace the natural key
or in addition to it, and has no business meaning. Surrogate keys are frequently
added to records when populating a data warehouse, to help isolate the records in
the warehouse from changes to the natural keys by outside processes.
Assigns the surrogate key value of the record on the key port to the
surrogate key field of the input record
2. If the natural key value of the input record does not match the natural key value of
any record on the key port, and this is the first occurrence of this natural key
value, Assign Keys:
o
Creates a new value for the surrogate key field of the input record
Sends the input record to both the new and the first port
3. If the natural key value of the input record does not match the natural key value of
any record on the key port, and this is not the first occurrence of this natural key
value, Assign Keys:
o
Assigns the surrogate key value from the record on the first port that has
the same natural key value as the input record (see 2 above) to the
surrogate key field of the input record
Sends the input record to the new port, but not the first port
If the key flow contains a group of records with duplicate natural key values,
Assign Keys uses only the first record of each such group to supply a surrogate key
value; it silently ignores the other records in the group.
Upon completion, Assign Keys sends to the log port a record containing counts of
records read from and written to each of the other ports. Note that these counts are per
partition.
in
key
Yes
Yes
Output Ports
Port
Name
first
Yes
No
new
Yes
No
old
Yes
No
No
No
The record formats on the old and new ports must be the same as the record
format on the in port. The record format on the first port must either be the same as the
record format on the in port, or must contain a subset of the fields in the record format on
the in port.
You do not need to partition the flows on the in port in any particular way, but
Assign Keys partitions the output from the first and new ports by natural key. The order
and partitioning of the output records on the old and new ports might not match the order
and partitioning of the records on the in port.
The partitioning of the records on the flow connected to the key port does not
matter, since these records are repartitioned inside Assign Keys. However, if you connect
a fan-in or fan-out flow to the key port, Assign Keys displays a yellow to-do cue. The
solution is to connect only a straight flow to the key port: a straight flow is the only type
of flow you ever need on this port.
If Assign Keys is running in a serial layout, it produces new surrogate key values
that are consecutive integers. These consecutive integers begin with the first
positive integer larger than the largest surrogate key value in the records entering
the key port. If no records enter the key port, the first new key value is 1.
If Assign Keys uses all the possible surrogate key values before it finishes
processing the records on the in port, it signals an error and stops the execution of the
graph. This could happen, for example, if the type of the surrogate key field is decimal(2)
and more than 100 records on the in port have natural key values that do not match the
natural key value of any record on the key port. Generally, you should choose a surrogate
key type that is wide enough to prevent this problem, such as integer(8).
Assign Keys uses as much main memory as it needs, subject to the limit specified
in the max_core parameter, to store two internal tables relating natural and surrogate key
values. One table relates natural key values to existing surrogate key values; the other
relates natural key values to newly created surrogate key values. If either of these internal
tables requires more than half the number of bytes specified in the max_core parameter,
Assign Keys temporarily stores part or all of the table in temporary files on disk, in the
working directory specified by the layout of the component connected to the old port.
The amount of disk space Assign Keys uses depends on the size of the flow on the
key port, and the number of natural keys in the records on the in port that are not found in
records on the key port.
If you set the few_keys parameter to True (the default), the component replicates
the key flow in each partition and avoids repartitioning the in flow.
This strategy usually works well if there are only a few thousand records on the
key port. If you have a large key flow, however, the replication process will
probably spill to disk, especially if the value of max_core is small. This slows the
execution of the graph.
If you set the few_keys parameter to False, the component partitions both the key
flow and the in flow by the natural key.
This strategy makes better use of main memory when there are hundreds
of thousands of records. However, if many of the input records have the same
natural key value, the partition that processes that key value has more work to do,
and thus can consume more disk space and processing time than would be
consumed if you set few_keys to True.