PUBLIC - Tofino Native Arch Document
PUBLIC - Tofino Native Arch Document
PUBLIC - Tofino Native Arch Document
– Public Version
Application Note
Mar 2021
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis
concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any
patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this
document.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the
latest Intel product specifications and roadmaps.
The products described may contain design defects or errors known as errata which may cause the product to
deviate from published specifications. Current characterized errata are available on request.
Intel, Tofino, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2021, Intel Corporation. All rights reserved.
7 Externs .......................................................................................................... 37
7.1 Direct and indirect externs .............................................................................. 38
7.1.1 Direct externs .................................................................................... 38
7.1.2 Indirect externs .................................................................................. 39
7.2 Packet lengths used by externs ....................................................................... 39
7.3 Action Profile ................................................................................................. 39
Figures
Tables
Table 1: Fields with Validity .................................................................................................... 21
Table 2: Queue Admission Control drop behavior ...................................................................... 29
Table 3: Summary of TNA Table Properties .............................................................................. 35
Table 4: Summary of TNA Externs and Where They May be Instantiated ..................................... 37
Table 5: Constants Used to Instantiate Pre-Defined CRC Polynomials .......................................... 53
Table 6: Meter Color Encoding ................................................................................................ 56
Table 7: Intel Tofino Part Numbers and Port Numbers ............................................................... 75
Table 8: 4-pipe Intel Tofino Port Numbers ............................................................................... 76
Table 9: 2-pipe Intel Tofino Port Numbers ............................................................................... 76
Mar 2019 0001 Initial Release of Public Version of Intel Tofino Native Architecture.
While there are many similarities between TNA and the Portable Switch Architecture
(PSA) published by P4.org, they are not the same.
Pipe 0
Output Ports
Input Ports
Pipe 1
Output Ports
Input Ports
Output Ports
Engine)
Input Ports
Each of the 4 pipes is identical in its internal structure. In each pipe, the blocks
labeled “Input Ports” and “Output Ports” contain 16 Ethernet MACs. Each Ethernet
MAC can be configured independently as a single 100 Gigabit Ethernet channel, as
four 25 Gigabit Ethernet channels, or one of several other modes described in Section
12.
Pipe 0 is also connected to one additional 100Gbps Ethernet port, called the CPU
Ethernet port. Pipe 2 is connected to a DMA packet interface, called the CPU PCIe
The control plane uses the APIs implemented in the driver software to configure
everything through a PCI Express interface, often using DMA channels to increase the
rate that large tables can be configured.
In the data plane, a packet is first received by an Input Ports block, and then flows
from left to right through the blocks shown in Figure 1: 4-pipe Intel Tofino Block
Diagram. The Ingress Parser identifies which headers are present in the packet. The
headers of interest are determined by the P4 program. The Ingress Control is
responsible for the bulk of data packet processing. Among other things, it is
responsible for choosing the desired destination for the packet. By destination we
include possibilities such as unicast, multicast, dropping the packet, etc. The Ingress
Deparser emits the headers as specified in the P4 program and constructs the packet
by prepending these headers to the packet payload, which is the portion of the packet
left after parsing.
The Traffic Manager stores packets in a packet buffer, optionally replicates and then
enqueues the packet for the selected port(s). Later, the scheduler will dequeue the
packet, and send it to the Egress Parser in the selected port’s pipe.
The Egress Parser, Egress Control, and Egress Deparser perform similar operations as
their counterparts in the ingress pipeline, but the P4 code controlling their behavior is
often quite different. For example, the Egress Control cannot change the output port
of the packet. The Output Ports block physically transmits the signal representing the
output packet to the external physical interface.
Additional differences between ingress and egress will be explained throughout this
document, and all available packet paths are described in Section 5.
NOTE: Do not use these types in the packet headers that are visible outside of Intel
Tofino – these can lead to non-portable packet formats
Note: Many intrinsic header definitions shown in this document omit padding fields
that are present in the definitions used by the P4 compiler. See Section 11.2 for a
description of the @padding annotation that is often used for such fields.
This section describes how to write P4 parsers and controls for an Intel Tofino device
using TNA. There are a few variations, depending upon whether you want all pipes
within an Intel Tofino device to process packets using the same P4 code, or whether
you wish some pipes to process packets differently from each other.
Section 5 describes in detail the intrinsic metadata fields that TNA provides. Many are
inputs to your P4 program, such as the port on which a packet arrived, or the time
when it arrived. Others are outputs from your P4 program, directing the behavior of
the device, such as whether the packet should be dropped, or sent to one output port,
or replicated to a group of output ports.
The P416 language specification defines several table properties such as key and
actions. Section 6 describes additional table properties provided by TNA that can be
used to modify the behavior of the table.
Section 7 defines the extern objects and functions provided by TNA. Section 9
describes options for configuring the behavior of the Intel Tofino packet generators.
Section 11 describes several P416 annotations that may help in writing some kinds of
programs for TNA.
The P4 developer must define a P4 struct that contains all headers of interest for their
use case in the ingress pipeline. They must also define a P4 struct that contains all
user-defined metadata fields of interest. The type names of these structs is up to the
choice of the P4 developer. The examples in this section will use the type names
my_ingress_headers_t and my_ingress_metadata_t. These same type names must
be used when defining the ingress parser, ingress control, and ingress deparser in the
program.
The sample P4 code below demonstrates defining an ingress parser, ingress control,
and ingress deparser. The names of the parsers and controls and the names of
parameters are up to the choice of the P4 developer. The things that must be exactly
as shown in the example are the order of the parameters, their direction (in, out, or
inout), and the types packet_in and any type names containing
intrinsic_metadata as part of their name. The contents of those intrinsic metadata
types will be described in Section 5.
control MyIngress(
inout my_ingress_headers_t ig_hdr,
inout my_ingress_metadata_t ig_md,
in ingress_intrinsic_metadata_t ig_intr_md,
in ingress_intrinsic_metadata_from_parser_t ig_prsr_md,
inout ingress_intrinsic_metadata_for_deparser_t ig_dprsr_md,
inout ingress_intrinsic_metadata_for_tm_t ig_tm_md)
{
apply {
// ingress control code here
}
}
control MyIngressDeparser(
packet_out pkt,
inout my_ingress_headers_t ig_hdr,
in my_ingress_metadata_t ig_md,
in ingress_intrinsic_metadata_for_deparser_t ig_dprsr_md)
{
apply {
// emit headers for out-of-ingress packets here
}
}
The egress parser, egress control, and egress deparser must also have two struct
types defined for the purposes of holding headers and user-defined metadata. These
types can be different than the types used in the ingress code, but you may also use
the same type names for ingress and egress code if that is desired. The examples in
this section will use the type names my_egress_headers_t and
my_egress_metadata_t.
The sample P4 code below demonstrates defining an egress parser, egress control,
and egress deparser. As for ingress, the names of parsers and controls and
parameter names may be chosen by the P4 developer.
parser MyEgressParser(
packet_in pkt,
out my_egress_headers_t eg_hdr,
out my_egress_metadata_t eg_md,
out egress_intrinsic_metadata_t eg_intr_md)
{
state start {
// parser code begins here
transition accept;
}
control MyEgress(
inout my_egress_headers_t eg_hdr,
inout my_egress_metadata_t eg_md,
in egress_intrinsic_metadata_t eg_intr_md,
in egress_intrinsic_metadata_from_parser_t eg_prsr_md,
inout egress_intrinsic_metadata_for_deparser_t eg_dprsr_md,
inout egress_intrinsic_metadata_for_output_port_t eg_oport_md)
{
apply {
// egress control code here
}
}
control MyEgressDeparser(
packet_out pkt,
inout my_egress_headers_t eg_hdr,
in my_egress_metadata_t eg_md,
in egress_intrinsic_metadata_for_deparser_t eg_dprsr_md)
{
apply {
// emit desired egress headers here
}
}
After defining one of each of these six blocks, then they are collected together into a
definition for a pipe, as shown in the example below. The names of the parameters to
the Pipeline package must match the names used earlier when defining those
controls. The name of the Pipeline instance may be chosen by the P4 developer. In
the example below uses the instance name pipe.
Pipeline(MyIngressParser(),
MyIngress(),
MyIngressDeparser(),
MyEgressParser(),
MyEgress(),
MyEgressDeparser()) pipe;
Now to finish the program, create the top-level instance. Following the P416 language
specification, this instance must be named main.
Switch(pipe) main;
This package instantiation for package Switch has a single parameter pipe. When the
package Switch is instantiated with a single parameter, it means that the six
programmable blocks bundled together in the instance pipe will be duplicated in every
pipe of the target device.
The word “duplicated” is significant here. If the program above is compiled for a 4-
pipe Intel Tofino, any P4 tables or extern objects defined within the controls
MyIngress and MyEgress will have four separate instances, one per pipe. There are no
tables or extern objects that can be physically shared across separate pipes. This
automatic duplication is provided as a convenience. It is a modification of the P416
language specification.
The example code below shows the creation of a separate Pipeline instance named
pipe2, which uses this second set of six blocks.
Pipeline(MyIngressParser2(),
MyIngress2(),
MyIngressDeparser2(),
MyEgressParser2(),
MyEgress2(),
MyEgressDeparser2()) pipe2;
The example below has four parameters to the Switch package instantiation.
This example will cause Intel Tofino pipes 0 and 1 to execute the P4 code of the first
set of six programmable blocks shown in the previous section, and pipes 2 and 3 to
execute the P4 code of the second set of six programmable blocks whose names end
with “2”.
Many variations of this example are also supported. For example, to have different
code for the ingress and egress control, but identical code for the parsers and
deparsers, there is no requirement to copy and paste the parser and deparser code.
Instead, one can define pipe3 as in the example below. Note that all parser and
deparser parameters are the same as the ones used when creating instance pipe.
Pipeline(MyIngressParser(),
MyIngress2(),
MyIngressDeparser(),
MyEgressParser(),
MyEgress2(),
• Ingress parser
• Ingress deparser
• Egress parser
• Egress control
• Egress deparser
The P416 language specification allows parsers to call sub-parsers, and controls to call
other controls. TNA supports this.
If a TNA program’s ingress control calls control C1, then C1 is considered part of the
ingress control, e.g. for the purposes of restrictions on what kinds of extern objects
that control C1 may instantiate and call, as documented in Section 7. The same
applies for all programmable blocks listed above.
Packet
Normal Ingress-intrinsic Port metadata
(e.g. Eth / IP / etc.)
8 bytes 8 bytes
As shown in Figure 2: Ingress Parser, the byte stream format depends on the packet
processing path. The first 8 bytes contain the ingress intrinsic metadata which is
common to all packet types. The ingress parser is expected to extract this metadata
into the argument of the type ingress_intrinsic_metadata_t so that this data will
be available for use in the ingress control.
The next 8 bytes hold either the port metadata or the resubmit metadata (depending
on the resubmit_flag bit inside the ingress intrinsic metadata). The contents of the
port metadata and resubmit metadata is defined by the P4 program. The size is
always 8 bytes. The ingress parser is expected to extract these headers or to skip
over them if they are not needed.
For packets from an Ethernet port, the Ethernet header begins immediately after the
port metadata or resubmit metadata.
Note: It is possible for a packet received from outside Intel Tofino on an Ethernet port
to begin with something other than an Ethernet header. The only absolute
requirement is that it be a sequence of bytes treated as valid by Intel Tofino’s
Ethernet MAC logic.
If a received packet was created by a packet generator, the packet will contain a 6-
byte pktgen header immediately after the port metadata header. See Section 9 for
more details on packet generation, and the contents of the headers and payloads it
creates.
The header definition below shows the fields within the ingress intrinsic metadata.
Extracting a header with this type will typically be the first step in your ingress parser
code.
header ingress_intrinsic_metadata_t {
bit<1> resubmit_flag; // Flag distinguishing original
// packets from resubmitted packets.
bit<2> packet_version; // Read-only Packet version.
PortId_t ingress_port; // Ingress physical port id.
Regardless of whether a packet was received from outside the device, from a packet
generator, or looped back internally, the packet may be resubmitted once by your P4
program. See Section 7.14 for instructions on doing so. When the resubmitted packet
begins ingress parsing again, it will contain resubmit metadata supplied by the P4
program when the resubmit operation was invoked, instead of port metadata. The
rest of the packet after the resubmit metadata will be the same as the original packet.
5.1.1 Parser
TNA supports the extract, lookahead, and advance methods on the type packet_in
parameter.
The function verify defined in the P416 language specification is not supported. You
may instead assign values to user-defined metadata fields in the parser code
recording error status when packet header contents are found that you consider illegal
or malformed.
struct port_metadata_t {
bit<3> port_pcp;
bit<12> port_vid;
}
header resubmit_metadata_t {
// user-defined fields, up to the maximum size supported
}
struct port_metadata_t {
// user-defined fields, up to the maximum size supported
}
struct my_ingress_metadata_t {
port_metadata_t port_meta;
}
struct my_ingress_headers_t {
resubmit_metadata_t resub;
ethernet_h ethernet;
}
parser MyIngressParser(
packet_in pkt,
out my_ingress_headers_t ig_hdr,
out my_ingress_metadata_t ig_md,
out ingress_intrinsic_metadata_t ig_intr_md)
{
state start {
pkt.extract(ig_intr_md);
transition select (ig_intr_md.resubmit_flag) {
1: parse_resubmit;
struct ingress_intrinsic_metadata_from_parser_t {
bit<48> global_tstamp;
bit<32> global_ver;
bit<16> parser_err;
}
The field global_tstamp contains the time in nanoseconds when the packet entered
the ingress parser. The field parser_err contains a code indicating any error that
occurred during ingress parsing. See Section 5.2.1 for a list of parser errors and their
causes. The field global_ver is reserved.
If multiple errors occur while parsing a packet, the value of the parser_err field will
be the bitwise OR of each of the error codes. For example, if both PARTIAL_HDR and
SRC_EXT errors occur, the value of parser_err will be (0x0002 | 0x0020) = 0x0022.
If the ingress control P4 code never refers to the parser_err field, the compiler
configures the ingress parser to drop all packets that experience any parsing error in
the list above. For such a P4 program, these parsing error packets will never execute
the code of the ingress control.
If the ingress control P4 code does refer to the parser_err field, packets that
experience a NO_TCAM parsing error are configured to be dropped in the ingress parser
and will never execute the code of the ingress control. If a NO_TCAM parsing error is
not encountered by the packet during ingress parsing, the packet will be processed by
the ingress control.
For packets that are parsed by the egress parser, the egress control always processes
the packet, regardless of whether the egress parser detected an error.
If the P4 code explicitly performs a transition to the reject state, it will terminate
parsing, but does not cause any of the parser error flags to be set.
The P4 developer can choose to match on the parser_err field and perform any
desired actions. One simple approach to handling parser errors in TNA is to drop the
packet. In the ingress pipeline there are additional safe options: mirroring the packet,
or resubmitting.
Processing a packet that was parsed with errors can lead to unpredictable results,
because the hardware does not always complete all operations that appear in the P4
parser code before the parsing error was detected.
See below for descriptions of the reasons that each of these errors can occur.
PARSER_ERROR_ARAM_MBE: (action RAM multi-bit error) This error occurs if the parser
read a configuration memory during parsing, and error detection hardware logic
indicated that the contents must have been corrupted after it was written by driver
software when loading the compiled P4 program. In most cases this error can occur
only very briefly, as driver software will detect and correct the problem.
PARSER_ERROR_FCS: This error can occur for several different reasons. These reasons
all have in common that there was some error detected before the parser, and this
error is signaled to the parser. The parser then uses this error flag to pass on the
presence of any of those errors. One example is that an Ethernet frame was detected
to be corrupted because its FCS (Frame Check Sequence) did not match the one
calculated from the frame contents.
Fields with validity Struct type name in which the fields occur
ucast_egress_port ingress_intrinsic_metadata_for_tm_t
mcast_grp_a
mcast_grp_b
When the ingress parser begins executing, all ingress fields with validity are initialized
to invalid. When the egress parser begins executing, all egress fields with validity are
initialized to invalid.
When the P4 code assigns a value to a field with validity, the field automatically
becomes valid. If the P4 program never assigns a value to a field with validity while
processing a packet, the field remains invalid. Calling the TNA extern function
invalidate on a field with validity forces it to become invalid, e.g.
invalidate(ig_tm_md.ucast_egress_port).
The validity of these fields is significant in determining what happens to the packet, as
will be described later.
5.4 Destinations
When a P4 program sends a packet to the Traffic Manager, there is a group of intrinsic
metadata fields that control what copies will be made of the packet, and where each
of these copies will go. This group of metadata fields is called a destination in this
document.
A destination contains fields to specify the following places to which a packet will be
sent; there are conditions described later in which a packet will be sent to only some
of the following places, not all of them.
• Two multicast group ids, mcast_grp_a and mcast_grp_b. Each may be valid or
invalid, independently of the other. Only the ones that are valid cause
multicast copies to be created. See Section 5.3 for the meaning of field
validity, and Section 5.7.3 for details on multicast packet replication. Note that
even a valid mcast_grp field can refer to a multicast group that is configured
to make 0 copies of the packet.
Note that a single packet sent to a destination can be sent to any subset of the four
places (unicast, multicast group A, multicast group B, copy-to-CPU).
struct ingress_intrinsic_metadata_for_deparser_t {
bit<3> drop_ctl;
bit<3> digest_type;
bit<3> resubmit_type;
MirrorType mirror_type;
}
See Section 7.7 for details on digest_type, Section 7.14 for details on
resubmit_type, Section 7.10 for details on mirror_type, and Section 5.6 for details
on drop_ctl.
Note that if a packet is resubmitted, it will only be resubmitted, and no packet will go
to the Traffic Manager. If the packet is not resubmitted, then both a mirror packet and
the normal packet will go to the Traffic Manager if a mirror operation is invoked.
See Section 7.7 for how to invoke a digest operation, Section 7.14 for how to invoke a
resubmit operation, and Section 7.10 for how to invoke a mirror operation. A mirror
packet’s destination is determined by looking up its mirror session id in a table in the
Traffic Manager.
• If bit 0 of drop_ctl is 1, then the unicast egress port and both multicast
groups of the normal packet’s destination are invalidated. If bit 0 of drop_ctl
is 0, then the validity of the unicast egress port and multicast groups are left
as specified in the destination.
• If bit position 1 of drop_ctl is 1, then no copy-to-CPU will occur for the
normal packet. If bit position 1 of drop_ctl is 0, then a copy-to-CPU will occur
if the copy_to_cpu field is 1.
struct ingress_intrinsic_metadata_for_tm_t {
// Fields that are part of the destination are omitted here.
bit<1> bypass_egress;
bit<1> deflect_on_drop;
bit<1> disable_ucast_cutthru;
bit<1> enable_mcast_cutthru;
The egress processing can be skipped by assigning the bypass_egress field to 1 in the
ingress pipeline. This applies to all copies that are made of the packet according to the
values in its destination.
• Normal packets from ingress. That is, packets from ingress that were not
created by a mirroring operation (see Section 7.10).
For normal packets from ingress, the P4 program sends the destination fields in the
ingress_intrinsic_metadata_for_tm_t structure. This structure is an output from
the ingress control.
If write admission control decides not to store the packet in the chosen pool, e.g.
because that pool is too full, then the rest of the packet’s destination is ignored, and
the packet is typically dropped.
If a packet from ingress cannot be stored in the chosen pool, and has the
deflect_on_drop field assigned to 1, the Traffic Manager will attempt to store the
packet in an alternate pool. If there is room in that alternate pool, the packet will be
stored in the alternate pool and sent to an alternate destination. The alternate pool
and destination for deflect-on-drop packets are configured by the control plane
software.
Every packet arriving to the PRE already has a destination (as described in Section
5.4): regular packets get their destination from the intrinsic metadata, and mirrored
packets get their destination from the a mirror session lookup (see Section 5.7.1).
• copy_to_cpu flag
If the field mcast_grp_a is invalid, no copies of the packet descriptor are made for
mcast_grp_a. If mcast_grp_a is valid, then copies of the packet descriptor are made
based upon the configuration of multicast group mcast_grp_a, described in the next
section.
Multicast groups are configured by the control plane. For each multicast group id, the
control plane configures a list of Level 1 nodes (this list may be empty).
• Individual port
• A LAG, configured as a list of individual ports (must have at least one port,
duplicates are not allowed)
• n1.RID - A 16-bit replication id. The value 0 may be used, but if one avoids
using 0, then the egress_rid field of a packet’s egress intrinsic metadata (see
Section 5.8.1) makes it easy to distinguish multicast replicated packets from
other packets during egress processing.
Multicast groups contain a list of Individual Level 1 nodes (may be empty), and a list
of ECMP nodes (may be empty). When a Level 1 node (either individual or ECMP) is
associated with a multicast group, the following information must be provided as a
part of that association:
• Level 1 Node ID – This ID is used for identifying the node in the control plane
API and does not affect the data plane behavior.
A brief summary is that the PRE iterates over all level 1 nodes configured for the
multicast group, and for each level 1 node, the PRE iterates over all level 2 nodes
configured for that level 1 node. If a level 1 node is type ECMP, or a level 2 node is
type LAG, then a hash value from the packet’s destination is used to select exactly
one of the individual nodes in the ECMP/LAG node. Each node also has a “prune”
condition, which if true causes the PRE to make no copies of the packet for that node.
Note: Only values in the range [0, 287] are supported for pkt.level2_exclusion_id.
Every packet arriving to the Traffic Manager has a packet_color field, which comes
from the destination of the packet (see Section 5.4). The numeric encoding of the
packet_color field is the same as the default green, yellow, and red encodings for
meter colors (see Section 7.9.2).
Packets with color green are the highest importance. Packets with color yellow are
lower importance than green. Packets with color red are the lowest importance. The
ingress P4 code decides the value of the packet_color field for each packet, except
for mirrored packets, where the value of packet_color comes from the configuration
of the mirror session.
Every queue has three queue depth threshold values configured by control plane
software, in units of 80-byte cells. The red threshold is the smallest. The yellow
threshold is larger than the red threshold. The green threshold is the largest.
The table below describes the behavior of queue admission control when a packet
descriptor arrives to a queue, based upon that queue’s current depth, the queue’s
configured thresholds, and the packet_color of the packet.
The Queue Congestion State number is used for the enq_congest_stat and
deq_congest_stat intrinsic metadata fields. See Section 5.8.1.
There are many options for configuring the packet scheduler. For example, one of the
queues for an output port can be configured as strictly higher priority than all other
queues for the same output port. Multiple queues for the same output port may be
configured to use weighted fair queuing with relative weights configured for each
queue. Maximum shaping rates in bits per second may be configured for queues. The
full details will be documented in the future.
After dequeuing a packet descriptor, the packet scheduler fills in the egress intrinsic
metadata fields for this copy of the packet and appends after it the contents of the
packet read from the packet buffer. The space occupied by the packet in the packet
buffer is deallocated only when the last packet descriptor that points to it has been
dequeued.
A normal packet is one that is the result of ingress processing, and the packet was
unicast, multicast, copied via the copy_to_cpu flag, or deflected via the
deflect_on_drop flag.
A mirrored packet is either an ingress mirror packet created in the ingress deparser,
or an egress mirror packet created in the egress deparser. The details on mirroring
will be explained in Section 7.10.
All packets begin with egress intrinsic metadata. The egress parser P4 code should
extract the egress intrinsic metadata. After the intrinsic metadata, the rest of the
packet’s contents are as it was deparsed, either in the ingress or egress deparser.
For this reason, it is common for there to be one or more user-defined headers added
at the beginning of packets in the ingress deparser, to carry such fields. It is
customary to call these bridge header(s). Bridge headers can be defined, manipulated,
emitted in the ingress deparser, and parsed in the egress parser, just as any other
headers. TNA does not distinguish them from other headers in any way.
Bridge headers are optional. There is no size limit on bridge headers. However, using
a large bridge header can reduce the packet throughput. Intel Tofino is engineered to
maintain full packet throughput if the length of the bridge headers plus the packet
from ingress is at most 28 bytes longer than the original packet received by the
ingress parser (not counting the ingress intrinsic metadata and port metadata).
As for bridge headers, mirror headers are optional. Mirror headers may be at most 32
bytes long.
Figure 3: Normal & Mirrored Layout of Packets Received at the Egress Parser
Since bridge and mirror headers are optional, and their contents are user defined, it is
up to the P4 developer to design these header formats in such a way that one can
distinguish different kinds of packets arriving at the egress parser from each other.
For example, a P4 developer could choose a design where the first 4 bits of all packets
sent to the Traffic Manager contain a packet type value that identifies the format of
the packet’s first header.
TNA does provide a few intrinsic metadata fields to distinguish between some kinds of
packets. See the description of the fields deflection_flag, egress_rid, and
egress_rid_first in the next section.
header egress_intrinsic_metadata_t {
PortId_t egress_port;
bit<19> enq_qdepth;
bit<2> enq_congest_stat;
bit<18> enq_tstamp;
bit<19> deq_qdepth;
bit<2> deq_congest_stat;
bit<8> app_pool_congest_stat;
bit<16> egress_rid;
bit<1> egress_rid_first;
QueueId_t egress_qid;
bit<3> egress_cos;
bit<1> deflection_flag;
bit<16> pkt_length;
}
The enq_qdepth and deq_qdepth fields give the depth of the queue that the packet
passed through at the time the packet was enqueued, and dequeued, respectively.
Queue depths are in units of cells, where a cell contains 80 bytes.
The enq_tstamp field gives the time when the packet was enqueued, in nanoseconds.
The control plane software configures a 3-bit egress Class Of Service value for each
hardware queue. This value is put into the field egress_cos.
The field app_pool_congest_stat contains 8 bits, divided into four 2-bit sub-fields.
One sub-field is in bit positions [7:6], another in bit positions [5:4], the third in bit
positions [3:2], and the last in bit positions [1:0]. Each sub-field is configured to
correspond with one application pool in the Traffic Manager’s packet buffer. See
Section 5.7.2.
Each Traffic Manager pool is configured with three thresholds, in units of 80-byte cells.
These three thresholds are similar to the queue thresholds discussed in Section
5.7.3.2, and they are also called a red threshold, a yellow threshold, and a green
threshold, each larger than the previous one.
At the time a packet is dequeued, the four configured application pools are checked
for their total occupancy, and each is compared against the three thresholds. A 2-bit
numeric value that has the same encoding and meaning as the Queue Congestion
Thresholds in Section 5.7.3.2 is calculated for the pool, and put into the configured
sub-field of app_pool_congest_stat.
struct egress_intrinsic_metadata_from_parser_t {
bit<48> global_tstamp;
bit<32> global_ver;
bit<16> parser_err;
}
The field global_tstamp contains the time in nanoseconds when the packet entered
the egress parser. The field parser_err contains a code indicating any error that
occurred during egress parsing. See Section 5.2.1 for a list of parser errors and their
causes. The field global_ver is reserved.
struct egress_intrinsic_metadata_for_deparser_t {
bit<3> drop_ctl;
MirrorType mirror_type;
As mentioned in Section 5.3, field mirror_type has field validity and is initially invalid.
TNA initializes drop_ctl to 0 for each packet.
See Section 7.10 for details on mirror_type, and Section 5.12 for details on
drop_ctl. The behavior of the other fields is an advanced topic to be documented in
the future. It is recommended to leave them with the values that TNA initializes them
to.
struct egress_intrinsic_metadata_for_output_port_t {
bit<1> capture_tstamp_on_tx;
bit<1> update_delay_on_tx;
bit<1> force_tx_error;
}
Note that it is possible for both a mirror packet to be sent to the Traffic Manager, and
the normal packet to go to the output port.
See Section 7.10 for how to invoke a mirror operation. A mirror packet’s destination is
determined by looking up its mirror session id in a table in the Traffic Manager (see
Section 5.7.1).
All table properties must be assigned a compile-time known value of the supported
type. This example shows using a previously instantiated extern as the value of a
table property.
This is the recommended way to assign extern instances to table properties, because
it provides an explicit instance name for use by the generated control plane API.
match_kind {
range, // Used to represent min..max intervals
selector // Used for implementing ActionSelector
}
A table key with range match type can match on values within a range [start, end].
The range is inclusive at the start and end, meaning any search key value x will
match if (start <= x) && (x <= end).
A table key field with selector match type may only be used in a table implemented
via an ActionSelector (see Section 7.4 Action Selector). Such a table contains key
fields with other match types (e.g., exact , ternary , range), but must also have at
least one key field with the selector match type. The regular match types are used in
the match operation. The field(s) with match type selector are used to select one of
the multiple actions within one group of the action selector. If a key is used both as a
regular match type and a selector match type, it must be listed twice in the list of
table key fields.
Switch
Pipeline
The compiler rejects programs that instantiate externs, or attempt to call their
methods, from anywhere other than the places mentioned in Table 4: Summary of
TNA Externs and Where They May be Instantiated. For example, a Counter extern can
only be instantiated in either the ingress control or egress control. As described in
Section 4.3, any sub-parser or sub-control called by one of the the six P4
programmable blocks is considered to be part of it.
TNA does not support using the same extern instance from both Ingress and Egress,
nor from more than one of the parsers or controls defined in TNA.
1
There are several TNA externs not yet described in this document. They will be added to this
table when they are documented.
The externs that have these variants are counters, meters, and registers. In all cases
the direct variant of the extern has a name beginning with Direct, e.g.
DirectCounter, and the indirect variants of the extern has a name with no such
prefix, e.g. Counter.
Each direct extern has at least one method that can update its fields of a table entry,
e.g. count() for a DirectCounter. The fields belonging to a direct extern can only be
accessed from the P4 program via these methods, and these methods may only be
called from an action of the direct extern’s owner table. Every time the owner table is
applied, and an entry is matched, the action associated with the matching table entry
Each indirect extern has at least one method that can update one of its entries, e.g.
execute(index) for a DirectMeter. An entry of an indirect extern can only be
accessed from the P4 program via these methods. These methods may be called from
within actions of a table, or they may also be called from within the apply block of a
P4 control. Every call to such a method must specify the index of the entry to be
accessed.
In the ingress control, the packet length used by these externs includes all bytes from
the first byte after the port metadata or resubmit header, up to the end of the Frame
Check Sequence at the end of the Ethernet frame. For packets received from an
external Ethernet interface, the packet length is thus always at least 64 bytes, the
minimum Ethernet frame length.
In the egress control, the packet length used by these externs includes all bytes
emitted by the ingress deparser (or the egress deparser in the egress-to-egress
mirroring case) when the packet was sent to the Traffic Manager, plus the payload,
including the Frame Check Sequence. It does not include the egress intrinsic
metadata.
Note that the ingress packet length used does not include the effects of any headers
added or removed by your P4 program during the current execution of the ingress
control. If the packet was recirculated, it does include the effects of any headers
added or removed before the current execution of the ingress control.
The egress length used does include headers added or removed before it was last sent
to the Traffic Manager but does not include the effects of any headers added or
removed by your P4 program during the current execution of the egress control.
m1 set_port(1)
m2 set_port(2)
m2 set_port(1)
Figure 5: Action Profiles in TNA, above, contrasts a direct table with a table that has
an action profile implementation. A direct table, as seen in Figure 5: Action Profiles in
TNA part (a) contains the action specification in each table entry. In this example, the
table has a match key consisting of an LPM on header field h.f. The action is to set
the port. As we can see, entries t1 and t3 have the same action, i.e. to set the port to
1. Action profiles enable sharing an action across multiple entries by using a separate
table as shown in Figure 5: Action Profiles in TNA part (b).
A table with an action profile implementation has entries that point to a member
reference instead of directly defining an action specification. A mapping from member
references to action specifications is maintained in a separate table that is part of the
action profile instance defined in the table implementation attribute. When a table
with an action profile implementation is applied, the member reference is resolved and
the corresponding action specification is applied to the packet.
The control plane can add, modify or delete member entries for a given action profile
instance. The controller-assigned member reference must be unique in the scope of
the action profile instance. An action profile instance may hold at most size entries as
defined in the constructor parameter. Table entries must specify the action using the
controller-assigned reference for the desired member entry. Directly specifying the
action as part of the table entry is not allowed for tables with an action profile
implementation.
7.3.1 Example
The P4 control block MyIngress in the example below instantiates an action profile ap1
that contains up to 1000 member entries. Table ipv4_dest uses instance ap1 by
assigning it to the implementation attribute. The control plane can add member
entries to ap1, where each member can specify either a send or drop action (plus the
action data). When a member is successfully added, the switch software returns a
member id. When adding or modifying entries in table ipv4_dest, the control plane
software must specify the action using this member id.
control MyIngress(
inout my_ingress_headers_t ig_hdr,
inout my_ingress_metadata_t ig_md,
in ingress_intrinsic_metadata_t ig_intr_md,
in ingress_intrinsic_metadata_from_parser_t ig_prsr_md,
inout ingress_intrinsic_metadata_for_deparser_t ig_dprsr_md,
inout ingress_intrinsic_metadata_for_tm_t ig_tm_md)
{
ActionProfile(1000) ap1;
extern ActionSelector {
// Construct an action selector of 'size' entries
ActionSelector(bit<32> size, Hash<_> hash, SelectorMode_t mode);
Figure 6: Action Selector in TNA, above, illustrates a table that has an action selector
implementation. In this example, the table has a match key using a longest-prefix
A table with an action selector implementation consists of table entries that point to
either an action profile member reference or an action profile group reference. An
action selector instance can be logically visualized as two tables as shown in Figure 6:
Action Selector in TNA. The first table contains a mapping from group references to a
set of member references. The second table contains a mapping from member
references to action specifications.
In the figure, table entry t1 points at group g1, and group g1 contains member m1 and
m2. Table entry t2, however, points directly at member m2, without referring to any
groups. This flexibility of allowing some table entries to point at groups, and others
directly at members, can be useful when implementing routing tables. You may also
notice that there is a group g3 in the group table, but there are no table entries that
point at this group. This is a common condition, at least temporarily, during routing
table updates. New groups must be created before table entries can point to them.
Action selector members may only specify action types defined in the actions attribute
of the implemented table. All actions in a group must have the same action name. The
action parameters for actions in the same group may differ, and the action names
used by different groups in a selector may be different. An action selector instance
may be shared across multiple tables only if all such tables define the same set of
actions in their actions attribute. Furthermore, the selector match fields for such
tables must be identical and must be specified in the same order across all tables
sharing the selector. Tables with an action selector implementation cannot define a
default action. The default action for such tables is implicitly set to NoAction.
The dynamic selection algorithm requires a field list as an input for generating the
index to a member entry in a group. This field list is created by using the match type
selector when defining the table match key. The match fields of type selector are
composed into a field list in the order they are specified. The composed field list is
passed as an input to the action selector implementation. It is illegal to define a
selector type match field if the table does not have an action selector implementation.
The control plane can add, modify or delete member and group entries for a given
action selector instance. An action selector instance may hold at most size member
entries as defined in the constructor parameter. The number of groups may be at
most the size of the table that is implemented by the selector. Table entries must
specify the action using a reference to the desired member or group entry. Directly
specifying the action as part of the table entry is not allowed for tables with an action
selector implementation.
Table ipv4_lpm uses this instance by specifying the implementation table property as
shown. The control plane can add entries with action names and parameters to the
action profile ipv4_ecmp_ap. Each member can specify either a set_port or drop
action.
Then the control plane can add groups to the action selector ipv4_ecmp, and members
to those groups, where each member is a reference to an entry in ipv4_ecmp_ap.
When programming the table entries in table ipv4_lpm, the control plane does not
include the fields with match_kind selector in the key. The selector fields are
instead given as input to the hash_fn extern. In the example below, the fields
{hdr.ipv4.srcAddr, hdr.ipv4.dstAddr, hdr.ipv4.protocol} are passed as input
to the CRC16 hash algorithm used for member selection by the action selector.
ActionProfile(128) ipv4_ecmp_ap;
Hash<bit<8>>(HashAlgorithm_t.CRC16) hash_fn;
ActionSelector(action_profile = ipv4_ecmp_ap,
hash = hash_fn,
mode = SelectorMode_t.FAIR,
max_group_size = 256,
num_groups = 2048) ipv4_ecmp;
table ipv4_lpm {
key = {
hdr.ipv4.dstAddr: lpm;
hdr.ipv4.srcAddr: selector;
hdr.ipv4.dstAddr: selector;
hdr.ipv4.protocol: selector;
}
actions = { set_port; drop; }
implementation = ipv4_ecmp;
default_action = drop;
size = 16384;
}
apply {
if (hdr.ipv4.isValid()) {
ipv4_lpm.apply();
7.5 Checksum
Intel Tofino checksum engines can verify the header-only checksums such as IPv4
header checksum at the parser and re-compute the checksum using header or
metadata fields at the deparser.
For checksums that include the payload such as TCP checksum which is computed
over the entire packet, Intel Tofino assumes that the original checksum is correct and
incrementally updates the checksum value as described in RFC 1624. For an
incremental checksum update, the P4 program must subtract any fields that are
modified in the program in the parser (including the checksum field itself), and then
add the new values to the checksum update at the deparser.
Intel Tofino checksum engines support only 16-bit ones' complement checksums.
extern Checksum {
// Constructor.
Checksum();
parser IngressParser(
packet_in pkt,
out header_t hdr,
out metadata_t md,
out ingress_intrinsic_metadata_t ig_intr_md)
{
Checksum() ipv4_csum;
state parse_ipv4 {
pkt.extract(hdr.ipv4);
// This is equivalent to passing all the IPv4 fields
ipv4_csum.add(hdr.ipv4);
md.checksum_err = ipv4_csum.verify();
}
}
The partial program below demonstrates one way to use the Checksum extern to
calculate and then update the IPv4 header checksum. In this example, the checksum
is calculated in the ingress deparser block, with the assumption that all IPv4
modification is performed at ingress. In general, it is up to the P4 program author to
make sure that all relevant field modification shall precede the checksum update
location (ingress or egress).
control IngressDeparser(
packet_out pkt,
inout header_t hdr,
in metadata_t md,
in ingress_intrinsic_metadata_for_deparser_t md_for_dprs)
{
Checksum() ipv4_csum;
apply {
if (hdr.ipv4.isValid()) {
hdr.ipv4.hdrChecksum = ipv4_csum.update({
hdr.ipv4.version, hdr.ipv4.ihl, hdr.ipv4.diffserv,
hdr.ipv4.total_len,
hdr.ipv4.identification,
hdr.ipv4.flags, hdr.ipv4.frag_offset,
hdr.ipv4.ttl, hdr.ipv4.protocol,
/* skip hdr.ipv4.hdr_checksum, */
hdr.ipv4.srcAddr,
hdr.ipv4.dstAddr});
}
pkt.emit(hdr);
The example shows how to use the Checksum extern to compute an incremental
checksum for the TCP header for a common use case, like NAT, where the only fields
that are modified in the header are src/dst IP address and src/dst TCP port.
parser IngressParser(
packet_in pkt,
out header_t hdr,
out metadata_t md,
out ingress_intrinsic_metadata_t ig_intr_md)
{
Checksum() tcp_csum;
state parse_ipv4 {
pkt.extract(hdr.ipv4);
tcp_csum.subtract({hdr.ipv4.src_addr, hdr.ipv4.dst_addr});
transition (hdr.ipv4.protocol) {
6 : parse_tcp;
default : accept;
}
}
state parse_tcp {
pkt.extract(hdr.tcp);
tcp_csum.subtract({hdr.tcp.checksum});
tcp_csum.subtract({hdr.tcp.src_port, hdr.tcp.dst_port});
tcp_csum.subtract_all_and_deposit(md.checksum);
transition accept;
}
}
control IngressDeparser(
packet_out pkt,
inout header_t hdr,
in metadata_t md,
in ingress_intrinsic_metadata_for_deparser_t md_for_dprs)
{
Checksum() tcp_csum;
apply {
hdr.tcp.hdr_checksum = tcp_csum.update({
hdr.ipv4.src_addr,
hdr.ipv4.dst_addr,
hdr.tcp.src_port,
hdr.tcp.dst_port,
md.checksum});
packet.emit(hdr.ethernet);
packet.emit(hdr.ipv4);
packet.emit(hdr.tcp);
}
}
The checksum update can be done in the Ingress Deparser and/or the Egress
Deparser. If the residual metadata md.checksum is assigned in the Ingress Parser and
is needed in the Egress Deparser, then the metadata needs to be sent from ingress to
egress using bridge metadata. If such incremental checksum update happens for a
mirrored packet, the P4 programmer may need to add the metadata as a part of the
mirror header. More details of mirroring are given in Section 7.10.
parser IngressParser(
packet_in pkt,
out header_t hdr,
out metadata_t md,
out ingress_intrinsic_metadata_t ig_intr_md)
{
Checksum() tcp_csum;
state parse_ipv4 {
pkt.extract(hdr.ipv4);
tcp_csum.subtract({hdr.ipv4.src_addr,
hdr.ipv4.dst_addr,
8w0, // zero byte for TCP pseudo-header
hdr.ipv4.protocol,
hdr.ipv4.total_len});
<code omitted>
}
control IngressDeparser(
packet_out pkt,
inout header_t hdr,
in metadata_t md,
in ingress_intrinsic_metadata_for_deparser_t md_for_dprs)
{
Checksum() tcp_csum;
apply {
if (ipv4.tcp.isValid() && hdr.tcp.isValid()) {
hdr.tcp.hdr_checksum = tcp_csum.update({
hdr.ipv4.src_addr,
hdr.ipv4.dst_addr,
8w0, // zero byte for TCP pseudo-header
hdr.ipv4.protocol,
hdr.ipv4.total_len,
hdr.tcp.src_port,
hdr.tcp.dst_port,
hdr.tcp.seq_no,
hdr.tcp.ack_no,
hdr.tcp.data_offset,
hdr.tcp.res,
hdr.tcp.flags,
hdr.tcp.window,
hdr.tcp.urgent_ptr
md.checksum});
}
// other ingress deparser code goes here
}
}
TNA supports both direct and indirect variants of counters. See Section 7.1 for a
discussion of the differences between these variants.
• A packet count
• A byte count
• Both a packet count and a byte count
All constructor methods for creating instances of a counter extern take a parameter of
type CounterType_t to specify this choice.
enum CounterType_t {
PACKETS,
BYTES,
PACKETS_AND_BYTES
}
The byte counts are always incremented by the length of the packet currently being
processed, as defined in Section 7.2, minus the value of the adjust_byte_count
parameter, if such a parameter is given.
extern DirectCounter<W> {
DirectCounter(CounterType_t type);
void count(@optional in bit<32> adjust_byte_count);
}
A direct counter instance must be assigned as the value of the counters table
property for exactly one table. That table is the direct counter’s owner. It is an error
to call the count method for a direct counter instance anywhere except inside an
action of its owner table. TNA requires that every action of the direct counter’s owner
table must call the direct counter’s count method exactly once.
The counter value updated by an invocation of count is always the one associated with
the table entry that matched.
A DirectCounter instance must have a counter value associated with its owner table
that is updated when there is a default action assigned to the table, and a search of
the table results in a miss. If there is no default action assigned to the table, then
there need not be any counter updated when a search of the table results in a miss.
By “a default action is assigned to a table”, we mean that either the table has a
default_action table property with an action assigned to it in the P4 program, or the
control plane has made an explicit call to assign the table a default action. If neither of
these is true, then there is no default action assigned to the table.
7.7 Digest
This section describes digest generation, a mechanism to send a message from the
data plane to the control plane.
The other main technique to send messages from data plane to control plane is
sending a packet to the CPU Ethernet port or CPU PCIe port. Sending packets to these
ports typically sends most of the original packet headers, and perhaps also the
payload, each as a separate message to be received and processed by the control
plane.
Using digests provides several advantages over sending packets to CPU ports:
• Digest messages are smaller than even the minimum sized Ethernet packets.
• Data can be automatically and flexibly packed into digest messages, without
the need to define header formats that can be placed into an Ethernet packet.
• TNA can aggregate multiple digest messages into larger batch messages in the
hardware, reducing the rate of receiving these messages in the control plane
software.
A digest message may contain any value from the data plane that is an input to the
ingress deparser. If a desired field is not already an input to the ingress deparser, a
P4 developer may add the field to their user-defined metadata type, and assign the
desired value to the field in the ingress control. When a Digest instance is created, the
P4 programmer specifies a type that holds the desired value(s), often a P4 struct type.
The compiler determines a good serialization format to send the digest contents to the
control plane software. TNA provides the ability to the control plane software to
distinguish the messages created by different Digest instances from each other.
By default, no digests are generated. To generate a digest, assign a value in the range
0 through 7, inclusive, to the digest_type field in the ingress control, and then use
that value in the ingress deparser to choose between the values to send in the digest,
as shown in the example code below.
The digest_type field has field validity (see Section 5.3) and is initially invalid at the
beginning of ingress processing. If at one point in your ingress P4 code you assign a
value to digest_type, and then later want to undo the decision to generate a digest,
call the invalidate extern function on the digest_type field.
A digest is created by calling the pack method on an instance of the Digest extern.
The argument is the value to be included in the digest, and the type of this argument
must be the same as the type given when the Digest instance was constructed. Every
pack method call must be enclosed in an if statement with a condition of the form
(digest_type == constant). The recommended P4 code pattern to use the Digest
extern is as follows:
struct my_ingress_metadata_t {
// user-defined ingress metadata
PortId_t port;
bit<16> metadata_to_learn;
}
apply {
// Generate a digest, if digest_type is set in ingress control.
if (ig_dprsr_md.digest_type == 1) {
// The fields in the list correspond to the fields of the
// struct with type digest_a_t.
digest_a.pack({ig_hdr.ethernet.dst_addr, ig_md.port,
ig_hdr.ethernet.src_addr});
}
if (ig_dprsr_md.digest_type == 2) {
// The fields in the list correspond to the fields of the
// struct with type digest_b_t.
digest_b.pack({ig_hdr.ethernet.dst_addr,
ig_md.metadata_to_learn});
}
// the rest of the ingress deparser code goes here
pkt.emit(ig_hdr.ethernet);
}
}
Regardless of whether a digest is generated or not, the rest of the ingress deparser
code still takes effect for determining whether a packet is resubmitted, mirrored,
and/or a normal packet is sent to the Traffic Manager, as described in Section 5.5.
7.8 Hash
The Hash extern can take any collection of header or metadata fields and calculate a
deterministic hash function of those values. The output of the Hash extern can be
used in arbitrary expressions, e.g. as an index for an indirect counter, meter, or
register. The Hash output can also be used as an input to an ActionSelector extern,
often used for implementing features like ECMP or LAG to load balance traffic across
several available network paths.
extern Hash<W> {
// Constructor
// @type_param W : width of the calculated hash.
// @param algo : The default algorithm used for hash calculation.
Hash(HashAlgorithm_t algo);
// Constructor
// @param poly : The default coefficient used for hash algorithm.
Hash(HashAlgorithm_t algo, CRCPolynomial<_> poly);
extern CRCPolynomial<T> {
CRCPolynomial(T coeff, bool reversed, bool msb, bool extended,
T init, T xor);
}
When using the CRCPolynomial extern with the Hash extern, the HashAlgorithm_t
parameter should be HashAlgorithm_t.CUSTOM.
7.9 Meters
Meters (RFC 2698) provide a mechanism for measuring and detecting when a
sequence of packets in a flow have recently been arriving slower or faster than a
configured average rate (or two configured rates, for two rate meters). Each packet is
categorized into one of the three colors green, yellow, or red at the time the meter is
accessed and updated. See RFC 2698 for details on the conditions under which one of
these three results is returned.
RFC 2698 describes “color aware” and “color blind” meters. The Meter and
DirectMeter externs implement both behaviors. The only difference is in which
execute method you call when updating them. See the comments on the extern
definitions below.
Like counters, there are two variants of meters: direct and indirect. See Section 7.1
for a description of these variants.
The packet length used to update a byte-based meter is always the length of the
packet currently being processed, as defined in Section 7.2, minus the value of the
adjust_byte_count parameter, if such a parameter is given.
Note that TNA recognizes two different numeric values as an encoding for the color
yellow. TNA does not treat these two values any differently from each other when a
color value is needed.
This translation only exists for converting the return value from the Meter and
DirectMeter execute method calls. If you use an execute method that performs a
color-aware meter update, your program must use the numeric encoding documented
above.
extern Meter<I> {
Meter(bit<32> size, MeterType_t type);
Meter(bit<32> size, MeterType_t type, bit<8> red, bit<8> yellow,
bit<8> green);
// Use this method call to perform a color aware meter update (see
// RFC 2698). The color of the packet before the method call was
// made is specified by the color parameter.
bit<8> execute(in I index, in MeterColor_t color,
@optional in bit<32> adjust_byte_count);
extern DirectMeter {
DirectMeter(MeterType_t type);
DirectMeter(MeterType_t type, bit<8> red, bit<8> yellow,
bit<8> green);
bit<8> execute(in MeterColor_t color,
@optional in bit<32> adjust_byte_count);
bit<8> execute(@optional in bit<32> adjust_byte_count);
}
A direct meter instance must appear as the value of the meters table property for
exactly one table. That table is the direct meter’s owner. It is an error to call the
execute method for a direct meter instance anywhere except inside an action of its
owner table. TNA requires that every action of the direct meter’s owner table must call
the direct meter’s execute method exactly once.
7.10 Mirror
This section describes packet mirroring, a mechanism to create a copy of a packet and
send the copies to a specified destination (see Section 5.3 for the definition of a
destination). TNA supports two packet paths for mirroring: ingress-to-egress mirroring
and egress-to-egress mirroring.
In the ingress-to-egress mirroring path, the decision to mirror the packet is made in
the ingress control, and the Mirror extern emit method is called in the ingress
The Traffic Manager uses a mirror session identifier (specified as the first parameter to
the Mirror extern emit call) to look up a mirror session table that contains the values
of intrinsic metadata indicating where the mirror packet should be sent. All mirror
sessions are configured by the control plane. The mirrored packet(s) will later begin
processing in the egress pipeline.
The egress-to-egress mirroring path is similar. The decision to mirror the packet is
made in the egress control, and the Mirror extern emit method is called in the egress
deparser. The main difference, compared to ingress-to-egress mirroring, is that the
Mirror extern makes a copy of the packet’s contents as it is produced by the egress
deparser, after all modifications made in the egress control and egress deparser are
completed. The Mirror extern then prepends a mirror header that you specify and
sends this packet to the Traffic Manager. The Traffic Manager operates on such
packets in the same way described above for the ingress-to-egress mirroring path.
When your code initiates a mirror operation, you specify one of 1023 mirroring
sessions to use. (Note: Please, do not use Mirror Session 0, which is reserved.) Each
mirroring session is independently configured by the control plane software. This
configuration data resides within the Traffic Manager.
For example, suppose a P4 program would like to send a packet replica to the CPU
Ethernet port. The P4 program and control plane software developers agree on using
mirroring session 27 for this purpose. The control plane software configures mirroring
session 27 with the ucast_gress_port attribute set to the CPU Ethernet port number.
Later the P4 program processes a packet and calls a Mirror extern emit method with
session 27, which causes the Traffic manager to read session 27’s configuration data
and find the ucast_gress_port equal to the CPU Ethernet port number. The Traffic
Manager enqueues the mirrored packet for the CPU Ethernet port. Note that the P4
program cannot change the mirror session attributes.
TNA provides the Mirror extern to choose a mirror session and a user-defined header
for each mirrored packet. This ability to choose a mirror session is controlled by the
session_id argument of the emit method. The ability to choose a user-defined
header is controlled by the mirror_type intrinsic metadata field sent to ingress and
egress deparser and the hdr argument in the emit method. Next, we explain the
restrictions on the Mirror extern and the mirror_type field.
To mirror a packet in ingress, you must assign a value in the range 1 through 7,
inclusive, to the mirror_type intrinsic metadata field. mirror_type is initialized to 0
at the beginning of ingress, and 0 should not be used to mirror packets in ingress. If
at one point in your ingress P4 code you assign a value 1 through 7 to mirror_type,
To mirror a packet in egress, you must assign a value in the range 0 through 7,
inclusive, to the mirror_type intrinsic metadata field. In egress, the mirror_type
field has field validity (see Section 5.3), and is initially invalid at the beginning of
egress. If at one point in your egress P4 code you assign a value to mirror_type, and
then later want to undo the decision to create an egress-to-egress mirror, call the
invalidate extern function on the mirror_type field.
The Mirror extern may only be instantiated in the ingress deparser, egress deparser,
or in both. The compiler will reject any attempt to instantiate the Mirror extern
elsewhere.
The Mirror extern provides two emit methods. Calling the 1-argument emit method
will create a mirrored packet with no user-defined header, using the mirror session
specified by session_id. Calling the 2-argument emit method will create a mirrored
packet with a user-defined header hdr, using the mirror session specified by
session_id.
extern Mirror {
// Constructor
Mirror();
A P4 program may instantiate one or more Mirror externs. However, only one Mirror
extern may be used for any given packet. Every emit method call must be enclosed in
an if statement with a condition of the form (mirror_type == constant). The
recommended P4 code pattern to use the Mirror extern is as follows:
extern ParserCounter {
// Constructor
ParserCounter();
7.12 Random
The Random extern provides generation of pseudo-random numbers with a uniform
distribution. If one wishes to generate numbers with a non-uniform distribution, you
may do so by first generating a uniformly distributed random value, and then using
7.13 Registers
Registers are stateful memories whose values can be read and written during packet
forwarding under the control of the P4 program. They are like counters and meters in
that their state can be modified as a result of processing packets, but they are far
more general in the behavior they can implement.
Like counters, there are two variants of registers: direct and indirect. See Section 7.1
for a description of these variants. Below are the definitions of the constructor
methods for creating instances of these externs.
extern DirectRegister<T> {
DirectRegister();
DirectRegister(T initial_value);
}
The type argument I specifies the type of the index of an indirect register extern. This
type can typically be inferred by the compiler.
The type argument T specifies the type of each entry, i.e. the type of state stored in
each entry of the register. Register entries in TNA may be bit<8>, int<8>, bit<16>,
int<16>, bit<32>, or int<32> values, or may be pairs of one of those types. Any
struct with exactly two fields of identical type will be recognized as a pair. In addition,
entries containing a single value of type bit<1> can also be used.
The apply method in a RegisterAction may be declared with either one or two
arguments; the first inout argument is the value of the Register entry being read and
updated, while the second optional out argument is the value that will be returned by
the execute method when it is called in a table action.
Together Register and RegisterAction externs form the Stateful ALU portion of the
pipeline. RegisterActions are triggered from table actions by calling their execute
method, which causes the Stateful ALU to read the specified entry of the Register
storage, run the RegisterAction code, and write back the modified value to the same
entry of the Register storage. The simplest example Stateful ALU would be a simple
packet counter
This illustrates the basic structure and use of a Stateful ALU. The
RegisterAction.apply method encapsulates all the behavior in the stateful ALU, and
the execute method triggers it to run from some other construct in the P4 program.
Within the Stateful ALU there are two comparison ALUs and four arithmetic/logical
ALUs that can be used for computation. The results of the comparisons may be used
to condition the outputs of the other ALUs. All the ALUs operate on the same word size
for a given RegisterAction which may be 8, 16, or 32 bits. The inputs to all the ALUs
may come from the memory word (either half if it is a pair) or from either of two PHV
registers accessed by the stateful ALU, or from up to 4 constant values. All
RegisterActions sharing a single Register also share these resources, so the
compiler will attempt to efficiently pack and reuse values used by RegisterActions in
the same stateful ALU.
The four arithmetic/logical ALUs are organized as two pairs of two, with one pair
writing its result to each word of the memory. The compiler will schedule the
operations specified in the apply method to the appropriate ALUs to perform the
specified computation, and will attempt to reuse values and pack things as efficiently
as possible. The optional output of the stateful ALU (second argument to the apply
method which will be returned by the execute method) must be a copy of one of the
values read as input by the stateful ALU or one of the values being written to memory.
The restrictions above are significant, but they can be used to construct very useful
stateful data plane algorithms. Once learned, writing RegisterAction code is
reasonably straightforward, using if-else statements and simple assignments to the
parameters of the apply call. The constant ALUs can compare a memory value, a
packet value and a constant added together, allowing tests for thresholds or limits of
struct burst_data {
bit<32> timestamp;
bit<32> count;
};
#define BURST_INTERPACKET_DELAY 10
#define BURST_SIZE 5
This RegisterAction can then be called from an action and the return value tested to
see if this packet belongs to a burst of 5 or more packets. The above code shows a
few basic principles for structuring the code that will minimize problems
• Always have tests involving the values from memory before modifying them
• Only one modification of each memory item on any given path through the
code
• When outputting a flag, set it to zero at the top of the apply method, and set
it to one when the condition you want to test is true
The arithmetic ALUs support normal addition and subtraction operations. They also
support saturating addition and subtraction operations (see the P416 language
specification for the “|+|” and “|-|” operators). All combinations of bitwise logical
operations are supported, as well as min and max.
If the RegisterAction entry type is bit<1>, most of the stateful ALU functionality is
disabled. No comparisons or PHV reads are allowed, and the only operations allowed
are setting the memory bit value to 0 or to 1, along with optionally returning the
previous bit value. This is mainly useful for updating ActionSelectors or various
bloom filter like constructs.
enum MathOp_t {
MUL, // x
SQR, // x^2
SQRT, // sqrt(x)
DIV, // 1/x
RSQR, // 1/x^2
RSQRT // 1/sqrt(x)
};
extern MathUnit<T> {
// Configure a math unit for use in a register action
MathUnit(MathOp_t op, int factor); // configure as factor * op(x)
MathUnit(MathOp_t op, int A, int B); // configure as (A/B) * op(x)
T execute(in T x);
};
A given Register can only have one MathUnit extern, so if there are multiple
RegisterActions sharing a Register, they can only use one MathUnit between them
(it may be used in more than one RegisterAction).
7.14 Resubmit
Packet resubmission is a mechanism to repeat ingress processing on a packet. One
example where this can be useful is when parsing MPLS packets, where the first
header that appears after the last MPLS header might be Ethernet or IPv4. The first
time the packet is parsed, after parsing the last MPLS header, if the next 4 bits are the
value 4 in decimal, then the next header may be either:
• an IPv4 header
• an Ethernet header where the value of the most significant 4 bits of the
destination MAC address are the value 4 in decimal
There is not enough information in the packet contents to distinguish between these
cases. A good approach is to guess that the header is IPv4 during parsing and
proceed.
All not-resubmitted packets begin in the ingress parser with resubmit_flag=0 in the
ingress intrinsic metadata. This includes packets from a front panel port, the CPU, the
packet generator, and recirculated packets. All resubmitted packets have
resubmit_flag=1, to enable the P4 code to distinguish resubmitted packets from the
original packet.
The input port holds a copy of all received packets until the end of ingress processing.
When your program resubmits a packet, the input port sends the buffered packet to
the ingress pipeline again, with only small modifications described below.
To enable your P4 program to use some additional facts about what happened during
the first time processing the packet, during that first time you may make a method
call emit on an instance of the Resubmit extern and pass it an argument with type
header (a header type that you define) that is up to 64 bits long. The contents of this
header will overwrite the port metadata of the original packet, and this resubmit
header can be extracted by the ingress parser when the resubmitted packet is parsed.
Other than the value of the resubmit_flag field and the contents of the port
metadata header being overwritten by the resubmit header, the rest of the contents of
the original packet remain as they were the first time it was processed.
You may resubmit each received packet at most once. Attempting to perform a
resubmit operation on a packet with resubmit_flag=1 causes the packet to be
dropped.
To resubmit a packet, you must assign a value in the range 0 through 7, inclusive, to
the resubmit_type intrinsic metadata field. The resubmit_type field has field validity
(see Section 5.3) and is initially invalid at the beginning of ingress processing. If at
one point in your ingress P4 code you assign a value to resubmit_type, and then later
want to undo the decision to resubmit the packet, call the invalidate extern function
on the resubmit_type field.
The Resubmit extern may only be instantiated in the ingress deparser. The Resubmit
extern provides two emit methods. Invoking the 0-argument emit method will create
a resubmitted packet with all zeroes in the resubmit header. The 1-argument emit
method takes an argument resub_hdr with type header. Invoking the 1-argument
emit method will create a resubmitted packet with resub_hdr as the contents of its
resubmit header.
If the packet is resubmitted, no packet will be sent to the Traffic Manager. All calls to
pkt.emit in the code example above will be executed, but the value of pkt is
discarded.
The resubmit operation is only supported for ports in quads 0 up to 15 of each Intel
Tofino pipe (see Section 12). It is not supported on ports in quads 16 or 17.
When control plane software enables an application, it must specify what kind of an
event (called a trigger) will start the process of packet generation. The following four
trigger types are available:
• One-time timer
• Periodic timer
• Port down
• Packet recirculation
Each packet generator also has a 16 KByte buffer that is shared by all of its
applications. The control plane configures the contents of this buffer. Each application
is configured with a starting byte offset O and length L in bytes within this buffer. All
packets created by the packet generator begin with a 6-byte pktgen header, followed
by L bytes copied from this buffer, starting at offset O, as configured for the
application that generated the packet.
Packets generated by the packet generator will enter the corresponding pipe through
the ports they are attached to as normal (non-resubmitted) packets.
The control plane configures the following parameters for any application that it
wishes to generate packets when the specified event occurs.
• the number of batches B to create, up to 64K (the current control plane API
requires you to supply the maximum batch id B-1)
• the time interval IBT between batch starts (in nanoseconds)
• the number of packets N to generate per batch, up to 64K (the current control
plane API requires you to supply the maximum packet id N-1)
• the time interval IPT between creating packets in the same batch (in
nanoseconds)
• the starting byte offset O and length L in bytes within the packet generator’s
buffer.
The pktgen header for all triggers contains the following fields:
For the simplest case of B=1 and N=1, the only packet created will have batch_id
and packet_id equal to 0 (recall that the control plane API requires you to configure a
value 1 less for B and N, so both 0 for this case).
In general, B*N packets will be created for each trigger event. These B*N packets
differ only in the time they are created, and the values in their batch_id field (if they
have one) and their packet_id field. The following pseudocode describes when each
packet is created, and what values of batch_id and packet_id will be used to create
that packet.
next_batch_start_time = trigger_event_time;
for (batch_id = 0; batch_id < B; batch_id++) {
next_packet_time = max(time_now, next_batch_start_time);
for (packet_id = 0; packet_id < N; packet_id++) {
wait until time next_packet_time;
// Note: do not wait at all if next_packet_time is in the past
create packet with current batch_id and packet_id values;
next_packet_time += IPT;
}
next_batch_start_time += IBT;
}
Note that it is possible to configure IPT to a time interval smaller than the time
required to create and transmit one packet. If it is configured this way, the
pseudocode statement “wait until time next_packet_time” will not wait any time at all,
because next_packet_time will already have passed. In this case, packets within a
batch will have no idle time gap between them.
Similarly, it is possible to configure IBT to a time interval smaller than the time to
create one batch’s worth of packets. In this case, any later batch will begin
immediately after the last packet of the previous batch was created.
The Intel Tofino ASIC also has configuration options where the IBT and IPT values can
be pseudo-randomly generated within a configured range of values.
The pktgen header for packets generated with a one-time timer trigger contains the
following fields:2
header pktgen_timer_header_t {
bit<2> pipe_id; // Pipe id
bit<3> app_id; // Application id
bit<16> batch_id; // Start at 0 and increment to a
// programmed number
bit<16> packet_id; // Start at 0 and increment to a
// programmed number
}
• the time interval T between consecutive trigger firing events for this
application
The pktgen header for packets generated with a periodic timer trigger is the same as
the one described in Section 9.2.
When a port that is in the same pipe as the packet generator changes state from up to
down, the application’s trigger fires once, generating B batches of N packets each as
described in Section 9.1. The application is still enabled, but cannot fire again for the
same port, until the control plane enables it to fire again for that port.
The control plane must make a call for the (application id, port number) combination
to enable a port down event on that port number to cause packet generation to occur.
This is necessary to enable the trigger to fire the first time. It is necessary to repeat a
control plane call for an (application id, port number) combination after it has fired for
that port number, or it will not fire again for that port number.
2
Padding fields have been omitted from the definition of pktgen_timer_header_t for brevity.
header pktgen_port_down_header_t {
bit<2> pipe_id; // Pipe id
bit<3> app_id; // Application id
PortId_t port_num; // Port number
bit<16> packet_id; // Start at 0 and increment to a
// programmed number
}
The port_num field contains the full port number, including the pipe id in the most
significant, as described in Section 12. Note that while a port down trigger can
generate more than one batch of packets, there is no batch_id field in the pktgen
header.
The packet generator examines all packets that are recirculated on its port (but not
any other recirculation ports). Let R be the first 32 bits of a recirculated packet as it
was created by the egress deparser, just before it was recirculated. R[31:24] is the
first byte, and R[7:0] is the fourth byte. For the packet generator, the first byte of a
recirculated packet does not include the ingress intrinsic or port metadata shown in
Section 5.1.
If (R & M) is equal to (V & M), the recirculated packet matches and the trigger fires,
initiating the creation of B batches of packets as described in Section 9.1. The
application remains enabled and will cause additional trigger events if later
recirculated packet match the configured value/mask, until the control plane disables
it.
The pktgen header for packets generated with a port down trigger contains the
following fields (note that the padding fields have been omitted from the definition of
pktgen_recirc_header_t for brevity).
header pktgen_recirc_header_t {
bit<2> pipe_id; // Pipe id
bit<3> app_id; // Application id
bit<24> key; // Key from the recirculated packet
bit<16> packet_id; // Start at 0 and increment to a
// programmed number
}
Regardless of whether these features are enabled, all time stamp metadata fields
within a single Intel Tofino device are synchronized with each other and can be used
to make precise latency measurements between packets reaching different points
within the same device.
For example, in egress processing one could use a formula like the following to
calculate the time interval starting when a packet began ingress parsing, until the
packet began egress parsing, in nanoseconds. Most of the variation in such a latency
measurement would be in the time the packet spent waiting in a queue in the Traffic
Manager.
eg_prsr_md.global_tstamp[31:0] – ig_prsr_md.global_tstamp[31:0]
Where ig_prsr_md the intrinsic metadata struct described in Section 5.2, and
eg_prsr_md is the struct described in Section 5.9. Making this calculation requires
including the value of ig_prsr_md.global_tstamp[31:0]with packets from ingress to
egress, e.g. in a bridge header (see Section 5.8).
Since the changes made by the P4 compiler to such a header’s format cannot be
predicted by the P4 developer, such headers should not be sent to an external device.
The other device will have no way to know the bit-level format of a flexible header.
Flexible headers are only intended to be used for headers sent and received within the
same Intel Tofino device. For example:
• In bridge metadata headers for unicast and multicast packets sent from
ingress to egress
• In mirrored packets
• In recirculated packets
11.2 @padding
The annotation @padding may be used on a field of a header or struct. Such a field’s
value can be changed at any time by the compiler. It is typically used on fields whose
only reason for being included is to make the total length of it and the following field
become a multiple of 8 bits long, or the size of one PHV container. In many cases, the
compiler can produce more efficient results if it knows that it can overwrite these
fields with arbitrary values at any time.
Note: using the @padding annotation on fields within headers that are transmitted on
an external Ethernet port can cause arbitrary bit patterns to be filled in for this field,
which will be visible to the device receiving such packets.
BFN-T10-032D 2 20 32 128
BFN-T10-032D-024 2 20 24 96
BFN-T10-032D-020 2 20 20 80
BFN-T10-032D-018 2 20 18 72
Within a pipe, ports are organized in 18 quads. Quads 0 through 15 are often
connected to front panel ports. Quad 16 is connected either to a CPU port or a
recirculation port, depending upon which pipe it is in (see the tables below). Quad 17
is connected to a packet generator and can also be used as a recirculation port.
When quad number Q within a pipe is configured with any of 1, 2, or 4 lanes, the port
number within the pipe of its first lane is always 4*Q. The port number within the pipe
of each lane depends upon the number of lanes:
• For 2 lanes, the quad has two port numbers within the pipe: 4*Q and 4*Q+2.
• For 4 lanes, the quad has four port numbers within the pipe: 4*Q, 4*Q+1,
4*Q+2, and 4*Q+3.
For a port number within a pipe A, in pipe number P, the port number is (128*P + A).
The tables below show all port numbers in 4-pipe and 2-pipe Intel Tofino part
numbers.
Note 1: For Intel Tofino part number BFN-T10-032Q, all ports in pipes 2 and 3, quads
0 through 15, are recirculation ports.
Quads
Pipe Quad 16 Quad 17
0..15
For a port number within a pipe A (as described in the previous section), in pipe
number P, the mcport number is (72*P + A).
Everywhere in this document that port numbers are involved, it will be explicitly
stated if it must be an mcport number. Otherwise it should be a devport number.