Qat Performance Optimization Guide
Qat Performance Optimization Guide
Revision 008
December 2021
Figures
Figure 1. Packet Decrypt and Encrypt .................................................................................................................................. 17
330687 008 • Updated the document with New Intel Logo December 2021
330687 005 • Removed references to using coalescing timer and to September 2018
Intel® Communications
Chipset 8900 to 8920 series.
330687 004 • Updates to interrupt and epoll modes, other minor January 2017
technical changes
330687 003 • Minor updates throughout. Added Section 3.2.3, Epoll October 2015
Mode and Section 3.2.4, Recommendations
• Updated Section 4.2.6, Reducing Asymmetric Service
Memory Usage
1 Introduction
This performance optimization guide for Intel® QuickAssist Technology can be used both
during the architecture/design phases and the implementation/integration phases of a project
that involves the integration of the Intel® QuickAssist Technology software with an application
stack.
Accordingly, the guide is divided into two main sections:
• Software Design Guidelines – Architecture and design guidelines on how best to integrate
the Intel® QuickAssist Technology software into the application software stack. Trade-
offs between various design choices are described together with recommended
approaches.
• Application Tuning – Guidelines to further increase the performance of Intel® QuickAssist
Technology in the context of a full application.
The intended audience for this document includes software architects, developers and
performance engineers.
In this document, for convenience:
Acceleration drivers is used as a generic term for the software that allows the QuickAssist
Software Library APIs to access the Intel® QuickAssist Accelerator(s) integrated in the
following devices:
− Intel® Communications Chipset 8900 to 8920 Series
− Intel® Communications Chipset 8925 to 8955 Series
− Intel® Atom® processor C2000 product family
− Intel® Atom® processor C3000 product family
− Intel® C620 Series Chipsets
− Intel® Xeon® D-1500 processor
− Intel® Xeon® D-2100 processor
1.1 Terminology
Table 1. Terminology
Term Description
C-States C-States are advanced CPU current lowering technologies.
ECDH Elliptic Curve Diffie-Hellman
IA Intel® architecture CPU
Intel® SpeedStep® Advanced means of enabling very high performance while also meeting the
Technology power-conservation needs of mobile systems.
LAC LookAside Crypto
The time between the submission of an operation via the QuickAssist API and
Latency
the completion of that operation.
MSI Message Signaled Interrupts
NUMA Non Uniform Memory Access
This refers to the cost, in CPU cycles, of driving the hardware accelerator. This
Offload Cost cost includes the cost of submitting an operation via the Intel® QuickAssist API
and the cost of processing responses from the hardware.
Term Description
PCH Platform Controller Hub
PKE Public Key Encryption
The accelerator throughput usually expressed in terms of either requests per
Throughput
second or bytes per second.
Document Number/
Document Name Location
The acceleration driver interfaces to the hardware via hardware-assisted rings. These rings are
used as request and response rings. Rings are grouped into banks (16 rings per bank). Request
rings are used by the driver to submit requests to the accelerator and response rings are used
to retrieve responses back from the accelerator. The availability of responses can be indicated
to the driver using either interrupts or by having software poll the response rings.
At the Intel® QuickAssist Technology API, services are accessed via “instances.” A set of rings
is assigned to an instance and so any operations performed on a service instance will involve
communication over the rings assigned to that instance.
Each guideline will highlight its impact on performance. Specific performance numbers are not
given in this document since exact performance numbers depend on a variety of factors and
tend to be specific to a given workload, software, and platform configuration.
Software can either periodically query the hardware accelerator for responses or it can enable
the generation of an interrupt when responses are available. Interrupts or polling mode can be
configured per instance via the platform-specific configuration file. Configuration parameter
details are available in the Programmer’s Guide for your platform (refer to Table 1).
The properties and performance characteristics of each mode are explained below followed by
recommendations on selecting a configuration.
To reduce the number of interrupts generated, and hence the number of CPU cycles spent
processing interrupts, multiple responses can be coalesced together. The presence of the
multiple responses can be indicated via a single coalesced interrupt rather than having an
interrupt per response. The number of responses that are associated with a coalesced
interrupt is determined by an interrupt coalescing timer. When the accelerator places a
response in a response ring, it starts an interrupt coalescing timer. While the timer is running,
additional responses may be placed in the response ring. When the timer expires, an interrupt
is generated to indicate that responses are available. Details on how to configure interrupt
coalescing are available in the Programmer’s Guide for your platform (refer to Table 1).
Since interrupt coalescing is based on a timer, there is some variability in the number of
responses that are associated with an interrupt. The arrival rate of responses is a function of
the arrival rate of the associated requests and of the request size. Hence, the timer
configuration needed to coalesce X large requests is different from the timer configuration
needed to coalesce X small requests. It is recommended that the timer is tuned based on the
average expected request size.
The choice of timer configuration impacts throughput, latency, and offload cost:
• Configuring a very short time period maximizes the throughput through the accelerator,
minimizing latency, but will increase the offload cost since there will be a higher number of
interrupts and hence more CPU cycles spent processing the interrupts. If this interrupt
processing becomes a performance bottleneck for the CPU, the overall system
throughput will be impacted.
• Configuring a very long timer period leads to reduced offload cost (due to the reduction in
the number of interrupts) but increased latency. If the timer period is very long and causes
the response rings to fill, the accelerator will stall and throughput will be impacted.
The appropriate coalescing timer configuration will depend on the characteristics of the
application. It is recommended that the value chosen is tuned to achieve optimal performance.
When using interrupts with the user space Intel® QuickAssist Technology driver, there is
significant overhead in propagating the interrupt to the user space process that the driver is
running in. This leads to an increased offload cost. Hence it is recommended that interrupts
should not be used with the user space Intel® QuickAssist Technology driver.
The frequency of polling is a key performance parameter that should be fine-tuned based on
the application. This parameter has an impact on throughput, latency and on offload cost:
• If the polling frequency is too high, CPU cycles are wasted if there are no responses
available when the polling routine is called. This leads to an increased offload cost.
• If the polling frequency is too low, latency is increased and throughput may be impacted if
the response rings fill causing the accelerator to stall.
The choice of threading model has an impact on performance when using a polling approach.
There are two main threading approaches when polling:
• Creating a polling thread that periodically calls the polling API. This model is often the
simplest to implement, allows for maximum throughput, but can lead to increased offload
cost due to the overhead associated with context switching to/from the polling thread.
• Invoking the polling API and submitting new requests from within the same thread. This
model is characterized by having a “dispatch loop” that alternates between submitting
new requests and polling for responses. Additional steps are often included in the loop
such as checking for received network packets or transmitting network packets. This
approach often leads to the best performance since the polling rate can be easily tuned to
match the submission rate so throughput is maximized and offload cost is minimized.
The mode can only be used in user space. The following must be considered if opting to use
this mode (such as, over the standard polling mode in user space).
Because epoll mode has two parts, of which the kernel space part utilizes the legacy interrupt
mode, if there is a delay in the kernel interrupt (such as, by changing the coalescing fields),
there will be a corresponding increase in latency in the delivery of the event to user space.
The thread waiting for an event in epoll mode does not consume CPU time, but the latency
could have an impact on the performance. For higher packet load where the wait time for the
next packet is insignificant, polling mode is recommended.
You are limited to one instance (and one process) per bank in epoll mode.
3.1.4 Recommendations
Polling mode tends to be preferred in cases where traffic is steady (such as packet processing
applications) and can result in a minimal offload cost. Epoll mode is preferred for cases where
traffic is bursty, as the application can sleep until there is a response to process.
From a throughput and latency perspective, there is no difference in performance between the
Data Plane API and the traditional API.
From an offload cost perspective, the Data Plane API uses significantly fewer CPU cycles per
request compared to the traditional API. For example, the cryptographic Data Plane API has an
offload cost that is lower than the cryptographic traditional API.
Note: One constraint with using the Data Plane API is that interrupt mode is supported only if one
bank is served by only one thread.
Using the Data Plane API, batches of requests can be submitted to the accelerator using either
the cpaCySymDpEnqueueOp() or cpaCySymDpEnqueueOpBatch() API calls for the
symmetric cryptographic data plane API and using either the cpaDcDpEnqueueOp() or
cpaDcDpEnqueueOpBatch() API calls for the compression data plane API. In all cases,
requests are only submitted to the accelerator when the performOpNow parameter is set to
CPA_TRUE.
It is recommended to use the batch submission mode of operation where possible to reduce
offload cost.
With synchronous mode, the traditional Intel® QuickAssist Technology API will block and not
return to the calling code until the acceleration operation has completed.
With asynchronous mode, the traditional or Data Plane Intel® QuickAssist Technology API will
return to the calling code once the request has been submitted to the accelerator. When the
accelerator has completed the operation, the completion is signaled via the invocation of a
callback function.
From a performance perspective, the accelerator requires multiple inflight requests per
acceleration engine to achieve maximum throughput. With synchronous mode of operation,
multiple threads must be used to ensure that multiple requests are inflight. The use of multiple
threads introduces an overhead of context switching between the threads which leads to an
increase in offload cost.
Hence, the use of asynchronous mode is the recommended approach for optimal
performance.
Note: Specific performance numbers are not given in this document since exact performance
numbers depend on a variety of factors and tend to be specific to a given workload, software
and platform configuration.
When using the Data Plane API, it is possible to pass a flat buffer to the API instead of a buffer
list. This is the most efficient usage of system resources (mainly PCIe* bandwidth) and can
lead to lower latencies compared to using buffer lists.
It is recommended that the maximum number of concurrent requests is tuned to achieve the
correct balance between memory usage, throughput and latency for a given application. As a
guide the maximum number configured should reflect the peak request rate that the
accelerator must handle.
From a performance perspective, the cost of maintaining the state and the serialization
between the partial requests in a session has a negative impact on offload cost and throughput.
To maximize performance when using partial operations, multiple symmetric cryptographic
sessions must be used to ensure that sufficient requests are provided to the hardware to keep
it busy.
For optimal performance, it is recommended to avoid the use of partial requests if possible.
There are some situations where the use of partials cannot be avoided since the use of partials
and the need to maintain state is inherent in the higher level protocol (such as, the use of the
RC4 cipher with an SSL/TLS protocol stack).
If you are limited with the number of instances and want to run several different algorithms or
change keys for another session, de-initialize the session and create a new one. However, such
an approach impacts performance because it involves buffer disposal, deinitialization of the
instance, and so on.
Instead, the session can be reused with updating only a direction (encryption / decryption), key
or symmetric algorithm to be used. This method will not dispose buffers and can reduce the
CPU cycles significantly.
For example, for a top-SKU Intel® Communications Chipset 8900 to 8920 Series device,
which has four cryptographic engines and two compression engines, a minimum of four
cryptographic service instances and two compression service instances are required to
maximize performance.
Note: Intel® Communications Chipset 8925 to 8955 Series and later products have multiple
cryptographic and compression engines, but the hardware can load balance and provide full
performance using only one service instance for cryptographic operations and one service
instance for compression operations.
It is also recommended to assign each service instance to a separate CPU core to balance the
load across the CPU and to ensure that there are sufficient CPU cycles to drive the
accelerators at maximum performance.
When using interrupts, the core affinity settings within the configuration file should be used to
steer the interrupts for a service instance to the appropriate core.
Detailed guidelines on load balancing and how to ensure maximum use of the available
hardware capacity are available in the Programmer’s Guide for your platform (refer to Table 1).
4 Application Tuning
This chapter describes techniques you may employ to optimize your application.
For optimal performance, all data passed to the Intel® QuickAssist Technology engines should
be aligned to 64B. The Intel® QuickAssist Technology Cryptographic API Reference and the
Intel® QuickAssist Technology Data Compression API Reference manuals (refer to Table 2)
document the memory alignment requirements of each data structure submitted for
acceleration.
Note: The driver, firmware, and hardware handle unaligned payload memory without any functional
issue but performance will be impacted.
Note: It is common that packet payloads will not be aligned on a 64B boundary in memory, as the
alignment usually depends upon which packet headers are present. In general, the mitigation
for handling this is to adjust the buffer pointer, length and cipher offsets passed to hardware to
make the pointer aligned. This works on the assumption that there is a point in the packet,
before the payload, that is 64B aligned. See the diagram below for an illustration of adjusted
alignment in the context of encrypt/decrypt of an IPsec packet.
Note: With the Intel® Communications Chipsets 8900 to 8920 series, to use Wireless Firmware you
must enable the dc service, even though it is not used.
Note: This section is relevant only for Intel® Communications Chipset 8900 to 8920 Series and
Intel® Atom® processor C2000 product family software.
Embedded SRAM can be used to reduce the PCIe* transactions that occur during dynamic
compression sessions. This will have a small positive impact on performance, but also frees up
PCIe* bandwidth for other services.
latency, offload cost and throughput. Also, Section 3.1.2Polling Mode describes two ways of
polling:
• Polling via a separate thread.
• Polling within the same context as the submit thread.
With option 1, there is limited control over the poll interval, unless a real time operating system
is employed. With option 2, the user can control the interval to poll based on the number of
submissions made.
Whichever method is employed, the user should start with a low frequency of polling, and this
will ensure maximum throughput is achieved. Gradually increase the polling interval until the
throughput starts to drop. The polling interval just before throughput drops should be the
optimal for throughput and offload cost.
This method is only applicable where the submit rate is relatively stable and the average packet
size does not vary. To allow for variances, a larger ring size is recommended, but this in turn will
add to the maximum latency.
This section describes how to reduce the memory footprint required by using the asymmetric
crypto API.
The asymmetric cryptographic service requires a far larger memory pool compared to
symmetric cryptography and compression. The more logical instances that are defined in the
configuration file, the more memory is consumed by the driver. This memory usage may be
unnecessary if only the symmetric part of the cryptographic service is used. Alternatively, the
memory requirement may be reduced depending on the user’s requirements on the
asymmetric service.
Large values of the above configuration file parameter increase memory requirements and
more instances will also cause increases. To minimize memory, the user should:
• Set max_mr = 1 (at compile time)
• Set LAC_PKE_MAX_CHAIN_LENGTH = 1
By default, the driver uses 50 rounds of Miller Rabin to test primality. If the user does not
require this amount of prime testing, the following environment variable can be set at build
time to reduce this:
At build time:
• Set max_mr = 1
• In <ICP_ROOT>/quickassist/lookaside/access_layer/src/common/crypto/asym
/include/lac_pke_utils.h, set LAC_PKE_MAX_CHAIN_LENGTH = 1