0% found this document useful (0 votes)
262 views

Signal Processing On Intel Architecture

Signal Processing on latest Intel processors is now a viable option due to continued improvements in multi-core architectures. This approach can be applied to many applications in aerospace (radar, sonar), communications infrastructure (baseband processing, transcoding) and healthcare (medical imaging) paper describes an easy process that allows developers to quickly determine how fast 2nd generation Intel(r) core(tm) i7-2710qe processor will execute their Signal Processing algorithms.

Uploaded by

Rajani Panathala
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
262 views

Signal Processing On Intel Architecture

Signal Processing on latest Intel processors is now a viable option due to continued improvements in multi-core architectures. This approach can be applied to many applications in aerospace (radar, sonar), communications infrastructure (baseband processing, transcoding) and healthcare (medical imaging) paper describes an easy process that allows developers to quickly determine how fast 2nd generation Intel(r) core(tm) i7-2710qe processor will execute their Signal Processing algorithms.

Uploaded by

Rajani Panathala
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

White Paper
Intel Advanced Vector Extensions (Intel AVX) Signal Processing Embedded Computing

Engineers can quickly determine whether Intel processorbased platforms with Intel Advanced Vector Extensions (Intel AVX) satisfy signal processing requirements
Signal processing functions have often required special-purpose hardware such as DSPs and FPGAs. However, recent enhancements to Intel architecture processors are providing developers an alternative: execute signal processing workloads on an Intel processor. Signal processing on the latest Intel processors is now a viable option due to continued improvements in multi-core architectures. The increased parallelism from vector instructions, along with other continuing performance improvements, enables the efficient execution of data parallel workloads such as digital transforms and filters. Additionally, by consolidating signal processing functions with other workloads on a multi-core Intel processor, it is possible to save hardware cost, simplify the application development environment and reduce time to market. This approach can be applied to many applications in aerospace (radar, sonar), communications infrastructure (baseband processing, transcoding) and healthcare (medical imaging).

Umberto Santoni Platform Architect, Embedded Communications Group Thomas Long Software Engineer, Embedded Communications Group

This paper describes an easy process that allows developers to quickly determine how fast 2nd generation Intel Core i7-2710QE processor will execute their signal processing algorithms, based on performance data1 that is relatively easy to obtain. Developers can complete the process in a straightforward manner, as demonstrated with two simple examples in this paper: fast convolution and amplitude demodulation. The paper concludes by reviewing some of the development tools available to developers to conduct their own evaluations.

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Table of Contents
Why Intel Architecture for Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 SIMD Instructions Enhanced By Intel Advanced Vector Extensions (Intel AVX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The Process for Evaluating Signal Processing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Signal Processing Performance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Overview of benchmark data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 A) Forward and inverse Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 B) 2D Complex to Complex FFT Throughput (GFLOPS/s and absolute time) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 C) Filter Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 D) Discrete Hilbert Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 E) Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Speedup with Intel Advanced Vector Extensions (Intel AVX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Two Signal Processing Workload Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Example 1: Fast Convolution using FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Example 2: Discrete Envelope Detection / Amplitude Demodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Floating Point Speeds Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Development Tools Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Intel C++ Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Math Kernel Library (Intel MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Integrated Performance Primitives (Intel IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel VTune Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Application Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Eclipse*-based Integrated Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Consider Intel Architecture Processors for Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Appendix A: Test Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Why Intel Architecture for Signal Processing


There is a natural tendency to assume that just about any signal processing application requires a DSP, FPGA or ASP. Thats because traditionally, it was necessary to use specialized hardware in order to satisfy performance objectives. However, for hybrid designs utilizing a mix of specialized signal processing algorithms and a broader set of applications, implementing two separate computing architectures may pose some significant disadvantages, such as: Hardware and board space for two computing systems: higher product cost Multiple tool chains: additional technical training and project management complexity Multiple code bases: larger software management effort Power consumption for two computing systems: more expensive thermal design Intersystem communication: greater design complexity or possibility for bottlenecks Time to market: extra time needed to design, validate and integrate subsystems Two development teams: unique communication challenges (e.g., silos) One alternative is to perform signal processing workloads on an existing Intel architecture processor in the system. Workload consolidation is a powerful concept that has been delivering significant payoffs in datacenters with respect to reduced server cost, power consumption and footprint. This is made possible by multi-core processors with scalable, efficient performance, coupled with significant memory and I/O bandwidth. This consolidation approach can be equally powerful in embedded systems, addressing issues around cost, software complexity, power, time and communication. Often, performance efficiency is foremost on the minds of embedded system developers when running signal processing workloads. This is discussed in the next section, which presents performance data for key signal processing kernels running on 2nd generation Intel Core i7 processors. Yet, for most embedded applications, raw performance isnt the only factor; it is also necessary to meet overall system cost goals, and the highly scalable family of embedded Intel architecture processors helps to do just that. The embedded Intel processor roadmap gives developers a wide choice with respect to the number of cores, cache and system memory size, I/O and footprint. In addition, there are many other technologies available for enhancing system capabilities, like virtualization technology, remote management and various security features. Nevertheless, it is the enhanced vector single-instruction, multiple-data (SIMD) instructions that open the door to using Intel architecture processors for signal processing.

More specifically, the Intel Advanced Vector Extensions (Intel AVX) available for the first time with 2nd generation Intel Core i7 processors provide significantly improved floating point performance (see sidebar). Engineers who code floating point algorithms for Intel architecture processors can leverage a mature software ecosystem that offers a very wide breadth and depth of development tools. Also available are Intel development tools and libraries that employ Intel AVX and Intel Streaming SIMD Extensions 4 (Intel SSE4) instructions. Equipment manufacturers can choose from many hardware vendors supplying commercial off-the-shelf (COTS) embedded boards and systems that support embedded lifecycles and benefit from the economics of the PC/server supply chain.

SIMD Instructions Enhanced By Intel Advanced Vector Extensions (Intel AVX)


Many signal processing applications are highly parallel, performing the same arithmetic operation on large number sets. Speeding up these workloads, single-instruction, multiple-data (SIMD) instructions were introduced in the mid 1990s, and they perform the same operation on multiple data elements simultaneously, as illustrated below.
SMID + = 2 9 11 3 11 14 5 2 7 11 1 12 20 5 25

The throughput of a SIMD instruction is a function of register size because larger registers translate into greater throughput. With the introduction of 2nd generation Intel Core i7 processors, the size of the 16 registers available for floating point operations doubles, increasing from 128 bits to 256 bits. Additionally, new three and four operand instructions establish a destination argument that results in fewer register copies, better register usage, faster execution and smaller code size. These are just some of the recent architectural enhancements, called Intel Advanced Vector Extensions (Intel AVX).
128 bits (Intel SSE4) 256 bits (Intel AVX)

XMM0 XMM1 XMM2

XMM15

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

The Process for Evaluating Signal Processing Performance


Requiring different levels of effort, there are a number of ways to evaluate the performance of Intel processors, such as using Intel C++ and Intel Fortran Compilers, calling optimized performance libraries, or coding optimizations in assembly language and employing compiler intrinsics. The approach presented in this paper strikes a balance between effort and optimization to quickly achieve good estimates for the performance of signal processing algorithms. This is done in two ways. The first is to utilize the Intel Integrated Performance Primitives (Intel IPP) library. It provides a quick way to assess the performance of hundreds of algorithms and math functions optimized for Intel architecture. Furthermore, the signal processing portion of the Intel IPP library includes over 250 functions, each often supporting multiple data types and in-place versus not in-place variants. There are also many image processing calls that are useful for DSP applications. The Intel IPP library distribution includes a performance tool (with documentation) that allows developers to obtain performance metrics for all the functions in the library. For example, Figure 1 shows the results generated by a shell script written to automatically collect performance data on the number of clocks, execution time, and in some cases MFLOPs, for a function. For Figure 2, an .ini file was used in conjunction with the shell script, to report 2D FFT performance for specific input sizes.
#!/bin/bash source /opt/intel/composerxe/bin/compilervars .sh intel64 DATE=`date +%Y-%m-%d` OUTPUT_DIR=results_avx_${DATE} OUTPUT_PATH=${PWD}/${OUTPUT_DIR} PERF_TOOL_DIR=${IPPROOT}/tools/intel64/perfsys FILE_EXT_1=_lin_avx_1 FILE_EXT_2=_lin_avx_2 mkdir ${OUTPUT_DIR} #IPPS ${PERF_TOOL_DIR}/ps_ipps -r${OUTPUT_PATH}/ipps${FILE_ EXT_1} .csv -o${OUTPUT_PATH}/ipps${FILE_EXT_1} .txt -N1 -YHIGH -TAVX -B ${PERF_TOOL_DIR}/ps_ipps -r${OUTPUT_PATH}/ipps${FILE_ EXT_2} .csv -o${OUTPUT_PATH}/ipps${FILE_EXT_2} .txt -N1 -YHIGH -TAVX -B #2DFFT ${PERF_TOOL_DIR}/ps_ippi -r${OUTPUT_PATH}/2dfft${FILE_ EXT_1} .csv -o${OUTPUT_PATH}/2dfft${FILE_EXT_1} .t xt -N1 -YHIGH -TAVX -B -i${PWD}/2dfft .ini -fippiFFTFwd_CToC_32fc_C ${PERF_TOOL_DIR}/ps_ippi -r${OUTPUT_PATH}/2dfft${FILE_ EXT_2} .csv -o${OUTPUT_PATH}/2dfft${FILE_EXT_2} .t xt -N1 -YHIGH -TAVX -B -i${PWD}/2dfft .ini -fippiFFTFwd_CToC_32fc_C Figure 1 . Sample shell script running IPP performance tool

[Perf System] FFT_OrderXY=4x4; 5x5; 6x5; 6x6; 7x4; 7x5; 7x7; 8x3; 8x4; 8x6; 8x7; 8x8; 9x3; 9x5; 9x6; 9x8; 9x9;10x4; 10x5; 10x7; 10x8; 10x10; 11x3; 11x4; 11x6; 11x7; 11x11; 11x12; 11x13; 11x14; 11x15;12x3; 12x5; 12x6; 12x12; 13x4; 13x5; 13x13; 14x3; 14x4; 15x3; 15x4; 16x4; 17x3; 17x4; Figure 2 . Sample .ini file generating 2D FFT performance data

Another way to estimate signal processing performance is to focus the performance assessment on key kernels that are often used in signal processing workloads. By selecting a subset of the signal processing functions, developers can produce a manageable set of data that contains the most relevant functions and provides a reference with which to estimate the performance of other functions. For instance, this can be done by choosing forward and inverse FFTs of various sizes both complex and real along with FIR and IIR filters of varying complexities, and other useful functions such as discrete cosine and Hilbert transforms. Developers may certainly need data on functions other than the ones covered in this paper; in those cases, the Intel IPP performance tool can be used to gather the necessary data. In summary, this process for evaluating the signal processing performance of Intel architecture utilizes Intel-collected performance data and gives developers a straightforward method to quickly estimate performance for their own workloads. Although the data provided is on 2nd generation Intel Core i7-2710QE processor, the methods described here are extensible to the full range of Intel processors. With a manageable effort, this process gives developers a quick readout of the signal processing performance of next generation Intel processors and provides an estimate of how much general-purpose computing headroom is available for other applications. The next section reviews the performance data collected and demonstrates the process using two examples. It is important to note that although using Intel IPP to assess the signal processing performance of Intel processors provides a good starting point that balances effort and optimization, it need not be the endpoint. Going beyond Intel IPP, it may be possible to capture significant performance improvements for specific algorithms through the use of compiler optimizations, primitives and assembly language programming. The flexibility of Intel architecture and its supporting software infrastructure provides developers with all of these degrees of freedom that can ultimately identify the most appropriate tradeoff between performance and effort.

Signal Processing Performance Data


The following lists a sample of the signal processing performance data1,2 collected by Intel on 2nd Intel Core i7-2710QE processors. The algorithms were run on a single execution thread, on Linux* (Fedora* 13 distribution), and repeated until the results of iterations were within 5 percent accuracy. Developers can create results for their own algorithms and functions of interest using an Intel compiler and the Intel IPP package,

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

which includes sample code and the Intel IPP performance test tool. More details about the test configuration are provided in Appendix A. This single threaded execution environment provides an approximation of the performance of a single core and an indication of available computing headroom. Processor cores not used for signal processing can be targeted at other algorithms or applications running on the system. Intel development tools and libraries also provide extensive support for development, validation and performance tuning of multi-threaded applications.

Input range: 64B to 1024KB Data type: Single and double precision floating point Purpose: Decompose a discrete sequence into a set of frequencies. Complex-to-complex FFT performance of the IPP library is in the range of 16.1 23.7 single precision GFLOPs / sec for sizes between 64B and 4KB. Similarly, the performance of real-to-complex conjugate FFT is in the range of 15 17.3 single precision GFLOPs/sec for sizes between 64B and 4KB. For larger sizes, FFT performance declines as the execution of the algorithm becomes less compute-bound and more memory-bound. Additionally, the FFT throughput of double precision floating point is approximately half that of single precision, and it scales with input size similarly to single precision.

Overview of benchmark data


A . Forward and inverse Fast Fourier Transform (FFT) . Format: Complex-to-Complex, Real-to-Complex Conjugate Symmetric

Complex-To-Complex FFT
100,000.00 25

10,000.00 20 1,000.00 15

100.00

10.00

10

1.00 5 0.10

0.01 SP Float (Sec) DP Float (Sec) SP Float (GFLOP/s)

64 0.08 0.15 23.7

128 0.25 0.42 18.3 10.6

256 0.53 0.94 19.3 10.9

512 1.14 2.09 20.4 11.1

1K 2.44 5.16 21.0 10.0

2K 6.04 13.6 18.7 8.3

4K 15.3 31.7 16.1 7.8

8K 38.0 71.6 14.1 7.5

16K 83.8 171 13.7 6.7

32K 200 389 12.3 6.3

64K 464 842 11.3 6.2

128K 256K 512K 1024K 981 11.4 5.7 2274 5425 12800 10.4 4.8 9.2 4.4 8.2 4.2 1970 4966 11450 25250

DP Float (GFLOP/s) 12.6

Input Size

Figure 3 . Complex-to-Complex FFT and Inverse FFT

Real-To-CCS FFT
100,000.00 20 18 10,000.00 16 14 12 100.00 10 8 10.00 6 4 2 0.1 SP Float (Sec) DP Float (Sec) SP Float (GFLOP/s) DP Float (GFLOP/s) 0

1,000.00

1.0

128 0.14 0.23 16.4 9.6

256 0.34 0.59 15.0 8.7

512 0.73 1.24 15.7 9.3

1K 1.51 2.72 17.0 9.4

2K 3.26 6.39 17.3 8.8

4K 7.56 16.3 16.3 7.6

8K 18.4 37.5 14.5 7.1

16K 44.7 85.1 12.9 6.8

32K 98 197 12.6 6.2

64K 229 447 11.5 5.9

128K 528 965 10.6 5.8

256K 1120 2295 10.6 5.2

512K 1024K 2625 9.5 4.5 5989 8.8 4.1 5525 12663

Input Size

Figure 4 . Real-to-CCS (Complex Conjugate Symmetric) FFT


5

GFLOP / sec

Time (Sec)

GFLOP / sec

Time (Sec)

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

B . 2D Complex to Complex FFT Throughput (GFLOPS/s and absolute time) Format: Complex-to-Complex, 2 dimensional array data Input range: Various sizes ranging from 64B x 64B to 128KB x 16B Data type: Single precision floating point Purpose: Decompose a two dimensional array of discrete sequences into a set of frequencies.

Two dimensional complex-to-complex FFT performance of the Intel IPP library is in the range of 12.9 17.2 for sizes ranging from 64B x 64B to 4KB x 64B. As with the one dimensional FFT case, 2D FFT performance declines for larger sizes as the FFT becomes more memory-bound. Table 1 contains sample data points of interest, and it is possible to generate 2D FFT throughput data for other sizes using the Intel IPP library.

Input Size GFLOP/S Time (Sec)

64 X 64 17.2 14.3

128 x 128 13.6 84.5

256 x 64 256 x 256 512 x 64 14.9 76.9 12.9 407 14.2 173

1K x 256 13.6 1744

2K x 128 13.0 1823

4K x 64 12.9 1834

8K x 32 11.3 2088

16K x 16 10.9 2163

32K x 16 128K x 16 8.5 5893 7.4 29838

Table 1 . 2D Complex to Complex FFT Throughput (GFLOPS/s)

Speedup with Intel Advanced Vector Extensions (Intel AVX)


The improved performance from Intel Advanced Vector Extensions (Intel AVX) is illustrated in Figure 5, which shows the speed up compared to the prior generation Intel Streaming SIMD Extensions 3 (Intel SSE3) instructions. The comparison is for Complex-toComplex FFT and Inverse FFT functions, which are averaged together and charted for both single and double precision floating point routines. For smaller input sizes, the speedup is over two times, and for very large inputs, the speedup is around 20 percent. The floating point performance improvements for 2nd generation Intel Core i7 processors are the result of architectural enhancements, which enable the processor to: - Retire one floating point instruction per CPU clock cycle - Dispatch up to 4 floating point instructions per CPU clock cycle

Speedup of Intel AVX Speedup vs. Intel SSE4 for Complex-to Complex FFT and Inverse FFT (averaged)
2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K

Higher is better

Input Size
Single precision (SP) oating point Double precision (DP) oating point

Figure 5 . Intel Advanced Vector Extensions (Intel AVX) Speedup over Intel Streaming SIMD Extensions 4 (Intel SSE4)

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

C . Filter Execution Times Format: Single precision floating point complex data, complex coefficients. Finite Impulse Response Filter: 8, 32, and 128 taps

Infinite Impulse Response Filter: Orders ranging from 2 12 taps. Inputs: Complex data ranging from 32B 32KB in size. Purpose: Suppress unwanted components from a discrete-time series.

Input Size / Execution Time (Microseconds) 32 input 8 Tap Fir 32 Tap FIR 128 Tap FIR Order 2 IIR Order 3 IIR Order 4 IIR Order 6 IIR Order 7 IIR Order 8 IIR Order 10 IIR Order 12 IIR Order 11 IIR 0.2 0.6 2.1 0.2 0.3 0.4 0.6 0.7 1.1 1.1 1.2 1.2 128 input 0.5 2.0 7.7 0.7 0.9 1.1 1.4 1.6 1.7 2.1 2.5 2.3 512 input 1.8 4.2 6.2 2.4 3.5 3.8 4.5 5.2 5.3 6.3 7.3 7.0 2K 7.0 15.3 18.6 9.7 13.7 14.9 17.4 20.3 20.4 24.2 27.5 26.6 8K input 28.2 59.4 67.6 38.5 54.4 59.4 69.5 81.3 80.8 97.0 112.0 107.0 32K input 113.0 244.0 276.0 156.0 221.5 239.0 277.5 330.0 324.5 387.0 445.0 427.0

Table 2 . Filter Execution Time

D . Discrete Hilbert Transform Format: Single precision floating point complex data. Inputs: Complex data ranging from 128B 32KB in size.

Purpose: Create analytic representation of a real-valued discrete signal.

Input Size / Execution Time (Microseconds) 128 INT16 to Complex Short FP IN16 to Complex SP FP SP Float to Complex SP Float Table 3 . Hilbert Transform Execution Times 0.7 0.6 0.6 512 2.7 2.4 2.3 2K 13.0 11.5 11.0 8K 74.2 64.1 62.8 32K 385.3 353.6 341.0

E . Discrete Cosine Transform Format: Inputs: 128B 32KB

Purpose: Express a discrete signal as a series of cosine frequencies that can be used for lossy signal compression.

Input Size / Execution Time (Microseconds) 128 SP Float Forward SP Float Inverse DP Float Forward DP Float Inverse 0.8 0.8 0.6 0.6 512 3.7 3.7 2.3 2.3 2K 20.6 20.6 11.1 11.0 8K 111.8 111.0 60.0 58.9 32K 563.8 567.3 300.5 301.8

Table 4 . Discrete Cosine Transform Execution Times

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Two Signal Processing Workload Examples


In this section, two generic signal processing workload examples are presented, and the performance of the 2nd Intel Core i7-2710QE processor is estimated in two ways, according to the methods described earlier. The first method is a simple manual approximation that adds the performance data of underlying functions obtained from the Intel IPP performance tool. Interpolation is used where measured data is not available. Though this is a rough approximation, it produces a quick performance estimate and a directional check of whether the system performance and headroom is adequate. The second method estimates the performance of the algorithms by coding them in C++ using Intel IPP and measuring their performance directly with hardware counters. In the following examples, the results from both methods are presented, which provides an indication of how well the manual method approximates actual measured results. The objective of this exercise is to produce a reasonable estimate of performance in order to determine where effort is best applied during optimization. Ultimately, detailed design work may be needed to complete product development.
Sample C-code snipet for Example 1 using Intel Integrated Performance Primitives (Intel IPP): /* allocate and initialize specification structures */ ippsFFTInitAlloc_C_32fc(&FFTspec1_p, order, IPP_FFT_DIV_ FWD_BY_N, ippAlgHintFast); ippsFFTGetBufSize_C_32fc(FFTspec1, &BufSize); Buf1_p = (Ipp8u *) ippsMalloc_32sc(BufSize*sizeof(Ipp8u)); /* compute in-place FFTs of input sequences*/ ippsFFTFwd_CToC_32fc_I(x_p, FFTspec1_p, Buf1_p); ippsFFTFwd_CToC_32fc_I(y_p, FFTspec1_p, Buf1_p); /* perform complex multiplication and inverse FFT*/ ippsMul_32fc( x_p, y_p, o_p, veclength); ippsFFTInv_CToC_32fc_I(o_p,FFTspec1_p, Buf1_p); /* free specification structures */ ippsFFTFree_C_32fc( FFTSpec1_p); ippsFree(Buf1_p);

Example 1: Fast Convolution using FFT


The following example performs a fast convolution of two discrete signals, x(n) and y(n) shown in Figure 6. The example is also a frequency domain FIR filter when one of the input sequences represents the transfer function of an FIR filter. A sample C-code snipet using Intel IPPs is provided in Figure 7.
x(n)

Figure 7 . Sample C-code Snipet for Example 1

FFT X Inverse FFT o(n)

y(n)

FFT

Table 5 summarizes Fast Convolution execution times, calculated and measured, for various data sizes. The calculated times sum the execution times of individual functions, using times for Intel IPP signal processing functions obtained from the Intel IPP performance test tool running in batch mode. The measured times were generated by running the entire algorithm (Figure 6), which was coded in C++ and used Intel IPP, and by calculating the elapsed time based on the hardware clock count. The runtimes were averaged across 10,000 runs. For comparison, the calculated times were within 16 percent of the measured results. However, the calculated times took a few hours of unattended run time (no human effort aside from installing the Intel IPP and running the aforementioned shell script) and less than an hour of calculating results in a spreadsheet. The measured results took an engineer familiar with the Intel IPP and C++ programming a couple

Figure 6 . Fast Convolution using FFT Example

Inputs: x(n), y(n): 16KB input size Output: o(n): 16KB output size Operations: Single precision floating point in-place FFT, Complex Multiply, Inverse FFT
Size sec 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 7.96E-02 2.30E-01 5.04E-01 1.07E+00 2.32E+00 5.91E+00 1.49E+01 3.65E+01 8.13E+01 2.00E+02 4.54E+02 FFT clocks 1.67E+02 4.83E+02 1.06E+03 2.25E+03 4.87E+03 1.24E+04 3.13E+04 7.67E+04 1.71E+05 4.20E+05 9.53E+05 sec 8.02E-02 2.30E-01 4.92E-01 1.04E+00 2.27E+00 5.93E+00 1.49E+01 3.60E+01 8.19E+01 2.02E+02 4.53E+02 iFFT clocks 1.68E+02 4.83E+02 1.03E+03 2.18E+03 4.77E+03 1.25E+04 3.13E+04 7.56E+04 1.72E+05 4.24E+05 9.51E+05

Complex Mul sec 5.25E-02 1.05E-01 1.84E-01 3.17E-01 6.12E-01 1.18E+00 2.55E+00 5.46E+00 1.19E+01 2.57E+01 5.14E+01 clocks 1.10E+02 2.21E+02 3.87E+02 6.66E+02 1.29E+03 2.48E+03 5.34E+03 1.15E+04 2.50E+04 5.40E+04 1.08E+05

Fast Convolution Calculated sec 2.92E-01 7.95E-01 1.68E+00 3.50E+00 7.52E+00 1.89E+01 4.72E+01 1.14E+02 2.56E+02 6.28E+02 1.41E+03 clocks 6.13E+02 1.67E+03 3.54E+03 7.34E+03 1.58E+04 3.98E+04 9.92E+04 2.40E+05 5.38E+05 1.32E+06 2.97E+06

Fast Convolution Measured sec 2.99E-01 8.06E-01 1.69E+00 3.80E+00 8.33E+00 2.11E+01 5.22E+01 1.28E+02 2.96E+02 7.47E+02 1.63E+03 clocks 6.28E+02 1.69E+03 3.55E+03 7.99E+03 1.75E+04 4.42E+04 1.10E+05 2.70E+05 6.21E+05 1.57E+06 3.43E+06

Delta

-2% -1% 0% -8% -10% -10% -10% -11% -13% -16% -14%

Table 5 . Fast Convolution Execution Times


8

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

of days for coding, debugging and executing the runtime. From this example, it is clear the simple manual approximation method (i.e., calculated times) delivers a performance estimate at a fraction of the effort required for the coding approach, and it provides an early read on where to invest optimization effort. Still, coding the algorithm using Intel IPP took significantly less effort than the alternative, which is to manually optimize custom libraries for each generation of Intel processor.

One possible optimization is to parallelize the algorithm by going beyond vector instructions and executing the two FFTs on separate threads. Further parallelizing the execution, the threads can be dispatched to two processor cores and the results combined back to a single thread for the complex multiply and inverse FFT.

Floating Point Speeds Development


Voice quality is still a major concern for countless wireless, cable and internet service providers relying on Voice over IP (VoIP) to deliver telephony services. To ensure quality is on par with the Public Switched Telephone Network (PSTN), service providers require an economical and automatic means to continuously test calls in real time. A commonly used family of standards is PESQ3 (Perceptual Evaluation of Speech Quality), which defines a MOS voice quality score that closely correlates to human listening experience, as shown in Figure 8. The algorithms perform voice encoding and measurements related to jitter, packet loss, time-clipping and channel errors. Many of the algorithms are computationally intensive and use floating point fast Fourier transforms. For manufacturers of voice quality test equipment, passing industry conformance tests demands 32-bit float-like behavior throughout the application. This is exceptionally difficult to achieve with integer CPUs or FPGAs without suffering dire performance consequences. Likewise, fixed point math add, multiply or divide is not an option because of the loss of accuracy every time an instruction throws out a remainder. With a fixed point integer type, intermediate results of a multiply or add can grow beyond the fixed point type, causing overflow and truncation errors. This can make the result shrink below its fractional component and lead to incorrect results, like a PESQ score of 3.5 instead of 4.2. Ixia*, a leading supplier of test and measurement equipment, decided to use multi-core Intel processors with high performance floating point units because they could run the PESQ code as-is, hence minimal migration effort. The processors delivered accurate PESQ results and proved to have floating pointing performance on par with, and even superior to, many floating point DSPs. Using Intel, we were able to get near-final performance numbers in just a few days, significantly lowering our project risk, says Bryan Rittmeyer, System Architect at Ixia.

R
100 90 80 70 60 50 0

User Satisfaction
Very Satised Satised Some Users Dissatised Many Users Dissatised Nearly All Users Dissatised Not Recommended

MOS
5.0 4.3 4.0 3.6 3.1 2.6 1.0

Figure 8 . MOS Diagram

Example 2: Discrete Envelope Detection / Amplitude Demodulation


The following example performs a fast convolution of two discrete signals, x(n) and y(n) shown in Figure 6. The example is also a frequency domain FIR filter when one of the input sequences represents the transfer function of an FIR filter. A sample C-code snipet using Intel IPPs is provided in Figure 7.

The second example is an envelope detector for a discrete time sequence, shown in Figure 9. The Hilbert transform produces the analytic representation of the signal, whose magnitude is obtained in order to generate the envelope of the signal, which is then downsampled. The downsampling is done in two stages since the carrier is operating at 200x the frequency of the message bandwidth. This keeps the FIRs to reasonable sizes. Finally, the DC component is removed from the discrete output sequence. Figure 10 contains a code snipet of the MATLAB* model for the envelope detector and Figure shows the results.

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

x(n) = (A + M . f(mnts))* sin(cnts) + noise

-N/2

+ +

|xc |

LPF1

D1

LPF2

D2

DC removal

o(m)

Discrete Hilbert Transform

Envelope Formation

Low Pass Filtering, Downsampling, & DC removal

Figure 9 . Discrete Envelope Detection / Amplitude Demodulation Example

Input: Amplitude modulated message. Message bandwidth = 5KHz. Carrier frequency: 1000KHz. Input sampling frequency: 2200KHz. Output sampling frequency: 11KHz LPF1: 128 tap FIR, 44KHz cutoff frequency. D1: Downsampling by 25. LPF2: 128 tap FIR, 5.5KHz cutoff frequency. D2: Downsampling by 8.

%Form analytic signal for envelope inenv = abs(hilbert(in)); %Downsample, take out DC and LPF envelope lpf1 = fir1(lpf1tap,cutoff1/(Fsi/2),low,chebwin(lpf1tap+1)); out1 = fftfilt(lpf1,inenv); out1 = downsample(out1,D1); tout1 = downsample(tin,D1); %Stage 2 Downsample & LPF envelope lpf2 = fir1(lpf2tap,cutoff2/(Fs1/2),low,chebwin(lpf2tap+1)); out = fftfilt(lpf2,out1); out = downsample(out,D2); out = out - mean(out); out = out(17:length(out)); tout = downsample(tout1,D2); tout = tout(17:length(tout));

Figure 10 . Snipet of MATLAB* model

10

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Figure 11 shows the MATLAB results for a simulated noisy AM signal containing messages centered at 1KHz, 3KHz, and 4.75KHz. The figure also shows the single sided spectrum magnitude of the input,

after LPF1, after LPF2 and downsampling D2, and at the final output. Superimposed to the frequency spectrum is the frequency response of the LPFs (thin blue line).

Figure 11 . MATLAB* Results

Table 6 summarizes the execution times for the Amplitude Demodulation for various data sizes. Similar to Example 1, the calculated times summed the execution times of the individual functions, and the measured times were generated from hardware clock count measurements collected while the algorithm executed (see Figure 10 for the code snipet). The runtimes were averaged across 10,000 runs. The calculated times are within 11 percent of the measured results. Here again, the calculated times took a few hours of unattended run time and less than an hour of spreadsheet calculation time. An engineer familiar

with Intel IPP, C++ and the algorithm generated the measured results in three days, which included coding using Intel IPP functions, debugging and executing the runtime. This example is similar to Example 1, in that the manual approximation method offers a good compromise between accuracy and effort, and it provides a relatively quick indication of performance and focus areas for further optimization. As in the prior example, going to the next step of coding the algorithm using Intel IPP took an acceptable amount of effort, given the degree of optimization and compared to manually optimizing libraries to Intel processors.

11

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Input Size 32 128 512 2048 8192 9000 18000 27000 32768 36000
Input Size 327.68 360 720 1080 1310.72 1440 Input Size 8192 9000 18000 27000 32768 36000 Measured Data

Hilbert Tran sf (32f - 32fc) sec 1.41E-01 5.82E-01 2.33E+00 1.09E+01 6.17E+01 7.09E+01 1.74E+02 2.76E+02 3.42E+02 3.76E+02 clocks 2.96E+02 1.22E+03 4.89E+03 2.29E+04 1.30E+05 1.49E+05 3.64E+05 5.80E+05 7.18E+05 7.89E+05 sec

Magnitude clocks 8.51E+01 2.94E+02 1.16E+03 4.62E+03 1.87E+04 2.08E+04 4.45E+04 6.82E+04 8.34E+04 9.16E+04 4.05E-02 1.40E-01 5.52E-01 2.20E+00 8.90E+00 9.91E+00 2.12E+01 3.25E+01 3.97E+01 4.36E+01

128 tap FIR #1 sec 4.49E-01 1.63E+00 6.39E+00 2.63E+01 7.61E+01 8.10E+01 1.36E+02 1.90E+02 2.25E+02 2.47E+02 clocks 9.43E+02 3.42E+03 1.34E+04 5.52E+04 1.60E+05 1.70E+05 2.85E+05 3.99E+05 4.73E+05 5.19E+05

DownSampling by 8 sec 2.46E-02 2.25E-02 3.78E-02 9.67E-02 6.34E-01 7.41E-01 1.93E+00 3.12E+00 3.88E+00 4.26E+00 clocks 5.17E+01 4.73E+01 7.94E+01 2.03E+02 1.33E+03 1.56E+03 4.05E+03 6.55E+03 8.15E+03 8.95E+03

DownSampling by 25 sec 7.69E-02 7.03E-02 1.18E-01 3.02E-01 1.98E+00 2.31E+00 6.03E+00 9.74E+00 1.21E+01 1.33E+01 clocks 1.61E+02 1.48E+02 2.48E+02 6.35E+02 4.16E+03 4.86E+03 1.27E+04 2.05E+04 2.55E+04 2.80E+04

128 tap FIR #2 sec 5.69E+00 6.09E+00 1.57E+01 2.04E+01 2.34E+01 2.51E+01 clocks 1.20E+04 1.28E+04 3.30E+04 4.28E+04 4.91E+04 5.26E+04

DownSampling by 8 sec 3.56E-02 3.68E-02 6.54E-02 7.92E-02 8.81E-02 9.30E-02 clocks 7.47E+01 7.74E+01 1.37E+02 1.66E+02 1.85E+02 1.95E+02 Delta -3% -3% -4% -11% -11% -10%

Demod Calculated sec 1.54E+02 1.70E+02 3.52E+02 5.29E+02 6.42E+02 7.05E+02 clocks 3.24E+05 3.58E+05 7.39E+05 1.11E+06 1.35E+06 1.48E+06

Demod Measured sec 1.59E+02 1.75E+02 3.68E+02 5.97E+02 7.18E+02 7.86E+02 clocks 3.34E+05 3.67E+05 7.72E+05 1.25E+06 1.51E+06 1.65E+06 Calculated Data

Interpolation From Measured Data

Table 6 . Amplitude Demodulation Execution Times /* allocate and initialize specification structures */ ippsHilbertInitAlloc_32f32fc(&HilbertSpec_p, insmpl, ippAlgHintNone); ippsFIRInitAlloc_32f(&LPF1FIRState_p, (Ipp32f *)lpf1taps, LPF1TAPCNT, NULL); ippsFIRInitAlloc_32f(&LPF2FIRState_p, (Ipp32f *)lpf2taps, LPF2TAPCNT, NULL); /* Form the analytic signal and its envelope */ ippsHilbert_32f32fc((Ipp32f *)in_p, inenv_p, HilbertSpec_p); ippsMagnitude_32fc(inenv_p, inenvabs_p, insmpl); /* First stage of LPF and downsampling */ ippsFIR_32f_I(inenvabs_p, insmpl, LPF1FIRState_p); ippsSampleDown_32f((Ipp32f *)inenvabs_p, insmpl, outD1_p, &D1smpl, D1, &phm); /* Second stage of LPF and downsampling */ ippsFIR_32f_I(outD1_p, D1smpl, LPF2FIRState_p); ippsSampleDown_32f(outD1_p, D1smpl, outD2_p, &D2smpl, D2, &phm); /* Remove DC */ ippsMean_32f(outD2_p, D2smpl, &dcval, ippAlgHintNone); ippsSubC_32f_I((Ipp32f)dcval, outD2_p, D2smpl); /* free states */ ippsHilbertFree_32f32fc(HilbertSpec_p); ippsFIRFree_32f(LPF1FIRState_p); ippsFIRFree_32f(LPF2FIRState_p);

Figure 12 . Sample C-code Snipet for Example 2 using Intel Integrated Performance Primitives (Intel IPP)
12

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

The single-sided spectrum magnitude output of the Intel IPP implementation is shown in Figure 13 and indicates the resulting envelope

and message recovery at the 11KHz sampling frequency.

Figure 13 . Output Spectrum Magnitude for Intel Integrated Performance Primitives (Intel IPP) implementation .

Table 7 summarizes the measured execution time for various lengths of input sample sequences. Of note is the column labeled Utilization Rate. This is the algorithms execution time divided by the duration of the input sample, which provides a measure of core utilization over the time
Input Size in Samples 9000 18000 27000 36000 45000 54000 63000 72000 81000 90000 AVX Speedup Over SSE 1.32x 1.30x 1.27x 1.29x 1.26x 1.27x 1.25x 1.26x 1.25x 1.26x

interval of the algorithm (i.e., before the next set of input samples need to be processed). It is an indication of the amount of headroom the core has available for additional signal processing functions, or perhaps, for other applications.
AVX Time (sec) 174.70 367.64 597.04 785.95 1,048.31 1,223.71 1,485.68 1,659.40 1,986.38 2,155.75 Utilization Rate (%) 4.30% 4.50% 4.90% 4.80% 5.10% 5.00% 5.20% 5.10% 5.40% 5.30%

Table 7 . Measured Execution Time for Envelope Detector

13

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Development Tools Overview


Developers of signal processing applications have a wide choice of development tools from Intel and the broad Intel ecosystem. The benefits of using these comprehensive tool suites are many and impact every phase of the software development process. Intel C++ Compiler The Intel C++ Compilers for Linux and Microsoft* Windows* operating systems are optimized to harness key properties of Intel architecture processors and deliver optimal performance. They take advantage of a complex set of heuristics to decide which assembly instructions can best optimize the performance in various area, including memory access, branch prediction, vectorization and floating point operations. Intel Math Kernel Library (Intel MKL) Intel Math Kernel Library (Intel MKL) is a library of highly optimized, extensively threaded math routines that rely heavily on floating point computations for maximum performance. Core math functions include BLAS, LAPACK, ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, Vector Math and more. Intel Integrated Performance Primitives (Intel IPP) Intel IPP offers a rich set of library functions and codecs capable of speeding up the development of highly optimized routines for the handling of multimedia formats and data of any kind. They have been hand optimized at a low level to provide maximum performance and ease of use with Intel architecture processor-based platforms.

Intel VTune Performance Analyzer Designed to help developers find bottlenecks in their applications, the tool profiles how the application is using CPU time and computing platform resources throughout the code. Intel Application Debugger A rich and user friendly Eclipse* RCP-based graphical user interface, combined with OS signal and thread awareness, enable developers to cross-debug more easily by finding coding issues that affect application runtime behavior. Eclipse*-based Integrated Development Environment Intel software development products can be used with the Eclipse Integrated Development Environment (IDE).

Consider Intel Architecture Processors for Signal Processing


Although todays Intel architecture processors are already being used for signal processing workloads, the release of 2nd generation Intel Core i7 processors with Intel AVX makes this approach much more compelling. Intel AVX delivers over twice the performance1 for some floating pointbased workloads compared to prior generation Intel SSE instructions. It is relatively straightforward for developers to evaluate the signal processing performance of next generation Intel architecture processors using the data available collected with Intel tools and libraries.

14

White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives

Appendix A: Test Configuration Single thread execution Emerald Lake Platform (Fab A) BIOS American Megatrends 4.6.3.2 (Project Version ASNBCPT1.86C. 0054.P00) CPU: 2nd generation Intel Core i7-2710QE processor (4 core, 2.1GHz, 6MB LLC, Intel Hyper-Threading Technology off) PCH: Mobile Intel QM67 Chipset, B0 stepping. 2 GB RAM (2x1GB Samsung DIMM DDR3 1333, dual rank, PN: M471B2874EH1-CH9) Western Digital 160GB HDD (WD1600AAJS) Fedora* 13 Linux* 2.6.33.3-85.fc13.x86_64 operating system Intel Composer XE 2011 Intel C++ Compiler Pro, version 12.0.1, build 107. Intel Integrated Performance Primitives (Intel IPP) version 7.0, build 205.23, September 2, 2010 (libippse9.so.7.0) Intel IPP performance tool version 7.0 (part of the Intel IPP package) ll individual Intel IPP measurements were taken using the Intel IPP performance test tool. Standard batch A mode (-B) input was used. The automatic timing mode with default accuracy was used. The tests were run with high priority (Y=HIGH) and on one thread only (N=1). More information on the command line parameters can be obtained by running the performance applications with the hh switch requency domain FIR was compiled in release mode (Release x64) with the Intel C++ Compiler. The cache is F warmed before the test. Optimizations are enabled using the /O3 , -xHost, and std=c99 compiler flags. FDFIR data averaged among in place, fast, and no divide by N options ther data averaged among in place and not in place, fast & accurate switches, divide by N, divide by sqrt(n), O and no divide by N, as applicable to each algorithm ata is at fixed CPU clock frequency and may change with Intel Turbo Boost Technology enabled. D oftware libraries, drivers, operating systems, and compilers used are not fully tuned for performance and S additional performance gains may be possible.

Acronyms ASIC ASP DSP FIR Application-specific integrated circuit Application-specific processor Digital signal processor Finite impulse response FFT FPGA IIR SIMD Fast Fourier transform Field-programmable gate array Infinite impulse response Single-instruction, multiple data

15

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Source: PESQ website at http://www.pesq.org/ Copyright 2011 Intel Corporation. All rights reserved. Intel, the Intel logo and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.

2 3

*Other names and brands may be claimed as the property of others. Printed in USA 0111/S2D/BM/XX/PDF Please Recycle 324910-001US

You might also like