Signal Processing On Intel Architecture
Signal Processing On Intel Architecture
White Paper
Intel Advanced Vector Extensions (Intel AVX) Signal Processing Embedded Computing
Engineers can quickly determine whether Intel processorbased platforms with Intel Advanced Vector Extensions (Intel AVX) satisfy signal processing requirements
Signal processing functions have often required special-purpose hardware such as DSPs and FPGAs. However, recent enhancements to Intel architecture processors are providing developers an alternative: execute signal processing workloads on an Intel processor. Signal processing on the latest Intel processors is now a viable option due to continued improvements in multi-core architectures. The increased parallelism from vector instructions, along with other continuing performance improvements, enables the efficient execution of data parallel workloads such as digital transforms and filters. Additionally, by consolidating signal processing functions with other workloads on a multi-core Intel processor, it is possible to save hardware cost, simplify the application development environment and reduce time to market. This approach can be applied to many applications in aerospace (radar, sonar), communications infrastructure (baseband processing, transcoding) and healthcare (medical imaging).
Umberto Santoni Platform Architect, Embedded Communications Group Thomas Long Software Engineer, Embedded Communications Group
This paper describes an easy process that allows developers to quickly determine how fast 2nd generation Intel Core i7-2710QE processor will execute their signal processing algorithms, based on performance data1 that is relatively easy to obtain. Developers can complete the process in a straightforward manner, as demonstrated with two simple examples in this paper: fast convolution and amplitude demodulation. The paper concludes by reviewing some of the development tools available to developers to conduct their own evaluations.
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
Table of Contents
Why Intel Architecture for Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 SIMD Instructions Enhanced By Intel Advanced Vector Extensions (Intel AVX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 The Process for Evaluating Signal Processing Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Signal Processing Performance Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 Overview of benchmark data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 A) Forward and inverse Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 B) 2D Complex to Complex FFT Throughput (GFLOPS/s and absolute time) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 C) Filter Execution Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 D) Discrete Hilbert Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 E) Discrete Cosine Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Speedup with Intel Advanced Vector Extensions (Intel AVX) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 Two Signal Processing Workload Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Example 1: Fast Convolution using FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Example 2: Discrete Envelope Detection / Amplitude Demodulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Floating Point Speeds Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Development Tools Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Intel C++ Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Math Kernel Library (Intel MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Integrated Performance Primitives (Intel IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel VTune Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Intel Application Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Eclipse*-based Integrated Development Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Consider Intel Architecture Processors for Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Appendix A: Test Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
More specifically, the Intel Advanced Vector Extensions (Intel AVX) available for the first time with 2nd generation Intel Core i7 processors provide significantly improved floating point performance (see sidebar). Engineers who code floating point algorithms for Intel architecture processors can leverage a mature software ecosystem that offers a very wide breadth and depth of development tools. Also available are Intel development tools and libraries that employ Intel AVX and Intel Streaming SIMD Extensions 4 (Intel SSE4) instructions. Equipment manufacturers can choose from many hardware vendors supplying commercial off-the-shelf (COTS) embedded boards and systems that support embedded lifecycles and benefit from the economics of the PC/server supply chain.
The throughput of a SIMD instruction is a function of register size because larger registers translate into greater throughput. With the introduction of 2nd generation Intel Core i7 processors, the size of the 16 registers available for floating point operations doubles, increasing from 128 bits to 256 bits. Additionally, new three and four operand instructions establish a destination argument that results in fewer register copies, better register usage, faster execution and smaller code size. These are just some of the recent architectural enhancements, called Intel Advanced Vector Extensions (Intel AVX).
128 bits (Intel SSE4) 256 bits (Intel AVX)
XMM15
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
[Perf System] FFT_OrderXY=4x4; 5x5; 6x5; 6x6; 7x4; 7x5; 7x7; 8x3; 8x4; 8x6; 8x7; 8x8; 9x3; 9x5; 9x6; 9x8; 9x9;10x4; 10x5; 10x7; 10x8; 10x10; 11x3; 11x4; 11x6; 11x7; 11x11; 11x12; 11x13; 11x14; 11x15;12x3; 12x5; 12x6; 12x12; 13x4; 13x5; 13x13; 14x3; 14x4; 15x3; 15x4; 16x4; 17x3; 17x4; Figure 2 . Sample .ini file generating 2D FFT performance data
Another way to estimate signal processing performance is to focus the performance assessment on key kernels that are often used in signal processing workloads. By selecting a subset of the signal processing functions, developers can produce a manageable set of data that contains the most relevant functions and provides a reference with which to estimate the performance of other functions. For instance, this can be done by choosing forward and inverse FFTs of various sizes both complex and real along with FIR and IIR filters of varying complexities, and other useful functions such as discrete cosine and Hilbert transforms. Developers may certainly need data on functions other than the ones covered in this paper; in those cases, the Intel IPP performance tool can be used to gather the necessary data. In summary, this process for evaluating the signal processing performance of Intel architecture utilizes Intel-collected performance data and gives developers a straightforward method to quickly estimate performance for their own workloads. Although the data provided is on 2nd generation Intel Core i7-2710QE processor, the methods described here are extensible to the full range of Intel processors. With a manageable effort, this process gives developers a quick readout of the signal processing performance of next generation Intel processors and provides an estimate of how much general-purpose computing headroom is available for other applications. The next section reviews the performance data collected and demonstrates the process using two examples. It is important to note that although using Intel IPP to assess the signal processing performance of Intel processors provides a good starting point that balances effort and optimization, it need not be the endpoint. Going beyond Intel IPP, it may be possible to capture significant performance improvements for specific algorithms through the use of compiler optimizations, primitives and assembly language programming. The flexibility of Intel architecture and its supporting software infrastructure provides developers with all of these degrees of freedom that can ultimately identify the most appropriate tradeoff between performance and effort.
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
which includes sample code and the Intel IPP performance test tool. More details about the test configuration are provided in Appendix A. This single threaded execution environment provides an approximation of the performance of a single core and an indication of available computing headroom. Processor cores not used for signal processing can be targeted at other algorithms or applications running on the system. Intel development tools and libraries also provide extensive support for development, validation and performance tuning of multi-threaded applications.
Input range: 64B to 1024KB Data type: Single and double precision floating point Purpose: Decompose a discrete sequence into a set of frequencies. Complex-to-complex FFT performance of the IPP library is in the range of 16.1 23.7 single precision GFLOPs / sec for sizes between 64B and 4KB. Similarly, the performance of real-to-complex conjugate FFT is in the range of 15 17.3 single precision GFLOPs/sec for sizes between 64B and 4KB. For larger sizes, FFT performance declines as the execution of the algorithm becomes less compute-bound and more memory-bound. Additionally, the FFT throughput of double precision floating point is approximately half that of single precision, and it scales with input size similarly to single precision.
Complex-To-Complex FFT
100,000.00 25
10,000.00 20 1,000.00 15
100.00
10.00
10
1.00 5 0.10
128K 256K 512K 1024K 981 11.4 5.7 2274 5425 12800 10.4 4.8 9.2 4.4 8.2 4.2 1970 4966 11450 25250
Input Size
Real-To-CCS FFT
100,000.00 20 18 10,000.00 16 14 12 100.00 10 8 10.00 6 4 2 0.1 SP Float (Sec) DP Float (Sec) SP Float (GFLOP/s) DP Float (GFLOP/s) 0
1,000.00
1.0
512K 1024K 2625 9.5 4.5 5989 8.8 4.1 5525 12663
Input Size
GFLOP / sec
Time (Sec)
GFLOP / sec
Time (Sec)
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
B . 2D Complex to Complex FFT Throughput (GFLOPS/s and absolute time) Format: Complex-to-Complex, 2 dimensional array data Input range: Various sizes ranging from 64B x 64B to 128KB x 16B Data type: Single precision floating point Purpose: Decompose a two dimensional array of discrete sequences into a set of frequencies.
Two dimensional complex-to-complex FFT performance of the Intel IPP library is in the range of 12.9 17.2 for sizes ranging from 64B x 64B to 4KB x 64B. As with the one dimensional FFT case, 2D FFT performance declines for larger sizes as the FFT becomes more memory-bound. Table 1 contains sample data points of interest, and it is possible to generate 2D FFT throughput data for other sizes using the Intel IPP library.
64 X 64 17.2 14.3
256 x 64 256 x 256 512 x 64 14.9 76.9 12.9 407 14.2 173
4K x 64 12.9 1834
8K x 32 11.3 2088
Speedup of Intel AVX Speedup vs. Intel SSE4 for Complex-to Complex FFT and Inverse FFT (averaged)
2.8 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2 1 64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1024K
Higher is better
Input Size
Single precision (SP) oating point Double precision (DP) oating point
Figure 5 . Intel Advanced Vector Extensions (Intel AVX) Speedup over Intel Streaming SIMD Extensions 4 (Intel SSE4)
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
C . Filter Execution Times Format: Single precision floating point complex data, complex coefficients. Finite Impulse Response Filter: 8, 32, and 128 taps
Infinite Impulse Response Filter: Orders ranging from 2 12 taps. Inputs: Complex data ranging from 32B 32KB in size. Purpose: Suppress unwanted components from a discrete-time series.
Input Size / Execution Time (Microseconds) 32 input 8 Tap Fir 32 Tap FIR 128 Tap FIR Order 2 IIR Order 3 IIR Order 4 IIR Order 6 IIR Order 7 IIR Order 8 IIR Order 10 IIR Order 12 IIR Order 11 IIR 0.2 0.6 2.1 0.2 0.3 0.4 0.6 0.7 1.1 1.1 1.2 1.2 128 input 0.5 2.0 7.7 0.7 0.9 1.1 1.4 1.6 1.7 2.1 2.5 2.3 512 input 1.8 4.2 6.2 2.4 3.5 3.8 4.5 5.2 5.3 6.3 7.3 7.0 2K 7.0 15.3 18.6 9.7 13.7 14.9 17.4 20.3 20.4 24.2 27.5 26.6 8K input 28.2 59.4 67.6 38.5 54.4 59.4 69.5 81.3 80.8 97.0 112.0 107.0 32K input 113.0 244.0 276.0 156.0 221.5 239.0 277.5 330.0 324.5 387.0 445.0 427.0
D . Discrete Hilbert Transform Format: Single precision floating point complex data. Inputs: Complex data ranging from 128B 32KB in size.
Input Size / Execution Time (Microseconds) 128 INT16 to Complex Short FP IN16 to Complex SP FP SP Float to Complex SP Float Table 3 . Hilbert Transform Execution Times 0.7 0.6 0.6 512 2.7 2.4 2.3 2K 13.0 11.5 11.0 8K 74.2 64.1 62.8 32K 385.3 353.6 341.0
Purpose: Express a discrete signal as a series of cosine frequencies that can be used for lossy signal compression.
Input Size / Execution Time (Microseconds) 128 SP Float Forward SP Float Inverse DP Float Forward DP Float Inverse 0.8 0.8 0.6 0.6 512 3.7 3.7 2.3 2.3 2K 20.6 20.6 11.1 11.0 8K 111.8 111.0 60.0 58.9 32K 563.8 567.3 300.5 301.8
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
y(n)
FFT
Table 5 summarizes Fast Convolution execution times, calculated and measured, for various data sizes. The calculated times sum the execution times of individual functions, using times for Intel IPP signal processing functions obtained from the Intel IPP performance test tool running in batch mode. The measured times were generated by running the entire algorithm (Figure 6), which was coded in C++ and used Intel IPP, and by calculating the elapsed time based on the hardware clock count. The runtimes were averaged across 10,000 runs. For comparison, the calculated times were within 16 percent of the measured results. However, the calculated times took a few hours of unattended run time (no human effort aside from installing the Intel IPP and running the aforementioned shell script) and less than an hour of calculating results in a spreadsheet. The measured results took an engineer familiar with the Intel IPP and C++ programming a couple
Inputs: x(n), y(n): 16KB input size Output: o(n): 16KB output size Operations: Single precision floating point in-place FFT, Complex Multiply, Inverse FFT
Size sec 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 7.96E-02 2.30E-01 5.04E-01 1.07E+00 2.32E+00 5.91E+00 1.49E+01 3.65E+01 8.13E+01 2.00E+02 4.54E+02 FFT clocks 1.67E+02 4.83E+02 1.06E+03 2.25E+03 4.87E+03 1.24E+04 3.13E+04 7.67E+04 1.71E+05 4.20E+05 9.53E+05 sec 8.02E-02 2.30E-01 4.92E-01 1.04E+00 2.27E+00 5.93E+00 1.49E+01 3.60E+01 8.19E+01 2.02E+02 4.53E+02 iFFT clocks 1.68E+02 4.83E+02 1.03E+03 2.18E+03 4.77E+03 1.25E+04 3.13E+04 7.56E+04 1.72E+05 4.24E+05 9.51E+05
Complex Mul sec 5.25E-02 1.05E-01 1.84E-01 3.17E-01 6.12E-01 1.18E+00 2.55E+00 5.46E+00 1.19E+01 2.57E+01 5.14E+01 clocks 1.10E+02 2.21E+02 3.87E+02 6.66E+02 1.29E+03 2.48E+03 5.34E+03 1.15E+04 2.50E+04 5.40E+04 1.08E+05
Fast Convolution Calculated sec 2.92E-01 7.95E-01 1.68E+00 3.50E+00 7.52E+00 1.89E+01 4.72E+01 1.14E+02 2.56E+02 6.28E+02 1.41E+03 clocks 6.13E+02 1.67E+03 3.54E+03 7.34E+03 1.58E+04 3.98E+04 9.92E+04 2.40E+05 5.38E+05 1.32E+06 2.97E+06
Fast Convolution Measured sec 2.99E-01 8.06E-01 1.69E+00 3.80E+00 8.33E+00 2.11E+01 5.22E+01 1.28E+02 2.96E+02 7.47E+02 1.63E+03 clocks 6.28E+02 1.69E+03 3.55E+03 7.99E+03 1.75E+04 4.42E+04 1.10E+05 2.70E+05 6.21E+05 1.57E+06 3.43E+06
Delta
-2% -1% 0% -8% -10% -10% -10% -11% -13% -16% -14%
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
of days for coding, debugging and executing the runtime. From this example, it is clear the simple manual approximation method (i.e., calculated times) delivers a performance estimate at a fraction of the effort required for the coding approach, and it provides an early read on where to invest optimization effort. Still, coding the algorithm using Intel IPP took significantly less effort than the alternative, which is to manually optimize custom libraries for each generation of Intel processor.
One possible optimization is to parallelize the algorithm by going beyond vector instructions and executing the two FFTs on separate threads. Further parallelizing the execution, the threads can be dispatched to two processor cores and the results combined back to a single thread for the complex multiply and inverse FFT.
R
100 90 80 70 60 50 0
User Satisfaction
Very Satised Satised Some Users Dissatised Many Users Dissatised Nearly All Users Dissatised Not Recommended
MOS
5.0 4.3 4.0 3.6 3.1 2.6 1.0
The second example is an envelope detector for a discrete time sequence, shown in Figure 9. The Hilbert transform produces the analytic representation of the signal, whose magnitude is obtained in order to generate the envelope of the signal, which is then downsampled. The downsampling is done in two stages since the carrier is operating at 200x the frequency of the message bandwidth. This keeps the FIRs to reasonable sizes. Finally, the DC component is removed from the discrete output sequence. Figure 10 contains a code snipet of the MATLAB* model for the envelope detector and Figure shows the results.
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
-N/2
+ +
|xc |
LPF1
D1
LPF2
D2
DC removal
o(m)
Envelope Formation
Input: Amplitude modulated message. Message bandwidth = 5KHz. Carrier frequency: 1000KHz. Input sampling frequency: 2200KHz. Output sampling frequency: 11KHz LPF1: 128 tap FIR, 44KHz cutoff frequency. D1: Downsampling by 25. LPF2: 128 tap FIR, 5.5KHz cutoff frequency. D2: Downsampling by 8.
%Form analytic signal for envelope inenv = abs(hilbert(in)); %Downsample, take out DC and LPF envelope lpf1 = fir1(lpf1tap,cutoff1/(Fsi/2),low,chebwin(lpf1tap+1)); out1 = fftfilt(lpf1,inenv); out1 = downsample(out1,D1); tout1 = downsample(tin,D1); %Stage 2 Downsample & LPF envelope lpf2 = fir1(lpf2tap,cutoff2/(Fs1/2),low,chebwin(lpf2tap+1)); out = fftfilt(lpf2,out1); out = downsample(out,D2); out = out - mean(out); out = out(17:length(out)); tout = downsample(tout1,D2); tout = tout(17:length(tout));
10
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
Figure 11 shows the MATLAB results for a simulated noisy AM signal containing messages centered at 1KHz, 3KHz, and 4.75KHz. The figure also shows the single sided spectrum magnitude of the input,
after LPF1, after LPF2 and downsampling D2, and at the final output. Superimposed to the frequency spectrum is the frequency response of the LPFs (thin blue line).
Table 6 summarizes the execution times for the Amplitude Demodulation for various data sizes. Similar to Example 1, the calculated times summed the execution times of the individual functions, and the measured times were generated from hardware clock count measurements collected while the algorithm executed (see Figure 10 for the code snipet). The runtimes were averaged across 10,000 runs. The calculated times are within 11 percent of the measured results. Here again, the calculated times took a few hours of unattended run time and less than an hour of spreadsheet calculation time. An engineer familiar
with Intel IPP, C++ and the algorithm generated the measured results in three days, which included coding using Intel IPP functions, debugging and executing the runtime. This example is similar to Example 1, in that the manual approximation method offers a good compromise between accuracy and effort, and it provides a relatively quick indication of performance and focus areas for further optimization. As in the prior example, going to the next step of coding the algorithm using Intel IPP took an acceptable amount of effort, given the degree of optimization and compared to manually optimizing libraries to Intel processors.
11
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
Input Size 32 128 512 2048 8192 9000 18000 27000 32768 36000
Input Size 327.68 360 720 1080 1310.72 1440 Input Size 8192 9000 18000 27000 32768 36000 Measured Data
Hilbert Tran sf (32f - 32fc) sec 1.41E-01 5.82E-01 2.33E+00 1.09E+01 6.17E+01 7.09E+01 1.74E+02 2.76E+02 3.42E+02 3.76E+02 clocks 2.96E+02 1.22E+03 4.89E+03 2.29E+04 1.30E+05 1.49E+05 3.64E+05 5.80E+05 7.18E+05 7.89E+05 sec
Magnitude clocks 8.51E+01 2.94E+02 1.16E+03 4.62E+03 1.87E+04 2.08E+04 4.45E+04 6.82E+04 8.34E+04 9.16E+04 4.05E-02 1.40E-01 5.52E-01 2.20E+00 8.90E+00 9.91E+00 2.12E+01 3.25E+01 3.97E+01 4.36E+01
128 tap FIR #1 sec 4.49E-01 1.63E+00 6.39E+00 2.63E+01 7.61E+01 8.10E+01 1.36E+02 1.90E+02 2.25E+02 2.47E+02 clocks 9.43E+02 3.42E+03 1.34E+04 5.52E+04 1.60E+05 1.70E+05 2.85E+05 3.99E+05 4.73E+05 5.19E+05
DownSampling by 8 sec 2.46E-02 2.25E-02 3.78E-02 9.67E-02 6.34E-01 7.41E-01 1.93E+00 3.12E+00 3.88E+00 4.26E+00 clocks 5.17E+01 4.73E+01 7.94E+01 2.03E+02 1.33E+03 1.56E+03 4.05E+03 6.55E+03 8.15E+03 8.95E+03
DownSampling by 25 sec 7.69E-02 7.03E-02 1.18E-01 3.02E-01 1.98E+00 2.31E+00 6.03E+00 9.74E+00 1.21E+01 1.33E+01 clocks 1.61E+02 1.48E+02 2.48E+02 6.35E+02 4.16E+03 4.86E+03 1.27E+04 2.05E+04 2.55E+04 2.80E+04
128 tap FIR #2 sec 5.69E+00 6.09E+00 1.57E+01 2.04E+01 2.34E+01 2.51E+01 clocks 1.20E+04 1.28E+04 3.30E+04 4.28E+04 4.91E+04 5.26E+04
DownSampling by 8 sec 3.56E-02 3.68E-02 6.54E-02 7.92E-02 8.81E-02 9.30E-02 clocks 7.47E+01 7.74E+01 1.37E+02 1.66E+02 1.85E+02 1.95E+02 Delta -3% -3% -4% -11% -11% -10%
Demod Calculated sec 1.54E+02 1.70E+02 3.52E+02 5.29E+02 6.42E+02 7.05E+02 clocks 3.24E+05 3.58E+05 7.39E+05 1.11E+06 1.35E+06 1.48E+06
Demod Measured sec 1.59E+02 1.75E+02 3.68E+02 5.97E+02 7.18E+02 7.86E+02 clocks 3.34E+05 3.67E+05 7.72E+05 1.25E+06 1.51E+06 1.65E+06 Calculated Data
Table 6 . Amplitude Demodulation Execution Times /* allocate and initialize specification structures */ ippsHilbertInitAlloc_32f32fc(&HilbertSpec_p, insmpl, ippAlgHintNone); ippsFIRInitAlloc_32f(&LPF1FIRState_p, (Ipp32f *)lpf1taps, LPF1TAPCNT, NULL); ippsFIRInitAlloc_32f(&LPF2FIRState_p, (Ipp32f *)lpf2taps, LPF2TAPCNT, NULL); /* Form the analytic signal and its envelope */ ippsHilbert_32f32fc((Ipp32f *)in_p, inenv_p, HilbertSpec_p); ippsMagnitude_32fc(inenv_p, inenvabs_p, insmpl); /* First stage of LPF and downsampling */ ippsFIR_32f_I(inenvabs_p, insmpl, LPF1FIRState_p); ippsSampleDown_32f((Ipp32f *)inenvabs_p, insmpl, outD1_p, &D1smpl, D1, &phm); /* Second stage of LPF and downsampling */ ippsFIR_32f_I(outD1_p, D1smpl, LPF2FIRState_p); ippsSampleDown_32f(outD1_p, D1smpl, outD2_p, &D2smpl, D2, &phm); /* Remove DC */ ippsMean_32f(outD2_p, D2smpl, &dcval, ippAlgHintNone); ippsSubC_32f_I((Ipp32f)dcval, outD2_p, D2smpl); /* free states */ ippsHilbertFree_32f32fc(HilbertSpec_p); ippsFIRFree_32f(LPF1FIRState_p); ippsFIRFree_32f(LPF2FIRState_p);
Figure 12 . Sample C-code Snipet for Example 2 using Intel Integrated Performance Primitives (Intel IPP)
12
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
The single-sided spectrum magnitude output of the Intel IPP implementation is shown in Figure 13 and indicates the resulting envelope
Figure 13 . Output Spectrum Magnitude for Intel Integrated Performance Primitives (Intel IPP) implementation .
Table 7 summarizes the measured execution time for various lengths of input sample sequences. Of note is the column labeled Utilization Rate. This is the algorithms execution time divided by the duration of the input sample, which provides a measure of core utilization over the time
Input Size in Samples 9000 18000 27000 36000 45000 54000 63000 72000 81000 90000 AVX Speedup Over SSE 1.32x 1.30x 1.27x 1.29x 1.26x 1.27x 1.25x 1.26x 1.25x 1.26x
interval of the algorithm (i.e., before the next set of input samples need to be processed). It is an indication of the amount of headroom the core has available for additional signal processing functions, or perhaps, for other applications.
AVX Time (sec) 174.70 367.64 597.04 785.95 1,048.31 1,223.71 1,485.68 1,659.40 1,986.38 2,155.75 Utilization Rate (%) 4.30% 4.50% 4.90% 4.80% 5.10% 5.00% 5.20% 5.10% 5.40% 5.30%
13
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
Intel VTune Performance Analyzer Designed to help developers find bottlenecks in their applications, the tool profiles how the application is using CPU time and computing platform resources throughout the code. Intel Application Debugger A rich and user friendly Eclipse* RCP-based graphical user interface, combined with OS signal and thread awareness, enable developers to cross-debug more easily by finding coding issues that affect application runtime behavior. Eclipse*-based Integrated Development Environment Intel software development products can be used with the Eclipse Integrated Development Environment (IDE).
14
White Paper: Signal Processing on Intel Architecture: Performance Analysis using Intel Performance Primitives
Appendix A: Test Configuration Single thread execution Emerald Lake Platform (Fab A) BIOS American Megatrends 4.6.3.2 (Project Version ASNBCPT1.86C. 0054.P00) CPU: 2nd generation Intel Core i7-2710QE processor (4 core, 2.1GHz, 6MB LLC, Intel Hyper-Threading Technology off) PCH: Mobile Intel QM67 Chipset, B0 stepping. 2 GB RAM (2x1GB Samsung DIMM DDR3 1333, dual rank, PN: M471B2874EH1-CH9) Western Digital 160GB HDD (WD1600AAJS) Fedora* 13 Linux* 2.6.33.3-85.fc13.x86_64 operating system Intel Composer XE 2011 Intel C++ Compiler Pro, version 12.0.1, build 107. Intel Integrated Performance Primitives (Intel IPP) version 7.0, build 205.23, September 2, 2010 (libippse9.so.7.0) Intel IPP performance tool version 7.0 (part of the Intel IPP package) ll individual Intel IPP measurements were taken using the Intel IPP performance test tool. Standard batch A mode (-B) input was used. The automatic timing mode with default accuracy was used. The tests were run with high priority (Y=HIGH) and on one thread only (N=1). More information on the command line parameters can be obtained by running the performance applications with the hh switch requency domain FIR was compiled in release mode (Release x64) with the Intel C++ Compiler. The cache is F warmed before the test. Optimizations are enabled using the /O3 , -xHost, and std=c99 compiler flags. FDFIR data averaged among in place, fast, and no divide by N options ther data averaged among in place and not in place, fast & accurate switches, divide by N, divide by sqrt(n), O and no divide by N, as applicable to each algorithm ata is at fixed CPU clock frequency and may change with Intel Turbo Boost Technology enabled. D oftware libraries, drivers, operating systems, and compilers used are not fully tuned for performance and S additional performance gains may be possible.
Acronyms ASIC ASP DSP FIR Application-specific integrated circuit Application-specific processor Digital signal processor Finite impulse response FFT FPGA IIR SIMD Fast Fourier transform Field-programmable gate array Infinite impulse response Single-instruction, multiple data
15
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance Source: PESQ website at http://www.pesq.org/ Copyright 2011 Intel Corporation. All rights reserved. Intel, the Intel logo and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
2 3
*Other names and brands may be claimed as the property of others. Printed in USA 0111/S2D/BM/XX/PDF Please Recycle 324910-001US