FX8010 - A DSP Chip Architecture For Audio Effects: Steve Hoge
FX8010 - A DSP Chip Architecture For Audio Effects: Steve Hoge
Steve Hoge
Joint E-mu/Creative Technology Center
[email protected], http://www.emu.com, http://www.creaf.com
Abstract
FX8010 is a DSP chip architecture specifically designed for time-domain 3D audio and effects
processing. It is a 32-channel, 32-bit integer design that can deliver 100MIPS at a 50KHZ audio
sample rate. It features powerful delay memory and I/O engines that execute in parallel with and are
decoupled from microprogram execution. Its highly regular architecture supports the simultaneous
execution of large numbers of separately compiled and downloaded programs with zero-overhead
signal patching. A compiler for FX8010 programs generates code from C-style expressions and
control-flow constructs. FX8010 has been implemented in two different ASICs for PC multimedia
and professional audio applications.
Host Interface To 1M
FX8010 is a real-time digital signal processing Samples
External
architecture specifically designed to implement time- TRAM
TRAM TRAM
domain digital audio effects and multichannel mixing. Buffer
GPRs
TRAM
Engine
Engine
By coupling a highly regular four-operand, 32-bit
integer architecture with independent delay memory 1K
Instr
1K GPR
Internal
and I/O engines, FX8010 delivers 100MIPS at 32 Memory Execution
Memory TRAM
(EMU10K1) 32 Chan
50KHZ sample rate and is capable of simultaneously Chan Unit
Audio
Audio
executing up to eight high-quality reverberators or Inputs I/O
Outputs
Buffer
dozens of simpler algorithms. GPRs
The accessibility of both the MS and LS halves of the MAC Fractional multiply/add/subtract with optional
saturation/word wrap
FX8010 67-bit accumulator allow the
multiply/accumulator unit to perform either fractional MACINT Integer multiply/add/subtract with optional
saturation/word wrap
or integer arithmetic, depending on which half is
ACC3 Accumulate 3 inputs with saturation
retrieved as the result operand and how it is saturated.
This accessibility also makes possible double- MACMV Multiply/accumulate with additional data move
precision operations, if necessary. With both integer
and fractional multiplication available, the FX8010 SKIP Conditionally skip over instructions
compiler can generate coefficients that implement
conventional left and right shift operators (<< and ANDXOR Multi-purpose bitwise logical instruction
>>).
TSTNEG Test and conditionally negate the result
3.2 GPR Operand Architecture
LIMIT Test and conditionally output a higher/lower
The FX8010 execution unit is connected directly to a threshold
1K GPR address space for operand storage. Each LOG Convert linear to logarithmic representation
FX8010 instruction includes an opcode and four
independent GPR addresses that define the EXP Convert logarithmic to linear representation
instruction's three input operands A, X and Y and the
result operand R. For multiply/accumulate INTERP Linear interpolate between two values
instructions, A is the accumulator while X and Y are
symmetrical multiplier inputs. There are no visible Table 1: Typical FX8010 opcodes
hazards in the FX8010 operand pipeline, so the result As seen in the table of opcodes, the FX8010's
operand of one instruction can become any of the instructions implement traditional DSP arithmetic as
input operands on the next instruction cycle. well as some more unusual operations:
Many DSP architectures place input registers ahead of MACMV performs multiply/accumulation on X, Y
their execution units that must be kept filled with and the accumulator, while in parallel moving the A
operands by the programmer (often with parallel move operand to R. This simultaneously accomplishes the
operations) in order to extract maximum compute MAC and data shift required for FIR filtering.
bandwidth from the processor. Keeping these
registers full at all times is a challenging problem of ANDXOR (R = A & X ^ Y) allows the FX8010
operand sequencing and sometimes even memory compiler to take advantage of the 4-operand
layout, and becomes one of the arcane skills of the architecture and built-in ROM constant GPRs to
DSP programmer. By contrast, the FX8010 math synthesize bitwise AND, XOR, NOR, NAND, NOT
unit has no such input registers; instead, all and OR operations from a single opcode.
instructions fetch their operands by directly
addressing GPR memory. In this respect the FX8010 LOG and EXP perform transformations to and from a
architecture is very programmer-friendly. sign|exponent|mantissa representation with a
programmable maximum exponent. Applications are
Similarly, most DSPs also have volatile data compression, dB conversion, waveshaping and
accumulators or output registers that must be saved log domain arithmetic approximating division and
with data move instructions or recycled as "special"
roots. Interesting distortion effects are also possible, is viewed as an opcode and the data and address
especially by modulating the exponent size. buffers as operands, then the TRAM engine can be
seen as an independent execution unit that iterates its
INTERP, which performs the linear interpolation r = own simple microprogram once each sample period.
a*x + y*(1-x) allows single-instruction lowpass
filters, parameter smoothing, and inversely A dual-port memory architecture ensures that accesses
proportional signal mixing (e.g., wet/dry or pan by the execution unit and TRAM engine do not
control.) collide. Since operation of the TRAM engine is
decoupled from program execution, 100% TRAM
LIMIT and TSTNEG are both forms of conditional bandwidth utilization is guaranteed by design without
move instructions, useful for threshold detection and stalling the execution unit or requiring the FX8010
control signal generation. The compiler synthesizes programmer to manage memory transactions.
ABS() and SIGN() from TSTNEG by the right choice
of GPR operands and ROM constants. A hardware mechanism that returns zeros from each
delay line until it contains valid data obviates the
need for the time-consuming TRAM zeroing that
4. I/O Engine would otherwise be required at program load time.
32-channel signal input and output is accomplished
in FX8010 through buffers which are mapped into 5.3 Physical TRAM Implementation
GPR space. Since I/O is fully double-buffered, new Variations in TRAM architecture constitute the
input signals appear synchronously at the beginning major differences between the RChip and the
of each sample period in the 32 input GPRs, and the EMU10K1 implementations of FX8010. TRAM on
contents of the 32 output GPRs disappear off -chip. the RChip is implemented exclusively with external
In EMU10K1, physical input signals originate in the SRAMs, but in the EMU10K1 this off-chip memory
wavetable synthesizer and various AC97, I2S and can include system DRAM accessed across the PCI
S/PDIF codecs, and are output through codecs or bus in the host processor's address space. To
back across PCI to the host. The RChip also preserve PCI bandwidth, it has an additional block of
supports I2S and S/PDIF, but mainly uses EMU32, a TRAM located in a separate address space on-chip.
serial 32-bit, 32-channel interface, for I/O connections All TRAM in the EMU10K1 is 16-bits wide, but the
with other DSPs. RChip can be programmed to accommodate 16, 24 or
32-bit wide memory. TRAM less than 32-bits wide
5. TRAM Engine can be accessed using a hardware-based encoding
5.1 Circular Delay Addressing scheme that is transparent to the DSP programmer
and extends the TRAM's effective dynamic range.
The TRAM engine transfers samples between GPR- This encoding contributes greatly to a low noise floor
mapped buffers in the FX8010 and TRAM memory in recursive algorithms like reverberators, which are
in a 1M off-chip address space. TRAM is used for often plagued with noticeable truncation distortion
delay lines as well as indexed table look-up. For and noise-like limit cycles, especially as feedback
delay lines, the TRAM engine uses a circular coefficients are increased.
addressing mechanism, computing the absolute
address of each TRAM access by adding a relative While all GPRs are 32-bits wide, address offsets
delay offset to a global base address counter modulo occupy only the top 21-bits of the TRAM buffer
the entire delay address space, and decrementing the GPR. These MS 21 bits and the remaining LS 11
counter once per sample period. Since all delay lines bits can be thought of as the integer and fractional part
recirculate within the same physical memory, of the address, respectively. In a single MACINT
modulo-addressing of individual delay lines is not instruction, the LS bits can be masked and left-shifted
required. to become the coefficient to an INTERP instruction in
order to implement a linear-interpolated delay line.
5.2 Decoupled TRAM Execution
Each TRAM access is implemented using a pair of 6. Microsequencer
buffer registers mapped into GPR memory and a third 6.1 Microprogram Control Flow
register that contains flag bits that control the type of
TRAM access. One buffer GPR stores the TRAM FX8010 microprograms are stored on-chip in an array
address offset and the other stores an incoming or of wide microinstruction memory which cannot be
outgoing sample word (for reads or writes, written to by the execution unit. The FX8010
respectively.) By GPR-mapping the buffers, FX8010 executes in sample-locked fashion, so that the
programs can operate on TRAM data like any other instruction rate is a fixed multiple of the sample rate.
GPR operand and can compute new TRAM offsets for The FX8010 runs straight through its entire
modulated delay effects or table-lookups. Triples microinstruction array each sample period without
composed of these data, offset and flag registers are jumps, branches, or subroutine calls, so that it is
organized in an array of contiguous memory locations impossible to fall out of real-time operation.
where they are operated on sequentially by the
TRAM engine every sample period. If the flag register
While straight-line execution is an appropriate model layer allows the same effects management code to run
for implementing linear time-invariant filters, many efficiently on top of both of the current FX8010
common audio effects require event processing at a hardware implementations. For example, on a P166
regular sub-audio control rate or even asynchronously. running Win95, relocating and loading a typical
FX8010 accommodates this by providing conditional reverberator requires approximately 2ms.
move operations and conditional execution using the
SKIP instruction. With this technique, the FX8010 Above the driver stack are self-contained effects
does not actually skip forward over sequences of software plug-ins which encapsulate the FX8010
instructions but, based on tests of its Condition Code program and the host code necessary for parameter
Register (CCR), converts these sequences into NOPs. control. In Win95 these plug-ins take the form of
This conditional mechanism also supports independent registered COM objects which can be
multiprogramming by skipping over areas where new further encapsulated in ActiveX wrappers for use by
programs are being loaded without disturbing other DirectShow-aware applications.
executing programs.
7.3 Benchmarks
6.2 Condition Code Register The efficiency of the FX8010 architecture allows
The FX8010 CCR holds 5 bits, some or all of which instruction counts for most algorithms to be
are updated after each instruction cycle: estimated simply by the number of multiplies
required. Thus a reverberator allpass filter takes two
¥ Z - set if the result is zero instructions and a direct form biquad requires five, as
¥ M - set if the result is negative (minus) shown in the sample listings of FX8010 source code.
¥ N - set if a normalized result (MSB=next MSB)
¥ S - set if the result saturated or wrapped The implementation of FX8010 in the EMU10K1 is
¥ B - set if a borrow occurred in a subtraction sufficiently powerful that the E-mu Audio Production
Studio product is able to simultaneously run high-
Typical SKIP conditions such as M + Z (less than or quality reverb and chorus algorithms as well as a
equal to zero) or more exotic ones such as M + (S ¥ flanger, echo, "auto-wah" envelope filter, distortion,
~M) (negative or saturated positive) can be specified compressor/limiter, pitch-shifter, 4 parametric EQs, 4
by CCR test masks. By computing CCR masks and shelving EQs, and a large mixing matrix, all with
their inverses along with the proper skip counts, the smoothed parameter updates and ditherable outputs.
compiler is able to generate if/then/else constructs
nested to any depth from the familiar C-language
source code syntax. While the CCR mask and skip
count are typically compiler-generated constants, // Allpass[] is automatically
since they are stored in ordinary GPRs they can also // allocated in TRAM
be computed by the microprogram. DELAY allpass[ 50msec ] ;
GPR in, out ;
7. Effects Development
out = in * .6 + Allpass[ 40msec ] ;
7.1 FX8010 Compiler Allpass[] = in - out * .6 ;
fxasm, the FX8010 program compiler, accepts source
code files in a C-like expression syntax and generates Listing 1: FX8010 source code for Reverb allpass
a portable object code format. Currently, fxasm is
integrated into an effects development system using
Microsoft Developer Studio. // Output is in state[2]
GPR state[4], in ;
FX8010 programmers combine executable statements GPR a[3], b[2] ; // A/B Coefficients
with declarations of input and output ports, GPRs,
mix registers, constants, tables, and delay lines. ACC = input *a[0];
These objects are linked symbolically to the high- state[1]=state[2],ACC+=state[1]*a[2];
level parameter control code running on the host state[0]=input, ACC +=state[0] *a[1];
processor through #include symbol files. All GPR, state[3]=state[2],ACC+=state[3]*b[1];
microcode and TRAM addresses emitted from the state[2]= ACC + state[2] *b[0];
compiler are virtual; virtual-to-physical translation
happens both at program load time and at run-time, Listing 2: FX8010 source code for
when real-time parameter updates and queries are Direct Form 1 Biquad
relocated on-the-fly.
//dB peak meter in 2 instr. (LOG+LIMIT)
7.2 Drivers GPR in, peak ;
A driver stack written in C manages multi-program TEMP GPR tmp ;
resource allocation, loading, patching and parameter tmp = ABS( LOG( in )) ;
control in real-time across multiple FX8010s. Pools peak = tmp > peak ? tmp : peak ;
of temporary GPRs and other resources are maintained
for shared use by all loaded programs. An abstraction Listing 3: FX8010 Source code for peak VU meter