0% found this document useful (0 votes)
128 views4 pages

FX8010 - A DSP Chip Architecture For Audio Effects: Steve Hoge

1. FX8010 is a DSP chip architecture designed for audio effects processing with 32 channels and 32-bit integer operations capable of 100 million instructions per second. 2. It features parallel delay memory and I/O engines decoupled from program execution. Its architecture supports simultaneous execution of multiple independently compiled programs. 3. FX8010 has been implemented in ASICs for PC multimedia and professional audio applications, including a chip used in Creative Labs' SoundBlaster Live! products for 3D audio effects.

Uploaded by

juras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views4 pages

FX8010 - A DSP Chip Architecture For Audio Effects: Steve Hoge

1. FX8010 is a DSP chip architecture designed for audio effects processing with 32 channels and 32-bit integer operations capable of 100 million instructions per second. 2. It features parallel delay memory and I/O engines decoupled from program execution. Its architecture supports simultaneous execution of multiple independently compiled programs. 3. FX8010 has been implemented in ASICs for PC multimedia and professional audio applications, including a chip used in Creative Labs' SoundBlaster Live! products for 3D audio effects.

Uploaded by

juras
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

FX8010 - A DSP Chip Architecture for Audio Effects

Steve Hoge
Joint E-mu/Creative Technology Center
[email protected], http://www.emu.com, http://www.creaf.com
Abstract

FX8010 is a DSP chip architecture specifically designed for time-domain 3D audio and effects
processing. It is a 32-channel, 32-bit integer design that can deliver 100MIPS at a 50KHZ audio
sample rate. It features powerful delay memory and I/O engines that execute in parallel with and are
decoupled from microprogram execution. Its highly regular architecture supports the simultaneous
execution of large numbers of separately compiled and downloaded programs with zero-overhead
signal patching. A compiler for FX8010 programs generates code from C-style expressions and
control-flow constructs. FX8010 has been implemented in two different ASICs for PC multimedia
and professional audio applications.

1. Introduction Data Address Ctrl IRQ

Host Interface To 1M
FX8010 is a real-time digital signal processing Samples
External
architecture specifically designed to implement time- TRAM
TRAM TRAM
domain digital audio effects and multichannel mixing. Buffer
GPRs
TRAM
Engine
Engine
By coupling a highly regular four-operand, 32-bit
integer architecture with independent delay memory 1K
Instr
1K GPR
Internal
and I/O engines, FX8010 delivers 100MIPS at 32 Memory Execution
Memory TRAM
(EMU10K1) 32 Chan
50KHZ sample rate and is capable of simultaneously Chan Unit
Audio
Audio
executing up to eight high-quality reverberators or Inputs I/O
Outputs
Buffer
dozens of simpler algorithms. GPRs

The FX8010 architecture has already been


implemented in two different ASICs. The Figure 1. Basic FX8010 Architecture
EMU10K1, a PCI-based wavetable synthesizer,
DirectSound accelerator and audio interface chip, uses Internal and external "tank" memory (TRAM) for
the FX8010 as part of its 3D and environmental audio audio delay lines and table look-up is managed by a
effects engine and is a major element of the Creative TRAM engine that operates independently of and in
Technology SBLive! and E-mu Audio Production parallel with microprogram execution. This engine
Studio products. The RChip is a dedicated effects transfers audio samples between external memory in a
processor for embedded musical instrument and 1M word off-chip address space and internal dual-
professional audio applications, and is one of several ported data buffer GPRs shared between the TRAM
custom DSPs used in E-mu's Mantis digital audio engine and the FX8010 execution unit. Up to 256
mixing system. Numerous patents are pending on delay line or table accesses each sample period are
unique aspects of the FX8010 architecture. implemented by this addressing and data move
engine.
2. Basic Architecture In contrast to most commercial DSP chips, the
FX8010 is sample-locked and runs without jumps or
The FX8010 design comprises a 32/67 bit execution
branches, though conditional data movement is
unit, a 1K memory array of 32-bit General Purpose
available and block-oriented control flow constructs
Registers (GPRs) for signals, coefficients, and
can be implemented using conditional instruction
addresses, and 32 channels of 32-bit signal I/O via
execution. The architecture is specifically designed to
double-buffered I/O GPRs. Microprogram storage is
support the execution of multiple simultaneous but
an array of 1K instruction words, each of which
independently compiled and loaded effects programs,
specifies one opcode and four GPR operand addresses.
a capability that is facilitated by the conditional
The processor's opcodes include fractional and integer execution mechanism, a lack of exposed pipelining,
multiply/accumulate instructions, linear interpolation, and the direct addressing of GPR operands.
bit-wise logical operations, conditional instruction
Since FX8010 cannot operate alone but is designed to
execution and data movement, and single-cycle
be controlled by a conventional microprocessor, a
logarithmic and exponential conversion. The
high-bandwidth host interface provides mapping of
multiply/accumulate unit uses a 67-bit accumulator
the FX8010's internal GPR and microinstruction
including 4 guard bits and can accommodate double-
memory directly into the host's address space.
precision operations.
Interrupts from the FX8010 to the host can be
generated under control of DSP programs and when
signal saturation (clipping) occurs. A debug facility input operands. While the FX8010 also has a
allows the processor to be run in single-step mode. volatile accumulator output register, FX8010 result
operands are typically ordinary GPRs whose data
3. Execution Unit movement is implicit in the result operand address.
This accumulator can be reused by explicitly
3.1 Arithmetic specifying its GPR-mapped address as the A
The FX8010's 32-bit integer arithmetic exceeds the operand, but is only necessary when extra headroom
accuracy of single-precision floating point, and has (the accumulator guard bits) or precision (the LS 32
sufficient dynamic range and precision for almost any of the accumulator's 67 bits) need to be retained
audio processing or filtering operation. Total through the next multiply/accumulate instruction.
dynamic range is over 192dB, and the center
frequency resolution of a biquad filter using 32-bit The accumulator is one of several special registers
coefficients is less than 1HZ across the entire audio that are GPR-mapped, including the condition code
range. While all fractional coefficients must lie in the register (CCR), interrupt register, read-only delay line
range of [-1.0..1.0], sufficient footroom exists to and table base address registers, and noise (dither)
normalize most filter topologies so that their sources. In addition, FX8010 maps a collection of
coefficients fall within this range. Limit cycles that useful ROM constants into GPR space that are used
can arise in recursive filters from asymmetric as implicit operands in many instructions.
truncation of results towards -¥ are not typically an
issue in FX8010 due to the small magnitude of the 3.3 Instruction Set
truncation error. Opcode Operation

The accessibility of both the MS and LS halves of the MAC Fractional multiply/add/subtract with optional
saturation/word wrap
FX8010 67-bit accumulator allow the
multiply/accumulator unit to perform either fractional MACINT Integer multiply/add/subtract with optional
saturation/word wrap
or integer arithmetic, depending on which half is
ACC3 Accumulate 3 inputs with saturation
retrieved as the result operand and how it is saturated.
This accessibility also makes possible double- MACMV Multiply/accumulate with additional data move
precision operations, if necessary. With both integer
and fractional multiplication available, the FX8010 SKIP Conditionally skip over instructions
compiler can generate coefficients that implement
conventional left and right shift operators (<< and ANDXOR Multi-purpose bitwise logical instruction
>>).
TSTNEG Test and conditionally negate the result
3.2 GPR Operand Architecture
LIMIT Test and conditionally output a higher/lower
The FX8010 execution unit is connected directly to a threshold
1K GPR address space for operand storage. Each LOG Convert linear to logarithmic representation
FX8010 instruction includes an opcode and four
independent GPR addresses that define the EXP Convert logarithmic to linear representation
instruction's three input operands A, X and Y and the
result operand R. For multiply/accumulate INTERP Linear interpolate between two values
instructions, A is the accumulator while X and Y are
symmetrical multiplier inputs. There are no visible Table 1: Typical FX8010 opcodes
hazards in the FX8010 operand pipeline, so the result As seen in the table of opcodes, the FX8010's
operand of one instruction can become any of the instructions implement traditional DSP arithmetic as
input operands on the next instruction cycle. well as some more unusual operations:
Many DSP architectures place input registers ahead of MACMV performs multiply/accumulation on X, Y
their execution units that must be kept filled with and the accumulator, while in parallel moving the A
operands by the programmer (often with parallel move operand to R. This simultaneously accomplishes the
operations) in order to extract maximum compute MAC and data shift required for FIR filtering.
bandwidth from the processor. Keeping these
registers full at all times is a challenging problem of ANDXOR (R = A & X ^ Y) allows the FX8010
operand sequencing and sometimes even memory compiler to take advantage of the 4-operand
layout, and becomes one of the arcane skills of the architecture and built-in ROM constant GPRs to
DSP programmer. By contrast, the FX8010 math synthesize bitwise AND, XOR, NOR, NAND, NOT
unit has no such input registers; instead, all and OR operations from a single opcode.
instructions fetch their operands by directly
addressing GPR memory. In this respect the FX8010 LOG and EXP perform transformations to and from a
architecture is very programmer-friendly. sign|exponent|mantissa representation with a
programmable maximum exponent. Applications are
Similarly, most DSPs also have volatile data compression, dB conversion, waveshaping and
accumulators or output registers that must be saved log domain arithmetic approximating division and
with data move instructions or recycled as "special"
roots. Interesting distortion effects are also possible, is viewed as an opcode and the data and address
especially by modulating the exponent size. buffers as operands, then the TRAM engine can be
seen as an independent execution unit that iterates its
INTERP, which performs the linear interpolation r = own simple microprogram once each sample period.
a*x + y*(1-x) allows single-instruction lowpass
filters, parameter smoothing, and inversely A dual-port memory architecture ensures that accesses
proportional signal mixing (e.g., wet/dry or pan by the execution unit and TRAM engine do not
control.) collide. Since operation of the TRAM engine is
decoupled from program execution, 100% TRAM
LIMIT and TSTNEG are both forms of conditional bandwidth utilization is guaranteed by design without
move instructions, useful for threshold detection and stalling the execution unit or requiring the FX8010
control signal generation. The compiler synthesizes programmer to manage memory transactions.
ABS() and SIGN() from TSTNEG by the right choice
of GPR operands and ROM constants. A hardware mechanism that returns zeros from each
delay line until it contains valid data obviates the
need for the time-consuming TRAM zeroing that
4. I/O Engine would otherwise be required at program load time.
32-channel signal input and output is accomplished
in FX8010 through buffers which are mapped into 5.3 Physical TRAM Implementation
GPR space. Since I/O is fully double-buffered, new Variations in TRAM architecture constitute the
input signals appear synchronously at the beginning major differences between the RChip and the
of each sample period in the 32 input GPRs, and the EMU10K1 implementations of FX8010. TRAM on
contents of the 32 output GPRs disappear off -chip. the RChip is implemented exclusively with external
In EMU10K1, physical input signals originate in the SRAMs, but in the EMU10K1 this off-chip memory
wavetable synthesizer and various AC97, I2S and can include system DRAM accessed across the PCI
S/PDIF codecs, and are output through codecs or bus in the host processor's address space. To
back across PCI to the host. The RChip also preserve PCI bandwidth, it has an additional block of
supports I2S and S/PDIF, but mainly uses EMU32, a TRAM located in a separate address space on-chip.
serial 32-bit, 32-channel interface, for I/O connections All TRAM in the EMU10K1 is 16-bits wide, but the
with other DSPs. RChip can be programmed to accommodate 16, 24 or
32-bit wide memory. TRAM less than 32-bits wide
5. TRAM Engine can be accessed using a hardware-based encoding
5.1 Circular Delay Addressing scheme that is transparent to the DSP programmer
and extends the TRAM's effective dynamic range.
The TRAM engine transfers samples between GPR- This encoding contributes greatly to a low noise floor
mapped buffers in the FX8010 and TRAM memory in recursive algorithms like reverberators, which are
in a 1M off-chip address space. TRAM is used for often plagued with noticeable truncation distortion
delay lines as well as indexed table look-up. For and noise-like limit cycles, especially as feedback
delay lines, the TRAM engine uses a circular coefficients are increased.
addressing mechanism, computing the absolute
address of each TRAM access by adding a relative While all GPRs are 32-bits wide, address offsets
delay offset to a global base address counter modulo occupy only the top 21-bits of the TRAM buffer
the entire delay address space, and decrementing the GPR. These MS 21 bits and the remaining LS 11
counter once per sample period. Since all delay lines bits can be thought of as the integer and fractional part
recirculate within the same physical memory, of the address, respectively. In a single MACINT
modulo-addressing of individual delay lines is not instruction, the LS bits can be masked and left-shifted
required. to become the coefficient to an INTERP instruction in
order to implement a linear-interpolated delay line.
5.2 Decoupled TRAM Execution
Each TRAM access is implemented using a pair of 6. Microsequencer
buffer registers mapped into GPR memory and a third 6.1 Microprogram Control Flow
register that contains flag bits that control the type of
TRAM access. One buffer GPR stores the TRAM FX8010 microprograms are stored on-chip in an array
address offset and the other stores an incoming or of wide microinstruction memory which cannot be
outgoing sample word (for reads or writes, written to by the execution unit. The FX8010
respectively.) By GPR-mapping the buffers, FX8010 executes in sample-locked fashion, so that the
programs can operate on TRAM data like any other instruction rate is a fixed multiple of the sample rate.
GPR operand and can compute new TRAM offsets for The FX8010 runs straight through its entire
modulated delay effects or table-lookups. Triples microinstruction array each sample period without
composed of these data, offset and flag registers are jumps, branches, or subroutine calls, so that it is
organized in an array of contiguous memory locations impossible to fall out of real-time operation.
where they are operated on sequentially by the
TRAM engine every sample period. If the flag register
While straight-line execution is an appropriate model layer allows the same effects management code to run
for implementing linear time-invariant filters, many efficiently on top of both of the current FX8010
common audio effects require event processing at a hardware implementations. For example, on a P166
regular sub-audio control rate or even asynchronously. running Win95, relocating and loading a typical
FX8010 accommodates this by providing conditional reverberator requires approximately 2ms.
move operations and conditional execution using the
SKIP instruction. With this technique, the FX8010 Above the driver stack are self-contained effects
does not actually skip forward over sequences of software plug-ins which encapsulate the FX8010
instructions but, based on tests of its Condition Code program and the host code necessary for parameter
Register (CCR), converts these sequences into NOPs. control. In Win95 these plug-ins take the form of
This conditional mechanism also supports independent registered COM objects which can be
multiprogramming by skipping over areas where new further encapsulated in ActiveX wrappers for use by
programs are being loaded without disturbing other DirectShow-aware applications.
executing programs.
7.3 Benchmarks
6.2 Condition Code Register The efficiency of the FX8010 architecture allows
The FX8010 CCR holds 5 bits, some or all of which instruction counts for most algorithms to be
are updated after each instruction cycle: estimated simply by the number of multiplies
required. Thus a reverberator allpass filter takes two
¥ Z - set if the result is zero instructions and a direct form biquad requires five, as
¥ M - set if the result is negative (minus) shown in the sample listings of FX8010 source code.
¥ N - set if a normalized result (MSB=next MSB)
¥ S - set if the result saturated or wrapped The implementation of FX8010 in the EMU10K1 is
¥ B - set if a borrow occurred in a subtraction sufficiently powerful that the E-mu Audio Production
Studio product is able to simultaneously run high-
Typical SKIP conditions such as M + Z (less than or quality reverb and chorus algorithms as well as a
equal to zero) or more exotic ones such as M + (S ¥ flanger, echo, "auto-wah" envelope filter, distortion,
~M) (negative or saturated positive) can be specified compressor/limiter, pitch-shifter, 4 parametric EQs, 4
by CCR test masks. By computing CCR masks and shelving EQs, and a large mixing matrix, all with
their inverses along with the proper skip counts, the smoothed parameter updates and ditherable outputs.
compiler is able to generate if/then/else constructs
nested to any depth from the familiar C-language
source code syntax. While the CCR mask and skip
count are typically compiler-generated constants, // Allpass[] is automatically
since they are stored in ordinary GPRs they can also // allocated in TRAM
be computed by the microprogram. DELAY allpass[ 50msec ] ;
GPR in, out ;
7. Effects Development
out = in * .6 + Allpass[ 40msec ] ;
7.1 FX8010 Compiler Allpass[] = in - out * .6 ;
fxasm, the FX8010 program compiler, accepts source
code files in a C-like expression syntax and generates Listing 1: FX8010 source code for Reverb allpass
a portable object code format. Currently, fxasm is
integrated into an effects development system using
Microsoft Developer Studio. // Output is in state[2]
GPR state[4], in ;
FX8010 programmers combine executable statements GPR a[3], b[2] ; // A/B Coefficients
with declarations of input and output ports, GPRs,
mix registers, constants, tables, and delay lines. ACC = input *a[0];
These objects are linked symbolically to the high- state[1]=state[2],ACC+=state[1]*a[2];
level parameter control code running on the host state[0]=input, ACC +=state[0] *a[1];
processor through #include symbol files. All GPR, state[3]=state[2],ACC+=state[3]*b[1];
microcode and TRAM addresses emitted from the state[2]= ACC + state[2] *b[0];
compiler are virtual; virtual-to-physical translation
happens both at program load time and at run-time, Listing 2: FX8010 source code for
when real-time parameter updates and queries are Direct Form 1 Biquad
relocated on-the-fly.
//dB peak meter in 2 instr. (LOG+LIMIT)
7.2 Drivers GPR in, peak ;
A driver stack written in C manages multi-program TEMP GPR tmp ;
resource allocation, loading, patching and parameter tmp = ABS( LOG( in )) ;
control in real-time across multiple FX8010s. Pools peak = tmp > peak ? tmp : peak ;
of temporary GPRs and other resources are maintained
for shared use by all loaded programs. An abstraction Listing 3: FX8010 Source code for peak VU meter

You might also like