Supernova - A Scalable Parallel Audio Synthesis Server For Supercollider

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

SUPERNOVA - A SCALABLE PARALLEL AUDIO SYNTHESIS SERVER


FOR SUPERCOLLIDER

Tim Blechmann
[email protected]

ABSTRACT i860 coprocessors, was able to perform audio synthesis in


real-time and was commonly used for the production of
SuperCollider [5] is a computer music system based on artistic works in the early 1990s. However the computer
an object-oriented real-time scripting language and a sep- music systems that are in use these days mostly use a se-
arate audio synthesis server. The synthesis server is pro- quential programming model.
grammed using a sequential programming model and is This paper is divided into the following sections. Sec-
only able to use one CPU core for audio synthesis, so it tion 2 gives an introduction to the levels of parallelism in
does not make full use of today’s multi-core CPUs. computer music systems. Section 3 proposes two exten-
In order to overcome this limitation we have imple- sions to the SuperCollider node graph. Section 4 gives
mented Supernova, a drop-in replacement for the default a rough overview on the architecture of Supernova, a re-
synthesis server ‘scsynth’. Supernova introduces exten- placement for the SuperCollider server scsynth with a con-
sions to the sequential programming mode, exposing par- current audio synthesis engine. Section 5 presents and
allelism explicitly to the SuperCollider language. The discusses benchmark results.
multi-threaded audio synthesis engine of Supernova is scal-
able and optimized for low-latency real-time applications. 2. PARALLELIZING COMPUTER MUSIC
SYSTEMS
1. INTRODUCTION
There are different types of parallelism that applications
For many years the number of transistors per CPU has in- can make use of. When discussing parallelism for multi-
creased exponentially, roughly doubling every 18 months core processors, we will focus on thread level parallelism
to 2 years. This behavior is usually referred to as ‘Moore’s which describes the parallelization of an application into
Law’. Since the early 2000s, this does not necessarily separate threads. However many audio synthesis engines
increase the CPU performance any further, since the tech- can make use of data level parallelism using SIMD (sin-
niques that caused these performance gains have been ma- gle instruction, multiple data) instruction sets like SSE
xed out [7]. Processors have been ‘pipelined’, executing or Altivec for processing multiple samples in a single in-
instructions in a sequence of stages, which increases the struction. SIMD instructions are hardware-dependent, so
throughput at the cost of instruction latency. Since the they are usually generated by the compiler (or the low-
single stages require a reduced amount of logic, pipelined level developer). Many recent CPUs use out-of-order ex-
CPUs allow higher clock rates. These days, most pro- ecution engines, which means that they can execute mul-
cessors use out-of-order execution engines, which can ex- tiple independent instructions at the same time. This type
ecute independent instructions in parallel. Using more of parallelism is called instruction level parallelism and
transistors and further increasing the clock frequency for it is usually the responsibility of the instruction scheduler
single CPU cores would cause a cubic growth in power of the compiler to make optimal use of it. However there
consumption [2], imposing practical problems for cool- are some algorithms like using Estrin’s scheme [6], which
ing, particularly for mobile devices. While power con- makes better use of instruction level parallelism for poly-
sumption decreases with the die shrink, it has been sug- nomial approximation than the commonly used Horner’s
gested that this will reach its physical limits in the near scheme.
future [3]. Both data level and instruction level parallelism can be
So instead of trying to increase the CPU speed, the used with a sequential programming model. In this sec-
computer industry started to increase the number of cores tion we discuss different approaches to the use of thread
per CPU. These days, most mobile solutions use dual-core level parallelism for audio synthesis engines.
processors, while workstations are available with up to 8
cores. Some researchers expect the number of CPU cores 2.1. Pipelining
to double every 18 months [1].
Parallel architectures are not necessarily new for com- The basic approach for introducing parallelism into a se-
puter music systems. In the early days of computer music, quential application is to use pipelining. The algorithm
systems like the IRCAM Signal Processing Workstation is split into sequential stages and each stage is computed
(ISWP) [10], a NeXT-based computer with up to 24 Intel in a separate thread. To achieve the optimal speedup, all

644
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

pipeline stages should take the same CPU time and the
number of pipeline stages should match the number of
CPU cores. While pipelining is a simple technique to in- SinOsc.ar(440)
crease the throughput, it usually increases the latency. In
terms of computer music systems this would increase the
latency of a signal, which reduces its usability for real-
time applications.
Out.ar(53) ? In.ar(53)

Nevertheless, pipelining is used in several parallel com-


puter music systems. It was used in FTS [9], where sub- Out.ar(0)
patches could be assigned to certain CPUs, and crossing
CPU boundaries introduced a delay of a signal vector. The
same approach has recently been reimplemented in Pure
Data using the pd˜ object [11] which creates a separate Figure 1: Signal graph with an implicit dependency
process for a subpatch. Jack2 [8], a multi-processor ver-
sion of the Jack Audio Connection Kit, also implements
pipelining to increase the throughput of its clients. To and a list of its successors. After a node has been evalu-
avoid the additional latency, Jack2 splits a single signal ated, it decrements the activation count of its successors.
block into smaller blocks and runs its clients on these If an activation count drops to zero, the corresponding
smaller block sizes. node is ready for execution.
This approach can only be used if the original block
size is reasonably big, because the lowest reasonable size 2.2.3. Graph Parallelism & Resource Handling
for the pipeline blocks is the size of a cache line. A second
When introducing parallelism to an application, one needs
limiting factor is that the pipeline needs to be filled and
to make sure that the semantics of the original program
emptied for processing each audio block.
does not change. The signal graphs of many applica-
tions only contain information about explicit dependen-
2.2. Graph Parallelism cies, which are caused by the signal flow. When graph
A fundamentally different approach is to traverse the sig- nodes of an application access shared resources, the ac-
nal graph in parallel. While this doesn’t introduce any cess order of the sequential program is usually determined
additional latency to the audio signal, it leads to some by the topological sorting of the graph. To ensure the se-
implementation problems. The signal graph may contain mantic correctness of the parallelized program, these im-
tens or thousands of nodes, depending on the granularity plicit dependencies need to be added to the dependency
of the node graph; therefore it makes sense to distinguish graph.
between coarse grained and fine grained signal graph. Figure 1 shows a simple ugen graph, which is prone
I do not give a precise definition, but as a rule of thumb to implicit dependencies. The graph has two parts that
fine grained signal graphs can be seen as as graphs with a could be evaluated in parallel if only explicit dependen-
significant node scheduling overhead. cies would be taken into account. It is not determined by
the graph order whether In.ar(53) or Out.ar(53)
should be evaluated first. If both parts of the graph were
2.2.1. Parallelizing Fine Grained Graphs
evaluated in parallel, it would be undefined whether the
Because of the scheduling overhead of fine grained sig- read or the write access to the bus would happen first. In
nal graphs, it is not feasible to schedule each graph node fact, this is a race condition: during some DSP ticks the
separately, especially since sequential scheduling can be bus is read before, in other DSP ticks after it is written.
implemented very efficiently by iterating over a linearized This can corrupt the audio stream, since blocks may be
data structure (the topologically sorted signal graph). There- missed or played back twice. In order to keep the seman-
fore one would need to spend some effort to combine mul- tics from the sequential version of the program, the im-
tiple graph nodes to a single entity. While Ulrich Reiter plicit dependency from the sequential program needs to
and Andreas Partzsch have published an algorithm [12] to be added to the dependency graph.
achieve this, it is rather impractical to use, since it does
not take shared resources into account. 3. EXTENDING THE SUPERCOLLIDER NODE
GRAPH
2.2.2. Parallelizing Coarse Grained Graphs
The programming model of SuperCollider distinguishes
Coarse grained graphs do not need any graph clustering between unit generators and synths. Unit generators are
step, since by definition the scheduling overhead of its used to build larger entities (synthdefs), that can be in-
nodes can be neglected. A simple dataflow scheduling al- stantiated on the server as synths. Using the terms of Sec-
gorithm for coarse grained graphs has been implemented tion 2, unit generator graphs would qualify as fine grained
in Jack2 [4]. Each graph node is annotated with an acti- graphs, while the synth graph could be seen as coarse
vation count (initially, the number of predecessors it has) grained.

645
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

However, SuperCollider does not have a proper notion but they do not necessarily depend on a result of one of
of a dependency graph. Instead, its node graph models a the predecessors of the parent (parallel) group. To express
tree hierarchy, with synths and groups as tree nodes and a these single dependency relations, we introduce another
group as its root. Groups are lists of nodes, which can be concept that we call satellite nodes. Satellite predeces-
used to structure the audio synthesis and address multiple sors have to be evaluated before their reference nodes,
nodes as one entity. Listing 1 shows a small code example while satellite successors are evaluated after their refer-
that would result in the node hierarchy shown in Figure 2. ence node.
Since the node graph is explicitly exposed to the lan- Listing 3 shows how satellite predecessors can be used:
guage, the user needs to take care of the correct order of all generator synths are instantiated as satellite predeces-
execution. While this imposes some responsibility on the sors of the effect node, so they would be initially runnable.
user, it can be modified to explicitly specify parallelism. Typical use cases for satellite successors would be audio
To achieve this, we propose two extensions to the Super- analysis synths like peak meters for GUI applications, or
Collider node graph: parallel groups and satellite nodes. sound file recorders.
Since satellite nodes provide a facility to specify de-
3.1. Parallel Groups pendencies more accurately, the parallelism for many use
cases can be increased. It is even be possible to dispatch
The first approach to specify parallelism is the concept of satellite nodes at a lower priority in order to optimize graph
parallel groups. Parallel groups would be available as throughput.
ParGroup class in the SuperCollider language and have The combination of parallel group and satellite nodes
a semantics similar to groups. Like groups they can con- should provide sufficient means to parallelize many use
tain child nodes. However, instead of evaluating the child cases. They still do not model a dependency graph with
nodes sequentially, the order of execution is undefined, arbitrary dependencies, so there are certain dependency
so all nodes can be evaluated in parallel. This provides graphs which may be cumbersome to formulate. But this
the user with a simple facility to explicitly specify paral- limitation also avoids problems such as cyclic dependen-
lelism. Assuming that the generators of the earlier exam- cies, and integrates well in the node hierarchy.
ple can be evaluated in parallel, the code could be imple-
mented using the proposed ParGroup class as shown in 4. SUPERNOVA
Listing 2. While the node hierarchy would be the same as
shown in Figure 2, the dependency graph shown in Figure Supernova is a parallel implementation of the SuperCol-
3 is different. lider server, that can be used as drop-in replacement for
Introducing parallel groups has the advantage of be- scsynth. It implements an extended OSC interface, which
ing compatible with scsynth. Scsynth can simply emulate supports the necessary commands to instantiate parallel
parallel groups with sequential groups, since they provide groups and satellite nodes. Supernova can dynamically
all semantic properties as parallel groups. load SuperCollider unit generators, although the source
code needs to be slightly modified if the unit generator
3.2. Satellite Nodes accesses resources like buffers or busses.
While parallel groups easily fit into the concept of the Su-
4.1. Resource Access
perCollider node graph, they impose some limitation to
parallelism. Members of parallel groups are synchronized In scsynth, unit generators are known to be evaluated se-
in two directions: they are evaluated after all predecessors quentially. Obviously, this is not the case for unit gener-
of the parallel group have been evaluated and before all ators in Supernova. In order to ensure data consistency
successors. For many applications one does not need to for concurrent execution, some care needs to be taken to
specify both directions of synchronization, but one syn- achieve thread safety. The main data structures that are
chronization constraint is sufficient. In the example above shared among unit generators are busses and buffers. To
the generators need to be evaluated before the effect node, allow multiple readers for a resource, the unit generator

Listing 1: Using SuperCollider’s Group class Listing 2: Parallel Group Example


var generator_group, fx; var generator_group, fx;
generator_group = Group.new; generator_group = ParGroup.new;
4.do { 4.do {
Synth.head(generator_group, Synth.head(generator_group,
\myGenerator) \myGenerator)
}; };
fx = Synth.after(generator_group, fx = Synth.after(generator_group,
\myFx); \myFx);

646
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

Root Group Child Group


Group: generator_group Synth: myGenerator
Synth: myFx Synth: myGenerator
Synth: myGenerator
Synth: myGenerator

Figure 2: Node hierarchy for Listing 1

Synth: myGenerator Synth: myGenerator Synth: myGenerator Synth: myGenerator

Synth: myFx

Figure 3: Dependency graph for Listing 2

API has been extended by adding reader-writer spin-locks reasonably low (tens of microseconds, depending on the
for each bus or buffer. Before a unit generator can access size of the graph). So this does not significantly affect the
a resource, it needs to acquire the corresponding spinlock. real-time safety of the audio synthesis engine.
Since some unit generators require access to multiple re-
sources, some care needs to be taken in order to prevent 4.3. Limitations
deadlocks. Therefore a simple locking policy is used: a
total order of all resources is defined and locks need to be For low-latency applications real-time computer music sys-
acquired in this order. If one lock cannot be acquired, all tems require a low worst-case scheduling latency. There-
previously acquired locks need to be released before the fore it is not feasible to use blocking primitives for syn-
locks will be acquired again. chronization of the main audio thread and the audio helper
While this ensures atomicity for write access, it does threads. Instead Supernova wakes all helper threads in the
not take all the responsibility away from the user. For beginning of the main audio callback and all threads poll
example, two synths may use the Out.ar ugen in paral- a lock-free stack, containing those queue nodes that are
lel to write to the same bus without problems, since the ready for execution. This greedy use of CPU resources is
actual order of the write access does not matter, but for not friendly to other processes. Depending on the struc-
ReplaceOut.ar the semantics would differ. ture of the node graph a significant amount of CPU time
could be spent in the busy waiting loops. Unless one uses
highly tuned systems running the RT preemption patches
4.2. Dependency Graph for the Linux kernel, it seems to be the only way to dis-
patch threads quickly enough.
Internally, Supernova does not interpret the node graph di-
rectly as is done in scsynth, but the node graph is used to
create a dependency graph data structure. This data struc- 5. EXPERIMENTAL RESULTS
ture does not have the notion of groups and synths any
more, but its nodes contain sequences of synths. In this Supernova is designed for low-latency real-time opera-
representation, sequential synths are combined into a sin- tions. In order to evaluate the performance, we measured
gle queue node to avoid the overhead of scheduling each the execution times of the audio callback and stored them
synth as a single entity. While the construction of the de- in a histogram with microsecond granularity. This ap-
pendency graph introduces some run-time overhead when proach has the advantage that it does not only show mea-
the signal graph is changed, benchmarks suggest that it is sure the thougput, but that it actually shows more detailed
performance characteristics. For real-time audio synthe-
sis, the speedup of the worst case is more interesting than
the speedup of the average case, since a missed deadline
Listing 3: Satellite Node example would result in a possibly audible audio dropout.
var fx = Synth.new(\myFx); The tests were carried out on an Intel Core i7 work-
station, running Linux with RT preemption patches. Its
4.do { worst-case scheduling latency was measured to be about
Synth.preceding(fx, 20 microseconds. Different test graphs were examined.
\myGenerator) Each graph layout was tested with up to 4 threads using
}; parallel groups and is compared against sequential groups.
Figure 4 shows a typical histogram. One can observe

647
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

0 100 200 300 400 500 600 700


Execution Time of the Audio Synthesis in Microseconds
Parallel Group, 1 Thread Parallel Group, 4 Threads
Parallel Group, 2 Threads Sequential Group
Parallel Group, 3 Threads

Figure 4: Execution Time Histogram, One Parallel Group with 256 Lightweight Synths

different aspects: the execution time for each test case international conference on Applied parallel com-
shows little spread. Most of the histogram samples are puting: state of the art in scientific computing.
found around one peak, which a second small peak rough- Springer-Verlag, 2006, pp. 1–10.
ly 20 microseconds after the first. The time difference
between the first and second peak is in the order of mag- [2] S. Gochman, R. Ronen, I. Anati, A. Berkovits,
nitude of the worst-case scheduling latency that can be T. Kurts, A. Naveh, A. Saeed, Z. Sperber, and R. C.
achieved by the workstation. So it is most likely a result of Valentine, “The Intel Pentium M Processor: Mi-
hardware effects. Since no samples can be found behind croarchitecture and Performance,” Intel Technology
the second peak, the implementation can be considered as Journal, vol. 7, no. 02, pp. 21–36, 2003.
real-time safe.
[3] L. B. Kish, “End of Moores law: thermal (noise)
Using the histograms, average-case and worst-case ex-
death of integration in micro and nano electronics,”
ecution times can be determined and speedups can be com-
Physics Letters A, vol. 305, no. 3-4, pp. 144–149,
puted. Figures 5 and 6 show the computed speedups for
2002.
different use cases, both average-case and worst-case.
[4] S. Letz, Y. Orlarey, and D. Fober, “Jack audio server
6. CONCLUSION for multi-processor machines,” in Proceedings of the
International Computer Music Conference, 2005.
This paper introduces Supernova, a replacement for Super-
Collider’s default synthesis server scsynth. Supernova sup- [5] J. McCartney, “SuperCollider, a new real time syn-
ports two extensions to the SuperCollider node graph, so thesis language,” in Proceedings of the International
that the user can explicitly express parallelism in the node Computer Music Conference, 1996.
hierarchy. Its multiprocessor aware synthesis engine is
[6] J.-M. Muller, Elementary Functions: Algorithms
optimized for real-time audio synthesis, and scales well
and Implementation. Birkhauser, 2006.
with the number of CPUs. The next release of SuperCol-
lider will include Supernova as alternative to scsynth. [7] K. Olukotun and L. Hammond, “The Future of Mi-
croprocessors,” Queue, vol. 3, no. 7, pp. 26–29,
7. ACKNOWLEDGEMENTS 2005.

This paper is a revisited and extended version of the paper [8] Y. Orlarey, S. Letz, and D. Forber, “Multicore Tech-
presented at the SuperCollider Symposium 2010. Many nologies in Jack and Faust,” in Proceedings of the
thanks to Dr. Dan Stowell and Dr. M. Anton Ertl for International Computer Music Conference, 2008.
their valuable feedback and James McCartney for creating
[9] M. Puckette, “Combining Event and Signal Pro-
SuperCollider.
cessing in the MAX Graphical Programming
Environment,” Computer Music Journal, vol. 15,
8. REFERENCES no. 3, pp. 68–77, 1991.

[1] A. Buttari, J. Dongarra, J. Kurzak, J. Langou, [10] ——, “FTS: A Real-time Monitor for Multipro-
P. Luszczek, and S. Tomov, “The impact of multi- cessor Music Synthesis,” Computer Music Journal,
core on math software,” in Proceedings of the 8th vol. 15, no. 3, pp. 58–67, 1991.

648
Proceedings of the International Computer Music Conference 2011, University of Huddersfield, UK, 31 July - 5 August 2011

4
3.5
3
2.5

Speedup
2
1.5
1
0.5
0
group 1 Thread 2 Thread 3 Thread 4 Thread

Figure 5: Average Case Speedup, for five different use cases

4
3.5
3
2.5
Speedup

2
1.5
1
0.5
0
group 1 Thread 2 Thread 3 Thread 4 Thread

Figure 6: Worst Case Speedup, for five different use cases

[11] ——, “Thoughts on Parallel Computing for Music,”


in Proceedings of the International Computer Music
Conference, 2008.
[12] U. Reiter and A. Partzsch, “Multi Core / Multi
Thread Processing in Object Based Real Time Au-
dio Rendering: Approaches and Solutions for an Op-
timization Problem,” in Audio Engineering Society
122th Convention, 2007.

649

You might also like