Multicore Processing
Multicore Processing
Glossary 21
Index 25
Typographical conventions
Throughout this manual, we use certain typographical conventions to distinguish
technical terms. In general, the conventions we use conform to those found in IEEE
POSIX publications. The following table summarizes our conventions:
Reference Example
Code examples if( stream == NULL )
Command options -lR
Commands make
Environment variables PATH
File and pathnames /dev/null
Function names exit()
Keyboard chords Ctrl-Alt-Delete
Keyboard input something you type
Keyboard keys Enter
Program output login:
Programming constants NULL
continued. . .
Reference Example
Programming data types unsigned short
Programming literals 0xFF, "message string"
Variable names stdin
User-interface components Cancel
We use an arrow (→) in directions for accessing menu items, like this:
!
CAUTION: Cautions tell you about commands or procedures that may have
unwanted or undesirable side effects.
Technical support
To obtain technical support for any QNX product, visit the Support area on our
website (www.qnx.com). You’ll find a wide range of support options, including
community forums.
• MIPS-based systems
• PowerPC-based systems
If you have one of these systems, then you’re probably itching to try it out, but are
wondering what you have to do to get Neutrino running on it. Well, the answer is not
much. The only part of Neutrino that’s different for a multiprocessor system is the
microkernel — another example of the advantages of a microkernel architecture!
To determine how many processors there are on your system, look at the num_cpu
entry of the system page. For more information, see “Structure of the system page” in
the Customizing Image Startup Programs chapter of Building Embedded Systems.
SMP lets you get the most performance out of your system, but you might need to use
BMP for the few applications that may not work under SMP, or if you want to
explicitly control the process-level distribution of CPU usage.
In this chapter. . .
Setting up the OS image 7
Trying symmetric multiprocessing 8
Trying bound multiprocessing 8
This chapter gives you a quick hands-on introduction to multicore processing. The
main steps are as follows:
2 Go to the directory that holds the buildfile for your system’s boot image (e.g.
/boot/build).
In a real buildfile, you can’t use a backslash (\) to break a long line into shorter pieces,
but we’ve done that here, just to make the command easier to read.
Although the multiprocessing version of procnto has “SMP” in its name, it also
supports BMP. You can even use bound and symmetric multiprocessing
simultaneously on the same system.
9 Put the new image in place. In order to ensure you can still boot your system if
an error occurs, we recommend the following:
2 Start some processes that run indefinitely. For example, use the hogs utility to
display which processes are using the most CPU:
hogs -n -%10
3 Use pidin sched to see which processor your processes are running on.
If you’re using the IDE, you can use the System Information perspective to
watch the threads migrate.
return EXIT_SUCCESS;
}
On a uniprocessor system, this would consume all the processing time (unless
you’re using adaptive partitioning). On a multicore system, it consumes all the
time on one processor.
6 Use pidin sched to see which processor your other processes are running on.
They’re likely running on different processors from greedy.
on -C 0 ksh
2 Start some new processes from this shell. Note that they run only on the first
processor.
3 Use the -C or -R option (or both) to slay to change the runmask for one of
these processes. Note that the process runs only on the processors that you just
specified, while any children run on the processors you specified for the shell.
4 Use the -C or -R option (or both) and the -i option to slay to change the
runmask and inherit mask for one of these processes. Note that the process and
its children run only on the newly specified processors.
In this chapter. . .
Building a multicore image 13
The impact of multicore 13
Designing with multiprocessing in mind 18
[virtual=x86,bios] .bootstrap = {
startup-bios
PATH=/proc/boot procnto-smp
}
[+script] .script = {
devc-con -e &
reopen /dev/con1
[+session] PATH=/proc/boot esh &
}
libc.so
[type=link] /usr/lib/ldqnx.so.2=/proc/boot/libc.so
[data=copy]
devc-con
esh
ls
After building the image, you proceed in the same way as you would with a
single-processor system.
It’s also possible to run the multicore kernel on a uniprocessor system, but it requires a
486 or higher on x86 architectures, and a multicore-capable implementation on MIPS
and PPC.
Thread affinity
One issue that often arises in a multicore environment can be put like this: “Can I
make it so that one processor handles the GUI, another handles the database, and the
other two handle the realtime functions?”
The answer is: “Yes, absolutely.”
This is done through the magic of thread affinity, the ability to associate certain
programs (or even threads within programs) with a particular processor or processors.
Thread affinity works like this. When a thread starts up, its affinity mask (or runmask)
is set to allow it to run on all processors. This implies that there’s no inheritance of the
thread affinity mask, so it’s up to the thread to use ThreadCtl() with the
_NTO_TCTL_RUNMASK control flag to set its runmask:
The runmask is simply a bitmap; each bit position indicates a particular processor. For
example, the runmask 0x05 (binary 00000101) allows the thread to run on processors
0 (the 0x01 bit) and 2 (the 0x04 bit).
The <sys/neutrino.h> file defines some macros that you can use to work with a
runmask:
RMSK_SET(cpu, p)
Set the bit for cpu in the mask pointed to by p.
RMSK_CLR(cpu, p)
Clear the bit for cpu in the mask pointed to by p.
RMSK_ISSET(cpu, p)
Determine if the bit for cpu is set in the mask pointed to by p.
The CPUs are numbered from 0. These macros work with runmasks of any length.
Bound multiprocessing (BMP) is a variation on SMP that lets you specify which
processors a process or thread and its children can run on. To specify this, you use an
inherit mask.
To set a thread’s inherit mask, you use ThreadCtl() with the
_NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT control flag. Conceptually, the
structure that you pass with this command is as follows:
struct _thread_runmask {
int size;
unsigned runmask[size];
unsigned inherit_mask[size];
};
If you set the runmask member to a nonzero value, ThreadCtl() sets the runmask of the
calling thread to the specified value. If you set the runmask member to zero, the
runmask of the calling thread isn’t altered.
If you set the inherit_mask member to a nonzero value, ThreadCtl() sets the calling
thread’s inheritance mask to the specified value(s); if the calling thread creates any
children by calling pthread_create(), fork(), spawn(), vfork(), and exec(), the children
inherit this mask. If you set the inherit_mask member to zero, the calling thread’s
inheritance mask isn’t changed.
If you look at the definition of _thread_runmask in <sys/neutrino.h>, you’ll
see that it’s actually declared like this:
struct _thread_runmask {
int size;
/* unsigned runmask[size]; */
/* unsigned inherit_mask[size]; */
};
This is because the number of elements in the runmask and inherit_mask arrays
depends on the number of processors in your multicore system. You can use the
RMSK_SIZE() macro to determine how many unsigned integers you need for the
masks; pass the number of CPUs (found in the system page) to this macro.
Here’s a code snippet that shows how to set up the runmask and inherit mask:
unsigned num_elements = 0;
int *rsizep, masksize_bytes, size;
unsigned *rmaskp, *imaskp;
void *my_data;
/* Set the runmask. Call this macro once for each processor
the thread can run on. */
RMSK_SET(cpu1, rmaskp);
/* Set the inherit mask. Call this macro once for each
processor the thread’s children can run on. */
RMSK_SET(cpu1, imaskp);
if ( ThreadCtl( _NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT,
my_data) == -1) {
/* Something went wrong. */
.
.
.
}
}
You can also use the -C and -R options to the on command to launch processes with a
runmask (assuming they don’t set their runmasks programmatically); for example, use
on -C 1 io-pkt-v4 to start io-pkt-v4 and lock all threads to CPU 1. This
command sets both the runmask and the inherit mask.
You can also use the same options to the slay command to modify the runmask of a
running process or thread. For example, slay -C 0 io-pkt-v4 moves all of
io-pkt-v4’s threads to run on CPU 0. If you use the -C and -R options, slay sets
the runmask; if you also use the -i option, slay also sets the process’s or thread’s
inherit mask to be the same as the runmask.
This FIFO trick won’t work on an SMP system, because both threads may run
simultaneously on different processors. You’ll have to use the more “proper” thread
synchronization primitives (e.g. a mutex), or use BMP to tie the threads to specific
CPUs.
We recommend that you always use InterruptLock() and InterruptUnlock(), both in the
thread and in the ISR. The small amount of extra overhead on a single-processor box
is negligible.
Function Operation
atomic_add() Add a number
atomic_add_value() Add a number and return the original value of *loc
atomic_clr() Clear bits
atomic_clr_value() Clear bits and return the original value of *loc
atomic_set() Set bits
atomic_set_value() Set bits and return the original value of *loc
continued. . .
Function Operation
atomic_sub() Subtract a number
atomic_sub_value() Subtract a number and return the original value of *loc
atomic_toggle() Toggle (complement) bits
atomic_toggle_value() Toggle (complement) bits and return the original value of
*loc
The *_value() functions may be slower on some systems, so don’t use them unless
you really want the return value.
Adaptive partitioning
You can use adaptive partitioning on a multicore system, but there are some
interactions to watch out for. For more information, see “Using adaptive partitioning
and multicore together” in the Adaptive Partitioning Scheduling Details chapter of the
Adaptive Partitioning User’s Guide.
(i.e. one flow doesn’t rely on the results of another), this can be a good candidate for
parallelization within the process by starting multiple threads. Consider the following
graphics program snippet:
do_graphics ()
{
int x;
In the above example, we’re doing ray-tracing. We’ve looked at the problem and
decided that the function do_one_line() only generates output to the screen — it
doesn’t rely on the results from any other invocation of do_one_line().
To make optimal use of a multicore system, you would start multiple threads, each
running on one processor.
The question then becomes how many threads to start. Obviously, starting
XRESOLUTION threads (where XRESOLUTION is far greater than the number of
processors, perhaps 1024 to 4) isn’t a particularly good idea — you’re creating a lot of
threads, all of which will consume stack resources and kernel resources as they
compete for the limited pool of CPUs.
A simple solution would be to find out the number of CPUs that you have available to
you (via the system page pointer) and divide the work up that way:
#include <sys/syspage.h>
int num_x_per_cpu;
do_graphics ()
{
int num_cpus;
int i;
pthread_t *tids;
void *
do_lines (void *arg)
{
int cpunum = (int) arg; // convert void * to an integer
int x;
The above approach lets the maximum number of threads run simultaneously on the
multicore system. There’s no point creating more threads than there are CPUs,
because they’ll simply compete with each other for CPU time.
Note that in this example, we didn’t specify which processor to run each thread on.
We don’t need to in this case, because the READY thread with the highest priority
always runs on the next available processor. The threads will tend to run on different
processors (depending on what else is running in the system). You typically use the
same priority for all the worker threads if they’re doing similar work.
An alternative approach is to use a semaphore. You could preload the semaphore with
the count of available CPUs. Then, you create threads whenever the semaphore
indicates that a CPU is available. This is conceptually simpler, but involves the
overhead of creating and destroying threads for each iteration.
inherit mask
A bitmask that specifies which processors a thread’s children can run on. Contrast
runmask.
multicore system
A chip that has one physical processor with multiple CPUs interconnected over a
chip-level bus.
runmask
A bitmask that indicates which processors a thread can run on. Contrast inherit mask.
! I
_NTO_TCTL_RUNMASK 14 images, building for multicore 13
_NTO_TCTL_RUNMASK_GET_AND_SET_INHERIT 15 inherit mask 14
_thread_runmask 15 InterruptLock() 17, 18
interrupts, handling 17
InterruptUnlock() 17, 18
ISR, preemption considerations 17
A
affinity, thread 14
AMP (Asymmetric Multiprocessing) 3 M
atomic operations 17
multicore processing 3
building an image for 13
designing for 18
B interrupts and 17
sample buildfile for 13
BMP (Bound Multiprocessing) 3, 14 mutexes 16
trying it 8
buildfiles
modifying for multicore processing 7
sample 13 O
on utility 8, 16
operations, atomic 17
C OS images, building for multicore 7, 13
conventions
typographical vii
CPUs, number of 19 P
pathname delimiter in QNX documentation viii
pidin 8
F processes, processor running on
displaying 8
FIFO scheduling, using with multicore 16
specifying 9
processors, determining number of 3
procnto*-smp 7, 13
R
RMSK_CLR() 14
RMSK_ISSET() 14
RMSK_SET() 14
RMSK_SIZE() 15
runmask 14
S
scheduling policies, using FIFO with multicore
16
SchedYield(), using with multicore 16
slay 9, 16
SMP (Symmetric Multiprocessing) 3
trying it 8
support, technical viii
synchronization primitives and multicore 16
system page, number of CPUs 19
T
tasks, parallel 18
technical support viii
thread affinity 14
ThreadCtl() 14, 15
threads, running concurrently 13, 18
typographical conventions vii