INTEL - The Parallel Universe - Issue 02 - 2010
INTEL - The Parallel Universe - Issue 02 - 2010
Issue 2
March 2010 UNIVERSE
Where Are
DEVELOPER
ROCK STAR:
Levent
My Threads?
Akyil
Advisor Origins
by Paul Petersen
THE PARALLEL UNIVERSE
Pump out
performance Contents
gains. Letter from the Editor
Think Parallel: Good Programming
Starts with the Developer, by James Reinders
James Reinders, lead evangelist and director of Intel® Software Development Products,
addresses recent innovations in apps and tools, highlights key 2010 milestones, and
4
DEVELOPER ROCK STAR: explores what’s next in the new year and beyond.
Tony Mongkolsmai
Where Are My Threads? Intel® VTune™ Performance Analyzer
and Finding Threading and Parallelism Issues, by Levent Akyil 6
APP EXPERTISE:
Do you ever wonder how your parallel workload is distributed or scheduled across
Threading Performance Tools the available cores/processors? Explore how Intel® VTune™ Performance Analyzer
helps make such analysis easy.
Become a developer rock star with Intel® Parallel Studio. Learn how to add
parallelism to Microsoft Visual Studio* by visiting www.intel.com/software/
products/eval for a free evaluation.
© 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. © 2010, Intel Corporation. All rights reserved. Intel, the Intel logo,
Intel Core, Pentium, and VTune are trademarks of Intel Corporation
in the U.S. and other countries. *Other names and brands may be
claimed as the property of others.
THE PARALLEL UNIVERSE
LETTER FROM THE EDITOR
Encouraged by the success of Intel
Think
It is no surprise that Intel® Parallel
Studio is consistently cited in articles and
analyst reports about software that helps TBB, Intel addressed the whole design
Parallel:
with parallel programming. Of course,
Intel has a long history with high perfor-
mance computing (HPC) through efforts
cycle with Intel® Parallel Studio.
with OpenMP* and MPI*. Recently, Intel
led non-HPC efforts on projects like Intel®
Both the Go-Parallel site and the multicore Support for Microsoft Visual Studio* (VS)
Threading Building Blocks (Intel® TBB) and
portal on the Intel® Software Developer includes full support under VS 2005* and VS
Intel® Parallel Studio.
Network offer a rich set of forums and 2008*, and will expand to VS 2010* shortly
Offering real assistance for multicore
training materials used by many developers after it is available from Microsoft. We will
programming has meant addressing modi-
to hone their knowledge and skills. And Intel® continue to offer developers a choice when it
fication of existing applications, addressing
software tools, long recognized for offering comes to deciding which version of VS they
ease-of-use issues, and supplying tools to aid
high-performance solutions, deliver strong can use for parallel programming. We will also
in the entire process of designing a program.
support for parallel programming. see tangible results from our acquisitions of
4 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Where
event. By leveraging this technology you can
see how many events were sampled on each
core as well as which thread generated them.
The Show/Hide CPU Information button
Figure 1
(Figure 1) in the sampling toolbar displays
collected samples and events per processor
in the Process, Thread, Module, and Hotspot Thread view
Are My
sampling views (Figure 2).
We now know that this particular program
(sort_mt1.exe) was executed on two cores
and we can see the number of samples
collected on each core. But what we don’t Figure 2
know yet is how many threads this applica-
tion created and how the threads executed
Threads?
SOT view
on these cores. Selecting the Thread view
(Figure 2) when the CPU button is also
selected will show us the desired informa-
tion. (Figure 3) tells us that sort_mt1.exe
created two threads (thread18 and thread13)
and that each thread was executed on both
Figure 3
cores (OS scheduled these threads to run
on each core) during the analysis. If you look
at the clock ticks (i.e., CPU_CLK_UNHALTED.
CORE) for thread18, it becomes clear that this
Intel® VTune™ Performance Analyzer particular thread was executed on each core
while running most of the time on Processor 0.
and Finding Threading and
Parallelism Issues
Figure 4
6 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
BLOG
when the CPU button is also selected
provides insight into the execution and
You can zoom in to any
scheduling of these threads (Figure 7).
A closer look at thread9 and thread59 and
region on the timeline by
how they are executed on the cores, as
identifying the region of
shown in (Figure 7), reveals how the OS,
Windows XP Sp3* in this particular case, is
scheduling the threads on both cores. It also
Figure 5: Manually setting thread affinity can create problems. Each thread is scheduled/pinned to interest with the mouse
highlights
the Core/Processor 0.
illustrates that each thread is running almost
the same amount of time on each core. and selecting “Zoom In”
Note: The RS_UOPS_DISPATCHED.
CYCLES_NONE event shown in (Figure 7) from the context menu Fun with Locks and Waits—
counts the cycles where no μops were
dispatched (stalls cycles). The overall formula
Figure 6: The SOT shows a load imbalance issue. (right-click menu). Performance Tuning
is CPU_CLK_UNHALTED.CORE (clockticks)
~= RS_UOPS_DISPATCHED.CYCLES_ANY + by David Mackay, Ph.d.
RS_UOPS_DISPATCHED.CYCLES_NONE
(stall cycles). At times threaded software requires some critical sections,
You can zoom in to any region on the Thread Affinity
mutexes or locks. Do developers always know which of the
timeline by identifying the region of interest At this point, it is important to introduce the concept of thread affinity.
objects in their code have the most impact? If I want to
with the mouse and selecting “Zoom In” from Thread affinity restricts execution of certain threads to a subset of the
the context menu (right-click menu). Figure 8, examine my software to minimize the impact, or restructure
physical processing units in a multiprocessor computer. Depending on
which shows the zoomed region (0-1.8 secs), the topology of the machine, thread affinity can have a dramatic effect data to eliminate some of these synchronization objects and
Figure 7: Identify the stall cycles per thread per cpu.
reveals how threads are actually tossed on the execution speed of an application. improve performance, how do I know where I should make
back and forth between the cores. The OS However, you must have a good reason and be cautious before changes to get the biggest performance improvement?
scheduler simply doesn’t keep the threads interfering with the OS scheduler’s ability to schedule threads Intel® Parallel Amplifier can help me determine this.
on the same core (i.e., thread9 on core 0 effectively across processors/cores. Most recent OSs and their
and thread59 on core 1 or vice versa). This schedulers have improved significantly; generally speaking, modern Intel Parallel Amplifier provides three basic analysis types:
particular scheduling pattern might not be schedulers will perform efficiently. hotspots (with call stack data), concurrency, and locks
an issue for such a system since the cores The Intel® compiler OpenMP* runtime library has the ability to and waits. The locks and waits analysis highlights which
share the same second-level cache. For bind OpenMP threads to physical processing units. Thread affinity is synchronization objects block threads the longest. It
multi-socket systems, however, such a Figure 8: The SOT view shows how the OS scheduled the particular application. supported on the Windows* OS systems and versions of the Linux*
is common for software to have too many or too few
scheduling pattern will be a problem. OS systems that have kernel support for thread affinity. There are
three types of interfaces you can use to specify this binding. These are synchronization points. Insufficient synchronization points
collectively referred to as the Intel® OpenMP* Thread Affinity Interface. lead to race conditions and indeterminant results (if you
For more information, click here. The affinity types supported by the have this problem you need Intel® Parallel Inspector, not
Intel OpenMP runtime library are: none (default) / compact / disabled / Intel Parallel Amplifier. See this MSDN Web seminar for more
explicit / scatter. on Intel Parallel Inspector: Got Memory Leaks?). If you have
Figure 9: Note the SOT view after setting the thread affinity.
too many synchronization objects, you want to know which
After Setting the Affinity ones, if removed, would improve performance the most.
For this exercise and for this particular system, setting the affinity as
“scatter” or “compact” will not make any difference. Please see the
information provided in the link above for more details.
Figure 10 shows that both thread17 and thread64 remained on
the same cores on which they were initially scheduled. Thread17
initially got scheduled to run on Core 0 and Core 1, but it stayed on
Figure 10: Results showing how setting thread affinity affects the application runtime. Core 0 for the remainder of the run. o
8 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
By Paul Petersen A blank sheet of paper. It’s frightening, and also empowering.
It’s a license for unlimited creative freedom. As software developers, how
As you change your application
often do you get this opportunity? If your work is like mine, it turns out to
to make it ready to introduce be less often than you might think.
parallelism, your test suite can Most software development is adapting an existing solution to serve
be your biggest asset. Intel® a new purpose. It is optimizing an existing algorithm. It is enabling new
Parallel Advisor is designed to execution models for solutions customers already find valuable.
be your assistant as you analyze Maybe you are one of the developers who can afford to recreate your
current source code base as you seek to exploit parallel execution. You
your existing sequential
likely will have the biggest payoff since you have unlimited freedom to
implementations. design your algorithms and implementation to maximize the parallel
execution benefit.
If you need to reuse large portions of an existing implementation, you
still have a significant opportunity. In some ways, your job is much easier.
Maybe you have a “diamond in the rough.” You already know the “correct”
definition you are trying to implement. The correct behavior is defined by
the external behavior of your existing serial algorithms.
ADVISOR ORIGINS Your test suites validate your application against this correct behavior.
As you change your application to make it ready to introduce parallelism,
your test suite can be your biggest asset. After every transformation step
you perform, the validity of the transformation can be checked by verifying
your test suite application passes.
If you introduce changes to your algorithms that change the behavior
of your application, it is important to determine early if these changes are
desirable. In such a case, you need to update your test suite to allow this
new behavior. If the behavior is in error, you need to revert these changes
and go back to the prior version.
Intel® Parallel Advisor (part of Intel® Parallel Studio) is designed to be
your assistant as you analyze your existing sequential implementations to
discover how it can be refactored or redesigned to exploit parallel execution
of your application. This article explains some of the principles upon which
the design of Intel Parallel Advisor is based.
For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. 11
THE PARALLEL UNIVERSE
Multi-Level Parallel Thinking hierarchically (sometimes recursively) expands your Choosing Tasks
Task Execution ability to specify independent work that can be exploited for parallel You can imagine that in the serial execution of the program tasks could
execution. If you only create parallelism from the two top-most exist at multiple levels. At one extreme, you could assume that the
Refactoring to identify opportunities for
independent QuickSort calls, then your speedup will be implicitly entire application is one task. At the other extreme, you could assume
relaxed sequential execution is typically
limited to, at most, a factor of 2x. Recursively subdividing these that every statement or expression is a task. Both extremes are usually
implemented via fork-join parallelism. You fork
tasks into smaller tasks enables additional parallelism. incorrect when trying to pick where to add tasks to your application.
when you want to relax the constraints of
Another reason why hierarchical parallelism design is helpful is in The potential concurrency of your application is bounded by the
serial execution, and you join (using a barrier)
increasing the number of tasks that can be launched. Problems best number of tasks you create. If you only create two tasks (your main
when you want to enforce the constraints
suited for parallelism have large collections of objects, each of which program is also logically a task), your ideal parallel speed-up is only 2x
of serial execution. The barrier at the join
needs to be transformed independently by the application of a function. faster than serial (ignoring serial cache or memory bandwidth effects),
point allows any dependence from before the
If you can transform your algorithm into this form you will achieve the regardless of the number of processors you have available. Various forms
barrier to after the barrier to be satisfied.
best results. of overhead will also reduce the potential benefit of creating parallel
In simple fork-join parallelism the application
Often, the objects are not this independent; you may only have a tasks, so you need to identify many more tasks than the available
is either executing serially or it is executing
Refactoring to Enable Relaxed Sequential Execution in parallel. From Amdahl’s law you know that
small collection of “top-level” objects that are independent. If you apply processors in order to achieve the best performance.
parallelism only to this “top-level” collection of objects you may get a Having many tasks available also helps in another way. If you have
If your application already embodies the “correct” definition, you want to preserve it. This means potential speedup is limited by the percentage
large enough grain size to allow effective parallel execution, but your two tasks, and one task is very short (e.g., one second), and the other
that you can use refactoring techniques to uncover the latent parallelism in your application. This of time the program is executing serially.
scalability will be limited to the number of “top-level” objects you have. task is very long (e.g., 59 seconds), then running these two tasks
parallelism is latent because using a serial language to express your algorithm over-constrains Therefore, any time you transition from parallel
If your algorithm runs long enough to warrant the use of parallelism, serially will take one minute. How fast will they run when executed
the dependences that are necessary for correct execution. Refactoring is the process of to serial you are losing performance.
you may find another level of nested algorithms that also does an in parallel? You might hope that you would achieve a 2x speedup and
changing your application’s internal structure without modifying its external functional behavior. To overcome this problem, you can think
independent computation over a different small set of independent finish both tasks in 30 seconds.
This allows you to express your intentions more clearly and eliminate the implicit serial depen- hierarchically. The trick is to find a way to
data items. Applying parallelism to both levels increases the scalability Unfortunately, the answer is that you will finish both tasks in 59
dences that may be present in the sequential implementation of your algorithms. replicate the serializing algorithms at the
of your parallel algorithm. seconds. The minimum time to execute both tasks is the maximum
inner level, and create multiple tasks at an
time it would take to execute either task. Since the duration of the
outer level that each works independently.
second task is 59 seconds, you must wait at least this long. If possible,
parallel { The QuickSort algorithm is a good example
it would be better to break up the second task much finer into a
T1 = F1(A + B) task { T1 = F1(A + B) } (Figure 5). for (int i = 0; i < N; ++i)
T2 = F2(C * D) task { T2 = F2(C * D) } larger collection of tasks. For example, if instead of having one task
The algorithm has a serial phase (i.e., A[i] = B[i] + C[i]
} that takes 59 seconds, you could create 59 tasks that each take one
sorting small arrays, choosing the pivot, and
second, you could potentially achieve a 2x speed-up and finish all 60
partitioning the array) and a parallel phase (i.e.,
Figure 1 Figure 2 Figure 3 tasks in 30 seconds using two processors.
sorting the two halves of the now partitioned
You can see that a larger number of smaller tasks are necessary
array). The hierarchy you can exploit is the
to have enough potential concurrency to achieve the performance
In Figure 1, the serial semantics are that first you must add A+B, apply the function F1, and recursive call to the QuickSort function. parallel { improvement you desire. Two effects serve to balance this and prohibit
then store this result into T1. Only then do you multiply C*D, apply the function F2, and store for (int i = 0; i < N; ++i) tasks that are too small.
the result into T2. Assuming F1 and F2 are pure functions, mathematically the semantics are task { A[i] = B[i] + C[i] }
First, scheduling a task onto a thread is inexpensive, but does have a
equivalent if you calculate T2 first, and then calculate T1. }
cost. Scheduling 10 tasks, each with a single action, onto a thread will
The act of writing down these two statements creates a constraint that did not exist before
be slightly more expensive than combining all 10 of those actions into
the two statements were written down. Optimizing compilers (and out-of-order execution Figure 4
a single task and only scheduling that one task onto a thread. Most
hardware) try to understand the real and false dependences to enable faster execution. By
parallel frameworks have the capability to combine tasks to create
declaring task boundaries (i.e., the parallel actions shown as tasks in Figure 2), you can use the
void QuickSort( Value A[], int L, int H ) { larger chunks of work to be scheduled onto a thread for execution.
same technique statically in the source code, indicating when implied control or data dependences
if (H-L < TooSmallLimit) { This is a simple mechanical process. However, the opposite—splitting a
do not need to be enforced to ensure correct execution.
SerialSort(A, L, H); monolithic large task into many smaller tasks—is very difficult for the
Figure 3 shows a similar situation. This loop is shorthand for the set of assignment statements return; parallel framework, but may be fairly easy for you to specify.
where the variable i is in the range of 0:N-1. The result generated by this loop produces each }
Second, a good task encapsulates a consistent computation on an
element of A containing the sum of the corresponding elements of B and C. In the serial Value Pivot = A[ L+(H-L)/2 ];
int L1 = L; int H1 = H; object or set of objects. If you examine all of the variables accessed by
program, this loop is over-constrained by specifying that the index variable i is calculated via the
while (L1 < H1) { a task you will find that they fall into three cases.
induction i=i+1. Declaring the task boundaries as shown in Figure 4 defines all iterations of the
if (A[L1] < Pivot) These cases are:
loop as logically separate tasks (capturing the value of i when each task is created). ++L1;
Introducing parallelism via refactoring performs the primary task of identifying the places else if (A[H1] >= Pivot) 1. The variable is already private to a task or read-only by all tasks.
where the serial semantics are over-constraining the problem you need to solve. Designing your --H1; 2. The variable can be made private to a task.
parallel application via refactoring creates the key property that the original serial execution else
3. The variables communicate values between tasks.
Swap(A[L1], A[H1]);
is just one trivial execution of the parallel program. To see that this is true, consider a parallel
} In the second half of this article, we examine these cases by
program. What happens when you execute this program on a single thread? If you have only parallel {
showing how the way you use variables in your application affects
relaxed or removed artificial serial dependences, then adding them back in by executing the task { QuickSort(A, L, L1-1); }
task { QuickSort(A, L1, H); } the choices you will make as you understand how to transform your
parallel program on a single thread preserves the same behavior.
} application to exploit available latent parallelism. o
When you don’t care about the order of execution, the program can execute in parallel. When
}
you do care about the order of execution (e.g., the next statement has dependences on the
prior computation), then you retain a serial execution.
Figure 5
12 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Understanding the
Features of Intel® Parallel Inspector
by Example
By Bradley J. Werth Intel® Parallel Inspector is a combination of tools that performs two important Getting Started with Intel Parallel Inspector Intel Parallel Inspector is a dynamic analysis
functions: verifying memory integrity and verifying thread correctness. Memory integrity has tool, which means that it doesn’t look at the
Intel® Parallel Inspector been a key challenge for software development throughout the history of computing. When some
Intel Parallel Inspector is installed into the Microsoft Visual Studio 2008* IDE and appears as a
toolbar. Figure 1 shows how Intel Parallel Inspector appears after a typical install. source code but at the running program itself.
eases the correctness high-profile program gets hacked through a buffer overrun, that is a breach of memory integrity. This is both good and bad; good because
The typical use of Intel Parallel Inspector is to build an application with debugging information
burden on programmers. Even without malicious intent, it is all too easy for a programmer to make a mistake using the many problems cannot be caught with just
included, and then run Intel Parallel Inspector memory analysis or threading analysis by choosing
Explore how it helps complex memory syntax of a language like C. Likewise, multithreaded semantics provide plenty of
the type of analysis from the combo box and clicking the Inspect button. static analysis, but bad because this dynamic
maintain memory and leeway for programming mistakes. Intel Parallel Inspector aims to ease the correctness burden analysis can significantly slow down the
on programmers by providing some assurance that memory and thread integrity are maintained. execution of the program. Understanding
thread integrity.
Accompanying this article is a sample C++ project that contains examples of all the memory this tradeoff, Intel Parallel Inspector provides
and threading problems that Intel Parallel Inspector can detect, with clear labels of each “levels” of both memory and threading
problem. This article describes how Intel Parallel Inspector can be configured to catch those analysis. Levels range from 1 to 4 in order
problems, which will help you find and fix similar problems in your own projects. of increasing accuracy and decreasing perfor-
The data for graphs in this article were measured on a Windows 7* PC with an Intel® Core™ i7 mance. Figure 2 shows the interface for
Extreme Edition processor using the referenced example project. The time dilation or slowdown specifying the level of memory analysis.
caused by Intel Parallel Inspector inspection analysis will vary based on the platform, system When the user clicks the Run Analysis
configuration, and your project of analysis. button from the configuration screen, Intel
Parallel Inspector will begin analysis and run
the program to completion, or until the Stop
button is pressed on the toolbar. Some problems
will not be detected properly if the Stop button
is pressed. When analysis is complete, Intel
Parallel Inspector displays the analysis results
in a tab in the Microsoft Visual Studio IDE.
Figure 3 shows the results of running Level 3
memory analysis on the sample project.
Double-clicking on one of the errors will
display the source of the problem as well as
additional detail to help diagnose the problem.
Figure 1: Intel Parallel Inspector appears as a toolbar in the Microsoft Visual Studio 2008* IDE.
Figure 2: Intel Parallel Inspector allows configuration options for both memory
and threading analysis.
14 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
Figure 3: Analysis results appear in a tab in the IDE. Figure 5: Threading analysis accuracy improves with analysis level—to a point.
Time in Seconds
60.00%
Accuracy
1.50
50.00%
40.00%
Figure 4: Memory analysis accuracy improves with analysis level—to a point.
1.00
30.00%
Memory Analysis Time and Accuracy by Level 20.00%
0.50
8.00 10.00%
Time 100.00%
% Found 0.00 0.00%
7.00 90.00% 0.11 seconds
0 1 2 3 4
88% accuracy
80.00% Level
6.00
70.00%
Time in Seconds
5.00
60.00%
Accuracy
4.00 50.00%
Figure 4 shows the tradeoff between Threading Analysis: Features, Figure 5 shows the tradeoff between
accuracy and runtime for memory analysis of Accuracy, and Performance accuracy and runtime for threading analysis
3.00 40.00% the sample program, when run on a Windows of the sample program, when run on a
The sample project contains five threading
7 PC with an Intel Core i7 Extreme Edition Windows 7 PC with an Intel Core i7 Extreme
2.82 seconds 30.00% errors findable by Intel Parallel Inspector at
2.00 (25x normal running time) processor. Analysis levels 1 through 4 are Edition processor. Analysis levels 1 through
the highest analysis level. Unlike the memory
20.00% shown alongside the percentage of problems 4 are shown alongside the percentage of
analysis, all of the threading errors are safe in
1.00 found at each level. Analysis level 0 represents problems found at each level. Analysis level
10.00% the sense that they do not corrupt memory.
the case where Intel Parallel Inspector is 0 represents the case in which Intel Parallel
At level 1, Intel Parallel Inspector only detects
not run on the code at all (i.e., the raw Inspector is not run on the code at all
0.00 0.00% potential deadlocks. At level 2, data races are
0.11 seconds performance of the code). (i.e., the raw performance of the code).
0 1 2 3 4 also detected. Levels 3 and 4 increase the call
The data indicates that level 3 memory The data for threading analysis tells a
Level stack depth of the reported data races.
analysis has an excellent accuracy rate for an similar story as the memory analysis. Level 2
acceptable increase in runtime. For memory analysis catches 80 percent of the problems
analysis, 88 percent of the problems in this The sample project contains the with a runtime expanding to 18x of normal.
project are found with the runtime expanding following threading problems: Level 4 analysis will check for data races among
to 25x of normal. Level 4 analysis provides >> 3 heap data races stack variables. Stack variables are not meant
additional information (e.g., deeper call stacks >> 1 stack data race to be shared, so data races are not common.
Memory Analysis: Features, Accuracy, and Performance and thread stack analysis), but does not find A lower level analysis, such as level 2 or 3, are
The following “safe” memory >> 1 deadlock
The sample project contains 11 memory errors findable by Intel Parallel Inspector additional memory problems in this project. more commonly used for regular testing.
problems are present in the
at the highest analysis level. Three of these errors are fundamentally unsafe in that Your project may have a more complex
sample project:
they corrupt the heap or write to arbitrary memory. For this reason, those errors are memory access pattern that necessitates Further Reading and Resources
>> 4 memory leaks at different call stack depths level 4 memory analysis, but for most
only included as an optional preprocessor definition that is disabled by default. At
level 1, Intel Parallel Inspector detects all of the memory leaks in the code, although >> 2 invalid memory reads projects level 3 is sufficient and is certainly The help files installed with Intel Parallel
deep memory leaks may not get the full call stack recorded. At level 2, read/write >> 1 uninitialized memory access the right level for regular testing. Inspector (accessible in Microsoft Visual
accesses to invalid memory are detected. At levels 3 and 4, the call stack depth for >> 1 mismatched allocation/deallocation Studio* from the Help Menu as Intel® Parallel
issues is progressively increased. Studio—Parallel Studio Help—Inspector Help)
give additional examples of detectable errors
in the section “Problem Type Reference.” o
16 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
By Naveen Gv
Investigate the basics Microsoft* .NET Framework Overview
of calling Intel® Integrated .NET Framework Terminology
Performance Primitives
.NET Framework: The Microsoft* .NET Framework is a managed Assembly: An assembly is a collection of types and resources that
(Intel® IPP) from .NET
runtime environment for developing the applications that target are built to work together and form a logical unit of functionality.
Framework languages This article is intended to educate Intel® Integrated Performance Primitives the common language runtime (CLR) layer. This layer consists of They are the building blocks of the .NET Framework applications,
such as C#. (Intel® IPP) users on the basics of calling Intel IPP from .NET Framework languages the runtime execution services needed to develop various types of are stored in the portable executable (PE) files, and can be a DLL or
such as C#. The most important consideration is how to manage the calls between software applications, such as ASP .NET, Windows* forms, XML Web EXE. Assemblies also contain metadata, the information used by the
the managed .NET application and the unmanaged Intel® IPP Library. Figure 1 services, distributed applications, and others. Compilers targeting the CLR to guarantee security, type safety, and memory safety for code
provides a high-level perspective. CLR must conform to the common language specification (CLS) and execution.
Intel® IPP is an unmanaged code library written in native programming languages common type system (CTS), which are sets of language features and Garbage collector: The .NET Framework’s garbage collector (GC)
and compiled to machine code that can run on a target computer directly. types common among all the languages. These specifications enable service manages the allocation and release of memory for the managed
In this article, we consider the Platform Invocation services (P/Invoke) .NET type safety and cross-language interoperability. It means that an objects in the application. The GC checks for objects in the managed
interoperability mechanism specifically for calling the Intel IPP from C#. We describe object written in one programming language can be invoked from heap that are no longer used by the application and performs the
the basic concepts such as passing callback functions and arguments of the different another object written in another language targeting the runtime. operations necessary to reclaim their memory. The data under control
data types from managed to unmanaged code and the pinning of array arguments.
C#: C# is a programming language designed as the main language of of the GC is managed data.
the Microsoft .NET Framework. It compiles to a common intermediate Unsafe code: C# code that uses pointers is called unsafe code. The
language (CIL) like all the other languages compliant with the .NET keyword “unsafe” is a required modifier for the callable members such
Framework. CIL provides a code that runs under control of the CLR. as properties, methods, constructors, classes, or any block of code.
This is managed code. All codes that run outside the CLR are referred Unsafe code is a C# feature for performing memory manipulation using
to as unmanaged codes. CIL is an element of an assembly. pointers. Use the keyword fixed (pin the object) to avoid movement
of the managed object by the GC. Note that unsafe code must be
compiled with the /unsafe compiler option.
18 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
Figure 3: Pinning Arrays and Other Objects: Declarations are
Figure 2 similar for both methods
BLOG
[DllImport(“custom.dll”)]
unsafe static extern float
DllImport(“custom.dll”)] foo(float *x);
static extern double
foo(double a, double b); // automatic pinning
[DllImport(“custom.dll”)]
highlights
static extern float
foo(float[] x);
.NET Framework Declare Static Extern Method with the DllImport Attribute Pinning Arrays and Other Objects
Interoperability Mechanisms An unmanaged method must be declared as static extern with the DllImport attribute. All objects of the .NET Framework are managed by the GC. The GC can Structured Parallel
The CLR supports the Platform
Invocation Service (P/Invoke) that allows
This attribute defines the name of a native shared library (native DLL) where the unmanaged
function is located. The attribute DllImport and function specifiers static extern specify the
relocate an object in memory asynchronously. If managed code passes
to native code a reference to a managed object, this object must be
Programming
mapping a declaration of a managed method that is used by the .NET Framework to create a function and to marshal data. prevented from being moved in the memory. Its current location must by Michael McCool
method to unmanaged methods. The The DllImport attribute has parameters to specify correspondence rules between managed and be locked (or pinned) for the life of the native call.
resulting declaration describes the stack native methods, such as CharSet (Unicode or Ansi), ExactSpelling (true or false), CallingConvention Pinning can be performed manually or automatically. Manual, or
frame but assumes the external method One way of looking at parallel patterns (sometimes called
(cdecl or StdCall), EntryPoint. explicit, pinning can be performed by creating GCHandle: GCHandle
body from a native DLL. In the simplest case, the managed code can directly call a method foo() as illustrated in Figure 2. pinnedObj = GCHandle.Alloc(anObj, GCHandleType.Pinned). algorithmic skeletons) is through an analogy with “structured
Pinning can also be performed using the fixed statement in unsafe programming.” The premise of structured programming is that
P/Invoke Service code. An /unsafe compilation option is required in this case. a small number of control flow and data management patterns
Marshalling for the Parameters and Return Values
P/Invoke enables managed code to call C-style The other method is automatic pinning. When the runtime can be composed to implement the necessary control flow
unmanaged functions in native DLLs. P/Invoke The .NET Framework provides an interoperability marshaller to convert data between managed marshaller meets managed code that passes an object to the native
and unmanaged environments. When managed code calls a native method, parameters are and data access logic in most serial programs. There is some
can be used in any .NET Framework-compliant code, it automatically locks the referenced objects in memory for the
passed on the call stack. These parameters represent data in both the CLR and native code. evidence (see, for example, work by Skillicorn, Campbell, Cole, and
language. It is important to be familiar duration of the call. Declarations are similar for both methods. See
with the attributes DllImport, MarshalAs, They have the managed type and the native type. Figure 3. the Berkeley Dwarfs) that a relatively small number of patterns
StructLayout, and their enumerations to use Some data types have identical data representations in both managed and unmanaged code. Automatic pinning enables calling native methods in a usual manner can also express the necessary task and data organization in a
P/Invoke effectively. They are called isomorphic, or blittable, data types. They do not need special handling or conversion and does not require the /unsafe option during compiling. large fraction of parallel programs.
When a P/Invoke call is initiated to call an when passed between managed and unmanaged code. Basic data types of this kind are: float/ Note that aggressive pinning of short-lived objects is not good
unmanaged function in the native DLL, the P/ double, integer, and one-dimensional arrays of isomorphic types. These are common types practice because the GC cannot move a pinned object. This can cause Back in the ‘70s there was a heated argument about
Invoke service will perform the following steps: in Intel IPP. the heap to become fragmented, which reduces the available memory. structured vs. unstructured control flow. Basically, you can
However, some types have different representations in managed and unmanaged code. These obviously do anything you want with a conditional goto, which
1. Locate the DLL specified by the
DllImport attribute by searching either types are classified as non-isomorphic, or non-blittable, data types and require conversion, or
in the working directory or in the marshalling. Table 1presents some non-isomorphic types commonly used in the .NET Framework. Callback Function is usually the only control flow construct made available in
directories and sub-directories specified machine language. From the point of view of completeness,
In some cases, the default marshalling can be used. But not all parameters or return values A callback function is a managed or unmanaged function that is passed
in the PATH variable, and then load this is all a programming language really needs to support.
can be marshalled with the default mechanism. In such cases, the default marshalling can to an unmanaged DLL function as a parameter. To register a managed
the DLL into the memory.
be overridden with the appropriate marshalling information. Marshalling includes not only callback that an unmanaged function uses, declare a delegate with However, many computer scientists noted that there were
2. Find the function declared as static
extern in the DLL loaded to memory.
conversion of the data type, but other options such as the description of the data layout and the same argument list and pass an instance of it through P/Invoke. certain maintainability advantages to restricting control flow
direction of the parameter passing. There are some attributes for these purposes: MarshalAs, Figure 4: On the unmanaged side, it appears as a function pointer. to the composition of a small number of patterns supporting
3. Push the arguments on the stack by
performing marshalling and, if required, StructLayout, FieldOffset, InAttribute, and OutAttribute. The CLR automatically pins the delegate for the duration of the
iteration (do/while, repeat/until, for) and selection (if/then/
using the attributes MarshalAs and MarshalAsAttribute and Marshal class in the System.Runtime.InteropServices namespace can native call. Moreover, there is no need to create a pinning handle for
StructLayout. else, switch) in high-level languages.
be used for marshalling non-isomorphic data between managed and unmanaged code. the delegate for the asynchronous call: In this case, the unmanaged
4. Disable pre-emptive garbage collection. function pointer actually refers to a native code stub dynamically
5. Transfer the control to the generated to perform the transition and marshalling. This stub exists
unmanaged function. in fixed memory outside of the GC heap. The lifetime of the native
Managed Unmanaged code stub is directly related to the lifetime of the delegate.
Boolean BOOL, Win32 BOOL, or Variant The delegate instance, which is passed to unmanaged code,
employs the StdCall calling convention. An attribute to specify the
Char CHAR, or Win32 WCHAR Cdecl calling convention is available only in the .Net Framework v2.0+.
Object Variant, or Interface
Array SafeArray
20 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
Intel IPP Overview C# interface to Intel IPP Almost all Intel IPP functions have pointers as parameters. The
Intel IPP is a low-level software library. It provides a set of basic functions Namespace and Structures functions with pointers must be declared in the unsafe context. For
highly optimized for the IA-32, IA-64, and Intel® 64 architectures. The example, the function ippiAndC_8u_C1R is declared as in Figure 8.
public delegate void MyCallback(); The Intel IPP software, in addition to the libraries, provides special
use of the library significantly speeds up a wide variety of software in This code must be compiled with the /unsafe option.
[DllImport(“custom.dll”)] wrapper classes for the functions of several Intel IPP domains:
different application areas: signal and image processing, speech (G.728, Note that working with pointers in the C# application requires using
public static extern void signal and image processing, color conversion, cryptography, string
MyFunction(MyCallback callback); GSM-AMR, Echo Canceller), audio (MP3, AAC), and video (MPEG-2, MPEG-4, the operator fixed. The fixed statement sets a pointer to a managed
processing, data compression, JPEG coding, and math and vector math.
H.264, VC1) coding, image coding, data compression (BWT, MFT, RLE, LZSS, variable, and this variable is used during execution of the statement.
The wrappers allow Intel IPP users to call Intel IPP functions in C# appli-
LZ77), cryptography (SHA, AES, RSA certified by NIST), text processing, Without the fixed statement, pointers to managed variables may be
cations. These classes, in the case of signal sp and image processing ip
and computer vision. The Intel IPP software runs on different operating relocated unpredictably by the GC.
Figure 4 functions, are declared as in Figure 5.
systems: Windows* OS, Linux* OS and Mac OS* X. For the class Bitmap, the methods LockBits and UnlockBits
Enumerated data types that are used in the Intel IPP library must be
Intel IPP is a C-style API library. However, due to the StdCall calling must be used.
declared in the wrapper classes with a keyword “enum”. For example,
// ipps.cs convention, the primitives can be used in applications written in many the enumerator IppRoundMode is declared in ippdefs.cs as illustrated
namespace ipp {
public class sp {
other languages. For example, they work in applications written in Fortran,
Java*, Visual Basic, C++, and C#. The Intel IPP functions can be used in
in Figure 6.
Many Intel IPP functions use structures as parameters. To
Almost all Intel IPP functions
… the Microsoft .NET Framework managed environment to speed up the
};
};
performance of applications on Intel® processors and compatible platforms.
pass structures from a managed environment to an unmanaged
environment, the managed class struct must comply with the have pointers as parameters.
// ippi.cs corresponding library in the unmanaged code. The attribute
namespace ipp { StructLayout is used with the value LayoutKind.Sequential.
public class ip { Figure 7 shows how to use the structure Ipp64fc, which is a
…
type of double complex number.
}; [StructLayout(LayoutKind.
}; Sequential,CharSet=CharSet.Ansi)]
public struct Ipp64fc {
public double re;
Figure 5 public double im;
public Ipp64fc( double re, double im ) {
this.re = re; namespace ExampleIP
public enum IppRoundMode { this.im = im; {
ippRndZero = 0, }; using System;
ippRndNear = 1, }; using System.Windows.Forms;
ippRndFinancial = 2, using System.Drawing;
}; using System.Drawing.Imaging;
using ipp;
Figure 6 Figure 7
public class tip : System.Windows.Forms.Form {
private System.Drawing.Bitmap bmpsrc, bmpdst;
Figure 8 Figure 9
22 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
namespace ExampleIP When calling functions with Intel IPP Performance from C#
{
using System;
using System.Windows.Forms;
the same set of parameters, Intel IPP functions are optimized for Intel® and compatible processors.
This optimization can speed up performance of various applications.
However, specific features of using these libraries in the .NET
using System.Drawing;
using System.Drawing.Imaging; you can use the dynamic Framework can decrease performance of the application.
using System.Reflection;
using System.Collections;
using ipp;
method of function search Two possible reasons for a decrease in performance:
private object ippi; >> When a function is called from a DLL for the first time, the
private Type ippiType; corresponding DLL must be loaded into memory (e.g., using
LoadLibrary() on Windows* OSs), which takes 1000 to 2000
private Hashtable hash; Intel IPP Components— CPU clocks. More CPU clocks are needed to create the entry
…
public tip(Bitmap bmp) { Image Processing Sample points for all functions that are exported by this DLL.
assembly = Assembly.LoadFrom(“ippi_cs.dll”); Intel IPP Samples include C# interface and a demo application
ippi = assembly.CreateInstance(“ipp.ip”);
illustrating how C# developers can build applications with Intel IPP Table 2 shows the performance overhead numbers for Intel IPP
ippiType = ippi.GetType();
calls. The demo application performs filtering and morphological and image processing functions. Because Intel IPP functions are several
… geometric operations. times faster at corresponding C# implementations, it still makes sense
}; Additionally, the image compression functions are used in the to call Intel IPP despite the overhead of a C# call. To decrease the
void CreateMenu() { demo to read and write JPEG files. The application uses the wrapper overhead effect, developers can create a component- or application-
hash = new Hashtable();
classes for the image processing (ippi.cs) and image compression level interface in which one C# call leads to the execution of many
MenuItem miBlur = (ippj.cs) domains. The application launches the P/Invoke mechanism IPP functions.
new MenuItem(“Blur”, new EventHandler(MenuFilteringOnClick)); for unmanaged code in ippi-6.1.dll and loads the dispatcher of the An example of this approach is the .NET interface for DMIP. Also, we
hash.Add(miBlur, “ippiFilterBox_8u_C3R”); processor-specific libraries. This dispatcher loads the most efficient can compare the performance of the C# implementation and C# call
MenuItem miMin =
library for a given processor. For example, the library ippiv8-6.1.dll of Intel IPP functions. For example, the C# .NET library Mirror costs
new MenuItem(“Min Filter”,
new EventHandler(MenuFilteringOnClick)); is loaded on a system with an Intel® Core™2 Duo processor, and 5.7 CPU cycles per pixel; the Intel IPP-based C# call of Mirror function
hash.Add(miMin, “ippiFilterMin_8u_C3R”); the library ippiw7-6.1.dll is loaded on a system with an Intel® costs 1.7 cycles per pixel.
… Pentium® 4 processor.
};
Conclusion
private void MenuFilteringOnClick(object sender,System.EventArgs e) { Image ROI Processing The managed-unmanaged code interoperability provided in the .NET
FilteringFunction((string)hash[sender]);
}; Special attention must be paid when working with the functions Framework environment can be performed in different ways. The
that require border pixels (e.g., image filtering functions). The Intel best way to call C functions residing in a native DLL is to use the P/
unsafe private void FilteringFunction(string func) {… IPP functions operate only on pixels that are part of the image. Invoke service, which is available in any managed language. P/Invoke
MethodInfo method = ippiType.GetMethod(func); Therefore, a region of interest (ROI) is implied to be inside the image provides a powerful and flexible interoperability with inherited codes.
const int ksize = 5, half = ksize/2;
in such a way that all neighborhood pixels necessary for processing The DllImport attribute declares the external entry point. Marshalling
byte* pSrc = (byte*)bmpsrcdata.Scan0+(bmpsrcdata.Stride+3)*half,
pDst = (byte*)bmpdstdata.Scan0+(bmpdstdata.Stride+3)*half; the ROI edges actually exist. For example, for filtering functions, the attributes allow describing various options for the data conversion
IppStatus st = (ipp.IppStatus)method.Invoke(null, new object[] width of the border outside the ROI must not be less than half of and data layout in addition to the default data marshalling.
{(IntPtr)pSrc,bmpsrcdata.Stride,(IntPtr)pDst,bmpdstdata.Stride, the filter kernel size with the centered anchor cell. The use of SuppressUnmanagedCodeSecurityAttribute, InAttribute,
roi, new IppiSize(ksize,ksize), new IppiPoint(half,half)}); When processing an image ROI, the developer has to perform two and OutAttribute may significantly reduce the performance overhead.
…};
additional operations: shifting the pointer to the data and specifying Automatic pinning of the objects passed prevents them from being
};
}; the ROI size that is less than the image size. In Figure 9 the sample garbage collected for the duration of the call. Manual pinning is also
code illustrates how to work with ROI (using the example of the Intel available if the pointer to the object is kept and used in native code
IPP function ippiFilterBox that performs image blurring). after the call returns. A managed delegate type allows implementing
Figure 10 callback functions.
24 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
performance.
DEVELOPER ROCK STAR:
For its fall 2009 webinar series Intel invited Microsoft Check out a range of resources on a wide variety Steve Lionel
Visual Studio* C/C++ developers from a range of of software topics for a multitude of developer
industry-leading companies to share the secrets behind communities ranging from manageability to parallel
their real-world successes using Intel® Parallel Studio. programming to virtualization and visual computing. APP EXPERTISE:
All webinars, including those from previous series, are This content-rich collection includes Intel® Software Fortran Compilers
available for immediate, on-demand download. Network TV, popular blogs, videos, tools, and downloads.
© 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
26
Optimization Notice
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize
for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction
sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel
compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.
For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they
implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options.” Many library
routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other
microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and
Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will
get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree
for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and
Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on
Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine
which best meet your requirements. We hope to win your business by striving to offer the best performance of any
compiler or library; please let us know if you find we do not.
Notice revision #20101101
© 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
28