0% found this document useful (0 votes)

12 views15 pages

INTEL - The Parallel Universe - Issue 02 - 2010

Uploaded by

remilodyssee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

INTEL - The Parallel Universe - Issue 02 - 2010

Uploaded by

remilodyssee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

THE PARALLEL

Issue 2
March 2010 UNIVERSE
Where Are
DEVELOPER
ROCK STAR:
Levent

My Threads?
Akyil

Intel® VTune™ Performance

Analyzer and Finding Threading
and Parallelism Issues
by Levent Akyil

Letter from the

Editor
by James Reinders

Advisor Origins
by Paul Petersen
THE PARALLEL UNIVERSE

Pump out
performance Contents
gains. Letter from the Editor
Think Parallel: Good Programming
Starts with the Developer, by James Reinders
James Reinders, lead evangelist and director of Intel® Software Development Products,
addresses recent innovations in apps and tools, highlights key 2010 milestones, and
4

DEVELOPER ROCK STAR: explores what’s next in the new year and beyond.

Tony Mongkolsmai
Where Are My Threads? Intel® VTune™ Performance Analyzer
and Finding Threading and Parallelism Issues, by Levent Akyil 6
APP EXPERTISE:
Do you ever wonder how your parallel workload is distributed or scheduled across
Threading Performance Tools the available cores/processors? Explore how Intel® VTune™ Performance Analyzer
helps make such analysis easy.

Advisor Origins, by Paul Petersen 10

As you change your application to make it ready to introduce parallelism, your test suite can
be your biggest asset. Intel® Parallel Advisor is designed to be your assistant as you analyze
your existing sequential implementations.

Tony’s tip to boost performance: Understanding the Features of

Intel® Parallel Inspector by Example, by Bradley J. Werth 14
A key to using Intel® Parallel Studio is understanding what your goals Intel® Parallel Inspector eases the correctness burden on programmers. Explore how it helps
are. Our tools allow you to focus on specific areas of your program run, maintain memory and thread integrity.
so if you use the pause button you can get the results you want more
quickly and efficiently. Thread Your C# Code with Intel® Integrated
Performance Primitives, by Naveen Gv 18
The most important consideration is how to manage the calls between the managed .NET
application and the unmanaged Intel® IPP Library.

Rock Your Code. Resources & Sites of Interest 26

Become a developer rock star with Intel® Parallel Studio. Learn how to add
parallelism to Microsoft Visual Studio* by visiting www.intel.com/software/
products/eval for a free evaluation.

© 2010, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. © 2010, Intel Corporation. All rights reserved. Intel, the Intel logo,
Intel Core, Pentium, and VTune are trademarks of Intel Corporation
in the U.S. and other countries. *Other names and brands may be
claimed as the property of others.
THE PARALLEL UNIVERSE
LETTER FROM THE EDITOR
Encouraged by the success of Intel

Think
It is no surprise that Intel® Parallel
Studio is consistently cited in articles and
analyst reports about software that helps TBB, Intel addressed the whole design

Parallel:
with parallel programming. Of course,
Intel has a long history with high perfor-
mance computing (HPC) through efforts
cycle with Intel® Parallel Studio.
with OpenMP* and MPI*. Recently, Intel
led non-HPC efforts on projects like Intel®
Both the Go-Parallel site and the multicore Support for Microsoft Visual Studio* (VS)
Threading Building Blocks (Intel® TBB) and
portal on the Intel® Software Developer includes full support under VS 2005* and VS
Intel® Parallel Studio.
Network offer a rich set of forums and 2008*, and will expand to VS 2010* shortly
Offering real assistance for multicore
training materials used by many developers after it is available from Microsoft. We will
programming has meant addressing modi-
to hone their knowledge and skills. And Intel® continue to offer developers a choice when it
fication of existing applications, addressing
software tools, long recognized for offering comes to deciding which version of VS they
ease-of-use issues, and supplying tools to aid
high-performance solutions, deliver strong can use for parallel programming. We will also
in the entire process of designing a program.
support for parallel programming. see tangible results from our acquisitions of

Good Programming Starts

To these ends, Intel started with Intel TBB
Meanwhile, Intel investments continue to a couple of great teams last year, specifically
to address the many challenges of using
yield results. on Intel® compilers thanks to Cilk technology
C++—designed in an age of single-processor

with the Developer

2010 marks the sixth year of both Intel and on our Ct technology, available for field
systems—in a multiprocessor and multicore
shipping multicore processors and of Intel tests in beta form.
processor world.
TBB. Given these milestones, in addition Of course, parallelism for high performance
Encouraged by the success of Intel TBB,
to Intel Parallel Studio entering its second computing continues as well. Intel will offer the
Intel addressed the whole design cycle with
year and the teams from Cilk Arts* and first and only OpenMP to support Microsoft®
Intel Parallel Studio. By not just focusing on
RapidMind* now at work in Intel, one might Concurrency Runtime, a new MPI library with
the alluring topic of language extensions for
ask “What’s next?” split-rail support, MPI tools, a compiler, and
parallelism, Intel has made significant progress
Intel TBB has received awards and been an updated and expanded Intel® VTune™
on debugging and tuning issues that have
cited as the most popular abstraction for Performance Analyzer. Parallel programming
long perplexed parallel programming. Beyond
parallelism in use. We are obviously pleased just keeps getting better and better.
helping debug the most common parallel
with the strong interest and many prod- As I always like to point out, despite
programming issues, deadlocks, and data
ucts that are produced using Intel TBB. In a great record with helping adapt legacy
races, Intel found key advances to address
2010, Intel TBB will support Microsoft’s new applications, educating developers, and
memory leak errors, which have proven harder
Concurrency Runtime, a runtime designed providing really great help with tools, it is
to debug in a parallel program.
for use by parallel models like Intel TBB to still up to us, as software developers, to
Serious advances here have proven
ensure a new level of compatibility. This know what to do with these wonderful
immensely helpful for development teams
move represents an important step forward tools. Just as before parallel programming,
striving to have predictable schedules and
in coordinating parallelism running on the a good design comes from the human
results. It is not a stretch to say that without
Windows* operating system. developer—not the tools. Parallel programming
these new tools from Intel, a foray into parallel
Intel Parallel Studio recently offered its is no different. Therefore, we humans need
programming is much less likely to succeed.
first service pack, which added Windows 7* to work on “Think Parallel.”
Legacy, education, and tools were the
support and command line functionality for
three key needs identified in a recent joint
Intel® Parallel Inspector and Intel® Parallel James Reinders
Intel-Microsoft* customer roundtable on
Amplifier. It will also grow to include Intel® Portland, Oregon
James Reinders, lead evangelist and director parallelism that I was fortunate enough to March 2010
Parallel Advisor to help in the design and
attend. Intel’s alignment with our activities
of Intel® Software Development Products, design evaluation steps of adding parallelism.
and these needs are the strongest that I
addresses recent innovations in apps and tools, see in the industry.
We expect that the key insights into how to
highlights key 2010 milestones, and explores accelerate the error-prone and tedious step
of deciding how to add parallelism will be put
what’s next in the new year and beyond.
Efforts include: to good use by program architects.
>> Helping with legacy applications by making
it possible to add parallelism into existing
programs without unreasonable changes
>> Helping address multicore opportunities
through software developer education
James Reinders is Chief Software Evangelist and
when truly needed, while developing,
Director of Software Development Products at Intel
where possible, better approaches to Corporation. His articles and books on parallelism
avoid the need for additional education include Intel Threading Building Blocks: Outfitting
C++ for Multicore Processor Parallelism.

4 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

By Levent Akyil The event-based sampling (EBS) technology identifies

Do you ever wonder how
your parallel workload is
system-wide software performance problems by sampling
distributed or scheduled across
the available cores/processors?
processor events, such as clock ticks and cache misses.
Explore how Intel® VTune™
Performance Analyzer helps
make such analysis easy. One common question developers ask If you are still curious and would like to The SOT view can help you:
is how their parallel workload is distributed or see how these samples are distributed over
>> See how the operating system (OS)
scheduled across the available cores/processors. time per thread and per core, the sampling scheduled the threads to run.
Intel VTune Performance Analyzer helps over time (SOT) view can help. By selecting
>> Identify scheduling problems
make such analysis easy. The event-based SOT view (Figure 3) in Thread view (or in (Figure 5).
sampling (EBS) technology identifies system- any other view) the samples collected will >> Identify load balancing issues among
wide software performance problems by be displayed per thread and/or core (Figure threads (Figure 6).
sampling processor events, such as clock 4). The view seen in (Figure 4) is useful for >> Correlate microarchitectural problems.
ticks and cache misses (Figure 1). From the many reasons.
EBS data, you can determine which process,
thread, module, function, and source line in
a given application generated a particular The Show/Hide CPU Information button

Where
event. By leveraging this technology you can
see how many events were sampled on each
core as well as which thread generated them.
The Show/Hide CPU Information button
Figure 1
(Figure 1) in the sampling toolbar displays
collected samples and events per processor
in the Process, Thread, Module, and Hotspot Thread view

Are My
sampling views (Figure 2).
We now know that this particular program
(sort_mt1.exe) was executed on two cores
and we can see the number of samples
collected on each core. But what we don’t Figure 2
know yet is how many threads this applica-
tion created and how the threads executed

Threads?
SOT view
on these cores. Selecting the Thread view
(Figure 2) when the CPU button is also
selected will show us the desired informa-
tion. (Figure 3) tells us that sort_mt1.exe
created two threads (thread18 and thread13)
and that each thread was executed on both
Figure 3
cores (OS scheduled these threads to run
on each core) during the analysis. If you look
at the clock ticks (i.e., CPU_CLK_UNHALTED.
CORE) for thread18, it becomes clear that this
Intel® VTune™ Performance Analyzer particular thread was executed on each core
while running most of the time on Processor 0.
and Finding Threading and
Parallelism Issues
Figure 4

6 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE

Selecting the Thread view button (Figure 2)

BLOG
when the CPU button is also selected
provides insight into the execution and
You can zoom in to any
scheduling of these threads (Figure 7).
A closer look at thread9 and thread59 and
region on the timeline by
how they are executed on the cores, as
identifying the region of
shown in (Figure 7), reveals how the OS,
Windows XP Sp3* in this particular case, is
scheduling the threads on both cores. It also
Figure 5: Manually setting thread affinity can create problems. Each thread is scheduled/pinned to interest with the mouse
highlights
the Core/Processor 0.
illustrates that each thread is running almost
the same amount of time on each core. and selecting “Zoom In”
Note: The RS_UOPS_DISPATCHED.
CYCLES_NONE event shown in (Figure 7) from the context menu Fun with Locks and Waits—
counts the cycles where no μops were
dispatched (stalls cycles). The overall formula
Figure 6: The SOT shows a load imbalance issue. (right-click menu). Performance Tuning
is CPU_CLK_UNHALTED.CORE (clockticks)
~= RS_UOPS_DISPATCHED.CYCLES_ANY + by David Mackay, Ph.d.
RS_UOPS_DISPATCHED.CYCLES_NONE
(stall cycles). At times threaded software requires some critical sections,
You can zoom in to any region on the Thread Affinity
mutexes or locks. Do developers always know which of the
timeline by identifying the region of interest At this point, it is important to introduce the concept of thread affinity.
objects in their code have the most impact? If I want to
with the mouse and selecting “Zoom In” from Thread affinity restricts execution of certain threads to a subset of the
the context menu (right-click menu). Figure 8, examine my software to minimize the impact, or restructure
physical processing units in a multiprocessor computer. Depending on
which shows the zoomed region (0-1.8 secs), the topology of the machine, thread affinity can have a dramatic effect data to eliminate some of these synchronization objects and
Figure 7: Identify the stall cycles per thread per cpu.
reveals how threads are actually tossed on the execution speed of an application. improve performance, how do I know where I should make
back and forth between the cores. The OS However, you must have a good reason and be cautious before changes to get the biggest performance improvement?
scheduler simply doesn’t keep the threads interfering with the OS scheduler’s ability to schedule threads Intel® Parallel Amplifier can help me determine this.
on the same core (i.e., thread9 on core 0 effectively across processors/cores. Most recent OSs and their
and thread59 on core 1 or vice versa). This schedulers have improved significantly; generally speaking, modern Intel Parallel Amplifier provides three basic analysis types:
particular scheduling pattern might not be schedulers will perform efficiently. hotspots (with call stack data), concurrency, and locks
an issue for such a system since the cores The Intel® compiler OpenMP* runtime library has the ability to and waits. The locks and waits analysis highlights which
share the same second-level cache. For bind OpenMP threads to physical processing units. Thread affinity is synchronization objects block threads the longest. It
multi-socket systems, however, such a Figure 8: The SOT view shows how the OS scheduled the particular application. supported on the Windows* OS systems and versions of the Linux*
is common for software to have too many or too few
scheduling pattern will be a problem. OS systems that have kernel support for thread affinity. There are
three types of interfaces you can use to specify this binding. These are synchronization points. Insufficient synchronization points
collectively referred to as the Intel® OpenMP* Thread Affinity Interface. lead to race conditions and indeterminant results (if you
For more information, click here. The affinity types supported by the have this problem you need Intel® Parallel Inspector, not
Intel OpenMP runtime library are: none (default) / compact / disabled / Intel Parallel Amplifier. See this MSDN Web seminar for more
explicit / scatter. on Intel Parallel Inspector: Got Memory Leaks?). If you have
Figure 9: Note the SOT view after setting the thread affinity.
too many synchronization objects, you want to know which
After Setting the Affinity ones, if removed, would improve performance the most.
For this exercise and for this particular system, setting the affinity as
“scatter” or “compact” will not make any difference. Please see the
information provided in the link above for more details.
Figure 10 shows that both thread17 and thread64 remained on
the same cores on which they were initially scheduled. Thread17
initially got scheduled to run on Core 0 and Core 1, but it stayed on
Figure 10: Results showing how setting thread affinity affects the application runtime. Core 0 for the remainder of the run. o

8 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

By Paul Petersen A blank sheet of paper. It’s frightening, and also empowering.
It’s a license for unlimited creative freedom. As software developers, how
As you change your application
often do you get this opportunity? If your work is like mine, it turns out to
to make it ready to introduce be less often than you might think.
parallelism, your test suite can Most software development is adapting an existing solution to serve
be your biggest asset. Intel® a new purpose. It is optimizing an existing algorithm. It is enabling new
Parallel Advisor is designed to execution models for solutions customers already find valuable.
be your assistant as you analyze Maybe you are one of the developers who can afford to recreate your
current source code base as you seek to exploit parallel execution. You
your existing sequential
likely will have the biggest payoff since you have unlimited freedom to
implementations. design your algorithms and implementation to maximize the parallel
execution benefit.
If you need to reuse large portions of an existing implementation, you
still have a significant opportunity. In some ways, your job is much easier.
Maybe you have a “diamond in the rough.” You already know the “correct”
definition you are trying to implement. The correct behavior is defined by
the external behavior of your existing serial algorithms.

ADVISOR ORIGINS Your test suites validate your application against this correct behavior.
As you change your application to make it ready to introduce parallelism,
your test suite can be your biggest asset. After every transformation step
you perform, the validity of the transformation can be checked by verifying
your test suite application passes.
If you introduce changes to your algorithms that change the behavior
of your application, it is important to determine early if these changes are
desirable. In such a case, you need to update your test suite to allow this
new behavior. If the behavior is in error, you need to revert these changes
and go back to the prior version.
Intel® Parallel Advisor (part of Intel® Parallel Studio) is designed to be
your assistant as you analyze your existing sequential implementations to
discover how it can be refactored or redesigned to exploit parallel execution
of your application. This article explains some of the principles upon which
the design of Intel Parallel Advisor is based.

For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice. 11
THE PARALLEL UNIVERSE

Multi-Level Parallel Thinking hierarchically (sometimes recursively) expands your Choosing Tasks
Task Execution ability to specify independent work that can be exploited for parallel You can imagine that in the serial execution of the program tasks could
execution. If you only create parallelism from the two top-most exist at multiple levels. At one extreme, you could assume that the
Refactoring to identify opportunities for
independent QuickSort calls, then your speedup will be implicitly entire application is one task. At the other extreme, you could assume
relaxed sequential execution is typically
limited to, at most, a factor of 2x. Recursively subdividing these that every statement or expression is a task. Both extremes are usually
implemented via fork-join parallelism. You fork
tasks into smaller tasks enables additional parallelism. incorrect when trying to pick where to add tasks to your application.
when you want to relax the constraints of
Another reason why hierarchical parallelism design is helpful is in The potential concurrency of your application is bounded by the
serial execution, and you join (using a barrier)
increasing the number of tasks that can be launched. Problems best number of tasks you create. If you only create two tasks (your main
when you want to enforce the constraints
suited for parallelism have large collections of objects, each of which program is also logically a task), your ideal parallel speed-up is only 2x
of serial execution. The barrier at the join
needs to be transformed independently by the application of a function. faster than serial (ignoring serial cache or memory bandwidth effects),
point allows any dependence from before the
If you can transform your algorithm into this form you will achieve the regardless of the number of processors you have available. Various forms
barrier to after the barrier to be satisfied.
best results. of overhead will also reduce the potential benefit of creating parallel
In simple fork-join parallelism the application
Often, the objects are not this independent; you may only have a tasks, so you need to identify many more tasks than the available
is either executing serially or it is executing
Refactoring to Enable Relaxed Sequential Execution in parallel. From Amdahl’s law you know that
small collection of “top-level” objects that are independent. If you apply processors in order to achieve the best performance.
parallelism only to this “top-level” collection of objects you may get a Having many tasks available also helps in another way. If you have
If your application already embodies the “correct” definition, you want to preserve it. This means potential speedup is limited by the percentage
large enough grain size to allow effective parallel execution, but your two tasks, and one task is very short (e.g., one second), and the other
that you can use refactoring techniques to uncover the latent parallelism in your application. This of time the program is executing serially.
scalability will be limited to the number of “top-level” objects you have. task is very long (e.g., 59 seconds), then running these two tasks
parallelism is latent because using a serial language to express your algorithm over-constrains Therefore, any time you transition from parallel
If your algorithm runs long enough to warrant the use of parallelism, serially will take one minute. How fast will they run when executed
the dependences that are necessary for correct execution. Refactoring is the process of to serial you are losing performance.
you may find another level of nested algorithms that also does an in parallel? You might hope that you would achieve a 2x speedup and
changing your application’s internal structure without modifying its external functional behavior. To overcome this problem, you can think
independent computation over a different small set of independent finish both tasks in 30 seconds.
This allows you to express your intentions more clearly and eliminate the implicit serial depen- hierarchically. The trick is to find a way to
data items. Applying parallelism to both levels increases the scalability Unfortunately, the answer is that you will finish both tasks in 59
dences that may be present in the sequential implementation of your algorithms. replicate the serializing algorithms at the
of your parallel algorithm. seconds. The minimum time to execute both tasks is the maximum
inner level, and create multiple tasks at an
time it would take to execute either task. Since the duration of the
outer level that each works independently.
second task is 59 seconds, you must wait at least this long. If possible,
parallel { The QuickSort algorithm is a good example
it would be better to break up the second task much finer into a
T1 = F1(A + B) task { T1 = F1(A + B) } (Figure 5). for (int i = 0; i < N; ++i)
T2 = F2(C * D) task { T2 = F2(C * D) } larger collection of tasks. For example, if instead of having one task
The algorithm has a serial phase (i.e., A[i] = B[i] + C[i]
} that takes 59 seconds, you could create 59 tasks that each take one
sorting small arrays, choosing the pivot, and
second, you could potentially achieve a 2x speed-up and finish all 60
partitioning the array) and a parallel phase (i.e.,
Figure 1 Figure 2 Figure 3 tasks in 30 seconds using two processors.
sorting the two halves of the now partitioned
You can see that a larger number of smaller tasks are necessary
array). The hierarchy you can exploit is the
to have enough potential concurrency to achieve the performance
In Figure 1, the serial semantics are that first you must add A+B, apply the function F1, and recursive call to the QuickSort function. parallel { improvement you desire. Two effects serve to balance this and prohibit
then store this result into T1. Only then do you multiply C*D, apply the function F2, and store for (int i = 0; i < N; ++i) tasks that are too small.
the result into T2. Assuming F1 and F2 are pure functions, mathematically the semantics are task { A[i] = B[i] + C[i] }
First, scheduling a task onto a thread is inexpensive, but does have a
equivalent if you calculate T2 first, and then calculate T1. }
cost. Scheduling 10 tasks, each with a single action, onto a thread will
The act of writing down these two statements creates a constraint that did not exist before
be slightly more expensive than combining all 10 of those actions into
the two statements were written down. Optimizing compilers (and out-of-order execution Figure 4
a single task and only scheduling that one task onto a thread. Most
hardware) try to understand the real and false dependences to enable faster execution. By
parallel frameworks have the capability to combine tasks to create
declaring task boundaries (i.e., the parallel actions shown as tasks in Figure 2), you can use the
void QuickSort( Value A[], int L, int H ) { larger chunks of work to be scheduled onto a thread for execution.
same technique statically in the source code, indicating when implied control or data dependences
if (H-L < TooSmallLimit) { This is a simple mechanical process. However, the opposite—splitting a
do not need to be enforced to ensure correct execution.
SerialSort(A, L, H); monolithic large task into many smaller tasks—is very difficult for the
Figure 3 shows a similar situation. This loop is shorthand for the set of assignment statements return; parallel framework, but may be fairly easy for you to specify.
where the variable i is in the range of 0:N-1. The result generated by this loop produces each }
Second, a good task encapsulates a consistent computation on an
element of A containing the sum of the corresponding elements of B and C. In the serial Value Pivot = A[ L+(H-L)/2 ];
int L1 = L; int H1 = H; object or set of objects. If you examine all of the variables accessed by
program, this loop is over-constrained by specifying that the index variable i is calculated via the
while (L1 < H1) { a task you will find that they fall into three cases.
induction i=i+1. Declaring the task boundaries as shown in Figure 4 defines all iterations of the
if (A[L1] < Pivot) These cases are:
loop as logically separate tasks (capturing the value of i when each task is created). ++L1;
Introducing parallelism via refactoring performs the primary task of identifying the places else if (A[H1] >= Pivot) 1. The variable is already private to a task or read-only by all tasks.
where the serial semantics are over-constraining the problem you need to solve. Designing your --H1; 2. The variable can be made private to a task.
parallel application via refactoring creates the key property that the original serial execution else
3. The variables communicate values between tasks.
Swap(A[L1], A[H1]);
is just one trivial execution of the parallel program. To see that this is true, consider a parallel
} In the second half of this article, we examine these cases by
program. What happens when you execute this program on a single thread? If you have only parallel {
showing how the way you use variables in your application affects
relaxed or removed artificial serial dependences, then adding them back in by executing the task { QuickSort(A, L, L1-1); }
task { QuickSort(A, L1, H); } the choices you will make as you understand how to transform your
parallel program on a single thread preserves the same behavior.
} application to exploit available latent parallelism. o
When you don’t care about the order of execution, the program can execute in parallel. When
}
you do care about the order of execution (e.g., the next statement has dependences on the
prior computation), then you retain a serial execution.
Figure 5
12 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Understanding the
Features of Intel® Parallel Inspector
by Example

By Bradley J. Werth Intel® Parallel Inspector is a combination of tools that performs two important Getting Started with Intel Parallel Inspector Intel Parallel Inspector is a dynamic analysis
functions: verifying memory integrity and verifying thread correctness. Memory integrity has tool, which means that it doesn’t look at the
Intel® Parallel Inspector been a key challenge for software development throughout the history of computing. When some
Intel Parallel Inspector is installed into the Microsoft Visual Studio 2008* IDE and appears as a
toolbar. Figure 1 shows how Intel Parallel Inspector appears after a typical install. source code but at the running program itself.
eases the correctness high-profile program gets hacked through a buffer overrun, that is a breach of memory integrity. This is both good and bad; good because
The typical use of Intel Parallel Inspector is to build an application with debugging information
burden on programmers. Even without malicious intent, it is all too easy for a programmer to make a mistake using the many problems cannot be caught with just
included, and then run Intel Parallel Inspector memory analysis or threading analysis by choosing
Explore how it helps complex memory syntax of a language like C. Likewise, multithreaded semantics provide plenty of
the type of analysis from the combo box and clicking the Inspect button. static analysis, but bad because this dynamic
maintain memory and leeway for programming mistakes. Intel Parallel Inspector aims to ease the correctness burden analysis can significantly slow down the
on programmers by providing some assurance that memory and thread integrity are maintained. execution of the program. Understanding
thread integrity.
Accompanying this article is a sample C++ project that contains examples of all the memory this tradeoff, Intel Parallel Inspector provides
and threading problems that Intel Parallel Inspector can detect, with clear labels of each “levels” of both memory and threading
problem. This article describes how Intel Parallel Inspector can be configured to catch those analysis. Levels range from 1 to 4 in order
problems, which will help you find and fix similar problems in your own projects. of increasing accuracy and decreasing perfor-
The data for graphs in this article were measured on a Windows 7* PC with an Intel® Core™ i7 mance. Figure 2 shows the interface for
Extreme Edition processor using the referenced example project. The time dilation or slowdown specifying the level of memory analysis.
caused by Intel Parallel Inspector inspection analysis will vary based on the platform, system When the user clicks the Run Analysis
configuration, and your project of analysis. button from the configuration screen, Intel
Parallel Inspector will begin analysis and run
the program to completion, or until the Stop
button is pressed on the toolbar. Some problems
will not be detected properly if the Stop button
is pressed. When analysis is complete, Intel
Parallel Inspector displays the analysis results
in a tab in the Microsoft Visual Studio IDE.
Figure 3 shows the results of running Level 3
memory analysis on the sample project.
Double-clicking on one of the errors will
display the source of the problem as well as
additional detail to help diagnose the problem.

Figure 1: Intel Parallel Inspector appears as a toolbar in the Microsoft Visual Studio 2008* IDE.

Figure 2: Intel Parallel Inspector allows configuration options for both memory
and threading analysis.

14 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE

Figure 3: Analysis results appear in a tab in the IDE. Figure 5: Threading analysis accuracy improves with analysis level—to a point.

Threading Analysis Time and Accuracy by Level

3.00
Time 100.00%
% Found
90.00%
2.50
80% accuracy
80.00%

2.00 2.03 seconds 70.00%

(18x normal running time)

Time in Seconds
60.00%

Accuracy
1.50
50.00%

40.00%
Figure 4: Memory analysis accuracy improves with analysis level—to a point.
1.00
30.00%
Memory Analysis Time and Accuracy by Level 20.00%
0.50
8.00 10.00%
Time 100.00%
% Found 0.00 0.00%
7.00 90.00% 0.11 seconds
0 1 2 3 4
88% accuracy
80.00% Level
6.00
70.00%
Time in Seconds

5.00
60.00%

Accuracy
4.00 50.00%
Figure 4 shows the tradeoff between Threading Analysis: Features, Figure 5 shows the tradeoff between
accuracy and runtime for memory analysis of Accuracy, and Performance accuracy and runtime for threading analysis
3.00 40.00% the sample program, when run on a Windows of the sample program, when run on a
The sample project contains five threading
7 PC with an Intel Core i7 Extreme Edition Windows 7 PC with an Intel Core i7 Extreme
2.82 seconds 30.00% errors findable by Intel Parallel Inspector at
2.00 (25x normal running time) processor. Analysis levels 1 through 4 are Edition processor. Analysis levels 1 through
the highest analysis level. Unlike the memory
20.00% shown alongside the percentage of problems 4 are shown alongside the percentage of
analysis, all of the threading errors are safe in
1.00 found at each level. Analysis level 0 represents problems found at each level. Analysis level
10.00% the sense that they do not corrupt memory.
the case where Intel Parallel Inspector is 0 represents the case in which Intel Parallel
At level 1, Intel Parallel Inspector only detects
not run on the code at all (i.e., the raw Inspector is not run on the code at all
0.00 0.00% potential deadlocks. At level 2, data races are
0.11 seconds performance of the code). (i.e., the raw performance of the code).
0 1 2 3 4 also detected. Levels 3 and 4 increase the call
The data indicates that level 3 memory The data for threading analysis tells a
Level stack depth of the reported data races.
analysis has an excellent accuracy rate for an similar story as the memory analysis. Level 2
acceptable increase in runtime. For memory analysis catches 80 percent of the problems
analysis, 88 percent of the problems in this The sample project contains the with a runtime expanding to 18x of normal.
project are found with the runtime expanding following threading problems: Level 4 analysis will check for data races among
to 25x of normal. Level 4 analysis provides >> 3 heap data races stack variables. Stack variables are not meant
additional information (e.g., deeper call stacks >> 1 stack data race to be shared, so data races are not common.
Memory Analysis: Features, Accuracy, and Performance and thread stack analysis), but does not find A lower level analysis, such as level 2 or 3, are
The following “safe” memory >> 1 deadlock
The sample project contains 11 memory errors findable by Intel Parallel Inspector additional memory problems in this project. more commonly used for regular testing.
problems are present in the
at the highest analysis level. Three of these errors are fundamentally unsafe in that Your project may have a more complex
sample project:
they corrupt the heap or write to arbitrary memory. For this reason, those errors are memory access pattern that necessitates Further Reading and Resources
>> 4 memory leaks at different call stack depths level 4 memory analysis, but for most
only included as an optional preprocessor definition that is disabled by default. At
level 1, Intel Parallel Inspector detects all of the memory leaks in the code, although >> 2 invalid memory reads projects level 3 is sufficient and is certainly The help files installed with Intel Parallel
deep memory leaks may not get the full call stack recorded. At level 2, read/write >> 1 uninitialized memory access the right level for regular testing. Inspector (accessible in Microsoft Visual
accesses to invalid memory are detected. At levels 3 and 4, the call stack depth for >> 1 mismatched allocation/deallocation Studio* from the Help Menu as Intel® Parallel
issues is progressively increased. Studio—Parallel Studio Help—Inspector Help)
give additional examples of detectable errors
in the section “Problem Type Reference.” o

16 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE

The most important consideration

is how to manage the calls between .NET Framework Application

the managed .NET application and .NET Framework Class

the unmanaged Intel® IPP Library. Libraries (FCL)

.NET Common Language Runtime (CLR)

Windows* Unmanaged IPP Unmanaged

Code Libraries Code ‘C’ Libraries
Figure 1: Intel IPP in .NET Framework

By Naveen Gv
Investigate the basics Microsoft* .NET Framework Overview
of calling Intel® Integrated .NET Framework Terminology
Performance Primitives
.NET Framework: The Microsoft* .NET Framework is a managed Assembly: An assembly is a collection of types and resources that
(Intel® IPP) from .NET
runtime environment for developing the applications that target are built to work together and form a logical unit of functionality.
Framework languages This article is intended to educate Intel® Integrated Performance Primitives the common language runtime (CLR) layer. This layer consists of They are the building blocks of the .NET Framework applications,
such as C#. (Intel® IPP) users on the basics of calling Intel IPP from .NET Framework languages the runtime execution services needed to develop various types of are stored in the portable executable (PE) files, and can be a DLL or
such as C#. The most important consideration is how to manage the calls between software applications, such as ASP .NET, Windows* forms, XML Web EXE. Assemblies also contain metadata, the information used by the
the managed .NET application and the unmanaged Intel® IPP Library. Figure 1 services, distributed applications, and others. Compilers targeting the CLR to guarantee security, type safety, and memory safety for code
provides a high-level perspective. CLR must conform to the common language specification (CLS) and execution.
Intel® IPP is an unmanaged code library written in native programming languages common type system (CTS), which are sets of language features and Garbage collector: The .NET Framework’s garbage collector (GC)
and compiled to machine code that can run on a target computer directly. types common among all the languages. These specifications enable service manages the allocation and release of memory for the managed
In this article, we consider the Platform Invocation services (P/Invoke) .NET type safety and cross-language interoperability. It means that an objects in the application. The GC checks for objects in the managed
interoperability mechanism specifically for calling the Intel IPP from C#. We describe object written in one programming language can be invoked from heap that are no longer used by the application and performs the
the basic concepts such as passing callback functions and arguments of the different another object written in another language targeting the runtime. operations necessary to reclaim their memory. The data under control
data types from managed to unmanaged code and the pinning of array arguments.
C#: C# is a programming language designed as the main language of of the GC is managed data.
the Microsoft .NET Framework. It compiles to a common intermediate Unsafe code: C# code that uses pointers is called unsafe code. The
language (CIL) like all the other languages compliant with the .NET keyword “unsafe” is a required modifier for the callable members such
Framework. CIL provides a code that runs under control of the CLR. as properties, methods, constructors, classes, or any block of code.
This is managed code. All codes that run outside the CLR are referred Unsafe code is a C# feature for performing memory manipulation using
to as unmanaged codes. CIL is an element of an assembly. pointers. Use the keyword fixed (pin the object) to avoid movement
of the managed object by the GC. Note that unsafe code must be
compiled with the /unsafe compiler option.

18 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE
Figure 3: Pinning Arrays and Other Objects: Declarations are
Figure 2 similar for both methods

// fixed statement in unsafe code

BLOG
[DllImport(“custom.dll”)]
unsafe static extern float
DllImport(“custom.dll”)] foo(float *x);
static extern double
foo(double a, double b); // automatic pinning
[DllImport(“custom.dll”)]

highlights
static extern float
foo(float[] x);

.NET Framework Declare Static Extern Method with the DllImport Attribute Pinning Arrays and Other Objects
Interoperability Mechanisms An unmanaged method must be declared as static extern with the DllImport attribute. All objects of the .NET Framework are managed by the GC. The GC can Structured Parallel
The CLR supports the Platform
Invocation Service (P/Invoke) that allows
This attribute defines the name of a native shared library (native DLL) where the unmanaged
function is located. The attribute DllImport and function specifiers static extern specify the
relocate an object in memory asynchronously. If managed code passes
to native code a reference to a managed object, this object must be
Programming
mapping a declaration of a managed method that is used by the .NET Framework to create a function and to marshal data. prevented from being moved in the memory. Its current location must by Michael McCool
method to unmanaged methods. The The DllImport attribute has parameters to specify correspondence rules between managed and be locked (or pinned) for the life of the native call.
resulting declaration describes the stack native methods, such as CharSet (Unicode or Ansi), ExactSpelling (true or false), CallingConvention Pinning can be performed manually or automatically. Manual, or
frame but assumes the external method One way of looking at parallel patterns (sometimes called
(cdecl or StdCall), EntryPoint. explicit, pinning can be performed by creating GCHandle: GCHandle
body from a native DLL. In the simplest case, the managed code can directly call a method foo() as illustrated in Figure 2. pinnedObj = GCHandle.Alloc(anObj, GCHandleType.Pinned). algorithmic skeletons) is through an analogy with “structured
Pinning can also be performed using the fixed statement in unsafe programming.” The premise of structured programming is that
P/Invoke Service code. An /unsafe compilation option is required in this case. a small number of control flow and data management patterns
Marshalling for the Parameters and Return Values
P/Invoke enables managed code to call C-style The other method is automatic pinning. When the runtime can be composed to implement the necessary control flow
unmanaged functions in native DLLs. P/Invoke The .NET Framework provides an interoperability marshaller to convert data between managed marshaller meets managed code that passes an object to the native
and unmanaged environments. When managed code calls a native method, parameters are and data access logic in most serial programs. There is some
can be used in any .NET Framework-compliant code, it automatically locks the referenced objects in memory for the
passed on the call stack. These parameters represent data in both the CLR and native code. evidence (see, for example, work by Skillicorn, Campbell, Cole, and
language. It is important to be familiar duration of the call. Declarations are similar for both methods. See
with the attributes DllImport, MarshalAs, They have the managed type and the native type. Figure 3. the Berkeley Dwarfs) that a relatively small number of patterns
StructLayout, and their enumerations to use Some data types have identical data representations in both managed and unmanaged code. Automatic pinning enables calling native methods in a usual manner can also express the necessary task and data organization in a
P/Invoke effectively. They are called isomorphic, or blittable, data types. They do not need special handling or conversion and does not require the /unsafe option during compiling. large fraction of parallel programs.
When a P/Invoke call is initiated to call an when passed between managed and unmanaged code. Basic data types of this kind are: float/ Note that aggressive pinning of short-lived objects is not good
unmanaged function in the native DLL, the P/ double, integer, and one-dimensional arrays of isomorphic types. These are common types practice because the GC cannot move a pinned object. This can cause Back in the ‘70s there was a heated argument about
Invoke service will perform the following steps: in Intel IPP. the heap to become fragmented, which reduces the available memory. structured vs. unstructured control flow. Basically, you can
However, some types have different representations in managed and unmanaged code. These obviously do anything you want with a conditional goto, which
1. Locate the DLL specified by the
DllImport attribute by searching either types are classified as non-isomorphic, or non-blittable, data types and require conversion, or
in the working directory or in the marshalling. Table 1presents some non-isomorphic types commonly used in the .NET Framework. Callback Function is usually the only control flow construct made available in
directories and sub-directories specified machine language. From the point of view of completeness,
In some cases, the default marshalling can be used. But not all parameters or return values A callback function is a managed or unmanaged function that is passed
in the PATH variable, and then load this is all a programming language really needs to support.
can be marshalled with the default mechanism. In such cases, the default marshalling can to an unmanaged DLL function as a parameter. To register a managed
the DLL into the memory.
be overridden with the appropriate marshalling information. Marshalling includes not only callback that an unmanaged function uses, declare a delegate with However, many computer scientists noted that there were
2. Find the function declared as static
extern in the DLL loaded to memory.
conversion of the data type, but other options such as the description of the data layout and the same argument list and pass an instance of it through P/Invoke. certain maintainability advantages to restricting control flow
direction of the parameter passing. There are some attributes for these purposes: MarshalAs, Figure 4: On the unmanaged side, it appears as a function pointer. to the composition of a small number of patterns supporting
3. Push the arguments on the stack by
performing marshalling and, if required, StructLayout, FieldOffset, InAttribute, and OutAttribute. The CLR automatically pins the delegate for the duration of the
iteration (do/while, repeat/until, for) and selection (if/then/
using the attributes MarshalAs and MarshalAsAttribute and Marshal class in the System.Runtime.InteropServices namespace can native call. Moreover, there is no need to create a pinning handle for
StructLayout. else, switch) in high-level languages.
be used for marshalling non-isomorphic data between managed and unmanaged code. the delegate for the asynchronous call: In this case, the unmanaged
4. Disable pre-emptive garbage collection. function pointer actually refers to a native code stub dynamically
5. Transfer the control to the generated to perform the transition and marshalling. This stub exists
unmanaged function. in fixed memory outside of the GC heap. The lifetime of the native
Managed Unmanaged code stub is directly related to the lifetime of the delegate.
Boolean BOOL, Win32 BOOL, or Variant The delegate instance, which is passed to unmanaged code,
employs the StdCall calling convention. An attribute to specify the
Char CHAR, or Win32 WCHAR Cdecl calling convention is available only in the .Net Framework v2.0+.
Object Variant, or Interface

String LPStr, LPWStr, or BStr

Array SafeArray

Table 1: Mapping of several data types, unmanaged-managed environment

20 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE

Intel IPP Overview C# interface to Intel IPP Almost all Intel IPP functions have pointers as parameters. The
Intel IPP is a low-level software library. It provides a set of basic functions Namespace and Structures functions with pointers must be declared in the unsafe context. For
highly optimized for the IA-32, IA-64, and Intel® 64 architectures. The example, the function ippiAndC_8u_C1R is declared as in Figure 8.
public delegate void MyCallback(); The Intel IPP software, in addition to the libraries, provides special
use of the library significantly speeds up a wide variety of software in This code must be compiled with the /unsafe option.
[DllImport(“custom.dll”)] wrapper classes for the functions of several Intel IPP domains:
different application areas: signal and image processing, speech (G.728, Note that working with pointers in the C# application requires using
public static extern void signal and image processing, color conversion, cryptography, string
MyFunction(MyCallback callback); GSM-AMR, Echo Canceller), audio (MP3, AAC), and video (MPEG-2, MPEG-4, the operator fixed. The fixed statement sets a pointer to a managed
processing, data compression, JPEG coding, and math and vector math.
H.264, VC1) coding, image coding, data compression (BWT, MFT, RLE, LZSS, variable, and this variable is used during execution of the statement.
The wrappers allow Intel IPP users to call Intel IPP functions in C# appli-
LZ77), cryptography (SHA, AES, RSA certified by NIST), text processing, Without the fixed statement, pointers to managed variables may be
cations. These classes, in the case of signal sp and image processing ip
and computer vision. The Intel IPP software runs on different operating relocated unpredictably by the GC.
Figure 4 functions, are declared as in Figure 5.
systems: Windows* OS, Linux* OS and Mac OS* X. For the class Bitmap, the methods LockBits and UnlockBits
Enumerated data types that are used in the Intel IPP library must be
Intel IPP is a C-style API library. However, due to the StdCall calling must be used.
declared in the wrapper classes with a keyword “enum”. For example,
// ipps.cs convention, the primitives can be used in applications written in many the enumerator IppRoundMode is declared in ippdefs.cs as illustrated

namespace ipp {
public class sp {
other languages. For example, they work in applications written in Fortran,
Java*, Visual Basic, C++, and C#. The Intel IPP functions can be used in
in Figure 6.
Many Intel IPP functions use structures as parameters. To
Almost all Intel IPP functions
… the Microsoft .NET Framework managed environment to speed up the
};
};
performance of applications on Intel® processors and compatible platforms.
pass structures from a managed environment to an unmanaged
environment, the managed class struct must comply with the have pointers as parameters.
// ippi.cs corresponding library in the unmanaged code. The attribute
namespace ipp { StructLayout is used with the value LayoutKind.Sequential.
public class ip { Figure 7 shows how to use the structure Ipp64fc, which is a
…
type of double complex number.
}; [StructLayout(LayoutKind.
}; Sequential,CharSet=CharSet.Ansi)]
public struct Ipp64fc {
public double re;
Figure 5 public double im;
public Ipp64fc( double re, double im ) {
this.re = re; namespace ExampleIP
public enum IppRoundMode { this.im = im; {
ippRndZero = 0, }; using System;
ippRndNear = 1, }; using System.Windows.Forms;
ippRndFinancial = 2, using System.Drawing;
}; using System.Drawing.Imaging;
using ipp;
Figure 6 Figure 7
public class tip : System.Windows.Forms.Form {
private System.Drawing.Bitmap bmpsrc, bmpdst;

using System; private BitmapData getBmpData( Bitmap bmp ) {

using System.Runtime.InteropServices; return bmp.LockBits( new Rectangle(0,0,bmp.Width,bmp.Height),
namespace ipp { ImageLockMode.ReadWrite, PixelFormat.Format24bppRgb );
public enum IppStatus { };
…
ippStsNoErr = 0, unsafe private void FilterBoxFunction() {
… …
}; BitmapData bmpsrcdata = getBmpData( bmpsrc );
[StructLayout(LayoutKind.Sequential,CharSet=CharSet.Ansi)] BitmapData bmpdstdata = getBmpData( bmpdst );
public struct IppiSize { IppiSize roi = new IppiSize( bmpsrc.Width*3/4, bmpsrc.Height*3/4 );
public int width; const int ksize = 5, half = ksize/2;
public int height; // the three-channels images
public IppiSize( int width, int height ) { byte* pSrc = (byte*)bmpsrcdata.Scan0+(bmpsrcdata.Stride+3)*half,
this.width = width; pDst = (byte*)bmpdstdata.Scan0+(bmpdstdata.Stride+3)*half;
this.height = height; IppStatus st = ip.ippiFilterBox_8u_C3R(pSrc,bmpsrcdata.Stride,
}; pDst,bmpdstdata.Stride,
}; roi,
unsafe public class ip { new IppiSize(ksize,ksize),
[DllImport(“ippi-6.1.dll“)] public static extern new IppiPoint(half,half)});
IppStatus ippiAndC_8u_C1R( byte* pSrc, int srcStep, byte value, …
byte* pDst, int dstStep, IppiSize roiSize ); };
}; };
}; };

Figure 8 Figure 9

22 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE THE PARALLEL UNIVERSE

namespace ExampleIP When calling functions with Intel IPP Performance from C#
{
using System;
using System.Windows.Forms;
the same set of parameters, Intel IPP functions are optimized for Intel® and compatible processors.
This optimization can speed up performance of various applications.
However, specific features of using these libraries in the .NET
using System.Drawing;
using System.Drawing.Imaging; you can use the dynamic Framework can decrease performance of the application.
using System.Reflection;
using System.Collections;
using ipp;
method of function search Two possible reasons for a decrease in performance:

public class tip : System.Windows.Forms.Form {

private Assembly assembly;
and execution. >> Managed C# code calls unmanaged code through P/Invoke.
Every P/Invoke call requires from eight to 27 CPU clocks.

private object ippi; >> When a function is called from a DLL for the first time, the
private Type ippiType; corresponding DLL must be loaded into memory (e.g., using
LoadLibrary() on Windows* OSs), which takes 1000 to 2000
private Hashtable hash; Intel IPP Components— CPU clocks. More CPU clocks are needed to create the entry
…
public tip(Bitmap bmp) { Image Processing Sample points for all functions that are exported by this DLL.
assembly = Assembly.LoadFrom(“ippi_cs.dll”); Intel IPP Samples include C# interface and a demo application
ippi = assembly.CreateInstance(“ipp.ip”);
illustrating how C# developers can build applications with Intel IPP Table 2 shows the performance overhead numbers for Intel IPP
ippiType = ippi.GetType();
calls. The demo application performs filtering and morphological and image processing functions. Because Intel IPP functions are several
… geometric operations. times faster at corresponding C# implementations, it still makes sense
}; Additionally, the image compression functions are used in the to call Intel IPP despite the overhead of a C# call. To decrease the
void CreateMenu() { demo to read and write JPEG files. The application uses the wrapper overhead effect, developers can create a component- or application-
hash = new Hashtable();
classes for the image processing (ippi.cs) and image compression level interface in which one C# call leads to the execution of many
MenuItem miBlur = (ippj.cs) domains. The application launches the P/Invoke mechanism IPP functions.
new MenuItem(“Blur”, new EventHandler(MenuFilteringOnClick)); for unmanaged code in ippi-6.1.dll and loads the dispatcher of the An example of this approach is the .NET interface for DMIP. Also, we
hash.Add(miBlur, “ippiFilterBox_8u_C3R”); processor-specific libraries. This dispatcher loads the most efficient can compare the performance of the C# implementation and C# call
MenuItem miMin =
library for a given processor. For example, the library ippiv8-6.1.dll of Intel IPP functions. For example, the C# .NET library Mirror costs
new MenuItem(“Min Filter”,
new EventHandler(MenuFilteringOnClick)); is loaded on a system with an Intel® Core™2 Duo processor, and 5.7 CPU cycles per pixel; the Intel IPP-based C# call of Mirror function
hash.Add(miMin, “ippiFilterMin_8u_C3R”); the library ippiw7-6.1.dll is loaded on a system with an Intel® costs 1.7 cycles per pixel.
… Pentium® 4 processor.
};
Conclusion
private void MenuFilteringOnClick(object sender,System.EventArgs e) { Image ROI Processing The managed-unmanaged code interoperability provided in the .NET
FilteringFunction((string)hash[sender]);
}; Special attention must be paid when working with the functions Framework environment can be performed in different ways. The
that require border pixels (e.g., image filtering functions). The Intel best way to call C functions residing in a native DLL is to use the P/
unsafe private void FilteringFunction(string func) {… IPP functions operate only on pixels that are part of the image. Invoke service, which is available in any managed language. P/Invoke
MethodInfo method = ippiType.GetMethod(func); Therefore, a region of interest (ROI) is implied to be inside the image provides a powerful and flexible interoperability with inherited codes.
const int ksize = 5, half = ksize/2;
in such a way that all neighborhood pixels necessary for processing The DllImport attribute declares the external entry point. Marshalling
byte* pSrc = (byte*)bmpsrcdata.Scan0+(bmpsrcdata.Stride+3)*half,
pDst = (byte*)bmpdstdata.Scan0+(bmpdstdata.Stride+3)*half; the ROI edges actually exist. For example, for filtering functions, the attributes allow describing various options for the data conversion
IppStatus st = (ipp.IppStatus)method.Invoke(null, new object[] width of the border outside the ROI must not be less than half of and data layout in addition to the default data marshalling.
{(IntPtr)pSrc,bmpsrcdata.Stride,(IntPtr)pDst,bmpdstdata.Stride, the filter kernel size with the centered anchor cell. The use of SuppressUnmanagedCodeSecurityAttribute, InAttribute,
roi, new IppiSize(ksize,ksize), new IppiPoint(half,half)}); When processing an image ROI, the developer has to perform two and OutAttribute may significantly reduce the performance overhead.
…};
additional operations: shifting the pointer to the data and specifying Automatic pinning of the objects passed prevents them from being
};
}; the ROI size that is less than the image size. In Figure 9 the sample garbage collected for the duration of the call. Manual pinning is also
code illustrates how to work with ROI (using the example of the Intel available if the pointer to the object is kept and used in native code
IPP function ippiFilterBox that performs image blurring). after the call returns. A managed delegate type allows implementing
Figure 10 callback functions.

Intel IPP function Performance overhead

Runtime Function Invocation
The code example in Figure 9 shows how an Intel IPP function is References
ippiFilterMedian_8u_C3R 14%
launched via a direct call, and how static code is generated when the To download an Intel IPP evaluation package, sample code, and other
ippiRotateCenter_8u_C3R 14% application is compiled. This method is rather simple and obvious. documents, visit http://software.intel.com/en-us/intel-ipp/. o
When calling functions with the same set of parameters, you can
ippiWarpAffine_8u_C3R 39%
use the dynamic method of function search and execution. This
ippiMirror_8u_C3R 47% method is called reflection and can noticeably reduce the size of
the executable code.
Table 2: Cost of C# call—the lighter an operation, the bigger the overhead.
Figure 10 shows the demo application, the reflection method is
used to launch filtering and morphological functions from the menu.

24 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Resources AND Sites of Interest

The mission of Dr. Dobb’s Go Parallel is to assist

Amp up your app

developers in their efforts toward “Translating Multicore
Power into Application Performance.” Robust and full of
helpful information, the site is a valuable clearinghouse
of multicore-related blogs, news, videos, feature stories,
and other useful resources.

performance.
DEVELOPER ROCK STAR:
For its fall 2009 webinar series Intel invited Microsoft Check out a range of resources on a wide variety Steve Lionel
Visual Studio* C/C++ developers from a range of of software topics for a multitude of developer
industry-leading companies to share the secrets behind communities ranging from manageability to parallel
their real-world successes using Intel® Parallel Studio. programming to virtualization and visual computing. APP EXPERTISE:
All webinars, including those from previous series, are This content-rich collection includes Intel® Software Fortran Compilers
available for immediate, on-demand download. Network TV, popular blogs, videos, tools, and downloads.

Steve’s tip to boost productivity:

What if you could experiment with Intel’s advanced The Intel® Software Evaluation Center Build and debug a mixed C# and Fortran application using Intel® Visual
research and technology implementations that are still makes 30-day evaluation versions of Intel® Software Fortran. Turn on generated interface checking to find errors in your
under development? And then what if your feedback Development Products available for free download. older Fortran code. Use C_F_POINTER and C_LOC to do a “type cast.”
helped influence a future product? It’s possible here. For High Performance Computing Products, you can get
Test drive emerging tools, collaborate with peers, free support during the evaluation period by creating
and share your thoughts via the What If blogs and an Intel® Premier Support account after requesting
support forums. the evaluation license, or via Intel® Software Network
Forums. For evaluating Intel® Parallel Studio, you Rock Your Code.
can access free support through Intel® Software
Network Forums ONLY. Be a developer rock star with Intel® high-performance software tools.
Visit www.intel.com/software/products/eval for a free evaluation.

26
Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize
for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction
sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel
compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.
For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they
implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options.” Many library
routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other
microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and
Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will
get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree
for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and
Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on
Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine
which best meet your requirements. We hope to win your business by striving to offer the best performance of any
compiler or library; please let us know if you find we do not.
Notice revision #20101101

Unleash multicore platform performance.

Intel® compilers, libraries, and debugging and tuning tools provide everything
you need to roll out reliable apps that scale for today’s multicore innovations.
From super computers to laptops, and embedded systems to mobile devices,
Intel® software tools enable you to optimize legacy serial and threaded code
and plug in to multicore.

Become a developer rock star with Intel® Software Development Products.

Visit www.intel.com/software/products/eval for free evaluations.

Michael Klemm, Jim Cownie - High Performance Parallel Runtimes - Design and Implementation-De Gruyter (2020)
No ratings yet
Michael Klemm, Jim Cownie - High Performance Parallel Runtimes - Design and Implementation-De Gruyter (2020)
356 pages
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
No ratings yet
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
20 pages
Parallel Processing Assignment 1
No ratings yet
Parallel Processing Assignment 1
14 pages
Parallel Processing Chapter - 2
0% (1)
Parallel Processing Chapter - 2
135 pages
Multi-Core Programming Digital Edition (06-29-06) PDF
100% (1)
Multi-Core Programming Digital Edition (06-29-06) PDF
362 pages
Fundamentals of Multicore Software Development PDF
No ratings yet
Fundamentals of Multicore Software Development PDF
322 pages
Pin Tutorial Cgo Ispass 2012
No ratings yet
Pin Tutorial Cgo Ispass 2012
202 pages
INTEL - The Parallel Universe - Issue 05 - 2010
No ratings yet
INTEL - The Parallel Universe - Issue 05 - 2010
44 pages
Parallel Universe Issue 30
No ratings yet
Parallel Universe Issue 30
101 pages
OneAPI Introduction and Parallelization
No ratings yet
OneAPI Introduction and Parallelization
64 pages
Ipp - User Guide 8.0 U1
No ratings yet
Ipp - User Guide 8.0 U1
46 pages
INTEL - The Parallel Universe - Issue 03 - 2010
No ratings yet
INTEL - The Parallel Universe - Issue 03 - 2010
17 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
El Impacto en El Software de Las Arquitecturas Multicore
No ratings yet
El Impacto en El Software de Las Arquitecturas Multicore
48 pages
Unit 3
No ratings yet
Unit 3
31 pages
WWII 457th Anti-Aircraft Artillery
No ratings yet
WWII 457th Anti-Aircraft Artillery
229 pages
Parallel Architecture Fundamental
No ratings yet
Parallel Architecture Fundamental
18 pages
DDD - 0312 - April 2012
No ratings yet
DDD - 0312 - April 2012
28 pages
(2023) PARALLELC-ASSIST - Productivity Accelerator Suite Based On Dynamic Instrumentation
No ratings yet
(2023) PARALLELC-ASSIST - Productivity Accelerator Suite Based On Dynamic Instrumentation
15 pages
Part7 ImprovingParallelPerformance
No ratings yet
Part7 ImprovingParallelPerformance
37 pages
Parallel Processing Assignment 1
No ratings yet
Parallel Processing Assignment 1
14 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Ipps Interoperability
No ratings yet
Ipps Interoperability
17 pages
Multi-Core Programming Digital Edition (06!29!06)
No ratings yet
Multi-Core Programming Digital Edition (06!29!06)
362 pages
Intel Parallel Magazine Issue17
No ratings yet
Intel Parallel Magazine Issue17
49 pages
Dependency-Based Automatic Parallelization of Java Applications
No ratings yet
Dependency-Based Automatic Parallelization of Java Applications
13 pages
Release Notes Advisor 2011
No ratings yet
Release Notes Advisor 2011
11 pages
Parallel Universe Issue 32
No ratings yet
Parallel Universe Issue 32
74 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
C Ug lnx2
No ratings yet
C Ug lnx2
431 pages
Unit 1
No ratings yet
Unit 1
54 pages
Parallel Archit 1
No ratings yet
Parallel Archit 1
18 pages
A Gênese (Allan Kardec)
No ratings yet
A Gênese (Allan Kardec)
80 pages
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
No ratings yet
Introduction To Parallel Programming - Student Workbook With Instructor's Notes PDF
33 pages
01 Overview
No ratings yet
01 Overview
40 pages
Alv Editable and Saving Into Database Table
No ratings yet
Alv Editable and Saving Into Database Table
9 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Microprocessor (Report)
No ratings yet
Microprocessor (Report)
4 pages
Magro - Hyper-Threading Technology Impact
No ratings yet
Magro - Hyper-Threading Technology Impact
9 pages
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
No ratings yet
A Review On Use of MPI in Parallel Algorithms: IPASJ International Journal of Computer Science (IIJCS)
8 pages
A Practical Performance Analysis of Modern Software Used in High Performance Computing
No ratings yet
A Practical Performance Analysis of Modern Software Used in High Performance Computing
5 pages
Intel® Parallel Studio XE 2011 SP1 For Windows Installation Guide and Release Notes
No ratings yet
Intel® Parallel Studio XE 2011 SP1 For Windows Installation Guide and Release Notes
7 pages
Composer Brief
No ratings yet
Composer Brief
9 pages
Computação Paralela
No ratings yet
Computação Paralela
18 pages
ParallelComputing Backgrounder
No ratings yet
ParallelComputing Backgrounder
2 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Enterprise CAS94
No ratings yet
Enterprise CAS94
8 pages
Parallel Programming
No ratings yet
Parallel Programming
32 pages
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
No ratings yet
Multi-Core Programming - Increasing Performance Through Software Multi-Threading
11 pages
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
No ratings yet
A Survey of Parallel Programming Models and Tools in The Multi and Many-Core Era
18 pages
Microsoft® Visual Studio 2010
No ratings yet
Microsoft® Visual Studio 2010
11 pages
4674 IN IPS Simulia CS 111610
No ratings yet
4674 IN IPS Simulia CS 111610
3 pages
Multi-Core Programming: Shameem Akhter Jason Roberts
No ratings yet
Multi-Core Programming: Shameem Akhter Jason Roberts
22 pages
Microsoft® Visual Studio 2010
No ratings yet
Microsoft® Visual Studio 2010
9 pages
Elegant Oscillator
No ratings yet
Elegant Oscillator
16 pages
Computer Architecture
100% (1)
Computer Architecture
125 pages
RTL8370 (M)
100% (1)
RTL8370 (M)
101 pages
Interposing CT Connections in 3-ph TR
No ratings yet
Interposing CT Connections in 3-ph TR
12 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Network Attacks, Architecture and Isolation
100% (2)
Network Attacks, Architecture and Isolation
15 pages
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
No ratings yet
Migration To Multicore: Tools That Can Help: Tasneem G. Brutch
10 pages
Msi MS-7267 Rev 4.8 PDF
No ratings yet
Msi MS-7267 Rev 4.8 PDF
29 pages
Power Generation Using Speed Breakers and Efficient Use of Energy Created by It.
No ratings yet
Power Generation Using Speed Breakers and Efficient Use of Energy Created by It.
57 pages
Thesis Verilog AMS Model
0% (1)
Thesis Verilog AMS Model
204 pages
OMRON Inverter PDF
No ratings yet
OMRON Inverter PDF
64 pages
Computer-Fundamentals Solved MCQs (Set-28)
No ratings yet
Computer-Fundamentals Solved MCQs (Set-28)
8 pages
book24x7書單2008 111
No ratings yet
book24x7書單2008 111
916 pages
Bourgault X30 Set-Up For Version 3 - 18 - 517
No ratings yet
Bourgault X30 Set-Up For Version 3 - 18 - 517
81 pages
Service: Manual
No ratings yet
Service: Manual
2 pages
KORG MONOLOGUE Service Manual
No ratings yet
KORG MONOLOGUE Service Manual
16 pages
Citrex h5 Manual EN
No ratings yet
Citrex h5 Manual EN
61 pages
UPD650
No ratings yet
UPD650
2 pages
Amplifier Design Using ADS July 2004
No ratings yet
Amplifier Design Using ADS July 2004
74 pages
Arecaunut Classification Report Final Yolo Based
No ratings yet
Arecaunut Classification Report Final Yolo Based
35 pages
Service Manuals LG TV LCD RT20LA33 RT-20LA33 Service Manual
No ratings yet
Service Manuals LG TV LCD RT20LA33 RT-20LA33 Service Manual
25 pages
Adc Automated Testing Using Labview Software
No ratings yet
Adc Automated Testing Using Labview Software
9 pages
Dynaflex SDXC Digital Cross Connect System: Reliability Through Modularity
No ratings yet
Dynaflex SDXC Digital Cross Connect System: Reliability Through Modularity
2 pages
Sistemas Operativos: Sistema de Archivos
No ratings yet
Sistemas Operativos: Sistema de Archivos
58 pages
Schematic - Main - Flight Computers - 2023 06 08 YZ9nbVq3oJTo7q6B
No ratings yet
Schematic - Main - Flight Computers - 2023 06 08 YZ9nbVq3oJTo7q6B
1 page
Example: A Binary System Produces Marks With Probability of 0.7 and Spaces With Probability 0.3, 2/7 of The Marks Are Received in Error and 1/3 of The Spaces. Find The Information Trans-Fer
No ratings yet
Example: A Binary System Produces Marks With Probability of 0.7 and Spaces With Probability 0.3, 2/7 of The Marks Are Received in Error and 1/3 of The Spaces. Find The Information Trans-Fer
11 pages
ShatabdiPal Resume PDF
No ratings yet
ShatabdiPal Resume PDF
3 pages
Edu Cloud
No ratings yet
Edu Cloud
10 pages
DS-M7508HNI Series: Main Features
No ratings yet
DS-M7508HNI Series: Main Features
3 pages
Question # 01: Source Code: Talha Maqsood
No ratings yet
Question # 01: Source Code: Talha Maqsood
2 pages
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals: The Self-Development Mini Series, #0
From Everand
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals: The Self-Development Mini Series, #0
Rae Stonehouse
No ratings yet
The Self-Mastery Toolkit: Harnessing Emotional Intelligence, Productivity & Time Management for Personal Growth: The Self-Development Mini Series, #0
From Everand
The Self-Mastery Toolkit: Harnessing Emotional Intelligence, Productivity & Time Management for Personal Growth: The Self-Development Mini Series, #0
Rae A. Stonehouse
No ratings yet
The Influential Communicator: Strategies for Persuasive Speaking, Authentic Networking & Assertive Leadership
From Everand
The Influential Communicator: Strategies for Persuasive Speaking, Authentic Networking & Assertive Leadership
Rae A. Stonehouse
No ratings yet
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals
From Everand
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals
Rae A. Stonehouse
No ratings yet

INTEL - The Parallel Universe - Issue 02 - 2010

Uploaded by

INTEL - The Parallel Universe - Issue 02 - 2010

Uploaded by

THE PARALLEL

Intel® VTune™ Performance

Letter from the

Advisor Origins, by Paul Petersen 10

Tony’s tip to boost performance: Understanding the Features of

Rock Your Code. Resources & Sites of Interest 26

Good Programming Starts

with the Developer

By Levent Akyil The event-based sampling (EBS) technology identifies

Selecting the Thread view button (Figure 2)

Threading Analysis Time and Accuracy by Level

2.00 2.03 seconds 70.00%

The most important consideration

the managed .NET application and .NET Framework Class

.NET Common Language Runtime (CLR)

Windows* Unmanaged IPP Unmanaged

// fixed statement in unsafe code

String LPStr, LPWStr, or BStr

Table 1: Mapping of several data types, unmanaged-managed environment

using System; private BitmapData getBmpData( Bitmap bmp ) {

public class tip : System.Windows.Forms.Form {

Intel IPP function Performance overhead

Resources AND Sites of Interest

The mission of Dr. Dobb’s Go Parallel is to assist

Amp up your app

Steve’s tip to boost productivity:

Unleash multicore platform performance.

Become a developer rock star with Intel® Software Development Products.

You might also like