0% found this document useful (0 votes)
10 views48 pages

Intel OpenMP Webinar

Uploaded by

Nicolas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views48 pages

Intel OpenMP Webinar

Uploaded by

Nicolas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Vasanth Tovinkere | Architect, Flow Graph Analyzer

Intel® Corporation

*Other names and brands may be claimed as the property of others


What will be covered today
Task-based parallelism and task graphs
• Challenges

Overview of Intel® Advisor - Flow Graph Analyzer (FGA)


Walking through a sample
Summary

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 2
*Other names and brands may be claimed as the property of others.
3
Task-based parallelism
Advantages of task-based parallelism
• Makes parallelization efficient for irregular and runtime dependent
execution
• Promotes higher level thinking
• Improves load balancing
Tasks with dependencies
• Fall into two categories: explicit and implicit
• Extends the expressiveness of task-based parallel programming
• Reduces need for global synchronization mechanism such as task barriers

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 4
*Other names and brands may be claimed as the property of others.
Applications often contain multiple levels of
parallelism
Visible in FGA

Task Parallelism/
Message Passing

Visible in FGA

fork-join fork-join

SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 5
*Other names and brands may be claimed as the property of others.
Asynchronous task graphs (implicit vs. explicit)
OpenMP* Threading Building Blocks (TBB)

Hello World f() f()

task task task task

#pragma omp parallel


{
#pragma omp single graph g;
{ continue_node<continue_msg> h( g,
std::string s; []( continue_msg & ) {
{ cout << “Hello “;
#pragma omp task depend(out: s) Implicit dependency } );
{
derived from the Explicit dependency
s = “Hello ”;
cout << s; depend clause, in continue_node<continue_msg> w( g, expressed through
} this case the []( continue_msg & ) { the make_edge()
#pragma omp task depend(out: s) variable ‘s’ cout << “World!\n“; call
{ } );
s = “World!\n”;
cout << s; make_edge(h, w);
} h.try_put(continue_msg());
} g.wait_for_all();
}
}
Implicit Explicit

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 6
*Other names and brands may be claimed as the property of others.
Challenges with asynchronous task graphs

Creating implicit or explicit task graphs programmatically is easy


• Determining what was created is hard in many cases
New programming paradigm
Allows you to stream data through the graph, which makes debugging
challenging
Graph algorithms can be latency-bound or throughput-bound
Parallelism is unstructured in certain types of graphs, so performance analysis
can be challenging

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 7
*Other names and brands may be claimed as the property of others.
8
Intel® Advisor – Flow Graph Analyzer Toolbar supporting basic file and edition operations, visualization and
analytics that operate on the graph or performance traces

General health of the


graph displayed as a
tree-map
Palette of supported Canvas for visualizing
The area of the graphs
TBB node types
squares represent
organized in like
the CPU time taken
groups
by a node as a
percentage of the
application run and
Hierarchical
the color view of
indicates
the graph displayed
the concurrency
shown as awhen
observed tree that
node was active

Displays the execution trace data,


graph statistics and output
generated by custom analytics and
allows interactions with this data

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 9
*Other names and brands may be claimed as the property of others.
Workflows and UI features

10
Workflows: Create, Debug, Visualize and Analyze
Design mode
• Allows you to create a graph topology interactively
• Validate the graph and explore what-if scenarios
• Add C/C++ code to the node body
• Export C++ code using Threading Building Blocks (TBB) flow graph API
Analysis mode
• Compile your application (with tracing enabled)
• Capture execution traces during the application run
• Visualize/analyze in Flow Graph Analyzer
Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 11
*Other names and brands may be claimed as the property of others.
Creating Asynchronous Task-graphs

12
Intel® Advisor – Flow Graph Analyzer (Design mode)
Graph Creation

Drag and Drop Support

Interactive Canvas

Analytics and Modeling

Let’s make this


our “hello” node
Validation

Code Generation

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 13
*Other names and brands may be claimed as the property of others.
Intel® Advisor – Flow Graph Analyzer (Design mode)
Serialization

GraphML* file format – uses extensions C/C++ code generated from the graph

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 14
*Other names and brands may be claimed as the property of others.
Challenges With asynchronous task graphs

 New programming paradigm

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 15
*Other names and brands may be claimed as the property of others.
Intel® Advisor – Flow Graph Analyzer (Design mode)
Compiling and collecting traces

Path must be updated so fgtrun.bat and fgt2xml.exe can be run from the command line
>cl hello_world.cpp /O2 /DTBB_USE_THREADING_TOOLS ... /link tbb.lib /OUT:hello_world.exe

>set FGT_ROOT=<installation-directory>\fga\fgt

>set INTEL_LIBITTNOTIFY64=<installation-directory>\fga\fgt\windows\bin\intel64\<vc-version>\fgt.dll

>hello_world.exe

Traces are saved to a unique directory _fgt_<date>_<time>


>fgt2xml.exe <name-for-the-trace-data-file>

Automatically converts the latest timestamped directory

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 16
*Other names and brands may be claimed as the property of others.
Understanding Graph Execution

17
Examining the trace data: what’s possible?
“hello” node in all views that
represent different information.

Shows trace information for


the case when 1 message is
sent to the “hello” node.

How did we get the node names


to be the same as what was in
the C++ code?

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 18
*Other names and brands may be claimed as the property of others.
Examining the trace data: correlation
“hello” node in all views that
represent different information.

Shows trace information for


the case when 25 messages
are sent to the “hello” node.

Interacting with the canvas


Interacting with the timeline
Clicking on a node on the
Clicking on a task in the
canvas can highlight the
timeline will highlight the
corresponding node’s tasks in
corresponding node in the
the timeline. This is turned
canvas
OFF by default.

Clicking on a section with low


concurrency will highlight the
nodes that are active at that
time.

These nodes would be the


starting point of a cause-and-
effect analysis to see if they
were responsible for the lower
concurrency

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 19
*Other names and brands may be claimed as the property of others.
Examining the trace data through Trace Playback
Playback of execution traces to
see how data is flowing through
the graph.

Allows you to see how the data


flows through the graph and
what sections of the graph
result in good or poor scaling

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 20
*Other names and brands may be claimed as the property of others.
Examining the trace data: node view
Node view captures all execution
traces for a given node and
presents it in a single swim-lane
for the node

Each node swimlane is comprised


of multiple swimlanes
representing the threads which
executed an instance of the node.

Provides a compact representation of a


node’s execution

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 21
*Other names and brands may be claimed as the property of others.
Challenges With asynchronous task graphs

 New programming paradigm


 Allows you to stream data through the graph, which makes debugging
challenging

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 22
*Other names and brands may be claimed as the property of others.
Examining the trace data with data analysis
How do we know which instance
of the Hello task is in response to
which input message?

Helps answer the following


questions:

Are the tasks operating on data


retiring in order?

Are they out of order?

We need to track the data flowing


through the graph

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 23
*Other names and brands may be claimed as the property of others.
Examining the trace data with data analysis, cont.
Harder to track the data in dependency graphs as the Data ID cannot be
propagated from one node to the next
• continue_node requires an input of type continue_msg
continue_node<continue_msg> hello( hello_world_g0, []( continue_msg & ) {
cout << “Hello “;
} );

continue_node<continue_msg> world(( hello_world_g0,[]( continue_msg & ) {


cout << “World!\n“;
} );

We are going to convert the Hello World example to use function_node instead
so we can send the ID from one node to the next

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 24
*Other names and brands may be claimed as the property of others.
Examining the trace data with data analysis, cont.
Data tracking using an
experimental feature will allow
you to track which task instance
is for which inputs.

1. We changed our graph to use


a function_node instead of a
continue_node
2. We have a source_node that
streams 25 messages/data
through the graph
3. We modified the graph to
emit the data id from the
node source to hello and
hello to world.
4. We add an user event API to
tell the tool which data we
are processing in each node.

Gives you insight into


scheduler behavior.

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 25
*Other names and brands may be claimed as the property of others.
Examining the trace data with data analysis, cont.
Data tracking using an
experimental feature will allow
you to track which task instance
is for which inputs

Statistics for the graph is


organized by data operated on
and can be seen in Data Analysis
tab under Statistics

Using data analysis, the questions


posed earlier can be answered.

You can examine the trace data


to see if the data is retiring in-
order or out-of-order.

If the algorithm is meant to be


latency bound, then order is
important. If it is throughput
bound, data can retire out-of-
order

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 26
*Other names and brands may be claimed as the property of others.
Challenges with asynchronous task graphs

 New programming paradigm


 Allows you to stream data through the graph, which makes debugging
challenging
 Graph algorithms can be latency-bound or throughput-bound

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 27
*Other names and brands may be claimed as the property of others.
Understanding the performance

28
A simulation example

Goes through multiple time steps


Graph is created once programmatically and executed for each time step
• A message is sent to the graph to trigger each time step
• Wait for the graph to process the message (current time step) before the
next time step is triggered
• Implemented as a dependency graph using TBB continue_node
Measured performance shows some performance scaling w.r.t serial
implementation

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Example: performance analysis
A complex graph was created
programmatically.

Graph has 1319 nodes and 3066


edges.

General health of the graph with a


mix of red, yellow and green

Concurrency observed over time


ranges from good concurrency
where all cores are kept busy to
very few kept busy

What do the colors mean?

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 30
*Other names and brands may be claimed as the property of others.
Challenges with asynchronous task graphs

 Creating implicit or explicit task-graphs programmatically is easy


 Determining what was created is hard in many cases
 New programming paradigm
 Allows you to stream data through the graph, which makes debugging
challenging
 Graph algorithms can be latency-bound or throughput-bound

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 31
*Other names and brands may be claimed as the property of others.
Example: identifying problem areas
What was run and how much was
run?

Run captures 11 time steps

Appears to have one node that


consumes a lot of CPU time.

This node also has an observed


concurrency that is poor when it
executes

Clicking on the node takes you to


the node in the graph
visualization

You can also sort on the


appropriate column in the
statistics table.

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 32
*Other names and brands may be claimed as the property of others.
Example: identifying problem areas, cont.
Clicking on the node takes you to
the node in the graph
visualization

1. To see all tasks belonging


to this node in the execution
trace, you will have to
enable this interaction.
2. Click on the Show/Hide
tasks button
3. Now select the node in the
canvas

When this node is executing, the


resource utilization is very poor.

1. Improving the performance


of this one node will
substantially improve the
performance of the graph.

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 33
*Other names and brands may be claimed as the property of others.
Example: critical path
Analysis features

1. Critical Path
2. Rule-check

Critical Path

Computes the Critical Path(s) for


the graph using the execution
trace information

The most dominant task that


had the maximum CPU Time and
a corresponding low
concurrency
(continue_node_1009) is on this
critical path

Critical path reduces the complexity in large graphs by isolating a small set of nodes for analysis and tuning for performance improvements

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 34
*Other names and brands may be claimed as the property of others.
What else can we look at?

35
Example: performance analysis
Analysis features

1. Critical Path
2. Rule-check

Rule check

Rule-check runs registered rules


that may include validation and
performance rules

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 36
*Other names and brands may be claimed as the property of others.
Challenges with asynchronous task graphs

 Creating implicit or explicit task-graphs programmatically is easy


 Determining what was created is hard in many cases
 New programming paradigm
 Allows you to stream data through the graph, which makes debugging
challenging
 Graph algorithms can be latency-bound or throughput-bound
 Parallelism is unstructured in certain types of graphs, so performance
analysis can be challenging

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 37
*Other names and brands may be claimed as the property of others.
What does it look like in FGA?

38
Applications often contain multiple levels of
parallelism
Visible in FGA

Task Parallelism/
Message Passing

Visible in FGA

#pragma omp parallel for tbb::parallel_for

SIMD SIMD SIMD SIMD SIMD SIMD SIMD SIMD

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 39
*Other names and brands may be claimed as the property of others.
Fork-join parallelism: tbb::parallel_for
Captures the execution task-
graph for a fork-join construct
and provides additional analytics
that present information about
the construct

1. Imbalance
2. Efficiency

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 40
*Other names and brands may be claimed as the property of others.
Multi-level parallelism: graph level + fork-join

Timeline shows trace information


for the graph and any nested
parallelism that is present

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 41
*Other names and brands may be claimed as the property of others.
Multi-level parallelism in OpenMP*
Double-click
Top-level here onshows
the parallel
just one
region node to
entity, which is see the activity
a parallel region in
within the region
this OpenMP* example

Top-level treemap shows poor


resource utilization

Hovering the mouse over the


treemap shows activity in the
parallel region – double click to
show the details

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 42
*Other names and brands may be claimed as the property of others.
Download through Intel® Advisor package

43
Intel® Advisor – Flow Graph Analyzer

Product feature in Intel® Parallel


Studio XE 2019
Tool supports analysis and
design of parallel applications
using OpenMP* and Threading
Building Blocks
Available for Windows*, Linux*
and MacOS*

https://software.intel.com/en-us/articles/getting-started-
with-flow-graph-analyzer

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
Summary
Asynchronous task-graphs improves the efficiency of irregular and runtime
dependent execution
• TBB and OpenMP* provide mechanisms to program in this manner
Flow Graph Analyzer helps you create, debug, visualize and analyze such
graphs
• Critical path analysis is crucial in reducing the complexity of the analysis
problem to a handful of nodes
• Runtime specific analyses, such as the lightweight policy analysis for TBB,
target additional performance improvements

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 45
*Other names and brands may be claimed as the property of others.
Resources

Getting started with FGA


https://software.intel.com/en-us/articles/getting-started-with-flow-graph-analyzer

Driving Code Performance with Intel® Advisor’s Flow Graph Analyzer


https://software.intel.com/en-us/download/parallel-universe-magazine-issue-30-october-
2017

IWOMP 2018: Visualization of OpenMP* Task Dependencies Using


Intel® Advisor – Flow Graph Analyzer
https://link.springer.com/chapter/10.1007%2F978-3-319-98521-3_12

CPUs, GPUs, FPGAs: Managing the alphabet soup with Intel Threading
Building Blocks
https://software.intel.com/en-us/videos/cpus-gpus-fpgas-managing-the-alphabet-soup-with-
intel-threading-building-blocks

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 46
*Other names and brands may be claimed as the property of others.
Legal Disclaimer & Optimization Notice
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance
tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any
change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully
evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete
information visit www.intel.com/benchmarks.

INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS
ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS
FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY
RIGHT.

Copyright © 2018, Intel Corporation. All rights reserved. Intel, the Intel logo, Pentium, Xeon, Core, VTune, OpenVINO, Cilk, are trademarks of
Intel Corporation or its subsidiaries in the U.S. and other countries.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel
microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the
availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent
optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture
are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804

Optimization Notice
Copyright © 2018, Intel Corporation. All rights reserved. 47
*Other names and brands may be claimed as the property of others.

You might also like