Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
Using-Modern-Cpp-Techniques-To-Enhance-Multicore-Optimizations - Das's Edution
With multicore processors now commonplace in PCs, and core counts continually climbing, software
developers must adapt. By learning to tackle potential performance bottlenecks and issues with
concurrency, engineers can future-proof their code to seamlessly handle additional cores as they are
added to consumer systems.
To help with this effort, Intel software teams have created a graphics toolkit that shows how parallel
processing techniques are best applied to eight different graphics filters. The entire source code package
contains C++ files, header files, project files, filters, and database files. A DLL overlay with a simple user
interface shows the speed at which each filter can be applied, both in a single-core system and when
using parallel-processing techniques.
With applications across the computing spectrum, from database processing to image rendering,
parallel processing is a key concept for developers to understand. Readers are assumed to have some
experience and background in computer science to benefit from the concepts described here. The
source code was written for C++, but the concepts extend to other languages, and will be of interest to
anyone looking to better understand how to optimize their code for multicore processors.
An in-depth discussion of the Intel architecture is beyond the scope of this article. Software developers
should register at the Intel® Developer Zone and check out the documentation download page for Intel
architecture to read some of the following manuals:
• Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals
• Four-Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals
• Ten-Volume Set of Intel® 64 and IA-32 Architectures Software Developer's Manuals
• Software Optimization Reference Manual
• Uncore Performance Monitoring Reference Manuals
Getting Started
The system requirements to explore the Image Filters project solution are minimal. Any multicore
system with Microsoft Windows® 10 is sufficient.
This project assumes that you have a C++ toolkit, such as Microsoft Visual Studio* with the .NET
framework. Freebyte* has a full set of links here if you want to explore different C++ tools. To simply
look through code, you may want to use a free code viewer such as NotePad++* or a similar product.
1. Create an empty directory on your computer with a unique title such as “Image Filters Project”
or “Intel C++ DLL Project.”
Use whatever naming strategy you are comfortable with; you can include the year, for example.
2. Download the .zip file to your new directory.
Figure 1: You must build both projects inside the Solution Explorer.
6. From the Solution Explorer, select the ImageFiltersWPF project; then hold down the CTRL key
and select the ImageProcessing project.
7. Right-click on one of the highlighted files to pull up an Actions menu and select Build Selection.
This starts the compiler for both.
8. Wait while the system quickly compiles the existing source files into the binary, .DLL, and .EXE
files.
Multithreading Technique
By default, applications run on a single processing core of a system. Because all new computing systems
feature a CPU with multiple cores and threads, this implies that some complex calculations could be
distributed intelligently, greatly speeding computation times.
OpenMP* (Open Multi-Processing) is an API first published in 1981 for Fortran 1.0 that supports
multiplatform shared memory multiprocessing programming in C, C++, and Fortran on most
platforms, instruction set architectures, and operating systems. It consists of a set of compiler
directives, library routines, and environment variables that influence run-time behavior.
In the case of the C++ DLL, which is where the execution is actually happening, OpenMP involves using
compiler directives to execute the filtering routines in parallel. For picture processing, each pixel of the
input image has to be processed in order to apply the routine to that image. Parallelism offers an
interesting way of optimizing the execution time by spreading out the work across multiple threads,
which can work on different areas of the image.
If you follow the comments in the code, “BoxBlur.cpp” is setting up offsets, handling calculations when
edge conditions make convolution impossible, and applying the convolution kernel to each element for
red, blue, and green colors.
In the second example of “omp parallel for” taken from “SobelEdgeDetector.cpp”, similar filtering
operations take place, with the edge detector working with grayscale pictures.
Memory Management
In software development, developers must be careful about memory management to avoid serious
impacts on application performance. In the case of the Harris corner detector and the Shi-Tomasi corner
detector, memory management is crucial to creating three matrices and storing the results of Sx, Sy and
Sxy.
Figure 5: Setting up temporary memory for the Shi-Tomasi corner detector filter.
Allocating such matrices for every pixel of the source would require considerable memory, and
multithreading probably wouldn’t be beneficial when applied. In fact, it could even result in slower
calculations, due to the overhead of working with a large memory space.
In order to avoid memory issues and set up a scenario where multithreading makes the calculations
faster, these three matrices can be considered as a set of matrices, with each available thread
containing its own set of matrices. The application can then be set up to allocate, outside of the parallel
section, as many sets of matrices as there are available threads for this application. To get the maximum
number of threads the following function is used: “omp_get_max_threads()” from the “omp.h” file.
This file is found with the rest of the header files in the ImageProcessing > External Dependencies
directory.
As described at the Oracle support site, this function should not be confused with the similarly named
“omp_get_num_threads()”. The “max” call returns the maximum number of threads that can be put to
use in a parallel region. The “num” call returns the number of threads that are currently active and
performing work. Obviously, those are different numbers. In a serial region “omp_get_num_threads”
returns 1; in a parallel region it returns the number of threads that are being used.
The call “omp_set_num_threads” sets the maximum number of threads that can be used (equivalent to
setting the OMP_NUM_THREADS environment variable).
In the processing loop, each thread accesses its proper set of matrices. These sets are stored in a
dynamically allocated array of sets. To access the correct set the index of the actual thread is used with
the function “omp_get_thread_num()”.Once the routine is executed the three matrices are reset to
their initial values, so that the next time the same thread has to execute the routine for another pixel,
the matrix is already prepared for use.
Principle of Convolution
Image filtering is a good showcase for multicore processing because it involves the principle of
convolution. Convolution is a mathematical operation that accumulates effects; think of it as starting
with two functions, such as f and g, to produce a third function that represents the amount of overlap as
one function is shifted over another.
In the case of image filtering, convolution works with a kernel, which is a matrix of numbers. This kernel
is then applied to every pixel of the image, with the center element of the convolution kernel placed
over the source pixel (Px). The source pixel is then replaced by the weighted sum of itself and nearby
pixels. Multithreading helps filter the image faster by breaking the process into pieces.
Figure 6: Convolution using weighting in a 3 x 3 kernel (source: GNU Image Manipulation Program).
In this example, the convolution kernel is a 3 x 3 matrix represented by the green box. The source pixel
is 50, shown in red at the center of the 3 x 3 matrix. All local neighbors, or nearby pixels, are the pixels
directly within the green square. The larger the kernel, the larger the number of neighbors.
In this case, the only weight different than zero is the second element of the first row and represented
by a 1. All other elements in the 3 x 3 matrix are 0. The operation multiplies each element by zero,
removing it, except for the single pixel represented by 42. So, the new source pixel is 42 x 1 = 42. Thus,
the pixel just above the source pixel is overlapped by the weight 1 of the convolution kernel.
If you imagine each weighting as a fraction rather than zero, you can picture how images could be
blurred by analyzing and processing each surrounding pixel.
Filtering Techniques
To see the result of a filtering technique, you’ll have to build the project as described in the “Getting
Started” section. Then double-click on the Image Filters > ImageFiltersWPF > Bin > Debug >
ImageFiltersWPF.exe file.
The interface is very simple. You can select images on your system using the directory search feature in
the gray box in the upper-right corner. Use the “Stretch” button to make sure an image completely fills
the graphical user interface (GUI). Select an image, then apply a filter. Watch the “Time in seconds”
calculations at the bottom-left of the interface to see how long a filter would take to apply in a multicore
system versus a system with a single core.
There are eight filters in all; each alters the original image, but some filters create more dramatic
changes.
The weights of the convolution kernel for the box blur are all the same. Assuming that we have a 3 x 3
matrix, this means that we have nine elements inserted into the matrix in total.
The weight for every element is calculated so that the sum of every element is 1.
Figure 8: The original image on the left is vivid, with good detail, while the image on the right has had the Box Blur effect
applied.
When using the app, it was calculated that a single core system would take 0.1375 seconds to apply the
Box Blur while a multicore system, in this case with an Intel® Core™ i7-4220 processor, took 0.004
seconds.
Let’s look in depth at what is going on in BoxBlur.cpp to understand the multithreading principles.
include "stdafx.h"
#include <fstream>
#include "omp.h"
extern "C" __declspec(dllexport) int __stdcall BoxBlur(BYTE* inBGR, BYTE* outBGR, int
stride, int width, int height, KVP* arr, int nArr)
{
// Pack the following structure on one-byte boundaries:
// smallest possible alignment
// This allows us to use the minimal memory space for this type:
// exact fit - no padding
First, the file is set up to include the “stdafx.h” header, a standard system include. Finally, omp.h is the
header file that brings in OpenMP instructions.
BoxBlur then uses the extern “C” function to declare calls and variables. The rest of the C++ file is
devoted to multicore functionality. First, using “#pragma pack(push, 1)”,the file defines how to
efficiently handle a BGRA (blue green red alpha) color component packing structure on one-byte
boundaries using the smallest possible alignment.
Next, the file declares “#pragma pack (pop)” to set up the default packing mode, defines the Boolean
operator for whether multiple cores have been detected, sets up the convolution kernel, and allocates
memory.
Finally, if there are multiple cores (OpenMP = true), the file uses “#pragma omp parallel for if
(openMP)”. The code determines offsets and casts, and handles situations at the edges where
convolution is not possible. Results are written to the output image, and the allocated memory is
cleared for the convolution kernel. There are similar sections of code in each of the filters.
The weights of the elements of a Gaussian matrix N x N are calculated by the following:
Here x and y are the coordinates of the element in the convolution kernel. The top-left corner element is
at the coordinates (0, 0), and the bottom-right at the coordinates (N-1, N-1).
For the same reason as the Box Blur, the sum of every element has to be 1. Thus, at the end, we need
each element of the kernel to be divided by the total sum of the weights of the kernel.
1 0 -1 1 2 1
2 0 -2 0 0 0
1 0 -1 -1 -2 -1
This means that for each pixel there are two results of convolution, Gx and Gy. Considering them as a
scalar of a 2D vector, the magnitude G is calculated as follows:
-1 -1 -1
-1 8 -1
-1 -1 -1
Laplacian of Gaussian
The Laplacian edge detector is particularly sensitive to noise so, to get better results, we can apply a
Gaussian Blur to the whole image before applying the Laplacian filter. This technique is named
“Laplacian of Gaussian.”
Thus, Sx and Sy are calculated for every pixel in the window. Sxy is calculated as Sxy = Sx * Sy.
These results are stored in three different matrices. These matrices respectively represent the values Sx,
Sy, and Sxy for every pixel around the source pixel (and also itself).
A Gaussian matrix (the one used for the Gaussian blur) is then applied to these three matrices, which
results in three weighted values of Sx, Sy, and Sxy. We will name them Ix, Iy, and Ixy.
Then, if k is greater than a given threshold, the source pixel is turned into white, as part of a corner. If
not, it is set to black.
Once the matrix A is calculated, in the same way as above, the eigenvalues 𝜆1and 𝜆2 of the matrix A are
calculated. The eigenvalues of a matrix are the solutions of the following equation 𝑑𝑒𝑡(𝐴) = 0.
As with the Harris corner detector, with regard to the value of k, this source pixel will be or will not be
considered as part of a corner.
Figure 12: Example of Shi-Tomasi corner detector filter, resulting in almost total conversion to black pixels. The filtering took 61
seconds using a single core, versus 14 seconds using multiple cores.
Developers should not expect to get an N times speedup when running a program parallelized using
OpenMP on an N processor platform. According to sources, there are several reasons why this is true:
● When a dependency exists, a process must wait until the data it depends on is computed.
● When multiple processes share a nonparallel proof resource (like a file to write in), their
requests are executed sequentially. Therefore, each thread must wait until the other thread
releases the resource.
● A large part of the program may not be parallelized by OpenMP, which means that the
theoretical upper limit of speedup is limited, according to Amdahl's law.
● N processors in symmetric multiprocessing (SMP) may have N times the computation power,
but the memory bandwidth usually does not scale up N times. Quite often, the original memory
path is shared by multiple processors, and performance degradation may be observed when
they compete for the shared memory bandwidth.
● Many other common problems affecting the final speedup in parallel computing also apply to
OpenMP, like load balancing and synchronization overhead.
With current systems powered by processors such as the Intel® Core™ i9-7980XE Extreme Edition
processor, which has 18 cores and 36 threads, the advantages of developing code that is optimized to
handle multithreading is obvious. To learn more, download the app, analyze it with an integrated
development environment such as Microsoft Visual Studio, and get started with your own project.
ImageFileList.cs
This class’s primary function is to generate a list of image files that can be selected in order to apply
filters.
BitmapWrapper.cs
This class accepts a source bitmap image and takes that bitmap and turns it into a byte array that can be
consumed by the C++ DLL.
ImageProcessWrapper.cs
This class loads up the DLL and supplies the function to call, which passes the byte arrays to the DLL.
MainWindow.xmal.cs
This is the MainWindow GUI code. This function gets the current filter name and sends the
bitmapwrapper instance to the DLL. It does this twice, once for single core, then again as multicore.
After each of those run, it updates labels that contain the number of seconds it took to process the
image. Once the processing is complete, the new image is displayed.
Resources
Intel Developer Zone: https://software.intel.com
OpenMP: http://www.openmp.org/
Notices
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is
granted by this document.
Intel disclaims all express and implied warranties, including without limitation, the implied
warranties of merchantability, fitness for a particular purpose, and non-infringement, as well
as any warranty arising from course of performance, course of dealing, or usage in trade.
This document contains information on products, services and/or processes in development.
All information provided here is subject to change without notice. Contact your Intel
representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may
cause deviations from published specifications. Current characterized errata are available on
request.
Copies of documents which have an order number and are referenced in this document may
be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.