0% found this document useful (0 votes)

16 views14 pages

Instruction Mix

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views14 pages

Instruction Mix

Uploaded by

onementalist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

INSTRUCTION MIX SAMPLE

v2023.1.1 | September 2024

TABLE OF CONTENTS

Chapter 1. Introduction.........................................................................................1
Chapter 2. Application.......................................................................................... 2
Chapter 3. Configuration....................................................................................... 3
Chapter 4. Initial version of the kernel..................................................................... 4
Chapter 5. Updated version of the kernel..................................................................9
Chapter 6. Resources.......................................................................................... 11

www.nvidia.com
Instruction Mix Sample v2023.1.1 | ii
Chapter 1.
INTRODUCTION

This sample profiles a CUDA kernel which applies a simple sobel edge detection filter
to an image in global memory using the Nsight Compute profiler. The profiler is used to
analyze and identify the performance bottleneck due to an imbalanced instruction mix.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 1
Chapter 2.
APPLICATION

This sample CUDA application applies a simple sobel edge detection filter to an image
in global memory. The input and output images are at separate memory locations.
For simplicity it only handles image sizes which are an integral multiple of block size.
(BLOCK_SIZE - defined in the source file "instructionMix.cu")
The instructionMix sample is available with Nsight Compute under <nsight-
compute-install-directory>/extras/samples/instructionMix.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 2
Chapter 3.
CONFIGURATION

The profiling results included in this document were collected on the following
configuration:
‣ Target system: Linux (x86_64) with a NVIDIA RTX A4500 (Ampere GA102) GPU
‣ Nsight Compute version: 2023.3.1
The Nsight Compute UI screen shots in the document are taken by opening the profiling
reports on a Windows 10 system.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 3
Chapter 4.
INITIAL VERSION OF THE KERNEL

The Sobel operator performs a 2-D spatial gradient measurement on an image which
emphasizes regions of high spatial frequency that correspond to edges. Typically it is
used to find the approximate absolute gradient magnitude at each point in an input
grayscale image. Each thread applies the Sobel operator to one pixel of the input image
and generates one pixel of the output image. The operator uses two 3x3 kernels which
are convolved with the original image to calculate approximations of the derivaties - one

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 4
Initial version of the kernel

for horizontal changes, and one for vertical. The Sobel kernel is defined as a function
template that can be used as a generic function for different floating point precisions.

template<typename FLOAT_T>
__global__ void Sobel(
uchar4* pOut,
uchar4* pImg,
const int imgWidth,
const int imgHeight)
{
const int tx = blockIdx.x * blockDim.x + threadIdx.x;
const int ty = blockIdx.y * blockDim.y + threadIdx.y;
const int outIdx = ty * imgWidth + tx;

const int SX[] = {1, 2, 1, 0, 0, 0, -1, -2, -1};

const int SY[] = {1, 0, -1, 2, 0, -2, 1, 0, -1};

FLOAT_T sumX = 0.;

FLOAT_T sumY = 0.;
for (int j = -1; j <= 1; ++j)
{
for (int i = -1; i <= 1; ++i)
{
const auto idx = (j + 1) * 3 + (i + 1);
const auto sx = SX[idx];
const auto sy = SY[idx];

const auto luminance = GetPixel(pImg, tx + i, ty + j, imgWidth,

imgHeight);
sumX += (FLOAT_T)luminance * (FLOAT_T)sx;
sumY += (FLOAT_T)luminance * (FLOAT_T)sy;
}
}

sumX /= (FLOAT_T)9.;
sumY /= (FLOAT_T)9.;

const FLOAT_T threshold = 24.;

if (sumX > threshold || sumY > threshold)
{
pOut[outIdx] = make_uchar4(0, 255, 255, 0);
}

The initial version of the kernel Sobel executes the math operations on the grayscale
values in double precision floating point accuracy.

Sobel<double><<<grid, block>>>( pDstImage, pSrcImage, imgWidth,

imgHeight);

Profile the initial version of the kernel

There are multiple ways to profile kernels with Nsight Compute. For full details see the
Nsight Compute Documentation. One example is to perform the following steps:

‣ Refer to the README distributed with the sample on how to build the application
‣ Run ncu-ui on the host system
‣ Use a local connection if the GPU is on the host system. If the GPU is on a remote
system, set up a remote connection to the target system
‣ Use the Profile activity to profile the sample application

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 5
Initial version of the kernel

‣ Choose the full section set

‣ Use defaults for all other options
‣ Set a report name and then click on Launch

Summary page
The Summary page lists the kernels profiled and provides some key metrics for each
profiled kernel. It also lists the performance opportunities and estimated speedup for
each.

For this kernel it shows a hint for FP64/32 Utilization and suggests using 32-
bit precision floating point operations to improve performance. Click on FP64/32
Utilization rule link to see more context on the Details page. It opens the GPU
Speed of Light Throughput section on the Details page.

Details page - GPU Speed Of Light Throughput

The Details page GPU Speed Of Light Throughput section provides a high-level
overview of the throughput for compute and memory resources of the GPU used by the
kernel.
The initial version of the kernel has a duration of 628.03 microseconds and this is used as
the baseline for further optimizations.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 6
Initial version of the kernel

For this kernel it shows a hint for High Throughput and FP64/32 Utilization and
suggests looking at the Compute Workload Analysis section. Also we can see the GPU
Throughput Breakdown tables at the bottom for Compute Throughput and Memory
Throughput. The Compute Throughput Breakdown table shows that the SM FP64 pipe
throughput is high (85.25%). Click on Compute Workload Analysis to analyze the
usage of compute resources of the streaming multiprocessors (SM).

Details page - Compute Workload Analysis section

The Compute Workload Analysis section shows a hint for Very High
Utilization. It shows that FP64 is the highest-utilized pipeline (86.44%). The FP64
pipeline executes 64-bit floating point operations. It mentions that the pipeline is over-
utilized and likely a performance bottleneck. The guidance provided is to try and
decrease the utlization of the FP64 pipeline.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 7
Initial version of the kernel

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 8
Chapter 5.
UPDATED VERSION OF THE KERNEL

Based on the profiler hint of high FP64 pipeline utilization, we modify the code to use
single precision floating point instead of double precision. Since our input image has a
very limited value range and the Sobel operator is not receptible to minor differences in
precision, switching the computations from double to single precision has no negative
impact on its functionality.

Sobel<float><<<grid, block>>>( pDstImage, pSrcImage, imgWidth,

imgHeight);

Profile the updated kernel

The kernel duration has reduced from 628.03 microseconds to 31.87 microseconds. We
can set a baseline to the initial version of the kernel and compare the profiling results.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 9
Updated version of the kernel

We can confirm from the Compute Workload Analysis section that no pipeline has a
high utilization.

It shows a message that the pipe utilization is balanced and now the ALU is the highest-
utilized pipeline (57.65%). From the pipeline utlization chart we see that the FP64
pipeline utlization is reduced from 86.44% to 0% and the single precision FMA pipeline
utlization has increased from 0.89% to 33.25%.

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 10
Chapter 6.
RESOURCES

‣ Instruction Optimization section of the CUDA C++ Best Practices Guide

‣ Nsight Compute Documentation

www.nvidia.com
Instruction Mix Sample v2023.1.1 | 11
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.

Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.

This product includes software developed by the Syncro Soft SRL (http://
www.sync.ro/).

www.nvidia.com

NetBackup Training Module1
100% (3)
NetBackup Training Module1
35 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
NetNumen U31 R10 V12.14.30 Unified Element Management System Product Description 595609
100% (1)
NetNumen U31 R10 V12.14.30 Unified Element Management System Product Description 595609
59 pages
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Valn 10 Ip
No ratings yet
Valn 10 Ip
6 pages
Getting Started W/ Arduino On Windows
No ratings yet
Getting Started W/ Arduino On Windows
5 pages
OCR A Level H046 H446 Revision Checklist
No ratings yet
OCR A Level H046 H446 Revision Checklist
23 pages
Virtual Systems & Services Lecture 1
No ratings yet
Virtual Systems & Services Lecture 1
13 pages
How To Install XAMPP For Windows
No ratings yet
How To Install XAMPP For Windows
23 pages
PLNY12 Galera Cluster Best Practices
No ratings yet
PLNY12 Galera Cluster Best Practices
76 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
BRKDCT 2204 PDF
No ratings yet
BRKDCT 2204 PDF
73 pages
Anatomy of Computer
No ratings yet
Anatomy of Computer
292 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
GPGPU Tutorial
No ratings yet
GPGPU Tutorial
155 pages
GPGPU
No ratings yet
GPGPU
139 pages
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
No ratings yet
Chapter 11: Indexing and Storage: Modified From: Database System Concepts, 6 Ed
53 pages
En SIPS 8.1.1 Dep Book
No ratings yet
En SIPS 8.1.1 Dep Book
168 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
xc8 v2.46 Full Install Release Notes PIC
No ratings yet
xc8 v2.46 Full Install Release Notes PIC
125 pages
Topic 1 Introduction To Embedded System (ISMAIL - FKEUTM 2020)
No ratings yet
Topic 1 Introduction To Embedded System (ISMAIL - FKEUTM 2020)
37 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Data Highway Plus: Overview and Networking On DH+ Protocol
No ratings yet
Data Highway Plus: Overview and Networking On DH+ Protocol
26 pages
Part2 22
No ratings yet
Part2 22
97 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
WE Standalone AP User Manual
No ratings yet
WE Standalone AP User Manual
86 pages
Lec 14
No ratings yet
Lec 14
52 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
MSTR 3635
No ratings yet
MSTR 3635
99 pages
Final Exam OS
No ratings yet
Final Exam OS
86 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Part1 22
No ratings yet
Part1 22
77 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
No ratings yet
s6492 Scott Le Grand Deterministic Machine Learning Molecular Dynamics
68 pages
OpenACC 2
No ratings yet
OpenACC 2
44 pages
GPU EstimationMachineLearning PDF
No ratings yet
GPU EstimationMachineLearning PDF
22 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
Week 05 Lectures
No ratings yet
Week 05 Lectures
24 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
No ratings yet
Nvidia Cuda Tegra Toolkit 10.2.89: Release Notes For Development Auto 5.1.9
8 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Tools
No ratings yet
CUDA Tools
25 pages
Copyright and Licenses
No ratings yet
Copyright and Licenses
46 pages
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
No ratings yet
S62256 - Demystify CUDA Debugging and Performance With Powerful Developer Tools
44 pages
Module 2
No ratings yet
Module 2
50 pages
Chapter 1 Introduction To Mobile App Development
No ratings yet
Chapter 1 Introduction To Mobile App Development
30 pages
SIMEAS P OperatingInstructionProfibus E50417 B1076 C238 A2 30082004 en
No ratings yet
SIMEAS P OperatingInstructionProfibus E50417 B1076 C238 A2 30082004 en
27 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
ds748 Axi Uart16550
No ratings yet
ds748 Axi Uart16550
29 pages
Customization Guide
No ratings yet
Customization Guide
25 pages
Presented by Ragasudha.B Pavitha.P
No ratings yet
Presented by Ragasudha.B Pavitha.P
13 pages
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
No ratings yet
Bandwidth Intensive 3-D FFT Kernel For Gpus Using Cuda: Akira Nukada, Yasuhiko Ogata, Toshio Endo, Satoshi Matsuoka
11 pages
Shared Bank Conflicts
No ratings yet
Shared Bank Conflicts
16 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
AcceleratingAIAdvancements Pre Print Doube Blind
No ratings yet
AcceleratingAIAdvancements Pre Print Doube Blind
9 pages
SS7 Telecommunication in Kurdistan
No ratings yet
SS7 Telecommunication in Kurdistan
9 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Uncoalesced Global Accesses
No ratings yet
Uncoalesced Global Accesses
14 pages
Uncoalesced Global Accesses
No ratings yet
Uncoalesced Global Accesses
14 pages
eCAD Manual PDF
No ratings yet
eCAD Manual PDF
11 pages
Sanitizer NVTX Guide
No ratings yet
Sanitizer NVTX Guide
12 pages
Installation Guide
No ratings yet
Installation Guide
11 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Instruction Mix
No ratings yet
Instruction Mix
14 pages
Un Coalesced Global Accesses
No ratings yet
Un Coalesced Global Accesses
14 pages
Log
No ratings yet
Log
12 pages
A Quantitative Performance Analysis Model For GPU Architectures
No ratings yet
A Quantitative Performance Analysis Model For GPU Architectures
12 pages
Installation Guide
No ratings yet
Installation Guide
14 pages
Ansys 2021 R2 - GPU Accelerator Capabilities
No ratings yet
Ansys 2021 R2 - GPU Accelerator Capabilities
5 pages
Release Notes
No ratings yet
Release Notes
7 pages
Release Notes
No ratings yet
Release Notes
7 pages
Training
No ratings yet
Training
4 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Image Parallel Processing Based On GPU PDF
No ratings yet
Image Parallel Processing Based On GPU PDF
4 pages
Jenkins Install in EC2-V0.1
No ratings yet
Jenkins Install in EC2-V0.1
4 pages
CPU and Cooler: Recommended Motherboard CPU Combos
No ratings yet
CPU and Cooler: Recommended Motherboard CPU Combos
5 pages
Embedded Controller Programming For Real-Time Systems - UC San Diego Division of Extended Studies
No ratings yet
Embedded Controller Programming For Real-Time Systems - UC San Diego Division of Extended Studies
4 pages
Archives
No ratings yet
Archives
4 pages
Ion Console ETC: Eos Series
No ratings yet
Ion Console ETC: Eos Series
4 pages
Parallel Computing Project
No ratings yet
Parallel Computing Project
4 pages
Release Notes
No ratings yet
Release Notes
7 pages
Location Wise Details MASTER
No ratings yet
Location Wise Details MASTER
2 pages
Tatille Mobile LPR - 170518 Datasheet
No ratings yet
Tatille Mobile LPR - 170518 Datasheet
2 pages
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
From Everand
Microsoft AZ-400: Designing and Implementing Microsoft DevOps Solutions - Certification Exam Prep
Steve Brown
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
From Everand
Advanced Multiplayer Game Development with Ureal Engine 5: A Comprehensive Guide to C++ Scripting
Vladimir Kiselev
No ratings yet

Instruction Mix

Uploaded by

Instruction Mix

Uploaded by

INSTRUCTION MIX SAMPLE

v2023.1.1 | September 2024

const int SX[] = {1, 2, 1, 0, 0, 0, -1, -2, -1};

FLOAT_T sumX = 0.;

const auto luminance = GetPixel(pImg, tx + i, ty + j, imgWidth,

const FLOAT_T threshold = 24.;

Sobel<double><<<grid, block>>>( pDstImage, pSrcImage, imgWidth,

Profile the initial version of the kernel

‣ Choose the full section set

Details page - GPU Speed Of Light Throughput

Details page - Compute Workload Analysis section

Sobel<float><<<grid, block>>>( pDstImage, pSrcImage, imgWidth,

Profile the updated kernel

‣ Instruction Optimization section of the CUDA C++ Best Practices Guide

You might also like