100% found this document useful (1 vote)

13 views

Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL 2nd Edition James Reinders download pdf

Reinders

Uploaded by

milunasbisi40

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

13 views

Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL 2nd Edition James Reinders download pdf

Reinders

Uploaded by

milunasbisi40

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Download Full Version ebook - Visit ebookmeta.

com

Data Parallel C++: Programming Accelerated Systems

Using C++ and SYCL 2nd Edition James Reinders

https://ebookmeta.com/product/data-parallel-c-programming-
accelerated-systems-using-c-and-sycl-2nd-edition-james-
reinders/

OR CLICK HERE

DOWLOAD NOW

Discover More Ebook - Explore Now at ebookmeta.com

Instant digital products (PDF, ePub, MOBI) ready for you
Download now and discover formats that fit your needs...

Start reading on any device today!

Mnemonics for Radiologists and FRCR 2B Viva Preparation A

Systematic Approach Aug 15 2013 _ 1908911956 _ CRC Press
1st Edition Yoong
https://ebookmeta.com/product/mnemonics-for-radiologists-and-
frcr-2b-viva-preparation-a-systematic-approach-
aug-15-2013-_-1908911956-_-crc-press-1st-edition-yoong/
ebookmeta.com

Modern Parallel Programming with C++ and Assembly Language

Daniel Kusswurm

https://ebookmeta.com/product/modern-parallel-programming-with-c-and-
assembly-language-daniel-kusswurm/

ebookmeta.com

Problem Solving in Data Structures Algorithms Using C 2nd

Edition Hemant Jain

https://ebookmeta.com/product/problem-solving-in-data-structures-
algorithms-using-c-2nd-edition-hemant-jain/

ebookmeta.com

Brunch Please A Brunchin Good Cookbook 1st Edition Valeria

Ray

https://ebookmeta.com/product/brunch-please-a-brunchin-good-
cookbook-1st-edition-valeria-ray/

ebookmeta.com
Genetics for Smart Kids A Little Scientist s Guide to
Cells DNA Genes and More Future Geniuses Book 3 Pazos
Carlos
https://ebookmeta.com/product/genetics-for-smart-kids-a-little-
scientist-s-guide-to-cells-dna-genes-and-more-future-geniuses-
book-3-pazos-carlos/
ebookmeta.com

Fundamentals of Sleep Technology 3rd Edition Teofilo L Lee

Chiong Jr Md Cynthia Mattice Rita Brooks

https://ebookmeta.com/product/fundamentals-of-sleep-technology-3rd-
edition-teofilo-l-lee-chiong-jr-md-cynthia-mattice-rita-brooks/

ebookmeta.com

Mastering Python Network Automation Tim Peters

https://ebookmeta.com/product/mastering-python-network-automation-tim-
peters/

ebookmeta.com

History of Medieval India 2020th Edition Satish Chandra

https://ebookmeta.com/product/history-of-medieval-india-2020th-
edition-satish-chandra/

ebookmeta.com

Vision Reading Difficulties and Visual Stress 2nd Edition

Arnold J. Wilkins

https://ebookmeta.com/product/vision-reading-difficulties-and-visual-
stress-2nd-edition-arnold-j-wilkins/

ebookmeta.com
Obesity Diabetes and Inflammation Molecular Mechanisms and
Clinical Management Dimiter Avtanski Leonid Poretsky Eds

https://ebookmeta.com/product/obesity-diabetes-and-inflammation-
molecular-mechanisms-and-clinical-management-dimiter-avtanski-leonid-
poretsky-eds/
ebookmeta.com
Data Parallel C++
Programming Accelerated Systems Using
C++ and SYCL
—
Second Edition
—
James Reinders
Ben Ashbaugh
James Brodman
Michael Kinsner
John Pennycook
Xinmin Tian
Foreword by Erik Lindahl, GROMACS and
Stockholm University
Data Parallel C++
Programming Accelerated
Systems Using C++ and SYCL
Second Edition

James Reinders
Ben Ashbaugh
James Brodman
Michael Kinsner
John Pennycook
Xinmin Tian
Foreword by Erik Lindahl, GROMACS and
Stockholm University
Data Parallel C++: Programming Accelerated Systems Using C++ and SYCL, Second Edition
James Reinders Michael Kinsner
Beaverton, OR, USA Halifax, NS, Canada
Ben Ashbaugh John Pennycook
Folsom, CA, USA San Jose, CA, USA
James Brodman Xinmin Tian
Marlborough, MA, USA Fremont, CA, USA

ISBN-13 (pbk): 978-1-4842-9690-5 ISBN-13 (electronic): 978-1-4842-9691-2

https://doi.org/10.1007/978-1-4842-9691-2

Copyright © 2023 by Intel Corporation

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on
microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0
International License (https://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes
were made.
The images or other third party material in this book are included in the book’s Creative Commons license, unless indicated
otherwise in a credit line to the material. If material is not included in the book’s Creative Commons license and your intended
use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of
a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the
trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such,
is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Intel, the Intel logo, Intel Optane, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. OpenCL
and the OpenCL logo are trademarks of Apple Inc. in the U.S. and/or other countries. OpenMP and the OpenMP logo are
trademarks of the OpenMP Architecture Review Board in the U.S. and/or other countries. SYCL, the SYCL logo, Khronos and
the Khronos Group logo are trademarks of the Khronos Group Inc. The open source DPC++ compiler is based on a published
Khronos SYCL specification. The current conformance status of SYCL implementations can be found at https://www.khronos.
org/conformance/adopters/conformant-products/sycl.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests are measured using specific computer systems, components, software, operations and functions. Any change
to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you
in fully evaluating your contemplated purchases, including the performance of that product when combined with other
products. For more complete information visit https://www.intel.com/benchmarks. Performance results are based on testing
as of dates shown in configuration and may not reflect all publicly available security updates. See configuration disclosure for
details. No product or component can be absolutely secure. Intel technologies’ features and benefits depend on system
configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at
www.intel.com.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the
authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The
publisher makes no warranty, express or implied, with respect to the material contained herein.
Managing Director, Apress Media LLC: Welmoed Spahr
Acquisitions Editor: Susan McDermot
Development Editor: James Markham
Coordinating Editor: Jessica Vakili
Distributed to the book trade worldwide by Springer Science+Business Media New York, 1 NY Plaza, New York, NY 10004.
Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit https://www.springeronline.com.
Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM
Finance Inc). SSBM Finance Inc is a Delaware corporation.
For information on translations, please e-mail [email protected]; for reprint, paperback, or audio rights,
please e-mail [email protected].
Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also
available for most titles. For more information, reference our Print and eBook Bulk Sales web page at https://www.apress.com/
bulk-sales.
Any source code or other supplementary material referenced by the author in this book is available to readers on the Github
repository: https://github.com/Apress/Data-Parallel-CPP. For more detailed information, please visit https://www.apress.
com/gp/services/source-code.
Paper in this product is recyclable
Table of Contents
About the Authors��xix

Preface��xxi

Foreword�� xxv

Acknowledgments�� xxix

Chapter 1: Introduction��1
Read the Book, Not the Spec��2
SYCL 2020 and DPC++��3
Why Not CUDA?��4
Why Standard C++ with SYCL?��5
Getting a C++ Compiler with SYCL Support��5
Hello, World! and a SYCL Program Dissection��6
Queues and Actions��7
It Is All About Parallelism��8
Throughput��8
Latency��9
Think Parallel��9
Amdahl and Gustafson��10
Scaling��11
Heterogeneous Systems��11
Data-Parallel Programming��13

iii
Table of Contents

Key Attributes of C++ with SYCL��14

Single-Source��14
Host��15
Devices��15
Kernel Code��16
Asynchronous Execution��18
Race Conditions When We Make a Mistake��19
Deadlock��22
C++ Lambda Expressions��23
Functional Portability and Performance Portability��26
Concurrency vs. Parallelism��28
Summary��30

Chapter 2: Where Code Executes��31

Single-Source��31
Host Code��33
Device Code��34
Choosing Devices��36
Method#1: Run on a Device of Any Type��37
Queues��37
Binding a Queue to a Device When Any Device Will Do��41
Method#2: Using a CPU Device for Development, Debugging,
and Deployment��42
Method#3: Using a GPU (or Other Accelerators)��45
Accelerator Devices��46
Device Selectors��46
Method#4: Using Multiple Devices��50

iv
Table of Contents

Method#5: Custom (Very Specific) Device Selection��51

Selection Based on Device Aspects��51
Selection Through a Custom Selector��53
Creating Work on a Device��54
Introducing the Task Graph��54
Where Is the Device Code?��56
Actions��60
Host tasks��63
Summary��65

Chapter 3: Data Management��67

Introduction��68
The Data Management Problem��69
Device Local vs. Device Remote��69
Managing Multiple Memories��70
Explicit Data Movement��70
Implicit Data Movement��71
Selecting the Right Strategy��71
USM, Buffers, and Images��72
Unified Shared Memory��72
Accessing Memory Through Pointers��73
USM and Data Movement��74
Buffers��77
Creating Buffers��78
Accessing Buffers��78
Access Modes��80

v
Table of Contents

Ordering the Uses of Data��80

In-order Queues��83
Out-of-Order Queues��84
Choosing a Data Management Strategy��92
Handler Class: Key Members��93
Summary��96

Chapter 4: Expressing Parallelism��97

Parallelism Within Kernels��98
Loops vs. Kernels��99
Multidimensional Kernels��101
Overview of Language Features��102
Separating Kernels from Host Code��102
Different Forms of Parallel Kernels��103
Basic Data-Parallel Kernels��105
Understanding Basic Data-Parallel Kernels��105
Writing Basic Data-Parallel Kernels��107
Details of Basic Data-Parallel Kernels��109
Explicit ND-Range Kernels��112
Understanding Explicit ND-Range Parallel Kernels��113
Writing Explicit ND-Range Data-Parallel Kernels��121
Details of Explicit ND-Range Data-Parallel Kernels��122
Mapping Computation to Work-Items��127
One-to-One Mapping��128
Many-to-One Mapping��128
Choosing a Kernel Form��130
Summary��132

vi
Table of Contents

Chapter 5: Error Handling��135

Safety First��135
Types of Errors��136
Let’s Create Some Errors!��138
Synchronous Error��139
Asynchronous Error��139
Application Error Handling Strategy��140
Ignoring Error Handling��141
Synchronous Error Handling��143
Asynchronous Error Handling��144
The Asynchronous Handler��145
Invocation of the Handler��148
Errors on a Device��149
Summary��150

Chapter 6: Unified Shared Memory��153

Why Should We Use USM?��153
Allocation Types��154
Device Allocations��154
Host Allocations��155
Shared Allocations��155
Allocating Memory��156
What Do We Need to Know?��156
Multiple Styles��157
Deallocating Memory��164
Allocation Example��165

vii
Table of Contents

Data Management��165
Initialization��165
Data Movement��166
Queries��174
One More Thing��177
Summary��178

Chapter 7: Buffers��179
Buffers��180
Buffer Creation��181
What Can We Do with a Buffer?��188
Accessors��189
Accessor Creation��192
What Can We Do with an Accessor?��198
Summary��199

Chapter 8: Scheduling Kernels and Data Movement��201

What Is Graph Scheduling?��202
How Graphs Work in SYCL��202
Command Group Actions��203
How Command Groups Declare Dependences��203
Examples��204
When Are the Parts of a Command Group Executed?��213
Data Movement��213
Explicit Data Movement��213
Implicit Data Movement��214
Synchronizing with the Host��216
Summary��218

viii
Table of Contents

Chapter 9: Communication and Synchronization��221

Work-Groups and Work-Items��221
Building Blocks for Efficient Communication��223
Synchronization via Barriers��223
Work-Group Local Memory��225
Using Work-Group Barriers and Local Memory��227
Work-Group Barriers and Local Memory in ND-Range Kernels��231
Sub-Groups��235
Synchronization via Sub-Group Barriers��236
Exchanging Data Within a Sub-Group��237
A Full Sub-Group ND-Range Kernel Example��239
Group Functions and Group Algorithms��241
Broadcast��241
Votes��242
Shuffles��243
Summary��246

Chapter 10: Defining Kernels��249

Why Three Ways to Represent a Kernel?��249
Kernels as Lambda Expressions��251
Elements of a Kernel Lambda Expression��251
Identifying Kernel Lambda Expressions��254
Kernels as Named Function Objects��255
Elements of a Kernel Named Function Object��256
Kernels in Kernel Bundles��259
Interoperability with Other APIs��264
Summary��264

ix
Table of Contents

Chapter 11: Vectors and Math Arrays��267

The Ambiguity of Vector Types��268
Our Mental Model for SYCL Vector Types��269
Math Array (marray)��271
Vector (vec)��273
Loads and Stores��274
Interoperability with Backend-Native Vector Types��276
Swizzle Operations��276
How Vector Types Execute��280
Vectors as Convenience Types��280
Vectors as SIMD Types��284
Summary��286

Chapter 12: Device Information and Kernel Specialization��289

Is There a GPU Present?��290
Refining Kernel Code to Be More Prescriptive��291
How to Enumerate Devices and Capabilities��293
Aspects��296
Custom Device Selector��298
Being Curious: get_info<>��300
Being More Curious: Detailed Enumeration Code��301
Very Curious: get_info plus has()��303
Device Information Descriptors��303
Device-Specific Kernel Information Descriptors��303
The Specifics: Those of “Correctness”��304
Device Queries��305
Kernel Queries��306

x
Table of Contents

The Specifics: Those of “Tuning/Optimization”��307

Device Queries��307
Kernel Queries��308
Runtime vs. Compile-Time Properties��308
Kernel Specialization��309
Summary��312

Chapter 13: Practical Tips��313

Getting the Code Samples and a Compiler��313
Online Resources��313
Platform Model��314
Multiarchitecture Binaries��315
Compilation Model��316
Contexts: Important Things to Know��319
Adding SYCL to Existing C++ Programs��321
Considerations When Using Multiple Compilers��322
Debugging��323
Debugging Deadlock and Other Synchronization Issues��325
Debugging Kernel Code��326
Debugging Runtime Failures��327
Queue Profiling and Resulting Timing Capabilities��330
Tracing and Profiling Tools Interfaces��334
Initializing Data and Accessing Kernel Outputs��335
Multiple Translation Units��344
Performance Implication of Multiple Translation Units��345
When Anonymous Lambdas Need Names��345
Summary��346

xi
Table of Contents

Chapter 14: Common Parallel Patterns��349

Understanding the Patterns��350
Map��351
Stencil��352
Reduction��354
Scan��356
Pack and Unpack��358
Using Built-In Functions and Libraries��360
The SYCL Reduction Library��360
Group Algorithms��366
Direct Programming��370
Map��370
Stencil��371
Reduction��373
Scan��374
Pack and Unpack��377
Summary��380
For More Information��381

Chapter 15: Programming for GPUs��383

Performance Caveats��383
How GPUs Work��384
GPU Building Blocks��384
Simpler Processors (but More of Them)��386
Simplified Control Logic (SIMD Instructions)��391
Switching Work to Hide Latency��398
Offloading Kernels to GPUs��400
SYCL Runtime Library��400
GPU Software Drivers��401

xii
Table of Contents

GPU Hardware��402
Beware the Cost of Offloading!��403
GPU Kernel Best Practices��405
Accessing Global Memory��405
Accessing Work-Group Local Memory��409
Avoiding Local Memory Entirely with Sub-Groups��412
Optimizing Computation Using Small Data Types��412
Optimizing Math Functions��413
Specialized Functions and Extensions��414
Summary��414
For More Information��415

Chapter 16: Programming for CPUs��417

Performance Caveats��418
The Basics of Multicore CPUs��419
The Basics of SIMD Hardware��422
Exploiting Thread-Level Parallelism��428
Thread Affinity Insight��431
Be Mindful of First Touch to Memory��435
SIMD Vectorization on CPU��436
Ensure SIMD Execution Legality��437
SIMD Masking and Cost��440
Avoid Array of Struct for SIMD Efficiency��442
Data Type Impact on SIMD Efficiency��444
SIMD Execution Using single_task��446
Summary��448

xiii
Table of Contents

Chapter 17: Programming for FPGAs��451

Performance Caveats��452
How to Think About FPGAs��452
Pipeline Parallelism��456
Kernels Consume Chip “Area”��459
When to Use an FPGA��460
Lots and Lots of Work��460
Custom Operations or Operation Widths��461
Scalar Data Flow��462
Low Latency and Rich Connectivity��463
Customized Memory Systems��464
Running on an FPGA��465
Compile Times��467
The FPGA Emulator��469
FPGA Hardware Compilation Occurs “Ahead-of-Time”��470
Writing Kernels for FPGAs��471
Exposing Parallelism��472
Keeping the Pipeline Busy Using ND-Ranges��475
Pipelines Do Not Mind Data Dependences!��478
Spatial Pipeline Implementation of a Loop��481
Loop Initiation Interval��483
Pipes��489
Custom Memory Systems��495
Some Closing Topics��498
FPGA Building Blocks��498
Clock Frequency��500
Summary��501

xiv
Table of Contents

Chapter 18: Libraries��503

Built-In Functions��504
Use the sycl:: Prefix with Built-In Functions��506
The C++ Standard Library��507
oneAPI DPC++ Library (oneDPL)��510
SYCL Execution Policy��511
Using oneDPL with Buffers��513
Using oneDPL with USM��517
Error Handling with SYCL Execution Policies��519
Summary��520

Chapter 19: Memory Model and Atomics��523

What’s in a Memory Model?��525
Data Races and Synchronization��526
Barriers and Fences��529
Atomic Operations��531
Memory Ordering��532
The Memory Model��534
The memory_order Enumeration Class��536
The memory_scope Enumeration Class��538
Querying Device Capabilities��540
Barriers and Fences��542
Atomic Operations in SYCL��543
Using Atomics with Buffers��548
Using Atomics with Unified Shared Memory��550
Using Atomics in Real Life��550
Computing a Histogram��551
Implementing Device-Wide Synchronization��553

xv
Table of Contents

Summary��556
For More Information��557

Chapter 20: Backend Interoperability��559

What Is Backend Interoperability?��559
When Is Backend Interoperability Useful?��561
Adding SYCL to an Existing Codebase��562
Using Existing Libraries with SYCL��564
Using Backend Interoperability for Kernels��569
Interoperability with API-Defined Kernel Objects��569
Interoperability with Non-SYCL Source Languages��571
Backend Interoperability Hints and Tips��574
Choosing a Device for a Specific Backend��574
Be Careful About Contexts!��576
Access Low-Level API-Specific Features��576
Support for Other Backends��577
Summary��577

Chapter 21: Migrating CUDA Code��579

Design Differences Between CUDA and SYCL��579
Multiple Targets vs. Single Device Targets��579
Aligning to C++ vs. Extending C++��581
Terminology Differences Between CUDA and SYCL��582
Similarities and Differences��583
Execution Model��584
Memory Model��589
Other Differences��592

xvi
Table of Contents

Features in CUDA That Aren’t In SYCL… Yet!��595

Global Variables��595
Cooperative Groups��596
Matrix Multiplication Hardware��597
Porting Tools and Techniques��598
Migrating Code with dpct and SYCLomatic��598
Summary��603
For More Information��604

Epilogue: Future Direction of SYCL��605

Index��615

xvii
About the Authors
James Reinders is an Engineer at Intel Corporation with more than four
decades of experience in parallel computing and is an author/coauthor/
editor of more than ten technical books related to parallel programming.
James has a passion for system optimization and teaching. He has had the
great fortune to help make contributions to several of the world’s fastest
computers (#1 on the TOP500 list) as well as many other supercomputers
and software developer tools.

Ben Ashbaugh is a Software Architect at Intel Corporation where he has

worked for over 20 years developing software drivers and compilers for
Intel graphics products. For the past ten years, Ben has focused on parallel
programming models for general-purpose computation on graphics
processors, including SYCL and the DPC++ compiler. Ben is active in the
Khronos SYCL, OpenCL, and SPIR working groups, helping define industry
standards for parallel programming, and he has authored numerous
extensions to expose unique Intel GPU features.

James Brodman is a Principal Engineer at Intel Corporation working on

runtimes and compilers for parallel programming, and he is one of the
architects of DPC++. James has a Ph.D. in Computer Science from the
University of Illinois at Urbana-Champaign.

xix
About the Authors

Michael Kinsner is a Principal Engineer at Intel Corporation developing

parallel programming languages and compilers for a variety of
architectures. Michael contributes extensively to spatial architectures and
programming models and is an Intel representative within The Khronos
Group where he works on the SYCL and OpenCL industry standards for
parallel programming. Mike has a Ph.D. in Computer Engineering from
McMaster University and is passionate about programming models that
cross architectures while still enabling performance.

John Pennycook is a Software Enabling and Optimization Architect

at Intel Corporation, focused on enabling developers to fully utilize
the parallelism available in modern processors. John is experienced
in optimizing and parallelizing applications from a range of scientific
domains, and previously served as Intel’s representative on the steering
committee for the Intel eXtreme Performance User’s Group (IXPUG).
John has a Ph.D. in Computer Science from the University of Warwick.
His research interests are varied, but a recurring theme is the ability to
achieve application “performance portability” across different hardware
architectures.

Xinmin Tian is an Intel Fellow and Compiler Architect at Intel Corporation

and serves as Intel’s representative on the OpenMP Architecture Review
Board (ARB). Xinmin has been driving OpenMP offloading, vectorization,
and parallelization compiler technologies for Intel architectures. His
current focus is on LLVM-based OpenMP offloading, SYCL/DPC++
compiler optimizations for CPUs/GPUs, and tuning HPC/AI application
performance. He has a Ph.D. in Computer Science from Tsinghua
University, holds 27 US patents, has published over 60 technical papers
with over 1300+ citations of his work, and has coauthored two books that
span his expertise.

xx
Preface
If you are new to parallel programming that is okay. If you have never
heard of SYCL or the DPC++ compilerthat is also okay
Compared with programming in CUDA, C++ with SYCL offers
portability beyond NVIDIA, and portability beyond GPUs, plus a tight
alignment to enhance modern C++ as it evolves too. C++ with SYCL offers
these advantages without sacrificing performance.
C++ with SYCL allows us to accelerate our applications by harnessing
the combined capabilities of CPUs, GPUs, FPGAs, and processing devices
of the future without being tied to any one vendor.
SYCL is an industry-driven Khronos Group standard adding
advanced support for data parallelism with C++ to exploit accelerated
(heterogeneous) systems. SYCL provides mechanisms for C++ compilers
that are highly synergistic with C++ and C++ build systems. DPC++ is an
open source compiler project based on LLVM that adds SYCL support.
All examples in this book should work with any C++ compiler supporting
SYCL 2020 including the DPC++ compiler.
If you are a C programmer who is not well versed in C++, you are in
good company. Several of the authors of this book happily share that
they picked up much of C++ by reading books that utilized C++ like this
one. With a little patience, this book should also be approachable by C
programmers with a desire to write modern C++ programs.

Second Edition
With the benefit of feedback from a growing community of SYCL users, we
have been able to add content to help learn SYCL better than ever.

xxi
Preface

This edition teaches C++ with SYCL 2020. The first edition preceded
the SYCL 2020 specification, which differed only slightly from what the
first edition taught (the most obvious changes for SYCL 2020 in this edition
are the header file location, the device selector syntax, and dropping an
explicit host device).

Important resources for updated SYCL information, including any

known book errata, include the book GitHub (https://github.
com/Apress/data-parallel-CPP), the Khronos Group SYCL
standards website (www.khronos.org/sycl), and a key SYCL
education website (https://sycl.tech).

Chapters 20 and 21 are additions encouraged by readers of the first

edition of this book.
We added Chapter 20 to discuss backend interoperability. One of
the key goals of the SYCL 2020 standard is to enable broad support for
hardware from many vendors with many architectures. This required
expanding beyond the OpenCL-only backend support of SYCL 1.2.1. While
generally “it just works,” Chapter 20 explains this in more detail for those
who find it valuable to understand and interface at this level.
For experienced CUDA programmers, we have added Chapter 21 to
explicitly connect C++ with SYCL concepts to CUDA concepts both in
terms of approach and vocabulary. While the core issues of expressing
heterogeneous parallelism are fundamentally similar, C++ with SYCL offers
many benefits because of its multivendor and multiarchitecture approach.
Chapter 21 is the only place we mention CUDA terminology; the rest of this
book teaches using C++ and SYCL terminology with its open multivendor,
multiarchitecture approaches. In Chapter 21, we strongly encourage
looking at the open source tool “SYCLomatic” (github.com/oneapi-src/
SYCLomatic), which helps automate migration of CUDA code. Because it

xxii
Preface

is helpful, we recommend it as the preferred first step in migrating code.

Developers using C++ with SYCL have been reporting strong results on
NVIDIA, AMD, and Intel GPUs on both codes that have been ported from
CUDA and original C++ with SYCL code. The resulting C++ with SYCL
offers portability that is not possible with NVIDIA CUDA.
The evolution of C++, SYCL, and compilers including DPC++
continues. Prospects for the future are discussed in the Epilogue, after
we have taken a journey together to learn how to create programs for
heterogeneous systems using C++ with SYCL.
It is our hope that this book supports and helps grow the SYCL
community and helps promote data-parallel programming in C++
with SYCL.

Structure of This Book

This book takes us on a journey through what we need to know to be an
effective programmer of accelerated/heterogeneous systems using C++
with SYCL.

Chapters 1–4: Lay Foundations

Chapters 1–4 are important to read in order when first approaching C++
with SYCL.
Chapter 1 lays the first foundation by covering core concepts that are
either new or worth refreshing in our minds.
Chapters 2–4 lay a foundation of understanding for data-parallel
programming in C++ with SYCL. When we finish reading Chapters 1–4,
we will have a solid foundation for data-parallel programming in C++.
Chapters 1–4 build on each other and are best read in order.

xxiii
Preface

Chapters 5–12: Build on Foundations

With the foundations established, Chapters 5–12 fill in vital details by
building on each other to some degree while being easy to jump between
as desired. All these chapters should be valuable to all users of C++
with SYCL.

Chapters 13–21: Tips/Advice for SYCL in Practice

These final chapters offer advice and details for specific needs. We
encourage at least skimming them all to find content that is important to
your needs.

Epilogue: Speculating on the Future

The book concludes with an Epilogue that discusses likely and potential
future directions for C++ with SYCL, and the Data Parallel C++ compiler
for SYCL.
We wish you the best as you learn to use C++ with SYCL.

xxiv
Foreword
SYCL 2020 is a milestone in parallel computing. For the first time we have
a modern, stable, feature-complete, and portable open standard that can
target all types of hardware, and the book you hold in your hand is the
premier resource to learn SYCL 2020.
Computer hardware development is driven by our needs to solve
larger and more complex problems, but those hardware advances are
largely useless unless programmers like you and me have languages that
allow us to implement our ideas and exploit the power available with
reasonable effort. There are numerous examples of amazing hardware,
and the first solutions to use them have often been proprietary since it
saves time not having to bother with committees agreeing on standards.
However, in the history of computing, they have eventually always ended
up as vendor lock-in—unable to compete with open standards that allow
developers to target any hardware and share code—because ultimately the
resources of the worldwide community and ecosystem are far greater than
any individual vendor, not to mention how open software standards drive
hardware competition.
Over the last few years, my team has had the tremendous privilege
of contributing to shaping the emerging SYCL ecosystem through our
development of GROMACS, one of the world’s most widely used scientific
HPC codes. We need our code to run on every supercomputer in the
world as well as our laptops. While we cannot afford to lose performance,
we also depend on being part of a larger community where other teams
invest effort in libraries we depend on, where there are open compilers
available, and where we can recruit talent. Since the first edition of this
book, SYCL has matured into such a community; in addition to several

xxv
Foreword

vendor-provided compilers, we now have a major community-driven

implementation1 that targets all hardware, and there are thousands of
developers worldwide sharing experiences, contributing to training
events, and participating in forums. The outstanding power of open
source—whether it is an application, a compiler, or an open standard—is
that we can peek under the hood to learn, borrow, and extend. Just as we
repeatedly learn from the code in the Intel-led LLVM implementation,2
the community-driven implementation from Heidelberg University, and
several other codes, you can use our public repository3 to compare CUDA
and SYCL implementations in a large production codebase or borrow
solutions for your needs—because when you do so, you are helping to
further extend our community.
Perhaps surprisingly, data-parallel programming as a paradigm is
arguably far easier than classical solutions such as message-passing
communication or explicit multithreading—but it poses special challenges
to those of us who have spent decades in the old paradigms that focus on
hardware and explicit data placement. On a small scale, it was trivial for
us to explicitly decide how data is moved between a handful of processes,
but as the problem scales to thousands of units, it becomes a nightmare to
manage the complexity without introducing bugs or having the hardware
sit idle waiting for data. Data-parallel programming with SYCL solves
this by striking the balance of primarily asking us to explicitly express the
inherent parallelism of our algorithm, but once we have done that, the
compiler and drivers will mostly handle the data locality and scheduling
over tens of thousands of functional units. To be successful in data-parallel
programming, it is important not to think of a computer as a single unit
executing one program, but as a collection of units working independently

1
Community-driven implementation from Heidelberg University: tinyurl.com/
HeidelbergSYCL
2
DPC++ compiler project: github.com/intel/llvm
3
GROMACS: gitlab.com/gromacs/gromacs/

xxvi
Foreword

to solve parts of a large problem. As long as we can express our problem as

an algorithm where each part does not have dependencies on other parts,
it is in theory straightforward to implement it, for example, as a parallel
for-loop that is executed on a GPU through a device queue. However, for
more practical examples, our problems are frequently not large enough
to use an entire device efficiently, or we depend on performing tens of
thousands of iterations per second where latency in device drivers starts
to be a major bottleneck. While this book is an outstanding introduction to
performance-portable GPU programming, it goes far beyond this to show
how both throughput and latency matter for real-world applications, how
SYCL can be used to exploit unique features both of CPUs, GPUs, SIMD
units, and FPGAs, but it also covers the caveats that for good performance
we need to understand and possibly adapt code to the characteristics of
each type of hardware. Doing so, it is not merely a great tutorial on data-
parallel programing, but an authoritative text that anybody interested in
programming modern computer hardware in general should read.
One of SYCL’s key strengths is the close alignment to modern C++.
This can seem daunting at first; C++ is not an easy language to fully master
(I certainly have not), but Reinders and coauthors take our hand and lead
us on a path where we only need to learn a handful of C++ concepts to get
started and be productive in actual data-parallel programming. However,
as we become more experienced, SYCL 2020 allows us to combine this
with the extreme generality of C++17 to write code that can be dynamically
targeted to different devices, or relying on heterogeneous parallelism that
uses CPU, GPU, and network units in parallel for different tasks. SYCL is
not a separate bolted-on solution to enable accelerators but instead holds
great promise to be the general way we express data parallelism in C++.
The SYCL 2020 standard now includes several features previously only
available as vendor extensions, for example, Unified Shared Memory,
sub-groups, atomic operations, reductions, simpler accessors, and many
other concepts that make code cleaner, and facilitates both development
as well as porting from standard C++17 or CUDA to have your code target

xxvii
Foreword

more diverse hardware. This book provides a wonderful and accessible

introduction to all of them, and you will also learn how SYCL is expected to
evolve together with the rapid development C++ is undergoing.
This all sounds great in theory, but how portable is SYCL in practice?
Our application is an example of a codebase that is quite challenging to
optimize since data access patterns are random, the amount of data to
process in each step is limited, we need to achieve thousands of iterations
per second, and we are limited both by memory bandwidth, floating-point,
and integer operations—it is an extreme opposite of a simple data-parallel
problem. We spent over two decades writing assembly SIMD instructions
and native implementations for several GPU architectures, and our
very first encounters with SYCL involved both pains with adapting to
differences and reporting performance regressions to driver and compiler
developers. However, as of spring 2023, our SYCL kernels can achieve
80–100% of native performance on all GPU architectures not only from a
single codebase but even a single precompiled binary.
SYCL is still young and has a rapidly evolving ecosystem. There are
a few things not yet part of the language, but SYCL is unique as the only
performance-portable standard available that successfully targets all
modern hardware. Whether you are a beginner wanting to learn parallel
programming, an experienced developer interested in data-parallel
programming, or a maintainer needing to port 100,000 lines of proprietary
API code to an open standard, this second edition is the only book you will
need to become part of this community.

Erik Lindahl
Professor of Biophysics
Dept. Biophysics & Biochemistry
Science for Life Laboratory
Stockholm University

xxviii
Acknowledgments
We have been blessed with an outpouring of community input for this
second edition of our book. Much inspiration came from interactions with
developers as they use SYCL in production, classes, tutorials, workshops,
conferences, and hackathons. SYCL deployments that include NVIDIA
hardware, in particular, have helped us enhance the inclusiveness and
practical tips in our teaching of SYCL in this second edition.
The SYCL community has grown a great deal—and consists of
engineers implementing compilers and tools, and a much larger group of
users that adopt SYCL to target hardware of many types and vendors. We
are grateful for their hard work, and shared insights.
We thank the Khronos SYCL Working Group that has worked diligently
to produce a highly functional specification. In particular, Ronan Keryell
has been the SYCL specification editor and a longtime vocal advocate
for SYCL.
We are in debt to the numerous people who gave us feedback from
the SYCL community in all these ways. We are also deeply grateful for
those who helped with the first edition a few years ago, many of whom we
named in the acknowledgement of that edition.
The first edition received feedback via GitHub,1 which we did review
but we were not always prompt in acknowledging (imagine six coauthors
all thinking “you did that, right?”). We did benefit a great deal from that
feedback, and we believe we have addressed all the feedback in the
samples and text for this edition. Jay Norwood was the most prolific at
commenting and helping us—a big thank you to Jay from all the authors!

1
github.com/apress/data-parallel-CPP

xxix
Acknowledgments

Other feedback contributors include Oscar Barenys, Marcel Breyer, Jeff

Donner, Laurence Field, Michael Firth, Piotr Fusik, Vincent Mierlak, and
Jason Mooneyham. Regardless of whether we recalled your name here or
not, we thank everyone who has provided feedback and helped refine our
teaching of C++ with SYCL.
For this edition, a handful of volunteers tirelessly read draft
manuscripts and provided insightful feedback for which we are incredibly
grateful. These reviewers include Aharon Abramson, Thomas Applencourt,
Rod Burns, Joe Curley, Jessica Davies, Henry Gabb, Zheming Jin, Rakshith
Krishnappa, Praveen Kundurthy, Tim Lewis, Eric Lindahl, Gregory Lueck,
Tony Mongkolsmai, Ruyman Reyes Castro, Andrew Richards, Sanjiv Shah,
Neil Trevett, and Georg Viehöver.
We all enjoy the support of our family and friends, and we cannot
thank them enough. As coauthors, we have enjoyed working as a team
challenging each other and learning together along the way. We appreciate
our collaboration with the entire Apress team in getting this book
published.
We are sure that there are more than a few people whom we have failed
to mention explicitly who have positively impacted this book project. We
thank all who helped us.
As you read this second edition, please do provide feedback if you find
any way to improve it. Feedback via GitHub can open up a conversation,
and we will update the online errata and book samples as needed.
Thank you all, and we hope you find this book invaluable in your
endeavors.

xxx
Exploring the Variety of Random
Documents with Different Content
credit card donations. To donate, please visit:
www.gutenberg.org/donate.

Section 5. General Information About

Project Gutenberg™ electronic works
Professor Michael S. Hart was the originator of the Project
Gutenberg™ concept of a library of electronic works that could
be freely shared with anyone. For forty years, he produced and
distributed Project Gutenberg™ eBooks with only a loose
network of volunteer support.

Project Gutenberg™ eBooks are often created from several

printed editions, all of which are confirmed as not protected by
copyright in the U.S. unless a copyright notice is included. Thus,
we do not necessarily keep eBooks in compliance with any
particular paper edition.

Most people start at our website which has the main PG search
facility: www.gutenberg.org.

This website includes information about Project Gutenberg™,

including how to make donations to the Project Gutenberg
Literary Archive Foundation, how to help produce our new
eBooks, and how to subscribe to our email newsletter to hear
about new eBooks.