0% found this document useful (0 votes)
52 views253 pages

Vulkan 101

a vulkan tutor

Uploaded by

dreaman888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views253 pages

Vulkan 101

a vulkan tutor

Uploaded by

dreaman888
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 253

Vulkan 101

Tom Olson
Directory, Graphics Research, ARM
Chair, Vulkan Working Group
© Copyright Khronos Group 2016 - Page 8
What is Vulkan?
• A 3D graphics API for the next 20 years
- Logical successor to OpenGL / OpenGL ES
- Modern, efficient design
- An open, industry-controlled standard

• Here, now
- Released in February 2016
- Available today for Windows / Linux
- Shipping in Samsung Galaxy S7
- Support announced in Android ‘N’

• Different!
- Fundamental change in philosophy
- Requires corresponding changes in applications
© Copyright Khronos Group 2016 - Page 9
Why did we do this?
• Traditional APIs had issues…
• Developers weren’t happy

http://www.joshbarczak.com/blog/?p=154

http://richg42.blogspot.com/2014/05/things-that-drive-me-nuts-about-opengl.html

© Copyright Khronos Group 2016 - Page 10


Problems with OpenGL / OpenGL ES
• Programming model doesn’t match GPU HW
- Especially in mobile
- Driver magic hides the mismatch

• CPU intensive
- Lots of state validation, dependency tracking

• Complex, buggy, unpredictable drivers


- Different bugs and fast-paths on every GPU

• Fundamentally single-threaded
- Can’t use multi-core CPUs effectively

• …not to mention twenty years of legacy cruft


© Copyright Khronos Group 2016 - Page 11
Enter Vulkan…
• Design discussions start in October 2012

• Moves into high gear in July/August 2014


- Commitment from key ISVs
- AMD donation of Mantle

• A lot of very hard work follows…

• Release to public in February 2016


- Conformant drivers from four IHVs
- GLSL to SPIR-V compiler
- Debug and validation tools

© Copyright Khronos Group 2016 - Page 12


Vulkan in one slide

Resources (textures, buffers)

Memory
Instance Device
Queues

Command Buffers

© Copyright Khronos Group 2016 - Page 13


Vulkan in one slide two slides
Andrew
Neil / Hans-Kristian Tobias
Buffer
Command

Render Pass

Descriptor

Descriptor
Draw Call

Draw Call
Pipeline

Pipeline
Shaders

Shaders

Copy
Sync
Sync
Sets

Sets
Michael Jesse

© Copyright Khronos Group 2016 - Page 14


The principle of Explicit Control
• You promise to tell the driver
OpenGL lets you specify important
- What you are going to do information very late, and change it
- In sufficient detail that it doesn’t have to guess at any time. It’s convenient, but has
- When the driver needs to know it huge performance costs.

• In return, driver promises to do


OpenGL drivers often defer work
- What you asked for until later, move it to another
- When you asked for it thread, or even ignore your
- Very quickly commands, based on guesses about
your intent. Vulkan drivers won’t.

• No driver magic!

© Copyright Khronos Group 2016 - Page 15


Loader, layers, and extensions
• Vulkan has no dependencies on external APIs
- ICD loader is built-in
- Window system binding is (semi) built-in

• A side benefit: Layers


- Loader can install intercept libraries (“layers”)
- E.g. trace, debug

• Extensions
- Must be enabled at initialization time

© Copyright Khronos Group 2016 - Page 16


Multithreading
• All objects visible / accessible to all threads

• Most operations are externally synchronized


- Application must prevent unsafe concurrent access
- E.g., recording to the same command buffer
- E.g., submitting to the same queue
- Application must manage object lifetimes
- Note, many objects are immutable
- Concurrent read access is OK

• Allocation / creation are internally synchronized and may block


- Per-thread pool allocators keep this reasonably cheap

© Copyright Khronos Group 2016 - Page 17


Error handling
• Vulkan is optimized for correct applications
- Does not (generally) check for invalid usage
- Does not track dependencies
- Does not (generally) provide thread safety
- Breaking the rules results in undefined behavior

• Vulkan does check for errors you can’t predict


- Out of memory
- Device lost
- Other system errors…

• Layers to the rescue!


- Can enable validation layers during development

© Copyright Khronos Group 2016 - Page 18


Community
• A new attitude
- ISV member input drove key decisions
- Consulted with hundreds of developers

• Strong commitment to open source


- Loader
- Validation and other layers
- SPIR-V tools: compiler, validator, …
- Conformance tests
- Specification

• All at https://github.com/KhronosGroup

© Copyright Khronos Group 2016 - Page 19


Should you be using Vulkan?
• Challenges
- Verbose and complex
- Lots of exposed sharp edges
- Lots to learn

• Opportunities
- Much lower driver overhead
- …which you can spread across multiple threads
- More predictable performance
- Mobile friendly

• Realities
- Ecosystem is still immature
- Will need to ship GL/DX versions for years to come

© Copyright Khronos Group 2016 - Page 20


Command Buffers and Pipelines
Michael Worcester – Driver Engineer
([email protected])
26 May 2016 www.imgtec.com
Command Buffers – Deferring the work

 OpenGL is immediate (ignoring display lists)


 Driver does not know how much work is incoming
 Has to guess
 Bad!
 Vulkan splits recording of work from submission of work
 Removes guesswork from driver
 Reducing hitching
 Helps eliminate unexplained resource usage

© Imagination Technologies
Command Buffers – Pooling Resource
 Command Buffers always belong to a Command Pool
 Buffers are allocated from pools
 Pools provide lightweight synchronisation
 Pools can be reset, reclaiming all resources
 Two flavours of pool:
 Individual reset of command buffers
 Group reset only

© Imagination Technologies
Command Buffers – Going wide

Single Thread OpenGL Context

Thread 1 VkCommandBuffer

Thread 2 VkCommandBuffer

Thread N VkCommandBuffer

© Imagination Technologies
Command Buffers – Command Types

 Deferred recording of commands


 Transfer
 Graphics
 Compute
 Synchronisation

© Imagination Technologies
Command Buffers – Transfers

 Transfer commands are raw copies


 However, they can change the tiling of an image (this is the only way!)
 CPU -> GPU
 Texture upload
 Static buffer data
 GPU -> CPU
 Read back of data
 GPU -> GPU
 Pipelined updates of data
 Mipgen

© Imagination Technologies
Command Buffers – “Inside” or “Out”

Transfer Compute RenderPass Compute


Graphics Graphics Graphics

Dispatch BindPipeline BindDescriptors BeginRenderPass PushConstants Draw

© Imagination Technologies
Command Buffers – Secondaries

Primary Transfer Compute RenderPass Compute


ExecuteCommands ExecuteCommands

Secondaries BindPipeline BindDescriptors Draw BindPipeline BindDescriptors Draw Draw

© Imagination Technologies
Command Buffers – Reuse

Camera

© Imagination Technologies
Command Buffers – Reuse

Camera

© Imagination Technologies
Command Buffers – Lifetime
Ownership

CPU GPU

Allocated
Begin

Record End Begin Pending Submit Wait Active

© Imagination Technologies
Pipelines - An anatomy

VI IA VS CS TS ES GS VP RS MS DS FS CB

 Fixed Function States


 Programmable Shaders
 Descriptor Layout
 Renderpass (more later)
 Dynamic State

© Imagination Technologies
Pipelines – Fixed Function States

VI IA VS CS TS ES GS VP RS MS DS FS CB

 VertexInput
 Everything that isn’t a shader
 InputAssembly
 Buffer formats/layouts
 Tessellation
 Viewport
 Raster
 Multisample
 DepthStencil
 ColorBlend

© Imagination Technologies
Pipelines – Shader Stages

VI IA VS CS TS ES GS VP RS MS DS FS CB

 Currently same as OpenGL


 Vertex
 Control
 Evaluation
 Geometry
 Fragment
 Note: Tessellation and Geometry are optional features

© Imagination Technologies
Pipelines – Descriptor Layout

Describes the set of resources that a shader can access


 Uniforms
 Storage Buffers
 Images
 Samplers
 Push Constants

© Imagination Technologies
Pipelines – Dynamic State
 Viewport
 Per-draw state
 Scissor
 Tedious to compile each one
 Line Width
 Combinatorial explosion  Depth Bias
 Dynamic state!  Blend Constant Colour
 Opt-in  Depth Bounds
 Only use when required  Stencil
 Compare
 Write
 Reference

© Imagination Technologies
Pipelines – The Cache

 Share common state


 Load/Store

© Imagination Technologies
Introduction to SPIR-V Shaders
Neil Hickey
Compiler Engineer, ARM
© Copyright Khronos Group 2016 - Page 38
SPIR History

© Copyright Khronos Group 2016 - Page 39


SPIR-V Purpose

Parse HLSL Parse GLSL Parse OpenCL C Parse ISPC Parse Static C++

SPIR-V CFG Optimize SPIR-V CFG

Binary IHV Compiler SPIR-V Print SPIR-V

© Copyright Khronos Group 2016 - Page 40


Developer Ecosystem

• Multiple Developer Advantages:


• Same front-end compiler for multiple
platforms
• Reduces runtime kernel compilation time
• Don’t have to ship shader/kernel source
code
• Drivers are simpler and more reliable

© Copyright Khronos Group 2016 - Page 41


Vulkan and OpenCL
SPIR 1.2 SPIR 2.0 SPIR-V 1.0
100% Khronos defined
LLVM Interaction Uses LLVM 3.2 Uses LLVM 3.4 Round-trip lossless
conversion

Compute Constructs Metadata/Intrinsics Metadata/Intrinsics Native

Graphics Constructs No No Native

Supported Language OpenCL C 1.2 OpenCL C 1.2 – 2.0


OpenCL C 1.2
Feature Sets OpenCL C 2.0 OpenCL C++ and GLSL
OpenCL 2.1 Core
OpenCL C 1.2 OpenCL C 2.0
OpenCL Ingestion OpenCL 1.2 / 2.0
Extension Extension
Extensions

Vulkan Ingestion - - Vulkan 1.0 Core

© Copyright Khronos Group 2016 - Page 42


Compiler flow
GLSL Third party kernel and
Khronos has open sourced shader languages
these tools and translators
OpenCL C OpenCL C++
Khronos plans to open source
these tools soon

SPIR-V Tools
SPIR-V Validator
Other
SPIR-V (Dis)Assembler LLVM intermediate
forms
LLVM to SPIR-V
SPIR-V Bi-directional
• 32-bit word stream Translator
• Extensible and easily parsed
• Retains data object and
control flow information for
effective code generation and
translation

© Copyright Khronos Group 2016 - Page 43


SPIR-V Capabilities
• OpenCL and Vulkan

• Capabilities define feature sets


OpCapability Addresses
• Separate capabilities for Vulkan shaders and OpCapability Linkage
OpenCL kernels OpCapability Kernel
• Validation layer checks correct capabilities
OpCapability Vector16
requested OpCapability Int16

© Copyright Khronos Group 2016 - Page 44


SPIR-V Extensions
• OpExtension

• New functionality

• New instructions OpExtInstImport


“OpenCL.std”
• New semantics

© Copyright Khronos Group 2016 - Page 45


Vulkan shaders vs. GL shaders
• Program GLSL/ESSL shaders in high level language
• Ship high level source with application
• Graphics drivers compile at runtime
• Each driver needs a full compilation tool chain

• Shaders in binary format


• Compile offline
• Ship intermediate language with application
• Graphics drivers “just” lower from IL
• Higher level compilation can be shared among vendors (provided by Khronos)

© Copyright Khronos Group 2016 - Page 46


Vulkan shaders vs. GL shaders
; SPIR-V %6 = OpTypeFloat 32
#version 310 es ; Version: 1.0 %7 = OpTypeVector %6 4
; Generator: Khronos Glslang Reference Front End; 1 %8 = OpTypePointer Output %7

precision mediump float; ; Bound: 20


; Schema: 0
%9 = OpVariable %8 Output
%10 = OpTypeImage %6 2D 0 0 0 1 Unknown

uniform sampler2D s; OpCapability Shader


%1 = OpExtInstImport "GLSL.std.450"
%11 = OpTypeSampledImage %10
%12 = OpTypePointer UniformConstant %11

in vec2 texcoord; OpMemoryModel Logical GLSL450


OpEntryPoint Fragment %4 "main" %9 %17
%13 = OpVariable %12 UniformConstant
%15 = OpTypeVector %6 2

out vec4 color;


OpExecutionMode %4 OriginUpperLeft %16 = OpTypePointer Input %15
OpSource ESSL 310 %17 = OpVariable %16 Input
OpName %4 "main" %4 = OpFunction %2 None %3
OpName %9 "color" %5 = OpLabel
OpName %13 "s" %14 = OpLoad %11 %13

void main() OpName %17 "texcoord"


OpDecorate %9 RelaxedPrecision
%18 = OpLoad %15 %17
%19 = OpImageSampleImplicitLod %7 %14 %18

{ OpDecorate %13 RelaxedPrecision


OpDecorate %13 DescriptorSet 0
OpStore %9 %19
OpReturn

color = texture(s, texcoord);


OpDecorate %14 RelaxedPrecision OpFunctionEnd
OpDecorate %17 RelaxedPrecision
OpDecorate %18 RelaxedPrecision
} OpDecorate %19 RelaxedPrecision
%2 = OpTypeVoid
%3 = OpTypeFunction %2

© Copyright Khronos Group 2016 - Page 47


Khronos SPIR-V Tools
• Reference frontend (glslang) glslangValidator –V –o shader.spv shader.frag

• SPIR-V disassembler (spirv-dis) spirv-dis -o shader.spvasm shader.spv

• SPIR-V assembler (spirv-as) spirv-as –o shader.spv shader.spvasm

• SPIR-V reflection (spirv-cross) spirv-cross shader.spv

© Copyright Khronos Group 2016 - Page 48


Vulkan shaders in a high level language

• GL_KHR_vulkan_glsl

• Exposes SPIR-V features

• Similar to GLSL with some changes

• Extends #version 140 and higher on desktop and #version 310 es for mobile
content

© Copyright Khronos Group 2016 - Page 49


Vulkan_glsl removed features
• Default uniforms

• Atomic-counter bindings

• Subroutines

• Packed block layouts

© Copyright Khronos Group 2016 - Page 50


Vulkan_glsl new features
• Push constants

• Separate textures and samplers

• Descriptor sets

• Specialization constants

• Subpass inputs

© Copyright Khronos Group 2016 - Page 51


Push Constants
• Push constants replace non-opaque uniforms
- Think of them as small, fast-access uniform buffer memory
• Update in Vulkan with vkCmdPushConstants
// New
layout(push_constant, std430) uniform PushConstants {
mat4 MVP;
vec4 MaterialData;
} RegisterMapped;

// Old, no longer supported in Vulkan GLSL


uniform mat4 MVP;
uniform vec4 MaterialData;

// Opaque uniform, still supported


uniform sampler2D sTexture;1

© Copyright Khronos Group 2016 - Page 52


Separate textures and samplers
• sampler contains just filtering information
• texture contains just image information
• combined in code at the point of texture lookup

uniform sampler s;
uniform texture2D t;
in vec2 texcoord;
...
void main()
{
fragColor = texture(sampler2D(t,s), texcoord);
}

© Copyright Khronos Group 2016 - Page 53


Descriptor sets
• Bound objects can optionally define a descriptor set
• Allows bound objects to be updated in one block
• Allows objects in other descriptor sets to remain the same
• Enabled with the set = ... syntax in the layout specifier

layout(set = 0, binding = 0) uniform sampler s;


layout(set = 1, binding = 0) uniform texture2D t;

© Copyright Khronos Group 2016 - Page 54


Specialization constants
• Allows for special constants to be created whose value is overridable at pipeline
creation time.
• Can be used in expressions
• Can be combined with other constants to form new specialization constants
• Declared using layout(constant_id=...)
• Can have a default value if not overridden at runtime

layout(constant_id = 1) const int arraySize = 12;

vec4 data[arraySize];

© Copyright Khronos Group 2016 - Page 55


Specialization constants(2)
• gl_WorkGroupSize can be specialized with values for the x,y and z component.

layout(local_size_x_id = 2, local_size_z_id = 3) in;

• These specialization constants can be set at pipeline creation time by using


vkSpecializationMapInfo

const VkSpecializationMapEntry entries[] =


{
{ 1, // constantID
0*sizeof(uint32_t), // offset
sizeof(uint32_t) // size
},
};

© Copyright Khronos Group 2016 - Page 56


Specialization constants(3)
const uint32_t data[] = { 16};
const VkSpecializationInfo info =
{
1, // mapEntryCount
entries, // pMapEntries
1*sizeof(uint32_t), // dataSize
data, // pData
};

© Copyright Khronos Group 2016 - Page 57


Subpass Inputs
• Vulkan supports subpasses within render passes
• Standardized GL_EXT_shader_pixel_local_storage!

// GLSL
#extension GL_EXT_shader_pixel_local_storage : require
__pixel_local_inEXT GBuffer {
layout(rgba8) vec4 albedo;
layout(rgba8) vec4 normal;
...
} pls;

// Vulkan
layout(input_attachment_index = 0) uniform subpassInput albedo;
layout(input_attachment_index = 1) uniform subpassInput normal;
...

© Copyright Khronos Group 2016 - Page 58


Acknowledgements
• Hans-Kristian Arntzen – ARM
• Benedict Gaster – University of the West of England
• Neil Henning – Codeplay

© Copyright Khronos Group 2016 - Page 59


Using SPIR-V in practice with
SPIRV-Cross
Hans-Kristian Arntzen
Engineer, ARM
© Copyright Khronos Group 2016 - Page 60
Contents
• Moving to offline compilation of SPIR-V
• Creating pipeline layouts with SPIRV-Cross
- Descriptor sets
- Push constants
- Multipass input attachments
• Making SPIR-V portable to other graphics APIs
• Debugging complex shaders with your C++ debugger of choice

© Copyright Khronos Group 2016 - Page 61


Offline Compilation to SPIR-V
• Shader compilation can be part of your build system
• Catching compilation bugs in build time is always a plus
• Strict, mature GLSL frontends available
- glslang: https://github.com/KhronosGroup/glslang
- shaderc: https://github.com/google/shaderc
• Full freedom for other languages in the future

# Makefile rules

FRAG_SHADERS := $(wildcard *.frag)


SPIRV_FILES :=
$(FRAG_SHADERS:.frag=.frag.spv)

shaders: $(SPIRV_FILES)

%.frag.spv: %.frag
glslc –o $@ $< $(GLSL_FLAGS) –std=310es

© Copyright Khronos Group 2016 - Page 62


Vulkan Pipeline Layouts
• Need to know the “function signature” of our shaders

pipelineInfo.layout = <layout goes here>;


vkCreateGraphicsPipelines(..., &pipelineInfo, ..., &pipeline);

© Copyright Khronos Group 2016 - Page 63


The Contents of a Pipeline Layout
layout(set = 0, binding = 1) uniform UBO {
mat4 MVP;
};
layout(set = 1, binding = 2) uniform sampler2D uTexture;
layout(push_constant) uniform PushConstants {
vec4 FastConstant;
•} Signature
constants;

- 16 bytes of push constant space


- Two descriptor sets
- Set #0 has one UBO at binding #1
- Set #1 has one combined image sampler at binding #2
• Need to figure this out automatically, or write every layout by hand
- Latter is fine for tiny applications
- Vulkan does not provide reflection here, after all, this is vendor neutral information

© Copyright Khronos Group 2016 - Page 64


Introducing SPIRV-Cross
• SPIRV-Cross is a new tool hosted by Khronos
- https://github.com/KhronosGroup/SPIRV-Cross
• Extensive reflection
• Decompilation to high level languages

Khronos SPIR-V Toolbox

SPIRV- SPIRV- SPIRV-


glslang
Tools LLVM Cross

© Copyright Khronos Group 2016 - Page 65


Reflecting Uniforms and Samplers
• SPIRV-Cross has a simple API to retrieve resources

using namespace spirv_cross;

vector<uint32_t> spirv_binary = load_spirv_file();


Compiler comp(move(spirv_binary));

// The SPIR-V is now parsed, and we can perform reflection on it.


ShaderResources resources = comp.get_shader_resources();

for (auto &u : resources.uniform_buffers)


{
uint32_t set = comp.get_decoration(u.id, spv::DecorationDescriptorSet);
uint32_t binding = comp.get_decoration(u.id, spv::DecorationBinding);
printf(“Found UBO %s at set = %u, binding = %u!\n”,
u.name.c_str(), set, binding);
}

© Copyright Khronos Group 2016 - Page 66


Stepping it up with Push Constants
• SPIRV-Cross can figure out which push constant elements are in use
- Push constant blocks are typically shared across the various stages
- Only parts of the push constant block are referenced in a single stage

layout(push_constant) uniform PushConstants {


mat4 MVPInVertex;
vec4 ColorInFragment;
} constants;

FragColor = constants.ColorInFragment; // Fragment only uses element #1.

uint32_t id = resources.push_constant_buffers[0].id;
vector<BufferRange> ranges = comp.get_active_buffer_ranges(id);
for (auto &range : ranges)
{
printf(“Accessing member #%u, offset %u, size %u\n”,
range.index, range.offset, range.range);
}

// Possible to get names for struct members as well 

© Copyright Khronos Group 2016 - Page 67


Subpass Input Attachments
• Subpass attachments are similar to regular images
- Set
- Binding
- Input attachment index

layout(set = 0, binding = 0, input_attachment_index = 0) uniform subpassInput uAlbedo;


layout(set = 0, binding = 1, input_attachment_index = 1) uniform subpassInput uNormal;

vec4 lastColor = subpassLoad(uLastPass);

for (auto &attachment : resources.subpass_inputs)


{
// ...
}

© Copyright Khronos Group 2016 - Page 68


Taking SPIR-V Beyond Vulkan
• SPIR-V is a great format to rally around
- Makes sense to be able to use it in older graphics APIs as well
• Will take some time before exclusive Vulkan support is mainstream
• How to make use of Vulkan features while being compatible?
- Push constants
- Subpass
- Descriptor sets
• Without tools, Vulkan features will be harder to take advantage of

© Copyright Khronos Group 2016 - Page 69


GL + GLES + Vulkan Pipeline
• Implemented in our internal demo engine
• Write shaders in Vulkan GLSL
• Use Vulkan features directly
• No need for platform #ifdefs
• Can target mobile and desktop GL from same
SPIR-V binary

© Copyright Khronos Group 2016 - Page 70


Subpasses in OpenGL
• The subpass attachment is really just a texture read from gl_FragCoord
- Enables reading directly from tile memory on tiled architectures
- Great for deferred rendering and programmable blending

// Vulkan GLSL
uniform subpassInput uAlbedo;
...
FragColor = accumulateLight(
subpassLoad(uAlbedo),
subpassLoad(uNormal).xyz,
subpassLoad(uDepth).x);

// Translated to GLSL in SPIRV-Cross


uniform sampler2D uAlbedo;
...
FragColor = accumulateLight(
texelFetch(uAlbedo, ivec2(gl_FragCoord.xy), 0),
texelFetch(uNormal, ivec2(gl_FragCoord.xy), 0).xyz,
texelFetch(uDepth, ivec2(gl_FragCoord.xy), 0).x);

© Copyright Khronos Group 2016 - Page 71


Push Constants in OpenGL
• Push constants bundle up old-style uniforms into buffer blocks
- Translates directly to uniform structs
- Use reflection to stamp out a list of glUniform() calls

// Vulkan GLSL
layout(push_constant) uniform PushConstants {
vec4 Material;
} constants;

FragColor = constants.Material;

// Translated to GLSL in SPIRV-Cross


struct PushConstants {
vec4 Material;
};
uniform PushConstants constants;

FragColor = constants.Material;

© Copyright Khronos Group 2016 - Page 72


Descriptor Sets in OpenGL
• OpenGL has a binding space per type
• Find some remapping scheme that fits your application
• SPIRV-Cross can tweak bindings before decompiling to GLSL

// Vulkan GLSL
layout(set = 1, binding = 1) uniform sampler2D uTexture;

// SPIRV-Cross
uint32_t newBinding = 4;
glsl.set_decoration(texture.id, spv::DecorationBinding, newBinding);
glsl.unset_decoration(texture.id, spv::DecorationDescriptorSet);
string glslSource = glsl.compile();

// GLSL
layout(binding = 4) uniform sampler2D uTexture;

© Copyright Khronos Group 2016 - Page 73


gl_InstanceIndex in OpenGL
• Vulkan adds the base instance to the instance ID
- GL does not 
- Workaround is to have GL backend pass in the base index as a uniform

// Vulkan GLSL
layout(set = 0, binding = 0) uniform UBO {
mat4 MVPs[MAX_INSTANCES];
};

gl_Position = MVPs[gl_InstanceIndex] * Position;

// GLSL through SPIRV-Cross


layout(binding = 0) uniform UBO {
mat4 MVPs[MAX_INSTANCES];
};
uniform int SPIRV_Cross_BaseInstance; // Supplied by application

gl_Position = MVPs[(gl_InstanceID + SPIRV_Cross_BaseInstance)] * Position;

© Copyright Khronos Group 2016 - Page 74


Debugging Shaders in C++
• If you have thought …
- “I wish I could assert() in a compute shader”
- “I wish I could instrument a shader with logging”
- “I wish I could use clang address sanitizer to debug out-of-bounds access”
- “I want to reproduce a shader bug outside the driver”
- “I want to run regression tests when optimizing a shader”
- “I want to step through a compute thread in <insert C++ debugger here>”
• … the C++ backend in SPIRV-Cross could be interesting
• Still a very experimental feature
• Hope to expand this further in the future

© Copyright Khronos Group 2016 - Page 75


Basic Idea
• With GLM, C++ can be near GLSL compatible
• Reuse the GLSL backend to emit code which also works in C++
- Minor differences like references vs. in/out, etc
• Add some scaffolding to redirect shader resources
- Easily done with macros, the actual C++ output is kept clean
• The C++ output implements a simple C-compatible interface
• Add instrumentation to the C++ file as desired
• Compile C++ file to a dynamic library with debug symbols
• Instantiate from test program, bind buffers and invoke
- And have fun running shadertoy raymarchers at seconds per frame

© Copyright Khronos Group 2016 - Page 76


On the Command Line

# Compile to SPIR-V
glslc –o test.spv test.comp

# Create C++ interface


spirv-cross --output test.cpp test.spv --cpp

# Add some instrumentation to the shader if you want


$EDITOR test.cpp

# Build library
g++ -o test.so –shared test.cpp –O0 –g –Iinclude/spirv_cross

# Run your test app


./<my app> --shader test.so

© Copyright Khronos Group 2016 - Page 77


Another tool supporting Vulkan:
Mali Graphics Debugger is an advanced API tracer tool for Vulkan, OpenGL ES, EGL and
OpenCL. It allows developers to trace their graphics and compute applications to debug
issues and analyze the performance.

• Vulkan Support
- Trace all the function calls in the
SPEC.
- Allows you to see exactly what calls
compose your application.
- Contact the Mali forums and we would
love to get you setup.
https://community.arm.com/groups/
arm-mali-graphics

© Copyright Khronos Group 2016 - Page 78


Investigation with the Mali Graphics Debugger
Frame
Assets View
Statistics

Frame
Outline
States
Uniforms
Frame Vertex Attributes
Capture: Buffers
Framebuffers
API Trace

Textures
Shaders
Dynamic
Help

© Copyright Khronos Group 2016 - Page 79


References
• SPIRV-Cross
- https://github.com/KhronosGroup/SPIRV-Cross
• Glslang
- https://github.com/KhronosGroup/glslang
• Shaderc
- https://github.com/google/shaderc
• SPIRV-Tools
- https://github.com/KhronosGroup/SPIRV-Tools
• Mali Graphics Debugger
- http://malideveloper.arm.com/resources/tools/mali-graphics-debugger/

© Copyright Khronos Group 2016 - Page 80


Feeding Your Shaders

Jesse Barker
Principal Software Engineer

Moving to Vulkan: How to make your 3D graphics more explicit

May 26, 2016


© ARM 2016
What is a Vulkan Resource?
 Shader Input/Output  Buffers
 Referenced via Descriptors  Images
 Some are specialized in the  Samplers
hardware  Input Attachments
 Vertex Input Attributes
 Render Targets

83 © ARM 2016
What are Vulkan Descriptors?

Handle Type
myImageView SAMPLED_IMAGE

Image View
Image Device
Memory

84 © ARM 2016
What are Descriptor Sets?
// uniform blocks:
layout(set = 0, binding = 0) uniform Type0 { ... } ubo0; binding type stages
// textures: 0 Uniform Buffer Graphics
layout(set = 0, binding = 1) uniform sampler2D tex0;
1 Image/Sampler Graphics
// SSBO:
layout(set = 0, binding = 2) buffer Type2 { ... } ssbo0; 2 Storage Buffer Graphics
void main()
// ...
}

85 © ARM 2016
What is a Descriptor Pool?
typedef struct VkDescriptorPoolSize {
 Parent object of a VkDescriptorType type;
Descriptor Set uint32_t descriptorCount;
} VkDescriptorPoolSize;
 Allows Descriptor Set
typedef struct VkDescriptorPoolCreateInfo {
management to be VkStructureType sType;
threaded const void*
VkDescriptorPoolCreateFlags
pNext;
flags;
 Manages memory for uint32_t maxSets;
uint32_t poolSizeCount;
hardware descriptors const VkDescriptorPoolSize* pPoolSizes;
} VkDescriptorPoolCreateInfo;

86 © ARM 2016
Allocating Descriptor Sets
 Define desired layouts of descriptors
 Ask the Descriptor Pool to allocate a Descriptor Set per layout

87 © ARM 2016
What is a Pipeline Layout?
// uniform blocks:
layout(set = 0, binding = 0) uniform Type0
Descriptor Set 0
{ ... } ubo0;
layout(set = 0, binding = 0) uniform Type1 binding type stages
{ ... } ubo1;
0 Uniform Buffer Graphics
// textures:
layout(set = 0, binding = 1) uniform 0 Uniform Buffer Graphics
sampler2D tex0;
layout(set = 1, binding = 0) uniform 1 Image/Sampler Graphics
sampler2D tex1;

// SSBO:
layout(set = 1, binding = 1) buffer Type2 {
... } ssbo0;
Descriptor Set 1
void main() { binding type stages
// ...
}
0 Image/Sampler Graphics
1 Storage Buffer Graphics

88 © ARM 2016
How do Descriptors get into Descriptor Sets?
VKAPI_ATTR void VKAPI_CALL vkUpdateDescriptorSets( typedef struct VkWriteDescriptorSet {
VkDevice device, VkStructureType sType;
uint32_t const void* pNext;
descriptorWriteCount, VkDescriptorSet dstSet;
const VkWriteDescriptorSet* pDescriptorWrites, uint32_t dstBinding;
uint32_t descriptorCopyCount, uint32_t dstArrayElement;
const VkCopyDescriptorSet* pDescriptorCopies); uint32_t descriptorCount;
VkDescriptorType descriptorType;
const VkDescriptorImageInfo* pImageInfo;
const VkDescriptorBufferInfo* pBufferInfo;
const VkBufferView* pTexelBufferView;
} VkWriteDescriptorSet;

typedef struct VkCopyDescriptorSet {


VkStructureType sType;
const void* pNext;
VkDescriptorSet srcSet;
uint32_t srcBinding;
uint32_t srcArrayElement;
VkDescriptorSet dstSet;
uint32_t dstBinding;
uint32_t dstArrayElement;
uint32_t descriptorCount;
} VkCopyDescriptorSet;

89 © ARM 2016
Finally, I’m ready to use my Descriptor Sets
VKAPI_ATTR void VKAPI_CALL vkCmdBindDescriptorSets(
VkCommandBuffer commandBuffer,  Bound sets must
VkPipelineBindPoint pipelineBindPoint, match pipeline layout
VkPipelineLayout layout,
uint32_t firstSet,  Graphics or compute?
uint32_t descriptorSetCount,
const VkDescriptorSet* pDescriptorSets,  Simple layout is best
uint32_t dynamicOffsetCount,
const uint32_t* pDynamicOffsets);

90 © ARM 2016
What about Vertex Input?

91 © ARM 2016
Vertex Input Description
If your shader declares: const VkVertexInputBindingDescription binding[] =
{
{
0, // binding
in vec3 position; sizeof(float) * 3, // stride
in uvec2 texcoord; VK_VERTEX_INPUT_RATE_VERTEX // inputRate
},
{
Your C code declares: 1,
sizeof(uint8_t) * 2,
// binding
// stride
VK_VERTEX_INPUT_RATE_VERTEX // inputRate
struct Position },
{ };
float x, y, z; const VkVertexInputAttributeDescription attributes[] =
}; {
{
0, // location
struct Texcoord binding[0].binding, // binding
{ VK_FORMAT_R32G32B32_SFLOAT, // format
uint8_t u, v; 0 // offset
},
}; {
1, // location
binding[1].binding, // binding
VK_FORMAT_R8G8_UNORM, // format
0 // offset
}
};

92 © ARM 2016
Questions?

93 © ARM 2016
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured
may be trademarks of their respective owners.
Copyright © 2016 ARM Limited

© ARM 2016
Vulkan Subpasses
or
The Frame Buffer is Lava

Andrew Garrard
Samsung R&D Institute UK

UK Khronos Chapter meet, May 2016


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 96


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
Core 1 CmdBuf CmdBuf CmdBuf

Core 2 CmdBuf CmdBuf CmdBuf Command buffer


recording
Core 3 CmdBuf CmdBuf CmdBuf

Core 4 Submit Submit Submit

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 97


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
- Reuse of command buffers to avoid CPU build time
Record 2ry command buffer Record primary command buffer

Invoke
Invoke

Invoke

Invoke
2ry 2ry 2ry 2ry

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 98


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
- Reuse of command buffers to avoid CPU build time
vkQueueSubmit vkQueueSubmit vkQueueSubmit

Record command buffer CmdBuf CmdBuf CmdBuf

Record command buffer CmdBuf CmdBuf

Record command buffer CmdBuf

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 99


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
- Reuse of command buffers to avoid CPU build time
- Potentially more efficient memory management
Heap 1 Heap 2
User-defined memory reuse
Pool 1 Pool 2
Explicit state transitions
Image 1 Image 2 Image 3 Cost invoked at defined points

View 1 View 2

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 100


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
- Reuse of command buffers to avoid CPU build time
- Potentially more efficient memory management
- Avoiding unpredictable shader compilation
Compile to SPIR-V (slow) Offline

Record command buffer (slow-ish) 2ry thread

Submit command buffer (fast) Submitting thread

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 101


Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Vulkan aims to reduce the overheads of
keeping the GPU busy
- Efficient generation of work on multiple CPU cores
- Reuse of command buffers to avoid CPU build time
- Potentially more efficient memory management
- Avoiding unpredictable shader compilation
•Mostly, the message has been that if you’re entirely
limited by shader performance or bandwidth, Vulkan
can’t help you (there is no magic wand)
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 102
Vulkan:
Click Making
to edit Masteruse
titleof the
GPU more efficient
style
•Actually, that’s not entirely true...
•APIs like OpenGL were designed when the GPU
looked very different (or was partly software)
•The way to design an efficient mobile GPU is
not a perfect match for OpenGL
-Think a CPU’s command decode unit/microcode
•But the translation isn’t always perfectly
efficient
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 103
TiledtoGPUs
Click edit Master title style
•Most (not all) mobile GPUs use tiling
- It’s all about the bandwidth (size and power limits)
Scene description Binning pass Shading pass

•On-chip tile memory is much faster than the


main frame buffer
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 104
Not everything
Click to edit Masterreaches
title stylememory
•Rendering requires lots of per-pixel data
- Z, stencil
- Full multisample resolution
•We usually only care about the final image

Z Stencil RGB RGB

- We can throw away Z and stencil


- We only need a downsampled (A)RGB
- Don’t need to load anything from a previous frame
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 105
Sometimes
Click we want
to edit Master title the
styleresults
of rendering
•Output from one rendering job can be used by
the next
•Z buffer for shadow maps
•Rendering for environment maps
•HDR bloom

•These can have low resolution and may not


take much bandwidth
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 106
Sometimes
Click you do
to edit Master need
title styleframebuffer resolution
•Deferred shading
Z

Light
weight Render
render Diffuse/ɑ
full-screen
storing quad and
per- perform
surface fragment
content Specular/ shading
at each Specularity
fragment

Normal

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 107


Sometimes
Click you do
to edit Master need
title styleframebuffer resolution
•Deferred shading
•Deferred lighting
Diffuse
Z
Re-render
Render scene with
Light full-screen full
weight quad and fragment
render calculate shading,
Specularity Specular
for lighting using
lighting output lighting
input inputs

Normal

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 108


Sometimes
Click you do
to edit Master need
title styleframebuffer resolution
•Deferred shading
•Deferred lighting
•Order-independent transparency

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 109


Sometimes
Click you do
to edit Master need
title styleframebuffer resolution
•Deferred shading
•Deferred lighting
•Order-independent transparency
•HDR tone mapping

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 110


Rendering
Click outputs
to edit Master separately
title style
•Rendering to each surface separately is bad

•Geometry has a per-bin cost


- Sometimes the cost is low, but it’s there
- Vertices in multiple bins get processed repeatedly
- Rendering the scene repeatedly is painful
•Even immediate-mode renderers hate this!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 111
Multiple
Click render
to edit Mastertargets don’t
help much
title style
•Using MRTs means multiple buffers in one pass
Single scene traversal
This is a typical approach for
immediate-mode renderers (e.g.
desktop/console systems)

•Reduces the geometry load (only process once)


•Still writing a lot of data off-chip
- Tilers are all about trying not to do this!
- Increases use of shader resources may slow some h/w
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 112
PixeltoLocal
Click Storage
edit Master title(OpenGL
style ES extension)
•Tiler-friendly (at last)
- Store only the current tile values
- Read them later in the tile processing
•But not portable!
- Not practical on immediate renderers
- Debugging on desktop won’t work!
- Capabilities vary between devices
- Driver doesn’t have visibility
- Data access is restricted
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 113
Vulkan:
Click Explicit
to edit Master dependencies
title style
•Vulkan has direct support for this type of
rendering work load
•By telling the driver how you intend to use the
rendered results, the driver can produce a
better mapping to the hardware
- The extra information is a little verbose, but simpler
than handling all possible cases yourself!

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 114


Vulkan
Click render
to edit passes
Master and
subpasses
title style
•A render pass groups dependent operations
- All images written in a render pass are the same size

Lighting Fragment
Geometry

Single render pass

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 115


Vulkan
Click render
to edit passes
Master and
subpasses
title style
•A render pass groups dependent operations
- All images written in a render pass are the same size
•A render pass contains a number of subpasses
- Subpasses describe access to attachments
- Dependencies can be defined between subpasses

Sub Sub Sub


pass pass 2: pass 3:
1: Light Frag
Geo

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 116


Vulkan
Click render
to edit passes
Master and
subpasses
title style
•A render pass groups dependent operations
- All images written in a render pass are the same size
•A render pass contains a number of subpasses
- Subpasses describe access to attachments
- Dependencies can be defined between subpasses
•Each render pass instance has to be contained
within a single command buffer (unit of work)
- Some tilers schedule by render pass
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 117
Defining
Click a Master
to edit rendertitle
passstyle
•VkRenderPassCreateInfo
- VkAttachmentDescription *pAttachments
- Just the descriptions, not the actual attachments!
- VkSubpassDescription *pSubpasses
- VkSubpassDependency *pDependencies
•vkCreateRenderPass(device, createInfo,.. pass)
- Gives you a VkRenderPass object
- This is a template that you can use repeatedly
- When we use it, we get a render pass instance
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 118
Describing
Click attachments
to edit Master title stylefor a render pass
•VkAttachmentDescription
- format/samples
- loadOp
- VK_ATTACHMENT_LOAD_OP_LOAD to preserve
- VK_ATTACHMENT_LOAD_OP_DONT_CARE for overwrites
- VK_ATTACHMENT_LOAD_OP_CLEAR uniform clears (e.g. Z)
- storeOp
- VK_ATTACHMENT_STORE_OP_STORE to output it
- VK_ATTACHMENT_STORE_OP_DONT_CARE may discard after
the render pass
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 119
Defining
Click a Master
to edit subpass
title style
•VkSubpassDescription
- pInputAttachments
- Which of the render pass’s attachments this subpass reads
- pColorAttachments
- Which ones this subpass writes (1:1 - optional)
- pResolveAttachments
- Which ones this subpass writes (resolving multisampling)
- pPreserveAttachments
- Which attachments need to persist across this subpass
- Subpasses are numbered and ordered
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 120
Defining
Click subpass
to edit dependencies
Master title style
•VkSubpassDependency
- srcSubpass
- dstSubpass
- Where the dependency applies (can be external)
- srcStageMask
- dstStageMask
- Execution dependencies between subpasses
- srcAccessMask
- dstAccessMask
- Memory dependencies between subpasses
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 121
Vulkan
Click framebuffers
to edit Master title style
•A VkFramebuffer defines the set of
attachments used by a render pass instance
•VkFramebufferCreateInfo
- renderPass
- pAttachments
- These are actual VkImageViews this time!
- width
- height
- layers
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 122
Starting
Click to Master
to edit use a title
render
stylepass
•vkCmdBeginRenderPass/vkCmdEndRenderPass
- Starts a render pass instance in a command buffer
- You start in the first (maybe only) subpass implicitly
- pRenderPassBegin contains configuration
•VkRenderPassBeginInfo
- VkRenderPass renderPass
- The render pass “template”
- VkFrameBuffer framebuffer
- Specifies targets for rendering
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 123
Putting
Click it all
to edit together…
Master title style
VkAttachmentDescription VkSubpassDescription VkSubpassDependency
VkAttachmentDescription VkSubpassDescription VkSubpassDependency
VkAttachmentDescription VkSubpassDescription
VkAttachmentDescription
VkRenderPassCreateInfo Key:
VkImageView
• Objects are dark grey
VkImageView vkCreateRenderPass • Functions are light grey
• Arrows between objects are
VkImageView
references of some sort
VkImageView VkRenderPass • Arrows into functions are arguments
• Arrows out of functions are
VkFramebufferCreateInfo constructed objects

vkCreateFramebuffer VkRenderPassBeginInfo VkCommandBuffer

VkFramebuffer
vkCmdBeginRenderPass

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 124


Simple
Click rendering
to edit Master title style
•vkAllocateCommandBuffers (VK_COMMAND_BUFFER_LEVEL_PRIMARY)
•vkBeginCommandBuffer
Command buffer

•vkCmdBeginRenderPass Render pass

Draw Draw Draw Draw


•vkCmdDraw (etc.)
•vkCmdEndRenderPass
•vkEndCommandBuffer
Queue

•vkQueueSubmit
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 125
Multiple
Click render
to edit Masterpasses
title style
•You can have more than one render pass in a
command buffer Render pass
Command buffer
Render pass
- Yes, Leeloo multipass,
Draw Draw Draw Draw
we know…

- So a command buffer can render to many outputs


- E.g. you could render to the same shadow and environment
maps every frame by reusing the same command buffer
- But it must be the same outputs each time you submit
- A specific render pass instance has fixed vkFrameBuffers!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 126
Two to
Click limitations…
edit Master title style

•Different render passes ֜ independent outputs


- Rendering goes off-chip, there’s no PLS-style on-chip
reuse of pixel contents
•You can’t reuse the same command buffer with
a different render target
- E.g. for double buffering or streamed content
- We’ll come back to this…
•Still sometimes all you need, though!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 127
Moretothan
Click one subpass
edit Master title style
•vkCmdNextSubpass moves to the next subpass
- Implicitly start in the first subpass of the render pass
- Dependencies say what you’re accessing from
previous subpasses Command buffer

- Same render pass so Render Pass

accesses stay on

New subpass
chip (if possible) Draw Draw Draw Draw Draw

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 128


Usingtomultiple
Click subpasses
edit Master title style
•vkCmdBeginCommandBuffer
•vkCmdBeginRenderPass
Command buffer
•vkCmdDraw (etc.) Render Pass

New subpass
•vkCmdNextSubpass Draw Draw Draw Draw Draw

•vkCmdDraw (etc.)
•vkCmdEndRenderPass
•vkCmdEndCommandBuffer
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 129
Accessing
Click to edit subpass output
in fragment shaders
Master title style
•In SPIR-V, previous subpass content is read
with OpImageRead
- Coordinates are sample-relative, and need to be 0
- OpTypeImage Dim = SubpassData
•In GLSL (using GL_KHR_vulkan_glsl):
- Types for subpass access are [ui]subpassInput(MS)
- layout(input_attachment_index = i, …) uniform
subpassInput t; to select a subpass C.f. __pixel_localEXT layouts in
EXT_shader_pixel_local_storage
- subpassLoad() to access the pixel when using OpenGL ES

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 130


Avoiding
Click unnecessary
to edit allocations
Master title style
•If we’re using subpasses, we likely don’t need
the images in memory
- A tiler may be able to process the subpasses entirely
on-chip, without needing an allocation
- Still need to “do the allocation” in case the tiler can’t
handle the request/on an immediate-mode renderer!
- Won’t commit resources unless it actually needs to
•vkCreateImage flags for “lazy committal”
- VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 131
Vulkan
Click subpasses:
to edit advantages
Master title style
•The driver knows what you’re doing
- It can reorder subpasses EXT_shader_pixel_local_storage is actually
more explicit than Vulkan here (and may still
- It can change the tile size be offered as an extension)

- It can balance resources between subpasses


- It will fall back to memory for you if it has to
- Under the hood, mechanism likely matches PLS
•Works on immediate mode renderers
- Probably MRTs and normal external writes
- Desktop debugging tools will work!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 132
There’s
Click more:
to edit Secondary
Master command buffers
title style
•Vulkan has two levels of command buffers
- Determined by vkAllocateCommandBuffers
•VK_COMMAND_BUFFER_LEVEL_PRIMARY
- Main command buffer, as we’ve seen so far
•VK_COMMAND_BUFFER_LEVEL_SECONDARY
- Command buffer that can be invoked from the
primary command buffer

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 133


Use of
Click secondary
to edit command
Master title style buffers
•vkBeginCommandBuffer
- Takes a VkCommandBufferBeginInfo
•VkCommandBufferBeginInfo
- flags include:
- VK_COMMANDBUFFER_USAGE_RENDER_PASS_CONTINUE_BIT
- pInheritanceInfo
•VkCommandBufferInheritanceInfo
- renderPass and subpass
- framebuffer (can be null, more efficient if known)
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 134
Secondary
Click command
to edit Master buffers
and passes
title style
•Why do we need the “continue bit”?
- Render passes (and subpasses) can’t start in a
secondary command buffer
- Non-render pass stuff can be in a secondary buffer
- You can run a compute shader outside a render pass
- Otherwise, the render pass is inherited from the
primary command buffer

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 135


Secondary
Click command
to edit Master buffers
and passes
title style
•Why specify render pass/framebuffer?
- Command buffers needs to know this when recording
- Some operations depends on render pass info (e.g. format)
- Framebuffer is optional (can just inherit)
- If you can specify the actual framebuffer, the command
buffer can be less generic and therefore may be faster

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 136


Invoking
Click the
to edit secondary
Master command
title stylebuffer
•You can’t submit a secondary command buffer
•You have to invoke it from a primary command
buffer with vkCmdExecuteCommands
Secondary buffer Secondary buffer Secondary buffer

Draw Draw Draw Draw Draw Draw

Primary command buffer


Render pass Render pass
subpass
New

vkCEC vkCEC vkCEC

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 137


Secondary
Click command
to edit Master code buffer
title style
•vkCmdBeginCommandBuffer
Primary command buffer
•vkCmdBeginRenderPass Render pass

subpass
New
vkCEC vkCEC
•vkCmdExecuteCommands
•vkCmdNextSubpass Secondary buffer

•vkCmdExecuteCommands Draw Draw

•vkCmdEndRenderPass Secondary buffer

•vkCmdEndCommandBuffer Draw Draw

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 138


Performance
Click andtitle
to edit Master parallelism
style
•Creating a command buffer can be slow
- Lots of state to check, may require compilation
- This happens in GLES as well, you just don’t control when!
•So create secondary command buffers on
different threads
- Lots of 4- and 8-core CPUs in cell phones these days
•Invoking the secondary buffer is lightweight
- Primary command buffer generation is quick(er)
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 139
Whattodoes
Click this have
edit Master to do
title style with passes?
•Remember:
- Render passes exist within (primary) command buffers
- The command buffer sets up the GPU for the render pass
- On-chip rendering happens within a render pass
- If you want content to persist between render passes, it’ll
reach memory (or at least cache), not stay in the tile buffer
- You can’t use multiple threads to build work for a
primary command buffer in parallel
- You can build many secondary command buffers at once

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 140


You can’t
Click to edit mix and
Master match
title style
•Within a subpass you can either (but not both):
- Execute rendering commands directly in the primary
command buffer
- VK_SUBPASS_CONTENTS_INLINE

Command buffer
Render pass

Draw Draw Draw Draw

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 141


You can’t
Click to edit mix and
Master match
title style
•Within a subpass you can either (but not both):
- Execute rendering commands directly in the primary
command buffer
- VK_SUBPASS_CONTENTS_INLINE
- Invoke secondary command buffers from the primary
command buffer with vkCmdExecuteCommands
- VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS
Primary command buffer
Secondary buffer Render pass Secondary buffer

Draw Draw vkCEC vkCEC Draw Draw

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 142


You can’t
Click to edit mix and
Master match
title style
•Within a subpass you can either (but not both):
- Execute rendering commands directly in the primary
command buffer
- VK_SUBPASS_CONTENTS_INLINE
- Invoke secondary command buffers from the primary
command buffer with vkCmdExecuteCommands
- VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS
- Chosen by vkCmdBeginRenderPass/vkCmdNextSubpass
- Remember: you can only do these in a primary command
buffer!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 143
Command
Click to edit buffer reuse:
Master title styleeven
faster
•Primary command buffers work with a fixed
render pass and framebuffer
- You can reuse a primary command buffer, but it will
always access the same images – often good enough
- May have to wait for execution to end; can’t be “one-time”
•What if you want to access different targets?
- E.g. a cycle of framebuffers or streamed content?
- You can round-robin several command buffers
- Or you can use secondary command buffers!
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 144
Compatible
Click render
to edit Master titlepasses
and frame buffers
style
•The render pass a secondary command buffer
uses needn’t be the one it was recorded with
- It can be “compatible”
- Same formats, number of sub-passes, etc.
•You can have primary command buffers with
different outputs, and they can re-use
secondary command buffers
- The primary has to be different to record new targets
- The primary may have to patch secondary addresses
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 145
Almost-free
Click use with
to edit Master changing
framebuffers
title style
•No cost for secondary command buffers
•Primary command buffer is simple and quick
Primary command buffer
Render pass

CEC CEC
Target
image 1

Secondary
command
Primary command buffer buffer
Target
image 2 Render pass

CEC CEC Secondary


command
buffer

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 146


So I can
Click doMaster
to edit bloom/DoF/rain/motion
title style blur…!
•No! Remember, you can only access the
current pixel
•Tilers process one tile at a time
?

- If you could try to access a different pixel, the tile


containing it may not be there
- You have to write out the whole image to do this
- Slow, painful, last resort!
- Yes, we can think of possible solutions too
- Give it time (lots of different hardware out there)
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 147
Coming
Click outMaster
to edit of the shadow(buffer)s
title style
•Render passes are integral to the Vulkan API
- Reflects modern, high-quality rendering approaches
•The driver has more information to work with
- It can do more for you
- Remember this if you complain it’s verbose!
•Hardware resource management is hard
- Expect drivers to get better over time
•Another tool for better mobile gaming
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 148
Thank
Click to you
edit Master title style
•Over to you…

Andrew Garrard
a.garrard at samsung.com

UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 149


Keeping your GPU fed
without getting bitten
Tobias Hector
May 2016
© Copyright Khronos Group 2016 - Page 150
Introduction
• You have delicious draw calls
- Yummy!

© Copyright Khronos Group 2016 - Page 151


Introduction
• You have delicious draw calls
- Yummy!

• Your GPU wants to eat them


- It’s really hungry

© Copyright Khronos Group 2016 - Page 152


Introduction
• You have delicious draw calls
- Yummy!

• Your GPU wants to eat them


- It’s really hungry

• Keep it fed at all times


- So it keeps making pixels

© Copyright Khronos Group 2016 - Page 153


Introduction
• You have delicious draw calls
- Yummy!

• Your GPU wants to eat them


- It’s really hungry

• Keep it fed at all times


- So it keeps making pixels

• Don’t want it biting your hand


- Look at those teeth!

© Copyright Khronos Group 2016 - Page 154


Keeping it fed
• GPU needs a constant supply of food
- It doesn’t want to wait

• Certain foods are tough to digest


- Provide multiple operations to hide stalls

• Draw calls provide a variety of nutrition


- Vertex work, raster work, tessellation, vitamins A-K, etc.

© Copyright Khronos Group 2016 - Page 155


Keeping it fed

System
CPU

0 1
GPU

0 1

© Copyright Khronos Group 2016 - Page 156


Keeping it fed

System
CPU

0 1 2
GPU

0 1 2

© Copyright Khronos Group 2016 - Page 157


Keeping it fed

GPU
Vertex

0 1
Fragment

0 1

© Copyright Khronos Group 2016 - Page 158


Keeping it fed

GPU
Vertex

0 1 2
Fragment

0 1 2

© Copyright Khronos Group 2016 - Page 159


Not getting bitten
• GPU eating from lots of different plates
- Don’t touch anything it’s using!

• It doesn’t want a mouthful of beef choc chip ice cream


- Don’t change data whilst it’s accessing a resource

• Hey I’m eating that!


- Don’t delete resources whilst the GPU is still using them

© Copyright Khronos Group 2016 - Page 160


© Copyright Khronos Group 2016 - Page 161
© Copyright Khronos Group 2016 - Page 162
© Copyright Khronos Group 2016 - Page 163
© Copyright Khronos Group 2016 - Page 164
© Copyright Khronos Group 2016 - Page 165
On to the serious bits…

© Copyright Khronos Group 2016 - Page 166


Terminology
• Operation
- Anything that can be executed Note: Memory barrier does not
mean quite the same thing as GL’s
- Includes synchronization and memory barriers memory barrier, though there is
some relation.
• Execution Dependency
- Operations waiting on other operations
- All synchronization expresses these

• Memory Barrier
- Flush/invalidate caches
- Determination of access and visibility

• Memory Dependency
- Execution dependency involving a Memory Barrier

© Copyright Khronos Group 2016 - Page 167


Synchronization Types
• 3 types of explicit synchronization in Vulkan

• Pipeline Barriers, Events and Subpass Dependencies


- Within a queue
- Explicit memory dependencies

• Semaphores
- Between Queues

• Fences
- Whole queue operations to CPU OpenGL has just two, very coarse
synchronization primitives: memory
barriers and fences. They are
loosely similar to the equivalently
named concepts in Vulkan

© Copyright Khronos Group 2016 - Page 168


Pipeline Barriers
• Pipeline Barriers void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
- Precise set of pipeline stages VkPipelineStageFlags srcStageMask,
- Memory Barriers to execute VkPipelineStageFlags dstStageMask,

- Single point in time VkDependencyFlags dependencyFlags,


uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);

Executing a pipeline barrier is


roughly equivalent to a
glMemoryBarrier call, though with
much more control.

© Copyright Khronos Group 2016 - Page 169


Events
• Events void vkCmdSetEvent(
VkCommandBuffer commandBuffer,
- Same info as Pipeline Barriers VkEvent event,
- …but operate over a range VkPipelineStageFlags stageMask);
void vkCmdResetEvent(
VkCommandBuffer commandBuffer,
VkEvent event,
VkPipelineStageFlags stageMask);

void vkCmdWaitEvents(
VkCommandBuffer commandBuffer,
uint32_t eventCount,
const VkEvent* pEvents,
VkPipelineStageFlags srcStageMask,
VkPipelineStageFlags dstStageMask,
uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);

© Copyright Khronos Group 2016 - Page 170


Events
• Events VkResult vkSetEvent(
VkDevice device,
- Same info as Pipeline Barriers VkEvent event);
- …but operate over a range
VkResult vkResetEvent(
VkDevice device,
• CPU interaction VkEvent event);

- No explicit CPU wait


- No Memory Barriers VkResult vkGetEventStatus(
VkDevice device,
VkEvent event);

© Copyright Khronos Group 2016 - Page 171


Events
• Events VkResult vkSetEvent(
VkDevice device,
- Same info as Pipeline Barriers VkEvent event);
- …but operate over a range
VkResult vkResetEvent(
VkDevice device,
• CPU interaction VkEvent event);

- No explicit CPU wait


- No Memory Barriers VkResult vkGetEventStatus(
VkDevice device,
VkEvent event);
• Warning!
- OS may apply a timeout
- Set events soon after submission
- Could you just defer submission?

© Copyright Khronos Group 2016 - Page 172


Pipeline Barriers vs Events
• Use pipeline barriers for point synchronization
- Dependant operation immediately precedes operation that depends on it
- May be more optimal than set/wait event pair

• Use events if other work possible between two operations


- Set immediately after the dependant operation
- Wait immediately before the operation that depends on it

• Use events for CPU/GPU synchronization


- Memory accesses between processors
- Late latching of data to reduce latency

© Copyright Khronos Group 2016 - Page 173


Memory Barrier Types
• Global Memory Barrier
- All memory-backed resources OpenGL’s memory barriers imply
execution dependencies, which
Vulkan memory barriers do not –
• Buffer Barrier execution barriers are provided by
a pipeline barrier, event or subpass
- For a single buffer range dependency.

• Image Barrier
- For a single image subresource range

© Copyright Khronos Group 2016 - Page 174


Global Memory Barriers
• Global Memory Barriers typedef struct VkMemoryBarrier {
VkStructureType sType;
- All memory used by accessed stages const void* pNext;
- Effectively flushes entire caches VkAccessFlags srcAccessMask;
VkAccessFlags dstAccessMask;
} VkMemoryBarrier;
• Use when many resources transition
- Cheaper than one-by-one
- Don’t transition unnecessarily!

• User must define prior access


- Driver not tracking for you

© Copyright Khronos Group 2016 - Page 175


Buffer Barriers
• Buffer Barriers typedef struct VkBufferMemoryBarrier {
VkStructureType sType;
- A single buffer range const void* pNext;
- Defines access stages VkAccessFlags srcAccessMask;

- Defines queue ownership VkAccessFlags dstAccessMask;


uint32_t srcQueueFamilyIndex;
uint32_t dstQueueFamilyIndex;

• User must define prior access VkBuffer buffer;


VkDeviceSize offset;
- Driver not tracking for you VkDeviceSize size;
} VkBufferMemoryBarrier;

© Copyright Khronos Group 2016 - Page 176


Image Barriers
• Image Barriers typedef struct VkImageMemoryBarrier {
VkStructureType sType;
- A single image subresource range const void* pNext;
- Defines access stages VkAccessFlags srcAccessMask;

- Defines queue ownership VkAccessFlags dstAccessMask;


VkImageLayout oldLayout;
- Defines image layout VkImageLayout newLayout;
uint32_t srcQueueFamilyIndex;
uint32_t dstQueueFamilyIndex;
• User must define prior access VkImage image;
- Driver not tracking for you VkImageSubresourceRange subresourceRange;

- For images, this includes prior layout } VkImageMemoryBarrier;

• Appropriate layouts allow compression


- GPU may use image compression
- Saves bandwidth
- Use GENERAL instead of switching
frequently
© Copyright Khronos Group 2016 - Page 177
Subpass Dependencies
• Subpass dependencies typedef struct VkSubpassDependency {
uint32_t srcSubpass;
- Similar info to Pipeline Barriers uint32_t dstSubpass;
- Explicitly between two subpasses VkPipelineStageFlags srcStageMask;
VkPipelineStageFlags dstStageMask;
VkAccessFlags srcAccessMask;
• Memory barriers VkAccessFlags dstAccessMask;

- Implicit for attachments VkDependencyFlags dependencyFlags;


} VkSubpassDependency;
- Explicit for other resources

• Pixel local dependencies


- Same fragment/sample location
- Cheap for most implementations
- Use region dependency flag:
- VK_DEPENDENCY_BY_REGION_BIT

© Copyright Khronos Group 2016 - Page 178


Subpass Dependencies
• Subpass self-dependencies typedef struct VkSubpassDependency {
uint32_t srcSubpass;
- Subpasses can wait on themselves uint32_t dstSubpass;
- A pipeline barrier in the subpass VkPipelineStageFlags srcStageMask;
VkPipelineStageFlags dstStageMask;
VkAccessFlags srcAccessMask;
• Forward progress only VkAccessFlags dstAccessMask;

- Can’t wait on later stages VkDependencyFlags dependencyFlags;


} VkSubpassDependency;
- Must wait on earlier or same stage
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
• Pixel local only between fragments VkPipelineStageFlags srcStageMask,
- Must use flag: VkPipelineStageFlags dstStageMask,
VkDependencyFlags dependencyFlags,
- VK_DEPENDENCY_BY_REGION_BIT
uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);

© Copyright Khronos Group 2016 - Page 179


Subpass Dependencies
• Subpass external dependencies typedef struct VkSubpassDependency {
uint32_t srcSubpass;
- Wait on ‘external’ operations uint32_t dstSubpass;
- vkCmdWaitEvent in the subpass VkPipelineStageFlags srcStageMask;

- Events set outside the render pass VkPipelineStageFlags dstStageMask;


VkAccessFlags srcAccessMask;
VkAccessFlags dstAccessMask;
VkDependencyFlags dependencyFlags;
} VkSubpassDependency;
void vkCmdWaitEvents(
VkCommandBuffer commandBuffer,
uint32_t eventCount,
const VkEvent* pEvents,
VkPipelineStageFlags srcStageMask,
VkPipelineStageFlags dstStageMask,
uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);

© Copyright Khronos Group 2016 - Page 180


Example – Texture Upload
// Transition the buffer from host write to transfer read
bufferBarrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;
bufferBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
// Transition the image to transfer destination
imageBarrier.srcAccessMask = 0;
imageBarrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
imageBarrier.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
imageBarrier.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;

vkCmdPipelineBarrier(commandBuffer, VK_PIPELINE_STAGE_HOST_BIT, VK_PIPELINE_STAGE_TRANSFER_BIT, &bufferBarrier,


&imageBarrier);

vkCmdCopyBufferToImage(commandBuffer, srcBuffer, image, VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL, 1, &copy);

// Transition the image from transfer destination to shader read


imageBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
imageBarrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
imageBarrier.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
imageBarrier.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;

vkCmdPipelineBarrier(commandBuffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,


&imageBarrier);
© Copyright Khronos Group 2016 - Page 181
Example – Compute to Draw Indirect
// Add a subpass dependency to express the wait on an external event
externalDependency.srcSubpass = VK_SUBPASS_EXTERNAL;
externalDependency.srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
externalDependency.dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT;
externalDependency.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
externalDependency.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;

// Dispatch a compute shader that generates indirect command structures


vkCmdDispatch(...);
// Set an event that can be later waited on (same source stage).
vkCmdSetEvent(commandBuffer, event, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT);

vkCmdBeginRenderPass(...);

//Transition the buffer from shader write to indirect command


bufferBarrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
bufferBarrier.dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
bufferBarrier.buffer = indirectBuffer;
vkCmdWaitEvent(commandBuffer, event, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT, VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT,
&bufferBarrier);

vkCmdDrawIndirect(commandBuffer, indirectBuffer, ...);


© Copyright Khronos Group 2016 - Page 182
Semaphores
• Semaphores typedef struct VkSubmitInfo {
VkStructureType sType;
- Used to synchronize queues const void* pNext;
- Not necessary for single-queue uint32_t waitSemaphoreCount;
const VkSemaphore* pWaitSemaphores;
const VkPipelineStageFlags* pWaitDstStageMask;
• Fairly coarse grain uint32_t commandBufferCount;

- Per submission batch const VkCommandBuffer* pCommandBuffers;


uint32_t signalSemaphoreCount;
- E.g. a set of command buffers const VkSemaphore* pSignalSemaphores;
- Multiple per submit command } VkSubmitInfo;

• Implicit memory guarantees


- Effects visible to future operations on
the same device
- Not guaranteed visible to host

© Copyright Khronos Group 2016 - Page 183


Example – Acquire and Present
// Acquire an image. Pass in a semaphore to be signalled
vkAcquireNextImageKHR(device, swapchain, UINT64_MAX, acquireSemaphore, VK_NULL_HANDLE, &imageIndex);

// Submit command buffers


submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &acquireSemaphore;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &commandBuffer;
submitInfo.signalSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &graphicsSemaphore;

vkQueueSubmit(graphicsQueue, 1, &submitInfo, fence);

// Present images to the display


presentInfo.waitSemaphoreCount = 1;
presentInfo.pWaitSemaphores = &graphicsSemaphore;
presentInfo.swapchainCount = 1;
presentInfo.pSwapchains = &swapchain;
presentInfo.pImageIndices = &imageIndex;

vkQueuePresent(presentQueue, &presentInfo);

© Copyright Khronos Group 2016 - Page 184


Example – Acquire and Present (same queue)
// Acquire an image. Pass in a semaphore to be signalled
vkAcquireNextImageKHR(device, swapchain, UINT64_MAX, acquireSemaphore, VK_NULL_HANDLE, &imageIndex);

// Submit command buffers


submitInfo.waitSemaphoreCount = 1;
submitInfo.pWaitSemaphores = &acquireSemaphore;
submitInfo.commandBufferCount = 1;
submitInfo.pCommandBuffers = &commandBuffer;
submitInfo.signalSemaphoreCount = 0;

vkQueueSubmit(universalQueue, 1, &submitInfo, fence);

// Present images to the display


presentInfo.waitSemaphoreCount = 0;

presentInfo.swapchainCount = 1;
presentInfo.pSwapchains = &swapchain;
presentInfo.pImageIndices = &imageIndex;

vkQueuePresent(universalQueue, &presentInfo);

© Copyright Khronos Group 2016 - Page 185


Fences
• Fences VkResult vkQueueSubmit(
VkQueue queue,
- Used to synchronize queue to CPU uint32_t submitCount,
const VkSubmitInfo* pSubmits,
VkFence fence);
• Very coarse grain
- Per queue submit command VkResult vkResetFences(
VkDevice device,
uint32_t fenceCount,
• Implicit memory guarantees const VkFence* pFences);

- Effects visible to future operations on VkResult vkGetFenceStatus(


the same device VkDevice device,
- Not guaranteed visible to host VkFence fence);

VkResult vkWaitForFences(
VkDevice device,
GL’s fences are like a combination
uint32_t fenceCount,
of a semaphore and a fence in
const VkFence* pFences,
Vulkan – they can synchronize GPU
VkBool32 waitAll,
and CPU in multiple ways at a
uint64_t timeout);
coarse granularity.
© Copyright Khronos Group 2016 - Page 186
Example – Multi-buffering
// Have enough resources and fences to have one per in-flight-frame, usually the swapchain image count
VkBuffer buffers[swapchainImageCount];
VkFence fence[swapchainImageCount];

// Can use the index from the presentation engine - 1:1 mapping between swapchain images and resources
vkAcquireNextImageKHR(device, swapchain, UINT64_MAX, semaphore, VK_NULL_HANDLE, &nextIndex);

// Make absolutely sure that the work has completed


vkWaitForFences(device, 1, &fence[nextIndex], true, UINT64_MAX);

// Reset the fences we waited on, so they can be re-used


vkResetFences(device, 1, &fence[nextIndex]);

// Change the data in your per-frame resources (with appropriate events/barriers!)


...

// Submit any work to the queue, with those fences being re-used for the next time around
vkQueueSubmit(graphicsQueue, 1, &sSubmitInfo, fence[nextIndex]);

© Copyright Khronos Group 2016 - Page 187


Wait Idle
• Ensures execution completes VkResult vkQueueSubmit(
VkQueue queue,
- VERY heavy-weight uint32_t submitCount,
const VkSubmitInfo* pSubmits,
VkFence fence);
• vkQueueWaitIdle
- Wait for queue operations to finish VkResult vkResetFences(

- Equivalent to waiting on a fence VkDevice device,


uint32_t fenceCount,
const VkFence* pFences);

• vkDeviceWaitIdle VkResult vkGetFenceStatus(


- Waits for device operations to finish VkDevice device,
- Includes vkQueueWaitIdle for queues VkFence fence);

VkResult vkWaitForFences(
VkDevice device,
These are a lot like glFinish, and uint32_t fenceCount,
should be treated similarly – use const VkFence* pFences,
them VERY SPARINGLY. VkBool32 waitAll,
uint64_t timeout);

© Copyright Khronos Group 2016 - Page 188


Wait Idle
• Useful primarily at teardown
- Use it to quickly ensure all work is done

• Favour other synchronization at all other times


- Extremely heavyweight, will cause serialization!

© Copyright Khronos Group 2016 - Page 189


Programmer Guidelines
• Specify EXACTLY the right amount of synchronization
- Too much and you risk starving your GPU
- Miss any and your GPU will bite you

• Use the validation layers to help!


- Won’t catch everything yet, but improving over time

• Pay particular attention to the pipeline stages


- Fiddly but become intuitive as you use them

• Consider Image Layouts


- If your GPU can save bandwidth it will

• Different behaviour depending on implementation


- Test/Tune on every platform you can find!
© Copyright Khronos Group 2016 - Page 190
Keep your GPU fed without getting bitten!

Questions?

© Copyright Khronos Group 2016 - Page 191


Swapchains Unchained!
(What you need to know about Vulkan WSI)
Alon Or-bach, Chair, Vulkan System
Integration Sub-Group – May 2016
@alonorbach (disclaimers apply!)
© Copyright Khronos Group 2016 - Page 193
Intro to Vulkan Window System Integration
• Explicit control for acquisition and
presentation of images WSI Jargon Buster
- Designed to fit the Vulkan API and today’s
• Platform
compositing window systems Our terminology for an OS
• Not all extensions are supported by every / window system e.g.
platform Android, Windows,
- You MUST check and enable the extensions Wayland, X11 via XCB
your app/engine uses!!! • Presentation Engine
The platform’s compositor
• Today’s presentation should help you get
or display engine
presentation working
• Application
- Learn how to present through a swapchain
Your app or game engine
- Overview of Vulkan objects used by the WSI
extensions

© Copyright Khronos Group 2016 - Page 194


How many WSI extensions are there?
• Two cross-platform instance extensions
- VK_KHR_surface
- VK_KHR_display
• Six (platform) instance extensions
- VK_KHR_android_surface
- VK_KHR_mir_surface
- VK_KHR_wayland_surface
- VK_KHR_win32_surface
- VK_KHR_xcb_surface
- VK_KHR_xlib_surface
• Two cross-platform device extensions
- VK_KHR_swapchain
- VK_KHR_display_swapchain

© Copyright Khronos Group 2016 - Page 195


Vulkan Surfaces
• VkSurfaceKHR Physical Device A
Queue
- Vulkan’s way to encapsulate a native Family 2
window / surface Queue
Family 1 Queue
Unlike an EGLSurface, creating a Family 0
Vulkan Surface doesn’t mean you’ve
got your render targets created …yet
Physical Device B
Platform X
• Platform-independent surface queries
- Find out crucial information about your Queue
Queue
Family 1 Platform Y
surface’s properties Family 0
- Such as format, transform, image usage
- Some platforms provide additional queries
• Presentation support is per queue family Physical Device C
Queue
- An implementation may support multiple Queue
Family 1
platforms e.g. both xlib and xcb Family 0

- Or may not support presentation at all


© Copyright Khronos Group 2016 - Page 196
Vulkan Swapchains: VK_KHR_swapchain
• Array of presentable images associated with
a surface const VkSwapchainCreateInfoKHR createInfo =
- Application requests a minimum number {VK_STRUCTURE_TYPE_SWAPCHAIN_CREATE_INFO_KHR, // sType
of presentable images NULL, // pNext
0, // flags
- Implementation creates at least that mySurface, // surface
desiredNumberOfPresentableImages, // minImageCount
number surfaceFormat, // imageFormat
surfaceColorSpace, // imageColorSpace
- Implementation may have a limit myExtent, // imageExtent
1, // imageArrayLayers
• Upfront allocation of presentable images VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT, // imageUsage
VK_SHARING_MODE_EXCLUSIVE, // imageSharingMode
- No allocation hitching at crucial moment 0, // queueFamilyIndexCount
NULL, // pQueueFamilyIndices
- Pre-record fixed content command buffers surfaceProperties.currentTransform, // preTransform
VK_COMPOSITE_ALPHA_INHERIT_BIT_KHR, // compositeAlpha
• Present mode determines behavior swapchainPresentMode, // presentMode
VK_TRUE, // clipped
- FIFO support mandatory VK_NULL_HANDLE // oldSwapchain
};
- Platforms can offer mailbox,
immediate, FIFO relaxed FIFO is like eglSwapInterval = 1
Mailbox/Immediate is like eglSwapInterval 0
FIFO relaxed is like EXT_swap_control_tear
© Copyright Khronos Group 2016 - Page 197
Vulkan Swapchains: They’re good!
• Application knows which image within a
swapchain it is presenting Similar but neater than how
- Content of image preserved between EGL_KHR_partial_update /
EGL_EXT_buffer_age and preserved
presents behavior achieves this
• Application is responsible for explicitly
recreating swapchains - no surprises
- Platform informs app if current swapchain
- Suboptimal: e.g. after window resize,
swapchain still usable for present via image
scaling
- Surface Lost: swapchain no longer usable for
present
- Application is responsible to create a new
swapchain In EGL, the EGLSurface may be resized by the
platform after an eglSwapBuffers call.
Vulkan requires the application to intervene
© Copyright Khronos Group 2016 - Page 198
Vulkan Swapchains: They’re jolly good!
• Presenting and acquiring are separate
operations
- No need to submit a new image to acquire
another one, unless presentation engine
cannot release it
• Application must only modify presentable
images it has acquired
• Presentation engine must only display
presentable images that have been Stalls in frame loop
presented! are very bad!

In EGL, calling eglSwapBuffers both presents the


current back buffer and acquires a new one
Vulkan splits this up into separate operations

© Copyright Khronos Group 2016 - Page 199


Steps to setup your presentable images
1 – Create a native
window/surface Platform-specific APIs

2 – Create a Vulkan
surface
VK_KHR_<platform>_surface

3 – Query information
about your surface
VK_KHR_surface

4 – Create a Vulkan
swapchain
VK_KHR_swapchain
5 – Get your
presentable images
© Copyright Khronos Group 2016 - Page 200
Vulkan Frame Loop – as easy as 1-2-3!

0 – Create your
swapchain

1 – Acquire the next


3 – Present the image
presentable image

VK_KHR_swapchain

Legend
2 – Submit command Setup
buffer(s) for that image Steady-state
Response to suboptimal
/ surface_lost

© Copyright Khronos Group 2016 - Page 201


Vulkan Displays: VK_KHR_display Display 0
• Vulkan’s way to discover display devices Display
Display
(screens, panels) outside a window system Mode 1
Mode 0
- Reminder: Not supported on all platforms
Physical
• Defines VkDisplayKHR and Device
VkDisplayModeKHR objects
Plane 2
- Represent the display devices and the Plane 1
Plane 0
modes they support connected to a
VkPhysicalDevice
- Determine if a display supports multiple
planes that are blended together Surface Display 1

• Enables creation of a VkSurfaceKHR to Display


Display
Mode 1
represent a display plane Mode 0

A Vulkan display represents an actual display!


(Whereas an EGLDisplay is actually just a
connection to a driver – like a Vulkan Device)
© Copyright Khronos Group 2016 - Page 202
VK_KHR_display_swapchain
• Extends the information provided at vkQueuePresentKHR
- What region to present from the swapchain image
- What region to present to on the display
- Whether the display should persist the image
• Adds ability to create a shared swapchain
- Swapchain that takes multiple VkSwapchainCreateInfoKHR structs
- Allows multiple displays to be presented to simultaneously
- No guarantee that presents are atomic ...presently!

© Copyright Khronos Group 2016 - Page 203


Any question?
[email protected]
@alonorbach
© Copyright Khronos Group 2016 - Page 204
Moving To Vulkan
Asynchronous Compute
Chris Hebert, Dev Tech Software Engineer, Professional Visualization
Who am I?
Chris Hebert
@chrisjhebert

Dev Tech Software Engineer- Pro Vis


20 years in the industry
Joined NVIDIA in March 2015.
Real time graphics makes me happy
I also like helicopters
Chris Hebert - Circa 1974

206
NVIDIA/KHRONOS CONFIDENTIAL

• Some Context

Agenda • Sharing The Load


• Pipeline Barriers

207
NVIDIA/KHRONOS CONFIDENTIAL

Some Context

208
GPU Architecture
In a nutshell
NVIDIA Maxwell 2
Register File

Core
Load Store Unit

209
Execution Model SMM
Thread Hierarchies

Logical View HW View


32 threads

32 threads

32 threads

32 threads
Work Group Warps

210
Resource Partitioning
Resources Are Limited

Key resources impacting local execution:


• Program Counters
• Registers
• Shared Memory

211
Resource Partitioning
Resources Are Limited

Key resources impacting local execution:


• Program Counters Partitioned amongst threads
• Registers
• Shared Memory
Partitioned amongst work groups

212
Resource Partitioning
Resources Are Limited

Key resources impacting local execution:


• Program Counters Partitioned amongst threads
• Registers
• Shared Memory
Partitioned amongst work groups

e.g. GTX 980 ti


64k 32bit registers per SM
96kb shared memory per SM

213
Resource Partitioning
Registers

The more registers used by a kernel means few resident warps on the SM

Fewer Registers More Registers

More Threads Fewer Threads

214
Resource Partitioning
Shared Memory

The more shared memory used by a work group means fewer work groups on the SM

Less SMEM More SMEM

More Groups Fewer Groups

215
Keeping It Moving
Occupancy

• Some small kernels may have low occupancy


• Depending on the algorithm
• Compute resources are limited
• Shared across threads or work groups on a per SM basis
• Warps stall when they have to wait for resources
• This latency can be hidden
• If there are other warps ready to execute.

216
Keeping It Moving
Occupancy – Simple Theoretical Example

• Simple kernel that updates positions of 20480 particles


• 1 FMAD - ~20 cycles (instruction latency)
• 20480 particles = 640 warps
• To hide this latency, according to Littles Law
• Required Warps = Latency x Throughput
• Throughput should be 32 threads * 16 sms = 512 to keep GPU busy
• Required warps is 20*512 = 10240
• ….oh….

217
Keeping It Moving
Occupancy – Simple Theoretical Example

• Simple kernel that updates positions of 20480 particles


• 1 FMAD - ~20 cycles (instruction latency)
• 20480 particles = 640 warps
• To hide this latency, according to Littles Law – But only on 1 SM..
• Required Warps = Latency x Throughput
• Throughput should be 32 threads * 1 sm = 32 to keep GPU busy
• Required warps is 20*32 = 640
• And we theoretically have 15 SMs to use for other stuff.

218
Queuing It Up
Working with 1 Queue • Scheduler will distribute work across all SMs
• kernels execute in sequence
Command Buffer (there may be some overlap)
Command Buffer

Command Buffer
• Low occupancy kernels will waste GPU time
Command Buffer

Kernel Kernel Kernel

Command Queue

Command Buffer
Transfers

219
NVIDIA/KHRONOS CONFIDENTIAL

Sharing The Load

220
Queuing It Up
Working with N Queues

Command Buffer
• NVIDIA hardware gives you 16 all powerful queues
Command Buffer

Command Buffer • 1 Queue family that supports all operations


Command Buffer
• 16 queues available for use

Command Queue #1 Kernel Kernel Kernel

Command Queue #2 Kernel Kernel Kernel

Command Queue #3 Kernel Kernel Kernel

221
Queuing It Up
Working with N Queues

Command Buffer
• Application decides which queues for which kernels
Command Buffer

Command Buffer • Load balance for best performance


Command Buffer
• Profile (Nsight) to gain insights

Command Queue #1 Kernel Kernel Kernel

Command Queue #2 Kernel Kernel Kernel

Command Queue #3 Kernel Kernel Kernel

222
Queuing It Up
Compute and Graphics In Harmony

• Some hardware can even run compute and graphics work concurrently
• Needs fast context switching and at high granularity (not just at draw commands)
• Simple Graphics work tends to have high occupancy
• Complex graphics work can reduce occupancy
• Profile for performance insights

223
Queuing It Up
Compute and Graphics In Harmony

Compute Cmd Buffer • Profile to understand occupancy of both graphics and compute workloads
Compute Cmd Buffer

Graphics Cmd Buffer


• Queues can support both compute and graphics
Compute Cmd Buffer

Command Queue #1 Kernel Kernel Kernel

Command Queue #2 Kernel Kernel Kernel

Command Queue #3 Kernel Kernel Kernel

224
An Example
Compute and Graphics In Harmony

Free Surface Navier Stokes Solver


• 11 Compute Kernels
• 4 Shaders
Click here to view this video

• The output of each kernel is the input to the next


• Some kernels have very low occupancy
• Still opportunities for concurrency with compute

225
An Example
Many discretized operations are separable

Process X Axis Process Y Axis


(and half the Z) (and other half of Z)
Examples
Command Queue Command Queue • Fluid Sims
• Gaussian Blurs
• Convolution Kernels

SM SM SM SM SM SM SM SM

SM SM SM SM SM SM SM SM

Driver handles dispatching groups

Semaphore Semaphore Use semaphores to synchronize


226
An Example
Compute and graphics run concurrently
Compute Graphics

Compute Work Graphics Work Frame N

Command Queue Command Queue Frame


Frame N
N+1

Frame Frame
N+2 N+1
SM SM SM SM SM SM SM SM
Frame Frame
N+3 N+2
SM SM SM SM SM SM SM SM
Frame Frame
N+4 N+3

Semaphore
227
An Example
Putting it all together
Compute Graphics

Process X Axis Process Y Axis Frame N


Graphics Work
(and half the Z) (and other half of Z)

Command Queue Command Queue Command Queue Frame


Frame N
N+1

Frame Frame
N+2 N+1
SM SM SM SM SM SM SM SM
Frame Frame
N+3 N+2
SM SM SM SM SM SM SM SM
Frame Frame
N+4 N+3

Semaphore Semaphore
228
Memory Transfers
More opportunity for concurrency

• Memory transfers are handle by MMU


• Can run concurrently with Kernels
• As long as the current kernel isnt using the memory

MMU may be idle


Why do this?

Command Queue #1 Kernel Transfer Kernel Transfer Kernel

ALUs may be idle


229
Memory Transfers Examples
More opportunity for concurrency • Large image processing
• Video processing
When you can do this
• DtoH and HtoD transfers can run concurrently

Host to Device Queue Transfer Transfer Transfer

Compute Queue Kernel Kernel Kernel

Device to Host Queue Transfer Transfer Transfer

230
NVIDIA/KHRONOS CONFIDENTIAL

Conclusion
Takeaways

There is more than 1 queue available


Keep registers and shared memory to a minimum
Low occupancy leads to an under utilized GPU
Maximize GPU utilization by running kernels concurrently
Profile to understand the occupancy profiles of kernels and shaders
Some hardware can run kernels AND shaders concurrently
Use Semaphores to synchronize between queues
Be sensible at the beer festival

231
NVIDIA/KHRONOS CONFIDENTIAL

Thank You Enjoy Vulkan!!

232
Questions?
Chris Hebert, Dev Tech Software Engineer, Professional Visualization
Porting to Vulkan
Hans-Kristian Arntzen
Engineer, ARM
(Credit for slides: Marius Bjørge)
© Copyright Khronos Group 2016 - Page 234
Agenda
• API flashback
• Engine design
- Command buffers
- Pipelines
- Render passes
- Memory management

© Copyright Khronos Group 2016 - Page 235


API Flashback

Application

Application
Logic shift

Driver

Driver

© Copyright Khronos Group 2016 - Page 236


API Flashback
vkDevice

vkQueue vkCommandPool

vkCommandBuffer

vkCmdBindDescrip
vkBeginRenderPass vkCmdBindXXX vkCmdBindPipeline vkCmdDraw vkEndRenderPass
torSets

vkRenderPass vkBuffer vkPipeline vkDescriptorSet


State vkBufferView
vkFramebuffer Shaders vkImageView
vkImageView vkRenderPass vkSampler

vkDeviceMemory vkDeviceMemory vkDescriptorPool

Heap

© Copyright Khronos Group 2016 - Page 237


Porting from OpenGL to Vulkan?
• Most graphics engines today are designed around the principles of implicit driver
behaviour
- A direct port to Vulkan won’t necessarily give you a lot of benefits

• Approach it differently
- Re-design for Vulkan, and then port that to OpenGL

© Copyright Khronos Group 2016 - Page 238


Allocating Memory
• Memory is first allocated and then bound to Vulkan objects
- Different Vulkan objects may have different memory requirements
- Allows for aliasing memory across different Vulkan objects
• Driver does no ref counting of any objects in Vulkan
- Cannot free memory until you are sure it is never going to be used again
- Also applies to API handles!

• Most of the memory allocated during run-time is transient


- Allocate, write and use in the same frame
- Block based memory allocator

© Copyright Khronos Group 2016 - Page 239


Block Based Memory Allocator
• Relaxes memory reference counting
• Only entire blocks are freed/recycled
• Sub-allocations take refcount on block

© Copyright Khronos Group 2016 - Page 240


Command Buffers
• Request command buffers on the fly
- Allocated using ONE_TIME_SUBMIT_BIT
- Recycled

• Separate command pools per


- Thread
- Frame
- Primary/secondary

© Copyright Khronos Group 2016 - Page 241


Secondary Command Buffers

vkCommandPool vkCommandBuffer
Main thread
vkBeginRenderPass vkCmdExecuteCommands vkEndRenderPass

Thread 0 vkCommandPool Secondary command buffer

Thread 1 vkCommandPool Secondary command buffer

Thread 2 vkCommandPool Secondary command buffer

© Copyright Khronos Group 2016 - Page 242


Shaders
• Standardize on SPIR-V binary shaders
• Extensively use the Khronos SPIRV-Cross library
- Cross compiling back to GLSL
- Provides shader reflection for
- Vertex attributes
- Subpass attachments
- Pipeline layouts
- Push constants

© Copyright Khronos Group 2016 - Page 243


Pipelines

Pipeline state

Dynamic state Shaders Render pass

Blend State Pipeline layout

Rasterizer state Vertex input

Depth/stencil state Input assembly

© Copyright Khronos Group 2016 - Page 244


Pipelines
• Not trivial to create all required pipeline state objects upfront
Public interface

• Our approach: SetRenderState()


- Keep track of all pipeline state per command buffer
- Flush pipeline creation when required SetShaders()

- In our case this is implemented as an async operation SetVertexBuffer()

SetIndexBuffer()

Command
Draw() Buffer Internal

Flush

RequestPipeline

CreateNewPipeline

© Copyright Khronos Group 2016 - Page 245


Pipelines
• In an ideal world…
- All pipeline combinations should be created upfront

• …but this requires detailed knowledge of every potential shader/state combination that
you might have in your scene
- As an example, one of our fragment shaders have ~9000 combinations
- Every one of these shaders can use different render state
- We also have to make sure the pipelines are bound to compatible render passes
- An explosion of combinations!

© Copyright Khronos Group 2016 - Page 246


Pipeline cache
• Vulkan has built-in support for pipeline caching
- Store to disk and re-use on next run

• Can also speed up pipeline creation during run-time


- If the pipeline state is already in the cache it can be re-used

Pipeline state

Dynamic state Shaders Render pass

Blend State Pipeline layout

Rasterizer state Vertex input

Depth/stencil state Input assembly

vkPipelineCache
Disk

© Copyright Khronos Group 2016 - Page 247


Pipeline layout
• Defines what kind of resources are in each binding slot in your shaders
- Textures, samplers, buffers, push constants, etc
• Can be shared among different pipeline objects

© Copyright Khronos Group 2016 - Page 248


Pipeline layout
• Use SPIRV-Cross to automatically get binding information from SPIR-V shaders

SPIR-V shader

SPIRV-cross Pipeline layout

Descriptor set layout

Push constant range

© Copyright Khronos Group 2016 - Page 249


Descriptor Sets
• Textures, uniform buffers, etc. are bound to shaders in descriptor sets
- Hierarchical invalidation
- Order descriptor sets by update frequency

• Ideally all descriptors are pre-baked during level load


- Keep track of low level descriptor sets per material
- But, this is not trivial

© Copyright Khronos Group 2016 - Page 250


Descriptor Sets
• Our solution:
- Keep track of bindings and update descriptor sets when necessary
- Keep cache of descriptor sets used with immutable Vulkan objects
Public interface

SetShaders()

SetConstantData()

SetTexture()

Draw() Internal

Command Request cached


Buffer descriptor sets

Allocate descriptor sets Descriptor pool

Write descriptor sets Descriptor set layouts

BindDescriptorSets

© Copyright Khronos Group 2016 - Page 251


Descriptor Set emulation
• We also need to support this in OpenGL

• Our solution:
- Emulate descriptor sets in our OpenGL backend
- SPIRV-Cross collapses and serializes bindings

© Copyright Khronos Group 2016 - Page 252


Descriptor Set emulation
Shader
Set 0 Set 1 Set 2
0 GlobalVSData 0 MeshData 0 MaterialData
1 GlobalFSData 1 TexAlbedo
2 TexNormal
3 TexEnvmap

SPIR-V library to GLSL

Uniform block bindings Texture bindings


0 GlobalVSData 0 TexAlbedo
1 GlobalFSData 1 TexNormal
2 MeshData 2 TexEnvmap

© Copyright Khronos Group 2016 - Page 253


Push Constants
• Push constants replace non-opaque uniforms
- Think of them as small, fast-access uniform buffer memory
• Update in Vulkan with vkCmdPushConstants
• Directly mapped to registers on Mali GPUs

// New
layout(push_constant, std430) uniform PushConstants {
mat4 MVP;
vec4 MaterialData;
} RegisterMapped;

// Old, no longer supported in Vulkan GLSL


uniform mat4 MVP;
uniform vec4 MaterialData;

© Copyright Khronos Group 2016 - Page 254


Push Constant Emulation
• But again, we need to support OpenGL as well

• Our solution:
- Use SPIRV-Cross to turn push constants into regular non-opaque uniforms
- Logic in our OpenGL/Vulkan backends redirect the push constant data appropriately

© Copyright Khronos Group 2016 - Page 255


Render pass
• Used to denote beginning and end of rendering to a framebuffer

• Can be re-used but must be compatible


- Attachments: Framebuffer format, image layout, MSAA?
- Subpasses DepthStencil
- Attachment load/store Color targets

Public interface

BeginRenderPass Internal

RequestFramebuffer

RequestRenderPass

Command CreateCompatibleRend
Buffer erPass

CreateFramebuffer

BeginRenderPass

© Copyright Khronos Group 2016 - Page 256


Subpass Inputs
• Vulkan supports subpasses within render passes
• Standardized GL_EXT_shader_pixel_local_storage!
• Also useful for desktop GPUs

// GLSL
#extension GL_EXT_shader_pixel_local_storage : require
__pixel_local_inEXT GBuffer {
layout(rgba8) vec4 albedo;
layout(rgba8) vec4 normal;
...
} pls;

// Vulkan
layout(input_attachment_index = 0) uniform subpassInput albedo;
layout(input_attachment_index = 1) uniform subpassInput normal;
...

© Copyright Khronos Group 2016 - Page 257


Subpass Input Emulation
• Supporting subpasses in GL is not trivial, and probably not feasible on a lot of
implementations

• Our solution:
- Use SPIRV-Cross to rewrite subpass inputs to Pixel Local Storage variables or texture
lookups
- This will only support a subset of the Vulkan subpass features, but good enough for our
current use

© Copyright Khronos Group 2016 - Page 258


Synchronization
• Submitted work is completed out of order by the GPU
• Dependencies must be tracked by the application and handled explicitly
- Using output from a previous render pass
- Using output from a compute shader
- Etc
• Synchronization primitives in Vulkan
- Pipeline barriers and events
- Fences
- Semaphores

© Copyright Khronos Group 2016 - Page 259


Render passes and pipeline barriers
• Most of the time the application knows upfront how the output of a renderpass is going to
be used afterwards
• Internally we have a couple of usage flags that we assign to a render pass
- On EndRenderPass we implicitly trigger a pipeline barrier

Public interface

BeginRenderPass

Render pass usage flags


DrawSomething Pipeline stages?
Memory domains?

EndRenderPass

Command
Buffer Internal

vkCmdEndRenderPass

vkCmdPipelineBarrier

© Copyright Khronos Group 2016 - Page 260


Image Layout Transitions
• Must match how the image is used at any time
• Pedantic or relaxed
- Some implementations will require careful tracking of previous and new layout to achieve
optimal performance
- For Mali we can be quite relaxed with this – most of the time we can keep the image
layout as VK_IMAGE_LAYOUT_GENERAL

© Copyright Khronos Group 2016 - Page 261


Summary
• Don’t allocate or release during runtime
• Batching still applies
• Multi-thread your code!
• Use push-constants as much as possible
• Multi-pass is fantastic on mobile GPUs

© Copyright Khronos Group 2016 - Page 262

You might also like