Vulkan 101
Vulkan 101
Tom Olson
Directory, Graphics Research, ARM
Chair, Vulkan Working Group
© Copyright Khronos Group 2016 - Page 8
What is Vulkan?
• A 3D graphics API for the next 20 years
- Logical successor to OpenGL / OpenGL ES
- Modern, efficient design
- An open, industry-controlled standard
• Here, now
- Released in February 2016
- Available today for Windows / Linux
- Shipping in Samsung Galaxy S7
- Support announced in Android ‘N’
• Different!
- Fundamental change in philosophy
- Requires corresponding changes in applications
© Copyright Khronos Group 2016 - Page 9
Why did we do this?
• Traditional APIs had issues…
• Developers weren’t happy
http://www.joshbarczak.com/blog/?p=154
http://richg42.blogspot.com/2014/05/things-that-drive-me-nuts-about-opengl.html
• CPU intensive
- Lots of state validation, dependency tracking
• Fundamentally single-threaded
- Can’t use multi-core CPUs effectively
Memory
Instance Device
Queues
Command Buffers
Render Pass
Descriptor
Descriptor
Draw Call
Draw Call
Pipeline
Pipeline
Shaders
Shaders
Copy
Sync
Sync
Sets
Sets
Michael Jesse
• No driver magic!
• Extensions
- Must be enabled at initialization time
• All at https://github.com/KhronosGroup
• Opportunities
- Much lower driver overhead
- …which you can spread across multiple threads
- More predictable performance
- Mobile friendly
• Realities
- Ecosystem is still immature
- Will need to ship GL/DX versions for years to come
© Imagination Technologies
Command Buffers – Pooling Resource
Command Buffers always belong to a Command Pool
Buffers are allocated from pools
Pools provide lightweight synchronisation
Pools can be reset, reclaiming all resources
Two flavours of pool:
Individual reset of command buffers
Group reset only
© Imagination Technologies
Command Buffers – Going wide
Thread 1 VkCommandBuffer
Thread 2 VkCommandBuffer
Thread N VkCommandBuffer
© Imagination Technologies
Command Buffers – Command Types
© Imagination Technologies
Command Buffers – Transfers
© Imagination Technologies
Command Buffers – “Inside” or “Out”
© Imagination Technologies
Command Buffers – Secondaries
© Imagination Technologies
Command Buffers – Reuse
Camera
© Imagination Technologies
Command Buffers – Reuse
Camera
© Imagination Technologies
Command Buffers – Lifetime
Ownership
CPU GPU
Allocated
Begin
© Imagination Technologies
Pipelines - An anatomy
VI IA VS CS TS ES GS VP RS MS DS FS CB
© Imagination Technologies
Pipelines – Fixed Function States
VI IA VS CS TS ES GS VP RS MS DS FS CB
VertexInput
Everything that isn’t a shader
InputAssembly
Buffer formats/layouts
Tessellation
Viewport
Raster
Multisample
DepthStencil
ColorBlend
© Imagination Technologies
Pipelines – Shader Stages
VI IA VS CS TS ES GS VP RS MS DS FS CB
© Imagination Technologies
Pipelines – Descriptor Layout
© Imagination Technologies
Pipelines – Dynamic State
Viewport
Per-draw state
Scissor
Tedious to compile each one
Line Width
Combinatorial explosion Depth Bias
Dynamic state! Blend Constant Colour
Opt-in Depth Bounds
Only use when required Stencil
Compare
Write
Reference
© Imagination Technologies
Pipelines – The Cache
© Imagination Technologies
Introduction to SPIR-V Shaders
Neil Hickey
Compiler Engineer, ARM
© Copyright Khronos Group 2016 - Page 38
SPIR History
Parse HLSL Parse GLSL Parse OpenCL C Parse ISPC Parse Static C++
SPIR-V Tools
SPIR-V Validator
Other
SPIR-V (Dis)Assembler LLVM intermediate
forms
LLVM to SPIR-V
SPIR-V Bi-directional
• 32-bit word stream Translator
• Extensible and easily parsed
• Retains data object and
control flow information for
effective code generation and
translation
• New functionality
• GL_KHR_vulkan_glsl
• Extends #version 140 and higher on desktop and #version 310 es for mobile
content
• Atomic-counter bindings
• Subroutines
• Descriptor sets
• Specialization constants
• Subpass inputs
uniform sampler s;
uniform texture2D t;
in vec2 texcoord;
...
void main()
{
fragColor = texture(sampler2D(t,s), texcoord);
}
vec4 data[arraySize];
// GLSL
#extension GL_EXT_shader_pixel_local_storage : require
__pixel_local_inEXT GBuffer {
layout(rgba8) vec4 albedo;
layout(rgba8) vec4 normal;
...
} pls;
// Vulkan
layout(input_attachment_index = 0) uniform subpassInput albedo;
layout(input_attachment_index = 1) uniform subpassInput normal;
...
# Makefile rules
shaders: $(SPIRV_FILES)
%.frag.spv: %.frag
glslc –o $@ $< $(GLSL_FLAGS) –std=310es
uint32_t id = resources.push_constant_buffers[0].id;
vector<BufferRange> ranges = comp.get_active_buffer_ranges(id);
for (auto &range : ranges)
{
printf(“Accessing member #%u, offset %u, size %u\n”,
range.index, range.offset, range.range);
}
// Vulkan GLSL
uniform subpassInput uAlbedo;
...
FragColor = accumulateLight(
subpassLoad(uAlbedo),
subpassLoad(uNormal).xyz,
subpassLoad(uDepth).x);
// Vulkan GLSL
layout(push_constant) uniform PushConstants {
vec4 Material;
} constants;
FragColor = constants.Material;
FragColor = constants.Material;
// Vulkan GLSL
layout(set = 1, binding = 1) uniform sampler2D uTexture;
// SPIRV-Cross
uint32_t newBinding = 4;
glsl.set_decoration(texture.id, spv::DecorationBinding, newBinding);
glsl.unset_decoration(texture.id, spv::DecorationDescriptorSet);
string glslSource = glsl.compile();
// GLSL
layout(binding = 4) uniform sampler2D uTexture;
// Vulkan GLSL
layout(set = 0, binding = 0) uniform UBO {
mat4 MVPs[MAX_INSTANCES];
};
# Compile to SPIR-V
glslc –o test.spv test.comp
# Build library
g++ -o test.so –shared test.cpp –O0 –g –Iinclude/spirv_cross
• Vulkan Support
- Trace all the function calls in the
SPEC.
- Allows you to see exactly what calls
compose your application.
- Contact the Mali forums and we would
love to get you setup.
https://community.arm.com/groups/
arm-mali-graphics
Frame
Outline
States
Uniforms
Frame Vertex Attributes
Capture: Buffers
Framebuffers
API Trace
Textures
Shaders
Dynamic
Help
Jesse Barker
Principal Software Engineer
83 © ARM 2016
What are Vulkan Descriptors?
Handle Type
myImageView SAMPLED_IMAGE
Image View
Image Device
Memory
84 © ARM 2016
What are Descriptor Sets?
// uniform blocks:
layout(set = 0, binding = 0) uniform Type0 { ... } ubo0; binding type stages
// textures: 0 Uniform Buffer Graphics
layout(set = 0, binding = 1) uniform sampler2D tex0;
1 Image/Sampler Graphics
// SSBO:
layout(set = 0, binding = 2) buffer Type2 { ... } ssbo0; 2 Storage Buffer Graphics
void main()
// ...
}
85 © ARM 2016
What is a Descriptor Pool?
typedef struct VkDescriptorPoolSize {
Parent object of a VkDescriptorType type;
Descriptor Set uint32_t descriptorCount;
} VkDescriptorPoolSize;
Allows Descriptor Set
typedef struct VkDescriptorPoolCreateInfo {
management to be VkStructureType sType;
threaded const void*
VkDescriptorPoolCreateFlags
pNext;
flags;
Manages memory for uint32_t maxSets;
uint32_t poolSizeCount;
hardware descriptors const VkDescriptorPoolSize* pPoolSizes;
} VkDescriptorPoolCreateInfo;
86 © ARM 2016
Allocating Descriptor Sets
Define desired layouts of descriptors
Ask the Descriptor Pool to allocate a Descriptor Set per layout
87 © ARM 2016
What is a Pipeline Layout?
// uniform blocks:
layout(set = 0, binding = 0) uniform Type0
Descriptor Set 0
{ ... } ubo0;
layout(set = 0, binding = 0) uniform Type1 binding type stages
{ ... } ubo1;
0 Uniform Buffer Graphics
// textures:
layout(set = 0, binding = 1) uniform 0 Uniform Buffer Graphics
sampler2D tex0;
layout(set = 1, binding = 0) uniform 1 Image/Sampler Graphics
sampler2D tex1;
// SSBO:
layout(set = 1, binding = 1) buffer Type2 {
... } ssbo0;
Descriptor Set 1
void main() { binding type stages
// ...
}
0 Image/Sampler Graphics
1 Storage Buffer Graphics
88 © ARM 2016
How do Descriptors get into Descriptor Sets?
VKAPI_ATTR void VKAPI_CALL vkUpdateDescriptorSets( typedef struct VkWriteDescriptorSet {
VkDevice device, VkStructureType sType;
uint32_t const void* pNext;
descriptorWriteCount, VkDescriptorSet dstSet;
const VkWriteDescriptorSet* pDescriptorWrites, uint32_t dstBinding;
uint32_t descriptorCopyCount, uint32_t dstArrayElement;
const VkCopyDescriptorSet* pDescriptorCopies); uint32_t descriptorCount;
VkDescriptorType descriptorType;
const VkDescriptorImageInfo* pImageInfo;
const VkDescriptorBufferInfo* pBufferInfo;
const VkBufferView* pTexelBufferView;
} VkWriteDescriptorSet;
89 © ARM 2016
Finally, I’m ready to use my Descriptor Sets
VKAPI_ATTR void VKAPI_CALL vkCmdBindDescriptorSets(
VkCommandBuffer commandBuffer, Bound sets must
VkPipelineBindPoint pipelineBindPoint, match pipeline layout
VkPipelineLayout layout,
uint32_t firstSet, Graphics or compute?
uint32_t descriptorSetCount,
const VkDescriptorSet* pDescriptorSets, Simple layout is best
uint32_t dynamicOffsetCount,
const uint32_t* pDynamicOffsets);
90 © ARM 2016
What about Vertex Input?
91 © ARM 2016
Vertex Input Description
If your shader declares: const VkVertexInputBindingDescription binding[] =
{
{
0, // binding
in vec3 position; sizeof(float) * 3, // stride
in uvec2 texcoord; VK_VERTEX_INPUT_RATE_VERTEX // inputRate
},
{
Your C code declares: 1,
sizeof(uint8_t) * 2,
// binding
// stride
VK_VERTEX_INPUT_RATE_VERTEX // inputRate
struct Position },
{ };
float x, y, z; const VkVertexInputAttributeDescription attributes[] =
}; {
{
0, // location
struct Texcoord binding[0].binding, // binding
{ VK_FORMAT_R32G32B32_SFLOAT, // format
uint8_t u, v; 0 // offset
},
}; {
1, // location
binding[1].binding, // binding
VK_FORMAT_R8G8_UNORM, // format
0 // offset
}
};
92 © ARM 2016
Questions?
93 © ARM 2016
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM
Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured
may be trademarks of their respective owners.
Copyright © 2016 ARM Limited
© ARM 2016
Vulkan Subpasses
or
The Frame Buffer is Lava
Andrew Garrard
Samsung R&D Institute UK
Invoke
Invoke
Invoke
Invoke
2ry 2ry 2ry 2ry
View 1 View 2
Light
weight Render
render Diffuse/ɑ
full-screen
storing quad and
per- perform
surface fragment
content Specular/ shading
at each Specularity
fragment
Normal
Normal
Lighting Fragment
Geometry
VkFramebuffer
vkCmdBeginRenderPass
•vkQueueSubmit
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 125
Multiple
Click render
to edit Masterpasses
title style
•You can have more than one render pass in a
command buffer Render pass
Command buffer
Render pass
- Yes, Leeloo multipass,
Draw Draw Draw Draw
we know…
accesses stay on
New subpass
chip (if possible) Draw Draw Draw Draw Draw
New subpass
•vkCmdNextSubpass Draw Draw Draw Draw Draw
•vkCmdDraw (etc.)
•vkCmdEndRenderPass
•vkCmdEndCommandBuffer
UK Khronos Chapter meet, May 2016 Vulkan subpasses — Page 129
Accessing
Click to edit subpass output
in fragment shaders
Master title style
•In SPIR-V, previous subpass content is read
with OpImageRead
- Coordinates are sample-relative, and need to be 0
- OpTypeImage Dim = SubpassData
•In GLSL (using GL_KHR_vulkan_glsl):
- Types for subpass access are [ui]subpassInput(MS)
- layout(input_attachment_index = i, …) uniform
subpassInput t; to select a subpass C.f. __pixel_localEXT layouts in
EXT_shader_pixel_local_storage
- subpassLoad() to access the pixel when using OpenGL ES
subpass
New
vkCEC vkCEC
•vkCmdExecuteCommands
•vkCmdNextSubpass Secondary buffer
Command buffer
Render pass
CEC CEC
Target
image 1
Secondary
command
Primary command buffer buffer
Target
image 2 Render pass
Andrew Garrard
a.garrard at samsung.com
System
CPU
0 1
GPU
0 1
System
CPU
0 1 2
GPU
0 1 2
GPU
Vertex
0 1
Fragment
0 1
GPU
Vertex
0 1 2
Fragment
0 1 2
• Memory Barrier
- Flush/invalidate caches
- Determination of access and visibility
• Memory Dependency
- Execution dependency involving a Memory Barrier
• Semaphores
- Between Queues
• Fences
- Whole queue operations to CPU OpenGL has just two, very coarse
synchronization primitives: memory
barriers and fences. They are
loosely similar to the equivalently
named concepts in Vulkan
void vkCmdWaitEvents(
VkCommandBuffer commandBuffer,
uint32_t eventCount,
const VkEvent* pEvents,
VkPipelineStageFlags srcStageMask,
VkPipelineStageFlags dstStageMask,
uint32_t memoryBarrierCount,
const VkMemoryBarrier* pMemoryBarriers,
uint32_t bufferMemoryBarrierCount,
const VkBufferMemoryBarrier* pBufferMemoryBarriers,
uint32_t imageMemoryBarrierCount,
const VkImageMemoryBarrier* pImageMemoryBarriers);
• Image Barrier
- For a single image subresource range
vkCmdBeginRenderPass(...);
vkQueuePresent(presentQueue, &presentInfo);
presentInfo.swapchainCount = 1;
presentInfo.pSwapchains = &swapchain;
presentInfo.pImageIndices = &imageIndex;
vkQueuePresent(universalQueue, &presentInfo);
VkResult vkWaitForFences(
VkDevice device,
GL’s fences are like a combination
uint32_t fenceCount,
of a semaphore and a fence in
const VkFence* pFences,
Vulkan – they can synchronize GPU
VkBool32 waitAll,
and CPU in multiple ways at a
uint64_t timeout);
coarse granularity.
© Copyright Khronos Group 2016 - Page 186
Example – Multi-buffering
// Have enough resources and fences to have one per in-flight-frame, usually the swapchain image count
VkBuffer buffers[swapchainImageCount];
VkFence fence[swapchainImageCount];
// Can use the index from the presentation engine - 1:1 mapping between swapchain images and resources
vkAcquireNextImageKHR(device, swapchain, UINT64_MAX, semaphore, VK_NULL_HANDLE, &nextIndex);
// Submit any work to the queue, with those fences being re-used for the next time around
vkQueueSubmit(graphicsQueue, 1, &sSubmitInfo, fence[nextIndex]);
VkResult vkWaitForFences(
VkDevice device,
These are a lot like glFinish, and uint32_t fenceCount,
should be treated similarly – use const VkFence* pFences,
them VERY SPARINGLY. VkBool32 waitAll,
uint64_t timeout);
Questions?
2 – Create a Vulkan
surface
VK_KHR_<platform>_surface
3 – Query information
about your surface
VK_KHR_surface
4 – Create a Vulkan
swapchain
VK_KHR_swapchain
5 – Get your
presentable images
© Copyright Khronos Group 2016 - Page 200
Vulkan Frame Loop – as easy as 1-2-3!
0 – Create your
swapchain
VK_KHR_swapchain
Legend
2 – Submit command Setup
buffer(s) for that image Steady-state
Response to suboptimal
/ surface_lost
206
NVIDIA/KHRONOS CONFIDENTIAL
• Some Context
207
NVIDIA/KHRONOS CONFIDENTIAL
Some Context
208
GPU Architecture
In a nutshell
NVIDIA Maxwell 2
Register File
Core
Load Store Unit
209
Execution Model SMM
Thread Hierarchies
32 threads
32 threads
32 threads
Work Group Warps
210
Resource Partitioning
Resources Are Limited
211
Resource Partitioning
Resources Are Limited
212
Resource Partitioning
Resources Are Limited
213
Resource Partitioning
Registers
The more registers used by a kernel means few resident warps on the SM
214
Resource Partitioning
Shared Memory
The more shared memory used by a work group means fewer work groups on the SM
215
Keeping It Moving
Occupancy
216
Keeping It Moving
Occupancy – Simple Theoretical Example
217
Keeping It Moving
Occupancy – Simple Theoretical Example
218
Queuing It Up
Working with 1 Queue • Scheduler will distribute work across all SMs
• kernels execute in sequence
Command Buffer (there may be some overlap)
Command Buffer
Command Buffer
• Low occupancy kernels will waste GPU time
Command Buffer
Command Queue
Command Buffer
Transfers
219
NVIDIA/KHRONOS CONFIDENTIAL
220
Queuing It Up
Working with N Queues
Command Buffer
• NVIDIA hardware gives you 16 all powerful queues
Command Buffer
221
Queuing It Up
Working with N Queues
Command Buffer
• Application decides which queues for which kernels
Command Buffer
222
Queuing It Up
Compute and Graphics In Harmony
• Some hardware can even run compute and graphics work concurrently
• Needs fast context switching and at high granularity (not just at draw commands)
• Simple Graphics work tends to have high occupancy
• Complex graphics work can reduce occupancy
• Profile for performance insights
223
Queuing It Up
Compute and Graphics In Harmony
Compute Cmd Buffer • Profile to understand occupancy of both graphics and compute workloads
Compute Cmd Buffer
224
An Example
Compute and Graphics In Harmony
225
An Example
Many discretized operations are separable
SM SM SM SM SM SM SM SM
SM SM SM SM SM SM SM SM
Frame Frame
N+2 N+1
SM SM SM SM SM SM SM SM
Frame Frame
N+3 N+2
SM SM SM SM SM SM SM SM
Frame Frame
N+4 N+3
Semaphore
227
An Example
Putting it all together
Compute Graphics
Frame Frame
N+2 N+1
SM SM SM SM SM SM SM SM
Frame Frame
N+3 N+2
SM SM SM SM SM SM SM SM
Frame Frame
N+4 N+3
Semaphore Semaphore
228
Memory Transfers
More opportunity for concurrency
230
NVIDIA/KHRONOS CONFIDENTIAL
Conclusion
Takeaways
231
NVIDIA/KHRONOS CONFIDENTIAL
232
Questions?
Chris Hebert, Dev Tech Software Engineer, Professional Visualization
Porting to Vulkan
Hans-Kristian Arntzen
Engineer, ARM
(Credit for slides: Marius Bjørge)
© Copyright Khronos Group 2016 - Page 234
Agenda
• API flashback
• Engine design
- Command buffers
- Pipelines
- Render passes
- Memory management
Application
Application
Logic shift
Driver
Driver
vkQueue vkCommandPool
vkCommandBuffer
vkCmdBindDescrip
vkBeginRenderPass vkCmdBindXXX vkCmdBindPipeline vkCmdDraw vkEndRenderPass
torSets
Heap
• Approach it differently
- Re-design for Vulkan, and then port that to OpenGL
vkCommandPool vkCommandBuffer
Main thread
vkBeginRenderPass vkCmdExecuteCommands vkEndRenderPass
Pipeline state
SetIndexBuffer()
Command
Draw() Buffer Internal
Flush
RequestPipeline
CreateNewPipeline
• …but this requires detailed knowledge of every potential shader/state combination that
you might have in your scene
- As an example, one of our fragment shaders have ~9000 combinations
- Every one of these shaders can use different render state
- We also have to make sure the pipelines are bound to compatible render passes
- An explosion of combinations!
Pipeline state
vkPipelineCache
Disk
SPIR-V shader
SetShaders()
SetConstantData()
SetTexture()
Draw() Internal
BindDescriptorSets
• Our solution:
- Emulate descriptor sets in our OpenGL backend
- SPIRV-Cross collapses and serializes bindings
// New
layout(push_constant, std430) uniform PushConstants {
mat4 MVP;
vec4 MaterialData;
} RegisterMapped;
• Our solution:
- Use SPIRV-Cross to turn push constants into regular non-opaque uniforms
- Logic in our OpenGL/Vulkan backends redirect the push constant data appropriately
Public interface
BeginRenderPass Internal
RequestFramebuffer
RequestRenderPass
Command CreateCompatibleRend
Buffer erPass
CreateFramebuffer
BeginRenderPass
// GLSL
#extension GL_EXT_shader_pixel_local_storage : require
__pixel_local_inEXT GBuffer {
layout(rgba8) vec4 albedo;
layout(rgba8) vec4 normal;
...
} pls;
// Vulkan
layout(input_attachment_index = 0) uniform subpassInput albedo;
layout(input_attachment_index = 1) uniform subpassInput normal;
...
• Our solution:
- Use SPIRV-Cross to rewrite subpass inputs to Pixel Local Storage variables or texture
lookups
- This will only support a subset of the Vulkan subpass features, but good enough for our
current use
Public interface
BeginRenderPass
EndRenderPass
Command
Buffer Internal
vkCmdEndRenderPass
vkCmdPipelineBarrier