Adreno_DirectX_Developer_Guide
Adreno_DirectX_Developer_Guide
DirectX
Application Developers’ Guide
for
Adreno™ GPUs
Confidential and Proprietary – Qualcomm Technologies, Inc.
Restricted Distribution: Not to be distributed to anyone who is not an employee of either Qualcomm or
its subsidiaries without the express approval of Qualcomm’s Configuration Management.
Not to be used, copied, reproduced, or modified in whole or in part, nor its contents revealed in any
manner to others without the express written permission of Qualcomm Technologies, Inc.
Qualcomm and Snapdragon are trademarks of QUALCOMM Incorporated, registered in the United
States and other countries. All QUALCOMM Incorporated trademarks are used with permission. Other
product and brand names may be trademarks or registered trademarks of their respective owners. ARM
is a registered trademark of ARM Limited.
This technical data may be subject to U.S. and international export, re-export or transfer (“export”) laws.
Diversion contrary to U.S. and international law is strictly prohibited.
7 References ..................................................................................................... 49
This document is a guide for developing and optimizing DirectX applications for
Windows RT on Snapdragon™ S4-based mobile platforms that include the Adreno™
series GPUs (Graphics Processing Units). The Adreno 225 GPU is embedded within
Qualcomm’s Snapdragon S4 (MSM8960) processor which is the first Snapdragon
processor to support Microsoft’s Windows RT platform.
Adreno GPU is explained from an application development perspective. Sample code
within this document is copyright of QUALCOMM® Incorporated.
This document focuses on programming the GPU with Direct3D 11.1 on Windows RT
and touches upon other related components like CPU and memory that are needed for
graphics application development.
The following sections address this game dataflow diagram one system component at a
time.
CPU
The data flow in Figure 2-2 starts from the left with multiple threads within the game
being processed by the dual-core CPU. Assume this hypothetical game has a coarse-
threaded game engine architecture. Coarse-threaded architecture means that each of
the game engine functions like artificial intelligence and game physics are processed by
that CPU core as individual process threads that are assigned by the game engine. A
similar flow is possible for a fine-grained threaded game engine, where each of these
functions is divided into multiple threads. Each thread in a function (part of the function,
instead of a complete function) is then processed by a CPU core. The coarse-threaded
game engine tends to be more serial in flow on a CPU, whereas a fine-grained threaded
engine tends to be parallel in flow.
In Figure 2-2, each game engine function is processed from left to right on the first core
in a serial fashion. After processing the last function, the game state is stored in local
memory, and the second CPU core taking that state as a starting point begins
processing the next series of functions. While these CPU cores are processing the
functions, each of the game engine functions can also use the VeNum co-processor for
the math calculations. VeNum is a general-purpose SIMD (Single Instruction, Multiple
Data) architecture extension that is used for efficient Vector Floating Point (VFP)
processing of mathematical algorithms for these game engine functions. VeNum is
instructionally compatible with the standard ARM® NEON™ general-
purpose SIMD engine. There are multiple ways to use VeNum, including vectorizing
compilers, using C instrinsics or using Assembler. For more details, see ARM’s
reference page: http://www.arm.com/products/processors/technologies/neon.php
System Memory
GPU
Once the CPU completes its computational cycle, the game state data is stored in
memory, while processed geometry, texture data pointers and command pointers are
passed onto the GPU for further processing. There are multiple important aspects of
Adreno GPUs, as shown in Figure 2-2 and discussed later in this document. Here, the
focus is two components: GMEM (Internal Graphics Memory) sometimes also referred to
as OCMEM (On Chip Memory), which is a tile buffer for storing color, depth and stencil
values, and the shader processing unit.
Adreno GPUs have an internal local memory buffer that can store z, stencil and color
values. The tile-based rendering architecture in Adreno GPUs allows for dividing the
final frame buffer into smaller portions, called bins. Resolving these bins one at a time in
this local buffer is called tile-based. A completely resolved final frame buffer is then
stored in system memory as shown.
Display
The final processed pixel data from the frame buffer is updated onto the LCD screen by
the display driver.
Texturing Features
Texture Formats
The following table lists the texture formats supported by Direct3D11.1 feature level 9_3.
For each format, the table specifies whether the format can be used for a 2D, 3D, or
Cubemap texture. The Filter column specifies whether the format can be used with a
sampler that has non-point min or mag filtering enabled. The Mip column specifies
whether the format can be mipmapped, and the GenMip column specifies whether the
driver can auto-generate mipmaps for the format. Finally, the Render column lists
whether the format can be used as a render target.
Texture Compression
Compressing textures can significantly improve the performance and load time for
graphics applications since it reduces texture memory and bus bandwidth use. Important
compression texture formats that are supported by Adreno GPUin Direct3D 11.1 are:
BC1 – DXT1 texture compression format (4-bit per pixel for RGB with 0 or 1-bit
of alpha)
BC2 – DXT2 texture compression format (8-bit per pixel for RGBA). In BC2,
the alpha is stored as an uncompressed 4-bit value.
BC3 – DXT5 texture compression format (8-bit per pixel for RGBA). In BC3,
the alpha is compressed using a block compression technique.
In previous versions of Direct3D, the D3DX library provided useful utility methods for
loading compressed textures from DDS files. However, Microsoft deprecated D3DX and
did not include it in the Windows 8 SDK. However, Qualcomm provides support for
compressing textures to the BC1/BC2/BC3 format in the Adreno Texture Converter tool
as part of Adreno SDK.
Geometry Features
Vertex Formats
The following table lists the formats that can be used for vertex data in Direct3D 11.1
Feature Level 9.3.
Index Formats
Index buffers can be stored as 16-bit USHORT or 32-bit UINT using the formats
DXGI_FORMAT_R16_UINT and DXGI_FORMAT_R32_UINT.
Geometry instancing is the practice of rendering multiple copies of the same mesh or
geometry in a scene at once. This technique is used primarily for objects like trees,
grass, or buildings that can be represented as repeated geometry without appearing
unduly repetitive, though geometry instancing may also be used for characters.
Although vertex data is duplicated across all instanced meshes, each instance may have
other differentiating parameters (like color, transforms, lighting, etc.) changed to reduce
the appearance of repetition. For example, in Figure 2-3 all barrels in the scene could
use a single set of vertex data that is instanced multiple times instead of using unique
geometry for each one.
Geometry instancing offers a significant savings in memory usage. It allows the GPU to
draw the same geometry multiple times in a single draw call with different positions,
while storing only a single copy of the vertex data, a single copy of the index data and an
additional stream containing one copy of each instance’s transform. Without instancing,
the GPU would have to store multiple copies of the vertex and index data.
Geometry instancing can be implemented using
ID3D11DeviceContext::DrawIndexedInstanced() in Direct3D11 Feature Level 9_3.
Among the various techniques for reducing aliasing effects, multisampling is the one
supported by Adreno GPU. Multisampling divides every pixel into a set of samples,
each of which is treated like a “mini-pixel” during rasterization. Each sample has its own
color, depth, and stencil value. And those values are preserved until the image is ready
for display. When it’s time to compose the final image, the samples are resolved into the
final pixel color, as depicted in Figure 2-4. Adreno GPU supports two or four samples.
In Metro applications, MSAA cannot be performed on the main back buffer. Rather, the
application must render to a multisampled offscreen render target, resolve it to a non-
multisampled render target using ID3D11DeviceContext::ResolveSubresource(), and
then render the non-multisample version to the backbuffer as a texture bound to a
fullscreen quad. This technique is demonstrated in the MSAA example in the Adreno
SDK for DirectX.
This section describes developer tools from Microsoft as well as from Qualcomm that
are available for graphics application development and analysis on Windows RT
platforms powered by Snapdragon. Microsoft’s tools are available on Microsoft’s MSDN
website http://msdn.microsoft.com while Qualcomm-provided tools are available for free
download on the Qualcomm Developer Network: http://developer.qualcomm.com.
In previous versions of Direct3D, it was possible for the application to compile the HLSL
shader code at runtime using D3DX. However, in Metro applications, the HLSL must be
compiled ahead of time using fxc. The build process highlighted above will produce a
binary shader file (usually named ‘.cso’ for Compiled Shader Object) that can be loaded
at runtime. Microsoft does provide a debugging D3DCompiler API that can be used in
debug builds of an application to compile the HLSL code at runtime, but this API cannot
be used by released Metro applications.
DirectX Control Panel is also available directly from Visual Studio 2012 under the Debug
* Graphics menu.
Overview
The Adreno SDK includes support for emulation and other utilities that are important for
graphics application development. The SDK is provided as a development environment
for Qualcomm’s Adreno graphics processors. It is intended for a broad spectrum of
developers, from those who want to learn technologies like DirectX, to those who want to
utilize the more advanced features of Snapdragon’s Adreno graphics solution.
The Adreno SDK can be downloaded by visiting the Adreno section on the Qualcomm
Developer Network at https://developer.qualcomm.com/mobile-development/mobile-
technologies/gaming-graphics-optimization-adreno/tools-a nd-resources.
Features
The Adreno SDK includes the following features:
Emulation support in a desktop environment
An SDK help system
SDK browser
Overview
The Adreno Profiler is a leading PC-based tool used by 3D content developers to test,
analyze, profile and optimize embedded 3D games and applications on commercial
Snapdragon-based devices without having to make changes to the application.
The Adreno Profiler provides valuable, time-saving feedback that improves application
performance and efficiency. This feedback includes:
GPU and system-level performance metrics
DirectX11 API call tracing and emulation
Real-time driver overrides
Shader analysis functionality for estimating source shader complexity
The Adreno Profiler can be downloaded free by visiting the Adreno section on the
Qualcomm Developer Network at https://developer.qualcomm.com/mobile-
development/mobile-technologies/gaming-graphics-optimization-adreno/tools-and-
resources.
For more details, please refer to the documentation within the Adreno Profiler.
Microsoft::WRL::ComPtr<ID3D11Device> d3dDevice;
Microsoft::WRL::ComPtr<ID3D11DeviceContext> d3dDeviceContext;
if (FAILED(
D3D11CreateDevice(
nullptr,
D3D_DRIVER_TYPE_HARDWARE,
nullptr,
creationFlags,
featureLevels,
ARRAYSIZE(featureLevels),
D3D11_SDK_VERSION,
&d3dDevice,
nullptr,
&d3dDeviceContext
) )
{
// error..
}
D3D11_FEATURE_DATA_ARCHITECTURE_INFO info;
d3dDevice->CheckFeatureSupport(D3D11_FEATURE_ARCHITECTURE_INFO,
&info, sizeof(info));
if ( info.TileBasedDeferredRenderer )
{
// TBR present
}
The Adreno graphics driver will set TileBasedDeferredRenderer to TRUE. Many other
features can be queried at runtime using ID3D11Device::CheckFeatureSupport(). For
the full list of available queries, please see http://msdn.microsoft.com/en-
us/library/windows/desktop/ff476497(v=vs.85).aspx.
Figure 5-1 shows that vertices and pixels are processed in groups of four as a vector, or
a thread. When a thread stalls, the shader ALUs can be reassigned.
In unified shader architecture, there is no separate hardware for the vertex and fragment
shaders, as illustrated in Figure 5-2. This allows for greater flexibility of pixel/vertex load
balances.
Early Z Rejection
One of the important hardware features in Adreno GPUs, Early Z rejection provides a
fast occlusion method as well as the rejection of unwanted render passes for objects that
are not visible (hidden) from the view position. Figure 5-3 shows the red circle as an
object that is hidden behind the green object, which is represented here by a green
block. The rendering pass for this hidden object, which is not visible from the camera
viewpoint, is avoided using the early Z rejection feature.
Consider the example in Figure 5-4, which shows a color buffer represented as a grid,
and each block represented as a pixel. The rendered pixel area on this grid is colored
black. The Z-buffer value for these rendered black pixels is 1. If you are trying to render
a new primitive onto the same pixels of the existing color buffer that has the Z-buffer
value of 2 (as shown in the second grid with green blocks), the conflicting pixels in this
new primitive will be rejected as shown in the third grid representing the final color
buffer. Adreno GPU can reject occluded pixels at up to four times the drawn pixel fill
rate.
To get maximum benefit from this feature, we recommend drawing your scene with
primitives sorted out from front-to-back; i.e., near-to-far. This ensures that the Z-reject
rate is higher for the far primitives, which is very useful for applications that have high-
depth complexity.
In the traditional desktop style rendering method, which is also referred to as direct
rendering, the intermediate step of processing and rendering the scene per bin is
eliminated. The geometry is directly rendered into the final frame buffer, which is in-
system memory.
Tile-based rendering is very useful for cases with considerable depth complexity or
overdraw with alpha blending. Tile-based rendering significantly reduces memory
bandwidth requirements by processing pixels under each bin from internal fast access
memory to graphics for these blending operations. Using direct rendering to perform
these same blending operations, it is required to read from and write to the system
memory (where the final frame buffer resides) for each pixel that is being processed for
blending. Reading from and writing to the system memory for processing each pixel is a
costly operation that is performed at the expense of battery consumption.
The Adreno GPU’s tile-based rendering algorithm is apparent to the end developer, but
there are certain considerations that you can make in your game design to take
advantage of this.
The guidelines below include ways to take advantage of these hardware features.
A GMEM store is necessary when switching render targets, or prior to presenting the
frame. Minimizing render target switches is key to achieving good performance on any
tile-based renderer. GMEM loads are necessary only when the render target is bound,
and not cleared or discarded prior to drawing to it. A flush is treated the same as binding
a render target, and to avoid a GMEM load, must be cleared/discarded prior to
drawing. A typical scenario that requires a GMEM load would be to copy from a render
target into a new buffer for postprocessing effects. A typical scenario that does not
require a GMEM load, but is often overlooked by application developers, is rendering to
an offscreen render target, then switching to the swap chain and rendering the scene,
without calling clear/discard on either render target.
There are a number of operations in Direct3D11.1 that can cause GMEM load. Some
examples are listed below:
IDirect3D11DeviceContext::CopyResource or
IDirect3D11DeviceContext::CopySubresourceRegion where the source is a
render target
IDirect3D11DeviceContext::CopyResource or
IDirect3D11DeviceContext::CopySubresourceRegion where the destination was
already used as a texture for rendering to the current render target
Use Intrinsics
Intrinsic functions as part of the HLSL language should be used whenever feasible,
because you don’t need to reinvent the wheel by writing your own function, and there is
a good chance that they are optimized for specific shader profiles. Consult the HLSL
Intrinsic Functions list for all the functions supported: http://msdn.microsoft.com/en-
us/library/windows/desktop/ff471376(v=vs.85).aspx.
Use the Appropriate Data Type
By using the appropriate data type in the code, the compiler and the optimizer in the
driver can optimize code and pair shader instructions. For example, using a float4 data
type instead of a float data type prevents the compiler from arranging the output in a way
that they can be co-issued on hardware. Small mistakes can sometimes have a large
impact on performance. For example, the following code should consume a single
instruction slot.
int4 ResultOfA(int4 a)
{
return a + 1;
}
Whereas the following code might consume 8 instruction slots because of the 1.0.
After this each tile is broken up into aligned 2x2 quads. Some of these quads might
contain “dead” pixels as shown in red in
Figure 5-6. Very small triangles that are less than 2x2 pixels in size can waste at least
60% of the GPU’s processing power. Consider using geometry LODs that use only
bigger triangles. So for example when implementing a particle system, there could be
reduced efficiency when it is implemented using smaller triangles.
Texture Sampling
The fact that the hardware works on 2x2 pixels at the same time offers a challenge on
the pixel level. Four or more of those quads are grouped into a pixel vector (aka pixel
thread). Optimizing meshes so that pixel vectors are nicely grouped together makes
more efficient use of the texture cache.
Other strategies to avoid texture stalls are:
Avoid random access
Avoid 3D volume textures
Avoid fetching from 7 textures in one pixel shader; a more appropriate
number is 4
Use compressed formats everywhere: much better memory usage, less stalls
Use mipmaps when possible
In general, trilinear and anisotropic filtering is much more expensive than bilinear
filtering, while there is no difference between point and bilinear filtering.
Texture filtering can influence the speed of texture sampling. A bilinear texture look-up in
a 2D texture on a 32-bit format costs a single cycle. Adding trilinear filtering doubles that
cost to two cycles. Mipmap clamping may reduce this to bilinear cost though, so the
average cost might be lower in real-world cases. Adding anisotropic filtering multiplies
with the degree of anisotropy. That means a 16x anisotropic lookup can be 16 times
slower than a regular isotropic lookup. Because anisotropic filtering is adaptive, this hit is
taken only on pixels that require anisotropic filtering that might end up being only a few
pixels total. A rule of thumb for real-world cases is that anisotropic filtering will be less
than twice the cost on average.
Different texture formats have a large impact on texture sampling performance. A 32-bit
texture costs 1 cycle to fetch. So all 32-bit and smaller formats (including all the
compressed formats) are single-cycle. A 64-bit format takes two cycles and a 128-bit
format takes four cycles to fetch.
Cube maps and projected texture look-ups do not incur any extra cost, while shader-
specific gradients –based on ddx() / ddy()- cost an extra cycle. That means a regular
bilinear lookup that normally takes one cycle takes two with shader-specific gradients.
Please note that these shader-specific gradients cannot be stored across lookups, so if
you do a texture lookup with the same gradients again in the same sampler, it will cost
the one-cycle hit again.
3D Textures
Use of 3D textures can have a severe impact on memory and cache usage. Making
effective use of texture memory is already a major task for real-time 3D content. Adding
an additional dimension to a texture map increases its memory footprint significantly.
There are four major techniques that help to improve 3D texture mapping performance:
Keep textures small (< 32 texels in each direction)
Repeat and mirror where appropriate. Volume detail textures can be repeated to
great effect. Use mirroring address (D3D11_TEXTURE_ADDRESS_MIRROR)
modes for symmetrical textures like volumetric light maps. The mirror once
(D3D11_TEXTURE_ADDRESS_MIRROR_ONCE) addressing mode is
especially useful with 3D light maps.
Keep texture bit-depth as low as possible. Volumetric detail textures and light
maps are often grayscale. For these, use a single-channel texture.
Use compression. Developers should make the appropriate size/quality tradeoff
for their application.
Branching
Static branching performs well, but at the risk of extra GPRs. Using dynamic branching
in a shader has a certain, non-constant overhead that depends on the exact shader
code. Therefore, using dynamic branching is not always a performance win. Since
multiple pixels are processed as one, in the event that some pixels take a branch while
others do not, the GPU must execute both paths (all instructions really do execute, but
there’s a masking bit that controls the output so that only appropriate pixels are
affected). In this case, the performance will appear worse than if each pixel was
processed individually. If all pixels take the same path, the GPU is capable of really
taking the branch, which is good for performance.
Level-of-Detail Shaders
To improve the efficiency of shaders, similar to a geometric Level-of-Detail, a shader
Level-of-Detail can be implemented. Based on the distance from the camera, the quality,
cost and energy efficiency of shaders are exchanged. The further the object is from the
camera, the less expensive the cost and quality will be, while increasing the energy
efficiency.
Granularity of the LOD system can be based per-frame or per-object. The LOD value
algorithm would follow the distance from the camera. For a lighting system, the LOD
levels could be:
1. Normal-mapped per-pixel lighting with additional detailed maps attached
2. Normal-mapped per-pixel lighting without detailed map
3. Vertex lit with a color texture only
Avoid Uber-Shaders
Uber-shaders combine multiple shaders into a single shader that uses static branching.
Using them makes good sense if you are trying to reduce state changes and batch draw
calls. However, this often comes at the expense of an increased GPR count, which has
an impact on performance.
If some pixels in a thread are killed, and others are not, the shader still
executes.
If clip is used, Early-Z must be disabled by the driver. The reason for this is
because if Early-Z is enabled and a pixel is killed in the pixel shader, the depth
buffer value would be incorrectly updated for that pixel. The driver must
therefore disable Early-Z in the case when a shader uses the clip() function.
It is recommended to only use the clip() instruction in the pixel shader when it is
absolutely necessary to achieve an effect like alpha test. Otherwise, do not use clip() as
a performance optimization because it may result in reduced performance.
Applications should always use indexed rendering and create index buffers for holding
the indices. If an application chooses not to use indexed rendering, it will not benefit
from the vertex reuse described in Figure 5-2. The best way to achieve optimal reuse of
the post-transform vertex cache is to order the index buffers using either triangle strips
or strip-ordered indexed triangle lists (both should achieve roughly equivalent
performance).
Occlusion Queries
When in tile-based rendering mode, occlusion queries are batched along with the
rendering for a given render target. Since rendering for a given render target is not
processed until a flush condition is encountered, it is possible that the GPU hasn’t
processed the query by the time Present is called, even if ID3D11DeviceContext::Begin
and ID3D11DeviceContext::End were issued near the start of the frame. The most
efficient way to use occlusion queries is to issue the query on a previous frame, when
possible. Applications can call Flush or GetData to give the driver a hint that the
application needs the pending query, but this may incur an additional GMEM load and
GMEM store (see section 5.2 ).
If the occlusion data must be used in the same frame as the query, if Begin and End are
issued for rendering on a given render target, it is best to wait until issuing rendering
commands on another render target before calling GetData. This way the natural flow of
the driver will process the render target, and the query, and there will not be any
additional GMEM loads or GMEM stores.
Never spin in a tight loop to wait on a query to be satisfied. While this will work
functionally, it does not allow the application, or driver, to continue to feed commands to
the GPU, and will affect performance. Applications should have a fallback path to take
in case a query is not ready when GetData is called.
Updating Data
The ideal way to modify vertex and/or index buffer data is for an application to manage
its own buffers. The following application pseudo code will run optimally in the driver.
1. Create a buffer of a known size, preferably large enough to hold at least one
frame of data.
2. Keep a count of how much data has been added to that buffer,
3. If there is space in the buffer for your data, map the buffer using
D3D11_MAP_WRITE_NO_OVERWRITE to get a pointer to the surface. Write
the data to that location and increment the data count.
4. If there is not enough space in the buffer for your data, map the buffer using
D3D11_MAP_WRITE_DISCARD and reset the data count to zero.
5. Copy the new data into the mapped buffer, then unmap it.
The path described above is fairly commonly used by applications, and is optimized in
the driver.
5.3.3 Textures
Compress Textures
Texture cache friendliness requires that textures be compressed whenever possible.
Note that render targets are not compressed. While CPU time could be spent to
compress resolved render targets, the CPU-GPU synchronization and the actual
compression time would make real-time rendering impossible. Use external tools like
Adreno Texture Converter to compress textures and normal maps.
Use Mipmaps
Mipmaps have a two-fold benefit when lower LOD mips can be used. When they are
fetched from system memory, they are smaller and consume less bandwidth. Also,
since lower LOD mips are significantly smaller, they are more likely to reside in the
texture cache, which eliminates any need to fetch data over the system memory bus.
If an application decides to use GenerateMips, it should be done when loading a scene
to avoid any potential hitches during the rendering the end user sees.
Use Multi-Texturing
On the Adreno GPUs, multiple textures (as many as 16) can be used in a single render
pass. Blending is an inexpensive effect, so use multiple textures for possible effects
such as static lighting instead of dynamic lights.
Texture updates
When running a Direct3D11.1 Level 9 application, applications should consider avoiding
UpdateSubresource calls. The Direct3D runtime will create a temporary resource and
copy the system memory data into it prior to calling the driver to blt the data. For more
direct control over how the update is handled, applications should create and manage
This chapter is a guide for developers migrating applications from OpenGL ES 2.0/3.0 to
Direct3D11.1 feature level 9_3. While providing similar functionality to OpenGL ES,
Direct3D11.1 is a substantially different API. The following sections detail how various
parts of the OpenGL ES API map to Direct3D11.1.
Line Width X X
Point Sprites X X
Alpha To Coverage X X
Non-Power-of-2 Textures X X X
(no mipmapping, limited
wrap modes)
3D Textures GL_OES_texture_3D X X
BC1/BC2/BC3 Texture X
Compression
2D Texture Arrays X
Occlusion Queries X X
Geometry Instancing X X
Transform Feedback X
RGB5E9 shared-exponent X
textures
GLuint hBufferHandle;
glGenBuffers( 1, &hBufferHandle );
An ID3D11Buffer used for string the same vertex data could be created with the
following code in Direct3D11:
ComPtr<ID3D11Buffer> buffer;
if ( FAILED( pD3DDevice->CreateBuffer( &bdesc, &vertexBufferData, &buffer) ) )
{
return FALSE;
}
A similar mapping of code exists between creating an OpenGL ES index buffer object
(IBO) and an ID3D11Buffer in Direct3D11 for storing indices. In OpenGL ES, VBOs are
bound for rendering using glBindBuffer() and then bound to vertex attributes using
glBindVertexAttribPointer() and glEnableVertexAttribArray(). In Direct3D11, vertex and
index buffers are bound to the Input Assembler (IA) using code such as the following:
UINT32 offset = 0;
pD3DDeviceContext->IASetVertexBuffers(
0, // StartSlot
1, // NumBuffers
pVertexBuffer.GetAddressOf(), // VertexBuffers
&m_nVertexSize, // Strides
&offset); // Offsets
One big difference between OpenGL ES and Direct3D11 is the way in which vertex data
gets bound as inputs to the vertex shader. In Direct3D11, an ID3D11InputLayout object
must be created with a reference to the vertex shader byte code. The
ID3D11InputLayout object defines how the vertex data is laid out in memory and also
how it binds to the HLSL semantics provided in the vertex shader.
For example, let’s say that there is a vertex shader that declares the following set of
inputs:
struct VertexShaderInput
{
float2 vVertexPos : POSITION;
float4 vVertexColor: COLOR0;
float2 vVertexTex : TEXCOORD0;
};
D3D11_INPUT_ELEMENT_DESC inputDesc[] =
{
{"POSITION", 0, DXGI_FORMAT_R32G32_FLOAT, 0, 0, D3D11_INPUT_PER_VERTEX_DATA,
0},
ComPtr<ID3D11InputLayout> inputLayout;
result = !FAILED(
pD3DDevice->CreateInputLayout(
inputDesc,
3,
vertexShaderByteCode,
vertexShaderByteCodeSize,
&inputLayout) );
At render time, the vertex layout must be set to the input assembler as follows:
pD3DDeviceContext->IASetInputLayout(inputLayout.Get());
6.2.2 Textures
In OpenGL ES 2.0, texture and sampler state were bound together in a single object
called a texture object. OpenGL ES 3.0 introduces sampler objects that separate
sampler state from texture data. In Direct3D11.1, much like OpenGL ES 3.0, textures
and sampler state are separate. Additionally, in Direct3D11.1 an
ID3D11ShaderResourceView is needed to be able to access a texture in a shader. The
Direct3D11.1 separation of texture data, sampler state, and shader resource view allow
for increased flexibility (especially at higher feature levels), but also require more setup
code than OpenGL ES.
The following example of code creates a texture object and loads it with data (a single
mip level) and sampler state in OpenGL ES 2.0.
glGenTextures( 1, (GLuint*)pTexId );
glBindTexture( GL_TEXTURE_2D, *pTexId );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT );
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA, nMipWidth, nMipHeight,
nBorder, GL_RGBA, GL_UNSIGNED_BYTE,
pInitialData );
D3D11_SUBRESOURCE_DATA textureSubresourceData[1];
ZeroMemory(&textureSubresourceData[0], sizeof(D3D11_SUBRESOURCE_DATA));
textureSubresourceData[0].pSysMem = pInitialData;
textureSubresourceData[0].SysMemPitch = nMipWidth * nBPP / 8;
textureSubresourceData[0].SysMemSlicePitch = 0;
ComPtr<ID3D11Texture2D> texture;
ComPtr<ID3D11ShaderResourceView> textureView;
ComPtr<ID3D11SamplerState> sampler;
if ( FAILED(
pD3DDevice->CreateTexture2D(
&textureDesc,
&textureSubresourceData[0],
&texture ) ) )
{
return FALSE;
}
if ( FAILED( pD3DDevice->CreateShaderResourceView(
texture.Get(),
&textureViewDesc,
&textureView
) ) )
{
return FALSE;
}
// Create a sampler
D3D11_SAMPLER_DESC samplerDesc;
ZeroMemory(&samplerDesc, sizeof(samplerDesc));
samplerDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_LINEAR;
samplerDesc.MaxAnisotropy = 0;
samplerDesc.AddressU = D3D11_TEXTURE_ADDRESS_WRAP;
samplerDesc.AddressV = D3D11_TEXTURE_ADDRESS_WRAP;
samplerDesc.AddressW = D3D11_TEXTURE_ADDRESS_WRAP;
samplerDesc.MipLODBias = 0.0f;
samplerDesc.MinLOD = 0;
samplerDesc.MaxLOD = D3D11_FLOAT32_MAX;
samplerDesc.ComparisonFunc = D3D11_COMPARISON_NEVER;
samplerDesc.BorderColor[0] = 0.0f;
samplerDesc.BorderColor[1] = 0.0f;
samplerDesc.BorderColor[2] = 0.0f;
samplerDesc.BorderColor[3] = 0.0f;
pD3DDeviceContext->PSSetSamplers(
nTextureUnit,
1,
m_pSamplerState.GetAddressOf());
}
glEnable( GL_CULL_FACE );
glCullFace( GL_BACK );
glEnable( GL_DEPTH_TEST );
glDepthFunc( GL_LEQUAL );
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);
Setting the same set of render state in Direct3D11 requires creating several render state
objects and setting them to be current as shown next.
ComPtr<ID3D11RasterizerState> rasterizerState;
D3D11_RASTERIZER_DESC rdesc = CD3D11_RASTERIZER_DESC(D3D11_DEFAULT);
rdesc.CullMode = D3D11_CULL_BACK;
rdesc.FrontCounterClockwise = TRUE;
pD3DDevice->CreateRasterizerState(&rdesc, &rasterizerState);
ComPtr<ID3D11DepthStencilState> depthStencilState;
D3D11_DEPTH_STENCIL_DESC dsdesc = CD3D11_DEPTH_STENCIL_DESC(D3D11_DEFAULT);
dsdesc.DepthFunc = D3D11_COMPARISON_LESS_EQUAL;
dsdesc.DepthEnable = TRUE;
pD3DDevice->CreateDepthStencilState(&dsdesc, &depthStencilState);
ComPtr<ID3D11BlendState> blendState;
D3D11_BLEND_DESC bdesc = CD3D11_BLEND_DESC(D3D11_DEFAULT);
bdesc.RenderTarget[0].BlendOp = D3D11_BLEND_OP_ADD;
bdesc.RenderTarget[0].SrcBlend = D3D11_BLEND_SRC_ALPHA;
bdesc.RenderTarget[0].DestBlend = D3D11_BLEND_INV_SRC_ALPHA;
bdesc.RenderTarget[0].BlendEnable = TRUE;
pD3DDevice->CreateBlendState(&bdesc, &blendState);
It is important to understand that in order to make the most efficient use of render state
objects in Direct3D11, the application should minimize the number of times the objects
are set. Further, the render state objects should ideally be created at load/startup time
rather than dynamically. At runtime, the application should only be setting render state
objects current rather than creating new ones. This is not always possible, but ideally
this method should be employed.
6.2.4 Uniforms/Constants
In OpenGL ES 2.0, all shader constants are declared as uniform variables in GLSL and
set with values using the glUniform*() APIs. Loading uniform data is a frequent
bottleneck in OpenGL ES 2.0 applications, especially when dealing with large volumes
of data such as in matrix palette skinning. Much like with the render state API, the
Direct3D11 designers took an approach that aimed to achieve efficiency. All constant
data loaded to shaders in Direct3D11 is done through the user of constant buffers. A
constant buffer is basically a grouping of constants that will be updated at similar
frequency. For example, the following shows a constant buffer declaration in HLSL:
cbuffer BumpedReflectionConstantBuffer : register(c0)
{
float4x4 MatModelViewProj;
float4x4 MatModelView;
float4 LightPos;
float4 EyePos;
float FresnelPower;
float SpecularPower;
};
D3D11_SUBRESOURCE_DATA constantBufferData;
constantBufferData.pSysMem = pSrcConstants;
constantBufferData.SysMemPitch = 0;
constantBufferData.SysMemSlicePitch = 0;
ComPtr<ID3D11Buffer> buffer;
if ( FAILED(
pD3DDevice->CreateBuffer(
&CD3D11_BUFFER_DESC(nBufferSize, D3D11_BIND_CONSTANT_BUFFER),
&constantBufferData,
&buffer) ) )
{
return FALSE;
}
pD3DDeviceContext->PSSetConstantBuffers( bufferIndex, 1,
m_pBuffer.GetAddressOf());
glGenTextures( 1, &m_hDepthTexture);
To create a render target that can be used as a texture in Direct3D11, we need to create
a texture with both a shader resource view and a render target view. The following block
of code demonstrates creating a render target that has a texture color buffer and a
depth/stencil buffer.
D3D11_TEXTURE2D_DESC textureDesc;
ZeroMemory( &textureDesc, sizeof(textureDesc) );
textureDesc.Width = nWidth;
textureDesc.Height = nHeight;
textureDesc.MipLevels = 1;
textureDesc.ArraySize = 1;
textureDesc.Format = DXGI_FORMAT_R8G8B8A8_UNORM;
textureDesc.SampleDesc.Count = 1;
textureDesc.Usage = D3D11_USAGE_DEFAULT;
textureDesc.BindFlags = D3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCE;
textureDesc.CPUAccessFlags = 0;
textureDesc.MiscFlags = 0;
// Render target view
D3D11_RENDER_TARGET_VIEW_DESC renderTargetViewDesc;
ZeroMemory(&renderTargetViewDesc, sizeof(renderTargetViewDesc));
renderTargetViewDesc.Format = textureDesc.Format;
renderTargetViewDesc.ViewDimension = D3D11_RTV_DIMENSION_TEXTURE2D;
renderTargetViewDesc.Texture2D.MipSlice = 0;
// Sampler description
D3D11_SAMPLER_DESC samplerDesc;
ZeroMemory(&samplerDesc, sizeof(samplerDesc));
samplerDesc.Filter = D3D11_FILTER_MIN_MAG_MIP_POINT;
samplerDesc.MaxAnisotropy = 0;
samplerDesc.AddressU = D3D11_TEXTURE_ADDRESS_CLAMP;
samplerDesc.AddressV = D3D11_TEXTURE_ADDRESS_CLAMP;
samplerDesc.AddressW = D3D11_TEXTURE_ADDRESS_CLAMP;
samplerDesc.MipLODBias = 0.0f;
samplerDesc.MinLOD = 0;
samplerDesc.MaxLOD = D3D11_FLOAT32_MAX;
samplerDesc.ComparisonFunc = D3D11_COMPARISON_NEVER;
samplerDesc.BorderColor[0] = 0.0f;
samplerDesc.BorderColor[1] = 0.0f;
samplerDesc.BorderColor[2] = 0.0f;
samplerDesc.BorderColor[3] = 0.0f;
// Create Texture
ComPtr<ID3D11Texture2D> pTexture;
if ( FAILED(pD3DDevice->CreateTexture2D( &textureDesc, NULL, &pTexture) ) )
{
return FALSE;
}
ComPtr<ID3D11DepthStencilView> pDepthStencilView;
D3D11_DEPTH_STENCIL_VIEW_DESC depthStencilViewDesc;
ZeroMemory(&depthStencilViewDesc, sizeof(depthStencilViewDesc));
depthStencilViewDesc.Format = textureDesc.Format;
depthStencilViewDesc.ViewDimension = D3D11_DSV_DIMENSION_TEXTURE2D;
depthStencilViewDesc.Texture2D.MipSlice = 0;
if ( FAILED(pD3DDevice ->CreateDepthStencilView( pDepthTexture.Get(),
&depthStencilViewDesc,
&pDepthStencilView ) ) )
return FALSE;
Depth Textures
A common use of render targets is to render to a depth texture for an effect such as
shadow mapping. The Adreno GPU supports depth textures and this functionality is
exposed in OpenGL ES 2.0 via the GL_OES_depth_texture and is part of the OpenGL
ES 3.0 core functionality. However, Microsoft chose not to expose depth textures in
Direct3D11.1 feature level 9_3. As a consequence, the easiest way to achieve the
equivalent effect is to render to a single-channel 32-bit color texture
(DXGI_FORMAT_R32_FLOAT) and write a shader that computes the depth value and
writes it to the color buffer. In order to get the best use of precision, it is recommended
to write the value 1.0 – (z/w) to each pixel. The following block of HLSL code writes the
depth value to an interpolator that is sent to the pixel shader:
vso.Position = mul(MatModelViewProj, Position);
// Store depth for pixel shader
// Depth is Z/W.
// Use 1 - z/w to get better precision
vso.Depth = (1.0 - vso.Position.z / vso.Position.w);
Several examples in the DirectX Adreno SDK make use of this functionality including the
DepthOfField, ShadowMap, and CascadedShadowMap examples. Please see the
source code for these examples for more details on how to emulate depth textures.
void main()
{
vec4 Position = g_PositionIn;
vec2 TexCoord = g_TexCoordIn;
vec3 Tangent = g_TangentIn;
vec3 Binormal = g_BinormalIn;
vec3 Normal = g_NormalIn;
g_TexCoord = TexCoord;
g_Normal = Normal;
This same vertex shader in HLSL using input and output semantics is shown below.
cbuffer MaterialConstantBuffer
{
float4x4 MatModelViewProj;
float4x4 MatModel;
float4 LightPos;
float4 EyePos;
};
struct VertexShaderInput
{
float4 Position : POSITION;
float2 TexCoord : TEXCOORD0;
float3 Tangent : TANGENT;
float3 Binormal : BINORMAL;
float3 Normal : NORMAL;
};
struct PixelShaderInput
{
float4 Position : SV_POSITION;
float2 TexCoord : TEXCOORD0;
float4 LightVec : TEXCOORD1;
float3 ViewVec : TEXCOORD2;
float3 Normal : TEXCOORD3;
};
return vso;
}
Another difference you will notice between these two vertex shaders is the use of the
mul() intrinsic function for multiplying the MatModelViewProj matrix times the Position.
In GLSL, the * operator is used for multiplication by matrices, but in HLSL it must be
done explicitly using the mul() intrinsic function.
Textures
As mentioned in section 6.2.2 , Direct3D11 separates sampler state from texture state.
In HLSL, you must declare both a Texture and SamplerState in order to be able to fetch
from a texture. The following block of code shows fetching from a texture in GLSL in
OpenGL ES 2.0:
Note the use of the register(##) keyword. This is used to bind the texture and sampler to
a specific unit. In GLSL, this is done via setting the value of the uniform, but in HLSL it is
done explicitly in the shader code.
7 References
Ref. Document
Resources
R1 Programming Guide for Direct3D11 Microsoft
http://msdn.microsoft.com/en-
us/library/windows/desktop/ff476345(v=vs.85).aspx
R2 Visual C++ and WinRT/Metro - Some fundamentals by Nish Codeproject
Sivakumur
http://www.codeproject.com/Articles/262151/Visual-
Cplusplus-and-WinRT-Metro-Some-fundamentals
R3 Programming Windows®, 6th Edition by Charles Petzold Microsoft Press
8 Revision History