Peters2022 Permutation Coding Paper
Peters2022 Permutation Coding Paper
Compression of vertex attributes is crucial to keep bandwidth requirements in real-time rendering low. We
present a method that encodes any given number of blend attributes for skinning at a fixed bit rate while
keeping the worst-case error small. Our method exploits that the blend weights are sorted. With this knowledge,
no information is lost when the weights get shuffled. Our permutation coding thus encodes additional data,
e.g. about bone indices, into the order of the weights. We also transform the weights linearly to ensure full
coverage of the representable domain. Through a thorough error analysis, we arrive at a nearly optimal
quantization scheme. Our method is fast enough to decode blend attributes in a vertex shader and also to
encode them at runtime, e.g. in a compute shader. Our open source implementation supports up to 13 weights
in up to 64 bits.
Additional Key Words and Phrases: skinning, linear vertex blend animation, vertex buffer compression, vertex
blend attribute compression, permutation coding, bone weights, bone indices, simplex, tetrahedron
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
576336
5763368:2 Christoph Peters, Bastian Kuth, and Quirin Meyer
1 INTRODUCTION
Graphics hardware keeps improving rapidly but its computational power grows faster than the
available memory bandwidth. Thus, it is increasingly important in real-time rendering to encode
the scene representation as compactly as possible. This compact representation has to be usable
for rendering directly. For textures, block compression accomplishes this goal. For geometry, the
least intrusive approach is to compress each vertex individually at a fixed bit rate. Such methods
are widely used because except for some packing code for vertex buffers and unpacking code in
vertex shaders, they require no changes to the renderer. Simple fixed-point quantization works well
for vertex positions and texture coordinates. For vertex normals, octahedral maps are a popular
approach [Meyer et al., 2010].
Skinned meshes need additional attributes for blending: Each vertex stores multiple indices of
bones and corresponding weights defining the influences of these bones. Storing only 4 influences
at 8 bits per weight or index already takes 64 bits. A reasonably quantized vertex format without
blend attributes takes 128 bits (see Sec. 4.3). Thus, the space requirements of blend attributes
are significant. Nonetheless, their compression has not received attention until recently [Kuth
and Meyer, 2021]. This first work is focused on meshes with up to four weights per vertex. The
restriction to four weights is common in game engines, even though dense-weight blend skinning
with many influences per vertex is known to give clear visual improvements [Le and Deng, 2013].
Kuth and Meyer [2021] also formalize naive techniques that generalize to arbitrarily many weights
but these are far from optimal.
We present a more general and arguably more elegant solution. Our technique works for arbi-
trarily many weights and our GPU implementation supports up to 13 weights encoded in up to
64 bits per vertex. Like prior work [Kuth and Meyer, 2021], we store each tuple of bone indices
only once in a table. Thus, the data stored per vertex consist of the weights themselves and a single
tuple index (Sec. 3.1).
Our core insight is that no information is lost when we shuffle a strictly ordered sequence of
blend weights 𝑤 0 < . . . < 𝑤 𝑁 −1 before storage. We recover the original sequence efficiently through
sorting. As we do so, we also recover the permutation that was applied to it. We then turn this
permutation into an index from 0 to 𝑁 ! − 1. Since we are in control of what permutation we apply,
this scheme allows us to hide log2 (𝑁 !) bits of information in the blend weights without requiring
any additional storage (Sec. 3.3). In particular, we store (part of) the tuple index for the bone indices
this way.
It is suboptimal to quantize the individual weights 𝑤 0, . . . , 𝑤 𝑁 −1 directly because they are subject
to inequalities. Therefore, many possible codes do not correspond to meaningful weights (Fig. 1 left).
We address this issue with a linear transform that expands the space but preserves the ordering
(Fig. 1 middle, Sec. 3.4). The impact of this transform on the quantization error requires a careful
analysis, which also reveals a shortcoming of prior work [Kuth and Meyer, 2021] (Sec. 3.5). We
overcome this shortcoming and derive nearly optimal quantization schemes for different weight
counts (Sec. 3.6). In the end, all quantized numbers get coded into up to 64 bits (Sec. 3.7).
For four weights, our method is faster and more accurate than the best prior work [Kuth and
Meyer, 2021]. It naturally supports more weights and scales well. We find that 48 and 64 bits provide
sufficient accuracy for eight and 13 weights, respectively (Sec. 4). Our supplemental includes full
source code for our renderer and our experiments.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:3
2 RELATED WORK
The main incentives for GPU data compression are memory savings and reduced bandwidth and
power consumption. Since textures typically consume most memory, GPUs provide hardware-
support for random read-access from various lossy compressed texture formats [Garrard, 2020,
Nystad et al., 2012]. Additionally, GPUs utilize on-the-fly compression techniques by default
[McAllister et al., 2014].
Compression of mesh topology and vertex positions is well-studied but even the methods that
emphasize random access usually need to decompress several faces at once [Maglo et al., 2015].
Calver [2002, 2004] introduced quantization for vertex-attribute compression by decoding vertex
data in the vertex shader. Purnomo et al. [2005] carefully determine the number of bits allocated for
each attribute channel. Quantization techniques are now commonly used in games [Geffroy et al.,
2020, Karis et al., 2021, Persson, 2012]. Special compression schemes for unit vectors [Keinert et al.,
2015, Meyer et al., 2010, Rousseau and Boubekeur, 2020] and tangent frames [Frey and Herzeg,
2011, Geffroy et al., 2020] exploit their particular properties and allow efficient decoding in vertex
shaders.
Vertex blending, also known as skinning, animates a dense mesh using a hierarchy of bones
[Magnenat-Thalmann et al., 1989]. In each frame, each bone holds a transformation T𝑖 ∈ R4×4 .
Mesh vertices carry a rest position p ∈ R4 (in homogeneous coordinates), indices of relevant
bones 𝑗0, . . . , 𝑗𝑁 and corresponding weights 𝑤 0, . . . , 𝑤 𝑁 ≥ 0. Linear vertex blending applies the
transformation of each relevant bone and computes the animated position as convex combination
Í𝑁
𝑖=0 𝑤 𝑖 T 𝑗𝑖 p. It maps well to vertex shaders.
More sophisticated methods for the representation and combination of the transformations
address issues such as elbow collapse, joint bulging or candy wrapper artifacts [Alexa, 2002, Kavan
et al., 2008, Le and Hodgins, 2016]. When using optimized virtual bones, which blend influences
of multiple bones, two weights per vertex suffice [Le and Deng, 2013]. Direct delta mush [Le and
Lewis, 2019] efficiently approximates Laplacian smoothing to reduce the demands on artist-defined
blend weights. It replaces scalar weights by 4 × 4 matrices, which can be compressed using a coarse
direct delta mush and a fine vertex-blend model [Le et al., 2021]. To reduce memory demands of
animations, compression of bone transformations is viable [Fréchette, 2017].
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:4 Christoph Peters, Bastian Kuth, and Quirin Meyer
the tetrahedron and stores this index. Unlike the previous methods, OSS does not generalize to
more weights naturally. Decoding requires the solution of a polynomial equation of degree 𝑁 . For
𝑁 > 4, that might be impossible in closed form [Abel, 1826].
Since the greatest weight 𝑤 𝑁 can be computed from the others, we do not store it explicitly.
Most of the time, nearby vertices use exactly the same bone indices. Therefore, it is inefficient
to store these indices per vertex. Like prior work [Kuth and Meyer, 2021], we create a table of
all tuples of bone indices in a mesh instead. Then storing the tuple of bone indices per vertex is
accomplished by storing a tuple index referencing the matching entry in this table. The table would
be smaller if we were to sort by bone indices instead of sorting by weights but the benefits of
sorting by weights turn out to be greater (Sec. 4.1).
For creation of the table, we employ a few novel optimizations. If 𝑤 𝑁 = 1 after decoding, only
one bone influences the vertex. In this case, the tuple index is used as bone index directly and we
skip use of the table. Additionally, we exploit that bone indices for zero weights are irrelevant as
we reuse tuples. Fig. 2 illustrates our strategy. We treat irrelevant indices as ∞ (or as 216 − 1 in
our implementation with 16-bit indices). Then we sort lexicographically, taking the index for the
largest weight as most significant. If a tuple allows reuse, it is found at the beginning of a run of
matching tuples. We find all of them with a single scan over the sorted array.
Our goal is to store the blend weights 𝑤 0, . . . , 𝑤 𝑁 and the corresponding tuple index in as little
memory as possible. The amount of memory should be fixed to accommodate restrictions for vertex
shader inputs. The absolute worst-case error in the vector of blend weights (𝑤 0, . . . , 𝑤 𝑁 ) should be
small in terms of the 2-norm. And any weight count 𝑁 + 1 should be supported.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:5
Our encoder first applies a linear transform to the vector of weights (𝑤 0, . . . , 𝑤 𝑁 −1 ) to make it fill
a greater portion of the hypercube [0, 1] 𝑁 (Fig. 1 left and middle). This way, we avoid many invalid
codes. The resulting vector gets quantized entry by entry. Due to the transform, quantizing each
entry with the same precision gives lower accuracy for greater weights. Therefore, we determine
suitable precision factors 𝐵 0, . . . , 𝐵 𝑁 −1 ∈ N. Entry 𝑖 ∈ {0, . . . , 𝑁 − 1} gets quantized into an integer
𝑎𝑖 𝐵𝑖 + 𝑏𝑖 where the more significant 𝑎𝑖 ∈ {0, . . . , 𝐴 − 1} provides the same precision for each entry
and 𝑏𝑖 ∈ {0, . . . , 𝐵𝑖 − 1} provides additional precision as indicated by 𝐵𝑖 .
We encode the tuple index and 𝑏 0, . . . , 𝑏 𝑁 −1 into a single integer 𝑝 ∈ {0, . . . , 𝑃 − 1}, which we
call the payload. All that is left to do is to store 𝑝 and 𝑎 0, . . . , 𝑎 𝑁 −1 as compactly as possible. Due to
the sorted weights and the specifics of our quantization, we know 𝑎 0 < 𝑎 1 < . . . < 𝑎 𝑁 −1 . Therefore,
we lose no information if we shuffle this sequence before storage using one of the 𝑁 ! possible
permutations. We use log2 (𝑁 !) bits of the payload 𝑝 to pick this permutation. The remaining
log2 𝑃 − log2 (𝑁 !) bits are stored separately (we pick the precision factors such that 𝑃 ≥ 𝑁 !).
During decoding, we sort the sequence and thus recover the original 𝑎 0, . . . , 𝑎 𝑁 −1 as well as the
log2 (𝑁 !) bits of the payload. Thus, we have saved log2 (𝑁 !) bits of memory by finding a good use
for all the codes, which correspond to sequences that are not sorted (Fig. 1 right).
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:6 Christoph Peters, Bastian Kuth, and Quirin Meyer
Table 1. Key quantities concerning the efficiency of our permutation coding. 13! does not fit into 32 bits.
Weight count 𝑁 + 1 2 3 4 5 6 7 8 9 10 11 12 13
log2 (𝑁 !) 0 1 2.6 4.6 6.9 9.5 12.3 15.3 18.5 21.8 25.3 28.8
Sorting network size 0 1 3 5 9 12 16 19 25 29 35 39
Sorting network depth 0 1 3 3 5 5 6 6 7 9 8 9
To realize this scheme, we have to choose a mapping 𝜑. That means we have to pick an ordering
of the set of permutations S𝑁 . All (𝑁 !)! possible choices would work but we seek one that lets us
evaluate 𝜑 and 𝜑 −1 efficiently. We choose to order the permutations 𝜎 ∈ S𝑁 through a lexicographic
ordering of the index tuples (𝜎 (0), . . . , 𝜎 (𝑁 − 1)). Algorithm 1 implements 𝜑 −1 (𝑟 ). It constructs
the index tuple from right to left. Step 𝑖 extracts a base-𝑖 digit 𝑑 from the input 𝑟 to choose the
next entry. Since more significant digits determine entries further to the left, the ordering is
indeed lexicographic. Algorithm 2 reverts this process by extracting digits from left to right. Its
implementation with a bitmask is GPU friendly. These algorithms were described in more detail by
Lehmer [1960] and first proposed by his father in 1906. Thus, we call 𝜑 (𝜎) a Lehmer code.
The shuffling itself is a bit tricky to do on GPUs because when an array entry is accessed using a
dynamically computed index, that causes costly register spilling. To generate the shuffled sequence
𝑎𝜎 −1 (0) , . . . , 𝑎𝜎 −1 (𝑁 −1) , we write the indices 𝜎 (0), . . . , 𝜎 (𝑁 − 1) into the most significant bits of
𝑎 0, . . . , 𝑎 𝑁 −1 and then sort this sequence. To recover the permutation during sorting, we first write
the indices 0, . . . , 𝑁 − 1 into the least significant bits of 𝑎𝜎 −1 (0) , . . . , 𝑎𝜎 −1 (𝑁 −1) (after shifting other
bits to the left). After sorting, the least significant bits are 𝜎 (0), . . . , 𝜎 (𝑁 − 1), i.e. the input to
Algorithm 2. Sorting is done by optimal sorting networks [Knuth, 1998] (see Table 1).
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:7
In the special case 𝑁 = 3, the simplex is a tetrahedron [Kuth and Meyer, 2021] (Fig. 1 left).
Now consider the linear transform defined by
𝑖 −1
∑︁
𝑢𝑖 := (𝑁 + 1 − 𝑖)𝑤𝑖 + 𝑤𝑗, (2)
𝑗=0
(0, . . . , 0, 1, . . . , 1) T ∈ [0, 1] 𝑁 .
| {z }
𝑘 times
These vectors are exactly the corners of the unit hypercube [0, 1] 𝑁 at which the coordinates happen
to be sorted. Thus, the simplex of valid weight vectors gets mapped to the larger simplex of sorted
tuples in the unit hypercube (Fig. 1 middle). After this transform, each coordinate 𝑢𝑖 covers the
full range [0, 1] as intended. Furthermore, by shuffling entries of the vector u := (𝑢 0, . . . , 𝑢 𝑁 −1 ) T ,
we can attain any point in the unit hypercube, which is an indication that this scheme could be
optimal (Fig. 1 right).
Appendix B proves that the following formula provides the inverse of the above transform:
𝑖 −1
1 ∑︁ 1
𝑤𝑖 = 𝑢𝑖 − 𝑢𝑗 . (3)
𝑁 +1−𝑖 𝑗=0
(𝑁 + 1 − 𝑗) (𝑁 − 𝑗)
Both transforms take only linear time to compute since the sums for different outputs share common
prefixes.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:8 Christoph Peters, Bastian Kuth, and Quirin Meyer
3.6 Quantization
With this error analysis, we are prepared to quantize 𝑢 0, . . . , 𝑢 𝑁 −1 in a nearly optimal fashion. Two
aspects require special care. Firstly, entries of u can be equal but the quantized values 𝑎𝑖 entering
permutation coding must be strictly ordered, i.e. 𝑎 0 < . . . < 𝑎 𝑁 −1 . Besides, the integers 𝑎 0, . . . , 𝑎 𝑁 −1
should all cover the same range but we need different precision for different entries of u.
To allow for different precision, we define precision factors 𝐵 0, . . . , 𝐵 𝑁 −1 ∈ N per entry. Then
𝑢𝑖 is stored by 𝑎𝑖 ∈ {0, . . . , 𝐴 − 1}, where 𝐴 ∈ N with 𝐴 > 𝑁 , and a less significant extra value
𝑏𝑖 ∈ {0, . . . , 𝐵𝑖 − 1}. We determine 𝑎𝑖 and 𝑏𝑖 through division with remainder such that
1
𝑎𝑖 𝐵𝑖 + 𝑏𝑖 = (𝐴 − 𝑁 )𝐵𝑖 𝑢𝑖 + (𝑖 + 1)𝐵𝑖 − . (6)
2
This formula is carefully designed to stay within the allowable range:
1 1
(𝐴 − 𝑁 )𝐵𝑖 𝑢𝑖 + (𝑖 + 1)𝐵𝑖 − ≥ (𝑖 + 1)𝐵𝑖 − ≥ 0,
2 2
1 1
(𝐴 − 𝑁 )𝐵𝑖 𝑢𝑖 + (𝑖 + 1)𝐵𝑖 − ≤ (𝐴 + 𝑖 + 1 − 𝑁 )𝐵𝑖 − < 𝐴𝐵𝑖 .
2 2
As long as 𝐵 0 ≤ . . . ≤ 𝐵 𝑁 −1 , we also get 𝑎𝑖+1 > 𝑎𝑖 for all 𝑖 ∈ {0, . . . , 𝑁 − 2} because
1
(𝐴 − 𝑁 )𝐵𝑖+1𝑢𝑖+1 + (𝑖 + 1 + 1)𝐵𝑖+1 −
2
1
≥(𝐴 − 𝑁 )𝐵𝑖 𝑢𝑖 + (𝑖 + 1)𝐵𝑖 − + 𝐵𝑖 .
2
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:9
The decoder has to undo these steps. To this end, we perform repeated division with remainder in
reverse order. Note that the 𝐶𝑖 are compile time constants. If some of them are chosen as powers of
two, the compiler will implement the division through a right shift. We reward that in our brute
force search by allowing 0.7% greater error for each such division. This choice leads to considerably
more power-of-two divisors at the expense of tiny increases in error. For other divisors, compilers
perform similar optimizations but they are more costly nonetheless.
Our GLSL implementation supports encoding into up to 64 bits, i.e. into two 32-bit unsigned inte-
gers. Multiplication and addition with carry are natively supported by GLSL through
umulExtended() and uaddCarry(). For division with remainder by a number 𝐶𝑖 < 216 , we imple-
ment a multiword division that operates on 16 bits at a time and implements carry using the less
significant 16 bits of a 32-bit integer [Warren, 2012]. Our codec only uses 64-bit operations for the
steps where the most significant 32 bits are potentially non-zero.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:10 Christoph Peters, Bastian Kuth, and Quirin Meyer
Table 2. Potential savings in bits per vertex when sorting bone indices instead of blend weights for the scene
in Fig. 4 (201 k vertices) reduced to different numbers of weights per vertex. The benefit of a smaller table is
not enough to outweigh the increased cost for storing weights. The last row evaluates our optimizations from
Sec. 3.1.
Weight count 𝑁 + 1 4 6 8
Table size 𝑇 (sorted indices) 2327 2433 2467
Table size 𝑇 (sorted weights) 5977 6465 6482
Saving for the table 0.12 0.19 0.26
Saving for the tuple index 1.36 1.41 1.39
Saving for the weights -4.58 -9.49 -15.30
𝑇 as in Kuth and Meyer [2021] 9539 11187 11329
4 RESULTS
We now evaluate our technique. We start with data justifying our choice to sort blend weights
instead of bone indices (Sec. 4.1). Then we analyze the worst-case error of our technique in
comparison to prior work [Kuth and Meyer, 2021] with regard to the norm in Equation (4). Errors
in vertex positions of actual models reflect these theoretical numbers (Sec. 4.2). Finally, we report
frame times (Sec. 4.3) and timings for encoding and decoding on its own (Sec. 4.4).
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:11
Table 3. 2-norm errors across all weights (see Equations (8) and (9)) for different blend attribute compression
techniques. All values were scaled by 1000 to improve readability. Our technique always has the lowest error
and the advantage grows considerably for more weights.
Bit count 24 32 32 48 48 48 48
Weight count 𝑁 + 1 4 4 5 6 7 8 9
Supported table size 𝑇 1024 1024 2048 4096 2048 8192 4096
Unit cube sampling [2021] 115.47 13.64 72.13 21.56 51.43 120.70 282.84
POT AABB [2021] 27.94 6.82 36.07 10.78 25.72 60.35 68.43
Any AABB [2021] 23.73 3.72 17.75 5.00 10.87 25.63 37.55
OSS [2021] 13.53 2.06 - - - - -
Permutation coding, ours 9.28 1.34 4.97 1.00 1.78 3.70 4.85
Table 3 compares these worst-case errors for different techniques operating at different bit counts
and with different table sizes 𝑇 . Even the best prior work for 𝑁 = 3 [Kuth and Meyer, 2021] has a
46% to 53% greater error than our permutation coding. For greater 𝑁 , this OSS is not applicable and
the advantage of our work is still greater. It grows with log2 (𝑁 !). For 13 weights, our error is an
order of magnitude smaller than that of the best prior work and two orders of magnitude smaller
than that of unit cube sampling (i.e. simple fixed-point quantization of 𝑤 0, . . . , 𝑤 𝑁 −1 ∈ [0, 1]).
Halving the error takes roughly 𝑁 bits, so it would take another 42 bits to reach the same error
with any AABB [Kuth and Meyer, 2021].
Of course, the improved worst-case error compared to OSS [Kuth and Meyer, 2021] is due to
our chosen error metric. With regard to the error in 𝑤 0, . . . , 𝑤 𝑁 −1 alone, OSS is optimal by design.
Fig. 3 demonstrates that our metric, which additionally accounts for the largest weight 𝑤 𝑁 , is
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:12 Christoph Peters, Bastian Kuth, and Quirin Meyer
250
10
(a) Shaded (b) POT AABB (c) OSS (d) Ours
Fig. 3. Vertex positions for this skinned character (1.9 m tall) have been computed using compressed weights
and weights provided as 32-bit floats. All compression techniques use four bytes for four weights and the
tuple index. We color code the error in the world space positions. Model from mixamo.
indeed more relevant. The rounding errors in vertex positions are quite noisy but errors of our
permutation coding tend to be lower than with OSS at equal bit count.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:13
Fig. 4. Our benchmark scene with 1400 character models from mixamo, rendered at 1280×1024.
Table 4. Total frame times in milliseconds for rendering Fig. 4 with various techniques. Bit counts refer to
weights and (tuple) indices, errors are as in Table 3.
Table 5 shows the results. We note that our technique for four weights, is considerably less
expensive than OSS. The cost of our technique scales roughly linearly with the number of weights.
Encoding is slightly faster than decoding. Although encoding is more commonly done on CPU,
such a fast GPU implementation could come in handy for interactive editors or procedural content.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:14 Christoph Peters, Bastian Kuth, and Quirin Meyer
Table 5. Timings for the computations of encoding or decoding blend attributes in picoseconds per vertex.
The combined overhead for the encoding and decoding tests (generating weights from the thread index and
tying them to outputs) is reported separately.
𝑁 +1 4 (OSS) 4 4 5 6 7 8 9 10 11 12 13
Bit count 32 24 32 32 48 48 48 48 64 64 64 64
Encoder - 10.2 10.3 13.7 22.3 26.2 32.7 37.3 49.2 56.5 65.2 73.6
Decoder 17.6 10.2 10.3 15.5 27.2 32.0 38.0 42.2 56.3 62.0 80.6 82.6
Overhead 5.4 5.4 5.4 6.4 7.2 8.5 9.6 10.5 11.7 12.6 13.8 14.9
Our table construction could also be implemented on GPU using standard parallel sorting and
scanning procedures.
5 CONCLUSIONS
The quest for greater fidelity and more detail in real-time rendering is never ending. Nonetheless, it
is still common to restrict artists to use only four bones per vertex. Our permutation coding makes
it viable to lift this restriction. For example, supporting eight weights per vertex with table sizes up
to 𝑇 = 8192 and an accuracy equivalent to 10 bits per weight costs only 48 bits per vertex. This
bandwidth is easily affordable and so is the computational cost.
Similar methods could be applied for compression of barycentric coordinates 𝜆0, . . . , 𝜆𝑁 ≥ 0
Í𝑁
with 𝑖=0 𝜆𝑖 = 1 (without sorting). The corresponding simplex has vertices at the canonical basis
Í
vectors e0, . . . , e𝑁 −1 ∈ R𝑁 . Thus, the pendant for the transform in Equation (2) is 𝑢𝑖 := 𝑖𝑗=0 𝜆 𝑗 . The
trick in Equation (5) does not carry over, so optimal quantization becomes more challenging but
permutation coding would be as effective as for sorted blend weights.
REFERENCES
Niels Henrik Abel. 1826. Beweis der Unmöglichkeit, algebraische Gleichungen von höheren Graden als dem vierten allgemein
aufzulösen. Journal für die reine und angewandte Mathematik 1, 1 (1826), 65–84. https://doi.org/10.1515/9783112347386-009
Marc Alexa. 2002. Linear Combination of Transformations. ACM Trans. Graph. 21, 3 (2002). https://doi.org/10.1145/566654.
566592
Dean Calver. 2002. Vertex Decompression in a Shader. In Direct3D ShaderX – Vertex and Pixel Shader Tips and Tricks,
Wolfgang F. Engel (Ed.). Wordware Publishing, Inc., 172–187.
Dean Calver. 2004. Using Vertex Shaders for Geometry Compression. In ShaderX2 : Shader Programming Tips and Tricks
with DirectX 9.0, Wolfgang F. Engel (Ed.). Wordware Publishing, Inc., 3–12.
Nicolas Fréchette. 2017. Simple and Powerful Animation Compression. https://www.gdcvault.com/play/1024009/Simple-and-
Powerful-Animation Game Developers Conference.
Ivo Zoltan Frey and Ivo Herzeg. 2011. Spherical Skinning with Dual Quaternions and QTangents. In ACM SIGGRAPH 2011
Talks. https://doi.org/10.1145/2037826.2037841
Andrew Garrard. 2020. Khronos Data Format Specification v1.3.1. https://www.khronos.org/registry/DataFormat/specs/1.3/
dataformat.1.3.html#_compressed_texture_image_formats
Jean Geffroy, Axel Gneiting, and Yixin Wang. 2020. Rendering the Hellscape of Doom Eternal. In ACM SIGGRAPH ’20: ACM
SIGGRAPH 2020 Courses. https://advances.realtimerendering.com/s2020
Brian Karis, Rune Stubbe, and Graham Wihlidal. 2021. Nanite — A Deep Dive. In ACM SIGGRAPH ’21: ACM SIGGRAPH 2021
Courses. http://advances.realtimerendering.com/s2021
Ladislav Kavan, Steven Collins, Jiří Žára, and Carol O’Sullivan. 2008. Geometric Skinning with Approximate Dual Quaternion
Blending. ACM Trans. Graph. 27, 4 (2008). https://doi.org/10.1145/1409625.1409627
Benjamin Keinert, Matthias Innmann, Michael Sänger, and Marc Stamminger. 2015. Spherical Fibonacci Mapping. ACM
Trans. Graph. (Proc. SIGGRAPH Asia) 34, 6 (2015). https://doi.org/10.1145/2816795.2818131
Donald E. Knuth. 1998. The Art of Computer Programming, Volume 3 - Sorting and Searching, 2nd Edition. Addison-Wesley
Professional.
Bastian Kuth and Quirin Meyer. 2021. Vertex-Blend Attribute Compression. In High-Performance Graphics - Symposium
Papers. The Eurographics Association. https://doi.org/10.2312/hpg.20211282 Best paper.
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
Permutation Coding for Vertex-Blend Attribute Compression 5763368:15
Binh Huy Le and Zhigang Deng. 2013. Two-Layer Sparse Compression of Dense-Weight Blend Skinning. ACM Trans. Graph.
(Proc. SIGGRAPH) 32, 4 (2013). https://doi.org/10.1145/2461912.2461949
Binh Huy Le and Jessica K. Hodgins. 2016. Real-Time Skeletal Skinning with Optimized Centers of Rotation. ACM Trans.
Graph. (Proc. SIGGRAPH) 35, 4 (2016). https://doi.org/10.1145/2897824.2925959
Binh Huy Le and J P Lewis. 2019. Direct Delta Mush Skinning and Variants. ACM Trans. Graph. (Proc. SIGGRAPH) 38, 4
(2019). https://doi.org/10.1145/3306346.3322982
Binh Huy Le, Keven Villeneuve, and Carlos Gonzalez-Ochoa. 2021. Direct Delta Mush Skinning Compression with
Continuous Examples. ACM Trans. Graph. (Proc. SIGGRAPH) 40, 4 (2021). https://doi.org/10.1145/3450626.3459779
Derrick H. Lehmer. 1960. Teaching combinatorial tricks to a computer. Proceedings of Symposia in Applied Mathematics 10
(1960), 179–193. https://doi.org/10.1090/psapm/010
Adrien Maglo, Guillaume Lavoué, Florent Dupont, and Céline Hudelot. 2015. 3D Mesh Compression: Survey, Comparisons,
and Emerging Trends. ACM Comput. Surv. 47, 3 (2015). https://doi.org/10.1145/2693443
Nadia Magnenat-Thalmann, Richard Laperrière, and Daniel Thalmann. 1989. Joint-Dependent Local Deformations for Hand
Animation and Object Grasping. In Proceedings on Graphics Interface ’88. Canadian Information Processing Society.
David K. McAllister, Alexandre Joly, and Peter Tong. 2014. Lossless Frame Buffer Color Compression. United States Patent
8670613.
Quirin Meyer, Jochen Süßmuth, Gerd Sußner, Marc Stamminger, and Günther Greiner. 2010. On Floating-Point Normal
Vectors. Computer Graphics Forum (Proc. EGSR) 29, 4 (2010). https://doi.org/10.1111/j.1467-8659.2010.01737.x
Jorn Nystad, Anders Lassen, Andy Pomianowski, Sean Ellis, and Tom Olson. 2012. Adaptive Scalable Texture Compression.
In Eurographics/ACM SIGGRAPH Symposium on High Performance Graphics. The Eurographics Association. https:
//doi.org/10.2312/EGGH/HPG12/105-114
Emil Persson. 2012. Creating Vast Game Worlds: Experiences from Avalanche Studios. In ACM SIGGRAPH 2012 Talks. Article
32. https://doi.org/10.1145/2343045.2343089
Budirijanto Purnomo, Jonathan Bilodeau, Jonathan D. Cohen, and Subodh Kumar. 2005. Hardware-Compatible Vertex
Compression Using Quantization and Simplification. In Graphics Hardware. The Eurographics Association. https:
//doi.org/10.2312/EGGH/EGGH05/053-062
Sylvain Rousseau and Tamy Boubekeur. 2020. Unorganized Unit Vectors Sets Quantization. Journal of Computer Graphics
Techniques (JCGT) 9, 4 (2020). https://jcgt.org/published/0009/04/02/
Henry S. Jr. Warren. 2012. Hacker’s Delight, 2nd Edition. Addison-Wesley Professional.
Then if Equation (10) holds for sums with one term less, we find:
𝑖 −1
∑︁ 1
(𝑁 + 1 − 𝑘) (𝑁 − 𝑘)
𝑘=𝑗+1
𝑖 −2
1 ∑︁ 1
= +
(𝑁 + 2 − 𝑖) (𝑁 + 1 − 𝑖) (𝑁 + 1 − 𝑘) (𝑁 − 𝑘)
𝑘=𝑗+1
(10) 1 1 1
= + −
(𝑁 + 2 − 𝑖) (𝑁 + 1 − 𝑖) 𝑁 + 2 − 𝑖 𝑁 − 𝑗
1 1
= −
𝑁 +1−𝑖 𝑁 − 𝑗
□
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.
5763368:16 Christoph Peters, Bastian Kuth, and Quirin Meyer
B INVERSE TRANSFORM
Proof. To prove Equation (3), we substitute Equation (2) into its right-hand side and apply
Equation (10):
𝑖 −1
1 ∑︁ 1
𝑢𝑖 − 𝑢𝑗
𝑁 +1−𝑖 𝑗=0
(𝑁 + 1 − 𝑗) (𝑁 − 𝑗)
𝑖 −1 𝑖 −1 𝑖 −1
1 ∑︁ ∑︁ 1 ∑︁ 1 1
=𝑤𝑖 + 𝑤𝑗 − 𝑤𝑗 − 𝑤𝑗
𝑁 + 1 − 𝑖 𝑗=0 𝑗=0
𝑁−𝑗 𝑁 +1−𝑘 𝑁 −𝑘
𝑗,𝑘=0
𝑗 <𝑘
𝑖 −1 𝑖 −1
∑︁ 1 1 ∑︁ 1
=𝑤𝑖 + − −
© ª
®𝑤𝑗
+ − − (𝑁 + − (𝑁 −
𝑗=0 «
𝑁 1 𝑖 𝑁 𝑗 1 𝑘) 𝑘)
𝑘=𝑗+1 ¬
(10)
= 𝑤𝑖
□
C ERROR METRIC
Proof. To prove Equation (5), we start at the right-hand side, substitute in Equation (2), expand
and apply Equation (10):
𝑁 −1
∑︁ 1
𝑢˜ 2
𝑖=0
(𝑁 + 1 − 𝑖) (𝑁 − 𝑖) 𝑖
𝑁 −1 𝑖 −1
!2
∑︁ 1 ∑︁
= (𝑁 + 1 − 𝑖)𝑤˜ 𝑖 + 𝑤˜ 𝑗
𝑖=0
(𝑁 + 1 − 𝑖) (𝑁 − 𝑖) 𝑗=0
𝑁 −1 𝑁 −1 ∑︁
𝑖 −1 𝑁 −1 ∑︁
𝑖 −1
∑︁ 𝑁 +1−𝑖 2 ∑︁ 1 ∑︁ 1
= 𝑤˜ 𝑖 + 2 𝑤˜ 𝑖 𝑤˜ 𝑗 + 𝑤˜ 𝑗 𝑤˜ 𝑘
𝑖=0
𝑁 −𝑖 𝑖=0 𝑗=0
𝑁 − 𝑖 𝑖=0
(𝑁 + 1 − 𝑖) (𝑁 − 𝑖)
𝑗,𝑘=0
𝑁 −1 𝑁 −1 𝑁 −1 𝑁 −1
∑︁ 𝑁 +1−𝑖 1 ∑︁ 𝑤˜ 𝑖 𝑤˜ 𝑗 ∑︁ ∑︁ 1
= − 𝑤˜ 𝑖2 + + 𝑤˜ 𝑗 𝑤˜ 𝑘
𝑁 −𝑖 𝑁 −𝑖 𝑁 − max{𝑖, 𝑗 } (𝑁 + 1 − 𝑖) (𝑁 − 𝑖)
𝑖=0 𝑖,𝑗=0 𝑗,𝑘=0 𝑖=max{ 𝑗,𝑘 }+1
𝑁 −1 𝑁 −1 𝑁 −1
(10)
∑︁ ∑︁ 𝑤˜ 𝑖 𝑤˜ 𝑗 ∑︁ 1 1
= 𝑤˜ 𝑖2 + + − 𝑤˜ 𝑗 𝑤˜ 𝑘
𝑖=0 𝑖,𝑗=0
𝑁 − max{𝑖, 𝑗 } 𝑁 + 1 − 𝑁 𝑁 − max{ 𝑗, 𝑘 }
𝑗,𝑘=0
𝑁
∑︁−1 𝑁
∑︁−1
= 𝑤˜ 𝑖2 + 𝑤˜ 𝑗 𝑤˜ 𝑘
𝑖=0 𝑗,𝑘=0
□
Proc. ACM Comput. Graph. Interact. Tech., Vol. 5, No. 1, Article 5763368. Publication date: May 2022.