Richard Fabian Data Oriented Design Software Engineering For Limited
Richard Fabian Data Oriented Design Software Engineering For Limited
Richard Fabian
October 8, 2018
Data oriented design by Richard Fabian
1 Data-Oriented Design 1
1.1 It’s all about the data . . . . . . . . . . . . . . . 2
1.2 Data is not the problem domain . . . . . . . . 5
1.3 Data and statistics . . . . . . . . . . . . . . . . 10
1.4 Data can change . . . . . . . . . . . . . . . . . 12
1.5 How is data formed? . . . . . . . . . . . . . . . 17
1.6 The framework . . . . . . . . . . . . . . . . . . . 20
1.7 Conclusions and takeaways . . . . . . . . . . . 24
2 Relational Databases 27
2.1 Complex state . . . . . . . . . . . . . . . . . . . 28
2.2 The framework . . . . . . . . . . . . . . . . . . . 29
2.3 Normalising your data . . . . . . . . . . . . . . 32
2.4 Normalisation . . . . . . . . . . . . . . . . . . . 36
2.5 Operations . . . . . . . . . . . . . . . . . . . . . 52
2.6 Summing up . . . . . . . . . . . . . . . . . . . . 54
2.7 Stream Processing . . . . . . . . . . . . . . . . 55
2.8 Why does database technology matter? . . . . 57
3 Existential Processing 59
3.1 Complexity . . . . . . . . . . . . . . . . . . . . . 60
3.2 Debugging . . . . . . . . . . . . . . . . . . . . . 62
3.3 Why use an if . . . . . . . . . . . . . . . . . . . 63
3.4 Types of processing . . . . . . . . . . . . . . . . 68
3.5 Don’t use booleans . . . . . . . . . . . . . . . . 71
3.6 Don’t use enums quite as much . . . . . . . . 77
3.7 Prelude to polymorphism . . . . . . . . . . . . 79
3.8 Dynamic runtime polymorphism . . . . . . . . 80
3.9 Event handling . . . . . . . . . . . . . . . . . . 84
6 Searching 121
6.1 Indexes . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Data-oriented Lookup . . . . . . . . . . . . . . 123
6.3 Finding low and high . . . . . . . . . . . . . . . 129
6.4 Finding random . . . . . . . . . . . . . . . . . . 131
7 Sorting 133
7.1 Do you need to? . . . . . . . . . . . . . . . . . . 133
7.2 Maintaining . . . . . . . . . . . . . . . . . . . . 136
7.3 Sorting for your platform . . . . . . . . . . . . . 137
8 Optimisations 143
8.1 When should we optimise? . . . . . . . . . . . 145
8.2 Feedback . . . . . . . . . . . . . . . . . . . . . . 146
8.3 A strategy . . . . . . . . . . . . . . . . . . . . . 150
8.4 Tables . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5 Transforms . . . . . . . . . . . . . . . . . . . . . 160
8.6 Spatial sets . . . . . . . . . . . . . . . . . . . . . 162
8.7 Lazy evaluation . . . . . . . . . . . . . . . . . . 163
8.8 Necessity . . . . . . . . . . . . . . . . . . . . . . 164
8.9 Varying length sets . . . . . . . . . . . . . . . . 165
8.10 Joins as intersections . . . . . . . . . . . . . . 168
8.11 Data-driven techniques . . . . . . . . . . . . . 169
8.12 Structs of arrays . . . . . . . . . . . . . . . . . 170
10 Concurrency 193
10.1 Thread-safety . . . . . . . . . . . . . . . . . . . 194
10.2 Inherently concurrent . . . . . . . . . . . . . . 198
10.3 Gateways . . . . . . . . . . . . . . . . . . . . . . 199
12 In Practice 221
12.1 Data-manipulation . . . . . . . . . . . . . . . . 221
12.2 Game entities . . . . . . . . . . . . . . . . . . . 227
15 Hardware 277
15.1 Sequential data . . . . . . . . . . . . . . . . . . 279
15.2 Deep pipes . . . . . . . . . . . . . . . . . . . . . 281
15.3 Microcode . . . . . . . . . . . . . . . . . . . . . 284
15.4 SIMD . . . . . . . . . . . . . . . . . . . . . . . . 285
15.5 Predictability . . . . . . . . . . . . . . . . . . . . 286
15.6 How it works . . . . . . . . . . . . . . . . . . . . 287
16 Sourcecode 289
16.1 The basics . . . . . . . . . . . . . . . . . . . . . 289
16.2 Linked lists . . . . . . . . . . . . . . . . . . . . . 291
16.3 Branch prediction . . . . . . . . . . . . . . . . . 292
16.4 Cache size effect . . . . . . . . . . . . . . . . . . 293
16.5 False sharing . . . . . . . . . . . . . . . . . . . 293
16.6 Hot, cold, access . . . . . . . . . . . . . . . . . 295
16.7 Key lookup . . . . . . . . . . . . . . . . . . . . . 296
16.8 Matrix transpose . . . . . . . . . . . . . . . . . 299
16.9 Modifying memory . . . . . . . . . . . . . . . . 301
16.10 SIMD . . . . . . . . . . . . . . . . . . . . . . . . 302
16.11 Speculative waste . . . . . . . . . . . . . . . . . 305
16.12 Finite State Machines . . . . . . . . . . . . . . 307
Introduction
Data-Oriented Design
other than maybe the logic programming languages such as Prolog. The ex-
The time was right in 2009. The hardware was ripe for a
change in how to develop. Potentially very fast computers were
hindered by a hardware ignorant programming paradigm.
The way game programmers coded at the time made many
engine programmers weep. The times have changed. Many
mobile and desktop solutions now seem to need the data-
oriented design approach less, not because the machines are
better at mitigating an ineffective approach, but the games be-
ing designed are less demanding and less complex. The trend
for mobile seems to be moving to AAA development, which
should bring the return of a need for managing complexity
and getting the most out of the hardware.
Other than not having grids where they make sense, many
modern games also seem to carry instances for each and every
item in the game. An instance for each rather than a variable
storing the number of items. For some games this is an op-
timisation, as creation and destruction of objects is a costly
activity, but the trend is worrying, as these ways of storing
information about the world make the world impenetrable to
simple interrogation.
this
time when game developers were seen as cutting-edge pro-
grammers, inventing new technology as the need arises, but
with the advent of less adventurous hardware (most notably
in the x86 based recent 8th generations), there has been a
shift away from ingenious coding practices, and towards a
standardised process. This means game development can be
tuned to ensure the release date will coincide with marketing
dates. There will always be an element of randomness in high
profile game development. There will always be an element
of innovation that virtually guarantees you will not be able
to predict how long the project, or at least one part of the
project, will take. Even if data-oriented design isn’t needed to
make your game go faster, it can be used to make your game
development schedule more regular.
Relational Databases
In order to lay your data out better, it’s useful to have an un-
derstanding of the methods available to convert your existing
structures into something linear. The problems we face when
applying data-oriented approaches to existing code and data
layouts usually stem from the complexity of state inherent in
data-hiding or encapsulating programming paradigms. These
paradigms hide away internal state so you don’t have to think
about it, but they hinder when it comes to reconfiguring data
layouts. This is not because they don’t abstract enough to
allow changes to the underlying structure without impacting
the correctness of the code that uses it, but instead because
they have connected and given meaning to the structure of
the data. That type of coupling can be hard to remove.
We’re going to work with a level file for a game where you hunt
for keys to unlock doors in order to get to the exit room. The
level file is a sequence of script calls which create and con-
figure a collection of different game objects which represent
a playable level of the game, and the relationships between
those objects. First, we’ll assume it contains rooms (some
trapped, some not), with doors leading to other rooms which
can be locked. It will also contain a set of pickups, some let
the player unlock doors, some affect the player’s stats (like
health potions and armour), and all the rooms have lovely
textured meshes, as do all the pickups. One of the rooms is
marked as the exit, and one has a player start point.
Primed with this data, it’s now possible for us to create the
Pickups. We convert the calls to CreatePickup into the tables
in table 2.2. Notice that there was a pickup which did not
specify a colour tint, and this means we need to use a NULL
to represent not giving details about that aspect of the row.
The same applies to animations. Only keys had animations,
so there needs to be NULL entries for all non-key rows.
Textures
TextureID TextureName
tex rm "roomtexture"
tex rmstart "roomtexturestart"
tex rmtrapped "roomtexturetrapped"
tex key "keytexture"
tex pot "potiontexture"
tex arm "armourtexture"
Animations
AnimID AnimName
anim keybob "keybobanim"
Pickups
PickupID MeshID TextureID PickupType ColourTint Anim
k1 msh key tex key KEY Copper anim keybob
k2 msh key tex key KEY Silver anim keybob
k3 msh key tex key KEY Gold anim keybob
p1 msh pot tex pot POTION Green NULL
p2 msh pot tex pot POTION Purple NULL
a1 msh arm tex arm ARMOUR NULL NULL
2.4 Normalisation
Back when SQL was first created there were only three well-
defined stages of data normalisation. There are many more
now, including six numbered normal forms. To get the most
out of a database, it is important to know most of them, or at
least get a feel for why they exist. They teach you about data
dependency and can hint at reinterpretations of your data lay-
out. For game structures, BCNF (Boyce-Codd normal form is
explained later) is probably as far as you normally would need
to take your methodical process. Beyond that, you might wish
to normalise your data for hot/cold access patterns, but that
kind of normalisation is not part of the standard literature on
database normalisation. If you’re interested in more than this
book covers on the subject, a very good read, and one which
introduces the phrase “The key, the whole key, and nothing
but the key.” is the article A Simple Guide to Five Normal Forms
in Relational Database Theory[9] by William Kent.
Now consider rooms. If you use all the columns other than
the RoomID of the room table, you will find the combination
can be used to uniquely define the room. If you consider an
alternative, where a row had the same combination of values
making up the room, it would in fact be describing the same
room. From this, it can be claimed that the RoomID is being
used as an alias for the rest of the data. We have stuck the
RoomID in the table, but where did it come from? To start
with, it came from the setup script. The script had a Roo-
mID, but we didn’t need it at that stage. We needed it for the
destination of the doors. In another situation, where nothing
connected logically to the room, we would not need a RoomID
as we would not need an alias to it.
PickupTints
PickupID ColourTint
k1 Copper
k2 Silver
k3 Gold
p1 Green
p2 Purple
PickupAnims
PickupID Anim
k1 anim keybob
k2 anim keybob
k3 anim keybob
PickupInstances
RoomID PickupID
r2 k1
r3 k2
r3 a1
r3 p1
r4 k3
r4 p2
Doors
FromRoom ToRoom
r1 r2
r1 r3
r2 r1
r2 r4
r3 r1
r3 r2
r3 r5
r4 r2
LockedDoors
FromRoom ToRoom LockedWith
r1 r3 k1
r2 r4 k2
r3 r5 k3
Traps
RoomID Trapped
r2 10hp
r5 25hp
Laying out the data in this way takes less space in larger
projects as the number of NULL entries or arrays would have
only increased with increased complexity of the level file. By
laying out the data this way, we can add new features with-
out having to revisit the original objects. For example, if we
wanted to add monsters, normally we would not only have to
add a new object for the monsters, but also add them to the
room objects. In this format, all we need to do is add a new
table such as in table 2.7.
Monsters
MonsterID Attack HitPoints StartRoom
M1 2 5 r3
M2 2 5 r4
Weapons
WeaponType WeaponQuality WeaponDamage
Sword Rusty 2d4
Sword Average 2d6
Sword Masterwork 2d8
Lance Average 2d6
Lance Masterwork 3d6
Hammer Rusty 2d4
Hammer Average 2d4+4
WeaponDamageTypes
WeaponType WeaponDamageType
Sword Slashing
Lance Piercing
Hammer Crushing
Normalising to 1NF:
Pickups 1NF
PickupType MeshID TextureID
KEY mkey tkey
POTION mpot tpot
ARMOUR marm tarm
TintedPickups 1NF
PickupType ColourTint
KEY Copper
KEY Silver
KEY Gold
POTION Green
POTION Purple
Pickups
PickupID PickupType
k1 KEY
k2 KEY
k3 KEY
p1 POTION
p2 POTION
a1 ARMOUR
PickupTints
PickupID ColourTint
k1 Copper
k2 Silver
k3 Gold
p1 Green
p2 Purple
PickupAssets
PickupType MeshID TextureID
KEY msh key tex key
POTION msh pot tex pot
ARMOUR msh arm tex arm
PickupAnims
PickupType AnimID
KEY key bob
Rooms
RoomID WorldPos IsStart IsExit
r1 0,0 true false
r2 -20,10 false false
r3 -10,20 false false
r4 -30,20 false false
r5 20,10 false true
Rooms
IsStart HasTrap TextureID
true false tex rmstart
false false tex rm
false true tex rmtrap
AssetLookupTable
AssetID StubbedName
ast room "room%s"
ast roomstart "room%sstart"
ast roomtrap "room%strapped"
ast key "key%s"
ast pot "potion%s"
ast arm "armour%s"
2.4.7 Reflections
2.5 Operations
AddClosedDoor ( Door d ) {
gDoors . push_back () ;
}
AddOpenDoor ( Door d ) {
gDoors . insert ( gDoors . begin () + gDoors_firstClosedDoor , d ) ;
g D o o r s _ f i r s t C l o s e d D o o r += 1;
}
This is the idea that ”the primary key is also the data” men-
tioned much earlier. If the player enters a room and picks up
a Pickup, then the entry matching the room is deleted while
the inventory is updated to include the new PickupID.
2.6 Summing up
Now we realise that all the game data and game runtime can
be implemented in a database-like approach, we can also
see that all game data can be implemented as streams. Our
persistent storage is a database, our runtime data is in the
same format as it was on disk, what do we benefit from this?
Databases can be thought of as collections of rows, or col-
lections of columns, but it’s also possible to think about the
tables as sets. The set is the set of all possible permutations
of the attributes.
What’s become clear over the last decade is that most of the
high-level data processing techniques which are proving to be
useful are a combination of hardware-aware data manipula-
tion layers being used by functional programming style high-
level algorithms. As the hardware in your PC becomes more
and more like the internet itself, these techniques will begin
to dominate on personal hardware, whether it be personal
computers, phones, or whatever the next generation brings.
Data-oriented design was inspired by a realisation that the
hardware had moved on to the point where the techniques we
used to use to defend against latency from CPU to hard drive,
now apply to memory. In the future, if we raise processing
power by the utilisation of hoards of isolated unreliable com-
putation units, then the techniques for distributing comput-
ing across servers that we’re developing in this era, will apply
to the desktops of the next.
Chapter 3
Existential Processing
If you saw there weren’t any apples in stock, would you still
haggle over their price?
3.2 Debugging
To-do lists are great because you can set an end goal and
then add in subtasks that make a large and long distant goal
seem more doable. Adding in estimates can provide a little
urgency that is usually missing when the deadline is so far
away. Many companies use software to support tracking of
tasks, and this software often comes with features allowing
the producers to determine critical paths, expected developer
hours required, and sometimes even the balance of skills re-
quired to complete a project. Not using this kind of software
is often a sign that a company isn’t overly concerned with ef-
ficiency, or waste. If you’re concerned about efficiency and
waste in your program, lists of tasks seem like a good way to
start analysing where the costs are coming from. If you keep
track of these lists by logging them, you can look at the data
and see the general shape of the processing your software is
performing. Without this, it can be difficult to tell where the
real bottlenecks are, as it might not be the processing that is
the problem, but the requirement to process data itself which
has gotten out of hand.
Slowness also comes from not being able to see how much
work needs to be done, and therefore not being able to priori-
tise or scale the work to fit what is possible within the given
time-frame. Without a to-do list, and an ability to estimate
the amount of time each task will take, it is difficult to decide
the best course of action to take in order to reduce overhead
while maintaining feedback to the user.
Object-oriented programming works very well when there
are few patterns in the way the program runs. When either
the program is working with only a small amount of data, or
when the data is incredibly heterogeneous, to the point that
there are as many classes of things as there are things.
Transforms
Handles input data. Produces
Mutation in == out one item of output for every item
of input.
Handles input data. Produces
Filter in >= out up to one item of output for
every item of input.
( Handles input data. Produces
0, in = 0 unknown amount of items per
Emission out =
>= 0, otherwise item of input. With no input,
output is also empty.
Does not read data. Produces an
Generation in = 0 ∧ out >= 0 unknown amount of items just
by running.
• Once you have been shot, it takes some time until you
begin regenerating.
last dealt. If you want to find out someone’s health, then you
only need to look and see if they have an entityhealth row, or if
they have a row in deadEntities table. The reason this works
is, an entity has an implicit boolean hidden in the row exist-
ing in the table. For the entityDamages table, that implicit
boolean is the isHurt variable from the first function. For the
deadEntities table, the boolean of isDead is now implicit, and
also implies a health value of 0, which can reduce process-
ing for many other systems. If you don’t have to load a float
and check it is less than 0, then you’re saving a floating point
comparison or conversion to boolean.
Though this works, all the pointers to the old class are
now invalid. Using handles would mitigate these worries, but
add another layer of indirection in most cases, dragging down
performance even further.
defined by the tables they belong to, then you can switch
between tables at runtime. This allows you to change be-
haviour without any tricks, without the complexity of man-
aging a union to carry all the different data around for all the
states you need. If you compose your class from different at-
tributes and abilities then need to change them post creation,
you can. If you’re updating tables, the fact that the pointer
address of an entity has changed will mean little to you. It’s
normal for an entity to move around memory in table-based
processing, so there are fewer surprises. Looking at it from
a hardware point of view, in order to implement this form of
polymorphism you need a little extra space for the reference
to the entity in each of the class attributes or abilities, but
you don’t need a virtual table pointer to find which function
to call. You can run through all entities of the same type in-
creasing cache effectiveness, even though it provides a safe
way to change type at runtime.
Component Based
Objects
class Player {
public :
Player () ;
~ Player () ;
// ...
// ... the member functions
// ...
private :
Playe rPhysical phsyical ;
Playe rGameplay gameplay ;
EntityAnim anim ;
PlayerControl control ;
EntityRender render ;
EntityInWorld inWorld ;
Inventory inventory ;
};
What happens when we let more than just the player use
these arrays? Normally we’d have some separate logic for han-
dling player fire until we refactored the weapons to be generic
weapons with NPCs using the same code for weapons proba-
bly by having a new weapon class that can be pointed to by
the player or an NPC, but instead what we have here is a way
to split off the weapon firing code in such a way as to allow
the player and the NPC to share firing code without inventing
a new class to hold the firing. In fact, what we’ve done is split
the firing up into the different tasks it really contains.
Hierarchical Level of
Detail and
Implicit-state
M urder
y %
Crow Crow Crow
Blip
w '
Squadron Squadron Squadron
w '
Aircraf t Aircraf t Aircraf t
w '
EjectingP ilot F uselage W ing
When the aircraft are shot at, they switch to a taken dam-
age type. They are full health enemy aircraft unless they take
damage. If an AI reacts to damage with fear, they may eject,
adding another entity to the world. If the wing of the plane
is shot off, then that also becomes a new entity in the world.
Once a plane has crashed, it can delete its entity and replace
it with a smoking wreck entity that will be much simpler to
process than an aerodynamic simulation, faked or not.
If things get out of hand and the player can’t keep the
aircraft at bay and their numbers increase in size so much
that any normal level of detail system can’t kick in to miti-
gate it, collective lodding can still help by returning aircraft
to squadrons and flying them around the base attacking as a
group, rather than as individual aircraft. In the board game
Warhammer Fantasy Battle, there were often so many troops
firing arrows at each other, that players would often think of
attacks by squads as being collections of attacks, and not ac-
tually roll for each individual soldier, rat, orc or whatever it
was, but instead counted up how many troops they had, and
rolled that many dice to see how many attacks got through.
This is what is meant by attacking as a squadron. The air-
craft no longer attack, instead, the likelihood an attack will
succeed is calculated, dice are rolled, and that many attacks
get through. The level of detail heuristic can be tuned so the
nearest and front-most squadron are always the highest level
of detail, effectively making them roll individually, and the
ones behind the player maintain a very simplistic representa-
tion.
5.2 Mementos
HighDetail HighDetail
7
store extract
'
M emento
V ehicle
seed
v
P assengerStub +3 Character
7
seed
( extract
M emento
As with all things, take away an assumption and you can find
other uses for a tool. Whenever you read about, or work with a
level of detail system, you will be aware that the constraint on
what level of detail is shown has always been some distance
function in space. It’s now time to take the assumption, dis-
card it, and analyse what is really happening.
It’s an ugly term, and I hope one day someone comes up with
a better one, but it’s a technique that didn’t need a name
until people stopped doing it. Over the time it has taken
to write this book, games have started to have too many in-
stances. We’re not talking about games that have hundreds of
enemy spacecraft, battling each other in a desperate fight for
superiority, firing off missile after missile, generating visual
effects which spawn multiple GPU particles. We’re talking
about simple seeming games. We’re talking about your aver-
age gardening simulator, where for some reason, every leaf on
your plants is modeled as an instance, and every insect going
around pollinating is an instance, and every plot of land in
which your plants can grow is an instance, and every seed
you sew is an instance, and each have their own lifetimes,
components, animations, and their own internal state adding
to the ever-growing complexity of the system as a whole.
Searching
6.1 Indexes
t tx ty tz sx sy sz rs
cacheline
ri rj rk t tx ty tz sx
sy sz rs ri rj rk t tx
cacheline
ty tz sx sy sz rs ri rj
rk t tx ty tz sx sy sz
cacheline
rs ri rj rk t . . .
There are ways to organise the data better still, but any
more optimisation requires a complexity or space time trade
off. A basic binary search will home in on the correct data
quite quickly, but each of the first steps will cause a new cache
line to be read in. If you know how big your cache line is, then
you can check all the values that have been loaded for free
while you wait for the next cache line to load in. Once you
have got near the destination, most of the data you need is in
the cache and all you’re doing from then on is making sure you
have found the right key. In a cache line aware engine, all this
can be done behind the scenes with a well-optimised search
algorithm usable all over the game code. It is worth mention-
ing again, every time you break out into larger data struc-
tures, you deny your proven code the chance to be reused.
what you get to read for free with each request. In this partic-
ular case, there was enough cache line left to store another
11 floating point values, which are used as a place to store
something akin to skip-list.
times keys n s0 s1 s2
cacheline
s3 s4 s5 s6 s7 s8 s9 s10
Using the fact that these keys would be loaded into mem-
ory, we give ourselves the opportunity to interrogate some
data for free. In listing 6.3 you can see it uses a linear search
instead of a binary search, and yet it still manages to make
the original binary search look slow by comparison, and we
must assume, as with most things on modern machines, it
is because the path the code is taking is using the resources
better, rather than being better in a theoretical way, or using
fewer instructions.
i5-4430 @ 3.00GHz
Average 13.71ms [Full anim key - linear search]
Average 11.13ms [Full anim key - binary search]
Average 8.23ms [Data only key - linear search]
Average 7.79ms [Data only key - binary search]
Average 1.63ms [Pre-indexed - binary search]
Average 1.45ms [Pre-indexed - linear search]
For some tables, the values change very often. For a tree rep-
resentation to be high performance, it’s best not to have a high
number of modifications as each one could trigger the need
for a rebalance. Of course, if you do all your modifications
in one stage of processing, then rebalance, and then all your
reads in another, then you’re probably going to be okay still
using a tree.
Sorting
Whatever you need to sort for, make sure you need to sort
first, as usually, sorting is a highly memory intense business.
Depending on what you need the list sorted for, you could sort
while modifying. If the sort is for some AI function that cares
about priority, then you may as well insertion sort as the base
heuristic commonly has completely orthogonal inputs. If the
inputs are related, then a post insertion table wide sort might
be in order, but there’s little call for a full-scale sort.
http://seven-degrees-of-freedom.blogspot.co.uk/2010/07/question-of-
sorts.html
7.3 Sorting for your platform
Radix sort is the fastest serial sort. If you can do it, radix
sort is very fast because it generates a list of starting points
for data of different values in a first pass, then operates using
that data in a second pass. This allows the sorter to drop their
contents into containers based on a translation table, a table
that returns an offset for a given data value. If you build a list
from a known small value space, then radix sort can operate
very fast to give a coarse first pass. The reason radix sort is
serial, is that it has to modify the table it is reading from in
order to update the offsets for the next element that will be put
in the same bucket. If you ran multiple threads giving them
part of the work each, then you would find they were non-
linearly increasing in throughput as they would be contending
to write and read from the same memory, and you don’t want
to have to use atomic updates in your sorting algorithm.
2 It might be wise to have some inline sort function templates in your own
utility header so you can utilise the benefits of miniaturisation, but don’t
drop in a bloated std::sort
A / /B / A0
B / / / B0
a’ <= MAX(a,b)
b’ <= MIN(a,b)
A / 1 /I 2 /B 3 / / A0
B / /I / /B / B0
C / / /B / / C0
D / / / / / D0
What you may notice here is that the critical path is not
long (just three stages in total), the first stage is two concur-
rent sortings of A/C, and B/D pairs. The second stage, sort-
ing A/B, and C/D pairs. The final cleanup sorts the B/C
pair. As these are all branch-free functions, the performance
is regular over all data permutations. With such a regular
performance profile, we can use the sort in ways where the
variability of sorting time length gets in the way, such as just-
in-time sorting for subsections of rendering. If we had radix
sorted our renderables, we can network sort any final required
ordering as we can guarantee a consistent timing.
a’ <= MAX(a,c)
b’ <= MIN(b,d)
c’ <= MAX(a,c)
d’ <= MIN(b,d)
a’’ <= MAX(a’,b’)
b’’ <= MIN(a’,b’)
c’’ <= MAX(c’,d’)
d’’ <= MIN(c’,d’)
b’’’ <= MIN(b’’,c’’)
c’’’ <= MAX(b’’,c’’)
Optimisations and
Implementations
8.2 Feedback
Building budgets into how you work means, you can set re-
alistic budgets for systems early and have them work at a
certain level throughout development knowing they will not
cause grief later in development. On a project without bud-
gets, frame spikes may only become apparent near release
dates as it is only then that all systems are coming together
to create the final product. A system which was assumed to
be quite cheap, could cause frame spikes in the final prod-
uct, without any evidence being previously apparent. When
you finally find out which system causes the spikes, it may be
that it was caused by a change from a very long time ago, but
as resources were plentiful in the early times of development
on the project, the spikes caused by the system would have
1 From THE VALUE OF A MILLISECOND: FINDING THE OPTIMAL SPEED OF
Build or get yourself a profiler that runs all the time. En-
sure your profiler can report the overall state of the game
when the frame time goes over budget. It’s highly benefi-
cial to make it respond to any single system going over bud-
get. Sometimes you need the data from a number of frames
around when a violation occurred to really figure out what is
going on. If you have AI in your game, consider running con-
tinuous testing to capture performance issues as fast as your
build machine churns out testable builds. In all cases, un-
less you’re letting real testers run your profiler, you’re never
going to get real world profiling data. If real testers are going
to be using your profiling system, it’s worth considering how
you gather data from it. If it’s possible for you, see if you can
get automatically generated profile data sent back to an an-
alytics or metrics server, to capture issues without requiring
user intervention.
8.3.2 Measure
8.3.4 Implement
If you are not sure the optimisation will work out first time,
then the time saved by not doing a full implementation can be
beneficial, as a localised experiment can be worked on faster.
It can also be a good place to start when trying to build an
example for third parties to provide support, as a smaller ex-
ample of the problem will be easier to communicate through.
8.3.5 Confirm
Create a report of what you have done, and what you have
found. The benefits of doing this are twofold. First, you have
the benefit of sharing knowledge of a technique for optimisa-
tion, which clearly can help others hitting the same kind of
issue. The second is that creating the report can identify any
errors of measurement, or any steps which can be tested to
ensure they were actually pertinent to the final changes com-
mitted.
8.3.6 Summary
8.4 Tables
struct nodes
{
std :: vector < PosInfo > posInfos ;
std :: vector < vec3 > colors ;
std :: vector < LifetimeInfo > lifetimeInfos ;
} nodesystem ;
// ...
Listing 8.1: Mixing hot reads with hot and cold writes
date values which are used both for read and write, but are
close neighbours of data which is only used for reading.
There are other reasons why you might prefer to not store
data in trivial SoA format, such as if the data is commonly
subject to insertions and deletions. Keeping free lists around
to stop deletions from mutating the arrays can help alleviate
the pressure, but being unable to guarantee every element
ProcessJoin ( Func functionT oCall ) {
TableIterator A = t1Table . begin () ;
TableIterator B = t2Table . begin () ;
TableIterator C = t3Table . begin () ;
while ( ! A . finished && ! B . finished && ! C . finished ) {
if ( A == B && B == C ) {
func tionToCall ( A , B , C ) ;
++ A ; ++ B ; ++ C ;
} else {
if ( A < B || A < C ) ++ A ;
if ( B < A || B < C ) ++ B ;
if ( C < A || C < B ) ++ C ;
}
}
}
2 dependent on the target hardware, how many rows and columns, and
whether you want the process to run without trashing too much cache
to use such a trivial join, then you will need an alternative
strategy.
8.5 Transforms
can use better row search algorithms than a hash when they are available
cache ready for the next cycle.
When you normalise your data you reduce the chance of an-
other multifaceted problem of object-oriented development.
C++’s implementation of objects forces unrelated data to
share cache lines.
Every virtual call loads in the cache line that contains the
virtual-table pointer of the instance. If the function doesn’t
use any of the class’s early data, then that will be cache line
utilisation in the region of only 4%. That’s a memory through-
put waste, and cannot be recovered without rethinking how
you dispatch your functions. Adding a final keyword to your
class can help when your class calls into its own virtual func-
tions, but cannot help when they are called via a base type.
A B C D
"
a ab c cd
(
a ab abc abcd
Then once you have the last element, backfill all the other
elements you didn’t finish on your way to making the last el-
ement. When you come to write this in code, you will find
these backfilled values can be done in parallel while making
the longest chain. They have no dependency on the final value
so can be given over to another process, or managed by some
clever use of SIMD.
a ab c abcd
!
a ab abc abcd
Parallel prefix sums provide a way to reduce latency, but
are not a general solution which is better than doing a lin-
ear prefix sum. A linear prefix sum uses far fewer machine
resources to do the same thing, so if you can handle the la-
tency, then simplify your code and do the sum linearly.
Also, for cases where the entity count can rise and fall,
you need a way of adding and deleting without causing any
hiccups. For this, if you intend to transform your data in
place, you need to handle the case where one thread can be
reading and using the data you’re deleting. To do this in a
system where objects’ existence was based on their memory
being allocated, it would be very hard to delete objects that
were being referenced by other transforms. You could use
smart pointers, but in a multi-threaded environment, smart
pointers cost a mutex to be thread safe for every reference and
dereference. This is a high cost to pay, so how do we avoid it?
There are at least two ways.
Apart from finite state machines, there are some other com-
mon forms of data-driven coding practices. Some are not very
obvious, such as callbacks. Some are very obvious, such as
scripting. In both these cases, data causing the flow of code to
change will cause the same kind of cache and pipeline prob-
lems as seen in virtual calls and finite state machines.
void S I M D _ S S E _ U p d a t e P a r t i c l e s ( p ar ti cl e _b uf f er * pb , float
delta_time ) {
float g = pb - > gravity ;
float f_gd = g * delta_time ;
float f_gd2 = pb - > gravity * delta_time * delta_time * 0.5 f ;
Doing this means that for a 128 key stream, the key times
only take up 8 cache lines in total, and a binary search is go-
ing to pull in at most three of them, and the data lookup is
guaranteed to only require one, or two at most if your data
straddles two cache lines due to choosing memory space effi-
ciency over performance.
Writing code like this means you will see where the com-
piler has to branch more easily, but also, you can make your
writes more explicit, which means that where a compiler
might have had to break away from writing to memory, you
can force it to write in all cases, making your processing more
homogeneous, and therefore more likely to stream better.
9.4 Aliasing
1 int q =10;
2 int p [10];
3 for ( int i = 0; i < q ; ++ i )
4 p[i] = i;
i5-4430 @ 3.00GHz
Average 11.31ms [Simple, check the map]
Average 9.62ms [Partially cached query (25%)]
Average 8.77ms [Partially cached presence (50%)]
Average 3.71ms [Simple, cache presence]
Average 1.51ms [Partially cached query (95%)]
Average 0.30ms [Fully cached query]
The idea is that multiple threads will want to read from and
write to the same cache line, but not necessarily the same
memory addresses in the cache line. It’s relatively easy to
avoid this by ensuring any rapidly updated variables are kept
local to the thread, whether on the stack or in thread local
storage. Other data, as long as it’s not updated regularly, is
highly unlikely to cause a collision.
void L o c a l A ccumulator () {
int sum =0;
# pragma omp parallel num_threads ( NUM_THREADS )
{
int me = om p _g e t_ th r ea d _ n u m () ;
int l o c al_ acc umu lat or = 0;
for ( int i = me ; i < ELEMENT_COUNT ; i += NUM_THREADS ) {
l o c a l _acc umu lat or += CalcValue ( i ) ;
}
# pragma omp atomic
sum += loc al_ acc umu lat o r ;
}
}
So, how can you tell if this problem is real or not? If your
multi-threaded code is not growing at a linear rate of pro-
cessing as you add cores, then you might be suffering from
false sharing, look at the where your threads are writing, and
try to remove the writes from shared memory where possible
until the last step. The common example given is of adding
up some arrays and updating the sum value in some global
shared location, such as in listing 9.4.
i5-4430 @ 3.00GHz
Average 4.40ms [Random branching]
Average 1.15ms [Sorted branching]
Average 0.80ms [Trivial Random branching]
Average 0.76ms [Trivial Sorted branching]
in 0.8ms. The trivial version, which was likely mostly compiled into CMOVs,
ran in 0.4ms both sorted and unsorted
9.10 Don’t get evicted
It’s very simple advice. Not only is small code less likely to
be evicted, but if it’s done in bursts it will have had a chance
to get a reasonable amount of work before being overwritten.
Some cache architectures don’t have any way to tell if the el-
ements in the cache have been used recently, so they rely on
when they were added as a metric for what should be evicted
first. In particular, some Intel CPUs can have their L1 and
L2 cache lines evicted because of L3 needing to evict, but L3
doesn’t have full access to LRU information. The Intel CPUs
in question have some other magic that reduces the likelihood
of this happening, but it does happen.
A thing to watch out for is making sure the loops are trivial
and always run their course. If a loop has to break based on
data, then it won’t be able to commit to doing all elements of
the processing, and that means it has to do each element at
a time. In listing 9.8 the introduction of a break based on the
data turns the function from a fast parallel SIMD operation
auto-vectorisable loop, into a single stepping loop. Note that
branching in and of itself does not cause a breakdown in vec-
torisation, but the fact the loop is exited based on data. For
example, in listing 9.9, the branch can be turned into other
operations. It’s also the case that calling out to a function can
often break the vectorisation, as side effects cannot normally
be guaranteed. If the function is a constexpr, then there’s a
typedef float f16 __attribute__ (( aligned (16) ) ) ;
Over the next decade, compilers will get better and bet-
ter. Clang already attempts to unroll loops far more than
GCC does, and many new ways to detect and optimise sim-
ple code will likely appear. At the time of writing, the online
Compiler Explorer provided by Matt Godbolt2 , provides a good
way to see how your code will be compiled into assembly, so
you can see what can and will be vectorised, optimised out,
rearranged, or otherwise mutated into the machine-readable
form. Remember that the number of assembly instructions
is not a good metric for fast code, that SIMD operations are
not inherently faster in all cases, and measuring the code run-
ning cannot be replaced by stroking your chin3 while thinking
about whether the instructions look cool, and you should be
okay.
2 https://godbolt.org/
3 or even stroking a beard, or biting a pencil (while making a really serious
Concurrency
operation mode
int shared = 0;
void foo () {
int a = shared ;
a += R u n S om e Ca lc u la t io n () ;
shared = a ;
}
// directly modifying
int R u n S o m e Ca lc u la t io n () {
int val = 4 + ++ shared ;
}
// indirectly modifying
int foo2 () {
sharedMutex . acquire () ;
// oops , the base thread is the same , so reentrant or recursive
lock doesn ’t block .
shared += 1;
sharedMutex . release () ;
return shared ;
}
int R u n S o m e Ca lc u la t io n () {
int val = foo2 () + 9;
}
ing its data via this same call. Every time one of the hardware
threads encounters this code, it stops all processing until the
mutex is acquired. Once it’s acquired, no other hardware
thread can enter into these instructions until the current
thread releases the mutex at the far end. That is the only
guarantee though, as it could be that RunSomeCalculation
changes shared either directly or indirectly, or something
changes shared without invoking the mutexes. See the exam-
ples in listing 10.3.
3 You can see the figures on latency in Latency Numbers Every Programmer
For example, the physics system can update while the ren-
derer and the AI rely on the positions and velocities of the
current frame. The AI can update while the animation sys-
tem can rely on the previous set of states.
When you don’t know how many items you are going to get out
of a transform, such as when you filter a table to find only the
X which are Y, you need run a reduce on the output to make
the table non-sparse. Doing this can be log(N) latent, that
is pairing up rows takes serial time based on log(N). But, if
you are generating more data than your input data, then you
need to handle it very differently. Mapping a table out onto a
larger set can be managed by generating into gateways, which
once the whole map operation is over, can provide input to the
reduce stage. The gateways are queues which are only ever
written to by the Mapping functions, and only ever read by the
gathering tasks such as Reduce. There is always one gateway
per Map runtime. That way, there can be no sharing across
threads. A Map can write that it has put up more data, and
can set the content of that data, but it cannot delete it or mark
any written data as having been read. The gateway manages a
read head, which can be compared with the write head to find
out if there are any waiting elements. Given this, a gathering
gateway or reduce gateway can be made by cycling through
all known gateways and popping any data from the gateway
read heads. This is a fully concurrent technique and would
be just as at home in variable CPU timing solutions as it is
in standard programming practices as it implies a consistent
state through ownership of their respective parts. A write will
stall until the read has allowed space in the queue. A read
will either return “no data” or stall until the write head shows
there is something more to read.
Take for example the idea of a game which tries to get the
lowest possible latency between player control pad and avatar
reaction on-screen. In a lot of games, you have to put up
with the code reading the pad state, then the pad state being
used to adjust animations, then the animations adjusting the
renderables, then the rendering system rastering and finally,
the raster system swapping buffers. An example is shown in
diagram 10.1.
Figure 10.1: The effect of input can sometimes take three
frames to be seen on screen
tions http://interviews.slashdot.org/story/15/01/19/159242/
first conceived, have negative consequences with today’s hard-
ware.
3, B
d
A al
f
e d
f e
!
Cs
ST ART
d /7 B
e d
e
$ w
C
AT HOM
O E
&
GOIN G HOM
f E GOIN G T O F IELD
W ORK T HE F IELD
mented without a central state, but implicit state based on input alone
through hierarchical states, in other cases is it circumvented
by introducing more states simultaneously, and in yet other
cases, states are pushed onto a stack, so as to be able to re-
turn to them later. All these solutions provide techniques to
get around the limitations of the core tenet of state being sin-
gular. If we could get around this limitation, then we may
not need lots of clever techniques. The solution can be found
in where the state is stored. The reason there is only one
state at once is there is normally a single variable in which
the state is held. When we free state from the confines of the
container, in effect, having the state own the machine, not
the machine owning the state, we expand the opportunities
to create different numbers of simultaneous state as the out-
come of a transition. We can have states which lead to simul-
taneous states, or no state at all, or states which are aware
of their child states, or states that know they need to return
to another state, all within the same system, but defined by
a data-driven process. Potentially new finite state machine
techniques can be developed and implemented, without a call
for new code to be written to handle the new technique.
Keeping the state as a table entry can also let the FSM do
some more advanced work, such as managing level of detail or
culling. If your renderables have a state of potentially visible,
then you can run the culling checks on just these entities
rather than those not even up for consideration this frame.
Using collective lodding with FSMs allows for game flow logic
such as allowing the triggering of a game state change to emit
the state’s linked entities, which could provide a good point
for debugging any problems.
ws
DAT A
w
DAT A
transf orm
/ DAT A
In Practice
12.1 Data-manipulation
When the server went live, it wasn’t the services that died,
it was the login. Nginx is amazing, but under that amount of
load on a single server, with so many of the requests requiring
a lock on an SQL DB backend for Facebook integration, the
machine reached its limit very quickly. For once, we think
PHP itself wasn’t to blame. We had to redesign all the ser-
vices so they could work in three different situations so as to
allow the server to become a distributed service. By not lock-
ing down data into contexts, it was relatively easy to change
the way the data was processed, to reconfigure the single
service that previously did all the data consumption, colla-
tion, and serving, into three different services which handled
incoming data, merging the multiple instances, and serving
the data on the instances. In part, this was due to a very
procedural approach, but it was also down to data separa-
tion. The data, even between levels of detail, was not linked
together or bound by an object context. The lack of bind-
ing allows for simpler recombination of procedures, simpler
rewriting of procedures, and simpler repurposing of proce-
dures from related services. This is something the object-
oriented approach makes harder because you can be easily
tempted to start adding base classes and inheriting to gain
common functionality or algorithms. As soon as you do that,
1 anyone remembering developing on a PS2 will likely attest to the minimal
benefit you get from optimisations when your main bottleneck is the VU and
GS. The same was true here, we had a simple bottleneck of the size of the
data and the operations to run on it. Optimisations had minimal impact on
processing speed, but they did impact our ability to debug the service when
it did crash.
you start tying things together by what they mean rather than
what they are, and then you lose the ability to reuse code.
3 sometimes multiple cameras belonged to the same scene and the scene
4 dead men don’t care about bullets and don’t really care that much about
Another example from the god game prototype was the use
of duck-typing. Instead of adding a base class for people and
houses, just to see if they were meant to be under control
of the local player, we used a function template to extract a
boolean value. Duck typing doesn’t require anything more
than certain member functions or variables being available.
They don’t have to be in the same place, or come bundled into
a base class: they just have to be present.
6 profile guided optimisation might have saved a lot of time here, but the
best solution is to give the profiler and optimiser a better chance by leaving
them less to do
very little7 . At this point, I added the bootstrap code to gener-
ate the scene. Some components required other components
at runtime, so I added a requires trait to the components,
much like #include, except to make this work for teardown, I
also added a count so components went away much like ref-
erence counted smart pointers folded away. The initial demo
was a simple animated character running around a 3D envi-
ronment, colliding with props via the collision Handler.
8 http://cowboyprogramming.com/2007/01/05/evolve-your-heirachy/
Chapter 13
Maintenance and
reuse
13.2.1 Lifetimes
Bugs have a lot to do with not being in the right state. Debug-
ging, therefore, becomes a case of finding out how the game
got into its current, broken state.
13.3 Reusability
13.6 Refactoring
What’s wrong?
faster than C++. There are some arguments against the results, but there
are others backing it up. Read, make up your own mind.
Some of this comes from the initial interpretation of what
object-oriented means, as game developers tended to believe
that object-oriented meant you had to map instances of every-
thing you cared about into the code as instances of objects.
This form of object-oriented development could be interpreted
as instance-oriented development, and it puts the singular
unique entity ahead of the program as a whole. When put this
way, it is easier to see some of the problems that can arise.
Performance of an individual is very hard to decry as poor,
as object methods are hard to time accurately, and unlikely
to be timed at all. When your development practices promote
individual elements above the program as a whole, you will
also pay the mental capacity penalty, as you have to consider
all operations from the point of view of the actors, with their
hidden state, not from a point of view of value semantics.
Claim: Virtuals don’t cost much, but if you call them a lot it can
add up.
aka - death by a thousand paper cuts
class B {
public :
B () {}
virtual ~ B () {}
virtual void Call () { printf ( " Base \ n " ) ; }
void LocalCall () {
Call () ;
}
};
B * pb ;
D * pd ;
int main () {
D * d = new D ;
pb = pd = d ;
pb - > LocalCall () ;
// prints " Derived " via virtual call
pd - > LocalCall () ;
// prints " Derived " via direct call
}
one real class, and any others had to be pure virtual interface
classes as per the Java definition.
battle first
The takeaway from this, however, is that generic code still
needs to be learnt in order for the coder to be efficient, or
not cause accidental performance bottlenecks. If you go with
the STL, then at least you have a lot of documentation on
your side. If your game company implements an amazingly
complex template library, don’t expect any coders to use it
until they’ve had enough time to learn it, and that means,
if you write generic code, expect people to not use it unless
they come across it accidentally, or have been explicitly told
to, as they won’t know it’s there, or won’t trust it. In other
words, starting out by writing generic code is a good way to
write a lot of code quickly without adding any value to your
development.
Chapter 15
Looking at hardware
You will find many CPUs will do a lot of work out of order
if they can, and the possibility of doing things out of order is
something worth striving for. Consider the well-known evil of
a linked list. The reason why the linked list is worse than an
array for lookups isn’t just to do with all the jumping around
in memory, but also the fact that it cannot start work on items
many steps ahead. If it was all about jumping around in mem-
ory, then an array of pointers to objects would also be around
the same cost, but in tests, it’s shown that when accessing
an array versus a linked list, the array of pointers to objects
comes out closer to the array for performance than you would
expect if it was the mere pointer dereferencing that was the
cost. Instead, the cost stems from the fact that the next ele-
ment cannot be deduced without loading in the current ele-
ment. That is where the true cost lies. In the source code for
linked lists 16.2, the array lookup is clearly the fastest, but on
some hardware, the array of pointers approach, which offers
some of the benefits of a linked list, cuts the time to process
by more than 20%.
i5-4430 @ 3.00GHz
Average 24.35ms [Linked List Sum]
Average 19.03ms [Pointer Array Sum]
Average 4.37ms [Array Sum]
The sad thing is, these days, most compilers will build you
a better bit counter if you write your code the dumb way, as on
some CPUs there are instructions to count bits, which leaves
all your clever code wasting time, while also being impenetra-
ble to read. In this case, be aware of your target hardware,
and the capabilities of your compiler, as you might be shooting
yourself in the foot.
So, look to your types, and see if you can add a bit of SIMD
to your development without even breaking out the vector in-
uint32_t C o untBitsClever ( uint32_t v ) {
v = v - (( v >> 1) & 0 x55555555 ) ; // reuse
input as temporary
v = ( v & 0 x33333333 ) + (( v >> 2) & 0 x33333333 ) ; // temp
c = (( v + ( v >> 4) & 0 xF0F0F0F ) * 0 x1010101 ) >> 24; // count
return c ;
}
Learn how your hardware really works, how big each cache
is, what would cause memory to be dropped to disk, how long
it takes to send data to another machine, how many hops on
your protocol. Learn about the speed that information can,
and must flow, for the user experience to be within tolerances.
Sourcecode
https://github.com/raspofabs/dodbooksourcecode
This repository contains the tests performed, the supporting
code, and the testing harness.
In this source, the idea is to test and prove that linked lists
cost more than arrays because of the way they need memory
loads to continue their work, as opposed to just being slow
because of memory access.
struct A {
int val ;
int pad1 ;
int pad2 ;
int pad3 ;
};
struct Alink {
Alink * next ;
int val ;
int pad1 ;
int pad2 ;
int pad3 ;
};
A * aArray ;
A ** aPointerArray ;
Alink * aLinkedList ;
const int ELEMENT_COUNT = 4 * 1024 * 1024;
void TestSumArray () {
int accumulator = 0;
for ( int i = 0; i < ELEMENT_COUNT ; i +=1 ) {
accumulator += aArray [ i ]. val ;
}
}
void T e s t S u m A r ra y P o i n t e r () {
int accumulator = 0;
for ( int i = 0; i < ELEMENT_COUNT ; i +=1 ) {
accumulator += aPointerArray [ i ] - > val ;
}
}
void T e s t S u mLi nke dLi st () {
int accumulator = 0;
Alink * link = aLinkedList ;
while ( link != nullptr ) {
accumulator += link - > val ;
link = link - > next ;
}
}
Listing 16.2: Linked Lists
In this source, we’re looking for how the size of the working
set hits the size of the cache.
void T e s t S u mmi ngS imp le () {
int sum = 0;
for ( int i = 0; i < ELEMENT_COUNT ; ++ i ) {
sum += a [ i ];
}
}
void T e s t S u m m i n g B a c k w a r d s () {
int sum = 0;
for ( int i = ELEMENT_COUNT -1; i >= 0; --i ) {
sum += a [ i ];
}
}
void T e s t S u mm i ng S tr id e s () {
int sum = 0;
const int STRIDE = 16;
for ( int offset = 0; offset < STRIDE ; offset += 1 ) {
for ( int i = offset ; i < ELEMENT_COUNT ; i += STRIDE ) {
sum += a [ i ];
}
}
}
template < int byte_limit >
void T e s t W r i t e R a n g e L i m i t e d () {
int mask = ( byte_limit / sizeof ( c [0] ) ) -1;
for ( int i = 0; i < ELEMENT_COUNT *16; i += 16 ) {
c [ i & mask ] = i ;
}
}
template < int byte_limit >
void T e s t M o d i f y R a n g e L i m i t e d () {
int mask = ( byte_limit / sizeof ( c [0] ) ) -1;
for ( int i = 0; i < ELEMENT_COUNT *16; i += 16 ) {
c [ i & mask ] += 1;
}
}
16.10 SIMD
These are the source files for testing the finite state machine
variants.
namespace FSMSimple {
enum State {
S_sleeping ,
S_hunting ,
S_eating ,
S_exploring ,
};
struct Machine {
State state ;
float sleepiness ;
float hunger ;
float huntTimer ;
float eatTimer ;
};
struct Data {
Machine machine [ NUM_MACHINES ];
Data () {
pcg3 2_random_t rng ;
p cg 32 _srandom_r (& rng , 1234 , 5678) ;
for ( int m = 0; m < NUM_MACHINES ; ++ m ) {
Machine & M = machine [ m ];
M . state = S_sleeping ;
M . sleepiness = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.0 f , 0.2 f ) ;
M . hunger = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.5 f , 0.9 f ) ;
M . huntTimer = HUNTING_TIME ;
M . eatTimer = 0.0 f ;
}
}
void Update ( float deltaTime ) {
for ( int m = 0; m < NUM_MACHINES ; ++ m ) {
Machine & M = machine [ m ];
switch ( M . state ) {
case S_sleeping :
{
M . hunger += deltaTime * SLEEP_HUNGER ;
M . sleepiness += deltaTime * SLEEP_SLEEP ;
if ( M . sleepiness <= 0.0 f ) {
M . sleepiness = 0.0 f ;
if ( M . eatTimer > 0.0 f ) {
M . state = S_eating ;
} else {
if ( M . hunger > HUNGER_T RIGGER ) {
M . state = S_hunting ;
M . huntTimer = HUNTING_TIME ;
} else {
M . state = S_exploring ;
}
}
}
} break ;
case S_hunting :
{
M . hunger += deltaTime * HUNT_HUNGER ;
M . sleepiness += deltaTime * HUNT_SLEEP ;
M . huntTimer -= deltaTime ;
if ( M . huntTimer <= 0.0 f ) {
M . eatTimer = EATING_TIME ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
M . state = S_sleeping ;
} else {
M . state = S_eating ;
}
} else {
}
} break ;
case S_eating :
{
M . hunger += deltaTime * EAT_HUNGER ;
M . sleepiness += deltaTime * EAT_SLEEP ;
M . eatTimer -= deltaTime ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
M . state = S_sleeping ;
} else {
if ( M . eatTimer <= 0.0 f ) {
if ( M . hunger > HUNGER_T RIGGER ) {
M . state = S_hunting ;
M . huntTimer = HUNTING_TIME ;
} else {
M . state = S_exploring ;
}
}
}
} break ;
case S_exploring :
{
M . hunger += deltaTime * E XPLORE_ HUNGER ;
M . sleepiness += deltaTime * EXPLORE_SLEEP ;
if ( M . hunger > HUNGER_T RIGGER ) {
M . state = S_hunting ;
M . huntTimer = HUNTING_TIME ;
}
else {
if ( M . sleepiness > SLEEP_TRIGGER ) {
M . state = S_sleeping ;
}
}
} break ;
}
}
}
};
}
namespace FSMOOState {
struct State ;
struct Machine {
State * state = nullptr ;
float sleepiness ;
float hunger ;
float huntTimer ;
float eatTimer ;
inline void UpdateState ( State * newState ) ;
inline ~ Machine () ;
};
struct State {
virtual State * Update ( Machine &M , float deltaTime ) = 0;
virtual const char * GetName () { return " Base " ; }
};
struct Sleeping final : public State {
State * Update ( Machine &M , float deltaTime ) override ;
const char * GetName () override { return " Sleeping " ; }
};
struct Hunting final : public State {
State * Update ( Machine &M , float deltaTime ) override ;
const char * GetName () override { return " Hunting " ; }
};
struct Eating final : public State {
State * Update ( Machine &M , float deltaTime ) override ;
virtual const char * GetName () override { return " Eating " ; }
};
struct Exploring final : public State {
State * Update ( Machine &M , float deltaTime ) override ;
const char * GetName () override { return " Exploring " ; }
};
Sleeping m_commonSleepin g ;
Hunting m_commonHunting ;
Eating m_commonEating ;
Exploring m _com mon Exp lor i n g ;
struct Data {
Machine machine [ NUM_MACHINES ];
Data () {
pcg3 2_random_t rng ;
p cg 32 _srandom_r (& rng , 1234 , 5678) ;
for ( int m = 0; m < NUM_MACHINES ; ++ m ) {
Machine & M = machine [ m ];
M . state = & m_commo n S l e e p i ng ;
M . sleepiness = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.0 f , 0.2 f ) ;
M . hunger = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.5 f , 0.9 f ) ;
M . huntTimer = HUNTING_TIME ;
M . eatTimer = 0.0 f ;
}
}
void Update ( float deltaTime ) {
for ( int m = 0; m < NUM_MACHINES ; ++ m ) {
Machine & M = machine [ m ];
State * newState = M . state - > Update ( M , deltaTime ) ;
M . UpdateState ( newState ) ;
}
}
int S t a t e O b j e c t T o S t a t e I n d e x ( State * s ) {
if ( strcmp ( s - > GetName () , m _ c o m m o n Sl e e p i n g . GetName () ) == 0
)
return 0;
if ( strcmp ( s - > GetName () , m _ co mm on H un ti ng . GetName () ) == 0 )
return 1;
if ( strcmp ( s - > GetName () , m_com monEati ng . GetName () ) == 0 )
return 2;
if ( strcmp ( s - > GetName () , m _ c o m m o n E x p l o r i n g . GetName () ) == 0
)
return 3;
return -1;
}
};
// inlines
inline void Machine :: UpdateState ( State * newState ) {
if ( newState ) {
state = newState ;
}
}
inline Machine ::~ Machine () {
state = nullptr ;
}
State * Sleeping :: Update ( Machine &M , float deltaTime ) {
M . hunger += deltaTime * SLEEP_HUNGER ;
M . sleepiness += deltaTime * SLEEP_SLEEP ;
if ( M . sleepiness <= 0.0 f ) {
M . sleepiness = 0.0 f ;
if ( M . eatTimer > 0.0 f ) {
return & m_commonEat ing ;
} else {
if ( M . hunger > HUNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
return & m_commonH un ti n g ;
} else {
return & m _co mmon E x p l o r i n g ;
}
}
}
return nullptr ;
}
State * Hunting :: Update ( Machine &M , float deltaTime ) {
M . hunger += deltaTime * HUNT_HUNGER ;
M . sleepiness += deltaTime * HUNT_SLEEP ;
M . huntTimer -= deltaTime ;
if ( M . huntTimer <= 0.0 f ) {
M . eatTimer = EATING_TIME ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
return & m_commonSl e e p i n g ;
} else {
return & m_commonEat ing ;
}
}
return nullptr ;
}
State * Eating :: Update ( Machine &M , float deltaTime ) {
M . hunger += deltaTime * EAT_HUNGER ;
M . sleepiness += deltaTime * EAT_SLEEP ;
M . eatTimer -= deltaTime ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
return & m_commonSlee p i n g ;
} else {
if ( M . eatTimer <= 0.0 f ) {
if ( M . hunger > HUNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
return & m_commonH un ti n g ;
} else {
return & m _co mmon E x p l o r i n g ;
}
}
}
return nullptr ;
}
State * Exploring :: Update ( Machine &M , float deltaTime ) {
M . hunger += deltaTime * EXPLORE _HUNGER ;
M . sleepiness += deltaTime * EXPLORE_SLEEP ;
if ( M . hunger > HUNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
return & m_commonHunt in g ;
} else {
if ( M . sleepiness > SLEEP_TRIGGER ) {
return & m_commonSl e e p i n g ;
}
}
return nullptr ;
}
}
namespace FSMTableState {
struct Machine {
float sleepiness ;
float hunger ;
float huntTimer ;
float eatTimer ;
};
typedef std :: vector < Machine > MachineVector ;
struct Data {
MachineVector sleeps ;
MachineVector hunts ;
MachineVector eats ;
MachineVector explores ;
Data () {
pcg3 2_random_t rng ;
p cg 32 _srandom_r (& rng , 1234 , 5678) ;
for ( int m = 0; m < NUM_MACHINES ; ++ m ) {
Machine M ;
M . sleepiness = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.0 f , 0.2 f ) ;
M . hunger = p c g 3 2 _ r a n d o m _ r _ r a n g e f (& rng , 0.5 f , 0.9 f ) ;
M . huntTimer = HUNTING_TIME ;
M . eatTimer = 0.0 f ;
sleeps . push_back ( M ) ;
}
}
void Update ( float deltaTime ) {
MachineVector pendingSleep ;
MachineVector pendingHunt ;
MachineVector pendingEat ;
MachineVector pendingExplor e ;
{
for ( MachineVector :: iterator iter = sleeps . begin () ; iter
!= sleeps . end () ; ) {
Machine & M = * iter ;
M . hunger += deltaTime * SLEEP_HUNGER ;
M . sleepiness += deltaTime * SLEEP_SLEEP ;
if ( M . sleepiness <= 0.0 f ) {
M . sleepiness = 0.0 f ;
if ( M . eatTimer > 0.0 f ) {
pendingEat . push_back ( M ) ;
} else {
if ( M . hunger > HUNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
pendingHunt . push_back ( M ) ;
} else {
pendingExplo re . push_back ( M ) ;
}
}
* iter = sleeps . back () ; sleeps . pop_back () ;
} else {
++ iter ;
}
}
for ( MachineVector :: iterator iter = hunts . begin () ; iter !=
hunts . end () ; ) {
Machine & M = * iter ;
M . hunger += deltaTime * HUNT_HUNGER ;
M . sleepiness += deltaTime * HUNT_SLEEP ;
M . huntTimer -= deltaTime ;
if ( M . huntTimer <= 0.0 f ) {
M . eatTimer = EATING_TIME ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
pendingSleep . push_back ( M ) ;
} else {
pendingEat . push_back ( M ) ;
}
* iter = hunts . back () ; hunts . pop_back () ;
} else {
++ iter ;
}
}
for ( MachineVector :: iterator iter = eats . begin () ; iter !=
eats . end () ; ) {
Machine & M = * iter ;
M . hunger += deltaTime * EAT_HUNGER ;
M . sleepiness += deltaTime * EAT_SLEEP ;
M . eatTimer -= deltaTime ;
if ( M . sleepiness > SLEEP_TRIGGER ) {
pendingSleep . push_back ( M ) ;
* iter = eats . back () ; eats . pop_back () ;
} else {
if ( M . eatTimer <= 0.0 f ) {
if ( M . hunger > HUNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
pendingHunt . push_back ( M ) ;
} else {
pendingExplo re . push_back ( M ) ;
}
* iter = eats . back () ; eats . pop_back () ;
} else {
++ iter ;
}
}
}
for ( MachineVector :: iterator iter = explores . begin () ; iter
!= explores . end () ; ) {
Machine & M = * iter ;
M . hunger += deltaTime * EXPLORE_ HUNGER ;
M . sleepiness += deltaTime * EXPLORE_SLEEP ;
if ( M . hunger > H UNGER_T RIGGER ) {
M . huntTimer = HUNTING_TIME ;
pendingHunt . push_back ( M ) ;
* iter = explores . back () ; explores . pop_back () ;
} else {
if ( M . sleepiness > SLEEP_TRIGGER ) {
pendingSleep . push_back ( M ) ;
* iter = explores . back () ; explores . pop_back () ;
} else {
++ iter ;
}
}
}
}
sleeps . insert ( sleeps . end () , pendingSleep . begin () ,
pendingSleep . end () ) ;
hunts . insert ( hunts . end () , pendingHunt . begin () , pendingHunt .
end () ) ;
eats . insert ( eats . end () , pendingEat . begin () , pendingEat . end
() ) ;
explores . insert ( explores . end () , pending Explore . begin () ,
pendingExplore . end () ) ;
}
};
}
[5] D. L. Parnas
On the Criteria To Be Used in Decomposing Systems into
Modules
Communications of the ACM December 1972 Volume 15
Number 12