0% found this document useful (0 votes)
18 views72 pages

How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views72 pages

How Ubisoft Montreal Develops Games For Multicore - Before and After C++11 - Jeff Preshing - CppCon 2014

Uploaded by

alan88w
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

HOW UBISOFT MONTREAL

DEVELOPS GAMES FOR MULTICORE


Before & After C++11

Jeff Preshing
Technical Architect
Ubisoft Montreal
Watch Dogs Assassin’s Creed Unity

MULTICORE

Rainbow Six: Siege Far Cry 4


MULTICORE
several CPU cores on a single processor

§ Same instruction set


§ Same address space
§ Existing threads can run on any core

Popular way to offer power


Game C++
Industry Community

We all want to exploit multicore!


PART ONE PART TWO
Multicore Programming at Ubisoft The C++11 Atomic Library
Part One
Multicore Programming at Ubisoft
GENERAL-PURPOSE HARDWARE THREADS
available for game use

PlayStation 2 Xbox Xbox 360 PlayStation 3 PlayStation 4 Xbox One

1 MIPS 1 x86 6 PowerPC 2 PowerPC


+6 SPU
6 x64 6 x64
SINGLE-THREADED MAIN LOOP
in the early 2000s

Engine Graphics Engine Graphics


THREE THREADING PATTERNS
to exploit multicore

Pipelining Work

Dedicated Threads

Task Schedulers
CONCURRENT OBJECTS
Multiple threads operate concurrently on the object with at least one
thread modifying its state.

Pattern

Concurrent Object

Platform Atomic
primitives operations
Pipelining Work
PIPELINED GRAPHICS

Engine Engine Engine

Graphics Graphics Graphics


PIPELINED GRAPHICS

Engine Engine Engine

Semaphore Semaphore
Graphics Graphics Graphics
PIPELINED GRAPHICS
how to avoid concurrent modifications?

Engine

game objects

Graphics
DOUBLE-BUFFERED GRAPHICS STATE
One approach

struct  Object  
{  
       ...  
       Matrix  xform[2];  
       ...  
};  
SEPARATE GRAPHIC OBJECTS
Another approach

struct  Object   struct  GraphicObject  


{   {  
       ...          ...  
       Matrix  xform;          Matrix  xform;  
       GraphicObject*  gfxObject;          ...  
       ...   };  
};  

Copy at start of frame


Dedicated Threads
CONTENT STREAMING
We don’t load the entire game environment in memory at once.
DEDICATED LOADING THREAD

request
Loading
sleep sleep
Load chunk of world content
DEDICATED LOADING THREAD

Queue

Loading
WAKING UP THE LOADING THREAD
Loading Thread
ThreadSafeQueue<Request>  requests;   for  (;;)  
Event  workAvailable;   {  
       workAvailable.waitAndReset();  
       while  (r  =  requests.tryPop())  
       {  
Engine Thread                processLoadRequest(r);  
requests.push(r);          }  
workAvailable.signal();   }  

Event signaled à threads pass through


Event reset à threads wait
IMPROVING ON THE QUEUE
Many design choices

Cancel requests
Custom Interrupt requests
Re-prioritize requests
Task Schedulers
FINE-GRAINED PARALLELISM
Motivation for a task scheduler

Input Logic Physics Animation


FINE-GRAINED PARALLELISM
Motivation for a task scheduler

Input Logic Physics Animation


FINE-GRAINED PARALLELISM
Motivation for a task scheduler

Worker

Worker

Worker

Worker
SIMPLE TASK QUEUE
Queue
WAKING UP THE WORKER THREADS
Worker Thread
ThreadSafeQueue<Task>  tasks;   for  (;;)  
Event  workAvailable[numThreads];   {  
       workAvailable[thread].waitAndReset();  
       while  (t  =  tasks.tryPop())  
Submitting Thread        {  
tasks.push(t);                  t-­‐>Run();  
for  (int  i  =  0;  i  <  numThreads;  i++)          }  
       workAvailable.signal();   }  

One event for each worker thread


TASK GROUPS
Grouping work units together into larger tasks

Input
class  TaskGroup  
Logic {  
private:  
       Array<Item*>  m_Items;  
Physics        ...  
};  

Animation

Each TaskGroup keeps an array of items to update in parallel.


TASK GROUPS
class  TaskGroup  
{  
private:  
       vector<Item*>  m_Items;  
       volatile  int  m_Index;  
 
public:  
       void  Run()  
       {  
               for  (;;)   m_Items
               {  
                       int  index  =  AtomicIncrement(m_Index);  
                       if  (index  >=  m_Items.size())  
                               break;  
                       m_Items[index]-­‐>Run();  
               }  
       }  
};  

Multiple threads work on the same TaskGroup.


NOT A SIMPLE QUEUE ANYMORE

tail 0

tails 1, 2, 3

head

Custom

Could be a queue with separate tails for each worker.


MANAGING DEPENDENCIES
Input

Logic

Physics

Animation

No physics tasks before all logic tasks.


MANAGING DEPENDENCIES
class  TaskGroup  
{  
private:  
       vector<Item*>  m_Items;  
       volatile  int  m_Index;  
       volatile  int  m_RemainingCount;  
       ...  
 
public:   m_Items
       void  Run()  {  
               int  count  =  0;  
               for  (;;)  {  
                       int  index  =  AtomicIncrement(m_Index);  
                       if  (index  >=  m_Items.size())  
                               break;  
                       m_Items[index]-­‐>Run();  
                       count++;  
The thread that finishes the
               }   last item schedules the
               if  (count  >  0  &&  AtomicSubtract(m_Index,  count)  ==  0)  
                       AddDependencies();   next TaskGroup.
       }  
};  
IMPROVING ON THE TASK SCHEDULER
Many design choices

Centralized / per-thread task list


Priorities
Custom
Affinities
Batching
Profiler integration
“Pin” threads to cores
Atomic Operations
GAME ATOMICS
Typical portable library

Declaration volatile  int  A;  


A  =  1;  
Load/Store int  a  =  A;  
LIGHTWEIGHT_FENCE();  
Ordering
FULL_FENCE();  
AtomicIncrement(A);  
Read-Modify-Write AtomicCompareExchange(A,  …,  …);  
...  
FENCE MACROS
What’s the difference?

Used more often


LIGHTWEIGHT_FENCE();   FULL_FENCE();  
... all that, plus:
Orders loads from memory Commits stores before next load
Orders stores to memory

Does the job of: Does the job of:


atomic_thread_fence(memory_order_acquire);   atomic_thread_fence(memory_order_seq_cst);  
atomic_thread_fence(memory_order_release);  
atomic_thread_fence(memory_order_acq_rel);  
HOW THEY’RE IMPLEMENTED
on processors we care about

x86/64 PowerPC ARMv7


Declaration volatile  int  A;  
A  =  1;   mov  %,  %   ld  %,  %   ldr  %,  %  
Load/Store int  a  =  A;   mov  %,  %   st  %,  %   str  %,  %  
LIGHTWEIGHT_FENCE();   (compiler barrier) lwsync   dmb  
Ordering
FULL_FENCE();   mfence   hwsync   dmb  
COMPILER_BARRIER();  
AtomicIncrement(A);   lock  inc   lwarx   ldrex  
Read-Modify-Write AtomicCompareExchange(A,  …,  …);   lock  cmpxchg   ...   ...  
...   stwcx   strex  
ATOMIC OPERATIONS
How we ended up using them

Pattern

Concurrent Object

Atomic
operations
Game
Industry
EXAMPLE
Capped wait-free queue

template  <class  T,  int  size>  


class  CappedSPSCQueue  
{  
private:   m_writePos  
       T  m_items[size];  
       volatile  int  m_writePos;   m_items  
       int  m_readPos;  
 
public:  
       CappedSPSCQueue()  :  m_writePos(0),  m_readPos(0)  {}   m_readPos  
       bool  tryPush(const  T&  item)  {  ...  }  
       bool  tryPop(T&  item)  {  ...  }  
};  
 

Single producer, single consumer


EXAMPLE
Capped wait-free queue
m_writePos  

m_readPos  

bool  tryPush(const  T&  item)  


O KEN bool  tryPop(T&  item)  
{  
       int  w  =  m_writePos;   BR {  
       int  w  =  m_writePos;  
       if  (w  >=  size)   reorder        if  (m_readPos  >=  w)  
               return  false;                  return  false;  
       m_items[w]  =  item;          item  =  m_items[m_readPos];  
       m_writePos  =  w  +  1;          m_readPos++;  
reorder        return  true;          return  true;  
}   }  
EXAMPLE
Capped wait-free queue
m_writePos  

m_readPos  

bool  tryPush(const  T&  item)   bool  tryPop(T&  item)  


{   {  
       int  w  =  m_writePos;          int  w  =  m_writePos;  
       if  (w  >=  size)          if  (m_readPos  >=  w)  
               return  false;                  return  false;  
       m_items[w]  =  item;          LIGHTWEIGHT_FENCE();  
       LIGHTWEIGHT_FENCE();          item  =  m_items[m_readPos];  
       m_writePos  =  w  +  1;          m_readPos++;  
       return  true;          return  true;  
}   }  
RECAP:
Multicore programming at Ubisoft

§ Three threading patterns


§ Lots of custom concurrent objects
§ Atomic operations for high contention objects
§ We learned by doing
Part Two
The C++11 Atomic Library
ATOMIC OPERATIONS
in C++11

Pattern

Concurrent Object

Atomic
operations

C++11 Portable principles


C++11 FORBIDS DATA RACES
If multiple threads access the same variable concurrently, and at least one
thread modifies it, all threads must use C++11 atomic operations.

Thread 1 Thread 2

int  X;  

OK!
C++11 FORBIDS DATA RACES
If multiple threads access the same variable concurrently, and at least one
thread modifies it, all threads must use C++11 atomic operations.

Thread 1 Thread 2

int  X;  
Ra c e! r
Datad Behavio
d efine
Un
C++11 FORBIDS DATA RACES
If multiple threads access the same variable concurrently, and at least one
thread modifies it, all threads must use C++11 atomic operations.

Thread 1 Thread 2

atomic<int>  X;  

OK!

§ That’s how you know when you must use atomic<>.


C++11 FORBIDS DATA RACES
One reason they’re bad

Thread 1 Thread 2
X = 0x80004; c = X;

If machine can int  X;   ...we get a "torn write".


only write 16 bits... 0x80000
...and this is 32-bit...
C++11 FORBIDS DATA RACES
If multiple threads access the same variable concurrently, and at least one
thread modifies it, all threads must use C++11 atomic operations.

Thread 1 Thread 2

volatile  int  X;  

We break this rule all the time.


We know that int is atomic.
IT’S ACTUALLY TWO ATOMIC LIBRARIES
Masquerading under one API

Sequentially Consistent Atomics Low-Level Atomics

§ Similar to Java volatiles § Similar to C/C++ volatiles


§ Used in literature/books § Much like game atomics

All about interleaving statements

easier difficult

slower faster
SEQUENTIALLY CONSISTENT ATOMICS
Example #1

atomic<int>  A(0);  
atomic<int>  B(0);  

Thread 1 Thread 2
store, A  =  1;   B  =  1;   store,
then load c  =  B; d  =  A;   then load

Possible Interleavings: c   d  
A  =  1;   A  =  1;   B  =  1;   0   0   Impossible!
c  =  B;   B  =  1;   d  =  A;   0   1  
B  =  1;   c  =  B;   A  =  1;  
1   0  
d  =  A;   d  =  A;   c  =  B;  
1   1  
LOW-LEVEL ATOMICS
Example #1

atomic<int>  A(0);  
atomic<int>  B(0);  

Thread 1 Thread 2
A.store(1,  memory_order_relaxed);   B.store(1,  memory_order_relaxed);  
c  =  B.load(memory_order_relaxed);   d  =  A.load(memory_order_relaxed);  

Doing the same thing

c   d  
You can prevent it with “full memory fences”: 0   0   Possible!
atomic_thread_fence(memory_order_seq_cst);   0   1  
1   0  
1   1  
SEQUENTIALLY CONSISTENT ATOMICS
How to write them

atomic<int>  A;  

A.store(1,  memory_order_seq_cst);   All other constraints are low-level


c  =  A.load(memory_order_seq_cst);  

...is the same as:


A.store(1);  
c  =  A.load();  

Default argument ...and the same as:


A  =  1;  
c  =  A;  

Operator overloading
SEQUENTIALLY CONSISTENT ATOMICS
Example #2

atomic<int>  A(0);  
atomic<int>  B(0);  

Thread 1 Thread 2
c  =  B;  
two stores A  =  1;   d  =  A;  
two loads
B  =  1;

Possible Interleavings: c   d  
A  =  1;   A  =  1;   c  =  B;   0   0  
B  =  1;   c  =  B;   d  =  A;   0   1  
c  =  B;   B  =  1;   A  =  1;   Impossible!
1   0  
d  =  A;   d  =  A;   B  =  1;  
1   1  
LOW-LEVEL ATOMICS
Example #2

atomic<int>  A(0);  
atomic<int>  B(0);  

Thread 1 Thread 2
A.store(1,  memory_order_relaxed);   c  =  B.load(memory_order_relaxed);  
B.store(1,  memory_order_relaxed);   d  =  A.load(memory_order_relaxed);  

Doing the same thing

c   d  
This is the bug from Section One! 0   0  
You can fix it with “lightweight fences”: 0   1   Possible!
atomic_thread_fence(memory_order_acquire);   1   0  
atomic_thread_fence(memory_order_release);   1   1  
VISUALIZING LOW-LEVEL ATOMICS
Imagine each thread having its own private copy of memory.

Thread 1 Thread 2
A   1  
0   A   0  
B   0   B   0  
1  

A.store(1,  memory_order_relaxed);   B.store(1,  memory_order_relaxed);  


c  =  B.load(memory_order_relaxed);   d  =  A.load(memory_order_relaxed);  

Now c = 0, d = 0 is trivial.
VISUALIZING LOW-LEVEL ATOMICS
This analogy corresponds to each CPU core having its own cache.

A   1   A   0  
B   0   B   1  
VISUALIZING LOW-LEVEL ATOMICS
Eventually, changes propagate between threads, but the timing is
unpredictable.

A   1   A   1  
B   1   B   1  
SEQUENTIALLY CONSISTENT ATOMICS
The magic compilers use to implement them

Load Store
x86/64 mov  %,  %   lock  xchg  %,  %  

PowerPC hwsync   hwsync  


ld  %,  %   st  %,  %  
cmp  %,  0  
bc  #  
isync  

ARMv7 ldr  %,  %   dmb  


dmb   str  %,  %  
dmb  

ARMv8 ldar  %,  %   stlr  %,  %  

Itanium ld.acq  %,  %   st.rel  %,  %  


mf  

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
HOW TO CONVERT GAME ATOMICS
to low-level C++11 atomics

Game Atomics Low-Level C++11 Atomics


volatile  int  A;   atomic<int>  A;  
A  =  1;   A.store(1,  memory_order_relaxed);  
int  a  =  A;   int  a  =  A.load(memory_order_relaxed);  
LIGHTWEIGHT_FENCE();   atomic_thread_fence(memory_order_acquire/release);  
FULL_FENCE();   atomic_thread_fence(memory_order_seq_cst);  
AtomicIncrement(A);   A.fetch_add(1,  memory_order_relaxed);  
AtomicCompareExchange(A,  …,  …);   A.compare_exchange_strong(…,  …,  memory_order_relaxed);  
...   ...  
EXAMPLE
Capped wait-free queue in C++11

template  <class  T,  int  size>  


class  CappedSPSCQueue  
{  
private:   m_writePos  
       T  m_items[size];  
       atomic<int>  m_writePos;   m_items  
       int  m_readPos;  
 
public:  
       CappedSPSCQueue()  :  m_writePos(0),  m_readPos(0)  {}   m_readPos  
       bool  tryPush(const  T&  item)  {  ...  }  
       bool  tryPop(T&  item)  {  ...  }  
};  
 
EXAMPLE
Low-level with standalone fences
bool  tryPush(const  T&  item)  
{  
       int  w  =  m_writePos.load(memory_order_relaxed);  
       if  (w  >=  size)   Could even be non-atomic
               return  false;  
       m_items[w]  =  item;  
       atomic_thread_fence(memory_order_release);  
       m_writePos.store(w  +  1,  memory_order_relaxed);  
       return  true;  
}  
bool  tryPop(T&  item)  
{  
       int  w  =  m_writePos.load(memory_order_relaxed);  
       if  (m_readPos  >=  w)  
               return  false;  
       atomic_thread_fence(memory_order_acquire);  
       item  =  m_items[m_readPos];  
       m_readPos++;  
       return  true;  
}  
EXAMPLE
Low-level with standalone fences
bool  tryPush(const  T&  item)  
{   When this load sees the value
       int  w  =  m_writePos.load(memory_order_relaxed);   written by this store...
       if  (w  >=  size)  
               return  false;  
       m_items[w]  =  item;  
       atomic_thread_fence(memory_order_release);  
       m_writePos.store(w  +  1,  memory_order_relaxed);  
       return  true;  
}  
bool  tryPop(T&  item)  
{  
       int  w  =  m_writePos.load(memory_order_relaxed);  
       if  (m_readPos  >=  w)  
... the fences synchronize-                return  false;  
with each other (§29.8.2,        atomic_thread_fence(memory_order_acquire);  
N3337).        item  =  m_items[m_readPos];  
       m_readPos++;  
       return  true;  
}  
EXAMPLE
Low-level ordering constraints
bool  tryPush(const  T&  item)  
{   When the load sees the value
       int  w  =  m_writePos.load(memory_order_relaxed);   written by the store...
       if  (w  >=  size)  
               return  false;  
       m_items[w]  =  item;  
       m_writePos.store(w  +  1,  memory_order_release);  
       return  true;  
}  

bool  tryPop(T&  item)  


{  
       int  w  =  m_writePos.load(memory_order_acquire);  
... the store synchronizes-        if  (m_readPos  >=  w)  
               return  false;  
with the load (§29.3.2).        item  =  m_items[m_readPos];  
       m_readPos++;  
       return  true;  
}  
EXAMPLE
Using sequentially consistent atomics
bool  tryPush(const  T&  item)  
{   When the load reads from the
       int  w  =  m_writePos;   store, they synchronize-with
       if  (w  >=  size)  
               return  false;  
each other (§29.3.1).
       m_items[w]  =  item;  
       m_writePos  =  w  +  1;  
       return  true;  
}  

bool  tryPop(T&  item)  


{  
       int  w  =  m_writePos;  
       if  (m_readPos  >=  w)  
               return  false;  
       item  =  m_items[m_readPos];  
       m_readPos++;  
       return  true;  
}  
EXAMPLE
Capped wait-free queue in C++11

template  <class  T,  int  size>  


class  CappedSPSCQueue  
{  
private:  
       T  m_items[size];   All other variables can
       atomic<int>  m_writePos;  
       alignas(64)  int  m_readPos;   remain non-atomic
  because there is no data
public:   race.
       CappedSPSCQueue()  :  m_writePos(0),  m_readPos(0)  {}  
       bool  tryPush(const  T&  item)  {  ...  }  
       bool  tryPop(T&  item)  {  ...  }  
};  
 
BENCHMARKS
nanoseconds per operation
Intel Core-i7 Quad-core / 2.3 GHz / Xcode 5.1.1 Release
0 10 20 30 40 50 60 70 80 90 100

Standalone fences Push alone


Pop alone
Low-level constraints Concurrent push
Concurrent pop
Sequentially consistent

ARM Cortex-A9 Dual-core / 800 MHz / Xcode 5.1.1 Release


0 10 20 30 40 50 60 70 80 90 100

Standalone fences Push alone


Pop alone
Low-level constraints Concurrent push
Concurrent pop
Sequentially consistent
RECAP:
The C++11 Atomic Library

§ C++11 forbids “data races”


§ Two atomic libraries
§ Pass non-atomic information by synchronizing-with
THANKS

Charles Bloom
Hans Boehm
Bruce Dawson
Hugo Allaire
Peter Dimov
Dominic Couture
Maurice Herlihy
Jean-François Dubé
Dominique Duvivier Paul McKenney
Michael Lavaire Peter Sewell
Jean-Sébastien Pelletier Herb Sutter
Anthony Williams
Rémi Quenin
Dmitry Vyukov
James Therien
Jeff Preshing
@preshing
[email protected]

Preshing on Programming
http://preshing.com/

You might also like