0% found this document useful (0 votes)
28 views12 pages

Exploring Mutexes

1) Oracle databases use mutexes, latches, and locks for synchronization between processes accessing shared memory structures. Mutexes were introduced in Oracle 11g for submicrosecond synchronization and use a retrial spinlock approach. 2) Mutexes initially used a test-and-set spinlock but evolved to a test-and-test-and-set approach, polling the mutex location non-atomically before using atomic instructions to acquire it. If unsuccessful after 255 spins, the mutex yields the CPU and sleeps before retrying. 3) The document discusses the history and operation of Oracle mutexes, and notes that further research is needed to better understand and tune mutex performance.

Uploaded by

quispatdotanla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views12 pages

Exploring Mutexes

1) Oracle databases use mutexes, latches, and locks for synchronization between processes accessing shared memory structures. Mutexes were introduced in Oracle 11g for submicrosecond synchronization and use a retrial spinlock approach. 2) Mutexes initially used a test-and-set spinlock but evolved to a test-and-test-and-set approach, polling the mutex location non-atomically before using atomic instructions to acquire it. If unsuccessful after 255 spins, the mutex yields the CPU and sleeps before retrying. 3) The document discusses the history and operation of Oracle mutexes, and notes that further research is needed to better understand and tune mutex performance.

Uploaded by

quispatdotanla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Exploring mutexes, the Oracle R RDBMS

⃝ retrial spinlocks (MEDIAS2012) 1

Exploring mutexes, the Oracle⃝RDBMS


R retrial spinlocks
Nikolaev A. S.
[email protected], http://andreynikolaev.wordpress.com
RDTEX LTD, Protvino, Russia

Spinlocks are widely used in contemporary database engines.KGX mutexes is new retrial spinlocks appeared in
contemporary Oracle⃝ R versions for submicrosecond synchronization. The mutex contention is frequently observed
in high concurrency OLTP environments.
This work explores how Oracle mutexes operate, spin, and sleep. It develops predictive mathematical model
and discusses parameters and statistics related to mutex performance tuning, as well as results of contention
experiments.

I. Introduction Since that time, the mutual exclusion algorithms were


significantly advanced. Various sophisticated spinlock
According to Oracle⃝
R documentation [1] mutex is:
realizations (TS, TTS, Delay, MCS, Anderson, etc.)
”A mutual exclusion object . . . that prevents an object
were proposed and evaluated. The contemporary re-
in memory from aging out or from being corrupted
view of these algorithms may be found in [3]
. . . ”.
Two general spinlock types exist:
Huge Oracle RDBMS instance contains thousands
processes accessing the shared memory. This shared System spinlocks that protect critical OS structures.
memory named ”System Global Area” (SGA) consist The kernel thread cannot wait or yield the CPU. It
of millions cache, metadata and results structures. Si- must loops until success. Most mathematical mod-
multaneous access to these structures is synchronized els explore this spinlock type. Major metrics to
by Locks, Latches and KGX Mutexes: optimize system spinlocks are frequency of atomic
operations (or Remote Memory References) and
shared bus utilization.
User application spinlocks like Oracle latches and
mutexes that that protect user level structures. It
is more efficient to poll the mutex for several mi-
croseconds rather than pre-empt the thread doing
1 millisecond context switch. The metrics to op-
timize are spinlock acquisition CPU and elapsed
times.
In Oracle versions 10.2 to 11.2.0.1 (or, equivalently,
from 2006 to 2010) the mutex spun using atomic ”test-
and-set” operations. According to Anderson classifica-
Fig. 1. Oracle⃝
R RDBMS architecture tion [4] it was TS spinlock. Such spinlocks may induce
the Shared Bus saturation and affect performance of
Latches and mutexes are the Oracle realizations other memory operations.
of spinlock concept. My previous article [31] explored Since version 11.2.0.2 processes poll the mutex lo-
latches, the traditional Oracle spinlocks known since cation nonatomically and only use atomic instructions
1980th. Process requesting the latch spins 20000 cy- to finally acquire it. The contemporary mutex became
cles polling its location. If unsuccessful, process joins TTS (”test-and-test-and-set”) spinlock.
a queue and use wait-posting to be awakening on latch System spinlocks frequently use more complex
release. structures than TTS. Such algorithms, like famous
The goal of this work is to explore the newest Or- MCS spinlocks [5] were optimized for 100% utiliza-
acle spinlocks — mutexes. The mutexes were intro- tion. For the current state of system spinlock theory
duced in 2006 for synchronization inside Oracle Li- see [6].
brary Cache. Table 1 compares Oracle internal syn- If user spinlocks are holding for long time, for ex-
chronization mechanisms. ample due to OS preemption, pure spinning becomes
Wikipedia defines the spinlock as ”. . . a lock ineffective and wastes CPU. To overcome this, after
where the thread simply waits in a loop (”spins”) re- 255 spin cycles the mutex sleeps yielding the processor
peatedly checking until the lock becomes available. As to other workloads and then retries. The sleep time-
the thread remains active but isn’t performing a useful outs are determined by mutex wait scheme.
task, the use of such a lock is a kind of busy waiting”. Such spin-sleeping was first introduced in [7] to
Use of spinlocks for multiprocessor synchroniza- achieve balance between CPU time lost to spinning
tion were first introduced by Edsger Dijkstra in [2]. and context switch overhead.

(MEDIAS2012), , , 714 2012 .


International Conference on Informatics MEDIAS2012, Cyprus, Limassol, May 7–14, 2012
2 (MEDIAS2012) Nikolaev A. S.

Locks Latches Mutexes v. 11gR2 (2011): . . . Mutex wait schemes. Hot ob-
Access Several - Types and Operations ject copies.
Modes Modes
v. 12c (2012): Cloud. . .
Acquisition FIFO SIRO spin SIRO
FIFO wait As of now, Oracle is the most widely used SQL
SMP Atom- No Yes Yes RDBMS. In majority workloads it works perfectly.
icity However, quick search finds more then 100 books
Timescale Milli- Micro- SubMicro- devoted to Oracle performance tuning on Amazon
seconds seconds seconds [17, 18, 19]. Dozens conferences covered this topic ev-
Life cycle Dynamic Static Dynamic ery year. Why Oracle needs such tuning?
Table 1. Serialization mechanisms in Oracle Main reason is complex and variable workloads.
Oracle is working in very different environments rang-
From the queuing theory point of view such sys- ing from huge OLTPs, petabyte OLAPs to hundreds
tems with repeated attempts are retrial queues. More multitenant instances running on one server. Every
precisely, in the retrial system the request that finds high-end database is unique.
the server busy upon arrival leaves the service area For the ability to work in such diverse conditions
and joins a retrial group (orbit). After some time this Oracle RDBMS has complex internals. To get the
request will have a chance to try his luck again. There most out of hardware we need precise tuning. Working
exists an extensive literature on the retrial queues. See at Support, I cannot underestimate the importance of
[8, 9] and references therein. developers and database administrators education in
The mutex retrial spin-sleeping algorithm signif- this field.
icantly differs from the FIFO spin-blocking used by In order to diagnose performance problems Ora-
Oracle latches [31]. The spin-blocking was explored in cle instrumented his software. Every Oracle session
[10, 11, 12]. Its robustness in contemporary environ- keeps many statistics counters describing ”what was
ments was recently investigated in [13] done”. Oracle Wait Interface (OWI) [19] events de-
Historically the mutex contention issues were hard scribes ”why the session waits” and complements the
to diagnose and resolve [29]. The mutexes are much statistics.
less documented then needed and evolve rapidly. Sup- Statistics, OWI and data from internal (”fixed”)
port engineers definitely need more mainstream sci- X$ tables are used by Oracle diagnostics and visual-
ence support to predict the results of changing mutex ization tools.
parameters. This paper summarizes author’s work the This is the traditional framework of Oracle perfor-
subject. Additional details may be found in my blog mance tuning. However, it was not effective enough in
[30] spinlocks troubleshooting.
II. Oracle⃝
R RDBMS Performance Tun- DTrace.
ing overview To observe the mutexes work and short duration
Before discussing the mutexes, we need some in- events, we need something like stroboscope in physics.
troduction. During the last 33 years, Oracle evolved Likely, such tool exists in Oracle SolarisTM . This is
from the first one-user SQL database to the most ad- DTrace, Solaris 10 Dynamic Tracing framework [21].
vanced contemporary RDBMS engine. Each version
DTrace is event-driven, kernel-based instrumenta-
introduced performance and concurrency advances:
tion that can see and measure all OS activity. It de-
v. 2 (1979): the first commercial SQL RDBMS. fines probes to trap and handlers (actions) using dy-
v. 3 (1983): the first database to support SMP. namically interpreted C-like language. No application
v. 4 (1984): read-consistency, Database Buffer Cache. changes needed to use DTrace. This is very similar to
triggers in database technologies.
v. 5 (1986): Client-Server, Clustering, Distributing Database,
SGA. DTrace can:
v. 6 (1988): procedural language (PL/SQL), undo/redo, — Catch any event inside Solaris and function call
latches. inside Oracle.
— Read and change any address location in-flight.
v. 7 (1992): Library Cache, Shared SQL, Stored proce-
— Count the mutex spins, trace the mutex waits, per-
dures, 64bit.
form experiments.
v. 8/8i (1999): Object types, Java, XML. — Measure times and distributions up to microsecond
v. 9i (2000): Dynamic SGA, Real Application Clusters. precision.
v. 10g (2003): Enterprise Grid Computing, Self-Tuning, Unlike standard tracing tools, DTrace works in So-
mutexes. laris kernel. When process entered probe function, the
v. 11g (2008): Results Cache, SQL Plan Management, execution went to Solaris kernel and the DTrace filled
Exadata. buffers with the data. Kernel based tracing is more
R RDBMS

Exploring mutexes, the Oracle retrial spinlocks (MEDIAS2012) 3

assumes that no other requests arrive during the spin


(λ∆ ≪ 1) and the session acquires the mutex if it be-
come free while spinning. Therefore the spin succeeds
if:
Tm + ∆ > Hk + xk .
If the mutex was not released during ∆, the session
Fig. 2. Oracle mutex workflow. sleeps.
According to classic considerations of renewal the-
stable and have less overhead then userland. DTrace ory [23, 24], incoming requests peek up the holding
sees all the system activity and can account the time intervals with p.d.f.:
associated with kernel calls, scheduling, etc.
1
In the following sections describing Oracle perfor- ph = xp(x), (1)
mance tuning are interleaved by mathematical estima- S
tions. and observes the transformed mutex holding time dis-
III. Mutex spin model tribution. Here S = E(x) is the average mutex holding
time. The p.d.f. and average of residual holding time
The Oracle mutex workflow schematically visu- is well-known:
alised in fig. 2. The Oracle process:
∫∞
— Uses atomic hardware instruction for mutex Get. pr (x) = 1
p(t) dt
S (2)
— If missed, process spins by polling mutex location x
1 2
during spin get. Sr = 2S E(t )
— Number of spin cycles is bounded by spin count.
— In spin get not succeed, the process acquiring mu- The spin time distribution (conditioned on miss)
tex sleeps. follows the c.d.f. Pr , but has a discontinuity [31] at
— During the sleep the process may wait for already t = ∆ because the session acquiring latch never spins
free mutex. more than ∆. The magnitude of this discontinuity is
the overall probability that residual mutex holding
Oracle counts Gets and Sleeps and we can measure time will be greater then ∆. Corresponding p.d.f. is:
Utilization.
This section introduces the mathematical model psg (x) = pr (x)H(∆ − x) + Qr (∆)δ(x − ∆) (3)
used to forecast mutex behaviour. It extends the
model used in [10, 31] for general holding time dis- Here H(x) and δ(x) is Heaviside step and bump func-
tribution and TTS spinlock concurrency. tions correspondingly.
Consider a general stream of mutex holding events.
The spinlock observables.
The mutex memory location have been changed by
Oracle statistics allows measuring of spin ineffi-
sessions at time Hk , k ∈ N using atomic instruction.
ciency (or sleep ratio) coefficient k. This is the proba-
This instruction blocked the shared bus and succeeded
bility do not acquire mutex during the spin. Another
only when memory location is free.
crucial quantity is Γ - the average CPU time spent
After acquisition the session will hold the mutex
while spinning for the mutex:
for time xk distributed with p.d.f. p(t). I assume that
incoming stream is Poisson with rate λ and Hk (and  ∫ ∫∞

 k0 = Qr (∆) = ∆∞ pr (t) dt = S1
xk ) are generally independent forming renewal pro-  Q(t)dt

cess. Furthermore, I assume here the existence of at ∫∞ ∫∆ ∫∞ (4)


least second moments for all the distributions.  Γ = tpsg (t) dt = S1 dt Q(z) dz
The mutex acquisition request at time Tm , m ∈ N 0 0 t

succeeds immediately if it finds the mutex free. Due Here subscript 0 denotes ”contention free”approximation.
to Serve-In-Random-Order nature of spinlocks, there Using (2) for distributions with finite dispersion this
is no simple relation between m and k. expressions can be rewritten in two ways [31].
If the mutex was busy at time Tm :
Low spin efficiency region.
Hk < Tm < Hk + xk for some k, The first form is suitable for the region of low spin
efficiency ∆ ≪ S:
then miss occurred. According to PASTA property the 
miss probability is equal to mutex utilization ρ.  ∫∆

 k0 = 1 −

+ 1
(∆ − t)p(t) dt
S S
Missing session will spin polling the mutex location 0
(5)
up to time ∆ determined by mutex spin count pa- 
 ∆2
∫∆
 Γ=∆− + 1
(∆ − t)2 p(t) dt
rameter. The initial ”contention free” approximation 2S 2S
0
4 (MEDIAS2012) Nikolaev A. S.

1.00 For further estimations I will use the second sce-


0.50
0.20 nario. After the mutex release only one spinning ses-
0.10
0.05 sion acquires it according to SIRO discipline, while all
0.02
other sessions sleeps. This is interesting queuing disci-
0.1 1 10 100
pline that to my knowledge has not been explored in
Fig. 3. The concurrency formfactor Fc (x).
literature. Its C pseudocode looks like:
From the above expressions it is clear that spin while(1){ i:=0;
probes the mutex holding time distribution around the while(lock<>0 && i<spin_count) i:=i+1;
origin. if(Test_and_Set(lock))return SUCCESS;
Other parts of mutex holding time p.d.p. im- Sleep();
pact spin efficiency and CPU consumption only }
through the average holding time S. This al-
The time just after the mutex release is the Markov
lows to estimate how these quantities depend upon
regeneration point. All the spinning behavior after this
mutex spin count (or ∆) change. If processes never
time is independent of previous history.
releases mutex immediately (p(0) = 0) then
Consider mutex holding interval of length x con-
{ taining at least one (tagged) incoming request for mu-
k =1− ∆
S + O(∆3 )
2 tex from Poisson stream with rate λ. The conditional
Γ=∆− ∆ 4
2S + O(∆ ) probability that this interval will contain exactly n
incoming requests is:
For Oracle performance tuning purpose we need to
know what will happen if we double the ∆: 1 (λx)n −λx
e , n > 1.
In low efficiency region doubling the spin count will 1 − e−λx n!
double the number of efficient spins and also double
The session will acquire mutex at these conditions
the CPU consumption.
with probability 1/n. Overall probability for tagged
High spin efficiency region. session to acquire mutex is:
In high efficiency region the sleep cuts off the tail
of spinlock holding time distribution: ∞
1 ∑ (λx)n Ein(−λx)
 Fc (x) = = − λx (7)
 ∫∞ eλx − 1 n=1 n n! e −1

 k0 = S (t − ∆)p(t) dt
1
∆ ∫z
 ∫∞ Here Ein(z) = 0 (1−e−y ) dyy is the Entire Exponential

 Γ= 1 2
− 1
(t − ∆)2 p(t) dt = Sr − Tr
2S E(t ) 2S integral [25].

(6) The concurrency formfactor Fc (x) is a smooth
here Tr is the residual after-spin holding time. This monotonically decreasing function with asymptotics
quantity will be used later. Fc (x) = 1 − x/2 + O(x2 ) around 0 and Fc (x) ≈ 1/x +
Oracle normally operates in this region of small +O(1/x2 ), x → ∞. It may be efficiently approximated
sleeps ratio. Here the spin count is greater than num- by rational functions [26]. Fig. 3 shows its Log-Log
ber of instructions protected by mutex ∆ > S. The plot.
spin time is bounded by both the ”residual holding Normally mutex operates in region λx ≪ 1 and
time”and the spin count: value of formfactor Fc is very close to 1.
According to (1) the probability that missing re-
Γ < min( Sr , ∆) quest observe the holding interval from x to x + dx,
its residual holding time will be from t to t + dt, t 6 x
The sleep prevents process from waste CPU for spin- and it will concurrently acquire mutex on release is:
ning on heavy tail of mutex holding time distribution
1
Concurrency model. dP = p(x) Fc (λ min (x, ∆)) dx dt
S
In the real world several processes may spin on Therefore, the spin inefficiency or probability not to
different processors concurrently. After the mutex re- acquire mutex by spin will be:
lease all these sessions will issue atomic Test-and-Set ∫ ∫ ∞
instructions to acquire the mutex. Only one instruc- 1 ∆
k =1− dt p(x) Fc (λ min (x, ∆)) dt (8)
tion succeeds. What should the other sessions do? This S 0 t
is the principal question for hybrid TTS spinlocks.
Changing the integrations order we have:
The session may either continue the spin upto ∫
mutex spin count or it may sleep immediately. The 1 ∞
k =1− min (x, ∆)p(x) Fc (λ min (x, ∆)) dx
sleep seems reasonable because the session knows that S 0
spinlock just became busy. (9)
Exploring mutexes, the OracleR RDBMS
⃝ retrial spinlocks (MEDIAS2012) 5

Type id Mutex type protects


KHΡL 7 Cursor Pin Pin cursor in memory
0.6 8 hash table Cursor management
0.5 6 Cursor Parent ...
5 Cursor Stat ...
0.4 4 Library Cache Library cache manage-
0.3 ment
3 HT bucket mutex SecureFiles manage-
0.2 (kdlwl ht) ment
0.1 2 SHT bucket mu- ...
tex
0.0 1 HT bucket mutex ...
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 FSO mutex ...

Fig. 4. k(ρ).
Table 2. Mutex types in Oracle 11.2.0.3

Comparing with (6) we can outline the contention con- for i in 1..1000000 loop
tribution: execute immediate ’begin demo_proc();end;’;

1 ∆ ∂ end loop;
k = k0 + Q(x) (x (1 − Fc (λx))) dx (10)
S 0 ∂x Many other mutex contention scenarios possible. See
blog [30].
Using the formfactor asymptotics in low contention
Table 2 describes types of mutexes in contempo-
region λ∆ ≪ 1 and Little law ρ = λS we can estimate
rary Oracle. The ”Cursor Pin” mutexes act as pin
how the spin efficiency depends on mutex utilization:
counters for library cache objects (e.g. child cursors)
∫ ∆
ρ to prevent their aging out of shared pool. ”Library
k = k0 + 2 xQ(x)dx + o((λ∆)2 ) cache” cursor and bucket mutexes protect KGL locks
S 0
and static library cache hash structures. The ”Cur-
Data of mutex contention experiments (Fig. 4) sor Parent” and ”hash table” mutexes protect parent
roughly agree with this linear approximation. cursors during parsing and reloading.
IV. How Oracle requests the mutex The mutex address can be obtained from
x$mutex sleep history Oracle table. Such ”fixed”
This section discusses Oracle mutex internals be-
tables externalize internal Oracle structures to SQL.
yond the documentation. In order to explore mutexes
Due to dynamic nature of mutexes, Oracle do not have
we need reproducible testcases.
any fixed table like v$latch containing data about all
Each time the Oracle session executes SQL oper-
mutexes.
ator, it needs to pin the cursor in library cache using
According to Oracle documentation, this table is
mutex. ”True”mutex contention arises when the same
circular buffer containing data about latest mutex
SQL operator executes concurrently at high frequency.
waits. However, my experiments demonstrated that
Therefore, simplest testcase for ”Cursor: pin S” con-
it is actually hash array in SGA. The hash key of this
tention should look like:
array is likely to depend on mutex address and the ID
for i in 1..1000000 loop of blocking session. Row for each next sleep for the
execute immediate same mutex and blocking session replaces the row for
’select 1 from dual where 1=2’; previous sleep.
end loop; Mutexes in memory.
We can examine mutex using oradebug peek
The script uses PL/SQL loop to execute fast SQL op-
command. It shows memory contents:
erator one million times. Pure ”Cursor: pin S” mutex
contention arises when I execute this script by several SQL> oradebug peek 0x3F119B5A8 24
simultaneous concurrent sessions. [3F119B5A8, 3F119B5CC) =
00000016 00000001 0000001D 000015D7 382DA701 03
It is worth to note that session cached cursors
SID refcnt gets sleeps idn op
parameter value must be nonzero to avoid soft parses.
Otherwise, we will see contention for ”Library Cache” According to Oracle documentation the mutex
and ”hash table” mutexes also. Indeed it is enough to structure contains:
disable session cursor cache and add dozen versions of — Atomically modified value that consist of two parts:
the SQL to induce the ”Cursor: mutex S” contention. Holding SID. Top 4 bytes contain SID session
Similarly, ”Library cache: mutex X” contention currently holding the mutex eXclusively or
arises when anonymous PL/SQL block executes con- modifying it. It was session number 0x16 in the
currently at high frequency. above example.
6 (MEDIAS2012) Nikolaev A. S.

Reference count. Lower 4 bytes represent the get S get X,LX get E
number of sessions currently holding the mutex Held S - mutex S ?
in Shared mode (or is in-flux). X mutex X mutex X mutex X
— GETS - number of times the mutex was requested E - mutex S wait on X mutex S
— SLEEPS - number of times sessions slept for the Table 3. Mutex waits in Oracle Wait Interface
mutex
— IDN - mutex Identifier. Hash value of library cache amine states 0,1,2,3,. . . There are also EXCL and
object protected by mutex or hash bucket number. LONG EXCL states (fig. 6) .
— OP current mutex operation.
Oracle session changes the mutex state through dy-
namic structure called Atomic Operation Log (AOL).

Fig. 6. Mutex state transitions.

Not all operations used by each mutex type. The


”Cursor Pin” mutex pins the cursor in the Library
Cache during parse and execution in 8-like way:
Fig. 5. Mutex AOL

The AOL contains information about mutex oper-


ation in progress. To operate on mutex, session first
creates AOL, fills it with data about mutex and de-
sired operation, and calls one of mutex acquisition rou- Fig. 7. ”Cursor Pin”mutex state diagram.
tines. Each session has an array of references to the Here E an S modes effectively acts for ”Cursor
AOLs it is using. Fig. 5 illustrates this. AOLs are also Pin” mutex as exclusive and free states. The ”Library
used during mutex recovery if session crashes. Cache” mutex uses X mode only. This paper math is
targeted on these mutexes types.
Mutex modes and states.
The ”hash table” mutexes utilize both X and S
Mutex can be held in three modes: modes. Such ”read-write” spinlocks will be investi-
— ”Shared” (SHRD in traces) mode allows mutex be gated in separate paper.
holding by several sessions simultaneously. It al- When session is waiting for mutex, it registers the
lows read (execute) access to structure protected wait in Oracle Wait Interface [19]. Most frequently
by mutex. In shared mode the lower 4 bytes of mu- observed waits are named ”cursor: pin S, ”cursor:
tex value represent the number of sessions holding pin S wait on X and ”library cache: mutex X”.
the mutex. Upper bytes are zero. The naming scheme is presented in table 3. Here mutex
— ”eXclusive’ (EXCL) mode is incompatible with all is the name of mutex type.
other modes. Only one session can hold the mutex Experimental setup to explore mutex wait
in exclusive mode. It allows session exclusively ac- Unlike the latch, the details of mutex wait were
cess the structure protected by mutex. In X mode not documented by Oracle. We need explore it using
upper bytes of mutex value are equal to holder SID. DTrace. To explore the latch in [31], I acquired it di-
Lower bytes are zero. rectly calling kslgetl function. This is not possible for
— ”Examine” (SHRD EXAM in dumps) mode indi- mutex. However, changing memory I can make mu-
cates that mutex or its protected structure is in tex ”busy” artificially. Oracle oradebug utility allows
transition. In E mode upper bytes of mutex value changing of any address inside SGA:
are equal to holder SID. Lower bytes represent the
number of sessions simultaneously holding the mu- SQL>oradebug poke <mutex addr> 8 0x100000001
tex in S mode. Session can acquire mutex in E BEFORE: [3A9371338, 3A9371340) =
mode or upgrade it to E mode even if other ses- 00000000 00000000
sions are holding mutex in S mode. No other ses- AFTER: [3A9371338, 3A9371340) =
sion can change mutex at that time. 00000001 00000001

My experiments demonstrated that mutex state This looks exactly like session with SID 1 is hold-
transitions diagram looks like infinite fence contain- ing the mutex in E mode. I wrote several scripts that
ing shared states 0,1,2,3,. . . and corresponding ex- simulate a busy mutex in S, X and E modes. In
R RDBMS

Exploring mutexes, the Oracle retrial spinlocks (MEDIAS2012) 7

these scripts one session artificially holds mutex for makes sense to install the patch 6904068 in 10.2-11.1
50s. Another session tries to acquire mutex and ”stat- OLTP environments proactively.
ically”waits for ”cursor: pin S” event during 49s.
DTrace allowed me explore how Oracle actually waits
IV. Mutex statistics
for mutex. Mutex statistics are the tools to diagnose its
efficiency. Oracle internally counts the numbers
Original Oracle 10g mutex busy wait
of gets and sleeps for mutex. However, there is
Oracle introduced the mutexes in version 10.2.0.2.
no fixed table containing current statistics. The
Running the script against this version I saw that
x$mutex sleep history shows statistics as they
the waiting process consumed one of my CPUs com-
were at the time of last sleep. This is not enough.
pletely. Oracle showed millions microsecond waits that
Hopefully, Oracle provide us the x$ksmmem
accounted for 3 seconds out of actual 49 second wait.
fixed table. It shows contents of any address inside
The wait trace looks like:
SGA. The mutex value, its gets and sleeps can be di-
... spin 255 cycles rectly read out from Oracle memory. Repeatedly sam-
yield() pling mutex value we can estimate another key mutex
spin 255 cycles statistics Utilization. The Little’s law U = λS allows
yield() computing the average mutex holding time S.
... repeated 1910893 times Unlike the latches [31], mutex do not count its
misses and spin gets. The miss ratio ρ should be es-
The session waiting for mutex repeatedly spins 255 timated from PASTA (Poisson Arrivals See Time Av-
times polling the mutex location and then issue erages) property ρ ≈ U .
yield() OS syscall. This syscall just allows other pro- Oracle counts only the first mutex get, but all the
cesses to run. secondary sleeps. Therefore, the spin inefficiency ko-
Oracle 10.2-11.1 count wait time as the time spent efficient k differs from the sleep ratio κ = ∆misses
∆sleeps
.
off the CPU waiting for other processes. If the system If sleeps are much longer then mutex correlation
has free CPU power, Oracle thought it was not waiting time then every spin-and-sleep cycle observe indepen-
at all and mutex contention was invisible. dent picture. Because sleep probability is kρ, one can
Therefore the old version mutex was ”classic” spin- estimate:
lock without sleeps. If the mutex holding time is al-
ways small, this algorithm minimizes the elapsed time k
κ = k + (kρ)k + (kρ)2 k + . . . =
to acquire mutex. Spinning session acquires mutex im- 1 − kρ
mediately after its release.
The table 1 summarizes mutex statistics and their re-
Such spinlock are vulnerable to variability of hold-
lations.
ing time. If sessions hold mutex for long time, pure
spinning wastes CPU. Spinning sessions can aggres-
sively consume all the CPUs and affect the perfor- Description Definition Relations
∆gets
mance by priority inversion and CPU starvation. Mutex requests ar- λ = ∆time
Mutex wait with Patch 6904068 rival rate
If long ”cursor: pin S” waits were consistently ob- Sleeps rate ω = ∆sleeps
∆time
ω = κρλ
served in Oracle 10.2-11.1, then system do not have Miss ratio (PASTA ρ = ∆misses
∆gets
ρ ≈ UX
estimation)
enough spare CPU for busy waiting. For such a case,
Sleeps ratio κ= ∆sleeps
κ= ω
= k
Oracle provides the possibility to convert ”busy” mu- ∆misses
U
λρ 1−kρ

tex wait into ”standard” sleep. This enhancement was Avg. holding time S= λ
(Little’s law)
named ”Patch 6904068: High CPU usage when there ∆sleeps κ
Mutex spin ineffi- k= k=
are ”cursor: pin S” waits”. With this patch the mutex ∆spins 1+κρ
ciency
wait trace became:
... spin 255 times Corresponding script mutex statistics.sql to mea-
semsys() timeout=10 ms sure mutex statistics available in [30].
... repeated 4748 times The spin time ∆ can be obtained in my exper-
iments by counting the ”spin-and-yield’ cycles per
The semtimedop() is ”normal”OS sleep. The patch second. Contemporary Oracle versions can adjust ∆
significantly decreases CPU consumption by spinning. using parameter mutex spin count. Therefore, we
Its drawback is larger elapsed time to obtain mutex. can compute spin and yield times separately by linear
Ten milliseconds is long wait in Oracle timescale. regression.
One can adjust sleep time with centisecond gran- Typical nocontention values for spin, yield and mu-
ularity and even set it to 0 dynamically. In such case tex holding time S in exclusive mode on some plat-
the instance behave exactly like without the patch. It forms are summarized in table 4.
8 (MEDIAS2012) Nikolaev A. S.

Library cache Cursor pin spin yield() The overall wait time became:
Exadata 0.3 − 5µs 0.1 − 2µs 1.8µs 0.7µs
ρ
Sparc T2 2.5 − 12µs 3.2 − 11µs 8.7µs 9.5µs W = (Γ + k (T + Tr )) . (14)
1 − kρ
Table 4. Average mutex spin and yield() times.
In order to compare this to mutex experimental
Compare these microsecond times with default data it should be noted that, unlike the queuing the-
mutex sleep of 10 ms duration. Indeed, the mutex sleep ory, the Oracle Wait Interface does not treat the first
is 10000 times longer than spin. spin as a part of wait[19]. The wait time registered by
OWI will be:
V. ”Mean Value Analysis” of mutex re-
trials kρ
Wo = (T + ρΓ + Tr )
”Mean Value Analysis” (MVA) is an elegant ap- 1 − kρ
proach for queuing systems invented by M. Reiser, et Oracle performance tuning frequently used ”aver-
al. [27]. Recent work [28] discussed the MVA for re- age mutex wait duration” metric from AWR report [1]
trial queues. Though not applicable directly to non- as a contention signature. This is the OWI waiting
Markovian mutex, this approach can be useful for es- time normalized by the number of waits:
timations.
The important point of the following approxima- 1
wo = (T + ρΓ + Tr )
tion is replacement of fixed time mutex sleep by 1 − kρ
exponential memoryless distribution. According to
PASTA, request arriving with frequency λ finds mu- If this quantity significantly differs from 1cs, it may be
tex busy with probability ρ and goes to orbit (sleeps) a sign of abnormal mutex utilization or holding time.
for time T with probability kρ. Usually in Oracle 11.2 the huge sleep time T dom-
The waiting time consist of spin and sleep in the inates in these formulas and limits the mutex wait
orbit. performance.
W = Ws + Worb (11)
T ∼ 104 × {Γ, ∆, Tr }, k > 0.1
The process acquires the mutex during repeating
spins. The overall spin time is: Of course such estimations do not account for OS
scheduling and are not applicable when number of ac-
2 ρ
Ws = ρΓ + (kρ)ρΓ + (kρ) ρΓ + . . . = Γ (12) tive processes exceeds the number of CPUs.
1 − kρ
VI. 11.2.0.2.2 Mutex waits diversity
The request retries from orbit while mutex is busy
Since April 2011 the latest Oracle versions use
and idle (Fig. 2).
completely new concept of mutex waits.
Worb = Wb + Wi My Oracle Support site [16] described this in
note Patch 10411618 Enhancement to add dif-
In steady state the busy mutex wait time is needed ferent ”Mutex” wait schemes. The enhancement
to serve all requests currently in system. allows one of three concurrency wait schemes and in-
troduces 3 parameters to control the mutex waits:
Wb = Lorb S + ρTr
mutex wait scheme – Which wait scheme to use:
Here Tr is the residual after-spin holding time that 0 Always YIELD.
was already appeared in (6). According to Little’s law
Lorb = λWorb , Lb = λWb , ρ = λS. Therefore: 1 Always SLEEP for mutex wait time.

Wb = ρ(Worb + Tr ) (13) 2 Exponential Backoff upto mutex wait time.

Flows per second going to and from the orbit mutex spin count – the number of times to spin.
should be balanced. For exponential sleep approxima- Default value is 255.
tion:
λWb λWorb mutex wait time – sleep timeout depending on
kλρ + k = scheme. Default is 1.
T T
here T is the average time to sleep. One can substitute, The note also mentioned that this fix effectively su-
in spirit of MVA, the (13) into this expression and persedes the patch 6904068 described above.
estimate the average wait time spent on orbit as:
The SLEEPS. Mutex wait scheme 1
kρ In mutex wait scheme 1 session repeatedly requests
Worb = (T + Tr ) 1 ms sleeps:
1 − kρ
Exploring mutexes, the Oracle R RDBMS
⃝ retrial spinlocks (MEDIAS2012) 9

kgxSharedExamine() semsys() timeout=70 ms


yield() semsys() timeout=160 ms
pollsys() timeout=1 ms repeated 25637 times semsys() timeout=150 ms
semsys() timeout=300 ms repeated 159 times
The mutex wait time parameter controls the sleep
timeout in milliseconds. This scheme differs from This closely resembles the Oracle 8i latch ac-
patch 6904068 by one additional spin-and-yield cycle quisition algorithm [31, 30]. In scheme 2 the
at the beginning and smaller timeout. Actually the mutex wait time controls maximum wait time in
millisecond sleep will be rounded to centisecond on centiseconds. Due to exponentiality the mutex wait
platforms like Solaris, Windows and latest HP-UX. scheme 2 is insensitive to its value. Indeed, only sleep
This is because most OS can not sleep for very short after the fifth unsuccessful spin is affected by this pa-
times. rameter [30].
Performance of this scheme is sensitive to Default mutex scheme 2 wait differs from patch
mutex wait time tuning. fig. 8 demonstrates that 6904068 by two yield() syscalls at the beginning.
at moderate concurrency the smaller mutex sleeps per- These two spin-and-yields change the mutex wait per-
forms better and results in bigger throughputs: formance drastically (fig. 9). They effectively multiply
centisecond wait time T by k 2 :
_mutex_wait_scheme=1. Throughput HtpsL. 14 Exadata X2-2
ρ ( ( ))
200 000
W2 ≈ Γ + k k 2 T + Tr
æ
æ 1 − kρ
æ
æ æ æ æ æ æ æ æ æ
æ

150 000
Ÿ
Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ Ÿ
Classic YIELDS. Mutex wait scheme 0
æ
ò ò ò ò ò

Ÿ
Ÿ
ò
ò
ø
ò ò
ø ø
ò ò
ø ø ø ø ø ø The mutex wait scheme 0 mostly consist of re-
ò
ø ø
100 000
ò
ò
ø
ø peating spin-and-yield cycles.
æ
Ÿ ø _mutex_wait_time:
ò 1 ms
ø 8 yield() call repeated 99 times
50 000 32
64 pollsys() timeout=1 ms
ø
ò
æ
Ÿ

N threads yield() call repeated 99 times


0 5 10 15 20 25 30
pollsys() timeout=1 ms
...
Fig. 8. Mutex wait scheme 1 throughput.
It differs from aggressive mutex waits used in previous
MVA estimations for mutex wait scheme 1 results Oracle versions by 1ms sleep after each 99 yields. This
in: sleep significantly reduces CPU consumption and in-
ρ creases robustness. Unfortunately previous MVA style
W1 ≈ (Γ + k (kT + Tr ))
1 − kρ analysis is not applicable for this wait scheme.
You see that additional spin at the beginning ef- The scheme 0 is very flexible [30]. The sleep
fectively reduces wait time T multiplying it by k. This duration and yield frequency are tunable by
increase the performance. wait yield sleep time msecs . . . parameters. One
can also specify different wait modes for standard and
Default ”Exponential Backoff ” scheme 2
high priority processes. This allows almost any com-
Oracle uses the scheme 2 by default. This scheme is
bination of yield and sleeps including 10g and patch
named ”Exponential backoff” in documentation. Un-
6904068 behaviors.
like the previous versions, contemporary mutex wait
do not consumes CPU. Surprisingly, DTrace shows Comparison of mutex wait schemes
that there is no exponential behavior by default. Ses- Fig. 9 compares performance of ”Library Cache”
sion repeatedly sleeps with 1 cs duration: mutex contention testcase on Exadata platform for all
3 wait schemes and the patch 6904068 and 10g mutex
yield() call repeated 2 times algorithms as well.
semsys() timeout=10 ms repeated 4237 times The graphics demonstrate that:
To reveal exponential backoff one needs to increase the Default scheme 2 is well balanced in all concur-
mutex wait time parameter. rency regions.
SQL> alter system set "_mutex_wait_time"=30; Wait scheme 1 should be used when the system is
... constrained by CPU.
yield() call repeated 2 times
semsys() timeout=10 ms repeated 2 times Wait scheme 0 has the throughput close to 10g in
semsys() timeout=30 ms repeated 2 times medium concurrency region and may be recom-
semsys() timeout=80 ms mended in case of plethora of free CPU.
10 (MEDIAS2012) Nikolaev A. S.

Elapsed time CPU per thread Elapsed and CPU time vs. Spin Count Throughput vs. Spin Count
120 250 000
200 100 200 000
500
80 150 000
400 150 100 000
60
300 50 000
100 40
200 0
50 0 200 400 600 800 0 200 400 600 800
100
N N Mutex waits vs. Spin Count
0 10 20 30 40 50 0 10 20 30 40 50 Wait time vs. Spin Count
ææææ æ
æ æ 50.0 æææ
æ
105 æ
æ 10.0 æ
Throughput 104 æ 5.0 æ
æ
250 000 æ æ
æ 1.0
1000 æ 0.5 æ
scheme:
æ æ æ
æ æ æ æ æ
100 æ æ
200 000 2 200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400
1
0
150 000 Fig. 10. Spin count tuning.
10g
6904068
100 000 Traditionally tuning of mutex performance prob-
lems was focused on changing the application and re-
50 000
ducing the mutex demand. To achieve this one need to
0 N
tune the SQL operators, change the physical schema,
0 10 20 30 40 50
raise the bug with Oracle Support, etc. . . [17, 18, 19,
29].
Fig. 9. Mutex wait schemes performance.
However, such tuning may be too expensive and
even require complete application rewrite. This arti-
10g mutex algorithm had the fastest performance
cle discusses one not widely used tuning possibility
in medium concurrency workloads. However,
- changing of mutex spin count. This was commonly
its throughput fell down when number of con-
treated as an old style tuning, which should be avoided
tending threads exceeds number of CPU cores.
at any means. The public opinion is that increasing of
CPU consumption increased rapidly beyond this
spin count leads to waste of CPU. However, nowadays
point. This excessive CPU consumption starves
the CPU power is cheap. We may already have enough
processors and impacts other database work-
free resources. It makes sense to know when the spin
loads.
count tuning may be beneficial.
Patch 6904068 results in very low CPU consump- Mutex spin count tuning.
tion, but the largest elapsed time and the worst Long mutex holding time may cause the mutex
throughput. contention. Default mutex spin count =255 may
be too small. Longer spinning may alleviate this. If
IV. Mutex Contention the mutex holding time distribution has exponential
Mutex contention occurs when the mutex is re- tail:
quested by several sessions at the same time. Diag-
nosing mutex contention we always should remember Q(t) ∼ C exp(−t/τ )
Little’s law k ∼ C exp(−t/τ )
U = λS Γ ∼ Sr − Cτ exp(−t/τ )
It is easy to see that if ”sleep ratio”is small enough
Therefore, the contention can be consequence either:
(k ≪ 1) then
Long mutex holding time S due to, for example,
Doubling the spin count will square the ”sleep ra-
high SQL version count, bugs causing long mutex
tio” and will only add part of order of k to spin CPU
holding time or CPU starvation and preemption is-
consumption.
sues.
In other words, if the spin is already efficient, it
Or it may be due to high mutex exclusive Utiliza-
is worth to increase the spin count. Fig. 10 demon-
tion. Mutexes may be overutilized by too high SQL
strates effect of spin count adjustment for the ”Library
and PL/SQL execution rate or bugs causing excessive
Cache” mutex contention testcase.
requests.
The spin count tuning is very effective. Elapsed
Mutex statistics helps to diagnose what actually
time fell rapidly while CPU increased smoothly . The
happens.
number of mutex waits demonstrates almost linear be-
Latest Oracle versions include many fixes for mu-
havior in logscale. This confirms the scaling rule.
tex related bugs and allow flexible wait schemes, ad-
justment of spin count and cloning of hot library cache Conclusions
objects. My blog [30] continuously discusses related This work investigated the possibilities to diag-
enhancements. nose and tune mutexes, retrial Oracle spinlocks. Using
R RDBMS

Exploring mutexes, the Oracle retrial spinlocks (MEDIAS2012) 11

DTrace, it explored how the mutex works, its spin- and Applied Mathematics, Philadelphia, PA, USA,
waiting schemes, corresponding parameters and statis- 301-309.
tics. The mathematical model was developed to pre- [12] L. Boguslavsky, K. Harzallah, A. Kreinen, K. Sevcik,
dict the effect of mutex tuning. and A. Vainshtein. 1994. Optimal strategies for spin-
The results are important for performance tuning ning and blocking. J. Parallel Distrib. Comput. 21, 2
of highly loaded Oracle OLTP databases. (May 1994), 246-254. DOI=10.1006/jpdc.1994.1056
[13] Ryan Johnson, Manos Athanassoulis, Radu Stoica,
Acknowledgements and Anastasia Ailamaki. 2009. A new look at the
Thanks to Professor S.V. Klimenko for kindly roles of spinning and blocking. In Proceedings of the
inviting me to MEDIAS 2012 conference Fifth International Workshop on Data Management
Thanks to RDTEX CEO I.G. Kunitsky for finan- on New Hardware (DaMoN ’09). ACM, New York,
cial support. Thanks to RDTEX Technical Support NY, USA, 21-26. DOI=10.1145/1565694.1565700
Centre Director S.P. Misiura for years of encourage- [14] B. Sinharoy, et al. 1996. Improving Software MP Effi-
ment and support of my investigations. ciency for Shared Memory Systems. Proc. of the 29th
Thanks to my colleagues for discussions and all our Annual Hawaii International Conference on System
Sciences
customers for participating in the mutex troubleshoot-
ing. [15] T. E. Anderson, D. D. Lazowska, and H. M. Levy.
1989. The performance implications of thread man-
References agement alternatives for shared-memory multiproces-
[1] Oracle⃝Database
R Concepts 11g Release 2 (11.2). sors. SIGMETRICS Perform. Eval. Rev. 17, 1 (April
2010. 1989), 49-60. DOI=10.1145/75372.75378
[2] E. W. Dijkstra. 1965. Solution of a problem [16] My Oracle Support, Oracle official electronic on-
in concurrent programming control. Com- line support service. http://support.oracle.com.
mun. ACM 8, 9 (September 1965), 569-. 2011.
DOI=10.1145/365559.365617 [17] Lewis J. 2011. Oracle Core: Essential Internals for
[3] J.H. Anderson, Yong-Jik Kim, ”Shared-memory Mu- DBAs and Developers. Apress ISBN: 978-1430239543
tual Exclusion: Major Research Trends Since 1986”. [18] Adams S. 1999. Oracle8i Internal Services for Waits,
2003. Latches, Locks, and Memory. O’Reilly Media. ISBN:
[4] T. E. Anderson. 1990. The Performance of Spin 978-1565925984
Lock Alternatives for Shared-Memory Multiproces- [19] Millsap C., Holt J. 2003. Optimizing Oracle
sors. IEEE Trans. Parallel Distrib. Syst. 1, 1 (January performance. O’Reilly & Associates, ISBN: 978-
1990), 6-16. DOI=10.1109/71.80120 0596005276.
[5] John M. Mellor-Crummey and Michael L. Scott. [20] Richmond Shee, Kirtikumar Deshpande, K.
1991. Algorithms for scalable synchronization on Gopalakrishnan. 2004. Oracle Wait Interface:
shared-memory multiprocessors. ACM Trans. A Practical Guide to Performance Diagnostics
Comput. Syst. 9, 1 (February 1991), 21-65. & Tuning. McGraw-Hill Osborne Media. ISBN:
DOI=10.1145/103727.103729 978-0072227291
[6] M. Herlihy and N. Shavit. 2008. The Art of Multipro- [21] Bryan M. Cantrill, Michael W. Shapiro, and Adam H.
cessor Programming. Morgan Kaufmann Publishers Leventhal. 2004. Dynamic instrumentation of produc-
Inc., San Francisco, CA, USA. ISBN:978-0123705914. tion systems. In Proceedings of the annual conference
”Chapter 07 Spin Locks and Contention.” on USENIX Annual Technical Conference (ATEC
[7] J. K. Ousterhout. Scheduling techniques for concur- ’04). USENIX Association, Berkeley, CA, USA, 2-2.
rent systems. In Proc. Conf. on Dist. Computing Sys- [22] Anna R. Karlin, Kai Li, Mark S. Manasse, and Susan
tems, 1982. Owicki. 1991. Empirical studies of competitve spin-
[8] G.I. Falin, J.G.C. Templeton. 1997. Retrial Queues, ning for a shared-memory multiprocessor. SIGOPS
Chapman and Hall, London. ISBN 978-0412785504. Oper. Syst. Rev. 25, 5 (September 1991), 41-55.
DOI=10.1145/121133.286599
[9] J.R. Artalejo, A. Gomez-Corral. 2008. Retrial Queue-
ing Systems: A Computational Approach, Springer, [23] L. Kleinrock, Queueing Systems, Theory, Volume I,
ISBN: 978-3-540-78724-2. ISBN 0471491101. Wiley-Interscience, 1975.
[10] Beng-Hong Lim and Anant Agarwal. 1993. Waiting [24] Cox D. 1970. Renewal Theory. London: Methuen &
algorithms for synchronization in large-scale multi- Co. pp. 142. ISBN 0-412-20570-X.
processors. ACM Trans. Comput. Syst. 11, 3 (August [25] Olver F., et. al. 2010. NIST Handbook of Mathemat-
1993), 253-294. DOI=10.1145/152864.152869 ical Functions, National Institute of Standards and
[11] Anna R. Karlin, Mark S. Manasse, Lyle A. McGeoch, Technology and Cambridge University Press , ISBN
and Susan Owicki. 1990. Competitive randomized al- 978-0-521-19225-5, http://dlmf.nist.gov/6.20
gorithms for non-uniform problems. In Proceedings [26] Luke Y. 1969. The Special Functions and their Ap-
of the first annual ACM-SIAM symposium on Dis- proximations. Vol. 2, Academic Press, New York.
crete algorithms (SODA ’90). Society for Industrial ISBN 978-0-124-59902-4
12 (MEDIAS2012) Nikolaev A. S.

[27] Reiser, M. and S. Lavenberg, ”Mean Value Analysis [31] Nikolaev A. S. 2011. Exploring Oracle RDBMS
of Closed Multichain Queueing Networks,” JACM 27 latches using Solaris DTrace. In Proc. of ME-
(1980) pp. 313-322 DIAS 2011 Conf. ISBN 978-5-88835-032-4.
[28] Artalejo J.R., Resing J.A.C., ”Mean Value Analysis http://arxiv.org/abs/1111.0594v1
of Single Server retrial Queues”. APJOR 27(3): 335-
345. 2010
*
[29] Tanel Poder blog. Core IT for Geeks and Pros About the author
http://blog.tanelpoder.com Andrey Nikolaev is an expert at RDTEX First
Line Oracle Support Center, Moscow. His contact
[30] Andrey Nikolaev blog, Latch, mutex and beyond.
http://andreynikolaev.wordpress.com email is [email protected].

You might also like