We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
Troubleshooting the Most Complex Performance
Issue I’ve ever seen
Tanel Poder
http://blog.tanelpoder.com
http://tech.e2sn.com
enkitec eeeIntro: About me ~~ @ | auc
ORACLE | i
a
* Just moved to Dallas Expert Oracle Exadata
* After Tallinn -> Stockholm -> London -> book
é a (with Kerry Osborne and
ancun => Singapore Randy Johnson of Enkitec)
* Tanel Péder
Oracle Database Performance geek
Exadata Performance geek
Hadoop Performance geek
° Enkitec
* Consultant
* Researcher
* Technology Evangelist
enkitec ac 2Two issues - actually
* For warm-up:
* cursor pin: S wait events and sporadic CPU spikes
* Read more from my blog entry:
soikes-3 nid-systematic-troubl leshooting/
* Or just google for “cursor pin s”
epkitec ee aEnvironment
* High-concurrency, high-visibility OLTP database
* Oracle 11.1.0.7 single-instance, dedicated server processes
* HP-UX on Itanium
* 32 CPUs, 128 GB RAM
* Thousands of end users
* Multiple WebLogic application servers talking to database via
connection pools
enkitecThe problem
* Sporadic extreme slowness of Oracle DB and the server
* Slowness lasts for 1.. 20 minutes at a time...
* Queries don’t answer or extremely slow
* Can’t even log on to OS during that time
» New SSH connections succeded once the spike was over
It takes minutes to run simple OS commands during the problem time
* This is a global server-wide problem — everyone complains!
enkitec SeonLet’s pick and diagnose one occurrence of this problem
* The database response times extremely bad again around
18:10 and this lasted for about 5 minutes...
«If it’s the users who report the problem (as opposed to application
side measurements), then there may be some discrepancies in the
user reported times vs actual problem time
enkitec eee 6Initial AWR Report
Snap td Snap Time
Begin Snap:
End Snap: 61922 30-Oct-10 18:20:20
Elapsed: 20.17 (mins)
DB Times 559.31 (mins)
Evert Waits
db file sequential read 2,135, 668
DE CPU
log file sync 92,720
read by other session 92,676
SQL*Net message from dblink 525
Host CPU (CPUs: 32 Cores: 32 Sockets:
- ~ Load Average
Begin End aUser
~ enkitec
61921 30-Oct-10 18:00:10
Sessions Curs/Ses:
Time(s)
21,468
5,860
1,498
1,307
14132
32)
asystem
Avg = & DB
(ms) time Wait Class
10 64.0 User t/o
115
16 04.5 Commit
143.9 User 1/0
2155ASH data (shown in OEM)
* Average active sessions showed something different
* Note that this data is from another period of time when a similar spike happened
* In worst times there were up to 220 active sessions trying to be on CPU!
* Thanks to better granularity we see the spikes instead of some 20-minute or hourly
averages...
The problem with ASH samples is that it looks into session state from inside Oracle
Perhaps the starvation is due to same other application / instance in the server?
Top Aestety
{lick cn the band Below the chart change Pe ime pared fr the deta ecto batow,
3081
264.1
5)
2201
Bie
B20
Sano
440
op, -
has iso WSS Azo FOS APIO IHS AP20. 12S 1330
2oL0/10/31
13s Aaa
enkitec saan 8How many logons were done?
Snap Td Snap Time
Begin Shap: 61921 30-Get-10 18:00:10
End Snap: 61922 30-0¢t-10 18220720
Elapsed: 20.47 (mins)
DB Time: 559.31 (mins)
Statistic Total
index fetch by key 24,174,148
index scans kdlixst 24,565,055
leaf node 90-10 splits 5,865
leaf node splits 14,529
lob reads 34,480
lob writes 1,623,273
lab writes unaligned 1,623,266
logons cumulative 2,550
messages received 133,740
messages sent 133,740
min active SCN optimization appl 538,358
no buffer to keep pinned count 6,331
no work ~ consistent read gets 146, 703,542
opened cursors cumulative 4,168,700
smommnenitescom
~ enkitec
per Second
19,971.0
20,293.9
a9
12.0
28.5
1,341.0
1,341
aud
110.5
110.5
440.8
5.2
121,196.1
3,443.9)
afleuosounes
buUookeowwHsOS level metrics don’t lie (well, they do, but less ;-)
we Pg
| Sean Re
i i | Phys | Phys [Memory|Pg Out |
Date | Time (CPU & | IO Rt | KBRte | € | Rate
0/30/2010/17:45:00| 25.06) 7249.5) 145817.61 74.07 0.0)
50:00) 24.77 5334. 60928.0) 0.0]
196368.01 0.01
2 2 5h 19230 4.1 0.01
10/30/2010) 1 22.88] 74.15 0.0}
10/30/202041 66.89} 6] 76.97 0.01
10/30/2020 $1] 7544.9} 76.40 0.0)
10/30/2010/18:20:00| 25.47) 5144.0) 75.04 0.0)
10/30/2010/28:25:00| 28.38) 10239.5| 151952.0] 74.37 0.0]
1) What does the Time 18:10:00 mean, beginning of the
monitoring interval or end?
2) 66.89% busy during 5 minutes may actually mean 100% busy
during ~3 minutes out of 5, but we don’t know that for sure
without measuring in more detail (better granularity)...
enkitec etait
10Measuring CPU utilization in more detail
The spike lasted from Around 90% in
18:11 to 18:14 (3 min) enituasans: Kernel mode!!!
eottBxuexeeresesee2szersed
Whitt Mi bartetintenet teem Mahe
— Ss:
enkitec nies atCheckpoint — measured evidence so far
* Fact: We have a 100% CPU utilization spike, lasting 3 minutes
* Fact: 90% of it is spent in KERNEL mode
* Fact: We have over 2500 logons done during 20 minute
period
* 2.1 logons / second on average (which doesn’t sound bad)
* Kernel mode CPU usage is usually caused by system calls
enkitec wnenenbites com 12Diagnosing 90% kernel-mode CPU usage spikes...
1, Systematic
* Break down this 90% of Kernel mode CPU usage
* Profiling!
«Oh, this is a production system and the problem is acute & ongoing
+ On Solaris, I'd have used Dtrace stack() probe to record OS kernel
stack traces most common on CPU (google for dstackprof)
= Or lockstat as it reports spins on spinlocks (which consume kernel CPU)
* But this was HP-UX and | didn’t know the tools needed
* But | knew what numbers | wanted to see!
* We sent a request to HP-UX support:
“How do we measure & break down where is kernel mode CPU used?”
2. Check for usual suspects
* Fast, cheap checks to rule out or find known troublemakers
enkitec eae 13Kernel mode CPU usage spikes — the usual suspects
Before starting the systematic troubleshooting & drilldown,
do quick checks for usual suspects
« Remember, the client has a business problem, time is of essence...
1. Logon (or logoff) storms
* Spawning, initializing new processes, opening files and attaching to
SGA means system calls, kernel CPU usage
2. Oracle code getting into some crazy loop (due to a bug)
* Semop(), yield(), read /proc/..., getrusage(), etc loop
3. OS kernel spinlock contention
« Often due to bugs in OS or some kernel module
enkitec ‘wr, enkites.com 4Measuring logon storms
* Use the AUDS records or “logons cumulative” number from
VSSYSSTAT or AWR, right?
+ Wrong!
* flogons cumulative number is incremented by the session itself
—after it has logged on, the same applies to audit records!
After the listener connection has been established...
The process has been started...
It has attached to SGA SHM segments...
Audit file has been written (if needed) ...
Process, session SGA structures have been created
* Memory from OS and shared pool allocated (shared pool latches!)
Session has been authenticated
Then the logons cumulative is incremented!
fe PP
No
enkitec SpeoeeeeeMeasuring logon storms
* Logon storms should be measured at the listener level
* Process listener.log using a script:
§ tail listener.log
30-OCT-2010 23:22:03 * (CONNECT_DATA-..) * establish * E2SNoR * 0
3040CT=2010 23:22:08 * (CONNECT DATA=..) * establish * E2SNDB * 0
30-OCT-2010 23:22:08 * (CONNECT_DATAs..) * establish * B2SNDB * 0
30-OCT-2010 23122309 * service update * E2SNDR * 0
$ fgrep "30-0cT+2010" listener.1log | fgrep “establish” | \
awk "{ print $1." $2)" | awk directory
entry scan) became very slow — and it’s done in kernel mode
* — Aspinlock was held during the directory entry scan
* Other new Oracle processes also wanted to do the directory scan,
resulting in spiniock contention and further Kernel mode CPU usage
4, When the DB got slow — app servers fired up hundreds of
new connections to “make things faster”
* — This all fed back to the problem — even more contention & spinning
e
enkitec eee 26Limiting logon storms
Use Oracle Listener connection rate limiter (11gR1+)
listener.ora:
LISTENER=
(ADDRESS _LIST=
(ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1521) (RATE_LIMIT=5) )
(ADDRESS= (PROTOCOL=tcp) (HOST=) (PORT=1522) (RATE_LIMIT=;
(ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1523) )
)
Oracle Documentation: Oracle Net Listener Parameters (listener.ora)
http://download.oracle.com/docs/cd/828359_O1/network,111/b28317/listener.htm
Also, it is possible to limit logoff storm rate
_logout_storm_rate parameter (instance-wide)
enkitecTroubleshooting sporadic system performance issues
Right Data !!!
* Right scope — if your problem lasts for seconds, this should be
the granularity of your data too
* OS level data, in addition to the database metrics
* Ideally OS level metrics sampled multiple times per minute
enkitec eee 28Conclusions
* Logon storms are evil!
* They will amplify any performance hiccups as they cause extra load
just when the resources are scarcest
* Connection pools firing up hundreds of new connections are
evil!
» Know your limits (both max connections and max connect rate / sec)
* Here’s a thought:
+ If you have planned the servers’ capacity to support N-thousand
connections anyway (by allowing connection pools grow that high), why
not create this amount of connections right away?
* This would avoid logon storms during worst times as all connections have
already been created!
enkitec ——— 29