0% found this document useful (0 votes)
64 views

Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

Tru Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
64 views

Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

Tru Vu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
Troubleshooting the Most Complex Performance Issue I’ve ever seen Tanel Poder http://blog.tanelpoder.com http://tech.e2sn.com enkitec eee Intro: About me ~~ @ | auc ORACLE | i a * Just moved to Dallas Expert Oracle Exadata * After Tallinn -> Stockholm -> London -> book é a (with Kerry Osborne and ancun => Singapore Randy Johnson of Enkitec) * Tanel Péder Oracle Database Performance geek Exadata Performance geek Hadoop Performance geek ° Enkitec * Consultant * Researcher * Technology Evangelist enkitec ac 2 Two issues - actually * For warm-up: * cursor pin: S wait events and sporadic CPU spikes * Read more from my blog entry: soikes-3 nid-systematic-troubl leshooting/ * Or just google for “cursor pin s” epkitec ee a Environment * High-concurrency, high-visibility OLTP database * Oracle 11.1.0.7 single-instance, dedicated server processes * HP-UX on Itanium * 32 CPUs, 128 GB RAM * Thousands of end users * Multiple WebLogic application servers talking to database via connection pools enkitec The problem * Sporadic extreme slowness of Oracle DB and the server * Slowness lasts for 1.. 20 minutes at a time... * Queries don’t answer or extremely slow * Can’t even log on to OS during that time » New SSH connections succeded once the spike was over It takes minutes to run simple OS commands during the problem time * This is a global server-wide problem — everyone complains! enkitec Seon Let’s pick and diagnose one occurrence of this problem * The database response times extremely bad again around 18:10 and this lasted for about 5 minutes... «If it’s the users who report the problem (as opposed to application side measurements), then there may be some discrepancies in the user reported times vs actual problem time enkitec eee 6 Initial AWR Report Snap td Snap Time Begin Snap: End Snap: 61922 30-Oct-10 18:20:20 Elapsed: 20.17 (mins) DB Times 559.31 (mins) Evert Waits db file sequential read 2,135, 668 DE CPU log file sync 92,720 read by other session 92,676 SQL*Net message from dblink 525 Host CPU (CPUs: 32 Cores: 32 Sockets: - ~ Load Average Begin End aUser ~ enkitec 61921 30-Oct-10 18:00:10 Sessions Curs/Ses: Time(s) 21,468 5,860 1,498 1,307 14132 32) asystem Avg = & DB (ms) time Wait Class 10 64.0 User t/o 115 16 04.5 Commit 143.9 User 1/0 2155 ASH data (shown in OEM) * Average active sessions showed something different * Note that this data is from another period of time when a similar spike happened * In worst times there were up to 220 active sessions trying to be on CPU! * Thanks to better granularity we see the spikes instead of some 20-minute or hourly averages... The problem with ASH samples is that it looks into session state from inside Oracle Perhaps the starvation is due to same other application / instance in the server? Top Aestety {lick cn the band Below the chart change Pe ime pared fr the deta ecto batow, 3081 264.1 5) 2201 Bie B20 Sano 440 op, - has iso WSS Azo FOS APIO IHS AP20. 12S 1330 2oL0/10/31 13s Aaa enkitec saan 8 How many logons were done? Snap Td Snap Time Begin Shap: 61921 30-Get-10 18:00:10 End Snap: 61922 30-0¢t-10 18220720 Elapsed: 20.47 (mins) DB Time: 559.31 (mins) Statistic Total index fetch by key 24,174,148 index scans kdlixst 24,565,055 leaf node 90-10 splits 5,865 leaf node splits 14,529 lob reads 34,480 lob writes 1,623,273 lab writes unaligned 1,623,266 logons cumulative 2,550 messages received 133,740 messages sent 133,740 min active SCN optimization appl 538,358 no buffer to keep pinned count 6,331 no work ~ consistent read gets 146, 703,542 opened cursors cumulative 4,168,700 smommnenitescom ~ enkitec per Second 19,971.0 20,293.9 a9 12.0 28.5 1,341.0 1,341 aud 110.5 110.5 440.8 5.2 121,196.1 3,443.9) afleuosounes buUookeowwHs OS level metrics don’t lie (well, they do, but less ;-) we Pg | Sean Re i i | Phys | Phys [Memory|Pg Out | Date | Time (CPU & | IO Rt | KBRte | € | Rate 0/30/2010/17:45:00| 25.06) 7249.5) 145817.61 74.07 0.0) 50:00) 24.77 5334. 60928.0) 0.0] 196368.01 0.01 2 2 5h 19230 4.1 0.01 10/30/2010) 1 22.88] 74.15 0.0} 10/30/202041 66.89} 6] 76.97 0.01 10/30/2020 $1] 7544.9} 76.40 0.0) 10/30/2010/18:20:00| 25.47) 5144.0) 75.04 0.0) 10/30/2010/28:25:00| 28.38) 10239.5| 151952.0] 74.37 0.0] 1) What does the Time 18:10:00 mean, beginning of the monitoring interval or end? 2) 66.89% busy during 5 minutes may actually mean 100% busy during ~3 minutes out of 5, but we don’t know that for sure without measuring in more detail (better granularity)... enkitec etait 10 Measuring CPU utilization in more detail The spike lasted from Around 90% in 18:11 to 18:14 (3 min) enituasans: Kernel mode!!! eottBxuexeeresesee2szersed Whitt Mi bartetintenet teem Mahe — Ss: enkitec nies at Checkpoint — measured evidence so far * Fact: We have a 100% CPU utilization spike, lasting 3 minutes * Fact: 90% of it is spent in KERNEL mode * Fact: We have over 2500 logons done during 20 minute period * 2.1 logons / second on average (which doesn’t sound bad) * Kernel mode CPU usage is usually caused by system calls enkitec wnenenbites com 12 Diagnosing 90% kernel-mode CPU usage spikes... 1, Systematic * Break down this 90% of Kernel mode CPU usage * Profiling! «Oh, this is a production system and the problem is acute & ongoing + On Solaris, I'd have used Dtrace stack() probe to record OS kernel stack traces most common on CPU (google for dstackprof) = Or lockstat as it reports spins on spinlocks (which consume kernel CPU) * But this was HP-UX and | didn’t know the tools needed * But | knew what numbers | wanted to see! * We sent a request to HP-UX support: “How do we measure & break down where is kernel mode CPU used?” 2. Check for usual suspects * Fast, cheap checks to rule out or find known troublemakers enkitec eae 13 Kernel mode CPU usage spikes — the usual suspects Before starting the systematic troubleshooting & drilldown, do quick checks for usual suspects « Remember, the client has a business problem, time is of essence... 1. Logon (or logoff) storms * Spawning, initializing new processes, opening files and attaching to SGA means system calls, kernel CPU usage 2. Oracle code getting into some crazy loop (due to a bug) * Semop(), yield(), read /proc/..., getrusage(), etc loop 3. OS kernel spinlock contention « Often due to bugs in OS or some kernel module enkitec ‘wr, enkites.com 4 Measuring logon storms * Use the AUDS records or “logons cumulative” number from VSSYSSTAT or AWR, right? + Wrong! * flogons cumulative number is incremented by the session itself —after it has logged on, the same applies to audit records! After the listener connection has been established... The process has been started... It has attached to SGA SHM segments... Audit file has been written (if needed) ... Process, session SGA structures have been created * Memory from OS and shared pool allocated (shared pool latches!) Session has been authenticated Then the logons cumulative is incremented! fe PP No enkitec Speoeeeee Measuring logon storms * Logon storms should be measured at the listener level * Process listener.log using a script: § tail listener.log 30-OCT-2010 23:22:03 * (CONNECT_DATA-..) * establish * E2SNoR * 0 3040CT=2010 23:22:08 * (CONNECT DATA=..) * establish * E2SNDB * 0 30-OCT-2010 23:22:08 * (CONNECT_DATAs..) * establish * B2SNDB * 0 30-OCT-2010 23122309 * service update * E2SNDR * 0 $ fgrep "30-0cT+2010" listener.1log | fgrep “establish” | \ awk "{ print $1." $2)" | awk directory entry scan) became very slow — and it’s done in kernel mode * — Aspinlock was held during the directory entry scan * Other new Oracle processes also wanted to do the directory scan, resulting in spiniock contention and further Kernel mode CPU usage 4, When the DB got slow — app servers fired up hundreds of new connections to “make things faster” * — This all fed back to the problem — even more contention & spinning e enkitec eee 26 Limiting logon storms Use Oracle Listener connection rate limiter (11gR1+) listener.ora: LISTENER= (ADDRESS _LIST= (ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1521) (RATE_LIMIT=5) ) (ADDRESS= (PROTOCOL=tcp) (HOST=) (PORT=1522) (RATE_LIMIT=; (ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1523) ) ) Oracle Documentation: Oracle Net Listener Parameters (listener.ora) http://download.oracle.com/docs/cd/828359_O1/network,111/b28317/listener.htm Also, it is possible to limit logoff storm rate _logout_storm_rate parameter (instance-wide) enkitec Troubleshooting sporadic system performance issues Right Data !!! * Right scope — if your problem lasts for seconds, this should be the granularity of your data too * OS level data, in addition to the database metrics * Ideally OS level metrics sampled multiple times per minute enkitec eee 28 Conclusions * Logon storms are evil! * They will amplify any performance hiccups as they cause extra load just when the resources are scarcest * Connection pools firing up hundreds of new connections are evil! » Know your limits (both max connections and max connect rate / sec) * Here’s a thought: + If you have planned the servers’ capacity to support N-thousand connections anyway (by allowing connection pools grow that high), why not create this amount of connections right away? * This would avoid logon storms during worst times as all connections have already been created! enkitec ——— 29

You might also like