Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

Tru Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

64 views

Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

Tru Vu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 29

Troubleshooting the Most Complex Performance Issue I’ve ever seen Tanel Poder http://blog.tanelpoder.com http://tech.e2sn.com enkitec eeeIntro: About me ~~ @ | auc ORACLE | i a * Just moved to Dallas Expert Oracle Exadata * After Tallinn -> Stockholm -> London -> book é a (with Kerry Osborne and ancun => Singapore Randy Johnson of Enkitec) * Tanel Péder Oracle Database Performance geek Exadata Performance geek Hadoop Performance geek ° Enkitec * Consultant * Researcher * Technology Evangelist enkitec ac 2Two issues - actually * For warm-up: * cursor pin: S wait events and sporadic CPU spikes * Read more from my blog entry: soikes-3 nid-systematic-troubl leshooting/ * Or just google for “cursor pin s” epkitec ee aEnvironment * High-concurrency, high-visibility OLTP database * Oracle 11.1.0.7 single-instance, dedicated server processes * HP-UX on Itanium * 32 CPUs, 128 GB RAM * Thousands of end users * Multiple WebLogic application servers talking to database via connection pools enkitecThe problem * Sporadic extreme slowness of Oracle DB and the server * Slowness lasts for 1.. 20 minutes at a time... * Queries don’t answer or extremely slow * Can’t even log on to OS during that time » New SSH connections succeded once the spike was over It takes minutes to run simple OS commands during the problem time * This is a global server-wide problem — everyone complains! enkitec SeonLet’s pick and diagnose one occurrence of this problem * The database response times extremely bad again around 18:10 and this lasted for about 5 minutes... «If it’s the users who report the problem (as opposed to application side measurements), then there may be some discrepancies in the user reported times vs actual problem time enkitec eee 6Initial AWR Report Snap td Snap Time Begin Snap: End Snap: 61922 30-Oct-10 18:20:20 Elapsed: 20.17 (mins) DB Times 559.31 (mins) Evert Waits db file sequential read 2,135, 668 DE CPU log file sync 92,720 read by other session 92,676 SQL*Net message from dblink 525 Host CPU (CPUs: 32 Cores: 32 Sockets: - ~ Load Average Begin End aUser ~ enkitec 61921 30-Oct-10 18:00:10 Sessions Curs/Ses: Time(s) 21,468 5,860 1,498 1,307 14132 32) asystem Avg = & DB (ms) time Wait Class 10 64.0 User t/o 115 16 04.5 Commit 143.9 User 1/0 2155ASH data (shown in OEM) * Average active sessions showed something different * Note that this data is from another period of time when a similar spike happened * In worst times there were up to 220 active sessions trying to be on CPU! * Thanks to better granularity we see the spikes instead of some 20-minute or hourly averages... The problem with ASH samples is that it looks into session state from inside Oracle Perhaps the starvation is due to same other application / instance in the server? Top Aestety {lick cn the band Below the chart change Pe ime pared fr the deta ecto batow, 3081 264.1 5) 2201 Bie B20 Sano 440 op, - has iso WSS Azo FOS APIO IHS AP20. 12S 1330 2oL0/10/31 13s Aaa enkitec saan 8How many logons were done? Snap Td Snap Time Begin Shap: 61921 30-Get-10 18:00:10 End Snap: 61922 30-0¢t-10 18220720 Elapsed: 20.47 (mins) DB Time: 559.31 (mins) Statistic Total index fetch by key 24,174,148 index scans kdlixst 24,565,055 leaf node 90-10 splits 5,865 leaf node splits 14,529 lob reads 34,480 lob writes 1,623,273 lab writes unaligned 1,623,266 logons cumulative 2,550 messages received 133,740 messages sent 133,740 min active SCN optimization appl 538,358 no buffer to keep pinned count 6,331 no work ~ consistent read gets 146, 703,542 opened cursors cumulative 4,168,700 smommnenitescom ~ enkitec per Second 19,971.0 20,293.9 a9 12.0 28.5 1,341.0 1,341 aud 110.5 110.5 440.8 5.2 121,196.1 3,443.9) afleuosounes buUookeowwHsOS level metrics don’t lie (well, they do, but less ;-) we Pg | Sean Re i i | Phys | Phys [Memory|Pg Out | Date | Time (CPU & | IO Rt | KBRte | € | Rate 0/30/2010/17:45:00| 25.06) 7249.5) 145817.61 74.07 0.0) 50:00) 24.77 5334. 60928.0) 0.0] 196368.01 0.01 2 2 5h 19230 4.1 0.01 10/30/2010) 1 22.88] 74.15 0.0} 10/30/202041 66.89} 6] 76.97 0.01 10/30/2020 $1] 7544.9} 76.40 0.0) 10/30/2010/18:20:00| 25.47) 5144.0) 75.04 0.0) 10/30/2010/28:25:00| 28.38) 10239.5| 151952.0] 74.37 0.0] 1) What does the Time 18:10:00 mean, beginning of the monitoring interval or end? 2) 66.89% busy during 5 minutes may actually mean 100% busy during ~3 minutes out of 5, but we don’t know that for sure without measuring in more detail (better granularity)... enkitec etait 10Measuring CPU utilization in more detail The spike lasted from Around 90% in 18:11 to 18:14 (3 min) enituasans: Kernel mode!!! eottBxuexeeresesee2szersed Whitt Mi bartetintenet teem Mahe — Ss: enkitec nies atCheckpoint — measured evidence so far * Fact: We have a 100% CPU utilization spike, lasting 3 minutes * Fact: 90% of it is spent in KERNEL mode * Fact: We have over 2500 logons done during 20 minute period * 2.1 logons / second on average (which doesn’t sound bad) * Kernel mode CPU usage is usually caused by system calls enkitec wnenenbites com 12Diagnosing 90% kernel-mode CPU usage spikes... 1, Systematic * Break down this 90% of Kernel mode CPU usage * Profiling! «Oh, this is a production system and the problem is acute & ongoing + On Solaris, I'd have used Dtrace stack() probe to record OS kernel stack traces most common on CPU (google for dstackprof) = Or lockstat as it reports spins on spinlocks (which consume kernel CPU) * But this was HP-UX and | didn’t know the tools needed * But | knew what numbers | wanted to see! * We sent a request to HP-UX support: “How do we measure & break down where is kernel mode CPU used?” 2. Check for usual suspects * Fast, cheap checks to rule out or find known troublemakers enkitec eae 13Kernel mode CPU usage spikes — the usual suspects Before starting the systematic troubleshooting & drilldown, do quick checks for usual suspects « Remember, the client has a business problem, time is of essence... 1. Logon (or logoff) storms * Spawning, initializing new processes, opening files and attaching to SGA means system calls, kernel CPU usage 2. Oracle code getting into some crazy loop (due to a bug) * Semop(), yield(), read /proc/..., getrusage(), etc loop 3. OS kernel spinlock contention « Often due to bugs in OS or some kernel module enkitec ‘wr, enkites.com 4Measuring logon storms * Use the AUDS records or “logons cumulative” number from VSSYSSTAT or AWR, right? + Wrong! * flogons cumulative number is incremented by the session itself —after it has logged on, the same applies to audit records! After the listener connection has been established... The process has been started... It has attached to SGA SHM segments... Audit file has been written (if needed) ... Process, session SGA structures have been created * Memory from OS and shared pool allocated (shared pool latches!) Session has been authenticated Then the logons cumulative is incremented! fe PP No enkitec SpeoeeeeeMeasuring logon storms * Logon storms should be measured at the listener level * Process listener.log using a script: § tail listener.log 30-OCT-2010 23:22:03 * (CONNECT_DATA-..) * establish * E2SNoR * 0 3040CT=2010 23:22:08 * (CONNECT DATA=..) * establish * E2SNDB * 0 30-OCT-2010 23:22:08 * (CONNECT_DATAs..) * establish * B2SNDB * 0 30-OCT-2010 23122309 * service update * E2SNDR * 0 $ fgrep "30-0cT+2010" listener.1log | fgrep “establish” | \ awk "{ print $1." $2)" | awk directory entry scan) became very slow — and it’s done in kernel mode * — Aspinlock was held during the directory entry scan * Other new Oracle processes also wanted to do the directory scan, resulting in spiniock contention and further Kernel mode CPU usage 4, When the DB got slow — app servers fired up hundreds of new connections to “make things faster” * — This all fed back to the problem — even more contention & spinning e enkitec eee 26Limiting logon storms Use Oracle Listener connection rate limiter (11gR1+) listener.ora: LISTENER= (ADDRESS _LIST= (ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1521) (RATE_LIMIT=5) ) (ADDRESS= (PROTOCOL=tcp) (HOST=) (PORT=1522) (RATE_LIMIT=; (ADDRESS= (PROTOCOL=tep) (HOST=) (PORT=1523) ) ) Oracle Documentation: Oracle Net Listener Parameters (listener.ora) http://download.oracle.com/docs/cd/828359_O1/network,111/b28317/listener.htm Also, it is possible to limit logoff storm rate _logout_storm_rate parameter (instance-wide) enkitecTroubleshooting sporadic system performance issues Right Data !!! * Right scope — if your problem lasts for seconds, this should be the granularity of your data too * OS level data, in addition to the database metrics * Ideally OS level metrics sampled multiple times per minute enkitec eee 28Conclusions * Logon storms are evil! * They will amplify any performance hiccups as they cause extra load just when the resources are scarcest * Connection pools firing up hundreds of new connections are evil! » Know your limits (both max connections and max connect rate / sec) * Here’s a thought: + If you have planned the servers’ capacity to support N-thousand connections anyway (by allowing connection pools grow that high), why not create this amount of connections right away? * This would avoid logon storms during worst times as all connections have already been created! enkitec ——— 29

Advanced Oracle SQL Tuning
No ratings yet
Advanced Oracle SQL Tuning
5 pages
Oracle LibraryCacheInternals JulianDyke
No ratings yet
Oracle LibraryCacheInternals JulianDyke
66 pages
Byffer Cache Deep Dive - V2 PDF
No ratings yet
Byffer Cache Deep Dive - V2 PDF
55 pages
11 Advanced Oracle Troubleshooting Guide When The Wait Interface Is Not Enough
No ratings yet
11 Advanced Oracle Troubleshooting Guide When The Wait Interface Is Not Enough
5 pages
Oracle 10 Database Engine New Features For Techies: Tanel Põder Independent Consultant Integrid - Info
No ratings yet
Oracle 10 Database Engine New Features For Techies: Tanel Põder Independent Consultant Integrid - Info
32 pages
Snapper4 SQL
No ratings yet
Snapper4 SQL
52 pages
Snapper SQL
No ratings yet
Snapper SQL
49 pages
Contention - Perf - Tuning - OraPub PHLOUG CBC Analysis 1d
No ratings yet
Contention - Perf - Tuning - OraPub PHLOUG CBC Analysis 1d
23 pages
Wait Event Enhancements in Oracle 10g
No ratings yet
Wait Event Enhancements in Oracle 10g
32 pages
Tuning
No ratings yet
Tuning
12 pages
Advanced Research Techniques
No ratings yet
Advanced Research Techniques
35 pages
AWR Reports
No ratings yet
AWR Reports
30 pages
Row Cache Lock
No ratings yet
Row Cache Lock
7 pages
Active Session History 129612
No ratings yet
Active Session History 129612
50 pages
Memory Management and Latching
No ratings yet
Memory Management and Latching
34 pages
Performance Tuning Basics 15 - AWR Report Analysis - Expert Oracle
No ratings yet
Performance Tuning Basics 15 - AWR Report Analysis - Expert Oracle
63 pages
2009 06 02 Library-Cache-Lock
No ratings yet
2009 06 02 Library-Cache-Lock
9 pages
Collecting Oracle Extended Trace
No ratings yet
Collecting Oracle Extended Trace
3 pages
Back Porting Addm, Awr, Ash and Metrics To Oracle 9I & 8I: John Kanagaraj DB Soft Inc Session #206
No ratings yet
Back Porting Addm, Awr, Ash and Metrics To Oracle 9I & 8I: John Kanagaraj DB Soft Inc Session #206
29 pages
Awr Recreate
No ratings yet
Awr Recreate
2 pages
Exadata X3 in Action: Measuring Smart Scan Efficiency With AWR
No ratings yet
Exadata X3 in Action: Measuring Smart Scan Efficiency With AWR
7 pages
All Rows
No ratings yet
All Rows
31 pages
Table Name Acronym Expanded
No ratings yet
Table Name Acronym Expanded
14 pages
12c Adaptive Optimization
No ratings yet
12c Adaptive Optimization
46 pages
Automating Database Startup and Shutdown On Linux
No ratings yet
Automating Database Startup and Shutdown On Linux
3 pages
2014-Db-Franck Pachot-Interpreting Awr Reports Straight To The Goal-Manuskript
No ratings yet
2014-Db-Franck Pachot-Interpreting Awr Reports Straight To The Goal-Manuskript
11 pages
Oracle Queries For Checkpoint
No ratings yet
Oracle Queries For Checkpoint
13 pages
Oracle DBA Training PDF
No ratings yet
Oracle DBA Training PDF
6 pages
Troubleshooting 'Latch Cache Buffers Chains' Wait Contention
No ratings yet
Troubleshooting 'Latch Cache Buffers Chains' Wait Contention
4 pages
Latch Lock and Mutex Contention Troubleshooting
100% (1)
Latch Lock and Mutex Contention Troubleshooting
20 pages
AWR - Automatic Workload Repository
No ratings yet
AWR - Automatic Workload Repository
19 pages
Cursors and Triggers
No ratings yet
Cursors and Triggers
5 pages
Resolving Common Oracle Wait Events Using The Wait Interface
No ratings yet
Resolving Common Oracle Wait Events Using The Wait Interface
14 pages
Chinar - Aliyev - Time Model-Eng
No ratings yet
Chinar - Aliyev - Time Model-Eng
13 pages
Oracle To Microsoft SQL Server Migration - SQLines
No ratings yet
Oracle To Microsoft SQL Server Migration - SQLines
7 pages
Configuring HugePages For Oracle On Linux (x86-64)
No ratings yet
Configuring HugePages For Oracle On Linux (x86-64)
13 pages
Tuning The Redolog Buffer Cache and Resolving Redo Latch Contention
No ratings yet
Tuning The Redolog Buffer Cache and Resolving Redo Latch Contention
5 pages
How Oracle Uses Memory On Aix
100% (1)
How Oracle Uses Memory On Aix
29 pages
Enkitec RealWorldExadata
No ratings yet
Enkitec RealWorldExadata
38 pages
Latch and Mutex Contention Troubleshooting in Oracle: Tanel Põder
No ratings yet
Latch and Mutex Contention Troubleshooting in Oracle: Tanel Põder
20 pages
AWR Analysis Part-1 PDF
100% (1)
AWR Analysis Part-1 PDF
24 pages
ORACLE Performance Tuning Exerpt
No ratings yet
ORACLE Performance Tuning Exerpt
10 pages
Oracle Latch and Mutex Contention Troubleshooting
No ratings yet
Oracle Latch and Mutex Contention Troubleshooting
20 pages
Hacktivity LT 2011 en
No ratings yet
Hacktivity LT 2011 en
46 pages
The Secrets of Materialized Views
100% (2)
The Secrets of Materialized Views
8 pages
Oracle Database Architecture: Umeme Template Version 1.0
100% (1)
Oracle Database Architecture: Umeme Template Version 1.0
15 pages
Logical I/O: Julian Dyke Independent Consultant
No ratings yet
Logical I/O: Julian Dyke Independent Consultant
42 pages
Awr Report
No ratings yet
Awr Report
9 pages
02 - Buffer Cache Tuning
No ratings yet
02 - Buffer Cache Tuning
11 pages
Breitling - Histograms, Myths and Facts Oracle
No ratings yet
Breitling - Histograms, Myths and Facts Oracle
42 pages
Enkitec-DIY Exadata KerryOsborne
No ratings yet
Enkitec-DIY Exadata KerryOsborne
26 pages
Data Gard
No ratings yet
Data Gard
8 pages
Let Me Create A Test Case: I Will Take Different OS Statistics and Run Time Statistics To Compare With AWR Report Generated After Execution
No ratings yet
Let Me Create A Test Case: I Will Take Different OS Statistics and Run Time Statistics To Compare With AWR Report Generated After Execution
16 pages
Oracle Locks and Joins
No ratings yet
Oracle Locks and Joins
8 pages
Oracle ASM Load Balancing - Anthony Noriega
0% (1)
Oracle ASM Load Balancing - Anthony Noriega
48 pages
Cache Fusion Oracle Rac
No ratings yet
Cache Fusion Oracle Rac
25 pages
RWP 02_02c_Team_AWR_oci(2)
No ratings yet
RWP 02_02c_Team_AWR_oci(2)
26 pages
Tanel Poder Advanced Oracle Troubleshooting
100% (1)
Tanel Poder Advanced Oracle Troubleshooting
31 pages
AIX For System Administrators - Performance
No ratings yet
AIX For System Administrators - Performance
3 pages
Awrrpt 1 21748 21771
No ratings yet
Awrrpt 1 21748 21771
314 pages

Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

Tanel Poder Troubleshooting Complex Oracle Performance Issues Part 1

Uploaded by

You might also like