11 Advanced Oracle Troubleshooting Guide When The Wait Interface Is Not Enough
11 Advanced Oracle Troubleshooting Guide When The Wait Interface Is Not Enough
Page 1 of 5
RECENT POSTS
Session-Level Statspack
A Gotcha With Parallel Index Builds,
Parallel Degree And Query Plans
My Version Of SQL String To
Table Tokenizer
If I ever manage to post any more entries, the type and style of content will be pretty much as this one: some Oracle
problem diagnosis and troubleshooting techniques with some OS and hardware touch in it. Mmm internals ; -)
Nevertheless I am also a fan of systematic approaches and methods so I plan to propose some less known OS and
Oracle techniques for reducing guesswork in advanced troubleshooting even further.
RECENT COMMENTS
A N U P P A N I O N MY
VERSION OF SQL
STRING TO TA...
AMAN SHARMA ON
ABOUT
I worked on a project for which I needed to read data through an external table from an Unix pipe ( ever wanted to load
compressed flat file contents to Oracle on -the -fly? ; -)
CASE STUDY:
STATSPAC... ON
ADVANCED ORACLE
TROUBLESHOOTIN...
TOVORINOK ON
JUST A
T E S T P O S T I N G ...
ORATRANSPLANT
... O N S E S S I O N LEVEL STATSPACK
$ mknod /tmp/tmp_pipe p
BLOGROLL
About Me
Doug Burns
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options
Jonathan Lewis
Kevin Closson
Kyle Hailey
USERNAME
INSTANCE_NAME
HOST_NAME
VER
STARTED
SID SERIAL# SPID
Tom Kyte
------------ ---------------- ------------------------- ---------- -------- ------- ------- ------TANEL
SOL01
solaris01
10.2.0.2.0 20070618
470
14 724
Tanel@Sol01> CREATE DIRECTORY dir AS '/tmp';
Directory created.
Tanel@Sol01> CREATE TABLE ext (
2
value number
3 )
4 ORGANIZATION EXTERNAL (
5
TYPE oracle_loader
6
DEFAULT DIRECTORY dir
7
ACCESS PARAMETERS (
8
FIELDS TERMINATED BY ';'
9
MISSING FIELD VALUES ARE NULL
10
(value)
11
)
12
LOCATION ('tmp_pipe')
13
)
14 ;
B L O G STATS
3,762 Hits
CATEGORIES
Design
Internals
Oracle
Performance
SQL
Troubleshooting
Uncategorized
Unix/Linux
Table created.
Tanel@Sol01> select * from ext;
FEED
Session-Level Statspack
So far so good unfortunately this select statement never returned any results. As it turned out later, the gunzip over
remote ssh link which should have fed the Unix pipe with flat file data, had got stuck.
Without realizing that, I approached this potential session hang condition with first obvious check - a select from
V$SESSION_WAIT:
SID EVENT
STATE
SEQ# SECONDS_IN_WAIT
P1
P2
P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------470 db file sequential read
WAITED KNOWN TIME
164
7338
1
1892
1
Tanel@Sol01> /
SID EVENT
STATE
SEQ# SECONDS_IN_WAIT
P1
P2
P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------470 db file sequential read
WAITED KNOWN TIME
164
7353
1
1892
1
Tanel@Sol01> /
Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part ... Page 2 of 5
SID EVENT
STATE
SEQ# SECONDS_IN_WAIT
P1
P2
P3
------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------470 db file sequential read
WAITED KNOWN TIME
164
7374
1
1892
1
Tanel@Sol01>
The STATE and SECONDS_IN_WAIT columns in V$SESSION_WAIT say we have been crunching the CPU for last two
hours, right? (as WAITED means NOT waiting on any event, in this case the EVENT just shows the last event on
which we waited before getting on CPU)
Hmm.. lets check it out:
$ prstat -p 724
PID USERNAME
724 oracle
SIZE
621M
RSS STATE
533M sleep
PRI NICE
59
0
prstat reports that this process is currently in sleep state, is not using CPU and has used virtually no CPU during its 2hour run time!
Lets check with ps (which is actually a quite powerful tool):
$ ps -o user,pid,s,pcpu,time,etime,wchan,comm -p 724
USER
PID S %CPU
TIME
ELAPSED
WCHAN COMMAND
oracle
724 S 0.0
00:01
02:18:08 ffffffff8135cadc oracleSOL01
ps also confirms that the process 724 has existed for over 2 hours 18 minutes (ELAPSED), but has only used roughly 1
second of CPU time (TIME). The state column Salso indicates the sleeping status.
So, either Oracle V$SESSION_WAIT or standard Unix tools are lying to us. From above evidence it is pretty clear that its
Oracle whos lying (also, in cases like that, lower -level instrumentation always has a better chance to know whats really
going on at the upper level than vice versa).
So, lets use truss (or strace on Linux, tusc on HP-UX) to see if our code is making any system calls or is sleeping within
a system call
$ truss -p 724
read(14, 0xFFFFFD7FFD6FDE0F, 524273) (sleeping )
Hmm, as no followup is printed to this line, it looks like the process is waiting for a read operation on a file descriptor 14
to complete.
Which file is this fd 14 about?
$ pfiles 724
724:
oracleSOL01 (LOCAL=NO)
...snip...
14: S_IFIFO mode:0644 dev:274,2 ino:4036320452 uid:100 gid:300 size:0
O_RDONLY|O_LARGEFILE
/tmp/tmp_pipe
snip
So from here its already pretty obvious where the problem is. There is no data coming from the tmp_pipe. This led me to
check what was my gunzip doing on the other end of the pipe and it was stuck, in turn waiting for ssh to feed more data
into it. And ssh had got stuck due some network transport issue.
The baseline is that you can rely on low-level (OS) tools to identify whats really going on when higher level tools (like
Oracle wait interface) provide weird or contradicting information, in this case the Oracle wait interface was not recording
external table read wait events. I reported this info to Oracle people and I think it has been filed as a bug by now.
This was only a simple demo, identifying a pretty clear case of a session hang, however with use of a pretty intrusive tool
( I would not attach truss to a busy production instance process without thinking twice ).
However there are other options. In the next part of this guide ( when I manage to write it ) I will deal with more complex
problems like what to do when the session is not reporting significant waits and is spinning heavily on CPU. Using Oracle
and Unix tools it is quite easy to figure out the execution profile of a spinning server process, even without connecting to
Oracle at all ( do I hear pstack, mdb and stack tracing? ;-)
As Ive just started blogging, I would appreciate any feedback, including about things like blog layout, font sizes,
readability, understandability etc. Also I think it will take few days before I manage to post the Part 2 of this
troubleshooting guide.
Thank you for your patience reading through this :-)
17 Comments
1.
Welcome Tanel!
An interesting post and well presented too.
Seems like well be getting an education here at your blog.
Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part ... Page 3 of 5
I guess my only complaint was that, unless I made a mistake, I had to be logged in to Wordpress to make a comment
and that necessitated me creating a WordPress account which I don t really need can you configure the blog to
allow people to comment without requiring a WordPress account or was that an intentional move on your part?
Cheers
Jeff Moss
http://oramossoracle.blogspot.com
Comment by oramoss June 19, 2007 @ 5:11 am
2.
Thanks Jeff!
I changed the setting and am just testing commenting without logging on, now.
If you see this message, then it worked!
:)
Comment by Tanel June 19, 2007 @ 9:30 am
3.
Hi Tanel
Welcome aboard the great blogosphere.
Great Post, and after seeing some of your assistance on Oracle-L then I expect we will see some interesting items
here.
Cheers
Peter
Comment by Peter McLarty June 19, 2007 @ 11:29 am
4.
Tanel,
A very welcome addition to the list of Oracle blogs. Keep up the good work.
Cheers,
Doug
Comment by Doug Burns
5.
Welcome to the blogging world Tanel. I for one am looking forward to some great internals insights. I have added your
blog to my Oracle blogs aggregator at http://www.petefinnigan.com/news/blogs
cheers
Pete
C o m m e n t by Pete Finnigan
6.
Tanel,
keep up the good work. Great you started a blog. Maybe an idea to ask Eddie Awad to add this site to the
www.orana.info aggregator so we can keep up ; -)
Marco
Comment by Marco Gralike June 19, 2007 @ 6:56 pm
7.
Welcome Tanel.
I think It would be fine if you added a RSS feed link to your home page.
Nice site, nice post.
Cheers.
Carlos.
Comment by carlosal June 19, 2007 @ 9:26 pm
8.
Thanks Carlos, good idea!
Will check it out how to do it..
Comment by tanelp June 19, 2007 @ 9:53 pm
9.
Tanel,
Thanks for the tutorial! To me, the lesson learned is to start at the top (higher level tools (like Oracle wait interface))
and work your way down (low -level (OS) tools).
The difficulty is going to be remembering the tools, commands, and interpretation of the output. Stuff like this doesnt
happen to us unless its a production environment and the end-users are screaming. And, we ve been unfortunate in
that our predecessors shielded us from these issues.
Regards,
Gus
Comment by Gus Spier June 20, 2007 @ 1:31 am
10.
Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part ... Page 4 of 5
Thank you for this very educational entry and Welcome to database blogosphere
Keep blogging
Coskan
Comment by coskan June 20, 2007 @ 2:39 pm
11.
Welcome Tanel.
Hope you take some time from your busy timing to writing somthing in your blog frequently.
Jaffar
Comment by Syed Jaffar Hussain June 20, 2007 @ 3:40 pm
12.
[ ] A new Oracle blog drops: look for core IT for geeks and pros from Tanel Poder. His first real post!is a
troubleshooting guide for a wait interface issue. [ ]
Pingback by Log Buffer #50: A Carnival of the Vanities for DBAs Eye on Oracle June 22, 2007 @ 11:42 pm
13.
14.
Hi Hemant,
Looks like you didnt read my post thoroughly ;)
I deliberately included multiple samples from v$session_wait to show that SEQ# was not increasing.
The key point of this post was exactly that v$session_wait.state shows WAITED (sic!) which means NOT WAITING,
which means ready to be on CPU , which should result in the process consuming all CPU time it can as there were
no other wait events reported.
However that was not the case as I have proved, it is an Oracle instrumentation bug instead.
I normally do not use the WAIT_TIME column in my scripts and demos as it just confuses things. The
V$SESSION_WAIT.STATE already shows all what is needed (e.g. waiting or not waiting). By the way, STATE
column values are derived from exactly the same column in x$ksusest as WAIT_TIME, so it is physically impossible
to have a state other than WAITING (as in my demo) and WAIT_TIME=0 at the same time.
Comment by tanelp June 23, 2007 @ 11:41 pm
15.
The STATE of WAITED KNOWN TIMEmay be a bug. It should have showed WAITING. The bug could be that the
state did not change to WAITING although SECONDS_IN_WAIT for the current wait (as indicated by SEQ# being
the same) was being incremented. The bug could be with db file sequential read {why SEQUENTIAL READ
shouldn t it have been SCATTERED READ when doing an FTS ?} when reading from a PIPE, not a regular file, as an
External Table.
If it should have been a db file scattered read on the SELECT * then even the event showing as db file sequential
read could be wrong.
I was referring to the fact that we should read both columns WAIT_TIME and SECONDS_IN_WAIT.
From the documentation :
WAIT_TIME NUMBER A nonzero value is the session s last wait time. A zero value means the session is currently
waiting.
SECONDS_IN_WAIT NUMBER If WAIT_TIME = 0, then SECONDS_IN_WAIT is the seconds spent in the current
wait condition. If WAIT_TIME > 0, then SECONDS_IN_WAIT is the seconds since the start of the last wait, and
SECONDS_IN_WAIT - WAIT_TIME / 100 is the active seconds since the last wait ended.
Further, the doc on STATE= WAITED KNOWN TIME says WAIT_TIME = duration of last wait
WAIT_TIME gets posted on completion of the wait.
SECONDS_IN_WAIT, obviously, is continuously incremented.
So I was interested in knowing what WAIT_TIME was showing as well, along with SECONDS_IN_WAIT. That is to
say, we should read WAIT_TIME with SECONDS_IN_WAIT, when in doubt about what V$SESSION_WAIT.STATE
shows us.
Maybe one was for the db file sequential read (that could have occurred _before_ actually starting to do the FTS)
and the other for the db file scattered read that I would have expected.
Like you, I, too, use SECONDS_IN_WAIT in my current_waits reporting script.
Comment by Hemant K Chitale June 24, 2007 @ 5:53 pm
16.
This IS a bug as I have proven already. Oracle IS WAITING in a OS read syscall and the only correct state in
V$SESSION_WAIT.STATE is WAITING, regardless of values in any other columns. This contradiction alone proves
that Oracle behaves incorrectly.
Regarding what documentation says, the section about wait interface in Oracle has always been lacking
concreteness, thus the myths about having to use the WAIT_TIME and such. Just the fact that something is
documented, doesn t make it a fact ;-)
The SECONDS_OF_WAIT is an unfortunate misnaming, it should really be called SECONDS_IN_STATE, as it s just
the time since last wait state change (e.g. from waiting to not-waiting or vice versa). It s measured and incremented
even if being constantly on CPU without any waiting at all. Every time LGWR runs, it calculates the time delta since
last wait state change for each session and updates the fixed table under V$SESSION_WAIT.
Normally when you enter a wait, this wait state change timestamp is reset and SEQ# is incremented. What happens
in my test case, however, is that both last SEQ# and SECONDS_IN_WAIT are not changed at all, their values persist
Advanced Oracle Troubleshooting Guide: When the wait interface is not enough [part ... Page 5 of 5
17.
[ ] read a great quote from Tanel Poders blog recently I think I ll call it Tanil s dictum: lower -level instrumentation
always has a better chance to know what s really going on at [ ]
Pingback by Case Study: Statspack/AWR Latch Waits (Part 2) : Ardent Performance Computing June 30, 2007 @ 12:21 am
RSS
Leave a comment
Name (required)
Website
Submit Comment
Blog at WordPress.com .