/ Java EE Support Patterns: Thread Dump
Showing posts with label Thread Dump. Show all posts
Showing posts with label Thread Dump. Show all posts

9.28.2012

Log4j Thread Deadlock – A Case Study

This case study describes the complete root cause analysis and resolution of an Apache Log4j thread race problem affecting a Weblogic Portal 10.0 production environment. It will also demonstrate the importance of proper Java classloader knowledge when developing and supporting Java EE applications.

This article is also another opportunity for you to improve your thread dump analysis skills and understand thread race conditions.

Environment specifications

  • Java EE server: Oracle Weblogic Portal 10.0
  • OS: Solaris 10
  • JDK: Oracle/Sun HotSpot JVM 1.5
  • Logging API: Apache Log4j 1.2.15
  • RDBMS: Oracle 10g
  • Platform type: Web Portal

Troubleshooting tools

  • Quest Foglight for Java (monitoring and alerting)
  • Java VM Thread Dump (thread race analysis)

Problem overview

Major performance degradation was observed from one of our Weblogic Portal production environments. Alerts were also sent from the Foglight agents indicating a significant surge in Weblogic threads utilization up to the upper default limit of 400.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

  • What is the client impact? HIGH
  • Recent change of the affected platform? Yes, a recent deployment was performed involving minor content changes and some Java libraries changes & refactoring
  • Any recent traffic increase to the affected platform? No
  • Since how long this problem has been observed?  New problem observed following the deployment
  • Did a restart of the Weblogic server resolve the problem? No, any restart attempt did result in an immediate surge of threads
  • Did a rollback of the deployment changes resolve the problem? Yes

Conclusion #1: The problem appears to be related to the recent changes. However, the team was initially unable to pinpoint the root cause. This is now what we will discuss for the rest of the article.

Weblogic hogging thread report

The initial thread surge problem was reported by Foglight. As you can see below, the threads utilization was significant (up to 400) leading to a high volume of pending client requests and ultimately major performance degradation.






As usual, thread problems require proper thread dump analysis in order to pinpoint the source of threads contention. Lack of this critical analysis skill will prevent you to go any further in the root cause analysis.

For our case study, a few thread dump snapshots were generated from our Weblogic servers using the simple Solaris OS command kill -3 <Java PID>. Thread Dump data was then extracted from the Weblogic standard output log files.

Thread Dump analysis

The first step of the analysis was to perform a fast scan of all stuck threads and pinpoint a problem “pattern”. We found 250 threads stuck in the following execution path:

"[ACTIVE] ExecuteThread: '20' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x03c4fc38 nid=0xe6 waiting for monitor entry [0x3f99e000..0x3f99f970]
       at org.apache.log4j.Category.callAppenders(Category.java:186)
       - waiting to lock <0x8b3c4c68> (a org.apache.log4j.spi.RootCategory)
       at org.apache.log4j.Category.forcedLog(Category.java:372)
       at org.apache.log4j.Category.log(Category.java:864)
       at org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:110)
       at org.apache.beehive.netui.util.logging.Logger.debug(Logger.java:119)
       at org.apache.beehive.netui.pageflow.DefaultPageFlowEventReporter.beginPageRequest(DefaultPageFlowEventReporter.java:164)
       at com.bea.wlw.netui.pageflow.internal.WeblogicPageFlowEventReporter.beginPageRequest(WeblogicPageFlowEventReporter.java:248)
       at org.apache.beehive.netui.pageflow.PageFlowPageFilter.doFilter(PageFlowPageFilter.java:154)
       at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
       at com.bea.p13n.servlets.PortalServletFilter.doFilter(PortalServletFilter.java:336)
       at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
       at weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)
       at weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)
       at <App>.AppRedirectFilter.doFilter(RedirectFilter.java:83)
       at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
       at <App>.AppServletFilter.doFilter(PortalServletFilter.java:336)
       at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
       at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3393)
       at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
       at weblogic.security.service.SecurityManager.runAs(Unknown Source)
       at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140)
       at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046)
       at weblogic.servlet.internal.ServletRequestImpl.run(Unknown Source)
       at weblogic.work.ExecuteThread.execute(ExecuteThread.java:200)
       at weblogic.work.ExecuteThread.run(ExecuteThread.java:172)


As you can see, it appears that all the threads are waiting to acquire a lock on an Apache Log4j object monitor (org.apache.log4j.spi.RootCategory) when attempting to log debug information to the configured appender and log file. How did we figure that out from this thread stack trace? Let’s dissect this thread stack trace in order for you to better understand this thread race condition e.g. 250 threads attempting to acquire the same object monitor concurrently.


At this point the main question is why are we seeing this problem suddenly? An increase of the logging level or load was also ruled out at this point after proper verification. The fact that the rollback of the previous changes did fix the problem did naturally lead us to perform a deeper review of the promoted changes. Before we go to the final root cause section, we will perform a code review of the affected Log4j code e.g. exposed to thread race conditions.

Apache Log4j 1.2.15 code review


## org.apache.log4j.Category
/**
        * Call the appenders in the hierrachy starting at <code>this</code>. If no
        * appenders could be found, emit a warning.
        *
        * <p>
        * This method calls all the appenders inherited from the hierarchy
        * circumventing any evaluation of whether to log or not to log the
        * particular log request.
        *
        * @param event
        *            the event to log.
        */
       public void callAppenders(LoggingEvent event) {
             int writes = 0;

             for (Category c = this; c != null; c = c.parent) {
                    // Protected against simultaneous call to addAppender,
                    // removeAppender,...
                    synchronized (c) {
                           if (c.aai != null) {
                                 writes += c.aai.appendLoopOnAppenders(event);
                           }
                           if (!c.additive) {
                                 break;
                           }
                    }
             }

             if (writes == 0) {
                    repository.emitNoAppenderWarning(this);
             }
            
As you can see, the Catelogry.callAppenders() is using a synchronized block at the Category level which can lead to a severe thread race condition under heavy concurrent load. In this scenario, the usage of a re-entrant read write lock would have been more appropriate (e.g. such lock strategy allows concurrent “read” but single “write”). You can find reference to this known Apache Log4j limitation below along with some possible solutions.


Does the above Log4j behaviour is the actual root cause of our problem? Not so fast…
Let’s remember that this problem got exposed only following a recent deployment. The real question is what application change triggered this problem & side effect from the Apache Log4j logging API?

Root cause: a perfect storm!

Deep dive analysis of the recent changes deployed did reveal that some Log4j libraries at the child classloader level were removed along with the associated “child first” policy. This refactoring exercise ended-up moving the delegation of both Commons logging and Log4j at the parent classloader level. What is the problem?

Before this change, the logging events were split between Weblogic Beehive Log4j calls at the parent classloader and web application logging events at the child class loader. Since each classloader had its own copy of the Log4j objects, the thread race condition problem was split in half and not exposed (masked) under the current load conditions. Following the refactoring, all Log4j calls were moved to the parent classloader (Java EE app); adding significant concurrency level to the Log4j components such as Category. This increase concurrency level along with this known Category.java thread race / deadlock behaviour was a perfect storm for our production environment.

In other to mitigate this problem, 2 immediate solutions were applied to the environment:

  • Rollback the refactoring and split Log4j calls back between parent and child classloader
  • Reduce logging level for some appenders from DEBUG to WARNING
This problem case again re-enforce the importance of performing proper testing and impact assessment when applying changes such as library and class loader related changes. Such changes can appear simple at the "surface" but can trigger some deep execution pattern changes, exposing your application(s) to known thread race conditions.

A future upgrade to Apache Log4j 2 (or other logging API’s) will also be explored as it is expected to bring some performance enhancements which may address some of these thread race & scalability concerns.

Please provide any comment or share your experience on thread race related problems with logging API's.

7.31.2012

Oracle Service Bus – Stuck Thread Case Study

This case study describes the complete root cause analysis process of a stuck thread problem experienced with Oracle Service Bus 11g running on AIX 6.1 and IBM Java VM 1.6.

This article is also a great opportunity for you to improve your thread dump analysis skills and I highly recommend that you study and properly understand the following analysis approach. It will also demonstrate the importance of proper data gathering as opposed to premature middleware (Weblogic) restarts.

Environment specifications

  • Java EE server: Oracle Service Bus 11g
  • OS: AIX 6.1
  • JDK: IBM JRE 1.6.0 @64-bit
  • RDBMS: Oracle 10g
  • Platform type: Enterprise Service Bus

Troubleshooting tools

  • Quest Software Foglight for Java (monitoring and alerting)
  • Java VM Thread Dump (IBM JRE javacore format)

Problem overview

Major performance degradation was observed from our Oracle Service Bus Weblogic environment. Alerts were also sent from the Foglight agents indicating a significant surge in Weblogic threads utilization.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

  • What is the client impact? HIGH
  • Recent change of the affected platform? Yes, logging level changed in OSB console for a few business services prior to outage report
  • Any recent traffic increase to the affected platform? No
  • Since how long this problem has been observed?  New problem observed following logging level changes
  • Did a restart of the Weblogic server resolve the problem? Yes
Conclusion #1: The logging level changes applied earlier on some OSB business services appear to have triggered this stuck thread problem. However, the root cause remains unknown at this point.

Weblogic threads monitoring: Foglight for Java

Foglight for Java (from Quest Software) is a great monitoring tool allowing you to completely monitor any Java EE environment along with full alerting capabilities. This tool is used in our production environment to monitor the middleware (Weblogic) data, including threads, for each of the Oracle Service Bus managed servers. You can see below a consistent increase of the threads along with a pending request queue. 


For your reference, Weblogic slow running threads are identified as “Hogging Threads” and can eventually be promoted to “STUCK” status if running for several minutes (as per your configured threshold).

Now what should be your next course of action? Weblogic restart? Definitely not…
Your first “reflex” for this type of problems is to capture a JVM Thread Dump. Such data is critical for you to perform proper root cause analysis and understand the potential hanging condition. Once such data is captured, you can then proceed with Weblogic server recovery actions such as a full managed server (JVM) restart.

Stuck Threads: Thread Dump to the rescue!

The next course of action in this outage scenario was to quickly generate a few thread dump snapshots from the IBM JVM before attempting to recover the affected Weblogic instances. Thread dump was generated using kill -3 <Java PID> which did generate a few javacore files at the root of the Weblogic domain.

javacore.20120610.122028.15149052.0001.txt

Once the production environment was back up and running, the team quickly proceeded with the analysis of the captured thread dump files as per below steps.

Thread Dump analysis step #1 – identify a thread execution pattern

The first analysis step is to quickly go through all the Weblogic threads and attempt to identify a common problematic pattern such as threads waiting from remote external systems, threads in deadlock state, threads waiting from other threads to complete their tasks etc.

The analysis did quickly reveal many threads involved in the same blocking situation as per below. In this sample, we can see an Oracle Service Bus thread in blocked state within the TransactionManager Java class (OSB kernel code).

[ACTIVE] ExecuteThread: '292' for queue: 'weblogic.kernel.Default (self-tuning)'"
 J9VMThread:0x0000000139B76B00, j9thread_t:0x000000013971C9A0,
 java/lang/Thread:0x07000000F9D80630, state:B, prio=5
 (native thread ID:0x2C700D1, native priority:0x5, native policy:UNKNOWN)
 Java callstack:

  at com/bea/wli/config/transaction/TransactionManager._beginTransaction(TransactionManager.java:547(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.beginTransaction(TransactionManager.java:409(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedResourceManager.getDerivedValueInfo(DerivedResourceManager.java:339(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedResourceManager.get(DerivedResourceManager.java:386(Compiled Code))
  at com/bea/wli/sb/resources/cache/DefaultDerivedTypeDef.getDerivedValue(DefaultDerivedTypeDef.java:106(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterRuntimeCache.getRuntime(RouterRuntimeCache.java(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterManager.getRouterRuntime(RouterManager.java:640(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterContext.getInstance(RouterContext.java:172(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterManager.processMessage(RouterManager.java:579(Compiled Code))
  at com/bea/wli/sb/transports/TransportManagerImpl.receiveMessage(TransportManagerImpl.java:375(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalMessageContext$1.run(LocalMessageContext.java:179(Compiled Code))
  at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:363(Compiled Code))
  at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:146(Compiled Code))
  at weblogic/security/Security.runAs(Security.java:61(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalMessageContext.send(LocalMessageContext.java:144(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalTransportProvider.sendMessageAsync(LocalTransportProvider.java:322(Compiled Code))
  at sun/reflect/GeneratedMethodAccessor980.invoke(Bytecode PC:58(Compiled Code))
  at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code))
  at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code))
  at com/bea/wli/sb/transports/Util$1.invoke(Util.java:83(Compiled Code))
  at $Proxy111.sendMessageAsync(Bytecode PC:26(Compiled Code))
……………………………




Thread Dump analysis step #2 – review the blocked threads chain

The next step was to review the affected and blocked threads chain involved in our identified pattern. As we saw in the thread dump analysis part 4, the IBM JVM thread dump format contains a separate section that provides a full breakdown of all thread blocked chains e.g. the Java object monitor pool locks.

A quick analysis did reveal the following thread culprit as per below. As you can see, Weblogic thread #16 is the actual culprit with 300+ threads waiting to acquire a lock on a shared object monitor TransactionManager@0x0700000001A51610/0x0700000001A51628.

2LKMONINUSE      sys_mon_t:0x000000012CCE2688 infl_mon_t: 0x000000012CCE26C8:
3LKMONOBJECT       com/bea/wli/config/transaction/TransactionManager@0x0700000001A51610/0x0700000001A51628: Flat locked by "[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CA3C800), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011C785C00)
3LKWAITER                "[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011CA93200)
3LKWAITER                "[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011B3F2B00)
3LKWAITER                "[STUCK] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011619B300)
3LKWAITER                "[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CBE8000)
3LKWAITER                "[STUCK] ExecuteThread: '21' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012BE91200)
..................

Thread Dump analysis step #3 – thread culprit deeper analysis

Once you identify a primary culprit thread, the next step is to perform a deeper review of the computing task this thread is currently performing. Simply go back to the raw thread dump data and start analyzing the culprit thread stack trace from bottom-up.

As you can see below, the thread stack trace for our problem case was quite revealing. It did reveal that thread #16 is currently attempting to commit a change made at the Weblogic / Oracle Service Bus level. The problem is that the commit operation is hanging and taking too much time, causing thread #16 to retain the shared object monitor lock from TransactionManager for too long and “starving” the other Oracle Service Bus Weblogic threads.

"[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'"
J9VMThread:0x000000012CA3C800, j9thread_t:0x000000012C9F0F40, java/lang/Thread:0x0700000026FCE120, state:P, prio=5
(native thread ID:0x35B0097, native priority:0x5, native policy:UNKNOWN)
 Java callstack:
 
  at sun/misc/Unsafe.park(Native Method)
  at java/util/concurrent/locks/LockSupport.park(LockSupport.java:184(Compiled Code))
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:822)
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:853(Compiled Code))
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1189(Compiled Code))
  at java/util/concurrent/locks/ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:911(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedCache$Purger.changesCommitted(DerivedCache.java:80)
  at com/bea/wli/config/impl/ResourceListenerNotifier.afterEnd(ResourceListenerNotifier.java:120)
  at com/bea/wli/config/transaction/TransactionListenerWrapper.afterEnd(TransactionListenerWrapper.java:90)
  at com/bea/wli/config/transaction/TransactionManager.notifyAfterEnd(TransactionManager.java:1154(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.commit(TransactionManager.java:1519(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager._endTransaction(TransactionManager.java:842(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.endTransaction(TransactionManager.java:783(Compiled Code))
  at com/bea/wli/config/deployment/server/ServerDeploymentReceiver$2.run(ServerDeploymentReceiver.java:275)
  at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:321(Compiled Code))
  at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:120(Compiled Code))
  at com/bea/wli/config/deployment/server/ServerDeploymentReceiver.commit(ServerDeploymentReceiver.java:260)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.doCommitCallback(DeploymentReceiverCallbackDeliverer.java:195)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.access$100(DeploymentReceiverCallbackDeliverer.java:13)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer$2.run(DeploymentReceiverCallbackDeliverer.java:68)
  at weblogic/work/SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:528(Compiled Code))
  at weblogic/work/ExecuteThread.execute(ExecuteThread.java:203(Compiled Code))
  at weblogic/work/ExecuteThread.run(ExecuteThread.java:171(Compiled Code))

Root cause: connecting the dots

At this point the collection of the facts and thread dump analysis did allow us to determine the chain of events as per below:

  • Logging level change applied by the production Oracle Service Bus administrator
  • Weblogic deployment thread #16 was unable to commit the change properly
  • Weblogic runtime threads executing client requests quickly started to queue up and wait for a lock on the shared object monitor (TransactionManager)
  • The Weblogic instances ran out of threads, generating alerts and forcing the production support team to shut down and restart the affected JVM processes
Our team is planning to open an Oracle SR shortly to share this OSB deployment behaviour along with hard dependency between the client requests (threads) and OSB logging layer. In the meantime, no OSB logging level change will be attempted outside the maintenance window period until further notice.

Conclusion

I hope this article has helped you understand and appreciate how powerful thread dump analysis can be to pinpoint root cause of stuck thread problems and the importance for any Java EE production support team to capture such crucial data in order to prevent future re-occurrences.

Please do not hesitate to post any comment or question.

7.05.2012

How to analyze Thread Dump – Part 5: Thread Stack Trace

This article is part 5 of our Thread Dump analysis series. So far you have learned the basic principles of threads and their interactions with your Java EE container & JVM. You have also learned different Thread Dump formats for HotSpot and IBM Java VM’s. It is now time for you to deep dive into the analysis process.

** UPDATE: Thread Dump analysis tutorial videos now available here.

In order for you to quickly identify a problem pattern from a Thread Dump, you first need to understand how to read a Thread Stack Trace and how to get the “story” right. This means that if I ask you to tell me what the Thread #38 is doing; you should be able to precisely answer; including if Thread Stack Trace is showing a healthy (normal) vs. hang condition.

Java Stack Trace revisited

Most of you are familiar with Java stack traces. This is typical data that we find from server and application log files when a Java Exception is thrown. In this context, a Java stack trace is giving us the code execution path of the Thread that triggered the Java Exception such as a java.lang.NoClassDefFoundError, java.lang.NullPpointerException etc. Such code execution path allows us to see the different layers of code that ultimately lead to the Java Exception.

Java stack traces must always be read from bottom-up:
-        The line at the bottom will expose the originator of the request such as a Java / Java EE container Thread.
-        The first line at the top of the stack trace will show you the Java class where that last Exception got triggered.

Let’s go through this process via a simple example. We created a sample Java program simply executing some Class methods calls and throwing an Exception. The program output generated is as per below:

JavaStrackTraceSimulator
Author: Pierre-Hugues Charbonneau

Exception in thread "main" java.lang.IllegalArgumentException:
        at org.ph.javaee.training.td.Class2.call(Class2.java:12)
        at org.ph.javaee.training.td.Class1.call(Class1.java:14)
        at org.ph.javaee.training.td.JavaSTSimulator.main(JavaSTSimulator.java:20)

-        Java program JavaSTSimulator is invoked (via the “main” Thread)
-        The simulator then invokes method call() from Class1
-        Class1 method call() then invokes Class2 method call()
-        Class2 method call()throws a Java Exception: java.lang.IllegalArgumentException
-        The Java Exception is then displayed in the log / standard output

As you can see, the code execution path that lead to this Exception is always displayed from bottom-up.

The above analysis process should be well known for any Java programmer. What you will see next is that the Thread Dump Thread stack trace analysis process is very similar to above Java stack trace analysis.

Thread Dump: Thread Stack Trace analysis

Thread Dump generated from the JVM provides you with a code level execution snapshot of all the “created” Threads of the entire JVM process. Created Threads does not mean that all these Threads are actually doing something. In a typical Thread Dump snapshot generated from a Java EE container JVM:

-        Some Threads could be performing raw computing tasks such as XML parsing, IO / disk access etc.
-        Some Threads could be waiting for some blocking IO calls such as a remote Web Service call, a DB / JDBC query etc.
-        Some Threads could be involved in garbage collection at that time e.g. GC Threads
-        Some Threads will be waiting for some work to do (Threads not doing any work typically go in wait() state)
-        Some Threads could be waiting for some other Threads work to complete e.g. Threads waiting to acquire a monitor lock (synchronized block{}) on some objects

I will get back to the above with more diagrams in my next article but for now let’s focus on the stack trace analysis process. Your next task is to be able to read a Thread stack trace and understand what it is doing, on the best of your knowledge.

A Thread stack trace provides you with a snapshot of its current execution. The first line typically includes native information of the Thread such as its name, state, address etc. The current execution stack trace has to be read from bottom-up. Please follow the analysis process below. The more experience you get with Thread Dump analysis, the faster you will able to read and identify very quickly the work performed by each Thread:

-        Start to read the Thread stack trace from the bottom
-        First, identify the originator (Java EE container Thread, custom Thread ,GC Thread, JVM internal Thread, standalone Java program “main” Thread etc.)
-        The next step is to identify the type of request the Thread is executing (WebApp, Web Service, JMS, Remote EJB (RMI), internal Java EE container etc.)
-        The next step is to identify form the execution stack trace your application module(s) involved e.g. the actual core work the Thread is trying to perform. The complexity of analysis will depend of the layers of abstraction of your middleware environment and application
-        The next step is to look at the last ~10-20 lines prior to the first line. Identify the protocol or work the Thread is involved with e.g. HTTP call, Socket communication, JDBC or raw computing tasks such as disk access, class loading etc.
-        The next step is to look at the first line. The first line usually tells a LOT on the Thread state since it is the current piece of code executed at the time you took the snapshot
-        The combination of the last 2 steps is what will give you the core of information to conclude of what work and / or hanging condition the Thread is involved with

Now find below a visual breakdown of the above steps using a real example of a Thread Dump Thread stack trace captured from a JBoss 5 production environment. In this example, many Threads were showing a similar problem pattern of excessive IO when creating new instances of JAX-WS Service instances.


As you can see, the last 10 lines along with the first line will tell us what hanging or slow condition the Thread is involved with, if any. The lines from the bottom will give us detail of the originator and type of request.

I hope this article has helped you understand the importance of proper Thread stack trace analysis. I will get back with much more Thread stack trace examples when we cover the most common Thread Dump problem patterns in future articles. The next article will now teach you how to breakdown the Thread Dump threads in logical silos and come up with a potential list of root cause “suspects”.

Please feel free to post any comment and question.