/ Java EE Support Patterns: OSB
Showing posts with label OSB. Show all posts
Showing posts with label OSB. Show all posts

7.31.2012

Oracle Service Bus – Stuck Thread Case Study

This case study describes the complete root cause analysis process of a stuck thread problem experienced with Oracle Service Bus 11g running on AIX 6.1 and IBM Java VM 1.6.

This article is also a great opportunity for you to improve your thread dump analysis skills and I highly recommend that you study and properly understand the following analysis approach. It will also demonstrate the importance of proper data gathering as opposed to premature middleware (Weblogic) restarts.

Environment specifications

  • Java EE server: Oracle Service Bus 11g
  • OS: AIX 6.1
  • JDK: IBM JRE 1.6.0 @64-bit
  • RDBMS: Oracle 10g
  • Platform type: Enterprise Service Bus

Troubleshooting tools

  • Quest Software Foglight for Java (monitoring and alerting)
  • Java VM Thread Dump (IBM JRE javacore format)

Problem overview

Major performance degradation was observed from our Oracle Service Bus Weblogic environment. Alerts were also sent from the Foglight agents indicating a significant surge in Weblogic threads utilization.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

  • What is the client impact? HIGH
  • Recent change of the affected platform? Yes, logging level changed in OSB console for a few business services prior to outage report
  • Any recent traffic increase to the affected platform? No
  • Since how long this problem has been observed?  New problem observed following logging level changes
  • Did a restart of the Weblogic server resolve the problem? Yes
Conclusion #1: The logging level changes applied earlier on some OSB business services appear to have triggered this stuck thread problem. However, the root cause remains unknown at this point.

Weblogic threads monitoring: Foglight for Java

Foglight for Java (from Quest Software) is a great monitoring tool allowing you to completely monitor any Java EE environment along with full alerting capabilities. This tool is used in our production environment to monitor the middleware (Weblogic) data, including threads, for each of the Oracle Service Bus managed servers. You can see below a consistent increase of the threads along with a pending request queue. 


For your reference, Weblogic slow running threads are identified as “Hogging Threads” and can eventually be promoted to “STUCK” status if running for several minutes (as per your configured threshold).

Now what should be your next course of action? Weblogic restart? Definitely not…
Your first “reflex” for this type of problems is to capture a JVM Thread Dump. Such data is critical for you to perform proper root cause analysis and understand the potential hanging condition. Once such data is captured, you can then proceed with Weblogic server recovery actions such as a full managed server (JVM) restart.

Stuck Threads: Thread Dump to the rescue!

The next course of action in this outage scenario was to quickly generate a few thread dump snapshots from the IBM JVM before attempting to recover the affected Weblogic instances. Thread dump was generated using kill -3 <Java PID> which did generate a few javacore files at the root of the Weblogic domain.

javacore.20120610.122028.15149052.0001.txt

Once the production environment was back up and running, the team quickly proceeded with the analysis of the captured thread dump files as per below steps.

Thread Dump analysis step #1 – identify a thread execution pattern

The first analysis step is to quickly go through all the Weblogic threads and attempt to identify a common problematic pattern such as threads waiting from remote external systems, threads in deadlock state, threads waiting from other threads to complete their tasks etc.

The analysis did quickly reveal many threads involved in the same blocking situation as per below. In this sample, we can see an Oracle Service Bus thread in blocked state within the TransactionManager Java class (OSB kernel code).

[ACTIVE] ExecuteThread: '292' for queue: 'weblogic.kernel.Default (self-tuning)'"
 J9VMThread:0x0000000139B76B00, j9thread_t:0x000000013971C9A0,
 java/lang/Thread:0x07000000F9D80630, state:B, prio=5
 (native thread ID:0x2C700D1, native priority:0x5, native policy:UNKNOWN)
 Java callstack:

  at com/bea/wli/config/transaction/TransactionManager._beginTransaction(TransactionManager.java:547(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.beginTransaction(TransactionManager.java:409(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedResourceManager.getDerivedValueInfo(DerivedResourceManager.java:339(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedResourceManager.get(DerivedResourceManager.java:386(Compiled Code))
  at com/bea/wli/sb/resources/cache/DefaultDerivedTypeDef.getDerivedValue(DefaultDerivedTypeDef.java:106(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterRuntimeCache.getRuntime(RouterRuntimeCache.java(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterManager.getRouterRuntime(RouterManager.java:640(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterContext.getInstance(RouterContext.java:172(Compiled Code))
  at com/bea/wli/sb/pipeline/RouterManager.processMessage(RouterManager.java:579(Compiled Code))
  at com/bea/wli/sb/transports/TransportManagerImpl.receiveMessage(TransportManagerImpl.java:375(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalMessageContext$1.run(LocalMessageContext.java:179(Compiled Code))
  at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:363(Compiled Code))
  at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:146(Compiled Code))
  at weblogic/security/Security.runAs(Security.java:61(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalMessageContext.send(LocalMessageContext.java:144(Compiled Code))
  at com/bea/wli/sb/transports/local/LocalTransportProvider.sendMessageAsync(LocalTransportProvider.java:322(Compiled Code))
  at sun/reflect/GeneratedMethodAccessor980.invoke(Bytecode PC:58(Compiled Code))
  at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code))
  at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code))
  at com/bea/wli/sb/transports/Util$1.invoke(Util.java:83(Compiled Code))
  at $Proxy111.sendMessageAsync(Bytecode PC:26(Compiled Code))
……………………………




Thread Dump analysis step #2 – review the blocked threads chain

The next step was to review the affected and blocked threads chain involved in our identified pattern. As we saw in the thread dump analysis part 4, the IBM JVM thread dump format contains a separate section that provides a full breakdown of all thread blocked chains e.g. the Java object monitor pool locks.

A quick analysis did reveal the following thread culprit as per below. As you can see, Weblogic thread #16 is the actual culprit with 300+ threads waiting to acquire a lock on a shared object monitor TransactionManager@0x0700000001A51610/0x0700000001A51628.

2LKMONINUSE      sys_mon_t:0x000000012CCE2688 infl_mon_t: 0x000000012CCE26C8:
3LKMONOBJECT       com/bea/wli/config/transaction/TransactionManager@0x0700000001A51610/0x0700000001A51628: Flat locked by "[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CA3C800), entry count 1
3LKWAITERQ            Waiting to enter:
3LKWAITER                "[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011C785C00)
3LKWAITER                "[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011CA93200)
3LKWAITER                "[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011B3F2B00)
3LKWAITER                "[STUCK] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011619B300)
3LKWAITER                "[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CBE8000)
3LKWAITER                "[STUCK] ExecuteThread: '21' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012BE91200)
..................

Thread Dump analysis step #3 – thread culprit deeper analysis

Once you identify a primary culprit thread, the next step is to perform a deeper review of the computing task this thread is currently performing. Simply go back to the raw thread dump data and start analyzing the culprit thread stack trace from bottom-up.

As you can see below, the thread stack trace for our problem case was quite revealing. It did reveal that thread #16 is currently attempting to commit a change made at the Weblogic / Oracle Service Bus level. The problem is that the commit operation is hanging and taking too much time, causing thread #16 to retain the shared object monitor lock from TransactionManager for too long and “starving” the other Oracle Service Bus Weblogic threads.

"[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'"
J9VMThread:0x000000012CA3C800, j9thread_t:0x000000012C9F0F40, java/lang/Thread:0x0700000026FCE120, state:P, prio=5
(native thread ID:0x35B0097, native priority:0x5, native policy:UNKNOWN)
 Java callstack:
 
  at sun/misc/Unsafe.park(Native Method)
  at java/util/concurrent/locks/LockSupport.park(LockSupport.java:184(Compiled Code))
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:822)
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:853(Compiled Code))
  at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1189(Compiled Code))
  at java/util/concurrent/locks/ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:911(Compiled Code))
  at com/bea/wli/config/derivedcache/DerivedCache$Purger.changesCommitted(DerivedCache.java:80)
  at com/bea/wli/config/impl/ResourceListenerNotifier.afterEnd(ResourceListenerNotifier.java:120)
  at com/bea/wli/config/transaction/TransactionListenerWrapper.afterEnd(TransactionListenerWrapper.java:90)
  at com/bea/wli/config/transaction/TransactionManager.notifyAfterEnd(TransactionManager.java:1154(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.commit(TransactionManager.java:1519(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager._endTransaction(TransactionManager.java:842(Compiled Code))
  at com/bea/wli/config/transaction/TransactionManager.endTransaction(TransactionManager.java:783(Compiled Code))
  at com/bea/wli/config/deployment/server/ServerDeploymentReceiver$2.run(ServerDeploymentReceiver.java:275)
  at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:321(Compiled Code))
  at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:120(Compiled Code))
  at com/bea/wli/config/deployment/server/ServerDeploymentReceiver.commit(ServerDeploymentReceiver.java:260)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.doCommitCallback(DeploymentReceiverCallbackDeliverer.java:195)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.access$100(DeploymentReceiverCallbackDeliverer.java:13)
  at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer$2.run(DeploymentReceiverCallbackDeliverer.java:68)
  at weblogic/work/SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:528(Compiled Code))
  at weblogic/work/ExecuteThread.execute(ExecuteThread.java:203(Compiled Code))
  at weblogic/work/ExecuteThread.run(ExecuteThread.java:171(Compiled Code))

Root cause: connecting the dots

At this point the collection of the facts and thread dump analysis did allow us to determine the chain of events as per below:

  • Logging level change applied by the production Oracle Service Bus administrator
  • Weblogic deployment thread #16 was unable to commit the change properly
  • Weblogic runtime threads executing client requests quickly started to queue up and wait for a lock on the shared object monitor (TransactionManager)
  • The Weblogic instances ran out of threads, generating alerts and forcing the production support team to shut down and restart the affected JVM processes
Our team is planning to open an Oracle SR shortly to share this OSB deployment behaviour along with hard dependency between the client requests (threads) and OSB logging layer. In the meantime, no OSB logging level change will be attempted outside the maintenance window period until further notice.

Conclusion

I hope this article has helped you understand and appreciate how powerful thread dump analysis can be to pinpoint root cause of stuck thread problems and the importance for any Java EE production support team to capture such crucial data in order to prevent future re-occurrences.

Please do not hesitate to post any comment or question.

2.08.2012

Too many open files – Case Study

This case study describes the complete root cause analysis and resolution of a File Descriptor (Too many open files) related problem that we faced following a migration from Oracle ALSB 2.6 running on Solaris OS to Oracle OSB 11g running on AIX.

This article will also provide you with proper AIX OS commands you can use to troubleshoot and validate the File Descriptor configuration of your Java VM process.

Environment specifications

-        Java EE server: Oracle Service Bus 11g
-        Middleware OS: IBM AIX 6.1
-        Java VM: IBM JRE 1.6.0 SR9 – 64 bit
-        Platform type: Service Bus – Middle Tier

Problem overview

-        Problem type: java.net.SocketException: Too many open files error was observed under heavy load causing our Oracle OSB managed servers to suddenly hang

Such problem was observed only during high load and did require our support team to take corrective action e.g. shutdown and restart the affected Weblogic OSB managed servers

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH; Full JVM hang
·        Recent change of the affected platform? Yes, recent migration from ALSB 2.6 (Solaris OS) to Oracle OSB 11g (AIX OS)
·        Any recent traffic increase to the affected platform? No
·        What is the health of the Weblogic server? Affected managed servers were no longer responsive along with closure of the Weblogic HTTP (Server Socket) port
·        Did a restart of the Weblogic Integration server resolve the problem? Yes but temporarily only

-        Conclusion #1: The problem appears to be load related

Weblogic server log files review

A quick review of the affected managed servers log did reveal the error below:

java.net.SocketException: Too many open files

This error indicates that our Java VM process was running out of File Descriptor. This is a severe condition that will affect the whole Java VM process and cause Weblogic to close its internal Server Socket port (HTTP/HTTPS port) preventing any further inbound & outbound communication to the affected managed server(s).

File Descriptor – Why so important for an Oracle OSB environment?

The File Descriptor capacity is quite important for your Java VM process. The key concept you must understand is that File Descriptors are not only required for pure File Handles but also for inbound and outbound Socket communication. Each new Java Socket created to (inbound) or from (outound) your Java VM by Weblogic kernel Socket Muxer requires a File Descriptor allocation at the OS level.

An Oracle OSB environment can require a significant number of Sockets depending how much inbound load it receives and how much outbound connections (Java Sockets) it has to create in order to send and receive data from external / downstream systems (System End Points).

For that reason, you must ensure that you allocate enough File Descriptors / Sockets to your Java VM process in order to support your daily load; including problematic scenarios such as sudden slowdown of external systems which typically increase the demand on the File Descriptor allocation.

Runtime File Descriptor capacity check for Java VM and AIX OS

Following the discovery of this error, our technical team did perform a quick review of the current observed runtime File Descriptor capacity & utilization of our OSB Java VM processes. This can be done easily via the AIX procfiles <Java PID> | grep rlimit & lsof -p <Java PID> | wc –l commands as per below example:

## Java VM process File Descriptor total capacity

>> procfiles 5425732 | grep rlimit
  Current rlimit: 2000 file descriptors

## Java VM process File Descriptor current utilization

>> lsof -p <Java PID> | wc –l
  1920

As you can see, the current capacity was found at 2000; which is quite low for a medium size Oracle OSB environment. The average utilization under heavy load was also found to be quite close to the upper limit of 2000.

The next step was to verify the default AIX OS File Descriptor limit via the ulimit -S –n command:

>> ulimit -S –n
  2000

-        Conclusion #2: The current File Descriptor limit for both OS and OSB Java VM appears to be quite low and setup at 2000. The File Descriptor utilization was also found to be quite close to the upper limit which explains why so many JVM failures were observed at peak load

Weblogic File Descriptor configuration review

The File Descriptor limit can typically be overwritten when you start your Weblogic Java VM. Such configuration is managed by the WLS core layer and script can be found at the following location:

<WL_HOME>/wlserver_10.3/common/bin/commEnv.sh

..................................................
resetFd() {
  if [ ! -n "`uname -s |grep -i cygwin || uname -s |grep -i windows_nt || \
       uname -s |grep -i HP-UX`" ]
  then
    ofiles=`ulimit -S -n`
    maxfiles=`ulimit -H -n`
    if [ "$?" = "0" -a  `expr ${maxfiles} : '[0-9][0-9]*$'` -eq 0 -a `expr ${ofiles} : '[0-9][0-9]*$'` -eq 0 ]; then
      ulimit -n 4096
    else
      if [ "$?" = "0" -a `uname -s` = "SunOS" -a `expr ${maxfiles} : '[0-9][0-9]*$'` -eq 0 ]; then
        if [ ${ofiles} -lt 65536 ]; then
          ulimit -H -n 65536
        else
          ulimit -H -n 4096
        fi
      fi
    fi
  fi
.................................................

Root cause: File Descriptor override only working for Solaris OS!

As you can see with the script screenshot below, the override of the File Descriptor limit via ulimit is only applicable for Solaris OS (SunOS) which explains why our current OSB Java VM running on AIX OS did end up with the default value of 2000 vs. our older ALSB 2.6 environment running on Solaris OS which had a File Descriptor limit of 65536.


Solution: script tweaking for AIX OS

The resolution of this problem was done by modifying the Weblogic commEnv script as per below. This change did ensure a configuration of 65536 File Descriptor (from 2000); including for the AIX OS:


** Please note that the activation of any change to the Weblogic File Descriptor configuration requires a restart of both the Node Manager (if used) along with the managed servers. **

A runtime validation was also performed following the activation of the new configuration which did confirm the new active File Descriptor limit:

>> procfiles 6416839 | grep rlimit
  Current rlimit: 65536 file descriptors

No failure has been observed since then.

Conclusion and recommendations

-        When upgrading your Weblogic Java EE container to a new version, please ensure that you verify your current File Descriptor limit as per the above case study
-         From a capacity planning perspective, please ensure that you monitor your File Descriptor utilizaiton on a regular basis in order to identify any potential capacity problem, Socket leak etc..

Please don’t hesitate to post any comment or question on this subject if you need any additional help.

10.30.2011

OutOfMemoryError – Oracle Service Bus Alerts

This case study describes the complete steps from root cause analysis to resolution of a sudden OutOfMemoryError problem experienced with Oracle Service Bus 2.6 running on Weblogic Server 9.2.
This case study will also demonstrate how you combine Thread Dump analysis and Solaris PRSTAT to pinpoint culprit Thread(s) consuming large amount of memory on your application Java Heap.

Environment specifications

-         Java EE server: Oracle Service Bus 2.6 & Weblogic Server 9.2
-         OS: Solaris 10
-         JDK: Sun HotSpot 1.5 SR7
-         RDBMS: Oracle 10g
-         Platform type: Enterprise Service Bus

Troubleshooting tools

-         Quest Foglight for Java EE
-         Java VM Thread Dump (for stuck Thread analysis)
-         Solaris PRSTAT (for Thread CPU contributors breakdown)

Problem overview

-         Problem type: OutOfMemoryError

An OutOfMemoryError problem was observed in our Oracle Service Bus Weblogic logs along with major performance degradation.

Preventive restart was then implemented since a Java Heap memory leak was initially suspected which did not prove to be efficient.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? No
·        Since how long this problem has been observed?  Since several months with no identified root cause
·        Is the OutOfMemoryError happening suddenly or over time? It was observed that the Java Heap is suddenly increasing and can then take several hours before failing with a OutOfMemoryError
·        Did a restart of the Weblogic server resolve the problem? No, preventive restart was attempted which did not prove to be efficient

-         Conclusion #1: The sudden increase of the Java Heap could indicate long running Thread(s) consuming a large amount of memory in a short amount of time

-         Conclusion #2: A Java Heap leak cannot be ruled out at this point

Java Heap monitoring using Foglight

Quest Foglight monitoring tool was used to monitor the trend of the Oracle Service Bus Java Heap for each of the managed servers. You can see below a sudden increase of the Java Heap OldGen space can be observed.

This type of memory problem normally does not indicate a linear or long term memory leak but rather rogue Thread(s) consuming large amount of memory in a short amount of time. In this scenario, my recommendation to you is to first perform Thread Dump analysis before jumping too quickly to JVM Heap Dump analysis.


Thread Dump analysis

Thread Dump snapshots were generated during the sudden Java Heap increase and analysis did reveal a primary culprit as you can see below.

Analysis of the different Thread Dump snapshots did reveal a very long elapsed time of this Thread along with a perfect correlation with our Java Heap increase. As you can see, this Thread is triggered by the ALSB alerting layer and is involved in read/write of the Weblogic internal diagnostic File Store.

"pool-2-thread-1" prio=10 tid=0x01952650 nid=0x4c in Object.wait() [0x537fe000..0x537ff8f0]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:474)
        at weblogic.common.CompletionRequest.getResult(CompletionRequest.java:109)
        - locked <0xf06ca710> (a weblogic.common.CompletionRequest)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:657)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:630)
        at weblogic.diagnostics.archive.wlstore.PersistentRecordIterator.fill(PersistentRecordIterator.java:87)
        at weblogic.diagnostics.archive.RecordIterator.fetchMore(RecordIterator.java:157)
        at weblogic.diagnostics.archive.RecordIterator.hasNext(RecordIterator.java:130)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.getAlerts(AlertLoggerImpl.java:157)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.updateAlertEvalTime(AlertLoggerImpl.java:140)
        at com.bea.wli.monitoring.alert.AlertLog.updateAlertEvalTime(AlertLog.java:248)
        at com.bea.wli.monitoring.alert.AlertManager._evaluateSingleRule(AlertManager.java:992)
        at com.bea.wli.monitoring.alert.AlertManager.intervalCompleted(AlertManager.java:809)
        at com.bea.wli.monitoring.alert.AlertEvalTask.execute(AlertEvalTask.java:65)
        at com.bea.wli.monitoring.alert.AlertEvalTask.run(AlertEvalTask.java:37)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Solaris PRSTAT

Solaris PRSTAT was also captured in order to understand the amount of CPU consumed and determine if this Thread was stuck or performing a lot of processing.

As you can see below, Thread id #76 was identified as the primary CPU contributor with 8.6 %.

prstat –L <Java Pid>

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID
  9116 bea      3109M 2592M sleep   59    0  21:52:59 8.6% java/76
  9116 bea      3109M 2592M sleep   57    0   4:28:05 0.3% java/40
  9116 bea      3109M 2592M sleep   59    0   6:52:02 0.3% java/10774
  9116 bea      3109M 2592M sleep   59    0   6:50:00 0.3% java/84
  9116 bea      3109M 2592M sleep   58    0   4:27:20 0.2% java/43
  9116 bea      3109M 2592M sleep   59    0   7:39:42 0.2% java/41287
  9116 bea      3109M 2592M sleep   59    0   3:41:44 0.2% java/30503
  9116 bea      3109M 2592M sleep   59    0   5:48:32 0.2% java/36116
  9116 bea      3109M 2592M sleep   59    0   6:15:52 0.2% java/36118
  9116 bea      3109M 2592M sleep   59    0   2:44:02 0.2% java/36128
  9116 bea      3109M 2592M sleep   59    0   5:53:50 0.1% java/36111
  9116 bea      3109M 2592M sleep   59    0   4:27:55 0.1% java/55
  9116 bea      3109M 2592M sleep   59    0   9:51:19 0.1% java/23479
  9116 bea      3109M 2592M sleep   59    0   4:57:33 0.1% java/36569
  9116 bea      3109M 2592M sleep   59    0   9:51:08 0.1% java/23477
  9116 bea      3109M 2592M sleep   59    0  10:15:13 0.1% java/4339
  9116 bea      3109M 2592M sleep   59    0  10:13:37 0.1% java/4331
  9116 bea      3109M 2592M sleep   59    0   4:58:37 0.1% java/36571
  9116 bea      3109M 2592M sleep   59    0   3:13:46 0.1% java/41285
  9116 bea      3109M 2592M sleep   59    0   4:27:32 0.1% java/48
  9116 bea      3109M 2592M sleep   59    0   5:25:28 0.1% java/30733

Thread Dump and PRSTAT correlation approach

In order to correlate the PRSTAT data please follow the steps below:

-         Identify the primary culprit / contributor Thread(s) and locate the Thread Id (DECIMAL format) from the last column
-         Convert the decimal value to HEXA since Thread Dump native Thread id is in HEXA format
-         Search from your captured Thread Dump data and locate the Thread native Id >> nid=<Thread Id HEXA>

For our case study:

-         Thread Id #76 identified as the primary culprit
-         Thread Id #76 was converted to HEXA format >> 0x4c
-         Thread Dump data Thread native Id >> nid=0x4c

Such correlation and finding did confirm that this rogue Thread was actively processing (reading/writing to the File Store) and allocating a large amount of memory on the Java Heap in a short amount of time.

Root cause

The above analysis along with Mark Smith Oracle Blog article finding did confirm the root cause.

Oracle Service Bus allows Alert actions to be configured within the message flow (pipeline alerts). These pipeline alert actions generate alerts based on message context in a pipeline, to send to an alert destination. Such alert destination is the actual Weblogic diagnostic File Store which means this structure will grow over time depending of the volume of Alerts that your OSB application is generating.

It is located under >> //domain_name/servers/server_name/data/store/diagnostics/

In our case, the File Store size was around 800 MB.

Such increase of the diagnostic File Store size over time is leading to an increase of elapsed time of the Thread involved in read/write operations; allocating large amount of memory on the Java Heap. Such memory cannot be garbage collected until Thread completion which is leading to OutOfMemoryError and performance degradation.

Solution

2 actions were required to resolve this problem:

-         Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one
-         Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level

The reset of the diagnostic File Store did bring some immediate relief and by ensuring short and optimal diagnostic File Store operations and not causing too much impact on the Java Heap.

The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.

Regular monitoring and purging of the diagnostic File Store will also be implemented in order to prevent such performance degradation going forward; as per Oracle recommendations.

Conclusion and recommendations

I hope this article has helped you understand how to identify and resolve this Oracle service bus problem and appreciate how powerful Thread Dump analysis and Solaris PRSTAT tool can be to pinpoint this type of OutOfMemoryError problem.

Please do not hesitate to post any comment or question.