/ Java EE Support Patterns: Oracle Weblogic
Showing posts with label Oracle Weblogic. Show all posts
Showing posts with label Oracle Weblogic. Show all posts

12.17.2012

QOTD: Weblogic threads

This is the first post of a new series that will bring you frequent and short “question of the day” articles. This new format will complement my existing writing and is designed to provide you with fast answers for common problems and questions that I often get from work colleagues, IT clients and readers.

Each of these QOTD posts will be archived and you will have access to the entire list via a separate page. I also encourage you to post your feedback, recommendations along with your own complement answer for each question.

Now let’s get started with our first question of the day!

Question:

What is the difference between Oracle Weblogic thread states & attributes (Total, Standby, Active, idle, Hogging, Stuck)?

Answer:

Weblogic thread state can be a particularly confusing topic for Weblogic administrators and individuals starting to learn and monitor Weblogic threads. The thread monitoring section can be accessed for each managed server Managed server under the Monitoring > Threads tab.

As you can see, the thread monitoring tab provides a complete view of each Weblogic thread along with its state. Now let’s review each section and state so you can properly understand how to assess the health.

# Summary section

  • Execute Thread Total Count: This is the total number of threads “created” from the Weblogic self-tuning pool and visible from the JVM Thread Dump. This value correspond to the sum of: Active + Standby threads
  • Active Execute Threads: This is the number of threads “eligible” to process a request. When thread demand goes up, Weblogic will start promoting threads from Standby to Active state which will enable them to process future client requests
  • Standby Thread Count: This is the number of threads waiting to be marked “eligible” to process client requests. These threads are created and visible from the JVM Thread Dump but not available yet to process a client request
  • Hogging Thread Count: This is the number of threads taking much more time than the current execution time in average calculated by the Weblogic kernel
  • Execute Thread Idle Count: This is the number of Active threads currently “available” to process a client request





In the above snapshots, we have:

  • Total of 43 threads, 29 in Standby state and 14 in Active state
  • Out of the 14 Active threads, we have 1 Hogging thread and 7 Idle threads e.g. 7 threads “available” for request processing
  • Another way to see the situation: we have a total of 7 threads currently “processing” client request with 1 out of 7 in Hogging state (e.g. taking more time than current calculated average)

# Thread matrix




This matrix gives you a view of each thread along with its current state. There is one more state that you must also understand:

  • STUCK: A stuck thread is identified by Weblogic when it is taking more time than the configured stuck thread time (default is 600 seconds). When facing slowdown conditions, you will normally see Weblogic threads transitioning from the Hogging state followed by STUCK, depending how long these threads remain stuck executing their current request

3.20.2012

Java Thread deadlock – Case Study

This article will describe the complete root cause analysis of a recent Java deadlock problem observed from a Weblogic 11g production system running on the IBM JVM 1.6.

This case study will also demonstrate the importance of mastering Thread Dump analysis skills; including for the IBM JVM Thread Dump format.

Environment specifications

-        Java EE server: Oracle Weblogic Server 11g & Spring 2.0.5
-        OS: AIX 5.3
-        Java VM: IBM JRE 1.6.0
-        Platform type: Portal & ordering application

Monitoring and troubleshooting tools

-        JVM Thread Dump (IBM JVM format)
-        Compuware Server Vantage (Weblogic JMX monitoring & alerting)

Problem overview

A major stuck Threads problem was observed & reported from Compuware Server Vantage and affecting 2 of our Weblogic 11g production managed servers causing application impact and timeout conditions from our end users.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? MEDIUM (only 2 managed servers / JVM affected out of 16)
·        Recent change of the affected platform? Yes (new JMS related asynchronous component)
·        Any recent traffic increase to the affected platform? No
·        How does this problem manifest itself?  A sudden increase of Threads was observed leading to rapid Thread depletion
·        Did a Weblogic managed server restart resolve the problem? Yes, but problem is returning after few hours (unpredictable & intermittent pattern)

-        Conclusion #1: The problem is related to an intermittent stuck Threads behaviour affecting only a few Weblogic managed servers at the time
-        Conclusion #2: Since problem is intermittent, a global root cause such as a non-responsive downstream system is not likely

Thread Dump analysis – first pass

The first thing to do when dealing with stuck Thread problems is to generate a JVM Thread Dump. This is a golden rule regardless of your environment specifications & problem context. A JVM Thread Dump snapshot provides you with crucial information about the active Threads and what type of processing / tasks they are performing at that time.

Now back to our case study, an IBM JVM Thread Dump (javacore.xyz format) was generated which did reveal the following Java Thread deadlock condition below:

1LKDEADLOCK    Deadlock detected !!!
NULL           ---------------------
NULL          
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)
3LKDEADLOCKWTR    is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x0000000126171DF8 infl_mon_t: 0x0000000126171E38:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FESession@0x07000000198048C0/0x07000000198048D8:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012E560500)
3LKDEADLOCKWTR    which is waiting for:
4LKDEADLOCKMON      sys_mon_t:0x000000012884CD60 infl_mon_t: 0x000000012884CDA0:
4LKDEADLOCKOBJ      weblogic/jms/frontend/FEConnection@0x0700000019822F08/0x0700000019822F20:
3LKDEADLOCKOWN    which is owned by:
2LKDEADLOCKTHR  Thread "[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CC08B00)

This deadlock situation can be translated as per below:

-        Weblogic Thread #8 is waiting to acquire an Object monitor lock owned by Weblogic Thread #10
-        Weblogic Thread #10 is waiting to acquire an Object monitor lock owned by Weblogic Thread #8

Conclusion: both Weblogic Threads #8 & #10 are waiting on each other; forever!

Now before going any deeper in this root cause analysis, let me provide you a high level overview on Java Thread deadlocks.

Java Thread deadlock overview

Most of you are probably familiar with Java Thread deadlock principles but did you really experience a true deadlock problem?

From my experience, true Java deadlocks are rare and I have only seen ~5 occurrences over the last 10 years. The reason is that most stuck Threads related problems are due to Thread hanging conditions (waiting on remote IO call etc.) but not involved in a true deadlock condition with other Thread(s).

A Java Thread deadlock is a situation for example where Thread A is waiting to acquire an Object monitor lock held by Thread B which is itself waiting to acquire an Object monitor lock held by Thread A. Both these Threads will wait for each other forever. This situation can be visualized as per below diagram:


Thread deadlock is confirmed…now what can you do?

Once the deadlock is confirmed (most JVM Thread Dump implementations will highlight it for you), the next step is to perform a deeper dive analysis by reviewing each Thread involved in the deadlock situation along with their current task & wait condition.

Find below the partial Thread Stack Trace from our problem case for each Thread involved in the deadlock condition:

** Please note that the real application Java package name was renamed for confidentiality purposes **

Weblogic Thread #8

"[STUCK] ExecuteThread: '8' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012CC08B00, j9thread_t:0x00000001299E5100, java/lang/Thread:0x070000001D72EE00, state:B, prio=1
(native thread ID:0x111200F, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.stop(FEConnection.java:671(Compiled Code))
       at weblogic/jms/frontend/FEConnection.invoke(FEConnection.java:1685(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSConnection.stop(JMSConnection.java:863(Compiled Code))
       at weblogic/jms/client/WLConnectionImpl.stop(WLConnectionImpl.java:843)
       at org/springframework/jms/connection/SingleConnectionFactory.closeConnection(SingleConnectionFactory.java:342)
       at org/springframework/jms/connection/SingleConnectionFactory.resetConnection(SingleConnectionFactory.java:296)
       at org/app/JMSReceiver.receive()
……………………………………………………………………

Weblogic Thread #10

"[STUCK] ExecuteThread: '10' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012E560500, j9thread_t:0x000000012E35BCE0, java/lang/Thread:0x070000001ECA9200, state:B, prio=1
 (native thread ID:0x4FA027, native priority:0x1, native policy:UNKNOWN)
Java callstack:
       at weblogic/jms/frontend/FEConnection.getPeerVersion(FEConnection.java:1381(Compiled Code))
       at weblogic/jms/frontend/FESession.setUpBackEndSession(FESession.java:755(Compiled Code))
       at weblogic/jms/frontend/FESession.consumerCreate(FESession.java:1025(Compiled Code))
       at weblogic/jms/frontend/FESession.invoke(FESession.java:2995(Compiled Code))
       at weblogic/messaging/dispatcher/Request.wrappedFiniteStateMachine(Request.java:961(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.syncRequest(DispatcherImpl.java:184(Compiled Code))
       at weblogic/messaging/dispatcher/DispatcherImpl.dispatchSync(DispatcherImpl.java:212(Compiled Code))
       at weblogic/jms/dispatcher/DispatcherAdapter.dispatchSync(DispatcherAdapter.java:43(Compiled Code))
       at weblogic/jms/client/JMSSession.consumerCreate(JMSSession.java:2982(Compiled Code))
       at weblogic/jms/client/JMSSession.setupConsumer(JMSSession.java:2749(Compiled Code))
       at weblogic/jms/client/JMSSession.createConsumer(JMSSession.java:2691(Compiled Code))
       at weblogic/jms/client/JMSSession.createReceiver(JMSSession.java:2596(Compiled Code))
       at weblogic/jms/client/WLSessionImpl.createReceiver(WLSessionImpl.java:991(Compiled Code))
       at org/springframework/jms/core/JmsTemplate102.createConsumer(JmsTemplate102.java:204(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.doReceive(JmsTemplate.java:676(Compiled Code))
       at org/springframework/jms/core/JmsTemplate$10.doInJms(JmsTemplate.java:652(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.execute(JmsTemplate.java:412(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:650(Compiled Code))
       at org/springframework/jms/core/JmsTemplate.receiveSelected(JmsTemplate.java:641(Compiled Code))
       at org/app/JMSReceiver.receive()
……………………………………………………………

As you can see in the above Thread Strack Traces, such deadlock did originate from our application code which is using the Spring framework API for the JMS consumer implementation (very useful when not using MDB’s). The Stack Traces are quite interesting and revealing that both Threads are in a race condition against the same Weblogic JMS consumer session / connection and leading to a deadlock situation:

-        Weblogic Thread #8 is attempting to reset and close the current JMS connection
-        Weblogic Thread #10 is attempting to use the same JMS Connection / Session in order to create a new JMS consumer
-        Thread deadlock is triggered!

Root cause: non Thread safe Spring JMS SingleConnectionFactory implementation

A code review and a quick research from Spring JIRA bug database did reveal the following Thread safe defect below with a perfect correlation with the above analysis:

# SingleConnectionFactory's resetConnection is causing deadlocks with underlying OracleAQ's JMS connection

A patch for Spring SingleConnectionFactory was released back in 2009 which did involve adding proper synchronized{} block in order to prevent Thread deadlock in the event of a JMS Connection reset operation:

synchronized (connectionMonitor) {
  //if condition added to avoid possible deadlocks when trying to reset the target connection
  if (!started) {
    this.target.start();
    started = true;
  }
}

Solution

Our team is currently planning to integrate this Spring patch in to our production environment shortly. The initial tests performed in our test environment are positive.

Conclusion

I hope this case study has helped understand a real-life Java Thread deadlock problem and how proper Thread Dump analysis skills can allow you to quickly pinpoint the root cause of stuck Thread related problems at the code level. Please don’t hesitate to post any comment or question.

2.22.2012

Weblogic Free Consultation

This post is to inform you that I provide free Weblogic consulting services via this Blog root cause analysis forum & email. I will do my very best to provide you with guidance and share with you my tips on how resolve common Weblogic problems along with best practices.

My expertise background is as per below:

·        10 years of experience with development, support and hardening of Java EE systems and currently working as a full time employee at CGI Inc. Canada
·        Deep understanding and experience with Java EE containers such as Oracle Weblogic, Red Hat JBoss, Tomcat etc.
·        Deep understanding and experience of JVM tuning (Sun HotSpot, IBM JRE, Oracle JRockit)
·        Specialized in root cause analysis, performance tuning and advanced troubleshooting

Why is my IT consulting service from this Blog free? My goal is to create a solid knowledge base of Java EE support patterns on this Blog and share with Java EE production support individuals across the world. You can simply reward me by sharing this Blog with your other work colleagues and friends. With your approval, I might also use your problem case and create an article.
 
Weblogic consulting – a simple 2 steps process

·        Email me @phcharbonneau with your captured performance data or download link (Thread Dump, app & server logs etc. captured during your incident). Please also provide the specifications of your environment (Java EE server vendor & version, Java VM vendor and version etc.) and a high level description of your problem.
·        Wait for my initial reply. I will do my best to provide you with a quick answer in less than 24 hours. An IM session could be scheduled, if required.
  
I'm looking forward to hear from you!

Regards,

2.08.2012

Too many open files – Case Study

This case study describes the complete root cause analysis and resolution of a File Descriptor (Too many open files) related problem that we faced following a migration from Oracle ALSB 2.6 running on Solaris OS to Oracle OSB 11g running on AIX.

This article will also provide you with proper AIX OS commands you can use to troubleshoot and validate the File Descriptor configuration of your Java VM process.

Environment specifications

-        Java EE server: Oracle Service Bus 11g
-        Middleware OS: IBM AIX 6.1
-        Java VM: IBM JRE 1.6.0 SR9 – 64 bit
-        Platform type: Service Bus – Middle Tier

Problem overview

-        Problem type: java.net.SocketException: Too many open files error was observed under heavy load causing our Oracle OSB managed servers to suddenly hang

Such problem was observed only during high load and did require our support team to take corrective action e.g. shutdown and restart the affected Weblogic OSB managed servers

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH; Full JVM hang
·        Recent change of the affected platform? Yes, recent migration from ALSB 2.6 (Solaris OS) to Oracle OSB 11g (AIX OS)
·        Any recent traffic increase to the affected platform? No
·        What is the health of the Weblogic server? Affected managed servers were no longer responsive along with closure of the Weblogic HTTP (Server Socket) port
·        Did a restart of the Weblogic Integration server resolve the problem? Yes but temporarily only

-        Conclusion #1: The problem appears to be load related

Weblogic server log files review

A quick review of the affected managed servers log did reveal the error below:

java.net.SocketException: Too many open files

This error indicates that our Java VM process was running out of File Descriptor. This is a severe condition that will affect the whole Java VM process and cause Weblogic to close its internal Server Socket port (HTTP/HTTPS port) preventing any further inbound & outbound communication to the affected managed server(s).

File Descriptor – Why so important for an Oracle OSB environment?

The File Descriptor capacity is quite important for your Java VM process. The key concept you must understand is that File Descriptors are not only required for pure File Handles but also for inbound and outbound Socket communication. Each new Java Socket created to (inbound) or from (outound) your Java VM by Weblogic kernel Socket Muxer requires a File Descriptor allocation at the OS level.

An Oracle OSB environment can require a significant number of Sockets depending how much inbound load it receives and how much outbound connections (Java Sockets) it has to create in order to send and receive data from external / downstream systems (System End Points).

For that reason, you must ensure that you allocate enough File Descriptors / Sockets to your Java VM process in order to support your daily load; including problematic scenarios such as sudden slowdown of external systems which typically increase the demand on the File Descriptor allocation.

Runtime File Descriptor capacity check for Java VM and AIX OS

Following the discovery of this error, our technical team did perform a quick review of the current observed runtime File Descriptor capacity & utilization of our OSB Java VM processes. This can be done easily via the AIX procfiles <Java PID> | grep rlimit & lsof -p <Java PID> | wc –l commands as per below example:

## Java VM process File Descriptor total capacity

>> procfiles 5425732 | grep rlimit
  Current rlimit: 2000 file descriptors

## Java VM process File Descriptor current utilization

>> lsof -p <Java PID> | wc –l
  1920

As you can see, the current capacity was found at 2000; which is quite low for a medium size Oracle OSB environment. The average utilization under heavy load was also found to be quite close to the upper limit of 2000.

The next step was to verify the default AIX OS File Descriptor limit via the ulimit -S –n command:

>> ulimit -S –n
  2000

-        Conclusion #2: The current File Descriptor limit for both OS and OSB Java VM appears to be quite low and setup at 2000. The File Descriptor utilization was also found to be quite close to the upper limit which explains why so many JVM failures were observed at peak load

Weblogic File Descriptor configuration review

The File Descriptor limit can typically be overwritten when you start your Weblogic Java VM. Such configuration is managed by the WLS core layer and script can be found at the following location:

<WL_HOME>/wlserver_10.3/common/bin/commEnv.sh

..................................................
resetFd() {
  if [ ! -n "`uname -s |grep -i cygwin || uname -s |grep -i windows_nt || \
       uname -s |grep -i HP-UX`" ]
  then
    ofiles=`ulimit -S -n`
    maxfiles=`ulimit -H -n`
    if [ "$?" = "0" -a  `expr ${maxfiles} : '[0-9][0-9]*$'` -eq 0 -a `expr ${ofiles} : '[0-9][0-9]*$'` -eq 0 ]; then
      ulimit -n 4096
    else
      if [ "$?" = "0" -a `uname -s` = "SunOS" -a `expr ${maxfiles} : '[0-9][0-9]*$'` -eq 0 ]; then
        if [ ${ofiles} -lt 65536 ]; then
          ulimit -H -n 65536
        else
          ulimit -H -n 4096
        fi
      fi
    fi
  fi
.................................................

Root cause: File Descriptor override only working for Solaris OS!

As you can see with the script screenshot below, the override of the File Descriptor limit via ulimit is only applicable for Solaris OS (SunOS) which explains why our current OSB Java VM running on AIX OS did end up with the default value of 2000 vs. our older ALSB 2.6 environment running on Solaris OS which had a File Descriptor limit of 65536.


Solution: script tweaking for AIX OS

The resolution of this problem was done by modifying the Weblogic commEnv script as per below. This change did ensure a configuration of 65536 File Descriptor (from 2000); including for the AIX OS:


** Please note that the activation of any change to the Weblogic File Descriptor configuration requires a restart of both the Node Manager (if used) along with the managed servers. **

A runtime validation was also performed following the activation of the new configuration which did confirm the new active File Descriptor limit:

>> procfiles 6416839 | grep rlimit
  Current rlimit: 65536 file descriptors

No failure has been observed since then.

Conclusion and recommendations

-        When upgrading your Weblogic Java EE container to a new version, please ensure that you verify your current File Descriptor limit as per the above case study
-         From a capacity planning perspective, please ensure that you monitor your File Descriptor utilizaiton on a regular basis in order to identify any potential capacity problem, Socket leak etc..

Please don’t hesitate to post any comment or question on this subject if you need any additional help.

12.23.2011

HashMap infinite loop problem – Case Study

This article will provide you with complete root cause analysis and solution of a java.util.HashMap infinite loop problem affecting an Oracle OSB 11g environment running on IBM JRE 1.6 JVM.
This case study will also demonstrate how you can combine AIX ps –mp command and Thread Dump analysis to pinpoint you top CPU contributor Threads within your Java VM(s). It will also demonstrate how dangerous using a non Thread safe HashMap data structure can be within a multi Thread environment / Java EE container.

Environment specifications

-        Java EE server: Oracle Service Bus 11g
-        Middleware OS: AIX 6.1
-        Java VM: IBM JRE 1.6 SR9 – 64-bit
-        Platform type: Service Bus

Monitoring and troubleshooting tools

-        AIX nmon & topas (CPU monitoring)
-        AIX ps –mp (CPU and Thread breakdown OS command)
-        IBM JVM Java core / Thread Dump (thread analysis and ps –mp data corrleation)

Problem overview

-        Problem type: Very High CPU observed from our production environment

A high CPU problem was observed from AIX nmon monitoring hosting a Weblogic Oracle Service Bus 11g middleware environment.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH
·        Recent change of the affected platform? Yes, platform was recently migrated from Oracle ALSB 2.6 (Solaris & HotSpot 1.5) to Oracle OSB 11g (AIX OS & IBM JRE 1.6)
·        Any recent traffic increase to the affected platform? No
·        How does this high CPU manifest itself?  A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level.
·        Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern)

-        Conclusion #1: The high CPU problem appears to be intermittent vs. pure correlation with load
-        Conclusion #2: Since high CPU remains after load goes down, this indicates either JVM threshold are triggered along with point of non-return and / or the presence of some hang or infinite looping Threads

AIX CPU analysis

AIX nmon & topas OS command were used to monitor the CPU utilization of the system and Java process. The CPU utilization was confirmed to go up as high as 100% utilization (saturation level).

Such high CPU level did remain very high until the JVM was recycled.

AIX CPU Java Thread breakdown analysis

One of the best troubleshooting approaches to deal with this type of issue is to generate an AIX ps –mp snapshot combined with Thread Dump. This was achieved by executing the command below:

ps -mp <Java PID> -o THREAD

Then immediately execute:

kill -3 <Java PID>

** This will generate a IBM JRE Thread Dump / Java core file (javacorexyz..) **

The AIX ps –mp command output was generated as per below:

USER      PID     PPID       TID ST  CP PRI SC    WCHAN        F     TT BND COMMAND
user 12910772  9896052         - A    97  60 98        *   342001      -   - /usr/java6_64/bin/java -Dweblogic.Nam
-        -        -   6684735 S    0  60  1 f1000f0a10006640  8410400      -   - -
-        -        -   6815801 Z    0  77  1        -   c00001      -   - -
-        -        -   6881341 Z    0 110  1        -   c00001      -   - -
-        -        -   6946899 S    0  82  1 f1000f0a10006a40  8410400      -   - -
-        -        -   8585337 S    0  82  1 f1000f0a10008340  8410400      -   - -
-        -        -   9502781 S    2  82  1 f1000f0a10009140  8410400      -   - -
-        -        -  10485775 S    0  82  1 f1000f0a1000a040  8410400      -   - -
-        -        -  10813677 S    0  82  1 f1000f0a1000a540  8410400      -  
-        -        -  21299315 S    95  62  1 f1000a01001d0598   410400      -   - -
-        -        -  25493513 S    0  82  1 f1000f0a10018540  8410400      -   - -
-        -        -  25690227 S    0  86  1 f1000f0a10018840  8410400      -   - -
-        -        -  25755895 S    0  82  1 f1000f0a10018940  8410400      -   - -
-        -        -  26673327 S    2  82  1 f1000f0a10019740  8410400      -  



As you can see in the above snapshot, 1 primary culprit Thread Id (21299315) was found taking ~95% of the entire CPU.

Thread Dump analysis and PRSTAT correlation

Once the primary culprit Thread was identified, the next step was to correlate this data with the Thread Dump data and identify the source / culprit at the code level.

But first, we had to convert the decimal format to HEXA format since IBM JRE Thread Dump native Thread Id’s are printed in HEXA format.

Culprit Thread Id 21299315 >> 0x1450073 (HEXA format)

A quick search within the generated Thread Dump file did reveal the culprit Thread as per below.

Weblogic ExecuteThread #97 Stack Trace can be found below:

3XMTHREADINFO      "[STUCK] ExecuteThread: '97' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x00000001333FFF00, j9thread_t:0x0000000117C00020, java/lang/Thread:0x0700000043184480, state:CW, prio=1
3XMTHREADINFO1            (native thread ID:0x1450073, native priority:0x1, native policy:UNKNOWN)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/util/HashMap.findNonNullKeyEntry(HashMap.java:528(Compiled Code))
4XESTACKTRACE                at java/util/HashMap.putImpl(HashMap.java:624(Compiled Code))
4XESTACKTRACE                at java/util/HashMap.put(HashMap.java:607(Compiled Code))
4XESTACKTRACE                at weblogic/socket/utils/RegexpPool.add(RegexpPool.java:20(Compiled Code))
4XESTACKTRACE                at weblogic/net/http/HttpClient.resetProperties(HttpClient.java:129(Compiled Code))
4XESTACKTRACE                at weblogic/net/http/HttpClient.openServer(HttpClient.java:374(Compiled Code))
4XESTACKTRACE                at weblogic/net/http/HttpClient.New(HttpClient.java:252(Compiled Code))
4XESTACKTRACE                at weblogic/net/http/HttpURLConnection.connect(HttpURLConnection.java:189(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/http/HttpOutboundMessageContext.send(HttpOutboundMessageContext.java(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/http/wls/HttpTransportProvider.sendMessageAsync(HttpTransportProvider.java(Compiled Code))
4XESTACKTRACE                at sun/reflect/GeneratedMethodAccessor2587.invoke(Bytecode PC:58(Compiled Code))
4XESTACKTRACE                at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code))
4XESTACKTRACE                at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/Util$1.invoke(Util.java(Compiled Code))
4XESTACKTRACE                at $Proxy115.sendMessageAsync(Bytecode PC:26(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/LoadBalanceFailoverListener.sendMessageAsync(LoadBalanceFailoverListener.java:141(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/LoadBalanceFailoverListener.onError(LoadBalanceFailoverListener.java(Compiled Code))
4XESTACKTRACE                at com/bea/wli/sb/transports/http/wls/HttpOutboundMessageContextWls$RetrieveHttpResponseWork.handleResponse(HttpOutboundMessageContextWls.java(Compiled Code))
4XESTACKTRACE                at weblogic/net/http/AsyncResponseHandler$MuxableSocketHTTPAsyncResponse$RunnableCallback.run(AsyncResponseHandler.java:531(Compiled Code))
4XESTACKTRACE                at weblogic/work/ContextWrap.run(ContextWrap.java:41(Compiled Code))
4XESTACKTRACE                at weblogic/work/SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:528(Compiled Code))
4XESTACKTRACE                at weblogic/work/ExecuteThread.execute(ExecuteThread.java:203(Compiled Code))
4XESTACKTRACE                at weblogic/work/ExecuteThread.run(ExecuteThread.java:171(Compiled Code))

Thread Dump analysis – HashMap infinite loop condition!

As you can see from the above Thread Stack Trace of Thread #97, the Thread is currently stuck in an infinite loop / Thread race condition over a java.util.HashMap object (IBM JRE implementation).

This finding was quite interesting given this HashMap is actually created / own by the Weblogic 11g kernel code itself >> weblogic/socket/utils/RegexpPool

Root cause: non Thread safe HashMap in Weblogic 11g (10.3.5.0) code!

Following this finding and data gathering exercise, our team created a SR with Oracle support which did confirm this defect within the Weblogic 11g code base.

As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services.

The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context.

Solution

Since this problem was also affecting other Oracle Weblogic 11g customers, Oracle support was quite fast providing us with a patch for our target WLS 11g version. Please find the patch description and detail:

Content:
========
This patch contains Smart Update patch AHNT for WebLogic Server 10.3.5.0

Description:
============
HIGH CPU USAGE AT HASHMAP.PUT() IN REGEXPPOOL.ADD()

Patch Installation Instructions:
================================
- copy content of this zip file with the exception of README file to your SmartUpdate cache directory (MW_HOME/utils/bsu/cache_dir by default)
- apply patch using Smart Update utility

Conclusion

I hope this case study has helped you understand how to pinpoint culprit of high CPU Threads at the code level when using AIX & IBM JRE and the importance of proper Thread safe data structure for high concurrent Thread / processing applications.

Please don’t hesitate to post any comment or question.