0% found this document useful (0 votes)
299 views34 pages

Impact2012 - DataPower Troubleshooting PDF

Putting all these tools together is a difficult task for even experienced developers of the platform. This session will describe how to systematically approach troubleshooting several common scenarios. Primary responsibility for problem determination lies with the IBM Support team.

Uploaded by

Mahesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
299 views34 pages

Impact2012 - DataPower Troubleshooting PDF

Putting all these tools together is a difficult task for even experienced developers of the platform. This session will describe how to systematically approach troubleshooting several common scenarios. Primary responsibility for problem determination lies with the IBM Support team.

Uploaded by

Mahesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

DataPower Troubleshooting TSE-1116

Matthias David Siebler


DataPower L3 Team Lead

Session Number: 1116


Title: Troubleshooting DataPower Appliances in the Field
Abstract: The WebSphere DataPower family of appliances has a

wealth of tools available for troubleshooting problems in the field.


However putting all these tools together is a difficult task for even
experienced developers of the platform. Customers are often
overwhelmed by the volume of data and how to interpret the data
properly. This session will describe how to systematically approach
troubleshooting several common scenarios.
Track: SOA, Connectivity & Integration

Agenda

Introduction
Must-Gather & Error Reports
Packet Captures
Status Providers
Out-of-Memory (OoM)
Large Debug Logs
Advanced Techniques
Q&A
Summary

Introduction
As a closed system, the primary responsibility for DataPower problem determination lies with the
IBM support team.

Historically, little enablement of client self diagnosis and repair has been part of the architecture for
DataPower.

To facilitate improvements in problem determination, known as RAS, we are:

building our software to provide the data and information required to resolve problems when they occur FFDC (First Failure Data Capture)

defining tools, best practices, and standards that allow the efficient analysis of problems within a product
or solution by the system, customer, or IBM Support

analyzing problems to continually modify our processes and procedures to improve software quality and
prevent problems from occurring.

Must Gather
http://www-01.ibm.com/support/docview.wss?uid=swg21515489
Error reports contain most status providers

Some cannot fit into the error report due to size or time constraints
Error report content is continually being updated & improved
Reports can be useful even some time after the event
Status snapshot after the fact should be augmented by historical trends
before the event

Best practice is to have some minimal archives & trend graphs


But beware: Do not monitor the boxes to death!
All data is orthogonal to the method; via SNMP, CLI, webGUI, etc.

Error Report Analysis


Tools are available on the support website to help parse the data
I.B.M. Support Assistant plugins

Audit log will have history of restarts


Grep for errors; have a history of 'expected' errors & unexpected
i.e. loadbalancer health checks

What to always know


Minimal: (every 5 minutes)
Memory
Load
Established TCP connections

Throttler status log target is an easy way to collect all this


Create a dedicated log target file or syslog to keep from having it rotate
away
Logs persist after a crash but status provider data is lost
Know what is typical for your system

What Else?
All log files:
Additional logs not put into the error report will be in 'logtemp'
Top level & for the specific domain(s); unless too many domains
Automated scripts to get files via CLI or SOMA are helpful to build in
advance
Log files can be under 'logstore' if using the log to RAID option
<env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/">
<env:Body>
<dp:request xmlns:dp="http://www.datapower.com/schemas/management">
<dp:get-file name="logtemp:default-log"/>
</dp:request>
</env:Body>
</env:Envelope>

Status Provider
These provide information about the system
E.g. filesystem, environment sensor, domain status.

Information can be accessed thru


Webgui
CLI
SOMA
SNMP

Memory statistics
Show load

xi50(config)# show load


Task Name
--------main

Load
---2

Work List CPU

--------- --0

-----33

Memory
---------109

wtx

20

ssh

29

Show memory

xi50(config)# show mem


Memory Usage: 9 %
Total Memory: 4148536 kilobytes
Used Memory: 385722 kilobytes
Free Memory: 3762814 kilobytes
Requested Memory: 1197680 kilobytes
Hold Memory: 811958 kilobytes

File Count

Memory Statistics
Usage log:
20120314T102536Z [slm][debug] throttle(Throttler): tid(943):
Memory(3570524/4098982kB 87.107579 free) Pool(1041874)
Ports(31756/31850) Temporary-FS(224/242MB 92.561983 free) File(OK)

Memory: same as 'Free Memory' from 'show memory'


Pool: same as 'Hold Memory' from 'show memory'
Ports: number of free ports (internal structure; 'show connections')
File: generic test of all filesystems access (4 possible answers)
Cannot access filesystem due to low memory
Router has too many open files (may need to reload)
System has too many open files (may need to reboot)
other?

Discrepancies
What to look for in the status providers:
'show tcp'; 'show connections' & 'show handles'
All give slightly different results but roughly map one-to-one
If one is out-of-range by an order of magnitude could indicate an issue

'show load' vs. 'show cpu'


Load is an instantaneous measure
CPU is averaged
Load can jump around a lot; however
Mismatch can indicate 'spinning' ports

Packet capture
Why packet captures?

Packet Captures

Definitive answer to protocol interoperability.


Not necessarily the same as the Probe!
Now can capture on loopback, VLANs or all interfaces at once.
TCPdump format; viewable by Wireshark etc.

Packet Capture Filtering


Expression format should follow 'pcap-filter(7)'
http://www.unix.com/man-page/FreeBSD/7/pcap-filter
e.g.

Supports basic and advanced filtering capabilities


Provides ability to filter on

IP address
Port
MAC address
and many other qualifiers

FFDC background packet capture


Always-on background packet capture has low overhead
Captures packets on all interfaces simultaneously
Capture automatically generated when
the system experiences an outage, such as a crash
user requested - Must-Gather operation

When FFDC triggers report generation information is current


Enables a packet capture to be compressed and stored automatically
in an Error Report and optionally sent off-box

Service Probe
Multistep Probe shows the payload as it moves through the
processing policy not meant to be on-the-wire

OoM
DataPower does not have virtual memory
Pro: performance
Con: 4GB is shared by all domains & transactions

First step is to determine trigger of the OoM event


Memory leak
Traffic spike

Historical graphs are necessary to determine root cause


can indicate correlation of memory increase to high load
can indicate correlation of memory increase to backend latency

Spikes:
An increase in traffic arriving at the device
An increase in delay at backends or in sidecalls
Can be detected if Throttle status log option is enabled

Memory or other resource leaks


Generally must always have a baseline
What can be leaked?

Memory
File handles/sockets/file descriptors
Ports (slightly different from sockets)
Inodes (very rare)

Tracing must be turned on before the resource is leaked


Currently leak detection requires a reboot; development is planning for
always-on resource leak tracing

Memory logs
Each log message captures a snapshot at that time
Not cumulative; can go up & down
Not exhaustive; some actions or protocols can allocate memory outside

Added in 3.8.2: units are in bytes


20110224T110806Z [memory-report][debug] mpgw(sender): tid(8464): Response
Finished: memory used 21968644
20101108T232236Z [memory-report][debug] mpgw(sftp-ftp-mpgw):
tid(6000)[response][9.42.102.172]: Processing [Rule (sftp-ftp_policy_rule_1),
Action ('sftp-ftp_policy_rule_1_results_output_0', results()), Input(INPUT),
Output(NULL)] finished: memory used 595732

Services Memory Status Provider

Leak reports
Always best to have a baseline
Tracing must be always on!
Active transactions can cause noise in the data capture; best to turn off
traffic if at all possible

Also try to capture 10-15% memory growth between snapshots

By default data is captured to NFS; the snapshots can be large


CLI option is available if necessary; in some obvious cases can be
sufficient

Shows top ten users

In some cases leak reports may not be enough; certain types of


memory allocations are not captured

Scalability concerns

SLM vs. Monitors


Determining scalability requires a methodical and sensible approach
Debugging can be surprisingly tricky
How much traffic is the box actually taking?

How many requests?


What kind?
How big?
What actions?

Message Count Monitors


Most accurate method for determining exactly how much traffic a
service is processing
More lightweight than SLM
Does not have as many options&features

SLM
Shaping can be used to smooth traffic
Should not be used to hide a broken backend
Plan on shaping for a few seconds; not minutes

Reliability should be end2end; not hop-by-hop


There is no free lunch! (well maybe in Vegas)

Audit log
Polling for uptime is best practice for monitoring restarts
Except when the 32 bit counter wraps

Monitoring the audit log is also useful


If the uptime goes down then the box has rebooted; otherwise it reloaded
But note this message:
20120217T083717Z [eventlog][failure] (SYSTEM:default:*:*): Booting build 205760
on 2011/11/15 11:02:50 count 32. Uptime 777830
Booting message w/ type failure is just an indication that the audit log has rotated;
not that the box has restarted

Log CLI Trigger


Should have the ability to execute any CLI command or CLI script
Needs to match on Log Message ID and message text using an optional regular expression
Examples of ability to execute any CLI command or CLI scripts:
start a packet capture on a specific event
stop a packet on the next occurrence of the same event
perform a must-gather Error Report generation

Large Debug Logs


The goal is that we want to capture all possible data.
Debug logging will do that; with some few exceptions:

RBM (optional)

webGUI (optional)

logging about logging (not possible.)

In the default domain; create a new log target

type file

format text

timestamp numeric

archive rotate

event all debug

maximum total is 50MB times 100 = 5 GB

Do we have that much?

Make sure the space is available!

Large Debug Logs - Onbox

Best practice: log to RAID


Rotate; do not archive files
Better to pull via HTTP rather than push
If push must be used FTP is the preferred approach
The file log target cannot rotate more than once per second
minimum size of the log file should be able to contain more than 1
seconds worth of data; otherwise you will certainly be losing messages.

Dropped messages are also in the log file: Buffer Overflow: X event(s)
lost

Large Debug Logs


Always check to make sure they work
Check on the log target status
Should have zero dropped events

Large Debug Logs - Offbox


Best practice syslog over UDP
Using syslog-tcp may cause bottleneck (if using firmware 4.0.2 or
before)

DataPower opens many simultaneous connections


Can bring down some servers

Always set a static route to the syslog servers to force outbound traffic
over the correct interface
Adding a syslog log target is a lightweight addition to a busy box
Note: UDP syslog may truncate some longer messages

Support Resources

IBM WebSphere DataPower SOA Appliance Handbook


IBM Support Portal for DataPower
http://www.ibm.com/support/entry/portal/Overview/Software/WebSphere/WebSphere_DataPower_SOA_Appliances

developerWorks articles
WebCasts
Forum: https://www.ibm.com/developerworks/forums/forum.jspa?forumID=1198
User Groups: http://www.websphere.org/websphere/Site?page=ugdetail&groupId=165

Feedback?
Comments?

Copyright & Trademarks


IBM Corporation 2012. All Rights Reserved.
IBM, the IBM logo, and ibm.com are trademarks or registered
trademarks of International Business Machines Corp., registered in
many jurisdictions worldwide. Other product and service names
might be trademarks of IBM or other companies. A current list of
IBM trademarks is available on the Web at Copyright and
trademark information at www.ibm.com/legal/copytrade.shtml.

You might also like