Time is Money!
Use WebSphere MQ Shared Queues to
Reduce Outages
Lyn Elkins –
[email protected]IBM
16 March 2010
Session - 1613
Agenda
• Outages are evil
• What are shared queues?
• How are they different from other availability techniques?
• How can I reduce outages using shared queues?
• How much do they cost?
• To Duplex or not to Duplex
1
Outages are evil
• What kinds of outages are there?
• Planned – maintenance windows
• Unplanned:
• Hardware failures
• Systems Software failures
• Application Software failures
• Human failures
• Process failures
• Other disasters
Outages are evil
Unscheduled Outages
Operating Systems
10.0%
Hardware
10.0%
Application
40.0%
Gartner data
Process
40.0%
Source: Gartner Group
2
Outages are evil
• Both planned and unplanned outages can be costly
• Lost business
• One customer estimated losses of $10M after a 2 hour unplanned
outage
• Do you know what an outage costs your business?
• Lost productivity – outages and the recovery impact other
areas of business
• Commonly development productivity losses can be the result of a
production outage
• “We will have meetings until we know what caused this!”
• Lost sleep – those of us having to deal with the event
Outages are evil - Notes
• If you do not have a cost calculation for planned and unplanned
outages, it can be beneficial to develop one.
• Frequently based on historical data, like the $10M figure from one
customer. In their case they averaged $5M is transactional revenue
for every peak hour. In reality the number was probably higher, as
this was an extremely high volume period so the cost might have
been much higher.
• If you cannot quantify the cost of an outage, you will not be allowed
to spend money on preventing it.
• Losses in other areas may be caused by shifting prime workload onto
development hardware, etc.
• More productivity losses are caused by the inevitable meetings that
follow an outage
• ‘Root cause’ can be difficult to determine and politically charged.
3
What are Shared Queues?
• Shared queues are only available on WMQ for z/OS
• Not Linux for System z, not AIX, not….
• From an application perspective, shared queues allow any
application attached to a Queue Sharing Group (QSG)
queue manager to treat a shared queue as local
• Both MQPUTs and MQGETs are allowed
• An application can run on any LPAR and connect to any
queue manager in the QSG to access the same queue and
to get messages.
What are Shared Queues? Notes
• Shared queues are only available on WMQ for z/OS
• Not Linux for System z, not AIX, not….
• Repetition might allow this to soak in
• The ability to get from a shared queue means that an
application can run on any LPAR in the ‘plex and access
the MQ resources it needs to process.
• Of course the other resources the application needs must be
in place, too.
• If an outage should occur, other instances of the application
can pick up the work, without alteration.
• Applications can be moved more easily between systems.
4
What are Shared Queues?
• From a WMQ perspective, a queue sharing group is a
logical association of queue managers within a z/OS
Parallel Sysplex.
• From a systems perspective, shared queues are built on
the Parallel Sysplex Coupling Facility (CF).
• WMQ uses:
• CF List structures to host the queues
• DB2 Data sharing to store information about the shared
resources
• DB2 data sharing to store large messages (over 63K)
Sample WMQ Queue Sharing Group
MQHG – Queue Sharing Group
CF04
Structures:
CSQ_ADMIN
APPLICATION1
PERSISTENT
DUPLEXED
LPAR – SC49 LPAR – SC54
WMQ-MQH1 Coupling Facilities WMQ-MQH2
Broker – MQH1BRK Broker – MQH2BRK
DB2 DSG – D9HG DB2 DSG – D9HG
CF05
Structures:
DUPLEXED
10
5
How are Shared Queues different from other
availability techniques?
• Common Availability Solutions
• Failover solutions
• Tools like PowerHA, ARM, etc.
• Multi-instance queue managers – on distributed with WMQ V7.0.1
• When a problem is detected the queue manager restarts on
another image or another ‘box’
• Depends on shared storage
• Multi-instance queue managers require lease based sharing
11
How are Shared Queues different from other
availability techniques?
• Common Availability Solutions (continued)
• Queue Manager Clusters
• Multiple targets for inbound, new messages
• “Poor man’s” availability
• Failover solutions and queue manager clusters include
the possibility of ‘marooned messages’
12
6
How are Shared Queues different from other
availability techniques?
• The queues are not owned by an individual queue
manager, but by the coupling facility
• As queue managers are pulled from service, messages
on the shared queues continue to be available for
processing
• ‘Old message availability’
• All the queue managers in the QSG can be out of
service, and the queues and messages survive
13
Clustering for Availability
Q Mgr 1
B Msg1
Queue 1 Msg5
Q Mgr 2 Msg8
B
Msg2
Queue
Queue 1 1
Msg6
Q Mgr 3
Q Mgr 0 B
Msg3
A ? Queue 1
Msg7
SYSTEM.CLUSTER.TRANSMIT.QUEUE
Q Mgr 4
B Msg4-
Queue 1 Marooned
14
7
Reducing Outages by using QSGs
Msg3
Msg7
Q Mgr 1
Msg8
B
Msg11
CF Q Mgr 2 Msg4
B Msg5
Queue 1 Msg6
Queue 1
Q Mgr 3
Q Mgr 0 B Msg1-
? Processed
A
Q Mgr 4
B
Msg2
Msg9
QSG1 Msg10
15
How can I reduce outages using shared
queues?
• Planned and unplanned outages of a queue manager
• Work can continue for both old and new messages
• As long as the processing programs are available on other LPARs
• Often a queue manager or LPAR outage can go undetected from a user
perspective
• CICS 4.1 supports group attach to WMQ
• Cloned environments
• Horizontal scaling across members of a Sysplex
• LPARs should include:
• Queue managers in the QSG
• CICS regions hosting the same transactions
• DB2 Data Sharing
• Ability to run the same batch jobs
• VSAM RLS
16
8
How can I reduce outages using shared
queues?
• CICS 4.1 and Group Attach
• New connection method to WMQ
• MQCONN resource definition
• Can specify the QSG name rather than a specific queue manager
• Queue manager must be in the same LPAR as the CICS region
• If more than one QMGR available, random number algorithm is
used for selection
• If a QMGR becomes unavailable, CICS will automatically
reconnect to another member of the QSG on the same LPAR
17
Other availability improvements
• Inbound Shared channels
• Connections can be made to QSG
• Sysplex distributor will route connection to best available
CHIN
• CHINs listen on a generic port and a ‘fixed’ port
• Outbound Shared channels
• Uses a shared transmission queue
• Any CHIN in the QSG can service outbound messages
• ‘Least busy’ will process more outbound messages
18
9
Client Connections to Inbound shared channel
Client 0
LPAR 1
Client 1
QMG1
Sysplex
CF LPAR 2
Distributor QMG2
Queue 1
Queue 1
LPAR 3
MQCONN(QSG1)
MQClient 0 QMG3
Client 2
MQCONN(QSG1)
MQClient 1
LPAR 4
MQCONN(QSG1) QMG4
MQClient 2
Client 2
QSG1
19
Potential availability issues with shared
queues
• The Coupling Facility is normally very robust, but like any
other resource can run into issues
• The structure can run out of room
• The whole CF can crash
• Not typical, if the CF is external and using an independent power supply
• CF structures can become damaged
• Though this has become rarer, it can happen
• CF structures and persistent messages can be recovered
• CF Links can become unavailable
• Currently this will cause a queue sharing group outage
• The lab is aware of this, and are working towards a solution
• We’ll talk about system managed duplexing later
20
10
Potential issues with shared queues
• Damage to CF structures is becoming more rare
• INJERROR is a test tool to inject a structure error so it
is easy to test how MQ will behave
• CF Links becoming unavailable is often associated
with system maintenance
• Recently seen at customers who were applying CFCC
maintenance, replacing OSA cards, and upgrading the
operating system
• The whole CF becoming unavailable, while impacting
WMQ is usually a much larger problem
21
How much do they cost?
• Well……It depends
• Availability is not free
• CPU utilization is higher for shared queues
• Use of CF resources
• Large messages, over 63K, cost a great deal more
• Throughput can suffer – especially for large messages
• System management changes
• BACKUP CFSTRUCT
• RECOVER CFSTRUCT
22
11
Shared Queue Costs
• From MP16, shared queues CPU costs as compared to a
private queue:
• Are 78% more than for 1000 byte non-persistent messages
• Are 77% more for 5000 byte non-persistent messages.
• Then progressively down to 59% more for 63KB non-persistent
messages.
• Each queue manager in the QSG increases the CPU cost by about
5%
• YOUR MILAGE WILL VARY!
23
Shared Queue Costs – Notes
• This information is from MP16, which may be found at:
http://www-01.ibm.com/support/docview.wss?rs=171&uid=swg24007421&loc=en_US&cs=utf-
8&lang=en
24
12
Duplexing CF Structures
• WMQ can take advantage of System Managed Duplexing
• DB2 provides user managed duplexing for some structures
• Duplexing CF structures
• Like DASD mirroring, two copies of every update are made
• Includes both persistent and non-persistent messages
• Requires Two CFs, storage is allocated in both
• Protects the QSG from:
• CF Link failure
• CF Outage
• Increases availability seamlessly – neither the queue
manager nor the application detects an outage.
• No need for RECOVER CFSTRUCT
25
Sample WMQ Queue Sharing Group
MQHG – Queue Sharing Group
CF04
Structures:
CSQ_ADMIN
APPLICATION1
PERSISTENT
DUPLEXED
LPAR – SC49 LPAR – SC54
WMQ-MQH1 Coupling Facilities WMQ-MQH2
Broker – MQH1BRK Broker – MQH2BRK
DB2 DSG – D9HG DB2 DSG – D9HG
CF05
Structures:
DUPLEXED
26
13
Sample WMQ Policy Definition
• STRUCTURE NAME(MQHGDUPLEXED)
SIZE(30000)
INITSIZE(20000)
MINSIZE(20000)
PREFLIST(CF04,CF05)
FULLTHRESHOLD(85)
ALLOWAUTOALT(YES)
DUPLEX(ENABLED)
27
Duplexing CF Structures
• Duplexing is defined on the Coupling Facility Resource
Management Policy (CFRM) structure definitions. The
options are:
• DUPLEX(DISABLED) – the default, duplexing in not
permitted for this structure
• DUPLEX(ENABLED) – for WMQ structures, system
managed duplexing will be used
• DUPLEX(ALLOWED) – more about this later
28
14
Duplexed Structure – Coupling Facility
CF04
29
Duplexed Structure – Coupling Facility
CF05
30
15
Dynamic Duplexing
• The ‘DUPLEX(ALLOWED)’ option
• Storage is needed in both the primary and secondary CF at
definition time
• Duplexing – both user and system managed – is not active
until enabled. It will remain active until disabled
• SETXCF START,REBUILD,DUPLEX
• SETXCF STOP,REBUILD,DUPLEX
• Current recommendation is to use this option and turn
duplexing on prior to major maintenance, turning it off again
after the work is complete
31
Shameless Promotion
• New Redbook coming out:
• High Availability in WebSphere Messaging Solutions - SG24-
7839
• Lab using Shared queues tomorrow at the off site lab
• IBM Building room 9027, let Lyn or Ralph know if you will be
attending
32
16