SHARE Boston 2013 Common zOS Problems
SHARE Boston 2013 Common zOS Problems
Jerry Ng [email protected]
Patty Little [email protected]
IBM Poughkeepsie
1
Trademarks
• MVS
• OS/390®
• z/Architecture®
• z/OS®
2
Contents
3
A Reminder: Health Checker for z/OS
IBM Health Checker for z/OS is a component of MVS that identifies potential
problems before they impact your availability or, in worst cases, cause outages. It
checks the current active z/OS and sysplex settings and definitions for a system
and compares the values to those suggested by IBM or defined by you. It is not
meant to be a diagnostic or monitoring tool, but rather a continuously running
preventative that finds deviations from best practices. IBM Health Checker for
z/OS produces output in the form of detailed messages to let you know of both
potential problems and suggested actions to take.
4
CDS Inconsistency
The impact of ‘splitting’ XCF’s knowledge of the CDS’s varies depending on what
updates are done during the timeframe of the split. Wait states on some or all the
systems usually occur.
5
CDS Inconsistency
http://www-
03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/WP102281
Please review the above paper and session material to follow the best practices
for CDS’s. This will help to avoid disasters in your systems.
6
OMVS Services failing
In most z/OS systems, there are always multiple userids with UID(0).
.
If a UID(0) userid is created or altered so that their RACF Default group does not
have a GID, then the RACF commands will get a message like:
What can happen is that some program that needs UID(0) will at some point have
OMVS asks RACF who UID(0) is. If this incomplete userid happens to be found
first by RACF, then RACF sees that the Default Group has no GID and gives
back a non-zero return code because the userid's OMVS definition is incomplete.
This can cause a variety of problems that are not particularly easy to figure out.
7
OMVS Services failing
• What-to-do:
In RACF, the database has a "stage" level, 0 thru 3. You move it via the
IRRIRA00 utility.
Once at Stage 2 or Stage 3 (with APAR OA39645), RACF can guarantee that
any request to translate a UID or GID will result in the same answer any time.
Customers who are not in those environment are not guaranteed a consistent
answer.
8
ICSF/Crypto Master Keys
When customers using ICSF migrate to a new mainframe, they usually want to
continue to use their existing Key Data Sets (ie, CKDS and PKDS). To use these
datasets on the new box, you need to enter the correct Master Keys, DES & AES
for the CKDS, and RSA & ECC for the PKDS. If the master keys are forgotten
then these Key Data Sets can't be used and all of the keys in those Key Data
Sets can't be used.
If the Master Keys are forgotten/lost the only option is to power up the old box
and enter new Master Keys and re-encipher.
9
ICSF/Crypto Master Keys
• What-to-do:
10
Logger CF Structure
Structures which are too large may lead to various efficiency issues... Allocation
or Offload may take too long to complete, backing up other work.
11
Logger CF Structure
12
Use the Logger CFSizer!
http://www-947.ibm.com/systems/support/z/cfsizer/
• System z Coupling Facility Structure Sizer Tool (CFSizer).
CFSizer is a web-based application that will return structure sizes
based on the latest CFLEVEL for the IBM products that exploit the
coupling facility.
• Easy to use
• Minimal input data required
• Specify peak usage input
13
Defining the Logger CF Structure
14
PFA INI JAVAPATH
Problem: Error messages issued at PFA
modeling time
Potential error messages that may appear at PFA modeling times, depending
upon which checks PFA is running
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_LOGREC_ARRIVAL_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_MESSAGE_ARRIVAL_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
AIR033I PFA has detected that SMF is not running and has stopped
processing the PFA_SMF_ARRIVAL_RATE check. Processing will resume
after SMF restarts.
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_ENQUEUE_REQUEST_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
15
PFA INI JAVAPATH
Problem: PFA modeling will fail if JAVAPATH
is incorrectly defined in either
• /etc/PFA/ini or
• PFA EXEC
PGM=AIRAMBGN,REGION=0K,TIME=NOLIMIT,
PARM='path=(/usr/lpp/bcp)’
Potential error messages that may appear at PFA modeling times, depending
upon which checks PFA is running
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_LOGREC_ARRIVAL_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_MESSAGE_ARRIVAL_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
AIR033I PFA has detected that SMF is not running and has stopped
processing the PFA_SMF_ARRIVAL_RATE check. Processing will resume
after SMF restarts.
AIR022I REQUEST TO INVOKE MODELING FAILED FOR
CHECK NAME= PFA_ENQUEUE_REQUEST_RATE
UNIX SIGNAL RECEIVED= 00000000 EXIT VALUE= 00000002
16
PFA INI JAVAPATH
Problem: Example of ini file:
17
PFA INI JAVAPATH
Explanation:
• The JAVAPATH statement identifies the location of
where PFA’s Java code used for modeling resides
• It does NOT represent where JAVA 6.0 code
resides
18
PFA INI JAVAPATH
Solution:
• Check the ini file to ensure that it does not point to
Java 6.0 code, but rather PFA’s Java modeling
code
19
FTP’ing Problem Doc
20
FTP’ing Problem Doc
21
RASP using AUX Slots
• Problem: Growth in Auxiliary paging slots owned by
the RSM Address Space (RASP ASID 3)
*IRA206I JMONDB2A ASID 0089 FRAMES 0002307735 SLOTS 0000555697 % OF AUX 13.2
*IRA206I DBNGDBM1 ASID 01E1 FRAMES 0000592191 SLOTS 0000323722 % OF AUX 7.7
*IRA206I DBNGDBM1 ASID 00D7 FRAMES 0000139593 SLOTS 0000219606 % OF AUX 5.2
*IRA206I DBNGDBM1 ASID 01A0 FRAMES 0000111173 SLOTS 0000200963 % OF AUX 4.7
*IRA206I RASP ASID 0003 FRAMES 0000000468 SLOTS 0000192729 % OF AUX 4.5
When you are in an auxiliary storage shortage condition, you will receive
messages indicating the top users of aux slots. RASP (ASID 3) may own few
real frames but a large amount of aux slots. This is not an indication that RASP
has a problem.
22
RASP using AUX Slots
Explanation:
• When High Virtual Shared or Common storage (above the
2Gig bar) is used by any job, frames used to back this area
of storage are owned by the job which obtained the storage
area
• When REAL storage is low enough to drive paging, these
High Virtual pages that are paged to AUX slots are given to
and owned by the RSM address space (RASP)
• Need to find out which jobs are using High Virtual Shared
or Common storage (and if the amount is higher than
normal)
• One way is to get a dump and use IPCS RSMDATA (see next
pages)
The RASP aux slot counts are actually slots used for high virtual shared or
common pages belonging to jobs in the system. One way to identify the owner of
these pages is to take a dump and issue the IPCS RSMDATA command. See
next 2 pages for examples.
23
IPCS RSMDATA HVSHRDATA
Example
S START VSA END VSA ST K F VT JOBNAME ASID CREATE TIME REQUESTOR RQAS
J7D42 0190
J7D09 0154
J7D09S 0198
DB2DIST 005E
DB2DBM1 005C
DB2MSTR 0041
The output of the IPCS RSMDATA HVSHRDATA command shows the high
virtual shared pages owned by jobs in the system. Please see z/OS MVS
Diagnosis: Reference for details.
24
IPCS RSMDATA HVCOMMON
Example
START VSA END VSA Size St T K F L JOBNAME JOBID CREATE TIME REQUESTOR RQAS
The output of the IPCS RSMDATA HVCOMMON command shows the high
virtual common pages owned by jobs in the system. Please see z/OS MVS
Diagnosis: Reference for details.
25
RSU in IEASYSxx
If you specify a value of 1-9999 without a qualifier (M, G, T, or %), the value is
considered to be the number of the units, and the default storage increment size
is used. For example, if your machine has a storage increment size of 64
megabytes, specifying 20 causes 20 units of 64M (1.25G in total) to be set aside
for storage reconfiguration. Note that the storage increment size is entirely
hardware dependent, based not only on the hardware model, but possibly also on
the amount of real storage installed on the physical machine (not the LPAR). This
means using an unqualified value of 1-9999 can have unexpected results,
because its meaning can change dramatically with a simple upgrade to the
amount of real storage on the system.
26
RSU in IEASYSxx
Explanation:
• For best performance, it is recommended that
RSU=0 is coded (Healthcheck: RSM_RSU)
• If you need to code a RSU value, use units of M, G
or %, instead of a number (which means storage
increments)
• Storage increments size can change after a
machine upgrade or increase in real storage (see
PR/SM Planning Guide)
27
PROGxx REFRPROT to protect code
Problem:
• Overlays to code are difficult to debug and can
cause serious system impact.
Example:
• Recently a customer experienced a 1-bit overlay to
authorized code living in Key0 private storage in a CICS
region.
• This 1-bit code overlay led to a 5-word overlay of code in
Key0 CSA storage.
• Recurring ABEND0C1 errors in the CSA-resident code
had significant system impact.
8/9/2013 Copyright IBM 2013 28
28
PROGxx REFRPROT to protect code
Recommendation:
• Use the REFRPROT statement type to specify
that REFR programs are to be protected from
modification by placing them in key 0, non-fetch
protected storage, and page protecting the full
pages.
• Place REFRPROT in PROGxx parmlib member
OR
• SETPROG REFRPROT
• REFRPROT protects all REFReshable modules,
regardless of APF authorization
For more information on protection of REFR programs, see z/OS MVS Program
Management: User's Guide and Reference.
29
PROGxx REFRPROT to protect code
Explanation:
• Use the PROGxx REFRPROT option in test
environments to surface such issues before the
problem code makes it to production.
• Page protects all full-page portions of load modules
linked as REFReshable.
• Any attempt to alter page-protected storage results in
an ABEND0C4 PIC4 and the overlay is averted.
• Dump/logrec of the ABEND0C4 can be used to
determine the culprit.
• Problem program may produce dump/logrec as a result of the
ABEND0C4.
• SLIP can be used to gather documentation on a recurrence.
8/9/2013 Copyright IBM 2013 30
30
SDUMP MAXSPACE
Problem:
DB2 dump was partial due to reaching
MAXSPACE. What should I set MAXSPACE
to?
Since dump processing will write captured storage to a dump data set on DASD
as soon as the dump data capture completes, the presence of captured data for
multiple dumps would imply an issue with obtaining the storage needed to
allocate dump data sets.
31
SDUMP MAXSPACE
Explanation:
• MAXSPACE parameter acts as a throttle to
limit the maximum amount of virtual storage
that SDUMP can “capture” at any given time.
• Storage can belong to one or more captured
SDUMPs
• MAXSPACE set via CHNGDUMP (CD) command
• CD SET,SDUMP,MAXSPACE=yyyyyyyyMeg
(default = 500M, can range from 1-99999999)
Since dump processing will write captured storage to a dump data set on DASD
as soon as the dump data capture completes, the presence of captured data for
multiple dumps would imply an issue with obtaining the DASD storage needed to
allocate dump data sets.
32
SDUMP MAXSPACE
Solution:
1. Check sizes of your largest dumps. Given
these sizes, what seems like a reasonable
value for MAXSPACE?
33
SDUMP MAXSPACE
Solution:
4. If Answer1 > Answer2, then you need to
make a decision.
• To minimize the likelihood of a partial dump,
increase your AUX storage definition to at least 3
times the MAXSPACE that you require.
• If you are not in a position to increase your aux
storage definition, then you will need to lower
MAXSPACE to 1/3rd of the defined size.
Considerations:
• Partial dumps compromise the ability to diagnose
critical problems
• SDUMP tries to dump storage strategically by starting
with the more critical areas of storage
8/9/2013 Copyright IBM 2013 34
34
SDUMP AUXMGMT
Problem:
I ran into AUX storage issues when taking an
SVC dump. I'm using a reasonable MAXSPACE.
Why did this happen ?
35
SDUMP AUXMGMT
Explanation:
Even with a properly set MAXSPACE, SDUMP
can still trigger an AUX storage condition if the
overall system is using a sizeable amount of
AUX storage. The AUXMGMT parameter offers
additional system protection.
36
SDUMP AUXMGMT
Solution:
Use AUXMGMT parameter!
37
SDUMP AUXMGMT
Problem:
AUXMGMT protection detected aux storage
usage greater than 50% and is preventing any
new SVC dumps from being taken. How do I
recover my system’s ability to take a dump?
38
SDUMP AUXMGMT
Explanation:
A low threshold of 35% must be attained
(35%) before SDUMP processing is
allowed to resume.
• Resetting AUXMGMT=OFF after AUX
storage utilization has reached the 50%
threshold will *not* relieve the above low
threshold requirement! Once you hit the
AUXMGMT ON limit you MUST hit the low
limit (35%) before SDUMPs will again be
allowed.
8/9/2013 Copyright IBM 2013 39
39
SDUMP AUXMGMT
Solution:
There are two ways to attain the low limit:
1. CANCEL or wait for the address spaces that have
pages on AUX to free the storage or the job to end
OR
2. Add page datasets such that the percentage of
overall available AUX slots is then below 35%. If you
hit a AUXMGMT limit, and cannot add additional
page datasets, you will have to revert to option 1.
40