Ciss1 4om
Ciss1 4om
September, 2011
Release 1.4
1. Introduction
1.1 Summary
A Production Operations Manual (POM) defines the specific technical and operational processes that
must be carried out on a daily, weekly, monthly, or yearly basis. A POM is an application/system-specific
document containing detailed topology, dependencies, monitoring specifics, maintenance windows, etc.
Additionally, it contains the system’s scheduled events (regular production jobs, performance reporting,
or maintenance windows, etc.). The POM provides Field Operations staff the necessary instructions to
operate and support production computer systems.
The production support for the System Name Production System is divided or shared between the
Enterprise Operations & Infrastructure (EOI) and Product Development within the Office of Information
& Technology (OI&T), and Corporate Data Center Operations (CDCO).
1.2 Purpose
The purpose of this document is to:
• Be used as a reference manual for the daily operation and maintenance of CISS/OHRS
• Assist support personnel on the resolution of system issues
• Assist in the capacity, maintenance, and upgrade planning of CISS/OHRS
1.3 Scope
The scope of this document is limited to CISS/OHRS. Any references to external systems is only for
describing an interface and how the interface and the external system affects the operation of CISS/OHRS
or as a tool that may be used as part of system monitoring or the support and issue resolution system.
The MOU serves as the signatory document that invokes the SLA. The SLA/SLRs are referenced in the
appendix of the MOU, allowing them to be managed or modified without renegotiating the entire MOU.
1
CISS/OHRS Production Operations Manual September, 2011
After the SLR is negotiated, it results in an agreed Service Level Target (SLT) with metrics, measurement
techniques, and assumptions. The SLA and SLTs are a combined document.
All key functions are assigned to one or more responsible parties and activities are clearly defined in
order to maintain and support the applications and system components throughout its life cycle. These
roles and responsibilities are displayed in a tabular RACI format at the end of each section of the plan to
further define Responsibility, Accountability, Consultation, and Information roles.
2
CISS/OHRS Production Operations Manual September, 2011
While implementing the CISS framework and the OHRS application, the CISS project team follows an
agile software methodology to support rapid programming and short six-month releases to production.
For more information please view the Agile Software Development Methodology and other documents
available on the CISS TSPR page:
(http://tspr.vista.REDACTED/warboard/anotebk.asp?proj=1256&Type=Active).
This document contains instructions to help System Operators administrate and troubleshoot the delivered
software. System Operators are defined as IT staff at the data centers where CISS is deployed.
3
CISS/OHRS Production Operations Manual September, 2011
The architectural design of each of the three groups consists of different redundancies:
• The database servers are to be clustered at the OS level and at the database application level. The
database servers are connected to a SAN, for additional storage, redundancy, and availability.
• The two web servers are designed to run exactly the same functionally, through non-clustered. OS
level synchronization keeps the two servers consistent.
• The two application servers are not clustered at the OS level, but are clustered at the Application
level. OS level synchronization and application implemented clustering maintain the redundancies.
The Current systems implemented are HP ProLiant DL380 G5 servers, Intel ® Xeon ®CPU E5420 @
2.50GHz 64 Bit Dual Quad core Processors, Dual Power Supplies, Dual Gigabit Network interfaces,
iLO2 – Integrated Lights Out management port, RAID-controlled 6 HDD, 16 GB Memory. Microsoft
Windows 2003 Enterprise and Red Hat Enterprise Linux 5.x are the Operating Systems of the systems.
All Systems are attached to sites Gb network. ILO’s have not yet been implemented; initially the Core
switches did not have enough available ports.
Six of the Servers reside at Falling Waters, WV (CDCO), the Production site. Seven other Servers are
located at Hines, IL. The Hines site is considered the Disaster Recovery (DR) site.
The Hines, IL data center’s initial implementation did not have the SAN storage available to attach to the
database servers; the database servers alternative was to run using their local storage and leverage
mirroring between the two database servers. Another difference is the additional MS Windows server as
the MS SQLServer “Witness” server. The “Witness” server monitors the two Hines database servers, and
delegates which server is the Primary and the other as the Stand-by nodes.
4
CISS/OHRS Production Operations Manual September, 2011
5
CISS/OHRS Production Operations Manual September, 2011
The Applications:
• BEA / Oracle WebLogic 10.3.2
• Microsoft SQLServer 2005
• Apache 2.2.3
• VistALink 1.5
6
CISS/OHRS Production Operations Manual September, 2011
o Local Drives
C: 40 GB
D: 96 GB
E: 292 GB
F: 254 GB
o SAN Attached ( If attached, and if the Active server in the Cluster )
G: 102 GB
H: 102 GB
J: 102 GB
L: 34 GB
M: 17 GB
O: 85 GB
Q: 500 MB
• RHEL Application server
o /dev/mapper/rootvg-root 992M /
o /dev/mapper/rootvg-opt 3.9G /opt
o /dev/mapper/rootvg-var 3.9G /var
o /dev/mapper/rootvg-tmp 3.9G /tmp
o /dev/mapper/rootvg-usr 3.9G /usr
o /dev/mapper/rootvg-home 2.0G /home
o /dev/cciss/c0d0p1 251M /boot
o tmpfs 7.9G /dev/shm
o /dev/mapper/rootvg-u01 97G /u01
o /dev/mapper/rootvg-u02 9.9G /u02
o /dev/mapper/rootvg-u03 9.9G /u03
o /dev/mapper/rootvg-u04 9.9G /u04
There are numerous scripts involved in monitoring and synchronizing of servers systems.
7
CISS/OHRS Production Operations Manual September, 2011
The SQL backups are stored on the mapped H: SAN attached drive. The database servers have OS level
backup run at 5:00 A.M. every day. The DOS batch script does a checksum of the last backups, XCOPY
of the files to the DR servers, purges any files that are over five days.
The Linux servers, application, and web servers each have scheduled jobs:
The web servers monitor any PAID files that arrive and rename the file with a Date/Time stamp. If
processed files are found, after OHRS has uploaded the PAID content into the OHRS database, the files
are TAR GZIP’d into an archive file.
Example Crontab:
#* * * * * [command to be executed]
#- - - - -
#| | | | |
#| | | | +----- day of week (0 - 6) (Sunday=0)
#| | | +------- month (1 - 12)
#| | +--------- day of month (1 - 31)
#| +----------- hour (0 - 23)
#+------------- min (0 - 59)
*/10 * * * * /bin/bash /usr/local/bin/cissPAID_update_filename.bsh
Also the Web servers monitor, Similar Process with VAADERS data.
################################################################
################################################################
#* * * * * command to be executed
#- - - - -
#| | | | |
#+ min (0 - 59)
8
CISS/OHRS Production Operations Manual September, 2011
ROOT -- Webservers
1 0 * * * /bin/bash /var/www/cgi-bin/SAMBA_update.bsh -R
PBM -- Webservers
Weblogic
2 7 * * * /bin/bash
/u01/app/bea/weblogic/domains/CISSDomain_Prod/bin/check_vlj_connectors.bsh
*/4 * * * * ~/weblogic/common/bin/wlst.sh
/usr/local/etc/Connections.py ~/ciss.properties >/dev/null 2>&1
2 7 * * * ~/weblogic/common/bin/wlst.sh
/u01/app/bea/weblogic/domains/CISSDomain_Prod/WLST_scripts/vljMonitor.py
ciss.properties ALL
3. Routine Operations
Using Linux bash scripts to extract data from the different servers and systems, the data is gathered,
parsed, and output in csv, xml, flat file, or direct to the email. The systems administrator will monitor the
WebLogic JVM - Java Virtual Machine Memory, File System usages, VistALink Adaptor connectivity
via Dashboards, Consoles, or received emails. The systems administrator will also deploy the new
artifacts during planned outages, stop and start the WebLogic managed servers, monitor system backups.
Routine OS patches, updates will be performed via mechanisms standard to the OS.
9
CISS/OHRS Production Operations Manual September, 2011
The database administrator will monitor database growth, replication, and backups. The database
administrator will perform updates, upgrades, and maintenance to the database or database engine.
10
CISS/OHRS Production Operations Manual September, 2011
cd ${HOME};
getstatus.sh ciss.properties ;
Run the lsof command Looking for the Admin server port 7100:
o Command Line:
cd ${HOME};
startservers.sh ciss.properties ALL ;
getstatus.sh ciss.properties ;
o Via the Admin console:
Log into the Admin console using the host name and admin port for the URL:
Example: http://vhancrcissa901:7100/console
Select the Control tab
Select the Check box next to the managed servers ( Srv1, Srv2, … )
Click the ‘Start’ Button, located above the list of managed servers.
11
CISS/OHRS Production Operations Manual September, 2011
Press F5 to refresh or click the button with the curved circular arrows, to check the Status
• Verify the Apache is running:
• Either via command line or console, stop the WebLogic managed servers
12
CISS/OHRS Production Operations Manual September, 2011
o If Admin servers was not selected, repeat three previous steps to shutdown the WebLogic
Admin server.
• Shutdown the Linux server OS.
o The SQLServer Database engine and any other processes will be brought down normally thru
the system services.
• Shutdown Windows server
• Click the Start button
WARNING: running the next set of screen will HALT the server, unless the ILO access has been
configured or there is a physical person to power on the server.
13
CISS/OHRS Production Operations Manual September, 2011
• Click OK.
14
CISS/OHRS Production Operations Manual September, 2011
o Database Administrator can elaborate. The internal backup schedule backs up the Transactional
log and all databases, each exported to their own separate directories and corresponding database
names.
4:00 A.M.
• SAN Device
o Every evening full back ups are run at 6:00 P.M. and the following directories are being backed
up to tape (mentioned in the section Storage and Rotation).
15
CISS/OHRS Production Operations Manual September, 2011
o Daily Linear Tape-Open (LTO) tapes are stored locally at the site.
o An official request must be made to the EMC staff to restore data from tape and an alternate
location must be established to restore the file(s).
• Database
o A full database backup is performed each morning at 4:00 A.M. immediately followed by a
Transaction Log backup. The same process is repeated at 9:00 P.M. The backup files are stored on
a drive that is distinct from the database files. The database files are archived on to the disaster
recovery server, and the storage area network engineers perform server level backups that send
the files to a secure off-site facility. The backup and recovery policy used for the OHRS database
is through SQL Server Maintenance Plan backups. Recovery is achieved by using SQL Server
Enterprise manager, selecting “Restore Database” under database tasks and selecting the latest
backup and transaction log file.
• Both FW and Hines also use new Quantum i6000 tape libraries as our tape backup systems. Each of
these new tape library units provide six Linear Tape-Open (LTO)-4 tape drive systems to backup data
to tape (LTO-4 tapes will support 800GB native/1600GB compressed of data).
• Backups are performed daily/nightly via the network for most systems.
• Times in which backups are performed are based on the requirements and input of the corresponding
project owner, Database Administrator (DBA), application owners, etc.
• EMC staff performs a monthly backup which is retained per VA long-term retention policies.
• Tape rotation: all short-term retention tape backups (90 days or less) rotate based on the need and
when tapes have expired.
• All short-term retention tape backups are temporarily stored in a secure location onsite (once the tapes
have been ejected from the tape library) until they have expired and can be recycled/reused.
16
CISS/OHRS Production Operations Manual September, 2011
• All long-term retention tape backups are stored at a secure offsite facility (Iron Mountain). They will
usually do one scheduled pick up per week – tapes are shipped offsite on a routine basis using this
system.
• Retention is based solely on the requirements of the specific project – the standard for the retention
of all backup data is 90 days, and then the tapes are recycled. Monthly backups are retained for two
years (at Iron Mountain) under the current standard VA retention policy (mentioned above).
Users are identified by their roles in the project, by the VA and contracting staff. Roles and permissions
are determined by the Systems Admin and Database Admin, and are communicated to the team.
• WebLogic:
User requests access to the WebLogic application from the Systems Administrator. The SA determines
access needs, and reviews alternate means of access to assist the user’s request. If console access is
required, READ-ONLY/MONITOR user access is given to the request, unless more access is necessary.
• Portal:
Users request access from their local occupation administrative staff and the request is communicated to
the VA PM for approval. The VA PM will grant the user access using LDAP to associate the users VA ID
with the CISS/OHRS LDAP organizational unit (OU).
• OHRS application:
User portal access establishes the user’s roles and privileges and options usability within the OHRS
application.
o Users are identified by their role in the project or dependency, and granted the minimum access to
achieve their task or role.
The account is created with details of the person’s name and title.
Specific rights are given to the user’s log in, and any additional sudo (Linux) rights and
privileges are configured.
( Linux ) The amount of time the user requires access determines their account expiration
17
CISS/OHRS Production Operations Manual September, 2011
• WebLogic:
All users can access the portal using their VA user ID and password.
• OHRS:
o Only the USER ID associated with the OHRS portal will be granted to the ‘OHRS’ Button to
access the OHRS application.
o Each person is granted a role or set of roles in the OHRS application, and depending on the role,
each person has access to different aspects to the OHRS application and permission to execute
different tasks within the application.
o The Linux servers have standard monitoring scripts that send emails to the root user. Those
standard scripts, in conjunction to Multi Traffic Router Graph (MRTG), and some custom script
inform the Systems Administrator and the Database Administrator of any issues or pending
issues.
o No tool can monitor every aspect, so custom scripts are created to run against Windows, Linux
SNMP, and the WebLogic Scripting Tool (WLST).
o MRTG is used for its ability to chart any numerical data and has other underlying abilities.
o Using PHP, WLST, MRTG, custom scripts, and system generated emails, the Systems
Administrator has a wealth of options and avenues of monitoring the systems.
• Database Administration :
o The Microsoft SQLServer has some internal Reporting and Monitoring capabilities.
o Emails are sent to the Systems Admin and Database Admin for the nightly backups.
o The Replication manager tool within the Enterprise Studio is used in order to monitor the
performance of replication between the CISS-OHRS server and DR servers.
18
CISS/OHRS Production Operations Manual September, 2011
o The traffic will pass through the Load Balancer to the Apache servers.
o The WebLogic plug-in will validate the connectivity to the WebLogic-managed server located on
the Application server,.
If the application is available, the network traffic is passed to the WebLogic-managed server.
19
CISS/OHRS Production Operations Manual September, 2011
• Data can be analyzed actively, ‘On demand’, or post event. Each is made available via different
means.
20
CISS/OHRS Production Operations Manual September, 2011
• Network usage
• WebLogic can monitor JMS messaging and the Console can show problems. The need for monitoring
this has been very minimal; no further action has been requested.
o SDS update
An Email is sent to all projects that officially use the SDS database.
A manual update is done by the Database Administrator and requires the OHRS application
to be recycled. This allows clean connections to the updated data.
o PAID Data
o The updates are automated at the source, and can be uploaded multiple times a day.
o A Linux cron job executes a custom script to check for any new files and rename the
uploaded DATA.PAID files to the {Time stamp of when the file was created}.PAID.
o This custom script also initiates a Re-sync to all of the Production and DR web servers.
Twice a month, the OHRS application internal job loads sequentially all *.PAID files,
renaming the files {Time stamp of Processed}.PAID.{Time processed by OHRS
app}.processed
o A final MD5 checksum is performed on the final files before removing the uncompressed
processed files.
21
CISS/OHRS Production Operations Manual September, 2011
o Database extracts are performed periodically for use in the PROD Mirror and SQA testing
Databases. DBA to expand.
4. Exception Handling
4.1 Routine Errors
Like most systems, System Name may generate a small set of errors that may be considered “routine.”
These errors are routine in the sense that they have minimal impact on the user and do not compromise
the operational state of the system. Most of the errors are transient in nature and only require the user to
retry an operation. The following sub-sections describe these errors, their causes, and what, if any,
response an operator needs to take.
While the occasional occurrence of these errors may be routine, getting a large number of an individual
error over a short period of time is an indication of a more serious problem. In that case, the error needs to
be treated as an exceptional condition.
4.1.1 Security
• Reverse mapping checking getaddrinfo for xxxx-lt.vha.REDACTED failed - POSSIBLE BREAK-
IN ATTEMPT!
o This error is normal because of the slowness of the VPN DNS updates.
o A user connects via VPN into the VA network and the DNS system should be updated with the
connecting Workstation/Laptop hostname are associated with the IP address. The DNS is slow to
propagate the changes across the VA network. When the Linux server does a reverse lookup of
the requesting IP, the discrepancy of Hostnames occurs.
4.1.2 Time-outs
• Each user has a 15-minute inactive timer on the Linux servers. Once the time has expired, their
current session is logged out.
• VistAlink Adapters have timeouts, while attempting to connect to the configured Port and IP for the
associated Station ID.
22
CISS/OHRS Production Operations Manual September, 2011
4.1.3 Concurrency
Not applicable
• Application Errors, related to Database Connection, JVM sizes, and deployable artifact problems.
o All Apache logs are written to the directory /var/log/httpd/, unless specially identified in the
Apache configuration files.
• Security / SSH
o /var/log/secure
• General Messages
o /var/log/messages
4.2.3.1 Database
• One error situation with SQL Server databases, which is encountered sometimes in situations where a
DBA has not been continuously hands-on monitoring the database, occurs when the size of the
Transaction Log file grows so large that there is a danger of running out of disk space. This is a
serious issue. Normally, the size of the database files needs to be monitored such that this situation
will not occur.
• The correct way to control the growth of the Transaction Log file is to regularly perform a full
database backup and IMMEDIATELY follow it with a Transaction Log backup. This process may
even be back-to-back repeated one time; this should shrink the size of the Transaction Log.
23
CISS/OHRS Production Operations Manual September, 2011
• Note however, in a replicated scenario like OHRS, if the replication queue is broken, then the
transaction log grows in size, and it does not shrink even with the backup. In this case, the replication
needs to be restored first so that the queues get flushed.
• A rarely encountered but serious error situation happens if the SQL Server database were to ever go
into suspect status. In this situation, the only way to recover would be to have the latest database and
transaction log files available, and run the following steps one at a time:
o DBCC checkdb('CISS');
o Root cause
exception:REDACTED.vistalink.security.m.SecurityTooManyInvalidLoginAttemptsFaultExceptio
n:Fault Code: 'Server'; Fault String: 'Logon Failed'; Fault Actor: '';Code: '183005'; Type: '';
Message: 'Logon failure: 'Device/IP address is locked due to too many invalid signon attempts.''
">
4.2.3.4 Network
Network failures are reported in numerous locations: the DMESG system, /var/log/messages, the
application logs, and the WebLogic-managed server’s logs. The nature of the problem determines where
the error will be reported.
24
CISS/OHRS Production Operations Manual September, 2011
o Linux
The /var/log/secure files will report any issues connecting and authenticating the user.
o Samba
/var/log/samba/ smbd.log file will report any issues of authentication of connectivity issues.
o VSFTP
/var/log/vsftp.log file will report any errors connecting the VSFTP service.
• Verify application server is started, NFS mounted to web server, start weblogic admin application
server, start managed servers.
5. Continuity of Operations
Not applicable
6. Disaster Recovery
DR is a manual process
25
CISS/OHRS Production Operations Manual September, 2011
6.1 Required:
Access to Falling Waters and Hines via the following servers:
6.2 Assumptions:
Production environment, passwords, SQL server DBA skills, RHEL / MS Windows 2003 SA skills,
proper user level permissions, all servers are running at proper/normal runtime levels.
These procedures assume that the fail-over is planned and all sites are operational. They do not discuss
returning to normal operations, which could be done by executing the same procedures again with Hines
as the initial site and Falling Waters as the destination.
If the primary data center is down, you will need to rely on the replicated database at the DR site.
Rolling text:
26
CISS/OHRS Production Operations Manual September, 2011
Rolling text:
27
CISS/OHRS Production Operations Manual September, 2011
28
CISS/OHRS Production Operations Manual September, 2011
29
CISS/OHRS Production Operations Manual September, 2011
• Click Commit:
If grayed out (like above), you must contact the Load Balancer Administrator to grant you more
permissions.
30
CISS/OHRS Production Operations Manual September, 2011
• Select “fw”
• Click Commit:
IF grayed out (like above), you must contact Load Balancer Administrator to grant you more
permissions.
• Select “fw2”
• Click Commit:
31
CISS/OHRS Production Operations Manual September, 2011
If grayed out (like above), you must contact the Load Balancer Administrator to grant you more
permissions.
• Verify that the normal front door is accessible
o http://vaww.ciss.REDACTED/
7. System Support
An understanding of how System Name is supported by various organizations within the VA is important
to operators and administrators of the system. In the event that you are unable to resolve an issue, then it
is necessary to understand how to obtain support through OI&T’s system support organizations. The
following sections describe the support structure and provide procedures on how to obtain support.
The information in these sub-sections is a summary of parts of the CISS/OHRS O&M plan. This
document is available in ClearCase and should be used if additional information is required.
• Servers Hardware
After a system administrator has evaluated the problem regarding hardware or beyond his
abilities to fix or repair, the Hardware Vendor – HP, needs to be contacted and their staff will
require additional information to troubleshoot the problem.
• Operating systems
Each vendor (Microsoft and Red Hat) has a designated VA representative and that person should
be contacted initially to help escalate the issues to the vendor’s support systems.
• WebLogic
Must have a Oracle Support ID and Login to contact Oracle support. There is a POC in the VA
who manages the Oracle Support Identifiers.
• SQLServer
Please follow up with the Microsoft Representative.
32
CISS/OHRS Production Operations Manual September, 2011
33