888

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Operation and Maintenance Guide for OSIX and Jupiter

Sun Cellular Signalling Monitoring Tool Operation & Maintenance Guide


Rev 2.0: January 19, 2013

Table of Contents
1. 2. 3. 4. 5. Introduction.....................................................................................................................................3 Polystar Support Process and Infrastructure Workflow .................................................................3 System Architecture and Data Flow ................................................................................................5 System Health Monitoring ..............................................................................................................6 Regular O&M Tasks .......................................................................................................................12

1. Introduction
This document will highlight the different processes and tasks that are required to maintain reliability of the Sun Cellular Signalling Monitoring Tool. These are: Polystar Support Process and Infrastructure System Architecture and Dataflow System Health Monitoring Regular and Periodic Checks

The above tasks will be done by SUN for efficient problem management & resolution.

2. Polystar Support Process and Infrastructure Workflow


Polystar provides comprehensive support to maximize system uptime for the SUN Signaling Monitoring Tool. This support mechanism is highlighted in the diagram below.

From the time SUN SMT reported the problem which is captured by logging a trouble report into the Vision Project or reporting it via telephone, the case is handled and tracked until it is closed according to SLA between SUN SMT and Polystar.

This incident handling and case management is handled by three level of support infrastructure.

While the responsibility in finding the root cause of the problem and fixing it lies on Polystar, Sun Cellular will be able help improve the effectiveness in handling the trouble ticket and eventually reduce the resolution time for every reported problem. This is achieved by following performing the high level checks and system status check procedure indicated in the next sections. Support Contact Numbers and Email: Lam Nguyen Telephone: +65-81579599 Email: [email protected]

3. System Architecture and Data Flow


Key to effective problem reporting and first level troubleshooting is to understand the building blocks of the solution, its dependencies and data flow within the system.

The diagram above shows the Logical Connection and Data Flow of SUN Signalling Monitoring Tool. 1. The data-message signalling units (MSU) come into the system thru Media Probes (SigtranMAP/CAP) and E1 LIM (SS7-MAP). 2. The MSUs collected by Media Probes (MP) and E1 LIMs (LIM) are distributed / load-balanced by Routers (RTR) to Probe Servers (PRS) via router (RTR) as a load balancer. 3. The PRS will decode the MSUs, store the MSU (SOS DB), generate XDRs and stream the XDRs into QPS Servers. 4. QPS Servers performs XDR enrichment and KPI aggregation. The results are stored in QSS Storage Servers 5. Raw QSS Storage Servers will store the XDRs and Agg QSS Storage Servers will store the KPIs 6. Information are presented to the user thru the two (2) different application servers, namely: a. GLS for OSIX Client b. Jupiter Web Server for Reporting Application

4. System Health Monitoring


This section describes the System KPIs that need to be monitored by the Operations NOC Team to ensure reliable operations of the SUN Signalling Monitoring Tool and maximize system uptime. These are divided into four (4) areas: a. Server (OS Tab) b. Processes RUNTIME (Runtime Tab) c. CDR Buffering and Discards (CDR Tab) d. Collector Information (Collector Information Tab) e. Status : Minor : Critical : Major for Buffering : Critical for Discard : Critical : Critical

The hosts node provides the list the all the OSIX servers and its status (Green, Yellow, Red). The tabs status button will turn into Yellow or Red depending on the status of the underlying KPIs behind the respective tabs.

KPI Tab Status

Host Node Status

SERVER
KPI to Monitor: Used Disk Space % a. Threshold b. Action KPI Definition Displays the different server types, that is, QPS, QSS, GLS, RTR, PRS, CTR, or SYS. Used Disk Space (%) displays the used disk space in percent. Used Physical Memory (%) displays the used physical memory size in percent. : Warning=70%, Critical : 80% : Inform System Owner

RUNTIME
KPI to Monitor: Uptime % a. Threshold b. Action : Warning = 0 : Critical = 0 : The processes such as QPS, QSS, GLS, RTR, PRS, CTR, or SYS are : automatically restarted by the system if it stops running. There no need for : manual restart. If uptime is zero, it means that the process has failed and : you need to raise trouble ticket to Polsytar

KPI Definition Runtime shows information for a specific process, such as Uptime It displays how long time the server has been running in days, hours and minutes. Restart Count shows the number of time the processes has been restarted from the last update.

CDR
KPI to Monitor: Buffered Count a. Threshold b. Action : Warning = 100,000 : Critical = 200,000 : Raise Trouble Ticket to Polystar

KPI to Monitor: Discarded Count a. Threshold b. Action : Warning = 0 : Critical = 100 : Raise Trouble Ticket to Polystar

Raise Trouble Ticket if buffered counts continue to increase or discarded count did not return to zero. KPI Definition CDR Count displays the number of stored transactions, since the server was restarted or the counter was last reset. Buffered Count is the current number of buffered CDRs in the CDR queue. Discarded Count Displays the number of discarded transactions, that is not stored to disk, since the server was restarted or the counter was last reset. Discarded transactions can be caused by lack of disk space, or because the I/O load is too high.

COLLECTOR INFORMATION
KPI to Monitor: Throughput (Mbps) a. Threshold a. Action KPI Definition Displays a summary of all data received by a RTR or PRS, from the Media Probe or E1 LIM. Displays the collector throughput in Mbit per second. : Warning = 0 (40) : Critical = 0 (15) : Raise Trouble Ticket to Polystar

Job Failed Delta


KPI to Monitor: Job Failed Delta b. Threshold c. Action KPI Definition The number configured links (SS7 or IP) where the system is failing to monitor : Warning > 0 : Critical > 0 : Raise Trouble ticket to Polsytar

5. Regular O&M Tasks


The O&M Tasks includes the following: 1. Daily: Configuration Database Backup this is set to be done daily automatically. No action required from the user. 2. Daily: Verify that All Components Are Running from the System Status

3. Weekly: Verify OSIX and Jupiter Server Average System Load and Hard Disk Usage

You might also like