888
888
888
Table of Contents
1. 2. 3. 4. 5. Introduction.....................................................................................................................................3 Polystar Support Process and Infrastructure Workflow .................................................................3 System Architecture and Data Flow ................................................................................................5 System Health Monitoring ..............................................................................................................6 Regular O&M Tasks .......................................................................................................................12
1. Introduction
This document will highlight the different processes and tasks that are required to maintain reliability of the Sun Cellular Signalling Monitoring Tool. These are: Polystar Support Process and Infrastructure System Architecture and Dataflow System Health Monitoring Regular and Periodic Checks
The above tasks will be done by SUN for efficient problem management & resolution.
From the time SUN SMT reported the problem which is captured by logging a trouble report into the Vision Project or reporting it via telephone, the case is handled and tracked until it is closed according to SLA between SUN SMT and Polystar.
This incident handling and case management is handled by three level of support infrastructure.
While the responsibility in finding the root cause of the problem and fixing it lies on Polystar, Sun Cellular will be able help improve the effectiveness in handling the trouble ticket and eventually reduce the resolution time for every reported problem. This is achieved by following performing the high level checks and system status check procedure indicated in the next sections. Support Contact Numbers and Email: Lam Nguyen Telephone: +65-81579599 Email: [email protected]
The diagram above shows the Logical Connection and Data Flow of SUN Signalling Monitoring Tool. 1. The data-message signalling units (MSU) come into the system thru Media Probes (SigtranMAP/CAP) and E1 LIM (SS7-MAP). 2. The MSUs collected by Media Probes (MP) and E1 LIMs (LIM) are distributed / load-balanced by Routers (RTR) to Probe Servers (PRS) via router (RTR) as a load balancer. 3. The PRS will decode the MSUs, store the MSU (SOS DB), generate XDRs and stream the XDRs into QPS Servers. 4. QPS Servers performs XDR enrichment and KPI aggregation. The results are stored in QSS Storage Servers 5. Raw QSS Storage Servers will store the XDRs and Agg QSS Storage Servers will store the KPIs 6. Information are presented to the user thru the two (2) different application servers, namely: a. GLS for OSIX Client b. Jupiter Web Server for Reporting Application
The hosts node provides the list the all the OSIX servers and its status (Green, Yellow, Red). The tabs status button will turn into Yellow or Red depending on the status of the underlying KPIs behind the respective tabs.
SERVER
KPI to Monitor: Used Disk Space % a. Threshold b. Action KPI Definition Displays the different server types, that is, QPS, QSS, GLS, RTR, PRS, CTR, or SYS. Used Disk Space (%) displays the used disk space in percent. Used Physical Memory (%) displays the used physical memory size in percent. : Warning=70%, Critical : 80% : Inform System Owner
RUNTIME
KPI to Monitor: Uptime % a. Threshold b. Action : Warning = 0 : Critical = 0 : The processes such as QPS, QSS, GLS, RTR, PRS, CTR, or SYS are : automatically restarted by the system if it stops running. There no need for : manual restart. If uptime is zero, it means that the process has failed and : you need to raise trouble ticket to Polsytar
KPI Definition Runtime shows information for a specific process, such as Uptime It displays how long time the server has been running in days, hours and minutes. Restart Count shows the number of time the processes has been restarted from the last update.
CDR
KPI to Monitor: Buffered Count a. Threshold b. Action : Warning = 100,000 : Critical = 200,000 : Raise Trouble Ticket to Polystar
KPI to Monitor: Discarded Count a. Threshold b. Action : Warning = 0 : Critical = 100 : Raise Trouble Ticket to Polystar
Raise Trouble Ticket if buffered counts continue to increase or discarded count did not return to zero. KPI Definition CDR Count displays the number of stored transactions, since the server was restarted or the counter was last reset. Buffered Count is the current number of buffered CDRs in the CDR queue. Discarded Count Displays the number of discarded transactions, that is not stored to disk, since the server was restarted or the counter was last reset. Discarded transactions can be caused by lack of disk space, or because the I/O load is too high.
COLLECTOR INFORMATION
KPI to Monitor: Throughput (Mbps) a. Threshold a. Action KPI Definition Displays a summary of all data received by a RTR or PRS, from the Media Probe or E1 LIM. Displays the collector throughput in Mbit per second. : Warning = 0 (40) : Critical = 0 (15) : Raise Trouble Ticket to Polystar
3. Weekly: Verify OSIX and Jupiter Server Average System Load and Hard Disk Usage