A. Material 1) Failure Modes and Effect Analysis: I. Aterial and Ethod
A. Material 1) Failure Modes and Effect Analysis: I. Aterial and Ethod
A. Material 1) Failure Modes and Effect Analysis: I. Aterial and Ethod
field to identify various failure modes and its potential failure effect that could occur in anytime. FMEA is widely used in
a design, a manufacturing or assembly process, or a product and service for identifying all possible failures. Risk
identification is part of risk management. Therefore it is a critical first step of it. This paper is a case study on XYZ
University which trying to implement risk management which only focused on how to identify the risk using FMEA.
FMEA needs some parameters to be defined which are severity values, the likelihood of occurrence, and detection. Risk
Priority Number (RPN) is a matrix that indicates potential risk calculated by multiplying the three components, and it
used to classify which should be taken care of first based on the highest RPN value. Filtering the ticketing system and
mapped the incidents that happened to the current business process is how the data collected, also the interview to end
user for validation. The result of this paper is astonishing because it is different from the initial expectation that business
process like LMS or network facility will get the highest RPN value, but after doing all the process, it is found that
telecommunication is at the top. Surely this provides a new perspective to risk management to be slicker in handling
potential one.
INTRODUCTION
The risk is a situation involving exposure to danger or harm. In an organization, the risk is a part of doing business; it is a
challenge for every company. Risks can have consequences in financial, reputation, development and performance itself
[1]. If an organization pays much attention to manage the risk effectively, it will help organizations to perform well in an
uncertain environment. It is not only talking about the negative impact on an organization but also a positive impact. A
positive risk, if managed well, can turn into an opportunity for the organization [2].
Risk management is a combination of science and art of identifying, analyzing, and responding to an uncertain condition.
In terms of project management, the risk will remain in the life of a project and until the project is to meet the project
objectives. Risk management will help the project manager and team to achieve the goals by improving the process of
determining project scope, developing realistic estimation and with the selection of good projects [3]. Failure in managing
risk can harm the organization as well as the client involved with the project. Davy Corporation, the project completed
two years behind schedule and poor management of project risk was damaged not only Davy, but also the clients, and
Trafalgar House who bought Davy Corporation [4]. Risk management approaches nowadays become more complex
because it integrates many additional risk domains [5], but this paper only involves both Information Technology (IT) and
Business domain. It was proven to increase the chance of business sustainability, enhance the business process, and value
added. The study shows that the framework of organizing detailed information efficiently is useful to help risk assessment
process and assists in making the decision C level [6], [7].
The objective of this paper is to present the use of Failure Mode Effect Analysis (FMEA) to define and identify the risk
specifically related to network infrastructure, how to convert the collected data into useful information to identify and
evaluate three factors of FMEA. The aim is resulting in a Risk Priority Number (RPN) which acts as an indicator for
determining proper action by resulting a table of risk severity, hence the most critical risk will be prioritized and could be
mitigated.
Severity (S): is a numerical estimate of how severe end user will be affected because of the failure event, it is a 10-point
scale, ten is the highest.
Occurrence (O) or likelihood, is a numerical estimate likelihood of a failure mode will occur during production. It is a
10-point scale, ten is the highest.
Detection (D) or can be termed as effectiveness, is a numerical estimate of how effective the control that has been made
to detect the root cause or failure mode before it is happening or to prevent it from happening. It is a 10-point scale; ten is
the highest.
Risk Priority Number can figure out the seriousness of potential risk indicated the larger RPN values. If the RPN is identical,
the Severity (S) multiply by Occurrence (O) = S x O should be considered because the occurrence plays more role after the
severity in risk management [11].
TABLE I
PROJECT RISK MANAGEMENT PROCESS
Project
Description
Risk Management Process
This is a critical step. It produces a potential risk register which may occur in every field of works. This step should
Identify Risks
identify the risks that may occur in all areas also the impact on the current system.
After the risk register is produced, the next step is to analyze and evaluate. FMEA is used to analyze each of identified
risk based on failure mode or in this paper is the reason for outages. It also helps to evaluate the level of risk based on
Analyze and Evaluate Risks the severity, likelihood, and detection. Not all risk has the same level of losses. Evaluating the severity and likelihood
of occurrence will provide a new perspective on potential risk so that consideration in making decisions will make it
more reasonable.
Once the risk has analyzed and evaluated, for reducing the impact or likelihood of occurrence in acceptance level, a
Risk Mitigation strategy in responding to the risk is needed. The strategy is known as mitigation risk. In handling risk mitigation there
are some options such assume or accept, avoid, control, transfer, watch or monitor.
The purpose of this step is for tracking and ascertaining whether the organization followed policies.
Risk Monitoring Although strategies in managing risk were once decided within the organization monitoring risk is about
assessing whether the strategies still effective or not.
C. Related Work
A study in risk management has been widely researched with various model from identity, handle, and mitigation to monitoring.
Also, to compete with a competitor, the company needs to keep up the quality and reliability of the product while continuing in
identified and manage the risk that occurred.
Risk mitigation plan should facilitate the process of identification circumferential opportunities and risks itself [20]. Risk
mitigation can be conducted by reviewing and identifying risk mitigation activities, technology, technique, procedure, people,
process and method used in IT Governance [21]. Most of the existing models and framework of mitigating risks cannot support IT
Governance to mitigate risks. Thus, risk mitigation components and metrics are important and should be considered when
mitigating risks [22].
FMEA also used to analyze the current state of services within an acute care healthcare organization, and it is recommended to
use FMEA before implementation of any new service as one of an effective method for yielding highest quality [23].
However, identified and assess risk much easier task than to suggest a proper mitigation plan because handling risk needs to be
ready in advance, so the risk will not become bigger when it appeared. So once determined risk by its priority, handling and
mitigation strategies need to be determined too for each risk factor [24]. Torabi and Giahi also using FMEA to enhance risk
assessment which becomes the main parts of a business continuity management system [25]. FMEA is also studied by Tejaskumar
S.Parsana dan Mihi T. Patel to improve the quality and efficiency of Manufacturing Industry. The prevention action also proposed
according to risk prioritization [26].
Enterprise risk management (ERM) used by xyz university is a systematic and integrated approach to managing all forms of risks
that may confront directly or indirectly. It focused on the whole enterprise as a big picture. However, FMEA can help current
system to be assessed the possible ways in which failure might occur, assess the impact of the failure and understand what
preventive action can be taken before such failures occurring [27].
D. Methodology
The data was collected from the ticketing system and interview user for validation. The classification of risk is according IT
service catalogs which provided by IT division. All the incidents are mapped to the service catalogs in brainstorming way
involving all risk-related team to ensure the validity. The data shows that the main business process related to IT infrastructure
and services at XYZ University are as follows:
1. Attendance system is required to record all attendance of employees. The system is using SAP. Each employee can do
attendance at any site of XYZ University. Attendance can be done via fingerprint or the web-based on SAP Portal.
2. Learning Management System (LMS) for Regular Program Student is an information system for a regular student, used by
the lecturer and student to support academic learning. This system is used by the entire regular program student, both Bachelor,
and master degree and in all majors. Attendance, discussion forum, course material, information regarding financial status can
be accessed through the LMS.
3. Financial System is using SAP web-based application and has several functions to accommodate the task of
finance officer such as purchase request, leave request, medical billing, and budgeting.
4. Learning Management System (LMS) for Online Program Student is an information system for students who prefer online
program classes of how classes are conducted every day, used by the lecturer and student to support the
academic learning. This system is used by the entire online program student, a bachelor degree and master degree and for all
majors. Attendance, discussion forum, course material, information regarding financial status can be accessed through the LMS.
5. Email acts as a communication medium between internal parties and external parties. The purpose of using official email is
to build a professional image to external parties. Email is divided into two, for student/lecturer and the employees. Because of a
great number of students, so email for students and lecturers are using office 365 which integrated with LMS as a single sign-on.
Meanwhile, the email server for employees is using the on-premises exchange server because the number of employees is not as
many as students.
6. Video conference is a service supported by IT division in order to accommodate the needs of online meeting, distance
learning in a different location, and for the student of the online program. It is using many platforms, such as Cisco WebEx, Skype
for business, Cisco Telepresence, Cisco Jabber. Cisco Webex is used to support online classes. Cisco telepresence is used for
meetings between different campus locations. Skype for business and cisco jabber are used for meeting with external parties.
Video conference is also held for meetings with other universities both at home and abroad.
7. Telecommunication is required to save the cost by using a telephone, so the communication between employees in a different
location can be done with lower cost and quickly. This service also provides communication with external parties, using the billing
system. The service is vital for marketing staff because of daily task communicate with external parties. This plays a vital part in
customer services to help customers efficiently and ensures they are satisfied
8. Online Admission acts as a virtual representative where prospective students can do online registration.
9. Network facilities are provided to facilitate employees so that they can access the internal system. Wired typically used by an
employee while lecturer and student use wireless. Both of them have different rules, but both can access the internal system
corresponding to their regulation.
In this paper, the author modified some name of the column table of RPN, which are potential failure mode as RFO and potential
failure effect as impact. The name changed because it is more reasonable to be used in the educational field.
Severity criteria are determined by assessing the risk impact which can disturb the business process wholly, or it only affects
some individual in a division. The frequency of occurrence is determined by how often the same incident is happening. These
incidents are recorded, by calculating these records, it will bring out frequency level of the likelihood for an incident will happen.
Detection describes the perspective of the university, if the incident is happening, how bad it will affect the quality of services, is
the service still useable or it is unusable and also it measures how effective current design control works. The author used the
model of table based on Tejaskumar’s work as follow [23]:
TABLE II
TABLE OF SEVERITY
Table 2 shows the criteria of severity used to specify the impact of a potential risk that may occur to the current business
process which has been explained. The criteria cover from how failure affected to no people, affected some people till the loss
of ability in running the existing business process resulting in a business loss.
Next step is to define the table of occurrence which shows as follow:
TABLE III
TABLE OF OCCURRENCE
Table of occurrence consists of the period in which an incident may occur. This period is measured by looking at the existing
ticketing system, in which when sorted by time and mapped to recurring incidents and it created a specific pattern of the period.
The criteria cover from it only occurred once a year or occur at an unpredictable moment to several times a day which disturb
for most of the people. The result is shown in Table 3.
The next step is to determine the detection of the current condition. Detection relatively measures the ability to design control
to detect or prevent a potential cause or reason for outages that might happen. In this paper, the author put the pre-incident
response which can be a picture as a detection control which has been done by the university to prevent bad things happen.
TABLE IV
TABLE OF DETECTION
The data that previously discussed can be used to calculate The Risk Priority Number (RPN) by multiplying three elements
which are Severity, Occurrence, and Detection. The formula is:
RPN can help the risk-related team to identify potential risks to make a plan that should be taken, either mitigate, eliminate or
transfer it to professional companies.
According to the interview, communication via phone has become very important for daily activity. It is considered more
efficient than the email. Besides, it can shorten the time when critical conditions occur and also when students or parents want to
get some information right away. For some people, it is essential to keep the line of communication always available especially if
associated with phone service.
The Marketing team needs good and reliable telecommunication service so they can do a telemarketing approach to the
prospective students. It will increase the effectiveness and efficiency also the chance number of student intakes. Personal approach
and more trusted relationships can be built through the telecommunication way. Although their many methods to get connected
with the customers, such as using instant messenger, we believe that good communication begins with warm personal
communication over the phone.
Also, the Learning Management System (LMS) acts as a bridge between university and students, any information related to
courses, exams or financial status can be accessed here. LMS plays huge role in the learning process because whenever there is
a failure it will impact learning process that day, and it is everlasting for several days which disrupted the learning activity such
as upload and download learning materials or accessing the forum. LMS is often get disrupted during the score announcement,
during the announcement of new classes schedule and also at the end of the due date of project submission. Therefore, the
anticipation that has been done is added to the server’s capacity and use load balancing techniques to distribute the workload
across many servers. An explanation of why there are two LMS because they run different systems, maintained by different
teams and have different functions, so it requires two different systems running simultaneously.
Most of the services are related to a network infrastructure which becomes an essential part of current services. However, the
discussion is about how to break down all potential failure of a product, process or systems. Identify the actions needed that
could eliminate or reduce the chance of potential failures and document the process.
Table 5 shows the matrix calculation of RPN. Weights should be incorporated into each identified business process so that
the RPN score will appear after the process of multiplication is performed. Small RPN is always better than high RPN. It is
recommended to begin an action with high RPN and working in descending order the objective is to reduce one or more
potential risk that make up the RPN. The use of a Pareto chart is giving the picture the most important areas to be solved as soon
as possible as the RPN score is calculated.
It is explained earlier that to get the weight of each column of severity, occurrence, and detection is through brainstorming
and consideration of team-related risks. Reason for Outages (RFO) is determined using the records of incidents. It classified
something that already happened before, or it can be something that acts as an agent of failure in the future in the main business
process. The column can be assigned what can happen in the event of failure and the impact on the ongoing business process.
The data in this column is taken from previous incidents and added with the result of the discussion on the broader impact if the
failure occurs. Pre-incident responses are precautions that have been thought and done and are running to date to minimize the
effects that will occur in the event of a failure or can be said to be current control. In the FMEA in this paper added this column
to facilitate the determination of the score of each column. Risk Managers and the team discussed every criterion in the RPN
matrix. The result of the brainstorming to fill the weight of each criterion of RPN is as follows:
The left vertical axis of the Pareto chart represents RPN values. Vertical bars represent each service and the left has the highest
RPN values. The cumulative line is used to add the percentage from each bar. It determines how much of the total services should
be fixed by addressing the highest few in Pareto chart.
We identify services that contribute to 80% of the risk. Figure 2 shows that the telecommunication, LMS regular student, LMS
online student, and online admission contribute to the 80% risk of the system. Therefore, we advise prioritizing on that four
services to be solved as soon as possible.
I. CONCLUSIONS
In this paper, the author used FMEA to identify various failure modes and its potential failure effect that could occur in a
current business process that applied to an educational field which commonly used to in industrial field to enhance a process or to
improve steps of making a product.
This paper reaches deeply to the source of failure by the interview end user and classify every incident reported in the ticketing
system which resulting reason for outages and also explains the impact explicitly to the current business process. Although it
seems Network facility or LMS will be a prime candidate from the list of risks that exist but using the FMEA give another point of
view, new perspective of a current business process which result shows telecommunication is at top priority list.
Telecommunication area should take more precaution hence it will not interfere current business process whenever the failure
happens. If a catastrophic event occurs, the media coverage may affect the campus’s reputation, posing a threat to future
admissions and so financial strength. The risks faced by the university are diverse, and the loss potentials are enormous. The risk
manager must be vigilant in protecting the organization's assets from both direct and indirect potential losses. By developing and
implementing a comprehensive risk management plan, a university will hold a dynamic tool that can serve as a road map for
identifying and managing risk exposures.
However, this paper is only focused on the process of risk management which is concentrated in the assessment of risk. Further
work should be concentrated on mitigating and monitoring the risk of minimizing potential negative risks and maximizing
potential positive risks.
We would like to acknowledge the support from Bina Nusantara University for making this work under the Bina Nusantara
Graduate Program, Magister of Information Technology.