0% found this document useful (0 votes)
65 views20 pages

Chapter 18

Uploaded by

PAULO SOUZA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views20 pages

Chapter 18

Uploaded by

PAULO SOUZA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

18_Kinnison_c18_p217-236.

qxd 10/5/12 11:38 AM Page 217

Chapter

18
Reliability

Introduction
Reliability equals consistency. It can be defined as the probability that an item
will perform a required function, under specified conditions without failure, for
a specified amount of time according to its intended design. The reliability pro-
gram is a valuable means of achieving better operational performance in an
aircraft maintenance environment, and it is designed to decrease maintenance-
related issues and increase flight safety. The intent of this program is to deal
systematically with problems as they arise instead of trying to cure immediate
symptoms. This program is normally customized, depending on the operators,
to accurately reflect the specific operation’s requirements. Although the word
reliability has many meanings, in this book we will define the terms that have
specialized meanings to aviation maintenance and engineering. In the case of
reliability, we first must discuss one important difference in the application of
the term.
There are two main approaches to the concept of reliability in the aviation
industry. One looks essentially at the whole airline operation or the M&E oper-
ation within the whole, and the other looks at the maintenance program in par-
ticular. There is nothing wrong with either of these approaches, but they differ
somewhat, and that difference must be understood.
The first approach is to look at the overall airline reliability. This is measured
essentially by dispatch reliability; that is, by how often the airline achieves an
on-time departure1 of its scheduled flights. Airlines using this approach track
delays. Reasons for the delay are categorized as maintenance, flight operations,
air traffic control (ATC), etc. and are logged accordingly. The M&E organization
is concerned only with those delays caused by maintenance.

1
On-time departure means that the aircraft has been “pushed back” from the gate within 15 minutes
of the scheduled departure time.

217
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 218

218 Oversight Functions

Very often, airlines using this approach to reliability overlook any mainte-
nance problems (personnel or equipment related) that do not cause delays, and
they track and investigate only those problems that do cause delays. This is only
partially effective in establishing a good maintenance program.
The second approach (which we should actually call the primary approach)
is to consider reliability as a program specifically designed to address the prob-
lems of maintenance—whether or not they cause delays—and provide analysis
of and corrective actions for those items to improve the overall reliability of the
equipment. This contributes to the dispatch reliability, as well as to the overall
operation.
We are not going to overlook the dispatch reliability, however. This is a dis-
tinct part of the reliability program we discuss in the following pages. But we
must make the distinction and understand the difference. We must also real-
ize that not all delays are caused by maintenance or equipment even though
maintenance is the center of attention during such a delay. Nor can we only
investigate equipment, maintenance procedures, or personnel for those dis-
crepancies that have caused a delay. As you will see through later discussions,
dispatch reliability is a subset of overall reliability.

Types of Reliability
The term reliability can be used in various respects. You can talk about the over-
all reliability of an airline’s activity, the reliability of a component or system, or
even the reliability of a process, function, or person. Here, however, we will dis-
cuss reliability in reference to the maintenance program specifically.
There are four types of reliability one can talk about related to the mainte-
nance activity. They are (a) statistical reliability, (b) historical reliability, (c) event-
oriented reliability, and (d) dispatch reliability. Although dispatch reliability is
a special case of event-oriented reliability, we will discuss it separately due to
its significance.

Statistical reliability
Statistical reliability is based upon collection and analysis of failure, removal,
and repair rates of systems or components. From this point on, we will refer to
these various types of maintenance actions as “events.” Event rates are calcu-
lated on the basis of events per 1000 flight hours or events per 100 flight cycles.
This normalizes the parameter for the purpose of analysis. Other rates may be
used as appropriate.
Many airlines use statistical analysis, but some often give the statistics more
credence than they deserve. For one example, airlines with 10 or more aircraft
tend to use the statistical approach, but most teachers and books on statistics
tell us that for any data set with less than about 30 data points the statistical
calculations are not very significant. Another case of improper use of statistics
was given as an example presented in an aviation industry seminar on reliability.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 219

Reliability 219

The airline representative used this as an example of why his airline was going
to stop using statistical reliability. Here is his example.

We use weather radar only 2 months of the year. When we calculate the mean
value of failure rates and the alert level in the conventional manner [discussed in
detail later in this chapter] we find that we are always on alert. This, of course, is
not true.

The gentleman was correct in defining an error in this method, and he was
correct in determining that—at least in this one case—statistics was not a valid
approach. Figure 18-1 shows why.
The top curve in Fig. 18-1 shows the 2 data points for data collected when the
equipment was in service. It also shows 10 zero data points for those months
when the equipment was not used and no data were collected (12-month column).
These zeros are not valid statistical data points. They do not represent zero fail-
ures; they represent “no data” and therefore should not be used in the calcula-
tion. Using these data, however, has generated a mean value (lower, dashed line)
of 4.8 and an alert level at two standard deviations above the mean (upper, solid
line) of 27.6.
One thing to understand about mathematics is that the formulas will work,
will produce numerical answers, whether or not the input data are correct.
Garbage in, garbage out. The point is, you only have two valid data points here
shown in the bottom curve of Fig. 18-1 (2-month data). The only meaningful sta-
tistic here is the average of the two numbers, 29 (dashed line). One can calcu-
late a standard deviation (SD) here using the appropriate formula or a calculator,
but the parameter has no meaning for just two data points. The alert level set

Data 12 Mos 2 Mos


Jan 0
40
Feb 0
30
Mar 0 Failure Rate
Apr 0 20 Mean Value
Alert Level
May 0 10
Jun 0 0
Jul 0 1 2 3 4 5 6 7 8 9 10 11 12

Aug 0
Sep 26 26
Oct 32 32
Nov 0 40
Dec 0
30
Failure Rate
20 Mean Value
Sum 58 58 Alert Level
n 12 2 10
Avg. 4.8 29.0 0
Std. Dev. 11.4 4.2 1 2 3 4 5 6 7 8 9 10 11 12

A.L. 27.6 37.5

Figure 18-1 Comparison of alert level calculation methods.


18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 220

220 Oversight Functions

by using this calculation is 37.5 (solid line). For this particular example, sta-
tistical reliability is not useable, but historical reliability is quite useful. We will
discuss that subject in the next section.

Historical reliability
Historical reliability is simply a comparison of current event rates with those
of past experience. In the example of Fig. 18-1, the data collected show fleet fail-
ures of 26 and 32 for the 2 months the equipment was in service. Is that good
or bad? Statistics will not tell you but history will. Look at last year’s data for
the same equipment, same time period. Use the previous year’s data also, if
available. If current rates compare favorably with past experience, then every-
thing is okay; if there is a significant difference in the data from one year to the
next, that would be an indication of a possible problem. That is what a relia-
bility program is all about: detecting and subsequently resolving problems.
Historical reliability can be used in other instances, also. The most common one
is when new equipment is being introduced (components, systems, engines, air-
craft) and there is no previous data available on event rates, no information on
what sort of rates to expect. What is “normal” and what constitutes “a problem”
for this equipment? In historical reliability we merely collect the appropriate
data and literally “watch what happens.” When sufficient data are collected to
determine the “norms,” the equipment can be added to the statistical reliability
program.
Historical reliability can also be used by airlines wishing to establish a sta-
tistically based program. Data on event rates kept for 2 or 3 years can be tal-
lied or plotted graphically and analyzed to determine what the normal or
acceptable rates would be (assuming no significant problems were incurred).
Guidelines can then be established for use during the next year. This will be cov-
ered in more detail in the reliability program section below.

Event-oriented reliability
Event-oriented reliability is concerned with one-time events such as bird strikes,
hard landings, overweight landings, in-flight engine shutdowns, lighting strikes,
ground or flight interruption, and other accidents or incidents. These are events
that do not occur on a daily basis in airline operations and, therefore, produce
no usable statistical or historical data. Nevertheless, they do occur from time
to time, and each occurrence must be investigated to determine the cause and
to prevent or reduce the possibility of recurrence of the problem.
In ETOPS2 operations, certain events associated with this program differ
from conventional reliability programs, and they do rely on historical data and
alert levels to determine if an investigation is necessary to establish whether a
problem can be reduced or eliminated by changing the maintenance program.

2
Requirements for extended range operations with two-engine airplanes (ETOPS) are outlined
in FAA Advisory Circular AC 120-42B, and also discussed in Appendix E of this book.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 221

Reliability 221

Events that are related to ETOPS flights are designated by the FAA as actions
to be tracked by an “event-oriented reliability program” in addition to any sta-
tistical or historical reliability program. Not all the events are investigated, but
everything is continually monitored in case a problem arises.

Dispatch reliability
Dispatch reliability is a measure of the overall effectiveness of the airline oper-
ation with respect to on-time departure. It receives considerable attention from
regulatory authorities, as well as from airlines and passengers, but it is really
just a special form of the event-oriented reliability approach. It is a simple cal-
culation based on 100 flights. This makes it convenient to relate dispatch rate
in percent. An example of the dispatch rate calculation follows.
If eight delays and cancellations are experienced in 200 flights, that would mean
that there were four delays per 100 flights, or a 4 percent delay rate. A 4 percent
delay rate would translate to a 96 percent dispatch rate (100 percent − 4 percent
delayed = 96 percent dispatched on time). In other words, the airline dispatched
96 percent of its flights on time.
The use of dispatch reliability at the airlines is, at times, misinterpreted.
The passengers are concerned with timely dispatch for obvious reasons. To
respond to FAA pressures on dispatch rate, airlines often overreact. Some air-
line maintenance reliability programs track only dispatch reliability; that is,
they only track and investigate problems that resulted in a delay or a cancel-
lation of a flight. But this is only part of an effective program and dispatch reli-
ability involves more than just maintenance. An example will bear this out.
The aircraft pilot in command is 2 hours from his arrival station when he expe-
riences a problem with the rudder controls. He writes up the problem in the air-
craft logbook and reports it by radio to the flight following unit at the base. Upon
arrival at the base, the maintenance crew meets the plane and checks the log
for discrepancies. They find the rudder control write-up and begin trou-
bleshooting and repair actions. The repair takes a little longer than the sched-
uled turnaround time and, therefore, causes a delay. Since maintenance is at
work and the rudder is the problem, the delay is charged to maintenance and
the rudder system would be investigated for the cause of the delay.
This is an improper response. Did maintenance cause the delay? Did the
rudder equipment cause the delay? Or was the delay caused by poor airline pro-
cedures? To put it another way: could a change of airline procedures eliminate
the delay? Let us consider the events as they happened and how we might
change them for the better.
If the pilot and the flight operations organization knew about the problem
2 hours before landing, why wasn’t maintenance informed at the same time? If
they had been informed, they could have spent the time prior to landing in
studying the problem and performing some troubleshooting analysis. It is quite
possible, then, that when the airplane landed, maintenance could have met it
with a fix in hand. Thus, this delay could have been prevented by procedural
changes. The procedure should be changed to avoid such delays in the future.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 222

222 Oversight Functions

While the maintenance organization and the airline could benefit from this
advance warning of problems, it will not always eliminate delays. The impor-
tant thing to remember is that if a delay is caused by procedure, it should be
attributed to procedure and it should be avoided in the future by altering the
procedure. That is what a reliability program is about: detecting where the
problems are and correcting them, regardless of who or what is to blame.
Another fallacy in overemphasizing dispatch delay is that some airlines will
investigate each delay (as they should), but if an equipment problem is involved,
the investigation may or may not take into account other similar failures that
did not cause delays. For example, if you had 12 write-ups of rudder problems
during the month and only one of these caused a delay, you actually have two
problems to investigate: (a) the delay, which could be caused by problems other
than the rudder equipment and (b) the 12 rudder write-ups that may, in fact,
be related to an underlying maintenance problem. One must understand that
dispatch delay constitutes one problem and the rudder system malfunction
constitutes another. They may indeed overlap but they are two different prob-
lems. The delay is an event-oriented reliability problem that must be investi-
gated on its own; the 12 rudder problems (if this constitutes a high failure
rate) should be addressed by the statistical (or historical) reliability program.
The investigation of the dispatch delays should look at the whole operation.
Equipment problems—whether or not they caused delays—should be investi-
gated separately.

A Reliability Program
A reliability program for our purposes is, essentially, a set of rules and practices
for managing and controlling a maintenance program. The main function of a reli-
ability program is to monitor the performance of the vehicles and their associated
equipment and call attention to any need for corrective action. The program has
two additional functions: (a) to monitor the effectiveness of those corrective actions
and (b) to provide data to justify adjusting the maintenance intervals or mainte-
nance program procedures whenever those actions are appropriate.

Elements of a Reliability Program


A good reliability program consists of seven basic elements as well as a number
of procedures and administrative functions. The basic elements (discussed in
detail below) are (a) data collection; (b) problem area alerting, (c) data display;
(d) data analysis; (e) corrective actions; (f) follow-up analysis; and (g) a monthly
report. We will look at each of these seven program elements in more detail.

Data collection
We will list 10 data types that can be collected, although they may not necessarily
be collected by all airlines. Other items may be added at the airline’s discretion.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 223

Reliability 223

The data collection process gives the reliability department the information
needed to observe the effectiveness of the maintenance program. Those items that
are doing well might be eliminated from the program simply because the data
show that there are no problems. On the other hand, items not being tracked may
need to be added to the program because there are serious problems related to
those systems. Basically, you collect the data needed to stay on top of your oper-
ation. The data types normally collected are as follows:

1. Flight time and cycles for each aircraft


2. Cancellations and delays over 15 minutes
3. Unscheduled component removals
4. Unscheduled engine removals
5. In-flight shutdowns of engines
6. Pilot reports or logbook write-ups
7. Cabin logbook write-ups
8. Component failures (shop maintenance)
9. Maintenance check package findings
10. Critical failures

We will discuss each of these in detail below.

Flight time and flight cycles. Most reliability calculations are “rates” and are
based on flight hours or flight cycles; e.g., 0.76 failures per 1000 flight hours or
0.15 removals per 100 flight cycles.
Cancellations and delays over 15 minutes. Some operators collect data on all such
events, but maintenance is concerned primarily with those that are maintenance
related. The 15-minute time frame is used because that amount of time can usu-
ally be made up in flight. Longer delays may cause schedule interruptions or
missed connections, thus the need for rebookings. This parameter is usually con-
verted to a “dispatch rate” for the airline as discussed above.
Unscheduled component removals. This is the unscheduled maintenance
mentioned earlier and is definitely a concern of the reliability program. The rate
at which aircraft components are removed may vary widely depending on the
equipment or system involved. If the rate is not acceptable, an investigation
should be made and some sort of corrective action must be taken. Components
that are removed and replaced on schedule—e.g., HT items and certain OC
items—are not included here, but these data may be collected to aid in justify-
ing a change in the HT or OC interval schedule.
Unscheduled removals of engines. This is the same as component removals, but
obviously an engine removal constitutes a considerable amount of time and
manpower; therefore, these data are tallied separately.
In-flight shutdown (IFSD) of engines. This malfunction is probably one of the most
serious in aviation, particularly if the airplane only has two engines (or one).
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 224

224 Oversight Functions

The FAA requires a report of IFSD within 72 hours.3 The report must include
the cause and the corrective action. The ETOPS operators are required to track
IFSDs and respond to excessive rates as part of their authorization to fly ETOPS.
However, non-ETOPS operators also have to report shutdowns and should also
be tracking and responding to high rates through the reliability program.
Pilot reports or logbook write-ups. These are malfunctions or degradations in
airplane systems noted by the flight crew during flight. Tracking is usually by
ATA Chapter numbers using two, four, or six digits. This allows pinpointing of
the problems to the system, subsystem, or component level as desired.
Experience will dictate what levels to track for specific equipment.
Cabin logbook write-ups. These discrepancies may not be as serious as those
the flight crew deals with, but passenger comfort and the ability of the cabin
crew to perform their duties may be affected. These items may include cabin
safety inspection, operational check of cabin emergency lights, first aid kits, and
fire extinguishers. If any abnormality is found, these items are written up by
the flight crew in the maintenance logbook as a discrepancy item.
Component failures. Any problems found during shop maintenance visits are
tallied for the reliability program. This refers to major components within the
black boxes (avionics) or parts and components within mechanical systems.
Maintenance check package findings. Systems or components found to be in need
of repair or adjustment during normal scheduled maintenance checks (non-
routine items) are tracked by the reliability program.
Critical failures. Failures involving a loss of function or secondary damage that
could have a direct adverse effect on operating safety.

Problem detection—an alerting system


The data collection system allows the operator to compare present performance
with past performance in order to judge the effectiveness of maintenance and
the maintenance program. An alerting system should be in place to quickly
identify those areas where the performance is significantly different from
normal. These are items that might need to be investigated for possible prob-
lems. Standards for event rates are set according to analysis of past perform-
ances and deviations from these standards.
This alert level is based on a statistical analysis of the event rates of the pre-
vious year, offset by 3 months. The mean value of the failure rates and the stan-
dard deviation from the mean are calculated, and an alert level is set at one to
three standard deviations above that mean rate (more on setting and adjusting
alert levels later). This value, the upper control limit (UCL), is commonly referred
to as the alert level. However, there is an additional calculation that can be made
to smooth the curve and help eliminate “false alerts.” This is the 3-month rolling
average, or trend line. The position of these two lines (the monthly rate and the
3-month average) relative to the UCL is used to determine alert status.

3
See Federal Aviation Regulation 121.703, Mechanical Reliability Report.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 225

Reliability 225

0.600

0.550

0.500

0.450

0.400

0.350

0.300

0.250

Sep-99

Sep-00
Jan-99

Jan-00
Feb-99

Dec-99

Feb-00

Dec-00
Jul-99

Jul-00
Jun-99

Jun-00
Mar-99

Mar-00
May-99

May-00

Jan-01
Feb-01
Nov-99

Nov-00

Mar-01
Oct-99

Oct-00
Aug-99

Aug-00
Apr-99

Apr-00
Monthly Event Rate Mean Value UCL (Mean + 2 SD) Offset

Figure 18-2 Calculation of new alert levels.

Setting and adjusting alert levels


It is recommended that alert levels be recalculated yearly. The data used to
determine alert level are the event rates for the previous year offset by 3 months.
The reason for this will be explained shortly.
Figure 18-2 shows the data used and the results in graphic form. In this
example, we are establishing a new alert level for the year April 2000 through
March 2001. This level is represented in Fig. 18-2 as the upper straight line.
These data were obtained using the actual event rates for January 1999 through
December 1999 shown on the left of the figure. The three data points between
(shown as diamonds for January to March 2000 in Fig. 18-2) will be used in cal-
culating a 3-month rolling average to be used during the collection of new data.
This will be discussed later.
Basic statistics are used for the calculations. From the original data
(January–December 1999) we calculate the mean and the standard deviation
of these data points. The mean is used as a baseline for the new data and is
shown as the dashed line on the right side of Fig. 18-2. The solid line on the right
of Fig. 18-2 is the alert level that we have chosen for these data and is equal to
the calculated mean plus two standard deviations. Event rates for the new year,
then, will be plotted and measured relative to these guidelines.

Reading alert status


The data shown in Fig. 18-3 show 1 year of event rates (solid jagged line with
triangles) along with the mean value (bottom straight line) and the alert level
(upper straight line). As you can see, the event rate swings above the alert level
several times through the year (February, June, October, and December). Of course,
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 226

226 Oversight Functions

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Alert Level (UCL) Mean Monthly Event Rate 3_Month Average

Figure 18-3 Reading alert status.

it is easy to see the pattern as we look at the year’s events. But in reality, you
will only see 1 month at a time and the preceding months. Information on what
is going to happen the next month is not available to you.
When the event rate goes above the alert level (as in February), it is not nec-
essarily a serious matter. But if the rate stays above the alert level for 2 months
in succession, then it may warrant an investigation. The preliminary investi-
gation may indicate a seasonal variation or some other one-time cause, or it may
suggest the need for a more detailed investigation. More often than not, it can
be taken for what it was intended to be—an “alert” to a possible problem. The
response would be to wait and see what happens next month. In Fig. 18-3, the
data show that, in the following month (March) the rate went below the line;
thus, no real problem exists. In other words, when the event rate penetrates the
alert level, it is not an indication of a problem; it is merely an “alert” to the pos-
sibility of a problem. Reacting too quickly usually results in unnecessary time
and effort spent in investigation. This is what we call a “false alert.”
If experience shows that the event rate for a given item varies widely from
month to month above and below the UCL as in Fig. 18-3—and this is common
for some equipment—many operators use a 3-month rolling average. This is
shown as the dashed line in Fig. 18-3. For the first month of the new data year,
the 3-month average is determined by using the offset data points in Fig. 18-2.
(Actually, only 2 months offset is needed, but we like to keep things on a quar-
terly basis.) The purpose for the offset is to ensure that the plotted data for the
new year do not contain any data points that were used to determine the mean
and alert levels we use for comparison.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 227

Reliability 227

While the event rate swings above and below the alert level, the 3-month
rolling average (dashed line) stays below it—until October. This condition—
event rate and 3-month average above the UCL—indicates a need to watch the
activity more closely. In this example, the event rate went back down below the
UCL in November, but the 3-month average stayed above the alert level. This
is an indication that the problem should be investigated.

Setting alert levels


These upper control limits, or alert levels, and the mathematics that produced
them are not magical by any means. They will not tell you when you have a def-
inite problem nor will they tell you where or what to investigate. What they will
do is provide you with intelligent guidelines for making your own decisions
about how to proceed. The whole process begins with your intellect and your abil-
ity to set these alert levels to an effective level.
Earlier in this chapter, we talked about an airline that was rejecting statis-
tical reliability and gave an example of why. Another of the reasons the gentle-
man gave for this decision was that “we know we have problems with engines,
but engines are never on alert.” If you use the UCL concept to alert you to pos-
sible problems and you do not get an alert indication when you know you have
problems, then it should not take much thought to make you realize that your
chosen alert level is wrong. This alert level is a very important parameter and
it must be set to a useable level, a level that will indicate to you that a problem
exists or may be developing. If not set properly, the alert level is useless. And
that is not the fault of statistics.
This use of an alert level is designed to tell you when you have (or may have)
a problem developing that requires investigation. But you have to know what
conditions constitute a possible problem and set the alert level accordingly. You
have to know your equipment and its failure patterns to determine when you
should proceed with an investigation and when to refrain from investigating.
You have to recognize “false alerts.” You also have to know whether or not the
event rate data points for a particular item are widely or narrowly distributed;
i.e., if it has a large or small standard deviation. This knowledge is vital to set-
ting useable alert levels.
Many airlines erroneously set all alert levels at two standard deviations above
the mean. Unfortunately, this is not a good practice. It is a good place to start,
but there must be an adjustment in some cases to provide the most useable data
and to avoid false alerts.
As we discussed in Chap. 1, not everything fails at the same rate or in the
same pattern. Event rates tracked by a reliability program can be quite erratic,
as the data in Fig. 18-3 show. For other rates, the numbers can be more stable.
This characteristic of the data is depicted by the statistical parameter of stan-
dard deviation—the measure of the distribution of data points around the
mean. A large standard deviation means wide distribution, a large variation
in point values. A small standard deviation means that the points are closer
together.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 228

228 Oversight Functions

A: Data with Wide Dispersion


29
28
27
26
25
24
23
22
21
20
0 2 4 6 8 10 12 14 16

B: Data with Narrow Dispersion


29

28
27
26
25
24
23
22
21
20
0 2 4 6 8 10 12 14 16

Figure 18-4 Dispersion of data points.

Figure 18-4 shows the difference between two data sets. The data points in
(A) are widely scattered or distributed about the mean while those in (B) are all very
close together around the mean. Note that the averages of these two data sets are
nearly equal but the standard deviations are quite different. Figure 18-5 shows the
bell-shaped distribution curve. One, two, and three standard deviations in each
case are shown on the graph. You can see here that, at one SD only 68 percent of
the valid failure rates are included. At two standard deviations above the mean,
you still have not included all the points in the distribution. In fact, two stan-
dard deviations above and below the mean encompass only 95.5 percent of the
points under the curve; i.e., just over 95 percent of the valid failure rates. This
is why we do not consider an event rate in this range a definite problem. If it
remains above this level in the following month it may suggest a possible problem.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 229

Reliability 229

x
−3σ −2σ −σ B +σ +2σ +3σ
68.26%
95.46%
99.73%
Figure 18-5 Standard bell-shaped curve. (Source: The
Standard Handbook for Aeronautical and Astronautical
Engineers, New York, NY: McGraw-Hill, 2003.)

On the other hand, if the event rate data you are working with had a small stan-
dard deviation, it would be difficult to distinguish between two and three SDs.
In this case, the alert level should be set at three SDs.
This alert level system can be overdone at times. The statistics used are not
exact. We are assuming that the event rates will always have a distribution
depicted by the bell-shaped curve. We assume that our data are always accu-
rate and that our calculations are always correct. But this may not be true. These
alert levels are merely guidelines to identifying what should be investigated and
what can be tolerated. Use of the alert level is not rocket science but it helps
ease the workload in organizations with large fleets and small reliability staffs.
Some airlines, using only event rates, will investigate perhaps the 10 highest
rates; but this does not always include the most important or the most signifi-
cant equipment problems. The alert level approach allows you to prioritize these
problems and work the most important ones first.

Data display
Several methods for displaying data are utilized by the reliability department to
study and analyze the data they collect. Most operators have personal computers
available so that data can easily be displayed in tabular and graphical forms. The
data are presented as events per 100 or 1000 flight hours or flight cycles. Some,
such as delays and cancellations, are presented as events per 100 departures. The
value of 100 allows easy translation of the rate into a percentage.
Tabular data allow the operator to compare event rates with other data on
the same sheet. It also allows the comparison of quarterly or yearly data (see
Table 18-1). Graphs, on the other hand, allow the operator to view the month-
to-month performance and note, more readily, those items that show increasing
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 230

230 Oversight Functions

TABLE 18-1 Pilot Reports per 100 Landings (by ATA Chapter)

ATA Three-month Alert


Chapter System PIREPS June-99 July-99 August-99 average UCL Mean status

21 Air conditioning 114 3.65 3.77 3.80 3.74 3.75 2.70 YE


22 Auto flight 43 1.80 1.48 1.45 1.58 1.39 1.21 WA
23 Communications 69 3.44 2.75 2.33 2.84 2.80 2.30 CL
24 Electrical power 29 1.15 0.87 0.98 1.00 0.94 0.60 AL
25 Equip/furnishings 104 4.17 3.69 3.52 3.79 5.43 4.38
26 Fire protection 30 1.80 1.30 1.01 1.37 2.19 1.14
27 Flight controls 48 0.99 3.07 1.62 1.89 1.94 1.26
28 Fuel 36 0.65 1.16 1.22 1.01 2.32 1.27
29 Hydraulic power 17 0.73 0.43 0.57 0.58 1.58 0.82
30 Ice & rain protection 12 0.61 0.65 0.41 0.56 0.72 0.56
31 Instruments 49 1.76 1.48 1.66 1.63 2.46 1.66
32 Landing gear 67 2.41 2.06 2.27 2.25 2.72 1.76
33 Lights 72 3.48 3.15 2.43 3.02 3.32 2.42
34 Navigation 114 4.81 6.62 3.85 5.09 5.58 4.70
35 Oxygen 19 0.31 0.67 0.64 0.54 0.41 0.23 YE
36 Pneumatics 25 1.11 0.80 0.85 0.92 1.19 0.77
38 Water & waste 16 0.42 0.36 0.54 0.44 1.10 0.56
49 Aux. power 42 1.41 1.48 1.42 1.44 1.63 1.38
51 Structures 0 0.00 0.00 0.00 0.00 0.16 0.09
52 Doors 31 1.41 1.05 1.05 1.17 1.62 0.92
53 Fuselage 0 0.00 0.00 0.00 0.00 0.33 0.02
54 Nacelles & pylons 1 0.00 0.00 0.08 0.03 0.22 0.10
55 Stabilizers 0 0.00 0.00 0.00 0.00 0.16 0.09
56 Windows 0 0.00 0.04 0.00 0.01 0.09 0.06
57 Wings 0 0.00 0.00 0.00 0.00 0.33 0.15
71 Power plant 11 0.65 0.54 0.37 0.52 1.30 0.91
72 Engine 4 0.31 0.29 0.14 0.25 0.47 0.22
73 Fuel & controls 17 0.96 0.47 0.57 0.67 0.84 0.61
74 Ignition 11 0.08 0.40 0.37 0.28 0.46 0.30
75 Air 53 1.52 1.63 1.79 1.65 1.11 0.66 RA
76 Engine control 3 0.23 0.14 0.10 0.16 0.33 0.15
77 Engine indicating 22 0.53 0.76 0.74 0.68 0.96 0.68
78 Exhaust 3 0.50 0.43 0.10 0.34 0.90 0.64
79 Oil 5 0.19 0.22 0.17 0.19 0.83 0.48
80 Starting 3 0.27 0.29 0.10 0.22 0.28 0.17 CL
Total 1070

NOTE: Alert status codes: CL = clear from alert; YE = yellow alert; AL = red alert; RA = remains in alert; WA = watch.

rates and appear to be heading for alert status (see Fig. 18-3). This is a great
help in analysis. Some of the data collected may be compared on a monthly basis,
by event, or by sampling.
Table 18-1 is a listing of pilot reports (PIREPS) or maintenance logbook
entries recorded by a typical airline for 1 month of operation for a fleet of air-
craft. The numbers are examples only and do not represent any particular oper-
ator, aircraft, or fleet size. For these data, a tally is kept by ATA Chapter, and
event rates are calculated as PIREPS per 100 landings. The chart shows data
for the current month (August '99) and the two previous months along with the
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 231

Reliability 231

3-month rolling average. The alert level or UCL and the mean value of event
rate, calculated as discussed in the text, are also included. Seven of these ATA
Chapters have alert indications noted in the last column.
Chapter 21 has had an event rate above the UCL for 2 months running (July,
August); therefore, this represents a yellow alert (YE). Depending on the sever-
ity of the problem, this may or may not require an immediate investigation.
Chapter 24, however, is different. For July, the event rate was high, 1.15. If this
were the first time for such a rate, it would have been listed in the report for
that month as a watch (WA). The rate went down in July but has gone up again
in August. In the current report, then, it is a full alert condition. It is not only
above the alert level, it has been above 2 of the 3 months, and it appears some-
what erratic. It is left as an exercise for the student to analyze the other alert
status items. What about ATA Chapter 38?

Data analysis
Whenever an item goes into alert status, the reliability department does a pre-
liminary analysis to determine if the alert is valid. If it is valid, a notice of the
on-alert condition is sent to engineering for a more detailed analysis. The engi-
neering department is made up of experienced people who know maintenance
and engineering. Their job relative to these alerts is to troubleshoot the prob-
lem, determine the required action that will correct the problem, and issue an
engineering order (EO) or other official paperwork that will put this solution in
place.
At first, this may seem like a job for maintenance. After all, troubleshooting
and corrective action is their job. But we must stick with our basic philosophy
from Chap.7 of separating the inspectors from the inspected. Engineering can
provide an analysis of the problem that is free from any unit bias and be free
to look at all possibilities. A unit looking into its own processes, procedures, and
personnel may not be so objective. The engineering department should provide
analysis and corrective action recommendations to the airline Maintenance
Program Review Board (discussed later) for approval and initiation.
Note: Appendix C discusses the troubleshooting process that applies to engi-
neers as well as mechanics; and Appendix D outlines additional procedures for
reliability and engineering alert analysis efforts.

Corrective action
Corrective actions can vary from one-time efforts correcting a deficiency in a pro-
cedure to the retraining of mechanics to changes in the basic maintenance pro-
gram. The investigation of these alert conditions commonly results in one or
more of the following actions: (a) modifications of equipment; (b) change in or
correction to line, hangar, or shop processes or practices; (c) disposal of defec-
tive parts (or their suppliers); (d) training of mechanics (refresher or upgrade);
(e) addition of maintenance tasks to the program; or ( f ) decreases in maintenance
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 232

232 Oversight Functions

intervals for certain tasks. Engineering then produces an engineering order for
implementation of whatever action is applicable. Engineering also tracks the
progress of the order and offers assistance as needed. Completion of the cor-
rective action is noted in the monthly reliability report (discussed later).
Continual monitoring by reliability determines the effectiveness of the selected
corrective action.
Corrective actions should be completed within 1 month of issuance of the EO.
Completion may be deferred if circumstances warrant, but action should be
completed as soon as possible to make the program effective. Normally, the
Maintenance Program Review Board (MPRB) will require justification in writ-
ing for extensions of this period; the deferral, and the reason for deferral, will
be noted in the monthly report.

Follow-up analysis
The reliability department should follow up on all actions taken relative to
on-alert items to verify that the corrective action taken was indeed effective.
This should be reflected in decreased event rates. If the event rate does not
improve after action has been taken, the alert is reissued and the investiga-
tion and corrective action process is repeated, with engineering taking a dif-
ferent approach to the problem. If the corrective action involves lengthy
modifications to numerous vehicles, the reduction in the event rate may not
be noticeable for some time. In these cases, it is important to continue mon-
itoring the progress of the corrective action in the monthly report along with
the ongoing event rate until corrective action is completed on all vehicles.
Then follow-up observation is employed to judge the effectiveness (wisdom)
of the action. If no significant change is noted in the rates within a reason-
able time after a portion of the fleet has been completed, the problem and the
corrective action should be reanalyzed.

Data reporting
A reliability report is issued monthly. Some organizations issue quarterly and
yearly reports in summary format. The most useful report, however, is the
monthly. This report should not contain an excessive amount of data and graphs
without a good explanation of what this information means to the airline and
to the reader of the report. The report should concentrate on the items that have
just gone on alert, those items under investigation, and those items that are in
or have completed the corrective action process. The progress of any items that
are still being analyzed or implemented will also be noted in the report, show-
ing status of the action and percent of fleet completed if applicable. These items
should remain in the monthly report until all action has been completed and the
reliability data show positive results.
Other information, such as a list of alert levels (by ATA Chapter or by item)
and general information on fleet reliability will also be included in the monthly
report. Such items as dispatch rates, reasons for delays and/or cancellations,
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 233

Reliability 233

flight hours and cycles flown and any significant changes in the operation that
affect the maintenance activity would also be included. The report should be
organized by fleet; i.e., each airplane model would be addressed in a separate
section of the report.
The monthly reliability report is not just a collection of graphs, tables, and
numbers designed to dazzle higher-level management. Nor is it a document
left on the doorstep of others, such as QA or the FAA, to see if they can detect
any problems you might have. This monthly report is a working tool for main-
tenance management. Besides providing operating statistics, such as the
number of aircraft in operation, the number of hours flown, and so forth, it also
provides management with a picture of what problems are encountered (if any)
and what is being done about those problems. It also tracks the progress and
effectiveness of the corrective action. The responsibility for writing the report
rests with the reliability department, not engineering.

Other Functions of the Reliability Program


Investigation of the alert items by engineering often results in the need to
change the maintenance program. This can mean (a) changes in specific tasks;
(b) adjustments in the interval at which maintenance tasks are performed; or
(c) changes in the maintenance processes (HT, OC, and CM) to which compo-
nents are assigned. A change in the task may mean rewriting maintenance
and/or test procedures or in implementing new, more effective procedures.
Adjustments in the maintenance interval may be a solution to a given problem.
A maintenance action currently performed at, say a monthly interval, should, in
fact, be done weekly or even daily to reduce the event rate. The reliability pro-
gram should provide the rules and processes used to adjust these intervals. The
MPRB must approve these changes and, in certain instances, the regulatory
authority must also approve. Generally, though, the change to a greater frequency
(shorter interval) is not difficult. One should keep in mind, however, that this
means higher cost of maintenance due to the increase in maintenance activity.
This cost must be offset by the reduction in the event rate that generated the
change and a reduction in the maintenance requirements resulting from the
change. The economics of this change is one of the concerns engineering must
address during the investigation of the alert condition. The cost of the change may
or may not be offset by the gain in reliability or performance (see objective 5 in
Chap. 3).

Administration and Management


of the Reliability Program
On the administration and management side, a reliability program will include
written procedures for changing maintenance program tasks, as well as processes
and procedures for changing maintenance intervals (increasing or decreasing
them). Identification, calculation, establishment, and adjustment of alert levels
18_Kinnison_c18_p217-236.qxd 10/8/14 3:31 PM Page 234

234 Oversight Functions

and the determination of what data to track are basic functions of the reliabil-
ity section. Collecting data is the responsibility of various M&E organizations,
such as line maintenance (flight hours and cycles, logbook reports, etc.); overhaul
shops (component removals); hangar (check packages); and material (parts
usage). Some airlines use a central data collection unit for this, located in M&E
administration, or some other unit such as engineering or reliability. Other air-
lines have provisions for the source units to provide data to the reliability depart-
ment on paper or through the airline computer system. In either case, reliability
is responsible for collecting, collating, and displaying these data and performing
the preliminary analysis to determine alert status.
The reliability department analyst in conjunction with MCC keeps a watch-
ful eye on the aircraft fleet and its systems for any repeat maintenance dis-
crepancies. The analyst reviews reliability reports and items on a daily basis,
including aircraft daily maintenance, time-deferred maintenance items,
MEL, and other out-of-service events with any type of repeat mechanical
discrepancies.
The analyst plans a sequence of repair procedures if aircraft have repeated the
maintenance discrepancy three times or more and have exhausted any type of
fix to rid the aircraft of the maintenance discrepancy. The analyst is normally
in contact with the MCC and local aircraft maintenance management to coordi-
nate a plan of attack with the aircraft manufacturer’s maintenance help desk to
ensure proper tracking and documenting of the actual maintenance discrepancy
and corrective action planned or maintenance performed. These types of com-
munication are needed for an airlines to run a successful maintenance operation
and to keep the aircraft maintenance downtime to a minimum. This normally
occurs when a new type of aircraft is added to the airline’s fleet. Sometimes
maintenance needs help fixing a recurring problem.

Maintenance program review board


The solution of reliability problems is not the exclusive domain of the reliabil-
ity section or the engineering section; it is a maintenance and engineering
organization-wide function. This group approach ensures that all aspects of the
problem have been addressed by those who are most familiar with the situation.
Therefore, oversight of the program is assigned to a MPRB that is made up of
key personnel in M&E. Based on the typical organization of Chap. 7, the MPRB
would consist of the following personnel:

1. Director of MPE as chairman


2. Permanent members
a. Director of technical services
b. Director of airplane maintenance
c. Director of overhaul shops
d. Director of QA and QC
e. Manager of QA and QC
18_Kinnison_c18_p217-236.qxd 10/8/14 3:31 PM Page 235

Reliability 235

f. Manager of engineering
g. Manager of reliability
3. Adjunct members are representatives of affected M&E departments
a. Engineering supervisors (by ATA Chapter or specialty)
b. Airplane maintenance (line, hangar)
c. Overhaul shops (avionics, hydraulics, etc.)
d. Production planning and control
e. Material
f. Training

The head of MPE is the one who deals directly with the regulatory authority,
so as chairman of the MPRB, he or she would coordinate any recommended
changes requiring regulatory approval.
The MPRB meets monthly to discuss the overall status of the maintenance reli-
ability and to discuss all items that are on alert. The permanent members, or their
designated assistants, attend every meeting; the advisory members attend those
meetings where items that relate to their activities will be discussed. Items
coming into alert status for the recent month are discussed first to determine if
a detailed investigation by engineering is needed. Possible problems and solu-
tions may be offered. If engineering is engaged in or has completed investigation
of certain problems, these will be discussed with the MPRB members. Items that
are currently in work are then discussed to track and analyze their status and
to evaluate the effectiveness of the corrective action. If any ongoing corrective
actions involve long-term implementation, such as modifications to the fleet that
must be done at the “C” check interval, the progress and effectiveness of the cor-
rective action should be studied to determine (if possible) whether or not the
chosen action appears to be effective. If not, a new approach would be discussed
and subsequently implemented by a revision to the original engineering order.
Other activities of the MPRB include the establishment of alert levels and the
adjustment of these levels as necessary for effective management of problems.
The rules governing the reliability program are developed with approval by the
MPRB. Rules relating to the change of maintenance intervals, alert levels, and
all other actions addressed by the program must be approved by the MPRB. The
corrective actions and the subsequent EOs developed by the engineering depart-
ment are also approved by the MPRB before they are issued.

Reliability program document


The Maintenance Review Board (MRB), derived from the FAA Advisory Circular
AC 121-22B, provides guidelines for the aviation industry to use minimum
scheduled interval/tasking requirements for derivative and/or newly type-
certificated aircraft and their power plants for FAA approval. The AC 121-22B
also refers to schedule interval requirements as the Maintenance Review Board
Report (MRBR). After receiving approval from the FAA, an operator may gen-
erate or develop its own maintenance program for its particular type of fleet.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 236

236 Oversight Functions

The air carrier may use this AC’s provisions along with its own or other main-
tenance information to standardize, develop, implement, and update the FAA-
approved minimum schedule of maintenance and/or inspection requirements
for this program to become a final written report for each type of certificate holder.
The MRB revision issued by the manufacturer is sent to the fleet mainte-
nance manager (FMM) or a maintenance person assigned by the air carrier. In
some cases, this is the director of maintenance (DOM). The FMM/ DOM inter-
faces with the aircraft maintenance and production department to advise them
about the MRB program updates and revisions. The air carrier normally tracks
each revision by fleet type to ensure the corrective action plan has been rec-
ommended to bring the maintenance production department into compliance.
The MRB runs concurrent with the continuous analysis and surveillance system
(CASS) and the reliability-centered maintenance (RCM) and is applied using
the maintenance steering group MSG-3 system. The MSG-3 origination is asso-
ciated with the Air Transport Association of America (ATA). The ATA coding
system (detailed in Chap. 5) divides aircraft into distinct ATA units, and every
ATA unit is analyzed for regulatory purposes to understand the results retrieved
from the system and then passed on to an aviation industry steering group/com-
mittee. After the data has been reviewed by the steering committee and
approved by the regulatory board for the MRB, the results are published as part
of the aircraft maintenance manual.
This document also includes detailed discussion of the data collection, problem
investigation, corrective action implementation, and follow-up actions. It also
includes an explanation of the methods used to determine alert levels; the
rules relative to changing maintenance process (HT, OC, CM), or MPD task
intervals; when to initiate an investigation; definitions of MPRB activities and
responsibilities; and the monthly report format. The document also includes
such administrative elements as responsibility for the document, revision
status, a distribution list, and approval signatures.
The reliability program document is a control document and thus contains
a revision status sheet and a list of effective pages, and it has limited distri-
bution within the airline. It is usually a separate document but can be included
as part of the TPPM.

FAA interaction
It is customary, in the United States, to invite the FAA to sit in on the MPRB
meetings as a nonvoting member. (They have, in a sense, their own voting
power.) Since each U.S. airline has a principal maintenance inspector (PMI)
assigned and usually on site, it is convenient for the FAA to attend these meet-
ings. Airlines outside the United States that do not have the on-site represen-
tative at each airline may not find it as easy to comply. But the invitation should
be extended nevertheless. This lets the regulatory authority know that the air-
line is attending to its maintenance problems in an orderly and systematic
manner and gives the regulatory people an opportunity to provide any assistance
that may be required.

You might also like