Chapter 18
Chapter 18
Chapter
18
Reliability
Introduction
Reliability equals consistency. It can be defined as the probability that an item
will perform a required function, under specified conditions without failure, for
a specified amount of time according to its intended design. The reliability pro-
gram is a valuable means of achieving better operational performance in an
aircraft maintenance environment, and it is designed to decrease maintenance-
related issues and increase flight safety. The intent of this program is to deal
systematically with problems as they arise instead of trying to cure immediate
symptoms. This program is normally customized, depending on the operators,
to accurately reflect the specific operation’s requirements. Although the word
reliability has many meanings, in this book we will define the terms that have
specialized meanings to aviation maintenance and engineering. In the case of
reliability, we first must discuss one important difference in the application of
the term.
There are two main approaches to the concept of reliability in the aviation
industry. One looks essentially at the whole airline operation or the M&E oper-
ation within the whole, and the other looks at the maintenance program in par-
ticular. There is nothing wrong with either of these approaches, but they differ
somewhat, and that difference must be understood.
The first approach is to look at the overall airline reliability. This is measured
essentially by dispatch reliability; that is, by how often the airline achieves an
on-time departure1 of its scheduled flights. Airlines using this approach track
delays. Reasons for the delay are categorized as maintenance, flight operations,
air traffic control (ATC), etc. and are logged accordingly. The M&E organization
is concerned only with those delays caused by maintenance.
1
On-time departure means that the aircraft has been “pushed back” from the gate within 15 minutes
of the scheduled departure time.
217
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 218
Very often, airlines using this approach to reliability overlook any mainte-
nance problems (personnel or equipment related) that do not cause delays, and
they track and investigate only those problems that do cause delays. This is only
partially effective in establishing a good maintenance program.
The second approach (which we should actually call the primary approach)
is to consider reliability as a program specifically designed to address the prob-
lems of maintenance—whether or not they cause delays—and provide analysis
of and corrective actions for those items to improve the overall reliability of the
equipment. This contributes to the dispatch reliability, as well as to the overall
operation.
We are not going to overlook the dispatch reliability, however. This is a dis-
tinct part of the reliability program we discuss in the following pages. But we
must make the distinction and understand the difference. We must also real-
ize that not all delays are caused by maintenance or equipment even though
maintenance is the center of attention during such a delay. Nor can we only
investigate equipment, maintenance procedures, or personnel for those dis-
crepancies that have caused a delay. As you will see through later discussions,
dispatch reliability is a subset of overall reliability.
Types of Reliability
The term reliability can be used in various respects. You can talk about the over-
all reliability of an airline’s activity, the reliability of a component or system, or
even the reliability of a process, function, or person. Here, however, we will dis-
cuss reliability in reference to the maintenance program specifically.
There are four types of reliability one can talk about related to the mainte-
nance activity. They are (a) statistical reliability, (b) historical reliability, (c) event-
oriented reliability, and (d) dispatch reliability. Although dispatch reliability is
a special case of event-oriented reliability, we will discuss it separately due to
its significance.
Statistical reliability
Statistical reliability is based upon collection and analysis of failure, removal,
and repair rates of systems or components. From this point on, we will refer to
these various types of maintenance actions as “events.” Event rates are calcu-
lated on the basis of events per 1000 flight hours or events per 100 flight cycles.
This normalizes the parameter for the purpose of analysis. Other rates may be
used as appropriate.
Many airlines use statistical analysis, but some often give the statistics more
credence than they deserve. For one example, airlines with 10 or more aircraft
tend to use the statistical approach, but most teachers and books on statistics
tell us that for any data set with less than about 30 data points the statistical
calculations are not very significant. Another case of improper use of statistics
was given as an example presented in an aviation industry seminar on reliability.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 219
Reliability 219
The airline representative used this as an example of why his airline was going
to stop using statistical reliability. Here is his example.
We use weather radar only 2 months of the year. When we calculate the mean
value of failure rates and the alert level in the conventional manner [discussed in
detail later in this chapter] we find that we are always on alert. This, of course, is
not true.
The gentleman was correct in defining an error in this method, and he was
correct in determining that—at least in this one case—statistics was not a valid
approach. Figure 18-1 shows why.
The top curve in Fig. 18-1 shows the 2 data points for data collected when the
equipment was in service. It also shows 10 zero data points for those months
when the equipment was not used and no data were collected (12-month column).
These zeros are not valid statistical data points. They do not represent zero fail-
ures; they represent “no data” and therefore should not be used in the calcula-
tion. Using these data, however, has generated a mean value (lower, dashed line)
of 4.8 and an alert level at two standard deviations above the mean (upper, solid
line) of 27.6.
One thing to understand about mathematics is that the formulas will work,
will produce numerical answers, whether or not the input data are correct.
Garbage in, garbage out. The point is, you only have two valid data points here
shown in the bottom curve of Fig. 18-1 (2-month data). The only meaningful sta-
tistic here is the average of the two numbers, 29 (dashed line). One can calcu-
late a standard deviation (SD) here using the appropriate formula or a calculator,
but the parameter has no meaning for just two data points. The alert level set
Aug 0
Sep 26 26
Oct 32 32
Nov 0 40
Dec 0
30
Failure Rate
20 Mean Value
Sum 58 58 Alert Level
n 12 2 10
Avg. 4.8 29.0 0
Std. Dev. 11.4 4.2 1 2 3 4 5 6 7 8 9 10 11 12
by using this calculation is 37.5 (solid line). For this particular example, sta-
tistical reliability is not useable, but historical reliability is quite useful. We will
discuss that subject in the next section.
Historical reliability
Historical reliability is simply a comparison of current event rates with those
of past experience. In the example of Fig. 18-1, the data collected show fleet fail-
ures of 26 and 32 for the 2 months the equipment was in service. Is that good
or bad? Statistics will not tell you but history will. Look at last year’s data for
the same equipment, same time period. Use the previous year’s data also, if
available. If current rates compare favorably with past experience, then every-
thing is okay; if there is a significant difference in the data from one year to the
next, that would be an indication of a possible problem. That is what a relia-
bility program is all about: detecting and subsequently resolving problems.
Historical reliability can be used in other instances, also. The most common one
is when new equipment is being introduced (components, systems, engines, air-
craft) and there is no previous data available on event rates, no information on
what sort of rates to expect. What is “normal” and what constitutes “a problem”
for this equipment? In historical reliability we merely collect the appropriate
data and literally “watch what happens.” When sufficient data are collected to
determine the “norms,” the equipment can be added to the statistical reliability
program.
Historical reliability can also be used by airlines wishing to establish a sta-
tistically based program. Data on event rates kept for 2 or 3 years can be tal-
lied or plotted graphically and analyzed to determine what the normal or
acceptable rates would be (assuming no significant problems were incurred).
Guidelines can then be established for use during the next year. This will be cov-
ered in more detail in the reliability program section below.
Event-oriented reliability
Event-oriented reliability is concerned with one-time events such as bird strikes,
hard landings, overweight landings, in-flight engine shutdowns, lighting strikes,
ground or flight interruption, and other accidents or incidents. These are events
that do not occur on a daily basis in airline operations and, therefore, produce
no usable statistical or historical data. Nevertheless, they do occur from time
to time, and each occurrence must be investigated to determine the cause and
to prevent or reduce the possibility of recurrence of the problem.
In ETOPS2 operations, certain events associated with this program differ
from conventional reliability programs, and they do rely on historical data and
alert levels to determine if an investigation is necessary to establish whether a
problem can be reduced or eliminated by changing the maintenance program.
2
Requirements for extended range operations with two-engine airplanes (ETOPS) are outlined
in FAA Advisory Circular AC 120-42B, and also discussed in Appendix E of this book.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 221
Reliability 221
Events that are related to ETOPS flights are designated by the FAA as actions
to be tracked by an “event-oriented reliability program” in addition to any sta-
tistical or historical reliability program. Not all the events are investigated, but
everything is continually monitored in case a problem arises.
Dispatch reliability
Dispatch reliability is a measure of the overall effectiveness of the airline oper-
ation with respect to on-time departure. It receives considerable attention from
regulatory authorities, as well as from airlines and passengers, but it is really
just a special form of the event-oriented reliability approach. It is a simple cal-
culation based on 100 flights. This makes it convenient to relate dispatch rate
in percent. An example of the dispatch rate calculation follows.
If eight delays and cancellations are experienced in 200 flights, that would mean
that there were four delays per 100 flights, or a 4 percent delay rate. A 4 percent
delay rate would translate to a 96 percent dispatch rate (100 percent − 4 percent
delayed = 96 percent dispatched on time). In other words, the airline dispatched
96 percent of its flights on time.
The use of dispatch reliability at the airlines is, at times, misinterpreted.
The passengers are concerned with timely dispatch for obvious reasons. To
respond to FAA pressures on dispatch rate, airlines often overreact. Some air-
line maintenance reliability programs track only dispatch reliability; that is,
they only track and investigate problems that resulted in a delay or a cancel-
lation of a flight. But this is only part of an effective program and dispatch reli-
ability involves more than just maintenance. An example will bear this out.
The aircraft pilot in command is 2 hours from his arrival station when he expe-
riences a problem with the rudder controls. He writes up the problem in the air-
craft logbook and reports it by radio to the flight following unit at the base. Upon
arrival at the base, the maintenance crew meets the plane and checks the log
for discrepancies. They find the rudder control write-up and begin trou-
bleshooting and repair actions. The repair takes a little longer than the sched-
uled turnaround time and, therefore, causes a delay. Since maintenance is at
work and the rudder is the problem, the delay is charged to maintenance and
the rudder system would be investigated for the cause of the delay.
This is an improper response. Did maintenance cause the delay? Did the
rudder equipment cause the delay? Or was the delay caused by poor airline pro-
cedures? To put it another way: could a change of airline procedures eliminate
the delay? Let us consider the events as they happened and how we might
change them for the better.
If the pilot and the flight operations organization knew about the problem
2 hours before landing, why wasn’t maintenance informed at the same time? If
they had been informed, they could have spent the time prior to landing in
studying the problem and performing some troubleshooting analysis. It is quite
possible, then, that when the airplane landed, maintenance could have met it
with a fix in hand. Thus, this delay could have been prevented by procedural
changes. The procedure should be changed to avoid such delays in the future.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 222
While the maintenance organization and the airline could benefit from this
advance warning of problems, it will not always eliminate delays. The impor-
tant thing to remember is that if a delay is caused by procedure, it should be
attributed to procedure and it should be avoided in the future by altering the
procedure. That is what a reliability program is about: detecting where the
problems are and correcting them, regardless of who or what is to blame.
Another fallacy in overemphasizing dispatch delay is that some airlines will
investigate each delay (as they should), but if an equipment problem is involved,
the investigation may or may not take into account other similar failures that
did not cause delays. For example, if you had 12 write-ups of rudder problems
during the month and only one of these caused a delay, you actually have two
problems to investigate: (a) the delay, which could be caused by problems other
than the rudder equipment and (b) the 12 rudder write-ups that may, in fact,
be related to an underlying maintenance problem. One must understand that
dispatch delay constitutes one problem and the rudder system malfunction
constitutes another. They may indeed overlap but they are two different prob-
lems. The delay is an event-oriented reliability problem that must be investi-
gated on its own; the 12 rudder problems (if this constitutes a high failure
rate) should be addressed by the statistical (or historical) reliability program.
The investigation of the dispatch delays should look at the whole operation.
Equipment problems—whether or not they caused delays—should be investi-
gated separately.
A Reliability Program
A reliability program for our purposes is, essentially, a set of rules and practices
for managing and controlling a maintenance program. The main function of a reli-
ability program is to monitor the performance of the vehicles and their associated
equipment and call attention to any need for corrective action. The program has
two additional functions: (a) to monitor the effectiveness of those corrective actions
and (b) to provide data to justify adjusting the maintenance intervals or mainte-
nance program procedures whenever those actions are appropriate.
Data collection
We will list 10 data types that can be collected, although they may not necessarily
be collected by all airlines. Other items may be added at the airline’s discretion.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 223
Reliability 223
The data collection process gives the reliability department the information
needed to observe the effectiveness of the maintenance program. Those items that
are doing well might be eliminated from the program simply because the data
show that there are no problems. On the other hand, items not being tracked may
need to be added to the program because there are serious problems related to
those systems. Basically, you collect the data needed to stay on top of your oper-
ation. The data types normally collected are as follows:
Flight time and flight cycles. Most reliability calculations are “rates” and are
based on flight hours or flight cycles; e.g., 0.76 failures per 1000 flight hours or
0.15 removals per 100 flight cycles.
Cancellations and delays over 15 minutes. Some operators collect data on all such
events, but maintenance is concerned primarily with those that are maintenance
related. The 15-minute time frame is used because that amount of time can usu-
ally be made up in flight. Longer delays may cause schedule interruptions or
missed connections, thus the need for rebookings. This parameter is usually con-
verted to a “dispatch rate” for the airline as discussed above.
Unscheduled component removals. This is the unscheduled maintenance
mentioned earlier and is definitely a concern of the reliability program. The rate
at which aircraft components are removed may vary widely depending on the
equipment or system involved. If the rate is not acceptable, an investigation
should be made and some sort of corrective action must be taken. Components
that are removed and replaced on schedule—e.g., HT items and certain OC
items—are not included here, but these data may be collected to aid in justify-
ing a change in the HT or OC interval schedule.
Unscheduled removals of engines. This is the same as component removals, but
obviously an engine removal constitutes a considerable amount of time and
manpower; therefore, these data are tallied separately.
In-flight shutdown (IFSD) of engines. This malfunction is probably one of the most
serious in aviation, particularly if the airplane only has two engines (or one).
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 224
The FAA requires a report of IFSD within 72 hours.3 The report must include
the cause and the corrective action. The ETOPS operators are required to track
IFSDs and respond to excessive rates as part of their authorization to fly ETOPS.
However, non-ETOPS operators also have to report shutdowns and should also
be tracking and responding to high rates through the reliability program.
Pilot reports or logbook write-ups. These are malfunctions or degradations in
airplane systems noted by the flight crew during flight. Tracking is usually by
ATA Chapter numbers using two, four, or six digits. This allows pinpointing of
the problems to the system, subsystem, or component level as desired.
Experience will dictate what levels to track for specific equipment.
Cabin logbook write-ups. These discrepancies may not be as serious as those
the flight crew deals with, but passenger comfort and the ability of the cabin
crew to perform their duties may be affected. These items may include cabin
safety inspection, operational check of cabin emergency lights, first aid kits, and
fire extinguishers. If any abnormality is found, these items are written up by
the flight crew in the maintenance logbook as a discrepancy item.
Component failures. Any problems found during shop maintenance visits are
tallied for the reliability program. This refers to major components within the
black boxes (avionics) or parts and components within mechanical systems.
Maintenance check package findings. Systems or components found to be in need
of repair or adjustment during normal scheduled maintenance checks (non-
routine items) are tracked by the reliability program.
Critical failures. Failures involving a loss of function or secondary damage that
could have a direct adverse effect on operating safety.
3
See Federal Aviation Regulation 121.703, Mechanical Reliability Report.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 225
Reliability 225
0.600
0.550
0.500
0.450
0.400
0.350
0.300
0.250
Sep-99
Sep-00
Jan-99
Jan-00
Feb-99
Dec-99
Feb-00
Dec-00
Jul-99
Jul-00
Jun-99
Jun-00
Mar-99
Mar-00
May-99
May-00
Jan-01
Feb-01
Nov-99
Nov-00
Mar-01
Oct-99
Oct-00
Aug-99
Aug-00
Apr-99
Apr-00
Monthly Event Rate Mean Value UCL (Mean + 2 SD) Offset
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
it is easy to see the pattern as we look at the year’s events. But in reality, you
will only see 1 month at a time and the preceding months. Information on what
is going to happen the next month is not available to you.
When the event rate goes above the alert level (as in February), it is not nec-
essarily a serious matter. But if the rate stays above the alert level for 2 months
in succession, then it may warrant an investigation. The preliminary investi-
gation may indicate a seasonal variation or some other one-time cause, or it may
suggest the need for a more detailed investigation. More often than not, it can
be taken for what it was intended to be—an “alert” to a possible problem. The
response would be to wait and see what happens next month. In Fig. 18-3, the
data show that, in the following month (March) the rate went below the line;
thus, no real problem exists. In other words, when the event rate penetrates the
alert level, it is not an indication of a problem; it is merely an “alert” to the pos-
sibility of a problem. Reacting too quickly usually results in unnecessary time
and effort spent in investigation. This is what we call a “false alert.”
If experience shows that the event rate for a given item varies widely from
month to month above and below the UCL as in Fig. 18-3—and this is common
for some equipment—many operators use a 3-month rolling average. This is
shown as the dashed line in Fig. 18-3. For the first month of the new data year,
the 3-month average is determined by using the offset data points in Fig. 18-2.
(Actually, only 2 months offset is needed, but we like to keep things on a quar-
terly basis.) The purpose for the offset is to ensure that the plotted data for the
new year do not contain any data points that were used to determine the mean
and alert levels we use for comparison.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 227
Reliability 227
While the event rate swings above and below the alert level, the 3-month
rolling average (dashed line) stays below it—until October. This condition—
event rate and 3-month average above the UCL—indicates a need to watch the
activity more closely. In this example, the event rate went back down below the
UCL in November, but the 3-month average stayed above the alert level. This
is an indication that the problem should be investigated.
28
27
26
25
24
23
22
21
20
0 2 4 6 8 10 12 14 16
Figure 18-4 shows the difference between two data sets. The data points in
(A) are widely scattered or distributed about the mean while those in (B) are all very
close together around the mean. Note that the averages of these two data sets are
nearly equal but the standard deviations are quite different. Figure 18-5 shows the
bell-shaped distribution curve. One, two, and three standard deviations in each
case are shown on the graph. You can see here that, at one SD only 68 percent of
the valid failure rates are included. At two standard deviations above the mean,
you still have not included all the points in the distribution. In fact, two stan-
dard deviations above and below the mean encompass only 95.5 percent of the
points under the curve; i.e., just over 95 percent of the valid failure rates. This
is why we do not consider an event rate in this range a definite problem. If it
remains above this level in the following month it may suggest a possible problem.
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 229
Reliability 229
x
−3σ −2σ −σ B +σ +2σ +3σ
68.26%
95.46%
99.73%
Figure 18-5 Standard bell-shaped curve. (Source: The
Standard Handbook for Aeronautical and Astronautical
Engineers, New York, NY: McGraw-Hill, 2003.)
On the other hand, if the event rate data you are working with had a small stan-
dard deviation, it would be difficult to distinguish between two and three SDs.
In this case, the alert level should be set at three SDs.
This alert level system can be overdone at times. The statistics used are not
exact. We are assuming that the event rates will always have a distribution
depicted by the bell-shaped curve. We assume that our data are always accu-
rate and that our calculations are always correct. But this may not be true. These
alert levels are merely guidelines to identifying what should be investigated and
what can be tolerated. Use of the alert level is not rocket science but it helps
ease the workload in organizations with large fleets and small reliability staffs.
Some airlines, using only event rates, will investigate perhaps the 10 highest
rates; but this does not always include the most important or the most signifi-
cant equipment problems. The alert level approach allows you to prioritize these
problems and work the most important ones first.
Data display
Several methods for displaying data are utilized by the reliability department to
study and analyze the data they collect. Most operators have personal computers
available so that data can easily be displayed in tabular and graphical forms. The
data are presented as events per 100 or 1000 flight hours or flight cycles. Some,
such as delays and cancellations, are presented as events per 100 departures. The
value of 100 allows easy translation of the rate into a percentage.
Tabular data allow the operator to compare event rates with other data on
the same sheet. It also allows the comparison of quarterly or yearly data (see
Table 18-1). Graphs, on the other hand, allow the operator to view the month-
to-month performance and note, more readily, those items that show increasing
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 230
TABLE 18-1 Pilot Reports per 100 Landings (by ATA Chapter)
NOTE: Alert status codes: CL = clear from alert; YE = yellow alert; AL = red alert; RA = remains in alert; WA = watch.
rates and appear to be heading for alert status (see Fig. 18-3). This is a great
help in analysis. Some of the data collected may be compared on a monthly basis,
by event, or by sampling.
Table 18-1 is a listing of pilot reports (PIREPS) or maintenance logbook
entries recorded by a typical airline for 1 month of operation for a fleet of air-
craft. The numbers are examples only and do not represent any particular oper-
ator, aircraft, or fleet size. For these data, a tally is kept by ATA Chapter, and
event rates are calculated as PIREPS per 100 landings. The chart shows data
for the current month (August '99) and the two previous months along with the
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 231
Reliability 231
3-month rolling average. The alert level or UCL and the mean value of event
rate, calculated as discussed in the text, are also included. Seven of these ATA
Chapters have alert indications noted in the last column.
Chapter 21 has had an event rate above the UCL for 2 months running (July,
August); therefore, this represents a yellow alert (YE). Depending on the sever-
ity of the problem, this may or may not require an immediate investigation.
Chapter 24, however, is different. For July, the event rate was high, 1.15. If this
were the first time for such a rate, it would have been listed in the report for
that month as a watch (WA). The rate went down in July but has gone up again
in August. In the current report, then, it is a full alert condition. It is not only
above the alert level, it has been above 2 of the 3 months, and it appears some-
what erratic. It is left as an exercise for the student to analyze the other alert
status items. What about ATA Chapter 38?
Data analysis
Whenever an item goes into alert status, the reliability department does a pre-
liminary analysis to determine if the alert is valid. If it is valid, a notice of the
on-alert condition is sent to engineering for a more detailed analysis. The engi-
neering department is made up of experienced people who know maintenance
and engineering. Their job relative to these alerts is to troubleshoot the prob-
lem, determine the required action that will correct the problem, and issue an
engineering order (EO) or other official paperwork that will put this solution in
place.
At first, this may seem like a job for maintenance. After all, troubleshooting
and corrective action is their job. But we must stick with our basic philosophy
from Chap.7 of separating the inspectors from the inspected. Engineering can
provide an analysis of the problem that is free from any unit bias and be free
to look at all possibilities. A unit looking into its own processes, procedures, and
personnel may not be so objective. The engineering department should provide
analysis and corrective action recommendations to the airline Maintenance
Program Review Board (discussed later) for approval and initiation.
Note: Appendix C discusses the troubleshooting process that applies to engi-
neers as well as mechanics; and Appendix D outlines additional procedures for
reliability and engineering alert analysis efforts.
Corrective action
Corrective actions can vary from one-time efforts correcting a deficiency in a pro-
cedure to the retraining of mechanics to changes in the basic maintenance pro-
gram. The investigation of these alert conditions commonly results in one or
more of the following actions: (a) modifications of equipment; (b) change in or
correction to line, hangar, or shop processes or practices; (c) disposal of defec-
tive parts (or their suppliers); (d) training of mechanics (refresher or upgrade);
(e) addition of maintenance tasks to the program; or ( f ) decreases in maintenance
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 232
intervals for certain tasks. Engineering then produces an engineering order for
implementation of whatever action is applicable. Engineering also tracks the
progress of the order and offers assistance as needed. Completion of the cor-
rective action is noted in the monthly reliability report (discussed later).
Continual monitoring by reliability determines the effectiveness of the selected
corrective action.
Corrective actions should be completed within 1 month of issuance of the EO.
Completion may be deferred if circumstances warrant, but action should be
completed as soon as possible to make the program effective. Normally, the
Maintenance Program Review Board (MPRB) will require justification in writ-
ing for extensions of this period; the deferral, and the reason for deferral, will
be noted in the monthly report.
Follow-up analysis
The reliability department should follow up on all actions taken relative to
on-alert items to verify that the corrective action taken was indeed effective.
This should be reflected in decreased event rates. If the event rate does not
improve after action has been taken, the alert is reissued and the investiga-
tion and corrective action process is repeated, with engineering taking a dif-
ferent approach to the problem. If the corrective action involves lengthy
modifications to numerous vehicles, the reduction in the event rate may not
be noticeable for some time. In these cases, it is important to continue mon-
itoring the progress of the corrective action in the monthly report along with
the ongoing event rate until corrective action is completed on all vehicles.
Then follow-up observation is employed to judge the effectiveness (wisdom)
of the action. If no significant change is noted in the rates within a reason-
able time after a portion of the fleet has been completed, the problem and the
corrective action should be reanalyzed.
Data reporting
A reliability report is issued monthly. Some organizations issue quarterly and
yearly reports in summary format. The most useful report, however, is the
monthly. This report should not contain an excessive amount of data and graphs
without a good explanation of what this information means to the airline and
to the reader of the report. The report should concentrate on the items that have
just gone on alert, those items under investigation, and those items that are in
or have completed the corrective action process. The progress of any items that
are still being analyzed or implemented will also be noted in the report, show-
ing status of the action and percent of fleet completed if applicable. These items
should remain in the monthly report until all action has been completed and the
reliability data show positive results.
Other information, such as a list of alert levels (by ATA Chapter or by item)
and general information on fleet reliability will also be included in the monthly
report. Such items as dispatch rates, reasons for delays and/or cancellations,
18_Kinnison_c18_p217-236.qxd 10/5/12 11:38 AM Page 233
Reliability 233
flight hours and cycles flown and any significant changes in the operation that
affect the maintenance activity would also be included. The report should be
organized by fleet; i.e., each airplane model would be addressed in a separate
section of the report.
The monthly reliability report is not just a collection of graphs, tables, and
numbers designed to dazzle higher-level management. Nor is it a document
left on the doorstep of others, such as QA or the FAA, to see if they can detect
any problems you might have. This monthly report is a working tool for main-
tenance management. Besides providing operating statistics, such as the
number of aircraft in operation, the number of hours flown, and so forth, it also
provides management with a picture of what problems are encountered (if any)
and what is being done about those problems. It also tracks the progress and
effectiveness of the corrective action. The responsibility for writing the report
rests with the reliability department, not engineering.
and the determination of what data to track are basic functions of the reliabil-
ity section. Collecting data is the responsibility of various M&E organizations,
such as line maintenance (flight hours and cycles, logbook reports, etc.); overhaul
shops (component removals); hangar (check packages); and material (parts
usage). Some airlines use a central data collection unit for this, located in M&E
administration, or some other unit such as engineering or reliability. Other air-
lines have provisions for the source units to provide data to the reliability depart-
ment on paper or through the airline computer system. In either case, reliability
is responsible for collecting, collating, and displaying these data and performing
the preliminary analysis to determine alert status.
The reliability department analyst in conjunction with MCC keeps a watch-
ful eye on the aircraft fleet and its systems for any repeat maintenance dis-
crepancies. The analyst reviews reliability reports and items on a daily basis,
including aircraft daily maintenance, time-deferred maintenance items,
MEL, and other out-of-service events with any type of repeat mechanical
discrepancies.
The analyst plans a sequence of repair procedures if aircraft have repeated the
maintenance discrepancy three times or more and have exhausted any type of
fix to rid the aircraft of the maintenance discrepancy. The analyst is normally
in contact with the MCC and local aircraft maintenance management to coordi-
nate a plan of attack with the aircraft manufacturer’s maintenance help desk to
ensure proper tracking and documenting of the actual maintenance discrepancy
and corrective action planned or maintenance performed. These types of com-
munication are needed for an airlines to run a successful maintenance operation
and to keep the aircraft maintenance downtime to a minimum. This normally
occurs when a new type of aircraft is added to the airline’s fleet. Sometimes
maintenance needs help fixing a recurring problem.
Reliability 235
f. Manager of engineering
g. Manager of reliability
3. Adjunct members are representatives of affected M&E departments
a. Engineering supervisors (by ATA Chapter or specialty)
b. Airplane maintenance (line, hangar)
c. Overhaul shops (avionics, hydraulics, etc.)
d. Production planning and control
e. Material
f. Training
The head of MPE is the one who deals directly with the regulatory authority,
so as chairman of the MPRB, he or she would coordinate any recommended
changes requiring regulatory approval.
The MPRB meets monthly to discuss the overall status of the maintenance reli-
ability and to discuss all items that are on alert. The permanent members, or their
designated assistants, attend every meeting; the advisory members attend those
meetings where items that relate to their activities will be discussed. Items
coming into alert status for the recent month are discussed first to determine if
a detailed investigation by engineering is needed. Possible problems and solu-
tions may be offered. If engineering is engaged in or has completed investigation
of certain problems, these will be discussed with the MPRB members. Items that
are currently in work are then discussed to track and analyze their status and
to evaluate the effectiveness of the corrective action. If any ongoing corrective
actions involve long-term implementation, such as modifications to the fleet that
must be done at the “C” check interval, the progress and effectiveness of the cor-
rective action should be studied to determine (if possible) whether or not the
chosen action appears to be effective. If not, a new approach would be discussed
and subsequently implemented by a revision to the original engineering order.
Other activities of the MPRB include the establishment of alert levels and the
adjustment of these levels as necessary for effective management of problems.
The rules governing the reliability program are developed with approval by the
MPRB. Rules relating to the change of maintenance intervals, alert levels, and
all other actions addressed by the program must be approved by the MPRB. The
corrective actions and the subsequent EOs developed by the engineering depart-
ment are also approved by the MPRB before they are issued.
The air carrier may use this AC’s provisions along with its own or other main-
tenance information to standardize, develop, implement, and update the FAA-
approved minimum schedule of maintenance and/or inspection requirements
for this program to become a final written report for each type of certificate holder.
The MRB revision issued by the manufacturer is sent to the fleet mainte-
nance manager (FMM) or a maintenance person assigned by the air carrier. In
some cases, this is the director of maintenance (DOM). The FMM/ DOM inter-
faces with the aircraft maintenance and production department to advise them
about the MRB program updates and revisions. The air carrier normally tracks
each revision by fleet type to ensure the corrective action plan has been rec-
ommended to bring the maintenance production department into compliance.
The MRB runs concurrent with the continuous analysis and surveillance system
(CASS) and the reliability-centered maintenance (RCM) and is applied using
the maintenance steering group MSG-3 system. The MSG-3 origination is asso-
ciated with the Air Transport Association of America (ATA). The ATA coding
system (detailed in Chap. 5) divides aircraft into distinct ATA units, and every
ATA unit is analyzed for regulatory purposes to understand the results retrieved
from the system and then passed on to an aviation industry steering group/com-
mittee. After the data has been reviewed by the steering committee and
approved by the regulatory board for the MRB, the results are published as part
of the aircraft maintenance manual.
This document also includes detailed discussion of the data collection, problem
investigation, corrective action implementation, and follow-up actions. It also
includes an explanation of the methods used to determine alert levels; the
rules relative to changing maintenance process (HT, OC, CM), or MPD task
intervals; when to initiate an investigation; definitions of MPRB activities and
responsibilities; and the monthly report format. The document also includes
such administrative elements as responsibility for the document, revision
status, a distribution list, and approval signatures.
The reliability program document is a control document and thus contains
a revision status sheet and a list of effective pages, and it has limited distri-
bution within the airline. It is usually a separate document but can be included
as part of the TPPM.
FAA interaction
It is customary, in the United States, to invite the FAA to sit in on the MPRB
meetings as a nonvoting member. (They have, in a sense, their own voting
power.) Since each U.S. airline has a principal maintenance inspector (PMI)
assigned and usually on site, it is convenient for the FAA to attend these meet-
ings. Airlines outside the United States that do not have the on-site represen-
tative at each airline may not find it as easy to comply. But the invitation should
be extended nevertheless. This lets the regulatory authority know that the air-
line is attending to its maintenance problems in an orderly and systematic
manner and gives the regulatory people an opportunity to provide any assistance
that may be required.