Abstract—In modern manufacturing systems, maintenance degrees. A perfect maintenance action recovers the machine
operations are the key to improve machines’ reliability and to the status of “as good as new”. While a minimal
availability, and hence to improve system productivity and maintenance action only resumes the functionality of the
quality. Maintenance can be roughly categorized to Corrective
Maintenance (CM) and Preventive Maintenance (PM). Since the
machine without changing its aging status. An imperfect
production system is highly stochastic and maintenance actions maintenance recovers the machine to an extent between the
could either be perfect or imperfect, it is a complex decision to two. Based on these maintenance types, a lot of maintenance
make on when, where, and which type of maintenance action policies have been proposed for a single machine or a single-
should be taken. In this paper, a maintenance control law is unit system [2]. For instance, Liao et al. [3] used a search
proposed to schedule cost-effective maintenance, either CM or algorithm to determine a reliability threshold, at which a
PM, in a real-time fashion. The control cost consists of a resource
cost, and the immediate and potential production losses due to
preventive maintenance action will be taken, thus maximizing
the stoppage caused by the maintenance action. A data-driven the availability of the machine. These studies provide us with
model is used to evaluate the production losses. A case study is great insights about the machine reliability analysis.
performed to demonstrate the effectiveness of the proposed However, as modern manufacturing systems are
control method by comparing three different maintenance characterized by their complex structures of strongly
policies. interconnected machines and stochastic dynamics, existing
single-machine maintenance policies cannot be directly
Index Terms—Maintenance management, Deteriorating
system, Real-time control, Opportunity window, Data-driven applied to machines in a multi-stage production line. These
modeling machines are interconnected with each other, and both the
upstream and downstream impacts need to be fully examined
[4]. As a result, any maintenance decision, either CM or PM,
should consider not only its effect on the single machine’s
Most of the manufacturing systems are subject to degradation, but also the impact on the overall system.
deteriorations due to usage and aging of machines and To address the complexity, some simulation-based studies
operations in real industry practices. Upon random failures, have been conducted to find the optimal maintenance
machines will stop working unexpectedly, forcing the plant schedule, especially for PM [5]. Based on simulation, Roux et
management to react to the disruption passively. Such reactive
al. [6] optimized the interval of periodic PM to ensure a low
action is corrective maintenance (CM). The outcomes and
costs of such unexpected failures are very hard to predict and level of failures and minimize the unavailability of the system.
sometimes could be far beyond control. To avoid such One drawback of this approach is that in multi-stage
uncontrollable situations, it is essential to conduct preventive manufacturing systems, the temporary stoppage of one
maintenance (PM) before the failure actually happens. machine doesn’t necessarily lead to production losses. Arab
Effective and timely maintenance decision making is et al. [7] addressed this issue by incorporating remaining
definitely not a trivial matter in improving the competitiveness reliability of machines and work-in-process inventories into
of a manufacturing organization. the simulation model to search for the optimal maintenance
Optimal maintenance policies have been intensively schedule. However, intensive computing resource is required
investigated during the past decades. Researchers have found to search in the huge solution space, especially when the
that in reality, the maintenance action, either CM or PM, is system is scaled up. Changes on the process and equipment,
not necessarily a complete replacement [1]. Three different which are norm in today’s manufacturing industry, will lead
maintenance types are defined according to the improvement to corresponding changes and reconstructions in the
simulation models, and subsequently a higher resource
distribution. The failure rates of the machines are with time until it receives a preventive maintenance or breaks
mutually independent and increasing with time. down and a corrective maintenance has to be imposed. As Fig.
2) Corrective maintenance (CM) is taken upon failure; 2 shows, the machine receives its 𝑗𝑡ℎ maintenance (either CM
preventive maintenance (PM) is taken when the or PM) at time 𝑡𝑖𝑗 with a duration of 𝑑𝑖𝑗 , and at time 𝑡𝑖𝑗 + 𝑑𝑖𝑗
machine is still operational to prevent future failures. the machine resumed operation.
3) A perfect maintenance action will recover the machine
“as good as new”. A minimal maintenance action will
only resume the machine without changing its aging
4) An imperfect maintenance is recovering the machine
to somewhere between old and new. The recovery
effect is not stochastic and can be described by a
deterministic improvement factor.
5) The duration of each maintenance type is assumed to
be deterministic.
6) CM could either be perfect, imperfect or minimal; PM
could either be perfect or imperfect.
B. Production system model Figure 2. Maintenance actions on machine 𝑆𝑖
𝐶𝑖𝑗 = 𝐶𝑟𝑒𝑠 + 𝑐𝑝 (𝑃𝐿𝑖𝑗 + 𝑃𝑃𝐿𝑖𝑗 ) (7) is the opportunity window evaluated at time 𝑡𝑖𝑗 when action
𝐶𝑟𝑒𝑠 is the cost of resource used during maintenance, is taken and 𝑇𝑀∗ is the cycle time of the slowest machine 𝑆𝑀∗ .
including part replacement and other consumable expenses. It C. Potential production loss evaluation
varies with the maintenance type. Generally, the resource cost The size of opportunity window heavily relies on the buffer
of a perfect maintenance is larger than an imperfect levels (if upstream) or buffer vacancies (if downstream)
maintenance, and the resource cost of an imperfect between the machine of interest and the slowest machine. The
maintenance is larger than that of a minimal one. The second opportunity window accumulates when the machine operates
term is the profit loss due to the maintenance action that faster than the slowest machine, and shrinks when the
causes unavailability of the machine. 𝑃𝐿𝑖𝑗 and 𝑃𝑃𝐿𝑖𝑗 are the machine stops while the slowest machine operates.
immediate production loss and future potential production Given a maintenance action 𝑒⃗𝑖𝑗 imposed on machine 𝑆𝑖 , the
loss respectively, and they will be discussed in following stoppage will propagate to nearby machines sequentially
sections. since it will cause blockage or starvation. The opportunity
Since the cost is one-time cost, it is unfair to compare the windows of nearby machines gradually shrink. Considering
costs between different maintenance type options. The cost of subsequent random failures on these machines, the permanent
a perfect maintenance action might be much higher than an production losses may be amplified by the reduced
imperfect maintenance action, but the former one will ensure opportunity windows. In principle, these losses cannot be
the system to operate a longer period of time before failure. directly attributed to the initial action 𝑒⃗𝑖𝑗 , but 𝑒⃗𝑖𝑗 does
Hence a cost rate function 𝐶𝑅𝑖 is given by unifying the cost indirectly contribute to the losses of subsequent failures.
along the expected value of lifetime 𝑍𝑖𝑗 after the action, i.e. Therefore, a maintenance action may alter the health status of
𝐶𝑟𝑒𝑠 +𝑐𝑝 (𝑃𝐿𝑖𝑗 +𝑃𝑃𝐿𝑖𝑗 ) the overall system to some extent. To incorporate this impact,
𝐶𝑅𝑖 (𝑡) = , 𝑡 ∈ [𝑡𝑖𝑗 , 𝑡𝑖(𝑗+1) ) (8)
𝐸[𝑍𝑖𝑗 ] we refer the difference of future production losses with and
where 𝐸[𝑍𝑖𝑗 ] can be evaluated by the expectation of without 𝑒⃗𝑖𝑗 as potential production loss, denoted as 𝑃𝑃𝐿𝑖𝑗 .
(conditional) Weibull distribution. As the manufacturing system is highly nonlinear and
𝑔𝑖𝑗 𝛽𝑖 ∞ 𝑡 ∗ +𝑔𝑖𝑗 𝛽𝑖 stochastic, it is nearly impossible to have a precise expression
𝐸[𝑍𝑖𝑗 ] = exp (( ) ) ∫0 exp (( ) ) 𝑑𝑡 ∗ (9) for the potential production loss. Referring to Zou et al.[17],
𝛼𝑖 𝛼𝑖
The real-time cost rate of the whole production line is the production losses of the very first random failure event
𝐶𝑅(𝑡) = ∑𝑀 (10) within a short look ahead window Δ𝑡 is taken as an indicator.
𝑖=1 𝐶𝑅𝑖 (𝑡)
This cost rate function is used as control cost function to 𝑃𝑃𝐿𝑖𝑗 = 𝜂 × 𝐹𝑃𝐿𝑖𝑗 (13)
guide maintenance decision making. where 𝐹𝑃𝐿𝑖𝑗 is the production loss incurred by the very first
random failure of the whole line after the maintenance action
B. Immediate production loss evaluation
𝑒⃗𝑖𝑗 . A proper discount ratio 𝜂 can be given based on past
Any maintenance action taken on a machine will require experience.
the specific machine to stop operating for a period of time. A
Suppose that after time 𝑡𝑖𝑗 , the very first random failure 𝑒⃗𝑘∗
stoppage is directly counted toward unavailability time and
occurs on machine 𝑆𝑘 (𝑘 = 1, 2, … , 𝑀) at time 𝑡 ∗ and a
integrated into cost function in most related works. However,
perfect corrective maintenance is taken. Since there is no
Chang et al. found that the stoppage incurs permanent
other random failure between time 𝑒⃗𝑖𝑗 and 𝑒⃗𝑘∗ , the buffer
production loss only if it impedes (blocks or starves) the last
slowest machine in the line[12]. In other words, not all the levels 𝒃(𝑡𝑖𝑗 + 𝑡 ∗ ) can be exactly computed and 𝑂𝑊𝑘 (𝑡𝑖𝑗 +
stoppages cause permanent production loss. 𝑡 ∗ ) can be computed. Let 𝑃𝐿𝑘∗ (𝑡 ∗ ) denotes the production
The largest possible stoppage time of machine 𝑆𝑖 that loss incurred by 𝑒⃗𝑘∗ , then
won’t lead to permanent production loss is referred to as 𝑑1𝑐 −𝑂𝑊𝑘 (𝑡𝑖𝑗+𝑡 ∗ )
𝑃𝐿𝑘∗ (𝑡 ∗ ) = max { , 0} (14)
opportunity window, denoted as 𝑂𝑊𝑖 . 𝑇 𝑀∗
𝑇 The probability associated with 𝑃𝐿𝑘∗ (𝑡 ∗ )
is the probability
𝑂𝑊𝑖 (𝑇𝑑 ) = sup {𝑑 ≥ 0: 𝑠. 𝑡. ∃𝑇 ∗ (𝑑), ∫0 𝑠𝑀 (𝑡)𝑑𝑡 = ∗
𝑝(𝑘, 𝑡𝑖𝑗 , 𝑡 ) that the very first random failure occurs on
∫0 𝑠̃ (𝑡; 𝑒⃗)𝑑𝑡 , ∀𝑇 ≥ 𝑇 ∗ (𝑑)} (11) machine 𝑆𝑘 at time 𝑡 ∗ from time 𝑡𝑖𝑗 . The machines are
where ∫0 𝑠̃𝑀 (𝑡; 𝑒⃗)𝑑𝑡 and are the production
∫0 𝑠𝑀 (𝑡)𝑑𝑡 independent with each other regarding reliability. Therefore
volume of the end-of-line machine 𝑆𝑀 at time 𝑇, with and 𝑝(𝑘, 𝑡𝑖𝑗 , 𝑡 ∗ ) is the joint probability of 𝑀 machines, i.e.
without disruption event 𝑒⃗ = (𝑖, 𝑚, 𝑡, 𝑑), respectively. 𝑇 ∗ (𝑑) 𝑝(𝑘, 𝑡𝑖𝑗 , 𝑡 ∗ ) = 𝑝𝑘 (𝑡𝑖𝑗 , 𝑡 ∗ ) ∏𝑀
𝑙=1,𝑙≠𝑘 [1 − ∫0 𝑝𝑙 (𝑡𝑖𝑗 , 𝜏)𝑑𝜏 ] (15)
signifies the potential dependency of 𝑇 ∗ on 𝑑.
Finally, 𝐹𝑃𝐿𝑖𝑗 can be evaluated by the expected production
Given a maintenance action 𝑒⃗𝑖𝑗 = (𝑖, 𝑚𝑖𝑗 , 𝑡𝑖𝑗 , 𝑑𝑖𝑗 ) taken on losses.
machine 𝑆𝑖 , the immediate permanent production loss is Δt
𝑑𝑖𝑗 −𝑂𝑊𝑖 (𝑡𝑖𝑗)
𝐹𝑃𝐿𝑖𝑗 = ∑𝑀 ∗ ∗
𝑘=1 ∫0 𝑃𝐿𝑘∗ (𝑡 )𝑝(𝑘, 𝑡𝑖𝑗 , 𝑡 )𝑑𝑡
𝑃𝐿𝑖𝑗 = max { , 0} (12) where Δ𝑡 is the look-ahead window.
𝑇 𝑀∗
where 𝑑𝑖𝑗 is the duration of the maintenance action, 𝑂𝑊(𝑡𝑖𝑗 )
IV. REAL-TIME MAINTENANCE CONTROL LAW machines and 5 buffers (Fig. 2).
The maintenance cost heavily depends on the real-time
state of the production line. However, it is nearly impossible
to completely predict the state at any future time. It is
extremely difficult to find an optimal maintenance control
policy for a global time horizon. A feasible proposal is to
develop a control scheme in real-time fashion to optimize
maintenance cost rate, thus obtaining a near optimal
maintenance schedule. Our control objective is to minimize
the real-time cost rate, i.e. Figure 2. System structure of the serial production line
min ∑𝑁 𝑖 𝐶𝑅𝑖 (𝑡) (17)
Let 𝐶𝑅𝑖 (𝑡), 𝑚 = 1𝑝, 2𝑝, 1𝑐, 2𝑐, 3𝑐 denotes the cost rate The system parameters are shown in Table I. The profit per
assuming a maintenance action of type 𝑚 is taken on machine part is assumed to be 𝑐𝑝 = $300/𝑝𝑎𝑟𝑡. In this case study, the
𝑆𝑖 at time 𝑡. Then 𝐶𝑅𝑖𝑚 (𝑡) can be evaluated by inserting a look ahead window for future potential production loss
corresponding maintenance action 𝑒⃗𝑖∗ = (𝑖, 𝑚, 𝑡, 𝑑𝑚 ) into Eq. estimation is Δ𝑡 = 100 𝑚𝑖𝑛 and discount ratio is selected as
(8). The control law incorporates both CM and PM control 𝜂 = 0.3. The time step for deciding preventive maintenance
procedures. is 𝜎 = 25 𝑚𝑖𝑛. Maintenance parameters are given in Table II.
1) CM control procedure The case study is carried out in simulation, and the duration
If machine 𝑆𝑖 fails at time 𝑡, a CM action has to be imposed is four weeks, i.e. 𝑇 = 40320 𝑚𝑖𝑛 . The maintenance
in order to resume the machine. The eligible maintenance parameters are presented in Table II.
types are 1𝑐, 2𝑐, and 3𝑐. Then the maintenance type chosen
is the one that minimize the real-time cost rate of machine 𝑆𝑖 , TABLE I. PARAMETERS FOR THE PRODUCTION LINE
i.e. 𝑆1 𝑆2 𝑆3 𝑆4 𝑆5 𝑆6
𝑚𝑖 = arg min{𝐶𝑅𝑖𝑚 (𝑡), 𝑚 = 1𝑐, 2𝑐, 3𝑐} (18) Cycle time 𝑇𝑖 (𝑚𝑖𝑛) 0.92 0.88 1.0 0.87 0.87 0.92
Initial age 𝑣𝑖 (𝑚𝑖𝑛) 100 400 20 300 0 50
2) PM control procedure
A small time step 𝜎 is chosen, which is much smaller than Characteristic life 𝛼𝑖 (𝑚𝑖𝑛) 680 720 900 800 700 750
the lifetimes of machines. If machine 𝑆𝑖 is operational at Shape parameter 𝛽𝑖 2 2 2 2 2 2
current step, then machine 𝑆𝑖 is eligible for a PM. The cost 𝐵2 𝐵3 𝐵4 𝐵5 𝐵6
rate of PM is 𝐶𝑅𝑖𝑚 (𝑡), 𝑚 = 1𝑝, 2𝑝. Buffer capacity 𝐵𝑖 20 20 20 20 20
We can also decide not to take any action on the machine. Initial buffer level 𝑏𝑖 (0) 2 3 5 8 2
Then the machine is threatened to fail at next step with
probability 𝑃𝑖 (𝑡, 𝜎).
𝑃𝑖 (𝑡, 𝜎) = ∫0 𝑝𝑖 (𝑡, 𝑡 ∗ )𝑑𝑡 ∗ (19) TABLE II. PARAMETERS FOR MAINTENANCE ACTIONS
The future system state at time 𝑡 + 𝜎 is uncertain. In order 1𝑝 2𝑝 1𝑐 2𝑐 3𝑐
to evaluate the production losses at that time, one can use the Resource cost 𝐶𝑟𝑒𝑠 (𝑈𝑆 𝑑𝑜𝑙𝑙𝑎𝑟) 120 70 200 100 50
worst estimate of opportunity window of machine 𝑆𝑖 at time Duration 𝑑𝑚 (𝑚𝑖𝑛) 15 10 35 20 10
𝑡 + 𝜎 , which is the situation where machine 𝑆𝑖 stops Improvement factor 𝑞𝑚 0 0.7 0 0.7 1
operating for 𝜎 units of time. Thus upon the potential failure
at next time step, we can find the minimum 𝐶𝑅𝑖𝑚 (𝑡 + 𝜎), 𝑚 =
Three policies are compared using simulation, i.e.
1𝑐, 2𝑐, 3𝑐 . Specially, let 𝐶𝑅𝑖0 (𝑡) denotes the expected CM Policy 1: failure limit policy [18]
cost rate if we take no action at current step and machine 𝑆𝑖 A perfect PM action ( 1𝑝 ) will be imposed once the
fails at next step, then machine reaches a pre-determined failure rate threshold 𝜆,
𝐶𝑅𝑖0 (𝑡) = 𝑃𝑖 (𝑡, 𝜎) ⋅ min{𝐶𝑅𝑖𝑚 (𝑡 + 𝜎), 𝑚 = 1𝑐, 2𝑐 ,3𝑐} (20) and failures before that will be corrected with minimal
The maintenance decision 𝑚𝑖 at current step is CM (3𝑐).The threshold is chosen as 𝜆 = 1/400.
𝑚𝑖 = arg min{𝐶𝑅𝑖𝑚 (𝑡), 𝑚 = 1𝑝, 2𝑝, 0} (21) Policy 2: empirical policy
Policy 2 is to mimic the common practice in current
V. CASE STUDY industry, where preventive maintenance will only be
In order to demonstrate the effectiveness of the real-time conducted between shifts. Given a failure rate threshold
maintenance control law, a case study is presented to compare 𝜆 = 1/400, a machine will receive perfect PM (1𝑝) if it
reaches the failure limit between shifts. If the machine
the overall system profit under three maintenance policies.
fails during shift, the maintenance action is perfect CM
The overall profit over the whole production horizon 𝑇 is
(1𝑐) if the machine exceeds the failure limit threshold or
roughly calculated as:
minimal CM (3𝑐) if it is under the threshold. The shift
𝑂𝑣𝑒𝑟𝑎𝑙𝑙 𝑃𝑟𝑜𝑓𝑖𝑡 = 𝑐𝑝 ⋅ 𝑋𝑀 (𝑇) − 𝑀𝑎𝑖𝑛𝑡𝑒𝑛𝑎𝑛𝑐𝑒 𝐶𝑜𝑠𝑡𝑠 (22) length is 8 hours and duration of the PM break between
The production line used in this case study consists of 6 shifts is 30 𝑚𝑖𝑛.
