A Practical MTBF Estimate For PCB Design Considering Component and Non-Component Failures
A Practical MTBF Estimate For PCB Design Considering Component and Non-Component Failures
SUMMARY & CONCLUSIONS design errors, poor solder joints, software bugs and process-
related problems. For some products, non-component failures
Accurate reliability prediction for MTBF (mean-time- even dominate failure rates in the product early time. The
between-failures) is always desirable and anticipated before competitive electronic market on the other hand requires the
the new product is ramped up for customer shipment. In new product to be released to customers in the shortest time
reality it is often difficult to obtain the accurate MTBF frame, leaving no time for design engineers to eliminate all
estimate for a new product due to the lack of the testing time potential failures in the pilot line, especially those non-
in pilot line and limited field failure data. In this paper, a component problems (i.e. software bug). Therefore the actual
practical reliability prediction model is presented for field PCBs exhibit low reliability performance, which is in
predicting the MTBF of PCB (printed-circuit-board) in the contrast to the expected high MTBF that is estimated only
design phase. Unlike conventional reliability prediction from component failure rates.
models, which usually focus on parts failure rates, the method
presented here not only incorporates component failures but Failures Breakdown by Root-Cause Catagory
also non-component failures which include design, software, 45%
manufacturing and process issues. Component failure rates 40% PCB-A
are computed using either historical data or the nominal failure 35% PCB-B
30%
rates together with operating conditions such as temperature PCB-C
25%
Qty
Software
Design
Update
Component
Process
Others
to the design of a DC/analog instrument board that is used in
the semiconductor testing equipment.
1. INTRODUCTION
Figure 1 PCB failures breakdown by root-cause category
Design for reliability in electronics products has been
Figure 1 presents the failure data from three different
accomplished with the introduction of reliability prediction
PCB product lines. Data is breakdown by root-cause
tools in 1960s. Two most widely used tools are Mil-HDBK-
categories: component, design, manufacturing, process,
217 and Bellcore/Telcordia TR-332. Based on these two
software and others. Data are collected from field returns
standards, various commercial software applications were
during one year time frame since first customer shipment. An
implemented to facilitate the estimation of the product
interesting observation is component failures, though the
reliability. Including the Mil-HDBK and Bellcore standards,
largest bar in the pareto chart, contribute less than 45% of
most conventional MTBF prediction methods are developed
failures for all three types of products. For PCB-B component
based on component failure rates and the bill-of-materials
failures count for only 25% of field failures. The second
(BOM) of the product. In this paper the word PCB and
largest failures is Others (i.e. No-Failure-Found or NFF),
product are used interchangeably, so do the component and
which is 25% for all field returns on average. NFF is such
part.
kind of failure that happened at the customer site, but the
MTBF estimated from component failure rates is quite
failure can not be duplicated at the factory repair center.
optimistic in general. In field operation a PCB could fail due
Figure 1 indicates that ignoring the non-component failures,
to defective components and non-component issues or both.
the resulting prediction would be twice or quadruple higher
This is particularly true for the new product during the early
than the actual MTBF in the field.
introduction phase. These non-component failures include
As the component fabrication process continues to
604
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
1-4244-0008-2/06/$20.00 (C) 2006 IEEE
improve, component failure rates have steadily declined over the expected target value, a more accurate, yet practical,
the years to the point where non-component failure sources reliability estimate should be developed such that it not only
become dominant failure rates for a PCB. Figure 2 presents incorporates component failures but also considers non-
the MTBF run chart over the time for PCB-A as given in component issues due to design, manufacturing, software and
Figure 1. The reliability target for this product is 40,000 hours process etc. Very few papers have been published to link the
MTBF. Based on the conventional prediction model, the PCB failures with design, manufacturing, software and
target could be reached in week 41 if only component failure process issues. Gullo[2] and Johnson and Gullo[3] described
rates are considered. The actual MTBF is the smooth line that an in-service reliability prediction tool HIRAP developed by
is lower than the prediction curve. This product in fact takes Honeywell engineers. Using a top-down approach, HIRAP
57 weeks to reach the target MTBF after removing key non- breaks failures into seven categories using similarity
component failure modes through the corrective actions on coefficients between the predecessor products and the new
design, software, manufacturing, and handling. design. Categories 1-5 consist of broad component types that
historically have demonstrated to have similar failure rates.
120,000
MTBF Run Chart for a PCB Product
Categories 6 and 7 are used to address process, manufacturing
component failures only and design errors. If historical failure data are well maintained
100,000
comp, design. Software failures and the new product development is evolutionary, not
all failures revolutionary, HIRAP is a convenient and accurate tool to
80,000
predict the new product failure rate or MTBF in design phase.
MTBF (hours)
605
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
component and non-component failure rates. For components derating factor. The temperature factor for electronic
used in PCB, some have known failure rates as they can be component is usually modeled by Arrhenius equation as
derived from historical field data. Others may not have follows,
explicit failure rates because they are new and used for the Ea 1 1
( − )
first time in the design. Unlike component failures, failure π (T ) = π T = e k T0 T
(4)
rates of non-components are more project-specific. For where
instance, design problem is related to the design experience of Ea= activation energy (eV)
engineers and the complexity of the product. In k = Boltzman constant (8.62×10-5 eV/K)
manufacturing solder joint defects are correlated with number T0 = reference temperature (313K)
of solder joins on the PCB and the profile of the reflow T = operating ambient temperature (K)
temperature. The larger the number of solder joins on the Similarly, the electrical stressing or derating factor πE can
PCB, the more likely the poor solders could happen. In the be estimated by the following equation as,
following paragraphs, failure rates for components and non-
components are derived based on the availability of historical
π ( p ) = π E = e m ( p − p0 ) (5)
data, component suppliers’ inputs, temperature and stressing where,
variations, and root-cause categories. p = actual applied electrical stressing percentage
p0= reference stressing and equal to 50%
3.1 Components with Historical Data m = fitting parameter
Equation 5 is used to model the electrical factor given in
Let λ denotes the failure rate for a particular component Telcordia TR-332 [6]. The fitting value m ranges from 0.006
type used in the product. If historical failure data such as to 0.059. For details users can refer to [6]. It is very common
cumulative hours and the number of defects are available, the that same types of components are repeatedly used in one
failure rate can be directly estimated by board. For example, a 15V ceramic captor can be used in
total faults multiple places within a board. When the ambient
λ= (1)
cumulative hours temperatures of the components doe not vary more than a few
For example, assume a type of 5V relay is used in an degrees, a single temperature reading T or the average
existing PCB product, say PCB-A. Each PCB-A uses 20 temperature is adequate to be substituted into equation 4 for
relays and total of 150 PCBs are installed in the field. In the estimating πT. In reality, ambient temperatures may vary in
past one year, assuming seven boards have been returned from the range of 10 to 30 degrees among the same type of
field, fours boards were diagnosed as bad relays with one components [5]. Then the average temperature is not
defective relay on each board (other three boards failed in no- sufficient to be used to estimate πT. To quantify the
components). Also assuming that each board operates 8760 uncertainty of the component failure rate caused by
hours a year, then the failure rate for this type of relay denoted temperature variation, πT should be treated as a random value.
as λr can be estimated as The mean and the variance of πT can be estimated as follows
4 when the temperature distribution is known,
λr = +∞ Ea 1 1
8760 × 20 × 150 ( − )
E[π T ] = ∫e
k T0 x
f T ( x)dx (6)
=1.5×10-7 faults/hour (2) 0
For a new product under design, say PCB-B, there are 30 +∞ 2 Ea 1 1
( − )
pieces of 5V relays used in the board, then the failure rate for E[π T2 ] = ∫ e k T0 x
f T ( x)dx (7)
this type of relay can be treated as 1.5×10-7 faults/hour if the 0
operating conditions are similar to predecessor products. Var (π T ) = E[π T2 ] − (E[π T )
2
(8)
Typical lifetime for Semiconductor test equipment lasts only
five to seven years due to technology obsoleteness, thus the where f T ( x) = probability density function (pdf) of T.
component failure rate can be treated as time-independent
before they enter the wear-out phase. ASIC Temperature Distribution
14 0.08
3.2 Components without Historical Data 12 histogram 0.07
pdf 0.06
If components are new with few failure data available, 10
0.05
Quantity
606
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
Temperature distribution for components can be measured non - component faults
using thermometers by probing the wire close to the λ= (15)
cumulative hours
component surface. Boards need to be warmed up and operate
in normal conditions as if used in the field. Multiple readings Taking the same example in section 3.1, if three failing
are recommended for each component to remove the outlier boards returned from field are root-caused as cold solder joins,
data. Figure 3 plots the temperature profile for a type of then the manufacturing failure rate denoted as λm is
ASICs (application specific IC) measured from one PCB. 3
λm = =2.283×10-6 faults/hour (16)
This chart is originally presented in [5] and reproduced here. 8760 × 150
It shows that device temperature varies from 65 oC to 90 oC. By the same token, equation 15 can be used to compute
Components operating at above 80oC are expected to have other types of non-component failure rates such as design
much higher failure rates than those operating at the lower errors, software bugs, process issues. Different from
temperature. The temperature profile can be approximated as a component failures which duplicate the failure modes among
normal distribution with mean of 76.3oC and standard different product lines; non-component failures usually do not
deviation of 5.7 oC. have such high degrees of duplications. For example, most
Similarly, for the same type of components used in a design errors and software bugs may be unique to one product.
PCB, electrical derating p may not be the same for all these In other words, failures occurred on predecessor products may
components. It may vary from one component to another not resume in the new design. Therefore the useful dataset for
depending on the circuitry requirements and functionality. It non-component failures is scare when applied to the new
is therefore more reasonable to treat p as a random variable design analysis. The triangular distribution is often used as a
whose value is uniformly distributed within an interval. The subjective modeling of a population for which there is only
probability density function for p can be described as follows, limited failure data or no data available [7]. The triangular
1 distribution for λ is represented with a to denote the smallest
u≤x≤v possible value of the failure rate, b denotes the largest possible
f p ( x) = u − v (9)
0 otherwise value of the failure rate, and c as the mode to denote the most
likely value. For example, suppose the min is a, the max is b
Here u and v represent the minimum and the maximum
and the mode is c. The probability density function of the
derating percentages. The mean and the variance of πE can be
unknown failure rate λ is given in Figure 4.
estimated by
e m ( v − p0 ) − e m (u − p0 )
v
E[π E ] = ∫ e m ( x − p0 ) f P ( x)dx = (10) h
u
m( v − u )
g(λ)
e 2 m ( v − p0 ) − e 2 m ( u − p0 )
v
E[π E2 ] = ∫ e 2 m ( x − p0 ) f P ( x)dx = (11)
u
2 m (v − u )
Var[π E ] = E[π E2 ] − (E[π E ])
2
(12) λ
If the component derating profile is better to be fit by a c b
other distributions, fp(x) in equations (10) and (11) can be
adjusted accordingly. Component derating percentages can be Figure 4 Triangular Distribution for Failure Rate λ
obtained from design engineers after the product’s BOM is The mathematical expression for g(λ) in Figure 4 is
determined. Through the paper, the uniform distribution is represented as follows:
used because it requires no specific information for p while
2
still captures the derating variations. Finally the mean and (λ − a ) a ≤ λ ≤ c
variance of the component failure rate can be computed as (c − a)(b − a )
E[λ ] = λ 0 E[π T ]E[π E ] (13) 2
g (λ ) = (λ − b) c < λ ≤ b (17)
var(λ ) = λ20 ( E[π T2 ]E[π E2 ] − ( E[π T ]) 2 ( E[π E ]) 2 ) (14) (c − b)(b − a )
0 otherwise
3.3 Non-component Failure Rate Estimate
Parameters for the triangular distribution can be derived
Non-component failure rates usually are more difficult to from the dataset which it is intended to describe or model.
estimate than component failure rates. Non-component failure Provided the dataset does not contain any anomalous points,
rates are determined by various factors, some of them are the minimum and maximum can be obtained by sorting the
human-related. For example, the design and software failure values in ascending order and selecting the first and last points
rates are correlated with the design experience and the as a and b. Then the mode can be estimated by
similarity to the predecessor product. Manufacturing failure
rate is more likely associated with the complexity of the board c=3 λ -b-a (18)
and production experiences etc. The definition of non- Here λ is the sample mean for the dataset or the average
component failure rate λ is similar to that of the component failure rate. For example, to estimate the manufacturing
failure rate, which is failure rate of new product PCB-D, we find three predecessor
products that are close to PCB-D in terms of component
607
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
usage, manufacturing process, and total defects opportunity Sigma criteria can be used to calculate the confidence levels
per board. Historical data shows that manufacturing failure for MTBF. Based on Six-Sigma criterion, the maximum
rates for these three products are: 1.2×10-6, 1.4×10-6 and 2.4 achievable MTBF with 99.7% confidence can be estimated by,
×10-6. Then the sample mean λ =(1.2×10-6+1.4×10-6 +2.3 ×10- 1
Pr{λ PCB ≤ } ≥ 99.7% (24)
6
)/3=1.6×10-6. The min a =1.2×10-6, the max b =2.4 ×10-6, and MTBF
the mode c =1.3×10-6. The mean and the variance of the Quite often the confidence levels of the board MTBF are
failure rate λ can be obtained as, desirable before the volume manufacturing. The lower bound
b of MTBF can be improved by either using lower failure rate
E[λ ] = ∫ xg ( x)dx (19) components, reducing the variability of failure rates
a
b (temperature and derating controls), or enhancing the design,
E[λ2 ] = ∫ x 2 g ( x)dx (20) software testing, manufacturing and process.
a
608
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
gives E[λPCB]=1.01×10-5 and var(λPCB)=2.88×10-12. After we new products, reliability must be designed in early phase of
substitute the new mean and variance into equation 24 again, it prototype stage when the product is still in the concept
yields the lower MTBF bound as 67,600 hours which exceeds development stage. Both the component and non-component
the target by 2,600 hours. In practice, instead of attacking one risks need to be addressed and appropriately mitigated by
bucket issue, design engineers usually work with cross- identifying and eliminating high-risk items. This might create
functional teams resolve multiple issues covering the whole extra work load through design for reliability in pilot line and
failure spectrum. reliability testing, but the stake of the final reward is high
when we finally ramp-up and ship high reliability products to
5. CONCLUSIONS our customers. This means tremendous cost savings and
makes the new product to be more competitive than our
In semiconductor testing equipment industry, fast-to- competitors. Meanwhile, engineering staff can be transferred
market is essential for gaining the market share while high to new product design without the need of allocating extra
reliability keeps the product competitive from long-term resources to monitor and improve the reliability of existing
perspective. To maintain the momentum of competitiveness of products.
1 1MB EPROM 14 37.4 1.4 43.7 1.7 60 80 0.024 1.90 0.53 1.63 1.04
2 0.2W 1% Resistor 32 3.33 1.2 47.6 8.3 50 70 0.019 5.08 7.11 1.22 0.53
3 FPGA 26 6.1 0.7 56.2 4.1 60 85 0.024 3.74 1.16 1.74 1.18
4 Switch Transistor 26 5.81 1.2 51.8 2.8 60 75 0.024 5.37 2.03 1.53 0.91
5 Op Amp 3 9.00 1.2 46.6 5.2 55 75 0.024 3.17 2.46 1.45 0.83
6 Clock Driver 14 0.823 1.2 65.1 1.9 65 80 0.024 27.86 6.51 1.73 1.13
7 Switch Diode 4 11.25 1.0 63.2 6.4 60 80 0.029 15.74 11.08 1.81 1.25
609
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.
Table 4 Mean and Variance of Failure Rates for Components, Non-Components and New PCB
Components Non-component New PCB
Root Case New Existing
Category Comps comps Design Software Mfg Process Others PCB
mean 6.52E-6 2.42E-6 4.76E-7 9.00E-8 2.45E-7 3.62E-7 6.41E-7 1.08E-5
variance 2.42E-12 0 2.47E-13 8.66E-15 6.07E-14 1.44E-13 4.16E-13 3.30E-12
610
Authorized licensed use limited to: University of Pisa. Downloaded on January 15,2024 at 11:39:55 UTC from IEEE Xplore. Restrictions apply.