How Reliable Is Your Product?
How Reliable Is Your Product?
How Reliable Is Your Product?
Book Excerpt
50 Ways to Improve Product Reliability
By Mike Silverman
Foreword by Patrick O'Connor
E-mail: [email protected] 20660 Stevens Creek Blvd., Suite 210 Cupertino, CA 95014
C o n t e n t s
NOTE:
This is the Table of Contents (TOC) from the book for your reference. The eBook TOC (below) differs in page count from the tradebook TOC. Foreword by Patrick O'Connor . . . . . . . . . . . . . . . . . . . . . . . . 1 Why Am I Writing This Book? . . . . . . . . . . . . . . . . . . . . . . . . . 3
Foreword Preface
Part I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Chapter 1 Chapter 2 Chapter 3 Chapter 4 Guidelines, Not Rules . . . . . . . . . . . . . . . . . . . . . . . . . 7 What is Design for Reliability (DFR)? . . . . . . . . . . . . 9 Reliability Integration Provides Integrity . . . . . . . . 11 Balance between In-House and Outside Help for Your Program . . . . . . . . . . . . . . . . . . . . . . . 15
iii
iv
Contents
Chapter 31
Lessons LearnedSounds Like School, Only Much Better . . . . . . . . . . . . . . . . . . . . . . . . . . .263 Let's Take a Look at Our Warranty Returns . . . . . .267 Management Needs a Report . . . . . . . . . . . . . . . . .271 StatisticsMore Than a Four-Letter Word . . . . . .275 When Faced With Obsolete Parts, Turn To EDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .279
Software Reliability Growth . . . . . . . . . . . . . . . . . . .289 Summary of HALT Results at a HALT Lab . . . . . . .295 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .305 Glossary of Terms . . . . . . . . . . . . . . . . . . . . . . . . . .307 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .335
About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 Other Happy About Books . . . . . . . . . . . . . . . . . . . . . . . . 349
vi
Contents
F o r e w o r d
P. Drucker, W.A. Shewhart, W.E. Deming, and J.R Juran. The Japanese also applied methods for design for reliability, notably Quality Function Deployment (QFD) and failure modes and effects analysis (FMEA). By the turn of the century, methods of design for reliability and for manufacturing quality excellence had become refined. Most of the U.S. military standards were discontinued. More practical and effective methods were applied almost universally, particularly by industries whose products faced international competition or other drivers, particularly high costs of failures or strict customer requirements. However, some still cling to unrealistic mathematical precision for predicting and measuring reliability, as well as to bureaucratic approaches to quality management. In the same time frame, there have been improvements in design capabilities with advances in computer-aided engineering, as well as in materials and in manufacturing processes. We have seen dramatic improvements in the reliability of products as diverse as automobiles, telecommunications, domestic equipment, and spacecraft. How many readers have experienced a failure of a microprocessor or an automobile engine? I am pleased to endorse and recommend this new book. Mike Silverman presents a wealth of practical, experienced-based wisdom in a way that is easy to read and apply. He has avoided detailed descriptions of methods, emphasizing instead the management and team aspects of applying cost-effective reliability improvement tools in ways that work. The main methods he covers include reliability planning, design techniques (FMEA, fault tree analysis), testparticularly highly accelerated life test (HALT), and design of experiments, as well as methods for reliability prediction, stress derating, vendor reliability, failure reporting and analysis, and others. The whole product life cycle is considered, from initial design through prototype test, manufacturing, and field service, to obsolescence. He emphasizes the need for integration of reliability efforts to ensure their effective application. The fifty chapters all include brief case histories that illustrate this. I recommend the book as an excellent guide for engineering project management and their teams, as well as for reliability specialists. It demystifies the sometimes difficult methods and helps specialists to communicate with managers, designers, and other engineers. It will make your products more reliable. Patrick O'Connor November 2010
Foreword
P r e f a c e
Eye" collects a bit more information and raises its probability to 51%. The President then decides to take action based on this and authorizes an attack on the alleged terrorist group. What is that really telling us? In fact, a probability of 51% means that there is a 49% chance that the conclusion is incorrect based on the data. Likewise, with reliability tests, you need to make decisions based on test data from a sample of the population. You will never have enough data to be 100% certain of any decision, so you should gain as much confidence as you can with the time and money that you have. That is the art of reliability testing. I structured the book in 50 easy-to-read chapters. Each chapter has some background on the reliability technique, its usefulness, and in some cases, its limitations. In addition, when applicable, I compare the technique in question to other techniques to show you when to use which technique. Starting in Chapter 3, I introduce the topic of Reliability Integration, and for each chapter onwards, I comment on how you can use the concept of Reliability Integration with that particular technique. I will talk a lot about Reliability Integration. It is one of the most valuable takeaways from this book. In each chapter, I will provide one or more case studies from clients we have worked with and discuss how we utilized the specific technique in question. I didn't use the names of people or companies, but all of the case studies are real. Tips on how to best use this book: If a phrase is highlighted in bold italics, that means the term is a main technique of that chapter and is in the table of contents as well as the index. I will also capitalize the phrase throughout the rest of the book as an indication that it is an important technique. For all other important terms, check the index for other places I have used the same term. I included a guide to acronyms. The field of reliability uses a lot of acronyms and I know how frustrating it can be reading a book filled with them. I included a glossary of terms. If you feel I missed something or you have information to add to a particular topic, I'd love to hear from you. I hope you enjoy my book.
Preface
C h a p t e r
Reliability Integration is the process of seamlessly and cohesively integrating reliability techniques together to maximize reliability at the lowest possible cost. What this means is you should think of your reliability program as a set of techniques that are used together rather than just a bunch of individual activities. You are building a system, and a system is made up of different components and assemblies; there are different disciplines involved (some of the main disciplines are electrical, mechanical, software, firmware, optical, and chemical). All of the individual pieces make up the system, so don't forget about the interactions, and make sure that you think of the reliability from a system perspective. In Figure 3.1, we illustrate this point using the disciplines of electrical, mechanical, and software.
This is especially true of software versus hardware disciplines. Most companies work on Software Reliability and Hardware Reliability separately and don't integrate the two. When failures occur, this then results in finger-pointing rather than synergy. This is equally true of electrical versus mechanical disciplines. We see more synergy between these two groups during programs than between software and hardware; however, at the beginning, they rarely get together to discuss common reliability goals and how to apportion them down to each major area of the system. Product development teams view reliability within each of the separate sub-domains of mechanical, electrical, and software issues. Your customers view reliability as a system-level issue, with minimal concern placed on the distinction between mechanical, electrical, and software issues. Your customer wants the whole product and all its parts to work together perfectly. Since the primary measure of reliability is made by your customer and their end users, engineering teams should maintain a balance of both views (system and sub-domain) in order to develop a reliable product.
Costs
Reliability
Figure 3.2: System Reliability versus Cost In Figure 3.2, it is evident that: 1. Program costs go up as you spend more on reliability. At a certain point, you won't get your return on investment (ROI) because the reliability has reached a point where it is becoming increasingly more difficult to improve. That is why it is important to know what the goal is, and it can be just as detrimental to your company to produce a product that is too reliable as not reliable enough. The product that is too reliable usually comes with increased costs; your customers may not need this level of reliability and will opt for the less expensive product. When was the last time you purchased a $200 blender or toaster? 2. Warranty costs go up as reliability goes down. 3. Software has no associated manufacturing costs (other than perhaps the cost of CDs and manuals and the cost of personnel to test the product in production), so warranty costs and savings are almost entirely allocated to hardware. If there is no cost savings associated with improving Software Reliability, why not leave it as is and focus on improving hardware reliability to save money? You shouldn't do this for two reasons: a. Our experience is that for typical systems, software failures outnumber hardware failures by a ratio of 10:1 (see Section 31.1 for more details). Customers buy integrated systems, not just hardware. b. The benefits for a Software Reliability program aren't in direct cost savings. Instead, the benefits are in:
i. Increased software/firmware staff availability with reduced operational schedules, resulting in fewer corrective maintenance events. ii. Increased customer goodwill based on improved customer satisfaction.
CASE STUDY: Linking Electrical, Mechanical, and Software Reliability together We were working with a semiconductor equipment company to help improve their reliability on their next generation product. First, we provided a Design for Reliability (DFR) seminar for each of the three different disciplinesthe electrical group, the mechanical group, and the software group. Then, we met with the electrical, mechanical, and software team leads and developed reliability goals. We started with high level system goals and the apportioned the goals down to each subsystemelectrical, mechanical, and software. Each group lead then took the goal for his subsystem and broke it down further within his area. We worked with each group lead to put together a reliability program plan to meet his subsystem goals. We rolled each of these different subsystem plans into an overall reliability plan for the product. We worked with each group lead to ensure he was on track for meeting his subsystem goals throughout the product development process. The end result was that our client was able to achieve reliability goals for each subsystem and for the system as a whole.
A p p e n d i x
The following section of the book on Software Reliability Growth was provided by Mark Turner of Enecsys. This is referenced from Section 31.3.5 from the main section of this book. You can measure and manage the reliability growth of any software (or even hardware) development using an appropriate model, of which many exist, some of which are more suitable than others. Perhaps the most suitable for software development is the Rayleigh model. Software design and development is a continuous process where you provide functionality using source code. Unfortunately, despite the best intentions of engineers, you may introduce defects as you create source code. Therefore, you will benefit from modeling the creation, identification, and elimination of code defects as a function of time. Throughout the software development process, there are numerous opportunities for you to introduce defects. We provide a typical Rayleigh curve in Figure A.1, which illustrates the defect insertion and discovery process. This shows how you can identify and address defects that you introduce at an earlier phase in the software development. There will come a point in any software development program where you maximize the defect discovery rate, after which you reduce over time the quantity of remaining defects. Here the leading bar graph illustrates code defect insertion, which can often begin at the start of the development project. Because code defects are often related to the amount of engineering effort, the rate at which you introduce them often is directly proportional to that effort.
Project Probability Density Function 9.0 Coding Effort - Defect Insertion 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
140 Defect Removal Rate 120 100 80 60 40 20 0 Time Interval Coding Effort - Defect Insertion Defect Removal Rate
Figure A.1: Rayleigh Estimation Model with Effort and Defect Curves The lagging curve illustrates the defect removal rate, with problems being addressed at a later date than when you originally inserted them, which can hinder project progress and negatively impact customer satisfaction. You can partially mitigate this by conducting design reviews, code inspections, and early module testing, as these activities will often assist you in discovering inserted defects as early as possible, thus moving the defect discovery and correction curve to the left. Eventually, the quantity of defects still present in the code will equate to the original reliability target, and as you discover and address further defects, the Software Reliability increases, or grows. You can manage the rate at which you address defects by setting software defect targets. This has to begin by estimating how many defects are likely to occur, then addressing those defects by implementing a Software Reliability growth management program in which you plan and schedule the necessary resource to ensure you achieve your reliability target.
10
This model is particularly suited to Software Reliability modeling as it provides a good representation of the vector sum of a large number of random sources of defects, none of which dominate. The Rayleigh model provides an effective iterative design process in which feedback is inherently part of the solution process, and in fact it closely approximates the actual profile of defect data that you collect during software development programs. Monitoring software development defect metrics can provide you valuable input into planning engineering and Root Cause Analysis (RCA) efforts, and it helps you to quantify the maturity of the software you are developing. Collecting defect metrics related to engineering effort, project duration, and type over several development projects provides a great opportunity to analyze trends, which can then provide you with more accurate resource predictions for new projects. If you lack such trend data (which typically is a problem when you first deploy the Rayleigh model), then you may have to use industry data as an alternative guide. While this alternative approach may not factor in the abilities of your actual design team, it does at least provide a reasonable estimation to begin with. After your organization completes multiple projects, you will benefit from reviewing predicted versus actual defect counts, as this enables you to refine the original estimates and improve the model for future development projects. As the project progresses, compare the initial defect count estimates with the quantity of defects you actually address. If you find that the actual defect count is significantly higher than predicted, then the model has generated an early indication that a significant problem may exist. On the other hand, if you find that the actual defect count is significantly less than the initial prediction, then you should confirm that the identification process is indeed sufficient to detect the anticipated defects. Once you confirm this, then you can conclude that your defect insertion rate is actually less than predicted. In using the Rayleigh model, you should determine the parameters for the total anticipated engineering effort, the total number of defects that you expect to insert into the code, and the time period to reach the peak estimate. Knowing these parameters will enable you to plot the cumulative distribution function (CDF). We've shown an example of a CDF Chart in Figure A.2 and a CDF Table in Figure A.3. The Rayleigh model parameters of Figure A.3 are 55 man-months of engineering effort, 755 inserted defects, and 4 months to reach the peak estimate. For the defect plot, you must define an additional value associated with the estimated lag behind the start of the project effort to account for defect detection and correction effort, which in Figure A.3 is 4 months.
11
Figure A.2: Rayleigh Cumulative Distribution Function (CDF) Chart Figures A.2 and A.3 illustrate the relationship between the project effort and the number of defects that you insert into the code, which enables you to make decisions regarding the impact that any code changes are likely to have and in changes to the code delivery date. From this example, you can conclude that a delivery schedule of eight months would be completely unrealistic, as a significant number of defects will still be present in your code, whereas a delivery schedule of twelve to fifteen months is more realistic. Delivery schedules in between require you to make a schedule versus reliability tradeoff.
12
Figure A.3: Rayleigh CDF Table However, if early delivery is unavoidable, the CDF can aid in planning reliability growth activities and managing customer expectations where multiple deliveries are viable.
13
14
A u t h o r
Mike Silverman Mike is founder and managing partner at Ops A La Carte LLC, a professional consulting company that has an intense focus on helping clients with reliability throughout their product life cycle. Mike has over 25 years' experience in reliability engineering, reliability management, and reliability training. He is an experienced leader in reliability improvement through analysis and testing. Through Ops A La Carte, Mike has had extensive experience as a consultant to high-tech companies. A few of the main industries are aerospace and defense, clean technology, consumer electronics, medical, networking, oil and gas, semiconductor equipment, and telecommunications. Most of the examples in this book have been taken from Mike's experiences. Mike is an expert in accelerated reliability techniques and owns HALT and HASS Labs, one of the oldest and most experienced reliability labs in the world. Mike has authored and published 25 papers on reliability techniques
15
and has presented these in countries around the world, including Canada, China, Germany, Japan, Korea, Singapore, Taiwan, and the USA. He has also developed and currently teaches over 30 courses on reliability techniques. Mike has a BS degree in electrical and computer engineering from the University of Colorado at Boulder, and is a Certified Reliability Engineer (CRE) through the American Society for Quality (ASQ). Mike is a member of the American Society for Quality (ASQ), Institute of Electrical and Electronics Engineers (IEEE), Product Realization Group (PRG), Professional and Technical Consultants Association (PATCA), and IEEE Consulting Society. Mike is currently the IEEE Reliability Society Santa Clara Valley Chapter Chair. You can contact Mike via the Ops A La Carte website at http://www.opsalacarte.com.
16
Author
17