Design For Reliability
Design For Reliability
Human Error
During 2004 in the United States, pilot error was listed as the primary cause of 78.6% of fatal general aviation accidents, and as the primary cause of 75.5% of general aviation accidents overall. For scheduled air transport, pilot error typically accounts for just over half of worldwide accidents with a known cause.
Authorities in many fields ascribe 70-90% of all accidents (mistakes) to human error
FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
6
FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
7
FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
8
Reliability considerations
Availability
Recoverability Maintainability Reliability
What are failures, where do they come from and why Examples of good and bad design Specific issues
Two windows Similar names Same login details
Thinking about what might go wrong Thinking about the impact when something goes wrong Thinking about how to minimise the impacting Thinking about making the possible impossible
10
Do we do it on purpose?
Balance desire for incident resolution against the need for good design
Fire-fighting
Hero culture Many faults
Prevention
Reward planning Fewer faults
Fire-fighting is a natural state of mind for many people, when the adrenaline is pumping it is exciting
How often do the fire-fighters get rewarded and thanked? If culture encourages fire fighting then fire fighting is what will happen
Long lasting, embedded change, can only happen if the culture is changed
11
Phases to dread . . .
That wasnt supposed to happen Its never done that before It didnt do that in test It couldnt possibly be anything to do with my change It wasnt a big change Of course I know what Im doing
12
Human Error value not re-opened after a test 48 hrs before the accident Monitors placed incorrectly and didnt show real output There is consensus that the accident was exacerbated by wrong decisions made because the operators were overwhelmed with information, much of it irrelevant, misleading or incorrect A "positive feedback" lamp in the control room indicating the true position of the valve was eliminated in original construction to save time, but has been retrofitted onto all similar plants after the accident
13
14
that human error was to blame for the failure of the $247m (124m) Mars Global Surveyor spacecraft (MGS).
NASA lost its $125 million Mars Climate Orbiter spacecraft as a result of a mistake that would shame a first-year physics student failing to convert Imperial units to metric
15
16
Causes of Errors
Organisational
Complex hand-overs Unclear responsibilities (or gaps)
Organisational culture
Technological
Poor design
Individual
Stress Care Physiological
17
YMNEMNM YNMEMNM
Production instance was shut-down and upgrade causing a significant service outage (22 hours) while the database was recovered from the overnight backup and the business lost a further 8 hrs productivity in recreating the data
18
Roaming profiles meant that on logging into a production server the desktop looked like a normal desktop citrix session Email administrator restarted the exchange server rather than the local session by mistake
19
Human Error
People make mistakes that is inevitable Designers forgot the above rule if somebody makes a mistake due to poor design who is too blame? Take Aways
Dealing with the individual is punishment it doesnt benefit the organisation Need to fix the root cause that allowed the fault to occur or allowed the mistake to be made
21
OK so what
Change control process
Change plan printed out and each stage ticked off, in pen, the done Two sets of eyes for critical changes a second person sits to monitor the change
At least as senior No directly involved in the change Enough technical knowledge to check what is happening
Monitoring
Make sure only the most important events are critical Correctly prioritise everything
22
Initial Design
Naming conventions that clearly identify test and development Labelling on front and back of servers Make the profiles on different servers, or different accounts look different (grey normal; blue production; green test; red critical) Different passwords/user names to test/dev and production Script repeated activities, especially if several, complex stages are involved Consider multiple administration accounts normal user, admin user, server restart user preventing restarts been done by mistake (with controls login could be prevented during normal business hours) Involve support teams at the design stage get their input on what might go wrong
23
24
Take Aways
Mistakes WILL happen Dealing with the individual is punishment it doesnt benefit the organisation Need to fix the root cause that allowed the mistake to be made Ensure organisational learning is strongly embedded within the teams
learning from your own mistakes is good; learning from somebody else's is even better
25