0% found this document useful (0 votes)
64 views

Design For Reliability

Human error is listed as the primary cause of over 75% of general aviation accidents and just over half of commercial aviation accidents. Authorities in many fields ascribe 70-90% of all accidents to human error. Examples are provided of costly human errors, such as omitting a single letter in a command that caused the Phobos 1 Mars probe to be lost, and failing to convert units that led to the loss of NASA's Mars Climate Orbiter. Designing systems with clear naming conventions, separate user accounts, change control processes, and root cause analysis can help minimize human errors.

Uploaded by

bhavnish_kamboj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Design For Reliability

Human error is listed as the primary cause of over 75% of general aviation accidents and just over half of commercial aviation accidents. Authorities in many fields ascribe 70-90% of all accidents to human error. Examples are provided of costly human errors, such as omitting a single letter in a command that caused the Phobos 1 Mars probe to be lost, and failing to convert units that led to the loss of NASA's Mars Climate Orbiter. Designing systems with clear naming conventions, separate user accounts, change control processes, and root cause analysis can help minimize human errors.

Uploaded by

bhavnish_kamboj
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Getting Things Right

Design for Reliability how to avoid human errors

Human Error
During 2004 in the United States, pilot error was listed as the primary cause of 78.6% of fatal general aviation accidents, and as the primary cause of 75.5% of general aviation accidents overall. For scheduled air transport, pilot error typically accounts for just over half of worldwide accidents with a known cause.

Authorities in many fields ascribe 70-90% of all accidents (mistakes) to human error

When Nothing Else Matters


Concentrated Design
Focus on only the most important thing Nothing else matters

When Nothing Else Matters

Can you read this?


Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe

How many Fs are there

FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
6

How many Fs are there

FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
7

How many Fs are there

FINISHED FILES ARE THE RESULT OF YEARS OF SCIENTIFIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS...
8

Reliability considerations
Availability
Recoverability Maintainability Reliability

What are failures, where do they come from and why Examples of good and bad design Specific issues
Two windows Similar names Same login details

What can we learn Designing fail-safe processes


9

Design for Reliability


Problem Management
Remove the cause of the fault dont get better at responding to it Understand the impact on the business process, not on IS Balance desire for incident resolution against the need for problem resolution

Thinking about what might go wrong Thinking about the impact when something goes wrong Thinking about how to minimise the impacting Thinking about making the possible impossible

10

Do we do it on purpose?
Balance desire for incident resolution against the need for good design
Fire-fighting
Hero culture Many faults

Prevention
Reward planning Fewer faults

Fire-fighting is a natural state of mind for many people, when the adrenaline is pumping it is exciting
How often do the fire-fighters get rewarded and thanked? If culture encourages fire fighting then fire fighting is what will happen

Long lasting, embedded change, can only happen if the culture is changed
11

Phases to dread . . .
That wasnt supposed to happen Its never done that before It didnt do that in test It couldnt possibly be anything to do with my change It wasnt a big change Of course I know what Im doing

There wont be any impact Theres something you should know . . . .

12

Three Mile Island

28 March through early April of 1979, biggest nuclear accident in America

Human Error value not re-opened after a test 48 hrs before the accident Monitors placed incorrectly and didnt show real output There is consensus that the accident was exacerbated by wrong decisions made because the operators were overwhelmed with information, much of it irrelevant, misleading or incorrect A "positive feedback" lamp in the control room indicating the true position of the valve was eliminated in original construction to save time, but has been retrofitted onto all similar plants after the accident

13

Phobos 1 Mars probe


In 1988, the Soviet Union's Phobos 1 satellite was lost on its way to Mars. Why? According to Science magazine, "not long after the launch, a ground controller omitted a single letter in a series of digital commands sent to the spacecraft. And by malignant bad luck, that omission caused the code to be mistranslated in such a way as to trigger the test sequence" (the test sequence was stored in ROM, but was intended to be used only during checkout of the spacecraft while on the ground) Phobos went into a tumble from which it never recovered.

14

$372M for (lack of a) calculator


The US space agency, NASA, has said

that human error was to blame for the failure of the $247m (124m) Mars Global Surveyor spacecraft (MGS).

NASA lost its $125 million Mars Climate Orbiter spacecraft as a result of a mistake that would shame a first-year physics student failing to convert Imperial units to metric

15

Changing a disk it cant be that hard


Five Operators performed a basic repair task: replacing a failed disk in a software RAID system All five people participating in the experiments were trained on how to perform the repair and were given printed step-by-step instructions. Each person performed several trials of the repair process Low-stress setting (no alarms, no angry customers or bosses breathing down their necks), Operators made fatal errors that caused data loss up to 10% of the time - even the best-intentioned reliability technologies (such as RAID) can become impotent in the face of the human capacity for error.

16

Causes of Errors
Organisational
Complex hand-overs Unclear responsibilities (or gaps)
Organisational culture

Technological
Poor design

Complex systems Too little or too much data

Individual
Stress Care Physiological
17

Some Specific (from real-world examples)


Naming Conventions Production Oracle Instance Test Oracle Instance

YMNEMNM YNMEMNM

Production instance was shut-down and upgrade causing a significant service outage (22 hours) while the database was recovered from the overnight backup and the business lost a further 8 hrs productivity in recreating the data

18

Some Specific (from real-world examples)


Accessing to Production and Test systems using the same user ID Test and Production systems on different servers were set-up using the same username and password, UNIX administrator confused which server they were logged into and made changes to the production environment.

Roaming profiles meant that on logging into a production server the desktop looked like a normal desktop citrix session Email administrator restarted the exchange server rather than the local session by mistake
19

Some Specific (from real-world examples)


Simultaneous access to Production and Test systems SAN engineer had windows to the production and test SANs open at the same time. When tidying up in the test environment inadvertently used the wrong window and deleted the SAN configuration from the production Major service outage impacting many systems, leading to a significant SLA breach and a major relationship issue
20

Human Error
People make mistakes that is inevitable Designers forgot the above rule if somebody makes a mistake due to poor design who is too blame? Take Aways
Dealing with the individual is punishment it doesnt benefit the organisation Need to fix the root cause that allowed the fault to occur or allowed the mistake to be made
21

OK so what
Change control process
Change plan printed out and each stage ticked off, in pen, the done Two sets of eyes for critical changes a second person sits to monitor the change
At least as senior No directly involved in the change Enough technical knowledge to check what is happening

Monitoring
Make sure only the most important events are critical Correctly prioritise everything

22

Initial Design
Naming conventions that clearly identify test and development Labelling on front and back of servers Make the profiles on different servers, or different accounts look different (grey normal; blue production; green test; red critical) Different passwords/user names to test/dev and production Script repeated activities, especially if several, complex stages are involved Consider multiple administration accounts normal user, admin user, server restart user preventing restarts been done by mistake (with controls login could be prevented during normal business hours) Involve support teams at the design stage get their input on what might go wrong
23

Root Cause Analysis learning based organisation


Institute a mature problem management process with a full RCA stage In an RCA human error is not an acceptable root cause what condition or control was missing that allowed the inevitable to happen

24

Take Aways
Mistakes WILL happen Dealing with the individual is punishment it doesnt benefit the organisation Need to fix the root cause that allowed the mistake to be made Ensure organisational learning is strongly embedded within the teams

learning from your own mistakes is good; learning from somebody else's is even better

25

You might also like