0% found this document useful (0 votes)

73 views5 pages

Reliability Avalilability Serviceability

The document discusses reliability, availability, and serviceability (RAS) in computer processors. It defines RAS concepts like reliability, availability, and serviceability. It then summarizes three papers on implementing RAS: (1) DIVA uses check-compute units to detect errors and recompute instructions, (2) the Ultra Enterprise 10000 has redundant cooling, power, interconnects and can hot-swap components, (3) IBM's S/390 G5 uses multiple processors, extensive error checking and correcting to achieve 100% soft error correction.

Uploaded by

Arturo Castellanos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views5 pages

Reliability Avalilability Serviceability

Uploaded by

Arturo Castellanos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

EE482: Advanced Computer Organization Lecture #16

Processor Architecture
Stanford University Friday, 27 May 2000

Reliability, Availability, and Serviceability

Lecture #16: Tuesday, 22 May 2000
Lecturer: David Lie
Scribe: John Maly

1 Overview of RAS
Definition: RAS

RAS stands for Reliability, Availability, and Serviceability. It actually encompasses much
more than just the processor of a system, but this discussion will primarily be confined
to the processor-related aspects of RAS (although some discussion of redundant power
supply, cooling fans, and interconnect does take place at the end).

Why is RAS important?

Reliability is crucial in modern systems, even if it comes at the expense of system per-
formance. Slowness can be an acceptable trait of a system, but failure and data loss
are almost never acceptable. Downtime is equally unacceptable, lending to the obvious
importance of availability. Finally, serviceability contributes to both of the aforemen-
tioned traits, and should help to reduce the ongoing cost of running the system. Some
definitions are given below.

Definition: Reliability

Reliability is a function of time that expresses the probability at time t+1 that a system
is still working, given that it was working at time t.

Definition: Availability

Availability is the measure of how often the system is available for use (such as a system’s
up-time percentage). Availability and reliability may sound like the same thing, but it
is worth noting that a system can have great Availability but no Reliability. An internet
router is a good example of this; it stores no state data. It is one of the few systems
wherein data loss is acceptable, as long as high availability is maintained.
2 EE482: Lecture #16

Definition: Serviceability

Serviceability is a broad definition describing how easily serviced or repaired a system

is. For example, a system with modular, hot- swapable components would have a good
level of serviceability.

Note that implementing hot-swapable components contributes to all three of the above
qualities, not just serviceability.

Some Trends Currently Effecting RAS in Processors:

1. Shrinking achitectures and capacitances
2. Increased processor complexity (difficult to fully verify)
3. Extensive use of dynamic logic (more discharging)
(The latter is an issue with static logic also, but dynamic logic has lower margins for
error)

What are the two major themes in these papers for implementing RAS?

Data integrity: The ability to isolate failures is critical. We want to have safe fail-
ures (also known as fail-stop).

Redundancy: Redundancy in functional units, interconnects, and backup processors is

an effective way to assure reliability and availability, and to prevent the customer from
needing to service the product.

What are the three steps to failure recovery?

1. Detect the failure
2. Isolate it from causing further problems
3. Fix the failure and any associated side effects

2 DIVA: A Reliable Substrate for Deep Submicron

Microarchitecture Design
How does DIVA work?

DIVA uses check-compute units. After each instruction, these units are fed the end
result. They then compute the instruction result a second time, and compare their result
EE482: Lecture #16 3

to the supplied (initial) result. If the results do not match, the processor goes back and
performs the instruction again.

DIVA was designed with simplicity in mind. To this end, the checker is always as-
sumed correct (no voting takes place). This helps to reduce the design fault possibility.
DIVA is an improvement on Tandem, because it will actually catch design errors (which
would, theoretically, produce agreeing, incorrect results).

Definitions: Validation vs. Verification

While often used interchangably, these terms mean different things. Validation refers
to making sure a design functions according to specification. This can be done with a
simulator in software. Conversely, verification refers to making sure a finished product
(e.g., a processor) operates as intended (or previously simulated).

What does a Watchdog Timer do?

A watchdog timer exists to catch deadlock situations and restart things if they occur
(when the timer reaches a certain high value). This guarantees that the processor core
makes forward progress. This timer gets reset when the process it is timing is retired.

What are the main sources of performance impact in DIVA?

1. Number of memory ports (extra ports yield better performance)
2. Latency of detector (if this is really long, processes can’t retire quickly, and the in-
struction window fills up)
3. Varied exception rates (higher numbers of exceptions mean that less real work is get-
ting done)

Final note: The in-order core has almost no dependency. Ideally, it has 100 percent
prediction because it runs ”in the wake” of the superscalar processor. There is no for-
warding (except for the comm-checking bypass; we cannot perform checks without the
correct value here).

3 Ultra Enterprise 10000 Server: SunTrust Reliabil-

ity, Availability, and Serviceability
Features:
1. Interconnect redundancy (uses crossbar interconnect)
4 EE482: Lecture #16

2. System service (a separate processor monitors the whole system looking for and cor-
recting errors
3. Dynamic configuration (hotswap ability and corresponding software support)

Note that hotswapping entails a fairly complex procedure. The system must first evict
data, then evict each process after its next context switch, and finally shut down power
to the component before it can be removed. Processes are migrated over to other com-
ponents and execution is resumed.

Fault-Tolerant Cooling and Power

This system has redundant power supplies and power lines. The power supplies overlap
(current comes from all of them simultaneously). The system uses ”n+1 redundancy”,
meaning that one power supply can fail and the system continues to function, as it needs
only n working supplies at a given time.

The system also has redundant cooling fans. When one fan dies, the remaining fans
are actually sped up to compensate for the loss of airflow!

Partitioning

Partitioning was, prior to this machine, only available in mainframes. Essentially, the
system is divided into domains. Processes are distributed to the various domains, and
redistributed when failure necessitates it. The system uses a crossbar interconnect to
facilitate this redistribution.

Automatic System Reconfiguration

The Enterprise 10000 shares peripherals, IO, and the backplane. Memory is not con-
tiguous, as it is spread across different boards. The backplane is the only device that is
not hot-swapable.

Design note: It is generally inadvisable to place active devices onto backplanes when
designing a system. Such devices cannot be replaced without replacing the backplane,
which in most cases means the machine must come down for prolonged service.
EE482: Lecture #16 5

4 IBM’s S/390 G5 Microprocessor Design

What makes this processor reliable?

IBM claims that this processor is 100 percent effective in correcting soft errors. This
reliability comes from several features:
1. Multiple processors in 2 regions (processes can swap out if one processor goes down);
12 real processors, 12 ”spare”. (These are, however, in a multi-chip module, so a single
defective processor cannot be replaced. It is assumed that the 12 extra processors will
be enough redundancy for the life of the system.)
2. Implemented error correction and recovery (see below)
3. Implemented array recovery for hard errors (see below)
4. Uses CMOS instead of bipolar transistors
5. Uses decimal ALU’s instead of binary for more precision
6. 20 percent of all logic in the processor is devoted to error-checking

Hardware error recovery takes place in a series of steps:

a. Stop the system upon error detection
b. Any queued data is written into the L2 cache
c. Most of the machine’s functional units are reset
d. Use error-correcting codes:
If correct, continue.
If not, update the register file with the error-correcting code (ECC)
result, and assume that it is the correct value.
e. Repeat step d (one time).

It is interesting to note that parity is used in the L1 cache, and ECC used in the L2
cache. The rationale for this is that redundant data uses only parity; if an error is
detected, a new copy of the data can be brought in. If the data is unique, however,
error-correcting codes are used to attempt to reconstruct/correct it.

5 Basic Operations Performed by Computer System
100% (7)
5 Basic Operations Performed by Computer System
5 pages
Theory of Automata - Solved Assignments - Semester Spring 2010
74% (38)
Theory of Automata - Solved Assignments - Semester Spring 2010
33 pages
Ceragon IP20N Commissioning - Draft
100% (6)
Ceragon IP20N Commissioning - Draft
10 pages
Unit-II: Characteristics of Embedded Systems
No ratings yet
Unit-II: Characteristics of Embedded Systems
25 pages
MES Module 4 Notes (1) 2
No ratings yet
MES Module 4 Notes (1) 2
24 pages
Unit 3
No ratings yet
Unit 3
48 pages
AI 940 Dep Architectures
No ratings yet
AI 940 Dep Architectures
65 pages
DeltaV Cyber Security
No ratings yet
DeltaV Cyber Security
8 pages
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
No ratings yet
Reliable System Design: Hardware Design Checklist Testing Embedded Systems Critical Systems
28 pages
16 Fault Tolerance
No ratings yet
16 Fault Tolerance
34 pages
IEEEStd 30067 - 2013presentation
100% (3)
IEEEStd 30067 - 2013presentation
42 pages
Embedded System Design Concepts Module-4: Characteristics & Quality Attributes of Embedded Systems
No ratings yet
Embedded System Design Concepts Module-4: Characteristics & Quality Attributes of Embedded Systems
14 pages
Characteristics and Quality Attributes
No ratings yet
Characteristics and Quality Attributes
29 pages
Robust Subh PDF
No ratings yet
Robust Subh PDF
30 pages
Embedded System Design Concepts: Roopesh Kumar B N, Assistant Professor, Cse, Ksit
No ratings yet
Embedded System Design Concepts: Roopesh Kumar B N, Assistant Professor, Cse, Ksit
48 pages
15 Storage
No ratings yet
15 Storage
26 pages
Embedded System Module 4
No ratings yet
Embedded System Module 4
22 pages
Embedded Systems and Software
No ratings yet
Embedded Systems and Software
46 pages
Rajib Mall Chapters 1,2,3
100% (4)
Rajib Mall Chapters 1,2,3
343 pages
Bulletproof: A Defect-Tolerant CMP Switch Architecture
No ratings yet
Bulletproof: A Defect-Tolerant CMP Switch Architecture
12 pages
Memory Errors 2
No ratings yet
Memory Errors 2
14 pages
The Internet Meets Embedded Systems: William Nace Prof. Philip Koopman Carnegie Mellon University
No ratings yet
The Internet Meets Embedded Systems: William Nace Prof. Philip Koopman Carnegie Mellon University
23 pages
09 Fault Tolerance
No ratings yet
09 Fault Tolerance
5 pages
Mid Term Exam Semester 1
0% (1)
Mid Term Exam Semester 1
14 pages
RAS FEatures
No ratings yet
RAS FEatures
53 pages
Chapter 1
No ratings yet
Chapter 1
26 pages
MES Module 4
No ratings yet
MES Module 4
27 pages
Reliability, Availability, Serviceability (RAS) : The Ibm
No ratings yet
Reliability, Availability, Serviceability (RAS) : The Ibm
25 pages
1 Storage-150927084723-Lva1-App6892
No ratings yet
1 Storage-150927084723-Lva1-App6892
26 pages
4-Embedded System Design Issues
No ratings yet
4-Embedded System Design Issues
41 pages
Finalized Mind Map For Each CO: 16CST33-Java Programming
No ratings yet
Finalized Mind Map For Each CO: 16CST33-Java Programming
7 pages
DVCon Europe 2015 T07 Presentation
No ratings yet
DVCon Europe 2015 T07 Presentation
98 pages
1Z0 821
No ratings yet
1Z0 821
8 pages
Smart Calculator
100% (1)
Smart Calculator
11 pages
Cloud
No ratings yet
Cloud
18 pages
Embedded System - Merged PDF
No ratings yet
Embedded System - Merged PDF
172 pages
Embedded System Components Notes 1
No ratings yet
Embedded System Components Notes 1
12 pages
Programming Mini ProjectCSC430
No ratings yet
Programming Mini ProjectCSC430
7 pages
01 Lec Intro
No ratings yet
01 Lec Intro
33 pages
Biojava How To
No ratings yet
Biojava How To
84 pages
ES M4pdf
No ratings yet
ES M4pdf
10 pages
Lesson 1 Intro & IPO 1
No ratings yet
Lesson 1 Intro & IPO 1
12 pages
MBIST Guide
No ratings yet
MBIST Guide
82 pages
Training: Appin Technologies
No ratings yet
Training: Appin Technologies
52 pages
Rtos Group 10
No ratings yet
Rtos Group 10
9 pages
PDF Information Technology
No ratings yet
PDF Information Technology
54 pages
Microsoft NET For Programmers
100% (1)
Microsoft NET For Programmers
376 pages
Lec 1 Intro To Web Development
No ratings yet
Lec 1 Intro To Web Development
6 pages
Cyber Security Seminar
100% (1)
Cyber Security Seminar
40 pages
ES 18EC62 Module4 Notes
No ratings yet
ES 18EC62 Module4 Notes
35 pages
Embedded Systems Lec1
No ratings yet
Embedded Systems Lec1
27 pages
HMI Counting
No ratings yet
HMI Counting
22 pages
Vendor: Cisco Exam Code: 350-080 Exam Name: CCIE Data Center Written Exam Version: DEMO
No ratings yet
Vendor: Cisco Exam Code: 350-080 Exam Name: CCIE Data Center Written Exam Version: DEMO
8 pages
Java Programming Language Midterm
No ratings yet
Java Programming Language Midterm
188 pages
Tornado Python
No ratings yet
Tornado Python
139 pages
RH033
No ratings yet
RH033
264 pages
Installation: V500 Cordless Notebook Mouse V500 Souris Notebook Sans Fil
No ratings yet
Installation: V500 Cordless Notebook Mouse V500 Souris Notebook Sans Fil
2 pages
Reliability, Availability Maintainability New
No ratings yet
Reliability, Availability Maintainability New
28 pages
Learning Vector Quantization
No ratings yet
Learning Vector Quantization
100 pages
Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress
No ratings yet
Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress
16 pages
7.fault Tolerance
No ratings yet
7.fault Tolerance
35 pages
CH-2 ADBMS-Tree Structured Indexing
No ratings yet
CH-2 ADBMS-Tree Structured Indexing
17 pages
The TESTSIGNING Boot Configuration Option (Windows Drivers)
No ratings yet
The TESTSIGNING Boot Configuration Option (Windows Drivers)
2 pages
Linux Programming Syllabus
100% (1)
Linux Programming Syllabus
2 pages
Computer System General Requirements
No ratings yet
Computer System General Requirements
9 pages
Bec601 - Mod 2 Notes
No ratings yet
Bec601 - Mod 2 Notes
42 pages
GCSE Computer Science SQL and Databases Revision Notes
No ratings yet
GCSE Computer Science SQL and Databases Revision Notes
2 pages
Security in Embedded Hardware - Daniel Ziener
No ratings yet
Security in Embedded Hardware - Daniel Ziener
134 pages
MES-Mod 4
No ratings yet
MES-Mod 4
103 pages
Reliability of Computer Systems and Networks Fault Tolerance Analysis and Design 1st Edition Martin L. Shooman
No ratings yet
Reliability of Computer Systems and Networks Fault Tolerance Analysis and Design 1st Edition Martin L. Shooman
51 pages
Backhouse Problems
No ratings yet
Backhouse Problems
18 pages
Tips Hyper Works
No ratings yet
Tips Hyper Works
2 pages
Activity Diagram Use Case 8 Reset Password: Teacher System
No ratings yet
Activity Diagram Use Case 8 Reset Password: Teacher System
13 pages
Study Guide 1
No ratings yet
Study Guide 1
5 pages
Xeon RAS Tech Paper Rev 1 1
No ratings yet
Xeon RAS Tech Paper Rev 1 1
22 pages
Chap 1
No ratings yet
Chap 1
25 pages
Connecting MS Access and CONCEPT Via ODBC
No ratings yet
Connecting MS Access and CONCEPT Via ODBC
6 pages
Dolzilek D., MacDonald B. - in The - Recent Security Failures Prompt Review of Secure Computing Practices
No ratings yet
Dolzilek D., MacDonald B. - in The - Recent Security Failures Prompt Review of Secure Computing Practices
10 pages
MODULE 2 - Notes
No ratings yet
MODULE 2 - Notes
18 pages
Lecture 1 - Introduction Embedded Systems
No ratings yet
Lecture 1 - Introduction Embedded Systems
22 pages
AMC Notes
No ratings yet
AMC Notes
94 pages
Introduction To Embedded System
No ratings yet
Introduction To Embedded System
18 pages
ESD Mod2
No ratings yet
ESD Mod2
44 pages
99-Book 1907470 294551 0
No ratings yet
99-Book 1907470 294551 0
45 pages
Es CH-5
No ratings yet
Es CH-5
30 pages
An Embedded Control System Is A Specialized Computing System That Is Designed To Perform Dedicated Functions or Tasks Within A Larger System
No ratings yet
An Embedded Control System Is A Specialized Computing System That Is Designed To Perform Dedicated Functions or Tasks Within A Larger System
15 pages
Module 4
No ratings yet
Module 4
13 pages
AES - Module 1 Part II
No ratings yet
AES - Module 1 Part II
25 pages

Reliability Avalilability Serviceability

Uploaded by

Reliability Avalilability Serviceability

Uploaded by

EE482: Advanced Computer Organization Lecture #16

Reliability, Availability, and Serviceability

Why is RAS important?

Serviceability is a broad definition describing how easily serviced or repaired a system

Some Trends Currently Effecting RAS in Processors:

Redundancy: Redundancy in functional units, interconnects, and backup processors is

What are the three steps to failure recovery?

2 DIVA: A Reliable Substrate for Deep Submicron

Definitions: Validation vs. Verification

What does a Watchdog Timer do?

What are the main sources of performance impact in DIVA?

3 Ultra Enterprise 10000 Server: SunTrust Reliabil-

Fault-Tolerant Cooling and Power

Automatic System Reconfiguration

4 IBM’s S/390 G5 Microprocessor Design

Hardware error recovery takes place in a series of steps:

You might also like