Recommended Reading: Issue 4, May 2005
Recommended Reading: Issue 4, May 2005
Recommended Reading: Issue 4, May 2005
NOVEMBER 2004
Introduction Recommended Reading
Rodger Richey EMC Techniques for Microprocessor Software
Senior Applications Manager by D.R. Coulson, IEEE Colloquium on Electromagnetic Compat-
ibility of Software – 98/471, available for download from IEEE.org
This issue of the EMC Newsletter will focus on software for $35.00
techniques to make your application more robust. Software This paper describes a simple strategy for making
techniques do not preclude using any hardware solutions, but microcontroller-based systems more robust through software
should be used to enhance the hardware. techniques. The author discusses an approach that considers
You have probably heard this before, but sometimes the best identifying the hazard, the risk to the application, the
defense is a good offense. With respect to EMC, what this means vulnerabilities of the system, the failure mechanisms, and the
is, the more proactive you are in designing EMC at the start of a appropriate measures that can be taken to make the
design, the better the results at the end will be. This is not only application more robust.
true with hardware, but also with software.
Applications can have three basic responses to EMC. The first is
to continue operating as if nothing happened. The second is to Tips and Tricks
interrupt operation and then recover to a normal operating state.
The third response is to completely halt operation and not One of the best ways to help recover from EMC events when
recover. We all strive to achieve the first response. In most using the Watchdog Timer (WDT) is to fill all unused memory
cases, the second response is adequate depending on the type locations with a GOTO $ instruction.
of EMC event. The third however, is not acceptable. The This instruction can be used with the Microchip assembler and
techniques presented in this issue may make the difference tells the microcontroller to branch back to the same address. If
between a complete system failure and one that recovers. for some reason the microcontroller tries to execute
One feature that exists on all Microchip MCUs is a Watchdog instructions in unused program memory locations, the GOTO $
Timer or WDT. The WDT is an excellent peripheral to help instruction will create an infinite loop that does not clear the
recover from EMC events. Its effectiveness, however, is directly WDT. The WDT will eventually expire and reset the
related to how it is used. It is very important to detect the type of microcontroller. You can then use techniques discussed in
reset, and then properly put the microcontroller into a known other articles of this issue to detect the reset and then put the
good state before executing any code. microcontroller into a known good state.
When using a WDT in your application, you should analyze how One useful assembler directive for the Microchip assembler to
the WDT is enabled or disabled, what operations can be automate filling memory with GOTO $ is the FILL command.
performed on the WDT, and what is the timeout period, among The format of the FILL command is: FILL expr, count,
others. Understanding this operation will allow you to better use where expr is the value you want to place in program memory
the WDT to recover from EMC events. (it is typically enclosed in parenthesis as shown in the example
Your feedback regarding this newsletter is welcomed and valued. below), and count is the number of times you want the
Please send any comments on this newsletter or prior issues, or expression repeated. The following example illustrates the use
any ideas for future articles to: [email protected]. of the FILL command for the PIC18F252 microcontroller.
#include p18f252.inc
org 0x12
failsafe goto $
org 0x100
fill (goto failsafe), (0x8000-$)/2
end
This example shows how the FILL command could be used to
fill from address 0x100 to the end of program memory. The
In This Issue reason that the count parameter is divide by 2 is that the goto
failsafe instruction takes up two words of program memory.
Software and Hardware Techniques for Reducing EEPROM
Memory Corruption ................................................................2 If you are using a C compiler such as MPLAB® C18, you can
Fault Tolerant Software Techniques ......................................4 create an object file using the assembler and then link it into
Power-up Detection ...............................................................6 your project. Use the MAP output of the linker to find the open
Detection and Recovery ........................................................7 areas in your program memory, and then create the assembly
file to fill those gaps. You would normally use this procedure
once your code has been written and debugged.
• SPI™
The SPI bus is a 4-wire (Clock, Data In, Data Out
and Chip Select) synchronous bus. The key fea-
tures of the SPI bus that allow for an improvement
in the reliability and robustness of a system from
the data point of view, are the Chip Select and also
the hardware write protection. The Chip Select pin
will enable the device to not respond to any bus
commands, which is an advantage in a system
with noise. Built into the SPI bus command struc-
ture is the Write Enable latch. Before data can be
written to a memory location, the device must be
write enabled. After the write has completed, this
bit is cleared in hardware and must be re-enabled
before subsequent write operations.
• Microwire
The Microwire bus shares many features with the
SPI bus. The main exception is the software write
enable feature. With Microwire, this bit once
enabled, remains enabled until changed under
software control.
There are many applications that place a One of the most difficult to handle errors is a random
microcontroller into a very harsh environment. Some change to the program counter. Here are a few ways to
of these environments can cause electrical transients deal with this problem.
inside the microcontroller. One example is outer 1. Make critical code as small as possible to
space where the presence of ionizing radiation can minimize the chances of random entry.
create some very challenging problems. Another 2. Place infinite loops around critical sections to
harsh environment is in low cost appliances where catch runaway code before it enters the critical
cost savings measures can change an otherwise area.
benign environment into one with very harsh
3. Do not locate critical tasks or data near the
electromagnetic transients. The problems seen in
beginning or ending of memory.
these environments usually manifest themselves as
4. Use “passkeys” located throughout RAM to
randomly flipped bits inside the microcontroller.
enable a critical section. Clear the “passkeys”
Managing the environmentally induced faults falls into
after use.
the responsibility of the design engineer. For
terrestrial applications, the solution can usually be 5. Break up critical operations into smaller discrete
resolved by adjustments to the circuit or shielding. tasks and scatter them around memory.
However, this is not always possible, so this article 6. Leave a trail of completed steps in the RAM.
discusses how software may be able to mitigate the Before starting any task, verify that the previous
effects of these soft errors. steps are complete by looking at the trail. If the
program counter is corrupted, you will detect this
Inside a microcontroller there are many registers that
by seeing missing steps.
affect the function of the controller or the software.
7. Fill unused program memory with a jump to a
Special function registers typically affect how the con-
code trap. The trap can be used to reset by
troller will behave, while RAM locations are used to
jumping to the start of memory, by waiting for the
store variables affecting the behavior of the software.
Watchdog Timer, or even by triggering an
What is the affect on the application when one of these
external device to power cycle the controller.
registers changes in value? The following table shows
possible results from a few of the registers: One common problem is errors with the I/O registers. If
a bit flips in a PORT register, an output will suddenly
Program Counter Code flow will jump randomly. change. One solution is to keep RAM copies of all the
Status Register Banking, math, and decision special function registers. The registers can be
errors. updated with the correct values from RAM in a tightly
Interrupt status Random interrupts created, controlled area of code. Registers that are set and
masked or unmasked. never changed, should be refreshed from program
memory while the other registers are refreshed from
I/O Direction Inputs become outputs or outputs
RAM.
Registers become inputs.
It is possible to corrupt any place in memory, so care
Stack Pointer or Functions and interrupts generate
should be taken to ensure that the values are correct.
Stack memory random return addresses.
Fortunately, it is a simple matter to use software
This table does not show a comprehensive list of checksums or parity to verify that the data in RAM is
errors or results, but it should provide an idea on the correct.
possible severity of the problem.
State machines can be kept in good health by creating
Most of these problems can be detected, or corrected a table of valid to/from states. If a state machine is
by careful attention to detail in the software design. leaving state A and entering state C, a quick check of
However, we must remember that even for the the table can make sure that A->C is a valid combina-
problems we cannot correct, we must still assure fail- tion. If that state is invalid, it is time to reset or go to our
safe operation. code trap.
Note: Fail-safe is a term used to describe a sys- A stack failure will look exactly like a random program
tem where any failure will result in an out- counter upset. Adding the code for program counter
put that is safe. A failure in a saw must upsets will solve a few problems.
leave the blade stopped for the system to
be termed fail-safe.
When using a microcontroller in an Electrical Fast be updated. Having a bit flipped due to an EFT event
Transients (EFT) rich environment, the hardware is can be difficult to detect, but it takes the least amount
typically designed to handle an EFT event of a of EFT energy to accomplish. By using the BOR and
particular magnitude. Dependant upon what the the WDT, we can help protect against this. The more
hardware is designed to handle, there is always the severe EFT events will cause either a MCLR reset (by
possibility of an EFT event happening that exceeds inducing a voltage on the MCLR pin) or a POR reset.
this limit. What happens when an event like this By monitoring how many reset events happen and
occurs, and is there anything we can do to help protect what type of reset occurs, you can gauge the strength
the system’s integrity? The answers are, we don’t of the EFT environment and make the decision
always know what is going to happen, but there are whether or not additional hardware protection is
ways to protect the system through the use of software needed.
and some of the microcontroller’s peripherals. To determine which type of reset has occurred, there
Some of the possible errors that may occur due to an are a few bits that need to be monitored which are spe-
EFT event are: RAM corruption, an error with the cific to each microcontroller. In the case of most base-
program counter causing an unwanted jump in the line parts, there are two bits that need to be monitored
code, randomized interrupt, or a reset of the from the STATUS register: TO and PD. The following
microcontroller, just to name a few. Special care must table shows the condition of these pins with respect to
be taken when designing the system to deal with these a particular reset where ‘u’ indicates no change on the
types of situations. In this article, we will look pin.
specifically at the Reset condition and how to properly
identify what type of reset has occurred. Condition TO PD
Determining what type of reset that has occurred can Power-on Reset 1 1
be advantageous to the designer. After knowing what MCLR Wake-up 1 0
type of reset has occurred, the software can be written (from Sleep)
for the appropriate event. Anyone who is accustomed WDT Reset 0 1
to programming with our baseline parts is probably (normal operation)
already familiar with this practice. Since the baseline
parts don’t have an interrupt vector, any time the part WDT Wake-up 0 0
comes out of Sleep it will go through the reset vector. (from Sleep)
In a typical low-power application, the baseline part will MCLR Reset u u
be put to Sleep and will wait for an external event to (normal operation)
bring the part out of Sleep, such as a wake-on-change With our mid-range parts, we will also have to take into
on an I/O pin, a Watchdog time-out, or for the consideration the POR and BOR bits in the PCON reg-
PIC10F200 series, a wake-on-change on the compar- ister. Depending on how these bits are set, we can
ator. Each one of these examples could have different determine what caused the device to reset.
code to execute based on what brought the part out of
Sleep. Condition POR BOD TO PD
The same technique can be used to help protect the Power-on Reset 0 u 1 1
system’s integrity from an EFT event. The peripherals Brown-out Detect 1 0 1 1
that will help us do this are:
WDT Reset u u 0 u
• Watchdog Timer (WDT)
WDT Wake-up u u 0 0
• Brown-out Reset (BOR)
• Power-on Reset (POR) MCLR Reset u u 1 0
• Master Clear (MCLR) (during Sleep)
The WDT can be used to help determine if the pro- MCLR Reset u u u u
gram counter was corrupted. This can be done by fill- (normal operation)
ing up the unused portions of your program space with These bits must be checked at the start of your pro-
infinite loops. If one of these loops is executed due to gram before any instruction is preformed that could
an error with the program counter, the code will stay in affect the status of these bits. By only monitoring these
the loop until a WDT time-out occurs. The BOR can be bits alone, you still cannot ensure that your system
used to help detect possible RAM corruption. When a hasn’t been compromised. Rather, by monitoring these
BOR is detected, a parity check of the RAM can be bits, you are adding an additional layer of protection to
done to see if there is corruption, or the RAM can just your system at no hardware cost to the design.
The most important requirement for any defensive One possible fail mechanism is port I/O corruption.
software technique is to search (detect) the failing You can maintain another copy of I/O status in RAM
mechanism. Once the failing mechanism is detected, it and compare them to detect corruption. A PICmicro
may be easier to rescue (recover) the system. device has the capability to read its pin regardless of
One of the easiest failing modes to detect is a Reset. If its definition as an input or output. This feature helps
the code jumps to a reset vector, it’s usually an when comparing the value of the current output with
indication of a Reset. Microchip’s PICmicro® devices the intended output. Another option is to assume that
provide various status bits to easily identify the type of an I/O will become corrupted and restore the value
reset. The article, “Power-up Detection” in this issue, quickly. In this approach, port I/Os are refreshed at
explains the reset source identification in detail. regular intervals. This can be done by executing a
refresh once in a main routine or on some fixed timer
On the other hand, it’s much tougher to detect a RAM
event.
fail bit. In simplest terms, the RAM bit is a flip-flop. If a
noise event flips the state of one of the bits in this flip- Another less frequent failure related to an EMC event
flop, you may end up with a RAM bit corruption issue. is runaway code. If the program counter gets
corrupted, the device may jump to an unwanted
One way to detect RAM bit corruption, is to put a
location. Typically, you need a very serious EMC event
known data pattern in one of the RAM locations and
for this to happen. If all of the unused locations are
check for the corruption. Typically, most devices have
filled with a goto $ instruction, the device will become
hundreds of bytes or a few Kbytes of RAM. The pattern
stuck in an endless loop in the event that runaway
in one RAM location may not be enough to detect a
code occurs. The Watchdog Timer can be used to
potential corruption issue within the entire RAM.
safely recover from this state.
Depending on the RAM size, you need to use more
than one RAM location for pattern storage and match Another way to detect code misexecution is to use
checking. Depending on the RAM size, you need to token passing or a subroutine counter. This is mainly
reserve a few bytes of RAM space for matching. The used with safety-critical systems where you must
more locations you reserve, the better the chances of ensure proper flow of a critical program, as failure to
catching the issue, but at a higher resource cost. I also do so may have serious consequences. In the simplest
recommend that you reserve a couple of bytes of RAM implementation, successful execution of each
per memory bank for pattern storage. subroutine will add a certain value to a variable. If any
subroutine is not executed due to code misexecution,
Once RAM corruption has been detected, there are
then the subroutine counter will contain a different
multiple ways to recover. Most of them involve some
value. This will help in detecting code misexecution.
kind of reinitialization. If your code has a function that
initializes all usable RAM variables and SFRs, then it We’ve discussed some ways to detect typical failing
should call that function and start over. Another way to modes and ways to recover. More error correction
recover is do some kind of controlled Reset (RESET methods and detection logic exist that offer a variety of
instruction or WDT timeout). This may put most SFRs options. For example, Cyclic Redundancy Check
in a known state. However, it is recommended that you (CRC) logic on program/data memory can be used to
refer to the PICmicro device data sheet for the SFR check errors. Some modified CRC logic exists that
values in the case of such a reset. Many applications allows one or multiple bit error correction capability.
tend to initialize RAM variables at Reset. This Most error correction logic relies on some redundant
technique is popularly known as reset-based recovery. data storage. Therefore, it may result in additional
resource requirements.
If you have any critical variables whose corruption can
have a huge impact on software (i.e., current state Finally, a designer needs to analyze all options (both
information for state machine-based logic), then hardware and software) to find the optimum solution
maintain more than one copy of the same variable in for their system.
different memory banks. Then, prior to performing an
operation, match the value of both locations. If they do
not match, it’s an indication of corruption, and you will
need to take recovery actions. The software needs to
have a recovery plan, such as what to do if things go
wrong, and what is the best way to recover.
04/20/05