Acceptable Failure

by Dr. Howard Johnson. First printed in EDN magazine, March, 2000

It was midnight in the middle of a blizzard. My garage door opener had failed. It was snowing heavily. The thermometer read 15°F. I felt like an idiot standing there in the dark chipping ice off the front-door handle and trying to get inside without waking anyone.

The next morning the opener worked perfectly. It took me days to figure out that cold weather had caused the problem. Whenever I leave my truck outside for a long time, the transponder gets cold, and the transmitting frequency drifts off-center. That drift causes the opener to malfunction; however, the trip home is usually long enough to warm up the transmitting oscillator, so it works by the time I get to my garage door. The night of the blizzard, the car was parked at the barn a few hundred yards from the house. The trip home was so short that the oscillator never warmed up.

If this had been an isolated incident, I wouldn't have to worry about it. Unfortunately, my magic crystal ball tells me that, as electronic control becomes more pervasive, I will encounter more and more erratic, unreliable, and unresponsive systems.

I'm not asking for everything to be perfect. You certainly know as well as I that certain unavoidable failure modes afflict all electronic systems. Common examples include the probability of metastable failure that can happen when sampling an asynchronous input signal, the probability of "bit loss" in a big memory system, and the probability of undetected packet errors that can occur in a noisy communications channel.

The standard cures for these problems can improve but not eliminate the rate of failure. In the case of case of metastability, chaining together several synchronizing registers slashes but does not eliminate the probability of metastable failure. Similarly, even the best RAM test-and-self-repair strategies merely lessen bit loss. Longer and more sophisticated checksums only reduce the probability of undetected packet errors. In every case, regardless of how much failure-rate protection you use, a small but quantifiable residual probability of failure still exists. That brings me to an important point: Without clearly quantified limits on the "acceptable probability of failure," you never know whether you have implemented too little or too much of your favorite failure-rate cure.

Defining an acceptable probability of failure for your application is difficult. It depends greatly upon your application. For example, let's imagine that you design remote-control subsystems for big-screen TVs. You might have heard about metastability. You might be concerned that some couch potato, somewhere in the United States, would pick up his remote control one night and press "Channel 4." The remote-control signal might propagate through the air, arriving at the TV receiver at just the wrong instant, causing a metastable failure of the channel-changing logic. The TV might switch to Channel 5 instead of Channel 4.

What would happen in this scenario? Probably the viewer would grab his beer, take a deep slug, and mash "Channel 4" again—this time, a little harder. This time, the TV would probably do the right thing. As you can see, the consequence of failure in this system is small. The human in the loop gets the system back on track when it fails, and the customer's expectations for this kind of product are low anyway. No harm done.

On the other hand, people who work in the fields of medical electronics, avionics, or nuclear controls have a different feeling about what constitutes adequate reliability. In those industries, typical designers might reason that if everyone on the planet bought their product and if everyone ran the product at full speed with test diagnostics continuously for 24 hours a day, the designers wouldn't want to see a metastable failure anywhere within, say, the first million years. This high level of reliability might sound ridiculous to you, but maybe it isn't so far-fetched. If you can get this level of reliability by using just a couple of synchronizing registers, why not?

Calculations involving probability-of-error effects are well worth developing and have a real impact on how your product functions. I encourage you to learn all you can about these little probabilistic effects and think carefully about what constitutes an acceptable probability of error for your system. Also think carefully about other failure modes that may affect system operation—things like whether a garage-door opener will work in cold, snowy conditions.