Margin Testing

by Dr. Howard Johnson. First printed in EDN magazine, March 3, 2005

After reading my recent editorial about the difficulty of probing high-speed serial links ("The Perfect Probe," EDN, Oct 14, 2004), my friend JP Miller called to commiserate. It turns out he has been thinking about the same problem—how to develop reliable high-speed serial links and buses and how to test them once they are manufactured.

JP, what piqued your interest in chip testing?

It all began with one simple question: How can you tell whether you properly designed a system, whether you properly built it, and whether it continues to work properly in the field? That question got me thinking about how noise propagates between interfaces through electrical crosstalk, ground bounce, power-supply noise, and certain forms of software interference. My experience has taught me that testing a link in isolation is never sufficient; you must test links in combination with other noise sources. For example, suppose you have two chips talking with, say, PCI Express ports, but when one chip exercises its DRAM interface, it induces additional power-system noise, causing errors in the other. At high speeds, everything can become linked.

How do you propose isolating the sources and sensitivities to noise, so you can design robust systems?

I propose a three-point plan of attack. First, every chip must accumulate a meaningful record of conditions internal to the chip. This step includes the number of errors you observe, errors you correct, and errors that various filters categorize. These counts presume that your chip incorporates error detection. During development testing, provide a chip input that accurately starts and stops these error counts, so you can correlate the error counts against asynchronous events elsewhere in the system.

Second, convey your error indications through observable interfaces in a flexible, register-controlled manner. Implement a real-time-hardware error-indicator pin that triggers your scope when errors occur. Temporarily disable your error-correction systems to allow software more visibility into link performance. In general, open up the chip architecture so you can directly control hidden layers at the bottom of the protocol stack. To facilitate link testing, equip all your chips with special test patterns for receiver testing in the face of worst-case adjacent-channel noise.

Third, provide on-chip features for stress-testing every link. A simple register structure can kick off test one, test two, and test three and then reports the results. The monitoring software need not know how the tests are conducted or what they mean, only that the three tests were made and the numerical results.

Fruitful possibilities for stress testing include moving the receiver thresholds, skewing the internal clock, reducing the transmitted-signal amplitude, and messing up equalizer settings. Stress testing continuously increases a level of stress until the system begins to fail and records the maximum level of safe stress the system can handle.

I like stress tests. Simple go/no-go testing merely indicates when you have fallen off a cliff. Stress testing reveals how close you are to the edge.

Even a simple stress test provide useful feedback. Many industries use non-calibrated stress testing to great effect.

Yes, here's an example from personal experience. I have a 5.4-GHz wireless-Internet-access dish on the top of my barn, exposed to the weather. It points straight across the river valley to my access point. My wireless-Internet-service installer, Larry, tells me that my dish shows 400 units of margin on a clear day, but that it drops to only 150 units when the dish gets partially iced up, as it was yesterday when he came out to check it. Around 100 units is generally the minimum workable amount. Nobody can say what is a unit of margin, but the tests clearly indicate that ice is a major factor in performance. A little roof over the antenna partially shielding it from snow and ice would probably improve the reliability of my service.

Precisely my point. Simple margin testing can be very effective. If the guys down at Larry's TV shop can do that, surely we digital folks can learn the same trick.

JP Miller works on advanced systems architecture at Hewlett Packard. He was an early engineer at Compaq.