The Parity Jammer of Death

The Parity Jammer of Death

The qualification of CLARiiON has always included a set of quality software suites that induced just the wrong event to occur at just the wrong time. Originally called the Disk Array Qualifier (DAQ), these tests have run millions of times over nearly two decades, powering down drives, intentionally corrupting data, and purposefully shutting down systems.

CLARiiON also used a piece of hardware that was more random in nature. One of our lab techs created what he called “The Parity Jammer”. With his knowledge of the SCSI protocol, he built a device that plugged into the SCSI bus and generated all sorts of nasty SCSI signals (e.g. random SCSI bus resets). The device could run itself, and it also had buttons to manually wreak havoc if desired.

This device tormented our CLARIiON SCSI device driver guy. We internally referred to it as the “Parity Jammer of Death”. When we were qualifiying the write cache, however, the parity jammer helped us find a quality problem that filtered past the SCSI device driver and all the way up to the RAID level. What was the problem?

A Lost Write.

A lost write occurs when a disk drive receives a write command, returns good status, and then completely loses the data.  I first heard about lost writes (although they were called something else) at my very first RAID-5 design review that I held in 1989. Several of the disk drive experts in my organization at the time had seen such a phenomena. It was rare, but the fact that it occured at all was enough for me.

They advised the addition of a “read-after-write” option to confirm correct placement of the data.  So I added it to the code. But it was never really used (nor requested) by customers. As the years went by I didn’t think about it much.

Write Cache Qual, Circa 1993

When modifying CLARiiON’s software to support a write cache, we were also modifying the CLARiiON enclosure, as well as integrating the latest disk drive technology. With all of these changes happening at the same time, we relied on the DAQ and Parity Jammer more than ever. We added specific cache tests to the DAQ, and we also built several more Parity Jammers to run against a larger number of systems.

Sure enough, one of our DAQ tests started failing on systems with parity jammers attached. And when the tests started failing with the write cache turned OFF, we began to suspect that the problem wasn’t with our software. The RAID algorithms had been rock solid for several years by this point, and they also had been left relatively unchanged during the addition of the write cache.

Disk Drive Vendor Discussions

We started to have discussions with the disk drive vendors at this point.  We were second sourcing the disk drives, so we asked both vendors: “Have you seen any data integrity issues with your disk drives during bizarre SCSI scenarios?”. Both vendors responded in the negative. So we began thinking of ways to catch a drive “in the act”. We dedicated several CLARiiON enclosures to the effort, and soon we had isolated the problem down to a particular vendor’s drives. But we still had no “proof”.

And then I remembered the design review from 1989. We had the built-in ability to request read-after-write. So we turned it on, and set a breakpoint on the low-level detection of a “lost write”. And we hit it! This was proof that the disk was returning incorrect data.

We attached a SCSI-bus analyzer along side the parity jammer. We couldn’t trigger the analyzer to stop tracing on any given event (it took anywhere from 30 minutes to 2 hours to duplicate), so we started up several systems and manually poised our fingers on the “stop trace” button when we saw the breakpoint hit!  We printed out the results and faxed it to the drive vendor.

Surprisingly, they already knew about the problem and were working on a fix. It turns out that when the parity jammer caused a SCSI bus reset (in a particular window) immediately after the drive returned a write acknowledgement, the data would be lost. This was disconcerting, because it indicated that the write was not making it to the media (and CLARiiON explicitly disables all disk write caching).  The vendor just couldn’t BELIEVE that we had all the conditions in place to cause the problem to surface.

In any case, the problem was fixed.

Qualification Confidence

When we had flushed the bugs out of our own implementation of the write cache and had begun finding problems in disk vendor firmware, this went a long way towards bolstering our opinion of overall product quality.

These tests have grown over the years, and the quality framework pointed at every new release of CLARiiON is now quite substantial. The tests are a big reason for the trusted nature of the CLARiiON brand.

Steve