Testing CLARiiON’s Write Cache

Testing CLARiiON’s Write Cache

This is my third post in a series of blogs regarding the origins of the CLARiiON Write Cache feature. The development of the write cache algorithms followed a similar path to the development of the RAID-5 algorithms. For RAID-5 we employed state machines, and to test these state machines we created a unique piece of software called the Disk Array Qualifier, or the DAQ.

While building the write cache solution,  “design for test” was central. I had sleepless nights designing the RAID algorithms; for the write cache these nights were now accompanied by sleepless mornings. The worry was about finding holes in the implementation that could result in data loss or data unavailability.

The end result is well-known; there are hundreds of thousands of shipping, write-cache enabled CLARiiONs running around the world.

This post describes the foundational quality tests for the CLARiiON write cache.

Patient DAQ

The DAQ is famous for fault insertion. It can kill drives and corrupt data on disk, for example. These techniques were used against the cache algorithms for sure. But the DAQ also needed to become more patient and wait for power-cycling to finish and then verify that the data dumped to the CLARiiON vault was retrieved, available, and correct. We initially had a minimum of four different systems in order to test the different page sizes of the write cache (the write cache memory was configurable into “pages” of 2K, 4K, 8K, and 16K). A whole suite of “friendly” tests (some in the DAQ, some in other tools) focused on testing the cache without inserting failure conditions:

  • infinite write-read-verify tests: assume a person or automated power cycling device was in the background turning the CLARiiON off/on over and over again.
  • dual-SP tests: the CLARiiON write cache was both mirrored and active. Tests would run simultaneously against each storage processor (SP).
  • alternating-path tests: the CLARiiON microcode (FLARE) could sense when one SP was “overworked” and dynamically dedicate up to 85% of its cache to one side. Alternating traffic between the SPs would test these “page rebalancing” algorithms.
  • trespass tests: the CLARiiON LUN ownership model allowed access to a LUN from only one SP; if the LUN was moved to the peer SP (a la PowerPath failover), FLARE had to ensure that the LUN’s mirrored data became owned by the peer.

I can still remember the size and texture of the power buttons on the early versions of CLARiiON. Many of us would run the DAQ against the systems and walk around shutting the power off over and over again.

Faults

The boundary and failure conditions were the things that kept us up at night (as well as “in early”). In a previous post I mentioned that the use of the original CLARiiON write cache required that three pre-conditions were satisfied: (1) two healthy storage processors, (2) five healthy “vault” disks, and (3) a fully charged battery backup unit (BBU). A set of tests were written that performed the following types of fault insertions:

  • Vault drive failures – the vault is the “fast destage” area for CLARiiON’s cache. It required five disks. If one of them failed, then it wasn’t safe to continue caching. Continually powering off a vault disk, and then re-enabling it, all while I/O was running to the array, would continually enable and disable the cache.
  • Vault failures and PowerFail – after powering off a vault drive, it was also important to kill power. This allowed FLARE to test that the vault was correctly implemented as a RAID-3 device. Four of the disks would contain the actual cache pages, and the fifth would contain a parity calculation generated from the cache memory. Similarly, a vault disk could be powered off during reboot to test FLARE’s ability to read the cache from a degraded vault during boot.
  • BBU reset – FLARE required a fully charged BBU. It also tested the BBU as part of a daily schedule. A test was also written that repeatedly caused the BBU to run diagnostics (and thus take itself off-line) and then re-enable itself after the diags completed.
  • SP reboot – given that the original caching implementation required two storage processors, it made sense to create a test which continually and disruptively rebooted alternate SPs. This tested interrupted cache transfers, heartbeats, vaulting, and the flushing of dirty pages on the surviving SP.
  • Classic DAQ – it was important to continue to torture the original RAID algorithms running underneath the cache. In theory the software boundaries between the cache and the RAID algorithms were well-defined; running the full suite of DAQ tests ensured this was the case.

Cache Algorithms

With all the emphasis on the boundary cases it might be easy to forget that yes, we wrote a whole new set of cache management algorithms that actually stored and retrieved data!  We experimented with write and read patterns that we knew would stress the boundary conditions in the software:

  • Cache holes: disable the write cache, write a known 8K pattern to disk, enable the write cache, write new pattern to every other block, and then read the entire 8K. The result should be a mixture of the original pattern and the new pattern. This tests FLARE’s ability to preread and fill holes within cache pages.
  • Page spanning: write odd-size patterns that catch the tail end of a cache page, fill up entire page, and then stop at the beginning of the next page.
  • Variable page sizes: run the previous two tests and test all possible page sizes.
  • Write through: the SCSI protocol contains a bit known as “FUA”, or force-unit-access. If this bit was set the protocol dictated that the data must be stored onto disk. This tested a “write-through” cache algorithm.

I hope this gives readers a good feel for the lengths that CLARiiON went through to perform automated, abusive, and thorough quality tests to make sure customer data was protected.  It wasn’t until all these tests had been running flawlessly for months that the engineers felt comfortable releasing the product.

After the write cache shipped I was ready to move out of the FLARE team. Interest in working on FLARE was building within Data General, and some top talent was already modifying the software. Once the write cache had shipped I transferred out of the FLARE group and into a group that worked on software outside of the CLARiiON.  More on that in a future post.

Steve

2 Comments

  1. Vinay Rao

    Does the clariion throw away the read cache or write cache or both if a LUN or a target reset is performed ?
    If so could there be any corruption due to keeping the cache or throwing away the cache.
    Thanks

  2. Hey Vinay,
    No the CLARiiON doesn’t throw anything away on any sort of reset.
    Steve

Comments are closed