In a previous post I highlighted the need for the CLARiiON product to add a write cache to address the use case of restore from tape. The slow restore speeds accelerated a decision that we knew we’d have to make anyway. In order to increase adoption of RAID-5 technology we’d have to add a write cache. It would eventually become table stakes.
Building a write cache in a disk array ranks right up there as a “hard thing to do”. When discussing the importance of customer data, a write cache can be a tough sell. Not only was there a need for a quality design but a quality test suite as well.
So how did we go about building the CLARiiON write cache?
Let’s step through the decisions.
Mirror Mirror
The first decision we made was to mirror any write request to CLARiiON’s peer storage processor. This second processor was not only redundant, but it was also active. Once the new application data had been written into local memory and safely sent to the peer, we could notify the application that the data was safely stored.
The implication to the customer was that they needed two healthy storage processors (SPs) in their CLARiiON unit in order to take advantage of write caching. If one of the SPs failed, the write cache would be disabled. The original dual-SP requirement for caching has changed over the years and CLARiiON has shipped a variety of single-SP caching configurations.
What was the implication to the FLARE software (CLARiiON’s microcode) itself? Well, it required significant modification to a piece of software known as “CMI”. This acronym has undergone identity name change over the years but it originally stood for “Configuration Manager Interface”. It was a messaging system between the two storage processors to swap information about the configuration, and to request permission to perform new configuration operations. This messaging system was modified to support data transfer from one physical location to another. Significant changes to the back-end SCSI driver were also made to support these transfers.
The final piece of the mirroring solution required CLARiiON to add more memory to their storage processors. Typical sizes were usually multiples of 2 GB.
Batteries Required
The second decision that we made was to change the hardware to support a battery-backup unit (BBU). Mirroring customer data across storage processors provided redundancy against storage processor failures but not power failures. If power failed the BBU needed to power FLARE long enough for FLARE to protect the data (I’ll cover this below).
FLARE’s environmental monitoring code needed to be modified to read the signals on the BBU. Healthy BBU status became a pre-requisite for caching (in addition to dual-SP health). FLARE also introduced a schedule of BBU tests (typically after midnight). The cache would periodically be disabled and the BBU instructed to test itself. This allowed FLARE to feel confident that the battery was not emitting “false positives”.
How long should the battery hold up FLARE? Did it need to hold up all the disks as well? These questions brought us to our next decision.
The Vault
We decided to dedicate private space on the first five disk drives in the CLARiiON as an “emergency de-stage” area for the cache. The first four drives would be a direct mapping of the cache memory onto the disks, and the fifth drive would contain RAID-3 style parity information. So if and when a power-failure occured, it was only necessary to hold up the storage processors (and the vault disks) long enough to safely store the customer’s data. Then the BBU could shut itself off.
The reason for this “vaulting” technique is related to the reason we added the cache in the first place. If a 16 GB write cache was full of random application data destined for a RAID-5 configuration, it could take a LOOOOONG time to put the data where it ultimately belongs. A worst case scenario could take minutes. There was no cost-effective way to ship a BBU that could power all of the disks until the cache was empty.
On a related note, if one of the storage processors failed, it’s much safer to quickly vault (and thus protect) the write cache before some other quirky event might occur (e.g. software/hardware failure of the surviving SP).
The vault resulted in a new piece of software being developed for FLARE known as the ATM. I believe it is not an acronym but a wordplay on “cache”, “cash”, “vaulting”, and “deposits/withdrawals”. This piece of software was responsible for monitoring the three conditions for write caching: (1) healthy dual-SP, (2) fully charged battery, (3) presence of 5 healthy vault drives. When any one of those entities failed, the ATM was responsible for generating the RAID-3 parity and vaulting the contents to disk. The ATM was also involved in restoring the contents of the vault to SP memory during boot.
A State Machine (of course)
One of the more influential design decisions we made in the RAID5 implementation was to use a state machine approach. This worked well because it broke down a complex problem into a series of states that were manageable (and testable). We decided to use the same technique for the design of the CLARiiON write cache.
We started with two obvious states: ENABLED and DISABLED. The ENABLED state meant that all of the pre-conditions for write caching were satisfied (healthy SPs, BBU, and vault disks). The DISABLED state meant that either the user had turned caching OFF, or that one or more pre-conditions for write caching were not satisfied.
The design progressed by identifying the events that transitioned the write cache between the enabled and disabled states. Clearly a change in the health of the system might cause a transition. Along the way we discovered the need for intermediate states like ENABLING, QUIESCING, and VAULTING. As application writes entered the CLARiiON system, the “state” of the write cache would influence how the request was handled. We documented all the states and all the events.
Finally we relied on another successful technique from the past: we updated the disk array qualifier (DAQ) to test state transitions in an automated fashion.
I haven’t described the details of the cache algorithms themselves, but I plan to in future posts. They were also state-based and complex, and the DAQ was used to automatically test them thousands upon thousands of times. More in future posts.
Steve


Great post Steve – very informative!
One sentence out of this article raises a question my teammates and I have been debating for some time…
“The cache would periodically be disabled and the BBU instructed to test itself.”
So, on a CX3 series array, does write cache actually get disabled if/when a single SPS fails?
Thanks!
Dave,
Great question, you correctly point out that my historical blogging goes back to the days of only one BBU, where today’s product uses 2 SPS (standby power supplies).
FLARE’s behavior in today’s systems is to keep the cache enabled when one of the SPS faults or tests itself.
Cheers,
Steve
Hi Steve,
I don’t know much about Clariion so this might be a silly question. The Clariion write cache is actually a different hardware cache or it is a part of Main Memory of the storage processors?
And when there is a power failure, the BBU shuts down everything else but the SPs and the first 5 disks, is that so?
Anand
Anand,
The CLARiiON write cache is part of the main memory of the storage processors. So FLARE has to manage carving the memory between caching usage and FLARE’s operational usage.
Regarding the BBU, as designed it needs to keep up the storage processors and 5 disks only. Keep in mind that today’s CLARiiON use two SPS devices (standby power supplies) that operate in a similar fashion.
Steve
I suppose the 2 GB “module” for the CLARiiON dual storage processor is in more modern days than 1994, isn’t it? Just curiosity for the real number…
Hi Andres,
Yes you are correct, the latest CLARiiON to ship in August of 2008 has a maximum of 32GB of cache.
Regards,
Steve