CLARiiON NVRAM

CLARiiON NVRAM

I was wondering if there was a website out there which listed some of the many non-standard RAID levels. I found one here.

It’s a good read. I’ll often follow up by diving into more specific web descriptions of the items on this list.  Inevitably I’ll read a description which compares itself to CLARiiON RAID, and sometimes CLARiiON gets labeled as an “NVRAM-based solution”. This is misleading. It makes it sound like CLARiiON storage processors are loaded up with gigabytes of non-volatile RAM memory. That’s not true.

Historically there’s always been a small NVRAM component inside every CLARiiON product. But it’s not for performance acceleration.

It’s for data integrity acceleration.

NVRAM versus Write Cache

I’m fairly confident that the authors referring to CLARiiON are using the word NVRAM to describe the CLARiiON write cache. The CLARiiON write cache was not implemented using NVRAM parts, but it certainly does provide for performance acceleration. Exactly how would make a good blog topic.

The CLARiiON NVRAM actually pre-dates the CLARiiON write cache. As far as I know it’s been present on every version of CLARiiON hardware ever built.  It’s used for data integrity purposes.  So in essence, the CLARiiON developers gave priority to data integrity acceleration before performance acceleration.  It does no good for a storage system to be fast and wrong (or fast and broken). The write cache came later.

Interrupted Writes

Given that write operations to a RAID-5 block device end up writing data in one location and updating parity in another location, it’s clear that interrupted writes can cause parity integrity problems. Should an event such as a power failure occur, it’s quite likely that in-flight writes won’t complete, and the parity for a given write could be incorrect. Incorrect parity, if left undetected, can lead to data loss. The “Parity Lost and Parity Regained” paper points out that in some implementations it can lead to data corruption. The longer that incorrect parity lingers, the more chance for problems to occur.

Scubbing can fix incorrect parity, but not right away.  It’s better to fix the incorrect parity as soon as possible after a failure.  Enter the NVRAM, which is a persistent memory part which can retain meta-data through a power failure. Its role is to accelerate the correction of interrupted writes.

Data Integrity Acceleration

A design decision was made to insert a small entry into an NVRAM just before a new write was committed to disk. This entry consisted of (a) the LUN number, (b) the block address, and (c) the size of the write. When both data and parity were successfully updated, the entry was removed from the NVRAM.  If a power failure or other type of system failure occurred, the entry would be remain in the NVRAM.

Upon system restart, this NVRAM would be examined before allowing any I/O or scrubbing to occur. The system must be smart enough to recognize that the NVRAM is authentic (e.g. the storage processor wasn’t swapped out). The NVRAM entries are traversed, parity is examined, and any inconsistencies are cleaned up before further problems have a chance to occur. I/O is then allowed to resume. Scubbing is not required to “find” the inconsistencies.

I’m leaving out some details about validating the NVRAM is correct, failover considerations, and NVRAM replacement, but I think you get the picture.  NVRAM is yet another foundational data integrity feature present in CLARiiON since the beginning.

NVRAM and Disk Failures

One good question that often arises with the use of this technique revolves around disk failures.  If a disk failure occurs, followed by a power failure, won’t parity be impossible to repair? It’s a good and valid question.

In the case of disk failures CLARiiON stops putting entries into the NVRAM, because it stops updating parity.  This technique, known as “parity shedding”, will be a topic for a future post.

Steve