Back To The Future

Back To The Future

I recently discussed the “Parity Lost and Parity Regained” paper from the FAST conference. At the end of my post I stated that I wish I had a time machine to take the concepts from the paper and bring them back to the past.

In this post I’ll do the reverse. I’ll carry the ideas from the past that can be added to a paper in the future.  The current paper accurately models several important protection techniques and failure conditions. As CLARiiON was being built, I ran into many of these, and then some.

What are some additional protection techniques and error scenarios from the 80s that can be added to the modelling techniques used in 2008?

Let’s take a ride in the De Lorean.

“Simple” Disk Errors

There’s not a lot I disagree with in the paper, but there is a sentence in the introduction that might mislead customers. It states that “designing protection schemes to cope with disk errors is not overly challenging”. There are two clarifying points from the past that seem relevant here:

#1: The low level disk-failure handling algorithms that underly a RAID protection technique are an integral part of that protection.  If a disk is misbehaving, any decision to “kill” or “power-down” that disk may unnecessarily move the model into a “Data Loss” state.

#2: A disk failure in a CLARiiON results in a completely different RAID protection technique being employed.  In particular, when a disk fails, the CLARiiON software transitions from a “subtractive parity” technique, to a “parity shedding” algorithm.

Modeling both of these situations would go a long way towards helping customers to understand what techniques may put their data at risk when a disk error occurs.

Modeling Power Failures and System Panics

Most customers understand that power failures can leave RAID5 parity disks in an inconsistent state. System panics can also leave parity disks in an inconsistent state.  These types of events would be very valuable additions to the model.  Additionally, there are a variety of ways to recover from these types of failures, and each of them have different ramifications when it comes to data integrity. Here is a list of questions that were identified at a very early stage of RAID development:

  • What detection algorithm(s) are used to identify that parity is indeed corrupt?
  • What cleanup technique(s) are used to identify where parity is corrupt?
  • If scrubbing is used, what techniques are used for customer writes to scrubbed versus non-scrubbed areas?

All of these techniques are inter-related. This results in a model that is non-trivial. But a model should be built nonetheless, because power failures and other types of system failures are common.

Parallel Operations in a RAID Stripe

The model in the paper describes single points of failure for a given read or write operation. The primary goal of RAID, however, is to allow for multiple parallel operations to occur on all disks.  This is not easy.  RAID vendors have multiple choices for implementing parallel operations.

One possibility is “stripe locking”.  This minimizes the potential for data integrity issues but stripe locking can also negatively impact performance.  There are other techniques that can be modeled. For example, a “read and lock” approach can be employed. Similarly, a transition from “subtractive” to “additive” parity can occur when failures happen in the midst of parallel operations.  Modeling all of these choices is important. Showing which ones minimize or eliminate data corruptions or loss is critical.

Protection Techniques During Disk Rebuild

I mentioned that customers should be wary of viewing full disk errors as “simple”. The same can be said about “disk replacement” or “hot spares”.  The instant that a disk in a RAID set begins rebuilding, RAID implementors have a variety of choices on how to (a) perform the actual rebuild, and (b) perform read and write operations during the rebuild. Choices vary if a write request gets mapped before, after, or in the middle of the current rebuild checkpoint.

Disk rebuild protection algorithms also overlap with techniques for “parallel operations in a RAID stripe”. Writes or reads that are occuring during a rebuild operation can use additive or subtractive techniques, but they also may choose whole-stripe techniques, depending on the location of the rebuild checkpoint.  The choice of algorithm made could in theory corrupt the rebuild, or propogate data loss. It’s another area that could use a model.

Protection Techniques for Cached Writes

Write caching is clearly a favorite with customers that wish to avoid the RAID-5 write penalty. Storing newly written data in secure cache memory allows an application to see much better response times. The data in cached memory can be put onto the RAID device at a later time.

Write caches can be implemented using a variety of different techniques. All of these techniques should be modeled. And the resulting state transitions should be understood.

Grad School !!!  Modeling Double Failures

The FAST paper clearly states up front (in the abstract) that the research is exploring “single error conditions”.  This is a great place to start, because it creates a model that allows for a concise and clear picture of a complex problem.

And I hope it can be extended to cover double failures as well.

Because in my experience it’s the double failures that separate the weak from the strong.

There are protection techniques that will absolutely lose data if a second failure occurs.  There are other techniques that won’t.  And this makes all the difference to a customer.  Consider the case where a RAID set is running in a degraded mode (a disk has failed).  Imagine that a second disk fails (for whatever reason).  Now imagine a customer was able to revive the second disk (e.g. pull it out, pray, plug it back in).

Wouldn’t the customer want a protection technique that guarantees that no data was lost?  Well, which technique would that be?

Let’s model it and find out.

Steve