Regen vs Rebuild

The ability for a storage system to recover from a disk failure is table stakes for block storage vendors. Customers expect fast disk rebuild times with minimal (if any) application performance disruption.

CLARiiON is no exception. Customers that have bought an EMC CLARiiON box are familiar with how fast it recovers from disk failure.

How about EMC Centera?

Why am I interested in this topic? Two reasons really. One, because customers owning both products make the comparison about the time it takes to rebuild from a disk failure. Secondly, I had worked on CLARiiON rebuild, and when joining Centera I became curious about how object-based rebuild functioned.

And I’m finding that the evolution of Centera disk regeneration is very CLARiiON-like.

While many customers find product benchmarking irrelevant to their actual configuration, they are interested in performance in the face of failures. They are buying highly available storage solutions for a reason; if a single point of failure occurs, they want their business to keep running! And “keep running” means run at a satisfactory level of performance for their application.

CLARiiON is a block-based rebuild algorithm that first appeared in 1991 and for 17 years it has been continually tweaked, optimized, and improved. And customers that buy CLARiiON, or other block-based products, have gotten used to a deterministic rebuild time based on the size of the disk drive and the type of application traffic they are running.

What about Centera disk regeneration? Sometimes it’s faster. Like when the Centera has a relatively small number of objects in it.

Fill up the Centera with hundreds of millions of X-RAYs, or e-mails, or scanned check images, and guess what? It could take longer (than block) to rebuild a disk of the same size. How much longer? Well, that gets into an interesting topic in the world of object storage: rebuild determinism.

I Want to Know How It’s Built

In my blog, I like to talk about building software for the storage industry. Having written the original rebuild software for CLARiiON, of course I’m interested to compare the two approaches! I find disk rebuild for object-based storage a fascinating topic, because it’s interesting to hear about new ways of solving similar problems. I say similar, because block-based rebuild and object-based rebuild are not the same. But for a customer, block-based rebuild and object-based rebuild are the same. A disk fails, and the customer wants to get back to 100% protected. It’s the job of a software engineer to make disk rebuild run as fast as possible, whether it’s object or block based.

CLARiiON Rebuild

So let me start by roughly describing CLARiiON’s block based rebuild. There’s one word I use to separate it from Centera: locality.

CLARiiON’s rebuild is local. Everything happens on one storage processor. All the disks can transfer into common buffers. All the disks involved in the rebuild are directly accessible, and the exact same block offset on those disks is read via parallelized commands.

This rebuild has been optimized. When CLARiiON first shipped, disk capacities were smaller, and rebuilds went relatively quickly. And we learned how to make them faster over time. Multiple read requests were sent to the disk queue to facilitate elevator seek algorithms and leverage read-ahead caching. Rebuild memory buffers could be hard-wired. Application I/O could be throttled, if so desired. And so on.

CLARiiON rebuild has been a constant evolution.

Centera Regeneration

I think it’s a good thing that Centera uses the word “regeneration” as opposed to “rebuild”. There’s a paradigm in the block industry that “rebuild” means “replace”. A rebuilt new disk “replaces” what was on the old. The new disk looks exactly like the old one would have had it not failed.

Centera doesn’t work this way. The contents of the failed disk do not become duplicated onto a replacement disk. Instead, the contents of the failed disk are regenerated onto multiple, surviving disks.

Having said this, there are two words I’d like to use to describe Centera’s object-based regen: (1) file-based, and (2) distributed.

Centera regen is file-based. Object fragments are stored in files. When a disk fails, the files are gone. They need to be regenerated into other file systems on different, healthy Centera disks. One ramification of file-based rebuild is that the files can be of variable lengths. This is vastly different from a fixed-length disk blocksize. The size of these files and the number of files will impact regeneration speeds.

Centera regen is distributed. A Centera is a clustered system that contains dozens of network-separated nodes. Each node in the system participates in regeneration. Centera gathers/writes files from and to disks that are distributed across this network. Compare this to localized rebuild efficiencies. Efficient streaming of blocks off of a disk and directly onto another is not applicable in the Centera model. “Copying to a local buffer” is now “send via a TCP stack”.

Centera regen is continually being optimized. Centera software engineers are applying the same types of methods that the CLARiiON team did, and they are doing it via a consistent, release-by-release, year-over-year approach. Here are the types of optimizations that have been released over the years:

Batching of disk operations. Whether it’s a database request, or a file system operation, regeneration goes much more quickly when many objects are processed at the same time. This is similar to the CLARiiON technique of submitting multiple requests for an elevator-based seek algorithm.
Pipelining of disk and network transfers. Deserialize disk and network accesses so that they can occur in parallel.
Reduce network database lookups. Localize the list of all file locations instead of making multiple remote network calls to find files.
Improve file system efficiency. Tune the placement of files to enable fast iteration.

Each release of Centera software has implemented new improvements for disk regeneration, and thus improved regeneration times.

In addition to the software optimizations mentioned above, there have been hardware improvements as well. The earliest versions of Centera hardware and software experienced longer regeneration times, and a second hardware failure would sometimes cause “cascading” regenerations.

Mirrored and 6+1 Regeneration

Centera offers two levels of protection for customer data: content protection mirrored (CPM) and content protection parity (CPP). The CPP approach is a raid-like 6+1 scheme that allows customers to make the trade-off between capacity savings versus regeneration time.

The end goal: determinism

What’s the upshot of all of this?

Well, I’ve highlighted the significant differences between handling disk failures for block and object storage systems. I’ve hopefully made it clear that the speed of object-based rebuild is based on the number of objects in the system (and their size). And most importantly, a customer is less concerned with these details and more concerned with fast and consistent rebuild times, no matter what product experiences the failure.

The new hardware, the tuning, and the evolution of Centera’s object-based regeneration algorithms are yielding the desired result for customers: deterministic recovery from disk failure while maintaing adequate system performance.

Steve

1 Comment

Steve Todd

Top Categories

Top Stories