Being a software architect at a big company (EMC) has me thinking about how hard it can be to agree on an architectural direction.
All the companies purchased by EMC have software architects that are familiar with their own software design. The Office of the CTO works on global reference architectures as a framework for convergence. I’ve worked on at least ten different software architectures that have been shipped to the storage industry. I’ve written posts about how the architects from PowerPath worked with the architects for RSA.
I’ve seen a lot of architectural proposals, and I’ve adopted a criteria for determining whether an architectural proposal is “good” or “not good”. There’s one type of architecture that I favor.
The Customer Architecture.
One definition of a customer architecture might be “an architecture that results in a product that customer’s like to buy”. That’s all well and good but I can’t use that as a criteria when I’m trying to weigh in on whether or not I like a given software architecture. Another definition that moves in the right direction (for me) would be “an architecture that results in a product that works as advertised”.
For me it comes down to correctness. I evaluate an architecture based on whether or not I believe that we could ship it to the field and have it work correctly. And when I see an architectural proposal at EMC, there are three criteria I like to use to determine whether or not it’s a “customer architecture”.
#1: Show Me the Money
This criteria actually translates into “show me the use cases and requirements that generated this architecture”. Customers will likely purchase products based on a proposed architecture if it correctly satisfies the use cases and requirements. I like to dive into every piece of an architecture and ask “what is the use case or requirement that led to this piece of the architecture”? This is just good software development methodology. In fact, any architectural analysis without the use cases is academic. Fun and controversial, perhaps, but academic. It’s impossible to determine what’s missing from an architecture without knowledge of how customers would use it.
Case in point: CLARiiON. Use cases? Performance and data integrity. Disk speeds weren’t keeping up with CPU speeds, customers in the 90s started asking for disk systems that delivered performance without sacrificing data integrity in the face of increased chance of disk failure.
#2: Bounded Complexity
I’ve yet to implement an architecture that could be termed “easy”. EMC products tend to tackle issues that are “hard”, and therefore require a software architecture that must handle some fairly sticky error scenarios. If these error scenarios occur and cannot be handled cleanly by a bounded portion of the architecture, then I would argue that the proposal is not a customer architecture. The original CLARiiON architecture can again serve as another case in point.
Without doubt the “beast” of any RAID-5 implementation is the handling of errors and failures, whether they be disk failures, processor failures, or power failures. The original CLARiiON architecture had a component known as the “message dispatcher”, or MD. The MD was a process that received messages to perform RAID-5 read and write operations. It “owned” one of the disks, and it was responsible for communicating with other MD processes to coordinate multi-disk RAID operations. The entirety of RAID-5 failure handling was found in this piece of the architecture. In my opinion this is a good attribute for an architecture. Why? See criteria #3.
#3 Automated Complexity Testing
If the complexity for implementing a specific customer requirement is contained within one piece of the architecture (and not spread throughout the architecture), then it opens the door for a set of automated tests that target the complexity and verify the correct operation of the architecture.
The MD process in CLARiiON contained all the logic for handling disk hard errors, disk soft errors, disk failures, disk rebuilds, multi-disk read and write operations, and RAID-5 algorithms. This architectural component was implemented as a state machine. Certain states in the state machine were “boundary conditions” that would rarely happen. A “good” architecture should allow an automated test suite to visit all states and verify correctness.
The original architecture as proposed was flexible enough to allow the intrusion of “test hooks” such as the ability to power down disks, the ability to corrupt data, and the ability to know the location of rebuild checkpoints and RAID data layout. The Disk Array Qualifier (DAQ) used these mechanisms to move the RAID algorithms from state to state at exactly the right time. And of course the architecture must be flexible enough to allow for random fault insertions no matter what the internal state of the architecture. For example, a “parity jammer” device was inserted onto the back-end SCSI bus to make noise whenever it felt like it.
That’s Where I Start
In general that’s what I look for in any architectural proposal. How do the needs of the customers get reflected in the architecture? Do all of the customer’s needs get reflected? Was the architecture structured in a way that bounds complexity and allows for the verification of correctness?
I’ve applied this approach to CLARiiON, Navisphere, PowerPath, Storage Scope, Centera, etc., etc., etc. Are the internal architectures of these products customer architectures? I believe that the sales numbers for these products indicate that “the products are working the way customer’s expect it to”, which I’ve defined as a key aspect of a customer architecture. EMC also vigorously tracks the DU/DL (Data Unavailable/Data Loss) rates of these products, and they are all quite low, which is great news for users of these products. DU/DL rates are a way to evaluate architectures “after the fact”.
I’d be interested to hear other software architects weigh in on their criteria for architecture evaluation, especially (but not limited to) the information industry.
Steve

