Window into the Decision

Window into the Decision

One of the more interesting discussion points about CLARiiON is the fact that it runs embedded Windows XP inside.  I’ve had several readers ask me about this, and I’ve also seen a lot of google queries land on my site looking for more information about this topic.

As a result there’s a lot of curiousity: why (and when) was the Windows decision made?  The CLARiiON of the 90s was not a Windows solution.

Well, I’ve got answers.  Mind you, it may vary slightly from other answers you might get. I mean, it’s been ten years or so since the decision was made. My recollection is from a software architect’s point of view. As I recall there was a need to change the CLARiiON architecture for one main reason.

To compete against EMC.

In the late 90s CLARiiON was a product built by Data General. I had been involved in creating the software architecture for CLARiiON’s microcode, known as FLARE. Starting in the 80s my areas of responsibility were the RAID algorithms, the write cache, the failure handling (hot repair, failover/failback), and anything data integrity.

A Process-Based Architecture

The original FLARE software architecture was a collection of processes. There was a process for each disk. There was a process for handling read and write requests from application servers. There was a rebuild process, a background verify process, a configuration process, etc.  The operating environment for these processes was a homegrown proprietary operating system known as HEMI (because it used a “hemispherical” scheduling algorithm).

For the first six or seven years of its lifetime, we added features to FLARE. We added support for the SCSI protocol. We added hot spare support. We added a write cache, supported new RAID levels, improved performance, and added support for a graphical user interface. We moved from a static number of disks to a variable number of disks via shelf expansion.  We added SAN support (e.g. LUN masking). The original architecture handled all this and proved scalable and maintain-able.

And as the storage market began to really take off in the mid-to-late 90s, the CLARiiON business folks began to eye a move towards the lucrative profits available for higher end storage systems.  Systems like EMC Symmetrix.  The products were already bumping into each other in sales situations for customers considering purchases at the high-end of the mid-range and/or the bottom portion of the high-end.

Symm vs CLARiiON

Symmetrix was clearly a powerhouse, and one of the greater difficulties in competing with Symm would be the fact that Symm had a huge number of server connections (with a huge cache backing them). But it wasn’t the hardware that DG was concerned about so much as the latest software features that were generating huge amounts of revenue: SRDF and TimeFinder.  CLARiiON didn’t have these features. For more and more customers remote mirroring and snap copy were becoming table stakes, and CLARiiON didn’t have it.

In CLARiiON’s favor there were several standout advantages. Symm (at that time) did not have the RAID-5 algorithms, did not have a mirrored write-cache, and did not have an easy to use graphical user interface (Navisphere).  But without remote mirroring and snap copy, CLARiiON would not be able to compete. And remote mirroring and snap copy presented some problems for the processed-based CLARiiON software architecture.

Layers versus Processes

As I mentioned, FLARE had a front-end process to handle incoming requests from application servers. This process also contained the write cache. When moving data from cache to disk, the front-end process needed to “open” a connection to the appropriate process and send a “message”. More and more processes had been added to FLARE as the number of disks grew from 20, to 30, to 40, and eventually 120. Connections between all these processes resulted in lots of pipes with lots of messages.

Deciding where to insert remote mirroring and snap copy functionality was not straightforward. Adding new processes would introduce more “hops”. Modifying existing processes would blur the lines of the architecture and violate encapsulation of function.  So there was a worry about both performance and architectural maintenance.

It would be nice if FLARE could support the “insertion” of this functionality in a more performant manner, without “disturbing” the functionality found in components such as the write cache and RAID algorithms.

Kind of like a layered device stack.

Hey, doesn’t Windows have a nice layered device stack model?

Reactions to the Decision

As you can imagine, the announcement to move FLARE to an embedded Windows solution was met with internal skepticism. Questions about whether or not FLARE would start blue-screening at customer sites were jokingly answered with “We won’t ship a screen with it” ;>).  Saying that it was a controversial internal decision would be an understatement.

Customers were also skeptical. One of our VPs met with many customers who would repeatedly challenge him about quality levels in the new product. To his credit he admitted that clearly any move to a new architecture is going to initially experience new quality issues. However, DG didn’t have to dodge the quality issue, and had a rather compelling answer in its back pocket known as the DAQ. The Disk Array Qualifier could still run its suite of data integrity tests and hammer the new Windows based solution. The DAQ ran against both the old and new architectures because the “quality” hooks that have always been part of FLARE still functioned on Windows. The write cache could still be fully tested.

This satisfied customers’ concerns. And when they realized that a layered device stack provided (a) better performance, and (b) layered features without disturbing existing code, they began investing in the new architecture.

It Worked

CLARiiON software developers did a fine job componentizing different parts of the HEMI-based process architecture into different layers in a Windows driver stack. Performance was excellent (especially since the write cache was ported intact). New remote mirroring and snap copy device drivers were layered on top of the “Hemi-based” functionality.

The bold decision paid off. DG was acquired by EMC. How well did EMC sell the Windows-based solution?  Check out the latest press release about the CX announce in early August.  EMC reports over 300,000 CLARiiON units sold, with 99.999 uptime.  Quality, performance, adoption, and revenue.  Quite frankly the more than 300,000 CLARiiONs running world-wide still amazes me. That’s what happens when your storage product is bought by a storage company.

Additional layered drivers have been developed over the years, some of which I hope to write about in future posts.

Steve