Over two years ago I wrote about the advantages of using Isilon and VMware for Hadoop implementations via the installation of the Hadoop Starter Kit. At the time I learned that thousands of Isilon customers could “Hadoop-enable” their existing file stores simply by installing the starter kit on top of the raw files they had been collecting for years.
As I visit customers that are interested in deploying a particular Hadoop distribution (e.g. Hortonworks, Cloudera, Pivotal, etc), I notice that they are deploying the distribution on top of commodity storage and I wonder if they have considered all of the angles of using Isilon instead.
I spent some time looking into Isilon/Hadoop features with Ryan Peterson, and I came away with the following benefits that should be considered by any shop that is considering Hadoop on commodity storage.
The Benefits of In-Place Multi-Protocol
Underneath the Isilon covers, data is stored in its raw format. This means that any data ingested via HDFS can be exported via any of the wide variety of protocols supported by Isilon. This includes CIFS, NFS, HTTP, FTP, SWIFT, etc., etc., etc. This multi-protocol support is provided in-place (without copying the data).
Users of Hadoop on commodity storage should consider whether there is any potential use case where access to ingested data would be beneficial in another format, because if this use case ever surfaces, data would likely need to be copied to another system.
The Benefit of Changing Your Hadoop Distribution
Choosing a Hadoop distribution involves trade-offs. Over time the choice of a particular Hadoop distribution may prove to be problematic for a variety of reasons.
For example, the licensing agreement for a given Hadoop distribution may become a financial burden for a business. If commodity storage has been chosen, switching to another distribution is non-trivial, because the data formats essentially lock you into that distribution.
Similarly, the feature set of a given Hadoop distribution may be lacking in comparison to other distributions. Installing that distribution on top of another is a non-starter.
Using Isilon at the beginning of the process solves all of these problems. For example, Cloudera is a great partner of Isilon. If, however, Pivotal has a feature that is not present in Cloudera, the Pivotal distribution can simply be installed and used. Or if Hortonworks has a more favorable licensing model, that distribution can be installed.
Isilon’s approach of storing data in its raw format not only supports multi-protocol access, but it also supports multi-distribution access for Hadoop.
The Benefit of Sandbox Non-Duplication
A common workflow for Data Scientists using Hadoop is to copy a data set into a sandbox for analytic purposes (e.g. conditioning, exploring, modeling, etc).
Consider a 1TB Hadoop data set. Using commodity storage and the typical 200% protection overhead(see Cloudera article), the overall capacity (adding other metadata such as Hadoop’s intermediate shuffle files) would approach close to 4TB of total capacity. The creation of another analytic sandbox would therefore result in close to 8TB of total capacity.
With Isilon there is no need to duplicate the data; a simple metadata update is all that is required. The end result is that the capacity change for an Isilon Hadoop sandbox is essentially negligible. There is no need to double the size with full duplication.
The sandbox use case is further bolstered by Isilon’s protection approach.
The Benefit of Protection Overhead
Hadoop’s 3X protection scheme can result in 25% overall usage with additional overhead. Isilon uses parity schemes that can typically result in 80% capacity usage. Isilon’s operating system dedicates a smaller portion of the overall capacity to redundancy (depending on the parity scheme used and the width of the Isilon cluster). For example a Reed-Solomon approach with 5 Isilon nodes will result in 80% usage. A 10-node N+2 scheme would also result in approximately 80% usage. These numbers offset much of the cost benefit of commodity storage. In some cases Isilon comes in as less expensive than commodity.
The Future Benefit of vOneFS
Going forward, the extraction of Isilon’s vOneFS operating system to run on top of commodity storage will open up new use cases for Hadoop deployments. The benefits highlighted above allow for large, multi-node Isilon analytic deployments at the core of a business. With a virtualized vOneFS, smaller deployments on the edge of an enterprise can do edge-based Hadoop analytics and leverage the same rich set of Isilon features found in the core.
Other Benefits
This article has covered some of the more important benefits of using Isilon Hadoop but I have not covered them all. Other important comparisons include:
- HA Name Node redundancy (N-to-N versus Active/Passive)
- Isilon’s ability to edit files/objects
- Other protocols (SMB, Object)
- Simultaneous access across all protocols
- Better performance (e.g. on TPC-DS)
- Snapshot functionality
- Dedup functionality
- WORM (Sec 17a-4)
- Independent scaling of storage/compute
Steve
EMC Fellow

