Over the last few posts I’ve described an infrastructure that is highly compatible with new mobile, social, and analytic applications: the data lake infrastructure. After introducing the topic I dove a bit deeper into data lake considerations such as server level storage, HDD tiering, and secure tiering to cloud service providers.
Chuck Hollis likes to use the data lake paradigm of navigating on top of the water with a boat. If a new application wishes to provision storage resources in a data lake, how does it “drop anchor” on top of the right mix of storage infrastructure (e.g. server-based, HDD-based, cloud storage)?
Perhaps the correct form of the question is “How does it quickly drop anchor”?
The trend in the industry is to use a programmatic approach. The Hadoop Starter Kit described by Ed Walsh is a great use case: Hadoop is installed and usable in a matter of hours. James Ruddy talks about two use cases for quick deployment: object-based storage for OpenStack Swift, and file-based deployment for S3fs. In both cases James used the ViPR API to create the storage infrastructure.
James, Ed, and the rest of the OIL team are providing the industry with guidance on quick storage provisioning for the types of storage infrastructure required by new applications. HDFS, Swift Object, and S3fs are all examples of storage technology that play well with these new apps. The team is essentially creating a catalogue of storage services and dropping them down onto whatever infrastructure happens to be present in a data center.
Their results point to the trend of a catalogue-based approach to agile storage deployment. Storage services become separate software entities and placed in a catalogue. In response to a storage-provisioning request the assets can then be deployed on top of the right mix of server and HDD technology. The diagram below highlights this approach (underneath the ViPR API).
Data lakes are a conglomeration of different sets of data sources. If a data lake is being designed from scratch, a typical first use case is to set up HDFS with an in-memory data grid on top. Using the techniques outlined above, it’s very easy to install and get that type of configuration up and running.
However, why not design the data lake administrative interface to support the traditional (e.g. second platform) application and storage connections (e.g. apps running on top of block and/or file storage interfaces)? This brings all data types into the data lake.
Or why not migrate existing applications into the lake by using software-defined replication or mobility services?
The creation of an agile storage deployment architecture is a third step in this continuing look and design considerations for a third platform data center infrastructure.
A catalogue of storage assets, as depicted above, creates an interesting new scenario in terms of data protection: the provisioning (e.g. block, object, file, HDFS) and data protection (backup, replication) are all brought together with data movement technologies (dedup, mobility).
In an upcoming post I will dive a bit deeper into the lake to determine the potential for new data protection approaches in a data lake architecture.
Steve
Twitter: @SteveTodd
EMC Fellow



