I’ve been writing about the creation of a new IT infrastructure that’s more suitable for mobile, social, and analytic applications (and some of the problems that will be likely). At the core of the new infrastructure is an HDFS-based data lake, with an in-memory data grid on the surface. In my last post I discussed future architectures describing how block, file, object, and SQL storage support could be added to that infrastructure with agility.
EMC’s plan is to provide one API for provisioning this mix of storage capability: ViPR.
The ViPR diagram from my previous post illustrates the convergence of storage provisioning and data protection into a common API layer.
There are some who would argue that IT infrastructure no longer needs to provide data protection services for applications that run on top of technologies like HDFS. Their reasoning is two-fold: (a) more and more applications have built-in data protection, and (b) HDFS itself can make multiple copies of data.
I’ve asked several customers for their thoughts and the prevailing attitude is that applications and HDFS are highly capable of corrupting data and have no problem triplicating the corruption in an attempt to make multiple copies.
Regardless of how one feels, the bottom line is that a mix of application workloads running on a data lake architecture will surely contain applications that need data protection provided by the infrastructure. In many cases it will have to be there.
The problem,however, is that the volume of data in these new platforms will be enormously heavy, and traditional backup and restore architectures (e.g. move the data from primary storage, up to the backup server, and down or over to the protection storage) will not apply.
In other words, don’t bring the monster to the surface.
So what should an IT architect consider when building a data protection infrastructure for big, heavy data that is difficult to move around?
The answer, as my colleague Stephen Manley likes to say, is all about the metadata. The goal is to leverage application and infrastructure-based metadata to minimize movement of heavy data.
Here are some points to keep in mind about a forward-looking data protection architecture:
- Application requests to provision new storage should call a centralized software-defined storage portal.
- These requests can (and should) contain data protection policies.
- The protection storage (based on the policy) should be configured at the same time that the data source is configured.
- A set of data management services for both the data and the metadata should be made available through the same portal.
If these steps are followed, a smart software-defined storage layer can leverage the fact that configuration metadata catalogs and data protection metadata catalogs will be more in sync than ever before. In fact, the protection storage can essentially be thought of as a peer to the primary storage within a data lake infrastructure.
If you consider historically that data protection software has been located above the storage layer and made smart routing decisions between block, file, and object storage systems, it can now place itself into the storage provisioning layer. This allows easier access to configuration metadata, enabling intelligent optimization of data movement between the primary data source and the data protection storage.
Looking even further forward, one can visualize the protection storage actually moving inside the primary storage as a protection tier:
Deploying a new infrastructure (such as a Data Lake) is a great opportunity to break old paradigms and deploy new ones. From what I am seeing in the industry, the rush to deploy HDFS is exactly that: a scramble to create a massive data sync that data scientists can run analytics against. As part of this scramble data protection is often ignored in order to save time.
As a result, when data protection does become critical for analytic projects, the traditional backup and recovery architecture is completely ineffective. Bringing all that data to the surface is a bad idea.
Data Protection is topic number four in building a 3rd platform infrastructure:
In a future post I’ll move above the storage level and discuss how to automate the placement of what could be millions of applications running on top of this infrastructure.
Steve
EMC Fellow







