Data Lake Surface Lures

Data Lake Surface Lures

DataLake

In recent posts I’ve discussed a simplified data lake architecture that is starting to appear more frequently. This architecture enables the development of faster analytic applications by balancing an in-memory data grid on top of the Hadoop File System. A diagram of the architecture, and the apps that sit on top of it, is depicted below.

GemfireHDFS

These new applications can essentially fish the surface of a data lake by performing fast analytics on recently arrived data. GemFireXD is great at enabling this. There are also some really cool things coming with the Spring IO Platform that I will explore in future posts.

While the architecture looks simple, the deployment is not as simple as it seems. This is especially true at the server level. In this post I’d like to highlight where some of the main problems are surfacing and then lay out some directions for solving them.

Servers represent the “hot edge” of persistent storage, enabling latency reductions that are orders of magnitude better than spinning disk. GemFireXD is indeed a storage product. Although it uses memory, you can establish relationships with underlying non-volatile stores and replicate data in multiple ways. But in a data lake architecture GemFireXD is not the permanent resting home for the data. Eventually GemFireXD will move the data down into a persistent HDFS infrastructure.

In many cases this infrastructure is turning out to be hundreds (or thousands) of server-based flash storage devices balanced underneath GemFireXD:

ServerStorage

So in addition to understanding the staging between GemFireXD and HDFS, it is also necessary to understand scale-out data protection strategies for HDFS data at the server level. This week I asked several customers about data protection for HDFS. I received three types of answers:

  1. App takes care of it.
  2. HDFS takes care of it.
  3. Infrastructure works with HDFS to take care of it.

Regardless of the approach, one of the biggest concerns I’ve heard is “if data gets corrupted HDFS may triplicate the corruption”.

For decades a primary function of the disk array has been to provide data services that enable backup, replication, and mobility of data.

How then do these data services move up to the server level and scale horizontally?

HorizontalDataServices

Solutions to this problem are taking different forms:

VSAN

VMware vSAN can take hot edge servers with a balance of GemFireXD and flash, wrap them into VMware’s cloud orchestration infrastructure, and then apply existing horizontal data services on top. The main advantage of this solution is that the provisioning is incredibly easy, and existing administrators that are already comfortable with VMware management tools have very little to learn to make it work. ScaleIO

For customers having a mix of cloud orchestration platforms (e.g. VMware, MSOFT, OpenStack, etc), horizontal data services will be extended across thousands of nodes. More work needs to be done here to pool ScaleIO resources, distribute them amongst different orchestration platforms using a software-defined API, and then assign different levels of data services (e.g. snap, replication) among them.

HotandColdBut there is yet another problem that needs solving. What is the relationship between the server-based hot-edge and the HDD-based cold core? Certainly many use cases will require a spinning disk tier that far exceeds the capacity present on 1000s of server-based storage devices, yet there will be an expectation that the horizontal data services will seamlessly spill over onto the cold core tier (in addition to automatic data movement between the two).

Bottom line? It’s not as easy as slapping a few flash cards into the server tier! A thoughtful data protection architecture needs to be created first. Technologists building a data lake architecture will need consultation on all these issues. I’m interested in gathering thoughts on how best to deploy not only a horizontal storage stack at the server level, but horizontal data services as well.

Addressing horizontal storage and data services is step 1 in building a 3rd platform infrastructure. In future posts I will take a look at many other areas that need consideration.

3P1

Steve

https://stevetodd.tech

Twitter: @SteveTodd

EMC Fellow