Data Lake Addendum

Data Lake Addendum

In Data Lake, Simplified a simple picture of an emerging infrastructure was presented: in-memory data grids on top of HDFS.

The reason for this new infrastructure is that “second platform” (e.g. client-server era) data center infrastructures aren’t the best fit for the new breed of analytic applications. Two big reasons for the mismatch are

  1. As applications have evolved to be located further and further away from application data, latency requirements of many analytic apps can no longer be satisfied. The application nearness principle at the beginning of the second platform era has evolved to the application axis: geographic separation of applications and their data.
  2. Application data has spread from being stored on one physical spindle to being (a) spread across multiple data centers, (b) moved around the globe as part of a GEO-store like Atmos, and (c) stored in hybrid cloud environments. The graphical depiction below highlights this progression.

Span

But when I say “Data Lake, Simplified” I am not implying it is simple. It is true that a GemFireXD/HDFS infrastructure can quickly be built, and it is also true that new forms of analytic applications can be balanced on top of it. This approach allows businesses to quickly gain insight into massive amounts of real-time (GemFireXD) and historical (HDFS) content.

But it is also true that this architecture can end up being being built as a silo that is separate from legacy infrastructures. My EMC colleague Ken Durazzo introduced this discussion to me and we labeled it doing three things at once.

Data Lake Addendum

In my opinion the real data lake conversation is actually an addendum that builds upon the simplified architecture. The addendum covers a variety of topics that address legacy interoperability and legacy migration. Here are a subset of topics that businesses are beginning to ask:

  • How does one build out horizontal storage stacks and data services at the “hot edge” (e.g. the server level). In other words, how are resiliency services like backup, migration, replication, etc., implemented horizontally at the server level, and how do they scale?
  • As solid state becomes more prevalent at the server level, what is the role of HDD? How do the horizontal storage stack and services spill over onto the HDD infrastructure? What is the best way to handle the growing disparity between server flash performance and spinning HDD? Is it possible to outsource all HDD to a service provider (and how would you run the new apps if you did)?
  • What is the data protection strategy for a data lake? Given the massive amounts of data being stored in a data lake architecture, the exact same backup and recovery paradigms used in the 2nd platform infrastructure will not apply.
  • Given the sheer scale of data on 3rd platform architectures, how is it secured? How can it become more automated and data-driven? How can customers respond to threat identification?
  • How are non-analytic applications deployed onto the same infrastructure (e.g. legacy applications that ran on the 2nd platform, or new applications that are not analytic in nature)?
  • If a data lake architecture will host potentially millions of applications, how are the applications (and the data they store) placed automatically on top of the infrastructure components? How are they automatically monitored/moved?
  • How is the speed of analytic application development/deployment accelerated from months down to days?
  • How does a data lake architecture support thousands of mobile devices and hundreds of millions of “things”? How is it ingested? How is mobile data most efficiently analyzed?
  • How does one deterministically (i.e. network bandwidth) move applications and data from a legacy data center into the data lake?

In an attempt to start a dialogue on these questions I plan on sharing some insights that I’m starting to hear from customers world-wide. There is a pattern forming which is essentially a building-block approach for establishing a robust and complete 3rd-platform infrastructure. As part of measuring the completeness of the vision I’ll rely on the icon below to gauge the progress of the discussion.

3PL0

One common starting point for 3rd platform discussions is the shift of storage to the server layer. In a subsequent post I will highlight how this shift presents quite a few subtle problems, and lay out  different approaches that address these problems.

Steve

https://stevetodd.tech

Twitter: @SteveTodd

EMC Fellow