Centera and Data Lineage

Today at EMC World the veil was lifted and unfettered access was granted to developers that are actively participating in EMC’s innovation process.

In recent posts I’ve talked about the “idea contest” sponsored by EMC in 2007, and how this idea has been turned into running code. Today that software was shown outside of EMC for the first time.

I’m assuming most of you weren’t there to see it. So allow me to describe the idea to you.

I’d like to start by saying:

Have I told you that I think Centera is cool?

Executive Pitch

When presenting this idea to the EMC Execs back in October we cut back on the nerd-speak and focused on the business value. My opening line went like this:

For any given piece of content in your business, wouldn’t it be nice to have an accurate bibliography describing all the sources of information that were used to generate said content? Wouldn’t it be nice to not only have a bibliography, but to also have the ability to navigate back to those sources, right down to the file, page, paragraph, or subfile level? Businesses struggle to create their own solutions for accurate and consistent data lineage information.

So now I’ve set up the main business problem behind the idea, which is the ability to trace the ancestry, or lineage, or a given piece of content. To highlight the usefulness of having such a solution, my co-inventor Dan and I described some specific customer use cases.

Financial: for a given earnings report, what (and where) were all the inputs used to generate the earnings report? What was the algorithm used to generate the results? How can I navigate back to a given input that might have been incorrect, fix it, and regenerate the earnings report?
Scientific: for a given set of weather patterns, tornado analysis algorithms predicted a result of “no tornadoes”, yet tornadoes occured. How can a scientist navigate from the “incorrect result” back to the original weather pattern inputs and find out where the analysis went wrong? (read about the research at Indiana University School of Informatics).
Governmental: given a governmental decision based on input from intelligence and legal agencies, how can a “decision document” trace back to specific paragraphs in other documents that were used to make the decision?
Royalties: if three pieces of copyrighted information are used to form a new document, how can the copyrights be traced in order to enable royalty distributions?

Why Is This Problem So Hard?

The opening of our pitch mentioned that “businesses struggle to maintain accurate and consistent lineage records”. Why is this? In our opinion it’s because data lineage places a lot of requirements on the customer infrastructure:

Immutability: if the source inputs change, the bibliography is less than useful (ditto for modifying the bibliography).
Retention: if the source inputs are deleted, the bibliography is less than useful. Allowing the bibliography to be deleted would not be good.
Meta-data: a bibliography is “extra” meta-data and requires a strategy for managing the bibliography along with related content.
Graphing: a “navigation map” must be built to allow traversal from a given piece of content back to the original inputs.
Authentication: can businesses “prove” that the content, the sources, the graphs, and the bibliography have not been tampered with?

Can you see why businesses would struggle to build a complete solution like this? Building a solution that is lacking in any of these areas is really a solution with a hole big enough to drive a truck through.

By no stretch of the imagination is the industry unaware of the scope of this problem. In fact, the academic community has already done quite a bit of research and collaboration in this area. Check out the research on Provenance Aware Storage Systems being sponsored at Harvard.

The Centera Difference

Can anyone think of a product out there that specializes in immutability, retention, meta-data, and authentication? It’s called Centera. There’s not a lot of research focusing on the use of a content addressable storage systems (CAS) as the foundation of a data lineage solution. But it makes a lot of sense.

Centera meta-data can be used for both the bibliography and the navigation map.
The bibliography and the maps are immutable.
The bibliography and the maps are non-deletable via Centera retention.
The bibliography and the maps are authentic and tamperproof because they are protected by Centera’s content addresses.

Notice that I’ve only mentioned the bibliography and the maps. I haven’t mentioned the content or the source inputs. Customers could in theory use Centera to implement data lineage on their existing, non-CAS infrastructures (e.g. documents in file systems, on the web, etc).

But if the original content, the source input, and the transforming algorithms are all stored on Centera, they will also benefit from the immutability, retention, and authenticity that Centera provides.

Like To See A Picture?

I’ve got one. For a future post. I hope to also provide some feedback from EMC World attendees as well.

Steve

Steve Todd

Top Categories

Top Stories