Valuation Via Content Processing

Valuation Via Content Processing

In a previous post I introduced five different methods of data valuation. Valuation as a storage service can be implemented in a variety of locations within a data center ecosystem. The picture below highlights these five approaches.

DataValueFiveApproaches

In this post I’d like to focus on the first method of valuation: content processing.  This approach extracts value from data via a direct analysis of the content located in a storage repository (e.g. a Data Lake). For well over a decade many of my co-workers have created a variety of content analysis toolsets as part of our Enterprise Content Division (e.g. technologies introduced as a result of acquisitions such as Documentum, X-Hive,  etc).

The algorithms used for data valuation via content processing generate “value scores” using the approach highlighted below.

ContentProcessing

The steps that can be followed to produce valuation scores directly from content is as follows:

  • Text extraction: using natural language processing (NLP) techniques against a text document
  • Language identification: algorithms are run to identify the language being parsed
  • Linguistic analysis via tokenization: organizing the content into tokens can be done using techniques such as stemming and lemmatization
  • Token annotation: the resulting tokens can be compared to various business domains and tagged if they are relevant.

This approach requires line-of-business owners to create domain-specific taxonomies that are of importance to their business.

The resulting annotated tokens then allow valuation algorithms to extract business value from the content. The equation below describes this function:

V(c,x) = f({outside-factors},{domain-specific-tokens},{domain-specific-token-metadata})

The value (V) of content “c” in context “x” is the result of combining the tokens, metadata about these tokens, and the context (outside-factors).

In a future post this algorithm will be described in more detail.

Steve

https://stevetodd.tech

Twitter: @SteveTodd

EMC Fellow