The 100 Year Paper Clip
Digital Composite

The 100 Year Paper Clip

One of the requirements faced by digital archivists is the guarantee that archived content will still be accessible and available for years to come. SNIA has written a report on this topic. The crisis described in the paper needs a solution. Will the digital content of today still be viewable and useful in 100 years?

There are many challenges to be solved in building such an archive. One of the more daunting challenges is to manage the relationships between the content being archived (e.g. a photograph or scanned document) and the meta-data describing the content. Wouldn’t it be useful to have some sort of giant paper clip to permanently join together all of the disparate artifacts that relate to the digitized content?

Instead of pounding away at this problem with a hammer let’s use paper clips instead. And let’s use the latest in paper-clip technology.

XAM.

Industry standards mandate a structure for the long-term archival of content and associated metadata.  The OAIS standard defines an archival information package (AIP) as follows:

  • Data Object: a digital representation of what is being archived (e.g. video, audio, document).
  • Representation Information: a Data Object is a blob of bits; the representation information describes how to interpret said bits.
  • Reference Information – identifiers used to refer to the content.
  • Provenance Information – the annotated history of the content, such as the initial archivist, information about migration events, etc.
  • Context Information – more meta-data (e.g. why the content was created and how it relates to other data objects).
  • Fixity Information – authentication and authenticity keys needed to confirm that the content has not been altered.

Current Implementations

My brief description covers only six of the items standardized by OAIS; each item can be further decomposed and described. I’ve only scratched the surface of the Archival Information Package.

My larger point, however, is that the effective preservation of digital content requires the collation of at least six different pieces of information. What technology can be used to solve this problem today?

File systems and databases are logical choices. The content itself might be stored in a file system. Some of the metadata might be stored in a database. Other pieces of information (e.g. the Provenance Information) might be stored in a log.

Imagine trying to manage these locations and relationships over 100 years, and then imagine how difficult it is to ensure that files and records don’t get moved, deleted, and/or modified. Consider the number of technology refresh events that would occur over 100 years.

XAM as a Paper Clip

I agree with SNIA’s conclusion that XAM holds promise for addressing the 100 year issue. The XAM paper clip that can hold all of the content and meta-data together is known as an “XSET”. Consider a XAM application that gathers all these streams together as an XSET, and then commits that XSET to an archival device. I’ve whipped together some XAM psuedo-code to illustrate my point (don’t try and compile ;>)).

XSET.createXStream(“Data Object”);

XSET.createXStream(“RepresentationInformation”);

XSET.createXStream(“ReferenceInformation”);

XSET.createXStream(“ProvenanceInformation”);

XSET.createXStream(“ContextInformation”);

XSET.createXStream(“FixityInformation”);

myXUID = XSET.commit();

This example, though trivial, highlights the point I’m making. XAM applications gather variable amounts of fixed content (data/meta-data not likely to change) and create an “XSET paper clip”. They receive a “handle” in return. The handle is known as a XUID, and it is the sole piece of information required to retrieve all six pieces of content.

Applications become unaware of the “location” of their content. This is a key characteristic of XAM which helps to solve an important issue in the 100-year problem. Files cannot be renamed, moved, altered, or deleted over time.

XUIDs provide location-independent naming schemes; digitized objects can cross-reference each other using these XUIDs (as opposed to location-dependent absolute file names).

XUIDs are cryptographic hashes that assist in proving that archived content is original and authentic.

XSETs can be stored with retention periods (e.g. store this content forever).

If a vendor supports XAM by building a XAM-enabled archive (known in XAM as an “XSYSTEM”), their system can become a target for technology refresh and migration of XSETs.

Beginning Stages

While I like the technology and what it offers, XAM is still in the early stages. There are several storage vendors that have been involved in the evolution of XAM (Sun, NetApp, Hitachi, IBM, HP, to name a few). Many of them are in the process of building their own Vendor Implementation Modules (VIM) that plug into XAM. EMC has already productized their first VIM.

SNIA is currently evangelizing XAM and educating the masses on its capabilities. I’d like to see application vendors begin writing XAM applications that provide archiving functionality compliant with the OAIS standard.

I personally am going to have a chat with the Documentum folks ;>)

Steve

2 Comments

  1. While including hashes (other cryptographically verifiable representations of the stored content) as part of the XAM XUID is a good idea, the XAM standard does not require their inclusion. The XAM Storage System vendor is free to define the vendor-specific contents of the XUID.
    The reason why this is a good idea is because it allows a client application to independently verify trust (or for the local VIM to do it on the application’s behalf). If I just rely on an external system to say that it is trustworthy, I have no way to prove that what it is saying is true. By storing the “proof” locally, I can verify that the data it returned is indeed the data that I stored. While a client application could hash the data before storing it, store the hash locally, and then verify the hash on data retrieval, being able to have the hash as an intrinsic part of the storage identifier has numerous advantages.
    Of course, if you don’t do it in the VIM, you do need a utility routine to extract the hash so that the application can verify the data itself, or need to provide a routine to verify the data using the XUID, but this is trivial to provide.
    Also note that because XSETs can contain both binding (immutable) and non-binding (changeable fields), any hashes in the XUID can only allow verification of the contents of the binding fields.

  2. Enabling a repository to prove to a client that a file is intact is a problem that RSA Labs has been focusing on recently. It’s tricky: A hash does give an indication of the file contents at the time of storage, but of course doesn’t prove that the file remains intact. We’re developing techniques that enable a repository to prove that a file is completely intact (i.e., no corruptions or erasures even at the bit level)—while transmitting only some tens of bytes to a client. (Details are at http://www.rsa.com/rsalabs/node.asp?id=3357 and http://eprint.iacr.org/2008/175.)

Comments are closed