One of my EMC co-workers in China, Hang Guo, told me about a library in Beijing that is so amazing that he and his wife go there just to hang out on the weekends. I decided to look into this library myself and uncovered some pretty cool statistics.
The Peking University Library, founded in 1902, is the largest university library in all of Asia. I assume that the word "largest" (as taken from this web page) refers to the number of holdings contained within. According to the university there are 6,500,000 million different artifacts, including rare books, periodicals, rubbings from inscriptions on bronze and stone tablets, and various historical journals. The library has also made great strides with digital assets as well. All in all it sounds like a great place to visit and a well-run, famous institution.
If a physical building can house millions of physical objects as part of a library collection, then housing hundreds of millions of digital objects in a digitized library shouldn't be a problem, right? After all, the artifacts all have the same characteristics; they are a sequence of bits.
There are many articles like this that lament the fact that digital preservation is "too hard". One of the reasons seems to be the very fact that the artifacts all have the same characteristics! The main difference between the actual artifacts themselves is bit patterns and bit length!
Perhaps one of most concise problem summaries that I've come across is from the CASPAR project (Cultural, Artistic, and Scientific knowledge for Preservation, Access, and Retrieval). A training lecture on the topic can be found here. I pulled a quote from the website which describes the problem well:
The peculiar characteristics of digital objects imply several difficulties in their preservation.
Even if it is obviously
necessary to preserve the bits, this is not enough since more information, at many different levels (e.g. the format in which they are encoded, the semantics necessary for their intelligibility, the rights which are on them, etc.) is required in order to reach the ability to maintain and reuse these objects.
Moreover, digital preservation undergoes several threats which are related to financial and legal aspects, technological obsolescence, changes
in environment and in the people knowledge base, trustworthiness of repositories.
Object-based storage can solve some (but not all) of these problems. The first decade of the new millennium saw the rise of object-based storage (there are now thousands of customer-purchased, object-based systems containing multiple petabytes of fixed content). These types of systems excel at the grouping of archived content and the metadata that helps preserve it.
I've long felt that XAM technology can lend a helping hand when it comes to digital preservation, and I've laid my reasoning out in a research paper. Indeed, research at Penn State is testing out XAM in the context of digital archiving. I also found it interesting that Dell threw their hat into the ring and announced XAM support for their new object-based system.
In 2000 there were zero commercially-available and well-known object-based storage systems. I'm certain that by 2020 we'll see an incredibly wide and diverse range of object-based storage, and it will be interesting to see if XAM has taken hold as the industry standard way to interoperate and preserve the world's digital information.
Steve
http://stevetodd.typepad.com
Twitter: @SteveTodd
EMC Intrapreneur

