Research Papers Moving to the Cloud

The ACM has been publishing scholarly works since 1954. They have been diligently maintaining an online library of research papers that has continually benefited the technical community. For years much of their budget has been spent on print-based services as well.

I read an announcement by the ACM last week that they have decided to scale back their investment in print-based services and focus instead on the long-term digital preservation of ACM content as part of a public cloud.

Public clouds that specifically target digital preservation have a different set of requirements than a public cloud like Amazon EC2, for example. The focus in a "preservation cloud" is longevity, and the administrators of said cloud must think like digital curators.

It's an interesting exercise to study the (a) system and (b) curator choices that ACM has chosen. The ACM has chosen to go with CLOCKSS and Portico, respectively. Both CLOCKSS and Portico are not-for-profit organizations.

CLOCKSS

LOCKSS stands for "Lots of Copies Keep Stuff Safe", and the 'C' stands for "controlled". The LOCKSS software framework was developed at Stanford University and is a peer-to-peer, de-centralized model.

The CLOCKSS initiative is run by "the world’s leading scholarly publishers and research libraries" with a goal of ensuring "the long-term survival of Web-based scholarly publications for the benefit of the greater global research community". CLOCKSS ingest boxes are located at Rice, Indiana, and Stanford university. As libraries and researchers submit content to these "ingest boxes", they are stored in normalized, maintainable format in triplicate across the sites. Once all of the ingest boxes have cross-audited the content, the artifacts are moved to "archive nodes" spread throughout the globe. These boxes continually audit themselves and verify the authenticity of the content, and create new versions when hardware fails (very similar to a RAIN architecture).

Interestingly enough, the solution is a "dark archive". The initial content is not accessible to the general public. The ingest and preservation within CLOCKSS is initially focussed on maintaining content for the long-term. When a "trigger event" occurs, however, content is made public by migrating it to the newest format and storing it on publicly available nodes at Stanford and the University of Edinburgh.

A description of how CLOCKSS works can be found here.

Portico

Portico can be thought of as a third-party partner that supplies the people and processes behind the CLOCKSS solution. The best way to explain the process they bring to the table can be found in one of their brochures:

Portico

Diagram from the Portico brochure

The CLOCKSS and Portico solution relies heavily on migration of data formats as the years roll by. I've always wondered about the feasibility of such an approach. Can it scale to billions of documents? Does the system have to continually upgrade all documents to newer file formats as these new formats become available? One thing I picked up as I learned about their solution is that the format conversion occurs when the format goes from "dark" to "publicly available". In other words, there's no need to continually upgrade file formats.

The alternate approach to format conversion is something I've been researching here at EMC. As an entire class of documents is imported into an archive, can a Virtual Machine also be imported that is able to "read" all of these documents, both now and in the future? This would require some time of virtual machine "player" that has infinite playback capability (no easy task).

I hope to learn more about the server/storage implementation of CLOCKSS at locations around the world.

Steve

http://stevetodd.typepad.com

Twitter: @SteveTodd

EMC Intrapreneur