The team at the JFK Library scanned over 70 photos and nearly 70,000 documents in approximately six months before they ran out of space (and thus more capacity was added to their Centera).
When I looked around I only saw two scanners. I saw zero scanning elves (magical creatures that run the scanners all night long). I wanted to understand how that many documents ended up on Centera that quickly.
How were the documents initially organized? Were they re-organized? What’s the scanning process?
The answer to these questions helps to map the JFK archival process on top of the EMC storage infrastructure.
I mentioned in my last post that it was incredible to actually see a physical document that came across JFK’s desk. The document that I saw was located in a folder. The folder had a serial number. This particular folder was from JFK’s office files, and the serial number was something along the lines of “JFKPOF-001-001”. This serial number can be broken down into three parts.
Collection Name
The collection name “JFKPOF” stands for “John F. Kennedy’s Presidential Office Files”. So all folders prefixed with “JFKPOF” contain documents (or audiovisual items such as photos) that came from President Kennedy’s office. There are other collections in the library as well. For example, the collection with the prefix “JFKNSF” contains JFK’s national security files.
Box Name
All of the folders for a given collection end up in boxes. So if I look at the folder containing the serial number JFKPOF-001-001, I’m looking at the first folder in the first box of JFK’s Presidential Office File collection.
If you want to get a feel for the enormity of this digitization effort, scroll down to the bottom of this JFK website page and click on any of the “Series”. For example, Series 2 highlights boxes 27-33 of the President’s office files.
Folder Name
By now you know that the collection has boxes which have folders which have the actual documents and audiovisuals needing to be archived. But who put the documents into the folders in the first place? That depends.
In the case of JFK’s Presidential Office files, it was Evelyn Lincoln, President Kennedy’s personal secretary. Often times it is an archivist that organizes the documents. Either way I was told that it’s the library’s policy to retain the original order (an archival tenent known as “Respect du Fonds”).
Given the organization of collections->boxes->folders, the JFK library team designed a process on how they would take every folder and digitally preserve the contents. There were some key decisions made in this process which ultimately influenced the storage infrastructure:
Keep The Originals
Just because the JFK library is digitally preserving all of the documents doesn’t mean that they will dispose of the originals (some digital preservation projects take this approach). This also means that the library would not use equipment that might damage the documents (such as document feeders that can automate the scannning of multiple documents).
One ramification of the decision to keep the originals was that the library had to name the folders (using the above-mentioned naming scheme). This allows a digitally preserved document to link back to the original.
High Quality Document Capture
Some of JFK’s documents (many written on a typewriter) contained additional handwritten notes on them. Some notes were written lightly using a pencil. It was imperative that the scanning process resulted in a very high document resolution. For this reason the team chose to capture documents as 600 DPI TIFF files. This decision clearly influences storage capacity.
Additional Metadata
Nowhere in JFK’s documents does the word “Cuban Missile Crisis” appear. Therefore scanned documents that only get run through OCR software would be missing additional metadata that history has generated. The team therefore felt it was important to define a step in the process that provided for the addition of relevant metadata.
Quality Controls
One of the goals of the archival team was both the quality of the images being scanned and the accuracy of the metadata being entered. Quality controls were defined for every document being digitally preserved.
Time To Market!
Adding metadata to every document was unrealistic, given the enormous number of boxes and folders contained in all of JFK’s collection (the number of documents possessed by the library number in the millions). So a decision was made to add additional metadata at the folder level.
Ready to scan? Hold on.
The JFK Library was about to create a trusted digital repository (TDR). The OCLC (Online Computer Library Center) and RLG (Research Libraries Group) created a paper describing the attributes and requirements of a TDR. The paper recommends an emerging international standard known as an Open Archival Information System (OAIS).
The next step in the process is understanding how (and if!) an OAIS can be mapped onto a storage infrastructure.
But before I end, I’d like to again thank the folks from NARA that helped me on my visit to the library!
Steve

